CN109189589A

CN109189589A - A kind of distribution big data computing engines and framework method

Info

Publication number: CN109189589A
Application number: CN201810919696.6A
Authority: CN
Inventors: 程捷; 张念礼; 罗俊
Original assignee: Beijing Bo Hongyuan Data Polytron Technologies Inc
Current assignee: Beijing Bo Hongyuan Data Polytron Technologies Inc
Priority date: 2018-08-14
Filing date: 2018-08-14
Publication date: 2019-01-11
Anticipated expiration: 2038-08-14
Also published as: CN109189589B

Abstract

The invention discloses a kind of distributed big data computing engines and framework methods.Computing engines include: distributed coordination service cluster, for providing coordination service for Distributed Application, saving related plug-in card program and service database object set Schema configuration file；Message intermediate cluster, is used for transmission different types of business datum, and the business datum includes initial data, calculated result, snapshot data, base-line data and alert data；Streaming computing cluster, the bottom Computational frame based on Storm, for timing indicator big data treatment process to be abstracted as several processes；Control module is visualized, for data to be showed and managed by way of web；Data buffer storage cluster, for reducing the expense of convection type computing cluster memory when in high volume calculating.The fast and convenient online Stream Processing for realizing magnanimity timing indicator data of the project team of enterprise or the project cycle and manpower anxiety that the present invention can help big data technological accumulation less.

Description

Distributed big data calculation engine and construction method

Technical Field

The present invention relates to computing engine architectures, and in particular, to a distributed big data computing engine and an architecture method.

Background

At present, more and more enterprises recognize the importance of big data to future development of the enterprises, and therefore, the enterprises begin to use the big data and gradually rely on big data processing related technologies. However, as more and more data needs to be processed, the service scenario is more complex, and many problems are encountered in the actual execution process, such as shortage of big data talents, high labor cost, lack of precipitation in related technologies, difficulty in cultivating a relatively mature big data team in a short period, and various and diverse service requirements of different service departments, resulting in repeated development of different project codes, repeated wheeling, and eight-door project technical architecture, which brings great challenges to later maintenance and iteration.

With the development and maturity of big data processing technology, due to the above practical problems, it is considered necessary to abstract and design a flexible, light, general-purpose, stable and efficient unified big data processing engine framework based on the experience of borui data passing through numerous actual big data items to solve the problems.

In the prior art, a big data processing engine is completely based on memory computing, and the throughput is inferior to that of a traditional batch computing framework, such as Spark, MapReduce and the like; the built-in polymerization time has fixed granularity, can not be changed, and can not support the polymerization of larger granularity above the granularity; support for MQ is limited to Kafka, and other MQ support can be considered in the later period; in addition, the prior art is only suitable for structured time sequence index data processing and does not support other scenes such as unstructured big data processing.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a distributed big data calculation engine and a construction method, wherein the scope of responsibility of the distributed big data calculation engine comprises raw data preprocessing, quasi-real-time calculation, batch calculation with various time granularities, data landing and various fault-tolerant processing.

In order to solve the technical problems, the invention adopts the technical scheme that: a distributed big data compute engine, comprising:

the distributed coordination service cluster is used for providing coordination service for distributed application and storing a related plug-in program and a business database object set Schema configuration file, wherein the coordination service comprises configuration service, distributed synchronization and node monitoring;

the message intermediate cluster is a distributed multi-partition supporting multi-copy based distributed message system and is used for transmitting different types of service data, wherein the service data comprises original data, a calculation result, snapshot data, baseline data and alarm data;

the flow type computing cluster is a bottom computing framework based on Storm and is used for abstracting a time sequence index big data processing process into the following main processes: preprocessing, quasi-real-time calculation, small-batch calculation, large-batch calculation, landing and warehousing; the stream computing cluster comprises a preprocessing topology, a statistical topology and a storage topology;

the visual control module is used for showing and managing data in a web mode;

and the data caching cluster is a streaming computing cluster auxiliary memory storage cluster and is used for reducing the overhead of the streaming computing cluster memory during mass computing.

Further, in the streaming computing cluster:

the preprocessing topology is used for subscribing an original data source from the message intermediate cluster, preprocessing the data and performing quasi-real-time aggregation processing on a preprocessing result;

the statistical topology is used for carrying out batch aggregation on the preprocessed data according to different time granularities, and comprises two sub-computational topologies: small batch computing topologies and large batch computing topologies.

The storage topology is used for landing and warehousing language data, and basic framework support is provided for landing of a final calculation result; the landing data comprises time sequence index data and snapshot file data.

Furthermore, in the preprocessing topology, a plug-in for preprocessing is developed by a user, and a calculation rule is described in the database object set Schema by the user and is responsible for executing a specific cleaning policy on each piece of original data.

Further, the preprocessing topology sends a data mirror image to the message intermediate cluster after data preprocessing, and the user performs subsequent backup processing.

Furthermore, in the statistical topology, intermediate calculation results of medium and small granularity are cached in the data cache cluster for the next large granularity calculation; meanwhile, each granularity calculation result falls to the message intermediate cluster, and subsequent storage operation is carried out by the storage topology subscription, so that decoupling between data calculation and falling is realized.

Furthermore, the data cache cluster caches the calculation results in the middle of each granularity for direct use in the next time granularity calculation, so that the data processing magnitude is reduced.

Further, the streaming computing cluster also includes a baseline topology and/or an alarm topology.

The invention also includes the architecture method of the distributed big data computing engine, which comprises the following steps:

defining a source data format, packaging data in a uniform format, and identifying a data timestamp;

xml file of concrete processing rule of each service data in the data source is configured, and all data indexes and operation processing rule of dimensionality are described by the file;

developing a data preprocessing plugin by realizing the provided data preprocessing plugin interface class, wherein the data preprocessing plugin runs in a data preprocessing topology and is responsible for executing a specific cleaning strategy on each piece of original data;

the user develops the user-defined operator plug-in by realizing the provided user-defined operator plug-in interface class; the custom operator plug-in runs in the data calculation topology and is responsible for realizing the custom operator required by the user for processing the data index and the dimensionality; the custom operator plug-in receives a batch of data, calculates the data according to a custom calculation rule, and returns the result to the caller.

If the Mysql is used as a final floor database, the processes of table building and result data warehousing can be automatically completed by a frame; if other floor schemes are needed, the data storage plug-in is developed by realizing the data storage plug-in interface class running in the data storage topology and is responsible for storing the calculated data.

And further, the data storage plug-in receives the final calculation result data and stores the data according to the service requirement of the data storage plug-in.

Further, the method also comprises that a user develops the extensible topology according to the business requirement of the user to define processing or computing data, and submits the extensible topology to the engine in a topology independent computing mode, and the engine is loaded and operated.

Further, the method further comprises the steps of configuring basic dependent cluster addresses in the app.xml file and key control parameters during running of each computing topology, and starting running of each service topology through scripts.

The invention provides a distributed big data computing engine, which is characterized in that a great number of plug-ins and extension mechanisms are adopted in the overall structural design of the engine, personalized processing strongly associated with services, such as a data preprocessing strategy, is abstracted into a preprocessing plug-in, operators for processing data and dimension indexes are opened into a statistical plug-in, and a processing result falling strategy is abstracted into a storage plug-in. Meanwhile, in addition to the support of the plug-in technology, in order to enrich the functions of the engine framework, an extension mechanism is also supported. The user can develop the plug-in required by the user on the basis of the existing engine framework and submit the plug-in to the engine in an independent computing topology mode, and the extension of functions is realized because the engine is loaded and operated.

In addition, the Bonree Ants also supports the functions of plug-in dynamic update and schema.

The invention has the following beneficial effects:

1. the system has the advantages of simple and open architecture, less component dependence and low development, deployment and maintenance cost;

2. the engine framework is not coupled with the service, the data processing flow is highly abstract, and the universality is strong;

3. the second-level time delay is realized, the real-time performance is good, and batch calculation support is built in;

4. an extension mechanism is supported, and a user can enrich service scene function support by himself;

5. various fault-tolerant strategies are built in, so that stability and data safety are guaranteed;

6. visual management and monitoring are supported;

the method can help enterprises with less accumulation of big data technology or project teams with project period and manpower shortage to conveniently and quickly realize online streaming processing of mass time sequence index data. For a common service scene of the time sequence index streaming processing, the goal can be realized only by simply configuring and describing a service script on data application by non-research and development service personnel without the participation of research and development personnel; for a complex business scene, a research and development worker hopes to realize the logic of strong correlation between the relevant business and the business by carrying out a small amount of coding through a plug-in mechanism of an engine, and the bottom layer complex resource scheduling, task arrangement and fault-tolerant processing in the big data processing are given the responsibility of the engine, so that the development of the relevant big data processing business is quickly realized, and the relevant development and maintenance cost of an enterprise is greatly reduced. Through the practice of numerous internal projects, after the engine framework of the invention is applied, the large data processing development workload is reduced by 80 percent in whole, and the whole project period is shortened by more than 40 percent.

Drawings

FIG. 1 is a schematic diagram of a distributed data engine architecture diagram of the present invention.

FIG. 2 is a flow chart of the distributed data engine connection of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

[ example 1 ]

As shown in fig. 1, the distributed data engine framework mainly includes five parts, namely, a distributed coordination service cluster (ZooKeeper), a message middleware cluster (Kafka), a streaming computation cluster (Storm), a data caching cluster (Redis), and a visualization control, where ZooKeeper, Kafka, Storm, Redis, and the like are currently popular open source components. Wherein,

distributed coordination service clustering: ZooKeeper is a distributed application program coordination service, provides an efficient and reliable distributed coordination service for distributed applications, and provides distributed basic services such as configuration service, distributed synchronization, node monitoring and the like. In addition to its use to maintain the state of Kafka and Storm clusters in the distributed data engine, we also use it to save related plug-ins and business schema. Xml can be dynamically effective without restarting a topology program, thereby reducing the difficulty of topology maintenance and the problem of data loss caused by topology restart.

Message middleware cluster: kafka is a distributed, multi-partition-supporting, multi-copy, Zookeeper-coordination-based distributed message system, and has the greatest characteristic of processing a large amount of data in real time to meet various demand scenarios. It is a high throughput distributed publish and subscribe messaging system. Its stability and efficiency are also the most well-accepted message middleware within the industry. Different types of traffic data may be transmitted, such as: the system comprises original data, a calculation result, snapshot data, alarm data and the like, can ensure the real-time performance and the safety of the data (the data cannot be lost), can also play a role in buffering the access pressure, and well decouples the data and the service.

Streaming computing cluster: the bottom layer computing framework is based on Storm, so that Storm is selected as the bottom layer computing framework, and the characteristics of high instantaneity, low resource overhead, low external dependence, pure memory computing, good fault tolerance and the like of Storm are mainly considered. The distributed data engine abstracts the time sequence index big data processing process into the following main processes, namely: 5 processes of pretreatment, quasi-real-time calculation, small-batch calculation, large-batch calculation, floor warehousing and the like. The above processes are all accomplished by three types of topologies, namely prepressing Topology, simulation Topology and Storage Topology, which run on Storm.

PreProcessing Topology (PreProcessing Topology): this topology is responsible for subscribing to the original data source from kafka and invoking the etl-plugin to preprocess the data (the etl-plugin is implemented by the client itself), and to aggregate the etl-processed results in near real time (the calculation rules are described by the user in schema. If the user wants to backup original detail data after etl, only related configuration needs to be started in schema.

Statistical Topology (Topology): this topology is responsible for bulk aggregation of data after etl at different time granularity (rules are described by schema. There are two sub-computing topologies inside this process: small batch (minutes scale), large batch (hours scale and days scale). And in the calculation process, intermediate calculation results of small granularity are cached in the redis cluster for the next large granularity calculation. Meanwhile, each granularity calculation result can fall to the topic corresponding to kafka, and subsequent storage operation is carried out through 'data storage topology' subscription, so that decoupling between data calculation and falling is realized. The batch calculation is an aggregation calculation based on time granularity, and the aggregation of five different time granularities, such as 1 second, 1 minute, 10 minutes, 1 hour, 1 day and the like, is supported by default. Because progressive dependency exists among all granularity calculations, in order to reduce the calculation resource overhead and accelerate the calculation process, the calculation results among all the granularities are cached in the Redis cluster for being directly used in the next time granularity calculation, and therefore the data processing magnitude is reduced.

Storage Topology (Storage Topology): the topology is responsible for the floor storage of data, the floor data comprises two types, namely chronological index data (structured) and snapshot file data (unstructured, if the data exists in the service), and only basic framework support is provided for the floor of the final calculation result. Because the storage module is not responsible for final data storage, no limitation is imposed on the final data grounding storage component. The Bonree Ants default built-in support Mysql storage scheme, if Mysql is adopted as the final landing database, the processes of table creation and result data warehousing can be automatically completed by the engine. If other landing schemes such as HBase are needed, a user develops a Storage-plug-in to perform custom support to realize a specific landing strategy.

Caching the data cluster: the data in the whole data processing process of the distributed data engine is not dropped to the ground and is completed in the memory. Because a large-time granularity batch computation service scene needs to be supported, the Redis is introduced into the distributed data engine to serve as a Storm cluster auxiliary memory storage cluster, so that the cost of the Storm cluster memory during batch computation is reduced. Due to the adoption of a memory computing mode, the distributed data engine has high real-time performance and almost has no influence on disk I/O (input/output), but the sacrifice on data processing throughput is brought to a certain extent.

A visual control module: the whole big data related environment can be displayed and managed in a web mode, so that the problem of complex deployment and maintenance is solved. The specific functions are as follows: managing basic configuration; schema service configuration management; plug-in release management; topology release management; monitoring the running states of the clusters such as storm, kafka, redis and the like; and tracking a log service chain.

The distributed data engine overall architecture design largely adopts plug-ins and extension mechanisms, personalized processing strongly associated with services, such as data preprocessing strategy abstraction to preprocessing plug-ins Etl-plugin, operators (built-in basic operators supporting sum, max, min and the like) for data and dimension index processing are opened to statistical plug-ins Operator-plugin, and processing result drop strategy abstraction to Storage plug-ins. Meanwhile, in addition to the support of the plug-in technology, in order to enrich the functions of the distributed data engine framework, an extension mechanism is also supported. The user can develop needed extension on the basis of the existing engine framework and submit the needed extension to the engine in an independent computing topology mode, and the extension of functions is realized due to the loading and running of the engine. Currently, a distributed data engine defaults to a built-in dynamic baseline extension and an alarm condition judgment extension. And in addition, plug-in dynamic update and schema.

In this embodiment:

xml: the system comprises a base dependent cluster address and key control parameters when each computing topology runs;

xml: the data processing method comprises the steps that specific processing rules of all service data in a data source are configured, and operation processing rules of all data indexes and dimensions are described by the file;

etl (data preprocessing) plug-in: the method is developed by a user, runs in a data preprocessing topology and is responsible for executing a specific cleaning strategy on each piece of original data;

operator plug-in: the plug-in is optional, is developed by a user, runs in a data calculation topology and is responsible for realizing a user-defined operator required by the user for processing data indexes and dimensions;

a memory plug-in: the plug-in is optional, is developed by a user, runs in a data storage topology and is responsible for realizing specific floor-type warehousing operation of a data processing result.

[ example 2 ]

As shown in fig. 2, the implementation of the architecture comprises the following steps:

(1) and defining the format of the source data, wherein the data sent to the framework needs to be packaged in a uniform format and the data timestamp is identified because the framework is not limited to the format of the source data.

(2) And configuring a schema of each specific processing rule of the business data in the data source.

(3) And developing a data preprocessing plug-in by realizing the provided data preprocessing plug-in interface class, wherein the data preprocessing plug-in runs in a data preprocessing topology and is responsible for executing a specific cleaning strategy on each piece of original data. When the plug-in is developed, each received original number needs to be cleaned according to configuration items in Schma.xml, and then cleaned data is packaged into a specific object and returned to a caller.

(4) Basic operators such as sum, max, min and the like are supported in the framework design. If the user needs a specific operator, the user-defined operator plug-in can be developed by realizing the provided user-defined operator plug-in interface class. The method runs in a data calculation topology and is responsible for realizing a user-defined operator required by a user for processing data indexes and dimensions. The plug-in receives a batch of data, calculates the data according to a custom calculation rule, and returns the result to the caller.

(5) And the final calculation result of the architecture design is in a floor built-in Mysql supporting storage scheme, and if Mysql is used as a final floor database, the processes of table building and result data warehousing can be automatically completed by the framework. If other ground schemes such as HBase, elastic search and the like are needed, the data storage plug-in can be developed by realizing the provided data storage plug-in interface class, and the data storage plug-in runs in the data storage topology and is responsible for storing the calculated data. The plug-in receives a batch of data and stores the data according to the service requirement of the plug-in.

(6) Besides the support of the plug-in technology, the architecture design also supports extensions extension mechanism for enriching the functions of the engine framework. The user develops the extensible topology according to the self business requirement to define the processing or calculate the data. It is submitted to the engine in a topology independent computing manner, and extension of the engine function is realized due to loading and running of the engine.

(7) And configuring a basic dependency cluster address in the app.xml file and key control parameters during the operation of each computing topology, and starting to operate each service topology through a script.

The data processing of the engine comprises the following steps that a distributed application coordination service cluster is used for coordinating all parts to work:

1. the message intermediate cluster acquires source data from a data source;

2. the intermediate message cluster transmits the source data to a preprocessing topology, and the preprocessing plug-in is processed and then transmitted to the intermediate message cluster;

3. the data cache cluster obtains a preprocessed result from the preprocessing topology;

4. computing topology or other expanded topologies to acquire data from the data cache cluster or the message intermediate cluster, and processing the data to send the response piece data cache cluster;

6. the storage topology acquires data from a computing topology, other expansion topologies or a message intermediate cluster directly;

6. the storage topology sends the data to a data warehouse.

The invention introduces two processing mechanisms of plug-in and schema business rule configuration in the storm computing topology, and can dynamically take effect, which plays a great simplification role in the existing distributed big data technology development. The invention provides a plug-in mechanism (etl preprocessing plug-in, custom operator plug-in, storage plug-in), a schema business rule configuration mechanism and a topology expansion mechanism introduced into an engine architecture. And the related plug-ins, the schema business rule configuration and the topology expansion can be dynamically validated.

The above embodiments are not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make variations, modifications, additions or substitutions within the technical scope of the present invention.

Claims

1. A distributed big data computing engine, comprising:

the visual control module is used for showing and managing data in a web mode;

2. The distributed big data computing engine of claim 1, wherein: in the streaming computing cluster:

3. The distributed big data computing engine of claim 2, wherein: in the preprocessing topology, a preprocessing plug-in is developed by a user, and a calculation rule is described in a database object set Schema by the user and is responsible for executing a specific cleaning strategy on each piece of original data.

4. The distributed big data computing engine of claim 3, wherein: the preprocessing topology sends a data mirror copy to the message intermediate cluster after data preprocessing, and the user performs subsequent backup processing.

5. The distributed big data computing engine of claim 4, wherein: in the statistical topology, intermediate calculation results of medium and small granularities are cached in a data cache cluster for the next large granularity calculation; meanwhile, each granularity calculation result falls to the message intermediate cluster, and subsequent storage operation is carried out by the storage topology subscription, so that decoupling between data calculation and falling is realized.

6. The distributed big data computing engine of claim 5, wherein: the data caching cluster caches the calculation results in the middle of each granularity for direct use in next time granularity calculation, so that the data processing magnitude is reduced.

7. The distributed big data computing engine of claim 6, wherein: the streaming computing cluster further comprises a baseline topology and/or an alarm topology.

8. The architectural method of a distributed big data computing engine of claim 7, comprising:

9. The architectural method of a distributed big data computing engine of claim 8, wherein: and the data storage plug-in receives the final calculation result data and stores the data according to the service requirement of the data storage plug-in.

10. The architectural method of a distributed big data computing engine of claim 9, wherein: the method further comprises the following steps:

the user develops an extensible topology according to self business requirements to define, process or calculate data, and submits the data to the engine in an independent topology calculation mode, and the engine is loaded and operated;

and configuring a basic dependency cluster address and key control parameters of each computing topology during operation in the app-ml file, and starting to operate each service topology through a script.