CN109189589A - A kind of distribution big data computing engines and framework method - Google Patents
A kind of distribution big data computing engines and framework method Download PDFInfo
- Publication number
- CN109189589A CN109189589A CN201810919696.6A CN201810919696A CN109189589A CN 109189589 A CN109189589 A CN 109189589A CN 201810919696 A CN201810919696 A CN 201810919696A CN 109189589 A CN109189589 A CN 109189589A
- Authority
- CN
- China
- Prior art keywords
- data
- topology
- cluster
- computing
- calculation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000012545 processing Methods 0.000 claims abstract description 51
- 230000008569 process Effects 0.000 claims abstract description 20
- 238000004364 calculation method Methods 0.000 claims description 59
- 238000007781 pre-processing Methods 0.000 claims description 43
- 235000019580 granularity Nutrition 0.000 claims description 26
- 238000013500 data storage Methods 0.000 claims description 17
- 230000002776 aggregation Effects 0.000 claims description 7
- 238000004220 aggregation Methods 0.000 claims description 7
- 238000004140 cleaning Methods 0.000 claims description 6
- 238000012544 monitoring process Methods 0.000 claims description 5
- 238000013515 script Methods 0.000 claims description 4
- 230000000007 visual effect Effects 0.000 claims description 4
- 230000005055 memory storage Effects 0.000 claims description 3
- 238000004806 packaging method and process Methods 0.000 claims description 2
- 238000005192 partition Methods 0.000 claims description 2
- 238000009825 accumulation Methods 0.000 abstract description 2
- 208000019901 Anxiety disease Diseases 0.000 abstract 1
- 230000036506 anxiety Effects 0.000 abstract 1
- 230000005540 biological transmission Effects 0.000 abstract 1
- 230000007246 mechanism Effects 0.000 description 11
- 230000006870 function Effects 0.000 description 9
- 238000011161 development Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 8
- 238000013461 design Methods 0.000 description 6
- 238000012423 maintenance Methods 0.000 description 5
- 238000007726 management method Methods 0.000 description 4
- 238000012827 research and development Methods 0.000 description 3
- 241000257303 Hymenoptera Species 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000006116 polymerization reaction Methods 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000001556 precipitation Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/546—Message passing systems or structures, e.g. queues
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/02—Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of distributed big data computing engines and framework methods.Computing engines include: distributed coordination service cluster, for providing coordination service for Distributed Application, saving related plug-in card program and service database object set Schema configuration file;Message intermediate cluster, is used for transmission different types of business datum, and the business datum includes initial data, calculated result, snapshot data, base-line data and alert data;Streaming computing cluster, the bottom Computational frame based on Storm, for timing indicator big data treatment process to be abstracted as several processes;Control module is visualized, for data to be showed and managed by way of web;Data buffer storage cluster, for reducing the expense of convection type computing cluster memory when in high volume calculating.The fast and convenient online Stream Processing for realizing magnanimity timing indicator data of the project team of enterprise or the project cycle and manpower anxiety that the present invention can help big data technological accumulation less.
Description
Technical Field
The present invention relates to computing engine architectures, and in particular, to a distributed big data computing engine and an architecture method.
Background
At present, more and more enterprises recognize the importance of big data to future development of the enterprises, and therefore, the enterprises begin to use the big data and gradually rely on big data processing related technologies. However, as more and more data needs to be processed, the service scenario is more complex, and many problems are encountered in the actual execution process, such as shortage of big data talents, high labor cost, lack of precipitation in related technologies, difficulty in cultivating a relatively mature big data team in a short period, and various and diverse service requirements of different service departments, resulting in repeated development of different project codes, repeated wheeling, and eight-door project technical architecture, which brings great challenges to later maintenance and iteration.
With the development and maturity of big data processing technology, due to the above practical problems, it is considered necessary to abstract and design a flexible, light, general-purpose, stable and efficient unified big data processing engine framework based on the experience of borui data passing through numerous actual big data items to solve the problems.
In the prior art, a big data processing engine is completely based on memory computing, and the throughput is inferior to that of a traditional batch computing framework, such as Spark, MapReduce and the like; the built-in polymerization time has fixed granularity, can not be changed, and can not support the polymerization of larger granularity above the granularity; support for MQ is limited to Kafka, and other MQ support can be considered in the later period; in addition, the prior art is only suitable for structured time sequence index data processing and does not support other scenes such as unstructured big data processing.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a distributed big data calculation engine and a construction method, wherein the scope of responsibility of the distributed big data calculation engine comprises raw data preprocessing, quasi-real-time calculation, batch calculation with various time granularities, data landing and various fault-tolerant processing.
In order to solve the technical problems, the invention adopts the technical scheme that: a distributed big data compute engine, comprising:
the distributed coordination service cluster is used for providing coordination service for distributed application and storing a related plug-in program and a business database object set Schema configuration file, wherein the coordination service comprises configuration service, distributed synchronization and node monitoring;
the message intermediate cluster is a distributed multi-partition supporting multi-copy based distributed message system and is used for transmitting different types of service data, wherein the service data comprises original data, a calculation result, snapshot data, baseline data and alarm data;
the flow type computing cluster is a bottom computing framework based on Storm and is used for abstracting a time sequence index big data processing process into the following main processes: preprocessing, quasi-real-time calculation, small-batch calculation, large-batch calculation, landing and warehousing; the stream computing cluster comprises a preprocessing topology, a statistical topology and a storage topology;
the visual control module is used for showing and managing data in a web mode;
and the data caching cluster is a streaming computing cluster auxiliary memory storage cluster and is used for reducing the overhead of the streaming computing cluster memory during mass computing.
Further, in the streaming computing cluster:
the preprocessing topology is used for subscribing an original data source from the message intermediate cluster, preprocessing the data and performing quasi-real-time aggregation processing on a preprocessing result;
the statistical topology is used for carrying out batch aggregation on the preprocessed data according to different time granularities, and comprises two sub-computational topologies: small batch computing topologies and large batch computing topologies.
The storage topology is used for landing and warehousing language data, and basic framework support is provided for landing of a final calculation result; the landing data comprises time sequence index data and snapshot file data.
Furthermore, in the preprocessing topology, a plug-in for preprocessing is developed by a user, and a calculation rule is described in the database object set Schema by the user and is responsible for executing a specific cleaning policy on each piece of original data.
Further, the preprocessing topology sends a data mirror image to the message intermediate cluster after data preprocessing, and the user performs subsequent backup processing.
Furthermore, in the statistical topology, intermediate calculation results of medium and small granularity are cached in the data cache cluster for the next large granularity calculation; meanwhile, each granularity calculation result falls to the message intermediate cluster, and subsequent storage operation is carried out by the storage topology subscription, so that decoupling between data calculation and falling is realized.
Furthermore, the data cache cluster caches the calculation results in the middle of each granularity for direct use in the next time granularity calculation, so that the data processing magnitude is reduced.
Further, the streaming computing cluster also includes a baseline topology and/or an alarm topology.
The invention also includes the architecture method of the distributed big data computing engine, which comprises the following steps:
defining a source data format, packaging data in a uniform format, and identifying a data timestamp;
xml file of concrete processing rule of each service data in the data source is configured, and all data indexes and operation processing rule of dimensionality are described by the file;
developing a data preprocessing plugin by realizing the provided data preprocessing plugin interface class, wherein the data preprocessing plugin runs in a data preprocessing topology and is responsible for executing a specific cleaning strategy on each piece of original data;
the user develops the user-defined operator plug-in by realizing the provided user-defined operator plug-in interface class; the custom operator plug-in runs in the data calculation topology and is responsible for realizing the custom operator required by the user for processing the data index and the dimensionality; the custom operator plug-in receives a batch of data, calculates the data according to a custom calculation rule, and returns the result to the caller.
If the Mysql is used as a final floor database, the processes of table building and result data warehousing can be automatically completed by a frame; if other floor schemes are needed, the data storage plug-in is developed by realizing the data storage plug-in interface class running in the data storage topology and is responsible for storing the calculated data.
And further, the data storage plug-in receives the final calculation result data and stores the data according to the service requirement of the data storage plug-in.
Further, the method also comprises that a user develops the extensible topology according to the business requirement of the user to define processing or computing data, and submits the extensible topology to the engine in a topology independent computing mode, and the engine is loaded and operated.
Further, the method further comprises the steps of configuring basic dependent cluster addresses in the app.xml file and key control parameters during running of each computing topology, and starting running of each service topology through scripts.
The invention provides a distributed big data computing engine, which is characterized in that a great number of plug-ins and extension mechanisms are adopted in the overall structural design of the engine, personalized processing strongly associated with services, such as a data preprocessing strategy, is abstracted into a preprocessing plug-in, operators for processing data and dimension indexes are opened into a statistical plug-in, and a processing result falling strategy is abstracted into a storage plug-in. Meanwhile, in addition to the support of the plug-in technology, in order to enrich the functions of the engine framework, an extension mechanism is also supported. The user can develop the plug-in required by the user on the basis of the existing engine framework and submit the plug-in to the engine in an independent computing topology mode, and the extension of functions is realized because the engine is loaded and operated.
In addition, the Bonree Ants also supports the functions of plug-in dynamic update and schema.
The invention has the following beneficial effects:
1. the system has the advantages of simple and open architecture, less component dependence and low development, deployment and maintenance cost;
2. the engine framework is not coupled with the service, the data processing flow is highly abstract, and the universality is strong;
3. the second-level time delay is realized, the real-time performance is good, and batch calculation support is built in;
4. an extension mechanism is supported, and a user can enrich service scene function support by himself;
5. various fault-tolerant strategies are built in, so that stability and data safety are guaranteed;
6. visual management and monitoring are supported;
the method can help enterprises with less accumulation of big data technology or project teams with project period and manpower shortage to conveniently and quickly realize online streaming processing of mass time sequence index data. For a common service scene of the time sequence index streaming processing, the goal can be realized only by simply configuring and describing a service script on data application by non-research and development service personnel without the participation of research and development personnel; for a complex business scene, a research and development worker hopes to realize the logic of strong correlation between the relevant business and the business by carrying out a small amount of coding through a plug-in mechanism of an engine, and the bottom layer complex resource scheduling, task arrangement and fault-tolerant processing in the big data processing are given the responsibility of the engine, so that the development of the relevant big data processing business is quickly realized, and the relevant development and maintenance cost of an enterprise is greatly reduced. Through the practice of numerous internal projects, after the engine framework of the invention is applied, the large data processing development workload is reduced by 80 percent in whole, and the whole project period is shortened by more than 40 percent.
Drawings
FIG. 1 is a schematic diagram of a distributed data engine architecture diagram of the present invention.
FIG. 2 is a flow chart of the distributed data engine connection of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
[ example 1 ]
As shown in fig. 1, the distributed data engine framework mainly includes five parts, namely, a distributed coordination service cluster (ZooKeeper), a message middleware cluster (Kafka), a streaming computation cluster (Storm), a data caching cluster (Redis), and a visualization control, where ZooKeeper, Kafka, Storm, Redis, and the like are currently popular open source components. Wherein,
distributed coordination service clustering: ZooKeeper is a distributed application program coordination service, provides an efficient and reliable distributed coordination service for distributed applications, and provides distributed basic services such as configuration service, distributed synchronization, node monitoring and the like. In addition to its use to maintain the state of Kafka and Storm clusters in the distributed data engine, we also use it to save related plug-ins and business schema. Xml can be dynamically effective without restarting a topology program, thereby reducing the difficulty of topology maintenance and the problem of data loss caused by topology restart.
Message middleware cluster: kafka is a distributed, multi-partition-supporting, multi-copy, Zookeeper-coordination-based distributed message system, and has the greatest characteristic of processing a large amount of data in real time to meet various demand scenarios. It is a high throughput distributed publish and subscribe messaging system. Its stability and efficiency are also the most well-accepted message middleware within the industry. Different types of traffic data may be transmitted, such as: the system comprises original data, a calculation result, snapshot data, alarm data and the like, can ensure the real-time performance and the safety of the data (the data cannot be lost), can also play a role in buffering the access pressure, and well decouples the data and the service.
Streaming computing cluster: the bottom layer computing framework is based on Storm, so that Storm is selected as the bottom layer computing framework, and the characteristics of high instantaneity, low resource overhead, low external dependence, pure memory computing, good fault tolerance and the like of Storm are mainly considered. The distributed data engine abstracts the time sequence index big data processing process into the following main processes, namely: 5 processes of pretreatment, quasi-real-time calculation, small-batch calculation, large-batch calculation, floor warehousing and the like. The above processes are all accomplished by three types of topologies, namely prepressing Topology, simulation Topology and Storage Topology, which run on Storm.
PreProcessing Topology (PreProcessing Topology): this topology is responsible for subscribing to the original data source from kafka and invoking the etl-plugin to preprocess the data (the etl-plugin is implemented by the client itself), and to aggregate the etl-processed results in near real time (the calculation rules are described by the user in schema. If the user wants to backup original detail data after etl, only related configuration needs to be started in schema.
Statistical Topology (Topology): this topology is responsible for bulk aggregation of data after etl at different time granularity (rules are described by schema. There are two sub-computing topologies inside this process: small batch (minutes scale), large batch (hours scale and days scale). And in the calculation process, intermediate calculation results of small granularity are cached in the redis cluster for the next large granularity calculation. Meanwhile, each granularity calculation result can fall to the topic corresponding to kafka, and subsequent storage operation is carried out through 'data storage topology' subscription, so that decoupling between data calculation and falling is realized. The batch calculation is an aggregation calculation based on time granularity, and the aggregation of five different time granularities, such as 1 second, 1 minute, 10 minutes, 1 hour, 1 day and the like, is supported by default. Because progressive dependency exists among all granularity calculations, in order to reduce the calculation resource overhead and accelerate the calculation process, the calculation results among all the granularities are cached in the Redis cluster for being directly used in the next time granularity calculation, and therefore the data processing magnitude is reduced.
Storage Topology (Storage Topology): the topology is responsible for the floor storage of data, the floor data comprises two types, namely chronological index data (structured) and snapshot file data (unstructured, if the data exists in the service), and only basic framework support is provided for the floor of the final calculation result. Because the storage module is not responsible for final data storage, no limitation is imposed on the final data grounding storage component. The Bonree Ants default built-in support Mysql storage scheme, if Mysql is adopted as the final landing database, the processes of table creation and result data warehousing can be automatically completed by the engine. If other landing schemes such as HBase are needed, a user develops a Storage-plug-in to perform custom support to realize a specific landing strategy.
Caching the data cluster: the data in the whole data processing process of the distributed data engine is not dropped to the ground and is completed in the memory. Because a large-time granularity batch computation service scene needs to be supported, the Redis is introduced into the distributed data engine to serve as a Storm cluster auxiliary memory storage cluster, so that the cost of the Storm cluster memory during batch computation is reduced. Due to the adoption of a memory computing mode, the distributed data engine has high real-time performance and almost has no influence on disk I/O (input/output), but the sacrifice on data processing throughput is brought to a certain extent.
A visual control module: the whole big data related environment can be displayed and managed in a web mode, so that the problem of complex deployment and maintenance is solved. The specific functions are as follows: managing basic configuration; schema service configuration management; plug-in release management; topology release management; monitoring the running states of the clusters such as storm, kafka, redis and the like; and tracking a log service chain.
The distributed data engine overall architecture design largely adopts plug-ins and extension mechanisms, personalized processing strongly associated with services, such as data preprocessing strategy abstraction to preprocessing plug-ins Etl-plugin, operators (built-in basic operators supporting sum, max, min and the like) for data and dimension index processing are opened to statistical plug-ins Operator-plugin, and processing result drop strategy abstraction to Storage plug-ins. Meanwhile, in addition to the support of the plug-in technology, in order to enrich the functions of the distributed data engine framework, an extension mechanism is also supported. The user can develop needed extension on the basis of the existing engine framework and submit the needed extension to the engine in an independent computing topology mode, and the extension of functions is realized due to the loading and running of the engine. Currently, a distributed data engine defaults to a built-in dynamic baseline extension and an alarm condition judgment extension. And in addition, plug-in dynamic update and schema.
In this embodiment:
xml: the system comprises a base dependent cluster address and key control parameters when each computing topology runs;
xml: the data processing method comprises the steps that specific processing rules of all service data in a data source are configured, and operation processing rules of all data indexes and dimensions are described by the file;
etl (data preprocessing) plug-in: the method is developed by a user, runs in a data preprocessing topology and is responsible for executing a specific cleaning strategy on each piece of original data;
operator plug-in: the plug-in is optional, is developed by a user, runs in a data calculation topology and is responsible for realizing a user-defined operator required by the user for processing data indexes and dimensions;
a memory plug-in: the plug-in is optional, is developed by a user, runs in a data storage topology and is responsible for realizing specific floor-type warehousing operation of a data processing result.
[ example 2 ]
As shown in fig. 2, the implementation of the architecture comprises the following steps:
(1) and defining the format of the source data, wherein the data sent to the framework needs to be packaged in a uniform format and the data timestamp is identified because the framework is not limited to the format of the source data.
(2) And configuring a schema of each specific processing rule of the business data in the data source.
(3) And developing a data preprocessing plug-in by realizing the provided data preprocessing plug-in interface class, wherein the data preprocessing plug-in runs in a data preprocessing topology and is responsible for executing a specific cleaning strategy on each piece of original data. When the plug-in is developed, each received original number needs to be cleaned according to configuration items in Schma.xml, and then cleaned data is packaged into a specific object and returned to a caller.
(4) Basic operators such as sum, max, min and the like are supported in the framework design. If the user needs a specific operator, the user-defined operator plug-in can be developed by realizing the provided user-defined operator plug-in interface class. The method runs in a data calculation topology and is responsible for realizing a user-defined operator required by a user for processing data indexes and dimensions. The plug-in receives a batch of data, calculates the data according to a custom calculation rule, and returns the result to the caller.
(5) And the final calculation result of the architecture design is in a floor built-in Mysql supporting storage scheme, and if Mysql is used as a final floor database, the processes of table building and result data warehousing can be automatically completed by the framework. If other ground schemes such as HBase, elastic search and the like are needed, the data storage plug-in can be developed by realizing the provided data storage plug-in interface class, and the data storage plug-in runs in the data storage topology and is responsible for storing the calculated data. The plug-in receives a batch of data and stores the data according to the service requirement of the plug-in.
(6) Besides the support of the plug-in technology, the architecture design also supports extensions extension mechanism for enriching the functions of the engine framework. The user develops the extensible topology according to the self business requirement to define the processing or calculate the data. It is submitted to the engine in a topology independent computing manner, and extension of the engine function is realized due to loading and running of the engine.
(7) And configuring a basic dependency cluster address in the app.xml file and key control parameters during the operation of each computing topology, and starting to operate each service topology through a script.
The data processing of the engine comprises the following steps that a distributed application coordination service cluster is used for coordinating all parts to work:
1. the message intermediate cluster acquires source data from a data source;
2. the intermediate message cluster transmits the source data to a preprocessing topology, and the preprocessing plug-in is processed and then transmitted to the intermediate message cluster;
3. the data cache cluster obtains a preprocessed result from the preprocessing topology;
4. computing topology or other expanded topologies to acquire data from the data cache cluster or the message intermediate cluster, and processing the data to send the response piece data cache cluster;
6. the storage topology acquires data from a computing topology, other expansion topologies or a message intermediate cluster directly;
6. the storage topology sends the data to a data warehouse.
The invention introduces two processing mechanisms of plug-in and schema business rule configuration in the storm computing topology, and can dynamically take effect, which plays a great simplification role in the existing distributed big data technology development. The invention provides a plug-in mechanism (etl preprocessing plug-in, custom operator plug-in, storage plug-in), a schema business rule configuration mechanism and a topology expansion mechanism introduced into an engine architecture. And the related plug-ins, the schema business rule configuration and the topology expansion can be dynamically validated.
The above embodiments are not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make variations, modifications, additions or substitutions within the technical scope of the present invention.
Claims (10)
1. A distributed big data computing engine, comprising:
the distributed coordination service cluster is used for providing coordination service for distributed application and storing a related plug-in program and a business database object set Schema configuration file, wherein the coordination service comprises configuration service, distributed synchronization and node monitoring;
the message intermediate cluster is a distributed multi-partition supporting multi-copy based distributed message system and is used for transmitting different types of service data, wherein the service data comprises original data, a calculation result, snapshot data, baseline data and alarm data;
the flow type computing cluster is a bottom computing framework based on Storm and is used for abstracting a time sequence index big data processing process into the following main processes: preprocessing, quasi-real-time calculation, small-batch calculation, large-batch calculation, landing and warehousing; the stream computing cluster comprises a preprocessing topology, a statistical topology and a storage topology;
the visual control module is used for showing and managing data in a web mode;
and the data caching cluster is a streaming computing cluster auxiliary memory storage cluster and is used for reducing the overhead of the streaming computing cluster memory during mass computing.
2. The distributed big data computing engine of claim 1, wherein: in the streaming computing cluster:
the preprocessing topology is used for subscribing an original data source from the message intermediate cluster, preprocessing the data and performing quasi-real-time aggregation processing on a preprocessing result;
the statistical topology is used for carrying out batch aggregation on the preprocessed data according to different time granularities, and comprises two sub-computational topologies: small batch computing topologies and large batch computing topologies.
The storage topology is used for landing and warehousing language data, and basic framework support is provided for landing of a final calculation result; the landing data comprises time sequence index data and snapshot file data.
3. The distributed big data computing engine of claim 2, wherein: in the preprocessing topology, a preprocessing plug-in is developed by a user, and a calculation rule is described in a database object set Schema by the user and is responsible for executing a specific cleaning strategy on each piece of original data.
4. The distributed big data computing engine of claim 3, wherein: the preprocessing topology sends a data mirror copy to the message intermediate cluster after data preprocessing, and the user performs subsequent backup processing.
5. The distributed big data computing engine of claim 4, wherein: in the statistical topology, intermediate calculation results of medium and small granularities are cached in a data cache cluster for the next large granularity calculation; meanwhile, each granularity calculation result falls to the message intermediate cluster, and subsequent storage operation is carried out by the storage topology subscription, so that decoupling between data calculation and falling is realized.
6. The distributed big data computing engine of claim 5, wherein: the data caching cluster caches the calculation results in the middle of each granularity for direct use in next time granularity calculation, so that the data processing magnitude is reduced.
7. The distributed big data computing engine of claim 6, wherein: the streaming computing cluster further comprises a baseline topology and/or an alarm topology.
8. The architectural method of a distributed big data computing engine of claim 7, comprising:
defining a source data format, packaging data in a uniform format, and identifying a data timestamp;
xml file of concrete processing rule of each service data in the data source is configured, and all data indexes and operation processing rule of dimensionality are described by the file;
developing a data preprocessing plugin by realizing the provided data preprocessing plugin interface class, wherein the data preprocessing plugin runs in a data preprocessing topology and is responsible for executing a specific cleaning strategy on each piece of original data;
the user develops the user-defined operator plug-in by realizing the provided user-defined operator plug-in interface class; the custom operator plug-in runs in the data calculation topology and is responsible for realizing the custom operator required by the user for processing the data index and the dimensionality; the custom operator plug-in receives a batch of data, calculates the data according to a custom calculation rule, and returns the result to the caller.
If the Mysql is used as a final floor database, the processes of table building and result data warehousing can be automatically completed by a frame; if other floor schemes are needed, the data storage plug-in is developed by realizing the data storage plug-in interface class running in the data storage topology and is responsible for storing the calculated data.
9. The architectural method of a distributed big data computing engine of claim 8, wherein: and the data storage plug-in receives the final calculation result data and stores the data according to the service requirement of the data storage plug-in.
10. The architectural method of a distributed big data computing engine of claim 9, wherein: the method further comprises the following steps:
the user develops an extensible topology according to self business requirements to define, process or calculate data, and submits the data to the engine in an independent topology calculation mode, and the engine is loaded and operated;
and configuring a basic dependency cluster address and key control parameters of each computing topology during operation in the app-ml file, and starting to operate each service topology through a script.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810919696.6A CN109189589B (en) | 2018-08-14 | 2018-08-14 | Distributed big data calculation engine and construction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810919696.6A CN109189589B (en) | 2018-08-14 | 2018-08-14 | Distributed big data calculation engine and construction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109189589A true CN109189589A (en) | 2019-01-11 |
CN109189589B CN109189589B (en) | 2020-08-07 |
Family
ID=64921282
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810919696.6A Active CN109189589B (en) | 2018-08-14 | 2018-08-14 | Distributed big data calculation engine and construction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109189589B (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109918229A (en) * | 2019-02-18 | 2019-06-21 | 国家计算机网络与信息安全管理中心 | The data-base cluster copy construction method and device of non-logging mode |
CN110189039A (en) * | 2019-06-04 | 2019-08-30 | 湖南智慧畅行交通科技有限公司 | Based on distributed charging pile Event processing engine |
CN110502559A (en) * | 2019-07-25 | 2019-11-26 | 浙江公共安全技术研究院有限公司 | A kind of data/address bus and transmission method of credible and secure cross-domain data exchange |
CN110716966A (en) * | 2019-10-16 | 2020-01-21 | 京东方科技集团股份有限公司 | Data visualization processing method and system, electronic device and storage medium |
CN110825604A (en) * | 2019-11-05 | 2020-02-21 | 北京博睿宏远数据科技股份有限公司 | Method, device, equipment and medium for monitoring user track and performance of application |
CN110955734A (en) * | 2020-02-13 | 2020-04-03 | 北京一流科技有限公司 | Distributed signature decision system and method for logic node |
CN111061715A (en) * | 2019-12-16 | 2020-04-24 | 北京邮电大学 | Web and Kafka-based distributed data integration system and method |
CN111221831A (en) * | 2019-12-26 | 2020-06-02 | 杭州顺网科技股份有限公司 | Computing system for real-time processing of advertisement effect data |
CN111752689A (en) * | 2020-06-22 | 2020-10-09 | 深圳鲲云信息科技有限公司 | Neural network multi-engine synchronous computing system based on data flow |
CN112256734A (en) * | 2020-10-20 | 2021-01-22 | 中国农业银行股份有限公司 | Big data processing method, device, system, equipment and storage medium |
CN112328684A (en) * | 2020-11-03 | 2021-02-05 | 浪潮云信息技术股份公司 | Method for synchronizing time sequence data to Kafka in real time based on OpenTsdb |
CN112351096A (en) * | 2020-11-04 | 2021-02-09 | 福建天泉教育科技有限公司 | Method and terminal for processing message in big data scene |
CN112363755A (en) * | 2020-11-20 | 2021-02-12 | 成都秦川物联网科技股份有限公司 | Low-coupling expansion business system based on plug-in engine injection |
CN112529632A (en) * | 2020-12-17 | 2021-03-19 | 深圳市欢太科技有限公司 | Charging method, device, system, medium and equipment based on stream engine |
CN112632127A (en) * | 2020-12-29 | 2021-04-09 | 国华卫星数据科技有限公司 | Data processing method for real-time data acquisition and time sequence of equipment operation |
CN112632091A (en) * | 2020-12-17 | 2021-04-09 | 平安普惠企业管理有限公司 | Index flow real-time calculation method, device, equipment and medium based on big data |
CN112817573A (en) * | 2019-11-18 | 2021-05-18 | 北京沃东天骏信息技术有限公司 | Method, apparatus, computer system, and medium for building streaming computing applications |
CN114090113A (en) * | 2021-10-27 | 2022-02-25 | 北京百度网讯科技有限公司 | Method, device and equipment for dynamically loading data source processing plug-in and storage medium |
CN114297172A (en) * | 2022-01-04 | 2022-04-08 | 北京乐讯科技有限公司 | Cloud-native-based distributed file system |
CN114443626A (en) * | 2020-11-06 | 2022-05-06 | 中国移动通信集团江西有限公司 | Index calculation method and device, storage medium and index calculation platform |
CN118382132A (en) * | 2024-06-21 | 2024-07-23 | 成都谐盈科技有限公司 | Component registration and cancellation method and system based on SCA and SRTF core frames |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104021194A (en) * | 2014-06-13 | 2014-09-03 | 浪潮(北京)电子信息产业有限公司 | Mixed type processing system and method oriented to industry big data diversity application |
CN106294439A (en) * | 2015-05-27 | 2017-01-04 | 北京广通神州网络技术有限公司 | A kind of data recommendation system and data recommendation method thereof |
CN107577805A (en) * | 2017-09-26 | 2018-01-12 | 华南理工大学 | A kind of business service system towards the analysis of daily record big data |
-
2018
- 2018-08-14 CN CN201810919696.6A patent/CN109189589B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104021194A (en) * | 2014-06-13 | 2014-09-03 | 浪潮(北京)电子信息产业有限公司 | Mixed type processing system and method oriented to industry big data diversity application |
CN106294439A (en) * | 2015-05-27 | 2017-01-04 | 北京广通神州网络技术有限公司 | A kind of data recommendation system and data recommendation method thereof |
CN107577805A (en) * | 2017-09-26 | 2018-01-12 | 华南理工大学 | A kind of business service system towards the analysis of daily record big data |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109918229B (en) * | 2019-02-18 | 2021-03-30 | 国家计算机网络与信息安全管理中心 | Database cluster copy construction method and device in non-log mode |
CN109918229A (en) * | 2019-02-18 | 2019-06-21 | 国家计算机网络与信息安全管理中心 | The data-base cluster copy construction method and device of non-logging mode |
CN110189039A (en) * | 2019-06-04 | 2019-08-30 | 湖南智慧畅行交通科技有限公司 | Based on distributed charging pile Event processing engine |
CN110502559A (en) * | 2019-07-25 | 2019-11-26 | 浙江公共安全技术研究院有限公司 | A kind of data/address bus and transmission method of credible and secure cross-domain data exchange |
CN110716966A (en) * | 2019-10-16 | 2020-01-21 | 京东方科技集团股份有限公司 | Data visualization processing method and system, electronic device and storage medium |
CN110825604B (en) * | 2019-11-05 | 2023-06-30 | 北京博睿宏远数据科技股份有限公司 | Method, device, equipment and medium for monitoring user track and performance of application |
CN110825604A (en) * | 2019-11-05 | 2020-02-21 | 北京博睿宏远数据科技股份有限公司 | Method, device, equipment and medium for monitoring user track and performance of application |
CN112817573A (en) * | 2019-11-18 | 2021-05-18 | 北京沃东天骏信息技术有限公司 | Method, apparatus, computer system, and medium for building streaming computing applications |
CN112817573B (en) * | 2019-11-18 | 2024-03-01 | 北京沃东天骏信息技术有限公司 | Method, apparatus, computer system, and medium for building a streaming computing application |
CN111061715A (en) * | 2019-12-16 | 2020-04-24 | 北京邮电大学 | Web and Kafka-based distributed data integration system and method |
CN111061715B (en) * | 2019-12-16 | 2022-07-01 | 北京邮电大学 | Web and Kafka-based distributed data integration system and method |
CN111221831A (en) * | 2019-12-26 | 2020-06-02 | 杭州顺网科技股份有限公司 | Computing system for real-time processing of advertisement effect data |
CN111221831B (en) * | 2019-12-26 | 2024-03-29 | 杭州顺网科技股份有限公司 | Computing system for processing advertisement effect data in real time |
CN110955734A (en) * | 2020-02-13 | 2020-04-03 | 北京一流科技有限公司 | Distributed signature decision system and method for logic node |
WO2021259040A1 (en) * | 2020-06-22 | 2021-12-30 | 深圳鲲云信息科技有限公司 | Data flow-based neural network multi-engine synchronous calculation system |
CN111752689B (en) * | 2020-06-22 | 2023-08-25 | 深圳鲲云信息科技有限公司 | Neural network multi-engine synchronous computing system based on data flow |
CN111752689A (en) * | 2020-06-22 | 2020-10-09 | 深圳鲲云信息科技有限公司 | Neural network multi-engine synchronous computing system based on data flow |
CN112256734A (en) * | 2020-10-20 | 2021-01-22 | 中国农业银行股份有限公司 | Big data processing method, device, system, equipment and storage medium |
CN112328684A (en) * | 2020-11-03 | 2021-02-05 | 浪潮云信息技术股份公司 | Method for synchronizing time sequence data to Kafka in real time based on OpenTsdb |
CN112351096A (en) * | 2020-11-04 | 2021-02-09 | 福建天泉教育科技有限公司 | Method and terminal for processing message in big data scene |
CN112351096B (en) * | 2020-11-04 | 2023-03-24 | 福建天泉教育科技有限公司 | Method and terminal for processing message in big data scene |
CN114443626A (en) * | 2020-11-06 | 2022-05-06 | 中国移动通信集团江西有限公司 | Index calculation method and device, storage medium and index calculation platform |
CN112363755A (en) * | 2020-11-20 | 2021-02-12 | 成都秦川物联网科技股份有限公司 | Low-coupling expansion business system based on plug-in engine injection |
CN112363755B (en) * | 2020-11-20 | 2022-08-16 | 成都秦川物联网科技股份有限公司 | Low-coupling expansion business system based on plug-in engine injection |
CN112632091A (en) * | 2020-12-17 | 2021-04-09 | 平安普惠企业管理有限公司 | Index flow real-time calculation method, device, equipment and medium based on big data |
CN112632091B (en) * | 2020-12-17 | 2023-10-20 | 重庆软江图灵人工智能科技有限公司 | Index flow real-time calculation method, device, equipment and medium based on big data |
CN112529632A (en) * | 2020-12-17 | 2021-03-19 | 深圳市欢太科技有限公司 | Charging method, device, system, medium and equipment based on stream engine |
CN112529632B (en) * | 2020-12-17 | 2024-04-23 | 深圳市欢太科技有限公司 | Charging method, device, system, medium and equipment based on stream engine |
CN112632127B (en) * | 2020-12-29 | 2022-07-15 | 国华卫星数据科技有限公司 | Data processing method for real-time data acquisition and time sequence of equipment operation |
CN112632127A (en) * | 2020-12-29 | 2021-04-09 | 国华卫星数据科技有限公司 | Data processing method for real-time data acquisition and time sequence of equipment operation |
CN114090113A (en) * | 2021-10-27 | 2022-02-25 | 北京百度网讯科技有限公司 | Method, device and equipment for dynamically loading data source processing plug-in and storage medium |
CN114090113B (en) * | 2021-10-27 | 2023-11-10 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for dynamically loading data source processing plug-in |
CN114297172B (en) * | 2022-01-04 | 2022-07-12 | 北京乐讯科技有限公司 | Cloud-native-based distributed file system |
CN114297172A (en) * | 2022-01-04 | 2022-04-08 | 北京乐讯科技有限公司 | Cloud-native-based distributed file system |
CN118382132A (en) * | 2024-06-21 | 2024-07-23 | 成都谐盈科技有限公司 | Component registration and cancellation method and system based on SCA and SRTF core frames |
Also Published As
Publication number | Publication date |
---|---|
CN109189589B (en) | 2020-08-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109189589B (en) | Distributed big data calculation engine and construction method | |
US10929404B2 (en) | Streaming joins with synchronization via stream time estimations | |
KR102082355B1 (en) | Processing Techniques for Large Network Data | |
Zaharia et al. | Discretized streams: Fault-tolerant streaming computation at scale | |
Goodhope et al. | Building LinkedIn's Real-time Activity Data Pipeline. | |
CN111309409B (en) | Real-time statistics method for API service call | |
US9680893B2 (en) | Method and system for event state management in stream processing | |
US20120297249A1 (en) | Platform for Continuous Mobile-Cloud Services | |
Ren et al. | Strider: A hybrid adaptive distributed RDF stream processing engine | |
CN111221831B (en) | Computing system for processing advertisement effect data in real time | |
CN110737643A (en) | big data analysis, processing and management center station based on catering information management system | |
CN103701635A (en) | Method and device for configuring Hadoop parameters on line | |
Kailasam et al. | Extending mapreduce across clouds with bstream | |
Gencer et al. | Hazelcast Jet: Low-latency stream processing at the 99.99 th percentile | |
Gu et al. | Chronos: An elastic parallel framework for stream benchmark generation and simulation | |
Wu et al. | A reactive batching strategy of apache kafka for reliable stream processing in real-time | |
Dunne et al. | A comparison of data streaming frameworks for anomaly detection in embedded systems | |
EP3011456B1 (en) | Sorted event monitoring by context partition | |
Kazemitabar et al. | Geostreaming in cloud | |
WO2022266975A1 (en) | Method for millisecond-level accurate slicing of time series stream data | |
CN111435356A (en) | Data feature extraction method and device, computer equipment and storage medium | |
Lei et al. | Redoop: Supporting Recurring Queries in Hadoop. | |
Higashino | Complex event processing as a service in multi-cloud environments | |
Hsu et al. | Performance of causal consistency algorithms for partially replicated systems | |
US12050525B2 (en) | Simulating containerized clusters |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |