CN110209646A - A kind of data platform system calculated based on real-time streaming - Google Patents

A kind of data platform system calculated based on real-time streaming Download PDF

Info

Publication number
CN110209646A
CN110209646A CN201910397951.XA CN201910397951A CN110209646A CN 110209646 A CN110209646 A CN 110209646A CN 201910397951 A CN201910397951 A CN 201910397951A CN 110209646 A CN110209646 A CN 110209646A
Authority
CN
China
Prior art keywords
data
module
workflow
management
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910397951.XA
Other languages
Chinese (zh)
Inventor
钟证业
毕军
胡雨成
胥小燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huitongda Network Co Ltd
Original Assignee
Huitongda Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huitongda Network Co Ltd filed Critical Huitongda Network Co Ltd
Priority to CN201910397951.XA priority Critical patent/CN110209646A/en
Publication of CN110209646A publication Critical patent/CN110209646A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of data platform systems calculated based on real-time streaming, it include: to obtain the behavioral data information of tentation data and store, the tentation data is converted into a series of short and small data flows, the processing of real-time streaming data that high-throughput may be implemented, having fault tolerant mechanism.The system is by being converted into a series of short and small batch processing jobs for the behavioral data information of tentation data and being parsed and extracted key message, to realize high-volume data conversion into micro- batch data, and data are quickly calculated by distributed way, achieve the effect that handling capacity is big, low latency, to meet requirement of the data jettison system to high-timeliness and be applicable in the business sensitive for time delay.

Description

A kind of data platform system calculated based on real-time streaming
Technical field
The present invention relates to computer internet Data Warehouse Design technology more particularly to it is a kind of based on real-time streaming calculate Data platform system.
Background technique
Real-time streaming calculating is to carry out staged operation to data stream, resolves into a series of short and small batch processing jobs, then By the data stream transmitting of segmentation into data batch processing engine, data are cleaned in data engine, are extracted, convert behaviour Make, and obtained data processed result is saved in memory.
The data warehouse of traditional standardized, it is main to utilize the mechanism such as the timed task of oracle and trigger, partition table Storage data are quasi real time statisticallyd analyze, in terms of performance and function, have been difficult to meet user to Data Management Analysis Timeliness, the demand of accuracy.
With the rapid development of Internet, big data era has arrived, data are presented explosive growth, at traditional data Reason mode has been not suitable for the analysis to mass data, needs a kind of new data processing method to handle complicated business in real time and patrols Volume.
Summary of the invention
It, can be quickly and effectively the technical problem to be solved by the present invention is to establish the real-time streaming computing system of a set of maturation Accurately data service is provided.In order to solve the above-mentioned technical problems, the present invention provides a kind of numbers calculated based on real-time streaming It is flat according to plateform system, including data source modules, real time data computing module, data billboard display module and job scheduling management Platform;
Wherein, the data source modules are responsible for disposition data source link information, and the business that extracts needs to calculate and real-time synchronization Data;
The real time data computing module be responsible for real time data calculate and storage, and be responsible for each node resource management and Scheduling;
The data billboard display module is used for interaction analysis;
The job scheduling management platform is responsible for the configuration of workflow, calculates the normal operation of scheduler task in real time and determine When traffic control stream, it is ensured that the accuracy of data.
The data source modules include data volume module and data source module;
The data volume module is for distinguishing whether data are initialized, when the size of data volume is less than 10W+ number According to when, it is only necessary to data are synchronized, full table loads data, does not need to initialize;
The data source module includes structural data.Such as: oracle database, mysql database, structuring text Part, Hive table etc. and business need to calculate and the data of real-time synchronization, such as: sales volume, sales order, member's number, voucher Number etc.;
The real time data computing module includes real-time data synchronization module, yarn distributed management system, data storage Module and data computation module;
The real-time data synchronization module includes data volume module and data source module;
The data volume module and the function of data source module are identical as the function in data source modules;
The yarn distributed management system is used for the scheduling of resource of each node of cluster, to reach efficient resource pipe Reason, such as: when data need to store, the host node ReourceManager on yarn distributed management system understands basis from section The resource request of point NodeManager, carrys out reasonable distribution resource;
The data memory module is used to grab data by the micro- batch processing of spark sql to HDFS (Hadoop distribution File system) on synchronize, and be stored in memory, by processing, there are in each node of cluster for final data;
The data computation module is used for the data in cluster memory by registering interim table, and utilizes sql sentence Carry out logic calculation, such as: data from data source in real time micro- batch be drawn into HIVE after, using sql sentence data are carried out It screens and is associated and calculates with other tables, thus the achievement data needed;
The data billboard display module includes interaction analysis module, report display module and permission control module;
The interaction analysis module is carried out for the logic calculation of data set and according to different dimensions, different themes Analysis and processing, such as: the data of same table daily carry out data statistics and index calculate according to time dimension, or According to regional dimension, according to different provinces, to be divided, different data sets is had reached, completes different reports;
The report display module for data to be shown by different data drawing lists, such as: histogram, curve Figure, text box etc..
The permission control module gives phase to different business sides for controlling the permission that every report is checked and modified Corresponding permission.
The job scheduling management console module is used for the quasi real time management and running of spark task;
Quasi real time management and running include Workflow configuration, Coordinator configuration, workflow prison to the spark task Keyholed back plate reason;Workflow can Parallel Scheduling, also can serial scheduling, workflow allows unsuccessfully to weigh brush mechanism, can restart Execute workflow schedule;
The Workflow is configured to management and running spark task, and each spark task requires one work of configuration Industry, same category of operation need to configure a workflow, and during multiple workflow Parallel Schedulings, a spark task Other task runs in the same workflow are not interfered with unsuccessfully;
The Coordinator is configured to the timer-triggered scheduler of management work stream, needs to specify a corresponding job The frequency of stream and scheduling time, scheduling;
The workflow monitoring management is used to monitor state and the time of each spark task run.
The data memory module is used to grab by the micro- batch processing of spark sql and synchronize in data to HDFS, Synchronous data are a Dataframe (structured data sets) on HDFS, it can register interim table, to use sql It is operated.
The yarn distributed management system is responsible for cluster resource management and scheduling, for dividing the data of data source modules Cloth is stored on each node on HDFS and the scheduling of each node resource.
The data of the distributed storage on HDFS are by being mapped to the external table of GreenPlum.
The data of the GreenPlum external table are carried out logic calculation, are inserted into GreenPlum by storing process Portion's table.
The data of the GreenPlum external table are carried out logic calculation, are inserted by storing process Table inside GreenPlum, includes the following steps:
Step 1, the data of data source modules are extracted using spark sql micro- batch;
Step 2, the data register of extraction at interim table;
Step 3, the data of interim table are inserted into HIVE external table;
Step 4, data are mapped in GreenPlum external table by HIVE external table;
Step 5, GreenPlum external table is by storing process (logic calculation passes through after data progress logic calculation Sql sentence is completed), it is inserted into the inside table of GreenPlum;
Step 6, the data of internal table are write in data set, is shown by report.
The present invention has the advantages that following control:
The behavioral data information of tentation data is obtained using spark sql and is stored, and the tentation data is converted into one The short and small data flow of series, the processing of real-time streaming data that high-throughput may be implemented, having fault tolerant mechanism, by high-volume number According to being converted into micro- batch data, and data are quickly calculated by distributed way, reach handling capacity greatly, the effect of low latency Fruit;
The yarn distributed management system is used to be responsible for the scheduling of resource of each node of cluster, to reach the utilization of resources Maximization;
Distributed data-storage system is by the disk space on every server of Web vector graphic, by the storage resource of dispersion Constitute a virtual storage equipment, each corner of data dispersion storage in a network, to reach resource efficiency height, safety High feature;
Real time data calculating converts the data into RDD data set and is cached in memory, due to frequently using data set, subtracts I/O operation, the network transmission, the time recalculated for having lacked intermediate result improve the speed using operation significantly, reach The synchronization and calculating of near-realtime data;
Interaction analysis module can specify different dimension and theme according to different requirements, to complete corresponding report Table;
Permission control module controls the permission of different user, different reports, can according to the demand of user, to user and report Corresponding permission is arranged in table, reach user can see it is associated with oneself;
Monitoring operation: by operation, job title, state (it is ready, successfully, operation, failure, alarm), real time inspection make Industry operating status realizes that operation is run again by reset button.
Detailed description of the invention
The present invention is done with reference to the accompanying drawings and detailed description and is further illustrated, it is of the invention above-mentioned or Otherwise advantage will become apparent.
Fig. 1 is system structure of the invention figure.
Specific embodiment
The present invention will be further described with reference to the accompanying drawings and embodiments.
As shown in Figure 1, the invention discloses one to calculate the data platform system being related to based on real-time streaming, it is a kind of standard The system of real-time data synchronization and calculating, including data source modules, real time data computing module, data billboard display module, work Industry management and running platform;
Data source modules: data volume and data source;
Real time data computing module: real-time data synchronization module, yarn distributed management system, data storage, data meter It calculates;
Data billboard display module: interaction analysis module, report display module, permission control module;
Job scheduling manages platform: Workflow configuration, Coordinator configuration, workflow monitoring management;
1, data source modules
It is responsible for disposition data source link information, the business that extracts needs to calculate and the data of real-time synchronization;
1.1 data volume
When the data volume of data source is smaller, it can take and extract total data progress data synchronization every time and calculate, work as number When according to measuring larger, first extracting all data synchronizations and calculating, being initialized, in the synchronization of subsequent data and calculating process, It takes toward the mode of backwash for a period of time, realizes real time data synchronization;
1.2 data source
Data source includes structural data, such as: oracle database, mysql database, structured document, Hive Table etc.;
2, real time data computing module
It is responsible for real time data calculating and storage and the resource management and scheduling of each node;
3, real-time data synchronization module
Real-time data synchronization module is using spark sql at required for index in data source part field data micro- batch It manages, on real-time synchronization to HDFS.
4, Yarn distributed management system
Yarn distributed management system is made of ResourceManager and ApplicationMaster, ResourceManager is responsible for the resource management and scheduling of entire cluster, and ApplicationMaster is responsible for application program Relevant issues, such as task schedule, Mission Monitor and fault-tolerant etc.;
5, data store
Data storage is made of data buffer storage and data storage, and on data pick-up to HDFS, data buffer storage is in server Memory in, final data is stored on each node of server.
6, data calculate
The data being buffered in server memory are registered interim table by Dataframe, are answered using sql sentence Miscellaneous logic calculation solves the problems, such as that mass data loading velocity is slow in conjunction with Hive external table and GreenPlum external table, with And the problem of each data among systems Type-Inconsistencies, GreenPlum external table is mapped the data by Hive external table In, by calling storing process to insert data into inside calculated result layer GreenPlum in table;Such as: outside GreenPlum Portion's table carries out complicated logic calculation using sql by calling storing process;
7, data billboard display module
It is responsible for interaction analysis module, the permission control module of report display module and report of data;
8, interaction analysis module
Interaction analysis module be data in data set according to different dimensions, theme, different classification is carried out, thus complete At;
9, report display module
Report display module shows that two parts form by production report and report, according to the achievement data in data set, When making report, the data in data set can directly be carried out drawing and dragged, shown with forms such as column, curve, text boxes,
The time interval of refresh page is set according to different requirements,;
10, permission control module
Permission control module setting user checks the permission of report, and setting user can check only report related to user, Other reports do not appear in the interface that user can check;
11, job scheduling manages platform
It is responsible for the configuration of workflow, calculates the normal operation and timer-triggered scheduler workflow of scheduler task in real time, it is ensured that number According to accuracy
12, Workflow is configured
Workflow is configured in Oozie, an operation configures a spark program, can configure in a workflow Multiple operation concurrent processing, need to import the connection packet of data source in configuration and the jar packet of project, multiple operations are concurrent When processing, the state of operation is independent of each other;
13, Coordinator is configured
Coordinator is configured in Oozie, first selectes the workflow for needing timer-triggered scheduler, the beginning of timer-triggered scheduler is set The frequency of time, end time and timer-triggered scheduler;
14, workflow monitoring management
Workflow monitoring management is made of work flow operation state and spark task run state, and workflow passes through timing After scheduling starting, spark task brings into operation, by workflow and task names, check spark task operating status, when Between, it can also check the journal file of spark task run;
In the present embodiment, development deployment environment is as follows:
Develop environment:
Scala version: 2.10.5
Spark is multiplexed existing CDH component
IDE:IDEA
Deployed environment:
Scala version: 2.10.5
Spark version: 1.6.0
Zookeeper version: 3.4.5
MySQL version: 5.1.4
Oracle version: 10g
GreenPlum version: 5.7.0
Streaming computing system server deployed position see the table below 1:
Table 1
Embodiment
In the present embodiment, setting certain in oracle database table to have the data of 260M (as shown in table 2, is certain company pin Sell data),
Table 2
In table 2, first row ORDERNO indicates that order number, secondary series MEMBERNO indicate membership number, third column TOTALPRICE indicates total price, and the 4th column ORDERSTATUS indicates order status, and the 5th arranges organization number belonging to ORGID expression, 6th column OPERATORID indicates operator's number;
By data source modules, oracle database link information, including user name, password, oracle driving etc. are configured, Since data volume is larger, 120W row is had reached, needs advanced row data initialization procedure, is then carrying out data synchronization process;
Data initialization process is all data extracted in oracle first with spark sql micro- batch;Pass through data meter Module is calculated, the data of extraction are stored in memory, a data set is converted by registering interim table and is inserted into HIVE In, it by data memory module, stores data on each node of server, passes through yarn distributed management system mould Block carries out each node of cluster the reasonable distribution of resource, by the map feature of HIVE external table, data is mapped and are synchronized Into the external table of GreenPlum, by calling storing process, logic required for being carried out the data of external table using sql Then processing inserts data into table inside GreenPlum including association, conditional filtering, the logic calculation etc. with other tables In;
Data synchronization process, and the data extracted in oracle first with spark sql micro- batch, but be to extract currently Time, then next process was identical with the process of initialization toward one day data of backwash, but required first to empty every time Then HIVE external table, GreenPlum external table again come in data insertion;Each data synchronization process is all once to dispatch, It needs to manage platform by job scheduling and dispatches, job scheduling needs to configure the jar packet of code execution, link library Jar packet, distribution, the time of timer-triggered scheduler of dispatching resource etc.;Result data after having handled shows mould by data billboard Data are carried out the processing of different dimensions using data set by the interaction analysis module of block, such as: with time dimension (according to difference Period data are classified) or with regional dimension (according to different areas, data are classified) etc., pass through Data are carried out visualization exhibition by report display module by permission control module, the permission of every report of control, different user Show.
The present invention provides a kind of data platform systems calculated based on real-time streaming, implement the side of the technical solution There are many method and approach, the above is only a preferred embodiment of the present invention, it is noted that for the common skill of the art For art personnel, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications Also it should be regarded as protection scope of the present invention.All undefined components in this embodiment can be implemented in the prior art.

Claims (10)

1. a kind of data platform system calculated based on real-time streaming, which is characterized in that including data source modules, real time data meter It calculates module, data billboard display module and job scheduling and manages platform;
Wherein, the data source modules are responsible for disposition data source link information, and the business that extracts needs to calculate and the number of real-time synchronization According to;
The real time data computing module is responsible for real time data and calculates and store, and is responsible for resource management and the tune of each node Degree;
The data billboard display module is used for interaction analysis;
The job scheduling management platform is responsible for configuration, the normal operation for calculating scheduler task in real time and the timing tune of workflow Spend workflow, it is ensured that the accuracy of data.
2. system according to claim 1, which is characterized in that the data source modules include that data volume module and data are come Source module;
The data volume module is for distinguishing whether data are initialized, when the size of data volume is less than 10W+ data When, it is only necessary to data are synchronized, full table loads data, does not need to initialize;
The data source module includes structural data.
3. system according to claim 2, which is characterized in that the real time data computing module includes real-time data synchronization Module, yarn distributed management system, data memory module and data computation module;
The real-time data synchronization module includes data volume module and data source module;
The data volume module and the function of data source module are identical as the function in data source modules;
The yarn distributed management system is used for the scheduling of resource of each node of cluster, to reach efficient resource management;
The data memory module is used to grab by the micro- batch processing of spark sql and synchronize in data to HDFS, and stores In memory, by processing, there are in each node of cluster for final data;
The data computation module is used for the data in cluster memory by registering interim table, and using sql sentence come into Row logic calculation.
4. system according to claim 3, which is characterized in that the data billboard display module includes interaction analysis mould Block, report display module and permission control module;
The interaction analysis module is used for the logic calculation of data set and is analyzed according to different dimensions, different main bodys And processing;
The report display module is for showing data by different data drawing lists;
The permission control module gives different business sides corresponding for controlling the permission that every report is checked and modified Permission.
5. system according to claim 4, which is characterized in that the job scheduling management console module is appointed for spark Business quasi real time management and running;
Quasi real time management and running include Workflow configuration, Coordinator configuration, workflow monitoring pipe to the spark task Reason;Workflow can Parallel Scheduling, also can serial scheduling, workflow allows unsuccessfully to weigh brush mechanism, can restart to execute Workflow schedule;
The Workflow is configured to management and running spark task, and each spark task requires one operation of configuration, together A kind of other operation needs to configure a workflow, and during multiple workflow Parallel Schedulings, a spark mission failure is not Influence whether other task runs in the same workflow;
The Coordinator is configured to the timer-triggered scheduler of management work stream, need to specify a corresponding workflow with And the frequency of scheduling time, scheduling;
The workflow monitoring management is used to monitor state and the time of each spark task run.
6. the system stated according to claim 5, which is characterized in that the data memory module is used for by spark sql micro- batch It is synchronized in reason crawl data to HDFS, synchronous data are a Dataframe structured data sets, its energy on HDFS Interim table is registered, enough to use sql to be operated.
7. the system stated according to claim 6, which is characterized in that the yarn distributed management system be responsible for cluster resource management and Scheduling, for the data distribution formula of data source modules to be stored on each node on HDFS and the tune of each node resource Degree.
8. the system stated according to claim 7, which is characterized in that the data of the distributed storage on HDFS are by being mapped to The external table of GreenPlum.
9. system according to claim 8, which is characterized in that the data of the GreenPlum external table were by storing Journey carries out logic calculation, is inserted into table inside GreenPlum.
10. system according to claim 9, which is characterized in that the data of the GreenPlum external table are by depositing Storage process carries out logic calculation, is inserted into table inside GreenPlum, includes the following steps:
Step 1, the data of data source modules are extracted using spark sql micro- batch;
Step 2, the data register of extraction at interim table;
Step 3, the data of interim table are inserted into HIVE external table;
Step 4, data are mapped in GreenPlum external table by HIVE external table;
Step 5, after data are carried out logic calculation by storing process by GreenPlum external table, it is inserted into GreenPlum's Internal table;
Step 6, the data of internal table are write in data set, is shown by report.
CN201910397951.XA 2019-05-14 2019-05-14 A kind of data platform system calculated based on real-time streaming Pending CN110209646A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910397951.XA CN110209646A (en) 2019-05-14 2019-05-14 A kind of data platform system calculated based on real-time streaming

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910397951.XA CN110209646A (en) 2019-05-14 2019-05-14 A kind of data platform system calculated based on real-time streaming

Publications (1)

Publication Number Publication Date
CN110209646A true CN110209646A (en) 2019-09-06

Family

ID=67787178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910397951.XA Pending CN110209646A (en) 2019-05-14 2019-05-14 A kind of data platform system calculated based on real-time streaming

Country Status (1)

Country Link
CN (1) CN110209646A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110908641A (en) * 2019-11-27 2020-03-24 中国建设银行股份有限公司 Visualization-based stream computing platform, method, device and storage medium
CN111198884A (en) * 2019-12-27 2020-05-26 福建威盾科技集团有限公司 Information processing method and information processing system for vehicle initial entering city
CN111400352A (en) * 2020-03-18 2020-07-10 北京三维天地科技股份有限公司 Workflow engine capable of processing data in batches
CN112561368A (en) * 2020-12-22 2021-03-26 绿瘦健康产业集团有限公司 Visual achievement calculation method and device of OA examination and approval system
CN112632114A (en) * 2019-10-08 2021-04-09 中国移动通信集团辽宁有限公司 Method and device for MPP database to quickly read data and computing equipment
CN113064704A (en) * 2021-03-18 2021-07-02 北京沃东天骏信息技术有限公司 Task processing method and device, electronic equipment and computer readable medium
CN115618194A (en) * 2022-12-19 2023-01-17 江苏未至科技股份有限公司 Spark-based data processing method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160364211A1 (en) * 2015-06-11 2016-12-15 Electronics And Telecommunications Research Institute Method for generating workflow model and method and apparatus for executing workflow model
CN107515927A (en) * 2017-08-24 2017-12-26 深圳市云房网络科技有限公司 A kind of real estate user behavioural analysis platform
CN108255855A (en) * 2016-12-29 2018-07-06 北京国双科技有限公司 Date storage method and device
CN108446145A (en) * 2018-03-21 2018-08-24 苏州提点信息科技有限公司 A kind of distributed document loads MPP data base methods automatically
CN108681569A (en) * 2018-05-04 2018-10-19 亚洲保理(深圳)有限公司 A kind of automatic data analysis system and its method
CN108984547A (en) * 2017-05-31 2018-12-11 北京京东尚科信息技术有限公司 The method and apparatus of data processing
CN109408546A (en) * 2018-10-17 2019-03-01 深圳中顺易金融服务有限公司 A kind of stream data processing method and processing device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160364211A1 (en) * 2015-06-11 2016-12-15 Electronics And Telecommunications Research Institute Method for generating workflow model and method and apparatus for executing workflow model
CN108255855A (en) * 2016-12-29 2018-07-06 北京国双科技有限公司 Date storage method and device
CN108984547A (en) * 2017-05-31 2018-12-11 北京京东尚科信息技术有限公司 The method and apparatus of data processing
CN107515927A (en) * 2017-08-24 2017-12-26 深圳市云房网络科技有限公司 A kind of real estate user behavioural analysis platform
CN108446145A (en) * 2018-03-21 2018-08-24 苏州提点信息科技有限公司 A kind of distributed document loads MPP data base methods automatically
CN108681569A (en) * 2018-05-04 2018-10-19 亚洲保理(深圳)有限公司 A kind of automatic data analysis system and its method
CN109408546A (en) * 2018-10-17 2019-03-01 深圳中顺易金融服务有限公司 A kind of stream data processing method and processing device

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632114A (en) * 2019-10-08 2021-04-09 中国移动通信集团辽宁有限公司 Method and device for MPP database to quickly read data and computing equipment
CN112632114B (en) * 2019-10-08 2024-03-19 中国移动通信集团辽宁有限公司 Method, device and computing equipment for fast reading data by MPP database
CN110908641A (en) * 2019-11-27 2020-03-24 中国建设银行股份有限公司 Visualization-based stream computing platform, method, device and storage medium
CN110908641B (en) * 2019-11-27 2024-04-26 中国建设银行股份有限公司 Visualization-based stream computing platform, method, device and storage medium
CN111198884A (en) * 2019-12-27 2020-05-26 福建威盾科技集团有限公司 Information processing method and information processing system for vehicle initial entering city
CN111198884B (en) * 2019-12-27 2023-06-06 福建威盾科技集团有限公司 Method and system for processing information of first entering city of vehicle
CN111400352A (en) * 2020-03-18 2020-07-10 北京三维天地科技股份有限公司 Workflow engine capable of processing data in batches
CN112561368A (en) * 2020-12-22 2021-03-26 绿瘦健康产业集团有限公司 Visual achievement calculation method and device of OA examination and approval system
CN112561368B (en) * 2020-12-22 2023-08-01 广东壹健康健康产业集团股份有限公司 Visual performance calculation method and device for OA approval system
CN113064704A (en) * 2021-03-18 2021-07-02 北京沃东天骏信息技术有限公司 Task processing method and device, electronic equipment and computer readable medium
CN115618194A (en) * 2022-12-19 2023-01-17 江苏未至科技股份有限公司 Spark-based data processing method

Similar Documents

Publication Publication Date Title
CN110209646A (en) A kind of data platform system calculated based on real-time streaming
Hu et al. Time-and cost-efficient task scheduling across geo-distributed data centers
CN103092698B (en) Cloud computing application automatic deployment system and method
CN103930875B (en) Software virtual machine for acceleration of transactional data processing
CN107103064B (en) Data statistical method and device
CN111917887A (en) System for realizing data governance under big data environment
Isah et al. A scalable and robust framework for data stream ingestion
Ju et al. iGraph: an incremental data processing system for dynamic graph
CN103885986A (en) Main and auxiliary database synchronization method and device
CN105574082A (en) Storm based stream processing method and system
Huang et al. Yugong: Geo-distributed data and job placement at scale
CN106354729A (en) Graph data handling method, device and system
CN105930417B (en) A kind of big data ETL interactive process platform based on cloud computing
CN110704465B (en) Method, device and storage medium for processing service work list
CN103077192B (en) A kind of data processing method and system thereof
CN110308984A (en) It is a kind of for handle geographically distributed data across cluster computing system
CN106407231A (en) A data multi-thread export method and system
CN110134430A (en) A kind of data packing method, device, storage medium and server
CN105553732B (en) A kind of distributed network analogy method and system
CN102129443A (en) Real-time data transmission channel and method based on USAS (Univac Standard Airline Systems) host
CN113672240A (en) Container-based multi-machine-room batch automatic deployment application method and system
CN114756629B (en) Multi-source heterogeneous data interaction analysis engine and method based on SQL
Sathya et al. Application of Hadoop MapReduce technique to Virtual Database system design
CN108563787A (en) A kind of data interaction management system and method for data center's total management system
CN105824892A (en) Method for synchronizing and processing data by data pool

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination