CN116595071A - Data platform, data processing method, device, medium and equipment - Google Patents

Data platform, data processing method, device, medium and equipment Download PDF

Info

Publication number
CN116595071A
CN116595071A CN202211660131.3A CN202211660131A CN116595071A CN 116595071 A CN116595071 A CN 116595071A CN 202211660131 A CN202211660131 A CN 202211660131A CN 116595071 A CN116595071 A CN 116595071A
Authority
CN
China
Prior art keywords
data
real
offline
time
bin
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211660131.3A
Other languages
Chinese (zh)
Inventor
李小刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Energy Chain Holding Co ltd
Original Assignee
Chezhubang Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chezhubang Beijing Technology Co Ltd filed Critical Chezhubang Beijing Technology Co Ltd
Priority to CN202211660131.3A priority Critical patent/CN116595071A/en
Publication of CN116595071A publication Critical patent/CN116595071A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2379Updates performed during online database operations; commit processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/26Visual data mining; Browsing structured data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data platform, a data processing method, a medium and equipment, wherein a number bin system is used for layering and integrating source data to obtain an offline number bin table and/or a real-time number bin table; the storage layer is used for carrying out data persistence processing on the data of the offline data bin table and/or the real-time data bin table and storing the data; the computing layer is used for adopting different computing engines to process the data provided by the storage layer aiming at an offline scene and a real-time scene; the service layer is used for connecting the computing layer and the application layer; and the application layer is used for acquiring the processed data from the service layer and reading, analyzing or displaying the processed data according to application requirements. Therefore, the application provides a unified integrated data platform applicable to both offline scenes and real-time scenes, and ensures the completeness and instantaneity of data.

Description

Data platform, data processing method, device, medium and equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data platform, a data processing method, a data processing device, a medium, and a device.
Background
With the development of computer technology, data is a core asset for many enterprises. Under the conditions that business systems related to enterprises are more and the data volume of each business system is more and more, how to provide a real-time, complete and integrated data platform is a technical problem to be solved.
Disclosure of Invention
In view of this, the present application provides a data platform, a data processing method, a data processing device, a medium and a device, and mainly aims to provide a real-time, complete and integrated data platform.
According to one aspect of the application, a data platform is provided, the data platform comprises a number bin system, a storage layer, a calculation layer, a service layer and an application layer, wherein the number bin system is used for layering and integrating source data to obtain an offline number bin table and/or a real-time number bin table; the storage layer is used for accessing the several bin system to obtain an offline several bin table and/or a real-time several bin table, and carrying out data persistence processing on the data of the offline several bin table and/or the real-time several bin table and storing the data; the computing layer is used for adopting different computing engines to process the data provided by the storage layer aiming at an offline scene and a real-time scene, wherein the data structuring process is carried out on the offline scene by adopting a memory distributed computing engine; for a real-time scene, adopting a real-time data calculation engine to perform data structuring processing; the service layer is used for connecting the computing layer and the application layer; the application layer is used for acquiring the processed data from the service layer and reading, analyzing or displaying the processed data according to application requirements.
According to one aspect of the present application, there is provided a data processing method based on a data platform, including: layering and integrating the source data to obtain an offline bin counting table and/or a real-time bin counting table; obtaining an offline number bin table and/or a real-time number bin table, and performing data persistence processing on the data of the offline number bin table and/or the real-time number bin table and storing the data; for an offline scene and a real-time scene, adopting different calculation engines to perform data processing on the data provided by the storage layer, wherein for the offline scene, adopting a memory distributed calculation engine to perform data structuring processing; for a real-time scene, adopting a real-time data calculation engine to perform data structuring processing; and reading, analyzing or displaying the processed data according to the application requirements.
According to one aspect of the present application, there is provided an offline scene data processing method based on a data platform, including: layering and integrating the source data to obtain an offline bin count table; performing data persistence processing on the data of the offline bin table and storing the data; performing data structuring processing by adopting a memory distributed computing engine, wherein the memory distributed computing engine comprises a computing node, a resource manager and a job server, wherein the computing node reads an offline data bin table from the storage layer, the resource manager configures resources for the job server, the job server configures one or more jobs, and data of the offline data bin table is extracted, converted and loaded to complete the data structuring processing; and reading, analyzing or displaying the processed data according to the application requirements.
According to one aspect of the present application, there is provided a real-time scene data processing method based on a data platform, including: layering and integrating the source data to obtain a real-time bin table; performing data persistence processing on the data of the real-time bin table and storing the data; adopting a real-time data computing engine to perform data structuring processing, wherein the real-time data computing engine is a column database cluster, the column database cluster writes data into the fragments in a column storage mode based on a load balancing strategy based on a fragment algorithm, and the data structuring processing is completed by establishing connection among the fragments; and reading, analyzing or displaying the processed data according to the application requirements.
According to an aspect of the present application, there is provided an offline scene data processing apparatus based on a data platform, comprising: the data bin unit is used for layering and integrating the source data to obtain an offline data bin table; the data storage unit is used for obtaining an offline data bin table, and performing data persistence processing on the data of the offline data bin table and storing the data; the data computing unit is used for carrying out data structuring processing by adopting a memory distributed computing engine, wherein the memory distributed computing engine comprises a computing node, a resource manager and a job server, wherein the computing node reads an offline data bin table from the storage layer, the resource manager configures resources for the job server, the job server configures one or more jobs, and the data of the offline data bin table is extracted, converted and loaded to complete the data structuring processing; and the data application unit is used for reading, analyzing or displaying the processed data according to the application requirements.
According to an aspect of the present application, there is provided a real-time scene data processing apparatus based on a data platform, comprising: the digital bin processing unit is used for layering and integrating the source data to obtain a real-time digital bin table; the data storage unit is used for obtaining a real-time data bin table, and performing data persistence processing on the data of the real-time data bin table and storing the data; the data processing unit is used for carrying out data structuring processing by adopting a real-time data computing engine, wherein the real-time data computing engine is a column database cluster, the column database cluster writes data into the fragments in a column storage mode based on a load balancing strategy based on a fragment algorithm, and the data structuring processing is completed by establishing connection among the fragments; and the data application unit is used for reading, analyzing or displaying the processed data according to the application requirements.
According to one aspect of the present application there is provided a computer device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the above method.
According to an aspect of the present application there is provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above method when run.
By means of the technical scheme, the data platform, the data processing method, the medium and the device provided by the embodiment of the application adopt different calculation engines for data processing aiming at an offline scene and a real-time scene, wherein the data structuring processing is carried out on the offline scene by adopting a memory distributed calculation engine; for a real-time scene, a real-time data processing engine is adopted to carry out data structuring processing, so that a unified integrated data platform applicable to both an offline scene and a real-time scene is provided, real-time reading or analysis can be carried out on data under the real-time scene, such as real-time portrait or data large screen, data report presentation or business logic processing can be realized under the offline scene, the completeness of the data is ensured, the data of each business line can be integrated by meeting the data platform universal for both the offline scene and the real-time scene, and the data processing requirements of each business line under the offline scene and the real-time scene are met.
The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a schematic diagram of a data platform architecture according to an embodiment of the present application;
FIG. 2 shows a schematic diagram of a several-bin system according to an embodiment of the present application;
FIG. 3 shows a schematic diagram of a multi-bin hierarchical logic in a multi-bin system according to an embodiment of the present application;
FIG. 4 is a schematic diagram of another data platform according to an embodiment of the present application;
FIG. 5 shows a flowchart of a data processing method based on a data platform according to an embodiment of the present application;
FIG. 6 shows a flowchart of an offline processing method based on a data platform according to an embodiment of the present application;
FIG. 7 shows a flow chart of a real-time processing method based on a data platform according to an embodiment of the present application;
fig. 8 shows a schematic structural diagram of an offline processing device based on a data platform according to an embodiment of the present application;
fig. 9 shows a schematic structural diagram of a real-time processing device based on a data platform according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of another computer device according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
As previously analyzed, the data is the core asset of the enterprise. For example, for an energy digitizing enterprise, there may be a plurality of service lines (service systems), including, for example, group purchase oiling, intelligent charging, business management, and the like, and each service line relates to various service data, in the existing general solutions, a data storage scheme is respectively built for each service line, which results in an increase in construction cost and a cycle extension, and development efficiency is low, and since the storage schemes of each service line are respectively administrative, if the storage schemes of service lines are expanded or improved, it is necessary to expand or improve one by one, and increase labor and time costs. In this regard, the embodiment of the application provides a real-time, complete and integrated data platform, and the data of each service line is integrated and landed on the integrated data platform through data extraction and cleaning, so as to normalize the data.
Referring to fig. 1, a schematic diagram of a data platform architecture according to an embodiment of the present application is shown. The data platform comprises a plurality of bin systems, a storage layer, a calculation layer, a service layer and an application layer, wherein,
the system comprises a plurality of bin systems, a plurality of bin control systems and a plurality of bin control systems, wherein the bin systems are used for layering and integrating source data to obtain an offline bin table and/or a real-time bin table;
the storage layer is used for accessing the number bin system to obtain an offline number bin table and/or a real-time number bin table, and carrying out data persistence processing on the data of the offline number bin table and/or the real-time number bin table and storing the data;
the computing layer is used for adopting different computing engines to process the data provided by the storage layer aiming at the offline scene and the real-time scene, wherein the memory distributed computing engines are adopted to process data structuring for the offline scene; for a real-time scene, adopting a real-time data calculation engine to perform data structuring processing;
the service layer is used for connecting the computing layer and the application layer; for example, a BIServer is adopted to realize a service layer, which can be used for controlling the authority of users and data.
And the application layer is used for acquiring the processed data from the service layer and reading, analyzing or displaying the processed data according to application requirements.
The following first describes the log bin system.
A data warehouse (simply referred to as a "several bins"), as the name implies, is a warehouse for storing data that integrates the data of individual business systems. The data warehouse system can be accessed into various data sources to obtain source data, for example, source data is obtained from a relational database (mysql, oracle and the like), a big data system (maxcomputer, hive and the like), file data (excel, csv and the like) or an API interface, and subject-based layering and integration are carried out on the source data according to service requirements to obtain an offline data warehouse table or a real-time data warehouse table. The bin system is not separately shown in fig. 1, it being understood that the bin system is logically implemented on both the storage layer and the computation layer.
Referring to fig. 2, a schematic diagram of a digital bin system according to an embodiment of the present application is shown.
As shown in fig. 2, the data sources may include data source 1, data source 2, … …, data source n, representing different data sources, for example, the data sources may be various databases including, but not limited to, business databases, relational databases (e.g., mysql, oracle, etc.), journals (e.g., buried journals, gateway journals, text journals, etc.). It can be seen that the data of the multi-bin system comes from a plurality of data sources, and the storage modes of source data in different data sources may be different, so that the source data needs to be integrated into a final data set through a series of processes of extraction, cleaning and conversion from the data sources through a multi-bin layer.
The data of different data sources are layered and integrated according to the requirement, namely, the data are generally processed around a certain business theme (topic). The multi-bin layering is generally performed based on business scenes, and has the significance of reducing repeated development, generating a middle layer in the process of data development, sinking common logic and reducing repeated calculation; the data structure can be clearly expressed, each layering division is clear, and the understanding of developers is convenient; the problem is conveniently positioned, the relationship between the blood edges of the data is known in a layering manner, and the problem is positioned by backtracking when the problem occurs; simplifying the complex problem and simplifying the complex problem.
In the embodiment of the application, the digital bin system comprises an original data layer, a detail width surface layer and an application summarization layer; for an offline scene, the multi-bin system performs incremental pulling and full pulling on source data through an original data layer, so that the pulled data is kept consistent with the source data; through the detail width surface layer, constructing a detail width surface aiming at data of different topics of different service lines; the method comprises the steps that an application summarizing layer is used for summarizing the detail and width tables of all topics, and a summarizing table among all topics is generated and used as an offline number bin table; the digital bin system reads streaming data through an original data layer for a real-time scene; through the detail width surface layer, the streaming data are associated, and then the streaming data are written into the detail width tables of different topics; and summarizing the data in each detail width table of the detail width surface layer by applying a summarizing layer to obtain a summarizing table serving as a real-time number bin table.
As shown in FIG. 2, in an embodiment of the present application, the several bin layers may further include an original data layer (ODS), a detail width layer (CDM), and an application summary layer (ADS). Wherein, the ODS layer keeps consistent with the source data, for example, the source data can be synchronously pulled through increment pulling and full pulling; the CDM layer is used for constructing a detail and width table of a plurality of bins based on different topics of different service lines; and the ADS layer is used for summarizing (aggregating) the data tables among different service topics to obtain a summary table.
As shown in fig. 2, for the offline scene and the real-time scene, different data engines are adopted to process respectively, so that the real-time performance and completeness of data processing are ensured. For example, the offline scenario is: the indexes such as the number of users, the retention rate of old customers or the repurchase rate of transactions in the near 180 days are counted, and the real-time scene is as follows: through real-time large screen, statistics running water today, new visitor today in real time, in this case, the second level is needed to update the large screen data.
In one implementation, for offline scenarios, a memory distributed computing engine (e.g., spark) is employed for data layering. The memory distributed computing engine has great advantages in the scenes of query speed, usability, complex analysis and the like. In the implementation mode, the ODS layer is used for carrying out increment pulling and full pulling on the source data, so that the pulled data is kept consistent with the source data; through a CDM layer, constructing a detail and width table aiming at data of different topics of different service lines; and the ADS layer is used for summarizing the detail and width tables of the topics to generate a summary table among the topics.
In another implementation, for real-time scenarios, a stream processing framework computing engine (e.g., flink) is employed for data layering processing. The stream processing frame calculation engine has the advantages of simultaneously supporting high throughput, low delay and high performance, and is suitable for real-time scenes. In this implementation, the streaming data is read through the ODS layer; through CDM layer, associating the stream data, and then writing the stream data into detail width tables of different subjects; and summarizing the data in each detail list of the detail list surface layer through the ADS layer to obtain a summary list.
In summary, the ODS layer is used to synchronize source data, the CDM layer is used to obtain detail data, and the ADS layer is used to summarize data. The ODS layer is mainly used for interfacing with a data source, the data of the data source is constructed into tables in the ODS layer, the complete data in the data source is copied, usually, a plurality of tables are arranged in the data source, a plurality of tables are corresponding to the tables in the ODS layer, and two tables are synchronized from the data source in the ODS layer as shown in fig. 3. And the CDM layer is used for extracting, analyzing and counting the data according to the requirements of the service subject, so as to obtain a detail width Table, wherein two tables in the ODS layer are respectively refined into two tables and three tables in the CDM layer as shown in figure 3. The DWS layer is mainly used for carrying out aggregation and unification operation on CDM layer data, and the aggregation statistics is that data of all dimensions need to be refined and counted to form a summary table, and it can be understood that the summary table contains refined statistics results of all dimensions aiming at a specific business theme. As shown in fig. 3, each Table of the CDM layer is assembled into one Table.
It can be seen that, in the several bins system provided by the embodiment of the application, different calculation engines are adopted for data processing aiming at an offline scene and a real-time scene, wherein, for the offline scene, a memory distributed calculation engine is adopted for data layering processing; for a real-time scene, a stream processing frame calculation engine is adopted to conduct data layering processing, so that a unified integrated digital bin system applicable to both an offline scene and a real-time scene is provided, real-time reading or analysis can be conducted on data under the real-time scene, such as real-time portraits or data large screens, data report presentation or business logic processing can be achieved under the offline scene, completeness of the data is guaranteed, data of all business lines can be integrated by meeting the digital bin system universal to both the offline scene and the real-time scene, and data processing requirements of all the business lines under the offline scene and the real-time scene are met.
Referring to fig. 4, a schematic diagram of still another data platform structure according to an embodiment of the present application is shown. In comparison with fig. 1, fig. 4 shows a specific example of a storage system and a calculation engine (Delta Lake and Spark, clickhouse). It is to be understood that the above examples are illustrative only and are not to be construed as limiting the embodiments of the present application.
In fig. 4, the storage layer illustratively includes a data caching module and a data persistence module, for example, the caching module is implemented using Cassandra and the data persistence module is implemented using Delta Lake. Cassandra is an open source distributed NoSQL database system and is mainly characterized in that the Cassandra is not a database, but a distributed network service formed by a stack of database nodes, a write operation to Cassandra can be copied to other nodes, and a read operation to Cassandra can be routed to a certain node for reading. Therefore, for a Cassandra cluster, the expansion performance is strong, and only nodes are added inside the cluster. The Delta Lake open-source storage framework can support various query/compute engines. The Delta Lake can support concurrent read-write of a plurality of pipeline, and can store data through a parquet format, so that the characteristics of high-performance compression and the like are realized, batch and stream read-write of the data are supported, and metadata evolution is supported. In the embodiment of the application, the Delta Lake is used for accessing the digital bin system to obtain an offline digital bin table and/or a real-time digital bin table, and the data of the offline digital bin table and/or the real-time digital bin table is subjected to data persistence processing and stored.
In fig. 4, for an offline scene, spark is used as a memory distributed computing engine to perform data structuring processing on the offline scene. Spark is a memory distributed computing engine, great advantages exist in scenes such as query speed, usability and complex analysis, the Spark supports a Spark sql computing framework, and the embodiment of the application is mainly based on the Spark sql computing framework to support an offline scene.
In one implementation, the Spark system includes a resource scheduling manager (Spark Master), a computing node (Spark Worker), and a Job Server (Spark Job Server), where the resource scheduling manager schedules resources of the computing node and the Job Server, the computing node reads from a storage layer to an offline number bin table, the Job Server configures one or more jobs (Job), extracts, converts, and loads data of the offline number bin table, and completes data structuring processing.
Wherein Job is a computation Job consisting of one or more scheduling stages; parallel computing, which consists of multiple tasks, is often motivated by Spark Action, and a JOB consists of multiple RDDs (Resilient Distributed Datasets, elastically distributed data sets) and various operations acting on the respective RDDs. In the embodiment of the application, the data structuring process is completed by writing Spark sql ETL statements in Job. ETL (Extract-Transform-Load) is used for describing the process of extracting (extracting), converting (transforming) and loading (Load) data from a source end to a destination end, and can Extract various distributed and heterogeneous source data (such as relational data), clean incomplete data, repeated data, error data and other dirty data contents according to a predesigned rule, obtain clean data meeting requirements, and Load the clean data to an application layer, wherein the clean data is used as a basis for data reading, analysis, display and reading.
In fig. 4, for a real-time scene, the real-time scene is subjected to data structuring processing by using a Clickhouse as a real-time data calculation engine.
ClickHouse is a columnar database management system (DBMS) for online analysis (OLAP). In overview, clickhouse uses technical points such as data partitioning, columnar storage, primary index (primary key index), secondary index (hop index), data compression, data tagging, and the like, and thus can provide efficient data query and analysis. One node in the ClickHouse is a fragment, the weight can be configured for the fragment, the greater the weight is, the more data is allocated, and when the ClickHouse distributed table is used for writing data, the whole flow comprises: data is written into a slice (for example, slice 1); data belonging to the present shard is written to the local table, and data belonging to other shards is first written to the temporary directory of the present shard, for example: writing the data of other fragments into a temporary directory of the fragment 1; the shard establishes a connection with other shards in the cluster, for example: the connection between the fragments 1 and 2 and between the fragments 1 and 3 is established; asynchronously sending data written to the local temporary file to other shards, such as: slice 1 sends temporary directory data to slices 2 and 3.
In one implementation, the Clickhouse is used as a column database cluster, wherein the column database cluster writes data into the fragments in a column storage mode based on a load balancing strategy based on a fragmentation algorithm, and the data structuring process is completed by establishing connection among the fragments.
In order to provide an integrated data application, in one implementation, a unified integrated reporting platform is provided at the application layer, which can implement data reading, analysis, or presentation. The integrated report system performs report display on the data of the offline scene and performs real-time display on the data of the real-time scene, wherein the unified report system supports visual editing and a self-defined report template.
For offline scenes, a report can be customized according to service requirements, different template tables or charts are selected, for example, the offline data bins are connected through JDBC, data are synchronized into deltaake, and visual presentation of the report is performed through spark Sql ETL. Such as business scenarios in which metrics such as user retention, oil station retention, etc. are to be counted.
For real-time scenes, such as reading data on a large screen in real time, taking a business line as an example, displaying the running water, new number of customers, running water, profit and other indexes of the current day, each channel and each province, processing the data through a real-time number bin, and then sending the processed data to an integrated report platform through Clickhouse processing, wherein the report platform can read the data in a direct connection mode of JDBC, and the page realizes second-level updating of the large screen in real time through a second-level refreshing mechanism.
In summary, the data platform provided by the embodiment of the application adopts different calculation engines to perform data processing aiming at an offline scene and a real-time scene, wherein the data structuring processing is performed by adopting a memory distributed calculation engine for the offline scene; for a real-time scene, a real-time data processing engine is adopted to carry out data structuring processing, so that a unified integrated data platform applicable to both an offline scene and a real-time scene is provided, real-time reading or analysis can be carried out on data under the real-time scene, such as real-time portrait or data large screen, data report presentation or business logic processing can be realized under the offline scene, the completeness of the data is ensured, the data of each business line can be integrated by meeting the data platform universal for both the offline scene and the real-time scene, and the data processing requirements of each business line under the offline scene and the real-time scene are met.
Referring to fig. 5, a flowchart of a data processing method based on a data platform according to an embodiment of the present application is shown.
The data processing method based on the data platform comprises the following steps:
s501: layering and integrating the source data to obtain an offline bin counting table and/or a real-time bin counting table;
S502: obtaining an offline number bin table and/or a real-time number bin table, and performing data persistence processing on the data of the offline number bin table and/or the real-time number bin table and storing the data;
s503: for an offline scene and a real-time scene, adopting different calculation engines to perform data processing on the data provided by the storage layer, wherein for the offline scene, adopting a memory distributed calculation engine to perform data structuring processing; for a real-time scene, adopting a real-time data calculation engine to perform data structuring processing;
s504: and reading, analyzing or displaying the processed data according to the application requirements.
In one implementation, a memory distributed computing engine is used for data structuring, including: the memory distributed computing engine comprises a computing node, a resource manager and a job server, wherein the computing node reads an offline data bin table from the storage layer, the resource manager configures resources for the job server, the job server configures one or more jobs, and data of the offline data bin table are extracted, converted and loaded to complete data structuring processing;
in one implementation, a real-time data computation engine is employed for data structuring processing, comprising: the real-time data calculation engine is a column database cluster, the column database cluster writes data into the fragments in a column storage mode based on a load balancing strategy based on a fragment algorithm, wherein connection is established among the fragments, and data structuring processing is completed;
Referring to fig. 6, a flowchart of an offline processing method based on a data platform according to an embodiment of the present application is shown.
The offline scene data processing method based on the data platform comprises the following steps:
s601: layering and integrating the source data to obtain an offline bin count table;
s602: performing data persistence processing on the data of the offline data bin table and storing the data;
s603: the method comprises the steps that a memory distributed computing engine is adopted for data structuring, wherein the memory distributed computing engine comprises computing nodes, a resource manager and a job server, the computing nodes read an offline data bin table from a storage layer, the resource manager configures resources for the job server, the job server configures one or more jobs, and data of the offline data bin table are extracted, converted and loaded to complete data structuring;
s604: and reading, analyzing or displaying the processed data according to the application requirements.
Referring to fig. 7, a flowchart of a real-time processing method based on a data platform according to an embodiment of the present application is shown.
The real-time scene data processing method based on the data platform comprises the following steps:
s701: layering and integrating the source data to obtain a real-time bin table;
S702: performing data persistence processing on the data of the real-time bin table and storing the data;
s703: adopting a real-time data computing engine to perform data structuring processing, wherein the real-time data computing engine is a column database cluster, the column database cluster writes data into the fragments in a column storage mode based on a load balancing strategy based on a fragment algorithm, and the data structuring processing is completed by establishing connection among the fragments;
s704: and reading, analyzing or displaying the processed data according to the application requirements.
Referring to fig. 8, a schematic structural diagram of an offline processing device based on a data platform according to an embodiment of the present application is shown. The offline scene data processing device based on the data platform comprises:
a bin number unit 801, configured to perform layering and integration on source data to obtain an offline bin number table;
a data storage unit 802, configured to obtain an offline data bin table, and perform data persistence processing on the data of the offline data bin table and store the data;
the data computing unit 803 is configured to perform data structuring processing by using a memory distributed computing engine, where the memory distributed computing engine includes a computing node, a resource manager, and a job server, where the computing node reads an offline bin table from the storage layer, the resource manager configures resources for the job server, and the job server configures one or more jobs, extracts, converts, and loads data of the offline bin table, and completes the data structuring processing;
The data application unit 804 is configured to read, analyze, or display the processed data according to an application requirement.
Referring to fig. 9, a schematic structural diagram of a real-time processing device based on a data platform according to an embodiment of the present application is shown. The real-time scene data processing device based on the data platform comprises:
the several bins processing unit 901 is configured to perform layering and integration on the source data to obtain an implementation several bins table;
the data storage unit 902 is configured to obtain a real-time number bin table, and perform data persistence processing on data of the real-time number bin table and store the data;
the data processing unit 903 is configured to perform data structuring processing by using a real-time data computing engine, where the real-time data computing engine is a columnar database cluster, and the columnar database cluster writes data into the fragments in a columnar storage manner based on a load balancing policy based on a fragmentation algorithm, and the data structuring processing is completed by establishing a connection between the fragments;
the data application unit 904 is configured to read, analyze, or display the processed data according to an application requirement.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes non-volatile and/or volatile storage media and internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is for communicating with an external client via a network connection. The computer program is executed by a processor to perform functions or steps of a server side of a data processing method.
In one embodiment, a computer device is provided, which may be a client, the internal structure of which may be as shown in FIG. 11. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is for communicating with an external server via a network connection. The computer program, when executed by a processor, performs a function or steps on a client side of a data processing method
In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of when executing the computer program:
(1) Layering and integrating the source data to obtain an offline bin counting table and/or a real-time bin counting table;
(2) Obtaining an offline number bin table and/or a real-time number bin table, and performing data persistence processing on the data of the offline number bin table and/or the real-time number bin table and storing the data;
(3) For an offline scene and a real-time scene, adopting different calculation engines to perform data processing on the data provided by the storage layer, wherein for the offline scene, adopting a memory distributed calculation engine to perform data structuring processing; for a real-time scene, adopting a real-time data calculation engine to perform data structuring processing;
(4) And reading, analyzing or displaying the processed data according to the application requirements.
Or,
(1) Layering and integrating the source data to obtain an offline bin count table;
(2) Performing data persistence processing on the data of the offline bin table and storing the data;
(3) The method comprises the steps that a real-time data calculation engine is adopted for carrying out data structuring processing, wherein the memory distributed calculation engine comprises calculation nodes, a resource manager and a job server, wherein the calculation nodes read an offline data bin table from a storage layer, the resource manager configures resources for the job server, the job server configures one or more jobs, and data of the offline data bin table are extracted, converted and loaded to complete data structuring processing;
(4) And reading, analyzing or displaying the processed data according to the application requirements. Or,
(1) Layering and integrating the source data to obtain a real-time bin table;
(2) Performing data persistence processing on the data of the real-time bin table and storing the data;
(3) Performing data structuring processing by adopting a memory distributed computing engine, wherein the real-time data computing engine is a column database cluster, the column database cluster writes data into the fragments in a column storage mode based on a load balancing strategy based on a fragment algorithm, and the data structuring processing is completed by establishing connection among the fragments;
(4) And reading, analyzing or displaying the processed data according to the application requirements.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
layering and integrating the source data to obtain an offline bin table and/or a real-time bin table;
(2) Obtaining an offline number bin table and/or a real-time number bin table, and performing data persistence processing on the data of the offline number bin table and/or the real-time number bin table and storing the data;
(3) For an offline scene and a real-time scene, adopting different calculation engines to perform data processing on the data provided by the storage layer, wherein for the offline scene, adopting a memory distributed calculation engine to perform data structuring processing; for a real-time scene, adopting a real-time data calculation engine to perform data structuring processing;
(4) And reading, analyzing or displaying the processed data according to the application requirements.
Or,
(1) Layering and integrating the source data to obtain an offline bin count table;
(2) Performing data persistence processing on the data of the offline bin table and storing the data;
(3) The method comprises the steps that a real-time data calculation engine is adopted for carrying out data structuring processing, wherein the memory distributed calculation engine comprises calculation nodes, a resource manager and a job server, wherein the calculation nodes read an offline data bin table from a storage layer, the resource manager configures resources for the job server, the job server configures one or more jobs, and data of the offline data bin table are extracted, converted and loaded to complete data structuring processing;
(4) And reading, analyzing or displaying the processed data according to the application requirements. Or,
(1) Layering and integrating the source data to obtain a real-time bin table;
(2) Performing data persistence processing on the data of the real-time bin table and storing the data;
(3) Performing data structuring processing by adopting a memory distributed computing engine, wherein the real-time data computing engine is a column database cluster, the column database cluster writes data into the fragments in a column storage mode based on a load balancing strategy based on a fragment algorithm, and the data structuring processing is completed by establishing connection among the fragments;
(4) And reading, analyzing or displaying the processed data according to the application requirements.
It should be noted that, the functions or steps implemented by the computer readable storage medium or the computer device may correspond to the relevant descriptions of the server side and the client side in the foregoing method embodiments, and are not described herein for avoiding repetition.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims (15)

1. A data platform is characterized by comprising a plurality of bin systems, a storage layer, a calculation layer, a service layer and an application layer, wherein,
the system for counting bins is used for layering and integrating the source data to obtain an offline counting bin table and/or a real-time counting bin table;
The storage layer is used for accessing the several bin system to obtain an offline several bin table and/or a real-time several bin table, and carrying out data persistence processing on the data of the offline several bin table and/or the real-time several bin table and storing the data;
the computing layer is used for adopting different computing engines to process the data provided by the storage layer aiming at an offline scene and a real-time scene, wherein the data structuring process is carried out on the offline scene by adopting a memory distributed computing engine; for a real-time scene, adopting a real-time data calculation engine to perform data structuring processing;
the service layer is used for connecting the computing layer and the application layer;
the application layer is used for acquiring the processed data from the service layer and reading, analyzing or displaying the processed data according to application requirements.
2. The data platform of claim 1, wherein the memory distributed computing engine comprises a computing node, a resource scheduling manager, and a Job server, wherein the resource scheduling manager schedules resources of the computing node and the Job server, the computing node reads an offline bin table from a storage layer, and the Job server configures one or more jobs (Job), extracts, converts, and loads data of the offline bin table, and completes the data structuring process.
3. The data platform of claim 1, wherein the real-time data computing engine is a columnar database cluster, wherein the columnar database cluster writes data into the partitions in a columnar storage manner based on a load balancing strategy based on a partition algorithm, wherein the data structuring process is completed by establishing a connection between the partitions.
4. A data platform according to any one of claims 1-3, wherein the number bin system obtains source data from a relational database, a big data system, file data or an API interface, and performs topic-based layering and integration on the source data according to service requirements to obtain the offline number bin table or the real-time number bin table.
5. The data platform of claim 4, wherein the digital bin system comprises a raw data layer, a detail width layer, and an application summary layer;
for an offline scene, the digital bin system performs incremental pulling and full pulling on source data through the original data layer, so that the pulled data is kept consistent with the source data; building a detail width table aiming at data of different topics of different service lines through the detail width surface layer; the application summarizing layer is used for summarizing the detail and width tables of all the topics, and generating a summarizing table among all the topics as the offline number bin table;
The digital bin system reads streaming data through the original data layer for a real-time scene; through the detail width surface layer, the streaming data are associated, and then the streaming data are written into detail width tables of different topics; and summarizing the data in each detail list of the detail list surface layer through the application summarizing layer to obtain a summarizing list serving as the real-time bin count list.
6. A data platform according to any one of claims 1 to 3, wherein,
the application layer comprises an integrated report system, wherein the integrated report system performs report display on data of an offline scene and performs real-time display on data of a real-time scene, and the unified report system supports visual editing and a self-defined report template.
7. A data processing method based on a data platform, comprising:
layering and integrating the source data to obtain an offline bin counting table and/or a real-time bin counting table;
obtaining an offline number bin table and/or a real-time number bin table, and performing data persistence processing on the data of the offline number bin table and/or the real-time number bin table and storing the data;
for an offline scene and a real-time scene, adopting different calculation engines to perform data processing on the data provided by the storage layer, wherein for the offline scene, adopting a memory distributed calculation engine to perform data structuring processing; for a real-time scene, adopting a real-time data calculation engine to perform data structuring processing;
And reading, analyzing or displaying the processed data according to the application requirements.
8. The method of claim 7, wherein the performing data structuring with the memory distributed computing engine comprises:
the memory distributed computing engine comprises a computing node, a resource manager and a job server, wherein the computing node reads an offline data bin table from the storage layer, the resource manager configures resources for the job server, the job server configures one or more jobs, and data of the offline data bin table are extracted, converted and loaded to complete data structuring processing.
9. The method of claim 7, wherein the data structuring process using a real-time data computing engine comprises:
the real-time data calculation engine is a column database cluster, the column database cluster writes data into the fragments in a column storage mode based on a load balancing strategy based on a fragment algorithm, and the data structuring processing is completed by establishing connection among the fragments.
10. An offline scene data processing method based on a data platform is characterized by comprising the following steps:
Layering and integrating the source data to obtain an offline bin count table;
performing data persistence processing on the data of the offline bin table and storing the data;
performing data structuring processing by adopting a memory distributed computing engine, wherein the memory distributed computing engine comprises a computing node, a resource manager and a job server, wherein the computing node reads an offline data bin table from the storage layer, the resource manager configures resources for the job server, the job server configures one or more jobs, and data of the offline data bin table is extracted, converted and loaded to complete the data structuring processing;
and reading, analyzing or displaying the processed data according to the application requirements.
11. The real-time scene data processing method based on the data platform is characterized by comprising the following steps of:
layering and integrating the source data to obtain a real-time bin table;
performing data persistence processing on the data of the real-time bin table and storing the data;
adopting a real-time data calculation engine to perform data structuring processing, wherein the real-time data calculation engine is a column database cluster, the column database cluster writes data into the fragments in a column storage mode based on a load balancing strategy based on a fragment algorithm, and the data structuring processing is completed by establishing connection among the fragments;
And reading, analyzing or displaying the processed data according to the application requirements.
12. An offline scene data processing device based on a data platform, comprising:
the data bin unit is used for layering and integrating the source data to obtain an offline data bin table;
the data storage unit is used for obtaining an offline data bin table, and performing data persistence processing on the data of the offline data bin table and storing the data;
the data computing unit is used for carrying out data structuring processing by adopting a memory distributed computing engine, wherein the memory distributed computing engine comprises a computing node, a resource manager and a job server, wherein the computing node reads an offline data bin table from the storage layer, the resource manager configures resources for the job server, the job server configures one or more jobs, and the data of the offline data bin table is extracted, converted and loaded to complete the data structuring processing;
and the data application unit is used for reading, analyzing or displaying the processed data according to the application requirements.
13. A real-time scene data processing device based on a data platform, comprising:
the digital bin processing unit is used for layering and integrating the source data to obtain a real-time digital bin table;
The data storage unit is used for obtaining a real-time data bin table, and performing data persistence processing on the data of the real-time data bin table and storing the data;
the data processing unit is used for carrying out data structuring processing by adopting a real-time data computing engine, wherein the real-time data computing engine is a column database cluster, the column database cluster writes data into the fragments in a column storage mode based on a load balancing strategy based on a fragment algorithm, and the data structuring processing is completed by establishing connection among the fragments;
and the data application unit is used for reading, analyzing or displaying the processed data according to the application requirements.
14. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 7 to 11.
15. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 7 to 11 when the computer program is executed.
CN202211660131.3A 2022-12-23 2022-12-23 Data platform, data processing method, device, medium and equipment Pending CN116595071A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211660131.3A CN116595071A (en) 2022-12-23 2022-12-23 Data platform, data processing method, device, medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211660131.3A CN116595071A (en) 2022-12-23 2022-12-23 Data platform, data processing method, device, medium and equipment

Publications (1)

Publication Number Publication Date
CN116595071A true CN116595071A (en) 2023-08-15

Family

ID=87588588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211660131.3A Pending CN116595071A (en) 2022-12-23 2022-12-23 Data platform, data processing method, device, medium and equipment

Country Status (1)

Country Link
CN (1) CN116595071A (en)

Similar Documents

Publication Publication Date Title
US11461356B2 (en) Large scale unstructured database systems
US10783124B2 (en) Data migration in a networked computer environment
DE202014010898U1 (en) Hierarchical denomination of objects in a decentralized storage system
CN101916261A (en) Data partitioning method for distributed parallel database system
US11210271B1 (en) Distributed data processing framework
US11841845B2 (en) Data consistency mechanism for hybrid data processing
US20230063730A1 (en) Storage engine for hybrid data processing
CN116303814A (en) Digital bin system, data processing method, device, medium and equipment
CN114328779A (en) Geographic information cloud disk based on cloud computing efficient retrieval and browsing
Das et al. A study on big data integration with data warehouse
CN116166191A (en) Integrated system of lake and storehouse
Sun et al. A distributed incremental information acquisition model for large-scale text data
CN106716400A (en) Partitioned management method and apparatus for data table
CN115577050B (en) Construction method of electric charge digital application platform
Hassan Storage structures in the era of big data: from data warehouse to lakehouse
CN116595071A (en) Data platform, data processing method, device, medium and equipment
Aydin et al. Data modelling for large-scale social media analytics: design challenges and lessons learned
US20230066540A1 (en) Hybrid data processing system and method
CN110569310A (en) Management method of relational big data in cloud computing environment
Feng et al. Research on parallel association rules mining algorithm based on hadoop
Kumaraguru et al. A study of big data definition, layered architecture and challenges of big data analytics
Johnson et al. Big data processing using Hadoop MapReduce programming model
Zhong et al. Research and application of massive data processing technology
Wang et al. Research on the Optimization of Spark Big Table Equal Join
Bonacorsi et al. Exploiting analytics techniques in CMS computing monitoring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20240411

Address after: Room 2101, Block B, Platinum Plaza, 5-15 Wenling Road, Laoshan District, Qingdao City, Shandong Province, 266100

Applicant after: Shandong Energy Chain Holding Co.,Ltd.

Country or region after: China

Address before: Room A221, floor 2, building 4, yard 1, yaojiayuan South Road, Chaoyang District, Beijing 100123

Applicant before: CHEZHUBANG (BEIJING) TECHNOLOGY Co.,Ltd.

Country or region before: China