CN114218218A - Data processing method, device and equipment based on data warehouse and storage medium - Google Patents

Data processing method, device and equipment based on data warehouse and storage medium Download PDF

Info

Publication number
CN114218218A
CN114218218A CN202111541687.6A CN202111541687A CN114218218A CN 114218218 A CN114218218 A CN 114218218A CN 202111541687 A CN202111541687 A CN 202111541687A CN 114218218 A CN114218218 A CN 114218218A
Authority
CN
China
Prior art keywords
data
warehouse
layer
data table
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111541687.6A
Other languages
Chinese (zh)
Inventor
张洪姣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinao Shuneng Technology Co Ltd
Original Assignee
Xinao Shuneng Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinao Shuneng Technology Co Ltd filed Critical Xinao Shuneng Technology Co Ltd
Priority to CN202111541687.6A priority Critical patent/CN114218218A/en
Publication of CN114218218A publication Critical patent/CN114218218A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Abstract

The disclosure provides a data processing method, a data processing device, data processing equipment and a storage medium based on a data warehouse. The method comprises the following steps: extracting source data to obtain data to be processed, storing the data to be processed into an operation data layer in a data warehouse, and generating an operation data table based on the data to be processed; the operation data table is transmitted to a detail data layer in a data warehouse, and data cleaning and data conversion operations are carried out on the operation data table to obtain the detail data table; transmitting the detail data tables to a summary data layer in the data warehouse, and aggregating the detail data tables to generate a summary data table; and transmitting the summarized data table to an application data layer in a data warehouse, performing data processing on the summarized data table to obtain an application data table serving as a data processing object, and performing data processing based on the application data table. The method and the device can reduce the time consumption of data processing, improve the data processing efficiency, facilitate the positioning and troubleshooting, and enable the data processing result to be more accurate.

Description

Data processing method, device and equipment based on data warehouse and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, device, and storage medium based on a data warehouse.
Background
The data warehouse of the traditional big data platform is generally divided into a source data layer, a data warehouse layer, a data mart layer and a data application layer, wherein the data warehouse layer is used for cleaning data of the source data layer and gathering fields with the same granularity so as to provide the data mart layer with better use. In practical applications, a data warehouse is mainly used for data storage, cleaning and processing of various data, so that data required by a user is provided based on data processing results of the data warehouse.
In the prior art, when a data warehouse is used for processing data, data is simply layered without strictly dividing data layering, so that data tables in the data warehouse are relatively disordered and the data warehouse is not easily expanded; because the data is not clearly layered, the upstream and downstream relation of the data table is difficult to look up, and the backward positioning troubleshooting problem becomes very difficult. Therefore, the data processing method based on the existing data warehouse has the problems of long time consumption of data processing, low data processing efficiency, inconvenience in positioning, inaccurate data processing result and the like.
In view of the above problems in the prior art, it is desirable to provide a data processing scheme that reduces the time consumption of data processing, improves the data processing efficiency, facilitates the problem location and troubleshooting, and makes the data processing result more accurate.
Disclosure of Invention
In view of this, embodiments of the present disclosure provide a data processing method, an apparatus, a device and a storage medium based on a data warehouse, so as to solve the problems in the prior art that data processing consumes a long time, data processing efficiency is low, positioning is not convenient, and a data processing result is not accurate enough.
In a first aspect of the embodiments of the present disclosure, a data processing method based on a data warehouse is provided, including: extracting source data in a source data system to obtain data to be processed, storing the data to be processed into an operation data layer in a data warehouse, and generating an operation data table in the operation data layer based on the data to be processed; the operation data table is transmitted to a detail data layer in a data warehouse, and data cleaning and data conversion operation are carried out on the operation data table in the detail data layer to obtain the detail data table; the detail data table is transmitted to a summary data layer in the data warehouse, aggregation operation is carried out on the detail data table in the summary data layer, and the summary data table is generated based on the result of the aggregation operation; and transmitting the summarized data table to an application data layer in the data warehouse, performing data processing operation on the summarized data table in the application data layer to obtain an application data table serving as a data processing object, and performing data processing based on the application data table stored in the data warehouse.
In a second aspect of the embodiments of the present disclosure, a data processing apparatus based on a data warehouse is provided, including: the extraction module is configured to perform extraction operation on source data in the source data system to obtain data to be processed, store the data to be processed into an operation data layer in the data warehouse, and generate an operation data table based on the data to be processed in the operation data layer; the cleaning module is configured to transmit the operation data table to a detail data layer in the data warehouse, and perform data cleaning and data conversion operation on the operation data table in the detail data layer to obtain the detail data table; an aggregation module configured to transfer the detail data table into a summarized data layer within the data warehouse, perform an aggregation operation on the detail data table in the summarized data layer, and generate a summarized data table based on a result of the aggregation operation; and the processing module is configured to transmit the summarized data table to an application data layer in the data warehouse, perform data processing operation on the summarized data table in the application data layer to obtain an application data table serving as a data processing object, and perform data processing based on the application data table stored in the data warehouse.
The embodiment of the present disclosure adopts at least one technical scheme that can achieve the following beneficial effects:
the method comprises the steps that extraction operation is carried out on source data in a source data system to obtain data to be processed, the data to be processed are stored in an operation data layer in a data warehouse, and an operation data table is generated in the operation data layer based on the data to be processed; the operation data table is transmitted to a detail data layer in a data warehouse, and data cleaning and data conversion operation are carried out on the operation data table in the detail data layer to obtain the detail data table; the detail data table is transmitted to a summary data layer in the data warehouse, aggregation operation is carried out on the detail data table in the summary data layer, and the summary data table is generated based on the result of the aggregation operation; and transmitting the summarized data table to an application data layer in the data warehouse, performing data processing operation on the summarized data table in the application data layer to obtain an application data table serving as a data processing object, and performing data processing based on the application data table stored in the data warehouse. The method and the device can reduce the time consumption of data processing, improve the data processing efficiency, facilitate the problem location and troubleshooting, and enable the data processing result to be more accurate.
Drawings
To more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without inventive efforts.
Fig. 1 is a schematic flow chart diagram of a data warehouse-based data processing method provided by an embodiment of the present disclosure;
FIG. 2 is an architectural diagram of a restructured data warehouse provided by an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a data processing apparatus based on a data warehouse according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.
As described above, a conventional big data platform data warehouse is generally divided into a source data layer, a data warehouse layer, a data mart layer, and a data application layer; the data warehouse layer collects the data of the source data layer, such as cleaning (duplicate removal, dirty data processing and the like) and the same granularity field, so as to provide the data of the upper layer for better use. The data processing capacity of the data warehouse has important significance for reducing the time consumption of data processing, improving the data processing efficiency and improving the accuracy of data processing results. The following detailed description is provided with reference to specific embodiments for describing problems in data processing by using a data warehouse in the prior art, and specifically may include the following:
data warehouses are built in data platforms, and are often referred to as offline data warehouses. The data warehouse is mainly used for extracting, storing, cleaning and processing a series of data such as various measurement data, business data, three-party data, user information, behavior logs, performance logs and the like, and finally provides required data for clients, data analysts, product managers, operation and maintenance teams, decision makers and the like in forms of reports, interfaces, ftp, databases and the like. Thereby providing data support for process improvement, product and performance improvement, cost reduction, quality monitoring, and decision making. The data processing capabilities of the data warehouse are therefore crucial to the user.
The existing data warehouse mainly comprises two versions, the data warehouse V1.0 mainly comprises four subject databases of dtl (detail data), fdm (basic data), adm (aggregate data) and opt (operational data), and the data warehouse with the structure can only meet the scenes with small data volume and business volume, and cannot achieve the expected effect on complex data and business scenes. From the data warehouse V1.0 to the data warehouse V2.0, the data warehouse in the original test cluster is rebuilt in the production environment, the configuration is optimized, the original business theme is continuously used, and two themes of an index result and a report general data set are added. Due to the above problems of the existing data warehouse, when data processing is performed by using the existing data warehouse, the following problems often exist:
1) since no specification document is developed for the data warehouse and no relevant specification is provided, the data warehouse can be used only by experience, and more libraries and tables are needed and are completely confused.
2) The structure of the data warehouse is unreasonable, the whole data warehouse only considers the measuring point data, the index data are also distributed in each service library, with more and more services, the transverse and longitudinal expansion is difficult, and for newly introduced or newly created data, which database or table is added to is not very clear, so the data is relatively disordered.
3) The subjects of the data warehouse are relatively disordered, for example, the dtl database only stores source data, the fdm database is divided into 60 tables according to 60 devices, the adm database is a large variety, and various processing aggregate data are in the data warehouse, so that the data volume and the table number are huge.
4) Data are not layered, the upstream and downstream relation of a table is difficult to find, and backward positioning investigation is very inconvenient to carry out when problems are investigated.
5) In the absence of annotations and document descriptions, databases, tables and columns are basically not annotated and are difficult to use by developers, so that the databases or tables are repeatedly built and repeatedly developed.
6) The account is public without setting an authority function, so that the account is free from secret login, and can be built, inserted, checked, deleted and the like at will, so that the operation risk is high.
7) The configuration of the data warehouse has obvious problems, no strict use mode exists, a huge partition table is randomly double-clicked on a data inquiry program, all the memory is exhausted, and in addition, the test environment for quoting and submitting jar packets is not provided, and the like.
In view of the consideration of problems existing in the existing data warehouse and data processing, the data in the data warehouse are strictly classified into layers, the data layers of different layers have different functions, and different data are correspondingly processed, so that the data processing efficiency is improved, the problem of troubleshooting can be accurately positioned based on the data layers, and the working efficiency of developers is improved. The data warehouse and the data processing method after optimization in the embodiment of the disclosure can deal with more and more complex services and requirements, provide data support for users better and faster, and improve user experience. The following embodiments will describe the data warehouse provided by the embodiments of the present disclosure in the following aspects: hardware configuration, architecture design, theme division, hierarchy division, authority management, queue management and metadata management.
Fig. 1 is a schematic flow chart of a data warehouse-based data processing method according to an embodiment of the present disclosure. The data warehouse-based data processing method of fig. 1 may be performed by a server. As shown in fig. 1, the data processing method based on the data warehouse may specifically include:
s101, extracting source data in a source data system to obtain data to be processed, storing the data to be processed into an operation data layer in a data warehouse, and generating an operation data table in the operation data layer based on the data to be processed;
s102, the operation data table is transmitted to a detail data layer in a data warehouse, and data cleaning and data conversion operation are carried out on the operation data table in the detail data layer to obtain the detail data table;
s103, the detail data table is transmitted to a summarized data layer in the data warehouse, aggregation operation is carried out on the detail data table in the summarized data layer, and the summarized data table is generated based on the result of the aggregation operation;
and S104, transmitting the summarized data table to an application data layer in the data warehouse, performing data processing operation on the summarized data table in the application data layer to obtain an application data table serving as a data processing object, and performing data processing based on the application data table stored in the data warehouse.
Specifically, the Data Warehouse (DW) of the embodiment of the disclosure is obtained by performing system processing, summarizing and arrangement on the basis of extracting and cleaning original dispersed database Data. The purpose of data warehouse construction is to provide functional analysis and decision support for front-end query and analysis as a basis. The embodiment of the present disclosure uses a Hive data warehouse tool, which is mainly used to process structured data, and the Hive data warehouse is generally divided into 4 levels, i.e., an ODS layer, a DWD layer, a DWS layer, and an ADS layer, and each level is used to store different types of tables.
Further, the source data system may refer to a MySQL database, which is a relational database management system and belongs to the Oracle product. In terms of WEB applications, MySQL is one of the best RDBMS (Relational Database Management System) application software, and the Relational Database stores data in different tables, thereby increasing speed and flexibility. The SQL language used by MySQL is the most common standardized language for accessing databases.
According to the technical scheme provided by the embodiment of the disclosure, data to be processed is obtained by performing extraction operation on source data in a source data system, the data to be processed is stored in an operation data layer in a data warehouse, and an operation data table is generated in the operation data layer based on the data to be processed; the operation data table is transmitted to a detail data layer in a data warehouse, and data cleaning and data conversion operation are carried out on the operation data table in the detail data layer to obtain the detail data table; the detail data table is transmitted to a summary data layer in the data warehouse, aggregation operation is carried out on the detail data table in the summary data layer, and the summary data table is generated based on the result of the aggregation operation; and transmitting the summarized data table to an application data layer in the data warehouse, performing data processing operation on the summarized data table in the application data layer to obtain an application data table serving as a data processing object, and performing data processing based on the application data table stored in the data warehouse. The method and the device can reduce the time consumption of data processing, improve the data processing efficiency, facilitate the problem location and troubleshooting, and enable the data processing result to be more accurate.
In some embodiments, performing an extraction operation on source data in a source data system to obtain to-be-processed data includes: for source data stored in a relational database, extracting the source data in the relational database by using a data warehouse scheduling platform to obtain data to be processed; for source data stored in the distributed message queue, log data are collected from the distributed message queue by a log data collection assembly and are sent to a distributed file system, and the log data in the distributed file system are extracted by a data warehouse scheduling platform to obtain data to be processed.
Specifically, in the operation of extracting the source data to obtain the data to be processed, two cases are mainly included, that is, case one: the source data is stored in the relational database, so the data to be processed needs to be extracted from the relational database, and the case two is as follows: the source data is stored in the distributed message queue, and at this time, the data to be processed needs to be extracted from the distributed message queue. In practical application, the data extracted to the data warehouse comprises external data, business data, internet of things data, logs and the like, and if the data is stored in the MySQL, the MySQL data can be extracted to the data warehouse by means of a Dolphin Scheduler; if the data is stored in Kafka, the data is sent to Hadoop HDFS by means of Flume, and the data extraction process is completed.
Further, for the data extraction operation in case two, after the source data is sent to the Kafka distributed message system, a Flume data acquisition system (also referred to as a log acquisition system) is used to acquire the measurement point source data in the Kafka distributed message system, and the measurement point source data is sent to the path of the HDFS file system of the big data cluster. After the HDFS file system obtains the source data of the measuring points, the data scheduling platform is used for synchronizing the source data of the measuring points from the HDFS file system to the Hive data warehouse tool. In practical applications, the ETL data processing method can be used to synchronize and process source data, and ETL refers to a process of extracting, cleaning, converting, and loading data from a source end (such as a business system, log information, and data of an internet of things) to a destination end (typically a data warehouse). ETL is equivalent to a bridge that transfers data from a distributed file system into a data warehouse. In actual practice, the databases, data warehouse tools, or programming types employed by the different systems may vary.
In some embodiments, storing the data to be processed into an operational data layer within the data warehouse, and generating an operational data table based on the data to be processed in the operational data layer, includes: the data to be processed is synchronized to a data warehouse by using a data warehouse scheduling platform, the data to be processed is stored in an operation data layer, and the structured data to be processed is mapped into an operation data table, wherein the operation data layer is used for storing incremental data or full data corresponding to the data to be processed.
Specifically, when the source data is stored in a relational database (e.g., MySQL database), the ETL technology may be used to synchronize the preprocessed data from the MySQL database to the Hive data warehouse, and in the synchronization process, the ETL technology is used to process and process the preprocessed data to obtain the target data. ETL refers to a process of extracting, cleaning, converting, and loading data from a source (e.g. business system, log, internet of things data, etc.) to a destination (usually a data warehouse). ETL is equivalent to a bridge that transfers data from MySQL database into the data warehouse. In actual practice, the databases, data warehouse tools, or programming types employed by the different systems may vary.
Further, the data warehouse of the embodiment of the present disclosure is a novel data warehouse built based on the Hive data warehouse tool. Hive is a data warehouse tool based on Hadoop, and can map structured data files into a database table and provide SQL-like query functions. The UDF function adopted in Hive is a Hive self-defined function, the Hive self-contained function cannot completely meet the service requirement, the self-defined function is needed to be used, and different processing logics can be used for processing different requirements through the self-defined function.
Further, the embodiment of the present disclosure synchronizes preprocessed data in MySQL database directly to the ODS layer in the Hive data warehouse tool. The ODS (operation Data store) of Hive is a source Data layer which is used for storing source Data, original logs and Data can be directly loaded from the MySQL database, and the Data are kept in an original form and are not processed. In other words, the ODS layer can directly obtain the source data in the database, modify the table name of the structured source data, or do nothing.
In some embodiments, performing a data washing and data conversion operation on the operation data table in the detail data layer to obtain a detail data table includes: and carrying out data cleaning on the operation data table by using an ETL data processing algorithm, carrying out desensitization treatment on the operation data table after the data cleaning, carrying out dimension reduction operation on the data table obtained after the desensitization treatment, carrying out consistency treatment on the data in the data table to obtain a wide table, and taking the wide table as a detail data table.
Specifically, after the data is extracted to the Hadoop HDFS by using the Dolphin Scheduler or the flash, the data warehouse may perform partitioning (partitioning by day, month, and year), compression (the compression ratio is 10:1, for example, 100g of data is compressed by 10g), and column type storage (ORC format) on the model table, and at the same time, the Hadoop HDFS may implement storage of the data.
Further, ETL data cleaning is performed by means of HQL, for example, NULL value removal, filtering of meaningless data of a core field (for example, system coding is NULL and the like), desensitization of sensitive data such as a mobile phone number and an identity card number, dimension degradation and dimension reduction (JOIN operation reduction) are performed on a table transmitted by a service server, data consistency processing is performed on user behavior data and service data, and finally a wide table is formed, wherein a great amount of detailed data are stored.
Further, after mapping the source data into the operation data table (i.e., the ODS layer table), transferring the ODS layer table to the live DWD layer, and performing data cleaning (such as removing null values and dirty data) and desensitization processing on the ODS layer table by using the live DWD (data water detail) layer in the live to obtain the detail data table (i.e., the DWD layer table), where the detail data in the source data is stored.
In some embodiments, performing an aggregation operation on the detailed data table in the summarized data layer and generating a summarized data table based on a result of the aggregation operation includes: and according to the preset aggregation granularity and an aggregation rule, the detail data tables are aggregated by using an ETL data processing algorithm so as to aggregate the detail data tables into a summary data table, and the summary data table stores detail data corresponding to the same aggregation granularity.
Specifically, after the detail data table is obtained by processing the data through a detail data layer (namely, a DWD layer) in the data warehouse, the detail data table is transferred to a summary data layer (namely, a DWS layer), and the summary data layer further processes the detail data table to obtain a summary data table. In practical application, the data table of the DWS level is used for storing the data of the granularity summary aggregation.
Furthermore, the Hive data warehouse has different business topics, that is, the data warehouse is divided into different topics according to the business, the different topics correspond to different databases, and the different topics correspond to different aggregation granularities. When the detail data is summarized in the DWS layer, the detail data can be summarized according to each business theme. The following detailed description is made on the topic division and the naming specification of the database in the embodiments of the present disclosure with reference to specific embodiments, and specifically may include the following contents:
firstly, the purpose of theme division is to provide normative guidance for a warehouse developer, and a corresponding theme can be found according to the business and the field of data, so that the data can be used more conveniently and quickly.
The second theme division method is that data in the enterprise information system is integrated, classified and analyzed and utilized at a higher level to obtain an abstract concept, and then each theme basically corresponds to a macroscopic analysis field, for example, the business analysis field may be quality monitoring, operation and maintenance service, operation and maintenance data related to products, and the like, and also may be data of items, delivery information, and the like of a rapid delivery platform.
In addition, a database can be created according to the business theme, and the naming specification of the database is as follows:
firstly, the Chinese character input method can be composed of English words or Chinese pinyin and underlines, and requires all lowercase;
secondly, when underlines are used for splicing, the length of each section does not exceed 3 sections, each section needs to correspond to a certain business meaning, the operation is simple and easy to understand, meanwhile, the length of each section does not exceed 8 characters, and the overall length does not exceed 26 characters;
if the length exceeds the limit, the English words or Chinese phonetic abbreviations can be used for naming.
For example, if the data is operation and maintenance data related to service operation, the service theme is a service operation theme, and the database name is "vops"; if the data is the project, delivery information and the like of the rapid delivery platform, the service theme is a delivery platform theme, and the database name is fast _ delivery _ platform.
In some embodiments, performing a data processing operation on the summarized data table in the application data layer to obtain an application data table used as a data processing object includes: and according to the preset report content, carrying out data processing on the summarized data table stored in the application data layer by utilizing an ETL data processing algorithm to generate application data associated with the report content, and generating an application data table according to the application data.
Specifically, the summarized data table obtained after summarization is transmitted to an application data layer (i.e., an ADS layer) for further data processing, so as to generate an ADS layer table, which is an object queried by the report platform. That is, in the subsequent report development process, the data required for report development comes from the application data layer in the data warehouse, and therefore, the ADS layer is an application-oriented data service layer.
Further, after the application data table is obtained, report development can be performed through a report tool (such as a universal report platform) based on a developed model, and data analysis, data mining and the like are performed based on report result data, so that a value is created for an enterprise. The following describes a report development process using an application data table in detail with reference to specific embodiments.
In some embodiments, the data processing based on the application data table stored in the data warehouse comprises: and executing data query operation on the application data table in the data warehouse by using a pre-configured report development tool and a query data set in the report development tool so as to obtain target data from the application data table and generate a report according to the target data and preset report configuration.
Specifically, because the table data after ETL processing is stored in the application data layer, and the development stage of the report is completed in the universal report platform, that is, Hive is only used for storing data and is not responsible for developing the report, the universal report platform needs to be used to call the data in Hive to develop the report. The data set of the universal report platform comprises SQL scripts for data query, and the universal report platform queries application data obtained after processing in the Hive data warehouse tool through a Presto query engine to obtain target data for report development, so that the universal report platform develops reports based on the target data.
Furthermore, although data of related reports are stored in the Hive, the data are in a data format with simple arrangement and cannot be directly displayed in the reports, and all data in a Hive database table are not required for report development; therefore, the data query script in the report platform is used for querying the needed data to obtain the target data for report development.
Further, after the target data are obtained by using the Presto query engine, the report is subjected to custom development through a custom plug-in function in the report platform. And the user-defined plug-in generates a report finally used for data statistics and analysis by utilizing the preset report configuration and the acquired target data to execute report development operation.
The above embodiment describes the data processing process based on the data warehouse in detail, and the embodiment of the present disclosure also provides a reconstructed data warehouse (hereinafter referred to as a data warehouse), and the architecture and the content of the reconstructed data warehouse are described in detail below with reference to the accompanying drawings and the specific embodiment, that is, the reconstructed data warehouse is specifically described in several aspects of hardware configuration, architecture design, theme division, hierarchy division, authority management, queue management, metadata management, and the like. Fig. 2 is an architectural diagram of a reconstructed data warehouse provided by an embodiment of the present disclosure. As shown in fig. 2, the architecture of the reconstructed data warehouse may specifically include the following:
in the integral architecture of the digital warehouse, the data warehouse is a system which carries out data extraction, data storage, data conversion, data processing, application and the like on source system data and then submits the data to dimensional data storage, and functional analysis and decision support are provided. The data warehouse is the foundation and core of the BI, and the ETL is the core and soul of the data warehouse. The data warehouse is designed according to 4 and levels (ODS, DWD, DWS, ADS), corresponding levels must be specified when tables are built, and the tables must be created according to business meanings.
The classification of the database table includes a conventional table, a test table, an intermediate table, a snapshot table, and a derived table, and the following specification of these tables is described in detail, which may specifically include the following contents:
conventional table: { database _ name } [ sub-topic _ ] { ods/dwd/dws/ads } _ { service scenario or custom tag } - [ data granularity ], such as: ads _ use _ minute (common report model, ads layer _ usage _15 minutes), cim.ods _ device _ base _ info (cim.ods layer _ device information table), where [ ] content indicates that unrewritable can be omitted, and { } content indicates that unrewritable cannot be omitted.
Testing the meter: theoretically, a test table is not allowed to be created in a production environment, but because no test environment exists at present, the test table can be created temporarily in the production environment, but the test table is required to be deleted immediately after the test is finished, and the specification is as follows: { table _ name } _ test.
Third, middle table: storing an intermediate result table, wherein the specification is as follows: { table _ name } _ mid.
Fourthly, snapshot table: it is necessary to have a partition table and store the history data of the source data at the granularity of the partitions. When a certain table needs to store the latest data, a partition table of the table needs to be established before, and then the data of the latest partition is imported into a snapshot table every time, wherein the specification is as follows: table _ name _snapshot.
Exporting a table: when partial data needs to be exported to other databases (such as MySQL) for use, the export table specification is created as follows: table _ name _ export.
Fields must be created according to business meanings in field naming specifications in a database table, and the field naming specifications are as follows:
the Chinese character input method includes the steps of firstly, selecting Chinese characters, firstly, selecting English words or Chinese pinyin and underlining, and requiring all lowercase characters.
And when splicing by underlines, the length of each segment does not exceed 8 characters, and the overall length of each segment does not exceed 35 characters.
If the length exceeds the limit, the English words or Chinese phonetic abbreviations can be used for naming.
And fourthly, the data types suggested to be used are required to be used, and the problems of incompatibility, precision loss and the like are avoided.
The Hive data types include suggested data types and data types which are not suggested to be used, wherein the suggested data types include, but are not limited to, the following types: shaping: INT (BIGINT), floating point type: DOUBLE, character type: STRING, date type: DATE, time type: TIMESTAMP, composite type: ARRAY, MAP, STRUCT, etc. The types of data that are not suggested for use include, but are not limited to, the following: because of potential compatibility and scalability issues, the BOOLEAN type is not recommended and the STRING type can be used: instead of the BOOLEAN type; in addition, the DECIMAL type is used carefully in the scene with extremely strict requirements on precision, and can be replaced by a DOUBLE type.
Here, the data warehouse often needs to import and export tables, and when data is synchronized, the usage principle of the corresponding field specification is as follows: the data type after import and export is consistent with the data type in the original table, such as integer corresponding to integer, floating point corresponding to floating point, character corresponding to character, date corresponding to date type, etc.
In terms of hardware configuration, the whole data warehouse is built on CDH5.13.3, the NameNode of the HDFS has 1 main node and 1 standby node, and the DataNode has 11 nodes. There is 21.3T of storage space. HIVE configures virtual ip based on high availability, and the resource manager of the resource manager YARN has a main node and a standby node.
In the aspect of hierarchical division, the hierarchical division and the naming of the table are standardized, wherein the purpose of the hierarchical division of the data warehouse is to simplify complex things, reduce the dependence on original data, reduce the access pressure of an original database, enable the data to be more uniform and complete, more conveniently troubleshoot problems and easily locate the problems.
Further, the workflow specification, the work node specification, and the scheduling specification in the data warehouse scheduling platform (Dolphin Scheduler) are specifically introduced as follows:
the workflow name: popular and easily understood Chinese is used, and the Chinese comprises the subject, the service scene, the granularity and the like of the workflow. For example: and collecting and analyzing index _ index calculation results, warehousing _ day level, measuring point _ Internet of things data loss rate _ hour level, operation _ operation value report _ day level and operation and maintenance _ operation and maintenance log.
Description of workflow: the method can be described in more detail based on the workflow name by using popular and easily understood Chinese, and needs to include a creator, creation time, a thinner, update time, update functions and the like. For example: workflow name: index _ index calculation result is put into bin _ day level; workflow description: synchronizing the index calculation power level results into a plurality of bins; the creator: XXX, creation time: 2020-06-01, updater: XXX, update time: 2020-08-20, update function: two node tasks for synchronizing data to the mysql report library are disabled.
Alarming by workflow: and starting a workflow alarm and selecting a reasonable time limit range.
Fourthly, name of the working node: the function performed by each node is described using popular and easy-to-understand Chinese. For example: synchronizing a _ CIM database to a plurality of bins, cleaning dirty data _ day level of a _ measurement point table, and processing a _ CIM intermediate result wide table.
Description of working nodes: the same as the workflow description.
Task priority: the default is "MEDIUM", which can be adjusted according to the task specific priority.
Number of retries failed is: the default setting is that the number of failed retries is 3, which can be adjusted according to specific situations.
The failed retry interval: the default retry setting is 3, which can be adjusted according to specific situations.
Ninthly, overtime warning: tasks with higher real-time requirements need to be configured, for example, index results are put into bins, and the tasks can be configured according to specific conditions.
The schedule specification in the r includes schedule time: configuring according to the specific situation of the task; scheduling strategy: and setting a failure strategy and a notification strategy according to the specific task condition. Important tasks, need to set up to continue, and fail to send mail; the priority of the process: configuring with the priority in the workflow specification; failure and timeout alarms: the notification group needs to be configured with the mail receivers with failed receiving and overtime in advance, the important task needs to be configured with the notification group, the receivers and the transcribers, and the receivers and the transcribers can directly fill in the mailbox.
Further, the resource file in the data warehouse scheduling platform may specifically include the following normalized contents:
resource path: related resource files used in a workflow of a data warehouse scheduling platform (Dolphin scheduler) are uploaded to data/Dolphin scheduler/[ tenant ]/resources/resource file names of a Hadoop cluster. For example: data/dolphinscheduler/sc _ public/resources/resource file name. When in use, resource files required to be used, such as hql files, need to be specified in the task. Or downloading resource files locally from the cluster, such as json files.
The resource file naming specification is as follows: naming a file by using an English splicing underline mode, wherein the prefix of a name needs to describe the function or function of the file, such as import, insert, drop, export and the like; after the prefix, specific operated subject matters such as metric, uac, cim and the like need to be specified; the subject then needs to specify the tables of operations and effects to look up and use.
Resource file format: the file formats supported at present mainly include: sh, sql, hql, json, py, datax, java, etc.
Fourthly, describing the resource file: the use of the popular and well understood chinese specification file of roles or functions must be described in which tasks of which workflow are used for maintenance.
Resource file content: and filling the content according to actual needs.
Further, for the encoding specification in the data warehouse scheduling platform, the following normalized contents may be specifically included:
writing principles: the code line is required to be clear and neat in level and have certain ornamental property, the principle of optimal execution speed is fully considered in code writing, necessary comments are required in the code to enhance the readability of the code, and the standard plays a guiding role in daily code development work and is continuously perfected and supplemented.
The basic requirements are as follows: all SQL keywords AND reserved words applied in the code segment use full capitalization, such as SELECT, FROM, WHERE, AND, OR, UNION, INSERT, DELETE, GROUP, HAVING, COUNT, etc. A mixed case and case approach, such as Select or Select. The code segment uses lower case except the key word and the reserved word, one Tab, namely four spaces, is a retraction amount, all the retractions are integral multiples of the retraction amount, the bins are set to be in a strict mode, and the partition table must be assigned for inquiring. When the non-partition table is queried, operations using select without limit are prohibited, and corresponding parentheses are usually required to be in the same column position.
Coding specification: the code head comprises information of an adding subject, function description, an author, date and the like, and a modification log and a title bar are reserved for subsequent modification of classmate adding modification records.
Fourthly, parameter configuration: after the code header, the compute queue name, task name and configuration parameters of the merge doclet must be added. And whether parameters such as dynamic partition configuration, map and reduce parameter configuration, memory setting, performance optimization and the like are added can be determined according to specific requirements. The following are parameter configuration templates that must be added:
code arrangement: the keywords such as SELECT, FROM, JOIN, ON, WHERE, AND, GROUP BY, etc. need to be added with the corresponding fields, table names, association conditions, filtering conditions, grouping fields, etc. The keywords, table names and fields and the like are arranged in a manner of one table name and one field per line. The' separator between two fields immediately precedes the second field, the AS statement should be in the same row with the corresponding field, and the AS of the fields should be aligned AS much AS possible on the same column. A space is reserved before and after the arithmetic operator and the logical operator. The following convention is made for CASE statement layout: the CASE and END keywords should be aligned with fields, AS alias should be added after the END clause, if an ELSE clause is included, the ELSE clause is aligned with the WHEN clause, the first WHEN clause should be written after the CASE keyword is rewound and is retracted by a reduction amount, each WHEN clause is written in one line, of course, if the sentence is long and can be rewound, the THEN keyword is aligned AS much AS possible, and the sentence behind THEN is left aligned AS much AS possible.
Sixthly, SQL annotation: the comments of each SQL statement are individually arranged in rows and placed in front of the statement, the field comments are immediately behind the fields, the branch conditional expressions which are difficult to understand should be annotated, the functions of the SQL statement should be explained for important calculation, the lengthy functions should be realized, the statements should be generally explained according to the realized function segments, and when the constants and the variables are annotated, the meanings (necessary contents) of the stored values and the legal value ranges (optional contents) should be annotated.
Further, the operation and maintenance specification of the data warehouse may specifically include the following normalized contents:
the purpose is: the development and online of the data only complete the first stage of the requirement, and an effective operation and maintenance mechanism is essential to ensure the stable execution of the program and the on-time output of the data.
Keeping the rule: on-duty personnel are arranged every day and are used as a first interface person for operation and maintenance to process tasks and data problems. The person on duty needs to keep the phone on for 24 hours, and can be connected when meeting important problems. When the on-duty personnel can not troubleshoot and solve the problem, the task developer and the group leader need to be informed for assistance support at the first time. The task alarm and the mail need to be replied at the first time when being received, the use is possibly influenced and needs to be solved at the first time, and the problem with low timeliness requirement can be solved after work and by a company.
Conventional operation and maintenance knowledge accumulation: the scheme summary of the small file and block which are increased too fast is avoided, and dynamic partitioning is reasonably used: the dynamic partition can create too many catalogues in batch, files and data blocks can grow exponentially, and the granularity of the partition is prevented from being built too thin: HIVE is not afraid of large files, is afraid of small and large files, has too fine granularity of partitions, can create too many directories, and increases addressing time. The data of the intermediate table and the snapshot table needs to be cleaned regularly: the data of the intermediate table and the snapshot table are not directly used, so that a command or node task of timing cleaning needs to be added to the task if unnecessary historical data exists. The small files in the HQL must be merged, the occupation of a namenode memory is increased due to the fact that the small files are too many, block blocks are increased, time consumption is increased when the small files are merged in task calculation, and therefore commands for merging the small files must be added in the HQL.
Furthermore, table and field comments are added, comments are added on tables and fields of different levels, developers can understand the field function and the service meaning more easily, problems are convenient to troubleshoot, and maintenance information is synchronized into the Excel document.
Further, the authority management is to solve the authority problem in the previous warehouse, a range is introduced to manage the authority, and the authority management in the data warehouse scheduling platform may specifically include the following steps:
recycling an account number: and recovering partial (such as deletion) authority of the original externally-disclosed user account (including bigdata, hive and the like) to avoid losing important data of the company.
Creating a user: and creating information such as user names, passwords, mailbox addresses and the like.
Controlling the authority: after the user is created, the authority is controlled by the user, the sub-base and the sub-table. For example, a dedicated user business _ opt is created for an operation team, the user only has the operations of adding, deleting and modifying the table in the business _ opt database, has no authority for adding, deleting and modifying other databases, and needs to additionally apply for the authority again if other authorities are needed, so that the user can be prevented from misoperation on other data.
And fourthly, all operations are carried out in the HUE by leaving traces, the users can develop on the HUE after the operations are created, and log information of data adding, deleting, modifying and checking operations is provided, so that query, problem checking, problem tracing and the like are facilitated.
Further, for better management and use of the more and more libraries, tables, and fields in a bin, the consanguineous management of metadata and tables is increased. For the metadata management in the data warehouse scheduling platform, the Atlas is mainly deployed.
Based on the foregoing embodiments, the present disclosure achieves the purpose of reconstructing an existing data warehouse by the following aspects, specifically:
1) developing a digital warehouse development specification document: the method comprises the contents of a data platform architecture, a data processing flow, a data research and development flow, a database specification definition and the like, so that developers can develop according to the specification of documents.
2) Rationalizing a data research and development flow, a data processing flow, hardware configuration and a data warehouse architecture: the data of the Internet of things is divided into measurement data and index data, and the data are respectively put into a data base metric library and an index _ result library. And the data are subjected to processes of data extraction, data storage, data conversion, data processing, application and the like, and then submitted to a dimensional data storage system to provide functional analysis and decision support. The data warehouse also performs hierarchical processing. And data set, chart and instrument board management is performed on the universal report platform, and enterprise report service is provided.
3) Subject definition and database specification: for example, the business _ opt database is a dedicated database constructed for the data analysis team and maintained for the data analysis team; the CIM database only stores general information model data provided by a CIM team, and provides a unified data use standard for companies. At a higher level, data in the enterprise information system is integrated, classified and analyzed and utilized to obtain an abstract concept.
4) Bin layering and table specification: each level of the data warehouse is used for storing data of a corresponding type, for example, original measurement data such as EMS (enhanced system management) and the like are stored in an ODS (oxide Dispersion strengthened) level table; the detail data after the cleaning conversion and the like are put on the DWD level table; summarizing and aggregating data of DWS hierarchical table grading granularity and the like; and the ADS hierarchical table is used for putting the object co-operation report data and the like. The method aims to simplify complex matters, reduce the dependence on original data, reduce the access pressure of an original database, enable the data to be more uniform and complete, more conveniently troubleshoot problems and easily locate the problems.
5) Workflow specification, work node specification and scheduling specification in a data warehouse scheduling platform (Dolphin Scheduler).
6) Specification of resource files in data warehouse scheduling platform (Dolphin Scheduler): including naming specification, file format, file description, file content
7) And (3) encoding specification: the method comprises the writing principle and the basic requirements (including code head, parameter configuration, code arrangement, HQL annotation and the like), so that the code line is clear and ordered in hierarchy, has certain ornamental property, and is convenient for later-stage problem troubleshooting.
8) Operation and maintenance specifications of a data warehouse: the method comprises the steps of accumulating the purpose, the rule and the conventional operation and maintenance knowledge of operation and maintenance, and informing task developers and group leaders to assist and support the task developers and the group leaders at the first time when a problem occurs and the operators on duty cannot troubleshoot and solve the problem. The task alarm and the mail need to be replied at the first time when being received, the use is possibly influenced and needs to be solved at the first time, and the problem with low timeliness requirement can be solved after work and by a company.
9) Add Table and field comments: and comments are added to the tables and the fields of different levels, so that developers can understand the field function and the service meaning more easily, the problems are convenient to troubleshoot, and the maintenance information is also synchronized into the Excel document. And the repeated table building and repeated development of workers are also avoided.
10) And (3) authority management: and (4) carrying out authority management by using the Range, wherein the authority management comprises account recovery, user creation, authority control and the like, and different addition, deletion, modification and check authorities are given to different departments and different accounts.
11) Configuration problems: and strict mode limitation is added to prevent excessive resource occupation of partition tables, sorting, Cartesian products and the like, so that resource protection is performed.
12) Metadata management: the increased consanguineous management of metadata and tables is provided by Atlas for better management and use of the more and more libraries, tables, and fields in a warehouse. From the process of data generation, such as ETL, to the storage of data, to the use of data. When the data of the upstream ETL or the downstream data report is changed, people can conveniently locate data, the blood relationship of the data set can be visually browsed, and a user can be allowed to deeply discuss operation, safety and data origin information.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.
Fig. 3 is a schematic structural diagram of a data processing apparatus based on a data warehouse according to an embodiment of the present disclosure.
As shown in fig. 3, the data warehouse-based data processing apparatus includes:
the extraction module 301 is configured to perform extraction operation on source data in the source data system to obtain to-be-processed data, store the to-be-processed data into an operation data layer in the data warehouse, and generate an operation data table based on the to-be-processed data in the operation data layer;
a cleaning module 302 configured to transfer the operation data table to a detail data layer in the data warehouse, and perform data cleaning and data conversion operations on the operation data table in the detail data layer to obtain a detail data table;
an aggregation module 303 configured to pass the detail data table into a summarized data layer within the data warehouse, perform an aggregation operation on the detail data table in the summarized data layer, and generate a summarized data table based on a result of the aggregation operation;
and the processing module 304 is configured to transfer the summarized data table to an application data layer in the data warehouse, perform data processing operation on the summarized data table in the application data layer to obtain an application data table used as a data processing object, and perform data processing based on the application data table stored in the data warehouse.
In some embodiments, the extraction module 301 in fig. 3 extracts, by using a data warehouse scheduling platform, source data stored in a relational database to obtain to-be-processed data; for source data stored in the distributed message queue, log data are collected from the distributed message queue by a log data collection assembly and are sent to a distributed file system, and the log data in the distributed file system are extracted by a data warehouse scheduling platform to obtain data to be processed.
In some embodiments, the extraction module 301 of fig. 3 synchronizes the to-be-processed data to the data warehouse by using the data warehouse scheduling platform, stores the to-be-processed data in the operation data layer, and maps the structured to-be-processed data into the operation data table, where the operation data layer is used to store incremental data or full data corresponding to the to-be-processed data.
In some embodiments, the cleaning module 302 in fig. 3 performs data cleaning on the operation data table by using an ETL data processing algorithm, performs desensitization processing on the operation data table after the data cleaning, performs dimension reduction on the data table obtained after the desensitization processing, and performs consistency processing on data in the data table to obtain a wide table, where the wide table is used as a detail data table.
In some embodiments, the aggregation module 303 in fig. 3 aggregates the detail data tables by using an ETL data processing algorithm according to a preset aggregation granularity and an aggregation rule, so as to aggregate the detail data tables into a summary data table, where the summary data table stores detail data corresponding to the same aggregation granularity.
In some embodiments, the processing module 304 in fig. 3 performs data processing on the summarized data table stored in the application data layer by using an ETL data processing algorithm according to preset report content, generates application data associated with the report content, and generates an application data table according to the application data.
In some embodiments, the processing module 304 of fig. 3 performs a data query operation on an application data table in the data warehouse by using a pre-configured report development tool and using a query dataset in the report development tool, so as to obtain target data from the application data table, and generate a report according to the target data and a preset report configuration.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.
Fig. 4 is a schematic structural diagram of the electronic device 4 provided in the embodiment of the present disclosure. As shown in fig. 4, the electronic apparatus 4 of this embodiment includes: a processor 401, a memory 402 and a computer program 403 stored in the memory 402 and executable on the processor 401. The steps in the various method embodiments described above are implemented when the processor 401 executes the computer program 403. Alternatively, the processor 401 implements the functions of the respective modules/units in the above-described respective apparatus embodiments when executing the computer program 403.
Illustratively, the computer program 403 may be partitioned into one or more modules/units, which are stored in the memory 402 and executed by the processor 401 to accomplish the present disclosure. One or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 403 in the electronic device 4.
The electronic device 4 may be a desktop computer, a notebook, a palm computer, a cloud server, or other electronic devices. The electronic device 4 may include, but is not limited to, a processor 401 and a memory 402. Those skilled in the art will appreciate that fig. 4 is merely an example of the electronic device 4, and does not constitute a limitation of the electronic device 4, and may include more or less components than those shown, or combine certain components, or different components, e.g., the electronic device may also include input-output devices, network access devices, buses, etc.
The Processor 401 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 402 may be an internal storage unit of the electronic device 4, for example, a hard disk or a memory of the electronic device 4. The memory 402 may also be an external storage device of the electronic device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the electronic device 4. Further, the memory 402 may also include both internal storage units of the electronic device 4 and external storage devices. The memory 402 is used for storing computer programs and other programs and data required by the electronic device. The memory 402 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
In the embodiments provided in the present disclosure, it should be understood that the disclosed apparatus/computer device and method may be implemented in other ways. For example, the above-described apparatus/computer device embodiments are merely illustrative, and for example, a division of modules or units, a division of logical functions only, an additional division may be made in actual implementation, multiple units or components may be combined or integrated with another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method in the above embodiments, and may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the above methods and embodiments. The computer program may comprise computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain suitable additions or additions that may be required in accordance with legislative and patent practices within the jurisdiction, for example, in some jurisdictions, computer readable media may not include electrical carrier signals or telecommunications signals in accordance with legislative and patent practices.
The above examples are only intended to illustrate the technical solutions of the present disclosure, not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present disclosure, and are intended to be included within the scope of the present disclosure.

Claims (10)

1. A data processing method based on a data warehouse, comprising:
the method comprises the steps that extraction operation is carried out on source data in a source data system to obtain data to be processed, the data to be processed are stored in an operation data layer in a data warehouse, and an operation data table is generated in the operation data layer based on the data to be processed;
the operation data table is transmitted to a detail data layer in the data warehouse, and data cleaning and data conversion operation are carried out on the operation data table in the detail data layer to obtain a detail data table;
transferring the detail data table into a summary data layer within the data warehouse, performing an aggregation operation on the detail data table in the summary data layer, and generating a summary data table based on a result of the aggregation operation;
and transmitting the summarized data table to an application data layer in the data warehouse, performing data processing operation on the summarized data table in the application data layer to obtain an application data table used as a data processing object, and performing data processing based on the application data table stored in the data warehouse.
2. The method of claim 1, wherein the performing the extraction operation on the source data in the source data system to obtain the data to be processed comprises:
for source data stored in a relational database, extracting the source data in the relational database by using a data warehouse scheduling platform to obtain data to be processed;
for source data stored in a distributed message queue, a log data acquisition component is used for acquiring log data from the distributed message queue, the acquired log data are sent to a distributed file system, and a data warehouse scheduling platform is used for extracting the log data in the distributed file system to obtain data to be processed.
3. The method of claim 1, wherein storing the data to be processed into an operation data layer in a data warehouse and generating an operation data table based on the data to be processed in the operation data layer comprises:
synchronizing the data to be processed to the data warehouse by using a data warehouse scheduling platform, storing the data to be processed in the operation data layer, and mapping the structured data to be processed into the operation data table, wherein the operation data layer is used for storing incremental data or full data corresponding to the data to be processed.
4. The method of claim 1, wherein performing a data washing and data transformation operation on the operation data table in the detail data layer to obtain a detail data table comprises:
and carrying out data cleaning on the operation data table by using an ETL data processing algorithm, carrying out desensitization treatment on the operation data table after the data cleaning, carrying out dimension reduction operation on the data table obtained after the desensitization treatment, carrying out consistency treatment on the data in the data table to obtain a wide table, and taking the wide table as the detail data table.
5. The method of claim 1, wherein performing an aggregation operation on the detailed data table in the summarized data layer and generating a summarized data table based on a result of the aggregation operation comprises:
and according to preset aggregation granularity and an aggregation rule, utilizing an ETL (extract transform and load) data processing algorithm to aggregate the detail data tables so as to aggregate the detail data tables into the summary data table, wherein the summary data table stores detail data corresponding to the same aggregation granularity.
6. The method according to claim 1, wherein the performing a data processing operation on the summary data table in the application data layer to obtain an application data table used as a data processing object comprises:
and according to preset report content, carrying out data processing on the summarized data table stored in the application data layer by utilizing an ETL data processing algorithm to generate application data associated with the report content, and generating the application data table according to the application data.
7. The method of claim 1, wherein the processing data based on the application data table stored in the data warehouse comprises:
and performing data query operation on the application data table in the data warehouse by using a pre-configured report development tool and a query data set in the report development tool so as to obtain target data from the application data table, and generating a report according to the target data and preset report configuration.
8. A data processing apparatus based on a data warehouse, comprising:
the extraction module is configured to perform extraction operation on source data in a source data system to obtain data to be processed, store the data to be processed into an operation data layer in a data warehouse, and generate an operation data table based on the data to be processed in the operation data layer;
the cleaning module is configured to transfer the operation data table to a detail data layer in the data warehouse, and perform data cleaning and data conversion operation on the operation data table in the detail data layer to obtain a detail data table;
an aggregation module configured to pass the itemized data table into a summarized data layer within the data warehouse, perform an aggregation operation on the itemized data table in the summarized data layer, and generate a summarized data table based on a result of the aggregation operation;
and the processing module is configured to transmit the summarized data table to an application data layer in the data warehouse, perform data processing operation on the summarized data table in the application data layer to obtain an application data table used as a data processing object, and perform data processing based on the application data table stored in the data warehouse.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 7 when executing the program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202111541687.6A 2021-12-16 2021-12-16 Data processing method, device and equipment based on data warehouse and storage medium Pending CN114218218A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111541687.6A CN114218218A (en) 2021-12-16 2021-12-16 Data processing method, device and equipment based on data warehouse and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111541687.6A CN114218218A (en) 2021-12-16 2021-12-16 Data processing method, device and equipment based on data warehouse and storage medium

Publications (1)

Publication Number Publication Date
CN114218218A true CN114218218A (en) 2022-03-22

Family

ID=80702982

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111541687.6A Pending CN114218218A (en) 2021-12-16 2021-12-16 Data processing method, device and equipment based on data warehouse and storage medium

Country Status (1)

Country Link
CN (1) CN114218218A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114880308A (en) * 2022-07-12 2022-08-09 山东中创软件商用中间件股份有限公司 Metadata processing method, device and medium based on big data
CN116401325A (en) * 2023-05-15 2023-07-07 广州飞狮数字科技有限公司 Data processing method and device based on data warehouse model
CN116521142A (en) * 2023-06-29 2023-08-01 速度科技股份有限公司 Implementation method of spatio-temporal big data SQL engine based on prest expansion
CN116541887B (en) * 2023-07-07 2023-09-15 云启智慧科技有限公司 Data security protection method for big data platform

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114880308A (en) * 2022-07-12 2022-08-09 山东中创软件商用中间件股份有限公司 Metadata processing method, device and medium based on big data
CN116401325A (en) * 2023-05-15 2023-07-07 广州飞狮数字科技有限公司 Data processing method and device based on data warehouse model
CN116401325B (en) * 2023-05-15 2024-03-05 广州飞狮数字科技有限公司 Data processing method and device based on data warehouse model
CN116521142A (en) * 2023-06-29 2023-08-01 速度科技股份有限公司 Implementation method of spatio-temporal big data SQL engine based on prest expansion
CN116521142B (en) * 2023-06-29 2023-10-03 速度科技股份有限公司 Implementation method of spatio-temporal big data SQL engine based on prest expansion
CN116541887B (en) * 2023-07-07 2023-09-15 云启智慧科技有限公司 Data security protection method for big data platform

Similar Documents

Publication Publication Date Title
US11755628B2 (en) Data relationships storage platform
CN112685385B (en) Big data platform for smart city construction
CN108038222B (en) System of entity-attribute framework for information system modeling and data access
CN104767813B (en) Public's row big data service platform based on openstack
Karnitis et al. Migration of relational database to document-oriented database: Structure denormalization and data transformation
CN114218218A (en) Data processing method, device and equipment based on data warehouse and storage medium
US9262248B2 (en) Log configuration of distributed applications
CN103455540B (en) The system and method for generating memory model from data warehouse model
CN112199433A (en) Data management system for city-level data middling station
DE202012013469U1 (en) Data Processing Service
US9123006B2 (en) Techniques for parallel business intelligence evaluation and management
CN106802905B (en) Collaborative data exchange method of isomorphic PLM system
US11615076B2 (en) Monolith database to distributed database transformation
CN110442620A (en) A kind of big data is explored and cognitive approach, device, equipment and computer storage medium
CN112527886A (en) Data warehouse system based on urban brain
CN115640300A (en) Big data management method, system, electronic equipment and storage medium
CN104820700B (en) The processing method of substation's unstructured data
CN112817958A (en) Electric power planning data acquisition method and device and intelligent terminal
CN115617776A (en) Data management system and method
CN111427869A (en) Log system based on block chain
CN114661832A (en) Multi-mode heterogeneous data storage method and system based on data quality
CN105550351B (en) The extemporaneous inquiry system of passenger's run-length data and method
CN112163017B (en) Knowledge mining system and method
CN115168474B (en) Internet of things central station system building method based on big data model
CN112860653A (en) Government affair information resource catalog management method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination