CN116795816A

CN116795816A - A data warehouse construction method and system based on streaming processing

Info

Publication number: CN116795816A
Application number: CN202310603864.1A
Authority: CN
Inventors: 霍伟波; 刘襄雄; 刘超; 张元兰
Original assignee: Xiamen Meiya Pico Information Co Ltd
Current assignee: Xiamen Meiya Pico Information Co Ltd
Priority date: 2023-05-26
Filing date: 2023-05-26
Publication date: 2023-09-22

Abstract

The application discloses a stream processing-based method and a stream processing-based system for constructing a plurality of bins, which comprise the steps of analyzing and restoring service data of a structured or unstructured standard data packet, monitoring and capturing the change of a database, analyzing and processing the change, and pushing the data to a data convergence layer ODS; the data convergence layer ODS cleans, converts and desensitizes the data, and associates the data to form a data detail layer DWD; the data in the data detail layer DWD is distributed through the data to form a summarized data layer DWS, or the data is synchronized to a cloud component according to service requirements to form a standardized data query service; and the summary data layer DWS distributes the data detail layer DWD into a wide table or a thematic library through MYLink SQL data, and pi outputs the calculated data to the cloud component to provide service inquiry and offline calculation analysis. The stream processing-based multi-bin construction method and system have strong adaptability to scenes with high data real-time requirements, can be rapidly deployed and are easy to maintain, and the cost of enterprises is greatly reduced and the adaptability is improved.

Description

A data warehouse construction method and system based on streaming processing

技术领域Technical field

本发明涉及数据仓库建设的技术领域，尤其是一种基于流式处理的数仓建设方法和系统。The invention relates to the technical field of data warehouse construction, in particular to a data warehouse construction method and system based on stream processing.

背景技术Background technique

在计算机领域，数据仓库是用于数据分析与报告的系统，是商业智能的重要组成。数据仓库来自一个或多个不同源的集成数据中央存储库，可以是异构的存储库。数据仓库一般将当前最新或历史数据存储在一起，用于商业价值的创造。数据仓库代表对数据的管理和使用的方式，是典型的提取、转换、加载、建模使用分级，数据集成和访问层等完整的体系。数据仓库是面向主题的、集成的、变化的、但相对稳定的数据集合，用于管理决策过程的支持。In the computer field, data warehouse is a system used for data analysis and reporting, and is an important component of business intelligence. A data warehouse is a central repository of integrated data from one or more disparate sources, which can be heterogeneous. Data warehouses generally store the latest or historical data together for the creation of business value. Data warehouse represents the way of managing and using data. It is a typical complete system of extraction, transformation, loading, modeling, usage classification, data integration and access layer. A data warehouse is a subject-oriented, integrated, changing, but relatively stable data collection used to support the management decision-making process.

进入互联网时代由于上网用户剧增，特别是从PC时代进入移动互联时代，海量的用户行为数据的产生，从每天的PB级到EB级，甚至ZB级，企业迫切希望从这些海量数据中挖掘有效的商业信息，如业务数据、用户行为数据、异常数据等等数据中提炼出有价值的信息，用户商业决策。伴随着数据规模的不断增大，业务数据量激增，数据仓库的建设方法与架构也不断的迭代，从原来传统数仓，发展到离线数仓，再到现在的离线加实时的Lambda。In the Internet era, due to the rapid increase in Internet users, especially from the PC era to the mobile Internet era, massive user behavior data are generated, from PB level to EB level, or even ZB level every day. Enterprises are eager to mine effective results from these massive data. Extract valuable information from business information, such as business data, user behavior data, abnormal data, etc., for users to make business decisions. As the scale of data continues to increase, the amount of business data surges, and the construction methods and architecture of data warehouses are constantly iterated, from the original traditional data warehouse to offline data warehouse, and now to offline plus real-time Lambda.

传统的数仓对源头数据中的结构化、半结构化与非结构化数据通过离线ETL定期加载到数仓，之后通过计算引擎取得结果，再提供前端或服务使用。离线数仓+计算引擎，通常运用于大型OLTP数据库(以传统关系型数据库为主，如Oracle、SQL Server为代表)。传统的数据仓库对于事务型业务处理能非常完善地支撑，但面对海量的数据存储与计算显得无法适应，主要有以下3点：Traditional data warehouses regularly load structured, semi-structured and unstructured data from source data into the data warehouse through offline ETL, and then obtain the results through the calculation engine, and then provide the front-end or service for use. Offline data warehouse + computing engine is usually used in large-scale OLTP databases (mainly traditional relational databases, such as Oracle and SQL Server). Traditional data warehouses can perfectly support transactional business processing, but they are unable to adapt to massive data storage and calculations. The main reasons are as follows:

1.传统的数据仓库属于预建设模型的，对于海量数据或用户需要的不断变化都需要不断的重构，效率非常低。1. The traditional data warehouse is a pre-built model, which requires constant reconstruction for massive data or changing user needs, which is very inefficient.

2.由于大数据的海量数据属于由量变到质变的，传统数据仓库无法属于适应海量数据的快速分析响应的需要。2. Since the massive data of big data changes from quantitative to qualitative changes, traditional data warehouses cannot meet the needs of rapid analysis and response to massive data.

3.传统数据仓库的集群扩展成本非常大，且很难做好横向扩展提高计算能力。3. The cluster expansion cost of traditional data warehouse is very high, and it is difficult to do horizontal expansion to improve computing power.

随着数据规模的不断增大，业务的数据量激增，传统数据仓库难以承载海量数据，传统的数据库存储技术也面临存储紧张，成本不断提高。与此同时随着大数据技术的普及，通过大数据技术来构建离线数据仓库可能，采用大数据技术来承载存储与计算离线数仓。大数据中的数据仓库构建就是基于传统数仓建设架构而来，使用了大数据的技术工具来替代传统的OLTP，演变成离线大数据数仓建议架构，离线数据仓库建设方法很好地解决了传统数仓的不足。随着业务数据处理能力与需求的不断变化，实践中发现，离线批量处理的模式能力虽有很大的提升，但无论如何也无法满足数据处理与业务时效性要求非常高的业务场景。“离线+流式计算”双链路的数仓建设方法是一种折中的过渡性方案，但在实践生产中有诸多不足：As the scale of data continues to increase, the amount of business data surges. Traditional data warehouses are unable to carry massive amounts of data. Traditional database storage technologies are also facing storage shortages and increasing costs. At the same time, with the popularization of big data technology, it is possible to build offline data warehouses through big data technology, and use big data technology to store and calculate offline data warehouses. The construction of data warehouse in big data is based on the traditional data warehouse construction architecture. Big data technical tools are used to replace traditional OLTP, and it evolves into the proposed offline big data data warehouse architecture. The offline data warehouse construction method solves the problem well. The shortcomings of traditional data warehouses. As business data processing capabilities and needs continue to change, it has been found in practice that although the offline batch processing mode capabilities have been greatly improved, it cannot satisfy business scenarios with very high data processing and business timeliness requirements. The "offline + streaming computing" dual-link data warehouse construction method is a compromise transitional solution, but it has many shortcomings in practical production:

1.计算资源的使用增加。由于同时存在离线与流式计算两条线路，离线与流式数据计算资源占用时间段可能会不一致，离线计算更多是凌晨12点至早上6点前，而流式更多的是白天时间或凌晨12点前，这样离线与流式的计算资源没办法充分利用，导致整体的资源占用会增多。1. Increased usage of computing resources. Since there are two lines of offline and streaming computing at the same time, the time periods occupied by offline and streaming data computing resources may be inconsistent. Offline computing is more likely to occur between 12 a.m. and before 6 a.m., while streaming computing is more likely to occur during the day or during the day. Before 12 o'clock in the morning, such offline and streaming computing resources cannot be fully utilized, resulting in an increase in overall resource usage.

2.同时维护两套代码。离线与流式计算的两条线路，一个需要实现离线引擎上代码，一个则需要实现流式引擎上的代码，并且需要实现两套测试过程。对数仓业务的运维成本翻倍。2. Maintain two sets of codes at the same time. For the two lines of offline and streaming computing, one needs to implement the code on the offline engine, and the other needs to implement the code on the streaming engine, and two sets of test processes need to be implemented. The operation and maintenance costs of the warehouse business have doubled.

3.离线计算时效性差。由于业务的不断变化，越来越多的业务需要将原来有的离线任务时效性要求越来越高，由于离线计算只能满足T+n的计算要求，只能将n的时间级别转到分钟级，这样对服务器资源的要求越来越高，而且时效性不一致能保证。3. Offline calculation has poor timeliness. Due to the continuous changes in business, more and more businesses need to have higher and higher timeliness requirements for the original offline tasks. Since offline computing can only meet the calculation requirements of T+n, the time level of n can only be transferred to minutes. level, which places higher and higher demands on server resources, and inconsistent timeliness can be guaranteed.

4.集群存储要求高。由于离线与流式两个链路过程都需要将数据存储在群集中，并且在中间计算过程中会产大量的暂时数据或日志，这样会造成数据急速膨胀，对服务器存储造成极大压力。4. Cluster storage requirements are high. Since both offline and streaming link processes require data to be stored in the cluster, and a large amount of temporary data or logs will be generated during the intermediate calculation process, this will cause data to expand rapidly and put great pressure on server storage.

发明内容Contents of the invention

为了解决现有技术中存在的上述技术问题，本发明提出了一种基于流式处理的数仓建设方法和系统，以解决上述技术问题。In order to solve the above technical problems existing in the prior art, the present invention proposes a data warehouse construction method and system based on flow processing to solve the above technical problems.

根据本发明的第一方面，提出了一种基于流式处理的数仓建设方法，包括：According to the first aspect of the present invention, a data warehouse construction method based on streaming processing is proposed, including:

S1：对结构化或非结构化的标准数据包进行业务数据解析并还原，监控捕获数据库的变动进行解析处理，并将数据推送至数据汇聚层ODS；S1: Analyze and restore business data for structured or unstructured standard data packages, monitor and capture changes in the database for analysis and processing, and push the data to the data aggregation layer ODS;

S2：数据汇聚层ODS对数据进行清洗、转换、脱敏、关联形成数据明细层DWD；S2: The data aggregation layer ODS cleans, converts, desensitizes, and associates the data to form the data detail layer DWD;

S3：数据明细层DWD中的数据通过数据分发形成汇总数据层DWS，或根据业务需要将数据同步到云组件形成标准化数据查询服务；S3: The data in the data detail layer DWD forms the summary data layer DWS through data distribution, or the data is synchronized to the cloud component according to business needs to form a standardized data query service;

S4：汇总数据层DWS将数据明细层DWD通过MYLink SQL的数据分发形成宽表或专题库，并将计算后的数据输出到云组件，提供服务查询及离线计算分析。S4: The summary data layer DWS distributes the data detail layer DWD through MYLink SQL to form a wide table or thematic library, and outputs the calculated data to the cloud component to provide service query and offline calculation analysis.

在一些具体的实施例中，S1之前还包括对源头业务库根据数据汇聚层ODS的采集规则进行数据采集，源头业务库可对应一或多个数据汇聚层ODS。In some specific embodiments, S1 also includes collecting data from the source business library according to the collection rules of the data aggregation layer ODS. The source business library can correspond to one or more data aggregation layer ODS.

在一些具体的实施例中，S1具体包括利用sSend工具将结构化或非结构化的标准数据包进行业务数据解析，利用Datax将结构化或非结构化的数据解析并还原，利用FlinkCDC监控捕获数据库的变动进行解析处理。In some specific embodiments, S1 specifically includes using the sSend tool to parse structured or unstructured standard data packets for business data, using Datax to parse and restore structured or unstructured data, and using FlinkCDC to monitor and capture the database. Changes are analyzed and processed.

在一些具体的实施例中，S2中数据汇聚层ODS用于存储的业务库数据保持业务数据的原貌，通过MYLink引擎以SQL+UDF的方式对数据进行清洗、转换、脱敏、关联形成数据明细层DWD。In some specific embodiments, the business database data stored by the data aggregation layer ODS in S2 maintains the original appearance of the business data, and the MYLink engine uses SQL+UDF to clean, convert, desensitize, and associate the data to form data details. Layer DWD.

在一些具体的实施例中，S2还包括通过MYLink引擎将数据直接输出到云组件中提供原始数据的追踪查询。In some specific embodiments, S2 also includes outputting the data directly to the cloud component through the MYLink engine to provide tracking queries for the original data.

在一些具体的实施例中，云组件包括华为云认证组件或腾讯云。In some specific embodiments, the cloud components include Huawei Cloud certified components or Tencent Cloud.

在一些具体的实施例中，sSend工具、Datax、FlinkCDC均支撑消费队列作为数据汇聚层ODS。In some specific embodiments, the sSend tool, Datax, and FlinkCDC all support consumption queues as data aggregation layer ODS.

根据本发明的第二方面，提出了一种计算机可读存储介质，其上存储有一或多个计算机程序，该一或多个计算机程序被计算机处理器执行时实施上述的方法。According to a second aspect of the present invention, a computer-readable storage medium is provided, on which one or more computer programs are stored. When the one or more computer programs are executed by a computer processor, the above-mentioned method is implemented.

根据本发明的第三方面，提出了一种基于流式处理的数仓建设系统，包括：According to the third aspect of the present invention, a data warehouse construction system based on stream processing is proposed, including:

数据处理单元，配置用于对结构化或非结构化的标准数据包进行业务数据解析并还原，监控捕获数据库的变动进行解析处理，并将数据推送至数据汇聚层ODS，数据汇聚层ODS对数据进行清洗、转换、脱敏、关联形成数据明细层DWD；The data processing unit is configured to parse and restore structured or unstructured standard data packets, monitor and capture changes in the database for parsing and processing, and push the data to the data aggregation layer ODS. The data aggregation layer ODS processes the data. Perform cleaning, conversion, desensitization, and correlation to form the data detail layer DWD;

数据分发单元，配置用于将数据明细层DWD中的数据通过数据分发形成汇总数据层DWS，或根据业务需要将数据同步到云组件形成标准化数据查询服务；The data distribution unit is configured to distribute the data in the detailed data layer DWD to form the summary data layer DWS, or synchronize the data to the cloud component according to business needs to form a standardized data query service;

查询分析单元，汇总数据层DWS将数据明细层DWD通过MYLink SQL的数据分发形成宽表或专题库，并将计算后的数据输出到云组件，提供服务查询及离线计算分析。The query analysis unit, the summary data layer DWS, distributes the data detail layer DWD through MYLink SQL to form a wide table or thematic library, and outputs the calculated data to the cloud component to provide service query and offline calculation analysis.

在一些具体的实施例中，还包括数据采集单元，配置用于讲源头业务库根据数据汇聚层ODS的采集规则进行数据采集，源头业务库可对应一或多个数据汇聚层ODS。In some specific embodiments, it also includes a data collection unit configured to collect data from the source business library according to the collection rules of the data aggregation layer ODS. The source business library can correspond to one or more data aggregation layer ODS.

在一些具体的实施例中，利用sSend工具将结构化或非结构化的标准数据包进行业务数据解析，利用Datax将结构化或非结构化的数据解析并还原，利用FlinkCDC监控捕获数据库的变动进行解析处理；数据汇聚层ODS用于存储的业务库数据保持业务数据的原貌，通过MYLink引擎以SQL+UDF的方式对数据进行清洗、转换、脱敏、关联形成数据明细层DWD；通过MYLink引擎将数据直接输出到云组件中提供原始数据的追踪查询。In some specific embodiments, the sSend tool is used to parse structured or unstructured standard data packets for business data, Datax is used to parse and restore structured or unstructured data, and FlinkCDC is used to monitor and capture database changes. Parsing and processing; the data aggregation layer ODS is used to store business database data to keep the original appearance of the business data, and use the MYLink engine to clean, convert, desensitize, and associate the data in the form of SQL+UDF to form the data detail layer DWD; use the MYLink engine to The data is directly output to the cloud component to provide tracking queries of the original data.

本发明提出了一种基于流式处理的数仓建设方法和系统，能够很好适用于流与批并存的业务场景，对于流和批同一套代码且可共用相同的资源，这样对于资源利用率高且资源开销小。只要实现一套代码在开发、测试、发布上线难度大大降低，后期的运维成本也少。对数据实时性要求高的场景都有很强的适应性。能快速部署，易维护的优点，大大降低企业的成本与提高适应性。The present invention proposes a data warehouse construction method and system based on streaming processing, which can be well adapted to business scenarios where streams and batches coexist. The streams and batches have the same set of codes and can share the same resources, which improves resource utilization. High and low resource overhead. As long as a set of code is implemented, the difficulty of development, testing, release and online is greatly reduced, and the later operation and maintenance costs are also reduced. It has strong adaptability to scenarios that require high real-time data performance. The advantages of rapid deployment and easy maintenance can greatly reduce the cost of enterprises and improve adaptability.

附图说明Description of the drawings

包括附图以提供对实施例的进一步理解并且附图被并入本说明书中并且构成本说明书的一部分。附图图示了实施例并且与描述一起用于解释本发明的原理。将容易认识到其它实施例和实施例的很多预期优点，因为通过引用以下详细描述，它们变得被更好地理解。通过阅读参照以下附图所作的对非限制性实施例所作的详细描述，本申请的其它特征、目的和优点将会变得更明显：The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain the principles of the invention. Other embodiments and many of the intended advantages of embodiments will be readily recognized as they become better understood by reference to the following detailed description. Other features, objects and advantages of the present application will become more apparent by reading the detailed description of the non-limiting embodiments with reference to the following drawings:

图1是本申请的一个实施例的基于流式处理的数仓建设方法的流程图；Figure 1 is a flow chart of a data warehouse construction method based on streaming processing according to an embodiment of the present application;

图2是本申请的一个具体的实施例的流式数仓的整体架构图；Figure 2 is an overall architecture diagram of a streaming data warehouse according to a specific embodiment of the present application;

图3是本申请的一个具体的实施例的基于流式处理的数仓建设流程图；Figure 3 is a flow chart of data warehouse construction based on streaming processing according to a specific embodiment of the present application;

图4是本申请的一个具体的实施例的数据仓库架构图；Figure 4 is a data warehouse architecture diagram of a specific embodiment of the present application;

图5是本申请的一个实施例的基于流式处理的数仓建设系统架构图；Figure 5 is an architecture diagram of a data warehouse construction system based on streaming processing according to an embodiment of the present application;

图6适于用来实现本申请实施例的电子设备的计算机系统的结构示意图。FIG. 6 is a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present application.

具体实施方式Detailed ways

下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是，此处所描述的具体实施例仅仅用于解释相关发明，而非对该发明的限定。另外还需要说明的是，为了便于描述，附图中仅示出了与有关发明相关的部分。The present application will be further described in detail below in conjunction with the accompanying drawings and examples. It can be understood that the specific embodiments described here are only used to explain the relevant invention, but not to limit the invention. It should also be noted that, for convenience of description, only the parts related to the invention are shown in the drawings.

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that, as long as there is no conflict, the embodiments and features in the embodiments of this application can be combined with each other. The present application will be described in detail below with reference to the accompanying drawings and embodiments.

图1示出了根据本申请的实施例的基于流式处理的数仓建设方法流程图。如图1所示，该方法包括以下步骤：Figure 1 shows a flow chart of a data warehouse construction method based on streaming processing according to an embodiment of the present application. As shown in Figure 1, the method includes the following steps:

S101：对结构化或非结构化的标准数据包进行业务数据解析并还原，监控捕获数据库的变动进行解析处理，并将数据推送至数据汇聚层ODS。S101: Analyze and restore business data for structured or unstructured standard data packages, monitor and capture changes in the database for analysis and processing, and push the data to the data aggregation layer ODS.

在具体的实施例中，S1之前还包括对源头业务库根据数据汇聚层ODS的采集规则进行数据采集，源头业务库可对应一或多个数据汇聚层ODS。利用sSend工具将结构化或非结构化的标准数据包进行业务数据解析，利用Datax将结构化或非结构化的数据解析并还原，利用FlinkCDC监控捕获数据库的变动进行解析处理。In a specific embodiment, S1 also includes collecting data from the source business library according to the collection rules of the data aggregation layer ODS. The source business library can correspond to one or more data aggregation layer ODS. Use the sSend tool to parse structured or unstructured standard data packets for business data, use Datax to parse and restore structured or unstructured data, and use FlinkCDC to monitor and capture changes in the database for analysis and processing.

S102：数据汇聚层ODS对数据进行清洗、转换、脱敏、关联形成数据明细层DWD。数据汇聚层ODS用于存储的业务库数据保持业务数据的原貌，通过MYLink引擎以SQL+UDF的方式对数据进行清洗、转换、脱敏、关联形成数据明细层DWD。也可通过MYLink引擎将数据直接输出到云组件中提供原始数据的追踪查询。其中，sSend工具、Datax、FlinkCDC均支撑消费队列作为数据汇聚层ODS。S102: The data aggregation layer ODS cleans, converts, desensitizes, and correlates the data to form the data detail layer DWD. The data aggregation layer ODS is used to store business database data to keep the original appearance of the business data. The MYLink engine uses SQL+UDF to clean, convert, desensitize and associate the data to form the data detail layer DWD. Data can also be output directly to the cloud component through the MYLink engine to provide tracking and querying of original data. Among them, sSend tool, Datax, and FlinkCDC all support consumption queues as data aggregation layer ODS.

S103：数据明细层DWD中的数据通过数据分发形成汇总数据层DWS，或根据业务需要将数据同步到云组件形成标准化数据查询服务。S103: The data in the data detail layer DWD forms the summary data layer DWS through data distribution, or the data is synchronized to the cloud component according to business needs to form a standardized data query service.

S104：汇总数据层DWS将数据明细层DWD通过MYLink SQL的数据分发形成宽表或专题库，并将计算后的数据输出到云组件，提供服务查询及离线计算分析。S104: The summary data layer DWS distributes the data detail layer DWD through MYLink SQL to form a wide table or thematic library, and outputs the calculated data to the cloud component to provide service query and offline calculation analysis.

图2示出了本申请的一个具体的实施例的流式数仓的整体架构图，该流式数仓是基于新一代流式计算引擎MYLINK为基座，形成以TJPlat为工作台，TJPlat其核心思想是通过整合并改造流式计算体系来解决数仓建设过程中的问题，使实时计算与批处理都能按同一套模式建设。TJPlat也能在有需要时对历史数据进行重新计算，体现流式数仓的灵活性。Figure 2 shows the overall architecture diagram of a streaming data warehouse according to a specific embodiment of the present application. The streaming data warehouse is based on the new generation streaming computing engine MYLINK as the base, with TJPlat as the workbench, and TJPlat as its base. The core idea is to solve the problems in the data warehouse construction process by integrating and transforming the flow computing system, so that both real-time computing and batch processing can be constructed according to the same model. TJPlat can also recalculate historical data when necessary, reflecting the flexibility of the streaming data warehouse.

在具体的实施例中，本申请的流式处理数仓建设依据“一切资源化、资源目录化、目录全局化、全局标准化”的“四化”的顶级建设思路，将数仓建设的分层融入到数仓建设方法，所有的数仓分层资源已经定义为所有的资源；所有的资源定义到资源目录进行资源管理、数据项管理、资源的分级分类管理等，在资源目录也定义或设计该资源提取的下游资源或提取规则、数据的清洗规则(过滤、去重、格转、校验等)、数据的关联(资源的Join操作)等；目录全局化定义数仓的部署存储的组件(可以是ES、Hbase、MongoDB、HDFS、ClickHouse等)；全局标准化是通过统一的标准为后续的资源服务查询、数据应用甚至BI分析提供的全局性的指引；“四化”为流式处理数仓的建议提供了具体的业务指引，也可以说是业务的从抽象到具体化的过程。In a specific embodiment, the streaming data warehouse construction of this application is based on the top-level construction idea of "four modernizations" of "all resources, resource cataloging, catalog globalization, and global standardization", and the data warehouse construction is hierarchical. Integrated into the data warehouse construction method, all data warehouse hierarchical resources have been defined as all resources; all resources are defined in the resource directory for resource management, data item management, resource classification and classification management, etc., and are also defined or designed in the resource directory The downstream resources or extraction rules extracted by this resource, data cleaning rules (filtering, deduplication, formatting, verification, etc.), data association (Join operation of resources), etc.; the directory globally defines the deployment and storage components of the data warehouse. (Can be ES, Hbase, MongoDB, HDFS, ClickHouse, etc.); Global standardization is to provide global guidance through unified standards for subsequent resource service queries, data applications and even BI analysis; "Four modernizations" are streaming data processing Cang's suggestions provide specific business guidance, which can also be said to be the process of business from abstraction to concreteness.

在具体的实施例中，根据“四化”中定义的ODS(汇聚库)采集规则，从源头业务库到汇聚库(ODS)，源头库到汇聚库可以是一个源头库到n个(一个或多个)汇聚库，通过开源的MDatax可以将数据采集(采集过程中可以定义采集的业务规则)到离线库或消费队列(Kafka)中；sSend可以将特定业务场景的结构化数据、半结构化数据甚至非结构化数据采集到离线库、对象存储或消费队列中适用于大数据的特定场景数据的采集；而FlinkCDC主要用于实时获取MySQL、Oracle、Postgres等主要数据库的数据变更获取(Change DataCapture)，原理是监测并捕获数据库的变动(数据库的插入、更新及删除)，将这些变更发生的顺序完整记录并写入到消费队列(Kafka)中以便MYLINK的订阅及消费。所有的数据采集基于可定义的思想进行控制的。Datax、sSend或FlinkCDC都必须支撑消费队列(Kafka)作为ODS的，这是流式数仓建设的必须前提。In a specific embodiment, according to the ODS (aggregation library) collection rules defined in the "Four Modernizations", from the source business library to the aggregation library (ODS), the source library to the aggregation library can be one source library to n (one or Multiple) aggregation libraries, through the open source MDatax, data can be collected (business rules for collection can be defined during the collection process) into offline libraries or consumption queues (Kafka); sSend can collect structured data and semi-structured data of specific business scenarios Data and even unstructured data are collected into offline libraries, object storage or consumption queues, which are suitable for the collection of data in specific scenarios of big data; FlinkCDC is mainly used to obtain real-time data change acquisition (Change DataCapture) from major databases such as MySQL, Oracle, and Postgres. ), the principle is to monitor and capture database changes (database insertion, update and deletion), completely record the sequence of these changes and write them to the consumption queue (Kafka) for MYLINK subscription and consumption. All data collection is controlled based on definable ideas. Datax, sSend or FlinkCDC must support consumption queue (Kafka) as ODS, which is a necessary prerequisite for the construction of streaming data warehouse.

在具体的实施例中，MYLINK是基于Flink的批和流的思维开发出来的新一代流式计算引擎，完全使用类SQL的语义结合本申请的“四化”理念实现的流式计算引擎，降低流式数据仓建设门槛。MYLINK保留了Flink的实时计算特征包括：同时支持无界与无界的数据源；支持Join与Union；低延迟；可自定义统一的连接器。In a specific embodiment, MYLINK is a new generation of streaming computing engine developed based on Flink's batch and stream thinking. It is a streaming computing engine that completely uses SQL-like semantics combined with the "four modernizations" concept of this application, reducing the The threshold for building a streaming data warehouse. MYLINK retains the real-time computing features of Flink, including: supporting both unbounded and unbounded data sources; supporting Join and Union; low latency; and customizable unified connectors.

MYLINK同时结合“四化”理念创新流式数仓建设模式具体体现于如下方面：MYLINK also combines the concept of "four modernizations" to innovate the streaming data warehouse construction model, which is specifically reflected in the following aspects:

(1)通过资源对标接通ODS层与DWD层的关系，MYLINK会根据对标的资源所定义的数据处理规则进行流式计算处理(如数据清洗、数据关联、数据提取分发等ETL过程)，将并流式输出的数据根据部署定义自动化分层输出；(1) Connect the relationship between the ODS layer and the DWD layer through resource benchmarking. MYLINK will perform stream computing processing (such as data cleaning, data association, data extraction and distribution, etc. ETL processes) according to the data processing rules defined by the benchmarked resources. Automatically output hierarchical output of parallel streaming output data according to deployment definitions;

(2)根据业务所需可动态自由调整流式计算各算子环节的顺序及位置，以满足业务需求；(2) The order and position of each operator link in streaming computing can be dynamically and freely adjusted according to business needs to meet business needs;

(3)在流式计算节中，可自行写SQL语句来处理特定的业务需求，对于特殊或复杂的业务可定义UDF的方式；(3) In the streaming computing section, you can write your own SQL statements to handle specific business needs, and you can define UDF methods for special or complex businesses;

(4)MYLINK提供动态调节“窗口”的方式来调配业务的批计算窗口；(4) MYLINK provides a dynamic adjustment "window" method to allocate business batch calculation windows;

(5)支持更多的数仓组件，能支持开源组件外，也支持一些厂商的云组件，如华为云认证组件(ES、HIVE、HDFS、Mongodb等)、腾讯云(ES、HIVE、HDFS等)。(5) Supports more data warehouse components. In addition to supporting open source components, it also supports cloud components from some manufacturers, such as Huawei Cloud certified components (ES, HIVE, HDFS, Mongodb, etc.), Tencent Cloud (ES, HIVE, HDFS, etc.) ).

结合图3示出的本申请的一个具体的实施例的基于流式处理的数仓建设流程图，具体包括如下步骤：The data warehouse construction flow chart based on streaming processing according to a specific embodiment of the present application is shown in conjunction with Figure 3, which specifically includes the following steps:

平台通过“sSend”工具将结构化或非结构化的标准数据包进行业务数据解析、“MDatax”能力是用于将结构化或非结构化的数据解析并还原、“数据采集”的FlinkCDC专用于监控捕获数据库的变动进行解析处理，并将数据推送到ODS层库。The platform uses the "sSend" tool to parse structured or unstructured standard data packages for business data. The "MDatax" capability is used to parse and restore structured or unstructured data. The "data collection" FlinkCDC is dedicated to Monitor and capture changes in the database for analysis and processing, and push the data to the ODS layer library.

ODS层(原始库)用于存储的业务库数据保持业务数据的原貌，ODS层的数据可能通过MYLink引擎以SQL+UDF的方式对数据进行清洗、转换、脱敏、关联等形成DWD。通过MYLink也可以将数据直接输出到ES、HDFS等组件供数据分析人员对原始数据进行追踪查询，也能通过“cxLevelS”或“天河”服务平台提供给“数据应用”或第三方服务。The ODS layer (original library) is used to store business database data to keep the original appearance of the business data. The data in the ODS layer may be cleaned, converted, desensitized, associated, etc. through the MYLink engine in the form of SQL+UDF to form DWD. Through MYLink, data can also be directly output to ES, HDFS and other components for data analysts to track and query the original data. It can also be provided to "data applications" or third-party services through the "cxLevelS" or "Tianhe" service platform.

DWD层(数据明细层)的数据是经过治理后的数据层，DWD层的数据可用于做“数据分发”形成DWS如用户行为、商品的属性或专题库等；或根据业务的需要将数据通过MYLink同步到ES提供数据明细层查询服务形成各种标准化的数据查询服务。The data in the DWD layer (data detail layer) is a managed data layer. The data in the DWD layer can be used for "data distribution" to form DWS such as user behavior, product attributes or topic libraries; or the data can be passed through according to business needs. MYLink synchronizes to ES to provide data detail layer query services to form various standardized data query services.

DWS层来自DWD层通过MYLink SQL的“数据分发”形成宽表或专题库(指标库)，并将计算后的数据输出到Mongo、HDFS等组件，分别提供服务查询及离线计算分析。The DWS layer comes from the DWD layer to form a wide table or thematic library (indicator library) through the "data distribution" of MYLink SQL, and outputs the calculated data to Mongo, HDFS and other components to provide service query and offline calculation analysis respectively.

在具体的实施例中，离线数据仓库建设方法可以很好地解决传统数仓的不足之处：1.批量数据计算，能很大程度上解决了传统数仓的对于解决计算海量数据无能为力的问题。并且批量计算后可以批量查询计算后的结果。2.低成本的算力扩容，由于离线数仓采用的是Hdoop生态，能在低成本的小型机上很好的运算，对于离线计算横向扩展提供的低廉的扩容方案。不像传统数仓只能购买大型机器进行纵向扩容。3.离线批量计算也能根据业务的实际要求，根据数据的到位后再统一计算，这有得数据的一致性与完整性。离线数仓通过开源Datax、Flume或自研的sSend工具将源头异构数据源的不同来源统一采集到离线存储HDFS或HIVE，也可能将重要的基础数据采集到关系型数据库等。采集后的数据形成采集台账日志。采集到离线库的数据根据资源的情况可分区、分表或分库的方式存储，再通过QBI(自研离线分析平台)的HiveQL或SparkSQL对采集的数据进行离线批量计算，可一次性执行或定时执行。根据业务需求输出业务数据后，可通过统一的发布平台CXLeveS提供数据服务，供数据应用或服务应用使用。In specific embodiments, the offline data warehouse construction method can well solve the shortcomings of traditional data warehouses: 1. Batch data calculation can largely solve the problem of traditional data warehouses being unable to calculate massive data. . And after batch calculation, the calculated results can be queried in batches. 2. Low-cost computing power expansion. Since the offline data warehouse uses the Hdoop ecosystem, it can perform well on low-cost minicomputers and provides a low-cost expansion solution for horizontal expansion of offline computing. Unlike traditional data warehouses, you can only purchase large machines for vertical expansion. 3. Offline batch calculation can also be based on the actual requirements of the business and unified calculation after the data is in place, which ensures the consistency and integrity of the data. The offline data warehouse uses open source Datax, Flume or the self-developed sSend tool to uniformly collect different sources of heterogeneous data sources into offline storage HDFS or HIVE, and may also collect important basic data into relational databases, etc. The collected data forms a collection ledger log. The data collected in the offline database can be stored in partitions, tables or databases according to resource conditions, and then the collected data can be calculated offline in batches through HiveQL or SparkSQL of QBI (self-developed offline analysis platform), which can be executed in one go or Timed execution. After outputting business data according to business needs, data services can be provided through the unified publishing platform CXLeveS for use by data applications or service applications.

离线数仓在建设中能很好地根据数仓建设进行分层建设与管理，让数据能按设计的方式进行有序流转。数仓分层对于数据分层建议管理提供了很重要的理论基本。离线数仓在建议过程中数据(表或资源)是依赖复杂、层次混乱的，甚至会循环依赖的数据体系。为了让数据体系化有序开展实施，需要一套行之有效的数据组织、管理与使用的现代化的数据分层体系。数据分层主要为了在数据管理过程中能对数据更加清晰的掌控，以下几个优点：During the construction of the offline data warehouse, it can carry out hierarchical construction and management according to the construction of the data warehouse, so that the data can be circulated in an orderly manner according to the design. Data warehouse stratification provides a very important theoretical basis for data stratification recommendation management. In the offline data warehouse recommendation process, data (tables or resources) rely on complex, hierarchical, and even cyclically dependent data systems. In order to implement the data systematization in an orderly manner, a modern data layering system is needed for effective data organization, management and use. Data stratification is mainly for clearer control of data during the data management process. It has the following advantages:

1.清晰的数据结构。每个数据分层都有对应的作用域，在层次表达的时候更方便定义与理解。1. Clear data structure. Each data layer has a corresponding scope, which is more convenient to define and understand when expressing the layer.

2.复杂问题简单化。将每一个复杂的任务分解成多个步骤来完成，每层只处理特定的问题，比较简单和容易理解。便于维护数据的准确性与一致性。2. Simplify complex problems. Each complex task is broken down into multiple steps to complete. Each layer only handles specific problems, which is relatively simple and easy to understand. Easily maintain data accuracy and consistency.

3.减少重复工作。规范的数据分层，在开发一些通用的中间层数据表，能减少极大的重复计算工作。3. Reduce duplication of work. Standardized data stratification can reduce a lot of repeated calculations when developing some common middle-tier data tables.

4.统一的数据口径。通过数据分层，可提供统一的数据出口，统一对外输出的业务数据口径，避免业务出现奇异。4. Unified data caliber. Through data layering, a unified data export can be provided, and the caliber of externally output business data can be unified to avoid business abnormalities.

5.跟踪数据来源。业务表对外唯一性，但业务表的来源可以是一个或多个，可以通过数据表血缘的跟踪快速定义数据来源。5. Track data sources. The business table is externally unique, but the source of the business table can be one or more, and the data source can be quickly defined by tracking the lineage of the data table.

图4示出了本申请的一个具体的实施例的数据仓库架构图，如图4所示，数据仓库一般分为三层，即数据汇聚层(ODS)、数据仓库层(DW)、数据应用层(ADS)，其中数据仓库层(DW)又可分为：数据明细层(DWD)、汇总数据层(DWS)、公共维度层(DIM)、TMP(临时数据层)。Figure 4 shows the data warehouse architecture diagram of a specific embodiment of the present application. As shown in Figure 4, the data warehouse is generally divided into three layers, namely data aggregation layer (ODS), data warehouse layer (DW), and data application layer. Layer (ADS), of which the data warehouse layer (DW) can be divided into: data detail layer (DWD), summary data layer (DWS), common dimension layer (DIM), and TMP (temporary data layer).

本申请针对传统数仓与离线数仓(包含“离线+流式计算”数仓)在建设生产中的诸多问题，提出了一种新的数据仓库建议方法。该一种基于流式处理的数仓建设方法也可以叫基于流批一体的数仓建设方法与体系。通过流批一体在建设数仓解决有以下优点：This application proposes a new data warehouse suggestion method to address many problems in the construction and production of traditional data warehouses and offline data warehouses (including "offline + streaming computing" data warehouses). This data warehouse construction method based on streaming processing can also be called a data warehouse construction method and system based on streaming and batch integration. The solution to building a data warehouse through the integration of flow and batch has the following advantages:

1.计算同源。用同一套代码、同一套逻辑可同时处理流式任务或批量任务，降低了学习与维护成本，同时也能更好的资源利用率。1. Calculate homology. Using the same set of code and the same set of logic can process streaming tasks or batch tasks at the same time, reducing learning and maintenance costs and achieving better resource utilization.

2.存储同源。流式处理一体化在存储系统上能够同时满足流式数据和批式数据的存储，并能有效的协同及元数据的更新。2. Store the same origin. Streaming processing integration can satisfy the storage of streaming data and batch data at the same time on the storage system, and can effectively collaborate and update metadata.

3.数据延时低。流式处理是为了实时而创造成出来，数据延时性以秒级甚至毫秒级别的延时，对于实时性要求苛刻的业务场景非常适用。3. Low data latency. Streaming processing is created for real-time. The data delay is at the second or even millisecond level, which is very suitable for business scenarios with demanding real-time requirements.

图5示出了本申请的一个实施例的基于流式处理的数仓建设系统架构图，如图5所示，该系统包括数据处理单元501、数据分发单元502和查询分析单元503，其中，数据处理单元501配置用于对结构化或非结构化的标准数据包进行业务数据解析并还原，监控捕获数据库的变动进行解析处理，并将数据推送至数据汇聚层ODS，数据汇聚层ODS对数据进行清洗、转换、脱敏、关联形成数据明细层DWD；数据分发单元502配置用于将数据明细层DWD中的数据通过数据分发形成汇总数据层DWS，或根据业务需要将数据同步到云组件形成标准化数据查询服务；查询分析单元503中汇总数据层DWS将数据明细层DWD通过MYLink SQL的数据分发形成宽表或专题库，并将计算后的数据输出到云组件，提供服务查询及离线计算分析。Figure 5 shows the architecture diagram of a data warehouse construction system based on streaming processing according to one embodiment of the present application. As shown in Figure 5, the system includes a data processing unit 501, a data distribution unit 502 and a query analysis unit 503, where, The data processing unit 501 is configured to analyze and restore business data for structured or unstructured standard data packets, monitor and capture changes in the database for analysis and processing, and push the data to the data aggregation layer ODS. The data aggregation layer ODS processes the data. Perform cleaning, conversion, desensitization, and association to form the data detail layer DWD; the data distribution unit 502 is configured to distribute the data in the data detail layer DWD to form the summary data layer DWS, or synchronize the data to the cloud component according to business needs. Standardized data query service; the summary data layer DWS in the query analysis unit 503 distributes the data detail layer DWD through MYLink SQL to form a wide table or thematic library, and outputs the calculated data to the cloud component to provide service query and offline calculation analysis. .

在具体的实施例中，数据处理单元501之前还包括数据采集单元，配置用于讲源头业务库根据数据汇聚层ODS的采集规则进行数据采集，源头业务库可对应一或多个数据汇聚层ODS。In a specific embodiment, the data processing unit 501 also includes a data collection unit configured to collect data from the source business library according to the collection rules of the data aggregation layer ODS. The source business library can correspond to one or more data aggregation layer ODS. .

本申请的基于流式处理的数仓建设方法和系统，能够很好适用于流与批并存的业务场景，对于流和批同一套代码且可共用相同的资源，对于资源利用率高且资源开销小。只要实现一套代码在开发、测试、发布上线难度大大降低，后期的运维成本也少。流式处理对于目前时效性强的业务需求有很大的应用场景如欺诈预测，欺诈行为在金融领域属于高发行业，对于这些欺诈行为发生的过程较短、影响较大，如何防范是近年来不少金融公司或银行需要共同解决的问题。传统的反诈手段已经不足以解决所面临的困难。以往需要几个小时才能将交易数据及用户行为指标计算出来，再通过相应的规则判别出可疑用户，再结合案件调查甄别，在这种情况下资金早已被不法份子在在地球上被转移了好几遍了。而运用流式数仓建设方法，对相应的数据进行流式计算能在几秒甚至毫秒内完成相应指标的计算，再对实时流水进行实时预警或拦截，从而避免损失。本发明对于有需要进行流式数据仓建设，对数据实时性要求高的场景都有很强的适应性。能快速部署，易维护的优点，大大降低企业的成本与提高适应性。。The data warehouse construction method and system based on streaming processing of this application can be well adapted to business scenarios where streams and batches coexist. For streams and batches, the same set of codes can share the same resources. For high resource utilization and resource overhead, Small. As long as a set of code is implemented, the difficulty of development, testing, release and online is greatly reduced, and the later operation and maintenance costs are also reduced. Streaming processing has great application scenarios for current time-sensitive business needs, such as fraud prediction. Fraud is a high-incidence industry in the financial field. For these frauds, the process of occurrence is short and the impact is large. How to prevent them has been unclear in recent years. There are fewer problems that financial companies or banks need to solve together. Traditional anti-fraud methods are no longer sufficient to solve the difficulties faced. In the past, it took several hours to calculate transaction data and user behavior indicators, identify suspicious users through corresponding rules, and then combine them with case investigation and screening. In this case, the funds have already been transferred by criminals several times on the earth. All over. Using the streaming data warehouse construction method, streaming calculations on the corresponding data can complete the calculation of the corresponding indicators within seconds or even milliseconds, and then provide real-time warning or interception of the real-time flow to avoid losses. The present invention has strong adaptability to scenarios where it is necessary to construct a streaming data warehouse and where real-time data requirements are high. The advantages of rapid deployment and easy maintenance can greatly reduce the cost of enterprises and improve adaptability. .

下面参考图6，其示出了适于用来实现本申请实施例的电子设备的计算机系统的结构示意图。图6示出的电子设备仅仅是一个示例，不应对本申请实施例的功能和使用范围带来任何限制。Referring now to FIG. 6 , which shows a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present application. The electronic device shown in FIG. 6 is only an example and should not impose any restrictions on the functions and usage scope of the embodiments of the present application.

如图6所示，计算机系统包括中央处理单元(CPU)601，其可以根据存储在只读存储器(ROM)602中的程序或者从存储部分608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中，还存储有系统600操作所需的各种程序和数据。CPU601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。As shown in FIG. 6, the computer system includes a central processing unit (CPU) 601, which can operate according to a program stored in a read-only memory (ROM) 602 or loaded from a storage portion 608 into a random access memory (RAM) 603. Perform various appropriate actions and processing. In the RAM 603, various programs and data required for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

以下部件连接至I/O接口605：包括键盘、鼠标等的输入部分606；包括诸如液晶显示器(LCD)等以及扬声器等的输出部分607；包括硬盘等的存储部分608；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分609。通信部分609经由诸如因特网的网络执行通信处理。驱动器610也根据需要连接至I/O接口605。可拆卸介质611，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器610上，以便于从其上读出的计算机程序根据需要被安装入存储部分608。The following components are connected to the I/O interface 605: an input section 606 including a keyboard, a mouse, etc.; an output section 607 including a liquid crystal display (LCD), etc., speakers, etc.; a storage section 608 including a hard disk, etc.; and including a LAN card, Communication section 609 of a network interface card such as a modem. The communication section 609 performs communication processing via a network such as the Internet. Driver 610 is also connected to I/O interface 605 as needed. Removable media 611, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memories, etc., are installed on the drive 610 as needed, so that a computer program read therefrom is installed into the storage portion 608 as needed.

特别地，根据本公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本公开的实施例包括一种计算机程序产品，其包括承载在计算机可读存储介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信部分609从网络上被下载和安装，和/或从可拆卸介质611被安装。在该计算机程序被中央处理单元(CPU)601执行时，执行本申请的方法中限定的上述功能。需要说明的是，本申请的计算机可读存储介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中，计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读存储介质，该计算机可读存储介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读存储介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：无线、电线、光缆、RF等等，或者上述的任意合适的组合。In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product including a computer program carried on a computer-readable storage medium, the computer program containing program code for performing the method illustrated in the flowchart. In such embodiments, the computer program may be downloaded and installed from the network via communication portion 609, and/or installed from removable media 611. When the computer program is executed by the central processing unit (CPU) 601, the above-mentioned functions defined in the method of the present application are performed. It should be noted that the computer-readable storage medium of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Programmed read-only memory (EPROM or flash memory), fiber optics, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. As used herein, a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, in which computer-readable program code is carried. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable storage medium other than a computer-readable storage medium that may be sent, propagated, or transmitted for use by or in connection with an instruction execution system, apparatus, or device program of. Program code embodied on a computer-readable storage medium may be transmitted using any suitable medium, including but not limited to: wireless, wire, optical cable, RF, etc., or any suitable combination of the above.

可以以一种或多种程序设计语言或其组合来编写用于执行本申请的操作的计算机程序代码，程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++，还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing the operations of the present application may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional procedures, or a combination thereof. programming language - such as "C" or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In situations involving remote computers, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider through Internet connection).

附图中的流程图和框图，图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operations of possible implementations of systems, methods, and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagram may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved. It will also be noted that each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.

描述于本申请实施例中所涉及到的模块可以通过软件的方式实现，也可以通过硬件的方式来实现。The modules involved in the embodiments described in this application can be implemented in software or hardware.

作为另一方面，本申请还提供了一种计算机可读存储介质，该计算机可读存储介质可以是上述实施例中描述的电子设备中所包含的；也可以是单独存在，而未装配入该电子设备中。上述计算机可读存储介质承载有一个或者多个程序，当上述一个或者多个程序被该电子设备执行时，使得该电子设备：对结构化或非结构化的标准数据包进行业务数据解析并还原，监控捕获数据库的变动进行解析处理，并将数据推送至数据汇聚层ODS；数据汇聚层ODS对数据进行清洗、转换、脱敏、关联形成数据明细层DWD；数据明细层DWD中的数据通过数据分发形成汇总数据层DWS，或根据业务需要将数据同步到云组件形成标准化数据查询服务；汇总数据层DWS将数据明细层DWD通过MYLink SQL的数据分发形成宽表或专题库，并将计算后的数据输出到云组件，提供服务查询及离线计算分析。As another aspect, the present application also provides a computer-readable storage medium. The computer-readable storage medium may be included in the electronic device described in the above embodiments; it may also exist independently without being assembled into the electronic device. in electronic equipment. The computer-readable storage medium carries one or more programs. When the one or more programs are executed by the electronic device, the electronic device: parses and restores business data of structured or unstructured standard data packets. , monitor and capture changes in the database for analysis and processing, and push the data to the data aggregation layer ODS; the data aggregation layer ODS cleans, converts, desensitizes, and associates the data to form the data detail layer DWD; the data in the data detail layer DWD passes through the data Distribute to form a summary data layer DWS, or synchronize data to cloud components according to business needs to form a standardized data query service; the summary data layer DWS distributes the data detail layer DWD through MYLink SQL data to form a wide table or thematic library, and the calculated The data is output to the cloud component to provide service query and offline computing analysis.

以上描述仅为本申请的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解，本申请中所涉及的发明范围，并不限于上述技术特征的特定组合而成的技术方案，同时也应涵盖在不脱离上述发明构思的情况下，由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本申请中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present application and an explanation of the technical principles used. Those skilled in the art should understand that the scope of the invention involved in this application is not limited to technical solutions formed by a specific combination of the above technical features, but should also cover any solution consisting of the above technical features or without departing from the above inventive concept. Other technical solutions formed by any combination of equivalent features. For example, a technical solution is formed by replacing the above features with technical features with similar functions disclosed in this application (but not limited to).

Claims

1. A data warehouse construction method based on streaming processing, which is characterized by including:

S1: Analyze and restore business data for structured or unstructured standard data packages, monitor and capture changes in the database for analysis and processing, and push the data to the data aggregation layer ODS;

S2: The data aggregation layer ODS cleans, converts, desensitizes, and associates the data to form the data detail layer DWD;

S3: The data in the detailed data layer DWD forms a summary data layer DWS through data distribution, or the data is synchronized to the cloud component according to business needs to form a standardized data query service;

S4: The summary data layer DWS distributes the data detail layer DWD through MYLink SQL to form a wide table or thematic library, and outputs the calculated data to the cloud component to provide service query and offline calculation analysis.

2. The data warehouse construction method based on streaming processing according to claim 1, characterized in that, before S1, it also includes collecting data from the source business library according to the collection rules of the data aggregation layer ODS. The service library may correspond to one or more of the data aggregation layer ODS.

3. The data warehouse construction method based on streaming processing according to claim 1, characterized in that the S1 specifically includes using the sSend tool to perform business data analysis on the structured or unstructured standard data packets. Datax parses and restores the structured or unstructured data, and uses FlinkCDC to monitor and capture changes in the database for analysis and processing.

4. The data warehouse construction method based on streaming processing according to claim 1, characterized in that, in the S2, the business library data used to store by the data aggregation layer ODS maintains the original appearance of the business data, and uses the MYLink engine to The data is cleaned, converted, desensitized, and associated using SQL+UDF to form the data detail layer DWD.

5. The data warehouse construction method based on streaming processing according to claim 4, characterized in that the S2 further includes directly outputting the data to the cloud component through the MYLink engine to provide tracking query of the original data.

6. The data warehouse construction method based on streaming processing according to claim 1, characterized in that the cloud component includes a Huawei Cloud certification component or Tencent Cloud.

7. The data warehouse construction method based on streaming processing according to claim 3, characterized in that the sSend tool, Datax, and FlinkCDC all support consumption queues as the data aggregation layer ODS.

8. A computer-readable storage medium on which one or more computer programs are stored, characterized in that when the one or more computer programs are executed by a computer processor, the method of any one of claims 1-7 is implemented. method.

9. A data warehouse construction system based on streaming processing, which is characterized by including:

The data processing unit is configured to analyze and restore business data for structured or unstructured standard data packets, monitor and capture changes in the database for analysis and processing, and push the data to the data aggregation layer ODS. The data aggregation layer ODS Clean, convert, desensitize, and associate the data to form the data detail layer DWD;

A data distribution unit configured to distribute the data in the detailed data layer DWD to form a summary data layer DWS, or synchronize the data to the cloud component according to business needs to form a standardized data query service;

Query analysis unit, the summary data layer DWS distributes the data detail layer DWD through MYLink SQL to form a wide table or thematic library, and outputs the calculated data to the cloud component to provide service query and offline calculation analysis.

10. The data warehouse construction system based on streaming processing according to claim 9, characterized in that it also includes a data collection unit configured to collect data from the source business library according to the collection rules of the data aggregation layer ODS, The source service library may correspond to one or more of the data aggregation layer ODS.

11. The data warehouse construction system based on streaming processing according to claim 9, characterized in that the sSend tool is used to analyze the structured or unstructured standard data packets for business data, and Datax is used to analyze the structured data. Parse and restore structured or unstructured data, use FlinkCDC to monitor and capture changes in the database for analysis and processing; the data aggregation layer ODS is used to store the business database data to maintain the original appearance of the business data, using the MYLink engine in the form of SQL+UDF The data is cleaned, converted, desensitized, and associated to form the data detail layer DWD; the data is directly output to the cloud component through the MYLink engine to provide tracking and querying of the original data.

12. The data warehouse construction system based on streaming processing according to claim 11, characterized in that the sSend tool, Datax, and FlinkCDC all support consumption queues as the data aggregation layer ODS.