CN114490610A

CN114490610A - Data processing method and device for data bin, storage medium and electronic device

Info

Publication number: CN114490610A
Application number: CN202210090577.0A
Authority: CN
Inventors: 周波; 柴灵俊; 张建业; 鲁霜腾; 杨标
Original assignee: Zhejiang Huifu Network Technology Co ltd
Current assignee: Zhejiang Huifu Network Technology Co ltd
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-05-13

Abstract

The application discloses a data processing method and device for a data bin, a storage medium and an electronic device. The method comprises the steps of acquiring data from different data sources based on a Flink frame, and storing the data according to data lake hudi; establishing a target data bin based on preset data extraction, conversion and loading processing; executing preset data processing operation based on the target data bin, wherein the preset data processing operation at least comprises one of the following operations: generating report operation, business data query operation and decision engine original data cleaning operation. The method and the device solve the technical problems of poor storage and calculation processing effects of the internet financial data. The method and the device for realizing the light-weight deployment and supporting the visual configuration under different service scenes are realized.

Description

Data processing method and device, storage medium and electronic device for data warehouse

技术领域technical field

本申请涉及数据仓技术领域，具体而言，涉及一种用于数据仓的数据处理方法以及装置、存储介质、电子装置。The present application relates to the technical field of data warehouses, and in particular, to a data processing method and device, a storage medium, and an electronic device for a data warehouse.

背景技术Background technique

互联网金融，是通过将先进的信息化技术运用到金融业务中，使得银行业务人员能够实时获得大量与客户相关的重要信息，从而便于操作、提高效率。Internet finance, by applying advanced information technology to financial business, enables bank business personnel to obtain a large amount of important information related to customers in real time, thereby facilitating operation and improving efficiency.

对于数据仓中存在的增量数据、数据查询实时性等要求，目前还存在缺陷。There are still defects in the requirements of incremental data and real-time data query in the data warehouse.

针对相关技术中互联网金融数据的存储以及计算处理效果不佳的问题，目前尚未提出有效的解决方案。There is no effective solution to the problem of poor performance of Internet financial data storage and calculation processing in related technologies.

发明内容SUMMARY OF THE INVENTION

本申请的主要目的在于提供一种用于数据仓的数据处理方法以及装置、存储介质、电子装置，以解决互联网金融数据的存储以及计算处理效果不佳的问题，目前尚未提出有效的问题。The main purpose of this application is to provide a data processing method and device, a storage medium and an electronic device for a data warehouse, so as to solve the problem of poor performance of Internet financial data storage and calculation processing, and no effective problem has yet been proposed.

为了实现上述目的，根据本申请的一个方面，提供了一种用于数据仓的数据处理方法。In order to achieve the above object, according to an aspect of the present application, a data processing method for a data warehouse is provided.

根据本申请的用于数据仓的数据处理方法包括：基于Flink框架从不同的数据源中获取数据，并根据数据湖hudi进行存储；基于预设的数据抽取、转换以及加载处理，建立目标数据仓；基于所述目标数据仓，执行预设数据处理操作，其中，所述预设数据处理操作至少包括如下之一：生成报表操作、业务数据查询操作、决策引擎原始数据清洗操作。The data processing method for a data warehouse according to the present application includes: acquiring data from different data sources based on the Flink framework, and storing it according to the data lake hudi; establishing a target data warehouse based on preset data extraction, transformation and loading processing ; Based on the target data warehouse, perform a preset data processing operation, wherein the preset data processing operation includes at least one of the following: an operation of generating a report, an operation of querying business data, and an operation of cleaning raw data of a decision engine.

进一步地，所述基于预设的数据抽取、转换以及加载处理，建立目标数据仓，还包括：基于预设的全量数据和/或增量数据抽取、转换以及加载处理，建立目标数据仓，其中，全量/增量数据使用Canel组件抽取到kafka队列。Further, establishing the target data warehouse based on the preset data extraction, conversion and loading processing also includes: establishing the target data warehouse based on the preset full data and/or incremental data extraction, conversion and loading processing, wherein , the full/incremental data is extracted to the kafka queue using the Canel component.

进一步地，所述基于预设的数据抽取、转换以及加载处理，建立目标数据仓，包括：通过将数据从数据源读出、并进行数据类型转换与脏数据清洗之后，加载到目标数据仓。Further, establishing the target data warehouse based on the preset data extraction, conversion and loading process includes: loading the data into the target data warehouse by reading data from the data source, performing data type conversion and cleaning dirty data.

进一步地，所述基于所述目标数据仓，执行预设数据处理操作，还包括：用户行为数据分析操作，通过预设数据模型将业务数据将埋点日志打印到固定文件，收集到日志文件；将所述日志文件抽取到kafka队列；通过所述目标数据仓的输入源将数据传入目标数据仓，并采用Flink SQL对数据进行清洗之后，再将数据整理为所需的数据存入Hbase。Further, performing a preset data processing operation based on the target data warehouse further includes: a user behavior data analysis operation, printing the business data to a fixed file through a preset data model, and collecting the log file; Extract the log file to the kafka queue; pass the data into the target data warehouse through the input source of the target data warehouse, and use Flink SQL to clean the data, and then organize the data into the required data and store it in Hbase.

进一步地，所述基于所述目标数据仓，执行预设数据处理操作，其中，所述预设数据处理操作至少包括如下之一：生成报表操作、业务数据查询操作、决策引擎原始数据清洗操作，包括：所述预设数据处理操作包括：生成报表操作，通过所述目标数据仓接入输入源后，基于SQL进行数据的清洗，得到目标格式所需的数据；通过所述目标数据仓接入输出源，并将数据进行存储，通过BI工具对数据进行展示。Further, the preset data processing operation is performed based on the target data warehouse, wherein the preset data processing operation includes at least one of the following: a report generation operation, a business data query operation, and a decision engine raw data cleaning operation, Including: the preset data processing operation includes: generating a report operation, after accessing the input source through the target data warehouse, cleaning the data based on SQL to obtain data required by the target format; accessing through the target data warehouse Output the source, store the data, and display the data through BI tools.

进一步地，所述基于所述目标数据仓，执行预设数据处理操作，其中，所述预设数据处理操作至少包括如下之一：生成报表操作、业务数据查询操作、决策引擎原始数据清洗操作，包括：所述预设数据处理操作包括：决策引擎原始数据清洗操作，通过所述目标数据仓接入数据源，使用预设Flink SQL对数据进行清洗，其中所述Flink SQL根据业务数据进行配置，每个Flink SQL执行的结果都作为中间数据存入到数据湖hudi；将数据清洗为预设决策原始字段，通过输出源存储到所述目标数据仓；对于增量数据使用更新字段进行抽取、转换以及加载处理，建立所述目标数据仓。Further, the preset data processing operation is performed based on the target data warehouse, wherein the preset data processing operation includes at least one of the following: a report generation operation, a business data query operation, and a decision engine raw data cleaning operation, Including: the preset data processing operation includes: a decision engine raw data cleaning operation, accessing a data source through the target data warehouse, and using a preset Flink SQL to clean the data, wherein the Flink SQL is configured according to business data, The result of each Flink SQL execution is stored in the data lake hudi as intermediate data; the data is cleaned into the original field of the preset decision, and stored in the target data warehouse through the output source; for incremental data, the update field is used to extract and transform and a loading process to establish the target data warehouse.

进一步地，所述基于所述目标数据仓，执行预设数据处理操作，其中，所述预设数据处理操作至少包括如下之一：生成报表操作、业务数据查询操作、决策引擎原始数据清洗操作，包括：所述预设数据处理操作包括：业务数据查询操作，通过将全量数据抽取到kafka队列；将所述数据仓通过输入源对数据进行清洗；清洗结果通过配置ES作为输出源。Further, the preset data processing operation is performed based on the target data warehouse, wherein the preset data processing operation includes at least one of the following: a report generation operation, a business data query operation, and a decision engine raw data cleaning operation, It includes: the preset data processing operation includes: a business data query operation, by extracting the full amount of data to the kafka queue; cleaning the data from the data warehouse through the input source; and configuring the ES as the output source for the cleaning result.

为了实现上述目的，根据本申请的另一方面，提供了一种用于数仓的数据处理装置。In order to achieve the above object, according to another aspect of the present application, a data processing device for a data warehouse is provided.

根据本申请的用于数仓的数据处理装置包括：建立模块，用于基于预设的数据抽取、转换以及加载处理，建立目标数据仓；执行处理模块，用于基于所述目标数据仓，执行预设数据处理操作，其中，所述预设数据处理操作至少包括如下之一：生成报表操作、业务数据查询操作、决策引擎原始数据清洗操作。The data processing device for data warehouses according to the present application includes: an establishment module for establishing a target data warehouse based on preset data extraction, conversion and loading processing; an execution processing module for executing the execution based on the target data warehouse A preset data processing operation, wherein the preset data processing operation includes at least one of the following: a report generation operation, a business data query operation, and a decision engine raw data cleaning operation.

为了实现上述目的，根据本申请的又一方面，提供了一种计算机可读存储介质，所述计算机可读存储介质中存储有计算机程序，其中，所述计算机程序被设置为运行时执行所述方法。In order to achieve the above object, according to another aspect of the present application, a computer-readable storage medium is provided, where a computer program is stored in the computer-readable storage medium, wherein the computer program is configured to execute the method.

为了实现上述目的，根据本申请的再一方面，提供了一种电子装置，包括存储器和处理器，所述存储器中存储有计算机程序，所述处理器被设置为运行所述计算机程序以执行所述的方法。In order to achieve the above object, according to yet another aspect of the present application, an electronic device is provided, comprising a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to run the computer program to execute the computer program. method described.

在本申请实施例中用于数据仓的数据处理方法以及装置、存储介质、电子装置，采用基于Flink框架从不同的数据源中获取数据，并根据数据湖hudi进行存储；基于预设的数据抽取、转换以及加载处理，建立目标数据仓的方式，通过基于所述目标数据仓，执行预设数据处理操作，其中，所述预设数据处理操作至少包括如下之一：生成报表操作、业务数据查询操作、决策引擎原始数据清洗操作，达到了完成报表处理，业务数据及时查询，决策引擎原始数据的清洗的目的，从而实现了优化数据存储以及计算过程的技术效果，进而解决了互联网金融数据的存储以及计算处理效果不佳的技术问题。In the data processing method, device, storage medium, and electronic device used in the data warehouse in the embodiments of this application, data is obtained from different data sources based on the Flink framework, and stored according to the data lake hudi; based on preset data extraction , conversion and loading processing, and the method of establishing a target data warehouse, by executing a preset data processing operation based on the target data warehouse, wherein the preset data processing operation includes at least one of the following: report generation operation, business data query Operation and decision-making engine raw data cleaning operation achieves the purpose of completing report processing, timely query of business data, and cleaning of decision-making engine raw data, thus realizing the technical effect of optimizing data storage and calculation process, and solving the problem of Internet financial data storage. As well as technical issues with poor computational processing.

附图说明Description of drawings

构成本申请的一部分的附图用来提供对本申请的进一步理解，使得本申请的其它特征、目的和优点变得更明显。本申请的示意性实施例附图及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The accompanying drawings, which constitute a part of this application, are used to provide a further understanding of the application and make other features, objects and advantages of the application more apparent. The accompanying drawings and descriptions of the exemplary embodiments of the present application are used to explain the present application, and do not constitute an improper limitation of the present application. In the attached image:

图1是根据本申请实施例的用于数据仓的数据处理方法的硬件结构示意图；1 is a schematic diagram of a hardware structure of a data processing method for a data warehouse according to an embodiment of the present application;

图2是根据本申请实施例的用于数据仓的数据处理方法流程示意图；2 is a schematic flowchart of a data processing method for a data warehouse according to an embodiment of the present application;

图3是根据本申请实施例的用于数据仓的数据处理装置结构示意图；3 is a schematic structural diagram of a data processing apparatus for a data warehouse according to an embodiment of the present application;

图4是根据本申请实施例的用于数据仓的数据处理方法流程示意图。FIG. 4 is a schematic flowchart of a data processing method for a data warehouse according to an embodiment of the present application.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本申请方案，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分的实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本申请保护的范围。In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only The embodiments are part of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the scope of protection of the present application.

需要说明的是，本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本申请的实施例。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the description and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances for the embodiments of the application described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

在本申请中，术语“上”、“下”、“左”、“右”、“前”、“后”、“顶”、“底”、“内”、“外”、“中”、“竖直”、“水平”、“横向”、“纵向”等指示的方位或位置关系为基于附图所示的方位或位置关系。这些术语主要是为了更好地描述本申请及其实施例，并非用于限定所指示的装置、元件或组成部分必须具有特定方位，或以特定方位进行构造和操作。In this application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", The orientation or positional relationship indicated by "vertical", "horizontal", "horizontal", "longitudinal", etc. is based on the orientation or positional relationship shown in the drawings. These terms are primarily used to better describe the present application and its embodiments, and are not intended to limit the fact that the indicated device, element, or component must have a particular orientation, or be constructed and operated in a particular orientation.

并且，上述部分术语除了可以用于表示方位或位置关系以外，还可能用于表示其他含义，例如术语“上”在某些情况下也可能用于表示某种依附关系或连接关系。对于本领域普通技术人员而言，可以根据具体情况理解这些术语在本申请中的具体含义。In addition, some of the above-mentioned terms may be used to express other meanings besides orientation or positional relationship. For example, the term "on" may also be used to express a certain attachment or connection relationship in some cases. For those of ordinary skill in the art, the specific meanings of these terms in the present application can be understood according to specific situations.

此外，术语“安装”、“设置”、“设有”、“连接”、“相连”、“套接”应做广义理解。例如，可以是固定连接，可拆卸连接，或整体式构造；可以是机械连接，或电连接；可以是直接相连，或者是通过中间媒介间接相连，又或者是两个装置、元件或组成部分之间内部的连通。对于本领域普通技术人员而言，可以根据具体情况理解上述术语在本申请中的具体含义。Furthermore, the terms "installed", "set up", "provided with", "connected", "connected", "socketed" should be construed broadly. For example, it may be a fixed connection, a detachable connection, or a unitary structure; it may be a mechanical connection, or an electrical connection; it may be directly connected, or indirectly connected through an intermediary, or between two devices, elements, or components. internal communication. For those of ordinary skill in the art, the specific meanings of the above terms in this application can be understood according to specific situations.

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict. The present application will be described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

如图1所示，是根据本申请实施例的用于数据仓的数据处理方法的硬件结构示意图，其中包括数据仓100是面向主题、集成且相对稳定的反映历史变化的数据存储集合，用于支撑企业的分析报告与决策。数仓中的数据来源于不同数据源的集成，比如，数据源1、数据源2、数据源3，且这些数据源的存储方式可能不同，比如，mysql、orcale、hive等，因此需要做Etl操作整合不同的数据源。其中Etl操作包括但不限于，数据的抽取，就是把数据从数据源读出来。数据的转换，数据类型转换与脏数据清洗。数据的加载，处理后的数据加载到目标处，如数据仓。基于所述数据仓100可以提供数据报表、数据挖掘以及数据分析等业务功能。As shown in FIG. 1, it is a schematic diagram of a hardware structure of a data processing method for a data warehouse according to an embodiment of the present application, including that the data warehouse 100 is a theme-oriented, integrated and relatively stable data storage collection reflecting historical changes, used for Support the analysis report and decision-making of the enterprise. The data in the data warehouse comes from the integration of different data sources, such as data source 1, data source 2, and data source 3, and the storage methods of these data sources may be different, such as mysql, orcale, hive, etc., so Etl needs to be done Operations integrate disparate data sources. The Etl operation includes, but is not limited to, data extraction, which is to read data from the data source. Data conversion, data type conversion and dirty data cleaning. Data loading, the processed data is loaded into the target, such as a data warehouse. Based on the data warehouse 100, business functions such as data reporting, data mining, and data analysis can be provided.

如图2所示，该方法包括如下的步骤S201至步骤S202：As shown in Figure 2, the method includes the following steps S201 to S202:

步骤S201，基于Flink框架从不同的数据源中获取数据，并根据数据湖hudi进行存储；Step S201, obtain data from different data sources based on the Flink framework, and store it according to the data lake hudi;

步骤S202，基于预设的数据抽取、转换以及加载处理，建立目标数据仓；Step S202, establishing a target data warehouse based on preset data extraction, conversion and loading processing;

步骤S203，基于所述目标数据仓，执行预设数据处理操作，其中，所述预设数据处理操作至少包括如下之一：生成报表操作、业务数据查询操作、决策引擎原始数据清洗操作。Step S203: Execute a preset data processing operation based on the target data warehouse, wherein the preset data processing operation includes at least one of the following: an operation of generating a report, an operation of querying business data, and an operation of cleaning raw data of a decision engine.

从以上的描述中，可以看出，本申请实现了如下技术效果：From the above description, it can be seen that the application has achieved the following technical effects:

采用基于预设的数据抽取、转换以及加载处理，建立目标数据仓的方式，通过基于所述目标数据仓，执行预设数据处理操作，其中，所述预设数据处理操作至少包括如下之一：生成报表操作、业务数据查询操作、决策引擎原始数据清洗操作，达到了完成报表处理，业务数据及时查询，决策引擎原始数据的清洗的目的，从而实现了优化数据存储以及计算过程的技术效果，进而解决了互联网金融数据的存储以及计算处理效果不佳的技术问题。By adopting the method of establishing a target data warehouse based on preset data extraction, conversion and loading processing, a preset data processing operation is performed based on the target data warehouse, wherein the preset data processing operation includes at least one of the following: The report generation operation, business data query operation, and decision engine raw data cleaning operation achieve the purpose of completing report processing, timely query of business data, and cleaning the original data of decision engine, thereby realizing the technical effect of optimizing data storage and calculation process, and then It solves the technical problems of the storage of Internet financial data and the poor calculation and processing effect.

上述步骤S201以及步骤S202中将输入源配置到数据仓(平台)之后，数据仓(平台)则会按照配置的数据源对数据进行抽取，并且把抽取的数据放入到数据库湖hudi进行清洗，然后输出到配置好的数据源。After the input source is configured to the data warehouse (platform) in the above steps S201 and S202, the data warehouse (platform) will extract the data according to the configured data source, and put the extracted data into the database lake hudi for cleaning, Then output to the configured data source.

作为本实施例中的优选，所述基于预设的数据抽取、转换以及加载处理，建立目标数据仓，包括：将数据源按照预设组件式集成的方式进行数据抽取、基于数据湖hudi存储的方式进行转换以及基于Flink技术框架进行加载处理；通过将数据从数据源读出、并进行数据类型转换与脏数据清洗之后，加载到目标数据仓。As a preference in this embodiment, the establishment of a target data warehouse based on preset data extraction, conversion and loading processing includes: extracting data from data sources in a preset component-based integration manner, and storing data based on data lake hudi. The method is converted and loaded based on the Flink technical framework; by reading the data from the data source, converting the data type and cleaning the dirty data, it is loaded into the target data warehouse.

作为一种优选地实施方式，针对整个数据仓体系使用Flink技术框架来承载，数据的存储使用数据湖hudi，数据源进行组件式集成，支持常规的数据源操作，数据源包括但不限于oracle、mysql、hbase、kafka、es、mongodb、redis。As a preferred implementation, the Flink technology framework is used to carry the entire data warehouse system, the data lake hudi is used for data storage, the data source is integrated by components, and conventional data source operations are supported. Data sources include but are not limited to oracle, mysql, hbase, kafka, es, mongodb, redis.

作为一种优选地实施方式，基于所述数据仓完成报表处理，业务数据及时查询，决策引擎原始数据的清洗。As a preferred embodiment, report processing is completed based on the data warehouse, business data is queried in time, and raw data of the decision engine is cleaned.

作为一种优选地实施方式，基于所述数据仓的轻量级容易迁移部署。As a preferred implementation, based on the lightweight and easy migration deployment of the data warehouse.

作为一种优选地实施方式，基于所述数据仓在不同的预设业务场景之下可视化配置简单。预设业务场景包括但不限于，实时数据需求、数据跟踪等业务场景。即支持可视化配置。As a preferred implementation manner, the visualization configuration is simple under different preset business scenarios based on the data warehouse. The preset business scenarios include, but are not limited to, real-time data requirements, data tracking and other business scenarios. That is, it supports visual configuration.

上述步骤S202中基于所述目标数据仓，执行预设数据处理操作，其中，所述预设数据处理操作至少包括如下之一：生成报表操作、业务数据查询操作、决策引擎原始数据清洗操作。In the above step S202, a preset data processing operation is performed based on the target data warehouse, wherein the preset data processing operation includes at least one of the following: an operation of generating a report, an operation of querying business data, and an operation of cleaning raw data of a decision engine.

作为一种优选地实施方式，在增量数据的处理机制上面，采用数据库的CDC机制来抽取数据，部分数据库没有CDC机制的采用kafka队列将数据先同步到kafka队列，然后在数据仓(平台)配置kafka队列作为输入源来进行数据的清洗操作。As a preferred implementation, on the incremental data processing mechanism, the CDC mechanism of the database is used to extract data, and some databases without the CDC mechanism use the Kafka queue to synchronize the data to the Kafka queue first, and then store the data in the data warehouse (platform). Configure the kafka queue as an input source for data cleaning operations.

作为一种优选地实施方式，对于业务数据查询操作，在业务系统将埋点日志打印到固定文件，通过文件抽取工具同步到kafka然后在数仓平台配置kafka输入源对数据进行清洗。As a preferred implementation, for business data query operations, the buried point log is printed to a fixed file in the business system, synchronized to kafka through a file extraction tool, and then the kafka input source is configured on the data warehouse platform to clean the data.

作为一种优选地实施方式，对于决策引擎原始数据清洗操作，不管是数据库数据或者kafka数据作为数据输入源，然后通过Flink SQL进行Etl清洗，这个SQL每个业务都可以配置，每个SQL执行的结果都作为中间数据存入到数据湖。As a preferred implementation, for the raw data cleaning operation of the decision engine, whether database data or kafka data is used as the data input source, and then Etl cleaning is performed through Flink SQL. This SQL can be configured for each business, and each SQL executes The results are stored in the data lake as intermediate data.

作为本实施例中的优选，所述基于预设的数据抽取、转换以及加载处理，建立目标数据仓，还包括：基于预设的全量数据和/或增量数据抽取、转换以及加载处理，建立目标数据仓，其中，全量/增量数据使用Canel组件抽取到kafka队列。As a preference in this embodiment, establishing the target data warehouse based on the preset data extraction, transformation and loading processing further includes: based on the preset full data and/or incremental data extraction, transformation and loading processing, establishing The target data warehouse, in which the full/incremental data is extracted to the kafka queue using the Canel component.

具体实施时，针对不同场景之下有全量数据还有增量数据，建立数据仓的时候需要考虑将两部分数据进行同步。也就是说，全量/增量数据使用Canel组件抽取到kafka队列。这里面的全量和增量数据使用Canel组件抽取到kafka。During specific implementation, considering that there are full data and incremental data in different scenarios, it is necessary to consider synchronizing the two parts of data when establishing a data warehouse. That is, full/incremental data is extracted to the kafka queue using the Canel component. The full and incremental data here are extracted to kafka using the Canel component.

作为本实施例中的优选，所述基于所述目标数据仓，执行预设数据处理操作，还包括：用户行为数据分析操作，通过预设数据模型将业务数据将埋点日志打印到固定文件，收集到日志文件；将所述日志文件抽取到kafka队列；通过所述目标数据仓的输入源将数据传入目标数据仓，并采用Flink SQL对数据进行清洗之后，再将数据整理为所需的数据存入Hbase。As a preference in this embodiment, the performing a preset data processing operation based on the target data warehouse further includes: a user behavior data analysis operation, printing the business data and the buried point log to a fixed file by using a preset data model, Collect the log files; extract the log files to the Kafka queue; pass the data into the target data warehouse through the input source of the target data warehouse, and use Flink SQL to clean the data, and then organize the data into the required data Data is stored in Hbase.

具体实施时，先通过数据模型将业务数据收集到日志文件，然后使用＝组件将日志抽取到kafka，再使用对接数仓平台的输入源将数据传入数仓平台使用Flink SQL对数据进行清洗，再将数据整理成需要的数据放入到Hbase供业务使用。In the specific implementation, first collect business data into log files through the data model, then use the = component to extract the logs to kafka, and then use the input source connected to the data warehouse platform to transfer the data to the data warehouse platform and use Flink SQL to clean the data. Then organize the data into the required data and put it into Hbase for business use.

作为本实施例中的优选，所述基于所述目标数据仓，执行预设数据处理操作，其中，所述预设数据处理操作至少包括如下之一：生成报表操作、业务数据查询操作、决策引擎原始数据清洗操作，包括：所述预设数据处理操作包括：生成报表操作，通过所述目标数据仓接入输入源后，基于SQL进行数据的清洗，得到目标格式所需的数据；通过所述目标数据仓接入输出源，并将数据进行存储，通过BI工具对数据进行展示。As a preference in this embodiment, the preset data processing operation is performed based on the target data warehouse, wherein the preset data processing operation includes at least one of the following: report generation operation, business data query operation, decision engine operation The original data cleaning operation includes: the preset data processing operation includes: generating a report operation, after accessing the input source through the target data warehouse, cleaning the data based on SQL to obtain the data required by the target format; The target data warehouse is connected to the output source, stores the data, and displays the data through BI tools.

具体实施时，通过数仓平台接入输入源，通过SQL进行数据的清洗，可能会产生中间表，经过多次清洗最终形成目标格式需要的各种数据，通过数仓平台接入输出源，将数据进行存储，通过BI工具对数据进行展示。In the specific implementation, the input source is accessed through the data warehouse platform, and the data is cleaned through SQL, which may generate intermediate tables. After multiple cleanings, various data required by the target format are finally formed. Data is stored and displayed through BI tools.

作为本实施例中的优选，所述基于所述目标数据仓，执行预设数据处理操作，其中，所述预设数据处理操作至少包括如下之一：生成报表操作、业务数据查询操作、决策引擎原始数据清洗操作，包括：所述预设数据处理操作包括：决策引擎原始数据清洗操作，通过所述目标数据仓接入数据源，使用预设Flink SQL对数据进行清洗，其中所述Flink SQL根据业务数据进行配置，每个Flink SQL执行的结果都作为中间数据存入到数据湖hudi；将数据清洗为预设决策原始字段，通过输出源存储到所述目标数据仓；对于增量数据使用更新字段进行抽取、转换以及加载处理，建立所述目标数据仓。As a preference in this embodiment, the preset data processing operation is performed based on the target data warehouse, wherein the preset data processing operation includes at least one of the following: report generation operation, business data query operation, decision engine operation The original data cleaning operation includes: the preset data processing operation includes: the decision engine original data cleaning operation, accessing the data source through the target data warehouse, and using the preset Flink SQL to clean the data, wherein the Flink SQL according to The business data is configured, and the result of each Flink SQL execution is stored in the data lake hudi as intermediate data; the data is cleaned into the original field of the preset decision, and stored in the target data warehouse through the output source; for incremental data, use update Fields are extracted, converted and loaded to establish the target data warehouse.

具体实施时，通过数仓平台接入数据源，使用SQL对数据进行清洗，将数据清洗成决策原始字段，通过输出源存入存储到目标数据库。对于增量数据使用更新字段进行抽取到数据仓(平台)。In the specific implementation, the data source is accessed through the data warehouse platform, and the data is cleaned by SQL, and the data is cleaned into the original fields for decision-making, and stored in the target database through the output source. For incremental data, use update fields to extract to the data warehouse (platform).

作为本实施例中的优选，所述基于所述目标数据仓，执行预设数据处理操作，其中，所述预设数据处理操作至少包括如下之一：生成报表操作、业务数据查询操作、决策引擎原始数据清洗操作，包括：所述预设数据处理操作包括：业务数据查询操作，通过将全量数据抽取到kafka队列；将所述数据仓通过输入源对数据进行清洗；清洗结果通过配置ES作为输出源。As a preference in this embodiment, the preset data processing operation is performed based on the target data warehouse, wherein the preset data processing operation includes at least one of the following: report generation operation, business data query operation, decision engine operation The original data cleaning operation includes: the preset data processing operation includes: business data query operation, by extracting the full amount of data to the kafka queue; cleaning the data from the data warehouse through the input source; the cleaning result is configured as an output by ES source.

具体实施时，业务数据对数据的实时性要求比较高，需要将全量数据抽取到kafka队列，数据仓(平台)通过输入源，对数据进行清洗，清洗完成之后通过配置ES作为输出源，便于业务数据的检索。In specific implementation, business data has high requirements for real-time data, and it is necessary to extract the full amount of data to the Kafka queue. The data warehouse (platform) cleans the data through the input source. After cleaning, configure ES as the output source to facilitate business operations. retrieval of data.

需要说明的是，在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行，并且，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。It should be noted that the steps shown in the flowcharts of the accompanying drawings may be executed in a computer system, such as a set of computer-executable instructions, and, although a logical sequence is shown in the flowcharts, in some cases, Steps shown or described may be performed in an order different from that herein.

根据本申请实施例，还提供了一种用于实施上述方法的用于数仓的数据处理装置，如图3所示，该装置包括：According to an embodiment of the present application, a data processing apparatus for data warehouses for implementing the above method is also provided. As shown in FIG. 3 , the apparatus includes:

数据处理模块301，用于基于Flink框架从不同的数据源中获取数据，并根据数据湖hudi进行存储；The data processing module 301 is used to obtain data from different data sources based on the Flink framework, and store it according to the data lake hudi;

建立模块302，用于基于预设的数据抽取、转换以及加载处理，建立目标数据仓；The establishment module 302 is used for establishing the target data warehouse based on the preset data extraction, conversion and loading processing;

执行处理模块303，用于基于所述目标数据仓，执行预设数据处理操作，其中，所述预设数据处理操作至少包括如下之一：生成报表操作、业务数据查询操作、决策引擎原始数据清洗操作。The execution processing module 303 is configured to execute a preset data processing operation based on the target data warehouse, wherein the preset data processing operation includes at least one of the following: report generation operation, business data query operation, decision engine raw data cleaning operate.

在本申请的所述数据处理模块301以及所述建立模块302中将输入源配置到数据仓(平台)之后，数据仓(平台)则会按照配置的数据源对数据进行抽取，并且把抽取的数据放入到数据库湖hudi进行清洗，然后输出到配置好的数据源。After the input source is configured to the data warehouse (platform) in the data processing module 301 and the establishment module 302 of this application, the data warehouse (platform) will extract the data according to the configured data source, and extract the extracted data. The data is put into the database lake hudi for cleaning, and then output to the configured data source.

作为一种优选地实施方式，基于所述数据仓在不同的预设业务场景之下可视化配置简单。预设业务场景包括但不限于，实时数据需求、数据跟踪等业务场景。As a preferred implementation manner, the visualization configuration is simple under different preset business scenarios based on the data warehouse. The preset business scenarios include, but are not limited to, real-time data requirements, data tracking and other business scenarios.

在本申请的所述执行处理模块302中基于所述目标数据仓，执行预设数据处理操作，其中，所述预设数据处理操作至少包括如下之一：生成报表操作、业务数据查询操作、决策引擎原始数据清洗操作。The execution processing module 302 of the present application executes a preset data processing operation based on the target data warehouse, wherein the preset data processing operation includes at least one of the following: an operation of generating a report, an operation of querying business data, and a decision-making operation. Engine raw data cleaning operations.

显然，本领域的技术人员应该明白，上述的本申请的各模块或各步骤可以用通用的计算装置来实现，它们可以集中在单个的计算装置上，或者分布在多个计算装置所组成的网络上，可选地，它们可以用计算装置可执行的程序代码来实现，从而，可以将它们存储在存储装置中由计算装置来执行，或者将它们分别制作成各个集成电路模块，或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样，本申请不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that the above-mentioned modules or steps of the present application can be implemented by a general-purpose computing device, and they can be centralized on a single computing device, or distributed in a network composed of multiple computing devices Alternatively, they can be implemented with program codes executable by a computing device, so that they can be stored in a storage device and executed by the computing device, or they can be made into individual integrated circuit modules, or they can be integrated into The multiple modules or steps are fabricated into a single integrated circuit module. As such, the present application is not limited to any particular combination of hardware and software.

为了更好的理解上述智能对话交互方法流程，以下结合优选实施例对上述技术方案进行解释说明，但不用于限定本发明实施例的技术方案。In order to better understand the flow of the above-mentioned intelligent dialogue interaction method, the above-mentioned technical solutions are explained below with reference to the preferred embodiments, but are not used to limit the technical solutions of the embodiments of the present invention.

本申请实施例中的用于数据仓的数据处理方法，基于数据仓(平台)完成报表处理，业务数据及时查询，决策引擎原始数据的清洗。基于轻量级容易迁移部署，并且在不同的业务场景之下可视化配置简单。The data processing method for a data warehouse in the embodiment of the present application completes report processing based on the data warehouse (platform), timely query of business data, and cleaning of raw data of a decision engine. It is based on lightweight and easy to migrate and deploy, and the visual configuration is simple in different business scenarios.

如图4所示，是本申请实施例中用于数据仓的数据处理方法的流程示意图，实现的具体过程包括如下步骤：As shown in FIG. 4 , it is a schematic flowchart of a data processing method for a data warehouse in an embodiment of the present application. The specific process of implementation includes the following steps:

步骤S401，基于预设的数据抽取、转换以及加载处理，建立目标数据仓。In step S401, a target data warehouse is established based on preset data extraction, conversion and loading processing.

步骤S402，是否有增量数据。Step S402, whether there is incremental data.

步骤S403，基于预设的全量数据和/或增量数据抽取、转换以及加载处理，建立目标数据仓，其中，全量/增量数据使用Canel组件抽取到kafka队列。Step S403, based on the preset full data and/or incremental data extraction, conversion and loading processing, establish a target data warehouse, wherein the full/incremental data is extracted to the kafka queue using the Canel component.

在增量数据的处理机制上面，采用数据库的CDC机制来抽取数据，部分数据库没有CDC机制的采用kafka将数据先同步到kafka，然后在数仓平台配置kafka作为输入源来进行数据的清洗操作。In the incremental data processing mechanism, the CDC mechanism of the database is used to extract data. Some databases without the CDC mechanism use Kafka to synchronize the data to Kafka first, and then configure Kafka as an input source on the data warehouse platform for data cleaning operations.

对于过程跟踪类业务数据，在业务系统将埋点日志打印到固定文件，通过文件抽取工具同步到kafka然后在数仓平台配置kafka队列输入源对数据进行清洗。For process tracking business data, print the buried log to a fixed file in the business system, synchronize it to kafka through the file extraction tool, and then configure the kafka queue input source on the data warehouse platform to clean the data.

具体数据清洗过程，不论是数据库数据或者kafka队列数据作为数据输入源，然后通过Flink SQL进行Etl清洗，基于SQL每个业务都可以配置，每个sql执行的结果都作为中间数据存入到数据湖步骤S404，将数据源按照预设组件式集成的方式进行数据抽取、基于数据湖hudi存储的方式进行转换以及基于Flink技术框架进行加载处理。The specific data cleaning process, whether it is database data or kafka queue data as the data input source, and then Etl cleaning through Flink SQL, can be configured for each business based on SQL, and the result of each SQL execution is stored in the data lake as intermediate data Step S404, perform data extraction from the data source according to the preset component-based integration, conversion based on the data lake hudi storage, and loading processing based on the Flink technical framework.

步骤S405，通过将数据从数据源读出来、并进行数据类型转换与脏数据清洗之后，加载到目标数据仓。In step S405, the data is loaded into the target data warehouse by reading out the data from the data source, converting the data type and cleaning the dirty data.

步骤S406，基于所述目标数据仓，执行预设数据处理操作，其中，所述预设数据处理操作至少包括如下之一：生成报表操作、业务数据查询操作、决策引擎原始数据清洗操作。Step S406: Execute a preset data processing operation based on the target data warehouse, wherein the preset data processing operation includes at least one of the following: an operation of generating a report, an operation of querying business data, and an operation of cleaning raw data of a decision engine.

生成报表操作，通过数仓平台接入输入源，通过SQL进行数据的清洗，可能会产生中间表，经过多次清洗最终形成目标格式需要的各种数据，通过数仓平台接入输出源，将数据进行存储，通过BI工具对数据进行展示。Generate report operations, access the input source through the data warehouse platform, and clean the data through SQL, which may generate intermediate tables. After multiple cleanings, various data required by the target format are finally formed. Data is stored and displayed through BI tools.

决策引擎原始数据清洗操作，通过数仓平台接入数据源，使用SQL对数据进行清洗，将数据清洗成决策原始字段，通过输出源存入存储到目标数据库。对于增量数据使用更新字段进行抽取到数仓平台。The original data cleaning operation of the decision engine, access the data source through the data warehouse platform, use SQL to clean the data, clean the data into the original fields of the decision, and store it in the target database through the output source. For incremental data, use the update field to extract to the data warehouse platform.

业务数据查询操作，业务数据对数据的实时性要求比较高，需要将全量数据抽取到kafka，数据仓(平台)通过输入源，对数据进行清洗，清洗完成之后通过配置ES作为输出源，便于业务数据的检索。这里面的全量和增量数据使用Canel组件抽取到kafka队列。For business data query operations, business data has high real-time requirements for data. It is necessary to extract full data to Kafka, and the data warehouse (platform) cleans the data through the input source. After cleaning, configure ES as the output source, which is convenient for business operations. retrieval of data. The full and incremental data here is extracted to the kafka queue using the Canel component.

此外，对于用户行为数据分析操作，先通过数据模型将业务数据收集到日志文件，然后组件将日志抽取到kafka队列，再使用对接数仓平台的输入源将数据传入数据仓(平台)使用Flink SQL对数据进行清洗，再将数据整理成需要的数据放入到Hbase供业务使用。In addition, for user behavior data analysis operations, first collect business data into log files through the data model, then the component extracts the logs to the Kafka queue, and then uses the input source to the data warehouse platform to transfer the data to the data warehouse (platform) using Flink SQL cleans the data, and then organizes the data into the required data and puts it into Hbase for business use.

以上所述仅为本申请的优选实施例而已，并不用于限制本申请，对于本领域的技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above descriptions are only preferred embodiments of the present application, and are not intended to limit the present application. For those skilled in the art, the present application may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the protection scope of this application.

Claims

1. A data processing method for a data bin, comprising:

acquiring data from different data sources based on a Flink frame, and storing the data according to the data lake hudi;

establishing a target data bin based on preset data extraction, conversion and loading processing;

executing preset data processing operation based on the target data bin, wherein the preset data processing operation at least comprises one of the following operations: generating report operation, business data query operation and decision engine original data cleaning operation.

2. The method of claim 1, wherein establishing a target data bin based on a pre-defined data extraction, conversion, and loading process further comprises:

and establishing a target data bin based on preset full-amount data and/or incremental data extraction, conversion and loading processes, wherein the full-amount/incremental data is extracted to the kafka queue by using a Canel component.

3. The method of claim 2, wherein establishing a target data bin based on a pre-defined data extraction, conversion and loading process comprises:

the data is loaded to a target data bin after being read out from a data source and subjected to data type conversion and dirty data cleaning.

4. The method of claim 1, wherein performing a pre-set data processing operation based on the target data bin further comprises: the user behavior data analysis operation is performed,

printing the service data to a fixed file by a preset data model, and collecting a log file;

extracting the log file to a kafka queue;

and transmitting the data into the target data bin through the input source of the target data bin, cleaning the data by adopting the Flink SQL, then sorting the data into the required data and storing the data into the Hbase.

5. The method of claim 1, wherein the performing of the pre-set data processing operation is based on the target data bin, wherein the pre-set data processing operation comprises at least one of: generating report operation, business data query operation and decision engine original data cleaning operation, comprising:

the preset data processing operation comprises: the operation of generating a report is carried out,

after an input source is accessed through the target data bin, cleaning data based on SQL to obtain data required by a target format;

and accessing an output source through the target data bin, storing the data, and displaying the data through a BI tool.

6. The method of claim 1, wherein the performing of the pre-set data processing operation is based on the target data bin, wherein the pre-set data processing operation comprises at least one of: generating report operation, business data query operation and decision engine original data cleaning operation, comprising:

the preset data processing operation comprises: the decision engine raw data is flushed of operations,

accessing a data source through the target data bin, and cleaning data by using preset Flink SQL, wherein the Flink SQL is configured according to service data, and the execution result of each Flink SQL is stored into the data lake hudi as intermediate data;

cleaning data into a preset decision original field, and storing the data into the target data bin through an output source;

and performing extraction, conversion and loading processing on incremental data by using the updated field, and establishing the target data bin.

7. The method of claim 1, wherein the performing of the pre-set data processing operation is based on the target data bin, wherein the pre-set data processing operation comprises at least one of: generating report operation, business data query operation and decision engine original data cleaning operation, comprising:

the preset data processing operation comprises: a business data query operation is performed on the business data,

by extracting the full amount of data to the kafka queue;

cleaning the data bin through an input source;

the cleaning result is used as an output source by configuring the ES.

8. A data processing apparatus for counting bins, comprising:

the data processing module is used for acquiring data from different data sources based on a Flink frame and storing the data according to the data lake hudi;

the establishing module is used for establishing a target data bin based on preset data extraction, conversion and loading processing;

an execution processing module, configured to execute a preset data processing operation based on the target data bin, where the preset data processing operation at least includes one of: generating report operation, business data query operation and decision engine original data cleaning operation.

9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 7 when executed.

10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 7.