WO2023082681A1 - Data processing method and apparatus based on batch-stream integration, computer device, and medium - Google Patents

Data processing method and apparatus based on batch-stream integration, computer device, and medium Download PDF

Info

Publication number
WO2023082681A1
WO2023082681A1 PCT/CN2022/105078 CN2022105078W WO2023082681A1 WO 2023082681 A1 WO2023082681 A1 WO 2023082681A1 CN 2022105078 W CN2022105078 W CN 2022105078W WO 2023082681 A1 WO2023082681 A1 WO 2023082681A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
processing
layer
processed
module
Prior art date
Application number
PCT/CN2022/105078
Other languages
French (fr)
Chinese (zh)
Inventor
罗静
王博一
王晓
霍星志
郭宇鹏
毛少将
Original Assignee
通号通信信息集团有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 通号通信信息集团有限公司 filed Critical 通号通信信息集团有限公司
Publication of WO2023082681A1 publication Critical patent/WO2023082681A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Definitions

  • the data to be processed is processed layer by layer to obtain the first data; wherein, in each of the processing layers, the data input to the processing layer is processed to obtain the processed data, and the processed data is The data is real-time data, based on the Flink flow, the processed real-time data is stored in the Hive module, and the processed data is input to the next processing layer; the first data is the last one in the processing chain The processed data obtained by the processing layer;
  • the first processing module is configured to process the data to be processed layer by layer according to the processing link to obtain first data
  • the offline data is obtained from the Hive module, and the offline data is used to correct the wrong data.
  • Real-time data is corrected. Since the data is passed layer by layer and processed layer by layer, changes in the processing results of the previous processing layer will cause corresponding changes in the processing results of the subsequent processing layers, so the corrected data needs to be input to the next processing layer. layer, and the next processing layer re-processes the data.
  • the data processing device can connect to visual display components (such as Tableau), and query the full amount of data in a custom way on the web client (Web), thereby supporting the visual display of front-end data.
  • visual display components such as Tableau
  • FIG. 3 is a schematic diagram of the first structure of the data processing device provided by the embodiment of the present disclosure.
  • the data processing device includes an acquisition module 101, a first processing module 102 and The second processing module 103, the second processing module 103 forms a data application layer, the first processing module 102 includes a plurality of processing layers, each processing layer forms a processing link, and each processing layer includes a first processing unit 1021 and a second processing unit 1022 .
  • the acquiring module 101 is configured to acquire data to be processed, and the data to be processed includes real-time data.
  • the ODS layer, DWD layer and DWS layer are connected through the Kafka module to realize data exchange. Passed layer by layer.
  • the second processing module 203 is located at the ADS layer and may be an OLAP module.
  • the query module 204 is respectively connected to the Hive module of each processing layer and the OLAP module of the ADS layer, so as to realize cross-source query.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a data processing method based on batch-stream integration, comprising: obtaining data to be processed; according to a processing link, processing, layer by layer, the data to be processed so as to obtain first data; in processing layers, processing data input into the present processing layer to obtain processed real-time data, storing the processed real-time data in a Hive module on the basis of a Flink stream, and inputting the processed data into the next processing layer; processing the first data in a data application layer to obtain second data; in response to detecting an error in the second data, correcting, according to offline data of the present processing layer, wrong data in the processing layer in which a data error occurs, and inputting the corrected data into the next processing layer, so that the next processing layer processes the input data. The present disclosure further provides a data processing apparatus, a computer device, and a medium.

Description

基于批流一体的数据处理方法、装置、计算机设备和介质Data processing method, device, computer equipment and medium based on batch-flow integration 技术领域technical field
本公开涉及但不限于大数据处理技术领域。The present disclosure relates to but not limited to the technical field of big data processing.
背景技术Background technique
实时离线融合平台本质是数据仓库的一种,随着产品需求和内部决策对于数据实时性的要求越来越高,需要实时数据仓库的能力来赋能。传统离线数据仓库的数据时效性是T+1,调度频率以天为单位,无法支撑实时场景的数据需求。即使能将调度频率设置成小时,也只能解决部分时效性要求不高的场景,对于实效性要求很高的场景还是无法满足。The essence of the real-time offline fusion platform is a kind of data warehouse. As product demand and internal decision-making have higher and higher requirements for real-time data, real-time data warehouse capabilities are needed to empower it. The data timeliness of traditional offline data warehouses is T+1, and the scheduling frequency is in days, which cannot support the data requirements of real-time scenarios. Even if the scheduling frequency can be set to an hour, it can only solve some scenarios with low timeliness requirements, and cannot meet the scenarios with high effectiveness requirements.
实时数据仓库能有效的解决上面的问题,但是Kafka(开源流处理平台)只是临时的存储介质,数据会有一个超时的时间,比如只保存7天的数据,这会导致历史数据丢失,当实时任务出现错误时,由于没有历史数据,因此无法对数据重新进行修正计算。The real-time data warehouse can effectively solve the above problems, but Kafka (open source stream processing platform) is only a temporary storage medium, and the data will have a timeout period, for example, only 7 days of data will be saved, which will lead to the loss of historical data. When an error occurs in a task, since there is no historical data, it is impossible to recalculate the data.
而且,在相关技术中,Lambda架构的实时数据仓库存在离线和实时的割裂问题,相同数据源产生离线和实时两种不同的计算结果,而且,需要维护实时和离线两套框架,增加运维管理成本。Moreover, in related technologies, the real-time data warehouse of the Lambda architecture has the problem of splitting offline and real-time. The same data source produces two different calculation results, offline and real-time. Moreover, it is necessary to maintain two sets of real-time and offline frameworks to increase operation and maintenance management. cost.
发明内容Contents of the invention
本公开提供一种基于批流一体的数据处理方法、装置、计算机设备和介质。The disclosure provides a data processing method, device, computer equipment and medium based on batch-flow integration.
第一方面,本公开实施例提供一种基于批流一体的数据处理方法,所述方法应用于数据处理装置,所述数据处理装置包括数据应用层和多个处理层,各所述处理层形成处理链路,所述方法包括:In the first aspect, an embodiment of the present disclosure provides a data processing method based on batch-flow integration, the method is applied to a data processing device, and the data processing device includes a data application layer and a plurality of processing layers, and each processing layer forms processing a link, the method comprising:
获取待处理数据,所述待处理数据为实时数据;Obtaining data to be processed, the data to be processed is real-time data;
根据所述处理链路逐层处理所述待处理数据,得到第一数据;其中,在各所述处理层中,对输入本处理层的数据进行处理,得到处理后的数据,所述处理后的数据为实时数据,基于Flink流将处理后 的实时数据存储在Hive模块中,并将所述处理后的数据输入至下一个处理层;所述第一数据为所述处理链路中最后一个处理层得到的处理后的数据;According to the processing link, the data to be processed is processed layer by layer to obtain the first data; wherein, in each of the processing layers, the data input to the processing layer is processed to obtain the processed data, and the processed data is The data is real-time data, based on the Flink flow, the processed real-time data is stored in the Hive module, and the processed data is input to the next processing layer; the first data is the last one in the processing chain The processed data obtained by the processing layer;
在所述数据应用层中处理所述第一数据,得到第二数据;processing the first data in the data application layer to obtain second data;
响应于检测出所述第二数据有误,在所述发生数据错误的处理层中,根据本处理层的离线数据对所述有误的数据进行修正,得到修正后的数据,并将修正后的数据输入至下一个处理层,以便所述下一个处理层对输入的数据进行处理。In response to detecting that the second data is erroneous, in the processing layer where the data error occurs, the erroneous data is corrected according to the offline data of the processing layer to obtain the corrected data, and the corrected The data is input to the next processing layer, so that the next processing layer can process the input data.
又一方面,本公开实施例还提供一种数据处理装置,包括获取模块、第一处理模块和第二处理模块,所述第二处理模块形成数据应用层,所述第一处理模块包括多个处理层,各所述处理层形成处理链路,各所述处理层包括第一处理单元和第二处理单元;In yet another aspect, an embodiment of the present disclosure also provides a data processing device, including an acquisition module, a first processing module, and a second processing module, the second processing module forms a data application layer, and the first processing module includes a plurality of processing layers, each processing layer forms a processing link, and each processing layer includes a first processing unit and a second processing unit;
所述获取模块配置为,获取待处理数据,所述待处理数据包括实时数据;The acquiring module is configured to acquire data to be processed, and the data to be processed includes real-time data;
所述第一处理模块配置为,根据所述处理链路逐层处理所述待处理数据,得到第一数据;The first processing module is configured to process the data to be processed layer by layer according to the processing link to obtain first data;
其中,所述第一处理单元配置为,对输入本处理层的数据进行处理,得到处理后的数据,所述处理后的数据为实时数据,基于Flink流将处理后的实时数据存储在Hive模块中,并将所述处理后的数据输入至下一个处理层;所述第一数据为所述处理链路中最后一个处理层得到的处理后的数据;以及,接收所述第二单元发送的修正后的数据,将所述修正后的数据输入至下一个处理层的第一处理单元,以便所述下一个处理层的第一处理单元对输入的数据进行处理;Wherein, the first processing unit is configured to process the data input to the processing layer to obtain processed data, the processed data is real-time data, and store the processed real-time data in the Hive module based on the Flink stream , and input the processed data to the next processing layer; the first data is the processed data obtained by the last processing layer in the processing chain; and, receiving the data sent by the second unit corrected data, inputting the corrected data to the first processing unit of the next processing layer, so that the first processing unit of the next processing layer processes the input data;
所述第二处理单元配置为,响应于本处理层发生数据错误,根据本处理层的离线数据对所述有误的数据进行修正,得到修正后的数据,并将修正后的数据发送给所述第一处理单元;The second processing unit is configured to, in response to a data error occurring in the processing layer, correct the erroneous data according to the offline data of the processing layer, obtain the corrected data, and send the corrected data to the the first processing unit;
所述第二处理模块配置为,在所述数据应用层中处理所述第一数据,得到第二数据。The second processing module is configured to process the first data in the data application layer to obtain second data.
又一方面,本公开实施例还提供一种计算机设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序;当所述一个 或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如前所述的基于批流一体的数据处理方法。In yet another aspect, an embodiment of the present disclosure further provides a computer device, including: one or more processors; a storage device, on which one or more programs are stored; when the one or more programs are stored by the one or more When multiple processors execute, the one or more processors implement the batch-flow integration-based data processing method as described above.
又一方面,本公开实施例还提供一种计算机可读介质,其上存储有计算机程序,其中,所述程序被执行时实现如前所述的基于批流一体的数据处理方法。In yet another aspect, an embodiment of the present disclosure further provides a computer-readable medium on which a computer program is stored, wherein when the program is executed, the batch-flow integration-based data processing method as described above is implemented.
附图说明Description of drawings
图1为本公开实施例的基于批流一体的数据处理方法的流程示意图一;FIG. 1 is a first schematic flow diagram of a batch-flow integration-based data processing method according to an embodiment of the present disclosure;
图2为本公开实施例提供的基于批流一体的数据处理方法的流程示意图二;FIG. 2 is the second schematic flow diagram of the data processing method based on batch-flow integration provided by an embodiment of the present disclosure;
图3为本公开实施例提供的数据处理装置结构示意图一;FIG. 3 is a first structural schematic diagram of a data processing device provided by an embodiment of the present disclosure;
图4为本公开实施例提供的数据处理装置结构示意图二;FIG. 4 is a second structural diagram of a data processing device provided by an embodiment of the present disclosure;
图5为本公开实施例提供的数据处理装置的具体实例的结构示意图。Fig. 5 is a schematic structural diagram of a specific example of a data processing device provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
为了使本公开的目的、技术方案及优点更加清楚明白,下面通过具体实施方式结合附图对本公开实施例作进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本公开,并不用于限定本公开。In order to make the purpose, technical solutions and advantages of the present disclosure clearer, the embodiments of the present disclosure will be further described in detail below through specific implementation manners in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present disclosure, not to limit the present disclosure.
本公开实施例提供一种基于批流一体的数据处理方法,所述方法应用于数据处理装置,数据处理装置包括数据应用(Application Data Service,ADS)层和多个处理层,各处理层形成处理链路,经过处理链路处理的数据到达数据应用层。在一些实施例中,处理层可以包括ODS(Operational Data Store,操作数据存储)层、DWD(Data Warehouse Detail,数据明细)和DWS(Data WareHouse Servce,数据服务)层,形成ODS层—>DWD层—>DWS层的处理链路。数据按照处理链路的顺序依次在各处理层传递,前一个处理层的数据处理结果作为下一个处理层的数据源。An embodiment of the present disclosure provides a data processing method based on batch flow integration, the method is applied to a data processing device, the data processing device includes a data application (Application Data Service, ADS) layer and a plurality of processing layers, and each processing layer forms a processing Link, the data processed by the link reaches the data application layer. In some embodiments, the processing layer can include ODS (Operational Data Store, operational data storage) layer, DWD (Data Warehouse Detail, data details) and DWS (Data Ware House Servce, data service) layer, forming ODS layer -> DWD layer —>The processing link of the DWS layer. The data is transmitted in each processing layer sequentially according to the order of the processing links, and the data processing result of the previous processing layer is used as the data source of the next processing layer.
图1为本公开实施例提供的基于批流一体的数据处理方法的流程示意图一,如图1所示,所述基于批流一体的数据处理方法包括步骤11-14。FIG. 1 is a first schematic flowchart of a batch-flow integration-based data processing method provided by an embodiment of the present disclosure. As shown in FIG. 1 , the batch-flow integration-based data processing method includes steps 11-14.
步骤11,获取待处理数据,待处理数据为实时数据。In step 11, the data to be processed is acquired, and the data to be processed is real-time data.
实时数据是指生存期(即数据存在的时长)小于或等于超时时间的数据。Real-time data refers to data whose lifetime (that is, the duration of data existence) is less than or equal to the timeout period.
步骤12,根据处理链路逐层处理待处理数据,得到第一数据;其中,在各处理层中,对输入本处理层的数据进行处理,得到处理后的数据,处理后的数据为实时数据,基于Flink流将处理后的实时数据存储在Hive模块中,并将处理后的数据输入至下一个处理层;第一数据为处理链路中最后一个处理层得到的处理后的数据。Step 12: Process the data to be processed layer by layer according to the processing link to obtain the first data; wherein, in each processing layer, process the data input to the processing layer to obtain processed data, and the processed data is real-time data , store the processed real-time data in the Hive module based on the Flink flow, and input the processed data to the next processing layer; the first data is the processed data obtained by the last processing layer in the processing link.
在本步骤中,根据各处理层形成的处理链路逐层对待处理数据进行处理,前一个处理层的数据处理结果输入下一个处理层,作为下一个处理层的数据源。在每个处理层中,对输入本处理层的实时数据进行处理,得到处理后的实时数据。在得到处理后的实时数据之后,一方面,基于Flink流将处理后的实时数据存储在Hive模块中,以方便后续查询、调用,需要说明的是,处理后的实时数据在生存期大于超时时间之后就转换为离线数据;Hive是基于Hadoop的一个数据仓库工具,用来进行数据提取、转化、加载,是一种可以存储、查询和分析存储在Hadoop中的大规模数据的机制。Hive不适合用于联机(online)事务处理,也不提供实时查询功能,适合应用在基于大量不可变数据的批处理作业。另一方面,将处理后的实时数据作为本处理层的处理结果输入至处理链路中下一个处理层中,以便继续进行后续的数据处理。按照上述方式,得到处理链路中最后一个处理层的处理结果,该处理结果即为第一数据。In this step, the data to be processed is processed layer by layer according to the processing links formed by each processing layer, and the data processing results of the previous processing layer are input into the next processing layer as the data source of the next processing layer. In each processing layer, the real-time data input to this processing layer are processed to obtain processed real-time data. After the processed real-time data is obtained, on the one hand, the processed real-time data is stored in the Hive module based on the Flink flow to facilitate subsequent query and call. It should be noted that the lifetime of the processed real-time data is greater than the timeout time Then it is converted to offline data; Hive is a Hadoop-based data warehouse tool for data extraction, transformation, and loading. It is a mechanism that can store, query, and analyze large-scale data stored in Hadoop. Hive is not suitable for online transaction processing, nor does it provide real-time query functions, and is suitable for batch processing jobs based on large amounts of immutable data. On the other hand, the processed real-time data is input to the next processing layer in the processing chain as the processing result of this processing layer, so as to continue subsequent data processing. According to the above manner, the processing result of the last processing layer in the processing link is obtained, and the processing result is the first data.
步骤13,在数据应用层中处理第一数据,得到第二数据。Step 13, process the first data in the data application layer to obtain the second data.
第二处理是指数据分析,例如,OLAP(Online Analytical Processing,联机分析处理)。OLAP是在基于数据仓库多维模型的基础上实现的面向分析的各类操作的集合。OLAP的优势是基于数据仓库面向主题、集成的、保留历史及不可变更的数据存储,以及多维 模型多视角多层次的数据组织形式。The second processing refers to data analysis, for example, OLAP (Online Analytical Processing, Online Analytical Processing). OLAP is a collection of various analysis-oriented operations realized on the basis of the multidimensional model of the data warehouse. The advantage of OLAP is based on the subject-oriented, integrated, historically preserved and immutable data storage of the data warehouse, as well as the multi-dimensional model, multi-view and multi-level data organization form.
在本步骤中,OLAP模块对经过处理链路处理后得到的第一数据进行分析,得到的分析结果为第二数据,第二数据为全量数据,既包括实时数据也包括离线数据,均存储在OLAP中,从而实现实时数据和离线数据的融合。In this step, the OLAP module analyzes the first data obtained after processing the link, and the obtained analysis result is the second data, which is the full amount of data, including both real-time data and offline data, all stored in In OLAP, the integration of real-time data and offline data is realized.
步骤14,响应于检测出第二数据有误,在发生数据错误的处理层中,根据本处理层的离线数据对有误的数据进行修正,得到修正后的数据,并将修正后的数据输入至下一个处理层,以便下一个处理层对输入的数据进行处理。Step 14, in response to detecting that the second data is wrong, in the processing layer where the data error occurs, correct the wrong data according to the offline data of this processing layer, obtain the corrected data, and input the corrected data to the next processing layer so that the next processing layer can process the input data.
离线数据也称历史数据,是指生存期大于超时时间的数据。假如超时时间为7天,若数据的生存期小于或等于7天,则该数据为实时数据;等到数据的生存期超过7天时,该数据就变为了离线数据。Offline data is also called historical data, which refers to data whose lifetime is longer than the timeout period. If the timeout period is 7 days, if the data lifetime is less than or equal to 7 days, the data is real-time data; when the data lifetime exceeds 7 days, the data becomes offline data.
在本步骤中,若发现ADS层的数据有误,可以确定出是哪个处理层发生数据错误,在发生数据错误的处理层中,从Hive模块中获取离线数据,利用该离线数据对有误的实时数据进行修正。由于数据是逐层传递、逐层处理的,因此,前一处理层的处理结果发生变化,会导致其后的处理层的处理结果相应变化,因此需要将修正后的数据再输入至下一个处理层,由下一个处理层重新进行数据处理。In this step, if it is found that the data in the ADS layer is wrong, it can be determined which processing layer has a data error. In the processing layer where the data error occurs, the offline data is obtained from the Hive module, and the offline data is used to correct the wrong data. Real-time data is corrected. Since the data is passed layer by layer and processed layer by layer, changes in the processing results of the previous processing layer will cause corresponding changes in the processing results of the subsequent processing layers, so the corrected data needs to be input to the next processing layer. layer, and the next processing layer re-processes the data.
本公开实施例提供的基于批流一体的数据处理方法,数据处理装置包括数据应用层和多个处理层,各处理层形成处理链路,所述方法包括:获取待处理数据,待处理数据为实时数据;根据处理链路逐层处理待处理数据,得到第一数据;其中,在各处理层中,对输入本处理层的数据进行处理,得到处理后的数据,处理后的数据为实时数据,基于Flink流将处理后的实时数据存储在Hive模块中,并将处理后的数据输入至下一个处理层;在数据应用层中处理第一数据,得到第二数据;响应于检测出第二数据有误,在发生数据错误的处理层中,根据本处理层的离线数据对有误的数据进行修正,得到修正后的数据,并将修正后的数据输入至下一个处理层,以便下一个处理层对输入的数据进行处理;本公开实施例能够实现实时数据和离线数据采集及预处理,将实时数据和离线数据融合,可以实现数据同源、同计 算引擎、同计算口径,简化数据应用架构,一个系统架构同时支持离线数据和实时数据分析,降低架构复杂度,减少运维成本。In the data processing method based on batch-flow integration provided by the embodiments of the present disclosure, the data processing device includes a data application layer and a plurality of processing layers, and each processing layer forms a processing link. The method includes: acquiring data to be processed, the data to be processed is Real-time data; according to the processing link, the data to be processed is processed layer by layer to obtain the first data; wherein, in each processing layer, the data input to the processing layer is processed to obtain processed data, and the processed data is real-time data , store the processed real-time data in the Hive module based on the Flink flow, and input the processed data to the next processing layer; process the first data in the data application layer to obtain the second data; in response to detecting the second The data is wrong. In the processing layer where the data error occurs, the wrong data is corrected according to the offline data of this processing layer, the corrected data is obtained, and the corrected data is input to the next processing layer, so that the next The processing layer processes the input data; the embodiments of the present disclosure can realize real-time data and offline data collection and preprocessing, and integrate real-time data and offline data to realize the same data source, same computing engine, and same computing caliber, and simplify data application Architecture, a system architecture supports both offline data and real-time data analysis, reducing architecture complexity and operation and maintenance costs.
图2为本公开实施例提供的基于批流一体的数据处理方法的流程示意图二。在一些实施例中,如图2所示,在数据应用层中处理第一数据,得到第二数据之后,所述基于批流一体的数据处理方法还可以包括步骤21-22。FIG. 2 is a second schematic flow diagram of the data processing method based on batch-flow integration provided by an embodiment of the present disclosure. In some embodiments, as shown in FIG. 2 , after the first data is processed in the data application layer to obtain the second data, the batch-flow integration-based data processing method may further include steps 21-22.
步骤21,响应于接收到数据查询请求,获取查询结果,查询结果包括以下至少之一:各处理层的离线数据、第二数据。Step 21, in response to receiving a data query request, obtain a query result, the query result includes at least one of the following: offline data of each processing layer, and second data.
在本步骤中,可以针对不同的数据源单独查询,也可以在不同的数据源之间进行关联查询,单独查询即单独查询各个处理层的离线数据,关联查询即关联查询各个处理层的离线数据和第二数据。In this step, different data sources can be queried separately, or related queries can be performed between different data sources. A separate query means that the offline data of each processing layer is independently queried, and an associated query means that the offline data of each processing layer is queried in association. and the second data.
在一些实施例中,可以利用OpenLooKeng或Presto实现数据查询。OpenLooKeng是一种开源的高性能数据虚拟化引擎,提供统一SQL(结构化查询语言)接口,具备跨数据源/数据中心的分析能力,可以面向交互式、批、流等融合查询场景。OpenLooKeng可以连接Hive模块和OLAP模块,实现离线数据和实时数据统一查询。Presto是一种数据查询引擎,可对250PB以上的数据进行快速交互式分析。In some embodiments, OpenLooKeng or Presto can be used to implement data query. OpenLooKeng is an open-source, high-performance data virtualization engine that provides a unified SQL (Structured Query Language) interface, has cross-data source/data center analysis capabilities, and can be oriented to interactive, batch, and stream fusion query scenarios. OpenLooKeng can connect Hive module and OLAP module to realize unified query of offline data and real-time data. Presto is a data query engine that enables fast interactive analysis of more than 250PB of data.
步骤22,发送查询结果。Step 22, sending the query result.
本公开实施例可以实现离线数据和实时数据的统一查询。The embodiments of the present disclosure can realize unified query of offline data and real-time data.
在相关技术中,基于批流一体的数据处理方案没有统一的对外查询接口,存在数据落地管理复杂的问题。为了解决该问题,本公开实施例提供了统一的对外查询接口。In related technologies, the data processing scheme based on the integration of batch and flow has no unified external query interface, and there is a problem of complex data landing management. In order to solve this problem, the embodiment of the present disclosure provides a unified external query interface.
在一些实施例中,所述发送查询结果,包括以下步骤:通过预设的查询接口发送查询结果,查询接口可以包括以下至少之一:JDBC API接口、Rest API接口。In some embodiments, the sending query result includes the following steps: sending the query result through a preset query interface, and the query interface may include at least one of the following: JDBC API interface, Rest API interface.
JDBC(Java Database Connectivity,Java数据库连接)API(Application Programming Interface,应用程序接口)接口,是一种可以执行SQL语言的Java接口。通过JDBC API接口可以连接到关系数据库,并使用SQL语句完成数据查询和更新。JDBC (Java Database Connectivity, Java Database Connection) API (Application Programming Interface, Application Programming Interface) interface is a Java interface that can execute SQL language. Through the JDBC API interface, you can connect to the relational database, and use SQL statements to complete data query and update.
RESTful API接口中的Rest,表示性状态转移(Representation  State Transfer)。简单来说,就是用URL(Uniform Resource Locator,统一资源定位器)表示资源,用HTTP方法表征对这些资源的操作。RESTful API就是REST风格的API接口,是典型的基于HTTP协议的接口,确保交互数据的传输安全。终端向服务端发送数据查询请求后,如果不适用RESTful API接口,需要为每个终端的数据查询请求定义相应的返回格式,以适应前端显示。但是RESTful API接口要求前端以一种预定义的语法格式发送数据查询请求,那么服务端就可以定义一个统一的响应接口,不必像之前那样解析各种格式的数据查询请求,从而简化接口管理。Rest in the RESTful API interface, Representation State Transfer. To put it simply, URL (Uniform Resource Locator, Uniform Resource Locator) is used to represent resources, and HTTP methods are used to represent operations on these resources. RESTful API is a REST-style API interface, which is a typical interface based on the HTTP protocol, ensuring the security of interactive data transmission. After the terminal sends a data query request to the server, if the RESTful API interface is not applicable, a corresponding return format needs to be defined for each terminal's data query request to adapt to the front-end display. However, the RESTful API interface requires the front-end to send data query requests in a predefined syntax format, so the server can define a unified response interface without parsing data query requests in various formats as before, thereby simplifying interface management.
通过设置Rest API接口,数据处理装置可以连接可视化展示组件(例如Tableau),在网页客户端(Web)以自定义的方式对全量数据进行查询,从而支持前端数据可视化展示。By setting the Rest API interface, the data processing device can connect to visual display components (such as Tableau), and query the full amount of data in a custom way on the web client (Web), thereby supporting the visual display of front-end data.
在一些实施例中,所述对输入本处理层的数据进行处理,包括以下步骤:利用流数据处理引擎对输入本处理层的数据进行处理。在本公开实施例中,每个处理层的Kafka模块利用Flink对输入本处理层的数据进行处理。In some embodiments, the processing the data input to the current processing layer includes the following steps: using a stream data processing engine to process the data input to the current processing layer. In the embodiment of the present disclosure, the Kafka module of each processing layer uses Flink to process the data input to the processing layer.
Kafka是一种开源流处理平台,由Scala和Java编写,是一种高吞吐量的分布式发布订阅消息系统,它可以处理消费者在网站中的所有动作流数据。Kafka is an open source stream processing platform written in Scala and Java. It is a high-throughput distributed publish-subscribe message system that can process all action stream data of consumers in the website.
Flink是一种开源流处理框架,其核心是用Java和Scala编写的分布式流数据流引擎。Flink以数据并行和流水线方式执行任意流数据程序,Flink的流水线运行时系统可以执行批处理和流处理程序。Flink程序在执行后被映射到流数据流,每个Flink数据流以一个或多个源(数据输入,例如消息队列或文件系统)开始,并以一个或多个接收器(数据输出,如消息队列、文件系统或数据库等)结束。Flink is an open source stream processing framework whose core is a distributed stream data flow engine written in Java and Scala. Flink executes arbitrary streaming data programs in a data parallel and pipeline manner, and Flink's pipeline runtime system can execute batch and stream processing programs. Flink programs are mapped to streaming data streams after execution. Each Flink data stream starts with one or more sources (data input, such as message queue or file system) and ends with one or more sinks (data output, such as message Queue, file system or database, etc.) ends.
需要说明的是,也可以利用批数据处理引擎对输入本处理层的数据进行处理,但是相对于流数据处理引擎而言,实时性不佳。It should be noted that the batch data processing engine can also be used to process the data input to this processing layer, but compared with the stream data processing engine, the real-time performance is not good.
在一些实施例中,所述根据本处理层的离线数据对有误的数据进行修正,包括以下步骤:利用流数据处理引擎,根据本处理层的离线数据对有误的数据进行修正。在本公开实施例中,每个处理层的 Hive模块利用Flink,根据所存储的本处理层的离线数据对本处理层有误的数据进行修正。即Hive模块利用Flink,根据离线数据修改实时数据的Topic(主题),并将修改后的实时数据返回给Kafka模块。In some embodiments, the correcting the erroneous data according to the offline data of the current processing layer includes the following steps: using a stream data processing engine to correct the erroneous data according to the offline data of the current processing layer. In the embodiment of the present disclosure, the Hive module of each processing layer uses Flink to correct the erroneous data of the processing layer according to the stored offline data of the processing layer. That is, the Hive module uses Flink to modify the Topic (topic) of the real-time data according to the offline data, and returns the modified real-time data to the Kafka module.
在一些实施例中,待处理数据可以包括日志数据和业务数据,相应的,所述获取待处理数据,可以包括以下步骤:通过CDC(Change Data Capture,变更数据获取)的方式从业务数据库中获取业务数据,并根据日志收集系统(Flume)获取日志数据。In some embodiments, the data to be processed may include log data and business data. Correspondingly, the acquisition of the data to be processed may include the following steps: obtain from the business database by way of CDC (Change Data Capture, change data acquisition) Business data, and obtain log data according to the log collection system (Flume).
CDC可以监测并捕获数据库的变动(包括数据或数据表的插入、更新以及删除等),将数据库的变动按发生的顺序完整记录下来,写入到消息中间件中以供其他服务进行订阅及消费。CDC can monitor and capture changes in the database (including insertion, update and deletion of data or data tables, etc.), record the changes in the database in the order they occur, and write them into the message middleware for other services to subscribe and consume .
在本公开实施例中,业务数据库可以是关系型数据库,例如MySQL数据库。In the embodiment of the present disclosure, the service database may be a relational database, such as a MySQL database.
Flume是一个高可用、高可靠、分布式的海量日志采集、聚合、传输系统,Flume支持在日志系统中定制各类数据发送端,用于收集数据;同时还可以对数据进行简单处理,并将处理后的数据写入数据接收端。Flume is a highly available, highly reliable, and distributed massive log collection, aggregation, and transmission system. Flume supports customizing various data senders in the log system to collect data; The processed data is written to the data receiving end.
本公开实施例的基于批流一体的数据处理方案,支持实时数据和离线数据的采集与预处理,支持统一的数据查询,通过对外提供接口,支持JDBC与Restful发布,能够实现实时数据和离线的数据融合,可以解决大数据平台中实时数据和离线数据处理不统一、数据落地管理复杂等问题。The data processing scheme based on the integration of batch and stream in the embodiment of the present disclosure supports the collection and preprocessing of real-time data and offline data, supports unified data query, and supports JDBC and Restful publishing by providing external interfaces, and can realize real-time data and offline data processing. Data fusion can solve the problems of inconsistent real-time data and offline data processing and complex data landing management in the big data platform.
本公开实施例的基于批流一体的数据处理方案,支持批流一体的能力,扩展了全场景OLAP能力,通过一个数据模型和一个SQL语句就能同时接入批数据和流数据,对数据应用提供统一的查询接口,相对Lambda架构而言,能够实现数据同源、同计算引擎、同计算口径,同时支持历史数据和近实时数据分析,降低架构复杂度,减少运维成本,可以助力企业极简化数据应用架构,使用一个系统架构即可同时满足不同需求,从而更快地响应业务敏捷性。The data processing scheme based on batch-stream integration in the embodiment of the present disclosure supports the capability of batch-stream integration, expands the full-scenario OLAP capability, and can access batch data and stream data at the same time through a data model and a SQL statement. Provides a unified query interface. Compared with the Lambda architecture, it can achieve the same source of data, the same computing engine, and the same computing caliber. It also supports historical data and near-real-time data analysis, reduces the complexity of the architecture, and reduces the cost of operation and maintenance. It can help enterprises extremely Simplify the data application architecture, and use one system architecture to meet different needs at the same time, so as to respond to business agility faster.
本公开实施例还提供一种数据处理装置,图3为本公开实施例 提供的数据处理装置结构示意图一,如图3所示,所述数据处理装置包括获取模块101、第一处理模块102和第二处理模块103,第二处理模块103形成数据应用层,第一处理模块102包括多个处理层,各处理层形成处理链路,各处理层包括第一处理单元1021和第二处理单元1022。The embodiment of the present disclosure also provides a data processing device. FIG. 3 is a schematic diagram of the first structure of the data processing device provided by the embodiment of the present disclosure. As shown in FIG. 3 , the data processing device includes an acquisition module 101, a first processing module 102 and The second processing module 103, the second processing module 103 forms a data application layer, the first processing module 102 includes a plurality of processing layers, each processing layer forms a processing link, and each processing layer includes a first processing unit 1021 and a second processing unit 1022 .
获取模块101配置为,获取待处理数据,待处理数据包括实时数据。The acquiring module 101 is configured to acquire data to be processed, and the data to be processed includes real-time data.
第一处理模块102配置为,根据处理链路逐层处理待处理数据,得到第一数据。The first processing module 102 is configured to process the data to be processed layer by layer according to the processing link to obtain the first data.
其中,第一处理单元1021配置为,对输入本处理层的数据进行处理,得到处理后的数据,所述处理后的数据为实时数据,基于Flink流将处理后的实时数据存储在Hive模块中,并将所述处理后的数据输入至下一个处理层;所述第一数据为所述处理链路中最后一个处理层得到的处理后的数据;以及,接收所述第二单元发送的修正后的数据,将所述修正后的数据输入至下一个处理层的第一处理单元,以便所述下一个处理层的第一处理单元对输入的数据进行处理。Wherein, the first processing unit 1021 is configured to process the data input to the processing layer to obtain processed data, the processed data is real-time data, and store the processed real-time data in the Hive module based on the Flink flow , and input the processed data to the next processing layer; the first data is the processed data obtained by the last processing layer in the processing chain; and, receiving the correction sent by the second unit input the corrected data to the first processing unit of the next processing layer, so that the first processing unit of the next processing layer can process the input data.
第二处理单元1022配置为,响应于本处理层发生数据错误,根据本处理层的离线数据对所述有误的数据进行修正,得到修正后的数据,并将修正后的数据发送给所述第一处理单元1021。The second processing unit 1022 is configured to, in response to a data error occurring at the processing layer, correct the erroneous data according to the offline data of the processing layer, obtain the corrected data, and send the corrected data to the The first processing unit 1021 .
第二处理模块103配置为,在所述数据应用层中处理所述第一数据,得到第二数据。The second processing module 103 is configured to process the first data in the data application layer to obtain second data.
在一些实施例中,第二处理单元1022配置为,存储所述处理后的数据,以便生成本处理层的离线数据。In some embodiments, the second processing unit 1022 is configured to store the processed data so as to generate offline data of this processing layer.
在一些实施例中,第一处理单元1021配置为,利用流数据处理引擎对输入本处理层的数据进行处理。In some embodiments, the first processing unit 1021 is configured to use a stream data processing engine to process the data input to this processing layer.
图4为本公开实施例提供的数据处理装置结构示意图二,在一些实施例中,如图4所示,所述数据处理装置还包括查询模块104,查询模块104配置为,响应于接收到数据查询请求,获取查询结果,所述查询结果包括以下至少之一:各处理层的离线数据、所述第二数据;发送所述查询结果。FIG. 4 is a second structural diagram of a data processing device provided by an embodiment of the present disclosure. In some embodiments, as shown in FIG. 4 , the data processing device further includes a query module 104 configured to, in response to receiving data A query request is to obtain a query result, the query result including at least one of the following: offline data of each processing layer and the second data; sending the query result.
在一些实施例中,查询模块104配置为,通过预设的查询接口发送所述查询结果,所述查询接口至少包括:JDBC API接口、Rest API接口。In some embodiments, the query module 104 is configured to send the query result through a preset query interface, and the query interface at least includes: a JDBC API interface and a Rest API interface.
在一些实施例中,第二处理单元1022配置为,利用流数据处理引擎,根据本处理层的离线数据对所述有误的数据进行修正。In some embodiments, the second processing unit 1022 is configured to use a streaming data processing engine to correct the erroneous data according to the offline data of this processing layer.
在一些实施例中,所述待处理数据包括日志数据和业务数据,获取模块101配置为,通过变更数据获取CDC的方式从业务数据库中获取所述业务数据,并根据日志收集系统获取所述日志数据。In some embodiments, the data to be processed includes log data and business data, and the obtaining module 101 is configured to obtain the business data from the business database by changing the data acquisition CDC, and obtain the log according to the log collection system data.
为了清楚描述本公开实施例的技术方案,以下结合图5通过一具体实例对本公开实施的方案进行说明。图5为本公开实施例提供的数据处理装置的具体实例的结构示意图,如图5所示,本公开实施例提供一种基于批流一体的数据处理装置,所述装置包括获取模块201、第一处理模块202、第二处理模块203和查询模块204,第一处理模块202包括ODS层、DWD层和DWS层,上述3个处理层分别包括Kafka模块和Hive模块,一个处理层中的Kafka模块和Hive模块形成一个处理单元,3个处理层按照ODS层—>DWD层—>DWS层的顺序形成处理链路,其中,ODS层、DWD层和DWS层之间通过Kafka模块相连,实现数据的逐层传递。第二处理模块203位于ADS层,可以为OLAP模块。查询模块204分别与各处理层的Hive模块和ADS层的OLAP模块相连,可以实现跨源查询。In order to clearly describe the technical solution of the embodiment of the present disclosure, the solution implemented in the present disclosure will be described below through a specific example in conjunction with FIG. 5 . FIG. 5 is a schematic structural diagram of a specific example of a data processing device provided by an embodiment of the present disclosure. As shown in FIG. A processing module 202, a second processing module 203 and a query module 204, the first processing module 202 includes an ODS layer, a DWD layer and a DWS layer, and the above three processing layers include a Kafka module and a Hive module respectively, and the Kafka module in one processing layer It forms a processing unit with the Hive module, and the three processing layers form a processing link in the order of ODS layer->DWD layer->DWS layer. Among them, the ODS layer, DWD layer and DWS layer are connected through the Kafka module to realize data exchange. Passed layer by layer. The second processing module 203 is located at the ADS layer and may be an OLAP module. The query module 204 is respectively connected to the Hive module of each processing layer and the OLAP module of the ADS layer, so as to realize cross-source query.
获取模块201能够通过CDC的方式从MySQL数据库获取业务数据,以及从Flume中收集日志数据,并将业务数据和日志数据发送给ODS层中的Kafka模块。The acquisition module 201 can acquire business data from the MySQL database through CDC, collect log data from Flume, and send the business data and log data to the Kafka module in the ODS layer.
以ODS层为例,Kafka模块对输入ODS层的实时数据利用Flink进行处理,得到处理后的实时数据,并通过Flink流加载给Hive模块保存。ODS层的Kafka模块将处理后的实时数据发送给DWD层的Kafka模块,以便在DWD层中继续进行数据处理。Taking the ODS layer as an example, the Kafka module uses Flink to process the real-time data input to the ODS layer, obtains the processed real-time data, and loads it to the Hive module through the Flink stream for storage. The Kafka module of the ODS layer sends the processed real-time data to the Kafka module of the DWD layer, so as to continue data processing in the DWD layer.
查询模块204采用OpenLooKeng连接器,其上设置有JDBC API接口和Rest API接口,在通过上述接口接收到数据查询请求后,向各个处理层的以下至少一个模块发起数据查询:Hive模块、OLAP模块, 并将查询到的以下至少一种数据通过该接口返回:离线数据、实时数据。 Query module 204 adopts OpenLooKeng connector, is provided with JDBC API interface and Rest API interface on it, after receiving data query request by above-mentioned interface, initiates data query to following at least one module of each processing layer: Hive module, OLAP module, And return at least one of the following data queried through this interface: offline data, real-time data.
当基于数据查询检测出OLAP模块中的第二数据发生错误时,若错误发生在DWD层,则利用DWD层中Hive模块内存储的离线数据对有误的数据进行修正,将修正后的数据输入至DWS层的Kafka模块,由DWS层的Kafka模块继续进行数据处理。When an error occurs in the second data in the OLAP module based on data query, if the error occurs in the DWD layer, use the offline data stored in the Hive module in the DWD layer to correct the erroneous data, and input the corrected data To the Kafka module of the DWS layer, the Kafka module of the DWS layer continues to process data.
本公开实施例还提供了一种计算机设备,该计算机设备包括:一个或多个处理器以及存储装置;其中,存储装置上存储有一个或多个程序,当上述一个或多个程序被上述一个或多个处理器执行时,使得上述一个或多个处理器实现如前述各实施例所提供的基于批流一体的数据处理方法。An embodiment of the present disclosure also provides a computer device, the computer device includes: one or more processors and a storage device; wherein, one or more programs are stored on the storage device, when the one or more programs are executed by the one or more When executed by one or more processors, the above-mentioned one or more processors implement the batch-flow integration-based data processing method provided by the foregoing embodiments.
本公开实施例还提供了一种计算机可读介质,其上存储有计算机程序,其中,该计算机程序被执行时实现如前述各实施例所提供的基于批流一体的数据处理方法。An embodiment of the present disclosure also provides a computer-readable medium on which a computer program is stored, wherein when the computer program is executed, the batch-flow integration-based data processing method provided in the foregoing embodiments is implemented.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何 其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。Those skilled in the art can understand that all or some of the steps in the method disclosed above and the functional modules/units in the device can be implemented as software, firmware, hardware and an appropriate combination thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical components. Components cooperate to execute. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit . Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. permanent, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .
本文已经公开了示例实施例,并且虽然采用了具体术语,但它们仅用于并仅应当被解释为一般说明性含义,并且不用于限制的目的。在一些实例中,对本领域技术人员显而易见的是,除非另外明确指出,否则可单独使用与特定实施例相结合描述的特征、特性和/或元素,或可与其他实施例相结合描述的特征、特性和/或元件组合使用。因此,本领域技术人员将理解,在不脱离由所附的权利要求阐明的本发明的范围的情况下,可进行各种形式和细节上的改变。Example embodiments have been disclosed herein, and while specific terms have been employed, they are used and should be construed in a generic descriptive sense only and not for purposes of limitation. In some instances, it will be apparent to those skilled in the art that features, characteristics and/or elements described in connection with a particular embodiment may be used alone, or may be described in combination with other embodiments, unless explicitly stated otherwise. Combinations of features and/or elements. Accordingly, it will be understood by those of ordinary skill in the art that various changes in form and details may be made without departing from the scope of the present invention as set forth in the appended claims.

Claims (9)

  1. 一种基于批流一体的数据处理方法,其中,所述方法应用于数据处理装置,所述数据处理装置包括数据应用层和多个处理层,各所述处理层形成处理链路,所述方法包括:A data processing method based on batch-flow integration, wherein the method is applied to a data processing device, the data processing device includes a data application layer and a plurality of processing layers, each of the processing layers forms a processing link, and the method include:
    获取待处理数据,所述待处理数据为实时数据;Obtaining data to be processed, the data to be processed is real-time data;
    根据所述处理链路逐层处理所述待处理数据,得到第一数据;在各所述处理层中,对输入本处理层的数据进行处理,得到处理后的数据,所述处理后的数据为实时数据,基于Flink流将处理后的实时数据存储在Hive模块中,并将所述处理后的数据输入至下一个处理层;所述第一数据为所述处理链路中最后一个处理层得到的处理后的数据;Process the data to be processed layer by layer according to the processing link to obtain the first data; in each of the processing layers, process the data input to the processing layer to obtain processed data, the processed data For real-time data, store the processed real-time data in the Hive module based on the Flink flow, and input the processed data to the next processing layer; the first data is the last processing layer in the processing chain the resulting processed data;
    在所述数据应用层中处理所述第一数据,得到第二数据;processing the first data in the data application layer to obtain second data;
    响应于检测出所述第二数据有误,在所述发生数据错误的处理层中,根据本处理层的离线数据对所述有误的数据进行修正,得到修正后的数据,并将修正后的数据输入至下一个处理层,以便所述下一个处理层对输入的数据进行处理。In response to detecting that the second data is erroneous, in the processing layer where the data error occurs, the erroneous data is corrected according to the offline data of the processing layer to obtain the corrected data, and the corrected The data is input to the next processing layer, so that the next processing layer can process the input data.
  2. 如权利要求1所述的方法,其中,所述对输入本处理层的数据进行处理,包括:利用流数据处理引擎对输入本处理层的数据进行处理。The method according to claim 1, wherein said processing the data input into the processing layer comprises: using a stream data processing engine to process the data input into the processing layer.
  3. 如权利要求1所述的方法,其中,在所述数据应用层中处理所述第一数据,得到第二数据之后,所述方法还包括:The method according to claim 1, wherein, after processing the first data in the data application layer and obtaining the second data, the method further comprises:
    响应于接收到数据查询请求,获取查询结果,所述查询结果包括以下至少之一:各处理层的离线数据、所述第二数据;In response to receiving a data query request, obtain a query result, the query result includes at least one of the following: offline data of each processing layer, the second data;
    发送所述查询结果。Send the query result.
  4. 如权利要求3所述的方法,其中,所述发送所述查询结果,包括:通过预设的查询接口发送所述查询结果,所述查询接口包括以 下至少之一:JDBC API接口、Rest API接口。The method according to claim 3, wherein said sending said query result comprises: sending said query result through a preset query interface, said query interface comprising at least one of the following: JDBC API interface, Rest API interface .
  5. 如权利要求1所述的方法,其中,所述根据本处理层的离线数据对所述有误的数据进行修正,包括:利用流数据处理引擎,根据本处理层的离线数据对所述有误的数据进行修正。The method according to claim 1, wherein said correcting said erroneous data according to the offline data of this processing layer comprises: using a stream data processing engine to correct said erroneous data according to the offline data of this processing layer data are corrected.
  6. 如权利要求1-5任一项所述的方法,其中,所述待处理数据包括日志数据和业务数据,所述获取待处理数据,包括:The method according to any one of claims 1-5, wherein the data to be processed includes log data and business data, and the obtaining data to be processed includes:
    通过变更数据获取CDC的方式从业务数据库中获取所述业务数据,并根据日志收集系统获取所述日志数据。Obtain the service data from the service database by changing the data to obtain CDC, and obtain the log data according to the log collection system.
  7. 一种数据处理装置,其中,包括获取模块、第一处理模块和第二处理模块,所述第二处理模块形成数据应用层,所述第一处理模块包括多个处理层,各所述处理层形成处理链路,各所述处理层包括第一处理单元和第二处理单元;A data processing device, including an acquisition module, a first processing module and a second processing module, the second processing module forms a data application layer, the first processing module includes a plurality of processing layers, each of the processing layers forming a processing chain, each of the processing layers comprising a first processing unit and a second processing unit;
    所述获取模块配置为,获取待处理数据,所述待处理数据包括实时数据;The acquiring module is configured to acquire data to be processed, and the data to be processed includes real-time data;
    所述第一处理模块配置为,根据所述处理链路逐层处理所述待处理数据,得到第一数据;The first processing module is configured to process the data to be processed layer by layer according to the processing link to obtain first data;
    所述第一处理单元配置为,对输入本处理层的数据进行处理,得到处理后的数据,所述处理后的数据为实时数据,基于Flink流将处理后的实时数据存储在Hive模块中,并将所述处理后的数据输入至下一个处理层;所述第一数据为所述处理链路中最后一个处理层得到的处理后的数据;以及,接收所述第二单元发送的修正后的数据,将所述修正后的数据输入至下一个处理层的第一处理单元,以便所述下一个处理层的第一处理单元对输入的数据进行处理;The first processing unit is configured to process the data input to the processing layer to obtain processed data, the processed data is real-time data, and the processed real-time data is stored in the Hive module based on the Flink flow, and inputting the processed data to the next processing layer; the first data is the processed data obtained by the last processing layer in the processing chain; and receiving the corrected data sent by the second unit input the corrected data to the first processing unit of the next processing layer, so that the first processing unit of the next processing layer processes the input data;
    所述第二处理单元配置为,响应于本处理层发生数据错误,根据本处理层的离线数据对所述有误的数据进行修正,得到修正后的数据,并将修正后的数据发送给所述第一处理单元;The second processing unit is configured to, in response to a data error occurring in the processing layer, correct the erroneous data according to the offline data of the processing layer, obtain the corrected data, and send the corrected data to the the first processing unit;
    所述第二处理模块配置为,在所述数据应用层中处理所述第一 数据,得到第二数据。The second processing module is configured to process the first data in the data application layer to obtain second data.
  8. 一种计算机设备,包括:A computer device comprising:
    一个或多个处理器;one or more processors;
    存储装置,其上存储有一个或多个程序;a storage device having one or more programs stored thereon;
    当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如权利要求1-6任一项所述的基于批流一体的数据处理方法。When the one or more programs are executed by the one or more processors, the one or more processors are made to implement the batch-flow integration-based data processing method according to any one of claims 1-6 .
  9. 一种计算机可读介质,其上存储有计算机程序,其中,所述程序被执行时实现如权利要求1-6任一项所述的基于批流一体的数据处理方法。A computer-readable medium, on which a computer program is stored, wherein, when the program is executed, the batch-flow integration-based data processing method according to any one of claims 1-6 is realized.
PCT/CN2022/105078 2021-11-09 2022-07-12 Data processing method and apparatus based on batch-stream integration, computer device, and medium WO2023082681A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111318823.5 2021-11-09
CN202111318823.5A CN113779094B (en) 2021-11-09 2021-11-09 Batch-flow-integration-based data processing method and device, computer equipment and medium

Publications (1)

Publication Number Publication Date
WO2023082681A1 true WO2023082681A1 (en) 2023-05-19

Family

ID=78956925

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/105078 WO2023082681A1 (en) 2021-11-09 2022-07-12 Data processing method and apparatus based on batch-stream integration, computer device, and medium

Country Status (2)

Country Link
CN (1) CN113779094B (en)
WO (1) WO2023082681A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117724706A (en) * 2024-02-06 2024-03-19 湖南盛鼎科技发展有限责任公司 Method and system for batch-flow integrated flow real-time processing of heterogeneous platform mass data

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779094B (en) * 2021-11-09 2022-03-22 通号通信信息集团有限公司 Batch-flow-integration-based data processing method and device, computer equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473480A (en) * 2013-10-08 2013-12-25 武汉大学 Online monitoring data correction method based on improved universal gravitation support vector machine
US20150341231A1 (en) * 2014-05-21 2015-11-26 Asif Khan Distributed system architecture using event stream processing
CN112000636A (en) * 2020-08-31 2020-11-27 民生科技有限责任公司 User behavior statistical analysis method based on Flink streaming processing
CN113515363A (en) * 2021-08-10 2021-10-19 中国人民解放军61646部队 Special-shaped task high-concurrency multi-level data processing system dynamic scheduling platform
CN113779094A (en) * 2021-11-09 2021-12-10 通号通信信息集团有限公司 Batch-flow-integration-based data processing method and device, computer equipment and medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10936585B1 (en) * 2018-10-31 2021-03-02 Splunk Inc. Unified data processing across streaming and indexed data sets
US11526539B2 (en) * 2019-01-31 2022-12-13 Salesforce, Inc. Temporary reservations in non-relational datastores
CN112507029B (en) * 2020-12-18 2022-11-04 上海哔哩哔哩科技有限公司 Data processing system and data real-time processing method
CN113220521A (en) * 2021-02-04 2021-08-06 北京易车互联信息技术有限公司 Real-time monitoring system
CN112905595A (en) * 2021-03-05 2021-06-04 腾讯科技(深圳)有限公司 Data query method and device and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473480A (en) * 2013-10-08 2013-12-25 武汉大学 Online monitoring data correction method based on improved universal gravitation support vector machine
US20150341231A1 (en) * 2014-05-21 2015-11-26 Asif Khan Distributed system architecture using event stream processing
CN112000636A (en) * 2020-08-31 2020-11-27 民生科技有限责任公司 User behavior statistical analysis method based on Flink streaming processing
CN113515363A (en) * 2021-08-10 2021-10-19 中国人民解放军61646部队 Special-shaped task high-concurrency multi-level data processing system dynamic scheduling platform
CN113779094A (en) * 2021-11-09 2021-12-10 通号通信信息集团有限公司 Batch-flow-integration-based data processing method and device, computer equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117724706A (en) * 2024-02-06 2024-03-19 湖南盛鼎科技发展有限责任公司 Method and system for batch-flow integrated flow real-time processing of heterogeneous platform mass data
CN117724706B (en) * 2024-02-06 2024-05-03 湖南盛鼎科技发展有限责任公司 Method and system for batch-flow integrated flow real-time processing of heterogeneous platform mass data

Also Published As

Publication number Publication date
CN113779094A (en) 2021-12-10
CN113779094B (en) 2022-03-22

Similar Documents

Publication Publication Date Title
US11422982B2 (en) Scaling stateful clusters while maintaining access
US11354314B2 (en) Method for connecting a relational data store's meta data with hadoop
US11941017B2 (en) Event driven extract, transform, load (ETL) processing
US11836533B2 (en) Automated reconfiguration of real time data stream processing
WO2023082681A1 (en) Data processing method and apparatus based on batch-stream integration, computer device, and medium
US9418113B2 (en) Value based windows on relations in continuous data streams
US8321450B2 (en) Standardized database connectivity support for an event processing server in an embedded context
CN112507029B (en) Data processing system and data real-time processing method
CN109656963B (en) Metadata acquisition method, apparatus, device and computer readable storage medium
CN106687955B (en) Simplifying invocation of an import procedure to transfer data from a data source to a data target
WO2014026270A1 (en) High performance real-time relational database system and methods for using same
WO2018035799A1 (en) Data query method, application and database servers, middleware, and system
CN107346270B (en) Method and system for real-time computation based radix estimation
US20230144100A1 (en) Method and apparatus for managing and controlling resource, device and storage medium
US11645179B2 (en) Method and apparatus of monitoring interface performance of distributed application, device and storage medium
CN113568938A (en) Data stream processing method and device, electronic equipment and storage medium
WO2017157111A1 (en) Method, device and system for preventing memory data loss
CN114969441A (en) Knowledge mining engine system based on graph database
CN108629016B (en) Big data base oriented control system supporting real-time stream computing and computer program
US20120102168A1 (en) Communication And Coordination Between Web Services In A Cloud-Based Computing Environment
US20220277009A1 (en) Processing database queries based on external tables
CN116226045A (en) File data aggregation method, file data aggregation device and query system
CN113612832A (en) Streaming data distribution method and system
CN112988806A (en) Data processing method and device
US10733002B1 (en) Virtual machine instance data aggregation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22891490

Country of ref document: EP

Kind code of ref document: A1