一种增量数据一致性实现方法及装置Method and device for realizing consistency of incremental data
技术领域Technical field
本发明涉及数据仓库技术领域,尤其涉及一种增量数据一致性实现方法及装置。The invention relates to the technical field of data warehouses, in particular to a method and device for realizing incremental data consistency.
背景技术Background technique
大数据数据仓库构建ODS(Operational Data Store,操作数据存储)负责关系数据表,需要构建一致性增量数据表,以确保具有关联关系的各数据表之间的增量数据保持一致。以零售交易订单进行举例,订单头表与各订单子表之间的一致性增量数据表能够保证变化单号在每个订单增量数据表中都存在,不会出现一个变化单号在某些表中存在,而在某些表中不存在,导致产生增量表数据无法关联的问题。The ODS (Operational Data Store) of the big data data warehouse is responsible for relational data tables, and it is necessary to construct a consistent incremental data table to ensure that the incremental data between the associated data tables are consistent. Take a retail transaction order as an example. The consistent incremental data table between the order header table and each order sub-table can ensure that the change order number exists in each order incremental data table, and there will be no change order number in a certain order. Some tables exist, but do not exist in some tables, resulting in the problem that the incremental table data cannot be correlated.
现在技术中,一般会采取如下方法来实现增量数据的一致性:In current technology, the following methods are generally adopted to achieve the consistency of incremental data:
方法一:增量获取业务系统中的订单头表和各子表数据到大数据平台的增量数据表中,再使用hive/spark生成对应的全量数据表,根据增量数据表生成完整变化单号,去暴力匹配各表,最终生成各表的一致性增量数据表。Method 1: Incrementally obtain the order header table and sub-table data in the business system to the incremental data table of the big data platform, and then use hive/spark to generate the corresponding full data table, and generate a complete change sheet based on the incremental data table No., to brute force match each table, and finally generate a consistent incremental data table for each table.
方法二:增量获取业务系统订单头表和各子表的单号数据到业务系统的单号变化中间表中,根据单号变化中间表中的单号到业务系统通过数据库索引获取头表和各子表对应单号的业务数据到数据仓库的一致性增量数据表中。Method 2: Incrementally obtain the order number data of the order header table and each sub-table of the business system to the order number change intermediate table of the business system, and change the order number in the intermediate table according to the order number to the business system to obtain the header table and data through the database index Each sub-table corresponds to the business data of the single number to the consistent incremental data table of the data warehouse.
以上两种方案实现简单,但存在的一定缺陷和不足:The above two schemes are simple to implement, but there are certain defects and shortcomings:
针对方法一:Hive生成全量数据表,需要全量读写全量订单数据,假设100亿订单,每天订单增量200万,更新200万数据每次需要读写100亿数据,生成一致性增量数据表需要再次全量读取全量数据表,对全量数据表两次全量读 取和一次全量写,大数据平台资源损耗大,效率低。For method 1: Hive generates a full data table, which needs to read and write full order data. Assuming 10 billion orders, an order increase of 2 million per day, updating 2 million data requires reading and writing 10 billion data each time to generate a consistent incremental data table It is necessary to read the full data table in full again, to read the full data table twice, and to write the full data once. The big data platform has high resource consumption and low efficiency.
针对方法二:需要业务系统创建单号变化中间表,并对此表有写权限,两次读取业务系统表数据,依赖业务系统数据表索引使用,整个过程对业务系统具有较大的依赖,并且抽取过程可能会产生数据库锁,特别大促期间系统降级,直接导致数据无法抽取,整个大数据计算因为数据无法抽取而停滞,导致分析数据无法正常按时生产。For method two: the business system needs to create an intermediate table of order number change, and have write permission for this table, read the business system table data twice, rely on the use of the business system data table index, and the entire process has a greater dependence on the business system. In addition, the extraction process may cause database locks, especially during the promotion period, the system is degraded, which directly causes data to be unable to be extracted, and the entire big data calculation is stagnated because the data cannot be extracted, resulting in the analysis of data cannot be normally produced on time.
此外,基于Hive的数据仓库无法支持单号索引查询,对订单回溯场景无法支持,比如售后客服等业务分析场景,需要关联业务对应的订单数据,对应订单时间范围较广不太确定,可能1个月内或超过1年,由于Hive表基本无索引能力,这类业务分析的基于Hive表实现较困难。In addition, the Hive-based data warehouse cannot support order number index query, and it cannot support order backtracking scenarios. For example, business analysis scenarios such as after-sales customer service need to associate the order data corresponding to the business. The corresponding order time range is wide and uncertain, maybe one Within a month or more than one year, because Hive tables are basically incapable of indexing, it is difficult to implement this type of business analysis based on Hive tables.
发明内容Summary of the invention
本发明旨在至少解决现有技术或相关技术中存在的技术问题之一,为此本发明提供一种增量数据一致性实现方法及装置。The present invention aims to solve at least one of the technical problems existing in the prior art or related technologies. To this end, the present invention provides a method and device for achieving incremental data consistency.
本发明实施例提供的具体技术方案如下:The specific technical solutions provided by the embodiments of the present invention are as follows:
第一方面,本发明提供了一种增量数据一致性实现方法,包括:In the first aspect, the present invention provides a method for achieving incremental data consistency, including:
初始化业务系统中具有关联关系的各数据表的全部数据,加载至第一数据库中以生成多个全量数据表;Initialize all the data of each data table that has an association relationship in the business system, and load it into the first database to generate multiple full data tables;
基于所述业务系统的数据库日志,将所述各数据表的实时数据分别同步至所述多个全量数据表中以及第二数据库的多个增量数据表中;Based on the database log of the business system, synchronizing the real-time data of each data table to the multiple full data tables and the multiple incremental data tables of the second database respectively;
提取所述多个增量数据表中的所有的业务唯一标识,在所述第二数据库中合并生成增量标识合并表;Extracting all the unique identifiers of services in the plurality of incremental data tables, and merge them in the second database to generate an incremental identifier merged table;
根据所述增量标识合并表,查询得到所述多个全量数据表中与所述增量标识合并表相关的业务数据,并对应写入所述第二数据库的一致性增量数据表中。According to the incremental identification combination table, the business data related to the incremental identification combination table among the multiple full data tables is obtained by querying, and correspondingly written into the consistent incremental data table of the second database.
在一个优选的实施方式中,所述基于所述业务系统的数据库日志,将所述各数据表的实时数据分别同步至所述多个全量数据表中以及第二数据库的多个增 量数据表中,包括:In a preferred embodiment, based on the database log of the business system, the real-time data of each data table is synchronized to the multiple full data tables and the multiple incremental data tables of the second database. Include:
从所述业务系统的数据库日志中解析出所述各数据表的实时数据,并同步至实时数据流中;Analyze the real-time data of each data table from the database log of the business system, and synchronize it to the real-time data stream;
将所述实时数据流中的数据落地所述多个全量数据表中;以及Landing the data in the real-time data stream in the multiple full data tables; and
将所述实时数据流中的数据写入所述多个增量数据表中。The data in the real-time data stream is written into the multiple incremental data tables.
在一个优选的实施方式中,所述第一数据库为KV数据库,所述第二数据库为Hive数据库。In a preferred embodiment, the first database is a KV database, and the second database is a Hive database.
在一个优选的实施方式中,所述根据所述增量标识合并表,查询得到所述多个全量数据表中与所述增量标识合并表相关的业务数据,包括:In a preferred embodiment, the query to obtain the business data related to the incremental identification combination table among the multiple full data tables according to the incremental identification combination table includes:
针对所述增量标识合并表的每一个订单号,通过SQL查询接口分别查询所述多个全量数据表中匹配于所述订单号的业务数据,以得到所述查询结果。For each order number in the incremental identification merged table, the service data matching the order number in the multiple full data tables are respectively queried through the SQL query interface to obtain the query result.
在一个优选的实施方式中,所述方法还包括:In a preferred embodiment, the method further includes:
接收数据回溯查询指令,通过SQL查询接口在所述第一数据库中查询与所述数据回溯查询指令相关联的业务数据,并返回数据回溯查询结果。Receiving a data backtracking query instruction, querying the first database for business data associated with the data backtracking query instruction through a SQL query interface, and returning a data backtracking query result.
第二方面,提供了一种增量数据一致性实现装置,包括:In the second aspect, a device for realizing incremental data consistency is provided, including:
初始化模块,用于初始化业务系统中具有关联关系的各数据表的全部数据,加载至第一数据库中以生成多个全量数据表;The initialization module is used to initialize all the data of the associated data tables in the business system, and load them into the first database to generate multiple full data tables;
实时同步模块,用于基于所述业务系统的数据库日志,将所述各数据表的实时数据分别同步至所述多个全量数据表中以及第二数据库的多个增量数据表中;The real-time synchronization module is configured to synchronize the real-time data of each data table to the multiple full data tables and multiple incremental data tables of the second database based on the database log of the business system;
标识合并模块,用于提取所述多个增量数据表中的所有的业务唯一标识,在所述第二数据库中合并生成增量标识合并表;An identifier merging module, configured to extract all the unique identifiers of the services in the plurality of incremental data tables, and merge them in the second database to generate an incremental identifier merging table;
查询模块,用于根据所述增量标识合并表,查询得到所述多个全量数据表中与所述增量标识合并表相关的业务数据;A query module, configured to query and obtain business data related to the incremental identifier merge table among the multiple full data tables according to the incremental identifier merge table;
写入模块,用于将与所述增量标识合并表相关的业务数据对应写入所述第二数据库的一致性增量数据表中。The writing module is configured to write the business data related to the incremental identification merge table into the consistent incremental data table of the second database.
在一个优选的实施方式中,所述实时同步模块具体用于:In a preferred embodiment, the real-time synchronization module is specifically used for:
从所述业务系统的数据库日志中解析出所述各数据表的实时数据,并同步至实时数据流中;Analyze the real-time data of each data table from the database log of the business system, and synchronize it to the real-time data stream;
将所述实时数据流中的数据落地所述多个全量数据表中;以及Landing the data in the real-time data stream in the multiple full data tables; and
将所述实时数据流中的数据写入所述多个增量数据表中。The data in the real-time data stream is written into the multiple incremental data tables.
在一个优选的实施方式中,所述第一数据库为KV数据库,所述第二数据库为Hive数据库。In a preferred embodiment, the first database is a KV database, and the second database is a Hive database.
在一个优选的实施方式中,所述查询模块具体用于:In a preferred embodiment, the query module is specifically used for:
针对所述增量标识合并表的每一个订单号,通过SQL查询接口分别查询所述多个全量数据表中匹配于所述订单号的业务数据。For each order number in the incremental identification combined table, query the service data matching the order number in the multiple full data tables through the SQL query interface.
在一个优选的实施方式中,所述查询模块还用于:In a preferred embodiment, the query module is also used for:
接收数据回溯查询指令,通过SQL查询接口在所述第一数据库中查询与所述数据回溯查询指令相关联的业务数据,并返回数据回溯查询结果。Receiving a data backtracking query instruction, querying the first database for business data associated with the data backtracking query instruction through a SQL query interface, and returning a data backtracking query result.
本发明提供的一种增量数据一致性实现方法及装置,由于通过利用数据库日志将业务数据库中的各数据表的实时数据同步至数据仓库中,相对于现有技术中通过创建单号变化中间表读取业务系统表数据,强依赖于业务系统数据表索引使用,本发明在业务数据库的数据采集过程中,对业务数据库的正常运行基本无影响;且在从多个增量数据表查询与增量标识合并表相关的业务数据时,仅需一次全量读取,对数据库资源损耗较小,且通过写入查询结果而得到的一致性增量数据表能够确保各数据表之间的增量数据保持一致;此外,由于一致性增量数据表中的数据分析支持基于增量数据的分析,因此日数据分析场景下仅需要检索各表中的日数据就能完成订单相关所有分析,并不需要检索历史分区数据,数据库资源消耗较小。The method and device for realizing incremental data consistency provided by the present invention, because the real-time data of each data table in the business database is synchronized to the data warehouse by using the database log, compared with the prior art by creating a single number to change the middle The table reads the business system table data and strongly relies on the use of the business system data table index. During the data collection process of the business database, the present invention has basically no impact on the normal operation of the business database; and in the query and query from multiple incremental data tables When incrementally identifying the business data related to the merged table, it only needs to be read in full once, which consumes less database resources, and the consistent incremental data table obtained by writing the query result can ensure the increment between the data tables The data is consistent; in addition, because the data analysis in the consistent incremental data table supports the analysis based on incremental data, in the daily data analysis scenario, you only need to retrieve the daily data in each table to complete all the analysis related to the order. Need to retrieve historical partition data, the database resource consumption is small.
附图说明Description of the drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的 一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of the present invention, the following will briefly introduce the accompanying drawings used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained from these drawings without creative work.
图1示出了一种增量数据一致性实现方法的流程图;Figure 1 shows a flow chart of a method for achieving incremental data consistency;
图2示出了操作性数据仓库ODS的订单增量数据一致性的实现流程图;Figure 2 shows the implementation flow chart of the consistency of order incremental data of the operational data warehouse ODS;
图3示出了一种增量数据一致性实现装置的框图。Figure 3 shows a block diagram of a device for realizing incremental data consistency.
具体实施方式detailed description
为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the objectives, technical solutions, and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are merely Some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
除非上下文明确要求,否则整个说明书和权利要求书中的“包括”、“包含”等类似词语应当解释为包含的含义而不是排他或穷举的含义;也就是说,是“包括但不限于”的含义。Unless the context clearly requires, the words "including", "including" and other similar words in the entire specification and claims should be interpreted as inclusive rather than exclusive or exhaustive meanings; in other words, "including but not limited to" Meaning.
在本发明的描述中,需要理解的是,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。此外,在本发明的描述中,除非另有说明,“多个”的含义是两个或两个以上。In the description of the present invention, it should be understood that the terms "first", "second", etc. are only used for descriptive purposes and cannot be understood as indicating or implying relative importance. In addition, in the description of the present invention, unless otherwise specified, "plurality" means two or more.
实施例一Example one
本发明实施例提供了一种增量数据一致性实现方法,该方法可以应用于数据仓库(例如,操作性数据仓库ODS),如图1所示,该方法包括步骤:The embodiment of the present invention provides a method for achieving incremental data consistency, which can be applied to a data warehouse (for example, an operational data warehouse ODS). As shown in FIG. 1, the method includes the steps:
S1:初始化业务系统中具有关联关系的各数据表的全部数据,加载至第一数据库中以生成多个全量数据表。S1: Initialize all the data of each data table having an association relationship in the business system, and load it into the first database to generate multiple full data tables.
本实施例中,具有关联关系的各数据表可以是一对一关系或一对多关系,具有一对多关系的各数据表可以是以一个数据表作为父表,其他的多个数据表分别作为子表。例如,在零售交易订单场景中,订单头表为父表,订单商品表、订 单支付表、订单扩展表等均为子表。In this embodiment, each data table with an association relationship may be a one-to-one relationship or a one-to-many relationship, and each data table with a one-to-many relationship may use one data table as the parent table, and the other multiple data tables respectively As a child table. For example, in a retail transaction order scenario, the order header table is the parent table, and the order item table, order payment table, and order extension table are all child tables.
具体的,基于ETL工具从业务系统对应的业务数据库中抽取数据表的全部数据,经过清洗转换后,加载至第一数据库中,形成与各数据表的对应的多个全量数据表。Specifically, all data of the data table is extracted from the business database corresponding to the business system based on the ETL tool, and after cleaning and conversion, it is loaded into the first database to form multiple full data tables corresponding to each data table.
比如,可以将业务数据库中订单头表、订单商品表、订单支付表和订单扩展表各自的全部数据,加载到第一数据库中,生成订单头表、订单商品表、订单支付表和订单扩展表各自对应的全量数据表。For example, all data of the order header table, order product table, order payment table, and order extension table in the business database can be loaded into the first database to generate the order header table, order product table, order payment table, and order extension table Each corresponding full data sheet.
其中,第一数据库可以为KV(Key-Value)数据库。Key-Value数据库是一种以键值对存储数据的一种数据库,故也称键值数据库,其数据的存储与访问均是以键值对作为标识进行的,可以通过key快速查询到对应的value值,并可以对外提供很好地读写操作。Key-value数据库代表的有redis。Among them, the first database may be a KV (Key-Value) database. The Key-Value database is a kind of database that stores data in key-value pairs, so it is also called a key-value database. The storage and access of its data are carried out using key-value pairs as identifiers, and the corresponding ones can be quickly queried by key value value, and can provide good read and write operations externally. The key-value database represents redis.
S2:基于业务系统的数据库日志,将各数据表的实时数据分别同步至多个全量数据表中以及第二数据库的多个增量数据表中。S2: Based on the database log of the business system, the real-time data of each data table is synchronized to multiple full data tables and multiple incremental data tables of the second database.
其中,实时数据为各数据表中新增加或新修改的数据。Among them, real-time data is newly added or newly modified data in each data table.
其中,第二数据库为Hive数据库。Hive数据库是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的SQL查询功能,可以将SQL语句转换为MapReduce任务进行运行。其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。Among them, the second database is the Hive database. Hive database is a data warehouse tool based on Hadoop, which can map structured data files into a database table, and provides simple SQL query functions, which can convert SQL statements into MapReduce tasks for execution. Its advantages are low learning costs, simple MapReduce statistics can be quickly realized through SQL-like statements, no need to develop special MapReduce applications, and it is very suitable for statistical analysis of data warehouses.
具体的,从业务系统的数据库日志解析出各数据表的实时数据,并将实时数据同步至实时数据流中;Specifically, parse the real-time data of each data table from the database log of the business system, and synchronize the real-time data to the real-time data stream;
将实时数据流中的数据落地多个全量数据表中;Put the data in the real-time data stream into multiple full data tables;
以及将实时数据流中的数据写入第二数据库的多个增量数据表中。And write the data in the real-time data stream into multiple incremental data tables in the second database.
其中,数据库日志中记录了对业务数据库的操作信息,数据库日志具体可以为Binlog数据库日志,可以通过Binlog解析器定期解析Binlog数据库日志。Among them, the database log records operation information on the business database. The database log can specifically be a Binlog database log, and the Binlog database log can be parsed regularly through the Binlog parser.
于本实施例中,可以在数据库日志发生更新时获取数据库日志,其中,更新 包括对业务数据库的数据表的字段进行了增加、删除或修改操作。In this embodiment, the database log can be obtained when the database log is updated, where the update includes adding, deleting, or modifying fields of the data table of the business database.
需要说明的是,将实时数据流中的数据落地多个全量数据表中的步骤,以及将实时数据流中的数据落地多个全量数据表中的步骤,本发明实施例对这两个步骤的执行顺序不作具体限定,以同时执行上述两个步骤作为优选方案。It should be noted that the steps of landing the data in the real-time data stream in multiple full data tables, and the steps of landing the data in the real-time data stream in multiple full data tables, the embodiment of the present invention has two steps The order of execution is not specifically limited, and it is preferable to execute the above two steps simultaneously.
S3:提取多个增量数据表中的所有的业务唯一标识,在第二数据库中合并生成增量标识合并表。S3: Extract all service unique identifiers in multiple incremental data tables, and merge them in the second database to generate an incremental identifier merged table.
其中,业务唯一标识可以唯一标识数据库表中的一条业务记录,在订单应用场景中,业务唯一标识为订单单号。Among them, the business unique identifier can uniquely identify a business record in the database table. In the order application scenario, the business unique identifier is the order number.
具体的,提取多个增量数据表中所有的业务唯一标识,对所有的业务唯一标识进行合并去重,生成增量标识合并表。Specifically, all the unique identifiers of the services in the multiple incremental data tables are extracted, and all the unique identifiers of the services are merged and deduplicated to generate an incremental identifier merge table.
本实施例中,可以对所有的业务唯一标识合并成集合,并剔除其中重复的业务唯一标识,对去重后的业务唯一标识生成增量标识合并表,存储在Hive数据库中。In this embodiment, all service unique identifiers can be merged into a set, and the duplicate service unique identifiers can be eliminated, and the incremental identifier merge table is generated for the deduplicated service unique identifiers and stored in the Hive database.
S4:根据增量标识合并表,查询得到多个全量数据表中与增量标识合并表相关的业务数据,并对应写入第二数据库的一致性增量数据表中。S4: According to the incremental identification merging table, query and obtain the business data related to the incremental identification merging table from the multiple full data tables, and correspondingly write them into the consistent incremental data table of the second database.
具体的,该过程可以包括:Specifically, the process may include:
针对增量标识合并表的每一个订单号,通过SQL查询接口分别查询多个全量数据表中匹配于订单号的业务数据。For each order number in the incremental identification combined table, the business data matching the order number in the multiple full data tables is queried through the SQL query interface.
在具体实施过程中,可以通过开发SQL查询接口,将KV数据库查询集成到SQL中,降低开发难度,实现了通过SQL方式将Hive数据库和KV数据库进行实时关联。In the specific implementation process, the KV database query can be integrated into SQL by developing the SQL query interface to reduce the difficulty of development and realize the real-time association of the Hive database and the KV database through SQL.
由于Hive库和KV库能够通过SQL实现关联,使得KV库中的全量数据表可以支持基于单号的快速检索,提供数据索引检索能力,不增加hadoop平台或者业务系统的压力。Since the Hive library and the KV library can be associated through SQL, the full data tables in the KV library can support fast retrieval based on single numbers, provide data index retrieval capabilities, and do not increase the pressure on the hadoop platform or business system.
进一步地,除上述步骤之外,本发明实施例提供的方法还可以包括:Further, in addition to the above steps, the method provided in the embodiment of the present invention may further include:
基于第二数据库中的一致性增量数据表进行与业务主题相关的指标、维度、 属性分析,其中,业务主题可以是下单、执法、换退货。Based on the consistent incremental data table in the second database, analyze indicators, dimensions, and attributes related to the business theme, where the business theme can be order placement, law enforcement, exchange and return.
由于一致性增量数据表的数据分析支持基于增量数据的分析,因此日数据分析场景下仅需要检索各表中的日数据就能完成订单相关所有分析,包括下单宽表、支付宽表、退换货分析宽表等,分析过程中并不需要检索历史分区数据,因此数据库资源消耗较小。Since the data analysis of the consistent incremental data table supports the analysis based on incremental data, in the daily data analysis scenario, you only need to retrieve the daily data in each table to complete all the analysis related to the order, including the order wide table and the payment wide table , Return and exchange analysis wide table, etc., there is no need to retrieve historical partition data during the analysis process, so the database resource consumption is small.
进一步地,除上述步骤之外,本发明实施例提供的方法还可以包括:Further, in addition to the above steps, the method provided in the embodiment of the present invention may further include:
接收数据回溯查询指令,通过SQL查询接口在第一数据库中查询与数据回溯查询指令相关联的业务数据,并返回数据回溯查询结果。The data backtracking query instruction is received, the business data associated with the data backtracking query instruction is queried in the first database through the SQL query interface, and the data backtracking query result is returned.
示例性地,以客服投诉分析举例,当日客户投诉的订单可能是很长时间前的订单数据,Hive表难以实现长时间历史高效快速检索,而通过KV数据库的SQL查询接口进行查找全量数据表中的客户订单信息,实现业务数据的回溯查询,从而有力解决售后客服等业务需要检索以往订单数据关联业务场景以及获取关联订单作为维度分析业务的场景,且检索性能高,数据库资源消耗小。Exemplarily, take customer service complaint analysis as an example. The order complained by the customer on the day may be the order data a long time ago. The Hive table is difficult to achieve long-term historical efficient and fast retrieval, and the SQL query interface of the KV database is used to search the full data table The customer order information can realize the retrospective query of business data, which effectively solves the need for after-sales customer service and other services to retrieve past order data related business scenarios and obtain related orders as a dimensional analysis business scenario, with high retrieval performance and low database resource consumption.
下面以订单场景为例,对本发明实施例一提供的增量数据一致性实现方法进行进一步说明,如图2所示,图2示出一种操作性数据仓库ODS的订单增量数据一致性的实现过程,该过程包括:The following takes an order scenario as an example to further illustrate the method for implementing incremental data consistency provided in the first embodiment of the present invention, as shown in Figure 2, which shows an operational data warehouse ODS order incremental data consistency The realization process, which includes:
步骤1:初始化业务系统中父子表各表的全部数据,对应加载到KV库中以形成多个全量数据表;Step 1: Initialize all the data of each table of the parent-child table in the business system and load it into the KV library to form multiple full data tables;
步骤2:从业务系统中通过数据库日志实时同步数据到数据流中;Step 2: Synchronize data from the business system to the data stream in real time through the database log;
步骤3:将实时数据流数据对应落地Hive库的增量数据表中;Step 3: Map the real-time data stream data to the incremental data table of the Hive library;
步骤4:将实时数据流数据对应写入KV库中的全量数据表中;Step 4: Write the real-time data stream data into the full data table in the KV library;
步骤5:将各个增量数据表提取所有的单号合并去重写入Hive库的增量单号合并表中;Step 5: Extract all the tracking numbers from each incremental data table and merge them into the incremental tracking number merge table of the Hive library;
步骤6:根据增量单号合并表通过SQL查询接口查询调用各全量数据表的数据,并将查询结果写入Hive库的一致性增量数据表中。Step 6: Query and call the data of each full data table through the SQL query interface according to the incremental order number merge table, and write the query result into the consistent incremental data table of the Hive library.
通过上述步骤,最终可以生成数仓ODS的Hive库的一致性增量数据表和 和KV库的全量数据表。Through the above steps, the consistent incremental data table of the Hive library of the data warehouse ODS and the full data table of the KV library can be generated.
本发明提供的增量数据一致性实现方法,由于通过利用数据库日志将业务数据库中的各数据表的实时数据同步至数据仓库中,相对于现有技术中通过创建单号变化中间表读取业务系统表数据,强依赖于业务系统数据表索引使用,本发明在业务数据库的数据采集过程中,对业务数据库的正常运行基本无影响;且在从多个增量数据表查询与增量标识合并表相关的业务数据时,仅需一次全量读取,对数据库资源损耗较小,且通过写入查询结果而得到的一致性增量数据表能够确保各数据表之间的增量数据保持一致;此外,由于一致性增量数据表中的数据分析支持基于增量数据的分析,因此日数据分析场景下仅需要检索各表中的日数据就能完成订单相关所有分析,并不需要检索历史分区数据,数据库资源消耗较小。The method for realizing incremental data consistency provided by the present invention, because the real-time data of each data table in the business database is synchronized to the data warehouse by using the database log, compared to the prior art by creating a single number change intermediate table to read business The system table data strongly depends on the use of the business system data table index. The present invention has basically no impact on the normal operation of the business database during the data collection process of the business database; and when querying from multiple incremental data tables and combining incremental identification When table-related business data is read, it only needs to be read in full once, which consumes less database resources, and the consistent incremental data table obtained by writing the query results can ensure that the incremental data between the data tables is consistent; In addition, because the data analysis in the consistent incremental data table supports analysis based on incremental data, in the daily data analysis scenario, you only need to retrieve the daily data in each table to complete all the order-related analysis, without the need to retrieve historical partitions Data, database resource consumption is small.
实施例二Example two
本发明实施例提供一种增量数据一致性实现装置,参照图3所示,该装置包括:The embodiment of the present invention provides an incremental data consistency realization device. As shown in FIG. 3, the device includes:
初始化模块31,用于初始化业务系统中具有关联关系的各数据表的全部数据,加载至第一数据库中以生成多个全量数据表;The initialization module 31 is used to initialize all the data of each data table having an association relationship in the business system, and load it into the first database to generate multiple full data tables;
实时同步模块32,用于基于业务数据库的日志,将各数据表的实时数据分别同步至多个全量数据表中以及第二数据库的多个增量数据表中;The real-time synchronization module 32 is used to synchronize the real-time data of each data table to multiple full data tables and multiple incremental data tables of the second database based on the log of the business database;
标识合并模块33,用于提取多个增量数据表中的所有的业务唯一标识,在第二数据库中合并生成增量标识合并表;The identifier merging module 33 is used to extract all the unique identifiers of the services in the multiple incremental data tables, and merge them in the second database to generate an incremental identifier merge table;
查询模块34,用于根据增量标识合并表,查询得到多个全量数据表中与增量标识合并表相关的业务数据;以及The query module 34 is configured to query and obtain business data related to the incremental identifier merged table from the multiple full data tables according to the incremental identifier merged table; and
写入模块35,用于将与增量标识合并表相关的业务数据对应写入第二数据库的一致性增量数据表中。The writing module 35 is configured to write the business data related to the incremental identification merge table into the consistent incremental data table of the second database.
进一步地,实时同步模块32具体用于:Further, the real-time synchronization module 32 is specifically used for:
从业务系统的数据库日志解析出各数据表的实时数据,并将实时数据同步 至实时数据流中;Analyze the real-time data of each data table from the database log of the business system, and synchronize the real-time data to the real-time data stream;
将实时数据流中的数据落地多个全量数据表中;以及Put the data in the real-time data stream into multiple full data tables; and
将实时数据流中的数据写入多个增量数据表中。Write the data in the real-time data stream into multiple incremental data tables.
进一步地,第一数据库为KV数据库,第二数据库为Hive数据库。Further, the first database is a KV database, and the second database is a Hive database.
进一步地,查询模块34具体用于:Further, the query module 34 is specifically used for:
针对增量标识合并表的每一个订单号,通过SQL查询接口分别查询多个全量数据表中匹配于订单号的业务数据。For each order number in the incremental identification combined table, the business data matching the order number in the multiple full data tables is queried through the SQL query interface.
进一步地,查询模块34还用于:Further, the query module 34 is also used for:
接收数据回溯查询指令,通过SQL查询接口在第一数据库中查询与数据回溯查询指令相关联的业务数据,并返回数据回溯查询结果。The data backtracking query instruction is received, the business data associated with the data backtracking query instruction is queried in the first database through the SQL query interface, and the data backtracking query result is returned.
本发明提供的增量数据一致性实现装置,由于通过利用数据库日志将业务数据库中的各数据表的实时数据同步至数据仓库中,相对于现有技术中通过创建单号变化中间表读取业务系统表数据,强依赖于业务系统数据表索引使用,本发明在业务数据库的数据采集过程中,对业务数据库的正常运行基本无影响;且在从多个增量数据表查询与增量标识合并表相关的业务数据时,仅需一次全量读取,对数据库资源损耗较小,且通过写入查询结果而得到的一致性增量数据表能够确保各数据表之间的增量数据保持一致;此外,由于一致性增量数据表中的数据分析支持基于增量数据的分析,因此日数据分析场景下仅需要检索各表中的日数据就能完成订单相关所有分析,并不需要检索历史分区数据,数据库资源消耗较小。The incremental data consistency realization device provided by the present invention, because the real-time data of each data table in the business database is synchronized to the data warehouse by using the database log, compared to the prior art by creating a single number change intermediate table to read business The system table data strongly depends on the use of the business system data table index. The present invention has basically no impact on the normal operation of the business database during the data collection process of the business database; and when querying from multiple incremental data tables and combining incremental identification When table-related business data is read, it only needs to be read in full once, which consumes less database resources, and the consistent incremental data table obtained by writing the query results can ensure that the incremental data between the data tables is consistent; In addition, because the data analysis in the consistent incremental data table supports analysis based on incremental data, in the daily data analysis scenario, you only need to retrieve the daily data in each table to complete all the order-related analysis, without the need to retrieve historical partitions Data, database resource consumption is small.
上述所有可选技术方案,可以采用任意结合形成本发明的可选实施例,在此不再一一赘述。All the above-mentioned optional technical solutions can be combined in any way to form an optional embodiment of the present invention, which will not be repeated here.
需要说明的是:上述实施例提供的增量数据一致性实现装置在执行增量数据一致性实现方法时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将增量数据一致性实现装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分 功能。另外,上述实施例提供的增量数据一致性实现方法与增量数据一致性实现装置实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that, when the incremental data consistency realization device provided in the above embodiment executes the incremental data consistency realization method, only the division of the above-mentioned functional modules is used for illustration. In practical applications, the above-mentioned Function allocation is completed by different functional modules, that is, the internal structure of the device for achieving incremental data consistency is divided into different functional modules to complete all or part of the functions described above. In addition, the method for realizing incremental data consistency provided by the foregoing embodiment belongs to the same concept as the embodiment of the device for realizing incremental data consistency. For the specific implementation process, please refer to the method embodiment, which will not be repeated here.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the foregoing embodiments can be implemented by hardware, or by a program instructing relevant hardware to be completed. The program can be stored in a computer-readable storage medium. The storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.