CN112749226A - Hive incremental data synchronization method and device, computer equipment and storage medium - Google Patents

Hive incremental data synchronization method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112749226A
CN112749226A CN201911045304.9A CN201911045304A CN112749226A CN 112749226 A CN112749226 A CN 112749226A CN 201911045304 A CN201911045304 A CN 201911045304A CN 112749226 A CN112749226 A CN 112749226A
Authority
CN
China
Prior art keywords
hbase
hive
mapping table
data
incremental data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911045304.9A
Other languages
Chinese (zh)
Inventor
薛星海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201911045304.9A priority Critical patent/CN112749226A/en
Publication of CN112749226A publication Critical patent/CN112749226A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/275Synchronous replication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity

Abstract

The application relates to a Hive incremental data synchronization method, a Hive incremental data synchronization device, computer equipment and a storage medium, wherein the method comprises the following steps: determining Hbase corresponding to Hive; obtaining a correlation table from Hive to Hbase; and writing the incremental data into the Hbase through the API in the Hbase to update the data in the Hbase, and synchronizing the Hive incremental data according to the association table and the updated Hbase. In the whole process, incremental data are recorded by means of Hbase, incremental data synchronization to Hive is achieved on the basis of a Hive-to-Hbase association table, direct operation limitation in Hive is avoided, and based on Hbase and Hive association mapping, incremental data synchronization in Hive is achieved efficiently.

Description

Hive incremental data synchronization method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a Hive incremental data synchronization method, apparatus, computer device, and storage medium.
Background
With the development of computer technology, many data warehouse tools, such as Hive, which is the mainstream at present, are appeared.
Hive is a data warehouse tool based on Hadoop, can map Structured data files into a database table, provides a simple SQL (Structured Query Language) Query function, and can convert SQL statements into MapReduce tasks for operation. The method has the advantages that the learning cost is low, simple MapReduce statistics can be quickly realized through SQL-like statements, special MapReduce application does not need to be developed, and the method is very suitable for statistical analysis of a data warehouse.
While the Hive has the above advantages in function, incremental synchronization in Hive is a relatively complex process, and the traditional method is to synchronize incremental data to Hive through insert, update and delete of Hive, Hive serves as a data warehouse, and the support degree of itself on insert, update and delete is poor, so that the traditional Hive incremental synchronization method cannot efficiently achieve incremental synchronization.
Disclosure of Invention
In view of the above, it is necessary to provide a Hive incremental data synchronization method, an apparatus, a computer device, and a storage medium capable of efficiently implementing incremental synchronization in order to solve the above technical problems.
A Hive incremental data synchronization method, the method comprising:
determining Hbase corresponding to Hive;
acquiring a first mapping table in the Hive and a second mapping table in the Hbase, wherein the first mapping table and the second mapping table are associated with each other in field names and field types;
writing incremental data into the Hbase through an API (Application Program Interface) in the Hbase, wherein the incremental data is data which needs to be incremented into the Hive;
and synchronizing Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase.
In one embodiment, the obtaining the first mapping table in the Hive and the second mapping table in the Hbase includes:
creating a first mapping table in Hive;
calling the Hbase API, and creating a second mapping table in the Hbase, wherein the second mapping table is associated with the first mapping table.
In one embodiment, the obtaining the first mapping table in the Hive and the second mapping table in the Hbase includes:
creating a hivetable in Hive, and creating a hbasetable in Hbase;
the table names of the hievable table and the hbasetable, the rowkey field in the hievable table and the row key in the hbasetable, and the column character strings in the hievable table and the column character strings in the hbasetable are mapped in an associated manner;
and establishing communication between the Hive and the Hbase.
In one embodiment, before writing the incremental data into the Hbase through the API in the Hbase, the method further includes:
identifying a type of the incremental data;
the writing incremental data to the Hbase via the API in the Hbase comprises:
and calling an API (application programming interface) corresponding to the type of the incremental data in the Hbase to write the incremental data into the Hbase bottom data.
In one embodiment, the calling the API in the Hbase corresponding to the type of the delta data to write the delta data into Hbase underlying data includes:
if the type of the incremental data is newly added, calling an API (application programming interface) with a data adding function in the Hbase, and adding the incremental data;
if the type of the incremental data is updating, updating the incremental data through Hbase rowkey;
and if the incremental data type is deletion, deleting by Hbase rowkey.
In one embodiment, the synchronizing Hive increment data according to the first mapping table, the second mapping table and the updated Hbase comprises:
storing the data written into the Hbase into an HDFS (Hadoop Distributed File System);
and accessing the HDFS according to the first mapping table and the second mapping table, and synchronizing Hive query data.
In one embodiment, the synchronizing Hive increment data according to the first mapping table, the second mapping table and the updated Hbase comprises:
identifying the updated field name and field type in the updated Hbase;
and adjusting the updated field name and the metadata corresponding to the field type according to the first mapping table and the second mapping table, and synchronizing Hive incremental data.
A Hive incremental data synchronization apparatus, the apparatus comprising:
the database determination module is used for determining Hbase corresponding to Hive;
an association table obtaining module, configured to obtain a first mapping table in the Hive and a second mapping table in the Hbase, where the first mapping table and the second mapping table are associated with each other in a field name and a field type;
the increment module is used for writing increment data into the Hbase through an API (application program interface) in the Hbase, wherein the increment data is data needing to be incremented into the Hive;
and the synchronization module is used for synchronizing the Hive increment data according to the first mapping table, the second mapping table and the updated Hbase.
A computer device comprising at least one processor, at least one memory, and a bus; the processor and the memory complete mutual communication through the bus; the processor is configured to call program instructions in the memory to perform the method as described above.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as described above.
The Hive incremental data synchronization method, the device, the computer equipment and the storage medium determine the Hbase corresponding to Hive; obtaining a correlation table from Hive to Hbase; and operating the incremental data through the API in the Hbase to update the data in the Hbase, and synchronizing the Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase. In the whole process, incremental data are recorded by means of Hbase, incremental data synchronization to Hive is achieved on the basis of a Hive-to-Hbase association table, direct operation limitation in Hive is avoided, and based on Hbase and Hive association mapping, incremental data synchronization in Hive is achieved efficiently.
Drawings
FIG. 1 is a diagram of an embodiment of an application environment of the Hive incremental data synchronization method;
FIG. 2 is a flow diagram illustrating a Hive incremental data synchronization method according to one embodiment;
FIG. 3 is a flow chart illustrating a Hive incremental data synchronization method according to another embodiment;
FIG. 4 is a block diagram of the Hive incremental data synchronizer in one embodiment;
FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The Hive incremental data synchronization method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 writes the incremental data into the server 104 for Hive incremental data synchronization. The server 104 receives the incremental data, determines the Hbase corresponding to the Hive, acquires an association table from the Hive to the Hbase, writes the incremental data into the Hbase through an API in the Hbase to update the data in the Hbase, and synchronizes the Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a Hive incremental data synchronization method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
s200: and determining Hbase corresponding to Hive.
HBase is a distributed, column-oriented, open source database, provides Bigtable-like capabilities over Hadoop, HBase is a sub-item of the Hadoop item of Apache, HBase is different from a typical relational database, it is a database suitable for unstructured data storage, and HBase is column-based rather than row-based schema. Because the Hive and the Hbase belong to Hadoop big data components, the Hive and the Hbase are communicated with each other on the basis of the bottom layer. In this case, the Hbase previously allocated to the current Hive in the server is determined, and when the current Hive needs incremental synchronization, the Hbase providing support is determined first. The bottom layer data between the Hbase corresponding to the Hive is shared, and the bottom layer data between the Hbase and the Hbase are stored at the same position and can be stored in the HDFS specifically.
S400: and acquiring a first mapping table in the Hive and a second mapping table in the Hbase, wherein the first mapping table and the second mapping table are related to each other in field name and field type.
The first mapping table and the second mapping table may be pre-constructed and stored, or may be temporarily constructed and obtained. Specifically, the first mapping table includes a hiveable table in the Hive, the second mapping table includes an hbasetable table having a mapping relationship with the Hive table in the Hbase, and data in the Hbase can be synchronized into the Hive through the two tables Hive, that is, if operations including adding, updating, deleting, and the like are performed on the data in the Hbase, the operated data can also be synchronized into the Hive through the first mapping table and the second mapping table, and a field name and a field type between the first mapping table and the second mapping table are associated with each other.
S600: and writing incremental data into the Hbase through an API in the Hbase, wherein the incremental data is data which needs to be incremented into Hive.
An API is a call interface that an operating system leaves for an application program, which causes the operating system to execute commands of the application program by calling the API of the operating system. A plurality of APIs with different functions exist in the Hive and the Hbase, incremental data are written into the Hbase through the API in the Hbase, and the incremental data are written into the Hbase to update the data in the Hbase. Specifically, since the types of incremental data include addition, update, deletion, and the like, different APIs are used for different types of incremental data to update data in Hbase. The data to be added to the Hive is the data to be added to the Hive at this time, and the data is written into the Hbase firstly and then synchronized into the Hive by adopting the following operation.
S800: and synchronizing the Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase.
Since the first mapping table and the second mapping table can represent the data relationship between Hive and Hbase, after the Hbase data is updated (updated Hbase is obtained), the updated data can be synchronized into Hive through the first mapping table and the second mapping table. Specifically, the bottom data of Hive and Hbase are shared by the same data file stored on the HDFS, and when new data or deleted data is added or updated in Hbase, the change of the data is checked in Hive according to the first mapping table and the second mapping table.
Determining Hbase corresponding to Hive by the Hive incremental data synchronization method; obtaining a correlation table from Hive to Hbase; and writing the incremental data into the Hbase through an API in the Hbase to update the data in the Hbase, and synchronizing the Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase. In the whole process, incremental data are recorded by means of Hbase, incremental data synchronization to Hive is achieved on the basis of a Hive-to-Hbase association table, direct operation limitation in Hive is avoided, and based on Hbase and Hive association mapping, incremental data synchronization in Hive is achieved efficiently.
In one embodiment, obtaining the first mapping table in Hive and the second mapping table in Hbase comprises: creating a first mapping table in Hive; the Hbase API is called, and a second mapping table associated with the first mapping table is created in the Hbase.
In this embodiment, the first mapping table in the Hive and the second mapping table in the Hbase are obtained in an instant generation manner, and specifically, as described above, because the Hive and the Hbase both belong to a Hadoop big data component, a basic call has been made between the Hive and the Hbase on the bottom layer. Here, to create the mapping table, a Hive shell command or Hive API needs to be used for implementation. After creating a mapping table in Hive, Hive will call the API of Hbase again, and create an associated table in Hbase, and this calling process is controlled to be automatically completed by Hive. The server may specifically generate a command, and send the command to the Hive and the Hbase, so that the Hive and the Hbase cooperatively generate an association table from the Hive to the Hbase (a first mapping table and a second mapping table associated with the first mapping table, and a field name and a field type between the first mapping table and the second mapping table are associated).
As shown in fig. 3, in one embodiment, step S400 includes:
s420: a hivetable is created in Hive and a hbasetable table is created in Hbase.
The server may generate a table build statement, create a hivetable in Hive, and create a hbasetable table in Hbase.
S440: and associating and mapping table names of the hievable table and the hbasetable, rowkey fields in the hievable table and row keys in the hbasetable, and column character strings in the hievable table and column character strings in the hbasetable.
Establishing an association mapping relation between the hievable table and the hbasetable, wherein the association mapping relation specifically comprises the following steps: table names of hightable and hbasetable tables, i.e.
Figure BDA0002253975290000061
Rowkey in hightable field hbasetable, i.e. row key
Figure BDA0002253975290000062
And column character strings in the hivetable and column character strings in the hbasetable, wherein column1 in the hivetable is mapped to column1 field on column family1 in the hbasetable, column2 in the hivetable is mapped to column2 field on column family1 in the hbasetable, and column3 in the hivetable is mapped to column3 field on column family2 in the hbasetable, that is, column3 field in the hbasetable
Figure BDA0002253975290000071
column1、
Figure BDA0002253975290000072
column2 and
Figure BDA0002253975290000073
column3。
s460: and establishing communication between Hive and Hbase.
And calling respective external API interfaces of the Hive and the Hbase to establish communication between the Hive and the Hbase, and finally obtaining an association table from the Hive to the Hbase. Specifically, the server can generate an integration function statement between the Hive and the Hbase, and the establishment of the communication between the Hive and the Hbase is completed through the tool class of Hive _ Hbase-handler. Further, the tool class hive _ hbase-handler.
In one embodiment, before writing the incremental data into the Hbase through the API in the Hbase, the method further comprises: identifying a type of incremental data; and calling an API (application programming interface) corresponding to the type of the incremental data in the Hbase to write the incremental data into the Hbase bottom data.
There are APIs with different functions in Hbase, and here, for the type of incremental data, the API corresponding to the type is selectively called to operate on the incremental data. Specifically, if the type of the incremental data is newly added, calling an API with a data adding function in Hbase, and adding the incremental data; if the type of the incremental data is updating, updating the incremental data through Hbase rowkey; and if the incremental data type is deletion, deleting by Hbase rowkey.
In one embodiment, synchronizing the Hive increment data according to the first mapping table, the second mapping table and the updated Hbase comprises: storing the data written into the Hbase into the HDFS; and accessing the HDFS according to the first mapping table and the second mapping table, and synchronizing Hive query data.
For the data updated in the Hbase, the Hbase writes the data on the distributed storage HDFS, so that when the data is viewed from the Hive, the changed data is also viewed according to the first mapping table and the second mapping table, and operations such as query and statistics of the data can be realized. Briefly, the Hive and the Hbase are actually shared by the same data file stored on the HDFS, when the Hbase has new data or deleted data updated, the change of the data can be observed in the Hive, if the Hbase data changes, the data mapped by the Hive also changes, a single record can be newly added/modified by the Hbase, and then the Hbase data query statistics and the like are realized by using the association table in the Hive.
In one embodiment, synchronizing the Hive increment data according to the first mapping table, the second mapping table and the updated Hbase comprises: identifying the updated field name and field type in the updated Hbase; and adjusting the updated field name and the metadata corresponding to the field type according to the first mapping table and the second mapping table, and synchronizing the Hive incremental data.
The field names and field types between the first mapping table and the second mapping table are correlated, and according to the field names and field types correlated between the two tables, metadata adjustment in Hbase and Hive can be realized so as to realize that the data conforms to the format requirements of the respective databases (Hbase and Hive). In practical application, incremental data is written into Hbase, the field name and the field type are updated by the Hbase, metadata adjustment is performed on the data in order to ensure that the data meet the Hbase format requirement during writing, and during synchronization, the metadata adjustment is performed on the field name and the field type which are updated by the Hbase again according to the field name and the field type incidence relation between the first mapping table and the second mapping table, so that the data meet the Hive format requirement, and the Hive incremental data is synchronized.
In one application example, assuming that a piece of data a to Hive needs to be written currently, the Hive incremental data synchronization method includes firstly determining an Hbase of Hive sharing underlying data, creating a hiveable table in Hive, wherein the attribute of the hiveable table is an Hbase mapping table, automatically generating an Hbase table in the Hbase, wherein the field name and the field type between the hiveable table and the Hbase table are associated, writing data a into the Hbase according to an Hbase database format by using an Hbase command, storing the data into the underlying data of the Hbase, searching inquired data from the Hbase underlying data when Hive is inquired, and performing metadata adjustment (namely database format adjustment) on the searched data according to the Hbase table and the hiveable table to obtain the inquired data, thereby achieving Hive incremental synchronization.
It should be understood that although the various steps in the flow charts of fig. 2-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.
In order to further explain the technical solution and effect of the Hive incremental data synchronization method of the present application in detail, the following describes the whole process in detail by using a specific example and combining with a command statement corresponding to the implementation process. In practical application, the Hive incremental data synchronization method of the application can comprise the following parts:
1. establishing a Hive to Hbase association table.
Because the Hive and the Hbase belong to Hadoop big data components, the Hive and the Hbase are communicated with each other on the basis of the bottom layer. Here, to create the mapping table, a Hive shell command or Hive API needs to be used for implementation. After creating a mapping table in Hive, Hive will call the API of Hbase again, and create an associated table in Hbase, which may be automatically completed by controlling Hive. For example, the following commands create an association table of Hive and Hbase:
and (3) building a table sentence: create external table stable (rowkey string, column1 string, column2 string, column3 string). The table building statement represents that a table name called a hivetable is built in Hive, and a table name which is mapped with the table name in Hbase is hbasetable, and the mapping relation is as follows:
Figure BDA0002253975290000092
Figure BDA0002253975290000091
(rowkey field in Hive is associated to Rowkey in Hbasekey);
Figure BDA0002253975290000093
column1 (column 1 in hivetable maps to column1 field on column family1 in hbasetable;
Figure BDA0002253975290000094
column2 (column 2 in hivetable maps to column2 field on column family1 in hbasetable);
Figure BDA0002253975290000095
column3 (column 3 in the hievattable maps to column3 field on column family2 in the hbasetable).
Integrating functional statements: storedby 'org. apache. hadoop. hive. hbase of tororager handlers' with the performance ("hbase. column. mapping": key, column family1: column1, column family1: column2, column family2: column3 "). The Hive and Hbase integration function (intercommunication) is realized mainly through the Hive _ Hbase-handler.
2. And (3) passing the incremental data through an API of the Hbase, if the incremental data is new data, newly adding the incremental data into the Hbase, if the incremental data is updated data, updating the data through the Hbase rowkey, and if the incremental data is deleted data, deleting the data according to the rowkey.
3. And then synchronizing hive query data, wherein the data is changed, and an increment synchronization scene is realized. The underlying data of Hive and Hbase is actually a shared data file stored on the HDFS, and when new data or deleted data is added or updated in Hbase, the change of the data can be observed in Hive.
As shown in fig. 4, the present application further provides a Hive incremental data synchronization apparatus, which includes:
a database determining module 200, configured to determine an Hbase corresponding to Hive;
an association table obtaining module 400, configured to obtain a first mapping table in Hive and a second mapping table in Hbase, where the first mapping table and the second mapping table are associated with each other in field name and field type;
an increment module 600, configured to write increment data into the Hbase through an API in the Hbase, where the increment data is data to be incremented into Hive;
and a synchronizing module 800, configured to synchronize the Hive incremental data according to the first mapping table, the second mapping table, and the updated Hbase.
The Hive incremental data synchronizer determines Hbase corresponding to Hive; obtaining a correlation table from Hive to Hbase; and writing the incremental data into the Hbase through an API in the Hbase to update the data in the Hbase, and synchronizing the Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase. In the whole process, incremental data are recorded by means of Hbase, incremental data synchronization to Hive is achieved on the basis of a Hive-to-Hbase association table, direct operation limitation in Hive is avoided, and based on Hbase and Hive association mapping, incremental data synchronization in Hive is achieved efficiently.
In one embodiment, the association table obtaining module 400 is further configured to create a first mapping table in Hive; the Hbase API is called, and a second mapping table associated with the first mapping table is created in the Hbase.
In one embodiment, the association table obtaining module 400 is further configured to create a hivetable in Hive, and create an hbasetable table in Hbase; associating and mapping table names of the hievable table and the hbasetable, rowkey fields in the hievable table and row keys in the hbasetable, and column character strings in the hievable table and column character strings in the hbasetable; and establishing communication between Hive and Hbase.
In one embodiment, the delta module 600 is further configured to identify the type of delta data; and calling an API (application programming interface) corresponding to the type of the incremental data in the Hbase to write the incremental data into the Hbase bottom data.
In one embodiment, the increment module 600 is further configured to call an API with a data addition function in the Hbase to add the incremental data if the type of the incremental data is newly added; if the type of the incremental data is updating, updating the incremental data through Hbase rowkey; and if the incremental data type is deletion, deleting by Hbase rowkey.
In one embodiment, the synchronization module 800 is further configured to store the updated data in Hbase to the HDFS; and accessing the HDFS according to the first mapping table and the second mapping table, and synchronizing Hive query data.
In one embodiment, the synchronizing module 800 is further configured to synchronize the Hive increment data according to the first mapping table, the second mapping table, and the updated Hbase, including: identifying the updated field name and field type in the updated Hbase; and adjusting the updated field name and the metadata corresponding to the field type according to the first mapping table and the second mapping table, and synchronizing the Hive incremental data.
For specific limitations of the Hive incremental data synchronization device, reference may be made to the above limitations on the Hive incremental data synchronization method, which is not described herein again. The various modules in the Hive incremental data synchronization device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as Hbase and a preset association table. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a Hive incremental data synchronization method.
Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
determining Hbase corresponding to Hive;
acquiring a first mapping table in the Hive and a second mapping table in the Hbase, wherein the first mapping table and the second mapping table are related to each other in field names and field types;
writing incremental data into the Hbase through an API (application program interface) in the Hbase, wherein the incremental data are data which need to be incremented into Hive;
and synchronizing the Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
creating a first mapping table in Hive; the Hbase API is called, and a second mapping table associated with the first mapping table is created in the Hbase.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
creating a hivetable in Hive, and creating a hbasetable in Hbase; associating and mapping table names of the hievable table and the hbasetable, rowkey fields in the hievable table and row keys in the hbasetable, and column character strings in the hievable table and column character strings in the hbasetable; and establishing communication between Hive and Hbase.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
identifying a type of incremental data; and calling an API (application programming interface) corresponding to the type of the incremental data in the Hbase to write the incremental data into the Hbase bottom data.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
if the type of the incremental data is newly increased, calling an API (application programming interface) with a data adding function in Hbase, and adding the incremental data; if the type of the incremental data is updating, updating the incremental data through Hbase rowkey; and if the incremental data type is deletion, deleting by Hbase rowkey.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
storing the data written into the Hbase into the HDFS; and accessing the HDFS according to the first mapping table and the second mapping table, and synchronizing Hive query data.
In one embodiment, the processor, when executing the computer program, further performs the steps of: identifying the updated field name and field type in the updated Hbase; and adjusting the updated field name and the metadata corresponding to the field type according to the first mapping table and the second mapping table, and synchronizing the Hive incremental data.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
determining Hbase corresponding to Hive;
acquiring a first mapping table in the Hive and a second mapping table in the Hbase, wherein the first mapping table and the second mapping table are related to each other in field names and field types;
writing incremental data into the Hbase through an API (application program interface) in the Hbase, wherein the incremental data are data which need to be incremented into Hive;
and synchronizing the Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase.
In one embodiment, the computer program when executed by the processor further performs the steps of:
creating a first mapping table in Hive; the Hbase API is called, and a second mapping table associated with the first mapping table is created in the Hbase.
In one embodiment, the computer program when executed by the processor further performs the steps of:
creating a hivetable in Hive, and creating a hbasetable in Hbase; associating and mapping table names of the hievable table and the hbasetable, rowkey fields in the hievable table and row keys in the hbasetable, and column character strings in the hievable table and column character strings in the hbasetable; and establishing communication between Hive and Hbase.
In one embodiment, the computer program when executed by the processor further performs the steps of:
identifying a type of incremental data; and calling an API (application programming interface) corresponding to the type of the incremental data in the Hbase to write the incremental data into the Hbase bottom data.
In one embodiment, the computer program when executed by the processor further performs the steps of:
if the type of the incremental data is newly increased, calling an API (application programming interface) with a data adding function in Hbase, and adding the incremental data; if the type of the incremental data is updating, updating the incremental data through Hbase rowkey; and if the incremental data type is deletion, deleting by Hbase rowkey.
In one embodiment, the computer program when executed by the processor further performs the steps of:
storing the data written into the Hbase into the HDFS; and accessing the HDFS according to the first mapping table and the second mapping table, and synchronizing Hive query data.
In one embodiment, the computer program when executed by the processor further performs the steps of:
identifying the updated field name and field type in the updated Hbase; and adjusting the updated field name and the metadata corresponding to the field type according to the first mapping table and the second mapping table, and synchronizing the Hive incremental data.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A Hive incremental data synchronization method, comprising:
determining Hbase corresponding to Hive;
acquiring a first mapping table in the Hive and a second mapping table in the Hbase, wherein the first mapping table and the second mapping table are associated with each other in field names and field types;
writing incremental data into the Hbase through an API in the Hbase, wherein the incremental data is data which needs to be incremented into the Hive;
and synchronizing Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase.
2. The method of claim 1, wherein the obtaining the first mapping table in the Hive and the second mapping table in the Hbase comprises:
creating a first mapping table in Hive;
calling the Hbase API, and creating a second mapping table in the Hbase, wherein the second mapping table is associated with the first mapping table.
3. The method of claim 1, wherein the obtaining the first mapping table in the Hive and the second mapping table in the Hbase comprises:
creating a hivetable in Hive, and creating a hbasetable in Hbase;
the table names of the hievable table and the hbasetable, the rowkey field in the hievable table and the row key in the hbasetable, and the column character strings in the hievable table and the column character strings in the hbasetable are mapped in an associated manner;
and establishing communication between the Hive and the Hbase.
4. The method of claim 1, wherein prior to writing delta data to the Hbase via the API in the Hbase, further comprising:
identifying a type of the incremental data;
the writing incremental data to the Hbase via the API in the Hbase comprises:
and calling an API (application programming interface) corresponding to the type of the incremental data in the Hbase to write the incremental data into the Hbase bottom data.
5. The method of claim 4, wherein said calling an API in said Hbase corresponding to said type of delta data to write delta data to Hbase underlying data comprises:
if the type of the incremental data is newly added, calling an API (application programming interface) with a data adding function in the Hbase, and adding the incremental data;
if the type of the incremental data is updating, updating the incremental data through Hbase rowkey;
and if the incremental data type is deletion, deleting by Hbase rowkey.
6. The method of claim 1, wherein synchronizing Hive delta data according to the first mapping table, the second mapping table, and the updated Hbase comprises:
storing the data written into the Hbase into an HDFS;
and accessing the HDFS according to the first mapping table and the second mapping table, and synchronizing Hive query data.
7. The method of claim 1, wherein synchronizing Hive delta data according to the first mapping table, the second mapping table, and the updated Hbase comprises:
identifying the updated field name and field type in the updated Hbase;
and adjusting the updated field name and the metadata corresponding to the field type according to the first mapping table and the second mapping table, and synchronizing Hive incremental data.
8. A Hive incremental data synchronization apparatus, comprising:
the database determination module is used for determining Hbase corresponding to Hive;
an association table obtaining module, configured to obtain a first mapping table in the Hive and a second mapping table in the Hbase, where the first mapping table and the second mapping table are associated with each other in a field name and a field type;
the increment module is used for writing increment data into the Hbase through an API (application program interface) in the Hbase, wherein the increment data is data needing to be incremented into the Hive;
and the synchronization module is used for synchronizing the Hive increment data according to the first mapping table, the second mapping table and the updated Hbase.
9. A computer device comprising at least one processor, at least one memory, and a bus; the processor and the memory complete mutual communication through the bus; the processor is configured to invoke program instructions in the memory to perform the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN201911045304.9A 2019-10-30 2019-10-30 Hive incremental data synchronization method and device, computer equipment and storage medium Pending CN112749226A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911045304.9A CN112749226A (en) 2019-10-30 2019-10-30 Hive incremental data synchronization method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911045304.9A CN112749226A (en) 2019-10-30 2019-10-30 Hive incremental data synchronization method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112749226A true CN112749226A (en) 2021-05-04

Family

ID=75640597

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911045304.9A Pending CN112749226A (en) 2019-10-30 2019-10-30 Hive incremental data synchronization method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112749226A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201403929D0 (en) * 2013-03-13 2014-04-23 Cloudera Inc Low Latency query engine for apache hadoop
CN104298760A (en) * 2014-10-23 2015-01-21 北京京东尚科信息技术有限公司 Data processing method and data processing device applied to data warehouse
US20170075965A1 (en) * 2015-09-16 2017-03-16 Turn Inc. Table level distributed database system for big data storage and query
CN107203594A (en) * 2017-04-28 2017-09-26 努比亚技术有限公司 A kind of data processing equipment, method and computer-readable recording medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201403929D0 (en) * 2013-03-13 2014-04-23 Cloudera Inc Low Latency query engine for apache hadoop
CN104298760A (en) * 2014-10-23 2015-01-21 北京京东尚科信息技术有限公司 Data processing method and data processing device applied to data warehouse
US20170075965A1 (en) * 2015-09-16 2017-03-16 Turn Inc. Table level distributed database system for big data storage and query
CN107203594A (en) * 2017-04-28 2017-09-26 努比亚技术有限公司 A kind of data processing equipment, method and computer-readable recording medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
汤羽;王英杰;范爱华;姚远哲;: "基于HDFS开源架构与多级索引表的海量数据检索mDHT算法", 计算机科学, no. 02 *

Similar Documents

Publication Publication Date Title
KR102392944B1 (en) Data backup methods, storage media and computing devices
CN109906448B (en) Method, apparatus, and medium for facilitating operations on pluggable databases
TWI718375B (en) Data processing method and equipment based on blockchain
US11914585B2 (en) Servicing queries of a hybrid event index
US10261996B2 (en) Content localization using fallback translations
WO2020238858A1 (en) Data migration method and apparatus, and computer-readable storage medium
CN109739828B (en) Data processing method and device and computer readable storage medium
US20170193039A1 (en) Servicing queries of an event log
US20170075889A1 (en) Providing a content preview
WO2019047976A1 (en) Network file management method, terminal and computer readable storage medium
WO2020041950A1 (en) Data update method, device, and storage device employing b+ tree indexing
CN110795499A (en) Cluster data synchronization method, device and equipment based on big data and storage medium
US20180373741A1 (en) Systems and methods of creation and deletion of tenants within a database
CN105069151A (en) HBase secondary index construction apparatus and method
CN110188114A (en) A kind of optimization method of data manipulation, device, system, equipment and storage medium
CN111813799A (en) Database query statement generation method and device, computer equipment and storage medium
CN113220659A (en) Data migration method, system, electronic device and storage medium
US8818971B1 (en) Processing bulk deletions in distributed databases
WO2021258853A1 (en) Vocabulary error correction method and apparatus, computer device, and storage medium
US9569519B2 (en) Client-side directed commands to a loosely coupled database
CN105404653A (en) Realization method of fully distributed file index and cooperative editing mechanism
CN112749226A (en) Hive incremental data synchronization method and device, computer equipment and storage medium
JP2012108877A (en) Database management method
CN112559529A (en) Data storage method and device, computer equipment and storage medium
WO2023066222A1 (en) Data processing method and apparatus, and electronic device, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination