CN112749226A - Hive incremental data synchronization method and device, computer equipment and storage medium - Google Patents
Hive incremental data synchronization method and device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN112749226A CN112749226A CN201911045304.9A CN201911045304A CN112749226A CN 112749226 A CN112749226 A CN 112749226A CN 201911045304 A CN201911045304 A CN 201911045304A CN 112749226 A CN112749226 A CN 112749226A
- Authority
- CN
- China
- Prior art keywords
- hbase
- hive
- mapping table
- data
- incremental data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000013507 mapping Methods 0.000 claims abstract description 169
- 238000004590 computer program Methods 0.000 claims description 21
- 230000006870 function Effects 0.000 claims description 12
- 238000004891 communication Methods 0.000 claims description 10
- 238000012217 deletion Methods 0.000 claims description 7
- 230000037430 deletion Effects 0.000 claims description 7
- 230000008569 process Effects 0.000 abstract description 10
- 230000000875 corresponding effect Effects 0.000 description 25
- 230000001360 synchronised effect Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 239000008186 active pharmaceutical agent Substances 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 230000002596 correlated effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 230000001276 controlling effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
- G06F16/275—Synchronous replication
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2365—Ensuring data consistency and integrity
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Computer Security & Cryptography (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application relates to a Hive incremental data synchronization method, a Hive incremental data synchronization device, computer equipment and a storage medium, wherein the method comprises the following steps: determining Hbase corresponding to Hive; obtaining a correlation table from Hive to Hbase; and writing the incremental data into the Hbase through the API in the Hbase to update the data in the Hbase, and synchronizing the Hive incremental data according to the association table and the updated Hbase. In the whole process, incremental data are recorded by means of Hbase, incremental data synchronization to Hive is achieved on the basis of a Hive-to-Hbase association table, direct operation limitation in Hive is avoided, and based on Hbase and Hive association mapping, incremental data synchronization in Hive is achieved efficiently.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a Hive incremental data synchronization method, apparatus, computer device, and storage medium.
Background
With the development of computer technology, many data warehouse tools, such as Hive, which is the mainstream at present, are appeared.
Hive is a data warehouse tool based on Hadoop, can map Structured data files into a database table, provides a simple SQL (Structured Query Language) Query function, and can convert SQL statements into MapReduce tasks for operation. The method has the advantages that the learning cost is low, simple MapReduce statistics can be quickly realized through SQL-like statements, special MapReduce application does not need to be developed, and the method is very suitable for statistical analysis of a data warehouse.
While the Hive has the above advantages in function, incremental synchronization in Hive is a relatively complex process, and the traditional method is to synchronize incremental data to Hive through insert, update and delete of Hive, Hive serves as a data warehouse, and the support degree of itself on insert, update and delete is poor, so that the traditional Hive incremental synchronization method cannot efficiently achieve incremental synchronization.
Disclosure of Invention
In view of the above, it is necessary to provide a Hive incremental data synchronization method, an apparatus, a computer device, and a storage medium capable of efficiently implementing incremental synchronization in order to solve the above technical problems.
A Hive incremental data synchronization method, the method comprising:
determining Hbase corresponding to Hive;
acquiring a first mapping table in the Hive and a second mapping table in the Hbase, wherein the first mapping table and the second mapping table are associated with each other in field names and field types;
writing incremental data into the Hbase through an API (Application Program Interface) in the Hbase, wherein the incremental data is data which needs to be incremented into the Hive;
and synchronizing Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase.
In one embodiment, the obtaining the first mapping table in the Hive and the second mapping table in the Hbase includes:
creating a first mapping table in Hive;
calling the Hbase API, and creating a second mapping table in the Hbase, wherein the second mapping table is associated with the first mapping table.
In one embodiment, the obtaining the first mapping table in the Hive and the second mapping table in the Hbase includes:
creating a hivetable in Hive, and creating a hbasetable in Hbase;
the table names of the hievable table and the hbasetable, the rowkey field in the hievable table and the row key in the hbasetable, and the column character strings in the hievable table and the column character strings in the hbasetable are mapped in an associated manner;
and establishing communication between the Hive and the Hbase.
In one embodiment, before writing the incremental data into the Hbase through the API in the Hbase, the method further includes:
identifying a type of the incremental data;
the writing incremental data to the Hbase via the API in the Hbase comprises:
and calling an API (application programming interface) corresponding to the type of the incremental data in the Hbase to write the incremental data into the Hbase bottom data.
In one embodiment, the calling the API in the Hbase corresponding to the type of the delta data to write the delta data into Hbase underlying data includes:
if the type of the incremental data is newly added, calling an API (application programming interface) with a data adding function in the Hbase, and adding the incremental data;
if the type of the incremental data is updating, updating the incremental data through Hbase rowkey;
and if the incremental data type is deletion, deleting by Hbase rowkey.
In one embodiment, the synchronizing Hive increment data according to the first mapping table, the second mapping table and the updated Hbase comprises:
storing the data written into the Hbase into an HDFS (Hadoop Distributed File System);
and accessing the HDFS according to the first mapping table and the second mapping table, and synchronizing Hive query data.
In one embodiment, the synchronizing Hive increment data according to the first mapping table, the second mapping table and the updated Hbase comprises:
identifying the updated field name and field type in the updated Hbase;
and adjusting the updated field name and the metadata corresponding to the field type according to the first mapping table and the second mapping table, and synchronizing Hive incremental data.
A Hive incremental data synchronization apparatus, the apparatus comprising:
the database determination module is used for determining Hbase corresponding to Hive;
an association table obtaining module, configured to obtain a first mapping table in the Hive and a second mapping table in the Hbase, where the first mapping table and the second mapping table are associated with each other in a field name and a field type;
the increment module is used for writing increment data into the Hbase through an API (application program interface) in the Hbase, wherein the increment data is data needing to be incremented into the Hive;
and the synchronization module is used for synchronizing the Hive increment data according to the first mapping table, the second mapping table and the updated Hbase.
A computer device comprising at least one processor, at least one memory, and a bus; the processor and the memory complete mutual communication through the bus; the processor is configured to call program instructions in the memory to perform the method as described above.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as described above.
The Hive incremental data synchronization method, the device, the computer equipment and the storage medium determine the Hbase corresponding to Hive; obtaining a correlation table from Hive to Hbase; and operating the incremental data through the API in the Hbase to update the data in the Hbase, and synchronizing the Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase. In the whole process, incremental data are recorded by means of Hbase, incremental data synchronization to Hive is achieved on the basis of a Hive-to-Hbase association table, direct operation limitation in Hive is avoided, and based on Hbase and Hive association mapping, incremental data synchronization in Hive is achieved efficiently.
Drawings
FIG. 1 is a diagram of an embodiment of an application environment of the Hive incremental data synchronization method;
FIG. 2 is a flow diagram illustrating a Hive incremental data synchronization method according to one embodiment;
FIG. 3 is a flow chart illustrating a Hive incremental data synchronization method according to another embodiment;
FIG. 4 is a block diagram of the Hive incremental data synchronizer in one embodiment;
FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The Hive incremental data synchronization method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 writes the incremental data into the server 104 for Hive incremental data synchronization. The server 104 receives the incremental data, determines the Hbase corresponding to the Hive, acquires an association table from the Hive to the Hbase, writes the incremental data into the Hbase through an API in the Hbase to update the data in the Hbase, and synchronizes the Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a Hive incremental data synchronization method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
s200: and determining Hbase corresponding to Hive.
HBase is a distributed, column-oriented, open source database, provides Bigtable-like capabilities over Hadoop, HBase is a sub-item of the Hadoop item of Apache, HBase is different from a typical relational database, it is a database suitable for unstructured data storage, and HBase is column-based rather than row-based schema. Because the Hive and the Hbase belong to Hadoop big data components, the Hive and the Hbase are communicated with each other on the basis of the bottom layer. In this case, the Hbase previously allocated to the current Hive in the server is determined, and when the current Hive needs incremental synchronization, the Hbase providing support is determined first. The bottom layer data between the Hbase corresponding to the Hive is shared, and the bottom layer data between the Hbase and the Hbase are stored at the same position and can be stored in the HDFS specifically.
S400: and acquiring a first mapping table in the Hive and a second mapping table in the Hbase, wherein the first mapping table and the second mapping table are related to each other in field name and field type.
The first mapping table and the second mapping table may be pre-constructed and stored, or may be temporarily constructed and obtained. Specifically, the first mapping table includes a hiveable table in the Hive, the second mapping table includes an hbasetable table having a mapping relationship with the Hive table in the Hbase, and data in the Hbase can be synchronized into the Hive through the two tables Hive, that is, if operations including adding, updating, deleting, and the like are performed on the data in the Hbase, the operated data can also be synchronized into the Hive through the first mapping table and the second mapping table, and a field name and a field type between the first mapping table and the second mapping table are associated with each other.
S600: and writing incremental data into the Hbase through an API in the Hbase, wherein the incremental data is data which needs to be incremented into Hive.
An API is a call interface that an operating system leaves for an application program, which causes the operating system to execute commands of the application program by calling the API of the operating system. A plurality of APIs with different functions exist in the Hive and the Hbase, incremental data are written into the Hbase through the API in the Hbase, and the incremental data are written into the Hbase to update the data in the Hbase. Specifically, since the types of incremental data include addition, update, deletion, and the like, different APIs are used for different types of incremental data to update data in Hbase. The data to be added to the Hive is the data to be added to the Hive at this time, and the data is written into the Hbase firstly and then synchronized into the Hive by adopting the following operation.
S800: and synchronizing the Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase.
Since the first mapping table and the second mapping table can represent the data relationship between Hive and Hbase, after the Hbase data is updated (updated Hbase is obtained), the updated data can be synchronized into Hive through the first mapping table and the second mapping table. Specifically, the bottom data of Hive and Hbase are shared by the same data file stored on the HDFS, and when new data or deleted data is added or updated in Hbase, the change of the data is checked in Hive according to the first mapping table and the second mapping table.
Determining Hbase corresponding to Hive by the Hive incremental data synchronization method; obtaining a correlation table from Hive to Hbase; and writing the incremental data into the Hbase through an API in the Hbase to update the data in the Hbase, and synchronizing the Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase. In the whole process, incremental data are recorded by means of Hbase, incremental data synchronization to Hive is achieved on the basis of a Hive-to-Hbase association table, direct operation limitation in Hive is avoided, and based on Hbase and Hive association mapping, incremental data synchronization in Hive is achieved efficiently.
In one embodiment, obtaining the first mapping table in Hive and the second mapping table in Hbase comprises: creating a first mapping table in Hive; the Hbase API is called, and a second mapping table associated with the first mapping table is created in the Hbase.
In this embodiment, the first mapping table in the Hive and the second mapping table in the Hbase are obtained in an instant generation manner, and specifically, as described above, because the Hive and the Hbase both belong to a Hadoop big data component, a basic call has been made between the Hive and the Hbase on the bottom layer. Here, to create the mapping table, a Hive shell command or Hive API needs to be used for implementation. After creating a mapping table in Hive, Hive will call the API of Hbase again, and create an associated table in Hbase, and this calling process is controlled to be automatically completed by Hive. The server may specifically generate a command, and send the command to the Hive and the Hbase, so that the Hive and the Hbase cooperatively generate an association table from the Hive to the Hbase (a first mapping table and a second mapping table associated with the first mapping table, and a field name and a field type between the first mapping table and the second mapping table are associated).
As shown in fig. 3, in one embodiment, step S400 includes:
s420: a hivetable is created in Hive and a hbasetable table is created in Hbase.
The server may generate a table build statement, create a hivetable in Hive, and create a hbasetable table in Hbase.
S440: and associating and mapping table names of the hievable table and the hbasetable, rowkey fields in the hievable table and row keys in the hbasetable, and column character strings in the hievable table and column character strings in the hbasetable.
Establishing an association mapping relation between the hievable table and the hbasetable, wherein the association mapping relation specifically comprises the following steps: table names of hightable and hbasetable tables, i.e.Rowkey in hightable field hbasetable, i.e. row keyAnd column character strings in the hivetable and column character strings in the hbasetable, wherein column1 in the hivetable is mapped to column1 field on column family1 in the hbasetable, column2 in the hivetable is mapped to column2 field on column family1 in the hbasetable, and column3 in the hivetable is mapped to column3 field on column family2 in the hbasetable, that is, column3 field in the hbasetablecolumn1、column2 andcolumn3。
s460: and establishing communication between Hive and Hbase.
And calling respective external API interfaces of the Hive and the Hbase to establish communication between the Hive and the Hbase, and finally obtaining an association table from the Hive to the Hbase. Specifically, the server can generate an integration function statement between the Hive and the Hbase, and the establishment of the communication between the Hive and the Hbase is completed through the tool class of Hive _ Hbase-handler. Further, the tool class hive _ hbase-handler.
In one embodiment, before writing the incremental data into the Hbase through the API in the Hbase, the method further comprises: identifying a type of incremental data; and calling an API (application programming interface) corresponding to the type of the incremental data in the Hbase to write the incremental data into the Hbase bottom data.
There are APIs with different functions in Hbase, and here, for the type of incremental data, the API corresponding to the type is selectively called to operate on the incremental data. Specifically, if the type of the incremental data is newly added, calling an API with a data adding function in Hbase, and adding the incremental data; if the type of the incremental data is updating, updating the incremental data through Hbase rowkey; and if the incremental data type is deletion, deleting by Hbase rowkey.
In one embodiment, synchronizing the Hive increment data according to the first mapping table, the second mapping table and the updated Hbase comprises: storing the data written into the Hbase into the HDFS; and accessing the HDFS according to the first mapping table and the second mapping table, and synchronizing Hive query data.
For the data updated in the Hbase, the Hbase writes the data on the distributed storage HDFS, so that when the data is viewed from the Hive, the changed data is also viewed according to the first mapping table and the second mapping table, and operations such as query and statistics of the data can be realized. Briefly, the Hive and the Hbase are actually shared by the same data file stored on the HDFS, when the Hbase has new data or deleted data updated, the change of the data can be observed in the Hive, if the Hbase data changes, the data mapped by the Hive also changes, a single record can be newly added/modified by the Hbase, and then the Hbase data query statistics and the like are realized by using the association table in the Hive.
In one embodiment, synchronizing the Hive increment data according to the first mapping table, the second mapping table and the updated Hbase comprises: identifying the updated field name and field type in the updated Hbase; and adjusting the updated field name and the metadata corresponding to the field type according to the first mapping table and the second mapping table, and synchronizing the Hive incremental data.
The field names and field types between the first mapping table and the second mapping table are correlated, and according to the field names and field types correlated between the two tables, metadata adjustment in Hbase and Hive can be realized so as to realize that the data conforms to the format requirements of the respective databases (Hbase and Hive). In practical application, incremental data is written into Hbase, the field name and the field type are updated by the Hbase, metadata adjustment is performed on the data in order to ensure that the data meet the Hbase format requirement during writing, and during synchronization, the metadata adjustment is performed on the field name and the field type which are updated by the Hbase again according to the field name and the field type incidence relation between the first mapping table and the second mapping table, so that the data meet the Hive format requirement, and the Hive incremental data is synchronized.
In one application example, assuming that a piece of data a to Hive needs to be written currently, the Hive incremental data synchronization method includes firstly determining an Hbase of Hive sharing underlying data, creating a hiveable table in Hive, wherein the attribute of the hiveable table is an Hbase mapping table, automatically generating an Hbase table in the Hbase, wherein the field name and the field type between the hiveable table and the Hbase table are associated, writing data a into the Hbase according to an Hbase database format by using an Hbase command, storing the data into the underlying data of the Hbase, searching inquired data from the Hbase underlying data when Hive is inquired, and performing metadata adjustment (namely database format adjustment) on the searched data according to the Hbase table and the hiveable table to obtain the inquired data, thereby achieving Hive incremental synchronization.
It should be understood that although the various steps in the flow charts of fig. 2-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.
In order to further explain the technical solution and effect of the Hive incremental data synchronization method of the present application in detail, the following describes the whole process in detail by using a specific example and combining with a command statement corresponding to the implementation process. In practical application, the Hive incremental data synchronization method of the application can comprise the following parts:
1. establishing a Hive to Hbase association table.
Because the Hive and the Hbase belong to Hadoop big data components, the Hive and the Hbase are communicated with each other on the basis of the bottom layer. Here, to create the mapping table, a Hive shell command or Hive API needs to be used for implementation. After creating a mapping table in Hive, Hive will call the API of Hbase again, and create an associated table in Hbase, which may be automatically completed by controlling Hive. For example, the following commands create an association table of Hive and Hbase:
and (3) building a table sentence: create external table stable (rowkey string, column1 string, column2 string, column3 string). The table building statement represents that a table name called a hivetable is built in Hive, and a table name which is mapped with the table name in Hbase is hbasetable, and the mapping relation is as follows: (rowkey field in Hive is associated to Rowkey in Hbasekey);column1 (column 1 in hivetable maps to column1 field on column family1 in hbasetable;column2 (column 2 in hivetable maps to column2 field on column family1 in hbasetable);column3 (column 3 in the hievattable maps to column3 field on column family2 in the hbasetable).
Integrating functional statements: storedby 'org. apache. hadoop. hive. hbase of tororager handlers' with the performance ("hbase. column. mapping": key, column family1: column1, column family1: column2, column family2: column3 "). The Hive and Hbase integration function (intercommunication) is realized mainly through the Hive _ Hbase-handler.
2. And (3) passing the incremental data through an API of the Hbase, if the incremental data is new data, newly adding the incremental data into the Hbase, if the incremental data is updated data, updating the data through the Hbase rowkey, and if the incremental data is deleted data, deleting the data according to the rowkey.
3. And then synchronizing hive query data, wherein the data is changed, and an increment synchronization scene is realized. The underlying data of Hive and Hbase is actually a shared data file stored on the HDFS, and when new data or deleted data is added or updated in Hbase, the change of the data can be observed in Hive.
As shown in fig. 4, the present application further provides a Hive incremental data synchronization apparatus, which includes:
a database determining module 200, configured to determine an Hbase corresponding to Hive;
an association table obtaining module 400, configured to obtain a first mapping table in Hive and a second mapping table in Hbase, where the first mapping table and the second mapping table are associated with each other in field name and field type;
an increment module 600, configured to write increment data into the Hbase through an API in the Hbase, where the increment data is data to be incremented into Hive;
and a synchronizing module 800, configured to synchronize the Hive incremental data according to the first mapping table, the second mapping table, and the updated Hbase.
The Hive incremental data synchronizer determines Hbase corresponding to Hive; obtaining a correlation table from Hive to Hbase; and writing the incremental data into the Hbase through an API in the Hbase to update the data in the Hbase, and synchronizing the Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase. In the whole process, incremental data are recorded by means of Hbase, incremental data synchronization to Hive is achieved on the basis of a Hive-to-Hbase association table, direct operation limitation in Hive is avoided, and based on Hbase and Hive association mapping, incremental data synchronization in Hive is achieved efficiently.
In one embodiment, the association table obtaining module 400 is further configured to create a first mapping table in Hive; the Hbase API is called, and a second mapping table associated with the first mapping table is created in the Hbase.
In one embodiment, the association table obtaining module 400 is further configured to create a hivetable in Hive, and create an hbasetable table in Hbase; associating and mapping table names of the hievable table and the hbasetable, rowkey fields in the hievable table and row keys in the hbasetable, and column character strings in the hievable table and column character strings in the hbasetable; and establishing communication between Hive and Hbase.
In one embodiment, the delta module 600 is further configured to identify the type of delta data; and calling an API (application programming interface) corresponding to the type of the incremental data in the Hbase to write the incremental data into the Hbase bottom data.
In one embodiment, the increment module 600 is further configured to call an API with a data addition function in the Hbase to add the incremental data if the type of the incremental data is newly added; if the type of the incremental data is updating, updating the incremental data through Hbase rowkey; and if the incremental data type is deletion, deleting by Hbase rowkey.
In one embodiment, the synchronization module 800 is further configured to store the updated data in Hbase to the HDFS; and accessing the HDFS according to the first mapping table and the second mapping table, and synchronizing Hive query data.
In one embodiment, the synchronizing module 800 is further configured to synchronize the Hive increment data according to the first mapping table, the second mapping table, and the updated Hbase, including: identifying the updated field name and field type in the updated Hbase; and adjusting the updated field name and the metadata corresponding to the field type according to the first mapping table and the second mapping table, and synchronizing the Hive incremental data.
For specific limitations of the Hive incremental data synchronization device, reference may be made to the above limitations on the Hive incremental data synchronization method, which is not described herein again. The various modules in the Hive incremental data synchronization device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as Hbase and a preset association table. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a Hive incremental data synchronization method.
Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
determining Hbase corresponding to Hive;
acquiring a first mapping table in the Hive and a second mapping table in the Hbase, wherein the first mapping table and the second mapping table are related to each other in field names and field types;
writing incremental data into the Hbase through an API (application program interface) in the Hbase, wherein the incremental data are data which need to be incremented into Hive;
and synchronizing the Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
creating a first mapping table in Hive; the Hbase API is called, and a second mapping table associated with the first mapping table is created in the Hbase.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
creating a hivetable in Hive, and creating a hbasetable in Hbase; associating and mapping table names of the hievable table and the hbasetable, rowkey fields in the hievable table and row keys in the hbasetable, and column character strings in the hievable table and column character strings in the hbasetable; and establishing communication between Hive and Hbase.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
identifying a type of incremental data; and calling an API (application programming interface) corresponding to the type of the incremental data in the Hbase to write the incremental data into the Hbase bottom data.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
if the type of the incremental data is newly increased, calling an API (application programming interface) with a data adding function in Hbase, and adding the incremental data; if the type of the incremental data is updating, updating the incremental data through Hbase rowkey; and if the incremental data type is deletion, deleting by Hbase rowkey.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
storing the data written into the Hbase into the HDFS; and accessing the HDFS according to the first mapping table and the second mapping table, and synchronizing Hive query data.
In one embodiment, the processor, when executing the computer program, further performs the steps of: identifying the updated field name and field type in the updated Hbase; and adjusting the updated field name and the metadata corresponding to the field type according to the first mapping table and the second mapping table, and synchronizing the Hive incremental data.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
determining Hbase corresponding to Hive;
acquiring a first mapping table in the Hive and a second mapping table in the Hbase, wherein the first mapping table and the second mapping table are related to each other in field names and field types;
writing incremental data into the Hbase through an API (application program interface) in the Hbase, wherein the incremental data are data which need to be incremented into Hive;
and synchronizing the Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase.
In one embodiment, the computer program when executed by the processor further performs the steps of:
creating a first mapping table in Hive; the Hbase API is called, and a second mapping table associated with the first mapping table is created in the Hbase.
In one embodiment, the computer program when executed by the processor further performs the steps of:
creating a hivetable in Hive, and creating a hbasetable in Hbase; associating and mapping table names of the hievable table and the hbasetable, rowkey fields in the hievable table and row keys in the hbasetable, and column character strings in the hievable table and column character strings in the hbasetable; and establishing communication between Hive and Hbase.
In one embodiment, the computer program when executed by the processor further performs the steps of:
identifying a type of incremental data; and calling an API (application programming interface) corresponding to the type of the incremental data in the Hbase to write the incremental data into the Hbase bottom data.
In one embodiment, the computer program when executed by the processor further performs the steps of:
if the type of the incremental data is newly increased, calling an API (application programming interface) with a data adding function in Hbase, and adding the incremental data; if the type of the incremental data is updating, updating the incremental data through Hbase rowkey; and if the incremental data type is deletion, deleting by Hbase rowkey.
In one embodiment, the computer program when executed by the processor further performs the steps of:
storing the data written into the Hbase into the HDFS; and accessing the HDFS according to the first mapping table and the second mapping table, and synchronizing Hive query data.
In one embodiment, the computer program when executed by the processor further performs the steps of:
identifying the updated field name and field type in the updated Hbase; and adjusting the updated field name and the metadata corresponding to the field type according to the first mapping table and the second mapping table, and synchronizing the Hive incremental data.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A Hive incremental data synchronization method, comprising:
determining Hbase corresponding to Hive;
acquiring a first mapping table in the Hive and a second mapping table in the Hbase, wherein the first mapping table and the second mapping table are associated with each other in field names and field types;
writing incremental data into the Hbase through an API in the Hbase, wherein the incremental data is data which needs to be incremented into the Hive;
and synchronizing Hive incremental data according to the first mapping table, the second mapping table and the updated Hbase.
2. The method of claim 1, wherein the obtaining the first mapping table in the Hive and the second mapping table in the Hbase comprises:
creating a first mapping table in Hive;
calling the Hbase API, and creating a second mapping table in the Hbase, wherein the second mapping table is associated with the first mapping table.
3. The method of claim 1, wherein the obtaining the first mapping table in the Hive and the second mapping table in the Hbase comprises:
creating a hivetable in Hive, and creating a hbasetable in Hbase;
the table names of the hievable table and the hbasetable, the rowkey field in the hievable table and the row key in the hbasetable, and the column character strings in the hievable table and the column character strings in the hbasetable are mapped in an associated manner;
and establishing communication between the Hive and the Hbase.
4. The method of claim 1, wherein prior to writing delta data to the Hbase via the API in the Hbase, further comprising:
identifying a type of the incremental data;
the writing incremental data to the Hbase via the API in the Hbase comprises:
and calling an API (application programming interface) corresponding to the type of the incremental data in the Hbase to write the incremental data into the Hbase bottom data.
5. The method of claim 4, wherein said calling an API in said Hbase corresponding to said type of delta data to write delta data to Hbase underlying data comprises:
if the type of the incremental data is newly added, calling an API (application programming interface) with a data adding function in the Hbase, and adding the incremental data;
if the type of the incremental data is updating, updating the incremental data through Hbase rowkey;
and if the incremental data type is deletion, deleting by Hbase rowkey.
6. The method of claim 1, wherein synchronizing Hive delta data according to the first mapping table, the second mapping table, and the updated Hbase comprises:
storing the data written into the Hbase into an HDFS;
and accessing the HDFS according to the first mapping table and the second mapping table, and synchronizing Hive query data.
7. The method of claim 1, wherein synchronizing Hive delta data according to the first mapping table, the second mapping table, and the updated Hbase comprises:
identifying the updated field name and field type in the updated Hbase;
and adjusting the updated field name and the metadata corresponding to the field type according to the first mapping table and the second mapping table, and synchronizing Hive incremental data.
8. A Hive incremental data synchronization apparatus, comprising:
the database determination module is used for determining Hbase corresponding to Hive;
an association table obtaining module, configured to obtain a first mapping table in the Hive and a second mapping table in the Hbase, where the first mapping table and the second mapping table are associated with each other in a field name and a field type;
the increment module is used for writing increment data into the Hbase through an API (application program interface) in the Hbase, wherein the increment data is data needing to be incremented into the Hive;
and the synchronization module is used for synchronizing the Hive increment data according to the first mapping table, the second mapping table and the updated Hbase.
9. A computer device comprising at least one processor, at least one memory, and a bus; the processor and the memory complete mutual communication through the bus; the processor is configured to invoke program instructions in the memory to perform the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911045304.9A CN112749226A (en) | 2019-10-30 | 2019-10-30 | Hive incremental data synchronization method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911045304.9A CN112749226A (en) | 2019-10-30 | 2019-10-30 | Hive incremental data synchronization method and device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112749226A true CN112749226A (en) | 2021-05-04 |
Family
ID=75640597
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911045304.9A Pending CN112749226A (en) | 2019-10-30 | 2019-10-30 | Hive incremental data synchronization method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112749226A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB201403929D0 (en) * | 2013-03-13 | 2014-04-23 | Cloudera Inc | Low Latency query engine for apache hadoop |
CN104298760A (en) * | 2014-10-23 | 2015-01-21 | 北京京东尚科信息技术有限公司 | Data processing method and data processing device applied to data warehouse |
US20170075965A1 (en) * | 2015-09-16 | 2017-03-16 | Turn Inc. | Table level distributed database system for big data storage and query |
CN107203594A (en) * | 2017-04-28 | 2017-09-26 | 努比亚技术有限公司 | A kind of data processing equipment, method and computer-readable recording medium |
-
2019
- 2019-10-30 CN CN201911045304.9A patent/CN112749226A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB201403929D0 (en) * | 2013-03-13 | 2014-04-23 | Cloudera Inc | Low Latency query engine for apache hadoop |
CN104298760A (en) * | 2014-10-23 | 2015-01-21 | 北京京东尚科信息技术有限公司 | Data processing method and data processing device applied to data warehouse |
US20170075965A1 (en) * | 2015-09-16 | 2017-03-16 | Turn Inc. | Table level distributed database system for big data storage and query |
CN107203594A (en) * | 2017-04-28 | 2017-09-26 | 努比亚技术有限公司 | A kind of data processing equipment, method and computer-readable recording medium |
Non-Patent Citations (1)
Title |
---|
汤羽;王英杰;范爱华;姚远哲;: "基于HDFS开源架构与多级索引表的海量数据检索mDHT算法", 计算机科学, no. 02 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102392944B1 (en) | Data backup methods, storage media and computing devices | |
CN109906448B (en) | Method, apparatus, and medium for facilitating operations on pluggable databases | |
CN107562775B (en) | Data processing method and device based on block chain | |
US11914585B2 (en) | Servicing queries of a hybrid event index | |
US20170149885A1 (en) | Server-side selective synchronization | |
CN109739828B (en) | Data processing method and device and computer readable storage medium | |
CN110799961B (en) | System and method for creating and deleting tenants in database | |
WO2019047976A1 (en) | Network file management method, terminal and computer readable storage medium | |
US20160179789A1 (en) | Content localization using fallback translations | |
WO2021073510A1 (en) | Statistical method and device for database | |
WO2020041950A1 (en) | Data update method, device, and storage device employing b+ tree indexing | |
CN105069151A (en) | HBase secondary index construction apparatus and method | |
CN111930850A (en) | Data verification method and device, computer equipment and storage medium | |
CN110188114A (en) | A kind of optimization method of data manipulation, device, system, equipment and storage medium | |
CN113220659A (en) | Data migration method, system, electronic device and storage medium | |
WO2017084520A1 (en) | Method and apparatus for synchronizing data files in a cloud environment | |
US8818971B1 (en) | Processing bulk deletions in distributed databases | |
CN111737981A (en) | Vocabulary error correction method and device, computer equipment and storage medium | |
CN112559529A (en) | Data storage method and device, computer equipment and storage medium | |
CN113010476A (en) | Metadata searching method, device and equipment and computer readable storage medium | |
US9569519B2 (en) | Client-side directed commands to a loosely coupled database | |
JP2012108877A (en) | Database management method | |
CN105404653A (en) | Realization method of fully distributed file index and cooperative editing mechanism | |
CN112749226A (en) | Hive incremental data synchronization method and device, computer equipment and storage medium | |
US20200233839A1 (en) | Defragmenting metadata of a filesystem |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |