CN115794843A - Hive-based data updating method and device and readable storage medium - Google Patents

Hive-based data updating method and device and readable storage medium Download PDF

Info

Publication number
CN115794843A
CN115794843A CN202211383394.4A CN202211383394A CN115794843A CN 115794843 A CN115794843 A CN 115794843A CN 202211383394 A CN202211383394 A CN 202211383394A CN 115794843 A CN115794843 A CN 115794843A
Authority
CN
China
Prior art keywords
data
data table
updating
record
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211383394.4A
Other languages
Chinese (zh)
Inventor
徐烨
侯鑫磊
刘晓龙
刘云飞
牛佩云
周帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Bank of China
Original Assignee
Agricultural Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Bank of China filed Critical Agricultural Bank of China
Priority to CN202211383394.4A priority Critical patent/CN115794843A/en
Publication of CN115794843A publication Critical patent/CN115794843A/en
Pending legal-status Critical Current

Links

Images

Abstract

The application discloses a Hive-based data updating method, a Hive-based data updating device and a readable storage medium, wherein after data records in a target data table are updated, version information of the target data table is recorded, when the version information meets preset conditions, effective data records which are not deleted are obtained, unique identification values UUID and latest operation version numbers of the effective data records in each row are inquired, the target data which need to be reserved are further determined, the operation version numbers are reset, a temporary data table is generated to replace historical data in the target data table, and the version information of the target data table is reset. The method comprises the steps of carrying out data updating operation on a data table, monitoring version information of the data table in real time, automatically clearing redundant data in the data table and initializing the version information of the data table once preset conditions are met, and therefore under the condition of continuously updating data for a long time, the calculation complexity can be guaranteed to be always in a reasonable range, and data processing efficiency is guaranteed while data updating is achieved.

Description

Hive-based data updating method and device and readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a Hive-based data updating method, apparatus, and readable storage medium.
Background
In the era of rapid development of big data, a data warehouse tool named Hive is often used for solving the data statistics problem of massive structured logs because the tool can support large-scale data processing. Hive is a set of data warehouse analysis system constructed based on Hadoop, and provides rich SQL query modes to analyze data stored in a Hadoop distributed file system. Hive has a significant advantage in processing a large amount of data because it is built on a cluster and can be computed in parallel using MapReduce.
However, because Hive is based on HDFS (Hadoop Distributed File System), and HDFS File System does not support modifying existing data files, at present, hive itself has poor support for updating data such as update and delete, when a certain row of data needs to be modified, the whole table or the whole partition needs to be deleted, the whole data table is regenerated, and the cost of modifying data is very high. At present, in order to solve the cost problem and avoid deleting the existing data, the existing Hive data modification method implements data updating by grouping and managing each piece of data in Hive and introducing a warehousing timestamp to call the latest data warehoused in a group of data, but after the data is updated frequently and is operated for a long time, the redundant data of each group is continuously increased, so that the computation complexity is continuously increased when the latest data is called, and the computation efficiency is reduced.
Disclosure of Invention
In view of this, the present application provides a data updating method, device and readable storage medium based on Hive, which are used to solve the problems that after data in Hive is continuously updated for a long time, redundant data stored in Hive is increased, which results in increased computational complexity and decreased computational efficiency.
In order to achieve the above object, the following solutions are proposed:
a Hive-based data updating method comprises the following steps:
receiving a data updating operation instruction, and determining a target data record to be updated in a target data table, wherein the data updating operation instruction is a data modifying instruction or a data deleting instruction;
updating the target data record according to the data updating operation instruction, and updating the version information of the target data table, wherein the version information is used for recording the updating times or updating information of the target data table;
judging whether the updated version information meets a preset condition or not;
if the updated version information meets the preset condition, inquiring UUID and latest operation version number of each valid data record in the target data table, wherein the UUID is used for uniquely identifying one valid data record, the latest operation version number is used for identifying the updating times of the valid data record, and the valid data record is a data record which does not execute the data deleting instruction;
acquiring target data corresponding to the latest operation version number of the effective data record, setting an initial operation version number for the target data, and generating a temporary data table, wherein the temporary data table comprises the target data, the UUID and the initial operation version number;
and clearing the target data table, inserting the temporary data table into the target data table, and initializing the version information of the target data table.
Optionally, before the determining whether the updated version information meets the preset condition, the method further includes:
receiving a newly added data instruction, and acquiring a newly added data record;
determining the UUID value and the initial operation version number of the newly added data record;
and executing the new data instruction, recording the new data record, the UUID value and the initial operation version number in the target data table, and updating the version information of the target data table.
Optionally, the method further comprises:
receiving a query data instruction, and determining an effective data set, wherein the effective data set contains data information corresponding to the latest operation version number of each record in the target data table, and the query data instruction comprises a query condition;
and executing the query data instruction, searching data information meeting the query condition in the effective data set, obtaining a target set and outputting the target set.
Optionally, the updating the target data record according to the data updating operation instruction includes:
when the data updating operation instruction is a data modification instruction, executing the data modification instruction to obtain a modified updating data record;
recording the updated data record in the target data table;
determining and recording the UUID and the operation version number of the updated data record;
or the like, or a combination thereof,
and when the data updating operation instruction is a data deleting instruction, setting the operation version number of the target data record as a preset value.
Optionally, the method further comprises:
inquiring whether a first operation version number contains the preset value, wherein the first operation version number is a historical operation version number of a data record in the target data table after the updating operation is completed;
and if the first operation version number does not contain the preset value, determining the data record as the valid data record.
Optionally, the determining whether the updated version information meets a preset condition includes:
judging whether the number of times of the completed updating operation of the target data table reaches a first preset threshold value or not;
and/or the presence of a gas in the gas,
judging whether the total number of the data records in the target data table reaches a second preset threshold value or not;
and/or the presence of a gas in the gas,
and judging whether the time recorded at the current moment reaches preset time, wherein the moment when the version information of the target data table is initialized at the last time is a timing zero point.
Optionally, the method further comprises:
and recording the version information of the target data table in a management record table, wherein the management record table is used for carrying out version management on the data table in the Hive library.
A Hive-based data update apparatus, comprising:
the data record query unit is used for receiving a data updating operation instruction and determining a target data record to be updated in a target data table, wherein the data updating operation instruction is a data modification instruction or a data deletion instruction;
the data record updating unit is used for updating the target data record according to the data updating operation instruction and updating the version information of the target data table, wherein the version information is used for recording the updating times or updating information of the target data table;
the first judging unit is used for judging whether the updated version information meets the preset condition or not;
an update information query unit, configured to query, when the updated version information meets a preset condition, a UUID and a latest operation version number of each valid data record in the target data table, where the UUID is used to uniquely identify one valid data record, the latest operation version number is used to identify the update times of the valid data record, and the valid data record is a data record in which the delete data instruction is not executed;
a temporary data table generating unit, configured to obtain target data corresponding to a latest operation version number of the valid data record, set an initial operation version number for the target data, and generate a temporary data table, where the temporary data table includes the target data, the UUID, and the initial operation version number;
and the target data table resetting unit is used for emptying the target data table, inserting the temporary data table into the target data table and initializing the version information of the target data table.
Optionally, before the first determining unit, the method further includes:
the data record acquisition unit is used for receiving a newly added data instruction and acquiring a newly added data record;
the updating identification setting unit is used for determining the UUID value and the initial operation version number of the newly added data record;
and the data record recording unit is used for executing the newly increased data instruction, recording the newly increased data record, the UUID value and the initial operation version number in the target data table, and updating the version information of the target data table.
Optionally, the method further comprises:
the valid data query unit is used for receiving a query data instruction and determining a valid data set, wherein the valid data set contains data information corresponding to the latest operation version number of each record in the target data table, and the query data instruction comprises a query condition;
and the target set query unit is used for executing the query data instruction, searching the data information meeting the query condition in the effective data set, obtaining a target set and outputting the target set.
A readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of a Hive-based data updating method as described above.
It can be seen from the foregoing technical solutions that, in the data updating method based on Hive provided in this embodiment of the application, according to a received data updating operation instruction, a target data record to be updated in a target data table is updated, and version information recorded with update times or update information of the target data table is updated, when the updated version information meets a preset condition, an undeleted valid data record in the target data table is obtained, a unique identification value UUID and a latest operation version number of each row of valid data record are queried, target data to be retained in the target data table is further determined by the latest operation version number, an initial operation version number is set for the target data, a temporary data table is generated, and finally the target data table is emptied and inserted into the temporary data table, and version information of the target data table is initialized, so that cleaning and resetting of the target data table are completed. According to the scheme, after the data updating operation of the data table is completed, the version information of the data table is monitored in real time, once the preset condition is met, redundant data in the data table is automatically cleared, the version information of the data table is initialized, and under the condition that the data is continuously updated for a long time, the calculation complexity can be guaranteed to be always within a reasonable range, so that the data processing efficiency is guaranteed while the data updating is realized.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a Hive-based data updating method according to an embodiment of the present disclosure;
FIGS. 2-4 are schematic diagrams illustrating data update results of several Hive data tables;
fig. 5 is a schematic structural diagram of a data updating apparatus based on Hive according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The investigation finds that Hive is a mainstream tool for analyzing mass data processing at present, and because Hive is based on an HDFS file system and does not support data rewriting, data updating needs to be realized by changing the whole data table, and the modification cost is very high. The method obtains the latest data after the group of data is modified by calling the latest data put in storage in a group of data, can modify a certain row of data without deleting and regenerating the whole data table, and reduces the modification cost.
The inventor researches and discovers that after the data is operated for a long time and updated frequently, the prior art continuously logs in the new data without cleaning the historical data in the data table, so that the data amount in the data table is continuously increased, and excessive redundant data exist in each group, so that the computational complexity in operation is increased, and the computational efficiency is reduced.
In view of this, the inventor proposes a Hive-based data updating method, and fig. 1 shows a flowchart of the Hive-based data updating method provided in an embodiment of the present application, and as shown in fig. 1, the flowchart may include:
and S10, receiving a data updating operation instruction, and determining a target data record to be updated in a target data table.
The data updating operation instruction can be a data modification instruction or a data deletion instruction, the data modification instruction indicates that data of the target data record is modified in the target data table, and the data deletion instruction indicates that all historical data of the target data record is deleted in the target data table.
Specifically, the data update operation instruction may include a filtering condition of the target data record to be updated, and when the data update operation instruction is received, the data record satisfying the filtering condition may be searched in the target data table, and is determined as the target data record. Alternatively, the data record set may also be obtained as the target data record by pre-screening the data record set that needs to be updated.
And step S11, updating the target data record according to the data updating operation instruction, and updating the version information of the target data table.
Optionally, when the received data updating operation instruction is a data modification instruction, the data modification operation is started. Specifically, the data modification instruction may include a data modification command, for example, adding 1 to data in the target data record, obtaining the modified data record by executing the data modification instruction, and recording the modified data record and the update identifier of the current update in the target data table to modify the target data record in the target data table; when the received data updating operation instruction is a data deleting instruction, data deleting operation is started, and after the target data record to be deleted is determined, the updating identifier of the target data record can be set as a deleting identifier by executing the data deleting instruction, so that the target data record is deleted.
The update identifier of the data record may include a UUID and an operation version number, where the UUID is used to uniquely identify one data record, and the operation version number is used to identify the update times or update status of one data record.
In an optional case, for a newly entered data record, a UUID value of the newly entered data record may be calculated when the data record is entered and recorded in the data table, and an initial operation version number of the newly entered data record may be a preset initial value, which indicates that an update operation has not been performed on the data record; for the modified data record, the UUID value is not changed, and the UUID value can be used as the unique UUID value of the data record, and the operation version number can be a value obtained by adding 1 to the maximum operation version number of the data record, and is used for recording the number of times of modification of the data record; for a deleted data record, the UUID value is unchanged, the operation version number may be a preset value, and the preset value may be a deletion identifier, which indicates that the update state of the data record is a deleted state.
Further alternatively, the version information of the target data table may be the number of update operations to the target data table that have been completed, and when the update is completed, the number of update operations to the target data table is increased by 1; the version information can also be the number of all historical records in the target data table, and after the data updating operation is finished, the number of the target data records updated at this time is added on the basis of the version information before updating; the version information may also be the time recorded at the current time, and preferably, the time from the last time the target data table version information was reset may be counted, and the time from the current time to the time may be recorded. The method can realize the update of the version information of the target data table.
And S12, judging whether the updated version information meets a preset condition.
Optionally, the preset condition may be one or more of the following update conditions:
(1) The number of times of updating operations of the target data table is completed reaches a first preset threshold value.
Specifically, it has been described in step S11 that the version information of the target data table is updated immediately after the update of the data record in the target data table is completed, and the version information may include the number of update operations of the target data table, and each time the update of the target data table is completed, the number of update operations is added by 1, so as to determine whether the number of update operations reaches the first preset threshold, and if so, the preset condition is met.
Optionally, the first preset threshold may be configured according to cluster resources and data usage frequency. When the number of updates reaches a first preset threshold, which indicates that after the data update operation of the number of updates, the calculation efficiency during operation may not reach an ideal value due to continuous data storage, and at this time, the historical data of the target data table needs to be cleaned.
(2) And the total number of the data records in the target data table reaches a second preset threshold value.
Specifically, if the version information includes the number of all the history records in the target data table, after the target data table is updated each time, the total number of the data records entered in the target data table is updated, whether the total number reaches a second preset threshold value is judged, and if yes, a preset condition is met.
Optionally, the first preset threshold may be configured according to cluster resources and data usage frequency. When the total number of the data records reaches a second preset threshold, the data volume stored in the target data table exceeds a reasonable range, and in the historical updating process, the redundant data proportion is larger and larger along with the increase of the updating times, and at the moment, the redundant data in the target data table needs to be cleaned.
(3) And the time recorded at the current moment reaches the preset time.
The time recorded at the current moment may be the time obtained at the current moment when the time of last initializing the version information of the target data table is used as the zero point of timing to start timing. Specifically, if the version information includes the recorded time, when the update of the target data table is completed each time, the timing time of the current time is recorded, whether the timing time recorded at the current time reaches the preset time is judged, and if the timing time reaches the preset time, the preset condition is met.
Alternatively, the preset time may be configured according to the data update frequency and the data use frequency of the target data table. Further optionally, the recorded time is monitored in real time, once the recorded time reaches the preset time, the data table is automatically cleaned, and the relation between the recorded time and the preset time is monitored by recording the time from the current time to the last time of initializing the version information of the target data table, so that the historical data of the target data table is cleaned regularly according to the preset time.
Preferably, in order to ensure that the cleaning operation does not affect the data updating process, the monitoring of the recording time may be performed from the moment when the updating is finished, if the preset time is reached in the data updating process, the cleaning operation may not be triggered, after the updating is finished, the judgment of whether the version information meets the preset condition may be performed, and the cleaning operation is triggered by determining that the time recorded at the current moment reaches the preset time, so as to implement the real-time monitoring of the target data table and the cleaning of the historical data in an idle time.
And S13, if the updated version information meets the preset condition, inquiring the UUID and the latest operation version number of each effective data record in the target data table.
The valid data record of the target data table may be a data record for which a delete data instruction is not executed, and optionally, whether the valid data record is a valid data record may be determined by whether the delete identifier exists in the data record; the latest operation version number may be used to identify the number of times the valid data record completes modification.
When the updated version information satisfies the preset condition, indicating that the target data table satisfies the condition for performing the cleaning operation, the data cleaning may be started to be performed.
Firstly, effective data records are determined, wherein the effective data records are data records which need to be reserved when the data table is cleared. Specifically, whether a preset value or a deletion identifier exists in the historical operation version number of a data record or not can be judged, if yes, the data record is deleted and is an invalid data record, and if not, the data record is a valid data record.
The UUID value of each valid data record is then queried. Optionally, since the data records correspond to the UUID values one by one, and each data record corresponds to one UUID value, it may be determined whether the UUID value is the UUID of a valid data record by determining whether a preset value exists in the operation version number corresponding to the UUID value.
And finally, inquiring the latest operation version number of each valid data record, wherein the latest operation version number can be the maximum value in the operation version numbers corresponding to the UUID values of the valid data records. Optionally, the UUID of each valid data record may correspond to a latest operation version number, which is used to determine the latest data of the valid data record, and the latest operation version number may be the latest modified data or the newly entered data.
And S14, acquiring target data corresponding to the latest operation version number of the effective data record, setting an initial operation version number for the target data, and generating a temporary data table, wherein the temporary data table comprises the target data, the UUID and the initial operation version number.
Specifically, each valid data record corresponds to a latest operation version number, and optionally, by determining the latest operation version number of each valid data record in step S13 and further querying data corresponding to the latest operation version number, the latest entered data of each valid data record may be determined, and the target data to be retained is obtained.
Further alternatively, because Hive only supports updating and deleting operations in units of data tables, in order to change row-level data of the target data table, delete redundant parts, and only retain target data, the determined target data may be cached in a newly created data table, an initial operation version number is set in the new data table for each target data, and the UUID obtained in step S13 is recorded in the new data table, so as to generate a temporary data table. The target data and the UUID values are in one-to-one correspondence, the initial operation version number can be a preset initial value and used for identifying that the target data in the temporary data table is not subjected to updating operation, the target data, the corresponding UUID values and the initial operation version number can be recorded in the temporary data table at the same time, and an initial data table containing the target data is generated.
And S15, emptying the target data table, inserting the temporary data table into the target data table, and initializing the version information of the target data table.
Specifically, in order to update the target data table, the entire content of the current target data table may be deleted, and the generated temporary data table is inserted into the emptied target data table, where the data in the target data table has been changed into the target data, and other history data is deleted, and the name of the target data table has not changed, which is equivalent to replacing the content in the target data table with the content in the temporary data table, and does not affect the data call operation.
Optionally, the version information of the target data table is initialized when the temporary data table is inserted. Specifically, the number of times of updating the target data table may be reset to an initial value, which is used to identify that the replaced target data table does not perform the updating operation, and at the same time, the number of data records of the target data table may also be reset to the number of target data records, and the time at the current time may also be reset to a zero point of timing, so as to start timing at the time when the temporary data table is inserted, and monitor the timing time.
Further alternatively, a management record table may be created in advance for performing version management on all the data tables in Hive. The version information of the target data table is recorded in the management record table, so that the monitoring of the version information is realized, the data recorded in the management record table is updated in real time when the version information is updated or initialized every time, and the version management of the target data table is realized through visual data.
According to the embodiment of the application, a target data record and version information to be updated in a target data table are updated according to a received data updating operation instruction, when the updated version information meets preset conditions, effective data records which are not deleted in the target data table are obtained, the unique identification value UUID and the latest operation version number of each row of effective data records are inquired, target data which need to be reserved in the target data table are further determined according to the latest operation version number, a temporary data table is generated, the initial operation version number of the target data is set, the target data table is emptied, the temporary data table is inserted, the version information of the target data table is initialized, and the target data table is cleaned and reset. According to the scheme, after the data updating operation of the data table is completed, the version information of the data table is monitored in real time, once the preset condition is met, redundant data in the data table is automatically cleared, the version information of the data table is initialized, and under the condition that the data is continuously updated for a long time, the calculation complexity can be guaranteed to be always within a reasonable range, so that the data processing efficiency is guaranteed while the data updating is realized.
In some embodiments of the present application, in consideration that the data update process of the target data table may further include a new data addition operation, an optional implementation manner of the new data addition operation is provided in this embodiment of the present application, which may specifically include the following steps:
s1, receiving a newly added data instruction and acquiring a newly added data record.
Specifically, since the new added data does not need to change the data record already entered in the target data table, the target data record does not need to be determined, and the above steps S10 and S11 need not be performed. When a new data instruction is received, the new data instruction can contain data information to be recorded, the data information to be recorded is obtained, and the new data instruction is determined to be a new data record. Optionally, the newly added data record may be stored in a blank data table in advance, and when the newly added data instruction is received, the newly added data record may be obtained from the blank data table.
And S2, determining the UUID value and the initial operation version number of the newly added data record.
Optionally, before adding the new data record to the target data table, a corresponding unique UUID is calculated for each new data record, and meanwhile, the operation version number of each new data record is set as an initial value for identifying that the data update operation is not performed on the data record. The UUID may be a group of randomly generated numbers, and may also be generated by calculation based on data such as the current time, a counter, and a hardware identifier, where the hardware identifier may be a MAC address of the wireless network card.
And S3, executing a data adding instruction, recording a data adding record, a UUID value and an initial operation version number in the target data table, and updating the version information of the target data table.
Optionally, executing a new data instruction, and inputting the obtained new data record, the corresponding UUID and the initial operation version number into the target data table.
Further optionally, since the newly added data increases the total number of data records in the target data table, and the version information changes, after the operation of the newly added data is completed, the version information of the target data table may be updated. Specifically, the number of newly added data records may be added to the previous data amount, and meanwhile, 1 may be added to the number of updates of the target data table to obtain updated version information, and the operations of the above steps S12 to S15 may be further performed.
In the embodiment of the application, the newly added data instruction is received, the newly added data record is obtained and the target data table is input, the corresponding UUID and the initial operation version number are set for each newly added data record, and the version information of the target data table is updated, so that newly added data in the target data table can be realized, the data updating function is more comprehensive, and more data updating requirements based on Hive are met.
In some embodiments of the present application, considering that the data update process of the target data table may further include a query data operation, an alternative implementation manner of the query data operation is provided in this embodiment of the present application, which may specifically include the following steps:
s1, receiving a data query instruction and determining an effective data set.
Optionally, the valid data set includes the data newly entered by each record and does not include the deleted data. Specifically, when a query data instruction is received, by determining the latest operation version number of each record in the target data table, the corresponding data information may be determined as the latest entered data of each record, forming an effective data set. The method for determining the latest operation version number is described in detail in the above embodiments, and is described in detail in step S13, which is not described herein again.
And S2, executing the data query instruction, searching data information meeting the query condition in the effective data set, obtaining a target set and outputting the target set.
The query data instruction can contain query conditions, data information meeting the query conditions is screened out from the effective data set by executing the query data instruction to form a target set, and the target set is output as a query result.
In the embodiment of the application, the latest operation version number of each record is inquired in the target data table to determine the effective data set, the data information meeting the inquiry condition is determined in the effective data set, and the inquiry result is obtained, so that the data inquired in the target data table is the latest data information of each record, and the latest data inquiry can be realized in Hive. Because the data to be updated meeting the updating condition needs to be searched in the process of modifying and deleting the data, the implementation of the data query operation provides a basis for data updating.
Next, a data updating method based on Hive disclosed in this application will be further described with reference to the data updating results of several Hive data tables illustrated in fig. 2, fig. 3, and fig. 4.
Referring to FIG. 2, FIG. 2 illustrates a result presented after performing an add data operation on data Table A. Wherein, the three rows of data records in the data table a are all new data records.
When a new data adding instruction is received, the target data table is determined to be the data table A, and the name, gender and age information of three persons is obtained and is recorded into the data table A as three pieces of new data. The UUIDs of the three data records of "zhang san", "lie si", and "wang wu" are then determined and recorded in data table a. And finally, setting the initial operation version number of each record to be 0, and updating the version information of the data table A.
With further reference to FIG. 3, FIG. 3 illustrates a result presented after performing a modify data operation on data Table A. The update result shown in fig. 3 is obtained by modifying the data record in the data table a in fig. 2 as the existing record.
The received data updating operation instruction is the following statement:
update data table a set age = age +1 person sex = male
Screening out data records with male gender in the data table A as target data records to be modified by searching data operation, finding two data records corresponding to Zhang III and Wang Wu, executing the operation instruction, modifying the ages of Zhang III and Wang Wu, setting the operation version number of the modified data records as 1, obtaining the last two rows of data shown in figure 3, recording the data in the data table A, completing the data modification of the data table A in figure 2, and updating the version information of the data table A.
Specifically, existing information items of all data to be updated are processed according to query conditions in the operation instruction, and then the update operation is converted into an insert operation for execution after fields needing to be updated are inserted, replaced or deleted. Wherein the insert operation command is used to insert a field in the Hive data table.
With further reference to FIG. 4, FIG. 4 illustrates a result presented after performing a clean data operation on data Table A. The update result shown in fig. 4 is obtained by performing the above steps S12 to S15 on the data table a with the data record in the data table a in fig. 3 as the updated record.
In this embodiment, the preset value corresponding to the deletion flag is set to-1, the preset threshold of the update frequency in the version information is set to 1, the update frequency of the data table a in fig. 2 is set to an initial value of 0, and the following operations are performed on the data table a.
After the data updating operation is finished, the updating result shown in fig. 3 is obtained, because the version information of the data table a in fig. 2 is 0, and the data table a in fig. 3 performs the data updating operation once, the version information of the data table a is updated to 1, after the updating operation is finished, the version information of the data table a is judged, and at this time, the version information of the data table a reaches the preset threshold value 1, and the cleaning operation for the data table a is started.
Firstly, effective data records of the data table A are determined, and since data records with the operation version number of-1 do not exist in the data table A, all three data records of the data table A are effective data records. And then determining the UUID and the latest operation version number corresponding to the valid data record, wherein the latest operation version number corresponding to Zhang three is 1, the latest operation version number corresponding to Liqu is 0, and the latest operation version number corresponding to WangWu is 1. And then determining the three lines of data as target data, caching the target data into a temporary data table, recording corresponding UUIDs, and setting the operation version numbers of the three lines of data as initial values of 0. Finally, the data in the data table a is cleared, the temporary data table is inserted into the data table a to obtain the finally cleared data table a, the updating times of the initialized data table a are reset to the initial value 0, and the updating result shown in fig. 4 is finally presented.
According to the scheme, the update condition of each data record in the data sheet is monitored through the UUID and the operation version number, the update condition of the data sheet is monitored through the version information of the recorded data sheet, once the update times meet the preset threshold value, the redundant data in the data sheet is automatically cleared, the version information of the data sheet is initialized, the redundant data of the data sheet can be regularly cleared, and when the data is continuously updated for a long time, the calculation complexity can be always kept in a reasonable range, so that the data processing efficiency is maintained in a reasonable range.
The following describes a Hive-based data updating apparatus provided in an embodiment of the present application, and reference may be made to the Hive-based data updating apparatus described below and the Hive-based data updating method described above in a corresponding manner.
Referring to fig. 5, a Hive-based data updating apparatus is described, as shown in fig. 5, the apparatus may include:
a data record query unit 100, configured to receive a data update operation instruction, and determine a target data record to be updated in a target data table, where the data update operation instruction is a data modification instruction or a data deletion instruction;
a data record updating unit 110, configured to update the target data record according to the data update operation instruction, and update version information of the target data table, where the version information is used to record update times or update information of the target data table;
a first judging unit 120, configured to judge whether the updated version information meets a preset condition;
an update information querying unit 130, configured to query, when the updated version information meets a preset condition, a UUID and a latest operation version number of each valid data record in the target data table, where the UUID is used to uniquely identify one valid data record, the latest operation version number is used to identify the update times of the valid data record, and the valid data record is a data record in which the delete data instruction is not executed;
a temporary data table generating unit 140, configured to obtain target data corresponding to a latest operation version number of the valid data record, set an initial operation version number for the target data, and generate a temporary data table, where the temporary data table includes the target data, the UUID, and the initial operation version number;
a target data table resetting unit 150, configured to empty the target data table, insert the temporary data table into the target data table, and initialize version information of the target data table.
Optionally, the Hive-based data updating apparatus provided in this application may further include:
the data record acquisition unit is used for receiving a newly added data instruction and acquiring a newly added data record;
the updating identification setting unit is used for determining the UUID value and the initial operation version number of the newly added data record;
and the data record recording unit is used for executing the newly added data instruction, recording the newly added data record, the UUID value and the initial operation version number in the target data table, and updating the version information of the target data table.
Optionally, the Hive-based data updating apparatus provided in this application may further include:
the valid data query unit is used for receiving a query data instruction and determining a valid data set, wherein the valid data set contains data information corresponding to the latest operation version number of each record in the target data table, and the query data instruction comprises a query condition;
and the target set query unit is used for executing the query data instruction, searching the data information meeting the query condition in the effective data set, obtaining a target set and outputting the target set.
Alternatively, the data record updating unit 110 may include:
a data modification unit and a data deletion unit.
Optionally, the data modification unit may include:
the update data record acquisition unit is used for executing the data modification instruction to obtain a modified update data record when the data update operation instruction is the data modification instruction;
the updating data record recording unit is used for recording the updating data record in the target data table;
and the updating identification recording unit is used for determining and recording the UUID and the operation version number of the updating data record.
Alternatively, the data deleting unit may include:
and the operation version number setting unit is used for setting the operation version number of the target data record as a preset value when the data updating operation instruction is a data deleting instruction.
Optionally, the Hive-based data updating apparatus provided in this application may further include:
a second judging unit, configured to query whether the first operation version number includes the preset value, where the first operation version number is a historical operation version number of a data record in the target data table after the update operation is completed;
and the valid data record identification unit is used for determining the data record as the valid data record when the first operation version number does not contain the preset value.
Alternatively, the first judging unit 120 may include:
the updating operation frequency judging unit is used for judging whether the updating operation frequency of the target data table is finished reaches a first preset threshold value or not;
and/or the presence of a gas in the atmosphere,
the data record quantity judging unit is used for judging whether the total quantity of the data records in the target data table reaches a second preset threshold value or not;
and/or the presence of a gas in the gas,
and the recording time judging unit is used for judging whether the time recorded at the current moment reaches preset time, and the time when the version information of the target data table is initialized at the last time is a timing zero point.
Optionally, the Hive-based data updating apparatus provided in this application may further include:
and the version information recording unit is used for recording the version information of the latest data table in a management record table, and the management record table is used for carrying out version management on the data table in the Hive library.
Embodiments of the present application further provide a readable storage medium, where the storage medium may store a program suitable for being executed by a processor, where the program is used to implement each processing flow in the foregoing Hive-based data updating scheme.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A Hive-based data updating method is characterized by comprising the following steps:
receiving a data updating operation instruction, and determining a target data record to be updated in a target data table, wherein the data updating operation instruction is a data modifying instruction or a data deleting instruction;
updating the target data record according to the data updating operation instruction, and updating the version information of the target data table, wherein the version information is used for recording the updating times or updating information of the target data table;
judging whether the updated version information meets a preset condition or not;
if the updated version information meets the preset condition, inquiring UUID and latest operation version number of each valid data record in the target data table, wherein the UUID is used for uniquely identifying one valid data record, the latest operation version number is used for identifying the updating times of the valid data record, and the valid data record is a data record which does not execute the data deleting instruction;
acquiring target data corresponding to the latest operation version number of the valid data record, setting an initial operation version number for the target data, and generating a temporary data table, wherein the temporary data table comprises the target data, the UUID and the initial operation version number;
and clearing the target data table, inserting the temporary data table into the target data table, and initializing the version information of the target data table.
2. The method according to claim 1, before said determining whether the updated version information satisfies the preset condition, further comprising:
receiving a new data instruction and acquiring a new data record;
determining the UUID value and the initial operation version number of the newly added data record;
and executing the new data instruction, recording the new data record, the UUID value and the initial operation version number in the target data table, and updating the version information of the target data table.
3. The method of claim 1, further comprising:
receiving a query data instruction, and determining an effective data set, wherein the effective data set contains data information corresponding to the latest operation version number of each record in the target data table, and the query data instruction comprises a query condition;
and executing the query data instruction, searching data information meeting the query condition in the effective data set, obtaining a target set and outputting the target set.
4. The method of claim 1, wherein updating the target data record according to the data update operation instruction comprises:
when the data updating operation instruction is a data modification instruction, executing the data modification instruction to obtain a modified updating data record;
recording the updated data record in the target data table;
determining and recording the UUID and the operation version number of the update data record;
or the like, or a combination thereof,
and when the data updating operation instruction is a data deleting instruction, setting the operation version number of the target data record as a preset value.
5. The method of claim 4, further comprising:
inquiring whether a first operation version number contains the preset value, wherein the first operation version number is a historical operation version number of a data record in the target data table after the updating operation is completed;
and if the first operation version number does not contain the preset value, determining the data record as the valid data record.
6. The method according to claim 1, wherein the determining whether the updated version information satisfies a preset condition includes:
judging whether the number of finished updating operations of the target data table reaches a first preset threshold value or not;
and/or the presence of a gas in the atmosphere,
judging whether the total number of the data records in the target data table reaches a second preset threshold value or not;
and/or the presence of a gas in the gas,
and judging whether the time recorded at the current moment reaches preset time, wherein the moment when the version information of the target data table is initialized at the last time is a timing zero point.
7. The method of claim 1, further comprising:
and recording the version information of the target data table in a management record table, wherein the management record table is used for carrying out version management on the data table in the Hive library.
8. A Hive-based data update apparatus, comprising:
the data record query unit is used for receiving a data updating operation instruction and determining a target data record to be updated in a target data table, wherein the data updating operation instruction is a data modification instruction or a data deletion instruction;
the data record updating unit is used for updating the target data record according to the data updating operation instruction and updating the version information of the target data table, wherein the version information is used for recording the updating times or updating information of the target data table;
the first judging unit is used for judging whether the updated version information meets the preset condition or not;
an update information query unit, configured to query, when the updated version information meets a preset condition, a UUID and a latest operation version number of each valid data record in the target data table, where the UUID is used to uniquely identify one valid data record, the latest operation version number is used to identify the update times of the valid data record, and the valid data record is a data record in which the delete data instruction is not executed;
a temporary data table generating unit, configured to obtain target data corresponding to a latest operation version number of the valid data record, set an initial operation version number for the target data, and generate a temporary data table, where the temporary data table includes the target data, the UUID, and the initial operation version number;
and the target data table resetting unit is used for emptying the target data table, inserting the temporary data table into the target data table and initializing the version information of the target data table.
9. The apparatus according to claim 8, wherein the first determining unit further comprises:
the data record acquisition unit is used for receiving a newly added data instruction and acquiring a newly added data record;
the updating identification setting unit is used for determining the UUID value and the initial operation version number of the newly added data record;
a data record entering unit, configured to execute the new data command, record the new data record, the UUID value, and the initial operation version number in the target data table, and update version information of the target data table;
the valid data query unit is used for receiving a query data instruction and determining a valid data set, wherein the valid data set contains data information corresponding to the latest operation version number of each record in the target data table, and the query data instruction comprises a query condition;
and the target set query unit is used for executing the query data instruction, searching the data information meeting the query condition in the effective data set, obtaining a target set and outputting the target set.
10. A readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, performs the steps of a Hive-based data update method according to any of claims 1 to 7.
CN202211383394.4A 2022-11-07 2022-11-07 Hive-based data updating method and device and readable storage medium Pending CN115794843A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211383394.4A CN115794843A (en) 2022-11-07 2022-11-07 Hive-based data updating method and device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211383394.4A CN115794843A (en) 2022-11-07 2022-11-07 Hive-based data updating method and device and readable storage medium

Publications (1)

Publication Number Publication Date
CN115794843A true CN115794843A (en) 2023-03-14

Family

ID=85435830

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211383394.4A Pending CN115794843A (en) 2022-11-07 2022-11-07 Hive-based data updating method and device and readable storage medium

Country Status (1)

Country Link
CN (1) CN115794843A (en)

Similar Documents

Publication Publication Date Title
US7483918B2 (en) Dynamic physical database design
CN107943718B (en) Method and device for cleaning cache file
CA2722320C (en) Paging hierarchical data
CN108182258B (en) Distributed data analysis system and method
CN112328842B (en) Data processing method and device, electronic equipment and storage medium
CN114911830A (en) Index caching method, device, equipment and storage medium based on time sequence database
CN107301186B (en) Invalid data identification method and device
CN115878027A (en) Storage object processing method and device, terminal and storage medium
CN111427931A (en) Distributed query engine and method for querying relational database by using same
CN110716924B (en) Method and device for deleting expired data
CN111984688B (en) Method and device for determining business knowledge association relationship
CN109739819A (en) Snapshot lossless compression method, device, equipment and the readable storage medium storing program for executing that can be recalled
CN115794843A (en) Hive-based data updating method and device and readable storage medium
CN110019870B (en) Image retrieval method and system based on memory image cluster
CN113656438B (en) Data query method and device for data tree
CN112527824B (en) Paging query method, paging query device, electronic equipment and computer-readable storage medium
CN114090631A (en) Data query method and device, electronic equipment and storage medium
CN111061554B (en) Intelligent task scheduling method and device, computer equipment and storage medium
CN112764989A (en) Method for monitoring start-stop time of application service
CN117009439B (en) Data processing method, device, electronic equipment and storage medium
JP2018081603A (en) Kv data structure conversion device, kv data structure conversion method, and kv data structure conversion program
CN115658626B (en) Distributed network small file storage management method
CN113626490B (en) Data query method, device and equipment and storage medium
CN115438042A (en) Data table sampling method and device
CN113434492A (en) Data detection method and device, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination