CN118012826A - Data query method and related device - Google Patents

Data query method and related device Download PDF

Info

Publication number
CN118012826A
CN118012826A CN202410138600.8A CN202410138600A CN118012826A CN 118012826 A CN118012826 A CN 118012826A CN 202410138600 A CN202410138600 A CN 202410138600A CN 118012826 A CN118012826 A CN 118012826A
Authority
CN
China
Prior art keywords
data
hash
range
field
field value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410138600.8A
Other languages
Chinese (zh)
Inventor
吕虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Bank of China
Original Assignee
Agricultural Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Bank of China filed Critical Agricultural Bank of China
Priority to CN202410138600.8A priority Critical patent/CN118012826A/en
Publication of CN118012826A publication Critical patent/CN118012826A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data query method and a related device, wherein in the method, a data lake comprises a plurality of hash slots, each hash slot comprises a plurality of pieces of data, and field value ranges corresponding to the hash slots are obtained; combining at least two hash slots with the range similarity being greater than or equal to a range similarity threshold value as a file group in the plurality of hash slots; determining a target file group with a field value range comprising a target field value from a plurality of file groups based on a target field value corresponding to the target field in the data query condition; and inquiring the target file group based on the data inquiry condition. Therefore, when data query is carried out, the target file group meeting the data query condition can be determined according to the field value range corresponding to each file group, and the data query efficiency is improved; meanwhile, the field value range is acquired without ordering and rewriting the data, so that a system for distributing the data lake is not greatly loaded.

Description

Data query method and related device
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data query and related devices.
Background
The Hudi, iceberg and other data lakes use a distributed file system to intensively store data, support data stream insertion and updating, and the data lakes can also store the data into a form of spark, flink, presto and other big data calculation engines which are convenient to read. A data lake typically includes a plurality of File groups, one File Group containing a plurality of pieces of data. When operations such as data query, data modification or data addition are required to be executed on the data lake, the corresponding file group needs to be positioned first, and then subsequent operations are executed. For example, when a data query is required, the data is located to the corresponding data group according to the index (which may also be referred to as a primary key field, and the values of the data under the primary key field are different) in the query condition, and then the data group is located to the corresponding data. However, when data lakes are used for data storage, the data sets need to be traversed side by side, and query efficiency is affected.
In the related art, after a plurality of data are stored in a plurality of file groups of a data lake, the data stored in the plurality of file groups are sequenced and rearranged, and the sequenced data can improve the query efficiency. However, ordering and re-laying out the data, which corresponds to the need to re-read the data and re-store it in the data lake, can place a significant load on the system in which the data lake is distributed.
Therefore, how to improve the query efficiency and avoid increasing the system load is a current urgent problem to be solved.
Disclosure of Invention
In view of this, the embodiments of the present application provide a data query method and related device, so as to improve the query efficiency and avoid increasing the system load.
In a first aspect, an embodiment of the present application provides a data query method, where a data lake includes a plurality of hash slots, and each hash slot includes a plurality of pieces of data, the method includes:
Acquiring field value ranges corresponding to the hash slots respectively; the field value range is obtained based on field values of a plurality of pieces of data in the hash slot under a target field;
merging at least two hash slots with the range similarity being greater than or equal to a range similarity threshold value in the plurality of hash slots as a file group; the number of the file groups is smaller than that of Ha Xicao; the range similarity is used for indicating the coincidence degree of the two field value ranges;
determining a target file group with a field value range comprising the target field value from a plurality of file groups based on the target field value corresponding to the target field in the data query condition;
and inquiring the target file group based on the data inquiry condition.
Optionally, the range similarity is determined by:
Determining intersection ranges and union ranges of field value ranges corresponding to the two hash slots respectively in the plurality of hash slots;
calculating the difference between the maximum value and the minimum value in the intersection range to obtain a first difference value, and calculating the difference between the maximum value and the minimum value in the union range to obtain a second difference value;
And taking the ratio of the first difference value and the second difference value as the range similarity of the two hash slots.
Optionally, the target field includes a first field and a second field, and the first field has a higher priority than the second field; the obtaining the field value ranges corresponding to the hash slots respectively includes:
Acquiring a field value range of a first field and a field value range of a second field, which correspond to the hash slots respectively;
the merging, as a file group, at least two hash slots whose range similarity is greater than or equal to a range similarity threshold value, of the plurality of hash slots includes:
Merging at least two hash slots with the range similarity greater than or equal to the range similarity threshold value as a file group based on the field value ranges of the first fields respectively corresponding to the hash slots;
If a plurality of first hash slots which are not combined exist in the plurality of hash slots, combining at least two first hash slots with the range similarity greater than or equal to the range similarity threshold value as a file group based on the field value ranges of the second fields corresponding to the plurality of first hash slots respectively.
Optionally, the method further comprises:
If there are a plurality of second Ha Xicao that are not combined in the plurality of hash slots, then the plurality of second Ha Xicao are randomly combined.
Optionally, the merging, as the file group, at least two hash slots with a range similarity greater than or equal to a range similarity threshold, from the plurality of hash slots includes:
And combining at least two hash slots with the range similarity being greater than or equal to the range similarity threshold and the data volume stored after combination exceeding the data volume threshold as a plurality of file groups, wherein the data volume stored in each file group does not exceed the data volume threshold.
Optionally, the method further comprises:
After a batch of data is newly stored in the data lake, determining a file group to be split, wherein the stored data quantity of the file group exceeds a data quantity threshold value, from the file groups;
splitting the file group to be split based on field value ranges respectively corresponding to a plurality of hash slots in the file group to be split.
Optionally, the method further comprises:
converting a field value in the data lake under the target field into an integer type field value;
the obtaining the field value ranges corresponding to the hash slots respectively includes:
and acquiring field value ranges of integer types corresponding to the hash slots respectively.
Optionally, the method further comprises:
calculating the ratio of the field value range of each hash slot to the total field value range corresponding to the data lake to obtain the range ratio corresponding to each hash slot; the total field value range is obtained based on the field value of the data in the data lake under the target field;
the merging, as a file group, at least two hash slots whose range similarity is greater than or equal to a range similarity threshold value, of the plurality of hash slots includes:
determining a target hash slot with the range proportion being greater than a range proportion threshold value from the plurality of hash slots;
and merging at least two hash slots with the range similarity larger than or equal to the range similarity threshold value as a file group in the hash slots except the target hash slot.
Optionally, the number of the hash slots is determined based on the data amount corresponding to the data stored in the data lake for the first time.
In a second aspect, an embodiment of the present application provides a data query apparatus, where a data lake includes a plurality of hash slots, each hash slot including a plurality of pieces of data, the apparatus including:
The range acquisition module is used for acquiring field value ranges corresponding to the hash slots respectively; the field value range is obtained based on field values of a plurality of pieces of data in the hash slot under a target field;
The hash slot merging module is used for merging at least two hash slots with the range similarity being greater than or equal to a range similarity threshold value in the plurality of hash slots to be used as file groups; the number of the file groups is smaller than that of Ha Xicao; the range similarity is used for indicating the coincidence degree of the two field value ranges;
the file group determining module is used for determining a target file group with a field value range comprising the target field value from a plurality of file groups based on the target field value corresponding to the target field in the data query condition;
And the data query module is used for querying the target file group based on the data query condition.
In a third aspect, an embodiment of the present application provides a data query device, including a memory and a processor:
The memory is used for storing a computer program and transmitting the computer program to the processor;
The processor is configured to execute the computer program, so that the device executes the data query method described in the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer readable storage medium, where a computer program is stored, and when the computer program is executed, a device running the computer program implements the data query method described in the foregoing first aspect.
Compared with the prior art, the embodiment of the application has the following beneficial effects:
The embodiment of the application provides a data query method and a related device, wherein in the method, a data lake comprises a plurality of hash slots, each hash slot comprises a plurality of pieces of data, and field value ranges corresponding to the hash slots are obtained; the field value range is obtained based on the field values of the plurality of pieces of data in the hash slot under the target field; combining at least two hash slots with the range similarity being greater than or equal to a range similarity threshold value as a file group in the plurality of hash slots; the number of the file groups is smaller than Ha Xicao; the range similarity is used for indicating the coincidence degree of the two field value ranges; determining a target file group with a field value range comprising a target field value from a plurality of file groups based on a target field value corresponding to the target field in the data query condition; and inquiring the target file group based on the data inquiry condition. Therefore, when data query is carried out, the file groups do not need to be traversed one by one, the target file group which accords with the data query condition can be determined according to the field value range corresponding to each file group, and the data query efficiency can be improved; meanwhile, in the process of merging the hash slots, the method is carried out based on the field value ranges of the file groups, and the field value ranges are acquired without ordering and rewriting data, so that a system for distributing the data lake is not greatly loaded.
Drawings
In order to more clearly illustrate this embodiment or the technical solutions of the prior art, the drawings that are required for the description of the embodiment or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a data query method according to an embodiment of the present application;
FIG. 2 is a flowchart of a method for determining a range similarity according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a data query device according to an embodiment of the present application.
Detailed Description
In order to make the present application better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
At present, the existing data query method comprises the following steps: firstly, storing a plurality of pieces of data into a plurality of file groups, then, sorting the plurality of pieces of data included in the plurality of file groups based on the size of a main key in the data, merging a plurality of larger file groups which are fewer in quantity and include more pieces of data, namely, re-distributing the plurality of pieces of data, wherein the larger file groups respectively correspond to a main key range, and then, when data query is carried out, positioning the corresponding file groups directly according to the main key range, thereby improving the data query efficiency. However, ordering and re-laying out pieces of data, which corresponds to the need to re-read the data and re-store the data into a data lake, can place a significant load on the system in which the data lake is distributed.
In order to solve the above problems, an embodiment of the present application provides a data query method and related device, where in the method, a data lake includes a plurality of hash slots, each hash slot includes a plurality of pieces of data, and a field value range corresponding to each of the plurality of hash slots is obtained; the field value range is obtained based on the field values of the plurality of pieces of data in the hash slot under the target field; combining at least two hash slots with the range similarity being greater than or equal to a range similarity threshold value as a file group in the plurality of hash slots; the number of the file groups is smaller than Ha Xicao; the range similarity is used for indicating the coincidence degree of the two field value ranges; determining a target file group with a field value range comprising a target field value from a plurality of file groups based on a target field value corresponding to the target field in the data query condition; and inquiring the target file group based on the data inquiry condition. Therefore, when data query is carried out, the file groups do not need to be traversed one by one, the target file group which accords with the data query condition can be determined according to the field value range corresponding to each file group, and the data query efficiency can be improved; meanwhile, in the process of merging the hash slots, the method is carried out based on the field value ranges of the file groups, and the field value ranges are acquired without ordering and rewriting data, so that a system for distributing the data lake is not greatly loaded.
It should be noted that, the embodiment of the present application may not limit the execution subject of the data query method, for example, the data query method of the embodiment of the present application may be applied to a data processing device such as a terminal device or a server. The terminal equipment can be electronic equipment such as a smart phone, a computer, a tablet personal computer and the like. The server may be a stand-alone server, a cluster server, a cloud server, or the like. The present application is not particularly limited to the above-mentioned terminal device or server.
The following describes in detail, by way of example, specific implementation of the data query method and related apparatus in the embodiments of the present application with reference to the accompanying drawings.
Referring to fig. 1, the flowchart of a data query method provided by the embodiment of the present application, where a data lake includes a plurality of hash slots, each hash slot includes a plurality of pieces of data, and in conjunction with the description of fig. 1, the method specifically may include:
s101: and acquiring field value ranges corresponding to the hash slots respectively.
The hash slots may be referred to as hash buckets, and the data lake includes a plurality of hash slots, each having a corresponding hash slot number. When data is stored in a data lake, hash calculation can be performed based on a primary key value of the data to obtain a hash value, the hash value is divided by the number of hash slots and the remainder is taken as a hash slot number, and then the data is stored in the hash slot with the corresponding hash slot number. Thus, after a batch of data is stored in a data lake, each hash slot may include a plurality of pieces of data. The primary key value refers to a value corresponding to a primary key field, and the primary key value of each piece of data is unique, that is, different data can be distinguished by using the primary key value.
As an example, hash slot numbers of 100 hash slots may be 0-99, respectively. The application does not limit the number of hash slots.
In one possible embodiment of the present application, the number of the plurality of hash slots is determined based on the data amount corresponding to the data stored in the data lake for the first time.
In some embodiments, the number of hash slots may be obtained by dividing the amount of data corresponding to the first insertion of data into the data lake by the average number of hash slots that need to be stored in each hash slot.
As an example, for example, the data inserted into the data lake for the first time corresponds to 10 hundred million kb, the data stored in the file group after the hash slot combination may be 128Mb at maximum, and the data size of each data is 1kb, and then based on the ratio of 128Mb to 1kb, a file group may be obtained, and 130000 pieces of data may be stored in the file group. Assuming that 13 hash slots are needed to form a file group, based on the ratio of 130000 to 13, 10000 pieces of data can be stored in one hash slot, and based on the ratio of 10 hundred million kb of total data quantity inserted into a data lake for the first time to 10000 pieces of data, the number of the hash slots can be 1 ten thousand pieces.
The number of the hash slots cannot be changed after the data lake begins to store the data, and the performance of the data lake can be influenced by excessive or insufficient number of the hash slots in the data lake, so that the number of the hash slots which are suitable can be determined according to the data volume of a batch of data stored in the data lake for the first time, and the influence on the performance of the data lake is avoided.
In the embodiment of the application, the field value range is obtained based on the field values of the pieces of data in the hash slots under the target field. The field value under the target field refers to one or more of the values respectively corresponding to the multiple fields included in each piece of data, that is, the target field refers to one or more of the multiple fields, and the number of the target fields is not limited in the application.
As one example, the plurality of pieces of data are data related to banking or insurance business. The data may include field values under a plurality of fields such as a client number, a sex, a region, a service category, etc., and the client number may be a primary key field, and the region, the client number, etc. may be used as a target field.
In some embodiments, each hash slot may count a field value range of a target field when a batch of data is stored in a data lake. For example, after one hash slot stores two pieces of data, a maximum value and a minimum value under a target field may be determined, then when a third piece of data is stored, a field value of the third piece of data under the target field may be compared with the maximum value and the minimum value, if the field value is greater than the maximum value, a field value of the third piece of data under the target field is taken as the maximum value, if the field value is less than the minimum value, a field value of the third piece of data under the target field is taken as the minimum value, and so on, a field value range may be updated until a batch of data is stored, and a field value range of each hash slot may be obtained.
In some embodiments, when a batch of data is stored in the data lake, each hash slot statistics updates the data amount stored in the hash slot based on the data amount of each data until the batch of data is stored, and then the data amount stored in each hash slot can be obtained.
S102: and merging at least two hash slots with the range similarity being greater than or equal to a range similarity threshold value as a file group in the plurality of hash slots.
In the embodiment of the application, the file group is obtained by combining at least two hash slots, so that the number of the file group is smaller than that of Ha Xicao. Range similarity is used to indicate the degree of coincidence of two field value ranges.
As an example, the range similarity threshold may be 0.5 or 0.6, etc., as the application is not limited in this regard.
In some embodiments, after the file groups are merged, the mapping relationship between each hash slot and each file group may be recorded. For example, file group 1 corresponds to hash slot 3 and hash slot 6.
In one possible implementation manner of the present application, referring to fig. 2, the flowchart of a method for determining a range similarity provided by an embodiment of the present application is shown in fig. 2, where the range similarity is determined by the following steps:
S1: and determining the intersection range and the union range of the field value ranges respectively corresponding to the two hash slots in the plurality of hash slots.
As an example, the field value range corresponding to hash slot 1 is [0, 100], the field value range corresponding to hash slot 2 is [50, 120], the intersection range is [50, 100], and the total range is [0, 120].
S2: and calculating the difference between the maximum value and the minimum value in the intersection range to obtain a first difference value, and calculating the difference between the maximum value and the minimum value in the union range to obtain a second difference value.
Based on the above example, a first difference of 100 may be obtained based on 100-0 and a second difference of 70 may be obtained based on 120-50.
S3: and taking the ratio of the first difference value and the second difference value as the range similarity of the two hash slots.
Based on the above example, a range similarity of 10/7 for two hash slots can be obtained based on 100/70.
Thus, the larger the intersection range is, the smaller the union range is, which indicates that the overlapping degree of the field value ranges corresponding to the two hash slots is higher, so that the hash slots with similar field value ranges can be conveniently combined later.
In one possible embodiment of the present application, the target field may include a first field and a second field, where the first field has a higher priority than the second field, and S101 may be: and acquiring a field value range of a first field and a field value range of a second field, which correspond to the hash slots respectively.
As an example, the first field may be a customer number and the second field may be a region, as the application is not limited in this regard.
It should be noted that the target field includes the first field and the second field only as examples, and may include more fields, which is not limited in the present application.
Accordingly, S102 may include: and merging at least two hash slots with the range similarity greater than or equal to the range similarity threshold value as a file group based on the field value ranges of the first fields respectively corresponding to the hash slots. If a plurality of first hash slots which are not combined exist in the plurality of hash slots, combining at least two first hash slots with the range similarity greater than or equal to the range similarity threshold value as a file group based on the field value ranges of the second fields corresponding to the plurality of first hash slots respectively.
As an example, a total of 100 hash slots are included, at least two hash slots with a range similarity greater than or equal to the range similarity threshold are combined as a file group based on a field value range of a first field corresponding to each of the 100 hash slots, if 10 file groups are obtained and 20 hash slots remain, at least two hash slots with a range similarity greater than or equal to the range similarity threshold may be continuously combined as a file group based on a field value range of a second field corresponding to each of the 20 hash slots.
In some embodiments, the priorities of the multiple target fields may be preset according to the number of different values corresponding to the fields, that is, the more different values corresponding to the fields, the larger the range, the higher the priority.
Thus, considering that the combination is performed based on only one field, the situation that the intersection is not available in the field value ranges of some hash slots, or the field value ranges are too different to be combined may occur, so that the possibility of hash slot combination is further improved.
In addition, in a possible implementation manner of the present application, the data query method may further include: if there are a plurality of second Ha Xicao that are not combined in the plurality of hash slots, then the plurality of second Ha Xicao are randomly combined.
As an example, based on the above example, if the field value ranges of the second fields corresponding to the 20 hash slots are continuously based, at least two hash slots with a range similarity greater than or equal to the range similarity threshold are combined to form a file group, so as to obtain 8 file groups, and the remaining 4 hash slots, then the remaining 4 hash slots may be randomly combined, for example, two by two are combined.
Therefore, the situation that intersection sets cannot be combined in the field value ranges of some hash slots are avoided, or the situation that the field value ranges are too different and cannot be combined is further avoided, and the possibility of combining the hash slots is improved.
In one possible embodiment of the present application, S102 may specifically be: and combining at least two hash slots with the range similarity being greater than or equal to the range similarity threshold and the data volume stored after combination exceeding the data volume threshold as a plurality of file groups, wherein the data volume stored in each file group does not exceed the data volume threshold.
It should be appreciated that in order to avoid excessive amounts of storage of the consolidated file groups, resulting in high loads on the system, a data amount threshold is typically set for the file groups.
As an example, if there are 8 hash slots that can be merged into 1 file group, but the total data volume corresponding to the 8 hash slots exceeds the data volume threshold, the 8 hash slots can be merged into 2 or 3 file groups, so as to ensure that the data volume stored by the merged file group does not exceed the data volume threshold.
In some embodiments, hash slots with more similar field value ranges may be combined into a file group, for example, the range similarity threshold may be increased, so as to determine that hash slots with more camera field value ranges are combined. Illustratively, the enhanced range similarity threshold may be 0.8 or the like.
Therefore, the overlarge data volume of the file group is avoided, and the performance of the data lake is prevented from being influenced.
Based on the above description, when the hash slots are combined, the field value range determined by the read-write data is read and written in the process of storing the data in the data lake (namely storing the data in the disk system), and the data is directly combined in the subsequent combination without re-reading and writing.
S103: and determining a target file group with a field value range comprising the target field value from a plurality of file groups based on the target field value corresponding to the target field in the data query condition.
After the hash slots are combined, a data query condition for querying the data can be obtained, and based on a target field value under a target field in the data query condition, a field value range comprising the target field value is determined from field value ranges corresponding to a plurality of file groups respectively, wherein the corresponding file group is the target file group to be queried.
As an example, the target field is a client number, the data query condition may be "client number=200", and if the field value range of the file group 1 is "188-260", the file group 1 includes a piece of data to be queried corresponding to the data query condition, which may be directly located to the file group 1 without traversing other file groups.
S104: and inquiring the target file group based on the data inquiry condition.
After the target file group is located, corresponding data can be inquired from the target file group based on the data inquiry condition.
It should be appreciated that after the first batch of data is stored in the data lake, a new batch of data may be further stored later, which may cause the data volume of the file group to exceed the data volume threshold.
In order to solve the above problem, in one possible embodiment of the present application, the data query method may further include: after a batch of data is newly stored in the data lake, determining a file group to be split, wherein the stored data quantity of the file group exceeds a data quantity threshold value, from the file groups. And splitting the file group to be split based on field value ranges respectively corresponding to a plurality of hash slots in the file group to be split.
In some embodiments, it is determined that the data amount exceeds the data amount threshold and includes a plurality of hash slots, and then the file group to be split may be split based on the field value ranges corresponding to the hash slots, and the hash slots may be recombined to obtain a plurality of new file groups, that is, the file group to be split may be split to obtain a plurality of new file groups, and the S101-S102 may be executed again based on the hash slots in the file group to be split, for specific embodiments, which are not described herein.
In the related art, a hash bucket is generally bound for each file group, each hash bucket has a corresponding number, and when an operation needs to be performed on data, the number of the data needed to perform the operation is calculated, and the corresponding hash bucket is determined, that is, the corresponding file group is located. However, in this scheme, the number of hash buckets cannot be changed after the number of hash buckets is determined, but the data amount stored in the data lake cannot be estimated in advance, and if the number of hash buckets is too small, a file group is too large, so that the problem of write amplification is easily caused, and the concurrency of the query engine is reduced.
Based on the description, the application can not change the number of the hash slots, but does not bind the hash slots with the file group, the number of the file group is not limited by the number of the hash slots, but two or more hash slots can be formed into the file group, and when the file group is overlarge, the file group can be split to obtain a new file group, so that the data expansion can be conveniently carried out, and the storage capacity of the data lake is improved on the basis that the performance of the data lake is not influenced.
In addition, in some embodiments, after the new file groups are obtained by merging, the mapping relationship between each hash slot and each file group can be updated, so that data query is facilitated.
It should be understood that, in order to facilitate statistics of the field value range, the field value under the target field in the data lake may be a character type or a timestamp type, and in one possible embodiment of the present application, the data query method may further include: converting a field value in the data lake under the target field into an integer type field value; accordingly, S101 may be: and acquiring field value ranges of integer types corresponding to the hash slots respectively.
In some embodiments, assuming that a field value of a character type exists in a data lake under a target field, the field value of the character type may be first converted into a field value of an integer type, and if the number of bits of the field value of the integer type is greater than a preset bit number threshold, the number of bits of the field value of the integer type may be deleted to the preset bit number threshold; if the number of bits of the field value of the integer type is smaller than the preset number of bits threshold, the number of bits of the field value of the integer type can be complemented to the preset number of bits threshold.
As an example, the preset bit number threshold may be set to 8, and if the bit number of the field value of the integer type is 10, the last two bits may be truncated, and if the bit number of the field value of the integer type is 6, the last two bits of 0 may be appended.
Thus, based on the integer field value with the same number of bits, the statistics of the field value range are more beneficial.
In one possible implementation manner of the present application, the data query method may further include: and calculating the ratio of the field value range of each hash slot to the total field value range corresponding to the data lake to obtain the range ratio corresponding to each hash slot. Wherein the total field value range is derived based on the field value of the data in the data lake under the target field.
As an example, the data stored in the data lake has a field value of at most 1000 and at least 0 under the target field, and the total field value range is [0, 1000], and the range ratio is 0.8, that is, 80%, assuming that the field value range of the hash slot 3 is [100, 900 ].
Accordingly, S102 may specifically include: determining a target hash slot with the range proportion being greater than a range proportion threshold value from the plurality of hash slots; and merging at least two hash slots with the range similarity larger than or equal to the range similarity threshold value as a file group in the hash slots except the target hash slot.
As an example, the range ratio threshold may be 75% and the hash slot 3 range ratio may be determined to be the target hash slot if the range ratio is greater than 75%. And removing Ha Xicao from the plurality of hash slots, and merging at least two hash slots with the range similarity greater than or equal to the range similarity threshold value for the rest hash slots to be used as file groups.
Therefore, the range proportion exceeds the range proportion threshold value, which indicates that the range ratio is too large, and the field value range corresponding to the combined file group is too large, so that the combined file group is excluded and then combined, and the influence on the data query efficiency is avoided.
In addition, in some embodiments, assuming that the fields a, B, and C are target fields, if the range ratio corresponding to the field a of Ha Xicao exceeds the range ratio threshold, the remaining hash slots may be combined, then the hash slots and the remaining single hash slot are combined based on the field B, and so on.
The embodiments of the present application provide some specific implementations of a data query method, and based on this, the present application also provides a corresponding data query device. The data query device provided by the embodiment of the application will be described from the aspect of function modularization.
Referring to fig. 3, which is a schematic structural diagram of a data query device according to an embodiment of the present application, a data lake includes a plurality of hash slots, each hash slot includes a plurality of pieces of data, and the device data query device 300 may include:
a range obtaining module 310, configured to obtain field value ranges corresponding to the hash slots respectively; the field value range is obtained based on field values of a plurality of pieces of data in the hash slot under a target field;
A hash slot merging module 320, configured to merge, as a file group, at least two hash slots with a range similarity greater than or equal to a range similarity threshold, from the plurality of hash slots; the number of the file groups is smaller than that of Ha Xicao; the range similarity is used for indicating the coincidence degree of the two field value ranges;
a file group determining module 330, configured to determine, from a plurality of file groups, a target file group whose field value range includes the target field value based on a target field value corresponding to the target field in the data query condition;
And the data query module 340 is configured to query the target file group based on the data query condition.
As one embodiment, the range similarity is determined by:
A range determining unit, configured to determine, among the plurality of hash slots, an intersection range and a union range of field value ranges corresponding to each of the two hash slots;
The difference value calculation unit is used for calculating the difference value between the maximum value and the minimum value in the intersection range to obtain a first difference value, and calculating the difference value between the maximum value and the minimum value in the union range to obtain a second difference value;
And the similarity determining module is used for taking the ratio of the first difference value to the second difference value as the range similarity of the two hash slots.
As one embodiment, the target field includes a first field and a second field, the first field having a higher priority than the second field; the range acquisition module 310 may specifically be configured to:
Acquiring a field value range of a first field and a field value range of a second field, which correspond to the hash slots respectively;
accordingly, the hash slot merging module 320 may specifically include:
A first hash slot merging unit, configured to merge, as a file group, at least two hash slots with a range similarity greater than or equal to the range similarity threshold, based on a field value range of a first field corresponding to each of the plurality of hash slots;
And the second Ha Xicao merging unit is used for merging at least two first hash slots with the range similarity greater than or equal to the range similarity threshold value as a file group based on the field value ranges of the second fields respectively corresponding to the plurality of first hash slots if the plurality of first hash slots which are not merged exist in the plurality of hash slots.
As an embodiment, the data query device 300 may further include:
And the random combination module is used for carrying out random combination on the plurality of second Ha Xicao if a plurality of second Ha Xicao which are not combined exist in the plurality of hash slots.
As an embodiment, the hash slot merging module 320 may specifically be used to:
And combining at least two hash slots with the range similarity being greater than or equal to the range similarity threshold and the data volume stored after combination exceeding the data volume threshold as a plurality of file groups, wherein the data volume stored in each file group does not exceed the data volume threshold.
As an embodiment, the data query device 300 may further include:
The file group to be split determining module is used for determining file groups to be split, of which the stored data quantity exceeds a data quantity threshold value, from the file groups after a batch of data is newly stored in the data lake;
and the file group splitting module is used for splitting the file group to be split based on the field value ranges respectively corresponding to the hash slots in the file group to be split.
As an embodiment, the data query device 300 may further include:
the type conversion module is used for converting the field value under the target field in the data lake into the field value of an integer type;
accordingly, the range obtaining module 310 may specifically be configured to:
and acquiring field value ranges of integer types corresponding to the hash slots respectively.
As an embodiment, the data query device 300 may further include:
the range ratio calculation module is used for calculating the ratio of the field value range of each hash slot to the total field value range corresponding to the data lake to obtain the range ratio corresponding to each hash slot; the total field value range is obtained based on the field value of the data in the data lake under the target field;
accordingly, the Ha Xicao merging module 320 may specifically include:
a hash slot determining unit, configured to determine a target hash slot whose range ratio is greater than a range ratio threshold from the plurality of hash slots;
and a third hash slot merging unit configured to merge, as a file group, at least two hash slots whose range similarity is greater than or equal to a range similarity threshold, among hash slots other than the target hash slot.
As one embodiment, the number of the plurality of hash slots is determined based on the data amount corresponding to the data stored in the data lake for the first time.
The embodiment of the application also provides corresponding data query equipment and a computer readable storage medium, which are used for realizing the scheme provided by the embodiment of the application.
The device comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for executing the computer program so as to enable the device to execute the data query method according to any embodiment of the application.
The computer readable storage medium stores a computer program, and when the computer program is executed, a device executing the computer program implements the data query method according to any embodiment of the present application.
The "first" and "second" in the names of "first", "second" (where present) and the like in the embodiments of the present application are used for name identification only, and do not represent the first and second in sequence.
From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described example methods may be implemented in software plus general hardware platforms. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a readable storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network communication device such as a router) to perform the method according to the embodiments or some parts of the embodiments of the present application.
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment is mainly described in a different point from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements illustrated as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The foregoing is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the technical scope of the present application should be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims (10)

1. A data query method, wherein a data lake includes a plurality of hash slots, each hash slot including a plurality of pieces of data therein, the method comprising:
Acquiring field value ranges corresponding to the hash slots respectively; the field value range is obtained based on field values of a plurality of pieces of data in the hash slot under a target field;
merging at least two hash slots with the range similarity being greater than or equal to a range similarity threshold value in the plurality of hash slots as a file group; the number of the file groups is smaller than that of Ha Xicao; the range similarity is used for indicating the coincidence degree of the two field value ranges;
determining a target file group with a field value range comprising the target field value from a plurality of file groups based on the target field value corresponding to the target field in the data query condition;
and inquiring the target file group based on the data inquiry condition.
2. The method of claim 1, wherein the range similarity is determined by:
Determining intersection ranges and union ranges of field value ranges corresponding to the two hash slots respectively in the plurality of hash slots;
calculating the difference between the maximum value and the minimum value in the intersection range to obtain a first difference value, and calculating the difference between the maximum value and the minimum value in the union range to obtain a second difference value;
And taking the ratio of the first difference value and the second difference value as the range similarity of the two hash slots.
3. The method of claim 1, wherein the target field comprises a first field and a second field, the first field having a higher priority than the second field; the obtaining the field value ranges corresponding to the hash slots respectively includes:
Acquiring a field value range of a first field and a field value range of a second field, which correspond to the hash slots respectively;
the merging, as a file group, at least two hash slots whose range similarity is greater than or equal to a range similarity threshold value, of the plurality of hash slots includes:
Merging at least two hash slots with the range similarity greater than or equal to the range similarity threshold value as a file group based on the field value ranges of the first fields respectively corresponding to the hash slots;
If a plurality of first hash slots which are not combined exist in the plurality of hash slots, combining at least two first hash slots with the range similarity greater than or equal to the range similarity threshold value as a file group based on the field value ranges of the second fields corresponding to the plurality of first hash slots respectively.
4. A method according to claim 3, characterized in that the method further comprises:
If there are a plurality of second Ha Xicao that are not combined in the plurality of hash slots, then the plurality of second Ha Xicao are randomly combined.
5. The method of claim 1, wherein merging at least two hash slots having a range similarity greater than or equal to a range similarity threshold among the plurality of hash slots as a file group comprises:
And combining at least two hash slots with the range similarity being greater than or equal to the range similarity threshold and the data volume stored after combination exceeding the data volume threshold as a plurality of file groups, wherein the data volume stored in each file group does not exceed the data volume threshold.
6. The method according to claim 1, wherein the method further comprises:
After a batch of data is newly stored in the data lake, determining a file group to be split, wherein the stored data quantity of the file group exceeds a data quantity threshold value, from the file groups;
splitting the file group to be split based on field value ranges respectively corresponding to a plurality of hash slots in the file group to be split.
7. The method according to claim 1, wherein the method further comprises:
converting a field value in the data lake under the target field into an integer type field value;
the obtaining the field value ranges corresponding to the hash slots respectively includes:
and acquiring field value ranges of integer types corresponding to the hash slots respectively.
8. The method according to claim 1, wherein the method further comprises:
calculating the ratio of the field value range of each hash slot to the total field value range corresponding to the data lake to obtain the range ratio corresponding to each hash slot; the total field value range is obtained based on the field value of the data in the data lake under the target field;
the merging, as a file group, at least two hash slots whose range similarity is greater than or equal to a range similarity threshold value, of the plurality of hash slots includes:
determining a target hash slot with the range proportion being greater than a range proportion threshold value from the plurality of hash slots;
and merging at least two hash slots with the range similarity larger than or equal to the range similarity threshold value as a file group in the hash slots except the target hash slot.
9. The method of claim 1, wherein the number of the plurality of hash slots is determined based on an amount of data corresponding to the data stored for the first time in the data lake.
10. A data querying device, wherein the data lake comprises a plurality of hash slots, each hash slot comprising a plurality of pieces of data, the device comprising:
The range acquisition module is used for acquiring field value ranges corresponding to the hash slots respectively; the field value range is obtained based on field values of a plurality of pieces of data in the hash slot under a target field;
The hash slot merging module is used for merging at least two hash slots with the range similarity being greater than or equal to a range similarity threshold value in the plurality of hash slots to be used as file groups; the number of the file groups is smaller than that of Ha Xicao; the range similarity is used for indicating the coincidence degree of the two field value ranges;
the file group determining module is used for determining a target file group with a field value range comprising the target field value from a plurality of file groups based on the target field value corresponding to the target field in the data query condition;
And the data query module is used for querying the target file group based on the data query condition.
CN202410138600.8A 2024-01-31 2024-01-31 Data query method and related device Pending CN118012826A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410138600.8A CN118012826A (en) 2024-01-31 2024-01-31 Data query method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410138600.8A CN118012826A (en) 2024-01-31 2024-01-31 Data query method and related device

Publications (1)

Publication Number Publication Date
CN118012826A true CN118012826A (en) 2024-05-10

Family

ID=90942218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410138600.8A Pending CN118012826A (en) 2024-01-31 2024-01-31 Data query method and related device

Country Status (1)

Country Link
CN (1) CN118012826A (en)

Similar Documents

Publication Publication Date Title
US5664179A (en) Modified skip list database structure and method for access
US7725437B2 (en) Providing an index for a data store
Luo et al. The consistent cuckoo filter
US11100047B2 (en) Method, device and computer program product for deleting snapshots
CN113901279B (en) Graph database retrieval method and device
CN114936188A (en) Data processing method and device, electronic equipment and storage medium
CN111309677B (en) File management method and device of distributed file system
CN111752941B (en) Data storage and access method and device, server and storage medium
CN118012826A (en) Data query method and related device
CN113360551B (en) Method and system for storing and rapidly counting time sequence data in shooting range
CN115391341A (en) Distributed graph data processing system, method, device, equipment and storage medium
CN113419792A (en) Event processing method and device, terminal equipment and storage medium
CN114416741A (en) KV data writing and reading method and device based on multi-level index and storage medium
CN113742344A (en) Method and device for indexing power system data
US9483560B2 (en) Data analysis control
US20080183748A1 (en) Data Processing System And Method
CN112328629B (en) Entity object processing method and device and electronic equipment
CN110569221A (en) file system management method, device, equipment and storage medium with version function
CN112015791B (en) Data processing method, device, electronic equipment and computer storage medium
CN115033608B (en) Energy storage system information grading processing method and system
CN116069788B (en) Data processing method, database system, computer device, and storage medium
CN111949439B (en) Database-based data file updating method and device
CN112711627B (en) Data importing method, device and equipment of Greemplum database
CN109992701B (en) Chain table implementation method, device and equipment and readable storage medium
CN116301597A (en) Data storage method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination