CN110990394B - Method, device and storage medium for counting number of rows of distributed column database table - Google Patents

Method, device and storage medium for counting number of rows of distributed column database table Download PDF

Info

Publication number
CN110990394B
CN110990394B CN201811143454.9A CN201811143454A CN110990394B CN 110990394 B CN110990394 B CN 110990394B CN 201811143454 A CN201811143454 A CN 201811143454A CN 110990394 B CN110990394 B CN 110990394B
Authority
CN
China
Prior art keywords
data
data file
hbase table
column group
target column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811143454.9A
Other languages
Chinese (zh)
Other versions
CN110990394A (en
Inventor
刘勇
郭峰
于峰
张泉锦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN201811143454.9A priority Critical patent/CN110990394B/en
Publication of CN110990394A publication Critical patent/CN110990394A/en
Application granted granted Critical
Publication of CN110990394B publication Critical patent/CN110990394B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a line number statistical method and device of a distributed column database table and a readable storage medium, and belongs to the technical field of computers. The method comprises the steps of obtaining the number of lines of each data file based on the data files corresponding to a target column group in a distributed column database HBase table, and obtaining the number of lines of the HBase table based on the accumulated result of the number of lines corresponding to each data file. The method does not need to traverse all data of the HBase table, reduces the data quantity cached and read during line count, reduces the consumption of Input/Output (I/O) resources, and improves the use efficiency of I/O.

Description

Method, device and storage medium for counting number of rows of distributed column database table
Technical Field
The disclosure relates to the field of computer technology, and in particular, to a method, a device and a storage medium for counting the number of rows of a distributed column database table.
Background
HBase (Hadoop DataBase) is a distributed column-oriented database for the storage and management of large-scale data. The data in the HBase database is stored in the form of an HBase table, one record being stored in one row of the HBase table, and one HBase table may include up to several billion records (i.e., rows). In a business scenario using the HBase database, many demands need to count the number of records in the HBase table, i.e. the number of rows, and for such huge data volume, an efficient row number counting method of the HBase table needs to be designed.
The row counting (RowCounter) function of the HBase database can realize the row counting of the HBase table, and when the RowCounter function is used for row counting, all data in a data File, i.e., a Hadoop File (HFile), included in the HBase table needs to be scanned, and in this process, the scanned HFile needs to be cached and the row number needs to be counted according to the row value.
The use of the RowCounter function for line count requires a long time to occupy a large amount of cache and consumes a large amount of Input/Output (I/O) resources.
Disclosure of Invention
The present disclosure provides a method, an apparatus, and a storage medium for counting the number of rows of a distributed column database table, so as to overcome the problem that the number of rows of an HBase table in the related art needs to occupy a large amount of input/output resources. The technical scheme is as follows:
in one aspect, a method for counting the number of rows of a distributed column-oriented database table is provided, the method comprising: acquiring data files corresponding to a target column group in an HBase table, wherein at least one data file is provided, each data file comprises data positioned in the target column group, and the target column group comprises at least one column with non-empty data in the HBase table; acquiring the row number corresponding to each data file based on the data in each data file; and obtaining a row number accumulation result corresponding to each data file, and obtaining the row number of the HBase table according to the accumulation result.
Optionally, the obtaining a data file corresponding to the target column group in the HBase table includes: acquiring initial cache data in the HBase table; converting the initial cache data into one or more initial data files; and acquiring a data file corresponding to the target column group in the HBase table from the initial data file.
Optionally, the obtaining a data file corresponding to the target column group in the HBase table includes: acquiring target cache data corresponding to a target column group in the HBase table; and converting the target cache data into a data file to obtain the data file corresponding to the target column group.
Optionally, the obtaining a data file corresponding to the target column group in the HBase table includes: and determining a data file corresponding to a target column group in each block of the HBase table, wherein the block comprises data positioned in one or more rows of the HBase table and positioned in the target column group of the HBase table.
Optionally, the method further comprises: and when any block is deleted from the HBase table, removing the line number of the data file corresponding to the target column group in the deleted block from the line number of the HBase table.
Optionally, the method further comprises: if the data in any block changes, acquiring a data file corresponding to the target column group in the changed block, and updating the number of rows of the HBase table according to the data file corresponding to the target column group in the changed block.
Optionally, the method further comprises: when the number of the data files corresponding to any block is multiple, combining the data files corresponding to any block into one data file, and deleting overlapped data to obtain an optimized data file; and acquiring the number of lines corresponding to the optimized data file based on the data in the optimized data file, and acquiring the number of lines of the HBase table according to the number of lines corresponding to the optimized data file.
Optionally, the acquiring the row number corresponding to each data file based on the data in each data file includes: performing cloth Long Guolv with the type of the data in each data file being the row of the HBase table; the number of rows per data file is obtained based on the result of the bloom filter.
Optionally, the method further comprises: and when the HBase table is created, opening bloom filtering of a row with the type of the Hbase table for a target column group in the HBase table.
Optionally, the method further comprises: and selecting a column with the data null proportion not larger than a proportion threshold value from the columns of the HBase table to obtain the target column group.
In another aspect, there is provided a device for counting the number of rows of a distributed column-oriented database table, the device comprising: the first acquisition module is used for acquiring data files corresponding to a target column group in an HBase table, wherein at least one data file is provided, each data file comprises data positioned in the target column group, and the target column group comprises at least one column with non-empty data in the HBase table; the second acquisition module is used for acquiring the row number corresponding to each data file based on the data in each data file; and the third obtaining module is used for obtaining the accumulated results of the line numbers corresponding to the data files and obtaining the line numbers of the table according to the accumulated results.
Optionally, the first obtaining module is configured to obtain initial cache data in the HBase table; converting the initial cache data into one or more initial data files; and acquiring a data file corresponding to the target column group in the HBase table from the initial data file.
Optionally, the first obtaining module is configured to obtain target cache data corresponding to a target column group in the HBase table; and converting the target cache data into a data file to obtain the data file corresponding to the target column group.
Optionally, the first obtaining module is configured to determine a data file corresponding to a target column group in each block of the HBase table, where the block includes at least data located in one or more rows of the HBase table and located in the target column group of the HBase table.
Optionally, the apparatus further comprises: and the deleting module is used for deleting the row number of the data file corresponding to the target column group in the deleted block from the row number of the HBase table when any block is deleted from the HBase table.
Optionally, the apparatus further comprises: and the updating module is used for acquiring a data file corresponding to the target column group in the changed block if the data in any block is changed, and updating the row number of the HBase table according to the data file corresponding to the target column group in the changed block.
Optionally, the apparatus further comprises: the merging module is used for merging the plurality of data files corresponding to any block into one data file when the plurality of data files corresponding to any block are provided, deleting overlapped data and obtaining an optimized data file; the statistics module is used for acquiring the line number corresponding to the optimized data file based on the data in the optimized data file, and acquiring the line number of the HBase table according to the line number corresponding to the optimized data file.
Optionally, the second obtaining module is configured to perform a fabric Long Guolv with a type of a row of the HBase table on data in each data file; the number of rows per data file is obtained based on the result of the bloom filter.
Optionally, the apparatus further comprises: and the starting module is used for starting the bloom filtration of the row with the type of the Hbase table for the target column group in the HBase table when the HBase table is created.
Optionally, the apparatus further comprises: and a selecting module, configured to select a column with a data null ratio not greater than a ratio threshold from columns of the HBase table, to obtain the target column group.
In another aspect, a device for counting the number of rows of a distributed column-oriented database table is provided, the device comprising a processor and a memory, wherein at least one instruction is stored in the memory, and the instruction is loaded and executed by the processor to implement any one of the above-mentioned methods for counting the number of rows of the distributed column-oriented database table.
In another aspect, a computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement a method of counting the number of rows of a distributed column-oriented database table of any of the above.
The technical scheme provided by the disclosure at least comprises the following beneficial effects:
the method comprises the steps of obtaining the number of lines of each data file based on the data files corresponding to a target column group in a distributed column database HBase table, and obtaining the number of lines of the HBase table based on the accumulated result of the number of lines corresponding to each data file. The method does not need to traverse all data of the HBase table, reduces the data quantity cached and read during line count, reduces the consumption of I/O resources, and improves the use efficiency of I/O.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
Fig. 1 illustrates an architecture diagram of an HBase database storage system provided by an embodiment of the present disclosure;
Fig. 2 is a schematic diagram illustrating a storage structure of an HBase table according to an embodiment of the present disclosure;
FIG. 3 illustrates a schematic diagram of data overlap in a column family provided by an embodiment of the present disclosure;
fig. 4 shows a flowchart of a method for counting the number of rows of an HBase table according to an embodiment of the present disclosure;
fig. 5 shows a flowchart of a method for counting the number of rows of an HBase table according to an embodiment of the present disclosure;
fig. 6 shows a block diagram of a configuration of a device for counting the number of rows of an HBase table according to an embodiment of the present disclosure;
fig. 7 is a block diagram showing a configuration of a line count counting apparatus of an HBase table according to an embodiment of the present disclosure.
Detailed Description
For the purposes of clarity, technical solutions and advantages of the present disclosure, the following further details the embodiments of the present disclosure with reference to the accompanying drawings.
HBase is a distributed column-oriented database used for storage and management of large-scale data. The data in the HBase database is stored in the form of an HBase table, one record being stored in one row in the HBase table, and one HBase table may include up to several billion records (i.e., rows). In a business scenario using the HBase database, many demands need to count the number of records in the HBase table, i.e. the number of rows, and for such huge data volume, an efficient row number counting method of the HBase table needs to be designed.
Referring to fig. 1, a schematic diagram of an HBase database storage system according to an embodiment of the present disclosure is shown. The HBase database storage system is a distributed system and at least comprises an HBase Master (controller), a Region server and a Zookeeper (coordinator).
The Region server is used for data read-write service; the HBase Master is used for data distribution, database creation, database deletion and other operations; the ZooKeeper is used for election, message notification, state coordination, service discovery and other operations.
Alternatively, the Region server may be disposed on the first physical host, the HBase Master may be disposed on the second physical host, and the ZooKeeper may be disposed on the third physical host, where the first physical host, the second physical host, and the third physical host may be the same or different.
It should be noted that, in addition to the HBase database storage system shown in fig. 1, the HBase database storage system may further include a plurality of Region servers, where each Region server is disposed on a physical host. Furthermore, the architecture of the HBase database storage system is not limited to the above examples.
In the HBase database, the HBase table storing data is a distributed multidimensional table, and the data in the HBase table is indexed and queried and positioned by the following parameters:
Row key (rowkey): the rowkey may be used to sort records in the HBase table, identifying the row in which the data is located in the HBase table.
Column family: the column family in which the data is located in the HBase table is identified, the column family comprising one or more columns.
Column name (column name): the columns of the data in the column family in which the HBase table is located are identified.
Timestamp (time stamp): the time when the data was written in the HBase table is identified.
Referring to table 1, one example of the logical structure of the HBase table is shown. The HBase table comprises at least three records. The Rowkey of the first record is Rowkey001, which includes data of: data "4" located in column family info1, column c; data "1" located in column family info1, column a. The Rowkey of the second record is Rowkey002, which includes data of: data "2" located in column family info1, column a; data "6" located in column family info2, column f; data "2" located in column family info2, column e. The Rowkey of the third record is Rowkey003, which includes data of: data "2" located in column family info1, column b; data "3" located in column family info1, column a.
TABLE 1
The HBase table is stored in a distributed structure. Referring to fig. 2, a schematic diagram of the storage structure of the HBase table is shown. The HBase shown in fig. 2 includes a plurality of regions (blocks) in the dimension of the rows; wherein each Region is divided into a plurality of column families; wherein each column family includes a MemStore and a plurality of hfiles.
An HBase table may be partitioned into multiple regions in the dimension of a row, one Region being the smallest unit of distributable storage, i.e., one Region cannot be stored on a different server, and the size of each Region may be different. As more records and data are inserted, a Region is increased, and when the Region reaches a Region file threshold, the Region can be split into two new regions. Alternatively, the slicing operation may be implemented by a split command.
A Region is divided into a plurality of column families, one column family consisting of a MemStore stored in memory, or a MemStore stored in memory and one or more hfiles stored in hard disk. During the continuous addition of data, the data is first written into MemStore as cached data. When the number of the cache data in the MemStore reaches the cache threshold, the cache data in the MemStore is converted into an HFile and stored on a physical host in which the Region is located. The buffer threshold may be set according to actual situations. The conversion operation may be implemented by a flush command.
When a data update or deletion of the HBase table occurs, a new HFile file is generated due to the update or deletion operation, the new HFile file and the HFile before the update or deletion operation are data files having different time stamps, and there is a data overlap between the two HFile files, which may be an overlap of expired data and valid data. Accordingly, all HFile files within a Region can be merged to delete expired data in the overlapping data. The merge operation may be implemented by a major merge command.
Referring to FIG. 3, an example of data overlap in one column family is shown. As shown in fig. 3, the column family includes MemStore and N hfiles. The first HFile file is HFile1, the second HFile file is HFile 2. The first data with Rowkey001 as row key in hfile1 overlaps with the first data with Rowkey001 as row key in hfile3, the second data with Rowkey001 overlaps with the second data with Rowkey001 as row key in hfile3, hfile1 and hfile3 can be combined by a major comparison command, and the outdated data is deleted.
Referring to fig. 4, a flowchart of a method for counting the number of rows of a distributed column-oriented database table provided in an embodiment of the present disclosure is shown, where the method is executed by an HBase database storage system, and the method for counting the number of rows of the HBase table includes:
step 401, determining a target column group in the HBase table.
The target column group is used for counting the number of rows of the HBase table, and comprises at least one column with data which is not empty in the HBase table.
In one possible implementation, step 401 includes: and selecting a column with the data null proportion not larger than a proportion threshold value from columns of the HBase table to obtain a target column group. Wherein the target column family includes columns for which the proportion of the data to empty is not greater than a proportion threshold. The smaller the ratio threshold, the less the ratio of data in the target column group to null, the more accurate the number of rows acquired through the target column group.
For example, referring to the HBase table shown in table 1, in the HBase table, if none of the data values of the column named a in column group info1 is null, column group info1 is determined as the target column group of the HBase table.
Obtaining M records in an HBase table, namely M rows of data, wherein M is smaller than the total number of records stored in the HBase table; if the number of data values of the first column group in the M row data is not less than m×p, where p is a proportion threshold value, 0 < p is less than or equal to 1, determining the first column group as a target column group of the HBase table, where the first column group is used to count the total number of records stored in the HBase table, that is, the number of rows.
It should be noted that this step 401 is optional. For example, the target column group in the HBase table may also be specified by the user, in which embodiment step 401 need not be performed.
Step 402, obtaining a data file corresponding to a target column group in the HBase table.
Wherein the data files are at least one, and each data file comprises data located in a target column group.
In an example of a HBase table stored on a column basis, the HBase table includes one or more data files, a first data file included in the HBase table includes data located in a first row set and a first column group, a second data file included in the HBase table includes data located in a first row set and a second column group, a third data file included in the HBase table includes data located in a second row set and a first column group, and a fourth data file included in the HBase table includes data located in a second record set and a second column group. Taking the HBase table in this example as an example, if the target column group is determined to be the first column group, step 402 includes:
And acquiring a first data file and a third data file in the HBase table based on the first column family as data files.
In one possible implementation, the data of the HBase table is stored in the form of a block comprising at least one data file, step 402 comprising:
and determining the data file corresponding to the target column group in each block of the HBase table. Wherein the block includes data located in one or more rows of the HBase table and located in a target column group of the HBase table.
Optionally, the block may further include data located in one or more rows of the HBase table and located in other column groups of the HBase table that are not target column groups.
Illustratively, the block is a Region and the data file is an HFile file, step 402 includes:
and acquiring the HFile file corresponding to the target column group from the column group of one or more regions where the data of the target column group in the HBase table is located as a data file.
In another possible implementation, the data in the HBase table may be stored in the form of initial buffered data, in addition to being stored in a data file, and step 402 includes:
acquiring initial cache data in an HBase table;
converting the initial cache data into one or more initial data files;
And acquiring a data file corresponding to the target column group in the HBase table from the initial data file.
Optionally, the initial data file includes a data file corresponding to a target column group in the HBase table and a data file corresponding to other column groups.
In another possible implementation, the cache data corresponding to the target column group in the HBase table is target cache data, and step 402 includes:
acquiring target cache data corresponding to a target column group in an HBase table;
and converting the target cache data into a data file.
Illustratively, the target cache data is data in MemStore, the data file is an HFile file, and the implementation may include:
acquiring MemStore in one or more column families of Region where data of the HBase table is located;
the one or more MemStores are converted into one or more HFile files through flush commands, respectively, and the one or more HFile files are used as data files.
Step 403, acquiring the row number corresponding to each data file based on the data in each data file.
If the number of the data files in the HBase table is one, the number of lines corresponding to the data files is obtained, and if the number of the data files is multiple, the number of lines corresponding to the data files is obtained based on the data in one data file, and the operation is repeated until the number of lines corresponding to each data file is obtained.
Further, step 403 may include:
acquiring a rowkey of one data of a data file, and acquiring a temporary variable R of the line number of the data file, wherein R is a positive integer;
comparing the rowkey of the data with the rowkey of the previous data in the data file;
if the rowkey of the data is different from the rowkey of the previous data, adding 1 to the value of the temporary variable R;
repeating the steps until all data statistics of the data file are completed, and obtaining a final value of R, wherein the value is the number of lines of the data file;
in this process, when the first data in the data file is counted, the above-mentioned comparison operation is not required, and the value of the temporary variable R is set to 1.
Illustratively, the data file is an HFile file, and when the flush operation is triggered, the data is written into the HFile file according to the order of the magnitudes of the rowkeys, all the data with the same rowkey are written in an adjacent manner, so that the statistic line number can be judged by comparing the rowkey of the data with the rowkey of the previous data, whether the rowkey of the data is the same as the rowkey of the previous data, and when the rowkey of the data is different from the rowkey of the previous data, the statistic value of the line number is updated.
Further, the above implementation of obtaining the number of lines of the data file may be implemented by a cloth Long Guolv (Bloom Filter) function in the HBase database. The bloom filter function is used for performing bloom filtration on data in the HBase database according to set conditions in a specified data range, wherein the data in the data files can be subjected to bloom filtration of rows of which the types are HBase tables, and the number of rows of each data file is obtained based on the bloom filtration result.
Examples of results of bloom filtering obtained by bloom filtering one HFile file are shown below. In this example, the value 1511212 corresponding to No of Keys in bloom is the number of rows of the HFile file.
Bloom filter:
BloomSize:6356992
No of Keys in bloom:1511212
Max Keys for bloom:1515867
Percentage filled:100%
Number of chunks:49
Comparator:RawBytesComparator
One example of the results of the implementation of the bloom filter described above by computer code by embodiments of the present disclosure is shown below.
In this implementation manner, the method for counting the number of rows of the HBase table further includes:
in creating the HBase table, bloom filtering of rows of the HBase table of type HBase is started for the target column group in the HBase table.
And 404, obtaining a row number accumulation result corresponding to each data file, and obtaining the row number of the HBase table according to the accumulation result.
For the case that the data file is one, the number of lines of the data file is the number of lines of the HBase table;
And for the case that the number of the data files is multiple, acquiring the results of the number of the lines of the data files, and obtaining the number of the lines of the HBase table according to the accumulated results.
In one possible implementation manner, the HBase table includes one or more regions where data is located, where each Region includes one or more target HFile files, where the target HFile files are HFile files corresponding to a target column group in the HBase table. The number of rows of the HBase table is obtained by respectively accumulating the number of rows of all the target HFile files in the one or more regions.
Illustratively, the statistics of the number of rows of the HBase table are shown in equation 1.
The sumOfRowkey is the number of rows of the HBase table, n is the number of regions included in the HBase table, the function g (i) is the number of target HFile files in the ith Region in the HBase table, and the function f (i, j) is the number of rows of the jth target HFile files in the ith Region in the HBase table.
In an alternative implementation, the data of the HBase table is stored in the form of blocks, and if one of the blocks is deleted, the number of rows in the HBase table is correspondingly reduced, and at this time, the number of rows in the HBase table needs to be reclassified. In order to improve the accuracy of the line count of the HBase table, the line count method of the HBase table further comprises the following steps:
And step 405, when any block is deleted from the HBase table, removing the line number of the data file corresponding to the target column group in the deleted block from the line number of the HBase table.
In another optional implementation manner, if the data of the block where the data in the HBase table is located changes, the number of rows in the HBase table may also change, so as to improve the accuracy of the number of rows statistics of the HBase table, the method for counting the number of rows of the HBase table further includes:
step 406, if the data in any block changes, acquiring a data file corresponding to the target column group in the changed block, and updating the number of rows of the HBase table according to the data file corresponding to the target column group in the changed block.
For example, if the data in a Region where the data in the HBase table is located changes, one or more target HFile files corresponding to the target column group in the Region are obtained; respectively acquiring the line numbers of the one or more target HFile files according to the method in step 403; and adopting the parameters of the line numbers of the one or more target HFile files corresponding to the line numbers of the one or more target HFile files in the update formula 1, thereby obtaining the line numbers of the updated HBase table.
In another possible implementation manner, due to operations such as data updating and data deleting of the HBase table, the HBase table may store expiration data, and a data overlapping condition may occur between different data files of the HBase table; the data overlap may cause that the number of lines of the HBase table obtained based on the number of lines of the data file may exceed a true value, and in order to solve the problem of data overlap, the line number statistics method of the HBase table further includes:
step 407, when the number of the data files corresponding to any block is multiple, merging the multiple data files corresponding to any block into one data file, and deleting the overlapping data to obtain an optimized data file; acquiring the number of lines corresponding to the optimized data file based on the data in the optimized data file, and acquiring the number of lines of the HBase table according to the number of lines corresponding to the optimized data file.
Alternatively, a plurality of HFile files in a column group in a Region of the HBase table are merged into one HFile file by a major compatibility command, and overlapping data is deleted.
The method comprises the steps of obtaining the number of lines of each data file based on the data files corresponding to a target column group in a distributed column database HBase table, and obtaining the number of lines of the HBase table based on the accumulated result of the number of lines corresponding to each data file. The method does not need to traverse all data of the HBase table, reduces the data quantity cached and read during line count, reduces the consumption of I/O resources, and improves the use efficiency of I/O.
In addition, the number of lines of the data file is obtained based on the result of bloom filtering, so that the line number counting method is more efficient.
Referring to fig. 5, a flowchart of a method for counting the number of rows of an HBase table according to an embodiment of the present disclosure is shown, where the method is applied to the HBase database storage system shown in fig. 1, and the method includes:
step 501, the HBase Master determines the target column group in the HBase table.
See step 401, which is not described in detail herein, which is an optional step.
Step 502, the HBase Master queries a Region server where the Region included in the HBase table is located.
Wherein, the Region included in the HBase table may be one or more; the HBase table may include one or more regions in which the regions are located, and one or more regions in one of the regions may be located.
In one possible implementation, the HBase Master queries the Region server in which the Region included in the HBase table is located based on the identity of the Region server.
Optionally, the identifier of the Region server is stored in the relevant information of the HBase table.
Step 503, the Region server converts MemStore included in the HBase table into an HFile file.
In the HBase database storage system shown in fig. 1, a plurality of regions included in the HBase table may be distributed over a plurality of Region servers, and one implementation of step 503 includes:
The Region server searches one or more regions included in the HBase table on a host where the Region server is located;
the Region server converts MemStore in one or more column families included in the one or more regions into an HFile file via a flush command.
And repeating the steps for a plurality of Region servers distributed by a plurality of regions included in the HBase table until the cache data in the regions included in the HBase table are converted into data files.
In step 504, the Region server merges multiple HFile files in the Region included in the HBase table into one HFile file through the major compatibility command, and deletes the overlapping data.
Merging a plurality of HFile files in a Region into an HFile file through a major compatibility command and deleting overlapped data; and executing the operation on each Region included in the HBase table until all regions included in the HBase table complete the merging operation of the HFile file.
Step 505, the Region server obtains the target HFile included in the HBase table based on the target column family.
Optionally, the target HFile is one or more hfiles included in a column group corresponding to the target column group.
The Region server acquires the Region included in the HBase table on the host where the Region server is located, acquires a column group corresponding to a target column group in the Region for one Region, acquires one or more target HFiles included in the column group, and repeats the above operation for a plurality of regions included in the HBase table.
In step 506, the Region server obtains the line number of the target HFile file through the bloom filtering result.
See step 403, which is not described in detail herein.
Step 507, obtaining the accumulated result of the line numbers of the target HFile file obtained by the one or more Region servers, and obtaining the line numbers of the HBase table according to the accumulated result.
See step 404, which is not described in detail herein.
Alternatively, the accumulated results may be calculated by the scripting tool and obtained by the HMaster.
The method comprises the steps of obtaining the number of lines of each data file based on the data files corresponding to a target column group in a distributed column database HBase table, and obtaining the number of lines of the HBase table based on the accumulated result of the number of lines corresponding to each data file. The method does not need to traverse all data of the HBase table, reduces the data quantity cached and read during line count, reduces the consumption of I/O resources, and improves the use efficiency of I/O.
In addition, the number of lines of the data file is obtained based on the result of bloom filtering, so that the line number counting method is more efficient.
The following are device embodiments of the present disclosure, and for details of the device embodiments that are not described in detail, reference may be made to the method embodiments described above.
Referring to fig. 6, a block diagram of a configuration of an HBase table row count apparatus 600 according to an embodiment of the present disclosure is shown. The device comprises: a first acquisition module 610, a second acquisition module 620, and a third acquisition module 630.
The first obtaining module 610 is configured to obtain a data file corresponding to a target column group in the HBase table. Wherein. The data files are at least one, each data file comprises data in a target column group, and the target column group comprises at least one column with data in the HBase table being non-empty.
The second obtaining module 620 is configured to obtain, based on the data in each data file, a number of lines corresponding to each data file.
The third obtaining module 630 is configured to obtain a result of accumulating the number of rows corresponding to each data file, and obtain the number of rows of the HBase table according to the accumulated result.
Optionally, the data file is an HFile file.
In one possible implementation, the second obtaining module 620 is configured to perform, on the data in each data file, a fabric Long Guolv with a type of a row of the HBase table; the number of rows per data file is obtained based on the results of the bloom filter.
Optionally, the line count statistics apparatus 600 further includes an opening module, where the opening module is configured to, when creating the HBase table, open bloom filters for the rows of the HBase table, where the types of the rows are for a target column group in the HBase table.
In another possible implementation manner, the first obtaining module 620 is configured to obtain initial cache data in the HBase table; converting the initial cache data into one or more initial data files; and acquiring a data file corresponding to the target column group in the HBase table from the initial data file.
In another possible implementation manner, the first obtaining module 620 is configured to obtain target cache data corresponding to a target column group in the HBase table; and converting the target cache data into a data file.
In another possible implementation manner, the first obtaining module 620 is configured to determine a data file corresponding to the target column group in each block of the HBase table. Wherein the block includes data located in one or more rows of the HBase table and located in a target column group of the HBase table.
Optionally, the line count statistics apparatus 600 further includes a deletion module, configured to, when any block is deleted from the HBase table, remove, from the line count of the HBase table, the line count of the data file corresponding to the target column group in the deleted block.
Optionally, the line number statistics apparatus 600 further includes an updating module, configured to, if the data in any block changes, obtain a data file corresponding to the target column group in the changed block, and update the line number of the HBase table according to the data file corresponding to the target column group in the changed block.
Optionally, the line number statistics device 600 further includes a merging module and a statistics module, where the merging module is configured to merge, when the number of data files corresponding to any block is multiple, the multiple data files corresponding to any block into one data file, and delete overlapping data to obtain an optimized data file; the statistics module is used for acquiring the number of lines corresponding to the optimized data file based on the data in the optimized data file, and acquiring the number of lines of the HBase table according to the number of lines corresponding to the optimized data file.
Optionally, the line number statistics apparatus 600 further includes a selecting module, where the selecting module is configured to select, from columns in the HBase table, columns with a proportion of empty data not greater than a proportion threshold, and obtain a target column group.
The method comprises the steps of obtaining the number of lines of each data file based on the data files corresponding to a target column group in a distributed column database HBase table, and obtaining the number of lines of the HBase table based on the accumulated result of the number of lines corresponding to each data file. The method does not need to traverse all data of the HBase table, reduces the data quantity cached and read during line count, reduces the consumption of I/O resources, and improves the use efficiency of I/O.
In addition, the number of lines of the data file is obtained based on the result of bloom filtering, so that the line number counting method is more efficient.
Referring to fig. 7, a schematic structural diagram of a device 700 for counting the number of rows of an HBase table according to an embodiment of the present disclosure is shown. The device may be a server or a terminal, in particular:
the line count apparatus 700 of the HBase table includes a Central Processing Unit (CPU) 701, a system memory 704 including a Random Access Memory (RAM) 702 and a Read Only Memory (ROM) 703, and a system bus 705 connecting the system memory 704 and the central processing unit 701. The row statistics apparatus 700 of the HBase table further comprises a basic input/output system (I/O system) 706 to facilitate the transfer of information between various devices within the computer, and a mass storage device 707 for storing an operating system 713, application programs 714 and other program modules 715.
The basic input/output system 706 includes a display 708 for displaying information and an input device 709, such as a mouse, keyboard, or the like, for a user to input information. Wherein both the display 708 and the input device 709 are coupled to the central processing unit 701 through an input output controller 710 coupled to the system bus 705. The basic input/output system 706 may also include an input/output controller 710 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 710 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 707 is connected to the central processing unit 701 through a mass storage controller (not shown) connected to the system bus 705. The mass storage device 707 and its associated computer readable medium provide non-volatile storage for the row count means 700 of the HBase table. That is, the mass storage device 807 may include a computer readable medium (not shown) such as a hard disk or CD-ROM drive.
Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 704 and mass storage device 707 described above may be collectively referred to as memory.
The row number statistics apparatus 700 of the HBase table may also be run by a remote computer connected to a network, such as the internet, according to various embodiments of the present disclosure. I.e. the number of rows of the HBase table, the statistics means 700 may be connected to the network 712 via a network interface unit 711 connected to the system bus 705, or alternatively, the network interface unit 711 may be used to connect to other types of networks or remote computer systems (not shown).
The memory also includes one or more programs, one or more programs stored in the memory and configured to be executed by the CPU. The one or more programs include instructions for performing the method of counting the number of rows of the HBase table provided in either one of fig. 4 and 5.
The disclosed embodiments also provide a non-transitory computer readable storage medium that, when executed by a processor of a computing system, enables the computing system to perform the method of counting the number of rows of the HBase table provided by any one of fig. 4 and 5.
A computer program product comprising instructions which, when run on a computer, cause the computer to execute instructions for performing the method of counting the number of rows of a HBase table as provided in any one of figures 4 and 5.
Alternatively, the above-described computer-readable storage medium may be a ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, or the like.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
The foregoing description of the exemplary embodiments of the application is not intended to limit the application to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the application.

Claims (20)

1. A method for counting the number of rows in a distributed column-oriented database table, the method comprising:
selecting columns with the data null proportion not larger than a proportion threshold value from columns of a distributed column-oriented database HBase table, and obtaining a target column group;
acquiring data files corresponding to a target column group in the HBase table, wherein at least one data file is provided, each data file comprises data positioned in the target column group, and the target column group comprises at least one column with data which is not empty in the HBase table;
acquiring the row number corresponding to each data file based on the data in each data file;
and obtaining a row number accumulation result corresponding to each data file, and obtaining the row number of the HBase table according to the accumulation result.
2. The method according to claim 1, wherein the obtaining the data file corresponding to the target column group in the HBase table includes:
acquiring initial cache data in the HBase table;
converting the initial cache data into one or more initial data files;
and acquiring a data file corresponding to the target column group in the HBase table from the initial data file.
3. The method according to claim 1, wherein the obtaining the data file corresponding to the target column group in the HBase table includes:
acquiring target cache data corresponding to a target column group in the HBase table;
and converting the target cache data into a data file to obtain the data file corresponding to the target column group.
4. The method according to claim 1, wherein the obtaining the data file corresponding to the target column group in the HBase table includes:
and determining a data file corresponding to a target column group in each block of the HBase table, wherein the block comprises data positioned in one or more rows of the HBase table and positioned in the target column group of the HBase table.
5. The method according to claim 4, wherein the method further comprises:
And when any block is deleted from the HBase table, removing the line number of the data file corresponding to the target column group in the deleted block from the line number of the HBase table.
6. The method according to claim 4, wherein the method further comprises:
if the data in any block changes, acquiring a data file corresponding to the target column group in the changed block, and updating the number of rows of the HBase table according to the data file corresponding to the target column group in the changed block.
7. The method according to claim 4, wherein the method further comprises:
when the number of the data files corresponding to any block is multiple, combining the data files corresponding to any block into one data file, and deleting overlapped data to obtain an optimized data file;
and acquiring the number of lines corresponding to the optimized data file based on the data in the optimized data file, and acquiring the number of lines of the HBase table according to the number of lines corresponding to the optimized data file.
8. The method according to claim 1, wherein the obtaining the number of lines corresponding to each data file based on the data in each data file comprises:
Performing cloth Long Guolv with the type of the data in each data file being the row of the HBase table;
the number of rows per data file is obtained based on the result of the bloom filter.
9. The method of claim 8, wherein the method further comprises:
and when the HBase table is created, opening bloom filtering of a row with the type of the Hbase table for a target column group in the HBase table.
10. A device for counting the number of rows of a distributed column-oriented database table, the device comprising:
the selection module is used for selecting columns with the data being empty and the proportion not larger than a proportion threshold value from the columns of the distributed column-oriented database HBase table to obtain a target column group;
the first acquisition module is used for acquiring data files corresponding to a target column group in the HBase table, wherein the number of the data files is at least one, each data file comprises data positioned in the target column group, and the target column group comprises at least one column with non-empty data in the HBase table;
the second acquisition module is used for acquiring the row number corresponding to each data file based on the data in each data file;
and the third obtaining module is used for obtaining the accumulated results of the line numbers corresponding to each data file and obtaining the line numbers of the HBase table according to the accumulated results.
11. The apparatus of claim 10, wherein the first acquisition module is configured to
Acquiring initial cache data in the HBase table; converting the initial cache data into one or more initial data files; and acquiring a data file corresponding to the target column group in the HBase table from the initial data file.
12. The apparatus of claim 10, wherein the first obtaining module is configured to obtain target cache data corresponding to a target column group in the HBase table; and converting the target cache data into a data file to obtain the data file corresponding to the target column group.
13. The apparatus of claim 10, wherein the first obtaining module is configured to determine a data file corresponding to a target column group in each block of the HBase table, the block including data located in one or more rows of the HBase table and located in the target column group of the HBase table.
14. The apparatus of claim 13, wherein the apparatus further comprises:
and the deleting module is used for deleting the row number of the data file corresponding to the target column group in the deleted block from the row number of the HBase table when any block is deleted from the HBase table.
15. The apparatus of claim 13, wherein the apparatus further comprises:
and the updating module is used for acquiring a data file corresponding to the target column group in the changed block if the data in any block is changed, and updating the row number of the HBase table according to the data file corresponding to the target column group in the changed block.
16. The apparatus of claim 13, wherein the apparatus further comprises:
the merging module is used for merging the plurality of data files corresponding to any block into one data file when the plurality of data files corresponding to any block are provided, deleting overlapped data and obtaining an optimized data file;
the statistics module is used for acquiring the line number corresponding to the optimized data file based on the data in the optimized data file, and acquiring the line number of the HBase table according to the line number corresponding to the optimized data file.
17. The apparatus of claim 10, wherein the second obtaining module is configured to perform a fabric Long Guolv with a type of rows of the HBase table on the data in each data file; the number of rows per data file is obtained based on the result of the bloom filter.
18. The apparatus of claim 17, wherein the apparatus further comprises:
and the starting module is used for starting the bloom filtration of the row with the type of the Hbase table for the target column group in the HBase table when the HBase table is created.
19. A distributed column-oriented database table row number statistics apparatus, characterized in that the apparatus comprises a processor and a memory, the memory having stored therein at least one instruction, which is loaded and executed by the processor to implement the distributed column-oriented database table row number statistics method according to any of claims 1-9.
20. A computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the distributed column-oriented database table row count method of any of claims 1-9.
CN201811143454.9A 2018-09-28 2018-09-28 Method, device and storage medium for counting number of rows of distributed column database table Active CN110990394B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811143454.9A CN110990394B (en) 2018-09-28 2018-09-28 Method, device and storage medium for counting number of rows of distributed column database table

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811143454.9A CN110990394B (en) 2018-09-28 2018-09-28 Method, device and storage medium for counting number of rows of distributed column database table

Publications (2)

Publication Number Publication Date
CN110990394A CN110990394A (en) 2020-04-10
CN110990394B true CN110990394B (en) 2023-10-20

Family

ID=70059736

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811143454.9A Active CN110990394B (en) 2018-09-28 2018-09-28 Method, device and storage medium for counting number of rows of distributed column database table

Country Status (1)

Country Link
CN (1) CN110990394B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737243A (en) * 2020-06-19 2020-10-02 中国银行股份有限公司 Historical data cleaning method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617211A (en) * 2013-11-20 2014-03-05 浪潮电子信息产业股份有限公司 HBase loaded data importing method
CN103617232A (en) * 2013-11-26 2014-03-05 北京京东尚科信息技术有限公司 Paging inquiring method for HBase table
CN103631940A (en) * 2013-12-09 2014-03-12 中国联合网络通信集团有限公司 Data writing method and data writing system applied to HBASE database
CN105117433A (en) * 2015-08-07 2015-12-02 北京思特奇信息技术股份有限公司 Method and system for statistically querying HBase based on analysis performed by Hive on HFile
CN105989076A (en) * 2015-02-10 2016-10-05 腾讯科技(深圳)有限公司 Data statistical method and device
WO2016180123A1 (en) * 2015-09-25 2016-11-17 中兴通讯股份有限公司 Hbase second-level index creation method and device
WO2017174013A1 (en) * 2016-04-06 2017-10-12 中兴通讯股份有限公司 Data storage management method and apparatus, and data storage system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205082A (en) * 2014-06-27 2015-12-30 国际商业机器公司 Method and system for processing file storage in HDFS
US10282349B2 (en) * 2015-08-26 2019-05-07 International Business Machines Corporation Method for storing data elements in a database

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617211A (en) * 2013-11-20 2014-03-05 浪潮电子信息产业股份有限公司 HBase loaded data importing method
CN103617232A (en) * 2013-11-26 2014-03-05 北京京东尚科信息技术有限公司 Paging inquiring method for HBase table
CN103631940A (en) * 2013-12-09 2014-03-12 中国联合网络通信集团有限公司 Data writing method and data writing system applied to HBASE database
CN105989076A (en) * 2015-02-10 2016-10-05 腾讯科技(深圳)有限公司 Data statistical method and device
CN105117433A (en) * 2015-08-07 2015-12-02 北京思特奇信息技术股份有限公司 Method and system for statistically querying HBase based on analysis performed by Hive on HFile
WO2016180123A1 (en) * 2015-09-25 2016-11-17 中兴通讯股份有限公司 Hbase second-level index creation method and device
WO2017174013A1 (en) * 2016-04-06 2017-10-12 中兴通讯股份有限公司 Data storage management method and apparatus, and data storage system

Also Published As

Publication number Publication date
CN110990394A (en) 2020-04-10

Similar Documents

Publication Publication Date Title
US11238098B2 (en) Heterogenous key-value sets in tree database
CN103064639B (en) Date storage method and device
CN111046034B (en) Method and system for managing memory data and maintaining data in memory
US20180225322A1 (en) Merge tree modifications for maintenance operations
US10769126B1 (en) Data entropy reduction across stream shard
KR102564170B1 (en) Method and device for storing data object, and computer readable storage medium having a computer program using the same
TW201841122A (en) Key-value store tree
CN111061758B (en) Data storage method, device and storage medium
CN108255925A (en) A kind of display methods and its terminal of data list structure alteration
CN112287182A (en) Graph data storage and processing method and device and computer storage medium
CN105989129A (en) Real-time data statistic method and device
CN109240607B (en) File reading method and device
WO2017161540A1 (en) Data query method, data object storage method and data system
EP3788505B1 (en) Storing data items and identifying stored data items
US20240126817A1 (en) Graph data query
CN111339078A (en) Data real-time storage method, data query method, device, equipment and medium
JP2022547673A (en) DATA PROCESSING METHOD AND RELATED DEVICE, AND COMPUTER PROGRAM
CN115878027A (en) Storage object processing method and device, terminal and storage medium
CN109189343B (en) Metadata disk-dropping method, device, equipment and computer-readable storage medium
CN110990394B (en) Method, device and storage medium for counting number of rows of distributed column database table
CN106383897A (en) Database capacity calculation method and apparatus
CN113360551B (en) Method and system for storing and rapidly counting time sequence data in shooting range
CN114036104A (en) Cloud filing method, device and system for re-deleted data based on distributed storage
CN112035413B (en) Metadata information query method, device and storage medium
CN111061719B (en) Data collection method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant