CN111563109B - Radix statistics method, apparatus, system, device, and computer-readable storage medium - Google Patents

Radix statistics method, apparatus, system, device, and computer-readable storage medium Download PDF

Info

Publication number
CN111563109B
CN111563109B CN202010339945.1A CN202010339945A CN111563109B CN 111563109 B CN111563109 B CN 111563109B CN 202010339945 A CN202010339945 A CN 202010339945A CN 111563109 B CN111563109 B CN 111563109B
Authority
CN
China
Prior art keywords
hash
data
dimension
bitmap
target data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010339945.1A
Other languages
Chinese (zh)
Other versions
CN111563109A (en
Inventor
杜红光
罗华林
何凯
夏春伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN202010339945.1A priority Critical patent/CN111563109B/en
Publication of CN111563109A publication Critical patent/CN111563109A/en
Application granted granted Critical
Publication of CN111563109B publication Critical patent/CN111563109B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the invention provides a radix counting method, a radix counting device, a radix counting system, a radix counting device and a computer readable storage medium. The method comprises the steps of obtaining target data and dimension data corresponding to the target data at a data node side; carrying out hash calculation on the target data by utilizing a hash algorithm to obtain a hash value corresponding to the target data; generating a bitmap array for dimension data corresponding to the target data by using a bitmap algorithm; storing a hash value corresponding to target data and a bitmap array generated for dimension data corresponding to the target data correspondingly by using a preset hash table; and sending the hash table to a preset computing node. The method comprises the steps of receiving hash tables respectively sent by a plurality of data nodes at a computing node side; and merging hash tables respectively sent by the plurality of data nodes, and performing base statistics processing on the plurality of bitmap arrays according to bits in the hash tables obtained after merging. The invention can save the storage space of data and rapidly and accurately complete the base statistics in a two-layer hash mode.

Description

Radix statistics method, apparatus, system, device, and computer-readable storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a radix statistics method, apparatus, system, device, and computer readable storage medium.
Background
Cardinality statistics are used to count the number of non-repeating element data in a batch of data. Radix statistics are commonly used in application scenarios such as calculating independent user number (UV), independent value number of dimensions, and the like. In a practical production environment, scenes that require a statistically accurate cardinality are often encountered. For example: after two ways of adjustment (such as different appearances of the portals) are respectively performed on the access portals of the website, an AB experiment needs to be performed on the two portals, and in the AB experiment, the access amounts of independent users respectively corresponding to the two portals within a preset time period need to be accurately counted so as to analyze data before the two portals are online.
Currently, in a distributed computing environment, a commonly used radix statistical algorithm is the hyberlog algorithm.
For the HyberLogLog algorithm, the magnitude order of the element data and the storage space are in a direct proportion relation, so the HyberLog algorithm performs compression processing on the element data in order to mark enough element data in the limited storage space when the element data is stored, but the compression processing causes the element data to have data loss, and the element data has data loss, so that the calculation accuracy of the HyberLog algorithm is lower, the error is larger, and the requirement of accurate statistics cannot be met.
Disclosure of Invention
An embodiment of the invention aims to provide a base number statistics method, a base number statistics device, a base number statistics system, a base number statistics device and a base number statistics computer readable storage medium, so as to solve the problem that the existing base number statistics mode is low in statistics accuracy.
Aiming at the technical problems, the embodiment of the invention has the following specific technical scheme:
in a first aspect of the present invention, there is provided a radix statistics method, including the steps of: acquiring target data and dimension data corresponding to the target data; carrying out hash calculation on the target data by using a preset hash algorithm so as to obtain a hash value corresponding to the target data; generating a bitmap array for dimension data corresponding to the target data by using a preset bitmap algorithm; wherein, each bit in the bitmap array maps a dimension element for base statistics; correspondingly storing a hash value corresponding to the target data and a bitmap array generated for dimension data corresponding to the target data by using a preset hash table; and sending the hash table to a preset computing node so that the computing node can execute the base statistic processing according to the bit according to the bitmap array in the received hash table.
The dimension data corresponding to the target data comprises: a plurality of dimension values of the target data; the generating a bitmap array for the dimension data corresponding to the target data by using a preset bitmap algorithm includes: inquiring a mapping relation table preset for the data node correspondence, and determining dimension elements corresponding to a plurality of dimension values in the dimension data; the mapping relation table is used for recording at least one dimension element and bits mapped in the bitmap array of each dimension element; according to the dimension elements corresponding to the plurality of dimension values in the dimension data, the bit positions mapped by the dimension elements corresponding to the plurality of dimension values in the bitmap array identify a first bit value, and the other bit positions identify a second bit value, so that the bitmap array corresponding to the dimension data is obtained.
The storing, by using a preset hash table, a hash value corresponding to the target data and a bitmap array corresponding to dimension data of the target data correspondingly includes: and in a Java language environment, storing the hash value corresponding to the target data and the bitmap array generated for the dimension data corresponding to the target data into the hash table by utilizing a Trove packet.
After the preset hash table is utilized to correspondingly store the hash value corresponding to the target data and the bitmap array generated for the dimension data corresponding to the target data, before the hash table is sent to the preset computing node, the method further comprises the following steps: querying whether the same hash value exists in the hash table; and if the same hash value exists in the hash table, performing aggregation processing on a plurality of bitmap arrays corresponding to the same hash value.
In a second aspect of the present invention, there is also provided a radix statistics method, including the steps performed at a computing node side: receiving hash tables respectively sent by a plurality of data nodes; wherein, hash value and bitmap array are stored in each hash table correspondingly; the hash value is obtained by carrying out hash calculation on target data by using a preset hash algorithm, and the bitmap array is generated for dimension data corresponding to the target data by using the preset bitmap algorithm; mapping a dimension element to be subjected to radix statistics for each bit in the bitmap array; and merging hash tables respectively sent by the plurality of data nodes, and executing base statistics processing on the plurality of bitmap arrays according to bit positions in the hash tables obtained after merging.
In the hash table obtained after merging, the base statistic processing is performed on a plurality of bitmap arrays according to bits, including: inquiring whether the same hash value exists in the hash table obtained after the merging; if the same hash value exists in the hash table, executing aggregation operation on a plurality of bitmap arrays corresponding to the same hash value to obtain an aggregated bitmap array corresponding to the same hash value; and in the hash table obtained after the merging, carrying out base statistic processing on the bitmap arrays corresponding to the same hash value and bitmap arrays corresponding to the rest hash values respectively according to bit positions.
The merging the hash tables sent by the data nodes respectively, and executing base statistics processing on the bitmap arrays according to bits in the hash tables obtained after merging, including: dividing the hash tables into a plurality of hash table sets which are not repeated mutually; combining hash tables in the hash table sets aiming at each hash table set, and executing base statistics processing on a plurality of bitmap arrays according to bits in the hash tables obtained after combination to obtain base statistics results corresponding to the hash table sets; after the base statistic results respectively corresponding to the hash table sets are obtained, the base statistic results respectively corresponding to the hash table sets are subjected to aggregation processing, and the final base statistic result is obtained.
Wherein before said merging a plurality of said hash tables, further comprising: obtaining a mapping relation table corresponding to each data node; the mapping relation table is used for recording at least one dimension element and bits mapped by each dimension element in a preset bitmap array; and according to the mapping relation table corresponding to each data node, carrying out alignment processing on bitmap arrays in the hash tables respectively sent by the data nodes, so that the bitmap arrays in the hash tables have the same bit numbers and dimension elements of bit mapping of corresponding positions. In a third aspect of the present invention, there is also provided a radix statistic device, provided on a data node side, including: the acquisition module is used for acquiring target data and dimension data corresponding to the target data; the first hash module is used for carrying out hash calculation on the target data by utilizing a preset hash algorithm so as to obtain a hash value corresponding to the target data; the bitmap generation module is used for generating a bitmap array for the dimension data corresponding to the target data by utilizing a preset bitmap algorithm; wherein, each bit in the bitmap array corresponds to a dimension element for base statistics; the second hash module is used for correspondingly storing the hash value corresponding to the target data and a bitmap array generated for the dimension data corresponding to the target data by utilizing a preset hash table; and the sending module is used for sending the hash table to a preset computing node so that the computing node can execute the base statistic processing according to the bit array in the received hash table. In a fourth aspect of the present invention, there is also provided a radix statistic device, provided on a computing node side, including: the receiving module is used for receiving hash tables respectively sent by a plurality of data nodes; wherein, hash value and bitmap array are stored in each hash table correspondingly; the hash value is obtained by carrying out hash calculation on target data by using a preset hash algorithm, and the bitmap array is generated for dimension data corresponding to the target data by using the preset bitmap algorithm; each bit in the bitmap array corresponds to a dimension element for base statistics; and the statistics module is used for merging hash tables respectively sent by the plurality of data nodes, and executing base statistics processing on the plurality of bitmap arrays according to bit positions in the hash tables obtained after merging.
In a fifth aspect of the implementation of the present invention, there is also provided a radix statistic system, where the radix statistic system includes: a plurality of data nodes and computing nodes respectively connected with each data node; each of the data nodes comprises: the first hash interface and the second hash interface are connected with each other; the first hash interface is used for acquiring target data and dimension data corresponding to the target data; carrying out hash calculation on the target data by using a preset hash algorithm so as to obtain a hash value corresponding to the target data; the second hash interface is used for generating a bitmap array for the dimension data corresponding to the target data by using a preset bitmap algorithm; wherein, each bit in the bitmap array corresponds to a dimension element for base statistics; correspondingly storing a hash value corresponding to the target data and a bitmap array generated for dimension data corresponding to the target data by using a preset hash table; transmitting the hash table to a preset computing node; the computing node is used for receiving hash tables respectively sent by the plurality of data nodes; and merging hash tables respectively sent by the plurality of data nodes, and executing base statistics processing on the plurality of bitmap arrays according to bit positions in the hash tables obtained after merging.
In a sixth aspect of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus; a memory for storing a computer program; and the processor is used for realizing any of the steps of the radix statistic method executed on the data node side or the steps of any of the radix statistic method executed on the computing node side when executing the program stored in the memory.
In a seventh aspect of the present invention, there is further provided a computer readable storage medium having instructions stored therein, which when executed on a computer, cause the computer to perform the steps of any of the above-described radix statistic method performed on the data node side, or to implement the steps of any of the above-described radix statistic method performed on the compute node side.
In an eighth aspect of the present invention, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of any of the above-described radix statistic method performed on the data node side, or to implement the steps of any of the above-described radix statistic method performed on the compute node side.
The cardinality statistics method, the cardinality statistics device, the cardinality statistics system, the cardinality statistics equipment and the cardinality statistics computer readable storage medium provided by the embodiment of the invention provide an accurate and rapid cardinality statistics method in a double-layer hash mode. Further, the embodiment of the invention firstly converts the target data into the hash value so as to save the storage space of the data; then, the data of the dimension elements requiring the base statistics are expressed in the form of a bitmap array, so that the data of the dimension elements cannot be lost, the effect of compressing the data quantity corresponding to the dimension elements is achieved, the accuracy of the base statistics is improved, and the storage space of the data can be further saved; and finally, storing the hash value of the target data and the bitmap array of the dimension data corresponding to the target data into a hash table, and rapidly and accurately completing the base statistics by utilizing the characteristic that the hash table has rapid inquiry, so that the data retrieval efficiency in the base statistics process is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
FIG. 1 is a flow chart of a radix statistics method performed at a data node side according to an embodiment of the present invention;
FIG. 2 is a flow chart of a radix statistics method performed at the computing node side according to an embodiment of the present invention;
FIG. 3 is a block diagram of a radix statistic device disposed on a data node side according to one embodiment of the invention;
FIG. 4 is a block diagram of a radix statistic device disposed on a compute node side according to one embodiment of the invention;
FIG. 5 is a block diagram of a radix statistics system according to an embodiment of the present invention;
FIG. 6 is a block diagram of a radix statistics system according to an embodiment of the present invention;
FIG. 7 is a diagram of a radix statistics system for performing radix statistics according to an embodiment of the present invention;
fig. 8 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.
The embodiment of the invention provides a radix statistic method executed on a data node side. Referring now to FIG. 1, a flow chart of a radix statistics method performed on the data node side according to one embodiment of the present invention is shown.
Step S110, acquiring target data and dimension data corresponding to the target data.
Target data refers to an object of interest in radix statistics. For example: the target data is an independent user.
Dimension data is an object indicating interest in radix statistics, and element information requiring radix statistics.
In this embodiment, the dimension data corresponding to the target data includes: a plurality of dimension values of the target data. Further, the dimension data corresponding to the target data includes a plurality of dimensions of the target data and a dimension value of each dimension.
For example: the target data may be a Unique user code (Unique ID), which may be used to represent an individual user.
Another example is: the plurality of dimensions may include: gender, age, region, version number, page number, etc.; the dimension values corresponding to the plurality of dimensions may include: men, 20 years old, beijing, version V1, page number P1, etc.
If the number of the target data is a plurality of, a plurality of target data and dimension data corresponding to each target data can be acquired. Further, when a user logs in and accesses the website through the data node, access records of the user are stored in the data node, and the access records include, but are not limited to: logging information of the user and webpage information accessed by the user. The login information includes, but is not limited to: user Unique ID, gender, age, and region. Accessed web page information including, but not limited to: the version number and page number of the web page.
Step S120, performing hash computation on the target data by using a preset hash algorithm, so as to obtain a hash value corresponding to the target data.
If the number of the target data is a plurality of, respectively carrying out hash calculation on each target data by utilizing a preset hash algorithm so as to respectively obtain a hash value corresponding to each target data.
In the present embodiment, the types of hash algorithms include, but are not limited to: non-cryptographic hash algorithms and cryptographic hash algorithms. Further, compared with the encryption type hash algorithm, the non-encryption type hash algorithm omits the process of encrypting and decrypting the hash value, so that the operation complexity and the operation time consumption of the hash calculation can be reduced, and the hash algorithm of the embodiment is preferably the non-encryption type hash algorithm.
In this embodiment, the non-encryption type hash algorithm may employ a 64-bit hash algorithm (xxhash 64 algorithm), and the xxhash64 algorithm may convert target data into long integer values (long integer values).
The embodiment of the invention carries out a hash algorithm on the target data and can compress the target data. For example: the xxhash64 algorithm is used for converting target data into Long type numerical values, the data can be converted from character strings of 32 bytes into Long type numerical values of 8 bytes, the compression ratio is 4, and the collision rate (data collision) of the converted numerical values is 0 in the order of 10 hundred million.
Step S130, generating a bitmap array for dimension data corresponding to the target data by using a preset bitmap algorithm; wherein each bit in the bitmap array maps a dimension element to be subjected to radix statistics.
At least one bit is included in the bitmap array. Each bit in the bitmap array maps a dimension element to be subjected to radix statistics.
A dimension element for representing one dimension value or a combination of dimension values formed by a plurality of dimension values. Dimension elements may be set according to cardinal statistics requirements. In this way, when generating the bitmap array for the dimension data, a preset bit value may be set in the bit mapped by the dimension element, and the bit value indicates whether the dimension element (dimension value or dimension value combination) exists in the dimension data. By the method, a large amount of element data (dimension data) is not required to be compressed and stored, the element data cannot be lost, whether the dimension element exists in the dimension data can be accurately known through the bitmap array, and the method is a key for carrying out accurate radix statistics on the dimension element.
Specifically, if the number of the target data is a plurality of, generating a bitmap array for the dimension data corresponding to each target data by using a preset bitmap algorithm.
The bitmap array comprises one or more bits, and the dimension element specifically mapped by each bit can be embodied by a mapping relation table.
And the mapping relation table is used for recording at least one dimension element and the bit mapped in the bitmap array of each dimension element. That is, the bitmap array includes one or more bits, each bit has a corresponding meaning, and each bit is an element of the to-be-radix statistic.
Generating a bitmap array for dimension data corresponding to each target data by using a preset bitmap algorithm and a mapping relation table correspondingly arranged for the data nodes, wherein the specific steps are as follows: inquiring a mapping relation table which is preset corresponding to the data node, and determining dimension elements corresponding to a plurality of dimension values in the dimension data; and according to the dimension elements corresponding to the plurality of dimension values in the dimension data, the bit positions mapped by the dimension elements corresponding to the plurality of dimension values in the bitmap array identify a first bit value, and the other bit positions identify a second bit value, so that the bitmap array corresponding to the dimension data is generated. Further, since the dimension elements are dimension values or a combination of dimension values, according to a plurality of dimension values of the target data included in the dimension data, the dimension elements corresponding to the plurality of dimension values of the target data can be queried first in the mapping relation table, and then bits mapped by the dimension elements corresponding to the plurality of dimension values of the target data can be queried.
The first bit value represents a dimension element for which a current bit map exists in the dimension data. Wherein the first bit value may be 1.
The second bit value indicates that no dimension element of the current bit map is present in the dimension data. Wherein the second bit value may be 0.
For example: the dimension elements are a combination of dimension values, the first bit value may be 1 and the second bit value may be 0. After modifying the website version, if the independent user number of browsing the version V1 and the page number P1 is expected to be known, setting the version V1 and the page number P1 as a first dimension combination value; if it is desired to know the number of independent users browsing version V2, page P7, version V2, page P7 may be set as the second dimension value combination. Thus, the mapping relation table can be used for recording dimension value combinations of the to-be-cardinal statistics mapped by each bit in the bitmap array, such as: the first bit maps to a first combination of dimension values and the second bit maps to a second combination of dimension values. According to the obtained dimension data of the target data, the mapping relation table is queried, and the dimension value combination corresponding to a plurality of dimension values in the dimension data can be determined, for example: in the dimension data, the dimension value of the version number dimension is the version V1, the dimension value of the page number dimension is three dimension values of the page number P1, the page number P2 and the page number P3, it can be determined that the plurality of dimension values (the version V1, the page number P2 and the page number P3) can be corresponding to the first dimension value combination (the version V1, the page number P1), and the second dimension value combination cannot be corresponding because the version V2 and the page number P7 do not exist in the plurality of dimension values; according to the dimension value combination corresponding to the plurality of dimension values in the dimension data, a first bit mark 1 mapped in a first dimension value combination and a second bit mark 0 mapped in a second dimension value combination are mapped in the bitmap array.
Step S140, using a preset hash table to store the hash value corresponding to the target data and the bitmap array generated for the dimension data corresponding to the target data.
And if the number of the target data is a plurality of, correspondingly storing a hash value corresponding to the target data and a bitmap array generated for dimension data corresponding to the target data by utilizing a preset hash table for each target data.
And in a Java language environment, storing the hash value corresponding to the target data and the bitmap array generated for the dimension data corresponding to the target data into the hash table by utilizing a Trove packet.
Specifically, in the framework of big data computing, a runtime environment of a Java virtual machine (Java Virtual Machine, abbreviated as JVM) is mainly used. Since Java is Object-oriented (Object), in the runtime environment of a JVM, hash tables support Object-type data by default, and do not support underlying-type data. Basic types, including but not limited to: int, double, long.
In this embodiment, the Trove packet provides the function of a custom hash table. In the Java language environment, an underlying data structure is constructed through a Trove package, and the underlying data structure can be recorded as double level HashData. Included in the radix data structure are, but not limited to: custom hash table TLonObjectMap < Data, byte [ ] >. Further, the basic data structure may further include: and a mapping relation table correspondingly set for the data nodes.
Data represents a key (key) in the hash table TLongObjectMap. In this embodiment, a hash value is defined for which Data is used to store target Data. The hash value of the target data may be a Long type value.
byte [ ] represents the key value of the key in the hash table TLongObjectMap. In this embodiment, byte [ ] is defined for storing bitmap arrays, i.e., dimension elements present in the marked dimension data.
Furthermore, in this embodiment, the hash table of the Java native is replaced by the basic data structure constructed by the Trove packet, so that the hash table can support basic type data in the Java language environment, the Trove packet adopts an open fixed value method, so that the consumption of linked list references can be reduced, and the Trove packet can avoid the extra storage space (such as an index value) generated when the basic type data is packaged into non-basic type data, so that the compression ratio can reach 2.16 under a 0.75 loading factor.
And step S150, the hash table is sent to a preset computing node, so that the computing node executes base statistic processing according to bits according to the bitmap array in the received hash table.
In this embodiment, in order to reduce duplicate data in the hash table, it may be queried whether the same hash value exists in the hash table before sending the hash table to the computing node; if the same hash value exists in the hash table, aggregation processing is carried out on a plurality of bitmap arrays corresponding to the same hash value.
The same hash value, comprising: and respectively carrying out hash calculation on different target data to obtain conflicting hash values and/or respectively carrying out hash calculation on the same target data to obtain the same hash values.
The aggregation processing is performed on the bitmap arrays corresponding to the same hash value, and an OR (OR) operation may be performed on the bits corresponding to the bitmap arrays.
The embodiment of the invention provides an accurate and rapid radix statistic method in a double-layer hash mode. Further, the embodiment of the invention firstly converts the target data into the hash value so as to save the storage space of the data; then, the data of the dimension elements requiring the base statistics are expressed in the form of a bitmap array, so that the data of the dimension elements cannot be lost, the effect of compressing the data quantity corresponding to the dimension elements is achieved, the accuracy of the base statistics is improved, and the storage space of the data can be further saved; and finally, storing the hash value of the target data and the bitmap array of the dimension data corresponding to the target data into a hash table, and utilizing the characteristic that the hash table can be rapidly inquired, improving the data retrieval efficiency in the base number statistics process and rapidly completing the base number statistics.
Because a plurality of nodes exist in the distributed environment, the nodes form a communication network, and a single node can only embody local characteristics and cannot embody global characteristics. For example: users in region a access the web site via data node 1 and users in region B access the web site via data node 2. Therefore, embodiments of the present invention specify data nodes and compute nodes in a distributed environment. The hash tables of all the data nodes are combined, and the base statistics is performed based on the combined hash tables, so that the base statistics result can embody the global characteristic. Of course, if it is desired to perform the radix statistics only on the local portion, the radix statistics may be performed according to bits by the bitmap array in the hash table of the data node on which the radix statistics is required to obtain the radix statistics result.
In a distributed environment, the embodiment of the invention also provides a radix statistic method executed on a computing node side aiming at the radix statistic method executed on the data node side. Referring now to FIG. 2, a flowchart of a radix statistics method performed at a compute node side according to one embodiment of the present invention is shown.
Step S210, receiving hash tables respectively sent by a plurality of data nodes.
A hash value and a bitmap array are correspondingly stored in each hash table. The hash value is obtained by carrying out hash calculation on target data by using a preset hash algorithm, and the bitmap array is generated for dimension data corresponding to the target data by using the preset bitmap algorithm. Each bit in the bitmap array maps a dimension element to be subjected to radix statistics.
If the number of the target data on the data node side is a plurality of, correspondingly storing a plurality of groups of hash values and bitmap arrays in the hash table; in each group of hash values and bitmap data, the hash values are obtained by carrying out hash calculation on one target data in a plurality of target data by utilizing a preset hash algorithm, and the bitmap array is generated for dimension data corresponding to the target data by utilizing the preset bitmap algorithm.
A dimension element, which is used for representing a dimension value or a dimension value combination formed by a plurality of dimension values.
A bitmap array comprising: at least one bit, each bit mapping one or a combination of dimension values. Wherein the dimension value or combination of dimension values of each bit map is a dimension element to be subjected to radix statistics.
Step S220, merging hash tables respectively sent by the plurality of data nodes, and executing base statistic processing on the plurality of bitmap arrays according to bit positions in the hash tables obtained after merging.
Since the hash tables from multiple Data nodes are identical in Data structure, i.e., the hash tables are all TLonObjectMap < Data, byte [ ] > Data structures, the hash tables from multiple Data nodes may be consolidated into one total hash table or into multiple Zhang Haxi tables in batches.
Batch-to-batch combination into a multi Zhang Haxi table refers to: the plurality of hash tables is divided into a plurality of mutually non-duplicate sets of hash tables. For example: the hash tables of data nodes 1 through 3 are divided into a set of hash tables, and the hash tables from data nodes 4 through 6 are divided into a set of hash tables. Combining hash tables in the hash table sets aiming at each hash table set, and executing base statistics processing on a plurality of bitmap arrays according to bits in the hash tables obtained after combination to obtain base statistics results corresponding to the hash table sets; after the base statistic results respectively corresponding to the hash table sets are obtained, the base statistic results respectively corresponding to the hash table sets are subjected to aggregation processing, and the final base statistic result is obtained.
Further, since the same target data may have dimension data in different data nodes, for example, the same user logs in the same website through different data nodes, in the hash table obtained after merging, whether the same hash value exists or not can be queried; if the same hash value exists in the hash table, executing aggregation operation on a plurality of bitmap arrays corresponding to the same hash value to obtain an aggregated bitmap array corresponding to the same hash value; and in the hash table obtained after the combination, carrying out base statistic processing on the bitmap arrays corresponding to the same hash value and bitmap arrays corresponding to the rest hash values respectively according to the bit positions. The polymerization operation may be or operation.
Performing radix statistics processing in terms of bits, comprising: and carrying out summation calculation on bit values of corresponding bits in the bitmap arrays to obtain a base statistic result corresponding to the bits, wherein the base statistic result is a base statistic result of dimension elements mapped by the bits.
In this embodiment, since the mapping relationship tables are set correspondingly for the data nodes, if different data nodes correspond to different mapping relationship tables, it is necessary to acquire the mapping relationship table corresponding to each data node before merging multiple hash tables; and according to the mapping relation table corresponding to each data node, carrying out alignment processing on bitmap arrays in the hash tables respectively sent by the plurality of data nodes, so that the bitmap arrays in the hash tables have the same bit number and the dimension elements of bit mapping of the corresponding positions are the same.
For example: a bit is recorded in a mapping relation table of the data node 1, and the bit maps the version V1; two bits are recorded in the mapping relation table of the data node 2, wherein the first bit maps the version V1 and the second bit maps the version V2; in performing the alignment process, one bit may be added to the bitmap array in the hash table from data node 1, the added bit being the second bit and mapping version V2, such that the number of bits for the hash table from data node 1 and the hash table from data node 2 are the same and the dimension elements mapped by the bits are the same for the bitmap array.
In this embodiment, on the data node side, the dimension elements are compressed into the bitmap array, and whether the dimension elements exist in the dimension data is marked in the bitmap array, so that the loss of the dimension elements is avoided, and the dimension elements appearing in the dimension data can be accurately known in the dimension array, so that the calculation node side can carry out base statistics with higher accuracy based on the bitmap array.
Because the radix statistics based on the bitmap algorithm can only be performed on the Int type data, if the double-layer hash mode of the embodiment of the invention is not adopted, a global data dictionary is required to be maintained between nodes, wherein the global data dictionary is used for recording the Int type element data corresponding to the character string type element data, the character string type data is firstly converted into the Int type data through the global data dictionary, and then the radix statistics is performed on the Int type data obtained through conversion. According to the embodiment of the invention, a global data dictionary is not required to be maintained between the data node and the computing node, and the global data dictionary is not required to be used for carrying out the base statistics on the dimension elements of the hundred million-level target data, so that the step of converting the data into an Int type by querying the global data dictionary is avoided, the problem of lower statistical precision of the base statistics caused by using the global data dictionary which is not timely maintained to carry out the base statistics can be avoided, the accurate base statistics can be realized by the double-layer hash mode of the embodiment of the invention, and the storage space used in the base statistics process is smaller. Further, the radix statistics method of the present embodiment may be applied in big data scenarios, and may perform off-line radix statistics or on-line radix statistics.
The embodiment of the invention provides a radix statistic device arranged at a data node side. As shown in fig. 3, a block diagram of a radix statistic device disposed on a data node side according to an embodiment of the invention is shown.
A radix statistic device provided on a data node side, comprising: the device comprises an acquisition module 310, a first hash module 320, a bitmap generation module 330, a second hash module 340 and a transmission module 350.
The acquiring module 310 is configured to acquire target data and dimension data corresponding to the target data.
The first hash module 320 is configured to perform hash computation on the target data respectively by using a preset hash algorithm, so as to obtain hash values corresponding to the target data respectively.
The bitmap generation module 330 is configured to generate bitmap arrays for the dimension data corresponding to the target data respectively by using a preset bitmap algorithm; wherein, each bit in the bitmap array corresponds to a dimension element to be subjected to radix statistics.
The second hash module 340 is configured to correspondingly store, using a preset hash table, a hash value corresponding to the target data and a bitmap array generated for the dimension data corresponding to the target data.
And the sending module 350 is configured to send the hash table to a preset computing node, so that the computing node performs, according to bits, radix statistics processing according to the bitmap array in the received hash table.
The functions of the apparatus according to the embodiments of the present invention have been described in the foregoing method embodiments, so that descriptions of the embodiments of the present invention are not exhaustive, and reference may be made to the related descriptions in the foregoing embodiments, which are not repeated herein.
The embodiment of the invention also provides a base number statistical device arranged at the side of the computing node. As shown in fig. 4, a block diagram of a radix statistic device disposed on a computing node side according to an embodiment of the invention is shown.
A radix statistic device provided on a computing node side, comprising: a receiving module 410 and a statistics module 420.
A receiving module 410, configured to receive hash tables sent by a plurality of data nodes respectively; wherein, hash value and bitmap array are stored in each hash table correspondingly; the hash value is obtained by carrying out hash calculation on target data by using a preset hash algorithm, and the bitmap array is generated for dimension data corresponding to the target data by using the preset bitmap algorithm; each bit in the bitmap array corresponds to a dimension element to be subjected to radix statistics.
And the statistics module 420 is configured to combine hash tables sent by the plurality of data nodes respectively, and perform radix statistics processing on the plurality of bitmap arrays according to bits in the hash tables obtained after the combination.
The functions of the apparatus according to the embodiments of the present invention have been described in the foregoing method embodiments, so that descriptions of the embodiments of the present invention are not exhaustive, and reference may be made to the related descriptions in the foregoing embodiments, which are not repeated herein.
The embodiment of the invention also provides a base number statistical system. Fig. 5 is a block diagram of a radix statistics system according to an embodiment of the present invention.
The radix statistics system comprises: a plurality of data nodes 510 and a computing node 520. The compute nodes 520 are each connected to each data node 510. Wherein each data node 510 comprises: the first hash interface 511 and the second hash interface 512 are connected to each other. Only the structure of one data node 510 is shown in fig. 5.
The first hash interface 511 is configured to obtain target data and dimension data corresponding to the target data; and respectively carrying out hash calculation on the target data by using a preset hash algorithm so as to respectively obtain hash values corresponding to the target data. Further, the first hash interface 511 may be an xxhash64 algorithm conversion service interface, and the first hash interface 511 may convert data of multiple data types (such as string-type data) into Long-type values (8 bytes). The preliminary compression of data may be achieved through the first hash interface 511.
A second hash interface 512, configured to generate bitmap arrays for the dimension data corresponding to the target data respectively using a preset bitmap algorithm; wherein, each bit in the bitmap array corresponds to a dimension element for base statistics; correspondingly storing a hash value corresponding to the target data and a bitmap array generated for dimension data corresponding to the target data by using a preset hash table; the hash table is sent to a preset computing node 520.
The second hash interface 512 is implemented based on the hash table TLongObjectMap, the hash value corresponding to the target data is the key value in the hash table, and the bitmap array generated for the dimension data corresponding to the target data is the key value of the hash value. Further compression of the data may be achieved through the second hash interface 512 and the hash table provided by the second hash interface 512 supports fast retrieval of the presence or absence of key values.
A computing node 520, configured to receive hash tables sent by a plurality of data nodes respectively; and merging hash tables respectively sent by the plurality of data nodes, and executing base statistics processing on the plurality of bitmap arrays according to bit positions in the hash tables obtained after merging.
Further, as shown in fig. 6, in the data node 510, it may further include: the mapping module 513. Included in the computing node 520 may be: the data merge module 521 and the data aggregation module 522.
The mapping module 513 is configured to store a mapping relationship table set correspondingly for the data node 510. Bits mapped in the bitmap array may be queried at the mapping module for dimension elements.
A data merging module 521, configured to obtain a mapping relationship table corresponding to each data node; the mapping relation table is used for recording at least one dimension element and bits mapped by each dimension element in a preset bitmap array; and according to the mapping relation table corresponding to each data node, carrying out alignment processing on bitmap arrays in the hash tables respectively sent by the data nodes, so that the bitmap arrays in the hash tables have the same bit numbers and dimension elements of bit mapping of corresponding positions. Further, after the bitmap array in the hash table TLongObjectMap sent by each of the plurality of data nodes is aligned, the plurality of hash tables tlongobjectmaps are combined, so that a new double level hashdata can be obtained.
And the data aggregation module 522 is configured to perform radix statistics processing on a plurality of bitmap arrays according to bits in the hash table obtained after merging. Further, the data aggregation module may perform an aggregation calculation on the double level hashdata newly generated by the data merging module, where the aggregation calculation includes: traversing all dimension elements, obtaining the bit mapped by each dimension element, traversing the marks of the bit mapped by the dimension element of all bitmap arrays in the TLongObjectMap aiming at each dimension element, and counting the number of marks as 1 as the base number statistical result of the dimension element.
For example: fig. 7 is a schematic diagram of a radix statistics system according to an embodiment of the present invention. The radix statistics system comprises: two data nodes 510 and one compute node 520.
The individual user access amount in the AB experiment is determined based on the cardinal number statistics system. The AB experiment is that different improvements are carried out on one page (page number P1) in a website in advance, wherein the improvement 1 corresponds to a version V1, and the improvement 2 corresponds to a version V2. Users in different areas access the web site through two data nodes 510 (data node 1 and data node 2), respectively. Both data nodes 510 may return different versions of a page for different users.
The following description will take data node 1 as an example:
the data node 510 constructs a hash table tlongiobject map with a Trove packet in which key is used to store a plurality of key VALUES, value is used to store a plurality of key VALUES, and the key value is a bitmap array.
The data node 510 sets up a mapping table specifying that each bit in the bitmap array maps to a combination of dimension values and stores the mapping table in the mapping module 513. In FIG. 6, the first bit map version of the bitmap array, V1, has a combination of dimension values for page number P1, the second bit map version, V2, has a combination of dimension values for page number P1, and the remaining bits map other combinations of dimension values.
The first hash interface 511 of the data node 510 obtains a plurality of target data and dimension data corresponding to each target data; sequentially carrying out hash calculation on each target data; the second hash interface 512 stores the calculated hash value in a hash table. Taking the previous two target data as examples: the first hash interface 511 performs hash calculation on the Unique ID "xxxxx", and the second hash interface 512 stores the obtained hash value "100001" as the first key; the dimension data of the Unique ID "xxxxx" is version V1 and page number P1, and then the second hash interface 512 determines that the version V1 and the page number P1 map a first bit by querying the mapping relation table, queries a bitmap array corresponding to the first key in the TLongObjectMap, and identifies 1 in the first bit and 0 in the second bit of the bitmap array; the first hash interface 511 performs hash computation on the Unique ID "yyyyyy", and the second hash interface 512 stores the obtained hash value "200090" as a second key; the dimension data of the Unique ID "yyyyy" is version V2 and page number P1, and then the second hash interface 512 determines that the version V2 and page number P1 map a second bit by querying the mapping relation table, queries the bitmap array corresponding to the second key at the TLongObjectMap, and identifies 0 at the first bit and 1 at the second bit of the bitmap array.
The second hash interface 512 of the data node 510 transmits the TLongObjectMap to the computing node after storing both the bitmap array and the hash values corresponding to the plurality of target data in the TLongObjectMap.
The data node 2 refers to the process execution of the data node 1, and will not be described in detail herein.
The computing node 520 receives tlongobjectmaps transmitted by the two data nodes 510, respectively. The data merge module 521 of the computing node 520 compares the mapping relationship tables of the two data nodes 510, determines that the number of bits in the bitmap array of the tlongobjectmaps from the two data nodes 510 is the same as the dimension value combination of the corresponding bit maps, so the data merge module 521 can directly merge the two tlongobjectmaps together.
The data aggregation module 522 of the computing node 520 traverses each bitmap array in the VALUES, and performs a summation operation on the bit VALUES of the corresponding bits to obtain a summation result corresponding to the bits, where the summation result corresponding to the bits is a radix statistical result corresponding to the dimension value combination of the bit mapping. Namely: the sum result corresponding to the first bit is the number of independent users accessing the version V1 and the page P1, and the sum result corresponding to the second bit is the number of independent users accessing the version V2 and the page P1.
In this embodiment, in the radix statistics process, the efficiency of identifying whether a dimension element exists determines the radix statistics efficiency, and in this embodiment, the hash table is used to query and store the data of the dimension element, so that the time complexity of accessing the dimension element is ensured to be O (1), and the radix statistics efficiency is higher.
In this embodiment, in the radix statistics process, radix statistics is often performed on multiple dimension elements, if a hash table is used to directly store each dimension element, an excessive storage space occupation situation occurs, so in this embodiment, a byte array structure is adopted, and a principle of BitMap is used to store whether a dimension element exists in each bit of the array, so that the data volume corresponding to the dimension element can be effectively compressed, and the storage space of 128 dimension value hundred million-level data is only 4.4GB.
In this embodiment, the dimension elements are compressed into the bitmap array, that is, whether the dimension elements exist in the dimension data is marked in the bitmap array, so that data loss of the dimension elements cannot occur, and therefore, the accuracy of the base statistics based on the bitmap array is higher. Through the embodiment, accurate radix statistics can be performed on AB experiments, the number of daily active users (Daily Active User, DAU for short), the weight removal display amount and other application scenes needing accurate radix statistics.
The embodiment of the invention also provides an electronic device, as shown in fig. 8, which includes a processor 810, a communication interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication interface 820 and the memory 830 complete communication with each other through the communication bus 840.
Memory 830 for storing a computer program.
The processor 810 is configured to implement the above-described steps of the radix statistic method performed on the data node side or the above-described steps of the radix statistic method performed on the computing node side when executing the program stored in the memory 830.
The processor 810, when executing the program stored in the memory 830, implements the above-mentioned radix statistic method executed on the data node side, and includes the steps of: acquiring target data and dimension data corresponding to each target data; respectively carrying out hash calculation on the target data by using a preset hash algorithm so as to respectively obtain hash values corresponding to the target data; generating bitmap arrays for dimension data corresponding to the target data respectively by using a preset bitmap algorithm; wherein, each bit in the bitmap array maps a dimension element for base statistics; correspondingly storing a hash value corresponding to the target data and a bitmap array generated for dimension data corresponding to the target data by using a preset hash table; and sending the hash table to a preset computing node so that the computing node can execute the base statistic processing according to the bit according to the bitmap array in the received hash table.
The dimension data corresponding to the target data comprises: a plurality of dimension values of the target data; the generating a bitmap array for the dimension data corresponding to the target data by using a preset bitmap algorithm includes: inquiring a mapping relation table preset for the data node correspondence, and determining dimension elements corresponding to a plurality of dimension values in the dimension data; the mapping relation table is used for recording at least one dimension element and bits mapped in the bitmap array of each dimension element; according to the dimension elements corresponding to the plurality of dimension values in the dimension data, the bit positions mapped by the dimension elements corresponding to the plurality of dimension values in the bitmap array identify a first bit value, and the other bit positions identify a second bit value, so that the bitmap array corresponding to the dimension data is generated.
After the preset hash table is utilized to correspondingly store the hash value corresponding to the target data and the bitmap array generated for the dimension data corresponding to the target data, before the hash table is sent to the preset computing node, the method further comprises the following steps: querying whether the same hash value exists in the hash table; and if the same hash value exists in the hash table, performing aggregation processing on a plurality of bitmap arrays corresponding to the same hash value.
The storing, by using a preset hash table, a hash value corresponding to the target data and a bitmap array corresponding to dimension data of the target data correspondingly includes: and in a Java language environment, storing the hash value corresponding to the target data and the bitmap array generated for the dimension data corresponding to the target data into the hash table by utilizing a Trove packet.
Wherein the hash algorithm is a non-encrypted hash algorithm.
The processor 810, when executing the program stored in the memory 830, implements the above-mentioned radix statistic method executed on the computing node side, and includes the steps of: receiving hash tables respectively sent by a plurality of data nodes; wherein, hash value and bitmap array are stored in each hash table correspondingly; the hash value is obtained by carrying out hash calculation on target data by using a preset hash algorithm, and the bitmap array is generated for dimension data corresponding to the target data by using the preset bitmap algorithm; mapping a dimension element to be subjected to radix statistics for each bit in the bitmap array; and merging hash tables respectively sent by the plurality of data nodes, and executing base statistics processing on the plurality of bitmap arrays according to bit positions in the hash tables obtained after merging.
Wherein before said merging a plurality of said hash tables, further comprising: obtaining a mapping relation table corresponding to each data node; the mapping relation table is used for recording at least one dimension element and bits mapped by each dimension element in a preset bitmap array; and according to the mapping relation table corresponding to each data node, carrying out alignment processing on bitmap arrays in the hash tables respectively sent by the data nodes, so that the bitmap arrays in the hash tables have the same bit numbers and dimension elements of bit mapping of corresponding positions.
In the hash table obtained after merging, the base statistic processing is performed on a plurality of bitmap arrays according to bits, including: querying the same hash value in the hash table obtained after the merging; if the same hash value exists in the hash table, executing aggregation operation on a plurality of bitmap arrays corresponding to the same hash value to obtain an aggregated bitmap array corresponding to the same hash value; and in the hash table obtained after the merging, carrying out base statistic processing on the bitmap arrays corresponding to the same hash value and bitmap arrays corresponding to the rest hash values respectively according to bit positions.
The merging the hash tables sent by the data nodes respectively, and executing base statistics processing on the bitmap arrays according to bits in the hash tables obtained after merging, including: dividing the hash tables into a plurality of hash table sets which are not repeated mutually; combining hash tables in the hash table sets aiming at each hash table set, and executing base statistics processing on a plurality of bitmap arrays according to bits in the hash tables obtained after combination to obtain base statistics results corresponding to the hash table sets; after the base statistic results respectively corresponding to the hash table sets are obtained, the base statistic results respectively corresponding to the hash table sets are subjected to aggregation processing, and the final base statistic result is obtained.
The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the terminal and other devices.
The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present invention, there is further provided a computer readable storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the steps of the radix statistic method performed on the data node side or the steps of the radix statistic method performed on the compute node side as described in any of the above embodiments.
In a further embodiment of the present invention, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of the radix statistic method performed on the data node side or the steps of the radix statistic method performed on the compute node side as described in any of the above embodiments.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (11)

1. A radix statistic method, characterized by the steps performed at a data node side, comprising:
acquiring target data and dimension data corresponding to the target data;
carrying out hash calculation on the target data by using a preset hash algorithm so as to obtain a hash value corresponding to the target data;
generating a bitmap array for dimension data corresponding to the target data by using a preset bitmap algorithm; wherein, each bit in the bitmap array maps a dimension element for base statistics;
the dimension data corresponding to the target data comprises a plurality of dimension values of the target data;
the generating a bitmap array for the dimension data corresponding to the target data by using a preset bitmap algorithm includes:
inquiring a mapping relation table preset for the data node correspondence, and determining dimension elements corresponding to a plurality of dimension values in the dimension data; the mapping relation table is used for recording at least one dimension element and bits mapped in the bitmap array of each dimension element;
According to the dimension elements corresponding to the plurality of dimension values in the dimension data, the bit positions mapped by the dimension elements corresponding to the plurality of dimension values in the bitmap array identify a first bit value, and the other bit positions identify a second bit value, so that the bitmap array corresponding to the dimension data is obtained;
correspondingly storing a hash value corresponding to the target data and a bitmap array generated for dimension data corresponding to the target data by using a preset hash table;
and sending the hash table to a preset computing node so that the computing node can execute the base statistic processing according to the bit according to the bitmap array in the received hash table.
2. The method according to claim 1, further comprising, after the storing the hash value corresponding to the target data and the bitmap array generated for the dimension data corresponding to the target data, before the sending the hash table to the preset computing node, after the using the preset hash table:
querying whether the same hash value exists in the hash table;
and if the same hash value exists in the hash table, performing aggregation processing on a plurality of bitmap arrays corresponding to the same hash value.
3. The method according to any one of claims 1-2, wherein the storing, using a preset hash table, the hash value corresponding to the target data and the bitmap array corresponding to the dimension data of the target data correspondingly includes:
and in a Java language environment, storing the hash value corresponding to the target data and the bitmap array generated for the dimension data corresponding to the target data into the hash table by utilizing a Trove packet.
4. A cardinal number statistics method, characterized by the steps performed at a computing node side, comprising:
receiving hash tables respectively sent by a plurality of data nodes; wherein, hash value and bitmap array are stored in each hash table correspondingly; the hash value is obtained by carrying out hash calculation on target data by using a preset hash algorithm, and the bitmap array is generated for dimension data corresponding to the target data by using the preset bitmap algorithm; mapping a dimension element to be subjected to radix statistics for each bit in the bitmap array;
obtaining a mapping relation table corresponding to each data node; the mapping relation table is used for recording at least one dimension element and bits mapped by each dimension element in a preset bitmap array;
According to the mapping relation table corresponding to each data node, aligning bitmap arrays in the hash tables respectively sent by a plurality of data nodes so that the bitmap arrays in the hash tables have the same bit numbers and dimension elements mapped by bit positions at corresponding positions;
and merging hash tables respectively sent by the plurality of data nodes, and executing base statistics processing on the plurality of bitmap arrays according to bit positions in the hash tables obtained after merging.
5. The method of claim 4, wherein the performing radix statistics on the plurality of bitmap arrays in bits in the hash table obtained after merging comprises:
inquiring whether the same hash value exists in the hash table obtained after the merging;
if the same hash value exists in the hash table, executing aggregation operation on a plurality of bitmap arrays corresponding to the same hash value to obtain an aggregated bitmap array corresponding to the same hash value;
and in the hash table obtained after the merging, carrying out base statistic processing on the bitmap arrays corresponding to the same hash value and bitmap arrays corresponding to the rest hash values respectively according to bit positions.
6. The method of claim 4, wherein merging hash tables sent by the plurality of data nodes respectively, and performing radix statistics processing on the plurality of bitmap arrays according to bits in the merged hash table, includes:
dividing the hash tables into a plurality of hash table sets which are not repeated mutually;
combining hash tables in the hash table sets aiming at each hash table set, and executing base statistics processing on a plurality of bitmap arrays according to bits in the hash tables obtained after combination to obtain base statistics results corresponding to the hash table sets;
after the base statistic results respectively corresponding to the hash table sets are obtained, the base statistic results respectively corresponding to the hash table sets are subjected to aggregation processing, and the final base statistic result is obtained.
7. A radix statistic apparatus, disposed on a data node side, comprising:
the acquisition module is used for acquiring target data and dimension data corresponding to the target data;
the first hash module is used for carrying out hash calculation on the target data by utilizing a preset hash algorithm so as to obtain a hash value corresponding to the target data;
The bitmap generation module is used for generating a bitmap array for the dimension data corresponding to the target data by utilizing a preset bitmap algorithm; wherein, each bit in the bitmap array corresponds to a dimension element for base statistics; the dimension data corresponding to the target data comprises a plurality of dimension values of the target data;
the bitmap generation module is further used for inquiring a mapping relation table preset for the data node, and determining dimension elements corresponding to a plurality of dimension values in the dimension data; the mapping relation table is used for recording at least one dimension element and bits mapped in the bitmap array of each dimension element; according to the dimension elements corresponding to the plurality of dimension values in the dimension data, the bit positions mapped by the dimension elements corresponding to the plurality of dimension values in the bitmap array identify a first bit value, and the other bit positions identify a second bit value, so that the bitmap array corresponding to the dimension data is obtained;
the second hash module is used for correspondingly storing the hash value corresponding to the target data and a bitmap array generated for the dimension data corresponding to the target data by utilizing a preset hash table;
And the sending module is used for sending the hash table to a preset computing node so that the computing node can execute the base statistic processing according to the bit array in the received hash table.
8. A radix statistic apparatus, disposed on a computing node side, comprising:
the receiving module is used for receiving hash tables respectively sent by a plurality of data nodes; wherein, hash value and bitmap array are stored in each hash table correspondingly; the hash value is obtained by carrying out hash calculation on target data by using a preset hash algorithm, and the bitmap array is generated for dimension data corresponding to the target data by using the preset bitmap algorithm; each bit in the bitmap array corresponds to a dimension element for base statistics;
the receiving module is further used for obtaining a mapping relation table corresponding to each data node; the mapping relation table is used for recording at least one dimension element and bits mapped by each dimension element in a preset bitmap array; according to the mapping relation table corresponding to each data node, aligning bitmap arrays in the hash tables respectively sent by a plurality of data nodes so that the bitmap arrays in the hash tables have the same bit numbers and dimension elements mapped by bit positions at corresponding positions;
And the statistics module is used for merging hash tables respectively sent by the plurality of data nodes, and executing base statistics processing on the plurality of bitmap arrays according to bit positions in the hash tables obtained after merging.
9. A radix statistic system, comprising:
a plurality of data nodes and a computing node; the computing nodes are respectively connected with each data node;
each of the data nodes comprises: the first hash interface and the second hash interface are connected with each other;
the first hash interface is used for acquiring target data and dimension data corresponding to the target data; carrying out hash calculation on the target data by using a preset hash algorithm so as to obtain a hash value corresponding to the target data;
the second hash interface is used for generating a bitmap array for the dimension data corresponding to the target data by using a preset bitmap algorithm; wherein, each bit in the bitmap array corresponds to a dimension element for base statistics; correspondingly storing a hash value corresponding to the target data and a bitmap array generated for dimension data corresponding to the target data by using a preset hash table; transmitting the hash table to a preset computing node; the dimension data corresponding to the target data comprises a plurality of dimension values of the target data;
The second hash interface is further configured to query a mapping relationship table preset for the data node, and determine dimension elements corresponding to a plurality of dimension values in the dimension data; the mapping relation table is used for recording at least one dimension element and bits mapped in the bitmap array of each dimension element; according to the dimension elements corresponding to the plurality of dimension values in the dimension data, the bit positions mapped by the dimension elements corresponding to the plurality of dimension values in the bitmap array identify a first bit value, and the other bit positions identify a second bit value, so that the bitmap array corresponding to the dimension data is obtained;
the computing node is used for receiving hash tables respectively sent by the plurality of data nodes; and merging hash tables respectively sent by the plurality of data nodes, and executing base statistics processing on the plurality of bitmap arrays according to bit positions in the hash tables obtained after merging.
10. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
A processor for implementing the method steps of any one of claims 1-3 or the method steps of any one of claims 4-6 when executing a program stored on a memory.
11. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method steps of any of claims 1-3 or the method steps of any of claims 4-6.
CN202010339945.1A 2020-04-26 2020-04-26 Radix statistics method, apparatus, system, device, and computer-readable storage medium Active CN111563109B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010339945.1A CN111563109B (en) 2020-04-26 2020-04-26 Radix statistics method, apparatus, system, device, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010339945.1A CN111563109B (en) 2020-04-26 2020-04-26 Radix statistics method, apparatus, system, device, and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN111563109A CN111563109A (en) 2020-08-21
CN111563109B true CN111563109B (en) 2023-09-01

Family

ID=72070594

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010339945.1A Active CN111563109B (en) 2020-04-26 2020-04-26 Radix statistics method, apparatus, system, device, and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN111563109B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112162918A (en) * 2020-09-07 2021-01-01 北京达佳互联信息技术有限公司 Application program testing method and device and electronic equipment
CN112114984A (en) * 2020-09-17 2020-12-22 清华大学 Graph data processing method and device
CN112612790A (en) * 2020-12-17 2021-04-06 深圳前海微众银行股份有限公司 Card number configuration method, device, equipment and computer storage medium
CN113282247A (en) * 2021-06-24 2021-08-20 京东科技控股股份有限公司 Data storage method, data reading method, data storage device, data reading device and electronic equipment
CN113468179B (en) * 2021-07-09 2024-03-19 北京东方国信科技股份有限公司 Base number estimation method, base number estimation device, base number estimation equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6957222B1 (en) * 2001-12-31 2005-10-18 Ncr Corporation Optimizing an outer join operation using a bitmap index structure
CN104866608A (en) * 2015-06-05 2015-08-26 中国人民大学 Query optimization method based on join index in data warehouse
CN108256087A (en) * 2018-01-22 2018-07-06 北京腾云天下科技有限公司 A kind of data importing, inquiry and processing method based on bitmap structure

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2317081C (en) * 2000-08-28 2004-06-01 Ibm Canada Limited-Ibm Canada Limitee Estimation of column cardinality in a partitioned relational database

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6957222B1 (en) * 2001-12-31 2005-10-18 Ncr Corporation Optimizing an outer join operation using a bitmap index structure
CN104866608A (en) * 2015-06-05 2015-08-26 中国人民大学 Query optimization method based on join index in data warehouse
CN108256087A (en) * 2018-01-22 2018-07-06 北京腾云天下科技有限公司 A kind of data importing, inquiry and processing method based on bitmap structure

Also Published As

Publication number Publication date
CN111563109A (en) 2020-08-21

Similar Documents

Publication Publication Date Title
CN111563109B (en) Radix statistics method, apparatus, system, device, and computer-readable storage medium
CN108427725B (en) Data processing method, device and system
WO2019095416A1 (en) Information pushing method and apparatus, and terminal device and storage medium
CN107798038B (en) Data response method and data response equipment
CN112039979B (en) Distributed data cache management method, device, equipment and storage medium
CN106487939B (en) A kind of method and apparatus, a kind of electronic equipment of determining User IP subnet
TW201800967A (en) Method and device for processing distributed streaming data
WO2022057357A1 (en) Data query method and apparatus, and database system
CN112307062B (en) Database aggregation query method, device and system
WO2020088262A1 (en) Data analysis method and device, and storage medium
CN111443899B (en) Element processing method and device, electronic equipment and storage medium
WO2016165542A1 (en) Method for analyzing cache hit rate, and device
KR100906454B1 (en) Database log data management apparatus and method thereof
CN114528231A (en) Data dynamic storage method and device, electronic equipment and storage medium
CN110020166B (en) Data analysis method and related equipment
CN114595215A (en) Data processing method and device, electronic equipment and storage medium
CN111158994A (en) Pressure testing performance testing method and device
Liu et al. SEAD counter: Self-adaptive counters with different counting ranges
CN112261134B (en) Network data access auditing method, device, equipment and storage medium
CN111538730A (en) Data statistics method and system based on Hash bucket algorithm
CN110909029A (en) Method and medium for realizing cache based on Nosql
CN115499338B (en) Data processing method, device, medium and cloud network observation system
CN112860712B (en) Block chain-based transaction database construction method, system and electronic equipment
CN112527787B (en) Safe and reliable multiparty data deduplication system, method and device
US11924097B2 (en) Traffic monitoring device, method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant