CN110580307B

CN110580307B - Processing method and device for fast statistics

Info

Publication number: CN110580307B
Application number: CN201910736362.XA
Authority: CN
Inventors: 姜海鸥; 蔡华谦; 常媛; 黄罡; 景翔
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2019-08-09
Filing date: 2019-08-09
Publication date: 2021-09-24
Anticipated expiration: 2039-08-09
Also published as: CN110580307A

Abstract

The invention provides a processing method and a device for rapid statistics, wherein the method is applied to a graph structure distributed account book, and comprises the following steps: step S1: acquiring the statistical requirements of users, and broadcasting and distributing the statistical requirements to each digital chain node by the main node; step S2: reading a first data set of corresponding start-stop sequence numbers or start-stop time in the digital chain nodes according to the statistical requirements, and extracting data corresponding to corresponding statistical objects in the first data set as original data; step S3: taking the original data as input of a radix estimation method, setting a target parameter, hashing the original data, calculating a barrel number corresponding to the hashed original data and a position where a first 1 of a residual digit of the original data after the barrel number is removed appears, and updating barrel information of the barrel number; the invention can realize the rapid, accurate and real-time statistics of the graph structure distributed account book data.

Description

Processing method and device for fast statistics

Technical Field

The present invention relates to the field of data statistics, and in particular, to a fast statistical processing method and a fast statistical processing apparatus.

Background

Statistical requirements are one of the very common requirements in the internet field. In the context of big data statistics, the statistical requirements can be divided into two categories, one is accurate statistics and one is fuzzy statistics, i.e. estimation, on big data. The latter occurs for two reasons: firstly, in the face of big data, a large amount of time and space resources are consumed for accurate statistics, and the accurate statistics cannot be realized under the condition of limited resources; secondly, the user usually does not care about the specific numerical value of the big data, but has certain tolerance to the error, and only needs to count the control with magnitude on the data or care about the variation trend.

The distributed account book technology is characterized by decentralization, openness and autonomy, and has the characteristics of information non-falsification and anonymity. As a novel decentralized tool, the distributed account book technology has important significance for establishing a shared value system and maintaining transaction safety. At present, the distributed ledger is generally realized by a block chain technology. With the rapid development of the distributed ledger technology, various consensus mechanisms and network structures are proposed by numerous scholars. One of them is the graph structure distributed ledger. The graph structure distributed account book replaces a traditional chain structure with a Directed Acyclic Graph (DAG), and all nodes are connected with each other to form the directed acyclic graph, so that the throughput of a network is greatly improved, and the distributivity of the network is enhanced.

In the graph-structured distributed ledger, statistics needs to be performed on resources in the network, such as the number of users logged in each day, the number of registered users, the daily transaction amount, and other information, so as to perform traffic monitoring or service processing. Such traffic belongs to a subclass of the fuzzy statistics mentioned earlier: statistics on cardinality (cardinality refers to the number of different elements in a set where there are duplicate elements). If the traditional statistical method is adopted, for example, data of each node is traversed and counted, a large amount of network bandwidth and computing resources are occupied. And will have serious impact on the real-time performance and effectiveness of the network. There is thus a need for a more efficient method of counting or estimating data in a network.

Disclosure of Invention

The invention provides a processing method and a processing device for fast statistics, which can realize fast, accurate and real-time statistics on distributed account book data of a graph structure.

In order to solve the above problem, the present invention discloses a processing method for fast statistics, which is applied to a graph structure distributed ledger, and the method includes:

step S1: acquiring the statistical requirements of users, and broadcasting and distributing the statistical requirements to each digital chain node by the main node;

step S2: reading a first data set of corresponding start-stop sequence numbers or start-stop time in the digital chain nodes according to the statistical requirements, and extracting data corresponding to corresponding statistical objects in the first data set as original data;

step S3: taking the original data as input of a radix estimation method, setting a target parameter, hashing the original data, calculating a barrel number corresponding to the hashed original data and a position where a first 1 of a residual digit of the original data after the barrel number is removed appears, and updating barrel information of the barrel number;

step S4: after all data processing of the first data set is completed, all barrel information of the first data set is stored in a memory of the data link node;

step S5: reading all data updated after the first data set from the chain nodes according to statistical requirements, taking the data as a second data set, and extracting data corresponding to corresponding statistical objects in the second data set as newly-added original data;

step S6: inputting the newly added original data into the same statistical model as the step S3, and calculating corresponding bucket information;

step S7: combining the bucket information of the newly added original data with the bucket information in the memories of the chain nodes to obtain complete bucket information, and storing the complete bucket information in the memories of the chain nodes;

step S8: and returning the barrel information in the memory of each numerical chain node to the main node, wherein the main node is used for carrying out statistical operation on the barrel information returned by each numerical chain node and feeding back the result of the statistical operation to the user.

Optionally, the step S3 specifically includes the following sub-steps:

according to the statistical requirement, selecting a hash function H to take each piece of original data stored in the data link node as binary cache buffer data for hashing, and outputting a plurality of hash results;

averagely dividing the hash space into m parts, wherein each part is called a bucket, and m is an integer power of 2;

taking the hash result as a hash sample, wherein the hash value length of the hash sample is L, the first k bits of the hash value are taken as the barrel number of the hash sample, and m is 2^KTaking the subsequent L-k bits as bit strings for subsequent estimation, and distributing the hash samples with the same barrel number to the same barrel;

calculating the position where the first 1 of the remaining digit of each hash sample after the hash value removes the barrel number and recording as m [ i ];

and aiming at the Hash sample of the same bucket, comparing the currently calculated m [ i ] with the previous bucket information m [ j ], if m [ i ] is larger than m [ j ], updating the bucket information of the bucket number to m [ i ], wherein i and j are positive integers.

Optionally, the hash result satisfies the following condition:

a1: the hash result is of a fixed length;

a2: the hash result satisfies uniform distribution;

a3: the probability that the hash results are identical approaches zero indefinitely.

Optionally, the method further includes:

removing outliers from the bucket information;

and performing deviation correction on the data subjected to outlier elimination processing.

Optionally, the processing of excluding outliers from the bucket information further includes:

removing outliers from the barrel information by adopting a sample consensus RANSAC algorithm; wherein the RANSAC algorithm is based on the following assumptions:

b1: there is a model with some parameters for the inlie data to describe its distribution, and the outlier data outler cannot be fitted by the model;

b2: outlier data is noise, and extreme noise can lead to misinterpretation of the data;

b3: given a set of intra-group data, typically memory data of a small amount, there is a procedure for estimating parameters suitable for interpreting the set of data models.

Optionally, the step of performing outlier elimination processing on the data subjected to barrel averaging by using a sample consensus RANSAC algorithm includes:

the first substep: extracting a sample from the barrel information, and calculating the coverage rate of the sample on the barrel information to obtain inner group data;

and a second substep: repeating the substeps for a plurality of times, and selecting a group with the highest coverage rate as the group data in the barrel.

Optionally, the performing deviation correction on the data after the outlier elimination processing further includes:

and performing deviation correction on the data subjected to outlier elimination processing by using a super log counting HLLC algorithm.

Optionally, the method further includes:

and the main node makes a secondary request to the target numerical link node with abnormal transmission, and stops communication with the target numerical link node when bucket information returned by the target numerical link node aiming at the secondary request is not received.

Optionally, the method further includes:

when the main node does not receive the bucket information returned by the target number chain node aiming at the secondary request, inquiring whether the same statistical requirements exist in historical statistics; if yes, the statistical result which is existed before is reused.

In order to solve the above problem, the present invention also discloses a processing apparatus for fast statistics, where the apparatus is applied in a graph structure distributed ledger, and the apparatus includes:

the statistical demand broadcasting module is used for acquiring the statistical demands of users, and the main node broadcasts and distributes the statistical demands to each numerical chain node;

the original data generation module is used for reading a first data set of corresponding start-stop sequence numbers or start-stop time in the digital chain nodes according to the statistical requirements, and extracting data corresponding to corresponding statistical objects in the first data set as original data;

the first calculation module is used for taking the original data as input of a radix estimation method, setting target parameters, hashing the original data, calculating a barrel number corresponding to the hashed original data and a position where the first 1 of the residual digit of the original data after the barrel number is removed appears, and updating barrel information of the barrel number;

the first storage module is used for storing all barrel information of the first data set in the memory of the data link node after all data processing of the first data set is completed;

the newly added original data module is used for reading all data updated after the first data set from the chain nodes according to the statistical requirements, taking the data as a second data set, and extracting data corresponding to a corresponding statistical object in the second data set as newly added original data;

the second calculation module is used for inputting the newly added original data into the same statistical model as the first calculation module and calculating corresponding barrel information;

the second storage module is used for merging the bucket information of the newly added original data and the bucket information in the memories of the chain nodes to obtain complete bucket information and storing the complete bucket information into the memories of the chain nodes;

and the statistical result feedback module is used for returning the barrel information in the memory of each numerical chain node to the main node, and the main node is used for performing statistical operation on the barrel information returned by each numerical chain node and feeding back the statistical operation result to the user.

Compared with the prior art, the invention has the following advantages:

according to the invention, according to the statistical requirements of users, the characteristic of scattered graph structure data is dealt with, data statistics is directly carried out on each node, the statistical time is saved, and only the intermediate result (barrel information) with extremely small space occupation is taken as the transmission quantity, so that the transmission load required by statistics is saved; meanwhile, newly generated data results are combined into the previously initialized results to ensure the timeliness of data statistics, so that the data fed back to the user is the latest time, and finally the purpose of fast, accurate and real-time statistics on the distributed book data of the graph structure is achieved.

Drawings

FIG. 1 is a flow chart of the steps of a fast statistics processing method of the present invention;

FIG. 2 is a hash space numbering diagram of the present invention;

FIG. 3 is a schematic diagram of hash sample binning in accordance with the present invention;

FIG. 4 is a frame architecture diagram of the present invention;

FIG. 5 is a dynamic display panel for BDCHain block transactions;

FIG. 6 is a Block compression experiment;

FIG. 7 is a Block bucket quantity versus time graph;

FIG. 8 is a Block node-time experimental graph;

FIG. 9 is a Block bucket amount-error experimental plot;

fig. 10 is a block diagram of a fast statistics processing apparatus according to the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The rapid data statistics refers to one of the most common computing scenes in the system, which are the actual application scenes of the internet. There are related requirements in the fields of data analysis, network monitoring, optimization work, etc. In practical application, the data fast statistical process is divided into three steps: 1. and collecting and summarizing the data scattered in each database to obtain a statistical quantity data set to be counted. 2. And carrying out deduplication (removing repeated data) processing on the collected statistical data set to obtain required statistics and feeding back the statistics to the user. 3. In the face of frequently-counted scenes such as network monitoring and the like, data needs to be updated rapidly, and the first two steps are repeatedly executed.

For the graph structure distributed account book, the three steps of data rapid statistics are difficult to complete rapidly due to the decentralization data structure characteristics of the account book, and the difficulty of data rapid statistics is greatly improved. Particularly, the following three typical characteristics of data of the graph structure distributed ledger cause difficulty in rapid data statistics:

(1) the distributed account book data storage nodes of the graph structure are dispersed, and a distributed storage data structure is adopted. The nodes are physically distributed in different places, even across countries or continents, and meanwhile, the distributed account book is characterized by having no central node. In the data statistics process, a large amount of time is consumed for remote transmission when summarized data are collected, and meanwhile, the problems that packet loss may be caused by large amount of data transmission, the security of the data is guaranteed in data transmission, the data is prevented from being stolen, and the like are also faced.

(2) Graph structure distributed ledger data has a large amount of redundant data. According to the storage mode of the distributed ledger data, single data (mainly transactions) are stored on a plurality of nodes, so that sharing among transactions is guaranteed, and a trust mechanism is established. For data statistics, the large amount of redundant data is a great challenge for data deduplication, and a large amount of memory resources are consumed.

(3) The distributed account book with the graph structure has mass data, and the data volume is increased at a high speed in real time. For a frequently counted scene, the graph structure needs to acquire newly added data in time and merge the newly added data with the previous mass data for deduplication, and the continuous mass computation amount is undoubtedly a great challenge to both memory resources and computation performance.

The distributed account book of the graph structure is decentralized, data is stored in transactions, a certain amount of transactions are copied into a plurality of copies and then packaged into a block, and the blocks are copied into the plurality of copies again and stored on each numerical chain node.

In view of the above technical problems and the characteristics of the graph structure distributed ledger, referring to fig. 1, a flowchart of steps of a processing method for fast statistics according to the present invention is shown, where the method is applied to the graph structure distributed ledger, and the method may specifically include the following steps:

in the invention, an asynchronous client-server service is directly constructed for all nodes, a client is deployed on a main node, and an agent server is deployed on each numerical chain node. Firstly, a user calls a web-service, a client of a main node inputs parameter fields of statistical requirements, including a statistical quantity field name, a statistical object (mainly transaction or block in a distributed account book), an index format, and statistics according to a serial number index or a time index, and the starting serial number or the ending serial number or the time is counted. Then, after the main node receives the user statistical requirements, the main node packs and broadcasts the requirements to distribute to all the numerical chain nodes, namely the server of each numerical chain node asynchronously requested by the main node client. And finally, after the server of each numerical chain node receives the statistical requirement, entering a corresponding statistical operation process.

each data node has a server and also has a database in which data is stored. In the invention, a server of a plurality of chain nodes reads a first data set corresponding to a start-stop sequence number or start-stop time from a database according to the start-stop sequence number or the start-stop time in a statistical requirement, and then data corresponding to the statistical object in the first data set is used as original data needing to be counted according to the statistical object.

In the actual use process, the deadline or the deadline sequence number of most statistical demand data is the latest statistical data, so the parameters in the statistical demand default to the latest statistical data.

In order to design a fast statistical method for the graph structure blockchain data, firstly we need to analyze the statistical requirements and the characteristics of the statistical data. In the context of big data, there are two main statistical requirements, one is to require accurate statistical results, such as correction of error audits, clearing and settlement of money depending on data, and the like. Accurate statistics on the basis of mass data consumes a large amount of time and space, and even the data is too large, the program is crashed. This requirement applies to the case of little data investigation or completely negligible time consumption. The other is to require substantially accurate statistical results, i.e., results that are probabilistically accurate, such as controlling data growth, monitoring traffic anomalies, real-time or fast settlement of large amounts, which may allow for errors for the user. The desire for such requirements is fast, low power consumption and substantially accurate statistics, and users often do not care about exact statistics, but rather about volume size or trend.

The statistical method of the present invention is directed to the latter case, i.e., substantially accurate statistics, i.e., estimates. Next, the statistical method of the present invention is set forth in steps S3 to S7, the input of the method is raw data, and the output is statistical results.

in the invention, an R-HLLC (R-HLLC) cardinal number estimation method based on RANSAC (random sample consensus) random sampling consistency is designed, and the method comprises the following steps: firstly, uniformly randomizing; secondly, barrel averaging; thirdly, removing outliers; and fourthly, correcting the deviation.

The first step, uniform randomization, comprises the steps of:

step 301: in the uniform randomization, according to the statistical requirement, a hash function H is selected to hash each piece of original data stored in the data chain node as binary cache buffer data, and a plurality of hash results are output. The obtained hash result meets the following conditions: a1: the hash result is of a fixed length; a2: the hash result satisfies uniform distribution; a3: the probability that the hash results are identical approaches zero indefinitely.

Secondly, barrel averaging, wherein the barrel averaging is based on the following mathematical basis:

assuming that a is a hash sample, the hash value has a length of L, i.e., is fixed to L bits. a is a binary bit string with length of L bits. Referring to fig. 2, which is a hash space numbering diagram of the present invention, the bit string with length L in fig. 2 is numbered 0, 1, 2, 3, … …, and L-1 from left to right according to the bit. Since the hash space is subject to uniform distribution and a is a sample drawn randomly, each bit of a should be independent of each other and satisfy the following distribution:

p (X ═ k) { (0.5(k ═ 0) @0.5(k ═ 1)) formula (1);

i.e. each bit of a has an equal probability of 0 or 1 and the bits are independent of each other. Let p (a) represent the first "1" in the hash space of a "The position of occurrence, ignoring the special case that the bit string is all 0 (probability is 1/2L), if all samples in the hash space, i.e. the hash value formed by all data, are traversed, and the maximum value of p (a) is taken as pmax, then 2 can be taken^pmaxAs a coarse cardinality estimate of this time sample. I.e. the result of the estimation is:

at this time, the rough estimation that can be made is based on the following fact: statistically, for a string of bits that are independent for each bit and follow a 0-1 distribution, scanning sequentially from left to right resembles throwing a coin of uniform texture up to L times until a result is obtained that is right side up. Statistically, this is called a Bernoulli process. It is clear that the problem is that the probability of a face being hit once is 1/2, that of a face being hit twice is 1/4, and that the probability of hitting k times is 1/2^k。

For the scan hash result a, scan from left to right until the first "1" stop is encountered. The probability that a scan larger than k bits is required to end is obviously 1/2k, i.e. the probability that the previous k bits are all "0". Therefore, in the process of scanning a, the probability that the number of scanning bits is not more than k is 1-1/2k, and for the hash result of n data, the probability that the number of scanning bits is not more than k for n times is:

pn (X is less than or equal to K) is (1-1/2^ K) ^ n formula (3);

similarly, n hash results are scanned, and as long as there is one scan, the probability of p (a) ═ k is:

pn (X is more than or equal to K) is 1- (1-1/2^ (K-1)) ^ n formula (4);

thus, when n is far greater than 2K, the probability of Pn (X is less than or equal to K) approaches to zero infinitely; when n is much smaller than 2K, it can be seen that the probability of Pn (X ≧ K) approaches zero indefinitely. Therefore, for a data set with a base number of n, if n is far greater than 2^ pmax, we obtain that the probability that pmax is the current value is almost 0; similarly, if n is much less than 2^ pmax, we get the probability that pmax is the current value to be almost 0. We consider 2^ pmax and n to be similar in size, i.e., 2^ pmax is a coarse estimate of n.

The above radix estimation with 2 pmax as the statistic will cause large errors due to chance. Therefore we adopt the way of bucket averaging to reduce the influence of errors. The specific operation is as follows:

step 302: the hash space is divided equally into m shares, each of which is called a bucket, where m is an integer power of 2.

Step 303: taking the hash result as a hash sample, wherein the hash value length of the hash sample is L, the first k bits of the hash value are taken as the barrel number of the hash sample, and m is 2^KAnd the subsequent L-k bits are used as a bit string for subsequent estimation, and the hash samples with the same barrel number are distributed into the same barrel.

Step 304: and calculating the position where the first 1 of the remaining digit of the hash value of each hash sample after the barrel number is removed, and marking as m [ i ], wherein the position is pmax of each barrel. Referring to fig. 3, a hash sample bucketing diagram of the present invention is shown.

Step 305: and aiming at the Hash sample of the same bucket, comparing the currently calculated m [ i ] with the previous bucket information m [ j ], if m [ i ] is larger than m [ j ], updating the bucket information of the bucket number to m [ i ], wherein i and j are positive integers.

For the above steps, a specific bucket dividing operation is illustrated, assuming that the fixed length of the hash is 32 bits, k is 4, m is 16, assuming that the hash value of a is "00100000001011001101001000100000", a should be classified into "0010", that is, bucket No. 2, and for the remaining part, the position where the first "1" appears is 7, so that p (a) corresponding to a is 7, and if p of other samples in the bucket is less than 7, m [2] is 7.

Theoretically speaking, averaging the m buckets can eliminate accidental errors to a certain extent, that is, the estimated value is:

however, in practice, since we cannot find the hash function that makes the hash function distributed uniformly, the sample data in the bucket is an abnormal value that is most likely to be caused by non-uniformity, and once the abnormal value in the bucket causes distortion of the bucket information, the result is affected. We call this distorted bucket information an outlier.

Next, in a preferred embodiment of the present invention, a method of processing distorted bucket information is proposed, the method further comprising:

step 306: removing outliers from the bucket information;

step 307: and performing deviation correction on the data subjected to outlier elimination processing.

Aiming at the step 306, specifically, a sample consensus RANSAC algorithm is adopted to remove outliers from the bucket information; the RANSAC algorithm, proposed by Fischler and Bolles and 1981, iteratively estimates parameters that fit a predetermined model from a set of samples that contain outliers to remove the outliers. RANSAC is a non-deterministic algorithm that produces a reasonable result with a certain probability, and multiple iterations can increase the probability of obtaining a reasonable result. Wherein the RANSAC algorithm is based on the following assumptions:

The RANSAC is realized by the following steps:

1) randomly selecting some original data, assuming that they are a subset of the inlier data;

2) establishing model fitting and formulating a model loss function;

3) verifying the model by using the residual data and calculating a loss function;

4) counting the data volume of the inner group contained in the model;

5) and (4) performing multiple iterations, and selecting the model with the largest data quantity of the inner group as a final result.

In the invention, the step of removing outliers from the data after barrel averaging by using RANSAC algorithm comprises:

and a second substep: repeating the substeps for a plurality of times, and selecting a group with the highest coverage rate as the group data in the barrel. And finally, averaging the data of the group in the barrel to calculate the final estimation result.

Although RANSAC can resist the interference from outliers to the statistical result to some extent, since sampling iteration is required for many times, if the sample size is too small, the result is inaccurate, so that the method is only suitable for the case that the bucket size is large enough.

Although an error-controlled estimator can be obtained by the three steps, it is not an unbiased estimate as can be seen from statistical analysis. But rather a gradual unbiased estimate. But in practice this progressive unbiasing is allowed. There is no specific expansion why such conclusions are drawn, only how to correct the deviation is analyzed.

Specifically, in step 307, the present invention uses a super log count HLLC algorithm to perform bias correction on the data after the outlier elimination processing.

Since the statistics are very sensitive to outliers, we use harmonic means instead of geometric means in order to further reduce the interference of outliers. After using the harmonic mean, the result of the estimation becomes:

wherein:

a_m＝(m∫_0^∞(log₂((2+u)/(1+u)))^m du)^(-1) (7)；

the invention can complete the statistics of the data through the four steps. In the following, we analyze memory usage, time consumption, error control and merging characteristics of the R-HLLC algorithm to verify that the three-point problem proposed by the present invention can be solved by this method, so as to realize fast statistics on the graph structure distributed book data.

Memory usage analysis: the R-HLLC algorithm implements o (log n) compression of the data, using a memory size that is only related to the bucket size m and the hash result length L. The memory size byte is the bucket quantity m is the bucket information size/8, and the bucket information size is smaller than L by definition. The raw data is hashed by a hash function, which is o (log n) compressed, and the hash values are compressed again to bucket information, which is o (log n) compressed, and thus compressed overall by o (log n). In fact, the subsequent statistical operation only needs to save the bucket information, and resource occupation and flow consumption are greatly saved.

Time consumption: the statistics can be completed by each numerical chain node in linear time, and the time complexity is O (n).

And (3) error control: the error formula is:

as can be seen from the formula (8), the error is only related to the barrel amount m, and the error can be controlled only by controlling the barrel amount. It should be noted that the above mentioned O (log n) and O (log n) are calculated based on the Linear Counting algorithm with the probability model complexity O (q) proposed by Kyu-Young Huang in 1999. The formula (8) is obtained by reversing the improved Loglog Counting algorithm on the basis of Linear Counting, the calculation principle is the prior art, and the description is omitted here for saving space.

Characteristics can be merged: for data using the same hash function and sub-buckets, merging can be completed only by storing larger bucket information in the corresponding buckets during merging, so that the method has an efficient merging characteristic.

the solution proposed in step S3 can basically cover most of the basic statistics processes, but for the high-frequency statistics requirements, such as real-time statistics of block production rate for traffic monitoring, the present invention only needs to keep the bucket information of the historical statistics, and superimpose and combine the newly added bucket information of the statistical data and the bucket information of the historical statistical data through the steps proposed in steps S4 to S7, and does not need to repeatedly perform statistics on the historical data. The processing methods of step S4 to step S7 refer to step S1 to step S3. Of course, in the process of processing the newly added original data, the same problem as in the process of processing the original data may also occur, and the present invention may still adopt steps 306 to 307, that is, the bucket information is processed by removing outliers, and then the data processed by removing outliers is subjected to deviation correction, so as to further improve the accuracy of the statistical result of the present invention.

In the invention, for the task of timing statistics, such as counting the transaction rate once every ten seconds, a server agent for receiving and transmitting is continuously operated at each data link node, data entangled in the time is collected at each fixed time, the counting step is completed, the result is stored in a memory, and the client of the main node is waited for broadcasting the counting requirement at any time to update the next barrel information.

In the invention, after each numerical chain node finishes the statistical requirement, the result of the statistical process is stored in the memory, the output of the statistical process is a string of intermediate statistical information, and the numerical chain nodes call the asynchronous http protocol through the server to respond the intermediate statistical information to the main node.

Then, the master node collects information from all child nodes (several child nodes), and at this time, the method of the embodiment of the present invention further includes:

The method for solving the transmission abnormity comprises the following steps:

And (4) carrying out secondary request on the node with failed transmission or the node with overtime transmission, and rejecting the node result if the statistical result is not obtained yet. It should be noted that, for a single failed node, directly removing the statistical result thereof will not have an excessive influence on the result. Because each piece of data can be randomly copied and backed up on a plurality of nodes according to the mechanism of distributed account book data storage, the elimination of a single failed node cannot cause influence. Theoretically, assuming that each piece of data is divided into n nodes, at least n nodes fail at the same time, which results in failure of data backup only on the n nodes, and in fact, the amount of data is extremely small compared to the entire data.

In conclusion, according to the statistical requirements of users, the data statistics method and the data statistics system for the graph structure respond to the characteristic of scattered graph structure data, data statistics is directly carried out on each node, the statistical time is saved, only intermediate results (bucket information) with extremely small space occupation amount are used as transmission amount, and the transmission load required by statistics is saved; meanwhile, a newly generated data result is merged into a previously initialized result to ensure the timeliness of data statistics, so that the data fed back to the user is the latest data.

Next, in order to implement how the above steps are implemented, the present invention designs a frame structure for implementing the above processing method for fast statistics, which includes a transceiving part and a statistics part, and referring to fig. 4, a frame structure diagram of the present invention is shown. In fig. 4, the transceiving part is completed by the communication module, the statistic part is completed by the preprocessing module and the real-time statistic module, and the communication module corresponds to step S1 and step S8 in the processing method of the present invention; the preprocessing module corresponds to the steps S3 to S4 in the processing method of the invention; the real-time statistic module corresponds to steps S5 to S7 in the processing method of the present invention.

In the invention, the transmitting and receiving part is a communication framework of an overall deployment scheme, and the main purpose is to solve the communication problem among all nodes. Firstly, in the communication protocol, the invention adopts XML-RPC (Extensible Markup Language-Remote Procedure Call). XML-RPC is a distributed computing protocol for XML remote calling methods, which uses the HTTP protocol as a sending mechanism, can call or request programs of other operating systems, and expects to get a reply. In other words, the XML-RPC allows a user or a developer to send a request in XML format to an external program. Secondly, in the aspect of asynchronous receiving and sending, in order to enable the main node to asynchronously control each numerical chain node, the invention adopts a python thread pool ThreadPool and utilizes the thread pool to realize the communication required by each node. Meanwhile, a timeout fault-tolerant mechanism is added to ensure that the statistical requirement can be completed within a specified time.

In the statistical part, in the data hash part, the invention adopts murmurhash which is a non-encryption type hash function. The method is suitable for half of hash retrieval operation, shows a good random distribution characteristic for keys with strong regularity and murmurhash, has a low collision rate, and can well meet the requirement of an algorithm on data hash.

In order to further verify the rapid statistical method for the graph structure distributed account book data designed and realized by the invention, the example verification is carried out on the block total quantity statistics in the existing big north chain BDChain distributed account book, the accuracy of the statistical result obtained by the method is compared with the real statistical result, and in addition, the spatial compression ratio and the statistical duration are analyzed, so that the effectiveness of the method is verified.

Example study-Block Total statistics

Fig. 5 is a BDChain block transaction dynamic display panel, which dynamically displays transaction detailed information, client information, block total amount, transaction total amount, block production rate, transaction production rate, physical location information of nodes, current statistical information of each node, and the like to a user. The block total amount statistical information of the current panel is updated every 5 seconds, and the newly generated block total amount is superposed into the statistics. When the amount of blocks is small, data can be collected and statistics can be redone by accessing the polling node, but for the millions of blocks at present, the previous method is obviously not applicable. It is considered that the display panel only serves for traffic monitoring and enables the user to have a control over the amount and growth trend of the current block, and does not care about the precise value of the total number of blocks. Therefore, the invention takes the statistics of the block total amount of the BDCHain distributed ledger as an example, and shows the application of the method in the rapid statistics of the data of the distributed ledger and the evaluation and analysis of the experimental result.

The research is deployed on five real number chain nodes of BDChain, and another master node is responsible for distribution statistics. Because the statistics of the method has certain contingency although the error is controllable, five groups of statistical experiments on the total number of the blocks are performed at the time.

First, experimental preparation was performed. In an experimental environment, experimental data comprises data of five groups of blocks, wherein each group of data comprises about fifty thousand blocks and is distributed on each node, and on average, each group of data comprises about one hundred thousand blocks on each node. As the original data to be counted. Since the hash field of the block is the unique identifier of the block, counting the hash field of the block can obtain the total number of the block. The invention extracts the hash fields of all block data as statistics, and the total size of each group of hash data is about 33M.

The invention verifies the applicability of the method on the block total quantity statistics from three evaluation indexes of compression rate, statistical consumed time and accuracy.

(1) High compression ratio: the memory occupied by the fast statistical method used by the invention is only related to the bucket amount and the hash function. Typically, the bucket amount is an integer power of 2 and the hash space is 32 or 64 bits. Taking the actually used 32-bit hash as an example, the maximum value of each bucket is 32, 5 bits are needed for storage, and if the bucket amount is m, the memory occupation amount of m buckets is only m × 5/8 bytes. It can be known that 2^12 buckets require 2.5k of memory, similarly 2^16 buckets require 40k of memory, 2^20 buckets require 640k of memory, and the statistical task can be completed by using very small memory. Simultaneous transmission requires only transmission of bucket information, so the amount of compression is from raw data to bucket information, which is o (log n) compression.

Before the method, the statistical mode used by the distributed account book is to aggregate all data and then remove the duplication, that is, after the data is compressed, the data is transmitted through http, and the compression amount of the method is simple compression of the data. Referring to fig. 6, a Block compression experiment diagram is shown. The invention takes zip compression as an example to show the compression amount comparison of the two methods. Thereby embodying the advantages of the method of the present invention in compressed transmission.

(2) And (3) fast statistics: for the statistical requirement of high-frequency use, the invention verifies the time consumed by statistics except the initialization time of data. The statistical duration includes two parts: firstly, network transmission time consumed by demand distribution and bucket information return is counted; and secondly, calculating bucket information by each numerical chain node and calculating the statistical time consumed by combining the bucket information by the main node and calculating the final result.

The invention evaluates the statistical duration from two aspects, and firstly verifies the influence of different bucket sizes on the statistical total consumed duration. Taking powers of 8, 12, 16 and 20 of 2 as bucket quantity parameters, respectively, under the condition of 5 numerical chain nodes, respectively carrying out statistical verification on 5 groups of data, and representing the final result by a box graph, as shown in FIG. 7, which is a Block bucket quantity-time experimental graph. As can be seen from the experimental result graph, as the bucket amount increases, the time length required for statistics also increases, but the whole can be completed in the second-class time. For bucket quantities below the 16 th power of 2, fast statistics can be completed in substantially one second.

Secondly, the influence of the number of different nodes on the total consumed time of statistics is verified, 5 nodes are taken as increasing units, and the statistical time of the distributed account book from 5, 10 … … to 100 chain nodes is simulated. The statistical time collection of 5 groups of data is performed by adopting the barrel amount of the 16 th power of 2, the experimental result is shown in fig. 8, and fig. 8 is a Block node-time experimental graph. As can be seen from fig. 8, since an asynchronous communication mechanism is adopted, the increase of the number of nodes has no obvious influence on the statistical duration. Therefore, the method can be expanded to a large number of nodes and can meet the requirement of actual statistics.

(3) The accuracy is as follows: the accuracy is a key part for verifying the method. The invention utilizes error rate to measure the accuracy of the experimental result, and the error rate is defined as follows:

error rate | accurate value-estimated value |/accurate value |/100%

For the invention, the accurate value is the accurate statistical value of the data, and the estimated value is the statistical value obtained by the method. Accuracy was evaluated from two aspects: first, the effect of different bucket sizes on the statistical error rate was verified. Taking powers of 8, 12, 16 and 20 of 2 as bucket quantity parameters, respectively, in the case of 5 numerical chain nodes, respectively performing statistical verification on 5 groups of data, and representing the final result by a box diagram, as shown in fig. 9, wherein fig. 9 is a Block bucket quantity-error experimental diagram. As can be seen from the graph, the error rate of the statistics decreases rapidly as the amount of buckets increases. The error can be controlled to be about 1 percent when the barrel is used for 2^16 times, and the error can be controlled to be less than 1 per thousand when the barrel is used for 2^20 times. Therefore, the rapid statistical method can well meet the statistical requirements of users in the face of the statistical requirements of big data.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 10, a block diagram of a fast statistics processing apparatus according to the present invention is shown, where the apparatus is applied to a graph-structured distributed ledger, and the apparatus may specifically include the following modules:

a statistical requirement broadcasting module 1001, configured to obtain statistical requirements of users, where the master node broadcasts and distributes the statistical requirements to each of the daisy-chained nodes;

an original data generating module 1002, configured to read a first data set of a corresponding start-stop sequence number or start-stop time from a daisy chain node according to the statistical requirement, and extract data corresponding to a corresponding statistical object in the first data set as original data;

a first calculating module 1003, configured to use the original data as an input of a radix estimation method, set a target parameter, hash the original data, calculate a barrel number corresponding to the hashed original data and a position where a first "1" of a remaining digit of the original data after the barrel number is removed occurs, and update barrel information of the barrel number;

a first saving module 1004, configured to, after all data processing of the first data set is completed, save all bucket information of the first data set in the memory of the daisy-chain node;

a newly added original data module 1005, configured to read all data updated after the first data set from the chain nodes according to the statistical requirement, and use the data as a second data set, and extract data corresponding to a corresponding statistical object in the second data set as newly added original data;

a second calculation module 1006, configured to input the newly added original data into a statistical model that is the same as the first calculation module, and calculate corresponding bucket information;

a second storing module 1007, configured to merge the bucket information of the newly added original data with the bucket information in the memories of the plurality of link points to obtain complete bucket information, and store the complete bucket information in the memories of the plurality of link points;

a statistical result feedback module 1008, configured to return the bucket information in the memory of each numerical link node to the master node, where the master node is configured to perform statistical operation on the bucket information returned by each numerical link node, and feed back a result of the statistical operation to the user.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

An embodiment of the present invention further provides an apparatus, including:

one or more processors; and

one or more machine-readable media having instructions stored thereon, which when executed by the one or more processors, cause the apparatus to perform one or more fast statistical processing methods as described in embodiments of the invention.

Embodiments of the present invention further provide a computer-readable storage medium, in which a stored computer program enables a processor to execute the processing method for fast statistics according to the embodiments of the present invention.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The above method and apparatus for processing fast statistics provided by the present invention are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A processing method for fast statistics applied to a graph structure distributed ledger is characterized by comprising the following steps:

step S5: reading all data updated after the first data set from the chain nodes according to the statistical requirements, taking the data as a second data set, and extracting data corresponding to a corresponding statistical object in the second data set as newly-added original data;

step S6: processing the newly added original data by adopting the method of the step S3 to obtain bucket information of the newly added original data;

step S8: the method comprises the steps that bucket information in a memory of each numerical chain node is returned to a main node, and the main node is used for carrying out statistical operation on the bucket information returned by each numerical chain node and feeding back a statistical operation result to a user;

the step S3 specifically includes the following sub-steps:

taking the hash result as a hash sample, wherein the hash value length of the hash sample is L, the first k bits of the hash value are taken as the barrel number of the hash sample, and m is 2^KThe subsequent L-k bits as a bit string for the subsequent estimation, anddistributing the hash samples with the same barrel number to the same barrel;

2. The method of claim 1, wherein the hash result satisfies the following condition:

a1: the hash result is of a fixed length;

a2: the hash result satisfies uniform distribution;

3. The method of claim 1, further comprising:

removing outliers from the bucket information;

4. The method of claim 3, wherein the excluding outliers processing of bucket information further comprises:

b3: given a set of inlier data, which is memory data with a small amount of data, there is a procedure for estimating parameters suitable for interpreting this set of data models.

5. The method of claim 4, wherein the step of using a sample consensus RANSAC algorithm to perform outlier rejection processing on the bucket-averaged data comprises:

6. The method of claim 3, wherein said bias-correcting said outlier-excluded processed data further comprises:

7. The method of claim 1, further comprising:

8. The method of claim 7, further comprising:

9. A processing apparatus for fast statistics in a graph-structured distributed ledger, the apparatus comprising:

a newly added original data module, configured to read all data updated after the first data set from the chain nodes according to the statistical requirement, and use the data as a second data set, and extract data corresponding to a corresponding statistical object in the second data set as newly added original data;

the second calculation module is used for processing the newly added original data by adopting the method in the step S3 to obtain bucket information of the newly added original data;

the statistical result feedback module is used for returning the barrel information in the memory of each numerical chain node to the main node, and the main node is used for performing statistical operation on the barrel information returned by each numerical chain node and feeding back the statistical operation result to the user;

the first calculation module is specifically configured to select a hash function H according to the statistical requirement, hash each piece of original data stored in a data link node as binary cache buffer data, and output a plurality of hash results; the hash space is divided equally into m shares, each of which is called a oneA bucket, said m being an integer power of 2; taking the hash result as a hash sample, wherein the hash value length of the hash sample is L, the first k bits of the hash value are taken as the barrel number of the hash sample, and m is 2^KTaking the subsequent L-k bits as bit strings for subsequent estimation, and distributing the hash samples with the same barrel number to the same barrel; calculating the position where the first 1 of the remaining digit of the hash value of each hash sample after the barrel number is removed and recording as m [ i [ i ] ]](ii) a Aiming at the Hash sample of the same barrel, the currently calculated m [ i [ ] is]With previous bucket information mj]Making a comparison if m [ i ]]＞m[j]Updating the bucket information of the bucket number to m [ i ]]And the i and the j are positive integers.