CN113364465A

CN113364465A - Percentile-based statistical data compression method and system

Info

Publication number: CN113364465A
Application number: CN202110626628.2A
Authority: CN
Inventors: 周奕庆; 蔡晓华
Original assignee: Shanghai Netis Technologies Co ltd
Current assignee: Shanghai Netis Technologies Co ltd
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2021-09-07
Anticipated expiration: 2041-06-04
Also published as: CN113364465B

Abstract

The invention provides a statistical data compression method and a system based on percentile, comprising the following steps: step 1: judging the upper limit value of the number of the statistical units to obtain a total statistical unit; step 2: screening out statistical units meeting preset conditions from the total statistical units according to a preset sampling rate; and step 3: calculating the index percentile threshold value of the screened statistical units meeting the preset conditions; and 4, step 4: and filtering the statistical units according to the index percentage threshold, filtering, removing all statistical units with index values less than or equal to the index percentage threshold, and aggregating the filtered statistical units to a dimension lower than the statistical units. The invention improves the processing efficiency and prolongs the effective long-term storage time of the data.

Description

Percentile-based statistical data compression method and system

Technical Field

The invention relates to the technical field of data compression, in particular to a method and a system for statistical data compression based on percentile.

Background

Stream data is a set of sequential, large, fast, continuous arriving data sequences, which can be generally viewed as a dynamic collection of data that grows indefinitely over time. The method is applied to the fields of network monitoring, sensor networks, aerospace, meteorological measurement and control, financial services and the like.

In the stream data statistics process, data is continuously input. Under a certain dimension, the number of distributed statistical units is increased continuously, resulting in an increase in memory usage. Memory footprint is typically controlled by limiting the number of statistical units. After the number reaches the upper limit, no new statistical units are allocated for the new data. The new statistic unit can be continuously allocated only by waiting for the statistic unit reporting and emptying the cached statistic unit after the statistic reporting period is reached. This is a random strategy, which easily causes important data appearing later in time sequence, such as statistical units with large index quantity, to be excluded and not counted.

Another control method is to sort all statistical units by index TopN after reaching the upper limit, and aggregate the units excluded from TopN. The aggregation is to reduce the dimension in the statistics, for example, in the monitoring scene, the statistics is performed by combining five dimensions of five tuples such as Source Ip, Source port, DestIp, DestPort, and ippostcalbyte, and only the statistics is performed by using Source Ip and DestIp as the combined dimension after the aggregation, thereby greatly reducing the statistical units. This approach sacrifices visibility of the dimensional perspective of the unimportant data. Meanwhile, although the method can keep the statistical unit with larger index quantity, under the condition of more total quantity of the statistical unit, too much CPU time is consumed by performing TopN on the total quantity, the statistical performance is influenced, the buffer backlog of input data is caused, and even the whole flow processing process is blocked.

Patent document CN100385437C (application number: CN200510115119.4) discloses a real-time data compression method for compressing data packet in a process control system, wherein the real-time data includes analog value, the method includes: 1) initializing a dictionary, and initializing characters which may appear in the compression process into the dictionary; 2) reading a numerical value; 3) subtracting adjacent data of the real-time data to obtain a difference value, and storing a read first numerical value in a compressed file; 4) and compressing the difference value by adopting an LZW algorithm.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a percentile-based statistical data compression method and system.

The statistical data compression method based on percentile provided by the invention comprises the following steps:

step 1: judging the upper limit value of the number of the statistical units to obtain a total statistical unit;

step 2: screening out statistical units meeting preset conditions from the total statistical units according to a preset sampling rate;

and step 3: calculating the index percentile threshold value of the screened statistical units meeting the preset conditions;

and 4, step 4: and filtering the statistical units according to the index percentage threshold, filtering, removing all statistical units with index values less than or equal to the index percentage threshold, and aggregating the filtered statistical units to a dimension lower than the statistical units.

Preferably, the step 1 comprises: predefining an upper limit value of the number of statistical units, judging whether the number of the current statistical units reaches the upper limit value of the number of the predefined statistical units or not when new input data has a new dimension value and needs to be distributed with new statistical units, and creating a new statistical unit if the number of the current statistical units does not reach the upper limit value of the number of the predefined statistical units; and if the upper limit value of the number of the predefined statistical units is reached, triggering the data compression of the statistical units.

Preferably, the step 2 comprises: and multiplying the preset sampling rate by the maximum value of the uint32 to obtain a sampling threshold, traversing all current statistical units, calculating the hash value of each statistical unit, selecting the statistical unit when the hash value is less than or equal to the sampling threshold, and ignoring the statistical unit when the hash value is greater than the sampling threshold.

Preferably, the step 3 comprises: calculating the number of statistical units, wherein the expression is as follows: k ═ N × (1-P/100), where: and N is the total number of the selected statistical units, P is a set percentile, and a BFPRT algorithm is used for obtaining an index value of TopK according to index sequencing to be used as an index percentile threshold value.

Preferably, the step 4 comprises: in the filtering process, if the data has indexes, directly filtering the data meeting the index percentage threshold value through an index mechanism; otherwise, traversing the full-scale statistical unit, filtering according to the index percentage threshold value, and selecting the dimension with smaller dispersion than the value of the statistical unit after filtering for aggregation.

The invention provides a percentile-based statistical data compression system, which comprises:

module M1: judging the upper limit value of the number of the statistical units to obtain a total statistical unit;

module M2: screening out statistical units meeting preset conditions from the total statistical units according to a preset sampling rate;

module M3: calculating the index percentile threshold value of the screened statistical units meeting the preset conditions;

module M4: and filtering the statistical units according to the index percentage threshold, filtering, removing all statistical units with index values less than or equal to the index percentage threshold, and aggregating the filtered statistical units to a dimension lower than the statistical units.

Preferably, the module M1 includes: predefining an upper limit value of the number of statistical units, judging whether the number of the current statistical units reaches the upper limit value of the number of the predefined statistical units or not when new input data has a new dimension value and needs to be distributed with new statistical units, and creating a new statistical unit if the number of the current statistical units does not reach the upper limit value of the number of the predefined statistical units; and if the upper limit value of the number of the predefined statistical units is reached, triggering the data compression of the statistical units.

Preferably, the module M2 includes: and multiplying the preset sampling rate by the maximum value of the uint32 to obtain a sampling threshold, traversing all current statistical units, calculating the hash value of each statistical unit, selecting the statistical unit when the hash value is less than or equal to the sampling threshold, and ignoring the statistical unit when the hash value is greater than the sampling threshold.

Preferably, the module M3 includes: calculating the number of statistical units, wherein the expression is as follows: k ═ N × (1-P/100), where: and N is the total number of the selected statistical units, P is a set percentile, and a BFPRT algorithm is used for obtaining an index value of TopK according to index sequencing to be used as an index percentile threshold value.

Preferably, the module M4 includes: in the filtering process, if the data has indexes, directly filtering the data meeting the index percentage threshold value through an index mechanism; otherwise, traversing the full-scale statistical unit, filtering according to the index percentage threshold value, and selecting the dimension with smaller dispersion than the value of the statistical unit after filtering for aggregation.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention controls the occupation of the statistical memory by controlling the upper limit value of the statistical unit, thereby avoiding the uncontrollable condition of the memory allocation of the statistical unit caused by the continuous input of data in the flow trial statistical scene;

(2) the method calculates the index percentile threshold value in a sampling mode, has small calculated data quantity, relatively accurately calculates and occupies less CPU time under the condition of very large number of statistical units, and avoids influencing the statistical performance of stream processing;

(3) according to the invention, through carrying out dimensionality reduction and aggregation on the low index value statistical units, a large amount of memory space of the low index value statistical units can be released, and the full-scale dimension information of the high index statistical units can be retained, so that the statistical business analysis can observe the detailed information of the statistical objects with higher influence degree ranking, the processing efficiency is improved, the effective long-term data storage time is prolonged, and the economic benefit is improved; meanwhile, the method can further enlarge the application range of the system and further improve the use benefit and the economic benefit.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Example 1:

the method controls the triggering of data compression through the upper limit of the number of statistical units. After data compression is triggered, sampling is carried out on the statistical units, a statistical unit set with a lower index value is determined by a faster method, and dimension reduction and aggregation are carried out on the statistical units to form a low-dimension statistical unit. And when the statistics report is output together with the normal statistics unit.

Referring to fig. 1: the method comprises the following steps:

and step 1, judging the upper limit value of the statistical unit. Specifically, an upper limit of the number of statistical units is predefined, and when new input data has a new dimension value and a new statistical unit needs to be allocated, whether the current statistical number reaches the upper limit of the number of statistical units is judged. If not, a new statistical unit is created. And if so, triggering the data compression of the statistical unit and entering the step 2.

And 2, sampling by a statistical unit, and screening out a small part of statistical units from the total statistical unit according to the specified sampling rate to calculate the index percentage of the next step. The method comprises the following steps: the sampling threshold H is obtained using a predefined sampling rate multiplied by the maximum value of the uint 32. And traversing all the current statistical units, and calculating the hash value of each statistical unit. The hash value calculation object should pick a simple type of dimension value, for example, using IDs of various integer types. And when the hash value is less than or equal to the sampling threshold value, selecting the statistical unit. If the sampling threshold value is larger than the sampling threshold value, the statistical unit is ignored. And finally, transmitting all the selected statistical units to the next step.

And 3, calculating the index percentile threshold value. Which calculates the indicator percentiles according to the specified percentiles. The method comprises the following steps: assuming that the total number of the selected statistical units is N and the set percentile is P, the number K of the statistical units with higher expected index rank is calculated to be N x (1-P/100). And then, obtaining an index value of TopK according to the index sequence by using a BFPRT algorithm, namely, using the index value as an index percentile threshold value.

And 4, filtering the index percentage threshold of the statistical unit, filtering and removing all statistical units with index values less than or equal to the index percentage threshold, and transmitting to the next step for dimension reduction polymerization.

Step 4.1: in the filtering process, if the data has indexes, the data meeting the index percentage threshold value can be further directly filtered through an index mechanism, and further acceleration is achieved.

Step 4.2: otherwise, the whole statistical unit is traversed, and filtering is carried out according to the index percentage threshold.

And 5, carrying out dimension reduction and aggregation on the statistical units. It aggregates the statistics units with index values below the index percentage threshold using a dimension lower than the dimension of the normal statistics units. Generally, a small part of dimensions with small value dispersion are selected from normal dimensions, so that the aggregated statistical units are kept in a very small number, and the memory occupation is not influenced obviously. For example, there are a full number n of statistical units U1, U2, … Un, each with 3 dimensions: A. b and C, and 2 indexes X and Y of summation type are subjected to dimensionality reduction aggregation, i th statistical unit dimension values are recorded as Ai, Bi and Ci, and index values are set as Xi and Yi. Assuming that the values of the dimension a of all the statistical units are a, that is, Ai is satisfied as a, the value dispersion of the dimension a is small. And the dimension B and the dimension C are different, and the value dispersion is large, so the dimension A is selected as the polymerization dimension, and the dimension B and the dimension C are abandoned. Thus n statistical units are aggregated into 1 statistical unit, which contains only A dimension with value a and its X index value

It has a Y index value of

Because the number of the aggregated statistical units is only 1, compared with the original n statistical units, the data is compressed to the original 1/n, and finally the total number of the statistical units is reduced and the memory is released.

Example 2:

example 2 is a preferred example of example 1.

In the monitoring scene, analysis and statistics are needed to be carried out on the TCP/IP session according to time sequence. Then the dimension of the statistic is the quintuple < Source Ip, Source Port, DestIp, DestPort, IPotocaLByte >. In some large-scale monitoring scenarios, such as internet banking of a large bank or an external connection port of a data center, the number of sessions per minute may reach 1 million or even more, which greatly exceeds the upper limit of a certain monitoring cluster. Meanwhile, the data does not have much practical significance by reserving more unimportant detailed sessions, and the data can be aggregated to the two-tuple < Source Ip, DestIp > for long-term storage.

Assume that the upper monitoring limit is set to 200 ten thousand per minute.

Step 1: and calculating a quintuple for a newly input data packet, and searching a statistical unit containing the same quintuple in the statistical unit set cached in the current minute. If the data packet is found, the statistical value of the data packet is accumulated on the statistical unit index, and the statistical processing of the new data packet is finished. Otherwise, step 2 is entered.

Step 2: and judging whether the number of the statistical units cached in the current minute reaches a predefined upper limit value of the number of the statistical units. And if the data packet statistics value does not exceed the preset value, creating a new statistical unit, distributing the memory, accumulating the data packet statistics value to the newly-built statistical unit index, and finishing the statistical processing of the new data packet. Otherwise, step 3 is entered to start compressing the number of statistical units.

And step 3: the sampling threshold H is obtained using a predefined sampling rate multiplied by the maximum value of the uint 32. And traversing and calculating the quintuple hash value of each statistical unit, specifically, calculating (((SourceIp × 31+ SourcePort) × 31+ DestIp) × 31+ DestPort) × 31+ ippocalcbyte when SourceIp or SourcePort is greater than DestIp or DestPort, and otherwise calculating (((DestIp × 31+ DestPort) × 31+ SourceIp) × 31+ SourcePort) × 31+ ipocacalcbyte. And when the hash value is less than or equal to the sampling threshold value, putting the statistical unit into a percentile threshold value calculation sample set, and entering the step 4.

And 4, step 4: assuming that the total number of statistical units in the percentile threshold calculation sample set is N and the set percentile is P, the number K of statistical units with higher expected index rank is calculated to be N × (1-P/100). And then, obtaining an index value of TopK according to the statistical unit index sequencing by using a BFPRT algorithm, namely, using the index value as an index percentile threshold value. If an index value index exists, the BFPRT algorithm can be directly applied to the index value index to accelerate the calculation process.

And 5: and traversing all the statistical units, and if the index value of the statistical unit is less than or equal to the index percentile threshold value obtained by calculation in the step 4, removing the statistical unit from the statistical unit set and placing the statistical unit into the aggregation cache. If the index value index exists, the index value index can be directly traversed, when the index value is larger than the percentile index threshold, the previously traversed index item is removed, then the statistical unit corresponding to the index item is removed from the statistical unit set, and finally the statistical unit is placed into the aggregation cache and the traversal process is skipped in advance.

Step 6: and traversing and calculating the two-element group < Source Ip and DestIp > of each statistical unit in the aggregation cache, summing the index values of all the statistical units with the same two-element group, combining the statistical units with the two-element group into an aggregation dimension statistical unit, and putting the aggregation dimension statistical unit into an aggregation statistical unit set.

And 7: before the next minute comes, if the number of the buffered normal statistical units reaches 200 ten thousand again due to the input of a new data packet, the compression process is carried out again, so that the number of the buffered normal statistical units in the current minute always does not exceed 200 ten thousand until the next minute comes. And when the next minute comes, reporting and outputting the normal statistical unit set and the aggregation statistical unit set. The normal statistical unit outputs quintuple and statistical index information, and the aggregation statistical unit outputs two-tuple and statistical index information. And finally emptying all the statistical unit sets, starting the next minute of statistics, and repeating the steps to realize the dynamic control of the flow test statistical memory.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A percentile-based statistical data compression method is characterized by comprising the following steps:

2. The percentile-based statistical data compression method according to claim 1, wherein the step 1 comprises: predefining an upper limit value of the number of statistical units, judging whether the number of the current statistical units reaches the upper limit value of the number of the predefined statistical units or not when new input data has a new dimension value and needs to be distributed with new statistical units, and creating a new statistical unit if the number of the current statistical units does not reach the upper limit value of the number of the predefined statistical units; and if the upper limit value of the number of the predefined statistical units is reached, triggering the data compression of the statistical units.

3. The percentile-based statistical data compression method according to claim 1, wherein the step 2 comprises: and multiplying the preset sampling rate by the maximum value of the uint32 to obtain a sampling threshold, traversing all current statistical units, calculating the hash value of each statistical unit, selecting the statistical unit when the hash value is less than or equal to the sampling threshold, and ignoring the statistical unit when the hash value is greater than the sampling threshold.

4. The percentile-based statistical data compression method according to claim 1, wherein the step 3 comprises: calculating the number of statistical units, wherein the expression is as follows: k ═ N × (1-P/100), where: and N is the total number of the selected statistical units, P is a set percentile, and a BFPRT algorithm is used for obtaining an index value of TopK according to index sequencing to be used as an index percentile threshold value.

5. The percentile-based statistical data compression method according to claim 1, wherein the step 4 comprises: in the filtering process, if the data has indexes, directly filtering the data meeting the index percentage threshold value through an index mechanism; otherwise, traversing the full-scale statistical unit, filtering according to the index percentage threshold value, and selecting the dimension with smaller dispersion than the value of the statistical unit after filtering for aggregation.

6. A percentile-based statistical data compression system, comprising:

7. The percentile-based statistical data compression system of claim 6, wherein the module M1 comprises: predefining an upper limit value of the number of statistical units, judging whether the number of the current statistical units reaches the upper limit value of the number of the predefined statistical units or not when new input data has a new dimension value and needs to be distributed with new statistical units, and creating a new statistical unit if the number of the current statistical units does not reach the upper limit value of the number of the predefined statistical units; and if the upper limit value of the number of the predefined statistical units is reached, triggering the data compression of the statistical units.

8. The percentile-based statistical data compression system of claim 6, wherein the module M2 comprises: and multiplying the preset sampling rate by the maximum value of the uint32 to obtain a sampling threshold, traversing all current statistical units, calculating the hash value of each statistical unit, selecting the statistical unit when the hash value is less than or equal to the sampling threshold, and ignoring the statistical unit when the hash value is greater than the sampling threshold.

9. The percentile-based statistical data compression system of claim 6, wherein the module M3 comprises: calculating the number of statistical units, wherein the expression is as follows: k ═ N × (1-P/100), where: and N is the total number of the selected statistical units, P is a set percentile, and a BFPRT algorithm is used for obtaining an index value of TopK according to index sequencing to be used as an index percentile threshold value.

10. The percentile-based statistical data compression system of claim 6, wherein the module M4 comprises: in the filtering process, if the data has indexes, directly filtering the data meeting the index percentage threshold value through an index mechanism; otherwise, traversing the full-scale statistical unit, filtering according to the index percentage threshold value, and selecting the dimension with smaller dispersion than the value of the statistical unit after filtering for aggregation.