CN118467582A

CN118467582A - General stream big data statistics system

Info

Publication number: CN118467582A
Application number: CN202310840778.2A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Individual
Current assignee: Individual
Priority date: 2022-10-17
Filing date: 2023-07-10
Publication date: 2024-08-09
Also published as: CN116561196A; CN115510110A

Abstract

The invention discloses a stream-type big data statistics system, and belongs to the technical field of big data application. The invention establishes a set of configuration method for describing the flow type statistical operation mode, which is internally provided with rich conversion functions and supports expression analysis, can meet various complex condition screening and logic judgment, and supports multidimensional calculation and statistics of various time granularities. The invention abstractly classifies the streaming statistics demands into various operation scenes based on the configuration specification, and performs uniform modularized packaging on the various operation scenes, wherein the realization of each operation component emphasizes the optimization of memory occupation and network IO problems. The whole data consumption link of the system is in a layer-by-layer decreasing structure, and messages are aggregated in each link, so that the downstream operation amount can be effectively reduced, all statistical tasks in a cluster share cluster operation resources, and unnecessary resource waste is reduced. The invention can help enterprises reduce investment in stream statistics.

Description

General stream big data statistics system

Technical Field

The invention relates to the technical field of big data application, in particular to a general stream big data statistics system.

Background

With the continuous development of various industries, the importance of enterprises on the timeliness is increasing, and the flow data statistics technology is being adopted by various industries and more enterprises, for example: real-time statistics of PV (photovoltaic) of products for Internet enterprises UV; the electronic commerce enterprise counts the transaction amount and the transaction amount of the platform in real time; the telecom operator counts the transmission quantity and transmission efficiency of the network data packet in real time; the intelligent traffic system counts the people flow and the traffic flow on the road in real time. The application of the common flow statistical technology improves the operation efficiency of enterprises and brings great convenience to our lives.

The application of the stream data statistics technology has great industrial value, however, many bottlenecks still exist in the development of the field at present. The current implementation of the stream statistics service in the industry is mostly based on FlinkSQL, sparkSQL, OLAP-class engines and other derivative-class technical schemes. The technical proposal is based on the SQL language to carry out data statistics and analysis, and because SQL is based on the concept of a data table to carry out data processing, higher memory waste caused by the need of storing more original data or intermediate state data in a memory cannot be avoided; the distributed SQL triggers a Shuffle in the data processing process, so that a large amount of network transmission is caused, and the execution efficiency is affected; SQL may cause serious data inclination in some packet aggregation operations, which has serious influence on normal execution of programs; aiming at specific statistical requirements, the scheme needs to execute independent calculation tasks, and higher resource waste is caused by incapability of sharing calculation resources among the tasks; in addition, the realization of the corresponding function needs to depend on professional data research and development personnel, so that the research and development cost of the flow statistics task is high and the period is long. The bottleneck of the scheme is also highlighted when the enterprise data index is exponentially increased due to the problems, and a great deal of research and development cost, data maintenance cost and server operation cost are consumed.

Aiming at the current situation, no solution exists in the industry, so the invention provides a method for realizing a general stream big data statistical system. The scheme self-defined stream statistics configuration specification is used for describing stream statistics requirements with various forms, and can replace the application of SQL language in the subdivision field. The configuration method has the advantages of powerful function, easy expansion, simple and clear grammar and convenient understanding and use. Based on the configuration specification, the cluster operation resources are shared among all statistical tasks, the system implementation focuses on avoiding the Shuffle operation, reducing the network data transmission quantity in the operation process, and carrying out uniform componentization encapsulation on various operation scenes of stream statistics, so that the memory occupation and network IO problems are emphasized and optimized, and various operation components can achieve the multiplexing effect. Based on the invention, enterprises can be helped to cope with complicated flow data statistics demands, and the problems of high research and development cost, high difficulty, long period and serious resource waste are solved. The invention can help enterprises to build a set of more perfect, stable and reliable data operation system more quickly, saves investment of enterprises in the aspect of data operation and has higher practical application value.

Disclosure of Invention

The invention provides a general stream big data statistical system which can help enterprises to cope with complicated stream data statistical demands, in order to achieve the above purpose, the system self-defines stream statistical configuration specifications, the configuration specifications comprise the following contents:

all statistical indicators are managed using a hierarchy of statistical groups and statistical terms, one for each statistical term, the statistical groups being a composite of one or more statistical terms based on a piece of metadata. Metadata refers to a data structure of an original statistical message corresponding to a statistical group, and includes a corresponding field name and a field type. All statistical items in the statistical group can carry out related index statistics based on one piece of statistical original message, and repeated transmission of message data can be reduced, so that network transmission is reduced. The configuration of each statistical item mainly includes: the method comprises the steps of statistic template configuration, statistic period configuration and data validity period configuration.

Further, the statistical template configuration includes: the statistical expression configuration is used for specifying the calculation rule of the statistical item; the dimension expression configuration is used for specifying dimension information of the statistical items; and the result screening expression configuration is used for carrying out screening operation on the statistical result.

Further, the statistical template configuration may be implemented based on XML, JSON, YAML, CSON, TOML or other text configurations of key-value pair formats.

Further, the present invention abstractly classifies the streaming statistics operation scene into a plurality of operation units, including: the number of times operation unit, summation operation unit, maximum value operation unit, minimum value operation unit, average value operation unit, time sequence operation unit and base number operation unit, the operation unit can be expanded as required.

Further, the statistical expression is composed of one or more statistical operation units, and arithmetic operations can be performed between the plurality of statistical operation units. The configuration format of the statistical operation unit is as follows:

function _ name (related _ column, filter _ unit1, filter _ unit2,) wherein,

Function_name is the name of the statistical function;

correlated_column is an operation-related parameter;

filter_unit is a filter parameter;

Further, the statistical operation related parameter is related to the type of the statistical operation unit, the related parameter value of the times operation defaults to 1, and no additional assignment is needed; the summation operation related parameter is a related field for carrying out summation operation, and the value of the summation operation related parameter is a numerical value type; the maximum value operation related parameter is a related field for carrying out maximum value calculation, and the value of the maximum value operation related parameter is a numerical value type; the minimum value operation related parameter is a related field for carrying out minimum value calculation, and the value of the related field is a numerical value type; the average value operation related parameter is a related field for carrying out average value calculation, and the value of the related field is a numerical value type; the related parameter of the radix operation is a related field for performing the radix operation, and the value of the related parameter is a character string type; the time sequence operation related parameter is a related field for performing time sequence operation, and the value of the time sequence operation related parameter is a numerical value type.

Further, the filtering parameters are expressions with boolean type results, and are used for filtering and judging the original messages in the time window, each statistical operation unit can assign 0 or more filtering parameters according to the requirement, the filtering parameters are in logical and operation relation, each filtering parameter consists of one or more filtering conditions, and the filtering conditions are connected by using logical operators.

Further, transformation class functions can be used in the statistical expression and the dimension expression, and the transformation class functions are performed on the relevant fields of the original message according to a specified mode and then subsequent operation is performed. The statistics period is used for specifying a time window of streaming computation, and comprises a rolling window type and a sliding window type, wherein the time granularity of each window type comprises a second level, a minute level, an hour level and a day level, and a user can customize the statistics period. The dimension expression may specify 0 or more dimensions, partitioned between the dimensions using a specified partitioner. The result screening expression is used for screening statistical results and can be used for common topN or lastN operation.

The invention provides a set of general stream big data statistical system based on the configuration specification, which comprises the following contents:

1. The system mainly comprises the following modules: the Client module is used for reporting the SDK accessed by the service party and counting the original message data; the RPC module is used for receiving statistical message data reported by a client and providing a statistical result query interface; the operation module is used for functionally encapsulating various flow type statistic operation units, executing current limiting rule judgment, analyzing configuration information of each statistic item, consuming message data, calculating according to statistic configuration and storing statistic results; the Web module is used for managing and maintaining the statistics group and the statistics item, checking the statistics result, setting the current limiting rule and managing the statistics index access authority.

2. The system manages all the statistical indicators using a hierarchy of statistical groups and statistical terms, one for each statistical term, one or more statistical terms based on the same piece of metadata called a statistical group. The Web module can manage the execution state of the statistical item calculation task, a user can start, stop and delete the appointed statistical item on a Web page, the statistical item in the running state normally performs data statistics, and the statistical item in the non-running state does not perform statistical operation.

3. The whole consumption link of the system comprises the following links: the Client module reports information data links, the RPC module processes the information data links, the operation module executes unfolding and grouping operation links and the statistical result storage links, the system uses asynchronous processing and batch consumption modes in each link, the links receive the information and then put the information into the information buffer pool, and the system divides the information into different information types according to predefined aggregation logic of the links and carries out aggregation processing on the information of the same type in a single-node process. The design can reduce data transmission downstream, improve network IO efficiency, and directly reduce downstream operand and DB writing pressure.

And 4, modifying the original time stamp of the message to be the minimum batch time before executing the aggregation operation in the message reporting link by the Client module, and calculating the minimum batch time corresponding to the message according to the time window and the original time stamp of the message by using the greatest common divisor of the statistical periods of all the effective statistical items in the current statistical group as the time window by the Client module. In addition, the Client module filters out field information irrelevant to statistical calculation in the original message according to the configuration information of all effective statistical items in the current statistical group, and then performs aggregation operation. The aggregation logic of the Client module aggregates according to the consistency of the content of the message, modifies the minimum batch time and removes the statistics-independent field in order to aggregate as many messages as possible.

5. The message buffer pool relied on by the aggregation operation in the system is realized based on a bounded priority blocking queue. The system divides the buffer pool into a plurality of slots, and the composition structure of each Slot comprises BoundedPriorityBlockingQueue (bounded preferential blocking queue) and the last access time stamp corresponding to the Slot. Generating Key of message event according to aggregation logic of different links before data is put into a buffer pool, distributing corresponding slots for messages according to the Key in the buffer pool, dividing the messages into different processing periods by a system according to a predefined time window, sequencing the messages in the same period according to the priority of the Key, sequencing the messages in different periods according to window time, and consuming the data of the buffer pool in batches by a consuming thread according to a predefined strategy sequence.

6. The consumption thread group of the buffer pool polls each Slot, judges whether the usage capacity of the Slot reaches a threshold value, wherein the threshold value of the Slot is batchsize x backlog _factor, batchsize is the specified maximum message number of single consumption, backlog _factor is the specified message backlog coefficient, if the usage capacity of the Slot does not reach the threshold value, the last access time of the Slot is continuously judged, if the usage capacity of the Slot does not reach the threshold value, the batch is read in sequence for consumption, otherwise, the task is skipped, and the usage capacity of the Slot and the last access time stamp of the Slot are updated after the Slot message is consumed. The buffer pool design sequentially arranges the messages of the same type in a fixed time period, and the consumption thread group can aggregate the messages of the same type together more, so that the operation amount of downstream tasks and the writing pressure of DB are directly reduced.

7. The radix statistics in the system refer to distinct operations (non-duplicate value number statistics), such as common statistics, UV, and other indexes. The base statistics is an operation type with more occupied resources in the streaming big data statistics, and if the original base values are completely stored and then counted, the base statistics can definitely cause larger writing loss and memory resource occupation, so that the invention provides a base statistics implementation scheme with less memory occupation in order to avoid the situation. The implementation scheme is provided with the built-in repeated data filtering device, and the base statistics is realized by counting the number of base values which do not exist in the filtering device. The filtering device is realized based on RoaringBitMap, the system sends the original base number to the filtering device, judges whether the base number exists or not by verifying whether the Index position corresponding to the RoaringBitMap structure in the device is 1, if so, filters the base number, otherwise, assigns the Index corresponding to the base number to be 1, counts the number of the base numbers which do not exist in the final filtering device, and then updates the externally stored result. The Index value is obtained by first calculating the corresponding Hash value from the original value according to a specified Hash algorithm, and then obtaining the Index value in RoaringBitMap structures by the Hash value through a custom conversion function. The Hash algorithm may use MurmurHash Bit to obtain a corresponding Long-type Hash value. The filtering device comprises a memory base filtering device and a distributed base filtering device, the memory base filtering device is high in judging efficiency and suitable for primary screening, and the base values after primary screening are sent to the distributed filtering device in batches for judging again. The distributed radix filtering device comprises a plurality of fragments, each fragment corresponds to one RoaringBitmap storage structure, the accuracy of radix statistics can be adjusted by adjusting the number of fragments, and the system determines the fragment where the system is located according to the Hash remainder of the original radix value. The above scheme does not require storing the original base value, and the Index position by Hash calculation RoaringBitmap does not require maintaining the mapping relationship between the original base value and Index. The RoaringBitMap structure can greatly save memory resource occupation under the condition of sparse data, improves program writing and operation performance, and can conveniently improve the precision of radix statistics by expanding the number of fragments.

8. And all the statistical items in the statistical group share one message data, and when the operation module receives the message data, the operation module can perform unfolding and grouping operation on the statistical messages. The unfolding operation is a process of querying all effective statistical items in the statistical group, extracting the associated field (the associated field refers to the field related to the calculation of the statistical item) of each statistical item, copying a piece of independent message data for each statistical item and only reserving the operation related field thereof. The purpose of the unrolling operation is to avoid that the subsequent operational logic of the statistical terms influence each other. The grouping operation is to extract the statistical period attribute of the statistical item, divide the time window according to the statistical period and group the message data according to the time window; then judging whether the statistical item contains a plurality of statistical operation units, and if so, grouping again according to the statistical operation units; and finally judging whether the statistical item contains the dimension attribute, if so, extracting dimension information and regrouping according to the dimension. The grouping operation aims at decomposing the whole operation process, aggregating and processing the same type of information, and the operation processes of different types of information are not mutually influenced.

9. In order to avoid instability of the system due to abrupt access of a certain large number of statistical demands or traffic surges of a certain statistical term, the system has a current limiting protection mechanism, which includes the following strategies: firstly, limiting the current of the original message data quantity of the statistical group, secondly, limiting the current of the result quantity of the statistical item, and thirdly, limiting the current of the calculated quantity of the statistical item. The message quantity flow limit of the statistical group is a flow limit strategy aiming at the message quantity in the unit time of the current statistical group, when the message quantity reaches a threshold value, all statistical items under the strategy statistical group are triggered to not execute statistical calculation, and the statistical item result quantity and the operand flow limit are flow limit strategies aiming at the statistical result quantity and the calculation complexity in the unit time of the current statistical item, and only the execution of the current statistical item is influenced. The stability of the whole service can be better ensured through the current limiting protection mechanism, the current limiting threshold value can be flexibly adjusted on the Web module page, the system current limiting device has an automatic recovery function, and the statistical task can be automatically recovered after the data quantity is reduced below the threshold value.

10. For the streaming statistics of such scenes, the data structure key of the statistics result is a numerical value containing a time parameter, the value is a statistical result numerical value, and the data structure has general similarity. Under the condition of huge data volume, in order to improve the writing efficiency of the data and reduce the waste of storage resources, the statistical result data storage of the invention adopts a delta time stamp compression mode. The system divides result data taking seconds and minutes as calculation periods into different time periods with the granularity of hours, stores a plurality of statistical result values in the same time period under the same dimension of each statistical item in different columns, uses the same Key for the data in the same time period, can reduce the repetition of the Key value by the design, and can obviously reduce the writing quantity of DB under the scene of huge data quantity.

11. In the execution process of a big data task, a shuffle is a factor which affects performance, and besides bringing a great deal of network overhead, the shuffle may cause problems such as data tilting and even OOM. The computing module adopts a computing mode for avoiding the shuffle, the task parallelism is adjusted by setting the number of computing nodes, and all the statistical indexes share cluster computing resources. The statistical result data and the intermediate state data are realized based on external storage, each node is only communicated with the external storage in the operation process, and different operation nodes are not mutually affected. Based on the design, the problems of data inclination, OOM prevention and the like can be avoided, and the load of operation nodes in the cluster is more balanced.

12. The Client module is provided with a protection mechanism of overtime fusing and abnormal fusing, an abnormal counting device and an automatic recovery component are arranged in the Client module, and when the business side calls the interface of the Client module to generate an abnormality and the times reach a threshold value, the fusing operation is executed. And automatically discarding the sending of the statistical message after the interface is fused. The automatic recovery is realized after the fusing time reaches the threshold value, and the implementation mode can maximally ensure the stability of the service of the access party.

Compared with the prior art, the configuration method for describing the stream statistics operation mode has the advantages of powerful function, easiness in expansion, simplicity and clarity in grammar and convenience in understanding and use, and even people without technical background can still use the method well. The general stream big data statistical system realized based on the configuration method has the advantages that the whole data consumption link of the system is of a layer-by-layer decreasing structure, and the messages are aggregated in each link, so that the downstream operand can be effectively reduced; all the statistical tasks in the cluster share cluster operation resources, so that unnecessary resource waste is reduced; the system is realized to focus on avoiding the Shuffle operation and reduce the network data transmission quantity in the operation process; the system performs unified modularized packaging on various operation scenes of the stream statistics, and focuses on optimizing memory occupation and network IO problems, so that various operation components can achieve multiplexing effects; the system has high overall operation performance and less occupied resources, can cope with complicated stream data statistics demands through simple page configuration and data access, can effectively reduce the research and development cost and the data maintenance cost of enterprises in the aspect of stream statistics, and also reduces the technical threshold of using stream big data statistics for small and medium enterprises, thereby having higher practical application value.

Drawings

FIG. 1 is a system architecture diagram of the present disclosure;

FIG. 2 is a diagram of a statistical term management architecture in the present invention;

FIG. 3 is a diagram of the statistical sample data of the order stream in the present invention;

FIG. 4 is an exemplary diagram of a statistical configuration method in accordance with the present invention;

FIG. 5 is a diagram of a complementary example of a statistical configuration method in accordance with the present invention;

FIG. 6 is an exemplary diagram of client module message aggregation in the present invention;

FIG. 7 is a diagram illustrating the data flow of each component module according to the present invention;

FIG. 8 is a flow chart of the data processing of the Tasks operation module in the present invention;

FIG. 9 is a flow chart of the calculation of each operational function in the present invention;

FIG. 10 is a flow chart of a radix statistics implementation of the present invention;

FIG. 11 is a flow chart of a message buffer pool processing in accordance with the present invention

FIG. 12 is a frame diagram of a flow restrictor assembly of the present invention;

Detailed Description

The following description of the embodiments of the present invention will be made with reference to the accompanying drawings, in which embodiments are shown, by way of illustration, only some, but not all embodiments of the invention. Those skilled in the art can make several modifications and improvements based on the embodiments of the present invention without departing from the technical solution, and these embodiments are all within the scope of the present invention.

Without further limitation, the terms "comprises" and "comprising" as used herein are intended to be open-ended terms, such that the term "comprising" does not exclude the presence of other elements in a process that includes such elements. As used herein, "0" or "a plurality" is a descriptive number and may mean 0, 1 or more, without limiting the specific number of elements.

The present embodiment provides a set of configuration specifications for describing the flow statistics operation mode. The use of this configuration method is illustrated in connection with fig. 3 taking the business scenario of e-commerce order real-time statistics as an example. The metadata configuration in this example includes the following fields:

orderld: order id, string type;

provice: province, character string type;

the city: city, character string type;

userId: user id, character string type;

biz: the service type and the character string type belong to the service type;

sellerId: merchant id, character string type;

amount: order amount, value type;

The statistical template configuration of the present embodiment uses an expression based on an XML format, which includes the following attributes: title, which is used to describe the name of the statistics item; stat, necessary attribute, statistical expression, used to describe the statistical calculation mode; dimens, optional attributes, dimension expressions, for describing dimension information; limit, optional attribute, result screening expression, is used for screening the statistical result. The embodiment abstractly classifies the stream statistics operation scene into a plurality of operation units, including: the system comprises a count number operation unit, a sum summation operation unit, a max maximum operation unit, a min minimum operation unit, an avg average operation unit, a seq time sequence operation unit and a bitcount base operation unit, wherein the operation units can be expanded according to requirements. The configuration format of each arithmetic unit is described in conjunction with the order statistics example of fig. 3.

(1) Count operation

Counting the number of orders: count ()

(2) Sum operation

Counting the total amount of the order: sum (amount)

(3) Max operation

Counting the maximum order amount: max (amount)

(4) Min operation

Counting the minimum order amount: min (amount)

(5) Bitcount operation

Counting the number of users with the following list: bitcount (userId) A

(6) Avg operation

Counting average order amount: avg (amount)

(7) Seq operation

Seq is used for time-sequential data storage and computation, such as: in a server performance monitoring scenario for monitoring server load: seq (loadaverage) A

The statistical expressions shown in connection with fig. 4 and 5 are composed of at least one statistical operation unit, and arithmetic operations can be performed between a plurality of statistical operation units, each of which can specify 0 or more filtering parameters. The filtering parameter is to filter and judge the original message of the streaming data, and is used for filtering the message which accords with the filtering rule. The comma segmentation is used for performing logical AND operation among a plurality of screening parameters, and the screening parameters can be composed of one or a plurality of screening conditions, and the logical operation is performed among the screening conditions. The statistical expression demonstration example is as follows:

(1) Counting the number of orders with the sum of more than 500 yuan: count (amount > '500')

(2) Counting the average amount of the order: sum/count ()

(3) Counting the average consumption amount: sum (amount)/bitcount (userId)

(4) Counting the number of users of mobile phone business: bitcount (userId, biz= = 'cellphone')

(5) Counting the total amount of the mobile phone business orders: sum (amount, biz= = 'cellphone')

(6) Counting the maximum order amount in Beijing area: max (amount, precursor= 'beijing')

(7) Counting the number of orders with the amount of the orders being more than 500 yuan for mobile phone business: count (biz= = 'cellphone', amount > '500')

(8) Counting the number of orders in Beijing and Shanghai: count (Provisions= 'beijing' ||Provisions= 'shanghai')

(9) Counting average order amount of food and beverage service: avg (amount, biz= = 'food' ||biz= = 'drinks')

(10) Counting the per-person consumption amount of mobile phone business: sum (amount, biz= = 'cellphone')/bitcount (userId, biz= = 'cellphone')

(11) Order ratio with statistics amount greater than 500 yuan: count (count >) 500')

(12) Counting the transaction amount of the mobile phone service: sum (amount, biz= = 'cellphone')/sum (amount)

(13) Counting the number of users in Beijing area: bitcount (userId, precursor= = 'beijing')/bitcount (userId)

The statistical period supports a rolling window and a sliding window, and each window type can select one of a plurality of time granularities of seconds, minutes, hours and days according to the requirement. In this embodiment, in order to facilitate the user to operate the Web page and display the statistics period by using a drop-down frame, the following filtering terms ：1-minute,2-minute,5-minute,10-minute,30-minute,1-hour,2-hour,3-hour,6-hour,1-day,recent-5-minute,recent-1-ho ur,recent-2-hour,recent-3-hour,recent-6-hour,recent-1-day. are used to represent sliding window statistics by using the selection terms of the current in this embodiment, and the other is rolling window data statistics, and the statistics period can be set in a self-defined manner according to the needs. After the user sets the statistics period, the system divides the time window according to the statistics period specified by the user and performs statistics on the messages within the time window.

The data validity period is used for setting the expiration time of the statistical result, and in this embodiment, in order to facilitate the user to operate the Web page, the data validity period is displayed by using a drop-down frame, and includes the following screening items: 3day,7day,14day,1month,2month,3month,6month,1year,2year. The data validity period can be set in a self-defined mode according to the requirement. After the user sets the data validity period, the system deletes the expiration data according to the appointed validity period.

As shown in fig. 4 and fig. 5, the result filtering expression is used for filtering statistical results, for example, common topN and lastN operations, where N is the number of filtering limits, and may be set as required, and its format is exemplified as follows: top50, last50.

As shown in connection with fig. 4 and 5, the dimension component can specify one or more dimensions as desired, in this embodiment using a semicolon split between the dimensions.

Referring to fig. 4 and 5, the transformation class function is used to transform the fields of the original message in a specified manner and then participate in the statistical operation, where the transformation class function may be applied in the statistical expression and the dimensional expression, and the transformation class function may be extended as needed, and examples are as follows:

(1) section function for numerical interval calculation, example: section ('600', '100,500,1000,500') output: (500-1000].

(2) Date_format function for timestamp formatting, examples: date_format (1670810400000, 'yyyy-MM-dd HH: MM: ss')

And (3) outputting: '2022-12-1210:00:00'

(3) Data_burst function for date-to-time stamp, example: date_parameter ('2022-12-1210:00:00', 'yyyy-MM-dd HH: MM: ss'), output: 1670810400000

(4) Substr functions for string interception, examples: substr ('abcde', '2', '4'), output: 'cd'

(5) To_upper function for case translation, example: to_upper ('abc'), output: 'ABC'

(6) The to_lower function for case translation, example: to_lower ('ABC'), output: 'abc'

(7) The contacts function, for string containment predicate, example: contains ('abcd', 'bc') output: true

(8) Reverse function, for string inversion, example: reverse ('abc'), output: 'cba'

(9) Start_with function for string judgment, example: start_with ('abcd', 'bc'), output: false

(10) End_with function for string judgment, example: end_with ('abcd', 'cd'), output: true

(11) Len function, used to obtain string length, example: len ('abcd'), output: 4

(12) Left function, for string interception, example: left ('abcde', '2'), output: ab (ab)

(13) Right function for string interception, example: right ('abcde', '2'), output: de

(14) Concat function, used for string concatenation, example: concat ('ab', 'cd'), output: abcd

(15) In function, determine if the array contains elements, example: in ('1', '1,2, 3') and output true

(16) A replace function for string replacement, examples: reproduction ('abcde', 'ab', 'cd'), output: cdcde A

(17) A trim function, for removing head and tail spaces, examples: trim ('abc'), output: abc

The above is a specific description of the configuration method, and a specific scheme for implementing the general-purpose type stream statistics system based on the configuration method is set forth below.

The system shown in connection with fig. 1 and 7 comprises the following constituent modules:

The client module is used for SDK accessed by a service party, and has the main functions of: providing an SDK interface for an access party; after receiving the statistical original message, judging the running state of the current statistical group, and discarding the original message of the statistical group in an abnormal state; verifying the statistical group key; carrying out aggregation treatment on the statistical information; real-time compression of statistical messages; asynchronous batch sending of messages to a statistics service; abnormal fusing and fusing automatic recovery.

An RPC module, a system RPC service module, the main functions of which are: receiving statistical information of each terminal and further aggregating the information of each terminal; sending a statistical message to a message middleware; and providing a statistics result query interface and a statistics group configuration information query interface.

Task module, system core operation module, the main functions of this module have: packaging the concrete realization logic of each operation unit; receiving the statistical information and decompressing the statistical information in real time; judging a current limiting rule; analyzing statistical configuration information; analyzing an operation expression; converting class function analysis and variable analysis; dimension calculation and storage; statistical operation and result storage; a system monitoring function; system Track function.

Web module: the main functions of the Web end display module are as follows: checking a statistical result; management and maintenance of statistical engineering, statistical groups and statistical items; configuring a current limiting threshold value; data index authority management; and managing system users.

As shown in fig. 2, the present embodiment uses a three-layer structure of statistical engineering, statistical groups, and statistical terms to manage a large number of statistical requirements, each statistical requirement corresponds to one statistical term, a user may create several statistical engineering according to needs, each statistical engineering may include several statistical terms, and several statistical terms based on the same metadata are called a statistical group. The service side accessing the system comprises the following steps:

1. Creating statistical engineering

The user can manage the statistical engineering in the system through the Web module page, and when the engineering is created, the information such as engineering name, engineering manager, engineering description and the like is required to be designated. An engineering may be created as an order statistics example: "order data statistics". The operation authority of the statistical engineering comprises the authority of an engineering manager and the access authority. The access rights can only access the data index under the project, and the project manager can also modify the project, delete the project and manage the statistics group and statistics items under the project.

2. Creating a statistics group

The user can manage the statistical group under each statistical engineering through the Web module page, including operations of creating the statistical group, deleting the statistical group, modifying metadata configuration of the statistical group, adjusting the execution state of the statistical group, modifying the current limiting threshold of the statistical group and the like, and all nodes in the cluster automatically load all change operations on the statistical group. Creating a statistics group requires specifying a statistics group token and metadata configuration information corresponding to the statistics group, where the token is used to distinguish between different statistics groups, for example, in order examples, a statistics group may be created: "order_stat", the metadata configuration includes the relevant field name, field type, and field description information.

After the statistics group is established, the system automatically generates key information for verification of the client message.

The statistics set contains the following states:

(1) And a normal state in which the statistical message is normally received and the statistical operation is normally performed.

(2) The current limiting state is changed into the current limiting state after the current limiting strategy of the statistical group is triggered, the statistical information is discarded in the current limiting time range, all the statistical items below the current limiting state do not execute the statistical task, and the current limiting state automatically returns to the normal state after the current limiting time is exceeded.

(3) And in a deactivated state, the user manually deactivates the statistical group, and all statistical items in the statistical group do not execute statistical tasks.

(4) And deleting the state, wherein the user manually deletes the statistical group.

3. Creating statistical items

The user can manage all the statistical items under the statistical group through the Web module page, including operations of creating the statistical items, modifying the statistical items, deleting the statistical items, adjusting the execution state of the statistical items, adjusting the current limiting threshold of the statistical items, and the like, and all the nodes in the cluster automatically load all the change operations on the statistical items. Each statistical requirement corresponds to a statistical item, and creating the statistical item requires specifying a corresponding statistical template configuration, statistical period configuration, and data validity period configuration. If we have the following two statistical requirements: statistics of order quantity per minute and statistics of number of users placed per minute, the following statistics are created: (1) Order quantity per minute statistics

And (3) counting templates: < stat-itemtitle = "order quantity per minute" statistics "stat =" count () "/>

Counting period: 1-minute

Data validity period: 14day

(2) Statistics of number of users per minute

And (3) counting templates: < stat-itemtitle = "per minute_order user count" stat= "bitcount (userId)"/>

Counting period: 1-minute

Data validity period: 14day

The statistics include the following states:

(2) The current limiting state is changed into the current limiting state after triggering the statistical item current limiting strategy, the statistical information is discarded in the current limiting time range, the current statistical item does not execute the statistical task, and the current statistical item automatically returns to the normal state after exceeding the current limiting time.

(3) And in a deactivated state, the user manually deactivates the statistical item, and the current statistical item does not execute the statistical task.

(4) And deleting the state, wherein the user manually deletes the statistical item.

After receiving the statistical information, the system firstly judges the running state of the corresponding statistical group, if the state of the statistical group is normal, the system continues to judge the running state of the statistical item, and if the statistical group and the statistical item are both in normal states, the system normally executes the statistical task.

The operation module of the embodiment is realized based on Structured Streaming, the RPC module writes the message data into the after-Kafka operation module to consume the data in Kafka, and the parallelism of the operation module is adjusted by adjusting executor quantity. In this embodiment, the statistical result data and the statistical dimension data are stored by using HBase, the distributed filtering device used for radix operation is implemented based on the Redis extended Redis-Roaring plug-in, limit operation is implemented based on SortedSet of the Redis, and the statistical configuration information, access right information, user information and other data of the Web module are stored by using MySQL.

Fig. 7 shows a data flow description among the constituent modules of the present embodiment, which includes the following steps:

1. The service side reports the original message through the Client module;

The RPC module receives the statistical message and writes the statistical message into the Kafka message middleware;

3. the operation module consumes the Kafka data, reads the statistical configuration information, executes statistical operation and writes the result into the DB;

4. checking the statistical result through a Web module;

Details of implementation inside each step are described in the following decomposition of each step.

Step one, a service party reports an original message through a Client module

1. Statistical group status verification

The access party reports the original message of the streaming data to the statistics service through the SDK interface, and the statistics group token, the statistics group key, the original message information and the message time stamp are required to be appointed when the SDK interface is called. The Client module calls the RPC module interface to acquire configuration information of the current statistical group, wherein the configuration information of the statistical group comprises information such as the current statistical group state, a statistical group key, a statistical group association field and the like. The Client module reads the state identification of the statistical group, and discards the corresponding message if the statistical group is in an abnormal state. The statistical group state verification is used for shielding the transmission of the statistical group message in an abnormal state and reducing unnecessary network transmission.

2. Key verification

And reading the key information from the statistic group configuration information, comparing the key information with the key input by the access party, and throwing out the exception if the matching fails. The key verification is used for avoiding the influence of other people reporting original message data on the statistical result and improving the data security of the whole service.

3. Message body unnecessary parameter filtering

In order to increase the transmission speed of the message and increase the message aggregation efficiency of the subsequent steps, the Client module needs to perform clipping operation on the original message, so as to remove the statistics-independent field. The statistics irrelevant field is calculated by the system according to all effective statistics items in the statistics group, and the fields irrelevant to all effective statistics items are filtered before the Client module reports the data, so that unnecessary data transmission is avoided.

Message aggregation at client

The flow statistics application scene has many repeated operations, for example, statistics of the call quantity of a certain service interface, the interface may be called 10000 times in one minute, and if the whole operation flow is completely executed every time the call is called, larger resource waste is caused. In this case, in order to improve the operation performance of the overall service, the present embodiment adopts a scheme of asynchronous processing and batch consumption, and performs aggregation processing on the repetitive computation. The repeated information is aggregated in each link from the Client to the final statistics result warehouse entry, and the whole consumption link of the system is of a layer-by-layer decreasing structure. Wherein the message aggregation of the Client module comprises two steps:

(1) Tamper message body timestamp

The Client module reports the message link to modify the original time stamp of the message into the minimum batch time before executing the aggregation operation, and the purpose of tampering with the time stamp of the message body is to aggregate as many messages as possible on the premise of ensuring the accuracy of data in the subsequent steps, so as to reduce the network transmission and downstream operation amount. The Client module takes the greatest common divisor of the statistical periods of all the effective statistical items in the current statistical group as a time window, and calculates the minimum batch time corresponding to the message according to the time window and the original time stamp of the message. The Client module modifies the message's original timestamp to the minimum batch time and then places it in the buffer pool. In the scenario of real-time statistics of user behavior logs, shown in FIG. 6, the aggregate operational flow when there is a need to count hourly PV and UV data from the logs.

(2) Polymerization operation

The aggregation operation is to merge the same type of messages together as shown in fig. 6. The aggregation logic of the Client module refers to messages with consistent message content, namely messages with the same statistical group and the same parameter value. The aggregate operation uses asynchronous thread batch processing. The consuming thread reads messages from the buffer pool in batches at specified intervals after the original messages are sent to the buffer pool, and aggregates the messages which meet the rules. After the aggregation operation, the data structure of the message body is changed from single message body content to two attributes of message body content and message repetition times.

5. Compressing and transmitting to RPC service in real time

The Client module asynchronously reads the message data of the buffer pool in batches, compresses the message body data by using snappy and then sends the data to the RPC module, and the real-time compression is used for improving the network transmission efficiency.

6. Abnormal fusing mechanism

The abnormal fusing mechanism is used for guaranteeing the stability of the service of the business party and avoiding the influence on the service of the business party caused by the instability of the statistical service. The abnormal fusing mechanism is that when the Client interface is called, if the abnormal times in unit time exceeds a threshold value, the abnormal fusing mechanism enters a fusing state, and at the moment, the Client module automatically skips the statistical message sending logic. After entering the fusing state, the Client module periodically detects whether the statistical service state is recovered to be normal, and if the statistical service is recovered to be normal, the statistical service is automatically reconnected without manual restarting.

Step two, the RPC module receives the statistical message and writes in the Kafka message middleware

1. Message reception

The RPC module provides a unified data receiving interface for all access parties of the statistical service, and in this embodiment, a disruptor lock-free queue is used to receive message data, which has the effect of improving the concurrency capability of the interface.

2. Message aggregation

After the RPC module receives the messages of each terminal, all the messages are further aggregated, and the aggregation logic is similar to the Client module, and is also aggregated according to the consistency of the message body content, namely the same statistical group and the same parameter value.

3. Asynchronous transmission

And the RPC module compresses the statistical message in real time by using an asynchronous processing and batch consumption mode and sends the statistical message to the Kafka message middleware. The assigned key is used as a random number when the random number is sent to the Kafka, and the random number is used for guaranteeing the data quantity balance of each partition, avoiding data inclination and improving the processing efficiency of the task operation module.

Step three, the Tasks operation module consumes data, reads statistical configuration information, executes statistical operation and writes the result into the DB

1. The Task module data processing flow shown in fig. 8 comprises the following steps:

(1) Reading and parsing the message from the message middleware;

(2) Judging the state of the statistical group, and filtering the statistical group information of abnormal states;

(3) Carrying out statistics on message quantity current limiting rule judgment of a group;

(4) Spreading the message according to all effective statistical items in the statistical group;

(5) Judging the result flow limit rule of each statistical item;

(6) Performing message grouping and writing into a buffer pool;

(7) The consumption thread reads the buffer pool and carries out corresponding operation according to the statistical operation type;

(8) Updating an operation result;

2. Message spreading and grouping

As shown in fig. 8, in this embodiment, all the statistics items under the statistics group share one message data, so that all the statistics items under the statistics group do not need to be sent by separate message data, which can reduce data volume transmission and improve network IO efficiency. And after receiving the message data, the Tasks module can perform unfolding and grouping operations on the statistical message.

The unfolding operation is the process of searching all effective statistical items in the statistical group, extracting the associated fields of each statistical item, copying a piece of independent message data for each statistical item and only reserving the operation related fields thereof. The purpose of the unrolling operation is to avoid that the subsequent operational logic of the statistical terms influence each other.

The grouping operation is to extract the statistical period attribute of the statistical item, divide the time window according to the statistical period and group the information after the unfolding operation according to the time window; then judging whether the statistical item contains a plurality of statistical operation units, and if so, grouping again according to the statistical operation units; and judging whether the statistical item contains dimension attributes, if so, extracting dimension information and regrouping according to the dimension. The grouping operation aims at decomposing the statistical operation process, aggregating and processing the same type of information, and the operation processes of different types of information are not mutually influenced.

After the grouping operation, the system further aggregates according to the message types, i.e. messages according to the same statistical item, the same statistical batch, the same dimension and the same arithmetic unit.

3. Current limiting protection mechanism

In order to avoid the instability of the system caused by the sudden access of a certain statistical demand of a large data volume or the flow surge of a certain statistical term, the system is provided with a current limiting protection mechanism, and the current limiting protection mechanism has the function of better guaranteeing the stability of the whole service, as shown in fig. 12, which is a structure diagram of a current limiting device of the system, wherein the current limiting protection mechanism comprises the following strategies:

(1) Statistical group message volume flow limiting

The statistical group message quantity current limit is a current limit policy for the number of statistical group messages received per unit time. The system built-in statistic group message quantity calculating device is used for calculating the quantity of the received statistic group messages in unit time. Triggering current limiting when the message quantity exceeds a threshold value in unit time, so that the current statistical group enters a current limiting state. The Client module and the Tasks module automatically discard the statistics group message in an abnormal state. Since a statistics group may correspond to one or more statistics, the current limiting policy may affect the normal statistics of all statistics under the statistics group. And after the statistical group enters the current limiting state, corresponding information is automatically abandoned within a designated time (20 minutes by default in the embodiment), and the statistical group automatically returns to the normal state after the current limiting time reaches a time threshold.

(2) Statistical term result current limiting

The statistics term result current limit is a current limit strategy for the number of statistics results generated by the statistics term in unit time. The system built-in statistical item result counting device is used for calculating the number of the generated statistical results in unit time. Triggering current limiting after the result quantity exceeds a threshold value in unit time, so that the current statistical item enters a current limiting state. The result quantity of the statistical term is related to two factors, namely, the time granularity of the statistical period is firstly, the finer the granularity of the statistical period is, the more index data quantity is, for example, the statistics of the second level and the minute level are more than the statistics of the hour level and the day level. The second influencing factor is dimension, and the more the number of dimensions is, the more the statistical results are generated in the unit time of the statistical item, for example, the more the statistical result quantity is generated by the statistical index taking the city as the dimension, the higher the statistical index taking the province as the dimension is. The result current limit of the statistical item is a current limit strategy for the current statistical item, so that the current statistical item is only affected, and other statistical items in the statistical group are not affected. And after the statistical item enters the current limiting state, corresponding information is automatically abandoned within a designated time (20 minutes by default in the embodiment), and after the current limiting time reaches a time threshold, the current statistical item is automatically restored to the normal state.

4. Radix operation

The bitcount radix operation of the present embodiment filters existing radix values using radix filtering means, determines the number of radix that does not exist in the filtering means, and then updates the statistical result in the DB to thereby implement radix statistics. The radix filtering device comprises a memory radix filtering device and a distributed radix filtering device. The memory radix filtering device is used for primarily judging whether the radix value exists or not, and has higher memory judging efficiency, so that the influence of repeated radix judgment on the overall performance is avoided as much as possible, and the memory radix filtering device is realized by using RoaringBitMap toolkits. The distributed base filtering device comprises a plurality of fragments, each fragment corresponds to one RoaringBitMap data storage structure, the number of fragments can be specified according to actual needs, and the accuracy of base operation can be improved by improving the number of fragments. The implementation of the distributed radix filtering apparatus as shown in fig. 10 includes the steps of:

(1) And generating a Long type Hash value corresponding to the original value by MurmurHash-128 Bit.

(2) Setting the number of fragments required by a statistical task, wherein each fragment corresponds to one RoaringBitMap data structure, and the filtering device of the embodiment is realized by adopting a mode of Redis expanding Redis-Roaring plug-in, and fragments corresponding to the original numerical value can be obtained through Hash redundancy.

(3) Splitting the Long type Hash value into two Int type integers according to a high 32bit and a low 32bit, and if the two Int type integers are negative numbers and the absolute values are taken, the combination of the two Int values corresponds to the Index value of the original value in the RoaringBitMap data structure.

(4) And sending the Int value combinations corresponding to the plurality of radix values to Redis in batches, and executing the plurality of operations of radix judgment in a merging way by using the Lua script. Judging whether the Int value combination exists in the filtering device or not, if both Int values exist in the filtering device, indicating that the original value exists, otherwise, indicating that the original value does not exist, and if the original value does not exist in the filtering device, updating the corresponding Index value after the judgment is completed.

(5) The number of original values not present in the filter means is counted and updated into the database.

The implementation scheme has the advantages that the base number operation does not need to store the original value, so that the occupation of a memory can be reduced; generating Index values by using MurmurHash-128 bits so that the mapping relation between the original values and the Index is not required to be maintained; the RoaringBitMap algorithm has the function of compressing the bitmap, so that the problem of resource occupation under the condition of base number sparseness can be solved; the Lua script is used for realizing the radix filtering function, so that the access times to Redis can be reduced, and the overall performance is improved.

5. Statistical operations and storage

The processing flow of the operation module divides time windows according to the statistical period for each statistical item, screens the information in each time window according to the filtering unit, only keeps the information meeting the condition, groups the information data again if a plurality of statistical operation units are included, and continuously regroups the information data if the dimension attribute is included, wherein each grouping is an independent calculation type, and the result is written into the DB after the statistics is carried out according to the operation function type and the association field. As shown in fig. 9, the present embodiment packages various operation scenarios of the stream statistics in a modular manner, and optimizes each operation unit for its characteristics to achieve a reusable effect. The processing flow of the operation unit comprises the following steps:

(1) The statistical expression analysis is to carry out structural splitting on user-defined configuration of a user, and extract each statistical operation unit, and associated parameters and screening parameters corresponding to each operation unit;

(2) Judging whether the related parameters and the screening parameters contain conversion functions or variables, and if so, calculating the result of the conversion functions or variables;

(3) Selecting a corresponding calculation strategy according to the type of the statistical operation unit;

(4) Combining the screening parameters into a screening logic expression, and calculating the numerical value of the expression;

(5) Filtering out messages which do not meet the screening rule;

(6) Memory calculation and writing the calculation result into a buffer pool;

(7) Judging whether Limit operation is included, and if so, carrying out Limit operation;

(8) Asynchronously aggregating the statistical results and updating the DB;

The calculation logic of each arithmetic unit differs for its characteristics as shown in fig. 9:

(1) The count operation and the sum operation are asynchronous threads for aggregating messages in a buffer pool and executing summation operation, and then updating the DB.

(2) Max and min operations are asynchronous threads that aggregate messages in a buffer pool and perform maximum and minimum operations, and then update DB.

(3) The avg operation is that the avg operation is divided into a sum operation and a count operation, and then the calculation is carried out according to the processing logic of the sum operation and the count operation, the results are written into a DB respectively, and the avg operation is carried out when the data is read.

(4) The seq operation is directly written into the DB by bulkload mode without passing through a buffer pool after filtering (DB used in the embodiment is HBase, if the DB is a storage engine of other types, the operation can be adjusted according to the writing characteristics of the storage engine).

(5) The bitcount operation firstly filters the radix value contained in the memory through the memory radix filtering device, then filters the radix value through the distributed radix filtering device, calculates the number of the radix not contained in the filtering device, and then converts the number of the radix value into the count operation.

The system is internally provided with a statistical operation unit management component which can add or delete operation to the operation unit. The statistical result storage link carries out the re-aggregation of the messages, and the link aggregation logic is consistent with the grouping link of the operation module, namely the messages with the same statistical item, the same time batch, the same dimension and the same operation unit.

6. Storage structure optimization

The embodiment further optimizes the data storage format aiming at the stream statistics scene, and aims to improve the data throughput of the DB and the system performance. Since the data structure keys of the statistical result are all values including time, the value is the statistical result value. The data structure has general similarity, in order to improve the writing efficiency of data and reduce the waste of storage resources under the condition of huge quantity, the embodiment adopts time stamp compression to the data storage of the statistical result, and the system can compress the data of the same hour and the same day according to the statistical period and then store the data in a block area. In the embodiment, the result data taking seconds and minutes as calculation periods is divided into different time periods with the granularity of hours, a plurality of statistical result values in the same time period under the same dimension of each statistical item are stored in different columns, the data in the same time period use the same Key, the design can reduce the repetition of the Key value, and the writing quantity of DB can be obviously reduced under the scene of huge data quantity.

7. Message buffer pool

In this embodiment, the multiple links aggregate message data, so as to combine the same type of messages and reduce unnecessary network transmission and downstream computation. The message buffer pool implementation on which the aggregation process depends is based on bounded priority blocking queues. The system divides the message buffer pool into a plurality of slots, and the composition structure of each Slot comprises BoundedPriorityBlockingQueue (bounded preferential blocking queue) and the last access time stamp corresponding to the Slot. The processing logic of the message buffer pool as shown in fig. 11 includes the steps of:

(1) The Producer generates keys of message events according to aggregation logic of different links, wherein the keys are used for distinguishing whether messages of the same type are generated;

(2) The message buffer pool distributes corresponding slots according to the message Key and the Hash surplus;

(3) Dividing the message into different processing cycles according to a predefined time window;

(4) The Slot prioritizes the messages in the same processing period according to Key, and the messages in different processing periods are ordered according to window time;

(5) The consumption thread group polls each Slot at regular time;

(6) Judging whether the usage capacity of the Slot exceeds a threshold value, wherein the threshold value is batchsize x backlog _factor, batchsize is the specified maximum message number for single consumption, and backlog _factor is the specified message backlog coefficient;

(7) If the usage capacity of the Slot does not exceed the threshold, continuing to judge the last consumption access time of the Slot, if the usage capacity of the Slot exceeds the time threshold, reading the batch consumption of the messages, otherwise, skipping the task.

(8) And updating the Slot use capacity and the last access time simultaneously after the Slot message is consumed.

8. Time window grouping

Creating a statistics item requires specifying a time window required by streaming statistics, the streaming statistics splits an event stream according to the time window, and the system refers to each time window as a batch, and the streaming statistics is a process of counting data in each batch. For example, the order amount time window of every 5 minutes is counted as 5 minutes, the order amount time window of every 2 hours is counted as 2 hours, and the PV and UV time windows of each day are counted as 1 day. The time window comprises a rolling window and a sliding window, wherein the rolling window is used for splitting the event stream according to the fixed window size, and different statistical windows are not overlapped. The sliding window is a fixed window size that first specifies the statistics period, then specifies the Step size Step that it slides, and there may be overlapping portions between different statistics windows.

Example rolling window (every 5 minutes): 10:00-10:05; 10:05-10:10; 10:10-10:15; 10:15-10:20;

Sliding window example (last 5 minutes, designated step size 1 minute): 10:00-10:05; 10:01-10:06; 10:02-10:07; 10:03-10:08;

In this embodiment, the statistics of the rolling window is that the starting time of the rolling window is used as a statistics period identifier, parameters such as a statistics group identifier, a statistics item identifier, a dimension, a statistics operation unit, a statistics period starting time and the like are used to generate a unique Key of the current rolling window, and real-time statistics is performed on the data with the same Key, so that the data statistics in the rolling window is realized.

In this embodiment, the statistics of the sliding window is that the starting time of the sliding window is used as the identifier of the statistics period, and parameters such as the statistics group identifier, the statistics item identifier, the dimension, the statistics operation unit, the statistics period starting time and the like are used to generate the unique Key of the current sliding window. Because there may be overlapping portions among the multiple windows in the sliding window, after receiving the statistical message, the task operation module determines one or more sliding windows to which the task operation module belongs according to the message event timestamp, the fixed window size and the step length, and expands and groups according to the multiple sliding windows, thereby realizing data statistics in the sliding window. In this embodiment, the step length of the sliding window is automatically selected by the system according to the statistical period.

9. Avoiding shuffle

In the execution process of a big data task, a shuffle is a factor which affects performance, and besides bringing a great deal of network overhead, the shuffle may cause problems such as data tilting and even OOM. The unexpected problem that can be caused by the shuffle is avoided by avoiding the uncontrollable factor of the shuffle. In this embodiment, the Tasks operation module is developed based on structured streaming, and the parallelism of task execution is adjusted by specifying executor number, and the statistical result data and intermediate state data in the operation process are both realized based on external storage. In this embodiment, the statistics result is stored in HBase, intermediate state data of bitcount radix operation is stored in Redis, and ordering data of limit operation is stored in Redis. Each executor communicates only with external storage during operation, and the different executor do not affect each other. The system uses a random character string as a Key by writing data of the Kafka message middleware to ensure data balance of each node Executor; through the design of a message buffer pool, the read-write pressure on a DB is reduced as much as possible by using a Lua script to operate Redis and using a batch interface for reading and writing HBase in a base judgment link and a Limit operation link; and a distributed lock is used for guaranteeing the concurrent access problem among the processes, so that the shuffle is avoided.

Limit operation

In this embodiment, the limit operation is used to perform a filtering operation on the statistical result, that is, after the logic execution of the operation unit is completed, the system determines whether the statistical item includes a limit attribute, if so, the limit operation is performed, and in this embodiment, the topN and lastN operations of the limit are implemented based on the SortedSet function of the Redis.

11. Data expiration date

In this embodiment, HBase is used as a statistics result storage device, when a user creates a statistics item, a task module designates the validity period of data, the statistics result is written into HBase and TTL is set according to the designated validity period, and the system realizes the data deletion function through the expiration mechanism of HBase.

Claims

1. The universal stream big data statistics system is characterized by comprising the following modules: the Client module is used for accessing the SDK by a service party and reporting statistical original message data; the RPC module is used for receiving statistical message data reported by a client and providing a statistical result query interface; the operation module is used for functionally encapsulating various flow type statistic operation units, executing current limiting rule judgment, analyzing configuration information of each statistic item, consuming message data, carrying out calculation according to statistic item configuration and storing statistic results; the Web module is used for managing and maintaining the statistics group and the statistics item, checking the statistics result, setting the current limiting rule and managing the statistics index access authority.

2. The system of claim 1, wherein the statistics item configuration information is a set of configuration specifications for describing a flow statistics operation mode, the configuration specifications including a statistics template configuration, a statistics period configuration and a data validity period configuration, the statistics template is based on configuration information of XML, JSON or other key value pair formats, the statistics period is a time window of flow statistics, and the data validity period is used for setting a storage duration of a statistics result.

3. The universal streaming big data statistics system according to claim 2, wherein the statistics template configuration mainly comprises: the statistical expression configuration is used for specifying a calculation rule of a statistical item and consists of one or more operation units, and arithmetic operation can be carried out among the operation units; a dimension expression configuration for specifying statistical dimension information, the multi-dimensional statistics being partitioned using specific separators; the result screening expression configuration is used for executing screening operation on the statistical result, and built-in custom functions and variables can be used in the statistical template.

4. A universal streaming big data statistics system according to claim 3, wherein the statistics operation unit comprises a number operation unit, a summation operation unit, a maximum operation unit, a minimum operation unit, a radix operation unit and a time sequence operation unit.

5. A universal streaming big data statistics system according to claim 1, characterized in that the system uses a hierarchy of statistics groups and statistics items to manage all statistics indicators, one for each statistics item, a plurality of statistics items based on the same piece of metadata being called a statistics group.

6. The universal streaming big data statistics system according to claim 1, wherein the system has a current limiting protection mechanism comprising the following policies: firstly, limiting the original message data quantity of a statistical group, secondly limiting the result quantity of the statistical item, thirdly limiting the calculation quantity of the statistical item, wherein the message quantity of the statistical group is a current limiting strategy for the message quantity in the unit time of the current statistical group, when the message quantity reaches a threshold value, all the statistical items under the statistical group of the strategy are triggered to not execute statistical calculation, the result quantity of the statistical item and the calculation quantity limiting are limiting strategies for the calculation complexity and the quantity of the statistical result in the unit time of the current statistical item, only the execution of the current statistical item is affected, and the current limiting protection mechanism has an automatic recovery function and automatically recovers after the data quantity is reduced below the threshold value.

7. The universal streaming big data statistics system according to claim 1, wherein the radix operation unit of the operation module uses an external storage to maintain the statistics result, the system built-in repeated data filtering device implements radix statistics by counting the number of radix values which are not present in the filtering device, the filtering device is implemented based on RoaringBitMap, the system comprises a memory radix filtering device and a distributed radix filtering device, the memory radix filtering device has high judging efficiency for the primary screening, and the number of radix values after the primary screening is sent to the number of radix values which are not present in the distributed filtering device in batches, and then the result in the external storage is updated.

8. The universal streaming big data statistics system according to claim 7, wherein the distributed radix filtering device comprises a plurality of slices, each slice corresponds to a RoaringBitmap storage structure, the accuracy of radix statistics can be adjusted by adjusting the number of slices, the system determines the slice where the slices are located according to the original radix value, the original value is calculated to a corresponding Hash value through a specified Hash algorithm before the data passes through the filtering device, and the Hash value is calculated to obtain the Index value of the radix value in the RoaringBitMap storage structure through a custom conversion function.

9. The universal stream big data statistics system according to claim 1, wherein the built-in anomaly counting component of the Client module has a fusing protection mechanism, when the anomaly occurs in the call data reporting interface, the module judges whether fusing is needed according to the anomaly quantity in unit time, if the interface is fused, the corresponding statistics message is automatically discarded, and the fusing time length is automatically recovered after reaching the threshold value.

10. The universal streaming big data statistics system according to claim 1, wherein the whole consumption link of the system comprises the following links: the Client module reports information data links, the RPC module processes the information data links, the operation module executes unfolding and grouping operation links and the statistical result storage links, each link uses asynchronous processing and batch consumption modes, each link receives information and then puts the information into the information buffer pool, the system divides the information into different information types according to predefined aggregation logic of each link, and the information is aggregated under the same type.

11. The system of claim 10, wherein the Client module modifies the original timestamp of the message to be the minimum batch time before executing the aggregation operation, and calculates the minimum batch time corresponding to the message according to the time window and the original timestamp of the message by using the greatest common divisor of the statistics periods of all the effective statistics items in the current statistics group as the time window.

12. The system of claim 10, wherein the Client module reports the field information irrelevant to statistics calculation in the original message filtered out according to the configuration information of all the effective statistics items in the current statistics group, and performs aggregation operation.

13. The system of claim 10, wherein the message buffer pool is implemented based on a bounded preferential blocking queue, the message buffer pool includes a plurality of slots, the composition structure of each Slot includes a bounded preferential blocking queue and a last access timestamp corresponding to the Slot, keys of message events are generated according to aggregation logic of different links before data is put into the buffer pool, the buffer pool allocates corresponding slots for messages according to keys, the system divides messages into different processing periods with predefined time windows, messages in the same period are ordered according to Key preferential ordering, messages in different periods are ordered according to window time, and consumption threads consume the data of the buffer pool in batches according to a predefined strategy sequence.

14. The system of claim 13, wherein the predefined policy for consumption is to poll each Slot for a thread group for consumption, determine whether a usage capacity of the Slot reaches a threshold, determine that the threshold of the Slot is batchsize x backlog _factor, wherein batchsize is a specified maximum number of messages consumed once, and backlog _factor is a specified message backlog coefficient, if the usage capacity of the Slot does not reach the threshold, continue to determine a last access time of the Slot, and if the usage capacity of the Slot exceeds the time threshold, sequentially read the slots in batches, otherwise skip the task, update the usage capacity of the Slot and a last access timestamp of the Slot after consuming the Slot.

15. The system of claim 10, wherein the operation module expands and groups the statistics items in the statistics group to share one message data, the expansion operation is a process of searching for all effective statistics items in the statistics group after the operation module receives the message data, extracting relevant fields of the statistics items to copy one single message data for each statistics item and only reserve relevant fields of the statistics items, the grouping operation is a process of extracting statistics period attributes of the statistics items, dividing time windows according to the statistics period and grouping the messages according to time windows, then dividing the whole operation process according to whether the statistics items contain a plurality of statistics operation units and dimension attributes, and the grouping operation decomposes the whole operation process, wherein the message aggregation processes of different types are not affected by each other.

16. The general stream big data statistics system according to claim 1, wherein the calculation module adopts a calculation mode of avoiding the shuffle, the execution parallelism of tasks is adjusted by setting the number of calculation nodes, each statistics index shares cluster calculation resources, the system disassembles the whole calculation process into different calculation types according to statistics item identification, dimension identification, time batch and statistics calculation unit by the statistics information in a single node process, the information of the same type is uniformly processed, the statistics result data and intermediate state data are realized based on external storage, each node is only communicated with the external storage in the calculation process, and the different calculation nodes are not mutually affected.