CN115510110A - Universal and reusable stream type big data statistics realization method and system - Google Patents

Universal and reusable stream type big data statistics realization method and system Download PDF

Info

Publication number
CN115510110A
CN115510110A CN202211263338.7A CN202211263338A CN115510110A CN 115510110 A CN115510110 A CN 115510110A CN 202211263338 A CN202211263338 A CN 202211263338A CN 115510110 A CN115510110 A CN 115510110A
Authority
CN
China
Prior art keywords
statistics
statistical
data
big data
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211263338.7A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202211263338.7A priority Critical patent/CN115510110A/en
Publication of CN115510110A publication Critical patent/CN115510110A/en
Priority to CN202310418409.4A priority patent/CN116561196A/en
Priority to CN202310840778.2A priority patent/CN118467582A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a general and reusable stream type big data statistical method and system, and belongs to the field of big data. The invention abstractly classifies the stream data statistical requirements into various operation scenes, including count, sum, max, min, avg, bitcount, topN, lastN and seq operation, and sets a set of configuration specifications for describing complicated stream data statistical requirements, wherein the configuration specifications are internally provided with rich conversion functions and support expression analysis, can meet various complex condition screening and logic judgment, support multi-dimensional calculation, support statistics of a plurality of time granularities of day level, hour level, minute level and second level, and support configuration of custom statistical periods. The method is used for helping enterprises to deal with the problem of complicated stream data statistics, and can greatly reduce the research and development cost and the data maintenance cost of the enterprises in the aspect of stream data statistics.

Description

Universal and reusable stream type big data statistics realization method and system
Technical Field
The invention relates to the technical field of big data application, in particular to a general reusable stream-type big data statistical method and a system.
Background
In the internet industry, at the present that the mobile internet is developed more mature, the traffic is at the top, the dividend disappears, the enterprise competition is more and more disastrous, and the cost for acquiring newly added users is increasingly higher. Many enterprises begin to realize that market occupation cannot be achieved in a simple and rough manner of subsidy, price fight and advertisement putting, the operation mode is difficult to maintain for a long time, and the concept of reducing cost, improving efficiency and maximizing single-user value through refinement and data operation is gradually accepted by more and more enterprises. The premise of the data-oriented operation is to establish a set of perfect data index system, and with the increasing importance of enterprises to the data-oriented operation, a great deal of data statistics needs inevitably derive, and currently, the mainstream big data statistics implementation schemes in the industry are generally as follows: the first is that enterprises independently research and develop statistical computing services, the implementation mode is high in cost, and with the increase of data volume and the continuous accumulation of data requirements, huge expenses are brought to the enterprises by the later development and maintenance cost. The second is that the enterprise realizes a self-service statistical analysis platform based on components such as FlinkSQL, sparkSQL and the like, the research, development, operation and maintenance cost of the method is high, corresponding tasks need to be submitted on the platform when specific requirements are realized, and waste of server resources is serious. The third one is realized based on OLAP components such as ClickHouse and the like, the OLAP components are used for multi-dimensional analysis in complex scenes, and OLAP engines are used by professional technicians, but the OLAP engine is not suitable for adopting the heavy realization mode for data statistics of a plurality of light-weight data. The fourth scheme is that many small and medium-sized enterprises completely adopt an offline mode to perform data statistical analysis, the offline statistical mode can sacrifice the real-time performance of data, and the offline statistical mode can bring the maintenance and storage cost of offline data. Aiming at the current situation, the invention provides a universal and reusable method and a system for realizing streaming big data statistics, which help enterprises to meet the requirements of complicated streaming data statistics and solve the problems of high development cost, high difficulty, long period, high data maintenance cost and serious resource waste in the traditional mode.
Disclosure of Invention
The invention provides a general and reusable method for realizing stream type big data statistics, which helps enterprises to meet the complicated stream type data statistics requirements, takes stream type big data statistics as an entry point, promotes the rapid popularization and large-scale application of stream type statistics, positions a big data platform which uses a set of server resources with less service use and simultaneously supports tens of thousands of stream type data statistics requirements, is dedicated to a series of problems brought by the stream type data statistics requirements showing blowout situations, and hopes that the technical scheme which is more fit with scenes and has more practical value helps the enterprises to reduce the cost in the aspect of datamation operation. In order to achieve the above object, the technical solution of the present invention includes the following contents:
s1, the stream statistics can be applied to a plurality of service scenes, the stream statistics is various in requirements and various in types, can be used for carrying out real-time statistics on order quantity, transaction amount and average order amount in an electric business scene, can be used for carrying out real-time statistics on PV and UV of services in a news information scene, and can be used for carrying out statistics on call quantity and abnormal rate of an interface in a technical scene. Although the stream type statistical scenes have great difference, the stream type statistical scenes have a great deal of commonality, the invention divides various stream type data statistical requirements into different operation scenes according to the calculation characteristics of the stream type data statistical requirements, and the operation scenes of the same type share the same calculation logic, thereby realizing the universality. The invention abstractly classifies the stream data statistical requirements into a plurality of operation scenes, including count, sum, max, min, avg, bitcount, lastN, topN and seq operation, and realizes each operation with high performance, thereby achieving the effect of unlimited multiplexing. In the invention, each stream type statistical requirement is composed of at least one operation unit, a plurality of operation units can be used by combining addition, subtraction, multiplication and division four arithmetic operations, and the format of each operation unit is as follows:
1) count operation unit
The format is as follows: count (filterParam 1, filterParam 2)
Description of the invention: and (3) performing times statistical operation, wherein the parameters are 0 or more Boolean type expressions for conditional screening.
2) bitcount operation unit
The format is as follows: bitcount (relaateColumn, filterParam1, filterParam 2)
Description of the drawings: radix statistics operation, relaatecolumn, is an associated field for radix statistics, necessary parameters, which may be followed by 0 or more boolean type expressions.
3) sum arithmetic unit
The format is as follows: sum (relaateColumn, filterParam1, filterParam 2)
Description of the invention: the sum statistics operation, relaatecolumn, is an associated field for the sum calculation, an essential parameter, and its value must be a numeric type, which may be followed by 0 or more boolean type expressions for conditional filtering.
4) max operation unit
The format is as follows: max (relaateColumn, filterParam1, filterParam 2)
Description of the drawings: the maximum value statistic operation, relaatecolumn, is an associated field for maximum value calculation, an essential parameter, and its value must be a numerical type, which may be followed by 0 or more boolean type expressions for conditional filtering.
5) min operation unit
The format is as follows: min (relaateColumn, filterParam1, filterParam 2)
Description of the drawings: the minimum value statistics operation, relaatecolumn, is an associated field for minimum value calculation, an essential parameter, and its value must be a numerical type, which may be followed by 0 or more boolean type expressions for conditional screening.
6) avg arithmetic unit
The format is as follows: avg (relaateColumn, filterParam1, filterParam 2)
Description of the drawings: the average value statistics operation, relaatecolumn, is an associated field for averaging calculation, an essential parameter, and its value must be a numerical type, which may be followed by 0 or more boolean type expressions for conditional filtering.
7) seq operation unit
The format is as follows: seq (relaateColumn, filterParam1, filterParam 2)
Description of the drawings: chronological data storage, relatedcolumn is an associated field for calculation of the chronological data, necessary parameters, and the value of the parameter must be a numerical type, which may be followed by 0 or more boolean type expressions for conditional filtering.
S2, in an actual service scene, data statistics is generally required according to different dimensions based on various purposes, for example, the order quantity of each day and the order quantity of each hour need to be counted, and the order quantities of different areas also need to be counted according to the area dimensions. In order to deal with the situation, the invention establishes a set of configuration specifications for describing the complicated statistical requirements of the streaming data. The specification supports multidimensional calculation, statistics of various time granularities of day level, hour level, minute level and second level, and configuration of a custom statistical period. The specification includes three components: the statistical configuration template is an expression based on an XML format and is used for describing a calculation mode of stream statistics; the statistical period is a time window for stream data statistics, and several time granularities of day, hour, minute and second can be selected according to needs; the data validity period is the storage duration of the statistical result.
And S3, in an actual service scene, in order to further expand the application range of the statistical configuration specification, the specification can be internally provided with rich conversion functions and variables and supports expression analysis, and various complex condition screening and logic judgment can be met. The conversion function and the built-in variable of the system can be flexibly expanded according to actual needs.
And S4, the streaming data statistics requirements for enterprises are various, a large enterprise can have tens of thousands of streaming data statistics requirements, the data indexes can be divided into different types according to the oriented user, such as business indexes, technical indexes and product indexes, and in order to ensure the safety of the data, each data index usually needs access authority control with different granularities. On the premise, in order to better manage a large number of statistical requirements, the invention uses a three-layer structure of statistical projects, statistical groups and statistical items to manage all the statistical requirements, a user can create a plurality of statistical projects according to needs, each statistical project can comprise a plurality of statistical items, a plurality of statistical items based on the same metadata are called a statistical group, each statistical group respectively corresponds to a metadata configuration, and the metadata configuration is a data structure corresponding to an original message and comprises a field name and a field type. The design has the advantages that firstly, all statistical requirements under a statistical group share one piece of original message data, the access of a business party is facilitated, meanwhile, the processing performance of a program can be improved, secondly, the business party can set the authority limit granularity of a statistical project, and data indexes with different business attributes belong to different statistical projects, so that the safety of data can be guaranteed, and the convenience of user operation can be improved.
S5, counting the specific scene of the streaming data, wherein the specific scene has a plurality of repeated operations, for example, the call volume of the interface per minute is counted, the interface may need to be called 10000 times in one minute, and if the whole operation flow is called each time, the whole operation flow is completely executed, the resource waste is undoubtedly caused. In order to improve the operation performance of the whole service under the condition, the invention adopts a mode of asynchronous processing, batch consumption and aggregation processing on repeated calculation, each link from the client end to the final statistical result storage carries out combination processing on repeated messages, and the whole consumption link of the system is a structure which is gradually decreased layer by layer, so that the data volume can be reduced and transmitted downstream, the network IO efficiency can be improved, the memory can be saved, and the downstream operation volume and the write-in pressure of the DB can be directly reduced.
S6, radix statistics is an operation type occupying more resources in the stream-type big data statistics, if radix original values are completely stored and then operated, relatively large data write-in loss and occupation of memory resources are undoubtedly caused, in order to avoid the situation, the invention provides a radix statistics implementation mode with less resource occupation, a system built-in repeated data filtering device of the implementation mode is used for achieving radix statistics, the filtering device comprises a plurality of fragments, each fragment corresponds to a Roaring bitmap data storage structure, the accuracy of the radix statistics can be improved by expanding the number of the fragments, the number of the fragments can be flexibly set according to needs, the data passes through the filtering device, firstly, the long type Hash value of the original values is calculated through a Murmurh 128bit algorithm, the long type value is converted to obtain index values of the original values in the Roaring bitmap, and the filtering device achieves the radix statistics by judging whether the index values exist or not. The implementation mode does not need to store the original base value, calculates the index position of the Bitmap through the Hash, does not need to maintain the mapping relation between the original base value and the index, can greatly improve the program writing and operation performance, and can flexibly improve the accuracy of base number statistics by expanding the number of the fragments.
S7, in order to avoid instability of the system caused by sudden access of a certain large number of statistical demands or flow bursting of a certain statistical item, the system is provided with a current limiting protection mechanism, and the current limiting protection mechanism comprises two aspects: the method comprises the steps of firstly, limiting the data quantity of original messages, secondly, limiting the quantity of statistical items, wherein the statistical group message quantity limiting is a current limiting strategy for counting the message quantity in a unit time window of a current statistical group, if the current limiting strategy is triggered, all statistical items under the current statistical group can be influenced, the current limiting of the statistical item quantity is a current limiting strategy for counting the quantity of the statistical items in the unit time window, if the current limiting strategy is triggered, only the current statistical item can be influenced, the stability of the whole service can be better guaranteed through a current limiting protection mechanism, the current limiting threshold value can be flexibly adjusted through a web end, a system current limiting device is provided with an automatic recovery component, and when the data quantity is reduced to be lower than the threshold value, the statistical service can be automatically recovered.
S8, aiming at the specific scene of stream statistics, because the data structures key of the statistical result are numerical values containing time stamps, and value is the numerical value of the statistical result, the data structures have universal similarity, and under the condition of huge quantity, in order to improve the writing efficiency of data and reduce the waste of storage resources, the data storage of the statistical result adopts a delta time stamp compression mode, the data in the same hour and the same day can be stored in a block of area after being compressed according to the calculation period of the statistical task, and the data in the same block of area share the same key, so that the repeated writing of the time stamp data is avoided, and a large quantity of storage resources are saved.
S9, the client module has protection mechanisms of overtime fusing and abnormal fusing, an abnormal counting device and an automatic recovery assembly are arranged in the client module, when the API interface provided by the client module is called by the service of the service side to be abnormal, the system automatically judges whether fusing is needed according to the abnormal calling amount, when the interface is fused, the sending of statistical information is automatically abandoned, and the fusing duration is automatically recovered after reaching the system threshold value.
Compared with the prior art, the invention has the beneficial effects that: complicated stream data statistics requirements can be met through simple page configuration and data access, and development cost and data maintenance cost of enterprises in the aspect of stream data statistics can be greatly reduced; the method helps enterprises to save time and cost and assists quick iteration of products; in addition, the invention reduces the technical threshold of using the flow type big data statistics by small and medium-sized enterprises; the technical implementation scheme of the invention has the advantages of less occupied resources, high operation performance and more stable system service, and can improve the utilization rate of server resources and reduce the investment of enterprise server resources.
Drawings
FIG. 1 is a system architecture layout of the present disclosure;
FIG. 2 is a diagram of statistics ownership management structure in the present invention;
FIG. 3 is a sample data diagram of an example of a subscription flow statistics embodiment of the present invention;
FIG. 4 is an exemplary diagram illustrating message aggregation by a client module in the present invention;
FIG. 5 is a diagram illustrating data flow among the constituent modules of the present invention;
FIG. 6 is a use illustration in the present disclosure describing a configuration specification;
FIG. 7 is a flow diagram of a radix statistics implementation of the present invention;
FIG. 8 is a block diagram of a flow restrictor assembly of the present invention;
Detailed Description
The technical solutions in the embodiments of the present invention are clearly described below with reference to the accompanying drawings, and the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. Based on the embodiments of the present invention, those skilled in the art can make several variations and improvements without departing from the technical solution, and all other embodiments obtained without creative efforts belong to the protection scope of the present invention.
The embodiment provides an implementation mode of a general streaming big data statistical system, which can help an enterprise to meet the complicated streaming data statistical requirements and solve the problems of high development cost, high difficulty, long period and much resource occupation, and the implementation scheme of the system can help the enterprise to quickly set up a set of relatively complete, stable and reliable datamation operation system and save the investment of the enterprise in datamation operation:
the system shown in conjunction with fig. 1 and 5 includes the following constituent modules:
(1) The Client module is used for the SDK accessed by the service party, and the module has the main functions of: after receiving the statistical message, the client module firstly judges the running states of the current statistical group and the statistical task and discards the original message of the statistical task in an abnormal state; aggregating the original statistical message; asynchronously sending messages to a back-end service in batches; an abnormal fusing and overtime fusing mechanism is built in, and if the number of abnormal or overtime message sending exceeds a threshold value, the abnormal or overtime message sending is automatically fused; and after the fusing time reaches a system threshold value, the fusing time is automatically recovered to a normal state.
(2) ICE module, system RPC service module, the main functions of this module are: receiving the statistical messages of each terminal and further aggregating the messages of each terminal; sending the statistical message to the message middleware; and providing a statistical result query interface for the outside.
(3) The system comprises a Tasks module and a system core operation module, wherein the main functions of the modules are as follows: the concrete implementation logic of each operation unit is internally packaged; receiving statistical messages, and decompressing the statistical messages in real time; key verification and parameter format verification, and discarding the message failed in verification; current limiting logic judgment; analyzing the configuration of the statistical task and analyzing an operational expression; built-in conversion function analysis and variable analysis; dimension information is stored; and performing statistical operation and aggregating and storing statistical results.
(4) A Web module: the Web end display module has the main functions of: checking a statistical result; managing a statistical task; current limiting rule configuration; managing data index authority; and managing a system user.
The invention sets a set of configuration specifications for describing complicated flow data statistical requirements, the specifications can be internally provided with rich conversion functions and variables and support expression analysis, various complex condition screening and logic judgment can be met, multi-dimensional calculation is supported, statistics of various time granularities of day level, hour level, minute level and second level is supported, and configuration of a custom statistical period is supported. The specification includes three components: the statistical template is an expression based on an XML format and is used for describing a calculation mode of stream statistics; the statistical period is a time window for stream data statistics, and several time granularities of day, hour, minute and second can be selected according to needs; the data validity period is the storage duration of the statistical result, wherein the statistical template is a core component of the configuration specification, the statistical template is an expression in an XML format, and an example of using the statistical template is described in a scene of statistics of the number of the electric business orders by combining fig. 3 and fig. 6, and the expression includes four basic attributes:
(1) title attribute, necessary attribute, for describing the name of the statistical item.
(2) The stat attribute, the necessary attribute, is used for expressing the calculation mode of statistics, and consists of at least one operation unit, and a plurality of operation units are combined and used by using four arithmetic operations of addition, subtraction, multiplication and division.
(3) The dimensions attribute, the unnecessary attribute, is used for describing statistical dimension information, and multiple dimensions are divided by using semicolons.
(4) And the limit attribute, an unnecessary attribute, is used for describing topN or lastN operation, and N can be flexibly set according to actual needs.
In order to further expand the application range of the configuration specification, as shown in fig. 6, the specification may have rich conversion functions and variables and support expression parsing, so as to meet various complex condition screening and logic judgment, and the conversion functions and internal variables may be flexibly expanded according to actual needs.
In this embodiment, a three-layer structure of statistical projects, statistical groups, and statistical items is used to manage a huge number of statistical demands, each statistical demand corresponds to one statistical item, as shown in fig. 2, a user may create a plurality of statistical projects according to needs, each statistical project may include a plurality of statistical items, and a plurality of statistical items based on the same metadata are called one statistical group, each statistical group corresponds to one piece of metadata, metadata configuration information needs to be simultaneously specified when creating a statistical group, and metadata configuration is a data structure corresponding to an original message, including a field name and a field type. The service party accesses the system and comprises the following steps:
(1) Creating a corresponding statistical project;
(2) Creating a corresponding statistic group under engineering, and simultaneously appointing fields and field type information contained in metadata of the statistic group;
(3) And creating statistical items under the statistical groups according to actual needs.
(4) The service side reports the original message through the Client module, wherein the original message comprises a statistic group identifier and field data information;
in the embodiment, asynchronous processing, batch consumption and aggregation processing of repeated calculation are adopted, each link from sending a message from a client end to final statistical result storage is used for merging the repeated messages, and the whole consumption link of the system is of a structure gradually decreased layer by layer, so that the data volume can be reduced for downstream transmission, the network IO efficiency can be improved, the memory can be saved, and the downstream computation amount and the write-in pressure of a DB can be directly reduced. The flow of client module message aggregation as shown in fig. 4 includes the following steps:
(1) Fields in the original message that are not relevant to the statistical task are removed.
(2) And calculating the minimum batch time, wherein the value is the greatest common divisor of the calculation period of the effective statistical task, and the time stamp of the original message is set as the minimum batch time.
(3) Message aggregation is repeated and messages are sent asynchronously to the backend service.
The built-in repeated data filtering apparatus in this embodiment is configured to implement radix statistics, where the filtering apparatus includes a plurality of fragments, each fragment corresponds to a RoaringBitMap data storage structure, and the number of fragments may be specified according to actual needs, and as shown in fig. 7, the radix statistics implementation scheme includes the following steps:
(1) And (4) the original numerical value is subjected to MurmurHash-128Bit to generate a Long type Hash value corresponding to the original numerical value.
(2) The method includes the steps that the number of fragments required by a statistical task is set, each fragment corresponds to a repeated data filtering device, the method is achieved by means of Redis expansion Redis-Roaring plug-ins, each fragment corresponds to a key value of a Roaring BitMap storage structure, and the original numerical value Hash is used for selecting the corresponding fragment.
(3) Since the Redis-rounding plug-in only supports 32-bit integers, the absolute value of the hash value of the long type is split into two int type integers according to the high 32-bit and the low 32-bit, and the combination of the two int values corresponds to the index value of the original value in the rounding bitmap data structure.
(4) Sending the Int value pair combination to Redis service in batches, aggregating a plurality of operations of repeated judgment to Lua script for merging execution, judging whether two Int values exist in the filtering device, filtering out corresponding original values by the device if the Int values exist, counting the number of the original values which do not exist in the filtering device, and updating the original values to a database.
In order to avoid the instability of the system caused by sudden access of a certain large data volume of statistical requirements or traffic surge of a certain statistical item, the system is provided with a current limiting protection mechanism, and the current limiting protection mechanism comprises two aspects: the first is to limit the message amount of the statistic group, and the second is to limit the result amount of the statistic item. The statistic group message quantity flow limitation is a flow limitation strategy aiming at the statistic message quantity in the unit time window of the current statistic group, and if the flow limitation strategy is triggered, all statistic items under the current statistic group can be influenced. The current limiting of the statistical item result quantity is a current limiting strategy aiming at the statistical result quantity in the unit time window of the current statistical item, and if the current limiting strategy is triggered, only the current statistical item is influenced. The stability of the whole service can be better guaranteed through a current-limiting protection mechanism, and a current-limiting threshold value can be flexibly adjusted at a web end. In addition, the system current limiting device is provided with an automatic recovery component, and after the data volume drops below the threshold value, the statistical service can be automatically recovered, and the operation of the current limiting protection device shown in fig. 8 comprises the following steps:
(1) The current limiting device comprises a statistic group message flow limiting component, a statistic item result flow limiting component and a current limiting state automatic recovery component.
(2) The statistical message enters the current limiting device, firstly, the statistical group message quantity current limiting rule is judged, the current limiting device takes 10 seconds as a counting window, the value can be flexibly adjusted according to needs, whether the message quantity of the statistical group in the current time window exceeds a threshold value or not is judged, if the message quantity exceeds the threshold value, the system sets the statistical group to be in a current limiting state, all statistical messages corresponding to the statistical group in the current limiting state can be automatically discarded by the system, and the current limiting strategy influences all statistical items in the statistical group.
(3) If the system does not trigger the message quantity limiting of the statistic group, the system continues to carry out the judgment of the statistic item result quantity limiting rule, the current limiting device takes 10 seconds as a counting window, the value can be flexibly adjusted according to needs, whether the result quantity of the statistic item in the current time window exceeds a threshold value or not is judged, if the result quantity of the statistic item exceeds the threshold value, the system sets the current statistic item to be in a current limiting state, all statistic messages corresponding to the statistic item in the current limiting state are automatically discarded, and the current limiting strategy only affects the current statistic item.
(4) The system current-limiting duration is 10 minutes, the value can be adjusted according to actual needs, and when the current-limiting duration reaches a threshold value, the system current-limiting duration is automatically recovered to a normal state.
This embodiment is directed to this specific scene of stream-oriented statistics, because the data structure key of statistical result all is the numerical value that contains the timestamp, value is the statistical result numerical value, and this kind of data structure has general similarity, in order to promote the write-in efficiency of data and reduce the waste of storage resource under the huge condition of quantity, the statistical result data storage all adopts the mode of delta timestamp compression, can be according to the calculation cycle of statistical task with the data compression back storage in a block of area of same hour, same day. In this embodiment, data of a statistical task with seconds and minutes as a calculation cycle is divided into different time periods with hours as granularity, and data in the same time period is stored in one region.
The client module of this embodiment adopts the mechanism of overtime fusing, unusual counting assembly and automatic recovery subassembly are built-in to the client module, when the API interface that the self service of business side provided calling the client module appears unusually, the system judges whether need fuse according to unusual transfer amount automatically, if the interface fuses then abandons the sending of statistical message automatically, the fusing duration reaches the automatic recovery to normal condition after the system threshold value, the guarantee business side self service's that this kind of implementation can maximize stability.

Claims (10)

1. A universal stream-type big data statistics realization method is characterized in that stream-type data statistics requirements are abstractly classified into multiple operation scenes including count, sum, max, min, avg, bitcount, topN, lastN and seq operation, and each operation is realized with high performance, so that the universality is realized, and the effect of unlimited multiplexing is achieved.
2. The method for implementing general streaming big data statistics as claimed in claim 1, wherein a set of configuration specifications for describing the requirements of complicated streaming big data statistics is formulated, the configuration specifications support multidimensional calculation, statistics of various time granularities of day level, hour level, minute level and second level, and configuration of custom statistics period, and the specification comprises three components: the statistical template is an expression based on an XML format and is used for describing a calculation mode of stream statistics; the statistical period is a time window for stream data statistics, and several time granularities of day, hour, minute and second can be selected according to needs; the data validity period is the storage duration of the statistical result.
3. The method as claimed in claim 1, wherein the configuration specification is configured to embed rich transformation functions and variables, and support expression parsing, and can satisfy various complex condition screening and logic judgment.
4. The system for the streaming big data statistics of the general type is characterized by comprising several modules: the Client module is used for the SDK accessed by the service party; the system comprises an ICE module, an RPC service module of the system and a database, wherein the RPC service module is used for receiving statistical message data of each terminal; the Task module is a statistical core calculation module; and the Web module comprises functions of managing statistical tasks, viewing statistical results, setting current limit and setting permission.
5. A general type of streaming big data statistics system according to claim 4, wherein all statistics requirements are managed by using a three-layer structure of statistics engineering, statistics groups and statistics items, and a user can create a plurality of statistics engineering according to needs, each statistics engineering can include a plurality of statistics items, and a plurality of statistics items based on the same metadata are called a statistics group, and each statistics group corresponds to a metadata.
6. The general streaming big data statistics system of claim 4, wherein a mode of asynchronous processing, batch consumption and aggregation processing of repetitive calculations is adopted, each link from sending a message from a client end to a final statistical result storage is used for merging repetitive messages, and a whole consumption link of the system is of a structure gradually decreased layer by layer.
7. The system for the general type of the streaming big data statistics as claimed in claim 4, wherein a repeated data filtering device is built in the system for implementing the radix statistics, the filtering device comprises a plurality of segments, each segment corresponds to a roaring bitmap data storage structure, the accuracy of the radix statistics can be improved by expanding the number of the segments, the number of the segments can be flexibly set as required, the Hash value of the data is calculated by the filtering device through the MurmurHash128bit algorithm on the original value, and the filtering device implements the radix statistics by judging whether the Hash value exists or not.
8. A general system for streaming big data statistics according to claim 4, wherein the system has a current limiting protection mechanism, and the current limiting protection mechanism includes two aspects: the first is to the current-limiting of statistics group message volume, and the second is to the current-limiting of statistics item result volume, through the stability of the guarantee system that the current-limiting protection mechanism can be better, the nimble adjustment of current-limiting threshold value accessible web end, current-limiting protection device has the automatic recovery subassembly, and when the data bulk dropped below the threshold value, the statistics service can automatic recovery.
9. The system for streaming big data statistics of claim 4, wherein the statistics data storage of the system adopts delta time stamp compression, and the data of the same hour and the same day are compressed and stored in a block according to the calculation period of the statistics item.
10. The system of claim 4, wherein the client module has a mechanism for overtime fusing and abnormal fusing, and an abnormal counting component and an automatic recovery component are built in the client module, when a service party calls an API (application program interface) provided by the client module to cause an abnormality, the system judges whether fusing is needed according to the abnormal amount, when the API is fused, the statistical message is automatically discarded, and the fusing duration can be automatically recovered after reaching a system threshold.
CN202211263338.7A 2022-10-17 2022-10-17 Universal and reusable stream type big data statistics realization method and system Pending CN115510110A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202211263338.7A CN115510110A (en) 2022-10-17 2022-10-17 Universal and reusable stream type big data statistics realization method and system
CN202310418409.4A CN116561196A (en) 2022-10-17 2023-04-18 Configuration method for describing stream statistics operation mode
CN202310840778.2A CN118467582A (en) 2022-10-17 2023-07-10 General stream big data statistics system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211263338.7A CN115510110A (en) 2022-10-17 2022-10-17 Universal and reusable stream type big data statistics realization method and system

Publications (1)

Publication Number Publication Date
CN115510110A true CN115510110A (en) 2022-12-23

Family

ID=84510454

Family Applications (3)

Application Number Title Priority Date Filing Date
CN202211263338.7A Pending CN115510110A (en) 2022-10-17 2022-10-17 Universal and reusable stream type big data statistics realization method and system
CN202310418409.4A Pending CN116561196A (en) 2022-10-17 2023-04-18 Configuration method for describing stream statistics operation mode
CN202310840778.2A Pending CN118467582A (en) 2022-10-17 2023-07-10 General stream big data statistics system

Family Applications After (2)

Application Number Title Priority Date Filing Date
CN202310418409.4A Pending CN116561196A (en) 2022-10-17 2023-04-18 Configuration method for describing stream statistics operation mode
CN202310840778.2A Pending CN118467582A (en) 2022-10-17 2023-07-10 General stream big data statistics system

Country Status (1)

Country Link
CN (3) CN115510110A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910324A (en) * 2023-07-14 2023-10-20 北京三维天地科技股份有限公司 Visual report configuration method and system for experimental big data
CN118095444A (en) * 2024-04-23 2024-05-28 创新奇智(青岛)科技有限公司 Optimization method and device for large model reasoning, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910324A (en) * 2023-07-14 2023-10-20 北京三维天地科技股份有限公司 Visual report configuration method and system for experimental big data
CN116910324B (en) * 2023-07-14 2024-02-06 北京三维天地科技股份有限公司 Visual report configuration method and system for experimental big data
CN118095444A (en) * 2024-04-23 2024-05-28 创新奇智(青岛)科技有限公司 Optimization method and device for large model reasoning, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN118467582A (en) 2024-08-09
CN116561196A (en) 2023-08-08

Similar Documents

Publication Publication Date Title
CN115510110A (en) Universal and reusable stream type big data statistics realization method and system
US8566527B2 (en) System and method for usage analyzer of subscriber access to communications network
CN113312376B (en) Method and terminal for real-time processing and analysis of Nginx logs
CN107346270B (en) Method and system for real-time computation based radix estimation
CN110442602B (en) Data query method, device, server and storage medium
CN109039817B (en) Information processing method, device, equipment and medium for flow monitoring
CN105405070A (en) Distributed memory power grid system construction method
CN113448812A (en) Monitoring alarm method and device under micro-service scene
CN112559634A (en) Big data management system based on computer cloud computing
CN113141410A (en) Dynamically adjusted QPS control method, system, device and storage medium
CN115168400A (en) External data management system and method
CN107609172A (en) A kind of cross-system multi-dimensional data search processing method and device
WO2024152746A1 (en) Nginx log compression and analysis method and device, and readable storage medium
CN112711614B (en) Service data management method and device
CN106557483B (en) Data processing method, data query method, data processing equipment and data query equipment
CN114596046A (en) Integrated platform based on unified digital model of business center station and data center station
CN114356712A (en) Data processing method, device, equipment, readable storage medium and program product
CN111049898A (en) Method and system for realizing cross-domain architecture of computing cluster resources
EP2770447B1 (en) Data processing method, computational node and system
CN115599871A (en) Lake and bin integrated data processing system and method
CN115098542A (en) Flow type big data frequency division pre-polymerization and query method
CN114090686A (en) Account-out acceleration method and device
CN113626516A (en) Data increment synchronization method and system
CN113810231A (en) Log analysis method, system, electronic equipment and storage medium
CN111125161A (en) Real-time data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20221223

WD01 Invention patent application deemed withdrawn after publication