CN117312375A - Real-time business data stream analysis processing method and system based on clickhouse - Google Patents

Real-time business data stream analysis processing method and system based on clickhouse Download PDF

Info

Publication number
CN117312375A
CN117312375A CN202210696796.3A CN202210696796A CN117312375A CN 117312375 A CN117312375 A CN 117312375A CN 202210696796 A CN202210696796 A CN 202210696796A CN 117312375 A CN117312375 A CN 117312375A
Authority
CN
China
Prior art keywords
data
structured
real
time
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210696796.3A
Other languages
Chinese (zh)
Inventor
陈均
黎君
李宁波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210696796.3A priority Critical patent/CN117312375A/en
Publication of CN117312375A publication Critical patent/CN117312375A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a clickhouse-based real-time business data stream analysis processing method and a clickhouse-based real-time business data stream analysis processing system. The method comprises the following steps: and carrying out data analysis and filtering on the real-time service data flow in each real-time data source to obtain structured data to be analyzed, carrying out data accuracy check and data attribute check on the structured data to obtain structured real-time data passing the check, and importing the structured real-time data into a library table of the self-constructed clickhouse cluster. And carrying out data analysis processing on the structured real-time data in the clickhouse cluster by using a target processing engine matched with the structured real-time data to generate corresponding data analysis results, and respectively storing each data analysis result into each target cache corresponding to the clickhouse cluster according to the data application scene. The method ensures the consistency and accuracy of the structured real-time data, realizes the storage and calculation separation of the data analysis processing and the data analysis result, reduces the development cost in the processing process, and improves the accuracy of the data analysis result.

Description

Real-time business data stream analysis processing method and system based on clickhouse
Technical Field
The application relates to the technical field of big data processing, in particular to a clickhouse-based real-time business data stream analysis processing method and a clickhouse-based real-time business data stream analysis processing system.
Background
With the development of big data processing technology and the wide application of various applications, platforms or services, massive real-time service data appears, and in order to effectively utilize each service data, for example, update and improve the corresponding application or service according to each service data, the massive real-time service data needs to be subjected to warehouse entry processing for storage so as to be convenient for subsequent supply or data analysis or data query by all parties of the platform.
Conventionally, when a large amount of real-time service data is put into storage, a large amount of message queues are generally consumed to finish the put-in operation, when data analysis or query functions are provided for other users after put-in, the data analysis is required to be performed on the large amount of data, a large amount of memory resources are required to be occupied, the development cost is high, the query and analysis speed is low, and the problem of inaccurate data analysis results is easy to occur because of wrong source data or data deviation in different put-in links.
Disclosure of Invention
Based on this, there is a need to provide a clickhouse-based real-time business data stream analysis processing method, system, computer device, computer readable storage medium and computer program product, which can reduce development cost in the process of analyzing and processing mass data, while guaranteeing accuracy of data analysis results.
In a first aspect, the present application provides a clickhouse-based real-time traffic data stream analysis processing method. The method comprises the following steps:
acquiring real-time service data streams in each real-time data source, and carrying out data analysis and filtering on each real-time service data stream to obtain structured data to be analyzed;
performing data accuracy check and data attribute check on the structured data to obtain structured real-time data passing the check, and importing the structured real-time data into a library table of a self-constructed clickhouse cluster;
performing data analysis processing on the structured real-time data in the clickhouse cluster by using a target processing engine matched with the structured real-time data to generate a corresponding data analysis result; the target processing engine is determined according to a data application scene corresponding to the structured real-time data;
Storing each data analysis result into each target cache corresponding to the clickhouse cluster according to the data application scene; and the target cache is matched with the data application scene.
In one embodiment, before the target processing engine matched with the structured real-time data is used to perform data analysis processing on the structured real-time data in the clickhouse cluster, the method further includes:
receiving a data query request, and acquiring an object authority carried by the data query request and a data application scene;
matching corresponding data query interfaces according to the object rights and the data application scene; the data query interface is used for accessing a data analysis result corresponding to the data query request;
and calling a target processing engine matched with the structured real-time data to execute a distributed timing task associated with the data application scene.
In one embodiment, the target processing engine for matching the structured real-time data according to the structured real-time data performs data analysis processing on the structured real-time data in the clickhouse cluster, and generates a corresponding data analysis result, which includes:
Executing a target script file corresponding to the distributed timing task according to a target processing engine matched with the structured real-time data;
and executing the target script file, and carrying out data analysis processing on the structured real-time data in the clickhouse cluster to obtain a corresponding data analysis result.
In one embodiment, after storing each data analysis result in each target cache corresponding to the clickhouse cluster according to the data application scenario, the method further includes:
accessing a target cache corresponding to the data application scene based on the data query interface;
and acquiring a data analysis result stored in the target cache, and feeding back the data analysis result to a target object corresponding to the data query request.
In one embodiment, importing the structured real-time data into a library table of a self-built clickhouse cluster includes:
determining a target processing engine matched with the structured real-time data from a table engine corresponding to the clickhouse cluster based on a data application scene of the structured real-time data;
establishing a library table corresponding to the structured real-time data in the clickhouse cluster according to the service data characteristics corresponding to the target processing engine and the structured real-time data, and storing the structured real-time data into the library table; when a library table corresponding to the structured real-time data is established, determining a primary key, a partition and a data storage period in the library table according to the structured real-time data.
In one embodiment, performing data accuracy check and data attribute check on the structured data to obtain structured real-time data passing the check, including:
acquiring data attributes of each tuple in the structured data to be analyzed, and acquiring different data attribute sets corresponding to each tuple in the structured data according to each data attribute; the data attributes among the data attribute sets are independent or related to each other;
sequentially checking the data accuracy of the data attribute sets corresponding to each tuple to obtain first structured data passing through the data accuracy check;
and determining the target attribute of the first structured data according to the data application scene, and carrying out probability distribution verification of the data attribute based on the target attribute of the first structured data to obtain structured real-time data passing the probability distribution verification.
In one embodiment, the sequentially performing data accuracy verification on the data attribute set corresponding to each tuple to obtain first structured data passing the data accuracy verification, including:
carrying out Cartesian product calculation on the data attribute set corresponding to each tuple to obtain a corresponding possible attribute set;
Acquiring an expected attribute set and an actual attribute set corresponding to the data attribute set;
and carrying out data accuracy verification based on the possible attribute set, the expected attribute set and the actual attribute set, and obtaining first structured data passing through the data accuracy verification.
In one embodiment, the determining the target attribute of the first structured data according to the data application scenario, and performing probability distribution verification of the data attribute based on the target attribute of the first structured data, to obtain structured real-time data that passes the probability distribution verification, includes:
determining a target attribute of the first structured data according to the data application scene;
sampling the first structured data based on the data application scene and the data scale of the real-time service data stream, and extracting a preset number of rows of data;
acquiring attribute classifications corresponding to the target attributes, and acquiring actual observation times corresponding to each attribute classification;
determining theoretical expected times corresponding to each attribute classification based on preset probability values of the preset row data falling into the corresponding attribute classification and preset numbers corresponding to the extracted data;
Determining a corresponding chi-square value according to the theoretical expected times and the actual observation times corresponding to each attribute classification;
and carrying out probability distribution verification of data attributes on the chi-square value by utilizing preset confidence data to obtain structured real-time data passing through the probability distribution verification.
In one embodiment, the data application scenario comprises a low frequency query scenario; after the data analysis processing is carried out on the structured real-time data in the clickhouse cluster according to the target processing engine matched with the structured real-time data, the method further comprises the steps of:
and acquiring the data analysis result in real time based on the data query interface.
In one embodiment, the method further comprises:
when a plurality of data query requests exist, determining a threshold value of the number of the target script files which are executed concurrently according to the processing performance of the clickhouse cluster.
In a second aspect, the present application further provides a clickhouse-based real-time traffic data stream analysis processing system. The system comprises:
the structured data generation module is used for acquiring real-time service data streams in each real-time data source, and carrying out data analysis and filtering on each real-time service data stream to obtain structured data to be analyzed;
The data verification module is used for carrying out data accuracy verification and data attribute verification on the structured data to obtain verified structured real-time data, and importing the structured real-time data into a library table of a self-constructed clickhouse cluster;
the data analysis result generation module is used for carrying out data analysis processing on the structured real-time data in the click house cluster by utilizing a target processing engine matched with the structured real-time data to generate a corresponding data analysis result; the target processing engine is determined according to a data application scene corresponding to the structured real-time data;
the data analysis result storage module is used for storing each data analysis result into each target cache corresponding to the clickhouse cluster according to the data application scene; and the target cache is matched with the data application scene.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:
acquiring real-time service data streams in each real-time data source, and carrying out data analysis and filtering on each real-time service data stream to obtain structured data to be analyzed;
Performing data accuracy check and data attribute check on the structured data to obtain structured real-time data passing the check, and importing the structured real-time data into a library table of a self-constructed clickhouse cluster;
performing data analysis processing on the structured real-time data in the clickhouse cluster by using a target processing engine matched with the structured real-time data to generate a corresponding data analysis result; the target processing engine is determined according to a data application scene corresponding to the structured real-time data;
storing each data analysis result into each target cache corresponding to the clickhouse cluster according to the data application scene; and the target cache is matched with the data application scene.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
acquiring real-time service data streams in each real-time data source, and carrying out data analysis and filtering on each real-time service data stream to obtain structured data to be analyzed;
Performing data accuracy check and data attribute check on the structured data to obtain structured real-time data passing the check, and importing the structured real-time data into a library table of a self-constructed clickhouse cluster;
performing data analysis processing on the structured real-time data in the clickhouse cluster by using a target processing engine matched with the structured real-time data to generate a corresponding data analysis result; the target processing engine is determined according to a data application scene corresponding to the structured real-time data;
storing each data analysis result into each target cache corresponding to the clickhouse cluster according to the data application scene; and the target cache is matched with the data application scene.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:
acquiring real-time service data streams in each real-time data source, and carrying out data analysis and filtering on each real-time service data stream to obtain structured data to be analyzed;
Performing data accuracy check and data attribute check on the structured data to obtain structured real-time data passing the check, and importing the structured real-time data into a library table of a self-constructed clickhouse cluster;
performing data analysis processing on the structured real-time data in the clickhouse cluster by using a target processing engine matched with the structured real-time data to generate a corresponding data analysis result; the target processing engine is determined according to a data application scene corresponding to the structured real-time data;
storing each data analysis result into each target cache corresponding to the clickhouse cluster according to the data application scene; and the target cache is matched with the data application scene.
In the method, the system, the computer equipment, the storage medium and the computer program product for analyzing and processing the real-time business data stream based on the clickhouse, the real-time business data stream in each real-time data source is obtained, and the data analysis and the filtration are carried out on each real-time business data stream to obtain the structured data to be analyzed. And further performing data accuracy check and data attribute check on the structured data to obtain structured real-time data passing through the data attribute check, and importing the structured real-time data into a library table of a self-constructed clickhouse cluster so as to ensure consistency and accuracy of the imported structured real-time data and improve accuracy of a data analysis result obtained subsequently. And further, utilizing a target processing engine matched with the structured real-time data to perform data analysis processing on the structured real-time data in the clickhouse cluster, generating corresponding data analysis results, and respectively storing each data analysis result into each target cache corresponding to the clickhouse cluster according to the data application scene. The target processing engine is determined according to the data application scene corresponding to the structured real-time data, and the target cache is matched with the data application scene. The method realizes the separation of data analysis processing and storage of data analysis results, so as to avoid memory occupation of a clickhouse cluster, reduce development cost in the process of mass data processing and analysis, and improve the accuracy of the obtained data analysis results.
Drawings
FIG. 1 is an application environment diagram of a clickhouse-based real-time business data stream analysis processing method in one embodiment;
FIG. 2 is a flow diagram of a clickhouse-based real-time business data stream analysis processing method in one embodiment;
FIG. 3 is a flow diagram of obtaining structured real-time data in one embodiment;
FIG. 4 is a flow chart of a method of clickhouse-based real-time traffic data stream analysis processing in another embodiment;
FIG. 5 is a schematic diagram of a query flow of data analysis results in one embodiment;
FIG. 6 is a diagram of real-time query results for advertisement metrics, in one embodiment;
FIG. 7 is a diagram of user funnel analysis based on advertisement reach in one embodiment;
FIG. 8 is a diagram of user activity observations of a launch platform in one embodiment;
FIG. 9 is a flow diagram of obtaining structured real-time data that passes data attribute verification in one embodiment;
FIG. 10 is a flow chart of a method of clickhouse-based real-time traffic data stream analysis processing in yet another embodiment;
FIG. 11 is a block diagram of a real-time business data stream analysis processing system based on clickhouse in one embodiment;
FIG. 12 is a block diagram of another embodiment clickhouse-based real-time traffic data stream analysis processing system;
fig. 13 is an internal structural view of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
The method for analyzing and processing the real-time business data stream based on the clickhouse can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The server 104 obtains the structured data to be analyzed by obtaining the real-time service data streams in each real-time data source, and performing data analysis and filtering on each real-time service data stream. The real-time service data streams in each real-time data source may be stored in a data storage system, or may be stored in local storage of different types of terminals 102. The server 104 obtains structured real-time data through data attribute verification by performing data accuracy verification and data attribute verification on the structured data, so as to import the structured real-time data into a library table of the self-constructed clickhouse cluster. Wherein the clickhouse cluster may be deployed on server 104. Further, the server 104 performs data analysis processing on the structured real-time data in the clickhouse cluster by using a target processing engine matched with the structured real-time data, generates a corresponding data analysis result, and stores each data analysis result into each target cache corresponding to the clickhouse cluster according to a data application scene, where the target processing engine determines that the target cache is matched with the data application scene according to the data application scene corresponding to the structured real-time data, and the target cache may be different types of caches, such as mysql (relational database), redis (non-relational database), and localcache. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, intelligent voice interaction devices, internet of things devices and portable wearable devices, where the internet of things devices may be intelligent home appliances (such as intelligent televisions, intelligent air conditioners, intelligent refrigerators), intelligent sound boxes, intelligent vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.
The embodiment of the invention can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, big data processing, intelligent transportation, auxiliary driving and the like. For example, taking an advertisement information pushing application scenario applied to a big data processing scenario as an example, aiming at a situation that a large amount of service data is generated in real time in the pushing and throwing process of advertisement information, and due to factors such as continuous change of advertisement throwing service or existence of service management logic defects, the problems of service data distortion, errors and inconsistent front and back are caused, further, by analyzing, filtering, data accuracy checking, probability distribution checking based on data attributes and other data processing operations on service data streams generated in real time in the throwing and throwing processes, accuracy and high consistency of real-time data obtained after verification are ensured, the problem that accuracy of data analysis results is reduced due to the error data is solved, and further, accuracy of the data analysis results provided when the data analysis results are queried is ensured.
Further, for different objects involved in the pushing and throwing process of the advertisement information, such as objects of advertisement thrower, advertisement throwing platform and the like, the data types concerned are different, and the data types which can be accessed and queried by different objects are also different. For example, the data analysis results that the advertisement provider can query include: the target delivery quantity of the advertisement, the arrival rate of the advertisement on the delivery platform, the exposure rate, the click rate and the purchase rate of the movable products, the actual conversion rate of each layer, such as the actual conversion rate between the target delivery quantity and the arrival quantity, the actual conversion rate between the arrival quantity and the exposure quantity, obtained through user funnel analysis can be queried, so that problems or faults possibly existing in the advertisement delivery process can be further evaluated and predicted according to the queried data analysis result, such as the advantages and disadvantages of advertisement materials per se, whether the delivery platform has interception errors or not is predicted, the problems in the advertisement delivery process can be accurately positioned, and timely adjustment and improvement can be achieved, so that the purpose of effectively delivering advertisement push information is achieved.
In one embodiment, as shown in fig. 2, a clickhouse-based real-time service data stream analysis processing method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
step S202, obtaining real-time service data streams in each real-time data source, and carrying out data analysis and filtering on each real-time service data stream to obtain structured data to be analyzed.
The real-time data source represents a data source of a service data stream which can be generated and reported in real time, and for different objects which need to perform data analysis and filtering processing operation on the data, such as different objects of an advertisement dispenser, an advertisement delivery platform side and the like, the corresponding real-time data sources are different. The advertisement delivery platform side can be different application platforms, websites, application programs and the like, the application programs have different purposes, the corresponding application programs have different types, and the advertisement delivery platform side can comprise instant messaging software, audio and video software, shopping software, games and the like and can be used as the advertisement delivery platform.
For example, real-time data sources may include Kafka (representing a distributed, partitioned, multi-copy, zookeeper-based distributed messaging system for processing large amounts of data in real-time to meet various demand scenarios), rubbimq (representing open source message broker software implementing advanced message queuing protocols), pulsar (representing message middleware with high throughput, low latency, compute storage separation, multi-tenant, and off-site replication functions), and NiFi (representing streaming data processing and distribution systems) among many different types.
Specifically, as different service scenarios are required, each real-time data source can continuously generate various different types of service data streams, wherein the service data streams can include unstructured data, semi-structured data, structured data and other different types of data. In the process of data production, due to influence factors such as service change, logic defect, lack of standardization, unknown indexes and the like, the problems of data distortion, misalignment, errors and inconsistent front and back can be caused, when the subsequent analysis is performed based on service data streams with problems, the accuracy of data analysis results is directly influenced, and further, the real-time service data streams in each real-time data source are required to be subjected to data analysis and filtration, so that invalid data are filtered, different types of data are subjected to format unification, and are extracted into structured data, and finally the structured data to be analyzed is obtained. For example, json data may be parsed into fixed-tag data using format json-schema (representing tools for validating json data formats), where json data represents a lightweight data exchange format.
The structured data is represented and stored by using a relational database, and is represented as two-dimensional data, and is characterized in that: the data represents information of one entity in units of rows, and attributes of each row of data are identical. Semi-structured data does not conform to data model structures associated in the form of relational databases or other data tables, but contains associated tags that separate semantic elements and hierarchy of records and fields, also known as self-describing structures such as XML formats, json formats, and the like. While unstructured data represents data without a fixed structure, which may include various documents, pictures, video or audio, etc.
Further, the method specifically may be based on a link engine (representing a distributed processing engine for processing data streams), access different real-time data sources, and utilize multiple APIs (application programming interfaces) supported by the link engine to implement data analysis and filtering on real-time service data streams in each real-time data source, so as to obtain structured data to be analyzed.
Among other things, the flank engine may be deployed in a variety of cluster environments, including without limitation YARN clusters (Yet Another Resource Negotiator, i.e., resource managers or dependency management tools), mesos clusters (universal resource management platforms for managing computing resources), kubernetes clusters (container cluster management platforms), bare metal clusters, and the like.
Step S204, data accuracy check and data attribute check are carried out on the structured data, structured real-time data passing through the data attribute check are obtained, and the structured real-time data are imported into a library table of the self-constructed clickhouse cluster.
The data accuracy check concrete can be Cartesian product check, the data attribute check concrete can be chi-square check, and finally structured real-time data passing through the data accuracy check and the data attribute check are stored and imported into a library table of the self-constructed clickhouse cluster. The data attribute check may also be a T-test (i.e., a student T-test, a test method for testing the difference degree of two averages of a small sample (e.g., a sample capacity is less than 30)), a Z-test (i.e., a method for testing the difference degree of averages of a large sample (e.g., a sample capacity is greater than 30), specifically, a method for using the theory of standard normal distribution to infer the probability of occurrence of the difference, so as to compare whether the difference between the two averages is significant), an F-test (i.e., a joint hypothesis test, also called a variance ratio test or a variance homogeneity test, for analyzing or testing a statistical model of more than one parameter to determine whether all or a part of the parameters in the model are suitable for estimating a matrix), a hierarchical sampling, and a sampling investigation.
Where clickhouse represents a columnar database management system for online analysis processing, then clickhouse cluster represents a physical cluster (physical cluster) or a logical cluster (logical cluster) comprising a plurality of clickhouse nodes. Wherein each clickhouse node in the physical cluster is managed by the same zookeeper cluster and various DDL (data definition language) operations of the data are valid for the whole cluster. Logical clusters represent physical clusters that do not have a certain physical relationship, for example, three physical clusters that form a logical cluster are independent of each other, and for various data operations of one physical cluster, the other two physical clusters cannot be perceived, but for the whole logical cluster, the data of any one physical cluster changes, and the logical cluster can be acquired by query.
Specifically, data attributes of each tuple in the structured data are obtained, different data attribute sets corresponding to each tuple in the structured data are obtained along with each data attribute, and then data accuracy verification is sequentially carried out on the data attribute sets corresponding to each tuple, so that first structured data passing through the data accuracy verification is obtained.
For example, taking advertisement delivery as an example, the data attributes of the advertisement delivery data may include advertisement space, target delivery amount, arrival rate, exposure rate, click rate, purchase rate, and the like, and for the data attribute set of each tuple in the advertisement delivery data, the accuracy of each tuple is checked in turn, whether error data exists is determined, if error data exists, the error data needs to be fed back to a data provider, such as an advertiser or a delivery platform, and other different data providers, so as to correct or delete the error data, and obtain first structured data passing the data accuracy check. After the error data is corrected, the method can be used for data accuracy verification processing of the next period or the next batch, and the structured data which passes the data accuracy verification is not subjected to repeated verification, so that repeated verification operation is avoided, and resource occupation is reduced.
Further, according to the data application scene, determining the target attribute of the first structured data, and based on the target attribute of the first structured data, performing probability distribution verification of the data attribute to obtain structured real-time data passing the probability distribution verification. When the data application scene represents the object to query the data analysis result, the specific application scene of the data analysis result to be queried can be a high concurrency query scene or a low frequency query scene, and specifically, whether the data application scene is a high concurrency query scene or a low frequency query scene can be judged according to the performance indexes such as the query volume, the time delay requirement, the concurrency number, the throughput, the thread number, the transaction number per second, the response time and the like.
For example, the high concurrency query scenario may be a massive online query service (such as a shopping platform's robbery activity, a holiday's robbery service, and a hot news of an information platform), while the low frequency query scenario may be a robot automatic query, a website query, a scenario for data visualization presentation of a data Dashboard (i.e., dashboard, business intelligence Dashboard, for implementing data visualization, which may present metric information and key business index status information to an enterprise), i.e., a scenario where data is presented in a visual form, such as in a chart or map, and an information presentation scenario for a BI tool (i.e., business intelligence analysis tool, for converting complex business data into simple and intuitive information, and presenting).
In one embodiment, because the query volume and the concurrency number under different data application scenarios and the time-consuming requirements are not fixed, the data application scenarios such as automatic query or website query of the robot are divided into high concurrency query scenarios when the query volume and the concurrency number exceed the query limit of the clickhouse cluster, that is, the data analysis result cannot be obtained by directly looking up the clickhouse cluster. Similarly, the visual display scene of the data instrument panel and the information display scene of the BI tool can be divided into high concurrency query scenes according to the performance indexes such as the query quantity size or the concurrency number, and the visual display scene is not limited to the low-frequency query scenes.
For example, taking a mass online query service as an example of a shopping platform shopping robbery activity, the target attribute of the current data application scenario may be a click rate or a purchase rate of the product, and then the click rate or the purchase rate of the product may be used as the target attribute of the first structured data, and based on the target attribute, data attribute verification is performed to obtain structured real-time data passing through the data attribute verification.
In one embodiment, as shown in fig. 3, a manner of obtaining structured real-time data is provided, and referring to fig. 3, it can be known that by accessing each real-time data source, such as data source 1, data source 2, … …, and data N, etc., and obtaining a real-time service data stream in each real-time data source, then performing data cleansing on each real-time service data stream by using StreamETL application in the link engine. The method comprises the steps of carrying out data analysis, data filtering and the like to obtain structured data to be analyzed, and further carrying out data accuracy check and data attribute check on the structured data to obtain structured real-time data passing the check. The data accuracy check sum data attribute check can be specifically Cartesian product check sum chi-square check, and structured real-time data passing the check is finally stored, specifically structured time is imported into a library table of a self-constructed clickhouse cluster for storage.
In one embodiment, importing structured real-time data into a constructed clickhouse cluster includes:
determining a target processing engine matched with the structured real-time data from a table engine corresponding to the clickhouse cluster based on a data application scene of the structured real-time data;
according to the service data characteristics corresponding to the target processing engine and the structured real-time data, establishing a library table corresponding to the structured real-time data in the clickhouse cluster, and storing the structured real-time data into the library table; when a library table corresponding to the structured real-time data is established, determining a primary key, a partition and a data storage period in the library table according to the structured real-time data.
Specifically, the data application scenes of the structured real-time data are different, the matched target processing engines are different, namely, the data application scenes are different, the corresponding required data size, the data processing requirements and the like are different, and further the target processing engines matched with the structured real-time data are determined in real time from the table engines corresponding to the clickhouse clusters according to the data application scenes of the structured real-time data, and the matched structured real-time data are analyzed and processed through the target processing engines. Wherein, clickhouse can support a plurality of different table engines, namely, the table engines can be divided into different categories according to different purposes, and the table engines of different categories respectively comprise a plurality of different processing engines.
Further, a database is created in the clickhouse cluster, and a library table corresponding to the structured real-time data is built in the clickhouse cluster according to the target processing engine and the service data characteristics corresponding to the structured real-time data, so that the structured real-time data is stored in the corresponding library table. The service data features corresponding to the structured real-time data may include different data features such as a service type, an actual service content, an application platform to which the service belongs, and a service application range corresponding to the structured real-time data.
When a library table corresponding to the structured real-time data is established, determining a primary key, a partition and a data storage period in the library table according to the structured real-time data. The main key of the advertisement stream data may include advertisement identifier, time window, user identifier, operation label, etc. For example, different operations of the user correspond to different flow data respectively, and when the statistics and analysis of the flow data are performed based on the user identification, specific operations performed by the user are distinguished by using different labels, for example, different operation types such as viewing, clicking, downloading and the like can be adopted.
The partitions represent different storage partitions of the structured real-time data in the clickhouse cluster, and the data storage period indicates that the corresponding structured real-time data can be stored in the clickhouse cluster, and the corresponding structured real-time data can be deleted when the corresponding data storage period is exceeded.
Step S206, performing data analysis processing on the structured real-time data in the clickhouse cluster by using a target processing engine matched with the structured real-time data, generating a corresponding data analysis result, and determining by using the target processing engine according to a data application scene corresponding to the structured real-time data.
Wherein the processing engines can be expressed as table engines of different purposes supported by clickhouses, wherein the table engines can be divided into different categories according to the different purposes, and the table engines of the different categories respectively comprise a plurality of different processing engines.
For example, a clickhouse supported table engine may include: 1) The MergeTree series, denoted as generic engine, for performing large data volume analysis; 2) Log series, denoted lightweight engine, for performing a small table analysis; 3) The Integration series is used for being integrated with other data storage and processing systems for use; 4) A particular family may include a variety of different special engines for use with a particular function or in a particular context.
The MergeTree series is mainly used for analyzing mass data, and supports functions of data partitioning, ordered storage, primary key indexing, sparse indexing, data TTL and the like, and also supports related grammar of SQL (structured query language) of all clickhouses, including sentences realizing different functions of data query, data update, data deletion, data insertion, database creation, database modification, new table creation, database table change, table deletion, index creation, index deletion and the like.
Specifically, according to the data application scene corresponding to the structured real-time data, a corresponding target processing engine is determined, specifically, according to the difference of the data application scenes, when the data is further processed, the target processing engine needs to be selected according to the requirements of whether the data in the corresponding data application scene has redundancy, whether the data has repetition, whether the data needs to be aggregated in advance, whether the safety needs to be ensured and the like, and meanwhile, whether the data needs to be duplicated, namely whether the data needs to be backed up through duplication or not can also be selected from the angle that the data needs to be high available.
Further, the MergeTree series may specifically include the following engines:
1) The MergeTree engine has the following specific application scenes: support data partitioning, storage ordering, primary key indexing, sparse indexing, and data TTL (i.e., time To Live, used To specify the lifecycle of data, when the data age expires, the piece of data will be automatically deleted), etc.; 2) The replaymergertree engine is used for solving the problem that the same main key in the mergertree engine cannot be de-duplicated, and has the main effect of de-duplication; 3) A CollapsingMergeTree engine for implementing asynchronous deletion (or folding); 4) A Version column engine for implementing assistance with Version columns to correctly delete duplicate rows; 5) And the SummingMerrgeTree engine is used for sum summing the same multi-row of the main key and replacing the same multi-row data of the main key with the summed data of one row.
For example, taking advertisement delivery as an example, when advertisement stream data in the advertisement delivery process is processed, a replied+replaymergetree engine is adopted, and data is asynchronously synchronized to other tables through a zookeeper (distributed application coordination service software) for copy selection, so that data backup is realized, and meanwhile, the problem of data repeated writing is solved by utilizing the characteristic of replaymergetree de-duplication.
Step S208, storing the data analysis results into target caches corresponding to the clickhouse clusters according to the data application scene, wherein the target caches are matched with the data application scene.
Specifically, the target cache may be a different type of cache corresponding to the clickhouse cluster, such as mysql (relational database), redis (non-relational database), and localcache (local cache). According to the different data application scenes, such as a high-frequency concurrent query scene, a low-frequency query scene, a data visualization scene and the like, the data analysis is respectively stored into different target caches corresponding to the clickhouse clusters.
Further, for example, massive high-concurrency queries on a line can be obtained, the data analysis results updated asynchronously can be stored in a redis cache, massive high-concurrency queries can be supported through the redis cache, if a clickhouse cluster can be directly searched only by low-frequency queries in some application scenes, and when the clickhouse cluster is directly searched, a target processing engine is required to be called, data analysis processing is carried out on structured real-time data in the clickhouse cluster, and corresponding data analysis results are obtained in real time and fed back or displayed.
That is, when the direct-checking click house cluster mode is adopted, each query indicates that one data analysis process needs to be executed, corresponding data analysis process results are obtained in real time, if a large amount of query demands exist, the problem that the click house cluster resources are consumed in a large amount and the query is overtime is caused, when a large amount of query demands occur, the data analysis results need to be stored in a target cache corresponding to the click house cluster, and the data query efficiency is improved by accessing the target cache mode, so that a large amount of queries on the click house cluster are avoided.
Specifically, the method of determining whether the data analysis result is written into the target cache according to whether the historical data needs to be stored or not may be an overwriting or incremental writing method. Wherein, if the historical data is not needed to be saved, the historical data is directly covered, and if the historical data is needed to be saved, a new cache space is needed to be occupied. By adopting a mode of separating a data analysis process and the storage of the data analysis result, the data analysis result is stored in a target cache corresponding to the clickhouse cluster, and even if the clickhouse cluster fails, the execution result of history update can be ensured to be stored in the target cache for query.
In the method for analyzing and processing the real-time service data stream of the clickhouse, the real-time service data stream in each real-time data source is obtained, and the data analysis and the filtering are carried out on each real-time service data stream to obtain the structured data to be analyzed. And further performing data accuracy check and data attribute check on the structured data to obtain structured real-time data passing through the data attribute check, and importing the structured real-time data into a library table of a self-constructed clickhouse cluster so as to ensure consistency and accuracy of the imported structured real-time data and improve accuracy of a data analysis result obtained subsequently. And further, according to a target processing engine matched with the structured real-time data, carrying out data analysis processing on the structured real-time data in the clickhouse cluster, generating corresponding data analysis results, and respectively storing each data analysis result into each target cache corresponding to the clickhouse cluster according to the data application scene. The target processing engine is determined according to the data application scene corresponding to the structured real-time data, and the target cache is matched with the data application scene. The method realizes the separation of data analysis processing and storage of data analysis results, so as to avoid memory occupation of a clickhouse cluster, reduce development cost in the process of mass data processing and analysis, and improve the accuracy of the obtained data analysis results.
In one embodiment, as shown in fig. 4, there is provided a clickhouse-based real-time service data stream analysis processing method, which as can be seen with reference to fig. 4, includes:
step S402, a data query request is received, and the object rights carried by the data query request and the data application scene are acquired.
Specifically, data query requests initiated by different objects through different paths or channels are received, and object rights carried by the data query requests and specific data application scenes of data to be queried are acquired.
Taking advertisement delivery as an example, the object for performing data query may be an advertisement dispenser or a delivery platform, and correspondingly, the object rights of the advertisement dispenser and the delivery platform are different, for example, the object rights of the advertisement dispenser may include that the queried data analysis result includes: the data of different stages such as the package quantity of the advertisement, the target delivery quantity of the advertisement, the arrival rate of the advertisement on the delivery platform, the exposure rate, the click rate, the purchasing rate of the movable product and the like, for example, the object authority of the delivery platform side, namely, the data analysis result capable of being queried comprises: different data such as registered user quantity of the platform, active user quantity of the platform, invalid user quantity of the platform, reaching quantity of the advertisement, exposure rate of the advertisement, click rate, and purchase rate of the movable product.
Further, the types of the data analysis results of the corresponding required query are different according to different object authorities of the object initiating the data query request, and meanwhile, the specific data application scenes of the data analysis results of different types are also different. The data application scenes of the data analysis result can be distinguished according to performance indexes such as query volume, time delay requirement, concurrency, throughput, thread number, transaction number per second, response time and the like, for example, different types such as a high concurrency query scene and a low frequency query scene can be distinguished, wherein the high concurrency query scene and the low frequency query scene can respectively comprise different specific data application scenes.
For example, according to different query channels or query approaches, the corresponding data application scenarios may also be different, for example, include massive online queries, such as a shopping platform's robotics, a holiday's ticketing service, information platform's hot news, etc., automatic queries (such as intelligent analysis robots of different application programs, for calling related website content in real time according to user requirements for analysis, to obtain corresponding query results), website queries, and data visual display scenarios for data dashboards, and information display scenarios for BI tools (i.e., business intelligent analysis tools), etc.
The on-line query service is divided into a high concurrency query scene, and other data application scenes including automatic query of a robot, website query, data visual display scene of a data dashboard, information display scene of a BI tool and the like are divided into a low-frequency query scene when the scene division is performed according to the on-line query service, automatic query of a robot, website query, data visual display scene of the data dashboard, information display scene of the BI tool and the like.
Likewise, when the query volume or the concurrency number exceeds the query limit of the clickhouse cluster, that is, when the data analysis result cannot be obtained by directly looking up the clickhouse cluster, the robot automatic query, the website query, the data visual display scene of the data dashboard, the information display scene of the BI tool and the like can be divided into a high concurrency query scene, which is not limited to a low-frequency query scene.
Step S404, matching corresponding data query interfaces according to the object rights and the data application scene.
Specifically, the data query interface is configured to access data analysis results corresponding to the data query request, that is, according to different object rights of an object initiating the data query request and different data application scenarios of the data analysis results to be queried, the data query results are matched with different data query interfaces, and further access data analysis results of different storage locations based on the data query interfaces.
The storage locations may be different target caches corresponding to the clickhouse cluster, such as mysql (relational database), redis (non-relational database), local cache, and the like, and may also be the clickhouse cluster.
For example, according to different data application scenarios, such as a high concurrency query scenario or a low frequency query scenario, the storage location of the data analysis result is determined, that is, if the data application scenario is a high concurrency query scenario, the corresponding data analysis result may be stored into a target cache (such as redis) corresponding to the clickhouse cluster, so as to implement a computational separation between the analysis process and the data analysis result, so as to avoid the problem that a large amount of memory of the clickhouse cluster is occupied due to a large amount of concurrent queries, and excessive consumption of resources is caused.
Similarly, if only low-frequency query is performed, the clickhouse cluster can meet the query amount and time-consuming requirement of the query scene, the target processing engine can be directly called, data analysis processing is performed on the structured real-time data in the clickhouse cluster, and corresponding data analysis results are obtained in real time and fed back or displayed. That is, in the low-frequency query scenario, when the direct-checking clickhouse cluster mode is adopted, each query indicates that data analysis processing needs to be executed once, so as to obtain a corresponding data analysis processing result in real time.
Step S406, a target processing engine corresponding to the structured real-time data matching is called to execute the distributed timing task associated with the data application scene.
The distributed timing task is used for triggering and executing a corresponding SQL (structured query language) script, and the SQL script comprises an SQL statement of one or more SQL commands, such as statements of different commands including data query, data update, data deletion, data insertion, database creation, database modification and the like. The SQL sentences are specifically constructed on library tables based on clickhouse clusters, and the SQL sentences are specifically used for realizing data statistics and analysis.
Specifically, a distributed timing task is executed by calling a target processing engine corresponding to the structured real-time data, and an SQL script corresponding to the distributed timing task is executed by the target processing engine, so that an execution result of the SQL script, namely, an obtained data analysis result is stored in a target cache corresponding to the clickhouse cluster, and the problems of resource expenditure and query timeout caused by the need of repeatedly querying the clickhouse are avoided.
Further, different object authorities and data application scenes are matched with different data query interfaces, and further, a target processing engine matched with the structured real-time data can be called by executing a distributed timing task associated with the corresponding data application scene, and further, data analysis processing of the structured real-time data in the clickhouse cluster is realized by the target processing engine.
In one embodiment, when there are multiple data query requests, a threshold number of concurrently executing target script files is determined based on the processing performance of the clickhouse cluster.
Specifically, when a plurality of data query requests exist, the problems that the processor is unavailable in high load and the query is overtime due to excessive SQL scripts which are executed by the clickhouse cluster in parallel are avoided by controlling the execution quantity of the SQL scripts (namely target script files) during concurrent query.
Further, according to the processing performance of the clickhouse cluster, the threshold value of the number of SQL scripts concurrently executed in the clickhouse cluster is set, for example, the threshold value of the number of script files concurrently executed may be set to be 100, or may be 150, 200 or more different values, that is, the number of the script files may be adjusted and modified according to the actual requirement and the built processing performance of the clickhouse cluster, and the method is not limited to a certain specific value or a certain specific values.
Step S408, executing the target script file corresponding to the distributed timing task according to the target processing engine matched with the structured real-time data.
Specifically, a distributed timing task corresponding to a data application scenario is executed by calling a target processing engine corresponding to the structured real-time data match, so that a target script file corresponding to the distributed timing task is executed by the target processing engine.
The target script file may be specifically an SQL script, and performs data analysis processing on the structured real-time data in the clickhouse cluster by executing the target script file, so as to obtain a corresponding data analysis result.
Step S410, performing data analysis processing on the structured real-time data in the clickhouse cluster by executing the target script file to obtain a corresponding data analysis result.
Specifically, a target script file corresponding to the distributed timing task is executed by calling a target processing engine matched with the structured real-time data, specifically, a corresponding SQL script is executed, so that data analysis processing is performed on the structured real-time data in the clickhouse cluster, and a corresponding data analysis result is obtained. The SQL script may include one or more SQL sentences of SQL commands, and the key words in the SQL sentences of the clickhouse cluster are not reserved and sensitive in case, and the clickhouse cluster supports multiple data types such as integers, floating point numbers, characters, dates, enumerated values, arrays and the like, and also supports various common functions and aggregation functions.
When the clickhouse cluster needs to be queried for many times, due to the limitation of processor load and input and output of a disk, the condition of query timeout exists, the data analysis efficiency is affected, and the SQL statement is optimized by adopting a down-tuning processing mode specifically, which comprises the following steps:
1) And setting indexes and selecting proper filtering conditions to reduce the data size of traversal. 2) Corresponding storage partitions are set for different data, and proper TTL (data life cycle) is set so as to reduce the size of query data. 3) When Join is operated (connect operation), the small table is placed on the right and the large table is placed on the left to reduce the number of searches from the data. 4) The data is ordered during writing, and the influence of unordered data on data merging and compression is reduced. 5) The original count function is replaced by a function with higher efficiency, such as a function uniq function with smaller error when performing estimation statistics. 6) And establishing a materialized view, and realizing data aggregation through the materialized view when data are inserted. Wherein the materialized view representation includes a database object of query results for use as a local copy of remote data or for generating a summary table based on a summation of the data tables. The remote table-based data stored by the materialized view may also be referred to as a snapshot corresponding to the remote data. Step S412, storing the data analysis results into the target caches corresponding to the clickhouse clusters according to the data application scene.
The data application scene can specifically comprise different scenes such as massive online query service, robot automatic query, website query, data visualization for a data dashboard, information display for a BI tool and the like. Whether each data application scene is a high-frequency query scene or a low-frequency query scene can be further determined according to the performance indexes such as the query volume, the time delay requirement, the concurrency number, the throughput, the thread number, the transaction number per second, the response time and the like.
Further, according to each data application scene, the data analysis results are stored in each target cache corresponding to the clickhouse cluster. The target cache may be different types of caches corresponding to the clickhouse cluster, such as mysql (relational database), redis (non-relational database), local cache, and the like. According to the different data application scenes, such as a high-frequency concurrent query scene, a low-frequency query scene, a data visualization scene and the like, the data analysis is respectively stored into different target caches corresponding to the clickhouse clusters. For example, a massive amount of highly concurrent queries on a line may be stored in a redis cache, and massive amounts of highly concurrent queries may be supported by the redis cache.
Step S414, based on the data query interface, accesses the target cache corresponding to the data application scenario.
The data query interface is used for accessing data analysis results corresponding to the data query request, namely, according to different object rights of an object initiating the data query request and different data application scenes of the data analysis results to be queried, the data analysis results are matched with different data query interfaces, and then the data analysis results of different storage positions are accessed based on the data query interfaces.
Specifically, when the data application scenario is a high concurrency query scenario, a target cache corresponding to the high concurrency query scenario, such as a redis cache or a mysql cache, may be accessed based on the data query interface.
Furthermore, the data application scenario further includes a low-frequency query scenario, that is, if only low-frequency query is needed in some application scenarios, the clickhouse cluster mode can also be directly searched, and when the clickhouse cluster is directly searched, the target processing engine needs to be called, the structured real-time data in the clickhouse cluster is subjected to data analysis processing, the corresponding data analysis result is obtained in real time, and feedback or display is performed. The real-time data application scenes of different access paths or channels can be used for acquiring data analysis results by uniformly inquiring target caches or directly inquiring click house clusters, and the data analysis results are stably and efficiently used in different services.
In one embodiment, as shown in fig. 5, a query flow of data analysis results is provided, and referring to fig. 5, first, by calling a target processing engine corresponding to structured real-time data matching, each distributed timing task in the timing scheduling system is executed, and a target script file corresponding to the distributed timing task is executed. And carrying out data analysis processing on the structured real-time data in the clickhouse cluster by executing the target script file to obtain a corresponding data analysis result. And writing the data analysis result into each target cache corresponding to the clickhouse cluster, so as to respectively provide corresponding data query interfaces for different data application scenes based on the unified query service, accessing the target cache corresponding to the data application scene through the data query interfaces, and obtaining the corresponding data analysis result from the target cache.
Specifically, the target cache may be different types of caches corresponding to the clickhouse cluster, such as mysql (relational database), redis (non-relational database), localcache (local cache), and the like, where the data application scenario supported by the unified query service includes: different scenes such as massive online query service, robot automatic query, website query, data visualization for a data dashboard, information display for BI tools and the like.
Further, according to performance indexes such as the size of the query volume, the time delay requirement, the concurrency number, the throughput, the thread number, the transaction number per second and the response time, a mode of acquiring data analysis results of different data application scenes, such as massive online query services, belongs to a high concurrency query scene, needs to acquire the data analysis results from a target cache, and when the clickhouse cluster supports the size of the query volume of the corresponding scene or executes the concurrency number, the data analysis results are acquired by adopting a direct-looking clickhouse cluster mode.
The timing scheduling system is constructed based on redis, mysql and zookeeper and is used for ensuring stable execution of target script files corresponding to each distributed timing task, namely, locking is performed when the script starts to execute, and after the execution is completed, the data is updated to a cache success state, then the lock is released, so that the occurrence of execution conflict is avoided. For example, a corresponding expiration time, such as 10S, is set for each target script file, and when the expiration time is reached by locking within the execution time of the target script file (i.e., 10S), the lock is released to release the execution memory and provide it to the next target script file. The period for executing the timing scheduling task can be determined according to the update degree of the data, for example, different time periods such as 10S, 30S, 1min and the like, and the period for executing the timing scheduling task can be determined according to actual requirements, so that the timely processing of the data is realized.
Step S416, the data analysis result stored in the target cache is obtained, and the data analysis result is fed back to the target object corresponding to the data query request.
Specifically, the data analysis result stored in the target cache is obtained, and the data analysis result is fed back to a target object initiating the data query request, wherein the target object can be an advertisement dispenser or an advertisement delivery platform side.
In one embodiment, as shown in fig. 6, a real-time query result diagram of an advertisement indicator is provided, and referring to fig. 6, the advertisement indicator may include a delivery time, a delivery platform (such as different instant messaging platforms, shopping platforms, games, etc.), a scheduling name, a scheduling identifier, a package output amount, a target delivery amount, a delivery mode, an advertisement identifier, an arrival amount, an outer exposure amount, an outer click amount, an inner exposure amount, an inner click amount, an activity exposure amount, etc.
Specifically, when inquiring about each advertisement putting condition, the inquiry basis can be advertisement identification, advertisement position, advertisement material, binding recommendation movable link and the like, and index data of the put advertisement in different dimensions can be obtained through different inquiry bases to perform multi-dimensional aggregate display, such as aggregate display of arrival amount, exposure rate, click amount and the like.
For example, the rights of the objects to the object from which the query request is initiated are different, and the data types of the corresponding presentable data are different, for example, the analysis result of the data that the advertisement advertiser can query includes: the data of different stages such as the package quantity of the advertisement, the target delivery quantity of the advertisement, the arrival rate of the advertisement on the delivery platform, the exposure rate, the click rate, the purchasing rate of the movable product and the like, and for example, the data analysis results which can be queried by the delivery platform side comprise: different data such as registered user quantity of the platform, active user quantity of the platform, invalid user quantity of the platform, reaching quantity of the advertisement, exposure rate of the advertisement, click rate and purchase rate of the active product (data types which can be queried by throwing the platform are not fully shown in fig. 6).
In one embodiment, as shown in fig. 7, a schematic diagram of user funnel analysis according to advertisement reaching conditions is provided, and referring to fig. 7, it can be known from fig. (a) that when user funnel analysis is performed according to advertisement reaching conditions, specifically, based on each advertisement index, according to each layer of data, the corresponding conversion rate is calculated in turn, and then, according to the conversion rate of each layer, whether a fault or an error occurs in the delivery process is determined, so as to adjust at any time. Specifically, through the funnel analysis from top to bottom, the conversion rate of each layer compared with the last layer evaluates the advantages and disadvantages of the advertisement (such as whether the material style is unwelcome or not), whether the platform has problems (such as whether the platform intercepts errors or not), and timely discovers the problems in the advertisement delivery process, improves, and ensures that the advertisement delivery is effective.
For example, the adjustment method may include: for example, advertisement delivery decision adjustment is performed, including modes of material adjustment, advertisement creation and the like, and modes of delivery platform change and the like. Specifically, for the advertisement delivery platform, according to the analysis condition of the corresponding user funnel, the advertisement delivery platform needs to be evaluated and predicted from different angles, for example, whether the advertisement delivery strategy of the platform itself, the advertisement restriction strategy (i.e. the interception degree or interception condition of the advertisement) and the like are reasonable or not, whether the pushing mode or the delivery mode and the like are reasonable or not, whether the user activity of the delivery platform is enough or not, and whether the platform needs to be expanded to meet the higher delivery requirement or not are judged.
Further, as can be seen from fig. 7 (B), for funnel analysis of the user, for example, when the advertisement is delivered, the reaching rate of the advertisement on the delivery platform is calculated according to the target delivery amount and the reaching amount, for example, the target delivery amount is to push the advertisement to 1 million users, but the actual reaching user is 800 ten thousand, funnel analysis is performed based on the target delivery amount and the reaching amount, so as to obtain the reaching rate of the advertisement on the delivery platform and the conversion rate of the current layer.
Similarly, for example, the conversion rate between the user quantity and the outer exposure is actually achieved, wherein the outer exposure is used for indicating whether the user directly sees the put advertisement, namely, when the advertisement is put on a different page, for example, the position of the dynamic sharing page of the social software is put, whether the user slides to the position is judged, and if the user does not slide to the advertisement putting position, namely, when the advertisement putting position is missed, the fact that the put advertisement does not complete the outer exposure is indicated for the user.
Further, referring to fig. 7 (B), it can be seen that the conversion rate between the outer layer exposure and the outer layer click, the conversion rate between the outer layer click and the inner layer exposure, the conversion rate between the inner layer exposure and the inner layer click, and the conversion rate between the inner layer click and the active exposure may be included. Wherein, the outer click indicates that the user directly sees the advertisement and clicks on the advertisement. The inner exposure means that the user clicks on the advertisement and then enters the advertisement link to see the advertisement content included in the advertisement link. The inner click indicates that the user further clicks the active link in the advertisement content after seeing the advertisement content of the advertisement link. The activity exposure indicates that the user clicks the activity link and selects to jump or access the corresponding popularization activity, for example, clicking the activity link to jump to other application program interfaces for access, wherein for different application platforms, for example, for shopping platforms, operations such as participation in the promotion activity, purchase, order placement and the like can be performed, and for game platforms, operations such as participation in game release, purchase of props or realization of new user downloading of games can be performed.
Referring to fig. 7 (C), it can be seen that different delivery platforms are respectively provided with different limiting items, such as a first limiting item, a second limiting item, a third limiting item, and other limiting items in fig. 7, and according to the limiting items of each platform, the condition that the delivery platform intercepts the advertisement can be obtained, whether the interception degree of the advertisement interception of the platform is wrong or not can be judged, and the fault condition can be timely fed back or the delivery platform can be modified. The first limiting item may include limiting conditions such as a condition limitation, an online game frequency control, a message out-of-market problem, and the like, and the number of people and the actual achievement ratio under different limiting conditions are different, so that the user-executable operations of the game platform are also different.
In one embodiment, as shown in fig. 8, a schematic view of user activity observation data of a delivery platform is provided, and as can be seen with reference to fig. 8 (D), the activity of each user of the current delivery platform may be observed and recorded, where the activity of the user may be represented by a user login value, a user registration value, and so on. For example, taking a launch platform as an example of a game application platform, a user may register or log in by entering the game through a heartbeat in the game or directly accessing a game application.
As can be seen from fig. 8 (E), according to different access modes and access time periods, active users of the game platform can be observed and intuitively displayed in a line drawing mode. For the platform throwing side, the user liveness of the current platform can be obtained in real time, so that activity information is pushed to active users, interaction between the users and the platform is enhanced, and the platform income is improved. When the game platform performs advertisement delivery, the game platform can further position, check and the like problems in the delivery process according to different indexes such as user liveness, advertisement delivery arrival rate, exposure rate, click rate and the like, and performs processing operations such as data analysis, platform characteristic evaluation and the like aiming at advertisement flow data so as to provide a data analysis result access path later.
In this embodiment, by receiving a data query request, and acquiring an object permission carried by the data query request and a data application scene, a corresponding data query interface is matched according to the object permission and the data application scene. And executing the distributed timing task associated with the data application scene by calling a target processing engine corresponding to the structured real-time data matching to trigger the execution of a target script file corresponding to the distributed timing task. By executing the target script file, the structured real-time data in the clickhouse cluster can be subjected to data analysis processing, and a corresponding data analysis result is generated. Storing each data analysis result into each target cache corresponding to the clickhouse cluster according to the data application scene, accessing the target cache corresponding to the data application scene based on the data query interface, acquiring the data analysis result stored in the target cache, and feeding back the data analysis result to the target object corresponding to the data query request. The method realizes the separation of data analysis processing and storage of data analysis results in the data query process, so as to avoid frequent access and memory occupation of the clickhouse cluster, reduce the development cost in the mass data processing and analysis process and reduce the occurrence probability of access overtime conditions.
In one embodiment, as shown in fig. 9, the step of obtaining the structured real-time data passing through the data attribute verification, that is, performing the data accuracy check and the data attribute verification on the structured data, and obtaining the structured real-time data passing through the data attribute verification specifically includes:
step S902, obtaining data attributes of each tuple in the structured data to be analyzed, and obtaining different data attribute sets corresponding to each tuple in the structured data according to each data attribute, wherein the data attributes of each data attribute set are mutually independent or related.
Specifically, the data attribute of each tuple in the structured data to be analyzed is different, for example, the data attribute in a certain tuple is an advertisement position, where the advertisement position may include different situations such as red dot prompt, announcement prompt, interface advertisement prompt, etc., and when the data attribute is the advertisement position, the number of data attributes of the corresponding tuple is 3, which can be understood as that the data attribute set of the tuple is a 3-element attribute column set. For other data attributes, and for possible cases included by each data attribute, a set of data attributes for the corresponding tuple is determined.
Step S904, sequentially checking the data accuracy of the data attribute sets corresponding to the tuples to obtain first structured data passing the data accuracy check.
The structured data may include a plurality of tuples, where the data attributes of the data attribute sets of the tuples are not coincident, for example, the advertisement slots may include different positions such as red dot prompt, announcement prompt, interface advertisement prompt, etc., and for different advertisement slots, the arrival amount, exposure amount, click amount, downloading amount, etc. of each advertisement slot may be used as the data attribute of the corresponding advertisement slot.
Specifically, a cartesian product calculation is performed on a data attribute set corresponding to each tuple to obtain a corresponding possible attribute set, an expected attribute set corresponding to the data attribute set and an actual attribute set are obtained, so that data accuracy verification is performed based on the possible attribute set, the expected attribute set and the actual attribute set, and first structured data passing through the data accuracy verification is obtained.
Wherein the Cartesian product calculation is used to represent the Cartesian product (Cartesian product) of set X and set Y, represented by X Y, wherein the first object is a member of X and the second object is one member of all possible ordered pairs of Y when the calculation is performed. For example, if set a= { a, B }, set b= {0,1,2}, then the cartesian product of the two sets is { (a, 0), (a, 1), (a, 2), (B, 0), (B, 1), (B, 2) }.
In one embodiment, taking the n-element attribute column set R and the m-element attribute column set S of the structured data as an example, the following formula (1) is adopted to calculate the cartesian product between the n-element attribute column set R and the m-element attribute column set S, to obtain a possible attribute set T:
T=R×S={r∪s|r∈R,s∈S}; (1)
taking the advertisement space may include different positions such as red dot prompt, advertisement prompt, interface advertisement prompt, etc., that is, 3-element attribute column set R, the arrival amount, exposure amount, click amount, and downloading amount of the advertisement on the advertisement space, that is, 4-element attribute column set S as an example, when calculating the possible attribute set T, the possible attribute set T obtained by cartesian product of the set R and the set S is: { (red dot cue, arrival amount), (red dot cue, exposure amount), (red dot cue, click amount), (red dot cue, download amount), (bulletin cue, arrival amount), (bulletin cue, exposure amount), (bulletin cue, click amount), (bulletin cue, download amount), (interface advertisement cue, arrival amount), (interface advertisement cue, exposure amount), (interface advertisement cue, click amount), (interface advertisement cue, download amount) }.
In one embodiment, data accuracy verification is performed by obtaining an expected attribute set corresponding to a data attribute set, and an actual attribute set, and based on the likely attribute set, the expected attribute set, and the actual attribute set.
For example, taking an advertisement position that may include different positions such as a red dot prompt, an announcement prompt, an interface advertisement prompt, and the like, that is, a 3-element attribute column set R, and an arrival amount, an exposure amount, a click amount, and a downloading amount of an advertisement on the advertisement position, that is, a 4-element attribute column set S as an example, when data query is performed on advertisement running water data in the advertisement delivery process, the expected attribute set may be that the arrival amount, the exposure amount, the click amount, and the downloading amount of each advertisement position can be comprehensively known, and then the expected attribute set M should be equal to the possible attribute set T under the current situation.
Likewise, for example, taking the advertisement space may include different positions such as red dot prompt, advertisement prompt, interface advertisement prompt, etc., that is, 3-element attribute column set R, and the arrival amount, exposure amount, click amount, and downloading amount of the advertisement on the advertisement space, that is, 4-element attribute column set S as an example, the actual attribute set N represents the attribute set actually existing in the queried data, and when the data is queried, the actual attribute set N includes: { (red dot prompt, arrival amount), (red dot prompt, exposure amount), (red dot prompt, click amount), (red dot prompt, download amount), (bulletin prompt, arrival amount), (interface advertisement prompt, download amount) }, the current situation of data missing can be judged according to the data included in the actual attribute set N, that is, the arrival amount, the exposure amount, the click amount and the download amount of each advertisement position cannot be completely known.
Further, when verifying the possible attribute set T, the expected attribute set M and the actual attribute set N, the data accuracy is verified according to the relationship shown in the following formula (2):
when the relationship shown in the formula (2) needs to be satisfied simultaneously, the corresponding structured data passes the data accuracy check to obtain first structured data passing the data accuracy check. That is, if any element in the set N does not belong to the set M, or if any element in the set N does not belong to the set T, or if any element in the set M does not belong to the set T, that is, if the relationship shown in the formula (2) is broken, it is considered that the real-time data source generates erroneous data, and the data cannot pass the data accuracy check, and correction or deletion is required.
Step S906, determining target attributes of the first structured data according to the data application scene, and performing probability distribution verification of the data attributes based on the target attributes of the first structured data to obtain structured real-time data passing the probability distribution verification.
The target attributes required to be verified are different according to different data application scenes, and further the target attributes of the first structured data are required to be determined according to the data application scenes. For example, if the data application scene is a data visualization display scene of a data dashboard, the target attribute to be checked for data attribute may be an advertisement space, for example, the advertisement space may include two classifications of red dot prompt and interface advertisement prompt, and if the target attribute is an advertisement space, the probability distribution check is performed based on the advertisement space to obtain the structured real-time data passing the probability distribution check.
Specifically, according to the data application scene, determining the target attribute of the first structured data, sampling the first structured data based on the data application scene and the data scale of the real-time service data stream, and extracting the preset number of rows of data. The method comprises the steps of obtaining attribute classifications corresponding to target attributes and actual observation times corresponding to the attribute classifications, determining theoretical expected times corresponding to the attribute classifications based on preset probability values of the corresponding attribute classifications and preset numbers corresponding to extracted data, determining corresponding chi-square values according to the theoretical expected times corresponding to the attribute classifications and the actual observation times, and further carrying out data attribute verification on the chi-square values by utilizing preset confidence data to obtain structured real-time data passing through probability distribution verification.
For example, taking the determined target attribute of the first structured data as an advertisement space according to the data application scene as an example, where the advertisement space includes two classifications of red dot prompt and interface advertisement prompt, sampling the first structured data according to the data application scene and the data scale of the real-time service data stream, and extracting a preset number n of lines of data. The preset number n of data may be, for example, 5 thousand lines, 1 ten thousand lines or 2 ten thousand lines, which are different values, specifically needs to be set or adjusted according to the data application scenario and the data scale of the real-time service data stream, and is not limited to a specific value or values.
Wherein, the attribute classification k corresponding to the target attribute, for example, when the advertisement space comprises two classifications of red dot prompt and interface advertisement prompt, k is 2, and the obtained actual observation times x corresponding to each attribute classification i The method specifically comprises the actual observation times x corresponding to the red dot prompt 1 Actual number of observations x corresponding to interface advertisement cues 2
Further, by acquiring the preset probability value P that the preset data fall into the corresponding attribute classification i I.e. a predefined theoretical probability value for different classifications, such as a preset probability value P corresponding to a red dot cue 1 And the preset probability value P corresponding to the interface advertisement prompt 2 To fall into the preset probability value P of the corresponding attribute classification based on the preset data i Comprises a preset probability value P corresponding to the red dot prompt 1 Preset probability value P corresponding to interface advertisement prompt 2 The preset number n corresponding to the extracted data is calculated by the following formula (3) to obtain the theoretical expected times m corresponding to each attribute classification i
m i =nP i The method comprises the steps of carrying out a first treatment on the surface of the The limiting conditions are that
Wherein, the red point prompts the corresponding preset probability value P 1 Preset probability value P corresponding to interface advertisement prompt 2 The sum of the two is 1, red spot is liftedShowing the corresponding theoretical expected times m 1 And theoretical expected times m corresponding to interface advertisement prompts 2 The sum of the two is equal to n, and similarly, the red dot prompts the corresponding actual observation times x 1 Actual number of observations x corresponding to interface advertisement cues 2 The sum of which is equal to n.
In one embodiment, as the corresponding preset number n of extracted data becomes larger, i.e., the data size increases, the limit distribution of the corresponding target attribute tends to chi-square distribution 2 The expression is represented by the following formula (4):
in the process of verifying the data attribute of the target attribute, according to the theoretical expected times and the actual observed times corresponding to each attribute classification, a corresponding chi-square value is obtained through calculation. Specifically, during the verification process, as the size n of the extracted data increases, the theoretical expected number of times m i Big enough chi 2 (calculated chi-square) obeying degrees of freedom k-1 2 Distribution. The attribute classification k, for example, the advertisement space includes two classifications, namely, red dot prompt and interface advertisement prompt, that is, k takes 2, the smaller the calculated chi-square value is, the better the chi-square value is, and the more accurate and effective the corresponding structured data is.
Further, the probability distribution verification of the data attribute is performed on the chi-square value by using the chi-square table and corresponding preset confidence data (for example, the confidence can be 95% confidence, 90% confidence, and the like, or can be other confidence of values, and the verification process is not particularly limited), so that structured real-time data passing through the probability distribution verification is obtained. Specifically, by performing probability distribution verification of data attributes, it can be determined whether the actual number of observations matches the theoretical probability distribution, thereby determining the validity of the generated structured data.
In one embodiment, in a continuous periodic test, chi-square distributions of target attributes for different time periods are continuously calculatedTo further determine whether each structured data in the service data stream has consistency before and after the structured data. Due to χ 2 Distributing χ with multiple key attributes of the structured data with additivity 2 The method can also be added, and further, whether the whole structured data is accurate and effective can be judged by judging key attributes of a plurality of structured data at the same time.
In this embodiment, by acquiring the data attribute of each tuple in the structured data to be analyzed, and according to each data attribute, different data attribute sets corresponding to each tuple in the structured data are obtained. And sequentially checking the data accuracy of the data attribute sets corresponding to each tuple to obtain first structured data passing through the data accuracy check, determining the target attribute of the first structured data according to the data application scene, and checking the data attribute based on the target attribute of the first structured data to obtain structured real-time data passing through the data attribute check. The accuracy check and the data attribute check of the structured data are realized, so that the consistency and the accuracy of the structured real-time data which are subsequently imported into the clickhouse cluster are ensured, and the accuracy of a data analysis result which is subsequently obtained is improved.
In one embodiment, as shown in fig. 10, a clickhouse-based real-time service data stream analysis processing method is provided, which specifically includes the following steps:
step S1001, acquiring real-time service data streams in each real-time data source, and performing data analysis and filtering on each real-time service data stream to obtain structured data to be analyzed.
Step S1002, obtaining data attributes of each tuple in the structured data to be analyzed, and obtaining different data attribute sets corresponding to each tuple in the structured data according to each data attribute.
In step S1003, a cartesian product calculation is performed on the data attribute set corresponding to each tuple, to obtain a corresponding possible attribute set, and an expected attribute set and an actual attribute set corresponding to the data attribute set are obtained.
Step S1004, performing data accuracy verification based on the possible attribute set, the expected attribute set and the actual attribute set, to obtain first structured data passing the data accuracy verification.
Step S1005, determining a target attribute of the first structured data according to the data application scene, sampling the first structured data based on the data application scene and the data scale of the real-time service data stream, and extracting a preset number of rows of data.
Step S1006, obtaining attribute classifications corresponding to the target attributes and actual observation times corresponding to the attribute classifications, and determining theoretical expected times corresponding to the attribute classifications based on preset probability values of the preset row data falling into the corresponding attribute classifications and preset numbers corresponding to the extracted data.
Step S1007, according to the theoretical expected times and the actual observed times corresponding to each attribute classification, determining the corresponding chi-square value, and carrying out probability distribution verification of the data attribute on the chi-square value by utilizing the preset confidence coefficient data to obtain the structured real-time data passing the probability distribution verification.
Step S1008, determining a target processing engine matched with the structured real-time data from the table engines corresponding to the clickhouse clusters based on the data application scene of the structured real-time data.
Step S1009, according to the business data characteristics corresponding to the target processing engine and the structured real-time data, establishing a library table corresponding to the structured real-time data in the clickhouse cluster in real time, and storing the structured real-time data into the library table.
Step S1010, receiving a data query request, and acquiring an object authority carried by the data query request and a data application scene.
Step S1011, matching corresponding data query interfaces according to the object rights and the data application scene, wherein the data query interfaces are used for accessing data analysis results corresponding to the data query requests.
Step S1012, a target processing engine corresponding to the structured real-time data match is invoked to execute the distributed timing task associated with the data application scenario.
Step S1013, executing the target script file corresponding to the distributed timing task according to the target processing engine matched with the structured real-time data.
Step S1014, performing data analysis processing on the structured real-time data in the clickhouse cluster by executing the target script file to obtain a corresponding data analysis result.
Step S1015, based on the data query interface, accessing the target cache corresponding to the data application scene.
Step S1016, the data analysis result stored in the target cache is obtained, and the data analysis result is fed back to the target object corresponding to the data query request.
In the method for analyzing and processing the real-time service data stream of the clickhouse, the real-time service data stream in each real-time data source is obtained, and the data analysis and the filtering are carried out on each real-time service data stream to obtain the structured data to be analyzed. And further performing data accuracy check and data attribute check on the structured data to obtain structured real-time data passing through the data attribute check, and importing the structured real-time data into a library table of a self-constructed clickhouse cluster so as to ensure consistency and accuracy of the imported structured real-time data and improve accuracy of a data analysis result obtained subsequently. And further, utilizing a target processing engine matched with the structured real-time data to perform data analysis processing on the structured real-time data in the clickhouse cluster, generating corresponding data analysis results, and respectively storing each data analysis result into each target cache corresponding to the clickhouse cluster according to the data application scene. The target processing engine is determined according to the data application scene corresponding to the structured real-time data, and the target cache is matched with the data application scene. The method realizes the separation of data analysis processing and storage of data analysis results, so as to avoid memory occupation of a clickhouse cluster, reduce development cost in the process of mass data processing and analysis, and improve the accuracy of the obtained data analysis results.
It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a clickhouse-based real-time service data flow analysis processing system for realizing the above-mentioned clickhouse-based real-time service data flow analysis processing method. The implementation of the solution provided by the system is similar to the implementation described in the above method, so the specific limitation in the embodiments of the one or more clickhouse-based real-time traffic data stream analysis processing systems provided below may be referred to the limitation of the clickhouse-based real-time traffic data stream analysis processing method hereinabove, and will not be repeated herein.
In one embodiment, as shown in FIG. 11, there is provided a clickhouse-based real-time traffic data stream analysis processing system comprising: a structured data generation module 1102, a data verification module 1104, a data analysis result generation module 1106, and a data analysis result storage module 1108, wherein:
the structured data generation module 1102 is configured to obtain real-time service data flows in each real-time data source, and perform data analysis and filtering on each real-time service data flow to obtain structured data to be analyzed.
The data verification module 1104 is configured to perform data accuracy verification and data attribute verification on the structured data, obtain structured real-time data that passes the data attribute verification, and import the structured real-time data into a library table of the self-constructed clickhouse cluster.
The data analysis result generating module 1106 is configured to perform data analysis processing on the structured real-time data in the clickhouse cluster by using a target processing engine matched with the structured real-time data, generate a corresponding data analysis result, and determine according to a data application scenario corresponding to the structured real-time data by using the target processing engine.
The data analysis result storage module 1108 is configured to store each data analysis result into each target cache corresponding to the clickhouse cluster according to the data application scenario, where the target cache matches with the data application scenario.
In the clickhouse-based real-time service data stream analysis processing system, the real-time service data streams in each real-time data source are acquired, and data analysis and filtering are carried out on each real-time service data stream to obtain the structured data to be analyzed. And further performing data accuracy check and data attribute check on the structured data to obtain structured real-time data passing through the data attribute check, and importing the structured real-time data into a library table of a self-constructed clickhouse cluster so as to ensure consistency and accuracy of the imported structured real-time data and improve accuracy of a data analysis result obtained subsequently. And further, utilizing a target processing engine matched with the structured real-time data to perform data analysis processing on the structured real-time data in the clickhouse cluster, generating corresponding data analysis results, and respectively storing each data analysis result into each target cache corresponding to the clickhouse cluster according to the data application scene. The target processing engine is determined according to the data application scene corresponding to the structured real-time data, and the target cache is matched with the data application scene. The method realizes the separation of data analysis processing and storage of data analysis results, so as to avoid memory occupation of a clickhouse cluster, reduce development cost in the process of mass data processing and analysis, and improve the accuracy of the obtained data analysis results.
In one embodiment, a clickhouse-based real-time traffic data stream analysis processing system is provided, the system comprising:
the data query request receiving module is used for receiving the data query request and acquiring the object permission carried by the data query request and the data application scene;
the data query interface matching module is used for matching the corresponding data query interface according to the object authority and the data application scene; the data query interface is used for accessing data analysis results corresponding to the data query request;
and the distributed timing task execution module is used for calling a target processing engine corresponding to the structured real-time data matching and executing the distributed timing task associated with the data application scene.
And the target script file execution module is used for executing the target script file corresponding to the distributed timing task according to the target processing engine matched with the structured real-time data. And the data analysis result generation module is used for carrying out data analysis processing on the structured real-time data in the clickhouse cluster by executing the target script file to obtain a corresponding data analysis result.
And the data analysis result storage module is used for respectively storing each data analysis result into each target cache corresponding to the clickhouse cluster according to the data application scene.
And the target cache access module is used for accessing the target cache corresponding to the data application scene based on the data query interface.
The data analysis result acquisition module is used for acquiring the data analysis result stored in the target cache and feeding the data analysis result back to the target object corresponding to the data query request.
In one embodiment, a clickhouse-based real-time traffic data stream analysis processing system is provided, the system comprising:
the target processing engine determining module is used for determining a target processing engine matched with the structured real-time data from a table engine corresponding to the clickhouse cluster based on the data application scene of the structured real-time data;
the structured real-time data storage module is used for establishing a library table corresponding to the structured real-time data in the clickhouse cluster according to the service data characteristics corresponding to the target processing engine and the structured real-time data, and storing the structured real-time data into the library table; when a library table corresponding to the structured real-time data is established, determining a primary key, a partition and a data storage period in the library table according to the structured real-time data.
In one embodiment, the data verification module is further configured to:
Acquiring data attributes of each tuple in the structured data to be analyzed, and acquiring different data attribute sets corresponding to each tuple in the structured data according to each data attribute; the data attributes among the data attribute sets are mutually independent or related; sequentially checking the data accuracy of the data attribute sets corresponding to each tuple to obtain first structured data passing the data accuracy check; and determining the target attribute of the first structured data according to the data application scene, and carrying out probability distribution verification of the data attribute based on the target attribute of the first structured data to obtain structured real-time data passing the probability distribution verification.
In one embodiment, the data verification module is further configured to:
carrying out Cartesian product calculation on the data attribute set corresponding to each tuple to obtain a corresponding possible attribute set; acquiring an expected attribute set and an actual attribute set corresponding to the data attribute set; and carrying out data accuracy verification based on the possible attribute set, the expected attribute set and the actual attribute set, and obtaining first structured data passing the data accuracy verification.
In one embodiment, the data verification module is further configured to:
Determining a target attribute of the first structured data according to the data application scene; sampling the first structured data based on the data application scene and the data scale of the real-time service data stream, and extracting a preset number of rows of data; acquiring attribute classifications corresponding to the target attributes, and acquiring actual observation times corresponding to the attribute classifications; determining theoretical expected times corresponding to each attribute classification based on preset probability values of the preset row data falling into the corresponding attribute classification and preset numbers corresponding to the extracted data; determining corresponding chi-square values according to theoretical expected times and actual observation times corresponding to each attribute classification; and carrying out distribution probability verification of the data attribute on the chi-square value by utilizing preset confidence data to obtain structured real-time data passing through the distribution probability verification.
In one embodiment, the data analysis result generation module is further configured to:
when there are multiple data query requests, a threshold number of concurrently executing target script files is determined according to the processing performance of the clickhouse cluster.
In one embodiment, the data application scenario comprises a low frequency query scenario; the clickhouse-based real-time business data stream analysis processing system further comprises:
And the data analysis result real-time acquisition module is used for acquiring the data analysis result in real time based on the data query interface.
In one embodiment, as shown in fig. 12, there is provided a clickhouse-based real-time traffic data stream analysis processing system, which, as can be seen with reference to fig. 12, includes:
the data preprocessing module 1202 is configured to perform data cleansing and data inspection to obtain structured real-time data. The data cleaning comprises data analysis and data filtering, and the data checking comprises data accuracy checking and data attribute checking.
The data storage module 1204 is configured to build a clickhouse cluster, select a target processing engine that matches the structured real-time data, and according to the characteristics of the target processing engine and the service data corresponding to the structured real-time data, build a library table corresponding to the structured real-time data in the clickhouse cluster in real time, and store the structured real-time data into the library table.
The data analysis module 1206 is configured to construct SQL statements, and perform tuning processing on each SQL statement to obtain a tuned SQL script.
The data query module 1208 is configured to perform a distributed timing task, perform data analysis processing on the structured real-time data in the clickhouse cluster, obtain a corresponding data analysis result, and store the data analysis result in the target cache. And providing real-time query service, providing different data query interfaces according to different object rights and different data application scenes, and respectively accessing different target caches through the different data query interfaces to obtain a data analysis result.
The modules in the clickhouse-based real-time traffic data stream analysis processing system described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 13. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer equipment is used for storing data such as real-time business data streams, structured data to be analyzed, structured real-time data, target processing engines, data analysis results, data application scenes, target caches and the like. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by the processor, implements a clickhouse-based real-time traffic data stream analysis processing method.
It will be appreciated by those skilled in the art that the structure shown in fig. 13 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions. When the advertisement is pushed, the user on each platform or application program can reject or conveniently reject the advertisement pushing information and the like.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (12)

1. A clickhouse-based real-time traffic data stream analysis processing method, the method comprising:
acquiring real-time service data streams in each real-time data source, and carrying out data analysis and filtering on each real-time service data stream to obtain structured data to be analyzed;
performing data accuracy check and data attribute check on the structured data to obtain structured real-time data passing the check, and importing the structured real-time data into a library table of a self-constructed clickhouse cluster;
Performing data analysis processing on the structured real-time data in the clickhouse cluster by using a target processing engine matched with the structured real-time data to generate a corresponding data analysis result; the target processing engine is determined according to a data application scene corresponding to the structured real-time data;
storing each data analysis result into each target cache corresponding to the clickhouse cluster according to the data application scene; and the target cache is matched with the data application scene.
2. The method of claim 1, further comprising, prior to said performing data analysis processing on the structured real-time data in the clickhouse cluster using the target processing engine that matches the structured real-time data to generate a corresponding data analysis result:
receiving a data query request, and acquiring an object authority carried by the data query request and a data application scene;
matching corresponding data query interfaces according to the object rights and the data application scene; the data query interface is used for accessing a data analysis result corresponding to the data query request;
And calling a target processing engine matched with the structured real-time data to execute a distributed timing task associated with the data application scene.
3. The method of claim 2, wherein the target processing engine that matches the structured real-time data performs data analysis processing on the structured real-time data in the clickhouse cluster to generate a corresponding data analysis result, and the method comprises:
executing a target script file corresponding to the distributed timing task according to a target processing engine matched with the structured real-time data;
and executing the target script file, and carrying out data analysis processing on the structured real-time data in the clickhouse cluster to obtain a corresponding data analysis result.
4. A method according to claim 2 or 3, further comprising, after said storing each of said data analysis results in each of the target caches corresponding to said clickhouse clusters, respectively, according to said data application scenario:
accessing a target cache corresponding to the data application scene based on the data query interface;
and acquiring a data analysis result stored in the target cache, and feeding back the data analysis result to a target object corresponding to the data query request.
5. A method according to any one of claims 1 to 3, wherein importing the structured real-time data into a library table of a self-built clickhouse cluster comprises:
determining a target processing engine matched with the structured real-time data from a table engine corresponding to the clickhouse cluster based on a data application scene of the structured real-time data;
establishing a library table corresponding to the structured real-time data in the clickhouse cluster according to the service data characteristics corresponding to the target processing engine and the structured real-time data, and storing the structured real-time data into the library table; when a library table corresponding to the structured real-time data is established, determining a primary key, a partition and a data storage period in the library table according to the structured real-time data.
6. A method according to any one of claims 1 to 3, wherein performing a data accuracy check and a data attribute check on the structured data to obtain structured real-time data passing the check comprises:
acquiring data attributes of each tuple in the structured data to be analyzed, and acquiring different data attribute sets corresponding to each tuple in the structured data according to each data attribute; the data attributes among the data attribute sets are independent or related to each other;
Sequentially checking the data accuracy of the data attribute sets corresponding to each tuple to obtain first structured data passing through the data accuracy check;
and determining the target attribute of the first structured data according to the data application scene, and carrying out probability distribution verification of the data attribute based on the target attribute of the first structured data to obtain structured real-time data passing the probability distribution verification.
7. The method according to claim 6, wherein sequentially performing data accuracy verification on the data attribute sets corresponding to each tuple to obtain first structured data passing the data accuracy verification, includes:
carrying out Cartesian product calculation on the data attribute set corresponding to each tuple to obtain a corresponding possible attribute set;
acquiring an expected attribute set and an actual attribute set corresponding to the data attribute set;
and carrying out data accuracy verification based on the possible attribute set, the expected attribute set and the actual attribute set, and obtaining first structured data passing through the data accuracy verification.
8. The method according to claim 6, wherein determining the target attribute of the first structured data according to the data application scenario, and performing probability distribution verification of the data attribute based on the target attribute of the first structured data, to obtain structured real-time data that passes the probability distribution verification, includes:
Determining a target attribute of the first structured data according to the data application scene;
sampling the first structured data based on the data application scene and the data scale of the real-time service data stream, and extracting a preset number of rows of data;
acquiring attribute classifications corresponding to the target attributes, and acquiring actual observation times corresponding to each attribute classification;
determining theoretical expected times corresponding to each attribute classification based on preset probability values of the preset row data falling into the corresponding attribute classification and preset numbers corresponding to the extracted data;
determining a corresponding chi-square value according to the theoretical expected times and the actual observation times corresponding to each attribute classification;
and carrying out probability distribution verification of data attributes on the chi-square value by utilizing preset confidence data to obtain structured real-time data passing through the probability distribution verification.
9. A clickhouse-based real-time traffic data stream analysis processing system, the system comprising:
the structured data generation module is used for acquiring real-time service data streams in each real-time data source, and carrying out data analysis and filtering on each real-time service data stream to obtain structured data to be analyzed;
The data verification module is used for carrying out data accuracy verification and data attribute verification on the structured data to obtain verified structured real-time data, and importing the structured real-time data into a library table of a self-constructed clickhouse cluster;
the data analysis result generation module is used for carrying out data analysis processing on the structured real-time data in the click house cluster by utilizing a target processing engine matched with the structured real-time data to generate a corresponding data analysis result; the target processing engine is determined according to a data application scene corresponding to the structured real-time data;
the data analysis result storage module is used for storing each data analysis result into each target cache corresponding to the clickhouse cluster according to the data application scene; and the target cache is matched with the data application scene.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.
11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.
12. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 8.
CN202210696796.3A 2022-06-20 2022-06-20 Real-time business data stream analysis processing method and system based on clickhouse Pending CN117312375A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210696796.3A CN117312375A (en) 2022-06-20 2022-06-20 Real-time business data stream analysis processing method and system based on clickhouse

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210696796.3A CN117312375A (en) 2022-06-20 2022-06-20 Real-time business data stream analysis processing method and system based on clickhouse

Publications (1)

Publication Number Publication Date
CN117312375A true CN117312375A (en) 2023-12-29

Family

ID=89248550

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210696796.3A Pending CN117312375A (en) 2022-06-20 2022-06-20 Real-time business data stream analysis processing method and system based on clickhouse

Country Status (1)

Country Link
CN (1) CN117312375A (en)

Similar Documents

Publication Publication Date Title
US11036735B2 (en) Dimension context propagation techniques for optimizing SQL query plans
US11625381B2 (en) Recreating an OLTP table and reapplying database transactions for real-time analytics
US11412343B2 (en) Geo-hashing for proximity computation in a stream of a distributed system
Sumbaly et al. The big data ecosystem at linkedin
US10121169B2 (en) Table level distributed database system for big data storage and query
US8725730B2 (en) Responding to a query in a data processing system
US10831619B2 (en) Fault-tolerant stream processing
US8108367B2 (en) Constraints with hidden rows in a database
US9747349B2 (en) System and method for distributing queries to a group of databases and expediting data access
US20100005114A1 (en) Efficient Delta Handling In Star and Snowflake Schemes
CN112434015B (en) Data storage method and device, electronic equipment and medium
CN111966692A (en) Data processing method, medium, device and computing equipment for data warehouse
CN110414259A (en) A kind of method and apparatus for constructing data element, realizing data sharing
US20220044144A1 (en) Real time model cascades and derived feature hierarchy
CN115516432A (en) Method and system for identifying, managing and monitoring data dependencies
Hasan et al. Data transformation from sql to nosql mongodb based on r programming language
CN115481026A (en) Test case generation method and device, computer equipment and storage medium
CN117312375A (en) Real-time business data stream analysis processing method and system based on clickhouse
CN116302867A (en) Behavior data analysis method, apparatus, computer device, medium, and program product
CN115658680A (en) Data storage method, data query method and related device
Engle A Methodology for Evaluating Relational and NoSQL Databases for Small-Scale Storage and Retrieval
US11663216B2 (en) Delta database data provisioning
Nguyen An application-oriented comparison of two NoSQL database systems: MongoDB and VoltDB
CN116860541A (en) Service data acquisition method, device, computer equipment and storage medium
CN116820326A (en) Data processing method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination