CN116881277A - Data aggregation method, apparatus and computer readable medium - Google Patents

Data aggregation method, apparatus and computer readable medium Download PDF

Info

Publication number
CN116881277A
CN116881277A CN202310777468.0A CN202310777468A CN116881277A CN 116881277 A CN116881277 A CN 116881277A CN 202310777468 A CN202310777468 A CN 202310777468A CN 116881277 A CN116881277 A CN 116881277A
Authority
CN
China
Prior art keywords
event data
data
aggregation
target
target event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310777468.0A
Other languages
Chinese (zh)
Inventor
高圣巍
张丽斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bilibili Technology Co Ltd
Original Assignee
Shanghai Bilibili Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bilibili Technology Co Ltd filed Critical Shanghai Bilibili Technology Co Ltd
Priority to CN202310777468.0A priority Critical patent/CN116881277A/en
Publication of CN116881277A publication Critical patent/CN116881277A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • G06F16/244Grouping and aggregation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data aggregation method, equipment and a computer readable medium, wherein after receiving target event data sent by a target data source, the application firstly queries associated event data corresponding to the target event data in a cache database, if the associated event data corresponding to the target event data is not queried in the cache database, the target event data is stored in a delay queue for waiting until the associated event data corresponding to the target event data is stored in the cache database is monitored, then the target event data is read from the delay queue, and the target event data and the associated event data are aggregated and counted and written into an aggregation database, so that the aggregation of the target event data and the associated event data can be accurately realized based on the target event data under the condition of disordered real-time stream events.

Description

Data aggregation method, apparatus and computer readable medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data aggregation method, apparatus, and computer readable medium.
Background
This section is intended to provide a background or context to the embodiments of the application that are recited in the claims. The description herein is not to be taken as an admission of prior art as including in this section.
In the e-commerce business, the real-time performance of data statistics is important, so that the method can influence whether an operator can make a decision in time, the efficiency of a recommendation algorithm can be influenced, and the whole business can be helped to find abnormal situations in time.
However, in practical applications, data to be counted often come from different data sources and are transmitted through real-time event streams, and after the data are aggregated, data analysis, attribution and other processing can be further performed, but due to different time when the real-time event streams of different data sources arrive at the statistics device, the problem that the data of different data sources arrive at the statistics device in disorder is generated, and efficient aggregation is difficult.
Disclosure of Invention
Aspects of the present application provide a data aggregation method, apparatus, and computer-readable storage medium to accurately aggregate data from different data sources in the event of a real-time stream event out-of-order.
In one aspect of the present application, a data aggregation method is provided, where the method includes:
Receiving target event data sent by a target data source;
if the associated event data corresponding to the target event data is not queried in the cache database, storing the target event data into a delay queue for waiting;
after the fact that the relevant event data corresponding to the target event data are stored in the cache database is monitored, the target event data are read from the delay queue, and the target event data and the relevant event data are subjected to aggregation statistics and then written into an aggregation database.
In another aspect of the present application, there is provided a data aggregation apparatus, wherein the apparatus includes:
the target data receiving module is used for receiving target event data sent by a target data source;
the delay storage module is used for storing the target event data into a delay queue for waiting if the associated event data corresponding to the target event data are not queried in the cache database;
and the data aggregation module is used for reading the target event data from the delay queue after the fact that the cache database stores the associated event data corresponding to the target event data is monitored, and writing the aggregate statistics of the target event data and the associated event data into the aggregation database.
In another aspect of the present application, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data aggregation method as described above.
In another aspect of the application, a computer readable storage medium having stored thereon computer program instructions executable by a processor to implement a data aggregation method as described above is provided.
In the scheme provided by the embodiment of the application, after receiving the target event data sent by the target data source, firstly, inquiring the associated event data corresponding to the target event data in the cache database, and if the associated event data corresponding to the target event data is not inquired in the cache database, storing the target event data in the delay queue for waiting until the associated event data corresponding to the target event data is stored in the cache database is monitored, then reading the target event data from the delay queue, and carrying out aggregation statistics on the target event data and the associated event data and writing the aggregate event data into the aggregation database, so that the aggregation of the target event data and the associated event data can be accurately realized based on the target event data under the condition of disordered real-time stream events.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:
FIG. 1 is a flow chart of a data aggregation method according to an embodiment of the present application;
FIG. 2 is a flow chart of a data aggregation method according to another embodiment of the present application;
FIG. 3 is a flowchart illustrating a data aggregation method according to another embodiment of the present application;
FIG. 4 is a flowchart illustrating a data aggregation method according to another embodiment of the present application;
FIG. 5 is a flowchart illustrating a data aggregation method according to an embodiment of an order application scenario of the present application;
FIG. 6 is a flowchart illustrating a data aggregation method according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a data aggregation apparatus according to an embodiment of the present application;
FIG. 8 is a schematic diagram of an electronic device suitable for implementing aspects of embodiments of the present application;
the same or similar reference numbers in the drawings refer to the same or similar parts.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In one exemplary configuration of the application, the terminal, the devices of the services network each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer-readable media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer program instructions, data structures, modules of the program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device.
The embodiment of the application provides a data aggregation method, which is characterized in that after target event data sent by a target data source is received, firstly, related event data corresponding to the target event data is queried in a cache database, if the related event data corresponding to the target event data is not queried in the cache database, the target event data is stored in a delay queue for waiting until the related event data corresponding to the target event data is stored in the cache database is monitored, then the target event data is read from the delay queue, and the target event data and the related event data are subjected to aggregation statistics and then written into an aggregation database, so that the aggregation of the target event data and the related event data can be accurately realized based on the target event data under the condition of disordered real-time stream event.
In an actual scenario, the execution body of the method may be a server on the server side, or an application program provided on the server, where the server may be an independent server, or may be one of a server cluster, or may be a virtual server in cloud computing, and is mainly used to aggregate data from different data sources so as to further perform attribution analysis according to the aggregated data, and in some cases, the execution body of the method may also be a user device provided on the server side, or an application program provided on the user device, where the user device is used by a service person to aggregate data from different data sources, and may perform attribution analysis according to the aggregated data so that the service person may take corresponding countermeasures according to the attribution analysis result.
Fig. 1 shows a process flow of a data aggregation method according to an embodiment of the present application, where the method at least includes the following processing steps:
step S101, receiving target event data sent by a target data source.
The target data source is a data source for sending target event data, which may be a server or a device for processing the target event, or may be a server for generating the target event data or forwarding the target event data, where the target event may include, but is not limited to, an order placement event, a customer complaint event, a commodity purchasing event, and the like, an event needing to be subjected to attribution analysis, and in order to perform attribution analysis on the target event, data of an associated event related to the target event needs to be used, where the associated event is an event having a certain causal relationship with the target event, for example, if the target event is an order placement event, the associated event may be a commodity browsing event, a commodity display event, a commodity link clicking event, a commodity purchasing event, and the like.
It should be noted that, in some embodiments, the target event data and the associated event data may be transmitted using a real-time stream, for example, an order event generated by a short video application, a commodity link click event, and the like, which may be transmitted using a real-time stream.
Step S102, if the associated event data corresponding to the target event data is not queried in the cache database, storing the target event data into a delay queue for waiting.
In the system architecture of the present application, a cache database is provided for the associated event data, and when the associated event data arrives, the corresponding associated event data may be temporarily stored in the cache database, where it should be noted that the associated event data may be original data sent by a data source of the associated event, or may be data related to a target event extracted from the original data according to a requirement, for example, if the associated event is a commodity browsing event, data such as a user ID, a commodity ID, a browsing time, etc. of a browsing user is extracted and stored as associated time data in the cache database.
In addition, in the system architecture of the application, a delay queue is arranged for the target event data, and the delay queue can be set as a memory queue, so that the processing efficiency and the polling efficiency can be higher. After receiving the target event data, firstly inquiring the associated event data corresponding to the target event data in a cache database, and if so, directly aggregating; if the related event data is not queried, indicating that the related event data cannot be aggregated at the moment, temporarily storing the target event data into a delay queue for waiting, and then, according to a set polling period or polling logic, polling each target event data in the delay queue, and querying whether the related event data corresponding to the target event data exists in a cache database until the related event data corresponding to the target event data is queried, wherein the polling process can be understood as a monitoring process, namely, monitoring whether the related event data corresponding to the target event data is stored in the cache database.
Step S103, after the fact that the cache database stores the associated event data corresponding to the target event data is monitored, the target event data is read from the delay queue, and the target event data and the associated event data are aggregated and counted and then written into an aggregation database.
According to the data aggregation method provided by the embodiment of the application, after receiving the target event data sent by the target data source, firstly, inquiring the associated event data corresponding to the target event data in the cache database, if the associated event data corresponding to the target event data is not inquired in the cache database, storing the target event data in the delay queue for waiting until the associated event data corresponding to the target event data is stored in the cache database is monitored, then, reading the target event data from the delay queue, and writing the aggregate statistics of the target event data and the associated event data into the aggregation database, so that the aggregation of the target event data and the associated event data can be accurately realized on the basis of the target event data under the condition of disordered real-time stream event.
Based on the embodiment shown in fig. 1, referring to fig. 2, in some modified embodiments, the method further includes:
step S104, receiving the associated event data sent by the associated data source;
step S105, storing the associated event data in a cache database.
Wherein the associated data source is a different data source than the target data source, for example, in some examples, the target event data such as the following list data is derived from MySQL database, and the associated event data such as the commodity browsing data, the commodity purchasing data is derived from Kafka message platform, wherein the Kafka message platform is a high throughput distributed publish-subscribe message system that can handle all action flow data of the consumer in the website.
The cache database can be realized by adopting a Redis database Redis (Remote Dictionary Server), namely a remote dictionary service, is an open-source log-type Key-Value database which is written and supported by using ANSIC language, can be based on memory and can be persistent, supports various data structure types, has the advantages of high reading and writing speed, high concurrent access and the like, and can meet the requirements of quick reading and writing and high concurrent access under the scene by adopting the Redis database as the cache database because of the large data volume of associated event data under the real-time streaming scene.
In the above embodiment, since the related event data mostly occurs before the target event data, the related event data may be stored in the cache database in advance after being acquired, and extracted from the cache database for aggregation after waiting for the target event data to arrive.
With continued reference to fig. 2, in some modified embodiments, after receiving the target event data sent by the target data source in step S101, the method further includes:
step S106, if the associated event data corresponding to the target event data is queried in the cache database, triggering to aggregate the target event data and the associated event data and writing the aggregate statistics into an aggregation database.
According to the method and the device, under the condition that the associated event data corresponding to the target event data is found in the cache database, aggregation statistics of the target event data and the associated event data can be directly triggered without waiting, so that the process of reading and writing the target event data into a delay queue and the monitoring process are reduced, and the overall aggregation efficiency is improved.
On the basis of any of the foregoing embodiments, referring to fig. 3, in some modified embodiments, step S103 includes writing the aggregate statistics of the target event data and the associated event data into an aggregate database, where the aggregate statistics include:
Step S1031, extracting aggregate statistics from the target event data and the associated event data according to preset statistics configuration information;
step S1032, writing the extracted aggregation statistical data into an aggregation database.
The statistics configuration information is configuration information set for what data contents need to be counted, and can be flexibly set according to actual requirements, the application is not limited to specific contents, the application aims to extract data contents which can be used for attributive analysis on a target event from target event data and associated event data, after the aggregate statistics data is extracted, the data contents can be written into an aggregate database, and the aggregate database is a database for storing the aggregate statistics data, can be realized by using a MySQL database and the like.
On the basis of the above embodiment, please refer to fig. 4, in some modified embodiments, the step S1031 extracts aggregate statistics from the target event data and the associated event data according to preset statistics configuration information, including:
step S10311, determining dimension combinations to be counted and dimension values corresponding to each dimension to be counted in the dimension combinations to be counted according to preset statistic configuration information;
Step S10312, combining the dimension values corresponding to different dimension to be counted in the dimension combination to be counted to obtain a dimension value combination to be counted;
step S10313, extracting aggregate statistics corresponding to each dimension value combination to be counted from the target event data and the associated event data.
In the attribution process, attribution analysis is often required to be carried out on a target event from a plurality of dimensions, a plurality of dimension values may exist in each dimension, and different dimensions can be combined to help to make more accurate attribution results, so that target event data and associated event data need to be pertinently aggregated on different dimensions and dimension combinations according to different attribution analysis tasks.
Based on the above embodiment, in some modified implementations, writing the extracted aggregation statistics into an aggregation database includes:
and writing the extracted aggregation statistical data into the aggregation database in a row manner according to the dimension value combination to be counted.
After the aggregate statistics are extracted, a traditional way is to store each piece of data by adopting one row, and the way has the defects of storage and query under the condition of huge data volume, for example, in some examples, a short video platform has real-time data acquisition of 30000 commodities, 5-minute granularity and 38 different dimension combinations each day, the number of records of a daily record database is about 3.2 hundred million, challenges are brought to the efficiency of data record and query, and in a general data table design, the data with < commodity id, dimension, time and statistics >4 pieces of information are taken as one row of data, as shown in the following table 1:
TABLE 1
Commodity ID Statistical dimension value combination Time Statistics value
1 Category=1/new passenger=2 2023-2-6 15:00:00 347
1 Category=1/new passenger=2 2023-2-6 15:05:00 625
1 Category=1/new passenger=2 2023-2-6 15:10:00 576
... ... .. ..
2 Category = 2/activity = 3 2023-2-6 15:05:00 380
When the report data is stored and queried, the design has 2 defects:
1. the time is stored precisely to 5 minutes granularity, and redundancy exists.
2. When querying data of a certain day, a range query needs to be made on a time column, and database indexes are not friendly.
In view of the above problems, in the above embodiments of the present application, the extracted aggregate statistics data is written into the aggregate database in rows according to the dimension value combinations to be counted, that is, different dimension value combinations to be counted are set in different rows, each row records the aggregate statistics data corresponding to the dimension value combinations to be counted, and the dimension value combinations to be counted of different rows are different, so that the data can be stored in rows by taking the dimension value combinations to be counted as a class, the storage structure is simplified, and the query efficiency is improved.
Taking the case of time statistics as an example, in some modified embodiments, the aggregation database records data in the form of a flattened table, where the flattened table includes a statistics dimension field and a plurality of time period fields, and records corresponding to different time periods in each statistics dimension are stored in the same row of the flattened table.
For example, the above table 1 is stored in the above manner provided in the embodiment of the present application, and data of < commodity id, dimension > one day may be stored in one row as a flattened table shown in the following table 2:
TABLE 2
Compared with the common table design, the flattened table stores data of < commodity id > and dimension > for one day only by storing one time date, so that storage redundancy is reduced. When inquiring data of a certain day, only the date column needs to be matched, the range inquiry is converted into single-value inquiry, and the index efficiency is fully utilized.
According to the embodiment, for an application scene related to time statistics, the statistics dimension field and the plurality of time period fields can be set in the flattened table, records of different time periods corresponding to each statistics dimension are stored in the same row of the flattened table, so that the data storage efficiency and the query efficiency can be effectively improved, and particularly for the query of the statistics data in the time period, the range query can be converted into the single-value query, and the query efficiency is sufficiently improved.
Based on the above embodiments, in some modified implementations, the flattened table further includes a date field, and records of the same date and different time periods are stored in the same row of the flattened table;
the method further comprises the steps of:
in response to a statistical data query request for a specified date, aggregate statistical data for the specified date is obtained by a single value query for the date field.
In this embodiment, after receiving a request for querying statistical data for a specified date, the method can obtain aggregated statistical data for the specified date through single-value query for the date field, so that the advantage of a flattened table is fully utilized, the range query is converted into the single-value query, and the query efficiency is fully improved.
Based on any of the foregoing embodiments, in some variations, the target event data comprises order event data, and the associated event data comprises at least one of merchandise link click event data, merchandise display event data, merchandise plus shopping cart event data.
For ease of understanding, the above data aggregation method of the present application is further described below with reference to a specific example.
In a specific example, the data aggregation method can be applied to an aggregation scene of order data and associated data of an e-commerce service, and aiming at the problem of low accuracy of real-time attribution of orders, an event cache and a delay queue are introduced, so that order events are attributed to correct events; aiming at the problem of insufficient query performance of a statistical database, the ultra-wide table structure suitable for report query is utilized to change range query into single-value query, reduce the number of scanning lines and improve the query performance, and concretely, please understand with reference to fig. 5, the data aggregation method comprises the following steps:
(1) And the Flink receives the real-time events of clicking, displaying and adding the shopping carts, wherein the Flink is an open source module for processing real-time event stream and aggregating real-time stream data, and is an execution main body of the data aggregation method in the embodiment.
(2) The link records basic information of the event, such as user ID, commodity ID, event time, etc., as associated event data into a cache database Redis.
(3) The Flink receives order event, i.e., target event data, in two cases A and B:
A. the browsing and clicking event of the user is reached, namely the associated event data is reached, and the next step of data aggregation is directly entered.
B. And if the browse and click event of the user does not arrive, namely the associated event data does not arrive, placing the order event into a delay queue, and aggregating after the browse and click event arrives.
(4) And collecting dimension information required by the order event, namely dimension combinations to be counted and dimension values corresponding to each dimension to be counted in the dimension combinations to be counted.
(5) And generating a required statistical dimension, namely a dimension value combination to be counted, according to the configuration.
(6) Aggregating the target event data and the associated event data into an aggregation database MySQL.
(7) And (5) inquiring a background report.
For order aggregation, because the real-time streaming system needs statistical data, the data is sourced from different data sources, such as browsing and clicking data of a user are sourced from kafka, and ordering and purchasing (namely adding to a shopping cart) data of the user are sourced from MySQL binlog. Multiple data sources have timing issues when an order is attributed, such as a user's order binlog event is earlier than a browse and click event that needs to be attributed, and the order cannot be attributed effectively. In order to handle the unordered situation of multiple data sources, the embodiment of the application introduces a Redis cache queue in the browsing and clicking event stream, and introduces a delay queue in the ordering attribution process.
As further understood with reference to FIG. 6, for the case where the order event precedes the click/show, shopping cart event, the order event is placed in a memory queue, referred to as a delay queue, and periodically polls the delay queue. And if the corresponding click and shopping cart adding event of the order arrives, the delayed order event is dequeued and enters a subsequent statistical flow, and if the corresponding click and shopping cart adding event is not received by the order event exceeding the designated time, the order event is discarded or marked as not being due to the order.
If the user does not order after browsing, the user's corresponding Redis event cache will be emptied after the maximum attribution window.
Further, different analysis tasks require aggregating real-time data over different dimensions, combinations of dimensions, e.g. business splitting user dimension, commodity dimension, access dimension into multiple values as shown in table 3 below:
TABLE 3 Table 3
Dimension name Dimension value
User' s Man and handy fan
Goods commodity Electronic digital class, participation activity
Access to App access, night access
Assuming that there are 2 analysis tasks to aggregate real-time data in the < user, commodity >, < user, access > dimension combinations, respectively, then the real-time flow statistics system needs to aggregate one flow to 8 statistics dimensions:
(1) < Man, electronic digital class >
(2) < men, attend Activity >
(3) < handling lovers, electronic digital class >
(4) < staff lovers, attend Activity)
(5) < Man, app Access >
(6) < Men, night visit >
(7) < handling lovers, app Access)
(8) < sponsor lovers, night visit >
The automatic combination of statistical dimensions may be accomplished by a permutation + ordering algorithm, which may be implemented with reference to the following pseudocode:
void permuteAndSort (map (dimension name- > dimension value List))
queue = empty queue
foreach dimension name, dimension value list
if queue is empty
queue initialization into multiple secondary lists, each secondary list containing a dimension value
else
Each secondary list lv2_list of foreach queue
Each dimension value dim_val of foreach dimension value list
Adding dim_val to the end of the lv2_list and adding the lvlist to the queue
Removing lv2_list from queue
Further, the multi-dimensional real-time streaming system supports real-time data acquisition of 30000 commodities per day, 5-minute granularity, and 38 different dimensional combinations. The number of records recorded into a database every day is about 3.2 hundred million, and the data recording and inquiring efficiency is challenged.
In a typical data table design, with < commodity id, dimension, time, and statistics >4 pieces of information as one line of data, as shown in the foregoing table 1, such a design has 2 points of deficiency when storing and querying report type data:
1. time is stored precisely to 5 minutes granularity, and redundancy exists
2. When querying data of a certain day, a range query needs to be made on a time column, and database indexes are not friendly.
In this embodiment, the database table storing real-time data adopts a flattened table design, and all data of < commodity id, dimension > a day is stored in one row, so that compared with the conventional table design, the flattened table stores data of < commodity id, dimension > a day only by one time, and storage redundancy is reduced. When inquiring data of a certain day, only the date column needs to be matched, the range inquiry is converted into single-value inquiry, and the index efficiency is fully utilized.
By the above specific embodiments, the following beneficial effects can be obtained:
1. aiming at the problem of low order real-time attribution accuracy, the embodiment introduces an event buffer and a delay queue, so that order events are attributed to correct events, and under the condition of disordered real-time stream events, the accuracy of order attribution is greatly ensured by adopting a buffer technology and a delay queue technology.
2. Aiming at the problem of insufficient query performance of a statistical database, the embodiment designs an ultra-wide table structure suitable for report query, changes range query into single-value query, reduces the number of scanning lines, improves the query performance, adopts the design of the wide table of the database, saves the storage amount of the statistical data, optimizes the range query into single-value query, and improves the performance of SQL query.
Based on the same inventive concept, the embodiment of the present application further provides a data aggregation device, where the corresponding method may be the data aggregation method in the foregoing embodiment, and the principle of solving the problem is similar to that of the method. The data aggregation device provided by the embodiment of the application can implement the data aggregation method, and the data aggregation device can be realized by software, hardware or a combination of software and hardware. For example, the data aggregation apparatus may comprise integrated or separate functional modules or units to perform the corresponding steps in the methods described above. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative. Referring to fig. 7, the data aggregation apparatus may include:
A target data receiving module 101, configured to receive target event data sent by a target data source;
the delay storage module 102 is configured to store the target event data into a delay queue for waiting if no associated event data corresponding to the target event data is queried in the cache database;
and the data aggregation module 103 is configured to read the target event data from the delay queue after detecting that the cache database stores the associated event data corresponding to the target event data, and aggregate the target event data and the associated event data and write the aggregate event data into an aggregation database.
In some variations, the apparatus further comprises:
the associated data receiving module is used for receiving associated event data sent by an associated data source;
and the associated data caching module is used for storing the associated event data into a cache database.
In some variations, the apparatus further comprises:
and the data aggregation triggering module is used for triggering the aggregation statistics of the target event data and the associated event data and writing the aggregate event data into the aggregation database if the associated event data corresponding to the target event data is queried in the cache database.
In some variant embodiments, the data aggregation module 103 includes:
the aggregation data extraction unit is used for extracting aggregation statistical data from the target event data and the associated event data according to preset statistical configuration information;
and the aggregation data storage unit is used for writing the extracted aggregation statistical data into an aggregation database.
In some variant embodiments, the aggregate data extraction unit includes:
the dimension value determining subunit is used for determining dimension combinations to be counted and dimension values corresponding to each dimension to be counted in the dimension combinations to be counted according to preset statistic configuration information;
the dimension value combination subunit is used for combining the dimension values corresponding to different dimension to be counted in the dimension combination to be counted to obtain the dimension value combination to be counted;
and the aggregation data statistics subunit is used for extracting aggregation statistics data corresponding to each dimension value combination to be counted from the target event data and the associated event data.
In some variations, the aggregate data storage unit comprises:
and the aggregation data storage subunit is used for writing the extracted aggregation statistical data into the aggregation database in a row mode according to the dimension value combination to be counted.
In some variations, the aggregated database records data in the form of a flattened table that includes a statistical dimension field and a plurality of time period fields, the records for each statistical dimension corresponding to a different time period being stored in the same row of the flattened table.
In some modified embodiments, the flattened table further includes a date field, and records of the same date and different time periods are stored in the same row of the flattened table;
the apparatus further comprises:
the data query module is used for responding to a statistical data query request aiming at a specified date and obtaining the aggregated statistical data of the specified date through single-value query aiming at the date field.
In some variations, the target event data comprises order event data and the associated event data comprises at least one of merchandise link click event data, merchandise display event data, merchandise plus shopping cart event data.
The data aggregation device provided by the embodiment of the application has the same beneficial effects as the data aggregation method provided by the previous embodiment of the application due to the same inventive concept.
Based on the same inventive concept, the embodiment of the present application further provides an electronic device, where the corresponding method of the electronic device may be the data aggregation method in the foregoing embodiment, and the principle of solving the problem is similar to that of the method. The electronic equipment provided by the embodiment of the application comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data aggregation methods and/or techniques of the various embodiments of the present application described above.
The electronic device may be a user device, or a device formed by integrating the user device and a network device through a network, or may also be an application running on the device, where the user device includes, but is not limited to, a computer, a mobile phone, a tablet computer, a smart watch, a bracelet, and other various terminal devices, and the network device includes, but is not limited to, a network host, a single network server, a plurality of network server sets, or a computer set based on cloud computing, where the network device is implemented, and may be used to implement a part of processing functions when setting an alarm clock. Here, the Cloud is composed of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual computer composed of a group of loosely coupled computer sets.
Fig. 8 shows a structure of an electronic device suitable for implementing the method and/or the technical solution in the embodiment of the present application, the electronic device 1200 includes a central processing unit (CPU, central Processing Unit) 1201, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a random access Memory (RAM, random Access Memory) 1203. In the RAM 1203, various programs and data required for the system operation are also stored. The CPU 1201, ROM 1202, and RAM 1203 are connected to each other through a bus 1204. An Input/Output (I/O) interface 1205 is also connected to the bus 1204.
The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, mouse, touch screen, microphone, infrared sensor, etc.; an output portion 1207 including a display such as a Cathode Ray Tube (CRT), a liquid crystal display (LCD, liquid Crystal Display), an LED display, an OLED display, or the like, and a speaker; a storage portion 1208 comprising one or more computer-readable media of hard disk, optical disk, magnetic disk, semiconductor memory, etc.; and a communication section 1209 including a network interface card such as a LAN (local area network ) card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet.
In particular, the methods and/or embodiments of the present application may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 1201.
Another embodiment of the present application also provides a computer readable storage medium having stored thereon computer program instructions executable by a processor to implement the method and/or the technical solution of any one or more of the embodiments of the present application described above.
In particular, the present embodiments may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowchart or block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of devices, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical function division, and there may be additional divisions in actual implementation, e.g., multiple elements or page components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.
The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims (12)

1. A method of data aggregation, wherein the method comprises:
receiving target event data sent by a target data source;
if the associated event data corresponding to the target event data is not queried in the cache database, storing the target event data into a delay queue for waiting;
after the fact that the relevant event data corresponding to the target event data are stored in the cache database is monitored, the target event data are read from the delay queue, and the target event data and the relevant event data are subjected to aggregation statistics and then written into an aggregation database.
2. The data aggregation method of claim 1, wherein the method further comprises:
receiving associated event data sent by an associated data source;
and storing the associated event data into a cache database.
3. The data aggregation method according to claim 1, wherein after receiving the target event data transmitted by the target data source, further comprising:
And if the associated event data corresponding to the target event data is queried in the cache database, triggering to aggregate the target event data and the associated event data and writing the aggregate event data into the aggregation database.
4. The data aggregation method according to claim 1, wherein the writing the aggregate statistics of the target event data and the associated event data into an aggregate database includes:
extracting aggregate statistics from the target event data and the associated event data according to preset statistics configuration information;
and writing the extracted aggregation statistical data into an aggregation database.
5. The data aggregation method according to claim 4, wherein the extracting the aggregate statistics from the target event data and the associated event data according to the preset statistics configuration information comprises:
according to preset statistical configuration information, determining dimension combinations to be counted and dimension values corresponding to each dimension to be counted in the dimension combinations to be counted;
combining the dimension values corresponding to different dimension to be counted in the dimension combination to be counted to obtain a dimension value combination to be counted;
and extracting aggregation statistical data corresponding to each dimension value combination to be counted from the target event data and the associated event data.
6. The data aggregation method according to claim 5, wherein the writing the extracted aggregation statistics into an aggregation database comprises:
and writing the extracted aggregation statistical data into the aggregation database in a row manner according to the dimension value combination to be counted.
7. The data aggregation method of claim 1, wherein the aggregation database records data in the form of a flattened table comprising a statistical dimension field and a plurality of time period fields, the records for each statistical dimension corresponding to a different time period being stored in a same row of the flattened table.
8. The data aggregation method of claim 7, wherein the flattened table further comprises a date field, records of the same date and different time periods being stored in the same row of the flattened table;
the method further comprises the steps of:
in response to a statistical data query request for a specified date, aggregate statistical data for the specified date is obtained by a single value query for the date field.
9. The data aggregation method of claim 1, wherein the target event data comprises order event data and the associated event data comprises at least one of merchandise link click event data, merchandise display event data, merchandise plus shopping cart event data.
10. A data aggregation apparatus, wherein the apparatus comprises:
the target data receiving module is used for receiving target event data sent by a target data source;
the delay storage module is used for storing the target event data into a delay queue for waiting if the associated event data corresponding to the target event data are not queried in the cache database;
and the data aggregation module is used for reading the target event data from the delay queue after the fact that the cache database stores the associated event data corresponding to the target event data is monitored, and writing the aggregate statistics of the target event data and the associated event data into the aggregation database.
11. An electronic device, the electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 9.
12. A computer readable medium having stored thereon computer program instructions executable by a processor to implement the method of any of claims 1 to 9.
CN202310777468.0A 2023-06-28 2023-06-28 Data aggregation method, apparatus and computer readable medium Pending CN116881277A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310777468.0A CN116881277A (en) 2023-06-28 2023-06-28 Data aggregation method, apparatus and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310777468.0A CN116881277A (en) 2023-06-28 2023-06-28 Data aggregation method, apparatus and computer readable medium

Publications (1)

Publication Number Publication Date
CN116881277A true CN116881277A (en) 2023-10-13

Family

ID=88267213

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310777468.0A Pending CN116881277A (en) 2023-06-28 2023-06-28 Data aggregation method, apparatus and computer readable medium

Country Status (1)

Country Link
CN (1) CN116881277A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117370400A (en) * 2023-12-05 2024-01-09 民航成都信息技术有限公司 Aviation data processing aggregation processing method and device, electronic equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117370400A (en) * 2023-12-05 2024-01-09 民航成都信息技术有限公司 Aviation data processing aggregation processing method and device, electronic equipment and medium
CN117370400B (en) * 2023-12-05 2024-02-13 民航成都信息技术有限公司 Aviation data processing aggregation processing method and device, electronic equipment and medium

Similar Documents

Publication Publication Date Title
CN110909063B (en) User behavior analysis method and device, application server and storage medium
US10754877B2 (en) System and method for providing big data analytics on dynamically-changing data models
US9363322B1 (en) Implementation of a web scale data fabric
US11016958B2 (en) Recreating an OLTP table and reapplying database transactions for real-time analytics
US11681651B1 (en) Lineage data for data records
US11036713B2 (en) Sending notifications in a multi-client database environment
US8978034B1 (en) System for dynamic batching at varying granularities using micro-batching to achieve both near real-time and batch processing characteristics
US10055444B2 (en) Systems and methods for access control over changing big data structures
CN110647512B (en) Data storage and analysis method, device, equipment and readable medium
CN103838867A (en) Log processing method and device
CN109299164A (en) A kind of data query method, computer readable storage medium and terminal device
US9600559B2 (en) Data processing for database aggregation operation
CN107346270B (en) Method and system for real-time computation based radix estimation
CN116881277A (en) Data aggregation method, apparatus and computer readable medium
CN112948397A (en) Data processing system, method, device and storage medium
US10360198B2 (en) Systems and methods for processing binary mainframe data files in a big data environment
US10225357B2 (en) Compact data structures for push notifications
CN116680315A (en) Data offline processing method and device, electronic equipment and storage medium
US11061926B2 (en) Data warehouse management and synchronization systems and methods
CN114547097A (en) Data processing method, device, equipment and storage medium
US9380126B2 (en) Data collection and distribution management
US11537554B2 (en) Analysis of streaming data using deltas and snapshots
CN110737691B (en) Method and apparatus for processing access behavior data
CN113297245A (en) Method and device for acquiring execution information
CN111949678A (en) Method and device for processing non-accumulation indexes across time windows

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination