CN112115163A - Data aggregation method and device, storage medium and electronic equipment - Google Patents

Data aggregation method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN112115163A
CN112115163A CN201910533378.0A CN201910533378A CN112115163A CN 112115163 A CN112115163 A CN 112115163A CN 201910533378 A CN201910533378 A CN 201910533378A CN 112115163 A CN112115163 A CN 112115163A
Authority
CN
China
Prior art keywords
data
main table
aggregation
change
main
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910533378.0A
Other languages
Chinese (zh)
Inventor
韦永剑
周涛
王昭博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201910533378.0A priority Critical patent/CN112115163A/en
Publication of CN112115163A publication Critical patent/CN112115163A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2358Change logging, detection, and notification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]

Abstract

The embodiment of the invention provides a data aggregation method, a data aggregation device, a storage medium and electronic equipment, wherein the method comprises the following steps: receiving a change message through an upstream stream processing platform, wherein the change data comprises a target identifier and change data; reading main table data with the target identification as a main key from a distributed storage system, and aggregating the main table data based on the changed data to generate aggregated data; sending the aggregated data to a downstream stream processing platform. By using the mass storage capacity of the Flink distributed streaming processing and distributed storage system, the real-time high-throughput data aggregation can be realized under the condition of ensuring the message time sequence.

Description

Data aggregation method and device, storage medium and electronic equipment
Technical Field
The invention relates to the technical field of computers, in particular to a data aggregation method, a data aggregation device, a storage medium and electronic equipment.
Background
With the development of the internet, a search engine is a main means for users to quickly acquire information.
In the e-commerce search field, most of original data of a search bottom layer is stored in various mysql tables, and before indexes are built, data of the tables needs to be aggregated into a complete record. The search data in the e-commerce field has the characteristics of mass and frequent change, so that mass real-time messages can be generated, how to realize low delay of mass real-time data and the aggregation of high throughput is a very big challenge. For example, during the period of e-commerce promotion, the commodity information changes frequently, and in the face of massive real-time data, if aggregation delay and message backlog are delayed, the real-time performance of the message cannot be guaranteed, and poor user experience is brought, for example, price message backlog, price adjustment for sales promotion of merchants cannot be effective in real time, inventory change is delayed, and a user may see that the searched commodity is not good, which is actually a good condition.
Therefore, a new data aggregation method, device, storage medium, and electronic device are needed to implement low-latency high-throughput data aggregation of millions of mass data.
The above information disclosed in this background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
In view of this, the present invention provides a data aggregation method, an apparatus, a storage medium, and an electronic device, which implement low-latency high-throughput data aggregation of millions of mass data.
Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.
According to a first aspect of the present invention, there is provided a data aggregation method, wherein the method comprises: receiving a change message through an upstream stream processing platform, wherein the change data comprises a target identifier and change data; reading main table data with the target identification as a main key from a distributed storage system, and aggregating the main table data based on the changed data to generate aggregated data; sending the aggregated data to a downstream stream processing platform.
In some exemplary embodiments of the invention, based on the foregoing scheme, the aggregated data comprises: aggregating data of the main table;
the aggregating the main table data based on the changed data to generate aggregated data includes: judging whether a column corresponding to a change field in the change data in a main table of the distributed storage system is empty or not; and if the column corresponding to the changed field in the changed data in the main table is empty, writing the changed field into the column, and generating main table aggregated data based on the main table written with the changed field.
In some exemplary embodiments of the invention, based on the foregoing, the method further comprises: if the column corresponding to the change field in the change data in the main table is not empty, comparing the version number of the change field with the version number of the column; if the version number of the change field is not larger than the version number of the column, generating main table aggregation data based on the main table; and if the version number of the change field is greater than the version number of the column, replacing the data of the column with the change field, and generating main table aggregated data based on the replaced main table data.
In some exemplary embodiments of the invention, based on the foregoing scheme, the aggregating data further comprises: compounding the aggregated data;
the method further comprises the following steps: reading supplementary table data taking a specified column in the main table aggregation data as a main key from supplementary tables of the distributed storage system; judging whether the attached table data is empty or not; and if the auxiliary table data are null, generating composite aggregation data based on the main table aggregation data.
In some exemplary embodiments of the invention, based on the foregoing, the method further comprises: and if the auxiliary table data are not empty, generating composite aggregation data based on the main table aggregation data and the auxiliary table.
In some exemplary embodiments of the present invention, based on the foregoing scheme, the target identifier includes a commodity identifier.
In some exemplary embodiments of the invention, based on the foregoing scheme, the method is performed by a Flink-based data aggregation apparatus.
According to a second aspect of the present invention, there is provided a data aggregation apparatus, wherein the apparatus comprises: the receiving module is used for receiving changed data through an upstream stream processing platform, and the changed data comprises a target identifier and changed data; the generating module is used for reading main table data with the target identification as a main key from the distributed storage system, and aggregating the main table data based on the changed data to generate aggregated data; and the sending module is used for sending the aggregated data to a downstream flow processing platform.
In some exemplary embodiments of the invention, based on the foregoing scheme, the aggregated data comprises: aggregating data of the main table;
the generation module comprises: a main table judging unit, configured to judge whether a column corresponding to a change field in the change data in a main table of the distributed storage system is empty; and the first main table aggregation unit is used for writing the change fields into columns in the main table if the columns corresponding to the change fields in the change data are empty, and generating main table aggregation data based on the main table written with the change fields.
In some exemplary embodiments of the present invention, based on the foregoing scheme, the generating module further includes: the comparison module is used for comparing the version number of the change field with the version number of the column if the column corresponding to the change field in the change data in the main table is not empty; a second main table aggregation unit, configured to generate main table aggregation data based on the main table if the version number of the change field is not greater than the version number of the column; and the third main table aggregation unit is used for replacing the data of the column with the change field and generating main table aggregation data based on the replaced main table data if the version number of the change field is greater than the version number of the column.
In some exemplary embodiments of the invention, based on the foregoing scheme, the aggregating data further comprises: compounding the aggregated data;
the generation module further comprises: a reading unit, configured to read, from an additional table of the distributed storage system, additional table data that has a specified column of the primary table aggregate data as a primary key; a composite judgment unit for judging whether the attached table data is empty; and the first composite aggregation unit is used for generating composite aggregation data based on the main table aggregation data if the auxiliary table data is empty.
In some exemplary embodiments of the invention, based on the foregoing scheme, the generating module further includes: and the second composite aggregation unit is used for generating composite aggregation data based on the main table aggregation data and the auxiliary table if the auxiliary table data is not empty.
In some exemplary embodiments of the present invention, based on the foregoing scheme, the target identifier includes a commodity identifier.
In some exemplary embodiments of the invention, the device is a Flink-based device based on the foregoing scheme.
According to a third aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored, wherein the program, when executed by a processor, performs the method steps as set forth in the first aspect.
According to a fourth aspect of the present invention, there is provided an electronic apparatus, comprising: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method steps as described in the first aspect.
In the embodiment of the invention, a change message is received through an upstream flow processing platform, and the change data comprises a target identifier and change data; reading main table data with the target identification as a main key from a distributed storage system, and aggregating the main table data based on the changed data to generate aggregated data; sending the aggregated data to a downstream stream processing platform. By using the mass storage capacity of the Flink distributed streaming processing and distributed storage system, the real-time high-throughput data aggregation can be realized under the condition of ensuring the message time sequence.
In the embodiment of the invention, based on Flink distributed streaming processing and mass storage capacity of a distributed storage system, not only can high-throughput aggregation of main table data be supported, but also high-throughput aggregation of auxiliary table data can be supported, and compared with the prior art that only aggregation of main table data is supported, the composite aggregated data in the embodiment of the invention is more comprehensive and complete.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.
FIG. 1 is a block diagram illustrating a data aggregation system in accordance with an exemplary embodiment;
FIG. 2 is a flow diagram illustrating a method of data aggregation in accordance with an exemplary embodiment;
FIG. 3 is a diagram illustrating a structure of a main table data according to an exemplary embodiment;
FIG. 4 is a diagram illustrating a structure of additional table data in accordance with an exemplary embodiment;
FIG. 5 is a flowchart illustrating a method of aggregating data for a master table in accordance with an exemplary embodiment;
FIG. 6 is a flow diagram illustrating a method for aggregating composite aggregated data in accordance with an exemplary embodiment;
FIG. 7 is a data flow diagram illustrating an exemplary embodiment;
FIG. 8 is a schematic diagram illustrating a structure of a data aggregation apparatus in accordance with an exemplary embodiment;
FIG. 9 is a schematic diagram illustrating a configuration of an electronic device in accordance with an exemplary embodiment;
FIG. 10 is a block diagram illustrating a program product according to one exemplary embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
The related art provides a real-time data flow for E-commerce search, which comprises a mysql database binlog subscription module, a data aggregation module, a message queue module of a stream processing platform (such as a Kaffka), an index production module and a search engine service module. The commodity information change can be written into a commodity mysql table, change information binlog is generated, the data aggregation module can query an inventory table, a commodity comment table, a price table and the like through sql statements according to a commodity id in the change information, data aggregation work is achieved, and finally the data aggregation work is sent to the index module to construct an index. The data aggregation method has many disadvantages, such as frequent commodity change, large database pressure facing massive real-time data, reduced performance of aggregation modules and serious reduction of throughput during the period of large promotion of e-commerce.
Fig. 1 is a schematic diagram illustrating a data aggregation system according to an exemplary embodiment. As shown in fig. 1, in the embodiment of the present invention, the stream processing platform takes kaffka as an example. The data sources are stored in different tables of the Mysql database, such as a commodity table, an inventory table, a price table and a shop table, when the data are changed, change information binlog is generated, the binglog is sent to a data aggregation device (module) through an upstream kafka, after the data are aggregated by the data aggregation device (module), the aggregated data are sent to a downstream kafka and sent to an index module through the downstream kafka, and the index module constructs a real-time index. It should be noted that the data aggregation device (module) is a device (module) based on Flink, which is a distributed processing engine (processor) for stream data and batch data. It is mainly implemented by Java code. For Flink, the main scene to be processed is the stream data, and the batch data is only a limit special case of the stream data. In other words, Flink handles all tasks as streams and thus can support high throughput, low latency, and high performance stream processing.
A data aggregation method proposed by an embodiment of the present invention is described in detail below with reference to specific embodiments. The data aggregation method provided by the embodiment of the invention is executed by a data aggregation device (module) based on Flink.
Fig. 2 is a flow diagram illustrating a method of data aggregation in accordance with an example embodiment.
As shown in fig. 2, in S210, a change message is received by the upstream stream processing platform, the change message including a target identifier and change data.
In the embodiment of the present invention, the change data may include, but is not limited to: change field, version number of change field. The change field may be a change field corresponding to at least one table of an inventory table, a product table, and a price table in the database, such as a brand title in the product table.
In the embodiment of the invention, the target identifier can be a commodity identifier.
In the embodiment of the invention, the stream processing platform is kafka as an example, and kafka is a high-throughput distributed publish-subscribe message system which can process all action stream data in a consumer-scale website. kafka is a high-throughput distributed publish-subscribe messaging system that can handle all the action flow data in a consumer-scale website.
In S220, the master table data with the target identifier as the master key is read from the distributed storage system, and the master table data is aggregated based on the changed data, so as to generate aggregated data.
In the embodiment of the invention, the distributed storage system can be Hbase, and the Hbase is a distributed column-oriented open source database, has the characteristics of high reliability, high performance, column-oriented property and scalability, and can support the million-level query rate.
In the embodiment of the invention, main table data is stored in Hbase, the main table data is a table which takes a commodity identifier as a main key and comprises a plurality of rows, and a plurality of main table data are stored in Hbase. And when the change message is received, reading the main table data taking the target identifier as a main key from the Hbase according to the target identifier included in the change message, and then aggregating the main table data according to the change data.
FIG. 3 is a diagram illustrating a structure of main table data, according to an example embodiment. As shown in fig. 3, the column family (other columns than the primary key) in the primary table can be divided into three parts: merchandise tables, inventory tables, and price tables. Wherein, the commodity table corresponds to a plurality of columns, such as: the name of the goods, the identity of the store, the brand of the goods, etc.
In an embodiment of the present invention, aggregating data includes: the master table aggregates the data. The main table aggregate data is data having the same structure as the main table. Generating main table aggregated data by aggregating the main table data. When the main table data is aggregated, judging whether a column corresponding to a change field in the change data in the main table read from the Hbase by the data aggregation module is empty:
1. and if the column corresponding to the change field in the change data in the main table is empty, writing the change field into the column, and generating main table aggregated data based on the main table written with the change field.
2. And if the column corresponding to the change field in the change data in the main table is not empty, comparing the version number of the change field with the version number of the column. It should be noted that the comparison of the version number of the change field and the version number of the column is performed to check the timing of the change field to avoid the change of the main table data by the delayed change message.
1) And if the version number of the change field is not greater than the version number of the column, generating main table aggregation data based on the main table.
2) And if the version number of the change field is greater than the version number of the column, replacing the data of the column with the change field, and generating main table aggregated data based on the replaced main table data.
In the embodiment of the present invention, the Hbase further stores attached table data, the attached table data is a table in which a certain column group (non-primary key) of the primary table is a primary key, and it should be noted that an attached table does not exist for each column in the primary table.
Fig. 4 is a diagram illustrating a structure of additional table data according to an exemplary embodiment. As shown in fig. 4, the attached table uses the store identifier as a main key, and the column family is a store table comprising a plurality of columns, such as: store name, store identification, brand of goods, etc.
In this embodiment of the present invention, aggregating data may further include: and compounding the polymerization data. The composite aggregation data comprises a main table part and an auxiliary table part, wherein the main table part is the main table aggregation data, and the auxiliary table part is an auxiliary table of the main table.
In the embodiment of the invention, after main table aggregation data are generated, the auxiliary table data taking the specified column in the main table aggregation data as the main key are read from the auxiliary table of Hbase, whether the auxiliary table data are empty or not is judged, and if the auxiliary table data are empty, composite aggregation data are generated based on the main table aggregation data. And if the auxiliary table data are not empty, generating composite aggregation data based on the main table aggregation data and the auxiliary table.
Since the main table aggregation data is the same as the main table data in Hbase, the supplementary table data having the main key as a designated column in the main table aggregation data may be read directly from the supplementary table in Hbase, or the supplementary table data having the main key as a designated column in the main table may be read from the supplementary table in Hbase.
In S230, the aggregated data is sent to a downstream stream processing platform.
In the embodiment of the invention, the data aggregation module sends the aggregated data to a downstream stream processing platform, such as kafka, and sends the aggregated data to the indexing module through the downstream kafka to construct the real-time index.
In the embodiment of the invention, a change message is received through an upstream flow processing platform, and the change data comprises a target identifier and change data; reading main table data with the target identification as a main key from a distributed storage system, and aggregating the main table data based on the changed data to generate aggregated data; sending the aggregated data to a downstream stream processing platform. By using the mass storage capacity of the Flink distributed streaming processing and distributed storage system, the real-time high-throughput data aggregation can be realized under the condition of ensuring the message time sequence.
The following describes in detail an aggregation process of main table aggregated data according to an embodiment of the present invention, with reference to a specific application scenario.
Fig. 5 is a flowchart illustrating a method for aggregating data of a master table according to an exemplary embodiment.
As shown in fig. 5, the method may include, but is not limited to, the following steps:
s501, reading main table data with a target identification as a main key from the distributed storage system.
The target identifier is an ID of a certain product identifier, that is, an ID of a certain SKU, and the distributed storage system stores a master table having the ID of the SKU as a master key, and upon receiving the change message, reads the master table having the ID of the SKU as the master key from the distributed storage system based on the ID of the SKU in the change message.
S502, judging whether the column corresponding to the change field in the change data in the main table is empty or not.
It should be noted that each column in the main table stored in the distributed storage system may be empty.
If so, go to S503, otherwise, go to S504.
S503, writing the change fields into the columns, and generating main table aggregation data based on the main table written with the change fields.
For example, the structure of the main table data shown in fig. 3 is shown in the following case: and if the price is empty and the change field is a price field, writing the price in the change field into the main table data, and generating the main table aggregated data based on the main table with the written price.
Note that the main table aggregate data has the same data structure as the main table.
S504, comparing the version number of the change field with the version number of the column.
And S505, if the version number of the change field is not greater than the version number of the column, generating main table aggregation data based on the main table.
For example, in the structure diagram of the main table data shown in fig. 3, if the change field is a price field, the version number (e.g., offset) of the change field is compared with the version number (offset) of the price column group in the main table data, and if the version number of the change field is not greater than (including equal to or less than) the version number of the price column group in the main table data, it is described that the price in the main table is the same as the price in the change field, or the price in the main table is newer than the version of the price in the change field, the price in the main table data does not need to be changed, and the main table aggregate data is generated directly based on the main table data.
And S506, if the version number of the change field is larger than that of the column, replacing the data of the column with the change field, and generating main table aggregated data based on the replaced main table data.
For example, in the configuration diagram of the master table data shown in fig. 3, when the version number of the change field is larger than the version number of the price column group in the master table data, the price in the change field is described as being newer than the version of the price in the master table, and the price in the change field is the latest price, the price in the original master table data is replaced with the price in the change field, and the master table aggregated data is generated based on the replaced master table data.
In the embodiment of the invention, the Flink distributed streaming processing and the mass storage capacity of the distributed storage system are utilized, and the real-time high-throughput data aggregation of the main table data can be realized under the condition of ensuring the message time sequence.
The following describes in detail the aggregation process of the composite aggregation data according to the embodiment of the present invention with reference to a specific application scenario.
Fig. 6 is a flowchart illustrating an aggregation method for composite aggregated data according to an exemplary embodiment.
As shown in fig. 6, the method may include, but is not limited to, the following steps:
s601, reading the attached table data with the specified column as the main key in the main table aggregation data from the attached table of the distributed storage system.
It should be noted that the aggregated main table aggregated data matches the main table data stored in the distributed storage system, and therefore, the attached table data having the specified column of the main table aggregated data as the main key is the attached table data having the specified column of the main table as the main key.
S602, judging whether the attached table data is empty or not.
It should be noted that each column in the attached table stored in the distributed storage system may be empty. And when each column in the attached table is empty, the data of the attached table is empty, and if at least one column in the attached table is not empty, the data of the attached table is not empty.
If the attached table data is empty, S603 is executed, and if the attached table data is not empty, S604 is executed.
And S603, generating composite aggregation data based on the main table aggregation data.
For example, in the schematic structure diagram of the attached table shown in fig. 4, if all the column groups in the attached table are empty, the composite aggregation data is generated directly based on the main table aggregation data, and all the column groups (non-main keys) in the attached table of the composite aggregation data are empty.
S604, generating composite aggregation data based on the main table aggregation data and the auxiliary table.
For example, in the schematic structure diagram of the attached table shown in fig. 4, if all the column irregularities in the attached table are null, composite aggregated data is generated based on the main table aggregated data and the attached table.
In the embodiment of the invention, based on Flink distributed streaming processing and mass storage capacity of a distributed storage system, not only can high-throughput aggregation of main table data be supported, but also high-throughput aggregation of auxiliary table data can be supported, and compared with the prior art that only aggregation of main table data is supported, the composite aggregated data in the embodiment of the invention is more comprehensive and complete.
The data aggregation method proposed by the present invention will be described in detail below with reference to specific examples.
Fig. 7 is a data flow diagram illustrating an exemplary embodiment. As shown in fig. 7, the data aggregation apparatus (module) is based on a Flink framework, and includes: a Kafka source section, an operator merge section, an operator leftjoin section, and a Kafka sink section, wherein the Kafka source section is operable to receive change messages, such as binlog messages, generated by a data source. The change message flows into the operator merge section. The operator merge mainly performs data aggregation on the main table in the Hbase according to the ID of the SKU in the change message to generate main table aggregation data. The generated main table aggregation data flows into the operator leftjoin part. And (3) performing data aggregation on the attached table in the Hbase according to the specified column in the main table aggregation data to generate composite aggregation data. The generated composite aggregate data was streamed into the Kafka sink section. The composite aggregated data is sent to downstream Kafka via Kafka sink.
It should be noted that the main table aggregation data generated by the operator merge can flow directly to the Kafka sink, and the operator leave join is not needed to perform data aggregation on the supplementary table in the Hbase according to the specified column in the main table aggregation data, so as to generate composite aggregation data.
It should be clearly understood that the present disclosure describes how to make and use particular examples, but the principles of the present disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.
The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. In the following description of the apparatus, the same parts as those of the foregoing method will not be described again.
Fig. 8 is a schematic structural diagram illustrating a data aggregation apparatus according to an exemplary embodiment, and as shown in fig. 8, the data aggregation apparatus 800 may include:
a receiving module 810, configured to receive change data through an upstream kafka, where the change data includes a destination identifier and change data;
a generating module 820, configured to read, from the Hbase, primary table data with the target identifier as a primary key, and aggregate the primary table data based on the changed data to generate aggregated data;
a sending module 830, configured to send the aggregated data to a downstream kafka.
In the embodiment of the invention, a change message is received through an upstream kafka, and the change data comprises a target identifier and change data; reading main table data with the target identification as a main key from Hbase, and aggregating the main table data based on the changed data to generate aggregated data; sending the aggregated data to a downstream kafka. By utilizing Flink distributed streaming processing and hbase mass storage capacity, real-time high-throughput data aggregation can be realized under the condition of ensuring message time sequence.
The specific details of the modules of each image classification apparatus have been described in detail in the corresponding image classification method, and therefore are not described herein again.
It should be noted that although in the above detailed description several modules or units of the apparatus for performing are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 900 according to this embodiment of the invention is described below with reference to fig. 9. The electronic device 900 shown in fig. 9 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present invention.
As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one storage unit 920, a bus 930 connecting different system components (including the storage unit 920 and the processing unit 910), and a display unit 940.
Wherein the storage unit stores program code that is executable by the processing unit 910 to cause the processing unit 910 to perform steps according to various exemplary embodiments of the present invention described in the above section "exemplary methods" of the present specification. For example, the processing unit 910 may execute S210 shown in fig. 2, receive a change message through an upstream stream processing platform, the change message including a target identifier and change data; s220, reading main table data with the target identification as a main key from the distributed storage system, and aggregating the main table data based on the changed data to generate aggregated data; and S230, sending the aggregated data to a downstream flow processing platform.
The storage unit 920 may include a readable medium in the form of a volatile storage unit, such as a random access memory unit (RAM)9201 and/or a cache memory unit 9202, and may further include a read only memory unit (ROM) 9203.
Storage unit 920 may also include a program/utility 9204 having a set (at least one) of program modules 9205, such program modules 9205 including but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 930 can be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 900 may also communicate with one or more external devices 970 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 900, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 900 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 950. Also, the electronic device 900 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 960. As shown, the network adapter 960 communicates with the other modules of the electronic device 900 via the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.
Referring to fig. 10, a program product 1000 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims (10)

1. A method for data aggregation, the method comprising:
receiving a change message through an upstream stream processing platform, wherein the change data comprises a target identifier and change data;
reading main table data with the target identification as a main key from a distributed storage system, and aggregating the main table data based on the changed data to generate aggregated data;
sending the aggregated data to a downstream stream processing platform.
2. The method of claim 1, wherein the aggregating data comprises: aggregating data of the main table;
the aggregating the main table data based on the changed data to generate aggregated data includes:
judging whether a column corresponding to a change field in the change data in a main table of the distributed storage system is empty or not;
and if the column corresponding to the changed field in the changed data in the main table is empty, writing the changed field into the column, and generating main table aggregated data based on the main table written with the changed field.
3. The method of claim 2, wherein the method further comprises:
if the column corresponding to the change field in the change data in the main table is not empty, comparing the version number of the change field with the version number of the column;
if the version number of the change field is not larger than the version number of the column, generating main table aggregation data based on the main table;
and if the version number of the change field is greater than the version number of the column, replacing the data of the column with the change field, and generating main table aggregated data based on the replaced main table data.
4. The method of claim 3, wherein the aggregating data further comprises: compounding the aggregated data;
the method further comprises the following steps:
reading supplementary table data taking a specified column in the main table aggregation data as a main key from supplementary tables of the distributed storage system;
judging whether the attached table data is empty or not;
and if the auxiliary table data are null, generating composite aggregation data based on the main table aggregation data.
5. The method of claim 4, wherein the method further comprises:
and if the auxiliary table data are not empty, generating composite aggregation data based on the main table aggregation data and the auxiliary table.
6. The method of claim 1, wherein the target identification comprises a merchandise identification.
7. The method according to any of claims 1-6, wherein the method is performed by a Flink-based data aggregation apparatus.
8. A data aggregation apparatus, characterized in that the apparatus comprises:
the receiving module is used for receiving changed data through an upstream stream processing platform, and the changed data comprises a target identifier and changed data;
the generating module is used for reading main table data with the target identification as a main key from the distributed storage system, and aggregating the main table data based on the changed data to generate aggregated data;
and the sending module is used for sending the aggregated data to a downstream flow processing platform.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.
10. An electronic device, comprising: one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method steps of any of claims 1-7.
CN201910533378.0A 2019-06-19 2019-06-19 Data aggregation method and device, storage medium and electronic equipment Pending CN112115163A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910533378.0A CN112115163A (en) 2019-06-19 2019-06-19 Data aggregation method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910533378.0A CN112115163A (en) 2019-06-19 2019-06-19 Data aggregation method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN112115163A true CN112115163A (en) 2020-12-22

Family

ID=73795595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910533378.0A Pending CN112115163A (en) 2019-06-19 2019-06-19 Data aggregation method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN112115163A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116821245A (en) * 2023-07-05 2023-09-29 贝壳找房(北京)科技有限公司 Data aggregation synchronization method and storage medium in distributed scene

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116821245A (en) * 2023-07-05 2023-09-29 贝壳找房(北京)科技有限公司 Data aggregation synchronization method and storage medium in distributed scene

Similar Documents

Publication Publication Date Title
CN109471863B (en) Information query method and device based on distributed database and electronic equipment
US10038619B2 (en) Providing a monitoring service in a cloud-based computing environment
CN109189835B (en) Method and device for generating data wide table in real time
US20210263906A1 (en) Recreating an oltp table and reapplying database transactions for real-time analytics
US20180075125A1 (en) Data partitioning and parallelism in a distributed event processing system
US20140279074A1 (en) Data management platform for digital advertising
JP2017530469A (en) Enriching events with dynamically typed big data for event processing
US10877971B2 (en) Logical queries in a distributed stream processing system
CN112711581B (en) Medical data checking method and device, electronic equipment and storage medium
CN111127051B (en) Multi-channel dynamic attribution method, device, server and storage medium
CN111562885A (en) Data processing method and device, computer equipment and storage medium
US11048703B2 (en) Minimizing processing using an index when non leading columns match an aggregation key
US9734178B2 (en) Searching entity-key associations using in-memory objects
EP3182298B1 (en) Smart elastic scaling based on application scenarios
US20230360083A1 (en) Deep learning-based revenue-per-click prediction model framework
US10244017B2 (en) Processing of streaming data with a keyed join
US11935004B2 (en) Reading and writing processing improvements as a single command
CN114282129A (en) Information system page generation method, system, electronic equipment and storage medium
CN112115163A (en) Data aggregation method and device, storage medium and electronic equipment
US10262061B2 (en) Hierarchical data classification using frequency analysis
CN111010453A (en) Service request processing method, system, electronic device and computer readable medium
CN111949678A (en) Method and device for processing non-accumulation indexes across time windows
US8645377B2 (en) Aggregating data from a work queue
CN112115113B (en) Data storage system, method, device, equipment and storage medium
EP3667602A1 (en) Multi-factor routing system for exchanging business transactions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination