CN112115163A

CN112115163A - Data aggregation method and device, storage medium and electronic equipment

Info

Publication number: CN112115163A
Application number: CN201910533378.0A
Authority: CN
Inventors: 韦永剑; 周涛; 王昭博
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-06-19
Filing date: 2019-06-19
Publication date: 2020-12-22

Abstract

The embodiment of the invention provides a data aggregation method, a data aggregation device, a storage medium and electronic equipment, wherein the method comprises the following steps: receiving a change message through an upstream stream processing platform, wherein the change data comprises a target identifier and change data; reading main table data with the target identification as a main key from a distributed storage system, and aggregating the main table data based on the changed data to generate aggregated data; sending the aggregated data to a downstream stream processing platform. By using the mass storage capacity of the Flink distributed streaming processing and distributed storage system, the real-time high-throughput data aggregation can be realized under the condition of ensuring the message time sequence.

Description

Data aggregation method and device, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of computers, in particular to a data aggregation method, a data aggregation device, a storage medium and electronic equipment.

Background

With the development of the internet, a search engine is a main means for users to quickly acquire information.

In the e-commerce search field, most of original data of a search bottom layer is stored in various mysql tables, and before indexes are built, data of the tables needs to be aggregated into a complete record. The search data in the e-commerce field has the characteristics of mass and frequent change, so that mass real-time messages can be generated, how to realize low delay of mass real-time data and the aggregation of high throughput is a very big challenge. For example, during the period of e-commerce promotion, the commodity information changes frequently, and in the face of massive real-time data, if aggregation delay and message backlog are delayed, the real-time performance of the message cannot be guaranteed, and poor user experience is brought, for example, price message backlog, price adjustment for sales promotion of merchants cannot be effective in real time, inventory change is delayed, and a user may see that the searched commodity is not good, which is actually a good condition.

Therefore, a new data aggregation method, device, storage medium, and electronic device are needed to implement low-latency high-throughput data aggregation of millions of mass data.

The above information disclosed in this background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

In view of this, the present invention provides a data aggregation method, an apparatus, a storage medium, and an electronic device, which implement low-latency high-throughput data aggregation of millions of mass data.

Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.

According to a first aspect of the present invention, there is provided a data aggregation method, wherein the method comprises: receiving a change message through an upstream stream processing platform, wherein the change data comprises a target identifier and change data; reading main table data with the target identification as a main key from a distributed storage system, and aggregating the main table data based on the changed data to generate aggregated data; sending the aggregated data to a downstream stream processing platform.

In some exemplary embodiments of the invention, based on the foregoing scheme, the aggregated data comprises: aggregating data of the main table;

the aggregating the main table data based on the changed data to generate aggregated data includes: judging whether a column corresponding to a change field in the change data in a main table of the distributed storage system is empty or not; and if the column corresponding to the changed field in the changed data in the main table is empty, writing the changed field into the column, and generating main table aggregated data based on the main table written with the changed field.

In some exemplary embodiments of the invention, based on the foregoing, the method further comprises: if the column corresponding to the change field in the change data in the main table is not empty, comparing the version number of the change field with the version number of the column; if the version number of the change field is not larger than the version number of the column, generating main table aggregation data based on the main table; and if the version number of the change field is greater than the version number of the column, replacing the data of the column with the change field, and generating main table aggregated data based on the replaced main table data.

In some exemplary embodiments of the invention, based on the foregoing scheme, the aggregating data further comprises: compounding the aggregated data;

the method further comprises the following steps: reading supplementary table data taking a specified column in the main table aggregation data as a main key from supplementary tables of the distributed storage system; judging whether the attached table data is empty or not; and if the auxiliary table data are null, generating composite aggregation data based on the main table aggregation data.

In some exemplary embodiments of the invention, based on the foregoing, the method further comprises: and if the auxiliary table data are not empty, generating composite aggregation data based on the main table aggregation data and the auxiliary table.

In some exemplary embodiments of the present invention, based on the foregoing scheme, the target identifier includes a commodity identifier.

In some exemplary embodiments of the invention, based on the foregoing scheme, the method is performed by a Flink-based data aggregation apparatus.

According to a second aspect of the present invention, there is provided a data aggregation apparatus, wherein the apparatus comprises: the receiving module is used for receiving changed data through an upstream stream processing platform, and the changed data comprises a target identifier and changed data; the generating module is used for reading main table data with the target identification as a main key from the distributed storage system, and aggregating the main table data based on the changed data to generate aggregated data; and the sending module is used for sending the aggregated data to a downstream flow processing platform.

the generation module comprises: a main table judging unit, configured to judge whether a column corresponding to a change field in the change data in a main table of the distributed storage system is empty; and the first main table aggregation unit is used for writing the change fields into columns in the main table if the columns corresponding to the change fields in the change data are empty, and generating main table aggregation data based on the main table written with the change fields.

In some exemplary embodiments of the present invention, based on the foregoing scheme, the generating module further includes: the comparison module is used for comparing the version number of the change field with the version number of the column if the column corresponding to the change field in the change data in the main table is not empty; a second main table aggregation unit, configured to generate main table aggregation data based on the main table if the version number of the change field is not greater than the version number of the column; and the third main table aggregation unit is used for replacing the data of the column with the change field and generating main table aggregation data based on the replaced main table data if the version number of the change field is greater than the version number of the column.

the generation module further comprises: a reading unit, configured to read, from an additional table of the distributed storage system, additional table data that has a specified column of the primary table aggregate data as a primary key; a composite judgment unit for judging whether the attached table data is empty; and the first composite aggregation unit is used for generating composite aggregation data based on the main table aggregation data if the auxiliary table data is empty.

In some exemplary embodiments of the invention, based on the foregoing scheme, the generating module further includes: and the second composite aggregation unit is used for generating composite aggregation data based on the main table aggregation data and the auxiliary table if the auxiliary table data is not empty.

In some exemplary embodiments of the invention, the device is a Flink-based device based on the foregoing scheme.

According to a third aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored, wherein the program, when executed by a processor, performs the method steps as set forth in the first aspect.

According to a fourth aspect of the present invention, there is provided an electronic apparatus, comprising: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method steps as described in the first aspect.

In the embodiment of the invention, a change message is received through an upstream flow processing platform, and the change data comprises a target identifier and change data; reading main table data with the target identification as a main key from a distributed storage system, and aggregating the main table data based on the changed data to generate aggregated data; sending the aggregated data to a downstream stream processing platform. By using the mass storage capacity of the Flink distributed streaming processing and distributed storage system, the real-time high-throughput data aggregation can be realized under the condition of ensuring the message time sequence.

In the embodiment of the invention, based on Flink distributed streaming processing and mass storage capacity of a distributed storage system, not only can high-throughput aggregation of main table data be supported, but also high-throughput aggregation of auxiliary table data can be supported, and compared with the prior art that only aggregation of main table data is supported, the composite aggregated data in the embodiment of the invention is more comprehensive and complete.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.

FIG. 1 is a block diagram illustrating a data aggregation system in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a method of data aggregation in accordance with an exemplary embodiment;

FIG. 3 is a diagram illustrating a structure of a main table data according to an exemplary embodiment;

FIG. 4 is a diagram illustrating a structure of additional table data in accordance with an exemplary embodiment;

FIG. 5 is a flowchart illustrating a method of aggregating data for a master table in accordance with an exemplary embodiment;

FIG. 6 is a flow diagram illustrating a method for aggregating composite aggregated data in accordance with an exemplary embodiment;

FIG. 7 is a data flow diagram illustrating an exemplary embodiment;

FIG. 8 is a schematic diagram illustrating a structure of a data aggregation apparatus in accordance with an exemplary embodiment;

FIG. 9 is a schematic diagram illustrating a configuration of an electronic device in accordance with an exemplary embodiment;

FIG. 10 is a block diagram illustrating a program product according to one exemplary embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

The related art provides a real-time data flow for E-commerce search, which comprises a mysql database binlog subscription module, a data aggregation module, a message queue module of a stream processing platform (such as a Kaffka), an index production module and a search engine service module. The commodity information change can be written into a commodity mysql table, change information binlog is generated, the data aggregation module can query an inventory table, a commodity comment table, a price table and the like through sql statements according to a commodity id in the change information, data aggregation work is achieved, and finally the data aggregation work is sent to the index module to construct an index. The data aggregation method has many disadvantages, such as frequent commodity change, large database pressure facing massive real-time data, reduced performance of aggregation modules and serious reduction of throughput during the period of large promotion of e-commerce.

Fig. 1 is a schematic diagram illustrating a data aggregation system according to an exemplary embodiment. As shown in fig. 1, in the embodiment of the present invention, the stream processing platform takes kaffka as an example. The data sources are stored in different tables of the Mysql database, such as a commodity table, an inventory table, a price table and a shop table, when the data are changed, change information binlog is generated, the binglog is sent to a data aggregation device (module) through an upstream kafka, after the data are aggregated by the data aggregation device (module), the aggregated data are sent to a downstream kafka and sent to an index module through the downstream kafka, and the index module constructs a real-time index. It should be noted that the data aggregation device (module) is a device (module) based on Flink, which is a distributed processing engine (processor) for stream data and batch data. It is mainly implemented by Java code. For Flink, the main scene to be processed is the stream data, and the batch data is only a limit special case of the stream data. In other words, Flink handles all tasks as streams and thus can support high throughput, low latency, and high performance stream processing.

A data aggregation method proposed by an embodiment of the present invention is described in detail below with reference to specific embodiments. The data aggregation method provided by the embodiment of the invention is executed by a data aggregation device (module) based on Flink.

Fig. 2 is a flow diagram illustrating a method of data aggregation in accordance with an example embodiment.

As shown in fig. 2, in S210, a change message is received by the upstream stream processing platform, the change message including a target identifier and change data.

In the embodiment of the present invention, the change data may include, but is not limited to: change field, version number of change field. The change field may be a change field corresponding to at least one table of an inventory table, a product table, and a price table in the database, such as a brand title in the product table.

In the embodiment of the invention, the target identifier can be a commodity identifier.

In the embodiment of the invention, the stream processing platform is kafka as an example, and kafka is a high-throughput distributed publish-subscribe message system which can process all action stream data in a consumer-scale website. kafka is a high-throughput distributed publish-subscribe messaging system that can handle all the action flow data in a consumer-scale website.

In S220, the master table data with the target identifier as the master key is read from the distributed storage system, and the master table data is aggregated based on the changed data, so as to generate aggregated data.

In the embodiment of the invention, the distributed storage system can be Hbase, and the Hbase is a distributed column-oriented open source database, has the characteristics of high reliability, high performance, column-oriented property and scalability, and can support the million-level query rate.

In the embodiment of the invention, main table data is stored in Hbase, the main table data is a table which takes a commodity identifier as a main key and comprises a plurality of rows, and a plurality of main table data are stored in Hbase. And when the change message is received, reading the main table data taking the target identifier as a main key from the Hbase according to the target identifier included in the change message, and then aggregating the main table data according to the change data.

FIG. 3 is a diagram illustrating a structure of main table data, according to an example embodiment. As shown in fig. 3, the column family (other columns than the primary key) in the primary table can be divided into three parts: merchandise tables, inventory tables, and price tables. Wherein, the commodity table corresponds to a plurality of columns, such as: the name of the goods, the identity of the store, the brand of the goods, etc.

In an embodiment of the present invention, aggregating data includes: the master table aggregates the data. The main table aggregate data is data having the same structure as the main table. Generating main table aggregated data by aggregating the main table data. When the main table data is aggregated, judging whether a column corresponding to a change field in the change data in the main table read from the Hbase by the data aggregation module is empty:

1. and if the column corresponding to the change field in the change data in the main table is empty, writing the change field into the column, and generating main table aggregated data based on the main table written with the change field.

2. And if the column corresponding to the change field in the change data in the main table is not empty, comparing the version number of the change field with the version number of the column. It should be noted that the comparison of the version number of the change field and the version number of the column is performed to check the timing of the change field to avoid the change of the main table data by the delayed change message.

1) And if the version number of the change field is not greater than the version number of the column, generating main table aggregation data based on the main table.

2) And if the version number of the change field is greater than the version number of the column, replacing the data of the column with the change field, and generating main table aggregated data based on the replaced main table data.

In the embodiment of the present invention, the Hbase further stores attached table data, the attached table data is a table in which a certain column group (non-primary key) of the primary table is a primary key, and it should be noted that an attached table does not exist for each column in the primary table.

Fig. 4 is a diagram illustrating a structure of additional table data according to an exemplary embodiment. As shown in fig. 4, the attached table uses the store identifier as a main key, and the column family is a store table comprising a plurality of columns, such as: store name, store identification, brand of goods, etc.

In this embodiment of the present invention, aggregating data may further include: and compounding the polymerization data. The composite aggregation data comprises a main table part and an auxiliary table part, wherein the main table part is the main table aggregation data, and the auxiliary table part is an auxiliary table of the main table.

In the embodiment of the invention, after main table aggregation data are generated, the auxiliary table data taking the specified column in the main table aggregation data as the main key are read from the auxiliary table of Hbase, whether the auxiliary table data are empty or not is judged, and if the auxiliary table data are empty, composite aggregation data are generated based on the main table aggregation data. And if the auxiliary table data are not empty, generating composite aggregation data based on the main table aggregation data and the auxiliary table.

Since the main table aggregation data is the same as the main table data in Hbase, the supplementary table data having the main key as a designated column in the main table aggregation data may be read directly from the supplementary table in Hbase, or the supplementary table data having the main key as a designated column in the main table may be read from the supplementary table in Hbase.

In S230, the aggregated data is sent to a downstream stream processing platform.

In the embodiment of the invention, the data aggregation module sends the aggregated data to a downstream stream processing platform, such as kafka, and sends the aggregated data to the indexing module through the downstream kafka to construct the real-time index.

The following describes in detail an aggregation process of main table aggregated data according to an embodiment of the present invention, with reference to a specific application scenario.

Fig. 5 is a flowchart illustrating a method for aggregating data of a master table according to an exemplary embodiment.

As shown in fig. 5, the method may include, but is not limited to, the following steps:

s501, reading main table data with a target identification as a main key from the distributed storage system.

The target identifier is an ID of a certain product identifier, that is, an ID of a certain SKU, and the distributed storage system stores a master table having the ID of the SKU as a master key, and upon receiving the change message, reads the master table having the ID of the SKU as the master key from the distributed storage system based on the ID of the SKU in the change message.

S502, judging whether the column corresponding to the change field in the change data in the main table is empty or not.

It should be noted that each column in the main table stored in the distributed storage system may be empty.

If so, go to S503, otherwise, go to S504.

S503, writing the change fields into the columns, and generating main table aggregation data based on the main table written with the change fields.

For example, the structure of the main table data shown in fig. 3 is shown in the following case: and if the price is empty and the change field is a price field, writing the price in the change field into the main table data, and generating the main table aggregated data based on the main table with the written price.

Note that the main table aggregate data has the same data structure as the main table.

S504, comparing the version number of the change field with the version number of the column.

And S505, if the version number of the change field is not greater than the version number of the column, generating main table aggregation data based on the main table.

For example, in the structure diagram of the main table data shown in fig. 3, if the change field is a price field, the version number (e.g., offset) of the change field is compared with the version number (offset) of the price column group in the main table data, and if the version number of the change field is not greater than (including equal to or less than) the version number of the price column group in the main table data, it is described that the price in the main table is the same as the price in the change field, or the price in the main table is newer than the version of the price in the change field, the price in the main table data does not need to be changed, and the main table aggregate data is generated directly based on the main table data.

And S506, if the version number of the change field is larger than that of the column, replacing the data of the column with the change field, and generating main table aggregated data based on the replaced main table data.

For example, in the configuration diagram of the master table data shown in fig. 3, when the version number of the change field is larger than the version number of the price column group in the master table data, the price in the change field is described as being newer than the version of the price in the master table, and the price in the change field is the latest price, the price in the original master table data is replaced with the price in the change field, and the master table aggregated data is generated based on the replaced master table data.

In the embodiment of the invention, the Flink distributed streaming processing and the mass storage capacity of the distributed storage system are utilized, and the real-time high-throughput data aggregation of the main table data can be realized under the condition of ensuring the message time sequence.

The following describes in detail the aggregation process of the composite aggregation data according to the embodiment of the present invention with reference to a specific application scenario.

Fig. 6 is a flowchart illustrating an aggregation method for composite aggregated data according to an exemplary embodiment.

As shown in fig. 6, the method may include, but is not limited to, the following steps:

s601, reading the attached table data with the specified column as the main key in the main table aggregation data from the attached table of the distributed storage system.

It should be noted that the aggregated main table aggregated data matches the main table data stored in the distributed storage system, and therefore, the attached table data having the specified column of the main table aggregated data as the main key is the attached table data having the specified column of the main table as the main key.

S602, judging whether the attached table data is empty or not.

It should be noted that each column in the attached table stored in the distributed storage system may be empty. And when each column in the attached table is empty, the data of the attached table is empty, and if at least one column in the attached table is not empty, the data of the attached table is not empty.

If the attached table data is empty, S603 is executed, and if the attached table data is not empty, S604 is executed.

And S603, generating composite aggregation data based on the main table aggregation data.

For example, in the schematic structure diagram of the attached table shown in fig. 4, if all the column groups in the attached table are empty, the composite aggregation data is generated directly based on the main table aggregation data, and all the column groups (non-main keys) in the attached table of the composite aggregation data are empty.

S604, generating composite aggregation data based on the main table aggregation data and the auxiliary table.

For example, in the schematic structure diagram of the attached table shown in fig. 4, if all the column irregularities in the attached table are null, composite aggregated data is generated based on the main table aggregated data and the attached table.

The data aggregation method proposed by the present invention will be described in detail below with reference to specific examples.

Fig. 7 is a data flow diagram illustrating an exemplary embodiment. As shown in fig. 7, the data aggregation apparatus (module) is based on a Flink framework, and includes: a Kafka source section, an operator merge section, an operator leftjoin section, and a Kafka sink section, wherein the Kafka source section is operable to receive change messages, such as binlog messages, generated by a data source. The change message flows into the operator merge section. The operator merge mainly performs data aggregation on the main table in the Hbase according to the ID of the SKU in the change message to generate main table aggregation data. The generated main table aggregation data flows into the operator leftjoin part. And (3) performing data aggregation on the attached table in the Hbase according to the specified column in the main table aggregation data to generate composite aggregation data. The generated composite aggregate data was streamed into the Kafka sink section. The composite aggregated data is sent to downstream Kafka via Kafka sink.

It should be noted that the main table aggregation data generated by the operator merge can flow directly to the Kafka sink, and the operator leave join is not needed to perform data aggregation on the supplementary table in the Hbase according to the specified column in the main table aggregation data, so as to generate composite aggregation data.

It should be clearly understood that the present disclosure describes how to make and use particular examples, but the principles of the present disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.

The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. In the following description of the apparatus, the same parts as those of the foregoing method will not be described again.

Fig. 8 is a schematic structural diagram illustrating a data aggregation apparatus according to an exemplary embodiment, and as shown in fig. 8, the data aggregation apparatus 800 may include:

a receiving module 810, configured to receive change data through an upstream kafka, where the change data includes a destination identifier and change data;

a generating module 820, configured to read, from the Hbase, primary table data with the target identifier as a primary key, and aggregate the primary table data based on the changed data to generate aggregated data;

a sending module 830, configured to send the aggregated data to a downstream kafka.

In the embodiment of the invention, a change message is received through an upstream kafka, and the change data comprises a target identifier and change data; reading main table data with the target identification as a main key from Hbase, and aggregating the main table data based on the changed data to generate aggregated data; sending the aggregated data to a downstream kafka. By utilizing Flink distributed streaming processing and hbase mass storage capacity, real-time high-throughput data aggregation can be realized under the condition of ensuring message time sequence.

The specific details of the modules of each image classification apparatus have been described in detail in the corresponding image classification method, and therefore are not described herein again.

It should be noted that although in the above detailed description several modules or units of the apparatus for performing are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 900 according to this embodiment of the invention is described below with reference to fig. 9. The electronic device 900 shown in fig. 9 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present invention.

As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one storage unit 920, a bus 930 connecting different system components (including the storage unit 920 and the processing unit 910), and a display unit 940.

Wherein the storage unit stores program code that is executable by the processing unit 910 to cause the processing unit 910 to perform steps according to various exemplary embodiments of the present invention described in the above section "exemplary methods" of the present specification. For example, the processing unit 910 may execute S210 shown in fig. 2, receive a change message through an upstream stream processing platform, the change message including a target identifier and change data; s220, reading main table data with the target identification as a main key from the distributed storage system, and aggregating the main table data based on the changed data to generate aggregated data; and S230, sending the aggregated data to a downstream flow processing platform.

The storage unit 920 may include a readable medium in the form of a volatile storage unit, such as a random access memory unit (RAM)9201 and/or a cache memory unit 9202, and may further include a read only memory unit (ROM) 9203.

Storage unit 920 may also include a program/utility 9204 having a set (at least one) of program modules 9205, such program modules 9205 including but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 930 can be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 900 may also communicate with one or more external devices 970 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 900, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 900 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 950. Also, the electronic device 900 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 960. As shown, the network adapter 960 communicates with the other modules of the electronic device 900 via the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.

Referring to fig. 10, a program product 1000 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A method for data aggregation, the method comprising:

receiving a change message through an upstream stream processing platform, wherein the change data comprises a target identifier and change data;

reading main table data with the target identification as a main key from a distributed storage system, and aggregating the main table data based on the changed data to generate aggregated data;

sending the aggregated data to a downstream stream processing platform.

2. The method of claim 1, wherein the aggregating data comprises: aggregating data of the main table;

the aggregating the main table data based on the changed data to generate aggregated data includes:

judging whether a column corresponding to a change field in the change data in a main table of the distributed storage system is empty or not;

and if the column corresponding to the changed field in the changed data in the main table is empty, writing the changed field into the column, and generating main table aggregated data based on the main table written with the changed field.

3. The method of claim 2, wherein the method further comprises:

if the column corresponding to the change field in the change data in the main table is not empty, comparing the version number of the change field with the version number of the column;

if the version number of the change field is not larger than the version number of the column, generating main table aggregation data based on the main table;

and if the version number of the change field is greater than the version number of the column, replacing the data of the column with the change field, and generating main table aggregated data based on the replaced main table data.

4. The method of claim 3, wherein the aggregating data further comprises: compounding the aggregated data;

the method further comprises the following steps:

reading supplementary table data taking a specified column in the main table aggregation data as a main key from supplementary tables of the distributed storage system;

judging whether the attached table data is empty or not;

and if the auxiliary table data are null, generating composite aggregation data based on the main table aggregation data.

5. The method of claim 4, wherein the method further comprises:

and if the auxiliary table data are not empty, generating composite aggregation data based on the main table aggregation data and the auxiliary table.

6. The method of claim 1, wherein the target identification comprises a merchandise identification.

7. The method according to any of claims 1-6, wherein the method is performed by a Flink-based data aggregation apparatus.

8. A data aggregation apparatus, characterized in that the apparatus comprises:

the receiving module is used for receiving changed data through an upstream stream processing platform, and the changed data comprises a target identifier and changed data;

the generating module is used for reading main table data with the target identification as a main key from the distributed storage system, and aggregating the main table data based on the changed data to generate aggregated data;

and the sending module is used for sending the aggregated data to a downstream flow processing platform.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.

10. An electronic device, comprising: one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method steps of any of claims 1-7.