CN111209278A

CN111209278A - Apparatus and method for streaming real-time processing of on-line production data

Info

Publication number: CN111209278A
Application number: CN201811388835.3A
Authority: CN
Inventors: 任文治; 袁建军
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2018-11-21
Filing date: 2018-11-21
Publication date: 2020-05-29

Abstract

The invention discloses a device and a method for processing online production data in a streaming real-time manner. In the invention, the online production data of a plurality of data sources read in real time based on a streaming processing mode can be transferred and stored as single data source sub-tables in a key value pair format, and the single data source sub-tables in the key value pair format of each data source can be periodically combined into a multi-data source wide table to realize the integration of multiple data sources, so that the integration of multiple data sources can be realized with considerable consumption efficiency and on the premise of ensuring the data integrity, and thus, the invention can realize the integration of multiple data sources with low development cost and considerable consumption efficiency and on the premise of ensuring the data integrity to meet the requirement on the real-time property of data. Moreover, the multi-data source broad table can be provided for the distributed file system to realize relatively simple data check, and can also be fed back to the distributed publish-subscribe message system for downstream service use.

Description

Apparatus and method for streaming real-time processing of on-line production data

Technical Field

The present invention relates to the field of data processing of big data, and more particularly, to an apparatus for stream-type real-time processing of on-line production data, a method for stream-type real-time processing of on-line production data, and a data processing device and a non-transitory computer-readable storage medium.

Background

With the development of large data, the amount of data that needs to be processed is also increasing. Based on the situation, the traditional offline T +1 processing mode hardly ensures timeliness, and hardly meets the requirement of data real-time performance from a business perspective.

Disclosure of Invention

One embodiment of the present invention provides an apparatus for streaming real-time processing of in-line production data, comprising:

the data access module reads online production data of a plurality of data sources from the distributed publish-subscribe message system based on a streaming processing mode;

the data buffer module is used for transferring the on-line production data of each data source into a single data source sublist in a key value format;

and the data merging module periodically merges the single data source sublists in the key-value pair format of each data source into a multi-data source wide table.

Optionally, the data access module reads the inline production data by invoking a distributed processing engine for streaming data and batch data.

Optionally, the online production data of each data source is stored in a distributed publish-subscribe message system in a partition by a primary key, and the data access module reads the online production data of each data source from each partition in parallel.

Optionally, the data buffer module caches a single data source sub-table in key-value pair format in a distributed column-oriented database.

Optionally, the data buffering module sets a primary key identifier, a database row key, an inter-table association field, and whether a modified content field exists for an update operation in a corresponding single data source sub-table according to a configuration file preset for each data source.

Optionally, the data buffer module directly converts the newly added data in the online production data into a key-value pair format, converts the updated data into the key-value pair format after data coverage, and converts the deleted data into the key-value pair format after adding deletion marks; the data buffer module stores the newly added data, the updated data and the deleted data in the key-value pair format into corresponding single data source sub-tables, wherein if a plurality of corresponding database row keys exist in the same single data source sub-table, the data buffer module splices the plurality of corresponding database row keys; and if the modified content field aiming at the updating operation exists in the single data source sub-table, the data buffer module carries out updating judgment.

Optionally, the data merging module merges in the relational database to obtain a multiple data source wide table.

Optionally, the data merging module enables merging of multiple data source wide tables through inter-table association based on a structured query language.

Optionally, the data merging module periodically merges the data sources in a preset time window to obtain a multiple data source width table.

Optionally, further comprising: and the data production module stores the multi-data source width table into the distributed file system and/or converts the multi-data source width table into a log format and provides the log format for the distributed publishing and subscribing message system.

Another embodiment of the present invention provides a method for streaming real-time processing of in-line production data, comprising:

reading online production data of a plurality of data sources from a distributed publish-subscribe message system based on a streaming processing mode;

the online production data of each data source is transferred into a single data source sublist in a key value format;

periodically merging the single data source sub-tables of the key-value pair format of each data source into a multi-data source wide table.

Optionally, reading the on-line production data of the plurality of data sources comprises: the inline production data is read by invoking distributed processing engines for streaming data and batch data.

Optionally, the online production data of each data source is stored in a distributed publish-subscribe message system in a partitioned manner according to the primary key, and reading the online production data of a plurality of data sources includes: the on-line production data of each data source is read in parallel from each partition.

Optionally, the unloading the online production data of each data source into a single data source sub-table in a key-value pair format includes: a single data source sublist in a key value pair format is cached in a distributed column-oriented database.

Optionally, the unloading the online production data of each data source into a single data source sub-table in a key-value pair format includes: setting a main key identifier, a database row key, an inter-table association field and whether a modified content field aiming at updating operation exists in a corresponding single data source sub-table according to a configuration file preset aiming at each data source; directly converting newly added data in the online production data into a key-value pair format, converting updated data into the key-value pair format after data coverage, and converting deleted data into the key-value pair format after deleting marks are added; storing the newly added data, the updated data and the deleted data in the key-value pair format into corresponding single data source sub-tables, wherein if a plurality of corresponding database row keys exist in the same single data source sub-table, the plurality of corresponding database row keys are spliced; and if the modified content field aiming at the updating operation exists in the single data source sub-table, the updating judgment is carried out.

Optionally, merging the single data source sub-tables of the key-value pair format of each data source into the multiple data source wide table includes: and merging the data sources in the relational database to obtain the multi-data source width table.

Optionally, merging the single data source sub-tables of the key-value pair format of each data source into the multiple data source wide table includes: the merging of multiple data source wide tables is achieved through table association based on a structured query language.

Optionally, further comprising: and storing the multi-data source width table into the distributed file system, and/or converting the multi-data source width table into a log format and then providing the log format for the distributed publishing and subscribing message system.

Another embodiment of the invention provides a data processing apparatus comprising a processor for performing the steps of the method as described above.

Another embodiment of the invention provides a non-transitory computer readable storage medium storing instructions for the processor to perform the steps of the method as described above.

As can be seen from the above, based on the above embodiment, the online production data of multiple data sources read in real time based on the streaming processing manner may be stored as the single data source sub-table in the key-value pair format, and the single data source sub-tables in the key-value pair format of each data source may be periodically merged into the multiple data source wide table to implement integration of multiple data sources, so that integration of multiple data sources may be implemented with considerable consumption efficiency and on the premise of ensuring data integrity, so as to meet the requirement for data real-time. Moreover, as a further continuation, the multiple data source broad tables can be provided to the distributed file system to achieve relatively simple data collation, and can also be fed back to the distributed publish-subscribe message system for use in downstream services.

Drawings

FIG. 1 is a schematic diagram showing an exemplary configuration of an apparatus for processing in-line production data in real time in a streaming manner in one embodiment;

FIG. 2 is a schematic diagram of a data access module in the apparatus shown in FIG. 1;

FIG. 3 is a schematic diagram of a data buffering module in the apparatus shown in FIG. 1;

FIG. 4 is a schematic diagram of a data merge module in the apparatus of FIG. 1;

FIG. 5 is a schematic flow chart diagram illustrating a method for streaming real-time processing of inline production data in one embodiment;

FIG. 6 is an expanded flow diagram of the method of FIG. 5;

FIG. 7 is a flow diagram of one embodiment of the method of FIG. 5;

fig. 8 is a schematic configuration diagram of a data processing apparatus in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.

FIG. 1 is a schematic diagram showing an exemplary configuration of an apparatus for processing in-line production data in real time by streaming in one embodiment. Referring to fig. 1, in one embodiment, an apparatus for streaming real-time processing of in-line production data may include: a data access module 11, a data buffer module 12, a data merge module 13, and a data production module 14.

The data access module 11 may read the online production data of the plurality of data sources from the distributed publish-subscribe message system 20 based on a streaming process.

Where the distributed publish-subscribe message system 20 described above may be a high-throughput open-source streaming platform such as a Kafka (Kafka) with the capability to process all the action stream data of multiple data sources, the online production data accessible in the distributed publish-subscribe message system 20 may be in a log format, such as a binary log (binlog) included in the dialog (Topic) data.

Also, the data access module 11 may invoke a distributed processing engine, such as a fast flexible connection (Flink), for streaming data and batch data to read online production data from the distributed publish-subscribe message system 20 through the invoked distributed processing engine. Compared with Storm wind (Storm) which is commonly used in a traditional streaming processing mode, the Flink has more functions, such as machine learning, batch processing, streaming processing and the like, and the data consumption efficiency of the Flink is higher.

In addition, the distributed publish-subscribe message system 20 may store the online production data of each data source in a partitioned manner according to the primary key, and accordingly, the online production data of each data source is read from each partition in parallel, so as to improve the efficiency of data consumption and ensure the integrity and consistency of the data of the same primary key at the same time.

Fig. 2 is a schematic diagram of a data access module in the apparatus shown in fig. 1. Referring to fig. 2 in conjunction with fig. 1, assuming that the distributed publish-subscribe message system 20 is connected to M data sources 20_1 to 20_ M (M is a positive integer greater than 1), and there are at most N primary keys (N is a positive integer greater than 1) for online production data of each of the M data sources 20_1 to 20_ M, there are at most N partitions of the primary keys in the distributed publish-subscribe message system 20 for each of the M data sources 20_1 to 20_ M. Accordingly, for the on-line production data of each of the M data sources 20_1 to 20_ M, the data access module 11 can read the corresponding partitions in parallel.

Fig. 3 is a schematic diagram of a data buffering module in the apparatus shown in fig. 1. Referring to fig. 3 in conjunction with fig. 1, the data buffer module 12 may be configured to forward the online production data of each data source into a single data source sublist in a Key-Value format. The data buffer module 12 may store the single data source sub-tables in the key value pair format in a distributed column-oriented database such as hadoop database (HBase), and the online production data of each data source may correspond to one single data source sub-table, so that the number of the single data source sub-tables may be M.

For the single data source sub-table 400 corresponding to each data source, the corresponding configuration file 40 may be processed in advance, and the data buffering module may set a primary key identifier 41, a database row key (Rowkey)42, a modified content field (UpdateCol)43 for an update operation, and an inter-table association field 44 in the corresponding single data source sub-table 400 according to the configuration file 40 set for each data source in advance. The modification content field 43 for the update operation is an optional configuration item, and it can be determined whether the corresponding data needs to be modified according to whether there is the configuration item in the single data source sub-table 400.

The profiles 40 of different data sources may contain the same configuration items, but the specific contents of the configuration items may be different from each other. For example, the configuration file 50 of each data source sets a configuration item of a primary key for a single data source table, but values of the primary key in the configuration item may be different.

The configuration file 40 is loaded to configure the single data source sub-table 400, so that the single data source sub-table 400 can be easily adjusted according to requirements. That is, the single data source sub-table 400 can be flexibly adjusted by only changing the configuration file 40 without changing the program setting of the data buffer module 12.

The single data source sub-table 400 configured according to the configuration file 40 may be stored in a cache medium 121, such as google cache, for subsequent reading.

Still referring to fig. 3, the HBase is taken as an example of the distributed column-oriented database 120, and the operation types corresponding to the online production data 50 may include three types, i.e., add (insert), update (update), and delete (delete), where the delete operation may be a logical delete, and accordingly:

for newly added data corresponding to the added operation in the on-line production data 50, the data buffer module 12 may directly perform a key-value-to-format conversion 53 thereon;

for the updated data related to the update operation in the online production data 50, the data buffer module 12 may perform the key-value-pair format conversion 53 after performing the data overlay 51 thereon, where the data overlay 51 means to overlay the original data with the latest data;

for deleted data, the data buffer module 12 may perform key-value to format conversion after performing delete token addition 52 thereon.

After the format conversion, the data buffer module 12 reads the single data source sub-table 400 from the buffer medium 121 to the database 120, and may store the new data, the updated data, and the deleted data in the key value pair format into the corresponding single data source sub-table 400 stored in the distributed column-oriented database 120.

If the database row key 42 configured for the same single data source sub-table 400 contains more than one Rowkey, then to match the storage format of HBase, the data buffer module 12 needs to perform row key concatenation 54, i.e., concatenate multiple corresponding Rowkeys together with a symbol such as an underline "_".

In addition, if the modified content field 43 for the update operation exists in the single data source sub-table 400, the data buffer module 12 also needs to make an update decision 55. For example, the data buffering module 12 first determines whether the service operation time corresponding to the current data is later than or equal to the service operation time of the stored data in the corresponding single data source sublist 400 in the HBase120, and if so, indicates that the current data is latest data, and updates the stored data in the corresponding single data source sublist 400 in the HBase120 to keep the data consistent; otherwise, the corresponding single data source sublist 400 in the HBase120 is not updated, and meanwhile, the current data may be considered as dirty data that should be overwritten and may be discarded.

By analogy, the HBase120 stores all the single data source sub-tables 400 according to the rule in real time.

The service operation time can be judged by comparing mid sequences in the key-value pair format, and see the following example:

it is assumed that the key value of the key value pair format on-line production data 1 includes a mid sequence having a value of "200135521945" and an opt field of "INSERT" indicating a new addition operation, and the key value of the key value pair format on-line production data 1 is 16777921_80463060328_ 8226277.

Further, assume that the key value of the key-value pair format on-line production data 2 is 16777921_80463060328_8226277, and the key value of the key also includes a mid sequence having a value of "200135521951" and an opt field representing an UPDATE operation "UPDATE".

The primary keys of the two on-line production data 1 and 2 are each composed of a primary key identification, an old identification, and a product ID, and the values of the primary keys of the two on-line production data 1 and 2 are the same.

The key value of the two online production data 1 and 2 is in JSON (JSON object notation) string format, which includes the contents of the online production data 1 and 2.

As can be seen from the above, the opt field of the on-line production data 1 is "INSERT" indicating the new operation, and the opt field of the on-line production data 2 is "UPDATE" indicating the UPDATE operation. By modifying the content field (UpdateCol) for the update operation, the data buffer module 12 can determine that it is necessary to determine whether the on-line production data 2 is indeed new data by a mid sequence that is incremented over time, i.e. the update decision 55.

Since the mid sequence "200135521951" of the on-line production data 2 indicates a business operation time later than the business operation time indicated by the mid sequence "200135521945" of the on-line production data 1 stored in the corresponding single data source partial table 400 in the HBase120, for the above example, the data buffer module 12 can recognize that the on-line production data 2 belongs to new data through the update decision 55 of the mid sequence, and update the on-line production data 1 stored in the corresponding single data source partial table 400 in the HBase120 using the same.

However, if the operation time indicated by the mid sequence of the on-line production data 2 is not later than, even earlier than, the operation time indicated by the mid sequence of the on-line production data 1 stored in the corresponding single data source table 400 in the HBase120, even if the on-line production data 2opt field is "UPDATE" indicating an UPDATE operation, the data buffer module 12 may determine that the on-line production data 2 belongs to dirty data that should be overwritten through the UPDATE decision 55 of the mid sequence.

Fig. 4 is a schematic diagram of a data merging module in the apparatus shown in fig. 1. Referring to fig. 4 in conjunction with fig. 1, the data merging module 13 may periodically merge the single data source sub-tables 400 in key-value pair format in each partition into a multiple data source wide table 600.

Specifically, the data merging module 13 may periodically obtain data stored in the single data source sub-table 400 from the distributed column-oriented database (e.g., HBase)120 in a preset time window (e.g., Flink time window), merge the data in a relational database such as an H2 memory database to obtain the multiple data source width table 600, and finally insert the multiple data source width table 600 into the distributed column-oriented database (e.g., HBase) 120.

The length of the time window can be set arbitrarily, so as to adjust the real-time processing instant degree. The length of the time window can be measured based on time or based on the amount of data, and is generally adapted to the processing speed of the millisecond pole.

Moreover, merging in a relational database, such as an H2 in-memory database, may be implemented by implementing inter-table associations between the single data source sub-tables 400 based on Structured Query Language (SQL). Compared with the mode of combining the table associations by using Java, the SQL-based association mode can save development cost and shorten development period, and in addition, the SQL processing is also utilized to generate data on the SQL processing line, so that the real-time data and the offline data can be better kept consistent.

In addition, the association between the single data source sub-tables 400 may be determined by the above-mentioned inter-table association fields, that is, the single data source sub-tables 400 corresponding to part of the primary keys of part of the data sources may be associated and combined into one multiple data source wide table 600, or all the single data source sub-tables 400 corresponding to all the primary keys of all the single data source sub-tables 400 may be associated and combined into one multiple data source wide table 600.

Referring back to fig. 1, the data production module 14 may store the multiple data source width tables in the Distributed File System 30, that is, the data production module may perform a disk-drop operation on the Distributed File System 30 for checking with the offline data, for example, the Distributed File System 30 may be a Hadoop Distributed File System (HDFS); and/or, the data production module 14 may convert the multiple data source broad tables into log format and provide the log format to the distributed publish-subscribe messaging system (e.g., Kafka)20 for display on a large screen for continued use by downstream services through continued production of the distributed publish-subscribe messaging system (e.g., Kafka) 20.

FIG. 5 is a schematic flow diagram illustrating a method for streaming real-time processing of inline production data in one embodiment. Referring to fig. 5, in one embodiment, a method of streaming real-time processing on-line production data may include:

s71, reading the online production data of a plurality of data sources from the distributed publish-subscribe message system based on a streaming processing mode;

s72, the online production data of each data source is transferred and stored into a single data source sub-table in a key value pair format;

and S73, periodically merging the single data source sub-tables of the key-value pair format of each data source into a multi-data source wide table.

Based on the above process, the online production data of the multiple data sources read through S71 can be saved as a single data source sub-table in the key-value pair format at S72, and the single data source sub-tables in the key-value pair format of each data source can be periodically merged into a multiple data source wide table through S73 to realize integration of multiple data sources, so that integration of multiple data sources can be realized with considerable consumption efficiency and on the premise of ensuring data integrity.

Fig. 6 is an expanded flow diagram of the method shown in fig. 5. Compared to fig. 5, the flow shown in fig. 6 further includes the following steps after S73:

s75, storing the multi-data source width table into the distributed file system;

and S76, converting the multi-data source broad table into log format and providing the log format to the distributed publish-subscribe message system.

Thus, as a further continuation of the flow shown in FIG. 5, multiple data source wide tables may be provided to the distributed file system to enable relatively simple data reconciliation, and may also be fed back to the distributed publish-subscribe message system for downstream business use.

FIG. 7 is a flow chart of an embodiment of the method shown in FIG. 5. In fig. 7, the method of streaming real-time processing on-line production data in this embodiment may be embodied as including the steps of:

and S710, reading the online production data stored by each data source according to the main key partition in parallel from the distributed publishing and subscribing message system by calling the distributed processing engine aiming at the stream data and the batch data.

S720, the online production data of each data source is transferred and stored in a distributed column-oriented database into a single data source sub-table in a key-value pair format.

Wherein, S720 may specifically include:

s721, setting a main key identifier, a database row key, an inter-table association field and whether a modification content field aiming at the updating operation exists in a corresponding single data source sub-table according to a configuration file preset aiming at each data source;

s722, directly converting newly added data in the on-line production data into a key-value pair format, converting the newly added data into the key-value pair format after data coverage of the updated data, and converting the deleted data into the key-value pair format after adding deletion marks;

and S723, storing the newly added data, the updated data and the deleted data in the key value pair format into corresponding single data source sub-tables, wherein if a plurality of corresponding database row keys exist in the same single data source sub-table, the plurality of corresponding database row keys are spliced, and if a modified content field aiming at the updating operation exists in the single data source sub-table, updating judgment is carried out.

And S730, combining the single data source sub-tables in the key value pair format of each data source into a multi-data source wide table in a relational database through the SQL-based inter-table association.

Fig. 8 is a schematic configuration diagram of a data processing apparatus in one embodiment. As shown in fig. 8, in one embodiment, a data processing apparatus comprises a processor 81 and a non-transitory computer readable storage medium 82, wherein the non-transitory computer readable storage medium 82 has stored therein instructions executable by the processor 81 to cause the processor 81 to perform the steps of the method shown in fig. 5 or fig. 6 or fig. 7, for example:

reading online production data of a plurality of data sources from a distributed publish-subscribe message system based on a streaming processing mode (for example, by calling a distributed processing engine for streaming data and batch data), and storing the online production data of the plurality of data sources in a partitioned mode according to a main key;

the online production data of each data source is transferred (for example, in a distributed column-oriented database) into a single data source sub-table in a key-value pair format, and the specific transfer mode can refer to the mode described above;

periodically, single data source sub-tables in a key-value pair format for each data source are merged (e.g., SQL-based associations in a relational database) into multiple data source wide tables.

Further, the instructions stored in the non-transitory computer readable storage medium 82, when executed by the processor 81, may further cause the processor 81 to store the multiple data source wide table into the distributed file system and/or convert the multiple data source wide table into a log format for provision to the distributed publish-subscribe message system.

In addition, the data processing device may also include a mass storage medium 84 for storing a distributed column-oriented database, such as HBase, and a relational database, such as an H2 in-memory database.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An apparatus for streaming real-time processing of in-line production data, comprising:

2. The apparatus of claim 1, wherein the data access module reads inline production data by invoking a distributed processing engine for streaming data and batch data.

3. The apparatus of claim 1, wherein the online production data of each data source is stored in a distributed publish-subscribe messaging system in partitions according to a primary key, and the data access module reads the online production data of each data source from each partition in parallel.

4. The apparatus of claim 1, wherein the data buffer module is to store a single data source sub-table in a key-value pair format in a distributed column-oriented database.

5. The apparatus of claim 4, wherein the data buffering module sets a primary key identifier, a database row key, an inter-table association field, and whether a modified content field exists for an update operation in a corresponding single data source sub-table according to a configuration file preset for each data source.

6. The apparatus of claim 5,

the data buffer module directly converts newly added data in the on-line production data into a key-value pair format, converts updated data into the key-value pair format after data coverage, and converts deleted data into the key-value pair format after deletion marks are added;

and the data buffer module stores the newly added data, the updated data and the deleted data in the key-value pair format into the corresponding single data source sub-table, wherein,

if the same single data source sub-table has a plurality of corresponding database row keys, the data buffer module splices the plurality of corresponding database row keys;

and if the modified content field aiming at the updating operation exists in the single data source sub-table, the data buffer module carries out updating judgment.

7. The apparatus of claim 1, wherein the data merge module merges in a relational database into a multiple data source wide table.

8. The apparatus of claim 1, wherein the data merge module is configured to implement merging of multiple data source wide tables via inter-table association based on a structured query language.

9. The apparatus of claim 1, wherein the data merging module periodically merges multiple data source width tables in a predetermined time window.

10. The apparatus of claim 1, further comprising:

and the data production module stores the multi-data source width table into the distributed file system and/or converts the multi-data source width table into a log format and provides the log format for the distributed publishing and subscribing message system.

11. A method of streaming real-time processing on-line production data, comprising:

the online production data of each data source is transferred and stored into a single data source sub-table in a key value pair format;

12. The method of claim 11, wherein reading inline production data for a plurality of data sources comprises: the inline production data is read by invoking distributed processing engines for streaming data and batch data.

13. The method of claim 11, wherein the online production data for each data source is stored in a partitioned manner by primary key in the distributed publish-subscribe messaging system, and wherein reading the online production data for a plurality of data sources comprises: the on-line production data of each data source is read in parallel from each partition.

14. The method of claim 11, wherein the unloading of the on-line production data of each data source into a single data source sub-table in key-value pair format comprises: a single data source sublist in a key value pair format is cached in a distributed column-oriented database.

15. The method of claim 14, wherein the unloading of the on-line production data of each data source into a single data source sub-table in key-value pair format comprises:

setting a main key identifier, a database row key, an inter-table association field and whether a modified content field aiming at updating operation exists in a corresponding single data source sub-table according to a configuration file preset aiming at each data source;

directly converting newly added data in the online production data into a key-value pair format, converting updated data into the key-value pair format after data coverage, and converting deleted data into the key-value pair format after deleting marks are added;

storing the newly added data, the updated data and the deleted data in the key-value pair format into corresponding single data source sub-tables, wherein if a plurality of corresponding database row keys exist in the same single data source sub-table, the plurality of corresponding database row keys are spliced; and if the modified content field aiming at the updating operation exists in the single data source sub-table, the updating judgment is carried out.

16. The method of claim 11, wherein merging the single data source sub-tables of key-value pair format for each data source into a multiple data source wide table comprises: and merging the data sources in the relational database to obtain the multi-data source width table.

17. The method of claim 11, wherein merging the single data source sub-tables of key-value pair format for each data source into a multiple data source wide table comprises: the merging of multiple data source wide tables is achieved through table association based on a structured query language.

18. The data processing device of claim 11, further comprising:

and storing the multi-data source width table into the distributed file system, and/or converting the multi-data source width table into a log format and then providing the log format for the distributed publishing and subscribing message system.

19. A data processing apparatus comprising a processor for performing the steps of the method of any one of claims 11 to 18.

20. A non-transitory computer readable storage medium storing instructions for the processor to perform the steps of the method of any one of claims 11 to 18.