CN111209278A - Apparatus and method for streaming real-time processing of on-line production data - Google Patents

Apparatus and method for streaming real-time processing of on-line production data Download PDF

Info

Publication number
CN111209278A
CN111209278A CN201811388835.3A CN201811388835A CN111209278A CN 111209278 A CN111209278 A CN 111209278A CN 201811388835 A CN201811388835 A CN 201811388835A CN 111209278 A CN111209278 A CN 111209278A
Authority
CN
China
Prior art keywords
data
data source
key
value pair
format
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811388835.3A
Other languages
Chinese (zh)
Inventor
任文治
袁建军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201811388835.3A priority Critical patent/CN111209278A/en
Publication of CN111209278A publication Critical patent/CN111209278A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a device and a method for processing online production data in a streaming real-time manner. In the invention, the online production data of a plurality of data sources read in real time based on a streaming processing mode can be transferred and stored as single data source sub-tables in a key value pair format, and the single data source sub-tables in the key value pair format of each data source can be periodically combined into a multi-data source wide table to realize the integration of multiple data sources, so that the integration of multiple data sources can be realized with considerable consumption efficiency and on the premise of ensuring the data integrity, and thus, the invention can realize the integration of multiple data sources with low development cost and considerable consumption efficiency and on the premise of ensuring the data integrity to meet the requirement on the real-time property of data. Moreover, the multi-data source broad table can be provided for the distributed file system to realize relatively simple data check, and can also be fed back to the distributed publish-subscribe message system for downstream service use.

Description

Apparatus and method for streaming real-time processing of on-line production data
Technical Field
The present invention relates to the field of data processing of big data, and more particularly, to an apparatus for stream-type real-time processing of on-line production data, a method for stream-type real-time processing of on-line production data, and a data processing device and a non-transitory computer-readable storage medium.
Background
With the development of large data, the amount of data that needs to be processed is also increasing. Based on the situation, the traditional offline T +1 processing mode hardly ensures timeliness, and hardly meets the requirement of data real-time performance from a business perspective.
Disclosure of Invention
One embodiment of the present invention provides an apparatus for streaming real-time processing of in-line production data, comprising:
the data access module reads online production data of a plurality of data sources from the distributed publish-subscribe message system based on a streaming processing mode;
the data buffer module is used for transferring the on-line production data of each data source into a single data source sublist in a key value format;
and the data merging module periodically merges the single data source sublists in the key-value pair format of each data source into a multi-data source wide table.
Optionally, the data access module reads the inline production data by invoking a distributed processing engine for streaming data and batch data.
Optionally, the online production data of each data source is stored in a distributed publish-subscribe message system in a partition by a primary key, and the data access module reads the online production data of each data source from each partition in parallel.
Optionally, the data buffer module caches a single data source sub-table in key-value pair format in a distributed column-oriented database.
Optionally, the data buffering module sets a primary key identifier, a database row key, an inter-table association field, and whether a modified content field exists for an update operation in a corresponding single data source sub-table according to a configuration file preset for each data source.
Optionally, the data buffer module directly converts the newly added data in the online production data into a key-value pair format, converts the updated data into the key-value pair format after data coverage, and converts the deleted data into the key-value pair format after adding deletion marks; the data buffer module stores the newly added data, the updated data and the deleted data in the key-value pair format into corresponding single data source sub-tables, wherein if a plurality of corresponding database row keys exist in the same single data source sub-table, the data buffer module splices the plurality of corresponding database row keys; and if the modified content field aiming at the updating operation exists in the single data source sub-table, the data buffer module carries out updating judgment.
Optionally, the data merging module merges in the relational database to obtain a multiple data source wide table.
Optionally, the data merging module enables merging of multiple data source wide tables through inter-table association based on a structured query language.
Optionally, the data merging module periodically merges the data sources in a preset time window to obtain a multiple data source width table.
Optionally, further comprising: and the data production module stores the multi-data source width table into the distributed file system and/or converts the multi-data source width table into a log format and provides the log format for the distributed publishing and subscribing message system.
Another embodiment of the present invention provides a method for streaming real-time processing of in-line production data, comprising:
reading online production data of a plurality of data sources from a distributed publish-subscribe message system based on a streaming processing mode;
the online production data of each data source is transferred into a single data source sublist in a key value format;
periodically merging the single data source sub-tables of the key-value pair format of each data source into a multi-data source wide table.
Optionally, reading the on-line production data of the plurality of data sources comprises: the inline production data is read by invoking distributed processing engines for streaming data and batch data.
Optionally, the online production data of each data source is stored in a distributed publish-subscribe message system in a partitioned manner according to the primary key, and reading the online production data of a plurality of data sources includes: the on-line production data of each data source is read in parallel from each partition.
Optionally, the unloading the online production data of each data source into a single data source sub-table in a key-value pair format includes: a single data source sublist in a key value pair format is cached in a distributed column-oriented database.
Optionally, the unloading the online production data of each data source into a single data source sub-table in a key-value pair format includes: setting a main key identifier, a database row key, an inter-table association field and whether a modified content field aiming at updating operation exists in a corresponding single data source sub-table according to a configuration file preset aiming at each data source; directly converting newly added data in the online production data into a key-value pair format, converting updated data into the key-value pair format after data coverage, and converting deleted data into the key-value pair format after deleting marks are added; storing the newly added data, the updated data and the deleted data in the key-value pair format into corresponding single data source sub-tables, wherein if a plurality of corresponding database row keys exist in the same single data source sub-table, the plurality of corresponding database row keys are spliced; and if the modified content field aiming at the updating operation exists in the single data source sub-table, the updating judgment is carried out.
Optionally, merging the single data source sub-tables of the key-value pair format of each data source into the multiple data source wide table includes: and merging the data sources in the relational database to obtain the multi-data source width table.
Optionally, merging the single data source sub-tables of the key-value pair format of each data source into the multiple data source wide table includes: the merging of multiple data source wide tables is achieved through table association based on a structured query language.
Optionally, further comprising: and storing the multi-data source width table into the distributed file system, and/or converting the multi-data source width table into a log format and then providing the log format for the distributed publishing and subscribing message system.
Another embodiment of the invention provides a data processing apparatus comprising a processor for performing the steps of the method as described above.
Another embodiment of the invention provides a non-transitory computer readable storage medium storing instructions for the processor to perform the steps of the method as described above.
As can be seen from the above, based on the above embodiment, the online production data of multiple data sources read in real time based on the streaming processing manner may be stored as the single data source sub-table in the key-value pair format, and the single data source sub-tables in the key-value pair format of each data source may be periodically merged into the multiple data source wide table to implement integration of multiple data sources, so that integration of multiple data sources may be implemented with considerable consumption efficiency and on the premise of ensuring data integrity, so as to meet the requirement for data real-time. Moreover, as a further continuation, the multiple data source broad tables can be provided to the distributed file system to achieve relatively simple data collation, and can also be fed back to the distributed publish-subscribe message system for use in downstream services.
Drawings
FIG. 1 is a schematic diagram showing an exemplary configuration of an apparatus for processing in-line production data in real time in a streaming manner in one embodiment;
FIG. 2 is a schematic diagram of a data access module in the apparatus shown in FIG. 1;
FIG. 3 is a schematic diagram of a data buffering module in the apparatus shown in FIG. 1;
FIG. 4 is a schematic diagram of a data merge module in the apparatus of FIG. 1;
FIG. 5 is a schematic flow chart diagram illustrating a method for streaming real-time processing of inline production data in one embodiment;
FIG. 6 is an expanded flow diagram of the method of FIG. 5;
FIG. 7 is a flow diagram of one embodiment of the method of FIG. 5;
fig. 8 is a schematic configuration diagram of a data processing apparatus in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.
FIG. 1 is a schematic diagram showing an exemplary configuration of an apparatus for processing in-line production data in real time by streaming in one embodiment. Referring to fig. 1, in one embodiment, an apparatus for streaming real-time processing of in-line production data may include: a data access module 11, a data buffer module 12, a data merge module 13, and a data production module 14.
The data access module 11 may read the online production data of the plurality of data sources from the distributed publish-subscribe message system 20 based on a streaming process.
Where the distributed publish-subscribe message system 20 described above may be a high-throughput open-source streaming platform such as a Kafka (Kafka) with the capability to process all the action stream data of multiple data sources, the online production data accessible in the distributed publish-subscribe message system 20 may be in a log format, such as a binary log (binlog) included in the dialog (Topic) data.
Also, the data access module 11 may invoke a distributed processing engine, such as a fast flexible connection (Flink), for streaming data and batch data to read online production data from the distributed publish-subscribe message system 20 through the invoked distributed processing engine. Compared with Storm wind (Storm) which is commonly used in a traditional streaming processing mode, the Flink has more functions, such as machine learning, batch processing, streaming processing and the like, and the data consumption efficiency of the Flink is higher.
In addition, the distributed publish-subscribe message system 20 may store the online production data of each data source in a partitioned manner according to the primary key, and accordingly, the online production data of each data source is read from each partition in parallel, so as to improve the efficiency of data consumption and ensure the integrity and consistency of the data of the same primary key at the same time.
Fig. 2 is a schematic diagram of a data access module in the apparatus shown in fig. 1. Referring to fig. 2 in conjunction with fig. 1, assuming that the distributed publish-subscribe message system 20 is connected to M data sources 20_1 to 20_ M (M is a positive integer greater than 1), and there are at most N primary keys (N is a positive integer greater than 1) for online production data of each of the M data sources 20_1 to 20_ M, there are at most N partitions of the primary keys in the distributed publish-subscribe message system 20 for each of the M data sources 20_1 to 20_ M. Accordingly, for the on-line production data of each of the M data sources 20_1 to 20_ M, the data access module 11 can read the corresponding partitions in parallel.
Fig. 3 is a schematic diagram of a data buffering module in the apparatus shown in fig. 1. Referring to fig. 3 in conjunction with fig. 1, the data buffer module 12 may be configured to forward the online production data of each data source into a single data source sublist in a Key-Value format. The data buffer module 12 may store the single data source sub-tables in the key value pair format in a distributed column-oriented database such as hadoop database (HBase), and the online production data of each data source may correspond to one single data source sub-table, so that the number of the single data source sub-tables may be M.
For the single data source sub-table 400 corresponding to each data source, the corresponding configuration file 40 may be processed in advance, and the data buffering module may set a primary key identifier 41, a database row key (Rowkey)42, a modified content field (UpdateCol)43 for an update operation, and an inter-table association field 44 in the corresponding single data source sub-table 400 according to the configuration file 40 set for each data source in advance. The modification content field 43 for the update operation is an optional configuration item, and it can be determined whether the corresponding data needs to be modified according to whether there is the configuration item in the single data source sub-table 400.
The profiles 40 of different data sources may contain the same configuration items, but the specific contents of the configuration items may be different from each other. For example, the configuration file 50 of each data source sets a configuration item of a primary key for a single data source table, but values of the primary key in the configuration item may be different.
The configuration file 40 is loaded to configure the single data source sub-table 400, so that the single data source sub-table 400 can be easily adjusted according to requirements. That is, the single data source sub-table 400 can be flexibly adjusted by only changing the configuration file 40 without changing the program setting of the data buffer module 12.
The single data source sub-table 400 configured according to the configuration file 40 may be stored in a cache medium 121, such as google cache, for subsequent reading.
Still referring to fig. 3, the HBase is taken as an example of the distributed column-oriented database 120, and the operation types corresponding to the online production data 50 may include three types, i.e., add (insert), update (update), and delete (delete), where the delete operation may be a logical delete, and accordingly:
for newly added data corresponding to the added operation in the on-line production data 50, the data buffer module 12 may directly perform a key-value-to-format conversion 53 thereon;
for the updated data related to the update operation in the online production data 50, the data buffer module 12 may perform the key-value-pair format conversion 53 after performing the data overlay 51 thereon, where the data overlay 51 means to overlay the original data with the latest data;
for deleted data, the data buffer module 12 may perform key-value to format conversion after performing delete token addition 52 thereon.
After the format conversion, the data buffer module 12 reads the single data source sub-table 400 from the buffer medium 121 to the database 120, and may store the new data, the updated data, and the deleted data in the key value pair format into the corresponding single data source sub-table 400 stored in the distributed column-oriented database 120.
If the database row key 42 configured for the same single data source sub-table 400 contains more than one Rowkey, then to match the storage format of HBase, the data buffer module 12 needs to perform row key concatenation 54, i.e., concatenate multiple corresponding Rowkeys together with a symbol such as an underline "_".
In addition, if the modified content field 43 for the update operation exists in the single data source sub-table 400, the data buffer module 12 also needs to make an update decision 55. For example, the data buffering module 12 first determines whether the service operation time corresponding to the current data is later than or equal to the service operation time of the stored data in the corresponding single data source sublist 400 in the HBase120, and if so, indicates that the current data is latest data, and updates the stored data in the corresponding single data source sublist 400 in the HBase120 to keep the data consistent; otherwise, the corresponding single data source sublist 400 in the HBase120 is not updated, and meanwhile, the current data may be considered as dirty data that should be overwritten and may be discarded.
By analogy, the HBase120 stores all the single data source sub-tables 400 according to the rule in real time.
The service operation time can be judged by comparing mid sequences in the key-value pair format, and see the following example:
it is assumed that the key value of the key value pair format on-line production data 1 includes a mid sequence having a value of "200135521945" and an opt field of "INSERT" indicating a new addition operation, and the key value of the key value pair format on-line production data 1 is 16777921_80463060328_ 8226277.
Further, assume that the key value of the key-value pair format on-line production data 2 is 16777921_80463060328_8226277, and the key value of the key also includes a mid sequence having a value of "200135521951" and an opt field representing an UPDATE operation "UPDATE".
The primary keys of the two on-line production data 1 and 2 are each composed of a primary key identification, an old identification, and a product ID, and the values of the primary keys of the two on-line production data 1 and 2 are the same.
The key value of the two online production data 1 and 2 is in JSON (JSON object notation) string format, which includes the contents of the online production data 1 and 2.
As can be seen from the above, the opt field of the on-line production data 1 is "INSERT" indicating the new operation, and the opt field of the on-line production data 2 is "UPDATE" indicating the UPDATE operation. By modifying the content field (UpdateCol) for the update operation, the data buffer module 12 can determine that it is necessary to determine whether the on-line production data 2 is indeed new data by a mid sequence that is incremented over time, i.e. the update decision 55.
Since the mid sequence "200135521951" of the on-line production data 2 indicates a business operation time later than the business operation time indicated by the mid sequence "200135521945" of the on-line production data 1 stored in the corresponding single data source partial table 400 in the HBase120, for the above example, the data buffer module 12 can recognize that the on-line production data 2 belongs to new data through the update decision 55 of the mid sequence, and update the on-line production data 1 stored in the corresponding single data source partial table 400 in the HBase120 using the same.
However, if the operation time indicated by the mid sequence of the on-line production data 2 is not later than, even earlier than, the operation time indicated by the mid sequence of the on-line production data 1 stored in the corresponding single data source table 400 in the HBase120, even if the on-line production data 2opt field is "UPDATE" indicating an UPDATE operation, the data buffer module 12 may determine that the on-line production data 2 belongs to dirty data that should be overwritten through the UPDATE decision 55 of the mid sequence.
Fig. 4 is a schematic diagram of a data merging module in the apparatus shown in fig. 1. Referring to fig. 4 in conjunction with fig. 1, the data merging module 13 may periodically merge the single data source sub-tables 400 in key-value pair format in each partition into a multiple data source wide table 600.
Specifically, the data merging module 13 may periodically obtain data stored in the single data source sub-table 400 from the distributed column-oriented database (e.g., HBase)120 in a preset time window (e.g., Flink time window), merge the data in a relational database such as an H2 memory database to obtain the multiple data source width table 600, and finally insert the multiple data source width table 600 into the distributed column-oriented database (e.g., HBase) 120.
The length of the time window can be set arbitrarily, so as to adjust the real-time processing instant degree. The length of the time window can be measured based on time or based on the amount of data, and is generally adapted to the processing speed of the millisecond pole.
Moreover, merging in a relational database, such as an H2 in-memory database, may be implemented by implementing inter-table associations between the single data source sub-tables 400 based on Structured Query Language (SQL). Compared with the mode of combining the table associations by using Java, the SQL-based association mode can save development cost and shorten development period, and in addition, the SQL processing is also utilized to generate data on the SQL processing line, so that the real-time data and the offline data can be better kept consistent.
In addition, the association between the single data source sub-tables 400 may be determined by the above-mentioned inter-table association fields, that is, the single data source sub-tables 400 corresponding to part of the primary keys of part of the data sources may be associated and combined into one multiple data source wide table 600, or all the single data source sub-tables 400 corresponding to all the primary keys of all the single data source sub-tables 400 may be associated and combined into one multiple data source wide table 600.
Referring back to fig. 1, the data production module 14 may store the multiple data source width tables in the Distributed File System 30, that is, the data production module may perform a disk-drop operation on the Distributed File System 30 for checking with the offline data, for example, the Distributed File System 30 may be a Hadoop Distributed File System (HDFS); and/or, the data production module 14 may convert the multiple data source broad tables into log format and provide the log format to the distributed publish-subscribe messaging system (e.g., Kafka)20 for display on a large screen for continued use by downstream services through continued production of the distributed publish-subscribe messaging system (e.g., Kafka) 20.
FIG. 5 is a schematic flow diagram illustrating a method for streaming real-time processing of inline production data in one embodiment. Referring to fig. 5, in one embodiment, a method of streaming real-time processing on-line production data may include:
s71, reading the online production data of a plurality of data sources from the distributed publish-subscribe message system based on a streaming processing mode;
s72, the online production data of each data source is transferred and stored into a single data source sub-table in a key value pair format;
and S73, periodically merging the single data source sub-tables of the key-value pair format of each data source into a multi-data source wide table.
Based on the above process, the online production data of the multiple data sources read through S71 can be saved as a single data source sub-table in the key-value pair format at S72, and the single data source sub-tables in the key-value pair format of each data source can be periodically merged into a multiple data source wide table through S73 to realize integration of multiple data sources, so that integration of multiple data sources can be realized with considerable consumption efficiency and on the premise of ensuring data integrity.
Fig. 6 is an expanded flow diagram of the method shown in fig. 5. Compared to fig. 5, the flow shown in fig. 6 further includes the following steps after S73:
s75, storing the multi-data source width table into the distributed file system;
and S76, converting the multi-data source broad table into log format and providing the log format to the distributed publish-subscribe message system.
Thus, as a further continuation of the flow shown in FIG. 5, multiple data source wide tables may be provided to the distributed file system to enable relatively simple data reconciliation, and may also be fed back to the distributed publish-subscribe message system for downstream business use.
FIG. 7 is a flow chart of an embodiment of the method shown in FIG. 5. In fig. 7, the method of streaming real-time processing on-line production data in this embodiment may be embodied as including the steps of:
and S710, reading the online production data stored by each data source according to the main key partition in parallel from the distributed publishing and subscribing message system by calling the distributed processing engine aiming at the stream data and the batch data.
S720, the online production data of each data source is transferred and stored in a distributed column-oriented database into a single data source sub-table in a key-value pair format.
Wherein, S720 may specifically include:
s721, setting a main key identifier, a database row key, an inter-table association field and whether a modification content field aiming at the updating operation exists in a corresponding single data source sub-table according to a configuration file preset aiming at each data source;
s722, directly converting newly added data in the on-line production data into a key-value pair format, converting the newly added data into the key-value pair format after data coverage of the updated data, and converting the deleted data into the key-value pair format after adding deletion marks;
and S723, storing the newly added data, the updated data and the deleted data in the key value pair format into corresponding single data source sub-tables, wherein if a plurality of corresponding database row keys exist in the same single data source sub-table, the plurality of corresponding database row keys are spliced, and if a modified content field aiming at the updating operation exists in the single data source sub-table, updating judgment is carried out.
And S730, combining the single data source sub-tables in the key value pair format of each data source into a multi-data source wide table in a relational database through the SQL-based inter-table association.
Fig. 8 is a schematic configuration diagram of a data processing apparatus in one embodiment. As shown in fig. 8, in one embodiment, a data processing apparatus comprises a processor 81 and a non-transitory computer readable storage medium 82, wherein the non-transitory computer readable storage medium 82 has stored therein instructions executable by the processor 81 to cause the processor 81 to perform the steps of the method shown in fig. 5 or fig. 6 or fig. 7, for example:
reading online production data of a plurality of data sources from a distributed publish-subscribe message system based on a streaming processing mode (for example, by calling a distributed processing engine for streaming data and batch data), and storing the online production data of the plurality of data sources in a partitioned mode according to a main key;
the online production data of each data source is transferred (for example, in a distributed column-oriented database) into a single data source sub-table in a key-value pair format, and the specific transfer mode can refer to the mode described above;
periodically, single data source sub-tables in a key-value pair format for each data source are merged (e.g., SQL-based associations in a relational database) into multiple data source wide tables.
Further, the instructions stored in the non-transitory computer readable storage medium 82, when executed by the processor 81, may further cause the processor 81 to store the multiple data source wide table into the distributed file system and/or convert the multiple data source wide table into a log format for provision to the distributed publish-subscribe message system.
In addition, the data processing device may also include a mass storage medium 84 for storing a distributed column-oriented database, such as HBase, and a relational database, such as an H2 in-memory database.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (20)

1. An apparatus for streaming real-time processing of in-line production data, comprising:
the data access module reads online production data of a plurality of data sources from the distributed publish-subscribe message system based on a streaming processing mode;
the data buffer module is used for transferring the on-line production data of each data source into a single data source sublist in a key value format;
and the data merging module periodically merges the single data source sublists in the key-value pair format of each data source into a multi-data source wide table.
2. The apparatus of claim 1, wherein the data access module reads inline production data by invoking a distributed processing engine for streaming data and batch data.
3. The apparatus of claim 1, wherein the online production data of each data source is stored in a distributed publish-subscribe messaging system in partitions according to a primary key, and the data access module reads the online production data of each data source from each partition in parallel.
4. The apparatus of claim 1, wherein the data buffer module is to store a single data source sub-table in a key-value pair format in a distributed column-oriented database.
5. The apparatus of claim 4, wherein the data buffering module sets a primary key identifier, a database row key, an inter-table association field, and whether a modified content field exists for an update operation in a corresponding single data source sub-table according to a configuration file preset for each data source.
6. The apparatus of claim 5,
the data buffer module directly converts newly added data in the on-line production data into a key-value pair format, converts updated data into the key-value pair format after data coverage, and converts deleted data into the key-value pair format after deletion marks are added;
and the data buffer module stores the newly added data, the updated data and the deleted data in the key-value pair format into the corresponding single data source sub-table, wherein,
if the same single data source sub-table has a plurality of corresponding database row keys, the data buffer module splices the plurality of corresponding database row keys;
and if the modified content field aiming at the updating operation exists in the single data source sub-table, the data buffer module carries out updating judgment.
7. The apparatus of claim 1, wherein the data merge module merges in a relational database into a multiple data source wide table.
8. The apparatus of claim 1, wherein the data merge module is configured to implement merging of multiple data source wide tables via inter-table association based on a structured query language.
9. The apparatus of claim 1, wherein the data merging module periodically merges multiple data source width tables in a predetermined time window.
10. The apparatus of claim 1, further comprising:
and the data production module stores the multi-data source width table into the distributed file system and/or converts the multi-data source width table into a log format and provides the log format for the distributed publishing and subscribing message system.
11. A method of streaming real-time processing on-line production data, comprising:
reading online production data of a plurality of data sources from a distributed publish-subscribe message system based on a streaming processing mode;
the online production data of each data source is transferred and stored into a single data source sub-table in a key value pair format;
periodically merging the single data source sub-tables of the key-value pair format of each data source into a multi-data source wide table.
12. The method of claim 11, wherein reading inline production data for a plurality of data sources comprises: the inline production data is read by invoking distributed processing engines for streaming data and batch data.
13. The method of claim 11, wherein the online production data for each data source is stored in a partitioned manner by primary key in the distributed publish-subscribe messaging system, and wherein reading the online production data for a plurality of data sources comprises: the on-line production data of each data source is read in parallel from each partition.
14. The method of claim 11, wherein the unloading of the on-line production data of each data source into a single data source sub-table in key-value pair format comprises: a single data source sublist in a key value pair format is cached in a distributed column-oriented database.
15. The method of claim 14, wherein the unloading of the on-line production data of each data source into a single data source sub-table in key-value pair format comprises:
setting a main key identifier, a database row key, an inter-table association field and whether a modified content field aiming at updating operation exists in a corresponding single data source sub-table according to a configuration file preset aiming at each data source;
directly converting newly added data in the online production data into a key-value pair format, converting updated data into the key-value pair format after data coverage, and converting deleted data into the key-value pair format after deleting marks are added;
storing the newly added data, the updated data and the deleted data in the key-value pair format into corresponding single data source sub-tables, wherein if a plurality of corresponding database row keys exist in the same single data source sub-table, the plurality of corresponding database row keys are spliced; and if the modified content field aiming at the updating operation exists in the single data source sub-table, the updating judgment is carried out.
16. The method of claim 11, wherein merging the single data source sub-tables of key-value pair format for each data source into a multiple data source wide table comprises: and merging the data sources in the relational database to obtain the multi-data source width table.
17. The method of claim 11, wherein merging the single data source sub-tables of key-value pair format for each data source into a multiple data source wide table comprises: the merging of multiple data source wide tables is achieved through table association based on a structured query language.
18. The data processing device of claim 11, further comprising:
and storing the multi-data source width table into the distributed file system, and/or converting the multi-data source width table into a log format and then providing the log format for the distributed publishing and subscribing message system.
19. A data processing apparatus comprising a processor for performing the steps of the method of any one of claims 11 to 18.
20. A non-transitory computer readable storage medium storing instructions for the processor to perform the steps of the method of any one of claims 11 to 18.
CN201811388835.3A 2018-11-21 2018-11-21 Apparatus and method for streaming real-time processing of on-line production data Pending CN111209278A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811388835.3A CN111209278A (en) 2018-11-21 2018-11-21 Apparatus and method for streaming real-time processing of on-line production data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811388835.3A CN111209278A (en) 2018-11-21 2018-11-21 Apparatus and method for streaming real-time processing of on-line production data

Publications (1)

Publication Number Publication Date
CN111209278A true CN111209278A (en) 2020-05-29

Family

ID=70783971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811388835.3A Pending CN111209278A (en) 2018-11-21 2018-11-21 Apparatus and method for streaming real-time processing of on-line production data

Country Status (1)

Country Link
CN (1) CN111209278A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108845794A (en) * 2018-05-16 2018-11-20 山东浪潮商用系统有限公司 A kind of streaming operation frame, method, readable medium and storage control
CN112100147A (en) * 2020-07-27 2020-12-18 杭州玳数科技有限公司 Method and system for realizing real-time acquisition from Bilog to HIVE based on Flink
CN112256700A (en) * 2020-10-19 2021-01-22 北京字节跳动网络技术有限公司 Data storage method and device, electronic equipment and computer readable storage medium
CN113468246A (en) * 2021-07-20 2021-10-01 上海齐屹信息科技有限公司 Intelligent data counting and subscribing system and method based on OLTP

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108845794A (en) * 2018-05-16 2018-11-20 山东浪潮商用系统有限公司 A kind of streaming operation frame, method, readable medium and storage control
CN112100147A (en) * 2020-07-27 2020-12-18 杭州玳数科技有限公司 Method and system for realizing real-time acquisition from Bilog to HIVE based on Flink
CN112256700A (en) * 2020-10-19 2021-01-22 北京字节跳动网络技术有限公司 Data storage method and device, electronic equipment and computer readable storage medium
CN113468246A (en) * 2021-07-20 2021-10-01 上海齐屹信息科技有限公司 Intelligent data counting and subscribing system and method based on OLTP

Similar Documents

Publication Publication Date Title
CN111209278A (en) Apparatus and method for streaming real-time processing of on-line production data
US11334544B2 (en) Method, apparatus, device and medium for storing and querying data
US9495398B2 (en) Index for hybrid database
US9507821B2 (en) Mail indexing and searching using hierarchical caches
CN103593440B (en) The reading/writing method and device of journal file
US20140222870A1 (en) System, Method, Software, and Data Structure for Key-Value Mapping and Keys Sorting
US10275486B2 (en) Multi-system segmented search processing
US10902069B2 (en) Distributed indexing and aggregation
CN103678694A (en) Method and system for establishing reverse index file of video resources
US20110029477A1 (en) Information similarity and related statistical techniques for use in distributed computing environments
US11308060B2 (en) Method, apparatus, device and storage medium for managing index
CN110245134B (en) Increment synchronization method applied to search service
CN109460404A (en) A kind of efficient Hbase paging query method based on redis
CN111046106A (en) Cache data synchronization method, device, equipment and medium
EP3107010B1 (en) Data integration pipeline
CN113568938A (en) Data stream processing method and device, electronic equipment and storage medium
Sanjuan et al. Merkle-crdts: Merkle-dags meet crdts
CN116383207A (en) Data tag management method and device, electronic equipment and storage medium
CN115617849A (en) Data processing method and device, electronic equipment and storage medium
CN115632866A (en) Message desensitization method, device, equipment and medium based on FPGA
CN115269654A (en) Data cache supplementing method, device, equipment and medium
US11372832B1 (en) Efficient hashing of data objects
CN110502583B (en) Distributed data synchronization method, device, equipment and readable storage medium
US20240095246A1 (en) Data query method and apparatus based on doris, storage medium and device
US11636254B2 (en) Provenance aware editing for spreadsheets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination