CN116842055A

CN116842055A - System and method for integrated processing of internet of things data batch flow

Info

Publication number: CN116842055A
Application number: CN202310787447.7A
Authority: CN
Inventors: 路培杰; 曾光; 刘文虎; 周志忠
Original assignee: Zhongke Yungu Technology Co Ltd
Current assignee: Zhongke Yungu Technology Co Ltd
Priority date: 2023-06-29
Filing date: 2023-06-29
Publication date: 2023-10-03

Abstract

The application discloses a system and a method for integrated processing of internet of things data batch flow. The system comprises: the cloud platform is configured to receive the work data uploaded by the edge side Internet of things equipment and store the work data to the kafka message queue; the Flink computing engine is communicated with the cloud platform and the data lake, and is configured to perform data processing on message data in the kafka message queue, store the message data in the data lake, and acquire snapshot information at a set moment in the data lake according to data analysis requirements to perform data analysis; the data lake is configured to hierarchically store data at each stage in the data processing process. According to the application, through integrating the Flink calculation engine and the data lake, the centralized storage and unified calculation of the data of the Internet of things are realized based on the snapshot mechanism of the data lake and the support of the Flink on the bounded unbounded data flow, the data consistency is effectively ensured, and the complexity of the system architecture is reduced.

Description

System and method for integrated processing of internet of things data batch flow

Technical Field

The application relates to the technical field of data processing, in particular to a system and a method for integrated processing of internet of things data batch streams.

Background

In the field of industrial internet, with the advent of industrial scenes such as everything interconnection and man-machine interaction, mass industrial equipment needs to be accessed to a cloud platform through means such as internet and 5G, so that a large amount of internet of things data are transmitted to the cloud platform from an edge side in real time, or operation instructions are sent to an edge end through the cloud platform, and the data transmitted to the cloud platform need to be subjected to various different forms of statistical analysis and data mining, so that business is enabled, and operation decisions are guided.

In order to meet the data analysis requirements of the mass Internet of things in different scenes, the current common practice in the industry is to introduce different big data components to solve the analysis requirements in different scenes. For example: the analysis mode of spark calculation engine and hive number bin is adopted according to the statistical analysis of days or months; recording and storing latest working condition data of industrial equipment by adopting a Canssandra non-relational database and an elastic search index database; storing historical working condition data of industrial equipment by adopting an Hbase non-relational database; the Internet of things data of the real-time analysis device adopts a flexible real-time computing engine, a message queue kafka, an MPP database dorisdb and other data storage systems supporting real-time analysis requirements. Thus, the whole data analysis process can involve various complex distributed data storage systems and distributed computing engines, and the operation and maintenance of such a set of large data analysis system requires great technical investment, has high operation and maintenance cost and technical complexity, and small and medium enterprises generally cannot maintain and operate such a complex large data system. Meanwhile, the same piece of data of the Internet of things needs to be repeatedly stored in different storage systems in order to meet the data analysis requirements in different scenes, so that the problems of great storage waste, data inconsistency and the like are caused, and no good solution is provided for the scenes at present.

Therefore, the architecture design of the data analysis of the mass Internet of things adopted in the prior art is complex, the operation and maintenance cost and the economic cost are both high, and the problem that the risk of inconsistent data is possibly caused exists.

Disclosure of Invention

The embodiment of the application aims to provide a system and a method for integrated processing of internet of things data batch flow, which are used for solving the problems that the architecture design of mass internet of things data analysis adopted in the prior art is complex, the operation and maintenance cost and the economic cost are high, and the risk that data are inconsistent is caused.

To achieve the above object, a first aspect of the present application provides a system for integrated processing of data batch of internet of things, the system comprising:

the cloud platform is configured to receive the work data uploaded by the edge side Internet of things equipment and store the work data to the kafka message queue;

the Flink calculation engine is respectively communicated with the cloud platform and the data lake, and is configured to perform data processing on message data in the kafka message queue, store the message data in the data lake and obtain snapshot information at set time in the data lake according to data analysis requirements to perform data analysis;

and the data lake is configured to store data of each stage in the data processing process in a layering way.

In an embodiment of the application, the Flink calculation engine is further configured to:

acquiring input instruction parameters;

opening a batch interface to execute batch logic under the condition that the instruction parameter is a batch instruction parameter;

and opening the real-time stream processing interface to execute the real-time stream processing logic in the case that the instruction parameter is the real-time stream processing instruction parameter.

In an embodiment of the application, the system further comprises a query engine, in communication with the data lake, configured to query data in the data lake according to the query requirement;

the data lake is also configured to send the query results of the query engine to the industrial application.

In an embodiment of the application, the data lake includes a data buffer layer, the Flink calculation engine is further configured to:

a data buffer table for storing data is newly built in the data buffer layer;

the message data in the kafka message queue is written into the data buffer table.

reading buffer data in a data buffer table;

and processing and storing the buffer data layer by layer based on the storage rules and preset indexes corresponding to each data layer in the data lake.

In an embodiment of the present application, the data lake further includes a data raw layer, a data standard layer, a data integration layer, a data application layer, and a dimension layer, the flank calculation engine is further configured to:

After regular and structured cleaning is carried out on the buffer data, the buffer data is stored in a data original layer;

normalizing the data in the data original layer;

storing the standardized data to a data standard layer;

slightly summarizing the data of the data standard layer according to preset indexes, dividing the theme, and storing the theme into the data integration layer;

comprehensively analyzing the data of the data integration layer, the data of the data standard layer and the data of the dimension layer according to preset indexes to obtain an analysis result;

and storing the analysis result into a data application layer.

The second aspect of the application provides a method for integrated processing of data batch of the internet of things, which is applied to a system for integrated processing of data batch of the internet of things, wherein the system comprises a cloud platform, a data lake and a Flink computing engine, and the Flink computing engine is respectively communicated with the cloud platform and the data lake, and the method comprises the following steps:

receiving the working data uploaded by the edge side Internet of things equipment through the cloud platform, and storing the working data into a kafka message queue;

carrying out data processing on the message data in the kafka message queue by a Flink calculation engine and storing the message data in a data lake;

data of each stage in the data processing process is stored in a layering manner through a data lake;

And obtaining snapshot files at set time in the data lake according to analysis requirements by the Flink calculation engine to perform data analysis.

In an embodiment of the present application, the method further includes:

acquiring input instruction parameters;

under the condition that the instruction parameters are batch processing instruction parameters, the batch processing interface is punched to execute batch processing logic;

In an embodiment of the present application, the data lake includes a data buffer layer, and the data processing and storing, by the computing engine, the message data in the kafka message queue includes:

a data buffer table for storing data is newly built in the data buffer layer;

writing the message data in the kafka message queue into a data buffer table;

reading buffer data in a data buffer table;

In the embodiment of the application, the data lake further comprises a data original layer, a data standard layer, a data integration layer, a data application layer and a dimension layer, and buffer data is processed layer by layer and stored based on storage rules and preset indexes corresponding to the data layers in the data lake, and the method comprises the following steps:

normalizing the data in the data original layer;

storing the standardized data to a data standard layer;

and storing the analysis result into a data application layer.

Through the technical scheme, the system for integrally processing the data batch of the Internet of things comprises a cloud platform, a Flink computing engine and a data lake, wherein the Flink computing engine is respectively communicated with the cloud platform and the data lake; the cloud platform receives the working data uploaded by the edge side Internet of things equipment and stores the working data to a kafka message queue; the Flink calculation engine performs data processing on the message data in the kafka message queue, stores the message data in a data lake, and obtains snapshot information at a set moment in the data lake according to data analysis requirements to perform data analysis; the data lake performs layered storage on the data of each stage in the data processing process; the data lake comprises a data buffer layer, a data original layer, a data standard layer, a data integration layer, a data application layer, a temporary layer and a dimension layer. According to the application, through integrating the Flink calculation engine and the data lake, based on a snapshot mechanism of the data lake and the support of the Flink on the bounded unbounded data flow, the centralized storage and unified calculation of the data of the Internet of things are realized, the information islands among different systems of different departments are broken, the data consistency is effectively ensured, the data quality is improved, and the complexity of the system is greatly reduced compared with the architecture design of the traditional data analysis system of the Internet of things.

Additional features and advantages of embodiments of the application will be set forth in the detailed description which follows.

Drawings

The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain, without limitation, the embodiments of the application. In the drawings:

fig. 1 is a schematic structural diagram of a system for integrated processing of internet of things data batch flow according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a data layering process in a data lake according to an embodiment of the present application;

FIG. 3 is a schematic illustration of a scheduling flow of a Flink computing engine according to an embodiment of the present application;

fig. 4 is a flow chart of a method for integrated processing of internet of things data batch flow according to an embodiment of the present application.

Description of the reference numerals

10. Cloud platform 20 Flink computing engine

30. Data lake 40 query engine

100. Data buffer layer 200 data original layer

300. Data standard layer 400 data integration layer

500. Data application layer 600 dimension layer

700. Temporary layer 800 task scheduling monitoring system

900. Metadata management data dictionary

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it should be understood that the detailed description described herein is merely for illustrating and explaining the embodiments of the present application, and is not intended to limit the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that, if directional indications (such as up, down, left, right, front, and rear … …) are included in the embodiments of the present application, the directional indications are merely used to explain the relative positional relationship, movement conditions, etc. between the components in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indications are correspondingly changed.

In addition, if there is a description of "first", "second", etc. in the embodiments of the present application, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present application.

Fig. 1 is a schematic structural diagram of a system for integrated processing of internet of things data batch flow according to an embodiment of the present application. As shown in fig. 1, an embodiment of the present application provides a system for integrated processing of internet of things data batch flows, where the system may include:

the cloud platform 10 is configured to receive the work data uploaded by the edge side internet of things device and store the work data to the kafka message queue;

the Flink calculation engine 20 is respectively communicated with the cloud platform 10 and the data lake 30, and is configured to perform data processing on message data in the kafka message queue, store the message data in the data lake 30 and obtain snapshot information at a set moment in the data lake 30 according to data analysis requirements for data analysis;

a data lake 30 configured to hierarchically store data of each stage in the data processing process;

the data lake 30 includes a data buffer layer, a data raw layer, a data standard layer, a data integration layer, a data application layer, a temporary layer, and a dimension layer.

In order to simplify the complex architecture design of the data analysis of the current mass Internet of things, the operation and maintenance cost and the economic cost are reduced, the quality of data is improved, and the problem of data consistency caused by repeated storage of a system is avoided. The embodiment of the application provides a system for integrally processing data batch flows of the Internet of things, which can comprise a cloud platform 10, a Flink computing engine 20 and a data lake 30, wherein the cloud platform 10 is communicated with the Flink computing engine 20, and the data lake 30 is communicated with the Flink computing engine 20.

In the embodiment of the present application, the cloud platform 10 refers generally to an industrial internet platform, and the cloud platform 10 may receive the working data uploaded by the edge-side internet of things device and store the working data in the kafka message queue. The edge side internet of things equipment comprises vehicles, production line equipment, instruments, sensors and the like. The work data uploaded by the edge-side internet of things equipment, namely internet of things data, can comprise working condition data of engineering mechanical equipment, geographic position data of engineering mechanical equipment, working condition data of production line production equipment, water and gas meter data of an industrial park, access control data and camera data of the industrial park and the like. In one example, the edge-side internet of things device may upload the working data to the gateway side of the cloud platform 10 in real time through the 5G network, and the gateway side of the cloud platform 10 is responsible for forwarding the working data to the kafka message queue. Further, the data in the kafka message queue is stored in the data lake 30.

In the embodiment of the present application, since the Flink computation engine 20 has the capability of performing data processing based on the event dimension, that is, the Flink computation engine 20 can process bounded data streams on one hand and support processing of unbounded data streams on the other hand. Therefore, in the embodiment of the present application, the link calculation engine 20 is selected as a distributed calculation engine to process the data of the internet of things. Specifically, the flank computation engine 20 communicates with the cloud platform 10 and the data lake 30, respectively, and the flank computation engine 20 may perform data processing on the message data in the kafka message queue and store the processed message data in the data lake 30. In one example, the Flink calculation engine 20 may interface with the kafka message queue through the kafka-connector on the one hand, connect the data lake 30 through the iceberg-connector on the other hand, and write the mass Internet of things data to the data lake 30 through the flinksql.

In the embodiment of the present application, the data lake 30 (Iceberg) is used as an open Table Format (Table Format) for a large data analysis scene, which uses a high-performance Table Format similar to an SQL Table, the Iceberg Format Table can store tens of PB data, can adapt to different computing engines, provides high-performance read-write and metadata management functions, supports real-time/batch data writing and reading, supports transaction ACID, supports adding, deleting and updating data, does not bind any underlying storage, supports parallel, ORC and Avro Format compatible line storage and column storage, supports snapshot data repeated query, has version rollback functions, and meanwhile, the data lake 30 is also a service-free working mode. Thus, embodiments of the present application innovatively combine the Flink compute engine 20 with the data lake 30, with the data lake 30 storing data hierarchically at various stages in the data processing process.

In the embodiment of the application, in order to enable all data to flow orderly, the whole life cycle of the data can be clearly and definitely perceived, and the embodiment of the application designs the data layering of the data lake 30 according to the characteristics of the data of the internet of things, so that the data structure is clearer and is convenient for later maintenance. In particular, the data layering of the data lake 30 may include a data buffer layer, a data raw layer, a data standard layer, a data integration layer, a data application layer, a temporary layer, and a dimension layer.

The data buffer layer is used for temporarily storing the original unstructured Internet of things data in the kafka message queue. The data original layer is used for storing the original data after data normalization and structural cleaning. The data standard layer is used for storing valuable service data obtained by data cleaning which is used for standardizing the original data of the data original layer. The data integration layer is used for storing data after slightly summarizing and subject calculating the business data of the data standard layer. The data application layer is used for storing statistical data of different topic types, which are obtained by carrying out deep management analysis and index classification summarization on the data of the data integration layer according to different topics, and the statistical data are generally wide-table data. In addition, the temporary layer processes the data and stores the intermediate data generated in the process layer by layer for subsequent call. The dimension layer is used for storing public data which are needed by the different data layers in data analysis.

Therefore, through the hierarchical design of the data lake, the logical relationship between the data of the Internet of things and the dimension data can be clearly defined, repeated development work is reduced, the data processing flow is clear, the structure is clear, and the functions are clear.

In the embodiment of the present application, the link calculation engine 20 may further obtain snapshot information at a set time in the data lake 30 according to the data analysis requirement for data analysis. Wherein the data analysis requirements include batch processed data analysis and real-time processed data analysis. Specifically, based on the characteristics of the time travel of the data lake 30, the data lake 30 stores Snapshot (snappshot) information of the data files at different times, the Snapshot information records all data file lists at a certain time, and as time goes by, snappshot at a previous time will contain all data file information recorded by snappshot at a previous time. Therefore, the data in any time period can be read according to the time characteristic of the snapplot. Meanwhile, since the Snapshot Snapshot-1 of the data file at the current moment contains all information of the Snapshot Snapshot-0 of the data file at the previous moment, the increment part between the Snapshot-1 and the Snapshot-0 is the latest data, any latest data in the latest (0-30 min) can be acquired by regulating the time period recorded by the Iceberg Snapshot, and the latest increment data is analyzed, so that a near-real-time analysis result can be obtained.

In the embodiment of the application, batch data analysis can be performed according to the characteristics of the data lake 30, namely, data statistics analysis according to days or months is performed, and only data of one day or one month time span is read according to Snapshot of the data lake 30 for analysis each time; on the other hand, the latest real-time analysis data result can be obtained at the same time. In the embodiment of the application, in order to improve the calculation efficiency, improve the timeliness of data analysis and save the calculation resources and storage resources, a Flink distributed calculation engine is selected for data calculation, because the Flink has the capability of analyzing and processing bounded data and unbounded data at the same time at the beginning of design, namely, the Flink integrates various operators, functions and methods for processing bounded and unbounded data streams. Therefore, in the embodiment of the application, the characteristics of the data computing engine Flink and the data storage system iceberg data lake 30 in the aspects of batch and real-time data processing are effectively integrated and innovatively designed, and the offline analysis and the real-time analysis of the data of the Internet of things are carried out according to the data layering of the iceberg data lake 30, so that the complexity and the cost of a massive Internet of things data processing system are greatly reduced, and the data analysis efficiency is improved.

In embodiments of the present application, the Flink computation engine 20 may also be configured to:

acquiring input instruction parameters;

In the embodiment of the application, the processing of the bounded data stream is generally called batch processing, the batch processing does not need to obtain data orderly, in a batch processing mode, the data stream is firstly persisted into a storage system, then the data of the whole data set is read, sequenced, counted or summarized, and finally the result is output. Real-time processing of an unbounded data stream, which is usually performed at the time of data generation, is called real-time stream processing, and because the data input of an unbounded data stream is infinite, it must be processed continuously. In the embodiment of the application, the Flink has the capability of analyzing and processing bounded data and unbounded data at the same time at the beginning of design. Specifically, input instruction parameters are acquired first, the instruction parameters can reserve batch processing quality parameters and real-time stream processing instruction parameters, and the instruction parameters can be input by technical developers through writing codes according to data analysis requirements. Further, in the case where the instruction parameter is a batch instruction parameter, the batch interface of the flank calculation engine 20 may be opened to execute the batch logic; in the case where the instruction parameters are real-time stream processing instruction parameters, the real-time stream processing interface of the Flink calculation engine 20 may be opened to execute real-time stream processing logic. Therefore, the system can realize the off-line analysis according to the real-time analysis at first, and the off-line analysis and the real-time analysis share the same storage system and computing resource, thereby reducing the operation and maintenance cost; and moreover, a set of analysis codes are commonly used for offline analysis and real-time analysis, so that the development cost is reduced, and the development efficiency is improved.

In an embodiment of the application, the system may further include a query engine 40 in communication with the data lake 30, configured to query the data in the data lake 30 for data according to the query requirement;

the data lake 30 may also be configured to send the query results of the query engine 40 to an industrial application.

In an embodiment of the present application, the system may further include a query engine 40, the query engine 40 being in communication with the data lake 30 and being capable of querying data in the data lake 30 according to the query requirements. Further, the data lake 30 may return the query results of the query engine 40 to the corresponding industrial application. Wherein the query engine 40 may presto the data query engine 40. Specifically, after the data of the internet of things is analyzed and processed layer by layer, final various statistical indexes are stored in the data application layer 500 of the data lake 30Iceberg in the form of a large-width table, in order to further simplify the large data architecture of the statistical analysis of the data of the internet of things, the embodiment of the application does not select to export the data to an external storage system, but directly inquires the statistical analysis result in Iceberg through prestorsql, and through adding relevant configuration of Iceberg under prestorsql installation deployment catalogue etc/catato realize the opening of prestorsql and Iceberg, namely, the data in the data lake 30Iceberg can be directly inquired through the inquiry engine 40, the user can quickly inquire and calculate the data in the data lake 30Iceberg through the jdbc connection address provided by prestorsql, and the result is directly returned to industrial application from the data lake 30.

Fig. 2 is a schematic flow chart of data layering processing in a data lake according to an embodiment of the application. As shown in fig. 2, the data of the internet of things is written into the data lake 30 to enter data analysis, and then data application is realized after passing through the data buffer layer 100, the data original layer 200, the data standard layer 300, the data integration layer 400 and the data application layer 500, wherein the data lake further comprises a dimension layer 600 and a temporary layer 700, and data processing task scheduling and data processing among all the layers are realized through the task scheduling monitoring system 800 and the metadata management data dictionary 900. Through the hierarchical design of the data lake 30, the logic relationship and application range of business and financial data are clearly defined, a plurality of repeated development works are reduced, and the data processing flow is clear, the structure is clear, and the functions are clear.

a data buffer table for storing data is newly built in the data buffer layer 100;

In the embodiment of the present application, the data buffer table is a new MOR table (Merge on Read) for the buffer data layer of the data lake 30. According to the data characteristics, the data lake 30 may be data-layered, where the data buffer layer 100 (BDL, buffer Data Layer) is mainly used to temporarily store the original data written by the link, and the validity period of the BDL layer data is relatively short, for example, 1-7 days, and needs to be deleted after the validity period.

reading buffer data in a data buffer table;

and processing and storing the buffer data layer by layer based on the storage rules and preset indexes corresponding to the data layers in the data lake 30.

In the embodiment of the present application, the data layering is designed for the data lake 30 according to the characteristics of the data, and the storage rule corresponding to the data layering includes a storage form of the data stored in the data layering. The data layering has a certain dependency relationship, and the data is processed layer by layer according to the business logic and the dependency relationship among the data layering.

after regular and structured cleaning is performed on the buffered data, the buffered data is stored in the data original layer 200;

normalizing the data in the data original layer 200;

storing the normalized data to the data standard layer 300;

slightly summarizing the data of the data standard layer 300 according to preset indexes, dividing the topics, and storing the data into the data integration layer 400;

comprehensively analyzing the data of the data integration layer 400, the data of the data standard layer 300 and the data of the dimension layer 600 according to preset indexes to obtain an analysis result;

The analysis results are stored in the data application layer 500.

Specifically, the flank computation engine 20 performs a normalization and structuring process on unstructured buffered data and stores the unstructured buffered data in the data-origin layer 200. In one example, the table of Iceberg corresponding to each service of the data original layer 200 may be created first, then each service data in the Iceberg table of the data buffer layer 100 is read in batches by the Flink computing engine 20, and then the service data in the Iceberg table of the data original layer 200 is written in batches by the Flink computing engine 20, where the main development language of this step is SQL, and by writing a code of data structuring processing, then running the code, the purpose of simple, clear and structured processing on the data is achieved, and by this operation, we obtain all structured service data that maintains all features of the original data.

Further, since Iceberg supports full or incremental reading of data from a snapshot, a first analysis may employ snapshot to read the full data, which is then analyzed and stored to the data origin layer 200, after which incremental analysis may be performed by incremental reading through the configuration of some Iceberg parameters. After the data of the data primary layer 200 is ready, the Iceberg partition table of the data standard layer 300 may be continuously created, then the data of the data primary layer 200 is read through the Flink computing engine 20, cleaning is achieved through SQL coding, dirty data, undesirable data and outdated data are removed, then the code is operated on the Flink computing engine 20, batch writing of the data primary layer 200 into the data standard layer 300 is achieved, and according to the principle of layering of the Iceberg data lake 30, the data standard layer 300 needs to retain historical data and current latest data, namely full data.

In one example, since the data of the internet of things is updated at all times, in order to fully play the value of the data, the historical data and the latest working condition data of the internet of things equipment need to be saved at the same time. In this regard, the conventional method is to make a data slice every day, where the slice records the total historical data before the current day, and there is a slice every day, and after a long time, there is a large amount of historical repeated data in the database, which occupies storage resources and computing resources. In the embodiment of the application, the functions of updating the line level and inserting in real time (namely reading-writing separation) of the iceberg table data are fully utilized, only one data fragment is needed to store the current and historical full data in the data standard deviation layer, when the data change occurs, the updating and inserting of the data can be completed directly through the updating operation, the data partition and fragment are not required to be re-created, and meanwhile, the incremental analysis result can be obtained, namely the real-time analysis result is obtained. In this way, full utilization of system memory resources and computing resources may be improved.

Further, after the data updating of the data standard layer 300 is completed, the data of the data standard layer 300 needs to be slightly summarized and then written into the data integration layer 400 in batches. Specifically, a table-building statement of the data integration layer 400 is created first, after the creation of the iceberg table of the data integration layer 400 is completed, service table data of the data standard layer 300 is read by the link calculation engine 20, and then the service table data is written into the data integration layer 400 in batches by the link calculation engine 20, so that the division of mild summary of each service data into different topics is completed.

Further, the link calculation engine 20 may perform comprehensive analysis on the data of the data integration layer 400, the data of the data standard layer 300, and the data of the dimension layer 600 according to the preset index, to obtain an analysis result, and store the analysis result in the data application layer 500. Specifically, firstly, an iceberg table of the data application layer 500 is created, then, according to different business logics, the dimension data of the dimension layer 600, the slightly summarized data of the data integration layer 400 and the detail data of the data standard layer 300 are comprehensively processed and analyzed through the flank calculation engine 20, and then, the results are written into the iceberg table corresponding to the data application layer 500 in batches.

In the embodiment of the application, as the processing process of the wide table is set up to a plurality of tables, a plurality of historical data can be loaded for comprehensive processing analysis, and a large amount of computing resources are generally occupied. Therefore, in the embodiment of the application, the analysis of the total data is only carried out for the first time, after the initialization operation is carried out for the first time, all subsequent analyses only carry out the analysis of the incremental data, the advantage that the iceberg can carry out the incremental data reading is fully utilized, the result of the incremental analysis is merged and updated into the previous historical analysis result in an additionally written mode, the comprehensiveness of the analysis result is ensured, and meanwhile, the result of the incremental analysis (real-time analysis) can be backed up, saved and updated, so that the result of the batch analysis can be obtained and the result of the real-time analysis can be obtained; a significant amount of computing and storage resources are saved after analysis of the incremental data alone. According to the operation, real-time and batch processing analysis statistical results of the Internet of things data can be obtained.

FIG. 3 is a schematic illustration of a scheduling flow of a Flink computing engine according to an embodiment of the present application. As shown in fig. 3, in order to achieve the purpose of automatically integrating the multi-system front-end service data, the goal of real-time service and timely finding out service anomalies is achieved, and the data processing tasks of the link calculation engine 20 between different layers (bdl— > odl— > SDL- > IDL- > ADL) of the data lake 30 are scheduled according to a certain frequency timing according to an interdependence relationship. According to the scheduling of certain frequency timing, the machine can automatically calculate the data of the Internet of things, and the data development efficiency is greatly improved.

Fig. 4 is a flow chart of a method for integrated processing of internet of things data batch flow according to an embodiment of the present application. As shown in fig. 4, an embodiment of the present application provides a method for integrated processing of data batch of internet of things, which is applied to a system for integrated processing of data batch of internet of things, the system includes a cloud platform, a data lake and a link calculation engine, the link calculation engine communicates with the cloud platform and the data lake respectively, the method may include the following steps:

step 101, receiving working data uploaded by edge side internet of things equipment through a cloud platform, and storing the working data to a kafka message queue;

102, carrying out data processing on the message data in the kafka message queue through a Flink calculation engine and storing the message data in a data lake;

step 103, data of each stage in the data processing process are stored in a layering manner through a data lake;

step 104, obtaining snapshot files at set time in the data lake according to analysis requirements through the Flink calculation engine to perform data analysis;

the data lake comprises a data buffer layer, a data original layer, a data standard layer, a data integration layer, a data application layer, a temporary layer and a dimension layer.

In the embodiment of the application, the cloud platform is generally an industrial internet platform, and can receive the working data uploaded by the edge-side internet of things equipment and store the working data into the kafka message queue. The edge side internet of things equipment comprises vehicles, production line equipment, instruments, sensors and the like. In one example, the edge-side internet of things device may upload the working data to the gateway side of the cloud platform in real time through the 5G network, and the gateway side of the cloud platform is responsible for forwarding the working data to the kafka message queue. Further, the data in the kafka message queue is stored in a data lake.

In the embodiment of the application, the Flink computing engine is respectively communicated with the cloud platform and the data lake, and can process the message data in the kafka message queue and store the message data in the data lake. In one example, the Flink compute engine may interface with the kafka message queue through the kafka-connector on the one hand, connect with the data lake through the iceberg-connector on the other hand, and write the mass internet of things data to the data lake through flinksql. Furthermore, the embodiment of the application creatively combines the Flink calculation engine with the data lake, and utilizes the data lake to store the data of each stage in the data processing process in a layering manner.

In the embodiment of the application, in order to enable all data to flow orderly, the whole life cycle of the data can be clearly and definitely perceived, and the embodiment of the application designs the data layering of the data lake iceberg according to the characteristics of the data of the Internet of things so that the data structure is clearer and is convenient for later maintenance. In particular, the data layering of the data lake may include a data buffer layer, a data raw layer, a data standard layer, a data integration layer, a data application layer, a temporary layer, and a dimension layer. Therefore, through the hierarchical design of the data lake, the logical relationship between the data of the Internet of things and the dimension data can be clearly defined, repeated development work is reduced, the data processing flow is clear, the structure is clear, and the functions are clear.

In the embodiment of the application, the Flink computing engine can also acquire snapshot information at a set moment in the data lake according to the data analysis requirement to perform data analysis. Specifically, based on the characteristic of the data lake iceberg time travel, the data lake stores Snapshot (snappshot) information of data files at different moments, the Snapshot information records all data file lists at a certain moment, and as time goes by, snappshot at a previous moment can contain all data file information recorded by snappshot at a previous moment. Therefore, the data in any time period can be read according to the time characteristic of the snapplot. Meanwhile, since the Snapshot Snapshot-1 of the data file at the current moment contains all information of the Snapshot Snapshot-0 of the data file at the previous moment, the increment part between the Snapshot-1 and the Snapshot-0 is the latest data, any latest data in the latest (0-30 min) can be acquired by regulating the time period recorded by the Iceberg Snapshot, and the latest increment data is analyzed, so that a near-real-time analysis result can be obtained.

According to the embodiment of the application, batch data analysis can be performed according to the characteristics of the data lake, namely, data statistics analysis according to days or months is performed, and only data of one day or one month time span is read according to Snapshot of the data lake for analysis each time; on the other hand, the latest real-time analysis data result can be obtained at the same time. In the embodiment of the application, the characteristics of the data computing engine Flink and the data storage system iceberg data lake in the aspects of batch and real-time data processing can be effectively integrated and innovatively designed, and the offline analysis and the real-time analysis of the data of the Internet of things are carried out according to the data layering in the iceberg data lake, so that the complexity and the cost of the mass Internet of things data processing system are greatly reduced, and the data analysis efficiency is improved.

In an embodiment of the present application, the method may further include:

acquiring input instruction parameters;

In the embodiment of the application, the Flink has the capability of analyzing and processing bounded data and unbounded data at the same time at the beginning of design. Specifically, input instruction parameters are acquired first, the instruction parameters can reserve batch processing quality parameters and real-time stream processing instruction parameters, and the instruction parameters can be input by technical developers through writing codes according to data analysis requirements. Further, in the case where the instruction parameter is a batch instruction parameter, a batch interface of the flank compute engine may be opened to execute the batch logic; in the case where the instruction parameters are real-time stream processing instruction parameters, the real-time stream processing interface of the Flink compute engine may be opened to execute the real-time stream processing logic. Therefore, the system can realize the off-line analysis according to the real-time analysis at first, and the off-line analysis and the real-time analysis share the same storage system and computing resource, thereby reducing the operation and maintenance cost; and moreover, a set of analysis codes are commonly used for offline analysis and real-time analysis, so that the development cost is reduced, and the development efficiency is improved.

In an embodiment of the present application, step 102, performing, by the computing engine, data processing on the message data in the kafka message queue and storing the processed message data in the data lake may include:

A data buffer table for storing data is newly built in the data buffer layer;

writing the message data in the kafka message queue into a data buffer table;

reading buffer data in a data buffer table;

In the embodiment of the application, the data buffer table is a new MOR table (Merge on Read) of a buffer data layer of a data lake. According to the data characteristics, data can be layered on the data lake, wherein a data buffer layer (BDL, buffer Data Layer) is mainly used for temporarily storing the original data written by the Flink, the validity period of the BDL layer data is relatively short, for example, 1-7 days, and the BDL layer data needs to be deleted after the validity period. In the embodiment of the application, the data layering is layering of the data lake design according to the characteristics of the data, and the storage rule corresponding to the data layering comprises a storage form of the data stored in the data layering. The data layering has a certain dependency relationship, and the data is processed layer by layer according to the business logic and the dependency relationship among the data layering.

In the embodiment of the present application, based on the storage rule and the preset index corresponding to each data layer in the data lake, the layer-by-layer processing and storage of the buffered data may include:

normalizing the data in the data original layer;

storing the standardized data to a data standard layer;

and storing the analysis result into a data application layer.

Specifically, the flank calculation engine performs normalization and structuring processing on unstructured buffer data and then stores the unstructured buffer data in a data original layer. In one example, a table of Iceberg corresponding to each service of the data original layer can be created first, then each service data in the Iceberg table of the data buffer layer is read in batches through a Flink computing engine, and then the service data in the Iceberg table of the data original layer is written in batches through the Flink computing engine.

Further, since Iceberg supports full or incremental reading of data from a snapshot, the first analysis may employ snapshot to read the full data, then store it to the data original layer, and then perform incremental analysis by incremental reading through configuration of some Iceberg parameters. After the data of the data original layer is ready, the Iceberg partition table of the data standard layer can be continuously created, then the data of the data original layer is read through the Flink computing engine, cleaning is realized through SQL coding, dirty data, undesirable data and outdated data are removed, then codes are operated on the Flink computing engine, batch writing of the data original layer into the data standard layer is realized, and according to the principle of layering of the Iceberg data lake, the data standard layer needs to keep historical data and current latest data, namely full data.

Further, after the data of the data standard layer is updated, the data of the data standard layer needs to be slightly summarized and then written into the data integration layer in batches. Specifically, a table construction statement of the data integration layer is firstly created, service table data of the data standard layer is read through the Flink computing engine after the creation of the iceberg table of the data integration layer is completed, and then the service table data are written into the data integration layer in batches through the Flink computing engine, so that the division of mild summary of each service data into different topics is completed.

Further, the flank calculation engine can comprehensively analyze the data of the data integration layer, the data of the data standard layer and the data of the dimension layer according to preset indexes to obtain analysis results, and store the analysis results into the data application layer. Specifically, firstly, an iceberg table of a data application layer is created, then, according to different business logics, comprehensive processing and analysis are carried out on dimension data of a dimension layer, mild summary data of a data integration layer and detail data of a data standard layer through a Flink computing engine, and then, the result is written into the iceberg table corresponding to the data application layer in batches.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A system for integrated processing of internet of things data batch flows, comprising:

the cloud platform is configured to receive the working data uploaded by the edge side Internet of things equipment and store the working data to the kafka message queue;

the Flink calculation engine is respectively communicated with the cloud platform and the data lake, and is configured to perform data processing on the message data in the kafka message queue, store the message data in the data lake and obtain snapshot information at a set moment in the data lake according to data analysis requirements to perform data analysis;

the data lake is configured to store data of each stage in the data processing process in a layered manner.

2. The system of claim 1, wherein the Flink calculation engine is further configured to:

acquiring input instruction parameters;

opening a batch interface to execute batch logic when the instruction parameter is a batch instruction parameter;

And opening a real-time stream processing interface to execute the real-time stream processing logic under the condition that the instruction parameter is the real-time stream processing instruction parameter.

3. The system of claim 1, further comprising a query engine in communication with the data lake configured to query data in the data lake according to a query requirement;

the data lake is further configured to send query results of the query engine to an industrial application.

4. The system of claim 1, wherein the Flink calculation engine is further configured to:

a data buffer table for storing data is newly built in the data buffer layer;

and writing the message data in the kafka message queue into the data buffer table.

5. The system of claim 4, wherein the data lake comprises a data buffer layer, the Flink calculation engine further configured to:

reading the buffer data in the data buffer table;

and processing and storing the buffer data layer by layer based on a storage rule and a preset index corresponding to each data layer in the data lake.

6. The system of claim 5, wherein the data lake further comprises a data raw layer, a data standard layer, a data integration layer, a data application layer, and a dimension layer, the Flink calculation engine further configured to:

After regular and structured cleaning is carried out on the buffer data, the buffer data is stored into the data original layer;

normalizing the data in the data original layer;

storing the standardized data to the data standard layer;

slightly summarizing the data of the data standard layer according to a preset index, dividing the theme, and storing the theme into the data integration layer;

and storing the analysis result into the data application layer.

7. The method for integrally processing the data batch of the Internet of things is characterized by being applied to a system for integrally processing the data batch of the Internet of things, wherein the system comprises a cloud platform, a data lake and a Flink computing engine, and the Flink computing engine is respectively communicated with the cloud platform and the data lake, and the method comprises the following steps:

receiving working data uploaded by the edge side Internet of things equipment through the cloud platform, and storing the working data into a kafka message queue;

carrying out data processing on the message data in the kafka message queue through the Flink calculation engine and storing the message data in the data lake;

The data of each stage in the data processing process is stored in a layered manner through the data lake;

8. The method of claim 7, wherein the method further comprises:

acquiring input instruction parameters;

9. The method of claim 7, wherein the data lake comprises a data buffer layer, wherein the data processing and storing, by a computing engine, the message data in the kafka message queue comprises:

a data buffer table for storing data is newly built in the data buffer layer;

writing the message data in the kafka message queue into the data buffer table;

reading the buffer data in the data buffer table;

10. The method of claim 9, wherein the data lake further comprises a data original layer, a data standard layer, a data integration layer, a data application layer and a dimension layer, the buffer data is processed and stored layer by layer based on a storage rule and a preset index corresponding to each data layer in the data lake, and the method comprises:

normalizing the data in the data original layer;

storing the standardized data to the data standard layer;

and storing the analysis result into the data application layer.