CN114579097A

CN114579097A - Cloud native data API construction method based on single data stream

Info

Publication number: CN114579097A
Application number: CN202210244542.8A
Authority: CN
Inventors: 郭晨
Original assignee: Jiangsu Yisi Changtian Digital Intelligent Technology Co ltd
Current assignee: Jiangsu Yisi Changtian Digital Intelligent Technology Co ltd
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-06-03

Abstract

The invention relates to the technical field of networks, in particular to a cloud native data AP I construction method based on single data flow, which comprises the following steps: 10: building a multi-source heterogeneous data exchange framework; 20: building a flow and batch integrated data processing framework; 30: data storage is carried out by adopting an Apache Hud i data lake; 40: establishing an AP I based on a FaaS platform; 50: the method and the device can deal with most data query requests by adopting Apache Hud I data lake storage and an OLAP query engine based on an MPP framework, can effectively meet the data requirements of various users compared with the majority of currently used Restfu l interfaces, and can effectively solve the problems that all data AP I cannot be used and fault isolation cannot be realized when the service is broken down due to abnormal AP I because the AP I is built based on a FaaS platform.

Description

Cloud native data API construction method based on single data stream

Technical Field

The invention relates to the technical field of networks, in particular to a cloud native data API construction method based on single data flow.

Background

In the process of data analysis and utilization, the data, the analysis model based on the data, and the data application based on the data and the analysis model have great value of open sharing. Traditional open data sharing means are data exports, such as to local disks, FTP servers, distributed file systems, and the like. The method is suitable for a temporary and large-volume data exchange scene, but the platform can lose any information collection capability of exporting data, cannot collect and audit information of data users, cannot export models and data applications, and greatly limits the functional boundary of data service. Therefore, in the scene of creating application ecology, an interface, particularly a Restful interface becomes a more popular service providing form, data, a model and application can be opened through the Restful interface, and a platform can also acquire basic information of a caller when a user calls the interface, so that authority management and flow concurrent control are facilitated, and better and more stable data service is provided.

However, the Restful interface supply of many manufacturers at present needs that customers clearly make demands in advance, then research and develop the force input, and the effort input is directly completed when the platform is delivered, and the mode is obviously not flexible and has large input. For the data API, some open source technologies and platform manufacturers may implement an explicit data API customization function, so that a user may define data requirements, and the platform may provide data open services in the form of the API. However, most of the current technologies use a set of service to operate all interfaces, and cannot dynamically adjust resources for each interface in a fine-grained manner, and when a service is crashed due to abnormality of some APIs, all data APIs cannot be used, and fault isolation cannot be achieved.

Therefore, a cloud native data API construction method based on single data streams is needed to improve the above problems.

Disclosure of Invention

The invention aims to provide a cloud native data API construction method based on a single data stream, so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

a cloud native data API construction method based on single data flow comprises the following steps:

10: multi-source heterogeneous data exchange framework building

20: building a flow and batch integrated data processing framework;

30: adopting Apache Hudi data lake to store data;

40: API building based on a FaaS platform;

50: and performing data query based on the Presto platform.

As a preferred embodiment of the present invention, the step 30 may also use an Alluxio storage system to reduce I/O overhead.

As a preferred embodiment of the present invention, the step 10 comprises the steps of:

101: data source abstraction, reading and writing general data sources such as JDBC, file system, message queue and the like, providing a development frame and an integration method, and facilitating users to develop other data source drivers by themselves:

102: exchange behavior abstraction: for each data source, user-defined behaviors such as an exchange task running time strategy, a new and old data writing strategy, a dirty data filtering strategy, task running configuration and the like are abstracted, and different data sources can be realized according to the supporting condition of the underlying technology;

103: external metadata import: if the data source side stores metadata of the imported data, such as field remarks, main foreign key relation and the like, providing a development framework to realize the import function of the metadata;

104: the functions of paged data source management, task monitoring and alarming, data collection cataloging and data set relationship management are supported, and the function of expanding a management console by using a low-code development framework is supported.

As a preferred embodiment of the present invention, the specific steps of step 20 include the following:

201: constructing a flow batch integrated data processing task;

202: scheduling a processing task;

203: processing data is managed hierarchically;

204: and (4) UDF management, wherein the UDF data processing function written by the user is uploaded to the platform for calling a processing task.

As a preferred scheme of the present invention, step 201 supports data processing through SQL language, Spark program, and Flink program, and under the condition supported by the bottom layer framework, the same processing task code can be switched between a stream operation mode and a batch operation mode, and step 202 also supports functions of scheduling at regular time, relying on scheduling, and scheduling on-line and off-line of tasks, so as to form a processing task workflow, and simultaneously supports an out-of-time warning function.

As a preferred embodiment of the present invention, the step 40 comprises the following steps:

401: the system comprises a data open interface engine based on the FaaS, a data ad-hoc query interface engine based on the FaaS technology, a container cluster corresponding to each ad-hoc query service, and unified interface access, load balancing and fault isolation;

402: the data pushing service pushes data to a user in a message queue form;

403: data desensitization management, namely configuring desensitization rules of data objects in data services according to the authority of a data service caller, and providing a character filling type desensitization mode except common field desensitization modes such as an identity card number, a mobile phone number and the like at present;

404: SLA-based storage scheduling.

In step 404, for different data storage and different SLAs of the APIs that can be provided by the query framework, the framework can provide different types of data storage, and respectively correspond to the data APIs of different SLA types, and predict whether the new data API SLA can be satisfied according to the current data API call condition and the data integration speed, thereby expanding the capacity and reducing the capacity of the underlying storage.

Compared with the prior art, the invention has the beneficial effects that:

according to the invention, by adopting the most energy-efficient data storage and query mode (an OLAP query engine based on MPP architecture of Apache Hudi data lake + Presto), Alluxio can be used to reduce I/O overhead if necessary, most data query requests can be dealt with, compared with the majority of currently used Restful interfaces, the data requirements of various users can be effectively met, and the API is built based on a FaaS platform, so that the problems that all data APIs cannot be used and fault isolation cannot be realized when the service is broken down due to API abnormality can be effectively solved.

Drawings

FIG. 1 is a block flow diagram of the API construction of the present invention;

FIG. 2 is a block diagram of a multi-source heterogeneous data exchange framework building process according to the present invention;

FIG. 3 is a block diagram of a flow of building a batch-flow integrated data processing framework according to the present invention;

FIG. 4 is a block diagram of an API establishment flow based on the FaaS platform.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by a person of ordinary skill in the art without any creative work based on the embodiments of the present invention belong to the protection scope of the present invention.

To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Several embodiments of the invention are presented. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

It will be understood that when an element is referred to as being "secured to" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. The terms "vertical," "horizontal," "left," "right," and the like as used herein are for illustrative purposes only.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Example, referring to fig. 1-4,

10: building a multi-source heterogeneous data exchange framework; can provide a data exchange technical framework with high abstraction and flexible and extensible functions, support the functions of data acquisition, export, data storage support conversion during open sharing and the like,

20: building a flow and batch integrated data processing framework;

30: adopting Apache Hudi data lake to store data;

40: the method comprises the steps that API building based on a FaaS platform is realized, data in the platform can be exported by using a multi-source heterogeneous data exchange framework, and an API can also be directly generated to be called by other systems or pushed to a downstream business system by using a message queue;

50: based on the Presto platform for data query, the Alluxio storage system may also be used in step 30 to reduce I/O overhead.

Step 10 comprises the steps of:

102: exchange behavior abstraction: for each data source, user-defined behaviors such as an exchange task running time strategy, a new and old data writing strategy, a dirty data filtering strategy, task running configuration and the like are abstracted, and different data sources can be realized by themselves according to the underlying technical support condition, for example, in the task running time, immediate execution, timing execution, cyclification and streaming execution can be supported; in the new and old data processing strategy, the strategy of full replacement, neglecting updating, storing as a new data version and the like can be supported; in the task operation configuration, different task executors (a single machine single thread, a spark cluster, a flash cluster and the like), speed limitation, breakpoint transmission continuation and the like can be supported;

103: external metadata import: if the data source side stores metadata of the imported data, such as field remarks, main foreign key relations and the like, providing a development framework to realize the import function of the metadata;

The specific steps of step 20 include the following:

201: constructing a flow batch integrated data processing task;

202: scheduling a processing task;

203: processing data is managed hierarchically;

204: and (4) UDF management, namely uploading a UDF data processing function written by a user to a platform for calling a processing task.

Step 201 supports data processing through SQL language, Spark program and Flink program, under the condition supported by the bottom layer framework, the same processing task code can be switched between a flow operation mode and a batch operation mode, and step 202 also supports functions of timing scheduling, dependence scheduling and on-line and off-line of scheduling tasks, forms a processing task workflow and supports an overtime early warning function.

Step 40 comprises the steps of:

401: the data open interface engine based on the FaaS technology is a data ad-hoc query interface engine based on the FaaS technology, each ad-hoc query service corresponds to one container cluster, and unified interface access, load balancing and fault isolation are provided, so that the problems that when service breakdown is caused by certain API abnormality existing in a Restful interface, all data APIs cannot be used and fault isolation cannot be realized can be effectively solved;

402: the data pushing service pushes data to a user in a message queue form;

404: the storage scheduling based on the SLA is different for different data storage and API that the query framework can provide, the framework can provide different types of data storage, respectively corresponds to the data API of different SLA types, and simultaneously predicts whether the new data API SLA can be met according to the current data API calling condition and the data integration speed, so as to expand capacity and reduce capacity of the bottom storage.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A cloud native data API construction method based on single data flow comprises the following steps:

10: building a multi-source heterogeneous data exchange frame;

20: building a flow and batch integrated data processing framework;

30: adopting Apache Hudi data lake to store data;

40: API building based on a FaaS platform;

50: and performing data query based on the Presto platform.

2. The cloud native data API construction method based on the single data stream according to claim 1, characterized in that: the step 30 may also employ an Alluxio storage system to reduce I/O overhead.

3. The cloud native data API construction method based on the single data stream according to claim 1, characterized in that: the step 10 comprises the following steps:

4. The cloud native data API construction method based on the single data stream according to claim 1, characterized in that: the specific steps of step 20 include the following:

201: constructing a flow batch integrated data processing task;

202: scheduling a processing task;

203: processing data is managed hierarchically;

5. The cloud native data API construction method based on the single data stream according to claim 4, wherein: the step 201 supports data processing through an SQL language, a Spark program and a Flink program, under the condition that the bottom layer framework supports, the same processing task code can be switched between a flow operation mode and a batch operation mode, the step 202 also supports functions of timing scheduling, dependence scheduling and on-line and off-line of scheduling tasks, a processing task workflow is formed, and meanwhile, an overtime early warning function is supported.

6. The cloud native data API construction method based on the single data stream according to claim 1, characterized in that: the step 40 comprises the steps of:

402: the data pushing service pushes data to a user in a message queue form;

404: SLA-based storage scheduling.

7. The cloud native data API construction method based on the single data stream according to claim 6, characterized in that: in step 404, for different data storage and different SLAs of the APIs that can be provided by the query framework, the framework can provide different types of data storage, and respectively correspond to the data APIs of different SLA types, and predict whether the new data API SLA can be satisfied according to the current data API call condition and the data integration speed, thereby expanding the capacity and reducing the capacity of the underlying storage.