CN117743470A

CN117743470A - Processing system for heterogeneous big data

Info

Publication number: CN117743470A
Application number: CN202410168476.XA
Authority: CN
Inventors: 路培杰; 杨辉; 周志忠; 刘文虎; 罗颖; 陈威余
Original assignee: Zhongke Yungu Technology Co Ltd
Current assignee: Zhongke Yungu Technology Co Ltd
Priority date: 2024-02-06
Filing date: 2024-02-06
Publication date: 2024-03-22
Anticipated expiration: 2044-02-06
Also published as: CN117743470B

Abstract

The embodiment of the application provides a processing system for heterogeneous big data. Comprising the following steps: a plurality of distributed storage systems for storing data and metadata for each data; the unified access platform comprises metadata storage systems corresponding to each distributed storage system, wherein the metadata storage systems are used for acquiring and storing metadata in the corresponding distributed storage systems; the metadata configuration management system is used for acquiring metadata in various distributed storage systems from the unified access platform and carrying out tabular organization and storage on all the metadata; the computing engine platform comprises a plurality of computing engines, and provides corresponding target computing engines to analyze computing tasks under the condition that the computing engine platform receives the computing tasks, determines target data corresponding to the computing tasks based on a storage table of metadata in the metadata configuration management system, and accesses a target distributed storage system corresponding to the target data based on the unified access platform to acquire the target data for computing.

Description

Processing system for heterogeneous big data

Technical Field

The application relates to the technical field of data of the Internet of things, in particular to a processing system for heterogeneous big data.

Background

The analysis solution for the cross-source heterogeneous data in the prior art comprises three methods, wherein the first method is to collect and collect data of different types of storage systems into an hdfs distributed file system through an ETL tool, construct a hive number bin based on hdfs, and comprehensively analyze the data in hive by utilizing calculation engines such as spark, flink and the like. And secondly, independently analyzing the data on each public cloud or private cloud, then converging the respective analysis results, and finally carrying out secondary statistical analysis on the converged results. And thirdly, uniformly storing heterogeneous data based on a data lake technology, and then uniformly analyzing by utilizing a computing engine. The three cross-source heterogeneous data processing modes need to synchronize the migration of mass data from different storage systems to the same place, a large amount of data migration synchronization cost exists, and the migration synchronization of the data generally needs a long time, so that the timeliness of data analysis is seriously affected. Meanwhile, the data aggregation requires a unified data format and unified storage, so that the original data cannot be reserved, the problem of data consistency exists, the data analysis efficiency is low, and the high data preprocessing cost exists.

Disclosure of Invention

The embodiment of the application aims to provide a processing system for heterogeneous big data, which is used for solving the technical defects of high cost and high data delay caused by unified acquisition and unified analysis in the statistical analysis process of multi-source heterogeneous big data volume in a mixed cloud environment in the prior art.

To achieve the above object, a first aspect of the present application provides a processing system for heterogeneous big data, including:

the system comprises a plurality of distributed storage systems, a plurality of storage units and a plurality of storage units, wherein the distributed storage systems are used for storing data and metadata of each data, and the data stored in any two different types of distributed storage systems are heterogeneous data;

the unified access platform comprises metadata storage systems corresponding to each distributed storage system, and each metadata storage system is used for acquiring and storing metadata in the corresponding distributed storage system;

the metadata configuration management system is connected with the unified access platform and the calculation engine platform and is used for acquiring metadata in various distributed storage systems from the unified access platform and carrying out tabular organization and storage on all the metadata;

the computing engine platform is connected with the unified access platform and the metadata configuration management system, and comprises a plurality of computing engines, the computing engine platform provides corresponding target computing engines according to the computing tasks under the condition that the computing tasks are received, the target computing engines analyze the computing tasks, determine target data corresponding to the computing tasks based on a storage table of metadata in the metadata configuration management system, and access a target distributed storage system corresponding to the target data based on the unified access platform so as to acquire the target data from the target distributed storage system for computing.

In an embodiment of the present application, the processing system further comprises: the multi-tenant platform is connected with the computing engine platform and is used for providing an SQL query interface, acquiring an SQL query statement submitted by an SQL client based on the query interface, converting the SQL into a computing task and sending the computing task to the computing engine platform; a container orchestration platform for providing a plurality of containers for the multi-tenant platform and the compute engine platform.

In an embodiment of the present application, the metadata configuration management system is further configured to: acquiring an installation package of the metadata configuration management system, and decompressing the installation package to obtain a corresponding decompressed package; acquiring a dependency package of the unified access platform, and adding the dependency package into a dependency library catalog of the decompressed package; constructing a first image file for the metadata configuration management system based on the container arrangement platform; acquiring a first containerized resource configuration file of a metadata configuration management system and first configuration information of a unified access platform, and adding the first configuration information into the first containerized resource configuration file, wherein the first configuration information comprises a plurality of storage implementation classes and metadata engine addresses of each metadata storage system; and based on the container arrangement platform, sequentially executing the first containerized resource configuration file and the first mirror image file to deploy the metadata configuration management system into a plurality of containers, and starting the metadata configuration management system to establish connection between the metadata configuration management system and the unified access platform.

In an embodiment of the present application, the multi-tenant platform is further configured to: constructing a second image file of the multi-tenant platform based on the container orchestration platform; acquiring a second containerized resource configuration file of the multi-tenant platform, and setting environment variables, external service ports, operation modes, mirror image file names and second configuration information required by the operation of the multi-tenant platform in the second containerized resource configuration file; the second containerized resource configuration file and the second image file are sequentially executed based on the containerized platform to deploy the multi-tenant platform into the plurality of containers.

In an embodiment of the present application, for any one of the compute engines, the compute engine platform is further to: constructing a third image file of the computing engine based on the container orchestration platform; reconstructing a second image file of the multi-tenant platform based on the third image file; the updated second image file is executed based on the containerization platform to deploy the compute engine into the plurality of containers.

In an embodiment of the present application, for any one of the compute engines, the compute engine platform is further to: acquiring third configuration information of the metadata configuration management system, and adding the third configuration information into a third image file; the computing engine is redeployed into the plurality of containers based on the updated third image file to establish a connection between the computing engine and the metadata configuration management system.

In an embodiment of the present application, for any one of the distributed storage systems, the compute engine platform is further to: adding a dependent package of the unified access platform into a third image file, wherein the dependent package comprises access parameters of a plurality of distributed storage systems; acquiring fourth configuration information of the distributed storage system, adding the first configuration information of the unified access platform into the fourth configuration information, and adding the updated fourth configuration information into a third image file; and redeploying the computing engine into a plurality of containers based on the updated third image file so as to establish connection between the computing engine and the unified access platform.

In an embodiment of the present application, for any one of the distributed storage systems, the unified access platform is further configured to: constructing a metadata storage system corresponding to the distributed storage system; constructing a corresponding storage bucket in the distributed storage system based on the metadata storage system, and setting access information of the storage bucket; defining a connection script between the bucket and the corresponding metadata storage system based on the access information; executing the connection script to establish connection between the storage bucket and the metadata storage system, so as to store the connection information of the storage bucket and the metadata system carried in the connection script to the metadata storage system, and transmitting the metadata of the data stored in the storage bucket to the metadata storage system for storage.

In an embodiment of the present application, the metadata configuration management system is further configured to: constructing a fusion analysis number bin for a plurality of distributed storage systems, wherein the fusion analysis number bin at least comprises an original layer, a standard layer, an integration layer and an application layer; constructing a table format metadata repository corresponding to a storage bucket in a plurality of distributed storage systems in an original layer; determining a list building script of each distributed storage system; for any one of the table-building scripts of the distributed storage system, after the table-building script is executed through the multi-tenant platform, the metadata configuration management system acquires metadata in the distributed storage system from the unified access platform and stores the metadata in a table format metadata repository for table format storage.

In an embodiment of the present application, the compute engine platform is further to: for any computing engine, after the computing engine determines target data corresponding to a computing task based on a storage table of metadata in a metadata configuration management system, accessing a target distributed storage system corresponding to the target data based on a unified access platform, and acquiring the target data from the target distributed storage system; the original layer based on the fusion analysis number bin caches the target data, and the standard layer, the integration layer and the application layer based on the fusion analysis number bin sequentially calculate the target data.

The technical scheme provides a processing system for heterogeneous big data, which comprises a unified access platform, a metadata configuration management system and a computing engine platform integrated with various computing engines. The unified access platform provides a data access interface of each distributed storage system in a unified way, so that metadata of data in each distributed storage system is collected in a unified way, and is integrated with a metadata configuration management system and a computing engine platform, and the metadata configuration management system stores the metadata in the unified access platform in a unified form. Meanwhile, the metadata configuration management system is integrated with the computing engine platform, the computing engine in the computing engine platform can access metadata based on the metadata configuration management system, further, business logic of the computing task is analyzed, data corresponding to the computing task is obtained, and the data is collected from the corresponding distributed storage system and calculated based on the unified access platform. The method only needs to interact with the unique file system of the unified access platform, so that unified storage and unified analysis of data are realized in a form, and the aim of multi-source heterogeneous data joint analysis is achieved in a simple and efficient mode.

Additional features and advantages of embodiments of the present application will be set forth in the detailed description that follows.

Drawings

The accompanying drawings are included to provide a further understanding of embodiments of the present application and are incorporated in and constitute a part of this specification, illustrate embodiments of the present application and together with the description serve to explain, without limitation, the embodiments of the present application. In the drawings:

FIG. 1 schematically illustrates a block diagram of a processing system for heterogeneous big data according to an embodiment of the present application;

FIG. 2 schematically illustrates a flow chart of data reading and writing by the JuiceFS distributed file system according to an embodiment of the present application;

FIG. 3 schematically illustrates a block diagram of yet another processing system for heterogeneous big data according to an embodiment of the present application;

FIG. 4 schematically illustrates a flow chart of fusion analysis of cross-source heterogeneous data according to an embodiment of the present application;

FIG. 5 schematically shows a flow chart of a kyuubi submit spark sql analysis task according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the specific implementations described herein are only for illustrating and explaining the embodiments of the present application, and are not intended to limit the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.

In addition, if there is a description of "first", "second", etc. in the embodiments of the present application, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying a relative importance thereof or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be regarded as not exist and not within the protection scope of the present application.

FIG. 1 schematically illustrates a block diagram of a processing system for heterogeneous big data according to an embodiment of the present application. As shown in fig. 1, an embodiment of the present application provides a processing system for heterogeneous big data, the processing system including:

and a plurality of distributed storage systems 110 for storing data and metadata of each data, wherein the data stored in any two different types of distributed storage systems 110 are heterogeneous data with each other.

In the technical scheme, various distributed storage systems can be distributed on different public clouds and private clouds, heterogeneous data refers to data from different sources and different types, and the data may have different structures, formats, semantics and purposes. Specifically, the multiple distributed storage systems in the technical scheme may refer to a mini, an ali cloud (OSS), an amazon cloud AWS (S3), an HDFS, a mini, an ali cloud (OSS), an amazon cloud AWS (S3), and an HDFS, in which data and metadata of each data may be stored respectively. Where minio is a high-performance, distributed object storage platform that provides high-performance, S3 protocol compatible object storage, well suited for storing large volumes of unstructured data. The mini is used as an object storage platform, and the processed objects are in granularity of files, namely, the storage of data can be completed only by uploading various data files (electronic files) carrying mass data in the form of files through a mini client. HDFS is a highly fault tolerant system suitable for deployment on inexpensive machines, providing high throughput data access, and is well suited for use on large-scale data sets.

The unified access platform 120 includes metadata storage systems corresponding to each of the distributed storage systems 110, each metadata storage system for retrieving and storing metadata in the corresponding distributed storage system 110.

In the technical scheme, the unified access platform can refer to JuiceFS, juiceFS which is a cloud native design high-performance distributed file system, and adopts a framework of data and metadata separated storage, so that the distributed design of the file system is realized. As shown in fig. 2, a flow chart of data reading and writing by the Juicefs distributed file system is provided, file data itself is cut and stored in the object storage platform in the Juicefs, and metadata can be stored in various databases such as Redis, mySQL, tiKV, SQLite. JuiceFS, as a cloud native distributed file system that is fully compatible with POSIX, HDFS, S and WebDAV access protocols, can easily implement storing, accessing, processing, sharing all unstructured and semi-structured data. The JuiceFS provides rich api interfaces, and a third party application or system can manage, analyze, archive, backup, read and write heterogeneous data on a storage system managed by the JuiceFS through the api interfaces provided by the JuiceFS-hadoop sdk. JuiceFS has good cross-platform capability, supporting running on almost all classes of operating systems in mainstream architecture, including, but not limited to Linux, macOS, windows, etc.

Because the unified access platform JuiceFS has rich api interfaces, the unified access platform JuiceFS may be integrated with each type of distributed storage system to collect metadata in each type of distributed storage system. Meanwhile, there are a plurality of metadata storage systems in the JuiceFS corresponding to each of the distributed storage systems, respectively, each metadata storage system being for storing metadata in the distributed storage system corresponding thereto. Specifically, the distributed storage system in the technical solution may include a minio, an ali cloud (OSS), an amazon cloud AWS (S3), and an HDFS, so that the JuiceFS may be integrated with the minio, the ali cloud (OSS), the amazon cloud AWS (S3), and the HDFS, and metadata storage systems corresponding to the minio, the ali cloud (OSS), the amazon cloud AWS (S3), and the HDFS may be respectively built in the JuiceFS to store corresponding metadata. In the technical scheme, a mysql database can be selected as a metadata repository of the JuiceFS, and a mini, an ali cloud (OSS), an amazon cloud AWS (S3) and mysql databases (jfs_mini_db, jfs_ ali _db, jfs_ AWS _db, jfs_hdfs_db) corresponding to the HDFS can be respectively created in the mysql. It should be understood that in the technical solution, four hybrid distributed storage systems including minio, alicloud (OSS), amazon cloud AWS (S3) and HDFS are mainly taken as examples, in actual use, how many kinds of distributed storage systems exist to create how many kinds of corresponding mysql metadata storage systems, and each mysql metadata storage system has an independent user name password and read-write authority to the library.

The metadata configuration management system 130 is connected to the unified access platform 120 and the computing engine platform 140, and is configured to obtain metadata in the various distributed storage systems 110 from the unified access platform 120, and perform tabular organization and storage on all metadata.

The metadata configuration management system may refer to a Hive Metastor, which is a service designed by Apache Hive for conveniently managing metadata, and because Spark, flink, and Trino distributed computing engines have perfect support functions for Hive. Therefore, in the technical scheme, the Hive Metastore can be selected as a metadata storage system in a table format in the JuiceFS file system so as to facilitate various calculation engines to perform business logic calculation. Specifically, the Hive Metastore of the metadata configuration management system is integrated with the unified access platform juiceFS and the computing engine platform respectively, and the unified access platform juiceFS can uniformly write the metadata in each distributed storage system into the Hive Metastore of the metadata configuration management system for uniform tabular storage after classifying, collecting and storing the metadata in each distributed storage system so as to be used for analysis and calculation of each computing engine in the computing platform.

The computing engine platform 140 is connected with the unified access platform 120 and the metadata configuration management system 130, the computing engine platform 140 comprises a plurality of computing engines, the computing engine platform 140 provides corresponding target computing engines according to the computing tasks when receiving the computing tasks, the target computing engines analyze the computing tasks, determine target data corresponding to the computing tasks based on a storage table of metadata in the metadata configuration management system 130, and access a target distributed storage system corresponding to the target data based on the unified access platform 120 so as to acquire the target data from the target distributed storage system for computing.

The computing engine platform can be a service platform integrating a plurality of computing engines of different types, and in the technical scheme, in order to enable the computing and analyzing service of big data to be more convenient and more universal, the computing engine platform can meet the data analysis and computing requirements in various scenes. Thus, there is a need for a service that can unify computing engines, and that can easily integrate different computing engines together. Specifically, the computing platform in the technical solution integrates a plurality of different computing engines, such as spark, flink, trino and other distributed computing engines. Each computing engine in the computing platform is integrated with a Hive Metastore of the metadata configuration management system and a unified access platform JuiceFS, and when the computing engine platform receives a computing task, the computing engine platform provides a corresponding target computing engine according to the computing task. The target computing engine analyzes the computing task, and because the target computing engine is integrated with the metadata configuration management system Hive Metastore and the unified access platform JuiceFS, the target computing engine can determine target data corresponding to the computing task based on a storage table of metadata in the metadata configuration management system Hive Metastore, and access a target distributed storage system corresponding to the target data based on the unified access platform JuiceFS to acquire the target data from the target distributed storage system for computing.

As shown in fig. 4, a flow chart of fusion analysis across source heterogeneous data is provided. The cross-source heterogeneous data may be from each of the distributed storage systems of the data storage layer, respectively, including: minio, ali cloud (OSS), amazon cloud AWS (S3), HDFS, minio, ali cloud (OSS), amazon cloud AWS (S3), and HDFS. Each distributed storage system is integrated with a unified access platform JuiceFS of a file system layer, and metadata storage systems mysql corresponding to a minio, an ali cloud (OSS), an amazon cloud AWS (S3), an HDFS, minio, an ali cloud (OSS), an amazon cloud AWS (S3) and the HDFS are respectively arranged in the JuiceFS and used for writing metadata in each distributed storage system for storage. The unified access platform JuiceFS is integrated with the Hive Metastore of the metadata configuration management system of the metadata layer, and metadata of each mysql stored in the JuiceFS is uniformly written into the Hive Metastore for table format storage. The Hive Metastore supports various computing engines, and thus, the Hive Metastore can be integrated with various computing engines of the computing layer, for example spark, flink, trino, and when any one of the computing engines in the computing layer receives a computing task, it can parse based on metadata stored in table format in the Hive Metastore to determine target data corresponding to the computing task. In order to enable the computing engine to acquire the target data from the target distributed storage system corresponding to the target data after determining the target data, the computing engine spark, flink, trino and the like are integrated with the unified access platform JuiceFS, so that the minio, the ali cloud (OSS), the amazon cloud AWS (S3) and the HDFS are accessed through the JuiceFS, the target data corresponding to the computing task are acquired from the minio, the ali cloud (OSS), the amazon cloud AWS (S3) and the HDFS respectively, and the computing is completed in the Hive Metastore of the metadata layer.

In an embodiment of the present application, as shown in fig. 3, there is provided a block diagram of still another processing system for heterogeneous big data, the processing system further including:

the multi-tenant platform 150 is connected with the computing engine platform 140 and is used for providing an SQL query interface, acquiring an SQL query statement submitted by an SQL client based on the query interface, converting the SQL into a computing task and sending the computing task to the computing engine platform 140;

in the technical scheme, the multi-tenant platform can be a kyuubi which is a distributed multi-tenant thread JDBC/ODBC server and is used for large-scale data management, processing and analysis, kyuubi can provide serverless SQL to support various computing engines, and the simplification and safe access to any computing engine cluster resource can be realized through a unified gateway to deploy different workloads for interrupt users. Where serverless SQL refers to the ability to run SQL queries without relying on any database server. The kyuubi comprises a user layer, a kyuubi server layer and a kyuubi engine layer, wherein the user layer can provide JDBC interface service for the SQL client, the kyuubi server layer can convert SQL query sentences submitted by the SQL client and acquired by the JDBC interface into corresponding computing tasks, and the kyuubi engine layer can be matched with the corresponding computing engines according to the computing tasks, so that the computing tasks are sent to the corresponding computing engines in the computing engine platform.

A container orchestration platform 160 for providing multiple containers for multi-tenant platform 150 and compute engine platform 140.

In the technical solution, the container orchestration platform may refer to kubernetes, abbreviated as K8s, which is an open source and is used for managing containerized applications on multiple hosts in the cloud platform, where the goal of kubernetes is to make deploying containerized applications simple and efficient. The containerization of K8s is based on the Docker, and K8s organizes the Docker containers together by various resource definitions. The minimum unit of management in K8s is pod, and K8s needs to manage containers, so that one pod is abstracted as a management unit. Specifically, in the technical scheme, K8s can be used for providing containers for various computing engines in the multi-tenant platform kyuubi and the computing engine platform, so that an operation environment is provided for the kyuubi and the computing engine. After the kuubi and the computing engine complete the containerized deployment in K8, they can run normally in the service system, thereby scheduling and managing various SQL query statements submitted by users through the SQL client.

The metadata configuration management system may refer to Hive Metastere, and since Spark, flink, trino distributed computing engines are relatively perfect for supporting Hive functions. Therefore, in the technical scheme, the Hive Metastore can be selected as a metadata storage system in a table format in the JuiceFS file system so as to facilitate various calculation engines to perform business logic calculation. Specifically, the Hive Metastore of the metadata configuration management system is integrated with the unified access platform juiceFS and the computing engine platform respectively, the Hive Metastore needs to record and store metadata information of various table-building sentences based on the juiceFS, and when the computing engine performs data computation, data distributed on various distributed storage systems can be found according to the metadata information recorded by the Hive Metastore and service logic computation is performed, so that communication between the Hive Metastore and the juiceFS needs to be opened. Specifically, integrating Hive Metastore with JuiceFS first requires obtaining an installation package of Hive Metastore, decompressing the installation package to obtain a corresponding decompressed package, obtaining a dependency package of the unified access platform JuiceFS, and adding the dependency package into a dependency library catalog of the decompressed package. Wherein, the installation package can refer to an apache-hive-Metastere-3.1.3-binary. Specifically, the installation package apache-hive-Metastere-3.1.3-binary. Tar. Gz is obtained and decompressed to obtain an apache-hive decompressed package, and a client-side dependent package program, namely, the accurate fs-hadoop-1.1.0.jar, is added under the lib catalog of the apache-hive decompressed package.

Secondly, constructing a first image file aiming at the Hive Metastore of the metadata configuration management system based on the container arrangement platform K8s, acquiring a first containerized resource configuration file of the Hive Metastore of the metadata configuration management system and first configuration information of the unified access platform, and adding the first configuration information into the first containerized resource configuration file, wherein the first configuration information comprises a plurality of storage implementation classes and metadata engine addresses of each metadata storage system. In this technical solution, because fusion analysis of data of multiple distributed storage systems such as mini, ali cloud (OSS), AWS (S3) cloud, and HDFS is to be implemented, information of each of the distributed storage systems corresponding to the JuiceFS is configured in the first containerized resource configuration file hive-meta-configmap.yaml. Specifically, according to the image making rule of K8s, a containerized deployment image of the apache-hive-metastine can be made, and the image of the made api-hive-metastine is uploaded to a preset harbor image server for storage. Editing a hive-metator-configmap.yaml configuration file, adding key configuration information of the JuiceFS in the configuration file, wherein the key configuration information mainly comprises two storage implementation classes of the JuiceFS, fs.jfs.imal=io.juicefilesystem, and fs.abstract filesystem.jfs.imal=io.juicefs.juicefs, and a metadata engine address juicefs.meta corresponding to each metadata storage system in the JuiceFS.

After the configuration of the first image file and the first containerized resource configuration file is completed, the first containerized resource configuration file and the first image file are sequentially executed based on the containerization platform K8s to deploy the metadata configuration management system Hive metaserver into a plurality of containers, and the metadata configuration management system Hive metaserver is started to establish a connection between the metadata configuration management system Hive metaserver and the unified access platform juiceFS. Specifically, after configuration is completed, hive-metastore-configmap.yaml is deployed in the K8s container in the form of configmap resource, and mirror image of the above-mentioned apiche-hive-metastore is deployed in the K8s container in the form of depoyment resource. Configuration information in the Hive-metastine-configmap is automatically loaded in the process of starting the Hive metastine service, and after the Hive metastine service is started, the Hive-metastine service is represented to be installed and integrated with a unified access platform juiceFS.

The multi-tenant platform can refer to kyuubi, in order to better submit SQL query sentences to computing engines in the computing engine platform and isolate computing resources, a unified multi-tenant SQL development framework kyuubi can be selected to integrate with each computing engine in the technical scheme. The kyuubi is integrated with a computing engine platform integrated with a plurality of computing engines, so that a container arrangement platform K8s is provided, after the kyuubi and the computing engines finish containerized deployment in the K8s, when a data analysis task exists, the kyuubi can pull up a computing engine corresponding to the data analysis task in a container of the K8 s. After the data analysis task is finished, the computing engine automatically exits the container to release resources so as to save the resource space and improve the response speed and the resource utilization rate. Specifically, when the multi-tenant platform kyuubi is deployed to the container arranging platform K8s, firstly, an image file of kyuubi is required to be constructed, a second containerized resource configuration file of kyuubi is obtained, and environment variables, external service ports, operation modes, image file names and second configuration information required by operation of the multi-tenant platform kyuubi are set in the second containerized resource configuration file. Specifically, an image file of the kyuubi running in the container can be made according to a making rule of the K8s container image, after the making of the kyuubi image file is finished, a configuration file of the kyuubi is modified, and an environment variable (java_home) required by the kyuubi running, a service port of the kyuubi, an operation mode of the kyuubi, an image name of the kyuubi, specific resource configuration information of the kyuubi and the like are specified in the file. It should be understood that only the service layer kyuubi server layer is a resident service, and can run in the form of pod in the k8s container, and the engine layer kyuubi engine as a functional module does not run in the form of resident service.

After the containerized resource configuration file and the mirror image file of the kyuubi are configured, the second containerized resource configuration file and the second mirror image file are sequentially executed based on the containerized platform K8s so as to deploy the multi-tenant platform kyuubi into a plurality of containers. Specifically, the configuration file is deployed in a K8s container in a configmap resource mode, then the mirror image file of kyuubi is deployed in the K8s container in a depoyment mode, and the kyuubi-server service automatically loads the configuration information in the configmap in the deployment process. After the kyuubi-server container is started, the sql analysis task can be submitted through a hive jdbc port provided by the kyuubi-server.

In this technical solution, the computing engine platform may refer to a service platform integrated with various distributed computing engines such as spark, flink, trino. In order to better submit SQL query sentences to computing engines in a computing engine platform and isolate computing resources, a unified multi-tenant SQL development framework kyuubi can be selected to be integrated with each computing engine based on a container programming platform K8s in the technical scheme. Specifically, for any computing engine, when the computing engine performs containerized deployment in the containerized platform K8s, a third image file of the computing engine needs to be built based on the containerized platform K8s, and a second image file of the multi-tenant platform kyuubi needs to be reconstructed based on the third image file. Specifically, taking a calculation engine as spark as an example, firstly, according to a manufacturing rule of a K8s container mirror image, manufacturing a mirror image file of the spark running in the container, and after the kyuubi-server submits an sql analysis task through a provided hive jdbc port, pulling up a calculation engine spark in the K8s container according to the analysis task by kyuubi, so that the mirror image file of kyuubi needs to be constructed on the basis of the spark mirror image file. After the image file of the kyuubi is reconstructed, the image file name of the spark and the specific configuration resource information can be added into the containerized resource configuration file of the kyuubi. The updated kyuubi image file is executed based on K8s to deploy spark into multiple containers. Otherwise, the K8s containerized deployment process of flank, trino is substantially similar to the spark containerized deployment process described above and will not be described in detail herein.

As shown in fig. 5, a flow chart of a kyuubi submit spark sql analysis task is provided. The kyuubi and spark are respectively deployed in a container arrangement platform kubernetes, and after the kyuubi-server container is started, a spark sql analysis task can be submitted through a hive jdbc port provided by the kyuubi-server. At this time, the kyuubi-server automatically pulls up a containerized operation spark cluster in a container provided by K8s according to the kyuubi configuration file, wherein the spark cluster consists of two services of driver and executor. The driver is responsible for analyzing the spark sql transmitted by the kyuubi-server according to the sequence of a logic execution plan, a physical execution plan and a DAG directed acyclic graph, and finally submitting the analysis result to the executor for specific execution and calculation, so that two containers, namely a driver pod and an executor pod, exist in a cluster of spark containerized operation. In order to avoid repeated pulling of spark clusters, a kyuubi-engine service of kyuubi maintains session information of the spark clusters, when new SQL is submitted through a kyuubi-server, a proper spark cluster is searched for from spark sessions maintained by kyuubi-engine to execute SQL tasks, if the proper spark cluster is not found, the kyuubi is responsible for pulling a new session again, namely a new spark cluster is established, and then corresponding SQL calculation tasks are executed. In the technical scheme, the operation modes of the computing clusters of the Flink and the Trino are similar to the spark, and detailed description is omitted here.

In this technical solution, the computing engine platform may refer to a service platform integrated with various distributed computing engines such as spark, flink, trino. When each computing engine receives the SQL analysis task transmitted by the multi-tenant platform kyuubi, the SQL is required to be analyzed through metadata information stored by a metadata configuration management system Hive Metastore, and storage position information of corresponding target data is found. Therefore, each computing engine needs to be integrated with the Hive Metastor, so that each computing engine can find corresponding data according to the metadata information of the Hive Metastor and load the data into the memory for computing. Specifically, for any computing engine, integrating the computing engine with the Hive Metastor requires first obtaining third configuration information of the Hive Metastor, so that the third configuration information is added to an image file of the computing engine. Specifically, taking a calculation engine as a spark as an example, when spark containerized deployment is performed, relevant configuration information of the Hive Metastore can be written into Hive-site.xml, and a Hive-site.xml file can be placed into an image file of the spark { SPARK_HOME }/conf directory.

After the image file of the spark is reconfigured, the spark can be redeployed into a plurality of containers based on the updated image file of the spark to establish a connection between the spark and the metadata configuration management system Hive metadata. Specifically, the spark container is deployed, the spark cluster container service is started, connection can be established with the Hive Metastore service through configuration information in Hive-site.xml, and when a SQL analysis task transmitted by the multi-tenant platform kyuubi is received by the spark, metadata can be read and written through metadata information stored by the Hive Metastore, and the data is searched and loaded. In the present technical solution, the integration process of the flank and the Trino is similar to that of the Hive Metastore, and this step will not be described in detail.

In the technical scheme, the computing engine platform may refer to a service platform integrated with various distributed computing engines such as spark, flink, trino, and various distributed storage systems may refer to a mini, an ali cloud (OSS), an amazon cloud AWS (S3) and an HDFS. Each computing engine has been already subjected to containerized deployment in K8S, and is integrated with a Hive Metastore of a metadata configuration management system, and when each computing engine performs SQL task computation, it is required to obtain metadata corresponding to SQL through the Hive Metastore and parse the metadata, and also to access target data corresponding to the metadata in each distributed storage system such as a mini, an ali cloud (OSS), an amazon cloud AWS (S3), and an HDFS through a JuiceFS. Therefore, the compute engine container is required to have access to the JuiceFS and can acquire connection information of each distributed storage system through the JuiceFS, thereby establishing a connection with each distributed storage system based on the connection information and acquiring target data.

Specifically, for any one of the distributed storage systems, a computing engine in a computing engine platform is required to access the distributed storage system, a dependency package of a unified access platform JuiceFS is required to be added to an image file of the computing engine, wherein the dependency package includes access parameters of a plurality of types of distributed storage systems, fourth configuration information of the distributed storage system is obtained, the first configuration information of the unified access platform JuiceFS is added to the fourth configuration information, and updated fourth configuration information is added to the image file of the computing engine. The dependency package of the unified access platform JuiceFS may refer to a client program JuiceFS-hadoop-1.1.0.Jar dependency package of JuiceFS. Specifically, taking a computing engine as a spark as an example, when a computing container such as spark is deployed, a container mirror directory of the spark can be used { SPARK_HOME }/jar adds the client program of JuiceFS-Hadoop-1.1.0.Jar dependency package, which is a Java client provided by JuiceFS and highly compatible with the HDFS interface, various applications in Hadoop ecology, including various computing engines in the scheme, can smoothly use stored data in JuiceFS without changing code. The jar packages a series of methods including storage implementation class, client access method, efficient data transmission, etc. In this technical solution, since many stock history data are stored in the HDFS file system, the computing engine needs to directly access the HDFS data, and needs to access the data in the HDFS and the distributed storage systems minio, AWS (S3), and ali cloud (OSS)) through the JuiceFS. It is necessary to add the configuration files core-site.xml and hdfs-site.xml files to the container image directory +.>And adding relevant configuration information of the JuiceFS in core-site.xml under the condition of { SPARK_HOME }/conf, wherein the relevant configuration information comprises mysql metadata storage system information corresponding to each distributed storage system in the JuiceFS and storage implementation type information of the JuiceFS. After configuring the image file of the spark, redeploying the spark into a plurality of containers based on the updated spark image file to integrate the spark with the unified access platform JuiceFS.

After the spark is redeployed, the spark container service is started to access the data of hdfs according to the configuration file information, the data in the juiceFS file can be accessed through a java client of JuiceFS-hadoop, when a spark SQL analysis task transmitted by a kuubi client is received, the spark can access library table metadata information in the Hive Metastere, and then the data can be directly obtained from a distributed storage system corresponding to the metadata in the juiceFS through the java client of the juiceFS and analyzed. In the technical scheme, the integration process of the Flink, the Trino and the JuiceFS is basically the same as that of the spark, and the description of the steps is not repeated.

In the technical scheme, various distributed storage systems can be referred to as minio, ali cloud (OSS), amazon cloud AWS (S3) and HDFS, and the unified access platform can be referred to as JuiceFS. Because the unified access platform JuiceFS has rich api interfaces, the unified access platform JuiceFS may be integrated with each type of distributed storage system to collect metadata in each type of distributed storage system. Meanwhile, a mysql database can be selected as a metadata repository of the JuiceFS, and a minio, an alicloud (OSS), an amazon cloud AWS (S3), and mysql databases (jfs_minio_db, jfs_ ali _db, jfs_ AWS _db, jfs_hdfs_db) corresponding to the HDFS can be created in the mysql. Specifically, for any one of the distributed storage systems, when integrating the unified access platform JuiceFS with the distributed storage system, a JuiceFS client needs to be deployed first. In the technical scheme, the client for deploying the juicefs on one Linux can be selected, so that various subsequent integrated operations and instruction input are facilitated. The specific installation process comprises the following steps: the latest client file JuiceFS-1.1.0-Linux-arm64.Tar.gz of JuiceFS is downloaded from github first down to the Linux/opt path, then the tar.gz file (tar-zxvf "JuiceFS-1.1.0-Linux-arm64. Tar.gz") is decompressed, finally the following command is executed: sudo install JuiceFS/usr/local/bin can complete the installation of the JuiceFS client. Meanwhile, a JuiceFS-hellp command can be executed on the linux server to verify whether the JuiceFS client finishes installation.

After the end of the juiceFS client is finished, a metadata storage system mysql corresponding to the distributed storage system is built in the juiceFS, a corresponding storage Bucket socket is built in the distributed storage system based on the metadata storage system mysql, access information of the storage Bucket socket is set, a connection script between the storage Bucket socket and the corresponding metadata storage system mysql is defined based on the access information, and the connection script is executed to establish connection between the storage Bucket socket and the metadata storage system mysql. Specifically, in order to implement JuiceFS to interact with data on minio, HDFS, ali cloud (OSS), AWS (S3) simultaneously, a file system needs to be created on various public clouds, private cloud distributed storage systems by JuiceFS clients. The file system for creating the JuiceFS on various cloud storage is mainly to open the connection and storage correspondence between the JuiceFS and each distributed storage system, and maintain the associated connection information in the metadata storage system of the JuiceFS. In the technical scheme, mysql can be selected as a metadata storage system of the JuiceFS, and a minio, an alicloud and an amazon cloud AWS (S3) are respectively created in the mysql, and mysql databases (jfs_minio_db, jfs_ ali _db, jfs_ AWS _db and jfs_hdfs_db) corresponding to the HDFS. Meanwhile, a corresponding Bucket socket for storing data is also required to be created in various different distributed storage systems, a login account and a password of the Bucket and read-write permission information of data in the Bucket are set, data in different areas can be written into public cloud object storage buckets in designated areas subsequently, and then data analysis is carried out to read the data from the corresponding buckets of cloud storage.

After the creation of the storage Bucket socket on the public cloud or the private cloud of the juiceFS is completed, the connection script is executed through the juiceFS client, and the connection information of the juiceFS and each storage system Bucket and other metadata information are written into a library table corresponding to mysql. Specifically, each execution script defines a connection address of the distributed storage system, a login user name, a password and a name of the socket, and metadata base connection information of the JuiceFS corresponding to each distributed storage system created in the mysql library. After the execution scripts of the connection information corresponding to MINIO, AWS, HDFS and alicloud (OSS) are executed on the linux server, the JuiceFS establishes a direct connection relationship with each of the distributed storage systems. Meanwhile, the connection information of the JuiceFS and the storage Bucket in each distributed storage system and other metadata information are written into a library table corresponding to mysql, and when all subsequent operations with the distributed storage system occur, the connection information and the corresponding relation information can be searched from the metadata library of the mysql, so that the connection relation with the distributed storage system is established based on the connection information.

In the technical scheme, the metadata configuration management system may refer to Hive Metastore, and the various distributed storage systems may refer to minio, ardisia cloud (OSS), amazon cloud AWS (S3) and HDFS.

The integration of the distributed file system JuiceFS, juiceFS with the object storage systems in the hybrid cloud opens up and containerizes the deployment, and can submit SQL to the overall system through kyuubi. The method mainly teaches how to realize fusion analysis of data in different distributed systems of distributed HDFS, minio, arian cloud (OSS) and AWS cloud (S3).

In the technical scheme, the Hive Metastore is adopted as a unified table type storage system of metadata, so that a fusion analysis process in the Hive Metastore can be designed based on a layering theory of Hive number bins. Specifically, the fusion analysis number bins aiming at various distributed storage systems are constructed according to layering concepts of a odl original layer, a sdl standard layer, a idl integration layer and a adl application layer, so that repeated development can be reduced to the greatest extent, and development efficiency is improved. The original data stored in the HDFS, minio, ali cloud (OSS), AWS cloud (S3) belongs to the original layer odl, firstly, a table format metadata repository hive database corresponding to each kind of distributed storage system data is created in the original layer odl through spark-sql, and a table creation script of each kind of distributed storage system is determined. Each of the table-building scripts may define a partition field Partiton, where Partiton is generally partitioned according to time, define a storage format of data, and according to an actual storage format (text, parquet, orc, avro, s3, etc.) of data, may correspondingly specify the storage format, and define a storage path Location of data in a specified minio. After the definition of the table establishing script of each distributed storage system is completed, aiming at the table establishing script of any one distributed storage system, after the table establishing script is executed by the multi-tenant platform kyuubi, the metadata configuration management system acquires metadata in the Hive Metastere of the distributed storage system from the unified access platform juiceFS, and stores the metadata in a table format metadata storage library for table format storage. Specifically, by executing the table-building statement by kuubi, the data corresponding to the minio is organized in a table format after the execution is completed, and the metadata information of the data and the library table information are stored in the Hive Metastore. Then, spark SQl can be performed through the kyuubi client to query the data tables (show tables) in all the object storage systems, and the data in any one of the distributed storage systems can be viewed. Meanwhile, spark directly integrates data in the hive operation hdfs, so that the data in the original hive number bin can be directly inquired and operated through a kyuubi client, and the data in all different distributed storage systems can be uniformly organized in a table format and processed in a centralized way under the condition of not carrying out any data migration and synchronization.

In this technical solution, the computing engine platform may refer to a service platform integrated with various distributed computing engines such as spark, flink, trino. For any computing engine, after the computing engine determines target data corresponding to a computing task based on a storage table of metadata in a metadata configuration management system Hive Metastere, the target distributed storage system corresponding to the target data is accessed based on a unified access platform juiceFS, and the target data is acquired from the target distributed storage system. As described above, the fusion analysis bins for various distributed storage systems have been constructed in the Hive Metastore, so after the computing engine obtains the target data from the corresponding distributed storage system, the target data may be cached based on the original layer odl of the fusion analysis bins, and sequentially computed based on the standard layer sdl, the integration layer idl, and the application layer adl of the fusion analysis bins. Specifically, the data in all different distributed storage systems in the Spark/Flink/Trino SQL operation hybrid cloud is submitted through the kyuubi unified SQL client, then, according to business logic, association fusion analysis can be carried out on different tables in different databases in the Hive Metastore, and layer-by-layer processing analysis can be carried out according to the sequence of odl (original layer), sdl (standard layer), idl (integration layer) and adl (application layer). According to the method, the timeliness of the data joint analysis is greatly improved and the analysis efficiency is improved when no data ETL synchronization or migration operation is carried out. Meanwhile, under the condition of not moving data, metadata in different distributed storage systems are organized in a centralized form by combining Hive Metastere with a juiceFS, data of different structures (parquet, ORC, text, OSS, S3, BSS and the like) in the juiceFS are fused by a distributed computing engine arranged in a containerized mode, and fusion analysis can be directly carried out on the data in Hive and the data in different storage objects in a JuiceFS file system, so that high-efficiency analysis of cross-source heterogeneous data is realized.

According to the technical scheme, innovative flow design and technical innovation are adopted, unified SQL client service kyuubi, distributed computing engine Spark/Flink/Trino and containerization deployment of a metadata system hive-metastine are achieved, the integration of the juicefs and object storage systems of different cloud manufacturers is achieved through introduction of a distributed file system juicefs and the technical innovation, different object storage systems in a hybrid cloud are used as storage of the juicefs, and mysql is used for storing metadata information of the juicefs and storage barrels in various object storage. Based on the container cloud technology, the integration of the hive-metastore and the juicefs is realized, and the hive metastore is creatively used for managing the metadata of the data managed by the juicefs, so that the data in the juicefs distributed system can be assembled in a form of hive table format. Because most SQL computing engines support hives well, SQL structured processing of Spark/Flink/Trino and other distributed computing engines on juicefs is indirectly realized, so that the Spark/Flink/Trino computing engines can successfully read and operate data in juicefs according to metadata information in hives.

According to the scheme, through technical innovation, the juicefs-hadoop java client and configuration information related to the juicefs are successfully introduced into the running container service of the Spark/Flink/Trino distributed computing engine, and finally the Spark/Flink/Trino computing engine is used for reading and writing data of the juicefs distributed file system. Based on the unified metadata management of hive metastore, unified storage management of juicefs and unified SQL client functions of kyuubi, the unified computing capability of Spark/Flink/Trino computing engines is finally realized, fusion analysis of different distributed object storage data in the hybrid cloud is finally realized, unified SQL (structured query language) processing is realized, and direct association analysis of data in object storage and data in original hive number bins is realized. By implementing the technical scheme, the workload of data migration and synchronization under the existing hybrid cloud deployment architecture is greatly reduced, the aims of cost reduction and synergy are achieved, unified management analysis of data under the condition that no data migration and unified data storage are carried out are realized, the time delay of data analysis is greatly reduced, the data analysis efficiency is higher, the instantaneity is obviously enhanced, the cost of data analysis is obviously reduced, the analysis efficiency of cross-source heterogeneous data is obviously improved, and the development efficiency of data is obviously improved.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. that fall within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims

1. A processing system for heterogeneous big data, the processing system comprising:

the computing engine platform is connected with the unified access platform and the metadata configuration management system, the computing engine platform comprises a plurality of computing engines, the computing engine platform provides corresponding target computing engines according to computing tasks under the condition that the computing tasks are received, the target computing engines analyze the computing tasks, determine target data corresponding to the computing tasks based on a storage table of metadata in the metadata configuration management system, and access a target distributed storage system corresponding to the target data based on the unified access platform so as to acquire the target data from the target distributed storage system for computing.

2. The processing system for heterogeneous big data of claim 1, wherein the processing system further comprises:

the multi-tenant platform is connected with the computing engine platform and is used for providing an SQL query interface, acquiring an SQL query statement submitted by an SQL client based on the query interface, converting the SQL into a computing task and sending the computing task to the computing engine platform;

a container orchestration platform for providing a plurality of containers for the multi-tenant platform and the compute engine platform.

3. The processing system for heterogeneous big data of claim 2, wherein the metadata configuration management system is further configured to:

acquiring an installation package of the metadata configuration management system, and decompressing the installation package to obtain a corresponding decompressed package;

acquiring a dependency package of the unified access platform, and adding the dependency package into a dependency library catalog of the decompression package;

constructing a first image file for the metadata configuration management system based on the container orchestration platform;

acquiring a first containerized resource configuration file of the metadata configuration management system and first configuration information of the unified access platform, and adding the first configuration information into the first containerized resource configuration file, wherein the first configuration information comprises a plurality of storage implementation classes and metadata engine addresses of each metadata storage system;

And based on a container arrangement platform, sequentially executing the first containerized resource configuration file and the first mirror image file to deploy the metadata configuration management system into the containers, and starting the metadata configuration management system to establish connection between the metadata configuration management system and the unified access platform.

4. The processing system for heterogeneous big data of claim 3, wherein the multi-tenant platform is further configured to:

constructing a second image file of the multi-tenant platform based on the container orchestration platform;

acquiring a second containerized resource configuration file of the multi-tenant platform, and setting environment variables, external service ports, operation modes, mirror file names and second configuration information required by the operation of the multi-tenant platform in the second containerized resource configuration file;

and based on the container orchestration platform, sequentially executing the second containerized resource configuration file and the second image file to deploy the multi-tenant platform into the plurality of containers.

5. The processing system for heterogeneous big data of claim 4, wherein for any one of the compute engines, the compute engine platform is further configured to:

Constructing a third image file of the computing engine based on the container orchestration platform;

reconstructing a second image file of the multi-tenant platform based on the third image file;

and executing the updated second image file based on the container orchestration platform to deploy the computing engine into the plurality of containers.

6. The processing system for heterogeneous big data of claim 5, wherein for any one computing engine, the computing engine platform is further configured to:

acquiring third configuration information of the metadata configuration management system, and adding the third configuration information into the third image file;

and redeploying the computing engine into the plurality of containers based on the updated third image file so as to establish connection between the computing engine and the metadata configuration management system.

7. The processing system for heterogeneous big data of claim 6, wherein for any one of the distributed storage systems, the compute engine platform is further configured to:

adding a dependent package of the unified access platform to the third image file, wherein the dependent package comprises access parameters of various distributed storage systems;

Acquiring fourth configuration information of the distributed storage system, adding the first configuration information of the unified access platform into the fourth configuration information, and adding the updated fourth configuration information into the third image file;

and redeploying the computing engine into the plurality of containers based on the updated third image file so as to establish connection between the computing engine and the unified access platform.

8. The processing system for heterogeneous big data of claim 1, wherein for any one of the distributed storage systems, the unified access platform is further configured to:

constructing a metadata storage system corresponding to the distributed storage system;

constructing a corresponding storage bucket in the distributed storage system based on the metadata storage system, and setting access information of the storage bucket;

defining a connection script between the bucket and a corresponding metadata storage system based on the access information;

executing the connection script to establish connection between the storage bucket and the metadata storage system, so as to store the connection information of the storage bucket and the metadata storage system carried in the connection script to the metadata storage system, and transmitting metadata of data stored in the storage bucket to the metadata storage system for storage.

9. The processing system for heterogeneous big data of claim 8, wherein the metadata configuration management system is further configured to:

constructing a fusion analysis number bin for a plurality of distributed storage systems, wherein the fusion analysis number bin at least comprises an original layer, a standard layer, an integration layer and an application layer;

constructing a table format metadata repository corresponding to a storage bucket in a plurality of distributed storage systems in the original layer;

determining a list building script of each distributed storage system;

and aiming at a table establishing script of any one of the distributed storage systems, after the table establishing script is executed by the multi-tenant platform, the metadata configuration management system acquires metadata in the distributed storage system from the unified access platform and stores the metadata into the table format metadata repository for table format storage.

10. The processing system for heterogeneous big data of claim 9, wherein the compute engine platform is further configured to:

for any computing engine, after the computing engine determines target data corresponding to the computing task based on a storage table of metadata in the metadata configuration management system, accessing a target distributed storage system corresponding to the target data based on the unified access platform, and acquiring the target data from the target distributed storage system;

And caching the target data based on the original layer of the fusion analysis number bin, and sequentially calculating the target data based on the standard layer, the integration layer and the application layer of the fusion analysis number bin.