CN118245044A - Heterogeneous data source access method and device of decoupling calculation engine - Google Patents

Heterogeneous data source access method and device of decoupling calculation engine Download PDF

Info

Publication number
CN118245044A
CN118245044A CN202311706616.6A CN202311706616A CN118245044A CN 118245044 A CN118245044 A CN 118245044A CN 202311706616 A CN202311706616 A CN 202311706616A CN 118245044 A CN118245044 A CN 118245044A
Authority
CN
China
Prior art keywords
data
partition
interface
reading
writing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311706616.6A
Other languages
Chinese (zh)
Inventor
周朝卫
邱勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unihub China Information Technology Co Ltd
Original Assignee
Unihub China Information Technology Co Ltd
Filing date
Publication date
Application filed by Unihub China Information Technology Co Ltd filed Critical Unihub China Information Technology Co Ltd
Publication of CN118245044A publication Critical patent/CN118245044A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a method and a device for accessing heterogeneous data sources of a decoupling calculation engine, wherein the method comprises the following steps: abstracting unified data lines, abstracting the data line type model, the data line model and the data types, thereby realizing unified access and storage of various heterogeneous data sources and having expansion capability in structural and functional aspects; based on plug-in mechanism, the reading and writing of data are realized by various data sources respectively, and the realization standard of plug-ins is defined by the parent class of Java; the adaptation of the computing engine, the computing engine defines a specific data source through the definition interface of the respective data reading and writing of the engine. The access method and the access device for the heterogeneous data sources of the decoupling computation engine process the query and the write operation of the data sources by using the unified API, so that the dependency relationship among different computation engines is decoupled, a plurality of computation engines are supported, and the development and maintenance cost is greatly reduced.

Description

Heterogeneous data source access method and device of decoupling calculation engine
Technical Field
The invention relates to the field of data sources, in particular to a method and a device for accessing heterogeneous data sources of a decoupling calculation engine.
Background
In a production system, there are a large number of heterogeneous data sources, such as MySQL tables, kafka topics, elastic search indexes, nodes and edges of graph databases, hive tables, object store files, and the like. At the same time, enterprises typically employ multiple compute engines to operate in parallel, such as Spark, flink, presto.
The current situation has some pain problems:
Code redundancy and maintenance difficulties: in order to access different data sources, plug-ins for the data sources need to be developed separately for each compute engine. For example, if point/edge data of a graph database is to be processed, data read and write plugins for Spark, flink, and Presto need to be developed, resulting in the existence of three pieces of code. Such repeated development and maintenance not only increases the workload, but also increases the possibility of errors.
Parameter inconsistency: because each compute engine has its own data source plug-in, each plug-in may have different parameters and configurations. This results in the need to use different parameter settings for different compute engines when accessing the data source, which presents a nuisance to developers and increases the complexity of use and maintenance.
The use cost is high: enterprises require significant time and resources to be devoted due to the need to develop and maintain data source plugins for each compute engine. In addition, due to redundancy and maintenance difficulties of plug-in code, repeated development and debugging work is also required for new data source access or replacement of the compute engine.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a method and a device for accessing heterogeneous data sources of a decoupling computation engine, which use a unified API to process query and write operations of the data sources, thereby decoupling the dependency relationship among different computation engines, supporting a plurality of computation engines and greatly reducing the development and maintenance cost.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
In an embodiment of the present invention, a method for accessing heterogeneous data sources of a decoupled compute engine is provided, the method comprising:
s01, abstracting unified data lines, and abstracting the data line type model, the data line model and the data types, so that unified access and storage of various heterogeneous data sources are realized, and the system has expansion capability in structural and functional aspects;
Further, the S01 includes:
s011, describing fields in a data row and attributes of the fields by an abstract data row type model, wherein the attributes of the fields comprise: the name of the field, the data type, whether null values are allowed or not, and the like, are equivalent to the definition of a table structure;
s012, the data line model defines the information of line data, and is used for exchanging and transmitting data in the reading and writing process of a data source;
further, the data line model includes:
defining the type of line data, and defining field information corresponding to the line data through a data line type model;
an object array for storing each field value of the record, each position corresponding to a field;
The field accesses the index, the object array content is directly accessed through the index, rather than the Java reflection, the array is accessed, and the performance is higher;
Mapping field names, namely ensuring that the field names in the definition of the line type model correspond to the object array positions, and supporting flexible access;
The serializer is used for establishing association between the data line type model and the data line and is responsible for serialization and anti-serialization work of the data line so as to realize efficient data transmission;
Metadata, recording information of auxiliary description data lines such as data sources, lengths, versions and the like;
And the persistence processor supports the persistence of the data line into a file or a database table and the like and is realized by converting the data line type.
S013, abstracting unified data types of fields, wherein the data types of the fields represent the data types of one field in the data line types; establishing a mapping relation between the data types of each data source and the unified data types, so that the unified API can be used for realizing the access and storage of data; the mapping relation is established by reading and writing plug-ins of each data source; when the data source reads, the data type of the data source is converted into a unified data type.
Further, the data types include:
the basic data types include: integer, string, boolean, etc.;
composite data type: a data type composed of a plurality of basic data types; for example, a composite data type consisting of both an integer and a string of basic data types.
Array type: defining the data type of the array, wherein the data type of the element of the data is defined through the basic data type;
map data type: data types defining relationships between keys and values define data types of the keys and values, respectively.
S02, reading and writing data are based on a plug-in mechanism, and the plug-in mechanism is realized in various modes; mySQL, elasticsearch and other various data sources respectively realize plug-ins for reading and writing the respective data sources, and the implementation specification of the plug-ins is defined through a Java parent class.
Further, the S02 includes:
s021, a data source reading standard interface, which defines metadata and basic behaviors of a data source by utilizing the data source reading definition interface, wherein the data source reading standard interface comprises: a data source reading definition interface, a data source dynamic partition definition interface and a partition data reading interface;
the data source reading definition interface is responsible for describing the whole data source and creating a data source dynamic partition definition interface and a partition data reading interface;
the data source dynamic partition definition interface is responsible for continuously producing partition object description of each partition, and the partition objects are distributed to different task execution subtasks by the scheduling framework;
The partition data reading interface is called by the subtask, reads the data of the partition object distributed to the subtask, and reads and processes the real data record.
MySQL, elasticsearch and other data sources respectively realize the three interfaces, so that a complete data source is formed to a unified data line format data set with parallel processing, and the abstract capacity and the universality of data reading of the data source are improved.
Further, the data source reads the definition interface: metadata and basic behavior of a data source are described using the data source read definition interface.
Further, the functions of the data source reading definition interface include:
Defining connection parameters of a data source: providing information such as connection address, configuration, etc. to configure data sources such as Kafka, hive address, etc.;
Providing a data reading interface: reading data from a data source and converting the data into a uniform data line format;
Defining a data structure: defining field structure information of a data line of a data source based on a data line type model;
management partition: partitioning the accessed data source data, thereby realizing parallel processing;
Unifying different data source APIs: records generated by different data sources are all represented using a uniform data line, providing a uniform API, such as map, filter, etc., for the processing of the records, regardless of where the data is coming from.
Further, the data source dynamic partition definition interface: partition rules for dynamically generating partitions of data sources, uniformly managing partition rules of different data sources (such as Kafka, HDFS and the like), and providing a consistent partition interface for upper-layer applications;
The partitioning of the data source is dynamically generated and is equivalent to the partitioning of the data source to be read into a plurality of partitions of the data source; the subarea of each data source corresponds to a concurrent task, so that distributed parallel processing can be realized;
Further, the functions of the data source dynamic partition definition interface include:
defining a partition enumeration interface: providing a next () method to sequentially return partition objects;
Providing a partition description: the partition description comprises description partition summary information such as partition identification, path range and the like;
support dynamic partitioning: the partition set may not be fixed and needs to support dynamic acquisition of more new partitions;
Concurrency safety: partition access under multithreading needs to ensure thread safety;
the data source is transparent: the interfacing data source is hidden by the partition enumeration interface without concern for its internal partition rules.
Further, the partitioned data reading interface encapsulates the reading detail difference of different data sources and provides uniform high-performance data reading service for upper-layer application; the partition data reading interface is used for truly reading and processing the data record of each partition object in the source, and each partition reads a list set of output data, wherein the data type of the list is in a unified data line format; the realization of the partition data reading interface is formulated for specific data sources, and different sources have different partition data reading realizations;
Further, the functions of the partition data read interface include:
Providing a data reading interface: reading out records from the data sources in parallel, and packaging the records into a unified data line format for return;
masking data source differences: regardless of the form of the data source, a unified read API interface is provided;
and (3) concurrency control: the multi-thread concurrent reading is supported to ensure the reading performance;
fault tolerant mechanism: processing source exceptions can retry or fail to transition to exception outputs.
S022, a data storage standard interface, which receives data and writes various data storage, wherein the data storage standard interface comprises: a data source storage definition interface and a partition data writing interface;
The data source storage definition interface is responsible for describing the whole data source storage and creating a partition data writing interface;
the partition data writing interface receives data of partition objects, each partition object is distributed with a subtask executed by a task through a scheduling framework, the subtask calls the partition data writing interface, and the data of the partition objects are written into a target storage.
Further, the data source storage definition interface describes metadata and basic behavior of the data source storage, specifies how to connect and operate the storage of various data sources, and provides a universal standard interface specification for customizing the storage of various data sources:
Further, the specific flow of the data source storage definition interface includes:
Setting a field structure definition of a data line of received data, comprising: field name, data type of field, etc.;
And transmitting the index number of the subtask corresponding to the data partition object to a partition data writing interface so as to support parallel writing of a plurality of partitions.
A partition data write interface is created.
Further, the partitioned data writing interface encapsulates the detail difference of writing different data sources, and provides unified high-performance data writing service for upper-layer application; the partition data writing interface is used for truly writing the data record of each partition object in the process that the input of each partition data is a data list set of the partition object, and the data types of the lists are in a unified data line format; the partition data writing interface is formulated for specific data sources, and different sources have different partition data writing implementations;
further, the functions of the partition data writing interface include:
Providing a data writing interface: receiving data of a partition object, wherein the received data is in a unified data line format, and according to different data source types, performing data type conversion and writing the data into a target storage;
masking data source differences: providing a unified write API interface regardless of the form of the data source;
and (3) concurrency control: the multi-thread concurrent writing is supported, and the writing performance is ensured;
Fault tolerant mechanism: data writing failure, retry or failure to transfer to abnormal output, etc.
S03, the adaptation of the computing engine, spark, flink, presto and other computing engines define a special data source through the definition interfaces of the respective data reading and writing of the engines.
Further, the step S03 includes:
S031, converting the data accessed through the data source reading standard interface into a format corresponding to a computing engine when the data is read, wherein the accessed data has a plurality of data partition objects, so that distributed data access can be realized;
S032, when writing data, converting the data format processed by the computing engine into a uniform data line format, receiving data through a data storage standard interface, and writing into various data storage.
In an embodiment of the present invention, there is further provided an access device for decoupling heterogeneous data sources of a computing engine, the device including:
The data abstraction module abstracts the unified data row, and abstracts the data row type model, the data row model and the data type, thereby realizing unified access and storage of various heterogeneous data sources and having expansion capability in structural and functional aspects;
the data reading and writing module is used for reading and writing data based on a plug-in mechanism, plug-ins for respectively realizing the reading and the writing of the data sources of each type are respectively realized, and the realization specification of the plug-ins is defined through a parent class of Java;
The engine adaptation module is used for adapting the computing engine, and the computing engine defines a special data source through a definition interface for reading and writing data of the engine.
In an embodiment of the present invention, a computer device is further provided, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the method for accessing heterogeneous data sources of the decoupling computation engine when executing the computer program.
In an embodiment of the present invention, a computer-readable storage medium is also presented, the computer-readable storage medium storing a computer program for executing the method of decoupling access to heterogeneous data sources of a computing engine.
The beneficial effects are that:
The invention relates to a method and a device for accessing heterogeneous data sources of a decoupling calculation engine, which support a plurality of calculation engines, and only need to develop a set of codes for reading and writing each data source; by introducing a unified standard interface, each data source is read and written only by developing the standard interface once; the standard-compliant computing engine can directly use the connector, so that development and maintenance costs are greatly reduced; the standard interface defines the capabilities of partition reading and writing and the like, simultaneously processes the data in a plurality of partitions in a partition mode, fully utilizes the parallelism of computing resources and further supports the large-scale data processing task with high throughput.
Drawings
FIG. 1 is a flow chart of a method for accessing heterogeneous data sources of a decoupled compute engine according to the present invention;
FIG. 2 is a schematic diagram of an access device for decoupling heterogeneous data sources of a compute engine according to the present invention;
FIG. 3 is a schematic diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The principles and spirit of the present invention will be described below with reference to several exemplary embodiments, with the understanding that these embodiments are merely provided to enable those skilled in the art to better understand and practice the invention and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Those skilled in the art will appreciate that embodiments of the invention may be implemented as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the following forms, namely: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
The terminology and interpretation related to the invention:
According to the embodiment of the invention, the access method and the device for the heterogeneous data sources of the decoupling computation engine are provided, and the query and the write operation of the data sources are processed by using a unified API, so that the dependency relationship among different computation engines is decoupled, multiple computation engines are supported, and the development and maintenance cost is greatly reduced.
The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments thereof.
As shown in fig. 1, the method for accessing heterogeneous data sources of a decoupling computation engine according to the present invention includes:
s01, abstracting unified data lines, and abstracting the data line type model, the data line model and the data types, so that unified access and storage of various heterogeneous data sources are realized, and the system has expansion capability in structural and functional aspects;
Further, the S01 includes:
s011, describing fields in a data row and attributes of the fields by an abstract data row type model, wherein the attributes of the fields comprise: the name of the field, the data type, whether null values are allowed or not, and the like, are equivalent to the definition of a table structure;
s012, the data line model defines the information of line data, and is used for exchanging and transmitting data in the reading and writing process of a data source;
further, the data line model includes:
defining the type of line data, and defining field information corresponding to the line data through a data line type model;
an object array for storing each field value of the record, each position corresponding to a field;
The field accesses the index, the object array content is directly accessed through the index, rather than the Java reflection, the array is accessed, and the performance is higher;
Mapping field names, namely ensuring that the field names in the definition of the line type model correspond to the object array positions, and supporting flexible access;
The serializer is used for establishing association between the data line type model and the data line and is responsible for serialization and anti-serialization work of the data line so as to realize efficient data transmission;
Metadata, recording information of auxiliary description data lines such as data sources, lengths, versions and the like;
And the persistence processor supports the persistence of the data line into a file or a database table and the like and is realized by converting the data line type.
S013, abstracting unified data types of fields, wherein the data types of the fields represent the data types of one field in the data line types; establishing a mapping relation between the data types of each data source and the unified data types, so that the unified API can be used for realizing the access and storage of data; the mapping relation is established by reading and writing plug-ins of each data source; when the data source reads, the data type of the data source is converted into a unified data type.
Further, the data types include:
the basic data types include: integer, string, boolean, etc.;
a composite data type, a data type composed of a plurality of basic data types; for example, a composite data type consisting of both an integer and a string of basic data types.
Array type: defining the data type of the array, wherein the data type of the element of the data is defined through the basic data type;
Map mapping data types, which are used for defining data types of the relation between the key and the value, and respectively defining the data types of the key and the value.
S02, reading and writing data are based on a plug-in mechanism, and the plug-in mechanism can be realized in various modes, for example, a factory design mode, a template design mode and the like can be used for realizing expansion of the plug-in. MySQL, elasticsearch and other various data sources respectively realize plug-ins for reading and writing the respective data sources, and the implementation specification of the plug-ins is defined through a Java parent class.
Further, the S02 includes:
s021, a data source reading standard interface, which defines metadata and basic behaviors of a data source by utilizing the data source reading definition interface, wherein the data source reading standard interface comprises: a data source reading definition interface, a data source dynamic partition definition interface and a partition data reading interface;
the data source reading definition interface is responsible for describing the whole data source and creating a data source dynamic partition definition interface and a partition data reading interface;
the data source dynamic partition definition interface is responsible for continuously producing partition object description of each partition, and the partition objects are distributed to different task execution subtasks by the scheduling framework;
The partition data reading interface is called by the subtask, reads the data of the partition object distributed to the subtask, and reads and processes the real data record.
MySQL, elasticsearch and other data sources respectively realize the three interfaces, so that a complete data source is formed to a unified data line format data set with parallel processing, and the abstract capacity and the universality of data reading of the data source are improved.
Further, the data source reads the definition interface: metadata and basic behavior of a data source are described using the data source read definition interface.
Further, the functions of the data source reading definition interface include:
Defining connection parameters of a data source: providing information such as connection address, configuration, etc. to configure data sources such as Kafka, hive address, etc.;
Providing a data reading interface: reading data from a data source and converting the data into a uniform data line format;
Defining a data structure: defining field structure information of a data line of a data source based on a data line type model;
management partition: partitioning the accessed data source data, thereby realizing parallel processing;
Unifying different data source APIs: records generated by different data sources are all represented using a uniform data line, providing a uniform API, such as map, filter, etc., for the processing of the records, regardless of where the data is coming from.
Further, the data source dynamic partition definition interface: partition rules for dynamically generating partitions of data sources, uniformly managing partition rules of different data sources (such as Kafka, HDFS and the like), and providing a consistent partition interface for upper-layer applications;
The partitioning of the data source is dynamically generated and is equivalent to the partitioning of the data source to be read into a plurality of partitions of the data source; the subarea of each data source corresponds to a concurrent task, so that distributed parallel processing can be realized;
Further, the functions of the data source dynamic partition definition interface include:
defining a partition enumeration interface: providing a next () method to sequentially return partition objects;
Providing a partition description: the partition description comprises description partition summary information such as partition identification, path range and the like;
support dynamic partitioning: the partition set may not be fixed and needs to support dynamic acquisition of more new partitions;
Concurrency safety: partition access under multithreading needs to ensure thread safety;
the data source is transparent: the interfacing data source is hidden by the partition enumeration interface without concern for its internal partition rules.
Further, the partitioned data reading interface encapsulates the reading detail difference of different data sources and provides uniform high-performance data reading service for upper-layer application; the partition data reading interface is used for truly reading and processing the data record of each partition object in the source, and each partition reads a list set of output data, wherein the data type of the list is in a unified data line format; the realization of the partition data reading interface is formulated for specific data sources, and different sources have different partition data reading realizations;
Further, the functions of the partition data read interface include:
Providing a data reading interface: reading out records from the data sources in parallel, and packaging the records into a unified data line format for return;
masking data source differences: regardless of the form of the data source, a unified read API interface is provided;
and (3) concurrency control: the multi-thread concurrent reading is supported to ensure the reading performance;
fault tolerant mechanism: processing source exceptions can retry or fail to transition to exception outputs.
S022, a data storage standard interface, which receives data and writes various data storage, wherein the data storage standard interface comprises: a data source storage definition interface and a partition data writing interface;
The data source storage definition interface is responsible for describing the whole data source storage and creating a partition data writing interface;
the partition data writing interface receives data of partition objects, each partition object is distributed with a subtask executed by a task through a scheduling framework, the subtask calls the partition data writing interface, and the data of the partition objects are written into a target storage.
Further, the data source storage definition interface describes metadata and basic behavior of the data source storage, specifies how to connect and operate the storage of various data sources, and provides a universal standard interface specification for customizing the storage of various data sources:
Further, the specific flow of the data source storage definition interface includes:
Setting a field structure definition of a data line of received data, comprising: field name, data type of field, etc.;
And transmitting the index number of the subtask corresponding to the data partition object to a partition data writing interface so as to support parallel writing of a plurality of partitions.
A partition data write interface is created.
Further, the partitioned data writing interface encapsulates the detail difference of writing different data sources, and provides unified high-performance data writing service for upper-layer application; the partition data writing interface is used for truly writing the data record of each partition object in the process that the input of each partition data is a data list set of the partition object, and the data types of the lists are in a unified data line format; the partition data writing interface is formulated for specific data sources, and different sources have different partition data writing implementations;
further, the functions of the partition data writing interface include:
Providing a data writing interface: receiving data of a partition object, wherein the received data is in a unified data line format, and according to different data source types, performing data type conversion and writing the data into a target storage;
masking data source differences: providing a unified write API interface regardless of the form of the data source;
and (3) concurrency control: the multi-thread concurrent writing is supported, and the writing performance is ensured;
Fault tolerant mechanism: data writing failure, retry or failure to transfer to abnormal output, etc.
S03, the adaptation of the computing engine, spark, flink, presto and other computing engines define a special data source through the definition interfaces of the respective data reading and writing of the engines.
Further, the step S03 includes:
S031, converting the data accessed through the data source reading standard interface into a format corresponding to a computing engine when the data is read, wherein the accessed data has a plurality of data partition objects, so that distributed data access can be realized;
For example, for Spark calculation engines, the unified data line is converted into a DATAFRAME dataset of Spark. For the Flink compute engine, the unified data line is converted into the Dataset dataset of Flink.
S032, when writing data, converting the data format processed by the computing engine into a uniform data line format, receiving data through a data storage standard interface, and writing into various data storage.
For example, for Spark calculation engine, the processed data format is DATAFRAME data set, the DATAFRAME data set is converted into unified data line format, and then data is received through data storage standard interface;
For the Flink computing engine, the processed data format is Dataset data sets, the Dataset data sets are converted into a unified data line format, and then the data is received through a data storage standard interface
Because Spark, flink and the like naturally have the capacity of distributed data processing, the data of each partition of Spark or Flink can be mapped into the data of the partition object of the data storage standard interface, so that the distributed data writing can be realized.
It should be noted that although the operations of the method of the present invention are described in a particular order in the above embodiments and the accompanying drawings, this does not require or imply that the operations must be performed in the particular order or that all of the illustrated operations be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
In order to more clearly explain the above method for accessing heterogeneous data sources of the decoupled computing engine, the following description is provided with reference to specific embodiments, however, it should be noted that the embodiments are only for better explaining the present invention, and are not meant to limit the present invention unduly.
S01, abstracting unified data lines, and abstracting the data line type model, the data line model and the data types, so that unified access and storage of various heterogeneous data sources are realized, and the system has expansion capability in structural and functional aspects;
the S01 includes:
s011, describing fields in a data row and attributes of the fields by an abstract data row type model, wherein the attributes of the fields comprise: the name of the field, the data type, whether null values are allowed or not, and the like, are equivalent to the definition of a table structure;
s012, the data line model defines the information of line data, and is used for exchanging and transmitting data in the reading and writing process of a data source;
The data line model includes:
defining the type of line data, and defining field information corresponding to the line data through a data line type model;
an object array for storing each field value of the record, each position corresponding to a field;
The field accesses the index, the object array content is directly accessed through the index, rather than the Java reflection, the array is accessed, and the performance is higher;
Mapping field names, namely ensuring that the field names in the definition of the line type model correspond to the object array positions, and supporting flexible access;
The serializer is used for establishing association between the data line type model and the data line and is responsible for serialization and anti-serialization work of the data line so as to realize efficient data transmission;
Metadata, recording information of auxiliary description data lines such as data sources, lengths, versions and the like;
And the persistence processor supports the persistence of the data line into a file or a database table and the like and is realized by converting the data line type.
S013, abstracting unified data types of fields, wherein the data types of the fields represent the data types of one field in the data line types; establishing a mapping relation between the data types of each data source and the unified data types, so that the unified API can be used for realizing the access and storage of data; the mapping relation is established by reading and writing plug-ins of each data source; when the data source reads, the data type of the data source is converted into a unified data type.
The data types include:
the basic data types include: integer, string, boolean, etc.;
a composite data type, a data type composed of a plurality of basic data types; for example, a composite data type consisting of both an integer and a string of basic data types.
Array type: defining the data type of the array, wherein the data type of the element of the data is defined through the basic data type;
Map mapping data types, which are used for defining data types of the relation between the key and the value, and respectively defining the data types of the key and the value.
S02, reading and writing data are based on a plug-in mechanism, and the plug-in mechanism can be realized in various modes, for example, a factory design mode, a template design mode and the like can be used for realizing expansion of the plug-in. MySQL, elasticsearch and other various data sources respectively realize plug-ins for reading and writing the respective data sources, and the implementation specification of the plug-ins is defined through a Java parent class.
The S02 includes:
s021, a data source reading standard interface, which defines metadata and basic behaviors of a data source by utilizing the data source reading definition interface, wherein the data source reading standard interface comprises: a data source reading definition interface, a data source dynamic partition definition interface and a partition data reading interface;
the data source reading definition interface is responsible for describing the whole data source and creating a data source dynamic partition definition interface and a partition data reading interface;
the data source dynamic partition definition interface is responsible for continuously producing partition object description of each partition, and the partition objects are distributed to different task execution subtasks by the scheduling framework;
The partition data reading interface is called by the subtask, reads the data of the partition object distributed to the subtask, and reads and processes the real data record.
MySQL, elasticsearch and other data sources respectively realize the three interfaces, so that a complete data source is formed to a unified data line format data set with parallel processing, and the abstract capacity and the universality of data reading of the data source are improved.
The data source read definition interface: metadata and basic behavior of a data source are described using the data source read definition interface.
The functions of the data source reading definition interface include:
Defining connection parameters of a data source: providing information such as connection address, configuration, etc. to configure data sources such as Kafka, hive address, etc.;
Providing a data reading interface: reading data from a data source and converting the data into a uniform data line format;
Defining a data structure: defining field structure information of a data line of a data source based on a data line type model;
management partition: partitioning the accessed data source data, thereby realizing parallel processing;
Unifying different data source APIs: records generated by different data sources are all represented using a uniform data line, providing a uniform API, such as map, filter, etc., for the processing of the records, regardless of where the data is coming from.
The data source dynamic partition definition interface: partition rules for dynamically generating partitions of data sources, uniformly managing partition rules of different data sources (such as Kafka, HDFS and the like), and providing a consistent partition interface for upper-layer applications;
The partitioning of the data source is dynamically generated and is equivalent to the partitioning of the data source to be read into a plurality of partitions of the data source; the subarea of each data source corresponds to a concurrent task, so that distributed parallel processing can be realized;
the functions of the data source dynamic partition definition interface include:
defining a partition enumeration interface: providing a next () method to sequentially return partition objects;
Providing a partition description: the partition description comprises description partition summary information such as partition identification, path range and the like;
support dynamic partitioning: the partition set may not be fixed and needs to support dynamic acquisition of more new partitions;
Concurrency safety: partition access under multithreading needs to ensure thread safety;
the data source is transparent: the interfacing data source is hidden by the partition enumeration interface without concern for its internal partition rules.
The partition data reading interface encapsulates the reading detail difference of different data sources and provides unified high-performance data reading service for upper-layer application; the partition data reading interface is used for truly reading and processing the data record of each partition object in the source, and each partition reads a list set of output data, wherein the data type of the list is in a unified data line format; the realization of the partition data reading interface is formulated for specific data sources, and different sources have different partition data reading realizations;
The functions of the partition data reading interface include:
Providing a data reading interface: reading out records from the data sources in parallel, and packaging the records into a unified data line format for return;
masking data source differences: regardless of the form of the data source, a unified read API interface is provided;
and (3) concurrency control: the multi-thread concurrent reading is supported to ensure the reading performance;
fault tolerant mechanism: processing source exceptions can retry or fail to transition to exception outputs.
S022, a data storage standard interface, which receives data and writes various data storage, wherein the data storage standard interface comprises: a data source storage definition interface and a partition data writing interface;
The data source storage definition interface is responsible for describing the whole data source storage and creating a partition data writing interface;
the partition data writing interface receives data of partition objects, each partition object is distributed with a subtask executed by a task through a scheduling framework, the subtask calls the partition data writing interface, and the data of the partition objects are written into a target storage.
The data source storage definition interface describes metadata and basic behaviors of the data source storage, and specifies how to connect and operate the storage of various data sources, and provides a universal standard interface specification for customizing the storage of various data sources:
the specific flow of the data source storage definition interface comprises the following steps:
Setting a field structure definition of a data line of received data, comprising: field name, data type of field, etc.;
And transmitting the index number of the subtask corresponding to the data partition object to a partition data writing interface so as to support parallel writing of a plurality of partitions.
A partition data write interface is created.
The partition data writing interface encapsulates the detail difference written into different data sources and provides unified high-performance data writing service for upper-layer application; the partition data writing interface is used for truly writing the data record of each partition object in the process that the input of each partition data is a data list set of the partition object, and the data types of the lists are in a unified data line format; the partition data writing interface is formulated for specific data sources, and different sources have different partition data writing implementations;
the functions of the partition data writing interface include:
Providing a data writing interface: receiving data of a partition object, wherein the received data is in a unified data line format, and according to different data source types, performing data type conversion and writing the data into a target storage;
masking data source differences: providing a unified write API interface regardless of the form of the data source;
and (3) concurrency control: the multi-thread concurrent writing is supported, and the writing performance is ensured;
Fault tolerant mechanism: data writing failure, retry or failure to transfer to abnormal output, etc.
S03, the adaptation of the computing engine, spark, flink, presto and other computing engines define a special data source through the definition interfaces of the respective data reading and writing of the engines.
The step S03 comprises the following steps:
S031, converting the data accessed through the data source reading standard interface into a format corresponding to a computing engine when the data is read, wherein the accessed data has a plurality of data partition objects, so that distributed data access can be realized;
For example, for Spark calculation engines, the unified data line is converted into a DATAFRAME dataset of Spark. For the Flink compute engine, the unified data line is converted into the Dataset dataset of Flink.
S032, when writing data, converting the data format processed by the computing engine into a uniform data line format, receiving data through a data storage standard interface, and writing into various data storage.
For example, for Spark calculation engine, the processed data format is DATAFRAME data set, the DATAFRAME data set is converted into unified data line format, and then data is received through data storage standard interface;
For the Flink computing engine, the processed data format is Dataset data sets, the Dataset data sets are converted into a unified data line format, and then the data is received through a data storage standard interface
Because Spark, flink and the like naturally have the capacity of distributed data processing, the data of each partition of Spark or Flink can be mapped into the data of the partition object of the data storage standard interface, so that the distributed data writing can be realized.
Taking Spark computing engine as an example, the code for reading data is as follows:
The name of the data source is specified by the format method and is specially used for reading the data source defined by the data source reading standard interface, the information such as a host, a port, a database name, a table and the like of the MySQL database is specified by the options method, and the data is loaded by the load method, so that the table of the database is loaded without complex data reading logic.
Taking Spark computing engine as an example, the code for writing data is as follows:
The data is written into the database, the name of a data source is specified through a format, the data is specially used for converting the data into a uniform data format, the data is received by a data storage standard interface and is written into, the information such as a host, a port, a database name, a table, a data writing mode and the like of the MySQL database is specified through an options method, and the data writing operation is executed through a save method, so that the data is written into the table of the database without complex data writing logic.
Based on the same inventive concept, the invention also provides an access device for decoupling heterogeneous data sources of the calculation engine. The implementation of the device can be referred to as implementation of the above method, and the repetition is not repeated. The term "module" as used below may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
FIG. 2 is a schematic diagram of an access device for decoupling heterogeneous data sources of a compute engine according to the present invention. As shown in fig. 2, the apparatus includes:
the data abstraction module 110 abstracts unified data lines, and abstracts the data line type model, the data line model and the data types, so that unified access and storage of various heterogeneous data sources are realized, and the structure and function aspects have expansion capability;
the data read-write module 120, the reading and writing of the data are based on a plug-in mechanism, each type of data source respectively realizes the plug-in of the respective data source reading and data writing, and the implementation specification of the plug-in is defined by the parent class of Java;
The engine adaptation module 130, the adaptation of the computing engine, the computing engine defines a specific data source through the defining interface of the respective data reading and writing of the engine.
It should be noted that while several modules of access means for decoupling heterogeneous data sources of a computing engine are mentioned in the above detailed description, such a partitioning is merely exemplary and not mandatory. Indeed, the features and functions of two or more modules described above may be embodied in one module in accordance with embodiments of the present invention. Conversely, the features and functions of one module described above may be further divided into a plurality of modules to be embodied.
Based on the foregoing inventive concept, as shown in fig. 3, the present invention further proposes a computer device 200, including a memory 210, a processor 220, and a computer program 230 stored in the memory 210 and capable of running on the processor 220, where the processor 220 implements the method for accessing heterogeneous data sources of the foregoing decoupled computing engine when executing the computer program 230.
Based on the foregoing inventive concept, the present invention also proposes a computer-readable storage medium storing a computer program for executing the method of accessing heterogeneous data sources of the foregoing decoupled compute engine.
The invention relates to a method and a device for accessing heterogeneous data sources of a decoupling calculation engine, which support a plurality of calculation engines, and only need to develop a set of codes for reading and writing each data source; by introducing a unified standard interface, each data source is read and written only by developing the standard interface once; the standard-compliant computing engine can directly use the connector, so that development and maintenance costs are greatly reduced; the standard interface defines the capabilities of partition reading and writing and the like, simultaneously processes the data in a plurality of partitions in a partition mode, fully utilizes the parallelism of computing resources and further supports the large-scale data processing task with high throughput.
While the spirit and principles of the present invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments nor does it imply that features of the various aspects are not useful in combination, nor are they useful in any combination, such as for convenience of description. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.
It should be apparent to those skilled in the art that various modifications or variations can be made in the present invention without requiring any inventive effort by those skilled in the art based on the technical solutions of the present invention.

Claims (19)

1. A method for accessing heterogeneous data sources of a decoupled compute engine, the method comprising:
s01, abstracting unified data lines, and abstracting the data line type model, the data line model and the data types, so that unified access and storage of various heterogeneous data sources are realized, and the system has expansion capability in structural and functional aspects;
S02, reading and writing data are based on a plug-in mechanism, plug-ins for respectively realizing the reading and writing of the data sources are respectively realized by various data sources, and the realization specification of the plug-ins is defined by a parent class of Java;
S03, adapting a computing engine, wherein the computing engine defines a special data source through a definition interface of data reading and writing of the engine.
2. The method for accessing heterogeneous data sources of a decoupled compute engine of claim 1, wherein S01 comprises:
s011, describing fields in a data row and attributes of the fields by an abstract data row type model, wherein the attributes of the fields comprise: the name of the field, the data type, whether null values are allowed;
s012, the data line model defines the information of line data, and is used for exchanging and transmitting data in the reading and writing process of a data source;
S013, abstracting unified data types of fields, wherein the data types of the fields represent the data types of one field in the data line types; establishing a mapping relation between the data types of each data source and the unified data types, and realizing the access and storage of data by using the unified API; the mapping relation is established by reading and writing plug-ins of each data source; when the data source reads, the data type of the data source is converted into a unified data type.
3. The method of accessing heterogeneous data sources of a decoupled compute engine of claim 2, wherein the data line model comprises:
defining the type of line data, and defining field information corresponding to the line data through a data line type model;
an object array for storing each field value of the record, each position corresponding to a field;
the field accesses the index, and the object array content is directly accessed through the index;
Mapping field names, namely ensuring that the field names in the definition of the line type model correspond to the object array positions, and supporting flexible access;
The serializer is used for establishing association between the data line type model and the data line and is responsible for serialization and anti-serialization work of the data line so as to realize efficient data transmission;
metadata, recording information of data sources, lengths and version auxiliary description data lines;
And the persistence processor supports the persistence of the data line into a file or a database table and the like and is realized by converting the data line type.
4. The method of accessing heterogeneous data sources of a decoupled compute engine of claim 2, wherein the data types comprise:
the basic data types include: integer, string, boolean;
composite data type: a data type composed of a plurality of basic data types;
Array type: defining the data type of the array, wherein the data type of the element of the data is defined through the basic data type;
map data type: data types defining relationships between keys and values define data types of the keys and values, respectively.
5. The method for accessing heterogeneous data sources of a decoupled compute engine of claim 1, wherein S02 comprises:
s021, a data source reading standard interface, which defines metadata and basic behaviors of a data source by utilizing the data source reading definition interface, wherein the data source reading standard interface comprises: a data source reading definition interface, a data source dynamic partition definition interface and a partition data reading interface;
the data source reading definition interface is responsible for describing the whole data source and creating a data source dynamic partition definition interface and a partition data reading interface;
the data source dynamic partition definition interface is responsible for continuously producing partition object description of each partition, and the partition objects are distributed to different task execution subtasks by the scheduling framework;
The partition data reading interface is called by the subtask, reads the data of the partition object distributed to the subtask, and reads and processes the real data record;
s022, a data storage standard interface, which receives data and writes various data storage, wherein the data storage standard interface comprises: a data source storage definition interface and a partition data writing interface;
The data source storage definition interface is responsible for describing the whole data source storage and creating a partition data writing interface;
the partition data writing interface receives data of partition objects, each partition object is distributed with a subtask executed by a task through a scheduling framework, the subtask calls the partition data writing interface, and the data of the partition objects are written into a target storage.
6. The method of claim 5, wherein the data source read definition interface: metadata and basic behavior of a data source are described using the data source read definition interface.
7. The method of claim 5, wherein the function of reading the definition interface from the data source comprises:
Defining connection parameters of a data source: providing a connection address, configuration and other information configuration data source;
Providing a data reading interface: reading data from a data source and converting the data into a uniform data line format;
Defining a data structure: defining field structure information of a data line of a data source based on a data line type model;
management partition: partitioning the accessed data source data, thereby realizing parallel processing;
Unifying different data source APIs: records generated by different data sources are all represented using a uniform data line, providing a uniform API for processing of the records.
8. The method of claim 5, wherein the data source dynamic partition definition interface: the partition system is used for dynamically generating partitions of data sources, uniformly managing partition rules of different data sources and providing a consistent partition interface for upper-layer applications.
9. The method of claim 5, wherein the function of the dynamic partition definition interface for the data source comprises:
defining a partition enumeration interface: providing a next () method to sequentially return partition objects;
providing a partition description: the partition description comprises partition identification and path range description partition summary information;
support dynamic partitioning: the partition set can be unfixed, so that more new partitions can be dynamically acquired;
concurrency safety: partition access under multithreading ensures thread security;
the data source is transparent: hidden by the partition enumeration interface.
10. The method for accessing heterogeneous data sources of a decoupled compute engine of claim 5, wherein the partitioned data reading interface encapsulates read detail differences of different data sources to provide unified high-performance data reading services for upper layer applications;
the partition data reading interface is used for truly reading and processing the data record of each partition object in the source, each partition reads a list set of output data, and the data type of the list is in a unified data line format;
The implementation of the partition data reading interface is formulated for specific data sources, and different partition data reading implementations exist for different sources.
11. The method of claim 5, wherein the function of the partitioned data reading interface comprises:
Providing a data reading interface: reading out records from the data sources in parallel, and packaging the records into a unified data line format for return;
masking data source differences: providing a unified read API interface;
and (3) concurrency control: the multi-thread concurrent reading is supported to ensure the reading performance;
fault tolerant mechanism: processing source exceptions can retry or fail to transition to exception outputs.
12. The method of claim 5, wherein the data source storage defines an interface, describes metadata and basic behavior of the data source storage, specifies how to connect and manipulate the storage of the various data sources, and provides a generic standard interface specification for customizing the storage of the various data sources.
13. The method for accessing heterogeneous data sources of a decoupled compute engine of claim 5, wherein the specific flow of data source storage definition interfaces comprises:
setting a field structure definition of a data line of received data, comprising: field name, data type of field;
Transmitting index numbers of subtasks corresponding to the data partition objects to a partition data writing interface so as to support parallel writing into a plurality of partitions;
a partition data write interface is created.
14. The method for accessing heterogeneous data sources of a decoupled compute engine of claim 5, wherein the partitioned data writing interface encapsulates differences in details of writing different data sources to provide unified high-performance data writing services for upper layer applications; the partition data writing interface is used for truly writing the data record of each partition object in the process that the input of each partition data is a data list set of the partition object, and the data types of the lists are in a unified data line format; the partition data writing interface is formulated for specific data sources, and different partition data writing implementations exist for different sources.
15. The method of claim 5, wherein the function of the partitioned data writing interface comprises:
Providing a data writing interface: receiving data of a partition object, wherein the received data is in a unified data line format, and according to different data source types, performing data type conversion and writing the data into a target storage;
masking data source differences: providing a unified write API interface regardless of the form of the data source;
and (3) concurrency control: the multi-thread concurrent writing is supported, and the writing performance is ensured;
Fault tolerant mechanism: data writing fails, retries or fails to change to abnormal output.
16. The method for accessing heterogeneous data sources of a decoupled compute engine of claim 1, wherein S03 comprises:
S031, converting the data accessed through the data source reading standard interface into a format corresponding to a computing engine when the data is read, wherein the accessed data has a plurality of data partition objects, so that distributed data access can be realized;
S032, when writing data, converting the data format processed by the computing engine into a uniform data line format, receiving data through a data storage standard interface, and writing into various data storage.
17. An access device for decoupling heterogeneous data sources of a computing engine, the device comprising:
The data abstraction module abstracts the unified data row, and abstracts the data row type model, the data row model and the data type, thereby realizing unified access and storage of various heterogeneous data sources and having expansion capability in structural and functional aspects;
the data reading and writing module is used for reading and writing data based on a plug-in mechanism, plug-ins for respectively realizing the reading and the writing of the data sources of each type are respectively realized, and the realization specification of the plug-ins is defined through a parent class of Java;
The engine adaptation module is used for adapting the computing engine, and the computing engine defines a special data source through a definition interface for reading and writing data of the engine.
18. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-16 when the computer program is executed by the processor.
19. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program for performing the method of any one of claims 1-16.
CN202311706616.6A 2023-12-13 Heterogeneous data source access method and device of decoupling calculation engine Pending CN118245044A (en)

Publications (1)

Publication Number Publication Date
CN118245044A true CN118245044A (en) 2024-06-25

Family

ID=

Similar Documents

Publication Publication Date Title
Aji et al. Hadoop-GIS: A high performance spatial data warehousing system over MapReduce
US7861222B2 (en) Discoscript: a simplified distributed computing scripting language
US10528440B2 (en) Metadata cataloging framework
US8250590B2 (en) Apparatus, system, and method for seamless multiple format metadata abstraction
US20080183725A1 (en) Metadata service employing common data model
US11010401B2 (en) Efficient snapshot generation of data tables
US8402119B2 (en) Real-load tuning of database applications
US10936616B2 (en) Storage-side scanning on non-natively formatted data
US10019473B2 (en) Accessing an external table in parallel to execute a query
CN116383238B (en) Data virtualization system, method, device, equipment and medium based on graph structure
Ganelin et al. Spark: Big data cluster computing in production
US10360204B2 (en) User defined function, class creation for external data source access
US20080154939A1 (en) Information transfer from object-oriented repository to relational database
WO2020117655A1 (en) System and method for ingesting data
US20110191549A1 (en) Data Array Manipulation
Alkowaileet et al. Large-scale complex analytics on semi-structured datasets using AsterixDB and Spark
CN118245044A (en) Heterogeneous data source access method and device of decoupling calculation engine
CN111459882B (en) Namespace transaction processing method and device for distributed file system
Mehrotra et al. Apache Spark Quick Start Guide: Quickly learn the art of writing efficient big data applications with Apache Spark
CN110647518A (en) Data source fusion calculation method, component and device
US11237752B2 (en) Automated data translation
US11755620B1 (en) Invoking supported non-relational database operations to provide results consistent with commands specified in a relational query language
US11803568B1 (en) Replicating changes from a database to a destination and modifying replication capacity
Orensa A design framework for efficient distributed analytics on structured big data
US11954510B2 (en) Native-image in-memory cache for containerized ahead-of-time applications

Legal Events

Date Code Title Description
PB01 Publication