WO2024087736A1 - 数据处理方法、数据处理引擎、计算设备及存储介质 - Google Patents
数据处理方法、数据处理引擎、计算设备及存储介质 Download PDFInfo
- Publication number
- WO2024087736A1 WO2024087736A1 PCT/CN2023/106888 CN2023106888W WO2024087736A1 WO 2024087736 A1 WO2024087736 A1 WO 2024087736A1 CN 2023106888 W CN2023106888 W CN 2023106888W WO 2024087736 A1 WO2024087736 A1 WO 2024087736A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data processing
- metadata
- external database
- memory
- computing device
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 215
- 238000003672 processing method Methods 0.000 title claims abstract description 41
- 230000015654 memory Effects 0.000 claims abstract description 158
- 238000000034 method Methods 0.000 claims abstract description 66
- 230000004044 response Effects 0.000 claims description 25
- 238000004590 computer program Methods 0.000 claims description 13
- 238000004891 communication Methods 0.000 description 20
- 230000006870 function Effects 0.000 description 18
- 238000012790 confirmation Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 12
- 230000008569 process Effects 0.000 description 12
- 238000012795 verification Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 206010047289 Ventricular extrasystoles Diseases 0.000 description 5
- 238000005129 volume perturbation calorimetry Methods 0.000 description 5
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000013500 data storage Methods 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 230000008520 organization Effects 0.000 description 3
- 238000005192 partition Methods 0.000 description 3
- 239000004065 semiconductor Substances 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 238000004549 pulsed laser deposition Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
Definitions
- the present application relates to the field of computer technology, and in particular to a data processing method, a data processing engine, a computing device and a storage medium.
- data processing engines such as the Spark SQL engine
- need to obtain metadata of database objects such as databases, tables, partitions, and views, as well as functions and information stored in other external systems through the catalog to process data. Therefore, the catalog is also called the "brain" of the big data processing engine.
- a big data processing engine can access the catalog of an external database through a network to obtain metadata.
- a big data processing engine needs to frequently access the catalog of an external database, or needs to obtain the same metadata through the network multiple times, network delay and jitter will be caused, thereby causing a decrease in the processing performance of the big data processing engine and a decrease in data processing efficiency.
- the embodiment of the present application provides a data processing method, a data processing engine, a computing device and a storage medium, which can avoid accessing the directory of an external database through the network multiple times, thereby reducing network overhead, while avoiding the impact of network unavailability on the execution of data processing jobs, improving the performance of the data processing engine, and improving the efficiency of data processing.
- the technical solution is as follows.
- a data processing method which is applied to a data processing engine of a cloud service, and the method comprises:
- the metadata in the external database is cached in the memory of the data processing engine. Accessing the metadata from the memory can avoid accessing the directory of the external database through the network multiple times, thereby reducing network overhead and avoiding the impact of network unavailability on the execution of data processing jobs, thereby improving the performance of the data processing engine and improving the efficiency of data processing.
- the method also includes: establishing a heartbeat channel between the memory and the external database, the heartbeat channel being used to determine whether the metadata in the external database is updated, and the number of the heartbeat channels being equal to the number of the external databases; if the metadata in the external database has been updated, the updated metadata is cached from the external database to the memory.
- the data processing job needs to access multiple external databases, there can be multiple heartbeat channels to determine whether the data in each external database has been updated; further, if the metadata of the target access object is not stored in the memory, the heartbeat channel is established after the metadata cache ends; if the metadata of the target access object is already stored in the memory, the heartbeat channel is established as soon as the data processing job is started.
- the method further includes: in response to completion of execution of the data processing job, closing the heartbeat channel.
- the heartbeat channel is closed after the data processing job is executed, which can save computing resources and improve the processing performance of the data processing engine.
- the external database stores a write-ahead log, which is used to record operations performed on the external database. If the metadata in the external database has been updated, the updated metadata is cached from the external database to the memory, including: if the write-ahead log includes an update operation on the metadata, the updated metadata is cached from the external database to the memory.
- caching the updated metadata from the external database into the memory including: if the metadata in the memory is inconsistent with the metadata in the external database, caching the updated metadata from the external database into the memory.
- the metadata of the target access object is cached from the external database to the memory of the data processing engine, including: caching the metadata from the external database to a target subdirectory of a directory in the memory of the data processing engine, the directory being used to store metadata of a database associated with the data processing engine, and the target subdirectory being used to store metadata from the external database.
- metadata from the external database is stored in the memory according to the structure of directory.subdirectory.database/data table.database object, which can avoid conflicts between the path of accessing metadata from the external database from the memory and the path of accessing metadata in the external database through the network.
- the method further includes: if the metadata is already stored in the memory, executing the step of accessing the metadata from the memory to execute the data processing job.
- the metadata that has been cached in the memory does not need to be cached again, which can save storage resources and improve the efficiency of data processing.
- a data processing device which is applied to a data processing engine of a cloud service, and the device includes:
- the confirmation module is used to determine the external database to be accessed by the data processing job and the target access object in the external database.
- the cache module is used to cache the metadata of the target access object from the external database into the memory of the data processing engine.
- An execution module is used to access the metadata from the memory to execute the data processing job.
- the device further includes:
- the communication module is used to establish a heartbeat channel between the memory and the external database.
- the heartbeat channel is used to determine whether the metadata in the external database is updated.
- the number of the heartbeat channels is equal to the number of the external databases.
- the cache module is also used to cache the updated metadata from the external database into the memory if the metadata in the external database has been updated.
- the communication module is further used for:
- the heartbeat channel is closed.
- the external database stores a write-ahead log
- the write-ahead log is used to record operations performed on the external database
- the cache module is used to:
- the updated metadata is cached from the external database into the memory.
- the cache module is used to:
- the updated metadata is cached from the external database into the memory.
- the cache module is used to:
- the metadata is cached from the external database to a target subdirectory of a directory in the memory of the data processing engine, the directory being used to store metadata of a database associated with the data processing engine, and the target subdirectory being used to store metadata from the external database.
- the execution module is further used to:
- the step of accessing the metadata from the memory to perform the data processing operation is performed.
- a data processing engine for a cloud service which includes a metadata cache interface, which is used to cache the metadata of a target access object from an external database into the memory of the data processing engine, so as to implement the data processing method provided by the first aspect or any optional method of the first aspect.
- a computing device which includes a processor and a memory, wherein the processor of the computing device is used to execute instructions stored in the memory of the computing device, so that the computing device performs a data processing method provided in the first aspect or any optional manner of the first aspect.
- a computing device cluster comprising at least one computing device, each computing device comprising a processor and a memory;
- the processor of the at least one computing device is used to execute instructions stored in the memory of the at least one computing device, so that the computing device cluster executes the data processing method provided by the first aspect or any optional manner of the first aspect.
- a computer program product comprising computer instructions, the computer instructions being stored in a computer-readable storage medium.
- a processor of a computing device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computing device executes the data processing method provided by the first aspect or any optional manner of the first aspect.
- a computer-readable storage medium in which at least one instruction is stored.
- the instruction is read by a processor to enable a computing device to execute the data processing method provided by the first aspect or any optional method of the first aspect.
- FIG1 is a schematic diagram of an application scenario of a data processing method provided in an embodiment of the present application.
- FIG2 is a flow chart of a data processing method provided in an embodiment of the present application.
- FIG3 is a flow chart of a data processing method provided in an embodiment of the present application.
- FIG4 is a schematic diagram of a data processing method provided in an embodiment of the present application.
- FIG5 is a structural block diagram of a data processing device provided in an embodiment of the present application.
- FIG6 is a schematic diagram of the structure of a computing device provided in an embodiment of the present application.
- FIG7 is a schematic diagram of a computing device cluster provided in an embodiment of the present application.
- FIG8 is a schematic diagram of a possible implementation of a computing device cluster provided in an embodiment of the present application.
- FIG. 1 is a schematic diagram of an application scenario of a data processing method provided in the embodiment of the present application.
- a data processing engine is running on a computing device, and the data processing engine is used to perform data processing jobs.
- the data processing engine can be Spark SQL or Flink SQL, etc., which is not limited in the embodiment of the present application.
- the process of the data processing engine executing a data processing job includes steps such as structured query language (SQL) syntax analysis (parse), catalog (catalog), analysis (analysis), optimization (optimize), physical plan (physical plan) and program execution, wherein program execution includes a resilient distributed dataset program (resilient distributed dataset program), a data stream program (datastream program) and a map-reduce program (map-reduce).
- SQL structured query language
- parse parse
- catalog catalog
- analysis analysis
- optimization optimization
- physical plan physical plan
- program execution includes a resilient distributed dataset program (resilient distributed dataset program), a data stream program (datastream program) and a map-reduce program (map-reduce).
- the data processing engine includes an in-memory, a metadata management system (hive metastore, HMS) and a catalog interface (application programming interface, API).
- the in-memory includes at least one cache memory, which is used to store the catalog of the database associated with the data processing engine.
- the metadata management system is used to manage the catalog of the database associated with the data processing engine.
- the catalog interface is used to access the catalog in the memory and metadata management system, obtain the metadata of database objects (database, DB), tables, partitions, views, etc., and execute the data processing job.
- the data processing engine is associated with an external database, that is, an external service (extend service), and the data processing engine accesses the catalog in the external database through the catalog interface to obtain the metadata in the external database.
- the data processing engine and the external database are connected in communication via a wired network or a wireless network.
- the wireless network or wired network uses standard communication technologies and/or protocols.
- the network is typically the Internet, but can also be any network, including but not limited to a local area network (LAN), a metropolitan area network (
- the data processing engine and the external database communicate based on Java database connectivity (JDBC), remote procedure call protocol (RPC), or hypertext transfer protocol (HTTP).
- JDBC Java database connectivity
- RPC remote procedure call protocol
- HTTP hypertext transfer protocol
- the data processing engine and the external database use technologies and/or formats including hypertext markup language (HTML), extensible markup language (XML), etc. to represent metadata exchanged through the network.
- SSL secure socket layer
- TLS transport layer security
- VPN virtual private network
- IPsec Internet protocol security
- FIG2 is a flow chart of a data processing method provided by the embodiment of the present application. As shown in FIG2, the method is applied to the data processing engine of the cloud service, and includes the following steps 201 to 203.
- the data processing engine determines an external database to be accessed by a data processing job and a target access object in the external database.
- the data processing job refers to a series of program instructions executed in sequence by the data processing engine, that is, a query statement, which includes statements for performing operations such as adding, deleting, modifying, and checking the target access object, as well as statements for obtaining the target access object.
- the target access object refers to a database object in an external database, including a database, table, partition, view, row, column, etc.
- the data processing engine caches the metadata of the target access object from the external database into the memory of the data processing engine.
- the metadata is used to describe the data attributes of the target access object to indicate the target access object.
- the external database stores a catalog of database objects. If there is a new database object in the external database, the metadata of the database object is written into the catalog. If the database object stored in the external database is updated, the metadata stored in the catalog is updated.
- the data processing engine caches the required metadata from the catalog of the external database to the catalog in the memory.
- the data processing engine accesses the metadata from the memory to execute the data processing job.
- the data processing engine accesses the metadata from the memory based on the data format of the metadata in the memory.
- the metadata is cached in the memory of the data processing engine, and the metadata is accessed from the memory, which can avoid accessing the directory of the external database through the network multiple times, thereby reducing network overhead, and at the same time avoiding the impact of network unavailability on the execution of data processing jobs, thereby improving the performance of the data processing engine and improving the efficiency of data processing.
- FIG2 shows only the basic process of the present application, and the solution provided by the present application is further described below.
- FIG3 is a flow chart of a data processing method provided by an embodiment of the present application, as shown in FIG3, a data processing engine applied to a cloud service includes the following steps 301 to 306.
- a data processing engine determines an external database to be accessed by the data processing job and a target access object in the external database.
- the data processing job is started, that is, after the data processing job is pulled up, the driver of the computing device is started.
- the data processing engine scans the query statement in the data processing job to determine the external database and the target access object.
- the data processing engine scans the keywords in the query statement to determine the external database and the target access object. For example, the query statement "alter table mysql.table1 add column column1 (col int)" indicates adding a column column1 to table1.
- the data processing engine determines that the external database to be accessed is mysql and the target access object is table1 in mysql by scanning the content "mysql.table1" after the keyword "table”.
- the data processing engine includes a metadata cache interface, and specifies the external database and the target access object by scanning the metadata cache interface called in the query statement.
- the query statement "catch (mysqlcatalog.table1)" indicates calling the metadata cache interface "mysqlcatalog.table1” to obtain the metadata of table1 in the mysql database.
- the data processing engine scans the metadata cache interface "mysqlcatalog.table1" called by the catch instruction, specifies the external database as mysql, and the target access object is table1 in mysql.
- the above step 301 is introduced by taking the database to be accessed as an external database as an example.
- the database associated with the data processing engine includes both an external database and an internal database.
- the database may include both an external database and an internal database.
- the data processing engine stores information about the associated database, including information about the internal database and information about the external database.
- the data processing engine determines whether the database to be accessed in the query statement is an internal database or an external database based on the database information and the scan result of the query statement. If the database to be accessed is an internal database, the metadata of the target access object in the internal database is directly accessed from the memory. The specific access process is introduced in the subsequent steps and will not be repeated here.
- the data processing engine determines the external database to be accessed and the target access object by scanning the query statement in the data processing job, which is simple to operate and has accurate results.
- the data processing engine caches the metadata from the external database to a target subdirectory of a directory in the data processing engine memory through a metadata cache interface with the external database.
- the directory is used to store metadata of the database associated with the data processing engine, and the target subdirectory is used to store metadata from the external database.
- the metadata cache interface is used for data interaction between the data processing engine and the external database to cache metadata from the external database to the memory of the data processing engine.
- the external database provides a metadata cache interface, so that the data processing engine can localize the metadata in the external database, thereby speeding up the data processing engine's access to metadata and avoiding inaccessibility or slow access due to network reasons.
- the data processing engine also provides a metadata cache interface. Users edit query statements that call the metadata cache interface to cache metadata required for data processing jobs from the external database. It should be noted that the above-mentioned method of caching metadata through the metadata cache interface is only exemplary, and metadata can also be cached based on other methods. The embodiments of the present application do not limit this.
- the data processing engine caches the metadata to the target subdirectory of the directory in memory through the metadata cache interface.
- the data processing engine creates a new subdirectory under the directory in memory, which is also the target subdirectory.
- the data organization model of the directory in memory changes from the original three-layer structure (directory.database/data table.database object), for example, (catalog.database/schema.table), to a four-layer structure (directory.sub-directory.database/data table.database object), for example, (catalog.sub-catalog.database/schema.table).
- the structure of the data organization model of the directory in memory is changed to avoid conflicts between the path of accessing metadata from the external database from the memory and the path of accessing metadata in the external database through the network.
- the directory in memory already includes the target subdirectory, and the data processing engine stores the metadata directly to the target subdirectory without creating a new subdirectory, thereby improving the efficiency of caching metadata and avoiding the directory including multiple subdirectories indicating the same external database, thereby avoiding interference with subsequent access to the metadata.
- a subdirectory can be created to store metadata from different external databases separately, thereby facilitating access to the metadata.
- the subdirectories under the directory in the memory include not only the subdirectories of the external database, but also the subdirectories of the internal database, so that the metadata in the internal database can be accessed from the memory.
- the data processing engine selectively caches the metadata in the external database into the memory through the metadata cache interface based on the external database to be accessed and the target access object, instead of caching all the metadata in the external database. This can reduce the number of metadata cached in the external database by the data processing engine during the execution of the entire data processing job, thereby reducing the cache overhead and cache maintenance overhead.
- step 302 is a step executed when the data processing engine determines that the metadata of the target access object is not stored in the memory.
- the metadata is already stored in the memory, and the data processing engine does not execute step 302, but executes the step of accessing the metadata from the memory to perform the data processing operation, that is, directly accessing the metadata from the memory.
- the data processing engine queries the subdirectory corresponding to the database to be accessed under the directory in the memory. If the directory does not include the subdirectory corresponding to the database, or the subdirectory corresponding to the database does not include the metadata of the target access object, it means that the metadata is not stored in the memory.
- the metadata that has been cached in the memory there is no need to cache it repeatedly, which can save storage resources and improve the efficiency of data processing.
- the data processing engine establishes a heartbeat channel between the memory and the external database.
- the heartbeat channel is used to determine whether the metadata in the external database is updated.
- the number of the heartbeat channels is equal to the number of the external databases.
- the memory establishes a heartbeat channel with multiple external databases, and each external database corresponds to a heartbeat channel.
- the data processing engine regularly sends a data synchronization request to the external database through the heartbeat channel. After the external database receives the data synchronization request, it replies with a response message to the data processing engine, that is, it determines whether the metadata in the external database is updated through the heartbeat mechanism.
- the external database actively sends a response message to the data processing engine on a regular basis, and the embodiments of the present application do not limit this. After receiving the response message, the data processing engine determines the update status of the metadata in the external database based on the response message.
- the above step 303 is explained by taking the establishment of a heartbeat channel after caching the metadata as an example.
- the metadata to be accessed is already stored in the memory, and the data processing engine does not execute the step of caching the metadata, but directly establishes a heartbeat channel between the memory and the external database, which can save storage resources.
- the heartbeat channel is established after the metadata cache is completed.
- the metadata has been stored in the memory, and the data processing engine establishes the heartbeat channel as soon as the data processing job is started.
- the data processing engine caches the updated metadata from the external database into the memory.
- the data processing engine determines whether the metadata in the external database has been updated based on the response information. Depending on the response information, the data processing engine determines whether the metadata in the external database has been updated in the following two ways.
- the first method the external database stores a write-ahead log, which is used to record operations performed on the external database. If the write-ahead log includes an update operation on the metadata, the updated metadata is cached from the external database to the memory.
- the response information includes the write-ahead log (WAL) stored in the external database.
- the update operation of the metadata includes the addition, deletion, modification, and query operations on the database object corresponding to the metadata based on the data definition language (DDL).
- the data processing engine filters the operations recorded in the write-ahead log based on the target access object corresponding to the cached metadata. If the update operation of the target access object is obtained after filtering, it is determined that the metadata in the external database has been updated. For example, the external database is mysql, and the write-ahead log of mysql is binlog.
- the data processing engine identifies the change of metadata by searching for the update operation of the target access object in binlog.
- the cached metadata is deleted and the updated metadata is cached in the memory, which can save storage resources and avoid access conflicts between updated metadata in the memory.
- the second method if the metadata in the memory is inconsistent with the metadata in the external database, the updated metadata is cached from the external database into the memory.
- the response information includes the metadata stored in the external database, that is, the external database sends the metadata to the data processing engine.
- the external database determines the target access object corresponding to the metadata cached by the data processing engine based on the call record of the metadata cache interface, and then regularly sends the metadata of the target access object to the data processing engine for verification. It should be noted that the above method is only an example of an external database determining the metadata to be sent to the data processing engine, which can be customized according to actual needs, and the embodiments of the present application are not limited to this.
- the data processing engine receives the response information and obtains the metadata in the external database.
- the data processing engine verifies the metadata in the memory with the metadata obtained from the response information. If they are consistent, it means that the metadata in the external database has not been updated. If they are inconsistent, it means that the metadata in the external database has been updated.
- the data processing engine caches the updated metadata in the memory.
- the method of verifying the metadata includes hash value verification, MD5 verification and CRC32 verification, etc. The verification method can be determined according to actual needs, and the embodiment of the present application does not limit this.
- the metadata in the external database by verifying whether the metadata in the external database is consistent with the metadata cached in the memory, it is confirmed whether the metadata in the external database has been updated.
- the verification has high accuracy, and the response information includes metadata. If the metadata in the external database has been updated, the metadata can be directly stored in the memory without caching the metadata through the metadata cache interface, which can improve the efficiency of data processing.
- the response information is encrypted to ensure the security of the metadata in the response information.
- the method of encrypting the response information can be customized according to actual needs, and the embodiments of the present application do not limit this.
- the data processing engine accesses the metadata from the memory to execute the data processing job.
- the data processing engine accesses the metadata from the memory based on the data format of the metadata in the memory, that is, the data organization model of the directory in the memory, for example, catalog.mysqlcatalog.database.table1.
- the data processing engine accesses the metadata in the memory and performs the query.
- the query statement is parsed, optimized, and a physical execution plan is generated.
- the data processing engine executes the physical execution plan, that is, executes the data processing job.
- the above steps 302 to 305 are explained by taking the data processing engine first caching metadata into the memory and then accessing the metadata from the memory to perform the data processing operation as an example.
- the data processing engine remotely accesses the external database through the network to obtain metadata, and then performs the data processing operation, and caches the metadata while executing the data processing operation.
- the metadata is first cached into the memory through the metadata cache interface, and then the metadata is accessed from the memory.
- the embodiments of the present application are not limited to this.
- the embodiments of the present application are described by taking the example of first updating the metadata and then executing the data processing job.
- the processes of updating the metadata and executing the data processing job can be carried out in parallel.
- the data processing job can also be re-executed based on the updated metadata.
- the data processing engine closes the heartbeat channel.
- the heartbeat channel is closed after the data processing job is executed, which can save computing resources and improve the processing performance of the data processing engine.
- FIG. 4 is a schematic diagram of a data processing method provided by an embodiment of the present application.
- the Flink SQL engine determines the target access object in the external database that needs to be accessed by scanning the data processing job, or caches part of the metadata in the external database into the directory in the memory of the Flink SQL engine through the metadata cache interface provided by the Flink SQL engine to improve access efficiency.
- the directory structure needs to add a layer of "sub-catalog" to avoid conflicts in the access path of the metadata.
- the Flink SQL engine establishes a heartbeat channel between the memory and the external database to identify whether the metadata in the external database is updated to avoid the occurrence of metadata inconsistency problems.
- the Flink SQL engine iterates all query statements in the data processing job, accesses the metadata in the memory, parses, optimizes, and generates a physical execution plan for the query statement, and closes the heartbeat channel after the data processing job is executed.
- the metadata in the external database is cached in the memory of the data processing engine, and the metadata is accessed from the memory, which can avoid accessing the directory of the external database through the network multiple times, thereby reducing network overhead, while avoiding the impact of network unavailability on the execution of data processing jobs, improving the performance of the data processing engine, and improving the efficiency of data processing. Furthermore, by adding a layer of subdirectories in the directory structure, conflicts between the path for accessing the metadata in the memory and the path for accessing the metadata in the external database are avoided; by establishing a heartbeat channel between the memory and the external database, the metadata is regularly updated during the execution of the data processing job to ensure the consistency of the metadata accessed from the memory and the metadata in the external database.
- Fig. 5 is a structural block diagram of a data processing device provided in an embodiment of the present application. As shown in Fig. 5 , the device is applied to a data processing engine of a cloud service, and the device includes a confirmation module 501 , a cache module 502 and an execution module 503 .
- the confirmation module 501 is used to determine the external database to be accessed by the data processing job and the target access object in the external database.
- the cache module 502 is used to cache the metadata of the target access object from the external database into the memory of the data processing engine.
- the execution module 503 is used to access the metadata from the memory to execute the data processing operation.
- the device further includes:
- the communication module is used to establish a heartbeat channel between the memory and the external database.
- the heartbeat channel is used to determine whether the metadata in the external database is updated.
- the number of the heartbeat channels is equal to the number of the external databases.
- the cache module 502 is also used to cache the updated metadata from the external database into the memory if the metadata in the external database has been updated.
- the communication module is further used for:
- the heartbeat channel is closed.
- the external database stores a write-ahead log
- the write-ahead log is used to record operations performed on the external database.
- the cache module 502 is used to:
- the updated metadata is cached from the external database into the memory.
- the cache module 502 is used to:
- the updated metadata is cached from the external database into the memory.
- the cache module 502 is used to:
- the metadata is cached from the external database to a target subdirectory of a directory in the memory of the data processing engine, the directory being used to store metadata of a database associated with the data processing engine, and the target subdirectory being used to store metadata from the external database.
- the execution module 503 is further configured to:
- the step of accessing the metadata from the memory to perform the data processing operation is performed.
- the confirmation module 501, the cache module 502, and the execution module 503 can all be implemented by software, or can be implemented by hardware.
- the implementation of the confirmation module 501 is introduced below by taking the confirmation module 501 as an example.
- the implementation of the cache module 502 and the execution module 503 can refer to the implementation of the confirmation module 501.
- the confirmation module 501 may include code running on a computing instance.
- the computing instance may include at least one of a physical host (computing device), a virtual machine, and a container. Further, the above-mentioned computing instance may be one or more.
- the confirmation module 501 may include code running on multiple hosts/virtual machines/containers. It should be noted that the multiple hosts/virtual machines/containers used to run the code may be distributed in the same region (region) or in different regions.
- the multiple hosts/virtual machines/containers used to run the code may be distributed in the same availability zone (AZ) or in different AZs, each AZ including one data center or multiple data centers with close geographical locations. Among them, usually a region may include multiple AZs.
- VPC virtual private cloud
- multiple hosts/virtual machines/containers used to run the code can be distributed in the same virtual private cloud (VPC) or in multiple VPCs.
- VPC virtual private cloud
- a VPC is set up in a region.
- a communication gateway needs to be set up in each VPC to achieve interconnection between VPCs through the communication gateway.
- the confirmation module 501 may include at least one computing device, such as a server, etc.
- the confirmation module 501 may also be a device implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
- the PLD may be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.
- the multiple computing devices included in the confirmation module 501 can be distributed in the same region or in different regions.
- the multiple computing devices included in the confirmation module 501 can be distributed in the same AZ or in different AZs.
- the multiple computing devices included in the confirmation module 501 can be distributed in the same VPC or in multiple VPCs.
- the multiple computing devices can be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
- the steps that the above modules are responsible for implementing can be specified as needed, and all the functions of the above device can be realized by respectively implementing different steps in the above data processing method through the above modules. That is, the data processing device provided in the above embodiment only uses the division of the above functional modules as an example when implementing data processing. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the device provided in the above embodiment belongs to the same concept as the corresponding method embodiment. The specific implementation process is detailed in the method embodiment, which will not be repeated here.
- FIG6 is a schematic diagram of the structure of a computing device provided in an embodiment of the present application.
- the computing device 600 includes: a bus 601, a processor 602, a memory 603, and a communication interface 604.
- the processor 602, the memory 603, and the communication interface 604 communicate with each other through the bus 601.
- the computing device 600 can be a server or a terminal device. It should be understood that the present application does not limit the number of processors and memories in the computing device 600.
- the bus 601 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus.
- the bus may be divided into an address bus, a data bus, a control bus, etc.
- FIG. 6 is represented by only one line, but does not mean that there is only one bus or one type of bus.
- the bus 601 may include a path for transmitting information between various components of the computing device 600 (e.g., the memory 603, the processor 602, and the communication interface 604).
- Processor 602 may include any one or more of a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).
- CPU central processing unit
- GPU graphics processing unit
- MP microprocessor
- DSP digital signal processor
- the memory 603 may include a volatile memory, such as a random access memory (RAM).
- the memory 603 may also include a non-volatile memory, such as a read-only memory (READ-ONLY MEMORY).
- ROM read-only memory
- flash memory hard disk drive (HDD) or solid state drive (SSD).
- SSD solid state drive
- the memory 603 stores executable program codes, and the processor 602 executes the executable program codes to respectively implement the functions of the aforementioned determination module 501, cache module 502 and execution module 503, thereby implementing the data processing method. That is, the memory 603 stores instructions for executing the data processing method.
- the communication interface 604 uses a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 600 and other devices or a communication network.
- a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 600 and other devices or a communication network.
- FIG7 is a schematic diagram of a computing device cluster provided in an embodiment of the present application.
- the computing device cluster includes at least one computing device.
- the computing device may be a server, such as a central server, an edge server, or a local server in a local data center.
- the computing device may also be a terminal device such as a desktop computer, a laptop computer, or a smart phone.
- the computing device cluster includes at least one computing device 600.
- the memory 603 in one or more computing devices 600 in the computing device cluster may store the same instructions for executing the data processing method.
- the memory 603 of one or more computing devices 600 in the computing device cluster may also store partial instructions for executing the data processing method.
- the combination of one or more computing devices 600 may jointly execute instructions for executing the data processing method.
- the memory 603 in different computing devices 600 in the computing device cluster can store different instructions, which are respectively used to execute part of the functions of the data processing device. That is, the instructions stored in the memory 603 in different computing devices 600 can implement the functions of one or more modules among the determination module 501, the cache module 502 and the execution module 503.
- one or more computing devices in the computing device cluster may be connected via a network.
- the network may be a wide area network or a local area network, etc.
- FIG. 8 shows a possible implementation.
- FIG. 8 is a schematic diagram of a possible implementation of a computing device cluster provided in an embodiment of the present application. As shown in FIG. 8 , two computing devices 600A and 600B are connected via a network. Specifically, the network is connected via a communication interface in each computing device.
- the memory 603 in the computing device 600A stores instructions for executing the functions of the determination module 501.
- the memory 603 in the computing device 600B stores instructions for executing the functions of the cache module 502 and the execution module 503.
- connection method between the computing device clusters shown in Figure 8 can be considered to be that the data processing method provided in this application requires a large amount of data storage, so it is considered to hand over the functions implemented by the cache module 502 and the execution module 503 to the computing device 600B for execution.
- the functions of the computing device 600A shown in FIG8 may also be completed by multiple computing devices 600.
- the functions of the computing device 600B may also be completed by multiple computing devices 600.
- the embodiment of the present application also provides another computing device cluster.
- the connection relationship between the computing devices in the computing device cluster can be similar to the connection mode of the computing device cluster shown in Figures 7 and 8.
- the difference is that the memory 603 in one or more computing devices 600 in the computing device cluster can store the same instructions for executing the data processing method.
- the memory 603 of one or more computing devices 600 in the computing device cluster may also store partial instructions for executing the data processing method.
- the combination of one or more computing devices 600 may jointly execute instructions for executing the data processing method.
- the embodiment of the present application also provides a computer program product including instructions.
- the computer program product may be a software or program product including instructions that can be run on a computing device or stored in any available medium.
- the computing device is caused to execute the data processing method.
- the embodiment of the present application also provides a computer-readable storage medium.
- the computer-readable storage medium can be any available medium that can be stored by the computing device or a data storage device such as a data center containing one or more available media.
- the available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state hard disk).
- the computer-readable storage medium includes instructions that instruct the computing device to execute the data processing method, or instruct the computing device to execute the data processing method.
- the information including but not limited to user device information, user personal information, etc.
- data including but not limited to data used for analysis, stored data, displayed data, etc.
- signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with the relevant laws, regulations and standards of relevant countries and regions.
- the metadata involved in the request are obtained with full authorization.
- the disclosed systems, devices and methods can be implemented in other ways.
- the device embodiments described above are only schematic.
- the division of the unit is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
- the mutual coupling or direct coupling or communication connection shown or discussed can be an indirect coupling or communication connection through some interfaces, devices or units, or it can be an electrical, mechanical or other form of connection.
- the unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present application.
- each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the above-mentioned integrated unit may be implemented in the form of hardware or software unit.
- the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
- the technical solution of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions to enable a computing device (which can be a personal computer, server, or computing device, etc.) to execute all or part of the steps of the method in each embodiment of the present application.
- the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc., various media that can store program code.
- first, second, etc. are used to distinguish between identical or similar items having substantially the same effects and functions. It should be understood that there is no logical or temporal dependency between “first”, “second”, and “nth”, nor is there a limitation on quantity and execution order. It should also be understood that although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another. For example, without departing from the scope of various examples, a first image may be referred to as a second image, and similarly, a second image may be referred to as a first image. Both the first image and the second image may be images, and in some cases, may be separate and different images.
- the term “if” may be interpreted to mean “when” or “upon” or “in response to determining” or “in response to detecting.”
- the phrase “if it is determined that " or “if [a stated condition or event] is detected” may be interpreted to mean “upon determining that ## or “in response to determining that " or “upon detecting [a stated condition or event]” or “in response to detecting [a stated condition or event],” depending on the context.
- the computer program product includes one or more computer program instructions.
- the computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
- the computer instructions may be stored in a computer-readable storage medium or transferred from a computer-readable storage medium to another computer.
- the computer program instructions can be transmitted from a website, computer, server or data center to another website, computer, server or data center by wired or wireless means.
- the computer readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server or data center that includes one or more available media.
- the available medium can be a magnetic medium (such as a floppy disk, hard disk, tape), an optical medium (such as a digital video disc (DVD), or a semiconductor medium (such as a solid state drive)), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本申请提供了一种数据处理方法、数据处理引擎、计算设备及存储介质,属于计算机技术领域。该方法包括:确定数据处理作业待访问的外部数据库以及该外部数据库中的目标访问对象;将该目标访问对象的元数据,从该外部数据库缓存至该数据处理引擎的内存中;从该内存中访问该元数据,以执行该数据处理作业。通过上述方法,将外部数据库中的元数据缓存至数据处理引擎的内存中,从内存访问元数据,能够避免多次通过网络访问外部数据库的目录,进而减少网络开销,同时规避网络的不可用对数据处理作业的执行产生影响,提升数据处理引擎的性能,提高数据处理的效率。
Description
本申请要求于2022年10月25日提交的申请号202211309305.1、发明名称为“数据处理方法、装置、计算设备集群及存储介质”的中国专利申请的优先权,以及,于2022年12月26日提交的申请号202211679701.3、发明名称为“数据处理方法、数据处理引擎、计算设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及计算机技术领域,特别涉及一种数据处理方法、数据处理引擎、计算设备及存储介质。
数据处理引擎,例如Spark SQL引擎,在执行数据处理作业的时候,需要通过目录(Catalog)获取数据库、表、分区、视图等数据库对象的元数据以及其他外部系统中存储的函数和信息,以对数据进行处理。因此,Catalog也被称为大数据处理引擎的“大脑”。
相关技术中,大数据处理引擎可以通过网络访问外部数据库的Catalog,以获取元数据。然而,当大数据处理引擎需要频繁地访问外部数据库的Catalog,或者需要多次通过网络获取同一份元数据时,会造成网络时延和抖动,从而导致大数据处理引擎的处理性能下降,数据处理的效率降低。
发明内容
本申请实施例提供了一种数据处理方法、数据处理引擎、计算设备及存储介质,能够避免多次通过网络访问外部数据库的目录,进而减少网络开销,同时规避网络的不可用对数据处理作业的执行产生影响,提升数据处理引擎的性能,提高数据处理的效率。该技术方案如下。
第一方面,提供了一种数据处理方法,应用于云服务的数据处理引擎,该方法包括:
确定数据处理作业待访问的外部数据库以及该外部数据库中的目标访问对象;将该目标访问对象的元数据,从该外部数据库缓存至该数据处理引擎的内存中;从该内存中访问该元数据,以执行该数据处理作业。
上述方法中,将外部数据库中的元数据缓存至数据处理引擎的内存中,从内存访问元数据,能够避免多次通过网络访问外部数据库的目录,进而减少网络开销,同时规避网络的不可用对数据处理作业的执行产生影响,提升数据处理引擎的性能,提高数据处理的效率。
可选地,该方法还包括:建立该内存与该外部数据库的心跳通道,该心跳通道用于确定该外部数据库中的元数据是否发生更新,该心跳通道的数量等于该外部数据库的数量;若该外部数据库中的元数据已更新,则将更新后的元数据,从该外部数据库缓存至该内存中。
其中,若数据处理作业需要访问多个外部数据库,则在心跳通道可以为多个,以分别确定各个外部数据库中的数据是否已发生更新;进一步地,若目标访问对象的元数据未存储在内存中,则心跳通道在元数据缓存结束后建立,若目标访问对象的元数据已存储在内存中,则数据处理作业一启动就建立该心跳通道。
上述方法中,由于外部数据库中数据的更新会反映在元数据上,一旦元数据更新,则说明外部数据库中的数据有所更新,通过建立数据处理引擎的内存与外部数据库之间的心跳通道,可以定期确认外部数据库中的元数据是否发生更新,以保证内存中的元数据与外部数据库中的元数据的一致性,也即是目标访问对象的结构的一致性,从而保证数据处理作业的正确执行。
可选地,该方法还包括:响应于该数据处理作业执行结束,关闭该心跳通道。
上述方法中,数据处理作业执行结束后关闭心跳通道,能够节约计算资源,提高数据处理引擎的处理性能。
可选地,该外部数据库存储有预写日志,该预写日志用于记录对该外部数据库执行的操作,该若该外部数据库中的元数据已更新,则将更新后的元数据,从该外部数据库缓存至该内存中,包括:若该预写日志中包括对该元数据的更新操作,则将更新后的元数据,从该外部数据库缓存至该内存中。
上述方法中,在外部数据库存储有预写日志的情况下,基于预写日志中记录的操作信息,确认外部数据库中的元数据是否发生更新,能够较快地确定外部数据库中元数据的更新情况,节约计算资源,提高数据处理作业的效率。
可选地,该若该外部数据库中的元数据已更新,则将更新后的元数据,从该外部数据库缓存至该内存中,包括:若该内存中的元数据与该外部数据库中的元数据不一致,则将更新后的元数据,从该外部数据库缓存至该内存中。
上述方法中,通过校验外部数据库中的元数据和内存中已缓存的元数据是否一致,来确认外部数据库中的元数据是否发生更新,校验的准确性较高。
可选地,该将该目标访问对象的元数据,从该外部数据库缓存至该数据处理引擎的内存中,包括:将该元数据从该外部数据库缓存至该数据处理引擎内存中目录的目标子目录,该目录用于存储该数据处理引擎所关联的数据库的元数据,该目标子目录用于存储来自于该外部数据库的元数据。
上述方法中,来自外部数据库的元数据按照目录.子目录.数据库/数据表.数据库对象的结构存储至内存中,能够避免从内存访问来自外部数据库的元数据的路径与通过网络访问外部数据库中的元数据的路径发生冲突。
可选地,该确定数据处理作业待访问的外部数据库以及该外部数据库中的目标访问对象之后,该方法还包括:若该内存中已存储有该元数据,则执行该从该内存中访问该元数据,以执行该数据处理作业的步骤。
上述方法中,对于已经缓存在内存中的元数据,不用重复缓存,能够节约存储资源,提高数据处理的效率。
第二方面,提供了一种数据处理装置,应用于云服务的数据处理引擎,该装置包括:
确认模块,用于确定数据处理作业待访问的外部数据库以及该外部数据库中的目标访问对象。
缓存模块,用于将该目标访问对象的元数据,从该外部数据库缓存至该数据处理引擎的内存中。
执行模块,用于从该内存中访问该元数据,以执行该数据处理作业。
在一种可能实施方式中,该装置还包括:
通讯模块,用于建立该内存与该外部数据库的心跳通道,该心跳通道用于确定该外部数据库中的元数据是否发生更新,该心跳通道的数量等于该外部数据库的数量。
该缓存模块,还用于若该外部数据库中的元数据已更新,则将更新后的元数据,从该外部数据库缓存至该内存中。
在一种可能实施方式中,该通讯模块还用于:
响应于该数据处理作业执行结束,关闭该心跳通道。
在一种可能实施方式中,该外部数据库存储有预写日志,该预写日志用于记录对该外部数据库执行的操作,该缓存模块用于:
若该预写日志中包括对该元数据的更新操作,则将更新后的元数据,从该外部数据库缓存至该内存中。
在一种可能实施方式中,该缓存模块用于:
若该内存中的元数据与该外部数据库中的元数据不一致,则将更新后的元数据,从该外部数据库缓存至该内存中。
在一种可能实施方式中,该缓存模块用于:
将该元数据从该外部数据库缓存至该数据处理引擎内存中目录的目标子目录,该目录用于存储该数据处理引擎所关联的数据库的元数据,该目标子目录用于存储来自于该外部数据库的元数据。
在一种可能实施方式中,该执行模块还用于:
若该内存中已存储有该元数据,则执行该从该内存中访问该元数据,以执行该数据处理作业的步骤。
第三方面,提供了一种云服务的数据处理引擎,该数据处理引擎包括元数据缓存接口,该元数据缓存接口用于将目标访问对象的元数据从外部数据库缓存至该数据处理引擎的内存中,以实现上述第一方面或第一方面的任一种可选方式所提供的数据处理方法。
第四方面,提供了一种计算设备,该计算设备包括处理器和存储器,该计算设备的处理器用于执行该计算设备的存储器中存储的指令,以使得该计算设备执行如上述第一方面或第一方面的任一种可选方式所提供的数据处理方法。
第五方面,提供了一种计算设备集群,该计算设备集群包括至少一个计算设备,每个计算设备包括处理器和存储器;
该至少一个计算设备的处理器用于执行该至少一个计算设备的存储器中存储的指令,以使得该计算设备集群执行如上述第一方面或第一方面的任一种可选方式所提供的数据处理方法。
第六方面,提供了一种计算机程序产品,该计算机程序产品包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算设备执行上述第一方面或第一方面的任一种可选方式所提供的数据处理方法。
第七方面,提供了一种计算机可读存储介质,该存储介质中存储有至少一条指令,该指令由处理器读取以使计算设备执行上述第一方面或第一方面的任一种可选方式所提供的数据处理方法。
图1是本申请实施例提供的一种数据处理方法的应用场景示意图;
图2是本申请实施例提供的一种数据处理方法的流程图;
图3是本申请实施例提供的一种数据处理方法的流程图;
图4是本申请实施例提供的一种数据处理方法的示意图;
图5是本申请实施例提供的一种数据处理装置的结构框图;
图6是本申请实施例提供的一种计算设备的结构示意图;
图7是本申请实施例提供的一种计算设备集群的示意图;
图8是本申请实施例提供的一种计算设备集群的可能实现方式的示意图。
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
下面介绍本申请实施例的应用场景。
本申请实施例提供的数据处理方法能够应用于云服务的数据处理引擎访问元数据以执行数据处理作业的场景中。图1是本申请实施例提供的一种数据处理方法的应用场景示意图,如图1所示,计算设备上运行有数据处理引擎,该数据处理引擎用于执行数据处理作业。该数据处理引擎可以是Spark SQL或Flink SQL等,本申请实施例不做限定。数据处理引擎执行数据处理作业的过程包括结构化查询语言(structured query language,SQL)语法分析(parse)、目录(catalog)、解析(analysis)、优化(optimize)、物理计划(physical plan)和程序执行等步骤,其中,程序执行包括弹性分布式数据集程序(resilient distributed dataset program)、数据流程序(datastream program)和映射归约程序(map-reduce)。
其中,数据处理引擎包括内存(in memory)、元数据管理系统(hive metastore,HMS)和目录(catalog)接口(application programming interface,API)。其中,该内存包括至少一个高速缓冲存储器(cache),该内存用于存储该数据处理引擎关联的数据库的目录。该元数据管理系统用于管理该数据处理引擎关联的数据库的目录。该目录接口用于访问内存和元数据管理系统中的目录,获取数据库(database,DB)、表(table)、分区、视图等数据库对象(database object)的元数据,执行该数据处理作业。在一些实施例中,数据处理引擎关联有外部数据库,也即是外部服务(extend service),数据处理引擎通过目录接口访问该外部数据库中的目录,以获取外部数据库中的元数据。数据处理引擎与外部数据库之间通过有线网络或者是无线网络方式进行通信连接。
在一些实施例中,上述无线网络或有线网络使用标准通信技术和/或协议。网络通常为因特网、但也能够是任何网络,包括但不限于局域网(local area network,LAN)、城域网(metropolitan area network,
MAN)、广域网(wide area network,WAN)、移动、有线或者无线网络、专用网络或者虚拟专用网络的任何组合。在一些实施例中,数据处理引擎与外部数据库之间基于Java数据库连接(java database connectivity,JDBC)通讯,或者通过远程调用协议(remote procedure call protocol,RPC)通讯,又或者通过超文本传输协议(hyper text transfer protocol,HTTP)通讯。在一些实施例中,数据处理引擎与外部数据库之间使用包括超级文本标记语言(hyper text markup language,HTML)、可扩展标记语言(extensible markup language,XML)等的技术和/或格式来代表通过网络交换的元数据。此外还能够使用诸如安全套接字层(secure socket layer,SSL)、传输层安全(transport layer security,TLS)、虚拟专用网络(virtual private network,VPN)、网际协议安全(internet protocol security,IPsec)等常规加密技术来加密所有或者一些链路。在另一些实施例中,还能够使用定制和/或专用数据通信技术取代或者补充上述数据通信技术。
上面介绍了本申请实施例的应用场景,下面介绍本申请实施例提供的一种数据处理方法的流程图。图2是本申请实施例提供的一种数据处理方法的流程图,如图2所示,该方法应用于云服务的数据处理引擎,包括下述步骤201至步骤203。
201、数据处理引擎确定数据处理作业待访问的外部数据库以及该外部数据库中的目标访问对象。
其中,该数据处理作业(job)是指数据处理引擎按顺序执行的一系列程序指令,也即是查询语句,该查询语句包括对目标访问对象执行增、删、改、查等操作的语句,以及获取目标访问对象的语句。该目标访问对象是指外部数据库中的数据库对象,包括数据库、表、分区、视图、行和列等。
202、数据处理引擎将该目标访问对象的元数据,从外部数据库缓存至数据处理引擎的内存中。
其中,该元数据用于描述目标访问对象的数据属性,以指示目标访问对象。外部数据库存储有数据库对象的目录(catalog),若外部数据库有新增的数据库对象,则将该数据库对象的元数据写入该目录中,若外部数据库中已存储的数据库对象有所更新,则对该目录中已存储的元数据进行更新。数据处理引擎将所需的元数据从外部数据库的目录中缓存至内存的目录中。
203、数据处理引擎从内存中访问该元数据,以执行该数据处理作业。
其中,数据处理引擎基于该元数据在内存中的数据格式,从内存访问该元数据。
通过本申请实施例中的技术方案,将元数据缓存至数据处理引擎的内存中,从内存访问元数据,能够避免多次通过网络访问外部数据库的目录,进而减少网络开销,同时规避网络的不可用对数据处理作业的执行产生影响,提升数据处理引擎的性能,提高数据处理的效率。
上述图2所示仅为本申请的基本流程,下面对本申请提供的方案进行进一步阐述。图3是本申请实施例提供的一种数据处理方法的流程图,如图3所示,应用于云服务的数据处理引擎,包括以下步骤301至步骤306。
301、数据处理引擎响应于数据处理作业启动,确定数据处理作业待访问的外部数据库以及该外部数据库中的目标访问对象。
其中,数据处理作业启动也即是数据处理作业拉起后,计算设备的驱动程序(driver)启动。数据处理引擎扫描数据处理作业中的查询语句,以确定外部数据库和目标访问对象。在一些实施例中,数据处理引擎扫描查询语句中的关键字,以确定外部数据库和目标访问对象,例如,查询语句“alter table mysql.table1 add column column1(col int)”,表示在table1中增加一列column1,数据处理引擎通过扫描关键字“table”后的内容“mysql.table1”确定待访问的外部数据库为mysql,目标访问对象为mysql中的table1。在另一些实施例中,数据处理引擎包括元数据缓存接口,通过扫描查询语句中调用的元数据缓存接口,指定外部数据库和目标访问对象,例如,查询语句“catch(mysqlcatalog.table1)”,表示调用元数据缓存接口“mysqlcatalog.table1”,以获取mysql数据库中table1的元数据,数据处理引擎扫描catch指令调用的元数据缓存接口“mysqlcatalog.table1”,指定外部数据库为mysql,目标访问对象为mysql中的table1。
需要说明的是,上述两种确定待访问的外部数据库和目标访问对象的方法仅是示例性的,可以根据实际需求进行定制,本申请实施例对此不做限定。
需要说明的是,上述步骤301是以待访问的数据库为外部数据库为例进行介绍的,在一些实施例中,数据处理引擎关联的数据库既包括外部数据库也包括内部数据库。在一个数据处理作业中,待访问
的数据库可以既包括外部数据库也包括内部数据库。在一些实施例中,数据处理引擎存储有所关联的数据库的信息,包括内部数据库的信息和外部数据库的信息,数据处理引擎基于数据库的信息和查询语句的扫描结果,确定查询语句中待访问的数据库是内部数据库还是外部数据库。若待访问的数据库是内部数据库,则直接从内存中访问该内部数据库中目标访问对象的元数据,具体的访问过程在后续步骤中进行介绍,此处不再赘述。
上述方法中,数据处理引擎通过扫描数据处理作业中的查询语句,确定待访问的外部数据库和目标访问对象,操作简单,结果准确。
302、数据处理引擎通过与外部数据库之间的元数据缓存接口,将该元数据从外部数据库缓存至数据处理引擎内存中目录的目标子目录,该目录用于存储该数据处理引擎所关联的数据库的元数据,该目标子目录用于存储来自于该外部数据库的元数据。
其中,元数据缓存接口用于数据处理引擎和外部数据库之间进行数据交互,以将元数据从外部数据库缓存至数据处理引擎的内存中。外部数据库提供元数据缓存接口,使得数据处理引擎能够将外部数据库中的元数据本地化,从而加快数据处理引擎访问元数据的速度,避免因为网络原因导致的不可访问或者访问速度慢的问题。数据处理引擎也提供元数据缓存接口,用户通过编辑调用该元数据缓存接口的查询语句,以从外部数据库缓存数据处理作业所需的元数据。需要说明的是,上述通过元数据缓存接口对元数据进行缓存的方式仅是示例性的,还可以基于其他方式对元数据进行缓存,本申请实施例对此不做限定。
数据处理引擎通过该元数据缓存接口将该元数据缓存至内存中目录的目标子目录。数据处理引擎在内存中的目录下新建一个子目录,该子目录也即是目标子目录。新建子目录后,内存中目录的数据组织模型由原来的三层结构(目录.数据库/数据表.数据库对象)的形式,例如,(catalog.database/schema.table),转变为四层结构(目录.子目录.数据库/数据表.数据库对象)的形式,例如,(catalog.sub-catalog.database/schema.table)。上述方法中,通过改变内存中目录的数据组织模型的结构,以避免从内存访问来自外部数据库的元数据的路径与通过网络访问外部数据库中的元数据的路径发生冲突。
在一些实施例中,内存中目录已包括该目标子目录,数据处理引擎将元数据直接存储至该目标子目录,而不用新建子目录,从而提高缓存元数据的效率,避免目录下包括多个指示同一个外部数据库的子目录,从而避免为后续访问元数据造成干扰。
需要说明的是,上述方法中,对于任一外部数据库,均可以创建一个子目录,以实现将来自不同的外部数据库的元数据分别存储,从而便于对元数据进行访问。另外,内存中目录下的子目录除了包括外部数据库的子目录,还包括内部数据库的子目录,从而可以从内存中访问内部数据库中的元数据。
上述方法中,数据处理引擎基于待访问的外部数据库和目标访问对象,通过元数据缓存接口,选择性地将外部数据库中的元数据缓存至内存中,而不是缓存外部数据库中的全部元数据,可以减少数据处理引擎在整个数据处理作业执行期间,缓存外部数据库中的元数据的个数,从而减少缓存的开销以及缓存维护的开销。
需要说明的是,上述步骤302是在数据处理引擎确定目标访问对象的元数据未存储至内存中的情况下所执行的步骤,在一些实施中,内存中已存储有该元数据,数据处理引擎不执行步骤302,执行从该内存中访问该元数据,以执行该数据处理作业的步骤,也即是直接从内存中访问该元数据。其中,数据处理引擎查询内存中目录下的待访问的数据库对应的子目录,若目录不包括该数据库对应的子目录,或者该数据库对应的子目录中不包括目标访问对象的元数据,则说明该元数据未存储至内存中。上述方法中,对于已经缓存在内存中的元数据,不用重复缓存,能够节约存储资源,提高数据处理的效率。
303、若该元数据已缓存完毕,数据处理引擎则建立内存与该外部数据库的心跳通道,该心跳通道用于确定该外部数据库中的元数据是否发生更新,该心跳通道的数量等于该外部数据库的数量。
其中,若数据处理作业需要访问多个外部数据库,也即是在多catalog场景下,内存与多个外部数据库建立心跳通道,每个外部数据库对应一个心跳通道。在一些实施例中,数据处理引擎通过该心跳通道定期发送数据同步请求给外部数据库,外部数据库收到该数据同步请求后回复响应信息给数据处理引擎,也即是通过心跳机制确定外部数据库中的元数据是否发生更新,在另一些实施例中,由外部数据库主动定期向该数据处理引擎发送响应信息,本申请实施例对此不做限定。数据处理引擎接收该响应信息后,基于该响应信息确定外部数据库中的元数据的更新情况。
上述实施例中,由于外部数据库中数据的更新会反映在元数据上,一旦元数据更新,则说明外部数据库中的数据有所更新,通过建立数据处理引擎的内存与外部数据库之间的心跳通道,可以定期确认外部数据库中的元数据是否发生更新,以保证内存中的元数据与外部数据库中的元数据的一致性,也即是目标访问对象的结构的一致性,从而保证数据处理作业的正确执行。
上述步骤303是以缓存完元数据之后建立心跳通道为例进行说明的,在一些实施例中,内存中已存储有待访问的元数据,数据处理引擎不执行缓存元数据的步骤,直接建立内存与外部数据库之间的心跳通道,能够节约存储资源。
需要说明的是,上述步骤303中,心跳通道建立的时机为元数据缓存结束之后,在一些实施例中,内存中已存储有该元数据,数据处理引擎在数据处理作业一启动就建立该心跳通道。
304、若该外部数据库中的元数据已更新,数据处理引擎则将更新后的元数据,从该外部数据库缓存至该内存中。
其中,数据处理引擎基于响应信息,确定外部数据库中的元数据是否已发生更新。根据响应信息的不同,数据处理引擎确定外部数据库中的元数据是否已发生更新包括下述两种方式。
第一种方式:外部数据库存储有预写日志,该预写日志用于记录对该外部数据库执行的操作,若该预写日志中包括对该元数据的更新操作,则将更新后的元数据,从该外部数据库缓存至该内存中。
其中,对于有预写日志的外部数据库,响应信息包括外部数据库中存储的预写日志(write-ahead logging,WAL)。对元数据的更新操作包括基于数据定义语言(data define language,DDL),对元数据对应的数据库对象所作的增、删、改、查等操作。数据处理引擎基于已缓存的元数据对应的目标访问对象,对预写日志中记录的操作进行过滤,若过滤后得到对该目标访问对象的更新操作,则确定外部数据库中的元数据发生了更新。例如,外部数据库为mysql,mysql的预写日志为binlog,数据处理引擎通过在binlog中查找对目标访问对象的更新操作,以识别元数据的变化。上述方法中,在外部数据库存储有预写日志的情况下,基于预写日志中记录的操作信息,确认外部数据库中的元数据是否发生更新,能够较快确定外部数据库中元数据的更新情况,节约计算资源,提高数据处理作业的效率。
在一些实施例中,若外部数据库中的元数据已更新,则删除已缓存的元数据,将更新后的元数据缓存至内存中,能够节约存储资源,并避免内存中更新后的元数据与更新后的元数据之间的访问冲突。
第二种方式:若该内存中的元数据与该外部数据库中的元数据不一致,则将更新后的元数据,从该外部数据库缓存至该内存中。
其中,对于没有预写日志的外部数据库,响应信息包括外部数据库中存储的元数据,也即是外部数据库将元数据发送给数据处理引擎。在一些实施例中,外部数据库基于元数据缓存接口的调用记录,确定数据处理引擎所缓存的元数据对应的目标访问对象,进而定期将该目标访问对象的元数据发送给数据处理引擎进行校验。需要说明的是,上述方法仅是外部数据库确定待发送给数据处理引擎的元数据的一种示例,可以根据实际需求进行定制,本申请实施例对此不做限定。
数据处理引擎接收响应信息,得到外部数据库中的元数据。数据处理引擎将内存中的元数据与从响应信息中获取的元数据进行校验,若一致,则说明外部数据库中的元数据没有更新,若不一致,说明外部数据库中的元数据已更新,数据处理引擎将更新后的元数据缓存至内存中。其中,对元数据进行校验的方式包括哈希值(hash)校验,MD5校验和CRC32校验等,可以根据实际需求确定校验方式,本申请实施例对此不做限定。
上述方法中,通过校验外部数据库中的元数据和内存中已缓存的元数据是否一致,确认外部数据库中的元数据是否发生更新,校验的准确性较高,且响应信息包括元数据,若外部数据库中的元数据已更新,可以直接将该元数据存储至内存中,而无需通过元数据缓存接口缓存该元数据,能够提高数据处理的效率。
在一些实施例中,对该响应信息进行加密,以保证响应信息中的元数据的安全性。加密响应信息的方式可以根据实际需求进行定制,本申请实施例对此不做限定。
需要说明的是,上述步骤303至步骤304所示的通过建立心跳通道对发生更新的元数据进行更新的过程为可选过程,本申请实施例对此不做限定。
305、数据处理引擎从该内存中访问该元数据,以执行该数据处理作业。
其中,数据处理引擎基于该元数据在内存中的数据格式,也即是内存中目录的数据组织模型,从内存中访问该元数据,例如,catalog.mysqlcatalog.database.table1。数据处理引擎访问内存中的元数据,对查
询语句进行解析、优化、生成物理执行计划,数据处理引擎执行该物理执行计划,也即是执行该数据处理作业。
需要说明的是,上述步骤302至305是数据处理引擎先缓存元数据至内存中,再从内存中访问元数据以执行数据处理作业为例进行说明的,在一些实施例中,若网络连接正常,数据处理引擎通过网络远程访问外部数据库,获取元数据,进而执行数据处理作业,并且在执行该数据处理作业的同时,对元数据进行缓存,若网络连接异常,则先通过元数据缓存接口缓存元数据至内存中,再从内存访问该元数据。本申请实施例对此不做限定。
需要说明的是,本申请实施例是以先更新元数据后执行数据处理作业为例进行说明的,在一些实施例中,更新元数据和执行数据处理作业的过程可以并行,当然,如果数据处理作业执行过程中元数据发生了更新,也可以基于更新后元数据的重新执行该数据处理作业。
306、数据处理引擎响应于该数据处理作业执行结束,关闭该心跳通道。
上述步骤306中,数据处理作业执行结束后关闭心跳通道,能够节约计算资源,提高数据处理引擎的处理性能。
下面以图4为例,对上述步骤301至步骤306所示的流程进行举例说明。图4是本申请实施例提供的一种数据处理方法的示意图。如图4所示,Flink SQL引擎响应于数据处理作业启动,通过扫描数据处理作业确定需要访问的外部数据库中的目标访问对象,或通过Flink SQL引擎提供的元数据缓存接口,将外部数据库中的部分元数据缓存到Flink SQL引擎内存的目录中,提升访问效率。同时外部数据库中的元数据在缓存到内存目录中的时候,目录结构需要增加一层“sub-catalog”来避免元数据的访问路径的冲突。Flink SQL引擎在数据处理作业执行过程中在内存与外部数据库之间建立心跳通道来识别外部数据库中的元数据是否发生更新,以避免元数据不一致问题的产生。Flink SQL引擎迭代数据处理作业中的所有查询语句,访问内存中的元数据,以对查询语句进行解析、优化、生成物理执行计划,在数据处理作业执行完成后,关闭心跳通道。
通过本申请实施例中的技术方案,将外部数据库中的元数据缓存至数据处理引擎的内存中,从内存访问元数据,能够避免多次通过网络访问外部数据库的目录,进而减少网络开销,同时规避网络的不可用对数据处理作业的执行产生影响,提升数据处理引擎的性能,提高数据处理的效率。进一步地,通过在目录结构中增加一层子目录,来避免访问内存中的元数据的路径与访问外部数据库中的元数据的路径发生冲突;通过建立内存与外部数据库的心跳通道,在数据处理作业执行期间定期更新元数据,以保证从内存中访问到的元数据与外部数据库中的元数据的一致性。
图5是本申请实施例提供的一种数据处理装置的结构框图。如图5所示,该装置应用于云服务的数据处理引擎,该装置包括确认模块501、缓存模块502和执行模块503。
确认模块501,用于确定数据处理作业待访问的外部数据库以及该外部数据库中的目标访问对象。
缓存模块502,用于将该目标访问对象的元数据,从该外部数据库缓存至该数据处理引擎的内存中。
执行模块503,用于从该内存中访问该元数据,以执行该数据处理作业。
在一种可能实施方式中,该装置还包括:
通讯模块,用于建立该内存与该外部数据库的心跳通道,该心跳通道用于确定该外部数据库中的元数据是否发生更新,该心跳通道的数量等于该外部数据库的数量。
该缓存模块502,还用于若该外部数据库中的元数据已更新,则将更新后的元数据,从该外部数据库缓存至该内存中。
在一种可能实施方式中,该通讯模块还用于:
响应于该数据处理作业执行结束,关闭该心跳通道。
在一种可能实施方式中,该外部数据库存储有预写日志,该预写日志用于记录对该外部数据库执行的操作,该缓存模块502用于:
若该预写日志中包括对该元数据的更新操作,则将更新后的元数据,从该外部数据库缓存至该内存中。
在一种可能实施方式中,该缓存模块502用于:
若该内存中的元数据与该外部数据库中的元数据不一致,则将更新后的元数据,从该外部数据库缓存至该内存中。
在一种可能实施方式中,该缓存模块502用于:
将该元数据从该外部数据库缓存至该数据处理引擎内存中目录的目标子目录,该目录用于存储该数据处理引擎所关联的数据库的元数据,该目标子目录用于存储来自于该外部数据库的元数据。
在一种可能实施方式中,该执行模块503还用于:
若该内存中已存储有该元数据,则执行该从该内存中访问该元数据,以执行该数据处理作业的步骤。
其中,确认模块501、缓存模块502、执行模块503均可以通过软件实现,或者可以通过硬件实现。示例性的,接下来以确认模块501为例,介绍确认模块501的实现方式。类似的,缓存模块502和执行模块503的实现方式可以参考确认模块501的实现方式。
模块作为软件功能单元的一种举例,确认模块501可以包括运行在计算实例上的代码。其中,计算实例可以包括物理主机(计算设备)、虚拟机、容器中的至少一种。进一步地,上述计算实例可以是一台或者多台。例如,确认模块501可以包括运行在多个主机/虚拟机/容器上的代码。需要说明的是,用于运行该代码的多个主机/虚拟机/容器可以分布在相同的区域(region)中,也可以分布在不同的region中。进一步地,用于运行该代码的多个主机/虚拟机/容器可以分布在相同的可用区(availability zone,AZ)中,也可以分布在不同的AZ中,每个AZ包括一个数据中心或多个地理位置相近的数据中心。其中,通常一个region可以包括多个AZ。
同样,用于运行该代码的多个主机/虚拟机/容器可以分布在同一个虚拟私有云(virtual private cloud,VPC)中,也可以分布在多个VPC中。其中,通常一个VPC设置在一个region内,同一region内两个VPC之间,以及不同region的VPC之间跨区通信需在每个VPC内设置通信网关,经通信网关实现VPC之间的互连。
模块作为硬件功能单元的一种举例,确认模块501可以包括至少一个计算设备,如服务器等。或者,确认模块501也可以是利用专用集成电路(application-specific integrated circuit,ASIC)实现或可编程逻辑器件(programmable logic device,PLD)实现的设备等。其中,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、现场可编程门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合实现。
确认模块501包括的多个计算设备可以分布在相同的region中,也可以分布在不同的region中。确认模块501包括的多个计算设备可以分布在相同的AZ中,也可以分布在不同的AZ中。同样,确认模块501包括的多个计算设备可以分布在同一个VPC中,也可以分布在多个VPC中。其中,该多个计算设备可以是服务器、ASIC、PLD、CPLD、FPGA和GAL等计算设备的任意组合。
需要说明的是,在其他实施例中,上述模块负责实现的步骤可根据需要指定,通过上述模块分别实现上述数据处理方法中不同的步骤来实现上述装置的全部功能。也即是,上述实施例提供的数据处理装置在实现数据处理时仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与相应方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
本申请还提供一种计算设备600。图6是本申请实施例提供的一种计算设备的结构示意图,如图6所示,计算设备600包括:总线601、处理器602、存储器603和通信接口604。处理器602、存储器603和通信接口604之间通过总线601通信。计算设备600可以是服务器或终端设备。应理解,本申请不限定计算设备600中的处理器、存储器的个数。
总线601可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图6中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。总线601可包括在计算设备600各个部件(例如,存储器603、处理器602、通信接口604)之间传送信息的通路。
处理器602可以包括中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。
存储器603可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器603还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,
ROM),快闪存储器,机械硬盘(hard disk drive,HDD)或固态硬盘(solid state drive,SSD)。
存储器603中存储有可执行的程序代码,处理器602执行该可执行的程序代码以分别实现前述确定模块501、缓存模块502和执行模块503的功能,从而实现数据处理方法。也即,存储器603上存有用于执行数据处理方法的指令。
通信接口604使用例如但不限于网络接口卡、收发器一类的收发模块,来实现计算设备600与其他设备或通信网络之间的通信。
本申请实施例还提供了一种计算设备集群。图7是本申请实施例提供的一种计算设备集群的示意图,如图7所示,该计算设备集群包括至少一台计算设备。该计算设备可以是服务器,例如是中心服务器、边缘服务器,或者是本地数据中心中的本地服务器。在一些实施例中,计算设备也可以是台式机、笔记本电脑或者智能手机等终端设备。
如图7所示,该计算设备集群包括至少一个计算设备600。计算设备集群中的一个或多个计算设备600中的存储器603中可以存有相同的用于执行数据处理方法的指令。
在一些可能的实现方式中,该计算设备集群中的一个或多个计算设备600的存储器603中也可以分别存有用于执行数据处理方法的部分指令。换言之,一个或多个计算设备600的组合可以共同执行用于执行数据处理方法的指令。
需要说明的是,计算设备集群中的不同的计算设备600中的存储器603可以存储不同的指令,分别用于执行数据处理装置的部分功能。也即,不同的计算设备600中的存储器603存储的指令可以实现确定模块501、缓存模块502和执行模块503中的一个或多个模块的功能。
在一些可能的实现方式中,计算设备集群中的一个或多个计算设备可以通过网络连接。其中,该网络可以是广域网或局域网等等。图8示出了一种可能的实现方式。图8是本申请实施例提供的一种计算设备集群的可能实现方式的示意图,如图8所示,两个计算设备600A和600B之间通过网络进行连接。具体地,通过各个计算设备中的通信接口与该网络进行连接。在这一类可能的实现方式中,计算设备600A中的存储器603中存有执行确定模块501的功能的指令。同时,计算设备600B中的存储器603中存有执行缓存模块502和执行模块503的功能的指令。
图8所示的计算设备集群之间的连接方式可以是考虑到本申请提供的数据处理方法需要大量地存储数据,因此考虑将缓存模块502和执行模块503实现的功能交由计算设备600B执行。
应理解,图8中示出的计算设备600A的功能也可以由多个计算设备600完成。同样,计算设备600B的功能也可以由多个计算设备600完成。
本申请实施例还提供了另一种计算设备集群。该计算设备集群中各计算设备之间的连接关系可以类似的参考图7和图8所示的计算设备集群的连接方式。不同的是,该计算设备集群中的一个或多个计算设备600中的存储器603中可以存有相同的用于执行数据处理方法的指令。
在一些可能的实现方式中,该计算设备集群中的一个或多个计算设备600的存储器603中也可以分别存有用于执行数据处理方法的部分指令。换言之,一个或多个计算设备600的组合可以共同执行用于执行数据处理方法的指令。
本申请实施例还提供了一种包含指令的计算机程序产品。该计算机程序产品可以是包含指令的,能够运行在计算设备上或被储存在任何可用介质中的软件或程序产品。当该计算机程序产品在计算设备上运行时,使得计算设备执行数据处理方法。
本申请实施例还提供了一种计算机可读存储介质。该计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。该可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,该指令指示计算设备执行数据处理方法,或指示计算设备执行数据处理方法。
需要说明的是,本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如,本申
请中涉及到的元数据都是在充分授权的情况下获取的。
本领域普通技术人员可以意识到,结合本文中所公开的实施例中描述的各方法步骤和单元,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各实施例的步骤及组成。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参见前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅是示意性的,例如,该单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
该作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件单元的形式实现。
该集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算设备(可以是个人计算机,服务器,或者计算设备等)执行本申请各个实施例中方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。还应理解,尽管以下描述使用术语第一、第二等来描述各种元素,但这些元素不应受术语的限制。这些术语只是用于将一元素与另一元素区别分开。例如,在不脱离各种示例的范围的情况下,第一图像可以被称为第二图像,并且类似地,第二图像可以被称为第一图像。第一图像和第二图像都可以是图像,并且在某些情况下,可以是单独且不同的图像。
本申请中术语“至少一个”的含义是指一个或多个,本申请中术语“多个”的含义是指两个或两个以上,例如,多个第二报文是指两个或两个以上的第二报文。本文中术语“系统”和“网络”经常可互换使用。
还应理解,术语“如果”可被解释为意指“当...时”(“when”或“upon”)或“响应于确定”或“响应于检测到”。类似地,根据上下文,短语“如果确定...”或“如果检测到[所陈述的条件或事件]”可被解释为意指“在确定...时”或“响应于确定...”或“在检测到[所陈述的条件或事件]时”或“响应于检测到[所陈述的条件或事件]”。
以上描述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机程序指令。在计算机上加载和执行该计算机程序指令时,全部或部分地产生按照本申请实施例中的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。
该计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机
可读存储介质传输,例如,该计算机程序指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质(例如软盘、硬盘、磁带)、光介质(例如,数字视频光盘(digital video disc,DVD)、或者半导体介质(例如固态硬盘))等。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。
Claims (13)
- 一种数据处理方法,其特征在于,应用于云服务的数据处理引擎,所述方法包括:确定数据处理作业待访问的外部数据库以及所述外部数据库中的目标访问对象;将所述目标访问对象的元数据,从所述外部数据库缓存至所述数据处理引擎的内存中;从所述内存中访问所述元数据,以执行所述数据处理作业。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:建立所述内存与所述外部数据库的心跳通道,所述心跳通道用于确定所述外部数据库中的元数据是否发生更新,所述心跳通道的数量等于所述外部数据库的数量;若所述外部数据库中的元数据已更新,则将更新后的元数据,从所述外部数据库缓存至所述内存中。
- 根据权利要求2所述的方法,其特征在于,所述方法还包括:响应于所述数据处理作业执行结束,关闭所述心跳通道。
- 根据权利要求2所述的方法,其特征在于,所述外部数据库存储有预写日志,所述预写日志用于记录对所述外部数据库执行的操作,所述若所述外部数据库中的元数据已更新,则将更新后的元数据,从所述外部数据库缓存至所述内存中,包括:若所述预写日志中包括对所述元数据的更新操作,则将更新后的元数据,从所述外部数据库缓存至所述内存中。
- 根据权利要求2所述的方法,其特征在于,所述若所述外部数据库中的元数据已更新,则将更新后的元数据,从所述外部数据库缓存至所述内存中,包括:若所述内存中的元数据与所述外部数据库中的元数据不一致,则将更新后的元数据,从所述外部数据库缓存至所述内存中。
- 根据权利要求1至5任一项所述的方法,其特征在于,所述将所述目标访问对象的元数据,从所述外部数据库缓存至所述内存中,包括:将所述元数据从所述外部数据库缓存至所述数据处理引擎内存中目录的目标子目录,所述目录用于存储所述数据处理引擎所关联的数据库的元数据,所述目标子目录用于存储来自于所述外部数据库的元数据。
- 根据权利要求1至6任一项所述的方法,其特征在于,所述确定数据处理作业待访问的外部数据库以及所述外部数据库中的目标访问对象之后,所述方法还包括:若所述内存中已存储有所述元数据,则执行所述从所述内存中访问所述元数据,以执行所述数据处理作业的步骤。
- 一种云服务的数据处理引擎,其特征在于,所述数据处理引擎包括元数据缓存接口,所述元数据缓存接口用于将目标访问对象的元数据从外部数据库缓存至所述数据处理引擎的内存中,以实现如权利要求1至7任一项所述的数据处理方法。
- 一种数据处理装置,其特征在于,应用于云服务的数据处理引擎,所述装置包括:确定模块,用于确定数据处理作业待访问的外部数据库以及所述外部数据库中的目标访问对象;缓存模块,用于将所述目标访问对象的元数据,从所述外部数据库缓存至所述数据处理引擎的内存中;执行模块,用于从所述内存中访问所述元数据,以执行所述数据处理作业。
- 一种计算设备,其特征在于,所述计算设备包括处理器和存储器,所述计算设备的处理器用于执行所述计算设备的存储器中存储的指令,以使得所述计算设备执行如权利要求1至7任一项所述的数据处理方法。
- 一种计算设备集群,其特征在于,包括至少一个计算设备,每个计算设备包括处理器和存储器;所述至少一个计算设备的处理器用于执行所述至少一个计算设备的存储器中存储的指令,以使得所述计算设备集群执行如权利要求1至7任一项所述的数据处理方法。
- 一种包含指令的计算机程序产品,其特征在于,当所述指令被计算设备集群运行时,使得所述计算设备集群执行如权利要求的1至7任一项所述的数据处理方法。
- 一种计算机可读存储介质,其特征在于,包括计算机程序指令,当所述计算机程序指令由计算设备集群执行时,所述计算设备集群执行如权利要求1至7任一项所述的数据处理方法。
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211309305.1 | 2022-10-25 | ||
CN202211309305 | 2022-10-25 | ||
CN202211679701.3A CN117971892A (zh) | 2022-10-25 | 2022-12-26 | 数据处理方法、数据处理引擎、计算设备及存储介质 |
CN202211679701.3 | 2022-12-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024087736A1 true WO2024087736A1 (zh) | 2024-05-02 |
Family
ID=90829920
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/106888 WO2024087736A1 (zh) | 2022-10-25 | 2023-07-12 | 数据处理方法、数据处理引擎、计算设备及存储介质 |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024087736A1 (zh) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108920616A (zh) * | 2018-06-28 | 2018-11-30 | 郑州云海信息技术有限公司 | 一种元数据访问性能优化方法、系统、装置及存储介质 |
CN113032349A (zh) * | 2019-12-25 | 2021-06-25 | 阿里巴巴集团控股有限公司 | 数据存储方法、装置、电子设备及计算机可读介质 |
CN114238518A (zh) * | 2021-12-22 | 2022-03-25 | 中国建设银行股份有限公司 | 数据处理方法、装置、设备及存储介质 |
CN114756577A (zh) * | 2022-03-25 | 2022-07-15 | 北京友友天宇系统技术有限公司 | 多源异构数据的处理方法、计算机设备及存储介质 |
-
2023
- 2023-07-12 WO PCT/CN2023/106888 patent/WO2024087736A1/zh unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108920616A (zh) * | 2018-06-28 | 2018-11-30 | 郑州云海信息技术有限公司 | 一种元数据访问性能优化方法、系统、装置及存储介质 |
CN113032349A (zh) * | 2019-12-25 | 2021-06-25 | 阿里巴巴集团控股有限公司 | 数据存储方法、装置、电子设备及计算机可读介质 |
CN114238518A (zh) * | 2021-12-22 | 2022-03-25 | 中国建设银行股份有限公司 | 数据处理方法、装置、设备及存储介质 |
CN114756577A (zh) * | 2022-03-25 | 2022-07-15 | 北京友友天宇系统技术有限公司 | 多源异构数据的处理方法、计算机设备及存储介质 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11263140B2 (en) | Cache aware searching based on one or more files in one or more buckets in remote storage | |
US20220253427A1 (en) | Disconnected operation within distributed database systems | |
KR102141234B1 (ko) | 분산된 데이터 스토어 내의 버젼형 계층 데이터 구조 | |
US20150234884A1 (en) | System and Method Involving Resource Description Framework Distributed Database Management System and/or Related Aspects | |
US9990225B2 (en) | Relaxing transaction serializability with statement-based data replication | |
WO2021068488A1 (zh) | 基于区块链的日志处理方法、装置、计算机设备及存储介质 | |
US11971876B2 (en) | Object resolution among account-level namespaces for database platforms | |
US20060047713A1 (en) | System and method for database replication by interception of in memory transactional change records | |
US11593354B2 (en) | Namespace-based system-user access of database platforms | |
US11922222B1 (en) | Generating a modified component for a data intake and query system using an isolated execution environment image | |
WO2022026973A1 (en) | Account-level namespaces for database platforms | |
WO2023060046A1 (en) | Errors monitoring in public and private blockchain by a data intake system | |
CN115729912A (zh) | 数据库访问系统、方法、计算机设备和存储介质 | |
CN112965837B (zh) | 配置和服务热重载更新方法、装置、计算机设备及存储介质 | |
WO2024087736A1 (zh) | 数据处理方法、数据处理引擎、计算设备及存储介质 | |
WO2023246188A1 (zh) | 一种数据共享方法及相关系统 | |
US20190057126A1 (en) | Low latency constraint enforcement in hybrid dbms | |
US11886439B1 (en) | Asynchronous change data capture for direct external transmission | |
CN117971892A (zh) | 数据处理方法、数据处理引擎、计算设备及存储介质 | |
CN110019259B (zh) | 分布式索引服务引擎的数据更新方法、装置及存储介质 | |
US12124458B2 (en) | Database system observability data querying and access | |
US20240152521A1 (en) | Database system observability data querying and access | |
WO2024178935A1 (zh) | 日志同步方法、装置、电子设备及存储介质 | |
WO2024040902A1 (zh) | 数据访问方法、分布式数据库系统及计算设备集群 | |
US11853319B1 (en) | Caching updates appended to an immutable log for handling reads to the immutable log |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23881324 Country of ref document: EP Kind code of ref document: A1 |