CN116069759A

CN116069759A - Data processing method, device and computer equipment

Info

Publication number: CN116069759A
Application number: CN202211711665.4A
Authority: CN
Inventors: 张�林; 安培; 张鹏; 杨志欣; 张小勇
Original assignee: Tianjin Zhongke Shuguang Storage Technology Co ltd
Current assignee: Tianjin Zhongke Shuguang Storage Technology Co ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-05-05

Abstract

The application relates to a data processing method, a data processing device and computer equipment. The method comprises the following steps: in response to the operation request of the metadata, the new cluster gateway determines the target cluster, and sends the operation request to the name node of the target cluster to instruct the name node of the target cluster to execute the metadata operation. The target cluster comprises a new cluster and/or an old cluster pointed by a new cluster gateway, wherein the catalog metadata of the new cluster is synchronized with the catalog metadata of the old cluster in advance. By adopting the method, the resource waste of the system can be avoided.

Description

Data processing method, device and computer equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method, apparatus, and computer device.

Background

With the rapid development of large data capacity, explosive increment of data capacity and cluster size of a distributed file system represented by HDFS occurs. At this time, the data of the new and old systems often need to be integrated to complete the new and old alternation of the cluster.

In the related technology, complete data migration is carried out through a data migration instruction distcp, specifically, the total data of an old cluster is copied to a new cluster to provide service, then the old cluster is disconnected, and finally the new and old alternation of the clusters is completed.

However, after the related technology is migrated, the old clusters are offline, and the two clusters cannot serve as a whole, so that the problem of resource waste exists.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a data processing method, apparatus, and computer device capable of avoiding resource waste.

In a first aspect, the present application provides a data processing method, the method comprising:

responding to an operation request of the metadata, and determining a target cluster by the new cluster gateway; the target cluster comprises a new cluster and/or an old cluster pointed by a new cluster gateway; wherein, the catalog metadata of the new cluster is synchronized with the catalog metadata of the old cluster in advance;

the new cluster gateway sends an operation request to the name node of the target cluster, and instructs the name node of the target cluster to execute metadata operation.

In the technical scheme of the embodiment of the application, in response to the operation request of the metadata, the new cluster gateway determines the target cluster, and the new cluster gateway sends the operation request to the name node of the target cluster to instruct the name node of the target cluster to execute the metadata operation. The target cluster comprises a new cluster and/or an old cluster pointed by a new cluster gateway, wherein the catalog metadata of the new cluster is synchronized with the catalog metadata of the old cluster in advance. The method determines the target cluster through the new cluster gateway, which means that the new cluster gateway can unify external services as a whole by combining new and old cluster resources. Further, the new cluster gateway sends an operation request to the name node of the target cluster, which indicates that the name nodes of the new and old clusters are unified in the new cluster gateway, and the data nodes have independence, so that the resource waste of the old cluster can be avoided.

In one embodiment, the new cluster gateway includes a pre-built user connection pool for caching connections to name nodes of the old cluster.

According to the technical scheme, the connection of the name node accessing the old cluster is cached through the pre-built user connection pool, and when the old cluster is called, the operation request of metadata can be responded rapidly according to the user connection pool. In addition, the user connection pool is a public connection pool, supports user calling under the condition that calling time does not conflict, and has reusability.

In one embodiment, if a request for data transfer is received after the operation of the metadata, the data transfer is performed in the respective data node for the name node of the new cluster or for the name node of the old cluster.

In the technical scheme of the embodiment of the application, when the data transmission request is received after the metadata is operated, the new cluster gateway distributes corresponding execution work to the name nodes of the clusters, so that the data transmission is carried out on the data nodes of the new clusters or the old clusters, and the execution load of the gateway can be shared while the execution logic of various nodes in the distributed file system is met.

In one embodiment, the metadata is file metadata, and the determining, by the new cluster gateway, the target cluster includes:

the new cluster gateway detects whether file metadata exists in the old cluster;

if the target cluster exists, the new cluster gateway determines that the target cluster is an old cluster;

if not, the new cluster gateway determines the target cluster as the new cluster.

In the technical scheme of the embodiment of the application, the target cluster is determined according to whether the old cluster is detected to have file metadata, so that the reliability of a detection result can be guaranteed to the greatest extent, the detection process is easy to operate, and an accurate target cluster judgment result can be obtained by adopting the determination method provided by the embodiment of the application.

In one embodiment, the operation of the name node of the target cluster to perform metadata includes:

if the target cluster is an old cluster, the name node of the old cluster maintains file metadata in the old cluster;

if the target cluster is a new cluster, the name node of the new cluster creates file metadata in the new cluster.

According to the technical scheme of the embodiment of the application, according to the mode that the new file is created in the new cluster and the old file is maintained in the old cluster, the gateway is used as a medium, new and old cluster resources are combined, and metadata with the operation object being the file is operated. The nano-tube mode of the embodiment of the application can ensure that the new cluster and the old cluster are used as a cluster to provide services to the outside in a unified way, and meanwhile, the new cluster and the old cluster have certain independence, so that the migration or merging cost and risk of the new cluster and the old cluster are reduced.

In one embodiment, the metadata is directory metadata, and determining, by the new cluster gateway, the target cluster includes:

the new cluster gateway determines the target cluster as a new cluster and an old cluster;

accordingly, the operation of the name node of the target cluster to execute the metadata includes:

the name nodes of the new cluster and the name nodes of the old cluster are synchronized to perform the operation of directory metadata in the respective clusters.

In the technical scheme of the embodiment of the application, when the metadata is the directory metadata, the new cluster gateway takes the new cluster and the old cluster as target clusters, and each target cluster synchronously operates the metadata, so that the efficiency of operating the directory metadata can be ensured.

In one embodiment, the synchronization process of the directory metadata of the new cluster and the old cluster includes:

responding to a configuration request of a client, and pointing a distributed file system cluster used by the client to a new cluster gateway;

the new cluster gateway synchronizes the catalog metadata of the old cluster into the catalog metadata of the new cluster;

and if the new cluster gateway determines that the directory metadata of the old cluster is completed synchronously, setting the new cluster to serve outside.

According to the technical scheme, the new cluster gateway is used for completing synchronization of the directory metadata of the new cluster and the directory metadata of the old cluster, so that the directory metadata of the old cluster is updated into the directory metadata of the new cluster, and the new cluster and the old cluster can serve as a whole. Meanwhile, the file data of the old cluster is still stored in the old cluster, so that the problem of resource waste caused by overall migration can be solved.

In one embodiment, the new cluster gateway synchronizes the directory metadata of the old cluster into the directory metadata of the new cluster, including:

the new cluster gateway traverses the existing directory metadata of the old cluster and synchronizes the whole quantity of the existing directory metadata into the directory metadata of the new cluster;

after the full synchronization is completed, the new cluster gateway acquires temporary target metadata recorded in an incremental data temporary database of the old cluster in the full synchronization process;

the new cluster gateway synchronizes the temporary target metadata increment to the directory metadata of the new cluster; wherein the new cluster and the old cluster stop servicing the outside during the incremental synchronization.

According to the technical scheme, the synchronization of the existing directory metadata in the external service state is completed through incremental synchronization, and the synchronization of the temporary target metadata in the external service state is completed through incremental synchronization, so that the integrity of the synchronous content is guaranteed in a batch synchronization mode, and the service interruption time is reduced.

In a second aspect, the present application also provides a data processing apparatus. The device comprises:

the cluster determining module is used for responding to the operation request of the metadata and determining a target cluster; the target cluster comprises a new cluster and/or an old cluster pointed by a new cluster gateway; wherein, the catalog metadata of the new cluster is synchronized with the catalog metadata of the old cluster in advance;

And the execution indication module is used for sending an operation request to the name node of the target cluster and indicating the name node of the target cluster to execute the metadata operation.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the method in any of the embodiments of the first aspect described above when the computer program is executed.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method in any of the embodiments of the first aspect described above.

In a fifth aspect, the present application also provides a computer program product. The computer program product comprising a computer program which, when executed by a processor, implements the steps of the method in any of the embodiments of the first aspect described above.

Drawings

FIG. 1 is a diagram of an application environment for a data processing method in one embodiment;

FIG. 2 is a schematic diagram of a distributed file system in one embodiment;

FIG. 3 is a flow diagram of a data processing method in one embodiment;

FIG. 4 is a flow diagram of a cluster determination step in one embodiment;

FIG. 5 is a flow chart illustrating metadata manipulation steps in one embodiment;

FIG. 6 is a flow diagram of a data processing method in one embodiment;

FIG. 7 is a flow diagram of a data synchronization process in one embodiment;

FIG. 8 is a flow chart of a data synchronization process in another embodiment;

FIG. 9 is a flow diagram that illustrates the steps of a directory synchronization process in one embodiment;

FIG. 10 is a block diagram of a data processing apparatus in one embodiment;

FIG. 11 is an internal block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The data processing method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein gateway 102 communicates with server 104 via a network. The client includes metadata for operation request data, and the gateway 102 is communicatively connected to the client for receiving the operation request of the client. Gateway 102 may be implemented as a stand-alone server or as a cluster of servers.

HDFS (Hadoop Distributed File System), which is a distributed file system (Distributed File System) designed to operate on general purpose hardware, HDFS is a highly fault tolerant system suitable for deployment on inexpensive machines, provides high throughput data access, is well suited for application on large data sets, and is one of the current infrastructure of large data systems.

With the rapid development of large data capacity, explosive increment of data capacity and cluster size of a distributed file system represented by HDFS occurs, for example, the total size of byte-jumping HDFS clusters is already more than 10 ten thousand, and the data capacity also reaches 10EB (1eb=1024pb, 1pb=1024 TB, which is equivalent to the data capacity of 100 ten thousand 1TB hard disks). The HDFS system is capable of storing large-scale data, the basic structure of which is shown in fig. 2.

As can be seen from fig. 2, the HDFS system itself is mainly composed of two parts, namely, metadata (Metadata) information (such as file directory structure, file owner, directory file authority, etc.) for storing files and a data node for storing file contents, which can be laterally expanded. When a user is performing a file operation, the method can be abstracted into two steps: metadata operations (Metadata ops, such as addition, deletion and modification of directory or file base information) and file content transmission, the system divides the content of a file into a plurality of blocks (128M in size by default) and then forms logically redundant copies through a mechanism of multiple copies (Replica) or EC erasure codes, and the logically redundant copies are respectively stored in different data nodes, so as to avoid the problem of data loss when a single node is unavailable.

For smooth transition of new and old clusters, the main current mode mainly comprises complete data migration through a data migration instruction distcp provided by an official, copying the total data of the old clusters to the new clusters to provide service, then downloading the old clusters, and finally completing new and old alternation of the clusters.

However, complete data migration of distcp, while not being a way of new and old cluster collaboration, has significant limitations:

(1) The two clusters cannot serve as a whole, the new and the old clusters can only independently serve, the two clusters cannot be really fused, and the old clusters are always disconnected after migration.

(2) The time consumption of data migration is increased along with the data scale, the time consumption is generally up to several hours or even days, and in the migration process, because distcp is used for ensuring the consistency of new and old cluster data, both clusters cannot write in, otherwise, distcp finds that the data is inconsistent, the data migration is interrupted, so that the whole flow fails.

Based on the above technical problems, the present application provides a data processing method, by adding a name node gateway (namenode) to a new cluster, where the gateway routes after receiving metadata (directory/file adding, deleting, checking, authority, extending attribute, quota, etc.) operation requests, determines that the requests should operate simultaneously in a new cluster operation or an old cluster operation or both the new and the old clusters, if the metadata operation is further performed, the corresponding cluster name node guides a user to perform data transmission in a data node corresponding to the new and the old clusters after performing the metadata operation, where the new and the old cluster data nodes are independent and not shared by each other.

In one embodiment, as shown in fig. 3, there is provided a data processing method including the steps of:

s302, responding to an operation request of metadata, and determining a target cluster by a new cluster gateway; the target cluster comprises a new cluster and/or an old cluster pointed by a new cluster gateway; wherein, the catalog metadata of the new cluster is synchronized with the catalog metadata of the old cluster in advance.

Metadata (Metadata) is data (data about other data) describing data, and is descriptive information about data and information resources. The metadata is used for supporting functions such as indication storage location, historical data, resource searching, file recording and the like, and realizing simple and efficient management of a large amount of networking data, effective discovery, searching, integrated organization and effective management of used resources. In the embodiment of the application, the metadata is an electronic catalog, and in order to achieve the purpose of cataloging, the content or the characteristics of the data must be described and collected, so as to achieve the purpose of assisting in data retrieval.

The metadata operation request refers to the operation requests of adding, deleting and checking the directory/file, permission, extension attribute, quota and the like. The operation request of the metadata is a request generated at the client based on the user's operation on the file or data. The operation request for metadata includes two types of requests: operation requests for metadata and requests for data transmission. After the operation request of the metadata is generated, the operation request needs to be sent to the cluster corresponding to the metadata through the gateway so as to execute the operation of the metadata.

A cluster is a group of mutually independent computers interconnected by a high-speed network, which form a group and are managed in a single system mode. When a client interacts with a cluster, the cluster appears as an independent server. In the embodiment of the application, the availability and scalability can be improved through reasonable configuration of the clusters.

Gateway (Gateway), also known as Gateway, protocol converter, is a computer system or device that serves as a conversion function, mainly used between different communication protocols, data formats or languages, and even two systems with completely different architectures. The new cluster gateway in the embodiment of the application is a server for forwarding other cluster communication data, acts on the new cluster, and is used for processing the operation request like a source server with resources when receiving the operation request sent from the client, and has the characteristics of multiprotocol support and high expandability.

The new cluster gateway is used for responding to the operation request of the metadata, determining clusters corresponding to the metadata in the new cluster and the old cluster according to the metadata, and taking the clusters corresponding to the metadata as target clusters. The target cluster corresponds to the operation request of the metadata, and the target cluster contains the metadata and is used for executing the operation on the metadata, that is, the target cluster can be a new cluster, an old cluster, a new cluster and an old cluster. Illustratively, if the operation request of the metadata is to operate on the metadata of the new cluster, the new cluster gateway takes the new cluster as a target cluster; if the operation request of the metadata is to operate the metadata of the old cluster, the new cluster gateway takes the old cluster as a target cluster; if the metadata operation request is to operate the metadata of the new cluster and the old cluster, the new cluster gateway takes the new cluster and the old cluster as target clusters at the same time.

It should be noted that, in order to respond to the operation request of the metadata, the new cluster gateway first determines the target cluster according to the metadata, and then instructs the target cluster to execute the specific operation. The directory metadata includes metadata related information, such as authority information, extended attribute, quota, flag information of whether to allow the snapshot function to be opened, and the like. In order to ensure that when the new cluster gateway responds to the operation request of the metadata, the time for determining the target cluster by the new cluster gateway is reduced, and before the new cluster gateway responds to the operation request of the metadata, the directory metadata of the old cluster is copied into the directory metadata of the new cluster, so that the synchronization of the directory metadata is completed. Obviously, by synchronizing the directory metadata of the old cluster into the directory metadata of the new cluster in advance, the response speed of the new cluster gateway to the operation request of the metadata can be accelerated.

S304, the new cluster gateway sends an operation request to the name node of the target cluster, and instructs the name node of the target cluster to execute metadata operation.

In the embodiment of the present application, the HDFS cluster includes a plurality of nodes, which are classified into name nodes and data nodes according to types. The name node is also called a master node, only one node exists in the HDFS cluster, and data block information reconstructed by all data nodes is scanned during starting, and the acquisition mode is obtained by periodically reconstructing data block list information sent by the data nodes during operation. The data nodes are also called slave nodes, a plurality of data nodes are arranged in the HDFS cluster and are responsible for storing and reading data, the data can be stored and retrieved according to the scheduling of a client or a name node, and a list of the stored blocks is sent to the name node periodically.

After the target cluster is determined, the new cluster gateway sends the metadata operation request to the name node of the target cluster, and indicates the name node of the target cluster to execute the metadata operation corresponding to the metadata operation request after receiving the metadata operation request.

For example, if the target cluster is an old cluster, the new cluster gateway sends an operation request to the name node of the old cluster, and instructs the name node of the old cluster to execute metadata operation; if the target cluster is a new cluster, the new cluster gateway sends an operation request to a name node of the new cluster, and instructs the name node of the new cluster to execute metadata operation; if the target cluster is a new cluster and an old cluster, the new cluster gateway sends an operation request to the name node of the new cluster and the name node of the old cluster, and indicates the name node of the new cluster and the name node of the old cluster to execute metadata operation.

In the data processing method provided by the embodiment of the application, in response to an operation request of metadata, a new cluster gateway determines a target cluster, and the new cluster gateway sends the operation request to a name node of the target cluster to instruct the name node of the target cluster to execute the metadata operation. The target cluster comprises a new cluster and/or an old cluster pointed by a new cluster gateway, wherein the catalog metadata of the new cluster is synchronized with the catalog metadata of the old cluster in advance. The method determines the target cluster through the new cluster gateway, which means that the new cluster gateway can unify external services as a whole by combining new and old cluster resources. Further, the new cluster gateway sends an operation request to the name node of the target cluster, which indicates that the name nodes of the new and old clusters are unified in the new cluster gateway, and the data nodes have independence, so that the resource waste of the old cluster can be avoided.

Before the new cluster sends an operation request to the target cluster, the new cluster and the old cluster are typically associated so that the target cluster can perform the metadata operation. Based on this, a connection manner of the new cluster and the old cluster will be described by way of an embodiment.

Wherein, the connection pool refers to a pool (set) formed by a group of connection. In the embodiment of the application, the required connection, namely the connection of the name node accessing the old cluster, is put into a user cache pool to form a public connection. When the connection in the user connection pool is called, some and only one user can use the connection, when the connection is returned, the connection can be restored to the public connection and is called again by other users, so that the multiplexing of the connection is achieved, and the consumption of repeatedly opening or closing the connection is reduced. It should be noted that, when a user invokes a connection in the connection pool, the connection needs to be an idle connection in the connection pool that is successfully bound to the user. This is because when a user makes a remote procedure call (Remote Procedure Call, RPC) for the first time, the connection used to initiate a request to the connection pool needs to be authenticated, after which the connection is bound to the user, after which all requests invoked through the bound connection are considered to be requests by the user.

If a user initiates a connection request, for example, there is an idle connection in the connection pool, and there is no binding relationship between each idle connection and the user, then the user still cannot call the idle connection in the connection pool, but needs to re-authenticate the connection used by the request, and establish a binding associated with the user to form a binding connection, so that when the binding connection is idle, the user can call the binding connection in the connection pool.

In HDFS systems, the labels of the users are mainly based on transmission control protocol (Transmission Control Protocol, TCP) connections, and then a connection pool based on the users needs to be built inside the new cluster gateway for caching the connections accessing the old cluster name nodes, so as to ensure that the users bound by the connections when distributing the requests to the old clusters are real client users, not local users of the gateway. TCP is a connection-oriented, reliable, byte stream based transport layer communication protocol, among other things.

According to the method and the device, the connection of the name node accessing the old cluster is cached through the pre-built user connection pool, and when the old cluster is called, the operation request of metadata can be responded rapidly according to the user connection pool. In addition, the user connection pool is a public connection pool, supports user calling under the condition that calling time does not conflict, and has reusability.

In response to the operation request of the metadata, the operation of the metadata may be performed through the gateway, or other nodes may be instructed through the gateway to perform the operation of the metadata. Based on this, the execution body of metadata will be described below by way of one embodiment.

In one embodiment, if a request for data transfer is received after an operation of metadata, the data transfer is performed in the respective data node for the name node of the new cluster or for the name node of the old cluster.

Wherein, the request of data transmission refers to reading and writing the file content.

When the operation request of the metadata is a request of data transmission, the operation request is executed by the name node of the cluster, firstly, the new cluster gateway sends the request of data transmission to the name node of the target cluster, and then the name node of the target cluster performs data transmission on the respective data nodes according to the request of data transmission. The target cluster corresponds to a request of data transmission, and can be a new cluster or an old cluster.

Illustratively, when the operation request of the metadata is an operation request of the metadata, the operation of the metadata is performed by the new cluster gateway by sending the operation request of the metadata to the name node of the target cluster.

In the embodiment of the application, when the data transmission request is received after the metadata is operated, the new cluster gateway distributes the corresponding execution work to the name node of the cluster, so that the data transmission is performed on the data nodes of the new cluster or the old cluster, and the execution load of the gateway can be shared while the execution logic of various nodes in the distributed file system is met.

When determining a target cluster, the cluster is generally determined according to the type of the metadata, so that the target cluster performs the operation request of the metadata. Based on this, the determination step of the target cluster is explained below by means of an embodiment.

In one embodiment, as shown in fig. 4, where the metadata is file metadata, the determining, by the new cluster gateway, the target cluster includes:

s402, the new cluster gateway detects whether file metadata exists in the old cluster.

When the metadata is file metadata, the new cluster gateway needs to judge the cluster to which the file metadata belongs so as to execute the operation on the file metadata. The new cluster gateway detects in the old cluster according to the file metadata, and determines the cluster to which the file metadata belongs as a target cluster by taking the detection result of whether the file metadata exists in the old cluster as a basis.

The new cluster gateway may also determine the target cluster by detecting whether file metadata exists for the new cluster, for example.

S404, if the target cluster exists, the new cluster gateway determines that the target cluster is an old cluster.

If the file metadata is in the old cluster, the file metadata can perform related operations in the old cluster, and at the moment, the new cluster gateway determines the target cluster as the old cluster.

And S406, if the target cluster does not exist, the new cluster gateway determines that the target cluster is a new cluster.

If the file metadata is not in the old cluster, the file metadata is indicated to be in the new cluster, and related operations of the file metadata can be executed in the new cluster, and at the moment, the new cluster gateway determines the target cluster as the new cluster.

According to the method, the device and the system for determining the target cluster, whether the old cluster has file metadata or not is determined according to the fact that the old cluster is detected, reliability of a detection result can be guaranteed to the greatest extent, the detection process is easy to operate, and an accurate target cluster judgment result can be obtained by the aid of the determining method.

When the operation object of the metadata is a file, the operation of executing the metadata typically effectively nanotubes the metadata through a gateway or a name node in a cluster. In view of this, the following describes an operation procedure of metadata in which an operation object is a file, by way of one embodiment.

In one embodiment, as shown in FIG. 5, the operation of the name node of the target cluster to perform metadata includes:

s502, if the target cluster is an old cluster, the name node of the old cluster maintains file metadata in the old cluster.

The file metadata is stored in the data node of the cluster, and when the metadata is file metadata, the new cluster gateway needs to send an operation request of the metadata to the name node of the target cluster, and instruct the name node of the target cluster to execute the operation of the metadata.

When the target cluster is an old cluster, the file metadata exists in the old cluster, the metadata is the old file metadata, the new cluster gateway can independently distribute operation requests of the metadata to the old cluster, and the name node of the old cluster is indicated to continuously maintain the file metadata in the old cluster.

S504, if the target cluster is a new cluster, creating file metadata in the new cluster by the name node of the new cluster.

When the target cluster is a new cluster, which means that the old cluster does not have file metadata, the metadata is new file metadata, and the new cluster gateway can independently distribute an operation request of the metadata to the new cluster to instruct a name node of the new cluster to maintain the file metadata in the new cluster.

In the embodiment of the application, according to the mode that the new file is created in the new cluster and the old file is maintained in the old cluster, the gateway is used as a medium, new and old cluster resources are combined, and metadata with the operation object being the file is operated. The nano-tube mode of the embodiment of the application can ensure that the new cluster and the old cluster are used as a cluster to provide services to the outside in a unified way, and meanwhile, the new cluster and the old cluster have certain independence, so that the migration or merging cost and risk of the new cluster and the old cluster are reduced.

When the operation object of the metadata is a directory, the operation of executing the metadata can effectively nanotube the metadata through a gateway or a name node in a cluster. In view of this, the following describes the operation procedure of metadata in which an operation object is a directory, by way of one embodiment.

In one embodiment, the metadata is directory metadata, and the determining, by the new cluster gateway, the target cluster includes: the new cluster gateway determines the target cluster as a new cluster and an old cluster; accordingly, the operation of the name node of the target cluster to execute the metadata includes: the name nodes of the new cluster and the name nodes of the old cluster are synchronized to perform the operation of directory metadata in the respective clusters.

When the metadata operation object is a catalog, synchronous operation is needed to be carried out on both the new cluster and the old cluster, the new cluster gateway determines that the target cluster is the new cluster and the old cluster, distributes the operation request of the metadata to the new cluster and the old cluster simultaneously, and the new cluster and the old cluster succeed simultaneously and return successful operation. In the operation process of the new and old clusters, the operation of the new and old clusters for carrying out the directory metadata is synchronously carried out, namely, the name node of the new cluster carries out the directory metadata operation in the new cluster, and the name node of the old cluster carries out the directory metadata operation in the old cluster.

In order to ensure reliability to the maximum extent, the operation is preferably performed in the old cluster. Meanwhile, an idempotent operation method is provided as far as possible for a specific metadata operation request, and if the operation fails, the operation is retried through a retry mechanism of the client to ensure that the processing is successful. Wherein, idempotent operation means that the influence generated by any multiple execution is the same as the influence of one execution.

In the embodiment of the application, when the metadata is directory metadata, the new cluster gateway takes the new cluster and the old cluster as target clusters, and each target cluster synchronously operates the metadata, so that the efficiency of operating the directory metadata can be ensured. Meanwhile, accidental errors in the metadata operation process can be avoided by combining idempotent operation and retry mechanisms, so that the reliability of the metadata operation process is guaranteed to the greatest extent.

In one embodiment, as shown in FIG. 6, a metadata gateway-based HDFS nanotube method is provided. As can be seen from FIG. 6, the client invokes the metadata operation or data transfer before the nanotubes, completing the nanotubes to the HDFS system. After the nanotubes, that is, after joining the name node gateway, the client may invoke the metadata operation of the new name node for the metadata operation of the name node, or may operate on the metadata of the old name node through the name node gateway. In addition, the client after the nanotubes can simultaneously perform the nanotubes on the new data node and the old data node.

According to the nano-tube method in the embodiment of the application, by adding the gateway to the new name node, the gateway routes after receiving the operation request of metadata (adding and deleting the directory/file, deleting and checking the authority, expanding the attribute, quota and the like), and determines that the request should be operated simultaneously in the new cluster operation or the old cluster operation or both the new and the old clusters, if the metadata operation is performed, the data transmission (file content reading and writing) is required, and the corresponding cluster name node guides the user to perform the data transmission in the data node corresponding to the new and the old clusters after the metadata operation is performed, wherein the new and the old cluster data nodes are independent and not shared. By adopting the nano-tube method of the embodiment of the application, the name nodes of the new and old clusters can be unified in the new cluster gateway, and the data nodes have independence, so that the resource waste of the old clusters is avoided.

In the process of the new cluster gateway for the new and old cluster nanotubes, the metadata of the new and old cluster directories are required to be synchronized, and whether the synchronization process is successful or not can directly influence the operation of the nanotubes. The directory metadata of the new cluster is thus pre-synchronized with the directory metadata of the old cluster before the new cluster sends the operation request. Based on this, a pre-synchronization process of directory metadata of the new cluster and the old cluster is explained below by an embodiment.

In one embodiment, as shown in FIG. 7, the synchronization process of the directory metadata of the new cluster and the old cluster includes:

s702, responding to the configuration request of the client, and pointing the distributed file system cluster used by the client to the new cluster gateway.

In the embodiment of the application, a gateway is newly added in the new cluster to act on the name node of the new cluster, so that the new cluster and the old cluster are respectively independent in the interior and simultaneously serve as a whole in the exterior. In response to the metadata operation request, in order to ensure that the operation request is sent to the new cluster gateway, the configuration of the client needs to be changed first, and the distributed file system cluster used by the client is pointed to the new cluster gateway.

S704, the new cluster gateway synchronizes the directory metadata of the old cluster into the directory metadata of the new cluster.

In the process of the nanotubes of the new cluster gateway, along with the continuous operation request of the user on the metadata, the scales of the name nodes and the data nodes of the new cluster, the name nodes and the data nodes of the old cluster are continuously increased. Because the logical relationship between the name nodes and the data nodes in each cluster is established, the name nodes of the new and old clusters can be associated to form a new cluster gateway containing a routing rule and a user connection pool. The new cluster gateway created in this way can realize unified nanotubes for the new and old clusters through the preset routing rules.

The new cluster gateway may synchronize the directory metadata of the old cluster to the directory metadata of the new cluster. For example, after the new and old clusters are simultaneously interrupted, complete data migration may be performed through a distcp instruction to perform the synchronization operation of the directory metadata, or the synchronization operation of the directory metadata may be performed by combining a full synchronization mode and an incremental synchronization mode in the process of performing the new and old cluster service.

S706, if the new cluster gateway determines that the directory metadata of the old cluster is synchronous, setting the new cluster to serve outside.

And after the new cluster gateway synchronizes all the directory metadata of the old cluster to the directory metadata of the new cluster, determining that the directory metadata are synchronized, wherein the old cluster and the new cluster share one directory metadata after the synchronization is completed, namely the directory metadata of the new cluster after the synchronization is completed.

According to the embodiment of the invention, the new cluster gateway is used for completing the synchronization of the directory metadata of the new cluster and the directory metadata of the old cluster, so that the directory metadata of the old cluster is updated into the directory metadata of the new cluster, and the new and the old clusters can serve as a whole. Meanwhile, the file data of the old cluster is still stored in the old cluster, so that the problem of resource waste caused by overall migration can be solved.

In the process of migrating old cluster data to a new cluster, unified nanotubes of the distributed file system can be realized through complete data migration and also through local data migration. Based on this, the data migration process of the old cluster is described below by way of one embodiment.

In one embodiment, as shown in FIG. 8, the new cluster gateway synchronizes the directory metadata of the old cluster into the directory metadata of the new cluster, including:

s802, the new cluster gateway traverses the existing directory metadata of the old cluster and synchronizes the whole quantity of the existing directory metadata into the directory metadata of the new cluster.

Database synchronization includes both full synchronization and incremental synchronization. The full-volume synchronization is to synchronize all data at one time, and the incremental synchronization is to synchronize only different parts of two databases.

It should be noted that, in the process of full synchronization, both the new cluster and the old cluster are in an external service state, and the directory metadata of the new cluster and the old cluster at this time are in a continuously written state. The new cluster gateway synchronizes the catalog metadata of the old cluster with the catalog metadata of the new cluster, and takes the catalog metadata of the old cluster as the existing catalog metadata before full synchronization.

In the process of full synchronization, the new cluster gateway firstly traverses the database where the whole old cluster is located to acquire the existing directory metadata of the old cluster, and then synchronizes the full amount of the existing directory metadata into the directory metadata of the new cluster at one time to complete database synchronization of the directory metadata of the new cluster. The new cluster gateway marks the existing directory metadata of the old cluster while performing the full synchronization operation so as to distinguish the existing directory metadata and the non-existing directory metadata of the old cluster in the full synchronization process.

S804, after the full synchronization is completed, the new cluster gateway acquires temporary target metadata recorded in the incremental data temporary database of the old cluster in the full synchronization process.

It should be noted that, in the incremental synchronization process, both the new cluster and the old cluster are in a state of stopping external service, and at this time, the directory metadata of both the new cluster and the old cluster are no longer written.

In the process of full synchronization, the old cluster can generate temporary data due to external services. And determining all temporary data generated by the old cluster in the full synchronization process as temporary target metadata of the old cluster, and recording the temporary target metadata in an incremental data temporary database of the old cluster. That is, the directory metadata of the old cluster can be divided into the existing directory metadata generated before the full-size synchronization and the temporary target metadata generated in the full-size synchronization process according to the time of the full-size synchronization. Obviously, after full synchronization, the temporary target metadata of the old cluster is also stored in the incremental data temporary database entirely, and the new cluster gateway needs to acquire the temporary target metadata stored in the incremental data temporary database to synchronize all metadata of the old cluster.

S806, the new cluster gateway synchronizes the temporary target metadata increment to the directory metadata of the new cluster; wherein the new cluster and the old cluster stop servicing the outside during the incremental synchronization.

After full synchronization is complete, the directory metadata of the new cluster contains the existing directory metadata of the old cluster. However, in the full synchronization process, the old cluster generates temporary target metadata due to the external service, and the temporary target metadata is obviously not synchronized to the directory metadata of the new cluster. Therefore, after the new cluster gateway acquires the temporary target metadata, incremental synchronization needs to be performed to synchronize the temporary target metadata of the old cluster into the directory metadata of the new cluster.

In the incremental synchronization process, the new cluster and the old cluster stop serving outside, and the new cluster gateway synchronizes the existing directory metadata and temporary target metadata of the old cluster into the directory metadata of the new cluster through full synchronization and incremental synchronization respectively, so that synchronization of all directory metadata of the old cluster is completed.

In the embodiment of the application, the synchronization of the existing directory metadata is completed in the external service state through incremental synchronization, and the synchronization of the temporary target metadata is completed in the external service state stopped through incremental synchronization, so that the batch synchronization mode not only ensures the integrity of the synchronization content, but also reduces the service interruption time.

In one embodiment, as shown in fig. 9, a flow chart of the directory synchronization process specifically includes:

(1) The gateway of the new cluster is opened and the old cluster is pointed to.

(2) Modifying the client configuration, and pointing the HDFS cluster used by the client to a new cluster, wherein the role of the new cluster is changed into a transparent agent of an old cluster.

(3) And the new cluster starts to perform data full synchronization, traverses the directory structure of the old cluster and synchronizes to the new cluster.

(4) At the same time of full-scale synchronization, the request for metadata change by the user is recorded in the incremental data temporary file when the request is proxied.

(5) After the full synchronization is completed, the new cluster performs incremental synchronization according to the record in the incremental data temporary file.

(6) And stopping the client from accessing the server in the incremental synchronization process, wherein the new and old clusters do not provide service for the outside, and the whole HDFS service is interrupted.

(7) And confirming that the synchronization is completed, wherein the new cluster normally provides external service, and the old cluster does not provide service to the outside.

(8) And restoring the access of the client to the new cluster, and ending the nanotube flow.

According to the embodiment of the application, aiming at the synchronization process of the metadata of the new and old cluster catalogs which can block the service only in the whole nano-tube flow, the synchronization of the metadata of the existing catalogs in the external service state is finished through incremental synchronization, and the synchronization of the metadata of the temporary target in the external service state is finished through incremental synchronization. In summary, the catalog synchronization optimization method provided by the embodiment of the application can reduce the overall blocking time.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a data processing device for realizing the above related data processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation of one or more embodiments of the data processing device provided below may refer to the limitation of the data processing method hereinabove, and will not be repeated herein.

In one embodiment, as shown in FIG. 10, there is provided a data processing apparatus 1000 comprising: a cluster determination module 1020 and an execution indication module 1040, wherein:

a cluster determination module 1020 for determining a target cluster in response to an operation request for metadata; the target cluster comprises a new cluster and/or an old cluster pointed by a new cluster gateway; wherein, the catalog metadata of the new cluster is synchronized with the catalog metadata of the old cluster in advance.

And the execution indication module 1040 is configured to send an operation request to the name node of the target cluster, and instruct the name node of the target cluster to execute the metadata operation.

In one embodiment, cluster determination module 1020 includes:

and the connection pool unit is used for including a pre-constructed user connection pool in the new cluster gateway, and the user connection pool is used for caching the connection of the name node accessing the old cluster.

In one embodiment, cluster determination module 1020 further comprises:

and the transmission unit is used for transmitting the data to the name node of the new cluster or the name node of the old cluster in the respective data nodes if the request of data transmission is received after the operation of the metadata.

In one embodiment, cluster determination module 1020 further comprises:

The judging unit is used for detecting whether file metadata exists in the old cluster or not by the new cluster gateway;

the first determining unit is used for determining that the target cluster is an old cluster by the new cluster gateway if the target cluster exists;

and the second determining unit is used for determining that the target cluster is a new cluster by the new cluster gateway if the target cluster is not present.

In one embodiment, the execution indication module 1040 includes:

the maintenance unit is used for maintaining file metadata in the old cluster by the name node of the old cluster if the target cluster is the old cluster;

and the creating unit is used for creating file metadata in the new cluster by the name node of the new cluster if the target cluster is the new cluster.

In one embodiment, the cluster determination module 1020 determines that the target cluster is a new cluster and an old cluster, and accordingly, the execution indication module 1040 includes operations for synchronizing directory metadata in the respective clusters by name nodes of the new cluster and name nodes of the old cluster.

In one embodiment, cluster determination module 1020 further comprises:

the pointing unit is used for responding to the configuration request of the client and pointing the distributed file system cluster used by the client to the new cluster gateway;

the synchronization unit is used for synchronizing the directory metadata of the old cluster into the directory metadata of the new cluster by the new cluster gateway;

And the setting unit is used for setting the new cluster to serve outside if the new cluster gateway determines that the directory metadata of the old cluster is finished synchronously.

In one embodiment, the synchronization unit comprises:

a traversing subunit, configured to traverse the existing directory metadata of the old cluster by using the new cluster gateway, and synchronize the total amount of the existing directory metadata to the directory metadata of the new cluster;

the acquisition subunit is used for acquiring temporary target metadata recorded in the incremental data temporary database of the old cluster in the full synchronization process after the full synchronization of the new cluster gateway is completed;

the synchronization subunit is used for synchronizing the temporary target metadata increment to the directory metadata of the new cluster by the new cluster gateway; wherein the new cluster and the old cluster stop servicing the outside during the incremental synchronization.

Each of the modules in the above-described data processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 11. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing data processing data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data processing method.

It will be appreciated by those skilled in the art that the structure shown in fig. 11 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

In one embodiment, the processor when executing the computer program further performs the steps of:

the new cluster gateway comprises a pre-constructed user connection pool, and the user connection pool is used for caching the connection of the name node accessing the old cluster.

if a request for data transfer is received after the operation of the metadata, the data transfer is performed in the respective data node as the name node of the new cluster or as the name node of the old cluster.

after the completion of the full synchronization of the new cluster gateway, acquiring temporary target metadata recorded in an incremental data temporary database of an old cluster in a full synchronization process;

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of:

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples represent only a few embodiments of the present application, which are described in more detail and are not thereby to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method of data processing, the method comprising:

responding to an operation request of the metadata, and determining a target cluster by the new cluster gateway; the target cluster comprises the new cluster and/or an old cluster pointed by the new cluster gateway; wherein, the catalog metadata of the new cluster is synchronized with the catalog metadata of the old cluster in advance;

and the new cluster gateway sends the operation request to the name node of the target cluster, and instructs the name node of the target cluster to execute the metadata operation.

2. The method of claim 1, wherein the new cluster gateway includes a pre-built user connection pool for caching connections of name nodes accessing the old cluster.

3. A method according to claim 1 or 2, characterized in that if a request for data transmission is received after the operation of the metadata, the data transmission is made in the respective data node for the name node of the new cluster or for the name node of the old cluster.

4. The method according to claim 1 or 2, wherein the metadata is file metadata, and wherein the determining the target cluster by the new cluster gateway comprises:

the new cluster gateway detects whether the file metadata exists in the old cluster;

if so, the new cluster gateway determines that the target cluster is the old cluster;

and if not, the new cluster gateway determines that the target cluster is the new cluster.

5. The method of claim 4, wherein the operation of the name node of the target cluster to perform the metadata comprises:

if the target cluster is the old cluster, the name node of the old cluster maintains the file metadata in the old cluster;

And if the target cluster is the new cluster, creating the file metadata in the new cluster by the name node of the new cluster.

6. The method according to claim 1 or 2, wherein the metadata is the directory metadata, and the determining, by the new cluster gateway, the target cluster comprises:

the new cluster gateway determines the target cluster as the new cluster and the old cluster;

and synchronizing the name nodes of the new cluster and the name nodes of the old cluster to perform the directory metadata operation in the respective clusters.

7. The method according to claim 1 or 2, wherein the synchronization procedure of the directory metadata of the new cluster and the old cluster comprises:

responding to a configuration request of a client, and directing a distributed file system cluster used by the client to the new cluster gateway;

the new cluster gateway synchronizes the directory metadata of the old cluster into the directory metadata of the new cluster;

8. The method of claim 7, wherein the new cluster gateway synchronizing the directory metadata of the old cluster into the directory metadata of the new cluster, comprising:

after the full synchronization is completed by the new cluster gateway, acquiring temporary target metadata recorded in an incremental data temporary database of the old cluster in the full synchronization process;

the new cluster gateway synchronizes the temporary target metadata increment into the directory metadata of the new cluster; wherein the new cluster and the old cluster cease servicing outside during the incremental synchronization.

9. A data processing apparatus, the apparatus comprising:

the cluster determining module is used for responding to the operation request of the metadata and determining a target cluster; the target cluster comprises the new cluster and/or an old cluster pointed by the new cluster gateway; wherein, the catalog metadata of the new cluster is synchronized with the catalog metadata of the old cluster in advance;

And the execution indication module is used for sending the operation request to the name node of the target cluster and indicating the name node of the target cluster to execute the metadata operation.

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.