CN117992419A

CN117992419A - Data processing method and device based on HDFS (Hadoop distributed File System)

Info

Publication number: CN117992419A
Application number: CN202211350151.0A
Authority: CN
Inventors: 何洋; 罗先强; 王�锋
Original assignee: Chengdu Huawei Technology Co Ltd
Current assignee: Chengdu Huawei Technology Co Ltd
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2024-05-07

Abstract

The application provides a data processing method and device based on an HDFS system. The method proposes to configure a first storage unit for storing HDFS system metadata in a data processing device connected to a host, and a second storage unit for storing hotspot data required by the host. When the first data is needed by the host, the data processing device rapidly acquires the first data from the HDFS system through the metadata in the first storage unit and provides the first data for the host when the first data is not stored in the second storage unit, so that the efficiency of data processing of the host is improved.

Description

Data processing method and device based on HDFS (Hadoop distributed File System)

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data processing method and apparatus based on an HDFS system.

Background

The Hadoop distributed file system (Hadoop distributed FILE SYSTEM, HDFS) is composed of metadata nodes (NameNodes) and data nodes (DataNodes), wherein the metadata nodes are used for providing metadata management services for the HDFS system. The data nodes store data in the form of data blocks for providing read-write services of the data blocks for clients.

When the data processing is specifically implemented, taking the reading of data as an example, the client side firstly sends a data reading request to the metadata node and receives metadata returned by the metadata node for representing the storage position of the data to be read. Further, the client side interacts with the corresponding data node according to the received indication of the metadata to acquire data. When a client performs a processing procedure, the client may need to read data for multiple times, and because of the separate storage architecture of the HDFS system, the client needs to interact with a metadata node to obtain metadata each time the client reads the data, which makes the data processing procedure implemented based on the HDFS system inefficient.

Disclosure of Invention

The embodiment of the application provides a data processing method and device based on an HDFS (Hadoop distributed File System), which are used for improving the efficiency of data processing of a host.

In a first aspect, the present application provides a data processing method based on an HDFS system, where the HDFS system includes a metadata node and a data node, the method is applied to a data processing apparatus, where the data processing apparatus is connected to a host, the data processing apparatus includes a first storage unit and a second storage unit, the first storage unit stores metadata in the metadata node of the HDFS system, and the second storage unit stores hot spot data required by the host, where the method includes:

receiving a first request from the host; the first request is used for requesting first data;

when the first data is determined not to be stored in the second storage unit, sending the first request to the data node according to the metadata of the first data stored in the first storage unit; metadata of the first data indicates a storage location of the first data at the data node;

and receiving the first data sent by the data node, and sending the first data to the host.

Based on the above scheme, the application provides a data processing device which is provided with partial processing functions of a host and an HDFS system and is configured with metadata and hot spot data of the HDFS system, wherein the host can interact with the data processing device through an I/O interface when processing data, and can quickly acquire required data according to the metadata stored in the data processing device. Compared with the prior art that a host needs to establish TCP links with metadata nodes and data nodes of an HDFS system respectively to acquire data, the scheme of the application can effectively improve the efficiency of host data processing.

In some embodiments, the method further comprises:

Receiving a second request from the host; the second request is for requesting metadata of second data;

reading metadata of the second data from the first storage unit when determining that the metadata of the second data is stored in the first storage unit;

Metadata of the second data is sent to the host.

In the prior art, a host needs to establish a network link with a metadata node to acquire metadata every time data processing is performed. In the scheme of the application, the host can directly acquire the required metadata through the connected data processing device.

In some embodiments, the method further comprises:

transmitting the second request to the metadata node when it is determined that the metadata of the second data is not stored in the first storage unit;

Receiving metadata of the second data sent by the metadata node, and sending the metadata of the second data to the host;

And storing the metadata of the second data to a first storage unit.

In some embodiments, the method further comprises:

Reading the first data from the second storage unit when it is determined that the first data is included in the hotspot data;

and sending the first data to the host.

Based on the above scheme, the present application proposes that the second storage unit for storing hot spot data commonly used by the host is configured in the data processing device, and when the data processing device receives a request for reading data from the host, the data processing device may first determine whether there is data in the second storage unit. If so, the data can be directly read from the second storage unit and sent to the host, so that compared with the prior art, the method not only saves the step of establishing the link between the host and the metadata node, but also saves the step of establishing the link with the data node, and further improves the efficiency of data processing.

In some embodiments, the first request is generated during execution of a first processing procedure by the host, and the second storage unit includes at least one storage space; the determining that the hotspot data stored in the second storage unit does not include the first data includes:

Determining that the first data is not included in a first storage space corresponding to the first processing process; the first storage space is one of the at least one storage space.

In some embodiments, the method further comprises:

And after receiving the first data sent by the data node, storing the first data into the first storage space.

Based on the above scheme, the application proposes that in the case that the first storage space does not have the data required by the host, the data processing device stores the data required by the host in the first storage space after acquiring the data required by the host as new hot spot data.

In some embodiments, the at least one memory space further includes a second memory space storing first data required in performing a second processing procedure, the method further comprising:

Deleting the first data in the second storage space when the first data in the first storage space is modified.

In some embodiments, the method further comprises:

Transmitting an update instruction to the data node when the first data in the second storage unit is modified; the update instruction carries the modified first data.

Based on the scheme, the application provides a failure mechanism of hot spot data aiming at different processing processes of a host. When any data in the storage space corresponding to any processing process is updated, the data processing device deletes the data in the other storage space where the data exists, so that the accuracy of the data in each storage space is ensured.

In some embodiments, the method further comprises:

Deleting the expired data in the hot spot data when the residual capacity of the second storage unit is smaller than a capacity threshold value; the expiration data is hot spot data with the time difference between the latest accessed time and the current time being larger than a time threshold value or hot spot data with the access times being smaller than a time threshold value.

In some embodiments, the first request carries a storage location of first data at the data node, the method further comprising:

After the first request is sent to the data node, a first response returned by the data node is received; the first response is used for representing that the first data does not exist in a storage position carried by the first request;

Sending a third request to the metadata node; the third request is for requesting metadata of the first data in the metadata node;

and replacing the metadata of the first data stored in the first storage unit by adopting the metadata of the first data returned by the metadata node.

Based on the scheme, the application provides an updating mechanism of the metadata stored in the first storage unit, so that the consistency of the metadata of the data processing device and the metadata of the metadata node of the HDFS system is improved, and the problem that the data cannot be read due to untimely updating of the metadata is avoided.

In a second aspect, an embodiment of the present application provides a data processing apparatus based on an HDFS system, where the HDFS system includes a metadata node and a data node, where the apparatus is connected to a host, and the apparatus includes a first storage unit and a second storage unit, where metadata in the metadata node of the HDFS system is stored in the first storage unit, and where hot spot data required by the host is stored in the second storage unit, where the apparatus further includes:

A communication unit configured to receive a first request from the host; the first request is used for requesting first data;

The processing unit is used for sending the first request to the data node according to the metadata of the first data stored in the first storage unit when the first data is determined not to be stored in the second storage unit; metadata of the first data indicates a storage location of the first data at the data node;

the communication unit is further configured to receive the first data sent by the data node, and send the first data to the host.

In some embodiments, the communication unit is further configured to receive a second request from the host; the second request is for requesting metadata of second data;

The processing unit is further used for reading the metadata of the second data from the first storage unit when the metadata of the second data stored in the first storage unit is determined;

The communication unit is further configured to send metadata of the second data to the host.

In some embodiments, the communication unit is further configured to:

And receiving the metadata of the second data sent by the metadata node, and sending the metadata of the second data to the host.

In some embodiments, the processing unit is further configured to:

The communication unit is instructed to send the first data to the host.

In some embodiments, the first request is generated during execution of a first processing procedure by the host, and the second storage unit includes at least one storage space; the processing unit is specifically configured to:

In some embodiments, the processing unit is further configured to:

In some embodiments, the at least one memory space further comprises a second memory space storing first data required in performing a second processing procedure, the processing unit further configured to:

In some embodiments, the communication unit is further configured to:

In some embodiments, the processing unit is further configured to:

In some embodiments, the first request carries the storage location of the first data at the data node, and the communication unit is further configured to receive a first response returned by the data node after sending the first request to the data node; the first response is used for representing that the first data does not exist in a storage position carried by the first request;

The processing unit is further configured to replace metadata of the first data stored in the first storage unit with metadata of the first data returned by the metadata node.

In a third aspect, an embodiment of the present application provides a data processing apparatus based on an HDFS system, including: a processor and a memory; the memory comprises a first memory unit and a second memory unit; metadata in metadata nodes of the HDFS system are stored in the first storage unit, and hot spot data required by the host are stored in the second storage unit; the memory is also used for storing programs; the processor is configured to execute the program stored in the memory, so that the apparatus implements the method as described in the first aspect or any of the possible designs of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing program code which, when run on the computer, causes the computer to perform the method of the first aspect and possible designs of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product which, when run on a computer, causes the computer to perform the method of the first aspect and possible designs of the first aspect.

In a sixth aspect, an embodiment of the present application provides a chip system, where the chip system includes a processor, and the processor is configured to invoke a computer program or a computer instruction stored in a memory, so as to execute the method described in the first aspect and the possible designs of the first aspect. Optionally, the memory includes a first storage unit and a second storage unit, where the first storage unit stores metadata in a metadata node of the HDFS system, and the second storage unit stores hot spot data required by the host.

In a seventh aspect, embodiments of the present application provide a processor for invoking a computer program or computer instructions stored in a memory to cause the processor to perform the above described method of the first aspect and possible designs of the first aspect.

Based on the implementation provided in the above aspects, the embodiments of the present application may be further combined to provide further implementations.

The technical effects that may be achieved by any one of the possible designs of the second aspect to the seventh aspect may be correspondingly described with reference to the technical effects that may be achieved by any one of the possible designs of the first aspect, and the description will not be repeated.

Drawings

FIG. 1 is a schematic diagram of a data processing system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of another data processing system architecture according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a process of reading data by a host according to an embodiment of the present application;

FIG. 4 is a schematic diagram of another data processing system according to an embodiment of the present application;

FIG. 5 is a flowchart of a data processing method based on an HDFS system according to an embodiment of the present application;

FIG. 6 is a flowchart of another data processing method based on an HDFS system according to an embodiment of the present application;

FIG. 7 is a flowchart of a method for reading first data according to an embodiment of the present application;

FIG. 8 is a schematic diagram of another data processing system architecture according to an embodiment of the present application;

FIG. 9 is a flowchart of a data processing method according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a data processing apparatus based on an HDFS system according to an embodiment of the present application;

FIG. 11 is a schematic diagram of another data processing apparatus based on an HDFS system according to an embodiment of the present application.

Detailed Description

In order to facilitate understanding of the data processing method based on the HDFS system provided by the embodiment of the present application, concepts and terms related to the embodiment of the present application will be first described briefly.

(1) Data processing unit (data processing unit, DPU): it can be understood that a special purpose processor is the third most important computing chip in the data center scenario, followed by a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU). The DPU is capable of providing a compute engine for high bandwidth, low latency, data intensive computing scenarios. The DPU may be connected to a host and may assist the host in performing tasks of some data processing classes. Because of the superior performance of the DPU in terms of data processing, the DPU is able to support offloading of basic functions such as virtualization, storage, networking, security, etc. That is, basic functions such as virtualization, storage, network, security and the like of the host can be offloaded to the DPU, and the DPU realizes the functions, thereby achieving the effect of accelerating the processing. In some scenarios, to facilitate data processing by the DPU, a dedicated memory is also configured in the DPU.

(2) Virtual Machine (VM): is a virtual environment that operates in a manner similar to a computer. A VM runs on one sequestered partition of its host, having its own CPU power, memory, operating system (e.g., windows, linux, macOS, etc.), and other resources. The hypervisor may separate the host's resources from the hardware and configure them appropriately so that they can be used by the VM.

(3) File system (FILE SYSTEM): a file system is a method and data structure used by an operating system to explicitly store files on a device (typically a disk, but also a NAND FLASH-based solid state disk) or partition, i.e., a method of organizing files on a storage device. The software mechanism responsible for managing and storing file information in an operating system is called a file management system, which is called a file system for short.

The file system consists of three parts: the interface of the file system, the software set for object manipulation and management, the objects and the attributes. From a system perspective, a file system is a system that organizes and allocates space for file storage devices, is responsible for storing files, and protects and retrieves stored files. Specifically, the file system is responsible for creating files for users, storing, reading out, modifying, dumping files, controlling access to files, revoking files when users are no longer using, etc.

(4) Virtual File System (VFS): the VFS is a kernel software layer, an interface layer between the physical file system and the services, abstracts all details of each file system of the operating system so that the different file systems appear identical to the operating system kernel and other processes running in the operating system. Strictly speaking, the VFS is not an actual file system, but exists only in the memory, and does not exist in any external memory space. The VFS is established at system start-up and subsides at system shutdown.

(5) High speed serial computer expansion bus standard (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, PCIe): a standard for internal data transmission of a computer belongs to high-speed serial point-to-point dual-channel high-bandwidth transmission, connected devices allocate exclusive channel bandwidth and do not share bus bandwidth, and each lane in PCIe connection consists of two pairs of wires, one is transmitted and the other is received.

(6) Remote procedure call protocol (remote procedure call protocol, RPC): services are requested from a remote computer program over a network without requiring knowledge of the protocols of the underlying network technology. The RPC protocol allows a program running on one computer to call a subroutine of another computer. The RPC protocol assumes the presence of certain transport protocols, such as the transmission control protocol (transmission control protocol, TCP) or the user datagram protocol (user datagram protocol, UDP), to carry information data between communication programs. In the open systems interconnection (open system interconnection, OSI) network communications model, the RPC spans a transport layer and an application layer.

(7) Metadata: information describing attributes of data is used to support functions such as storage location of indicated data, history data, resource lookup, and file recording.

Referring to FIG. 1, a schematic diagram of a data processing system architecture is provided according to an embodiment of the present application. Included in the data processing system are a host 100 and a file storage system 300. The host 100 may be a computing device deployed on the user side, and may be a physical machine or a virtual machine. Physical machines include, but are not limited to, desktop computers, servers (e.g., application servers, file servers, database servers, etc.), notebook computers, physical servers in a cloud computing cluster, computing devices in a computing device cluster, servers of a network management center, and mobile devices. The host 100 may communicate with other devices in the system and receive data sent by the other devices.

In fig. 1, a file storage system is taken as an example of HDFS. HDFS may be understood as a separate storage architecture. The HDFS system is a distributed file storage system, and is composed of metadata nodes and data nodes, where data stored in the HDFS system is split into multiple data blocks (blocks) and stored in different data nodes respectively. The metadata node stores metadata of the HDFS system, and is used for managing a naming space of the HDFS system, maintaining a file system directory tree structure, and maintaining position mapping relation of data blocks. The data node stores data of the HDFS system in the form of data blocks, is used for managing the stored data blocks and periodically synchronizing a stored data block list to the metadata node, wherein the data block list comprises storage addresses of the stored data blocks, and the data node is also used for providing data read-write service for clients. Illustratively, in order to solve the problem that metadata nodes cannot be expanded, it is also proposed in the related art that an HDFS system may use a resolution mode to divide a namespace into a plurality of partitions, each partition having a metadata node for storing metadata of a corresponding partition.

In some possible scenarios, from the software level, one or more Client Java virtual machines (Client Java virtual machine, client JVM) may be included in the host, as shown in FIG. 2. The client JVM includes an application, a distributed file system module (distributed filesysytem), and an HDFS input stream module (HDFS inputstream). The distributed file system module is used for acquiring metadata from metadata nodes of the HDFS system, and the HDFS input stream module is used for acquiring data from data nodes of the HDFS system. The application programs in the host can integrate software development kits (software development kit, SDK) of the HDFS system, the SDK of the HDFS system provides a plurality of application program interfaces (application programming interface, API), and the host can establish network links and conduct information interaction with metadata nodes or data nodes of the HDFS system through the API interfaces. When a host initiates a data processing request, for example, when a certain item of data is read, a TCP link can be established with a metadata node first to request location information of the data to be read, and further, the host can establish a TCP link with a corresponding data node according to the obtained location information to obtain the data to be read. For ease of understanding, reference is made to fig. 3, which is a schematic diagram of a process for reading data by a host, and a specific sequence of steps is referred to by reference numerals.

In some possible embodiments, a data processing apparatus 200 may also be included in the data processing system, as shown in FIG. 4. The host 100 is connected to the data processing device 200, for example, the data processing device 200 may be plugged into the host 100. In some scenarios, data processing apparatus 200 may also be deployed inside host 100.

Illustratively, the host 100 may have a data processing function, and may feed back the data processing result to other devices. In hardware, host 100 may include an I/O interface 110, a processor 120, and a memory 130.

The host 100 includes an I/O interface 110 for communicating with devices external to the host 100. For example, the external device may transmit data to the host 100 through the I/O interface 110, and after the host 100 processes the input data, the output result after the data processing is transmitted to the external device through the I/O interface 110.

The host 100 includes a processor 120, which may be a central processing unit (central processing unit, CPU) or other specific integrated circuit, that is an arithmetic core and a control core of the host 100. Processor 120 may also be other general purpose processors, digital Signal Processors (DSP), application SPECIFIC INTEGRATED Circuits (ASIC), field programmable gate arrays (field programmable GATE ARRAY, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like.

The host 100 includes a memory 130 for storing various running computer program instructions, data to be processed, data processing results, etc., in the operating system of the host 100. In order to increase the access speed of the processor 120, the memory 130 is required to have an advantage of a fast access speed. Memory 130 typically employs dynamic random access memory (dynamic random access memory, DRAM). In addition to DRAM, memory 130 may be other random access memory, such as static random access memory (Static random access memory, SRAM), or the like. The memory 130 may be a Read Only Memory (ROM). For read-only memory, for example, it may be a programmable read-only memory (programmable read only memory, PROM), an erasable programmable read-only memory (erasable programmable read only memory, EPROM), or the like. The memory 130 may also be a FLASH memory medium (FLASH), a hard disk drive (HARD DISK DRIVE, HDD), or a solid state drive (solid STATE DISK, SSD), or the like.

The data processing apparatus 200 is connected to the host 100, may be used as an external device of the host 100, may be disposed inside the host 100, and may be located on a motherboard or a back plane of the host 100, where the data processing apparatus 200 may interact with data of the host 100 through a bus, for example, a PCIe bus, a computing express interconnect (compute express link, CXL), a universal serial bus (universal serial bus, USB) protocol, or a bus of other protocols.

Illustratively, the data processing apparatus 200 may serve as a module with data processing functions attached to the host 100, and may serve as part of the functions of the host 100. That is, some of the functions of the host 100 are offloaded onto the data processing apparatus 200, and the embodiments of the present application are not limited to the functions of the data processing apparatus 200 offloaded from the host 100. Alternatively, some of the functions in file storage system 300 may be offloaded to data processing apparatus 200, as the application is not limited in this regard. For example, data processing apparatus 200 may implement, in place of a host, the functions of interaction with file storage system 300, parsing of metadata, and reading of data. For example, data processing device 200 may be a DPU that is plugged onto host 100.

In hardware, the data processing apparatus 200 may include a processor 210, a memory 220, a front-end interface 230 to interact with a host, and a back-end interface 240 to interact with an HDFS system.

The processor 210 of the data processing apparatus 200 is a main arithmetic unit of the data processing apparatus 200, and is a core unit of the data processing apparatus 200, and the processor 210 bears the main functions of the data processing apparatus 200. For example, some of the functions of the host 100 may be offloaded to the processor 210, data being processed by the processor 210, and tasks performed by the host 100 for delivery to the data processing apparatus 200.

The memory 220 of the data processing apparatus 200 is capable of supporting data processing operations of the processor 210, providing a data storage space for the processor 210, and similar to the memory 130 of the host, the memory 220 may also be used for storing computer program instructions that the processor 210 needs to call, data that needs to be processed, data processing results, and so on. The type of the memory 220 is similar to the type of the memory 130, and the foregoing is specifically referred to, and will not be repeated here. Memory 220 is understood to be the memory of data processing device 200. In the present application, the memory 220 is provided with a first storage unit 221. In which metadata of the HDFS system is stored. The memory 220 is also provided with a second storage unit 222 in which hotspot data common to hosts is stored. The hot spot data may be data in which the number of times of being accessed in the data node is greater than a number of times threshold, or may be data in which a time difference between a latest accessed time in the data node and a current time is less than a time threshold.

As shown in fig. 4, the data processing apparatus 200 may be directly inserted into a card slot on the motherboard of the host 100, and exchange data through the PCIe bus host 100. It should be noted that the PCIe bus in fig. 4 can be replaced with a bus of a computing fast interconnect (compute express link, CXL), universal serial bus (universal serial bus, USB) protocol, or other protocol to enable data transfer by the data processing apparatus 200.

Included in file storage system 300 are metadata nodes and data nodes, and data processing apparatus 200 may establish a TCP link or a remote direct data access (remote direct memory acces, RDMA) link through back-end interface 240 to enable information interaction with file storage system 300. The file storage system may be an HDFS system, for convenience of description, which will be described below by taking the HDFS system as an example.

At present, in an HDFS system, a host does not have a function of storing data, and each time data is read, information interaction is directly performed with a metadata node and a data node. For example, when the host computer performs data reading each time, a TCP link needs to be established with the metadata node to obtain metadata, and the TCP link is established, which increases the processing delay in the data reading process and reduces the efficiency of data processing.

The application provides a data processing method and a data processing device based on an HDFS system, wherein a first storage unit for storing metadata of the HDFS system and a second storage unit for storing hot spot data commonly used by a host are configured on an application side, so that the host can directly read data from the second storage unit when the host needs to read the data, or can determine the storage position of the data to be read in a data node according to the metadata stored in the first storage unit when the second storage unit does not store the required data, and further establish a link with the data node according to the storage position to acquire the data. Compared with the prior art, the scheme provided by the application does not need to establish a communication link from the metadata node to acquire data each time data processing is performed, but can acquire the data or the metadata from the host, so that the efficiency of host data processing can be improved.

Next, the scheme of the present application will be described. The solution of the present application may be implemented based on the system architecture shown in fig. 1, or may be implemented based on the system architecture shown in fig. 4, for example. The following description is provided in connection with different scenarios, respectively.

Scene one: the data processing method is implemented based on the system architecture shown in fig. 1.

In some embodiments, the host shown in fig. 1 may be configured with a first storage unit for storing metadata in the HDFS system and a second storage unit for storing hotspot data common to the host. For example, the first storage unit and the second storage unit may be a memory space, a cache space, or a storage of the host. When data processing is performed, the host can directly acquire the required data from the second storage unit, or when the second storage unit does not store the required data, the metadata of the required data can be acquired through the first storage unit, the storage position of the required data in the data node is determined based on the metadata, and the data is read from the data node of the HDFS system based on the determined storage position.

Referring to fig. 5, a flowchart of a data processing method based on an HDFS system is provided in an embodiment of the present application. The method flow may be implemented by a host and a data node of the HDFS system included in the system architecture shown in fig. 1, where the host includes a first storage unit, and metadata in a metadata node of the HDFS system is stored therein. The host also includes a second storage unit in which hot spot data common to the host is stored. The method shown in fig. 5 specifically includes:

501, the host determines that the first data is not stored in the second storage unit in response to an operation of reading the first data.

Alternatively, the host may determine that the first data is not included in the hot spot data stored in the second storage unit according to the name of the first data.

The host determines 502 metadata including first data among metadata stored in the first storage unit.

The metadata of the first data is used for representing the storage position of the first data in the data node of the HDFS system.

For example, the operation of reading the first data may be generated by the host during execution of a certain processing procedure. That is, when the host needs to read the first data in the HDFS system during the execution of a certain processing procedure, the metadata of the first data may be obtained from the metadata stored in the first storage unit without the first data being included in the second storage unit.

The host sends 503 a first request to the data node according to the storage location indicated by the metadata of the first data.

Wherein the first request is for requesting first data.

For example, the host may determine a data node storing the first data according to a storage location indicated by metadata of the first data, and send a first request to the determined data node.

The data node receives 504 a first request and sends first data to the host.

Illustratively, the data node may determine the first data based on a storage location carried by the first request.

Optionally, the host may further store the first data to the second storage unit after receiving the first data.

In related art, in a file system such as a network attached storage (network attached storage, NAS), an application side includes a VFS layer for caching metadata. However, in the HDFS file system, the host directly interacts with the HDFS system to obtain data, and does not include the VFS layer, so that metadata caching cannot be achieved. Based on the above, the application proposes to set a storage unit at the host side for storing metadata of metadata nodes of the HDFS system. When the host reads data, the host can directly acquire the metadata from the storage unit of the host, and the storage position of the data in the data node of the HDFS system is determined according to the metadata. Compared with the prior art that metadata is stored in the metadata nodes of the HDFS system in a centralized manner, the host computer needs to establish a link with the metadata nodes to acquire the metadata when performing data processing each time, and the scheme of the application can directly acquire the metadata from the host computer, so that the efficiency of data processing is improved.

Scene II: the data processing method is implemented based on the system architecture shown in fig. 4.

Referring to fig. 6, a flow chart of a data process is illustratively presented for the present application. The data processing flow diagram shown in fig. 6 may be implemented by a host 100, a data processing apparatus 200, and a data node included in the system shown in fig. 4. Wherein the data processing apparatus 200 includes a first storage unit for storing metadata of the HDFS system and a second storage unit for storing hot spot data required by the host. In the following, for convenience of description, the host 100 will be simply referred to as a host, and the data processing apparatus 200 will be simply referred to as a data processing apparatus. The method shown in fig. 6 specifically includes:

step 610: the host sends a first request to the data processing apparatus.

The first request may be generated by the host when executing a certain processing procedure, for example, the first request being for the first data. That is, when the host needs to read the first data while executing a certain processing procedure, the host may send a first request to the data processing apparatus to request the first data.

Step 620: the data processing device receives the first request, and when the first data is determined not to be stored in the second storage unit, the data processing device sends the first request to the data node according to the metadata of the first data stored in the first storage unit.

Wherein the metadata of the first data is used to indicate a storage location of the first data in the data node.

Alternatively, after receiving the first request, the data processing apparatus may parse the first request according to semantic parsing logic of the HDFS system (the function may be that the data processing apparatus is uninstalled from the HDFS system), determine that the first data is not stored in the second storage unit, and determine that metadata including the first data in metadata stored in the first storage unit. Further, the data processing apparatus may determine, according to metadata of the first data, that the first data is located at a data node in the HDFS system. Still further, the data processing apparatus may send a first request to the determined data node for requesting the first data.

In a possible implementation manner, the first request sent by the data processing apparatus to the data node may include a storage location of a data block containing the first data, for requesting the data block containing the first data. In another possible implementation, the first request may include a storage location of a data block including the first data, and a specific starting location in the data block where the first data is located and a length of the first data, for requesting the first data.

Step 630: the data node returns the first data to the data processing device.

Step 640: the data processing device receives the first data and sends the first data to the host.

Optionally, the data processing device may further store the first data to the second storage unit after receiving the first data returned by the data node.

Based on the above scheme, the application provides a data processing device which is provided with the data processing device and the metadata of the HDFS system and the hot spot data and unloads part of processing functions of the host and the HDFS system, and the host can interact with the data processing device through an I/O interface to acquire the required data when processing the data. Compared with the prior art that a host establishes a TCP link with an HDFS system to acquire data, the scheme of the application can effectively improve the efficiency of host data processing.

In some scenarios, the host may need to obtain metadata of one or more items of data, determine information contained in the metadata, when performing task creation or data management. In this scenario, the host may send a second request to the data processing apparatus, the second request requesting metadata for the second data.

In one possible case, the data processing apparatus may return the metadata of the second data directly to the host upon determining that the metadata of the second data is stored in the first storage unit.

In another possible case, the data processing apparatus may send the second request to the metadata node of the HDFS system upon determining that the metadata of the second data is not stored in the first storage unit. Upon receiving metadata of the second data returned by the metadata node, the data processing apparatus may send the metadata of the second data to the host. Optionally, the data processing apparatus may further store metadata of the second data into the first storage unit.

In some embodiments, the data processing apparatus determines whether metadata of the first data is included in metadata stored in the first storage unit after receiving the first request. In one possible case, the first storage unit includes metadata of the first data, and the data processing apparatus may read the metadata of the first data from the first storage unit. In another possible case, the first storage unit does not include metadata of the first data, and the data processing apparatus may send a request message to a metadata node of the HDFS system for requesting metadata stored by the metadata node. For example, the data processing apparatus may store metadata returned by the metadata node to the first storage unit.

In one possible implementation manner, the data processing apparatus may determine, after receiving the first request, whether the hotspot data stored in the second storage unit includes the first data. If so, the data processing apparatus may read the first data directly from the second storage unit and send it to the host. If not, the data processing device may acquire metadata of the first data from the first storage unit, and request the first data from the data node of the HDFS system according to the metadata of the first data. In order to facilitate understanding of the step of acquiring the first data provided by the present application, referring to fig. 7, a flowchart of a method for reading the first data provided by an embodiment of the present application specifically includes:

701, a data processing apparatus receives a first request from a host to access a second storage unit.

Wherein the first request is for requesting first data.

The data processing apparatus determines 702 whether the first data is stored in the second storage unit.

If yes, go on to step 703.

If not, proceed to step 704.

The data processing apparatus reads the first data in the second storage unit and transmits the first data to the host 703.

The data processing apparatus determines 704 whether metadata of the first data is stored in the first storage unit.

If yes, go on to step 705.

If not, then step 707 continues.

705, The data processing apparatus sends a first request to the data node according to the metadata of the first data stored in the first storage unit.

706, The data processing device receives the first data returned by the data node, and sends the first data to the host.

The data processing device may also store the first data returned by the data node to the second storage unit, for example.

707, The data processing apparatus sends a fourth request to the metadata node.

The fourth request may be used for requesting all metadata of the metadata node, or may be used for requesting metadata of the first data, for example.

708, The data processing apparatus sends a first request to the data node according to metadata of the first data returned by the metadata node.

Execution returns to step 706.

Based on the above scheme, the present application proposes that the second storage unit for storing hot spot data commonly used by the host is configured in the data processing device, and when the data processing device receives a request for reading data from the host, the data processing device may first determine whether there is data in the second storage unit. If not, the metadata acquired data stored in the first storage unit is sent to the host, so that compared with the prior art, the step of establishing a link with the metadata node is saved. If so, the data can be directly read from the second storage unit and sent to the host, compared with the prior art, the method not only saves the step of establishing the link with the metadata node, but also saves the step of establishing the link with the data node, and further improves the efficiency of data processing.

In some embodiments, the second storage unit in the data processing apparatus may include at least one storage space for storing hot spot data of at least one processing procedure of the host, respectively. Wherein the first request may be generated by the host during execution of the first processing procedure. When judging whether the hot spot data stored in the second storage unit includes the first data, the data processing device may judge whether the first storage space corresponding to the first processing procedure in the second storage unit includes the first data. By way of example, the data processing apparatus may determine whether the first data is included in the first storage space by comparing the names of the items of hot spot data in the first storage space with the names of the first data.

In one possible case, the data processing apparatus may directly read the first data from the first storage space and transmit the first data to the host when it is determined that the first data is included in the first storage space. In another possible case, the data processing apparatus may request the first data from the data node of the HDFS system according to the metadata stored in the first storage unit and send the first data to the host when it is determined that the first data is not included in the first storage space. The data processing device may also store the first data to the first storage space when receiving the first data returned by the data node, where the first data is used as hot spot data of the first processing process of the host.

In some embodiments, the at least one memory space of the second memory unit may further include a second memory space, where the second memory space stores first data required in performing the second processing procedure. For example, the data processing apparatus may delete the first data in the second storage space upon determining that the first data in the first storage space is modified. In one possible implementation, the first data in the first storage space may be modified by the host during execution of the first processing procedure.

The data processing apparatus may further send an update instruction to the data node when the first data in the second storage unit is modified, the update instruction carrying the modified first data. After receiving the update instruction, the data node may replace the original first data with the updated first data. Because the HDFS system serves multiple clients, in order to avoid the situation of concurrent writing, the present application proposes that when a data node updates data, a lease (Lease) mechanism may be used to avoid concurrent writing. For example, the data processing apparatus may apply for a lease time (for example, may be 10 seconds) to the metadata node before sending the update instruction, and during the lease time, the data processing apparatus may send the update instruction and the modified first data to the data node, so as to instruct the data node to update the first data. At the end of the lease time, the data processing apparatus will no longer have the right to send data to the data node.

In some scenes, in order to ensure the accuracy of metadata maintained in a first storage unit of the data processing device, the application proposes that the data processing device can periodically download metadata of metadata nodes of the HDFS system, and the metadata in the first storage unit is updated by adopting the downloaded metadata, so that the metadata in the data processing device is consistent with the metadata of the metadata nodes. As an alternative, the data processing apparatus may send a third request to the metadata node every interval setting period, wherein the third request is for requesting metadata of the metadata node. The data processing device replaces the metadata stored in the first storage unit with metadata returned by the metadata node.

In other cases, when the data processing device sends the first request to the data node of the HDFS system according to the metadata of the first data stored in the first storage unit, if the data node returns the first response, the data processing device may determine that the first data does not exist in the data node, and further may determine that the metadata of the first data stored in the first storage unit is inaccurate. Wherein the first response is used to characterize that the first data is not present in the storage location carried by the first request. For example, after receiving the first response, the data processing apparatus may send a third request to the metadata node, the third request requesting metadata in the metadata node.

In one possible implementation, the third request may be used to request metadata of the first data, and the data processing apparatus may replace metadata of the first data stored in the first storage unit with metadata of the first data returned by the metadata node.

In another possible implementation manner, the third request may be used to request all metadata stored by the metadata node, and the data processing apparatus may update all metadata stored by the first storage unit according to metadata returned by the metadata node.

In some embodiments, the data processing apparatus may manage the hot spot data stored in the second storage unit, for example, the data processing apparatus may delete the outdated data in the hot spot data when the remaining capacity of the second storage unit is less than the capacity threshold. The expiration data is hot spot data with the time difference between the latest accessed time and the current time being greater than a time threshold or hot spot data with the access times being less than a time threshold.

In some scenarios, the management of the first storage unit, the second storage unit, the metadata, and the management of the hotspot data may be implemented by specific processing modules in the data processing apparatus. As an example, referring to fig. 8, another system architecture diagram provided for an embodiment of the present application is shown. Shown in fig. 8 are host 810, data processing apparatus 820, and HDFS system 830. Among them, the host 810 includes a metadata processing module 811, a data processing module 812, and a message processing module 813. The data processing apparatus 820 includes a message processing module 821, a metadata processing module 822, a data processing module 823, an HDFS interface management module 824, a file caching module 825, a data updating module 826, and a communication processing module 827. Included in HDFS system 830 are metadata nodes and data nodes.

It should be noted that the above-mentioned division of the respective functional modules is only an example, and other manners of dividing the modules of the host and the data processing apparatus may be adopted, which is not limited in the present application. The functions of the respective modules shown in fig. 8 are described below:

The metadata processing module 811 in the host 810 is used to determine the storage location of data based on the received metadata. The data processing module 812 in the host is configured to implement various processing procedures according to the received data. The messaging module 813 in the host is used to interact with the data processing device 820, such as to send and receive metadata, send and receive data, and request and response interactions, for example, the messaging module 813 may be the I/O interface 110 in the system of fig. 4.

The message processing module 821 in the data processing apparatus 820 is configured to implement information interaction with a host. The metadata processing module 822 is configured to implement metadata management, such as metadata update. The data processing module 823 is used for realizing the management of the hot spot data. The file cache module 825 includes a first storage unit and a second storage unit, which are respectively used for storing metadata and hot spot data. The data update module 826 is used to implement the update of the data of the HDFS system based on the Lease mechanism. HDFS interface module 824 includes various API interfaces for accessing HDFS. The communication processing module 827 is configured to implement information interaction with the HDFS system based on the HDFS interface module.

The data processing procedure proposed by the present application will be described based on the system architecture shown in fig. 8. Referring to fig. 9, a flowchart of a data processing method provided in an embodiment of the present application specifically includes:

901, the data processing module 812 sends a first request to the message processing module 813.

The message processing module 813 encapsulates 902 the first request and sends it to the data processing module 823.

Illustratively, the message processing module 813 may perform encapsulation processing on the first request using the RPC protocol, and may send the encapsulated first request to the data processing module 823 through the message processing module 821.

The data processing module 823 may also perform operations such as decapsulation and path translation on the first request, for example, after receiving the first request.

903, The data processing module 823 determines whether the first data is included in the second storage unit.

If so, then execution continues with step 904.

If not, then step 906 continues.

The data processing module 823 returns 904 the first data to the message processing module 813.

The message processing module 813 returns 905 the first data to the data processing module 812.

The data processing module 823 sends 906 a request to read the first data to the metadata processing module 822.

907, The metadata processing module 822 determines whether metadata for the first data is included in the first storage unit.

If so, then execution continues with step 908.

If not, execution continues with step 911.

The metadata processing module 822 sends 908 the metadata of the first data to the data processing module 823.

The data processing module 823 sends 909 a first request to the data node according to the metadata of the first data.

The data processing module 823 may send a first request to the data node via the communication processing module 827, for example. For example, the first data may be requested by the communication processing module 827 establishing an RDMA link with the data node.

The data node returns 910 the first data to the data processing module 823.

After receiving the first data, the data processing module 823 may return to executing step 904.

911, The metadata processing module 822 sends a fourth request to the metadata node.

Wherein the fourth request is for requesting metadata of the first data.

912, The metadata node receives the fourth request and returns metadata for the first data to the metadata processing module 822.

The metadata processing module 822 may return to execute step 908 after receiving metadata for the first data.

The embodiment of the application also provides a data processing device based on the HDFS system, which is used for executing each step in the method, and the related features can be referred to the description in the embodiment, and are not repeated here. Referring to fig. 10, an HDFS system-based data processing apparatus 1000 may include: a communication unit 1001 and a processing unit 1002. Not shown in fig. 10, the data processing apparatus 1000 based on the HDFS system may further include a first storage unit in which metadata in a metadata node of the HDFS system is stored, and a second storage unit in which hotspot data required by the host is stored.

A communication unit 1001 for receiving a first request from the host; the first request is used for requesting first data;

A processing unit 1002, configured to send, when it is determined that the first data is not stored in the second storage unit, the first request to the data node according to metadata of the first data stored in the first storage unit; metadata of the first data indicates a storage location of the first data at the data node;

the communication unit 1001 is further configured to receive the first data sent by the data node, and send the first data to the host.

In some embodiments, the communications unit 1001 is further configured to receive a second request from the host; the second request is for requesting metadata of second data;

The processing unit 1002 is further configured to, when determining that metadata of the second data is stored in the first storage unit, read metadata of the second data from the first storage unit;

the communication unit 1001 is further configured to send metadata of the second data to the host.

In some embodiments, the communication unit 1001 is further configured to:

In some embodiments, the processing unit 1002 is further configured to:

the communication unit 1001 is instructed to transmit the first data to the host.

In some embodiments, the first request is generated during execution of a first processing procedure by the host, and the second storage unit includes at least one storage space; the processing unit 1002 is specifically configured to:

In some embodiments, the processing unit 1002 is further configured to:

In some embodiments, the at least one memory space further comprises a second memory space storing first data required in performing a second processing procedure, the processing unit 1002 further being configured to:

In some embodiments, the communication unit 1001 is further configured to:

In some embodiments, the processing unit 1002 is further configured to:

In some embodiments, the first request carries the storage location of the first data at the data node, and the communication unit 1001 is further configured to receive a first response returned by the data node after sending the first request to the data node; the first response is used for representing that the first data does not exist in a storage position carried by the first request;

The processing unit 1002 is further configured to replace metadata of the first data stored in the first storage unit with metadata of the first data returned by the metadata node.

In a simple embodiment, it will be appreciated by those skilled in the art that the HDFS system based data processing device of the above embodiment may take the form shown in fig. 11.

The apparatus 1100, as shown in fig. 11, includes at least one processor 1111, memory 1120, and optionally a communication interface 1130.

The specific connection medium between the processor 1111 and the memory 1120 is not limited in the embodiment of the present application. The memory 1120 includes a first memory unit and a second memory unit; metadata in metadata nodes of the HDFS system are stored in the first storage unit, and hot spot data needed by the host are stored in the second storage unit.

In the apparatus of fig. 11, a communication interface 1130 is further included, and the processor 1111 may perform data transmission through the communication interface 1130 when communicating with other devices.

When the HDFS system based data processing device takes the form shown in fig. 11, the processor 1111 in fig. 11 may cause the device 1100 to perform the method performed by the HDFS system based data processing device in any of the method embodiments described above by invoking computer-executable instructions stored in the memory 1120.

The embodiment of the application also relates to a chip system, which comprises a processor for calling a computer program or computer instructions stored in a memory, so that the processor performs the method of any of the embodiments described above.

In one possible implementation, the processor is coupled to the memory through an interface.

In one possible implementation, the system on a chip further includes a memory having a computer program or computer instructions stored therein. Optionally, the memory of the chip system further includes a first storage unit and a second storage unit, where the first storage unit stores metadata in a metadata node of the HDFS system, and the second storage unit stores hot spot data required by the host.

The embodiments of the present application also relate to a processor for invoking a computer program or computer instructions stored in a memory to cause the processor to perform the method according to any of the embodiments above.

It should be appreciated that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present application without departing from the scope of the embodiments of the application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims and the equivalents thereof, the present application is also intended to include such modifications and variations.

Claims

1. A data processing method based on an HDFS system, wherein the HDFS system includes metadata nodes and data nodes, the method is applied to a data processing device, the data processing device is connected to a host, the data processing device includes a first storage unit and a second storage unit, metadata in the metadata nodes of the HDFS system is stored in the first storage unit, and hot spot data required by the host is stored in the second storage unit, the method includes:

2. The method according to claim 1, wherein the method further comprises:

Metadata of the second data is sent to the host.

3. The method according to claim 2, wherein the method further comprises:

4. A method according to any one of claims 1-3, wherein the method further comprises:

and sending the first data to the host.

5. A method according to any of claims 1-3, wherein the first request is generated during execution of a first processing procedure by the host, and the second storage unit comprises at least one storage space; the determining that the hotspot data stored in the second storage unit does not include the first data includes:

6. The method of claim 5, wherein the method further comprises:

7. The method of claim 5 or 6, wherein the at least one memory space further comprises a second memory space storing first data required in performing a second processing procedure, the method further comprising:

8. The method according to any one of claims 1-7, further comprising:

9. The method according to any one of claims 1-7, further comprising:

10. The method of any of claims 1-8, wherein the first request carries first data at a storage location of the data node, the method further comprising:

11. A data processing apparatus based on an HDFS system, wherein the HDFS system includes metadata nodes and data nodes, the apparatus is connected to a host, the apparatus includes a first storage unit and a second storage unit, the first storage unit stores metadata in the metadata nodes of the HDFS system, and the second storage unit stores hot spot data required by the host, and the apparatus further includes:

12. The apparatus of claim 11, wherein the communication unit is further configured to receive a second request from the host; the second request is for requesting metadata of second data;

13. The apparatus of claim 12, wherein the communication unit is further configured to:

14. The apparatus according to any one of claims 11-13, wherein the processing unit is further configured to:

The communication unit is instructed to send the first data to the host.

15. The apparatus of any of claims 11-13, wherein the first request is generated during execution of a first processing procedure by the host, the second storage unit comprising at least one storage space; the processing unit is specifically configured to:

16. The apparatus of claim 15, wherein the processing unit is further configured to:

17. The apparatus according to claim 15 or 16, wherein the at least one memory space further comprises a second memory space storing first data required in performing a second processing procedure, the processing unit further configured to:

18. The apparatus according to any of claims 11-17, wherein the communication unit is further configured to:

19. The apparatus according to any one of claims 11-17, wherein the processing unit is further configured to:

20. The apparatus according to any of claims 11-18, wherein the first request carries first data at a storage location of the data node, the communication unit further configured to receive a first response returned by the data node after sending the first request to the data node; the first response is used for representing that the first data does not exist in a storage position carried by the first request;

21. A data processing apparatus based on an HDFS system, comprising: a processor and a memory;

the memory comprises a first memory unit and a second memory unit; metadata in metadata nodes of the HDFS system are stored in the first storage unit, and hot spot data required by the host are stored in the second storage unit;

the memory is also used for storing programs;

The processor is configured to execute a program stored in the memory, to cause the apparatus to implement the method according to any one of claims 1-10.

22. A computer readable storage medium storing computer executable instructions which, when invoked by an electronic device, cause the electronic device to perform the method of any one of claims 1-10.

23. A computer program product, characterized in that the computer program product, when run on a computer, causes the computer to perform the method of any of claims 1 to 10.