CN114281636B

CN114281636B - Method and device for processing user space file system fault

Info

Publication number: CN114281636B
Application number: CN202111339749.5A
Authority: CN
Inventors: 吴广远
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2023-08-25
Anticipated expiration: 2041-11-12
Also published as: CN114281636A

Abstract

The application provides a method, a system, equipment and a storage medium for processing user space file system faults, wherein the method comprises the following steps: dynamically acquiring lists of all computing nodes in the cluster, and distributing daemons to all computing nodes according to the lists; detecting whether the management process condition of the computing node is normal or not through the daemon, and detecting whether a user space file system mounting point of the computing node is invalid or not through the daemon in response to the normal management process condition of the computing node; responding to the normal mounting point of the user space file system of the computing node, and detecting whether the distributed file system file can be accessed through the mounting point of the user space file system or not through a daemon; and in response to failing to access the distributed file system file through the user space file system mount point, cancelling the user space file system mount point and re-mounting. The Hadoop cluster operation and maintenance method and device can greatly improve the Hadoop cluster operation and maintenance efficiency, reduce the waste of computing resources and improve the satisfaction degree of users on the Hadoop clusters.

Description

Method and device for processing user space file system fault

Technical Field

The present application relates to the field of big data, and more particularly, to a method, system, device, and storage medium for handling user space file system failures.

Background

In the face of massive unstructured data processing tasks, single computing power is difficult to deal with, if multi-machine parallel operation is adopted, an application manufacturer is required to develop a distributed file system and a scheduling frame by oneself, on one hand, the difficulty is relatively high, a large amount of manpower and material resources are consumed, on the other hand, the application manufacturer cannot concentrate on the development of a data processing algorithm, so that in the face of the scene, most application manufacturers can select Hadoop based on an open source architecture as a platform of a bottom layer, and application programs process massive unstructured data based on a Hadoop distributed file system (Hdfs) and a distributed scheduling frame (Yarn).

The development language of Hadoop is Java, but in order to pursue extremely good performance, the traditional unstructured data processing algorithm is mostly developed by adopting a C language, and the support of Hdfs on the C language is very limited, so that a Fuse (Filesystem in Userspace, user space file system) is adopted to mount Hdfs to a Hadoop computing node, and the distributed file system is operated like a local file system through the Fuse.

In such a usage scenario, yarn is responsible for the management of computing resources (CPU, memory) of all Hadoop computing nodes, but Yarn cannot manage computing resources occupied by Fuse, resulting in a situation that data processing subtasks and Fuse often generate resource contention, resulting in a Fuse dying or mounting point failure, and finally resulting in failure of all computing tasks allocated to the node.

Because of the problem of the Yarn self-scheduling algorithm, nodes with resource contention cannot be predicted, normally, the abnormal Fuse mount points can be manually processed only after a large number of calculation tasks fail, and the data processing tasks are submitted again, so that the Hadoop platform is heavy in maintenance task, and the calculation resources of the Hadoop cluster are seriously wasted.

Disclosure of Invention

In view of the above, an object of the embodiments of the present application is to provide a method, a system, a computer device, and a computer readable storage medium for handling a failure of a user space file system.

Based on the above objects, an aspect of the embodiments of the present application provides a method for handling a user space file system failure, including the steps of: dynamically acquiring lists of all computing nodes in the cluster, and distributing daemons to all computing nodes according to the lists; detecting whether the management process condition of the computing node is normal or not through the daemon, and detecting whether a user space file system mounting point of the computing node is invalid or not through the daemon in response to the normal management process condition of the computing node; detecting, by the daemon, whether a distributed file system file can be accessed through the user space file system mount point in response to the user space file system mount point of the computing node being normal; and canceling the user space file system mount point and re-mounting in response to the inability to access the distributed file system file through the user space file system mount point.

In some embodiments, the method further comprises: monitoring the daemon running state of all computing nodes, and restarting the daemon in response to the daemon running abnormality; and replacing the daemon with a new daemon in response to the daemon running abnormally and the number of restarts reaching a threshold.

In some embodiments, the method further comprises: and dynamically acquiring the health condition of the distributed file system, and responding to the abnormality of the distributed file system, terminating daemons of all computing nodes and canceling the user space file system mounting of all computing nodes.

In some embodiments, the method further comprises: and re-mounting the user space file system in response to the failure of the mounting point of the user space file system of the computing node.

In another aspect of an embodiment of the present application, there is provided a system for handling a user space file system failure, including: the distribution module is configured to dynamically acquire lists of all computing nodes in the cluster and distribute daemons to all the computing nodes according to the lists; the first detection module is configured to detect whether the management process condition of the computing node is normal through the daemon, and detect whether the user space file system mounting point of the computing node is invalid through the daemon in response to the normal management process condition of the computing node; the second detection module is configured to respond to the fact that the user space file system mounting point of the computing node is normal, and detect whether the distributed file system file can be accessed through the user space file system mounting point through the daemon; and an execution module configured to cancel the user space file system mount point and re-mount in response to the inability to access the distributed file system file through the user space file system mount point.

In some embodiments, the system further comprises a monitoring module configured to: monitoring the daemon running state of all computing nodes, and restarting the daemon in response to the daemon running abnormality; and replacing the daemon with a new daemon in response to the daemon running abnormally and the number of restarts reaching a threshold.

In some embodiments, the system further comprises a second monitoring module configured to: and dynamically acquiring the health condition of the distributed file system, and responding to the abnormality of the distributed file system, terminating daemons of all computing nodes and canceling the user space file system mounting of all computing nodes.

In some embodiments, the system further comprises a second execution module configured to: and re-mounting the user space file system in response to the failure of the mounting point of the user space file system of the computing node.

In yet another aspect of the embodiment of the present application, there is also provided a computer apparatus, including: at least one processor; and a memory storing computer instructions executable on the processor, which when executed by the processor, perform the steps of the method as above.

In yet another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method steps as described above.

The application has the following beneficial technical effects: by deploying the user space file system daemon in all computing nodes of the Hadoop cluster, abnormal scenes that the mounting points of the user space file system fail or are blocked can be identified, the mounting points of the user space file system are automatically repaired, the operation and maintenance efficiency of the Hadoop cluster is greatly improved, the waste of computing resources is reduced, and the satisfaction degree of users on the Hadoop cluster is improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application and that other embodiments may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an embodiment of a method for handling user space file system failures provided by the present application;

FIG. 2 is a schematic diagram of an embodiment of a system for handling user space file system failures provided by the present application;

FIG. 3 is a schematic hardware architecture diagram of an embodiment of a computer device for handling user space file system failures provided by the present application;

FIG. 4 is a schematic diagram of an embodiment of a computer storage medium for handling user space file system failures provided by the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the following embodiments of the present application will be described in further detail with reference to the accompanying drawings.

It should be noted that, in the embodiments of the present application, all the expressions "first" and "second" are used to distinguish two entities with the same name but different entities or different parameters, and it is noted that the "first" and "second" are only used for convenience of expression, and should not be construed as limiting the embodiments of the present application, and the following embodiments are not described one by one.

In a first aspect of the embodiment of the present application, an embodiment of a method for handling a user space file system failure is provided. FIG. 1 is a schematic diagram illustrating an embodiment of a method for handling a user space file system failure provided by the present application. As shown in fig. 1, the embodiment of the present application includes the following steps:

s1, dynamically acquiring lists of all computing nodes in a cluster, and distributing daemons to all computing nodes according to the lists;

s2, detecting whether the management process condition of the computing node is normal or not through the daemon, and detecting whether a user space file system mounting point of the computing node is invalid or not through the daemon in response to the normal management process condition of the computing node;

s3, responding to the fact that the user space file system mounting point of the computing node is normal, and detecting whether the distributed file system file can be accessed through the user space file system mounting point or not through the daemon; and

and S4, canceling the user space file system mounting point and re-mounting the user space file system mounting point in response to the fact that the distributed file system file cannot be accessed through the user space file system mounting point.

The application program submits the task to a Resource Manager node of the distributed scheduling framework, the Resource Manager node decomposes the task into a plurality of Map tasks and Reduce tasks, the Map tasks are distributed to different computing nodes according to a certain algorithm, the computing nodes access data on Hdfs through a local Fuse to calculate, the calculation result is written on the Hdfs, and then the simplified task gathers the calculation result before the collection through the Fuse to complete the calculation task. In this process, once the Fuse mount of a certain node fails, all the computing tasks running on the node fail, so that the overall computing efficiency of the Hadoop cluster is slowed down, and if serious, the computing tasks fail.

According to the embodiment of the application, the monitoring nodes are added, the daemon process of the Fuse is deployed to all the nodes of the cluster in batches, and the Fuse mounting condition of all the nodes of the cluster is monitored.

And dynamically acquiring a list of all computing nodes in the cluster, and distributing daemons to all computing nodes according to the list. And dynamically acquiring a list of all computing nodes of the cluster through the monitoring node, and distributing the file of the daemon to all the computing nodes.

In some embodiments, the method further comprises: monitoring the daemon running state of all computing nodes, and restarting the daemon in response to the daemon running abnormality; and replacing the daemon with a new daemon in response to the daemon running abnormally and the number of restarts reaching a threshold. The daemon running state of all the computing nodes is monitored by the monitoring nodes, and the daemons of the nodes are restarted in time after the nodes with daemons failing are found.

In some embodiments, the method further comprises: and dynamically acquiring the health condition of the distributed file system, and responding to the abnormality of the distributed file system, terminating daemons of all computing nodes and canceling the user space file system mounting of all computing nodes. The monitoring node dynamically senses the health condition of Hdfs, for example, daemons of all the computing nodes are timely terminated when the Hdfs service is terminated, and Fuse mounting of all the computing nodes is canceled.

Detecting whether the management process condition of the computing node is normal through the daemon, and detecting whether the user space file system mounting point of the computing node is invalid through the daemon in response to the management process condition of the computing node being normal. The daemon detects whether the NodeManager process condition of each computing node is normal.

And in response to the user space file system mounting point of the computing node being normal, detecting whether the distributed file system file can be accessed through the user space file system mounting point by the daemon.

And canceling the user space file system mounting point and re-mounting the distributed file system file in response to the failure to access the distributed file system file through the user space file system mounting point. And detecting whether the Hdfs file can be normally accessed through the Fuse mounting point (Hdfs can not be normally accessed if the Fuse process is blocked), and if the Hdfs file can not be normally accessed, canceling the Fuse mounting point and re-mounting.

According to the embodiment of the application, the user space file system daemon is deployed in all the computing nodes of the Hadoop cluster, so that abnormal scenes of failure or blocking of the mounting points of the user space file system can be identified, the mounting points of the user space file system can be automatically repaired, the operation and maintenance efficiency of the Hadoop cluster is greatly improved, the waste of computing resources is reduced, and the satisfaction degree of users to the Hadoop cluster is improved.

It should be noted that, in the embodiments of the method for handling a user space file system failure, the steps may be intersected, replaced, added and deleted, so that the method for handling a user space file system failure by using these reasonable permutation and combination transforms shall also belong to the protection scope of the present application, and shall not limit the protection scope of the present application to the embodiments.

Based on the above object, a second aspect of the embodiments of the present application proposes a system for handling a user space file system failure. As shown in fig. 2, the system 200 includes the following modules: the distribution module is configured to dynamically acquire lists of all computing nodes in the cluster and distribute daemons to all the computing nodes according to the lists; the first detection module is configured to detect whether the management process condition of the computing node is normal through the daemon, and detect whether the user space file system mounting point of the computing node is invalid through the daemon in response to the normal management process condition of the computing node; the second detection module is configured to respond to the fact that the user space file system mounting point of the computing node is normal, and detect whether the distributed file system file can be accessed through the user space file system mounting point through the daemon; and an execution module configured to cancel the user space file system mount point and re-mount in response to the inability to access the distributed file system file through the user space file system mount point.

In view of the above object, a third aspect of the embodiments of the present application provides a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: s1, dynamically acquiring lists of all computing nodes in a cluster, and distributing daemons to all computing nodes according to the lists; s2, detecting whether the management process condition of the computing node is normal or not through the daemon, and detecting whether a user space file system mounting point of the computing node is invalid or not through the daemon in response to the normal management process condition of the computing node; s3, responding to the fact that the user space file system mounting point of the computing node is normal, and detecting whether the distributed file system file can be accessed through the user space file system mounting point or not through the daemon; and S4, canceling the user space file system mounting point and re-mounting in response to the fact that the distributed file system file cannot be accessed through the user space file system mounting point.

In some embodiments, the steps further comprise: monitoring the daemon running state of all computing nodes, and restarting the daemon in response to the daemon running abnormality; and replacing the daemon with a new daemon in response to the daemon running abnormally and the number of restarts reaching a threshold.

In some embodiments, the steps further comprise: and dynamically acquiring the health condition of the distributed file system, and responding to the abnormality of the distributed file system, terminating daemons of all computing nodes and canceling the user space file system mounting of all computing nodes.

In some embodiments, the steps further comprise: and re-mounting the user space file system in response to the failure of the mounting point of the user space file system of the computing node.

As shown in fig. 3, a hardware structure diagram of an embodiment of the computer device for handling a user space file system failure according to the present application is shown.

Taking the example of the device shown in fig. 3, a processor 301 and a memory 302 are included in the device.

The processor 301 and the memory 302 may be connected by a bus or otherwise, for example in fig. 3.

The memory 302 serves as a non-volatile computer readable storage medium, and may be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the method of handling user space file system failures in embodiments of the present application. The processor 301 executes various functional applications of the server and data processing, i.e., implements a method of handling user space file system failures, by running non-volatile software programs, instructions, and modules stored in the memory 302.

Memory 302 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of a method of handling user space file system failures, etc. In addition, memory 302 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 302 may optionally include memory located remotely from processor 301, which may be connected to the local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more computer instructions 303 corresponding to a method of handling a user space file system failure are stored in the memory 302, which when executed by the processor 301, perform the method of handling a user space file system failure in any of the method embodiments described above.

Any one of the embodiments of the computer apparatus that performs the above-described method of handling a user space file system failure may achieve the same or similar effects as any of the previously-described method embodiments that correspond thereto.

The present application also provides a computer readable storage medium storing a computer program which when executed by a processor performs a method of handling a user space file system failure.

FIG. 4 is a schematic diagram of an embodiment of a computer storage medium for handling a user space file system failure according to the present application. Taking a computer storage medium as shown in fig. 4 as an example, the computer readable storage medium 401 stores a computer program 402 that when executed by a processor performs the above method.

Finally, it should be noted that, as will be appreciated by those skilled in the art, implementing all or part of the above-described embodiments of the method, the program of the method for handling the user space file system failure may be stored in a computer readable storage medium, and the program may include the steps of the embodiments of the above-described methods when executed. The storage medium of the program may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (RAM), or the like. The computer program embodiments described above may achieve the same or similar effects as any of the method embodiments described above.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The foregoing embodiment of the present application has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, and the program may be stored in a computer readable storage medium, where the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the application, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the application, and many other variations of the different aspects of the embodiments of the application as described above exist, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present application.

Claims

1. A method of handling user space file system failures, comprising the steps of:

dynamically acquiring lists of all computing nodes in the cluster, and distributing daemons to all computing nodes according to the lists;

detecting whether the management process condition of the computing node is normal or not through the daemon, and detecting whether a user space file system mounting point of the computing node is invalid or not through the daemon in response to the normal management process condition of the computing node;

detecting, by the daemon, whether a distributed file system file can be accessed through the user space file system mount point in response to the user space file system mount point of the computing node being normal; and

and canceling the user space file system mounting point and re-mounting the distributed file system file in response to the failure to access the distributed file system file through the user space file system mounting point.

2. The method according to claim 1, wherein the method further comprises:

monitoring the daemon running state of all computing nodes, and restarting the daemon in response to the daemon running abnormality; and

in response to the daemon running abnormally and the number of restarts reaching a threshold, the daemon is replaced with a new daemon.

3. The method according to claim 1, wherein the method further comprises:

and dynamically acquiring the health condition of the distributed file system, and responding to the abnormality of the distributed file system, terminating daemons of all computing nodes and canceling the user space file system mounting of all computing nodes.

4. The method according to claim 1, wherein the method further comprises:

and re-mounting the user space file system in response to the failure of the mounting point of the user space file system of the computing node.

5. A system for handling user space file system failures, comprising:

the distribution module is configured to dynamically acquire lists of all computing nodes in the cluster and distribute daemons to all the computing nodes according to the lists;

the first detection module is configured to detect whether the management process condition of the computing node is normal through the daemon, and detect whether the user space file system mounting point of the computing node is invalid through the daemon in response to the normal management process condition of the computing node;

the second detection module is configured to respond to the fact that the user space file system mounting point of the computing node is normal, and detect whether the distributed file system file can be accessed through the user space file system mounting point through the daemon; and

and the execution module is configured to cancel the user space file system mounting point and re-mount the user space file system in response to the fact that the distributed file system file cannot be accessed through the user space file system mounting point.

6. The system of claim 5, further comprising a monitoring module configured to:

7. The system of claim 5, further comprising a second monitoring module configured to:

8. The system of claim 5, further comprising a second execution module configured to:

9. A computer device, comprising:

at least one processor; and

a memory storing computer instructions executable on the processor, which when executed by the processor, perform the steps of the method of any one of claims 1-4.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any of claims 1-4.