CN114281636A

CN114281636A - Method and device for processing user space file system fault

Info

Publication number: CN114281636A
Application number: CN202111339749.5A
Authority: CN
Inventors: 吴广远
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-04-05
Anticipated expiration: 2041-11-12
Also published as: CN114281636B

Abstract

The invention provides a method, a system, equipment and a storage medium for processing faults of a user space file system, wherein the method comprises the following steps: dynamically acquiring a list of all computing nodes in the cluster, and distributing daemon to all computing nodes according to the list; detecting whether the management process condition of the computing node is normal or not through a daemon, and detecting whether a user space file system mounting point of the computing node is invalid or not through the daemon in response to the normal management process condition of the computing node; responding to the normal user space file system mounting point of the computing node, and detecting whether the distributed file system file can be accessed through the user space file system mounting point or not through a daemon; and in response to the distributed file system file not being accessible through the user space file system mount point, canceling the user space file system mount point and re-mounting. The invention can greatly improve the operation and maintenance efficiency of the Hadoop cluster, reduce the waste of computing resources and improve the satisfaction degree of users on the Hadoop cluster.

Description

Method and device for processing user space file system fault

Technical Field

The present invention relates to the field of big data, and more particularly, to a method, system, device, and storage medium for handling a failure of a user space file system.

Background

In the face of massive unstructured data processing tasks, single computing power is difficult to deal with, if multi-computer parallel operation is adopted, application manufacturers need to develop a distributed file system and a scheduling framework by themselves, on one hand, the difficulty is high, a large amount of manpower and material resources are consumed, on the other hand, the application manufacturers cannot concentrate on data processing algorithm development, so in the face of the scene, most application manufacturers can select Hadoop based on an open source architecture to serve as a bottom platform, and application programs process massive unstructured data based on a Hadoop distributed file system (Hdfs) and a distributed scheduling framework (Yarn).

The development language of the Hadoop main push is Java, but most of the traditional unstructured data processing algorithms are developed by adopting C language in pursuit of extreme performance, and Hdfs has very limited support to the C language, so that Fuse (file in user space file system) is adopted to mount Hdfs to a Hadoop computing node, and the distributed file system is operated like a local file system by the Fuse.

In such usage scenarios, the Yarn is responsible for the management of computing resources (CPUs and memories) of all Hadoop computing nodes, but the Yarn cannot manage the computing resources occupied by the Fuse, so that the situation that the data processing subtask and the Fuse frequently conflict with each other in terms of resources, the Fuse is dead in a false state or a mount point fails, and finally all the computing tasks allocated to the node fail.

Due to the problem of the Yarn self-scheduling algorithm, nodes with resource contention cannot be predicted, and normally, only after a large number of computing tasks fail, the Fuse mounting point exception is manually processed, and the data processing task is resubmitted, so that the maintenance task of the Hadoop platform is heavy, and the computing resources of the Hadoop cluster are seriously wasted.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, a system, a computer device, and a computer readable storage medium for processing a user space file system fault, where the method and the system can identify an abnormal scene where a user space file system mount point fails or is stuck by deploying a user space file system daemon in all computing nodes of a Hadoop cluster, and automatically repair the user space file system mount point, thereby greatly improving the operation and maintenance efficiency of the Hadoop cluster, reducing the waste of computing resources, and improving the satisfaction of a user on the Hadoop cluster.

Based on the above object, an aspect of the embodiments of the present invention provides a method for handling a failure of a user space file system, including the following steps: dynamically acquiring a list of all computing nodes in a cluster, and distributing daemon to all computing nodes according to the list; detecting whether the management process condition of the computing node is normal or not through the daemon, and detecting whether a user space file system mounting point of the computing node is invalid or not through the daemon in response to the fact that the management process condition of the computing node is normal; responding to the normal user space file system mounting point of the computing node, and detecting whether a distributed file system file can be accessed through the user space file system mounting point or not through the daemon; and in response to failing to access the distributed file system files through the user space file system mount point, canceling the user space file system mount point and re-mounting.

In some embodiments, the method further comprises: monitoring the operating states of daemons of all the computing nodes, and restarting the daemons in response to abnormal operation of the daemons; and in response to the daemon running abnormally and the number of restarting times reaching a threshold, replacing the daemon with a new daemon.

In some embodiments, the method further comprises: and dynamically acquiring the health condition of the distributed file system, and in response to the occurrence of an abnormality in the distributed file system, terminating the daemon of all the computing nodes and canceling the user space file system mount of all the computing nodes.

In some embodiments, the method further comprises: and in response to the failure of the user space file system mounting point of the computing node, re-mounting the user space file system.

In another aspect of the embodiments of the present invention, a system for processing a failure of a user space file system is provided, including: the distribution module is configured to dynamically acquire a list of all computing nodes in the cluster and distribute daemons to all computing nodes according to the list; the first detection module is configured to detect whether the management process condition of the computing node is normal through the daemon, and in response to the fact that the management process condition of the computing node is normal, detect whether a user space file system mount point of the computing node is invalid through the daemon; the second detection module is configured to respond to the fact that the user space file system mounting point of the computing node is normal, and detect whether the distributed file system file can be accessed through the user space file system mounting point or not through the daemon; and an execution module configured to cancel the userspace file system mount point and re-mount in response to an inability to access the distributed file system file via the userspace file system mount point.

In some embodiments, the system further comprises a monitoring module configured to: monitoring the operating states of daemons of all the computing nodes, and restarting the daemons in response to abnormal operation of the daemons; and in response to the daemon running abnormally and the number of restarting times reaching a threshold, replacing the daemon with a new daemon.

In some embodiments, the system further comprises a second monitoring module configured to: and dynamically acquiring the health condition of the distributed file system, and in response to the occurrence of an abnormality in the distributed file system, terminating the daemon of all the computing nodes and canceling the user space file system mount of all the computing nodes.

In some embodiments, the system further comprises a second execution module configured to: and in response to the failure of the user space file system mounting point of the computing node, re-mounting the user space file system.

In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method as above.

In a further aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.

The invention has the following beneficial technical effects: by deploying the user space file system daemon in all the computing nodes of the Hadoop cluster, the abnormal scene that the user space file system mounting points are invalid or stuck can be identified, the user space file system mounting points are automatically repaired, the operation and maintenance efficiency of the Hadoop cluster is greatly improved, the computing resource waste is reduced, and the satisfaction degree of users on the Hadoop cluster is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 is a diagram illustrating an embodiment of a method for handling a failure of a user space file system according to the present invention;

FIG. 2 is a diagram illustrating an embodiment of a system for handling a failure of a user space file system according to the present invention;

FIG. 3 is a diagram illustrating a hardware structure of an embodiment of a computer device for handling a failure of a user space file system according to the present invention;

FIG. 4 is a diagram of an embodiment of a computer storage medium for handling a user space file system failure provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In a first aspect of an embodiment of the present invention, an embodiment of a method for handling a failure of a user space file system is provided. Fig. 1 is a schematic diagram illustrating an embodiment of a method for handling a failure of a user space file system according to the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:

s1, dynamically acquiring a list of all computing nodes in the cluster, and distributing daemons to all computing nodes according to the list;

s2, detecting whether the management process condition of the computing node is normal or not through the daemon, and detecting whether the user space file system mounting point of the computing node is invalid or not through the daemon in response to the fact that the management process condition of the computing node is normal;

s3, responding to the normal user space file system mounting point of the computing node, and detecting whether the distributed file system file can be accessed through the user space file system mounting point or not through the daemon; and

and S4, in response to the distributed file system file can not be accessed through the user space file system mounting point, canceling the user space file system mounting point and re-mounting.

The application program submits tasks to Resource Manager (Resource management) nodes of a distributed scheduling framework, the Resource management nodes divide the tasks into a plurality of Map (Map) and Reduce (simplified) tasks, the Map tasks are distributed to different computing nodes according to a certain algorithm, the computing nodes access data on Hdfs through local Fuse to perform computation, the computation results are written on the Hdfs, and then the simplified tasks are collected through the Fuse to complete the computation tasks. In the process, once the Fuse mount of a certain node fails, all the computing tasks running on the node fail, so that the overall computing efficiency of the Hadoop cluster is slowed down, and if the overall computing efficiency is severe, the computing tasks fail.

In the embodiment of the invention, the monitoring nodes are added, the daemon process of the Fuse is deployed to all the nodes of the cluster in batch, and the Fuse mounting conditions of all the nodes of the cluster are monitored.

And dynamically acquiring a list of all the computing nodes in the cluster, and distributing a daemon program to all the computing nodes according to the list. And dynamically acquiring a list of all the computing nodes of the cluster through the monitoring node, and distributing the files of the daemon to all the computing nodes.

In some embodiments, the method further comprises: monitoring the operating states of daemons of all the computing nodes, and restarting the daemons in response to abnormal operation of the daemons; and in response to the daemon running abnormally and the number of restarting times reaching a threshold, replacing the daemon with a new daemon. The monitoring nodes monitor the operating states of the daemons of all the computing nodes, and the daemons of the nodes are restarted in time after the nodes with the failed daemons are found.

In some embodiments, the method further comprises: and dynamically acquiring the health condition of the distributed file system, and in response to the occurrence of an abnormality in the distributed file system, terminating the daemon of all the computing nodes and canceling the user space file system mount of all the computing nodes. The health condition of the Hdfs is dynamically sensed through the monitoring nodes, for example, the daemons of all the computing nodes are timely stopped when the Hdfs service is stopped, and Fuse mounting of all the computing nodes is cancelled.

And detecting whether the management process condition of the computing node is normal or not through the daemon, and detecting whether the user space file system mounting point of the computing node is invalid or not through the daemon in response to the normal management process condition of the computing node. And detecting whether the NodeManager (node management) process condition of each computing node is normal or not through a daemon.

And responding to the normal user space file system mounting point of the computing node, and detecting whether the distributed file system file can be accessed through the user space file system mounting point or not through the daemon.

In response to failing to access the distributed file system files through the user space file system mount point, canceling the user space file system mount point and re-mounting. And detecting whether the Hdfs file can be normally accessed through the Fuse mounting point (if the Fuse process is stuck, the Hdfs file cannot be normally accessed), and if the Hdfs file cannot be normally accessed, canceling the Fuse mounting point and re-mounting.

According to the embodiment of the invention, the user space file system daemon is deployed in all the computing nodes of the Hadoop cluster, so that the abnormal scene that the mounting points of the user space file system are invalid or jammed can be identified, the mounting points of the user space file system are automatically repaired, the operation and maintenance efficiency of the Hadoop cluster is greatly improved, the waste of computing resources is reduced, and the satisfaction degree of users on the Hadoop cluster is improved.

It should be particularly noted that, the steps in the embodiments of the method for handling a user-space file system failure described above can be mutually intersected, replaced, added, and deleted, so that these methods for handling a user-space file system failure, which are reasonably transformed by permutation and combination, should also belong to the scope of the present invention, and should not limit the scope of the present invention to the embodiments.

In view of the above object, a second aspect of the embodiments of the present invention provides a system for handling a failure of a user space file system. As shown in fig. 2, the system 200 includes the following modules: the distribution module is configured to dynamically acquire a list of all computing nodes in the cluster and distribute daemons to all computing nodes according to the list; the first detection module is configured to detect whether the management process condition of the computing node is normal through the daemon, and in response to the fact that the management process condition of the computing node is normal, detect whether a user space file system mount point of the computing node is invalid through the daemon; the second detection module is configured to respond to the fact that the user space file system mounting point of the computing node is normal, and detect whether the distributed file system file can be accessed through the user space file system mounting point or not through the daemon; and an execution module configured to cancel the userspace file system mount point and re-mount in response to an inability to access the distributed file system file via the userspace file system mount point.

In view of the above object, a third aspect of the embodiments of the present invention provides a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: s1, dynamically acquiring a list of all computing nodes in the cluster, and distributing daemons to all computing nodes according to the list; s2, detecting whether the management process condition of the computing node is normal or not through the daemon, and detecting whether the user space file system mounting point of the computing node is invalid or not through the daemon in response to the fact that the management process condition of the computing node is normal; s3, responding to the normal user space file system mounting point of the computing node, and detecting whether the distributed file system file can be accessed through the user space file system mounting point or not through the daemon; and S4, in response to the distributed file system file not being accessible through the user-space file system mount point, canceling the user-space file system mount point and re-mounting.

In some embodiments, the steps further comprise: monitoring the operating states of daemons of all the computing nodes, and restarting the daemons in response to abnormal operation of the daemons; and in response to the daemon running abnormally and the number of restarting times reaching a threshold, replacing the daemon with a new daemon.

In some embodiments, the steps further comprise: and dynamically acquiring the health condition of the distributed file system, and in response to the occurrence of an abnormality in the distributed file system, terminating the daemon of all the computing nodes and canceling the user space file system mount of all the computing nodes.

In some embodiments, the steps further comprise: and in response to the failure of the user space file system mounting point of the computing node, re-mounting the user space file system.

Fig. 3 is a schematic hardware structural diagram of an embodiment of the computer device for processing a user space file system failure according to the present invention.

Taking the device shown in fig. 3 as an example, the device includes a processor 301 and a memory 302.

The processor 301 and the memory 302 may be connected by a bus or other means, such as the bus connection in fig. 3.

The memory 302 is a non-volatile computer-readable storage medium, and can be used for storing non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the method for handling a user space file system failure in the embodiments of the present application. The processor 301 executes various functional applications of the server and data processing, i.e., implements a method of handling user-space file system failures, by running non-volatile software programs, instructions, and modules stored in the memory 302.

The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a method of handling a user space file system failure, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 302 optionally includes memory located remotely from processor 301, which may be connected to a local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Corresponding computer instructions 303 of one or more methods of handling a user space file system failure are stored in the memory 302 and when executed by the processor 301 perform the method of handling a user space file system failure in any of the above-described method embodiments.

Any embodiment of a computer device implementing the method for handling a user space file system failure as described above may achieve the same or similar effects as any of the preceding method embodiments corresponding thereto.

The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, performs a method of handling a user space file system failure.

FIG. 4 is a schematic diagram of an embodiment of a computer storage medium for handling a user space file system failure according to the present invention. Taking the computer storage medium as shown in fig. 4 as an example, the computer readable storage medium 401 stores a computer program 402 which, when executed by a processor, performs the method as described above.

Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes of the methods of the above embodiments can be implemented by a computer program to instruct related hardware, and the program of the method for handling a user space file system failure can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the methods described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A method for handling a user space file system failure, comprising the steps of:

dynamically acquiring a list of all computing nodes in a cluster, and distributing daemon to all computing nodes according to the list;

detecting whether the management process condition of the computing node is normal or not through the daemon, and detecting whether a user space file system mounting point of the computing node is invalid or not through the daemon in response to the fact that the management process condition of the computing node is normal;

responding to the normal user space file system mounting point of the computing node, and detecting whether a distributed file system file can be accessed through the user space file system mounting point or not through the daemon; and

in response to failing to access the distributed file system files through the user space file system mount point, canceling the user space file system mount point and re-mounting.

2. The method of claim 1, further comprising:

monitoring the operating states of daemons of all the computing nodes, and restarting the daemons in response to abnormal operation of the daemons; and

and in response to the daemon running abnormally and the number of reboots reaching a threshold, replacing the daemon with a new daemon.

3. The method of claim 1, further comprising:

and dynamically acquiring the health condition of the distributed file system, and in response to the occurrence of an abnormality in the distributed file system, terminating the daemon of all the computing nodes and canceling the user space file system mount of all the computing nodes.

4. The method of claim 1, further comprising:

and in response to the failure of the user space file system mounting point of the computing node, re-mounting the user space file system.

5. A system for handling a user space file system failure, comprising:

the distribution module is configured to dynamically acquire a list of all computing nodes in the cluster and distribute daemons to all computing nodes according to the list;

the first detection module is configured to detect whether the management process condition of the computing node is normal through the daemon, and in response to the fact that the management process condition of the computing node is normal, detect whether a user space file system mount point of the computing node is invalid through the daemon;

the second detection module is configured to respond to the fact that the user space file system mounting point of the computing node is normal, and detect whether the distributed file system file can be accessed through the user space file system mounting point or not through the daemon; and

an execution module configured to cancel the userspace file system mount point and re-mount in response to an inability to access the distributed file system file via the userspace file system mount point.

6. The system of claim 5, further comprising a monitoring module configured to:

7. The system of claim 5, further comprising a second monitoring module configured to:

8. The system of claim 5, further comprising a second execution module configured to:

9. A computer device, comprising:

at least one processor; and

a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method of any one of claims 1 to 4.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.