CN113791919A

CN113791919A - Method for multi-node parallel processing of massive archive files with any directory structure

Info

Publication number: CN113791919A
Application number: CN202111031407.7A
Authority: CN
Inventors: 扈紫豪; 黄祥志; 王宝玉; 臧文乾; 王更科; 余涛; 赵亚萌; 王栋
Original assignee: Beijing Siwei New Century Information Technology Co ltd; Langfang Zhongke Space Information Technology Co ltd; Zhongke Xingtong Langfang Information Technology Co ltd
Current assignee: Beijing Siwei New Century Information Technology Co ltd; Langfang Zhongke Space Information Technology Co ltd; Zhongke Xingtong Langfang Information Technology Co ltd
Priority date: 2021-09-03
Filing date: 2021-09-03
Publication date: 2021-12-14

Abstract

The invention provides a method for multi-node parallel processing of a large number of archived files with any directory structure, which is characterized by comprising the following steps: a user inputs a file path and issues a task for processing a file to a task control node; the task control node sends the task to the task sending message queue and appoints the work node to execute the work task; the work nodes executing the work tasks become execution nodes, the execution nodes execute the tasks, and completion information is fed back to the task feedback message queue after the tasks are completed; and the task control node pulls the task completion condition from the task feedback queue, and stops the working thread after judging that the stopping condition is met. According to the method, during the task execution process, dynamic capacity expansion and capacity reduction are carried out according to the task completion condition; the tasks are distributed to the task issuing message queue, the task load of each node is more balanced, and the processing efficiency is better.

Description

Method for multi-node parallel processing of massive archive files with any directory structure

Technical Field

The invention belongs to the field of data processing, and particularly relates to a method for multi-node parallel processing of massive archived files with any directory structure.

Background

Data storage conditions in the current information industry are mostly operations such as unified reading processing, warehousing copying and the like when a large number of small files are faced. These documents are characterized by: (1) the data of a single file is small, and the data volume of the whole file is very large; (2) file data storage is not canonical and directory hierarchies are very deep.

In the prior art, a file list is obtained by a single-node recursive traversal manner of a directory, and in the case of a particularly large data volume or a particularly deep file directory, the recursive effect is greatly reduced, so that the case of too deep directories or too large number cannot be processed. In the prior art, the batch of data is split into a plurality of tasks according to a data directory and a static split task in a static data allocation mode and a top-level parent directory, and the tasks are handed to corresponding nodes for processing.

Disclosure of Invention

In view of the above-mentioned drawbacks and deficiencies of the prior art, the present invention is directed to a method for multi-node parallel processing of large volumes of archive files with arbitrary directory structure.

In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

in a first aspect, a method for multi-node parallel processing of a large number of archived files with any directory structure is provided, which is characterized by comprising the following steps:

inputting a path of a task and issuing the task to a task control node;

the task control node sends the task to the task sending message queue and appoints the work node to execute the work task;

the work nodes executing the work tasks become execution nodes, the execution nodes execute the tasks, and completion information is fed back to the task feedback message queue after the tasks are completed;

and the task control node pulls the task completion condition from the task feedback queue, and stops the working thread after judging that the stopping condition is met.

According to the technical scheme provided by the embodiment of the application, the task control node pulls the task record from the execution node and exchanges the pulled task record and the task completion condition to the database.

According to the technical scheme provided by the embodiment of the application, the number of the working nodes is n, the task control node designates 1-n working nodes to start, and the 1-n working nodes become execution nodes.

According to the technical scheme provided by the embodiment of the application, the stopping condition is that the task control node judges that no new task exists and the dispatched tasks are all completed after the specified time; and when the task control node determines that the stopping condition is met, stopping all the working threads by the task control node.

According to the technical scheme provided by the embodiment of the application, the task execution of any execution node comprises the following steps:

pulling a task from a task issuing message queue;

judging whether a task is pulled;

if the execution node does not pull the task, the execution node pulls the task again;

if the execution node pulls the task, judging whether a subtask exists in the task path;

if the task path has subtasks, each subtask is used as a new task and is sent to a task issuing message queue;

if the task path does not have the subtask, directly processing the task;

and processing the task completion, and feeding back task completion information to the task feedback message queue.

In a second aspect, there is provided a computer apparatus, the apparatus comprising: a memory for storing executable program code; one or more processors configured to read executable program code stored in the memory to perform the method for multi-node parallel processing of large volumes of archive files with arbitrary directory structures according to the first aspect.

In a third aspect, a computer-readable storage medium is provided, which includes instructions that, when executed on a computer, cause the computer to perform the method for multi-node parallel processing of large volumes of arbitrary directory structure archived files according to the first aspect.

The invention has the following beneficial effects:

the task control nodes are arranged, and the task control nodes designate certain working nodes to execute the working tasks, so that the working nodes can be started or closed at any time in the task execution process, dynamic capacity expansion and capacity reduction are realized according to the task completion condition in the task execution process, and the task processing efficiency is improved; tasks are distributed to the task issuing message queue, and the execution nodes pull the tasks from the task issuing message queue, so that the problem of task load imbalance caused by static task distribution is well solved, the task load of each node is more balanced, and the processing efficiency is better; the traditional task scheduling system adopts network communication between a task control node and a task execution node to realize task dispatching and task result feedback.

In addition, each execution node working process firstly judges whether a subtask exists, and if the subtask exists, the subtask is sent back to the task issuing message queue as a new task, so that the task can be dynamically distributed, and the processing performance is better than that of the existing recursive traversal task directory.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 is a schematic diagram of a multi-node architecture according to the present application;

FIG. 2 is a schematic diagram of a workflow of an executing node according to the present application;

fig. 3 is a schematic diagram of a control node workflow according to the present application;

fig. 4 is a schematic diagram of a control node stop work flow according to the present application.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

Example 1:

a method for multi-node parallel processing of a large number of archive files with any directory structure comprises the following steps: inputting a path of a task and issuing the task to a task control node; the task control node sends the task to the task sending message queue and appoints the work node to execute the work task; the work nodes executing the work tasks become execution nodes, the execution nodes execute the tasks, and completion information is fed back to the task feedback message queue after the tasks are completed; and the task control node pulls the task completion condition from the task feedback queue, and stops the working thread after judging that the stopping condition is met.

Specifically, the task node is statically and evenly allocated after receiving the task path in the existing task allocation mode, that is, 10 tasks are evenly allocated to 5 nodes, and each node has 2 tasks. Thus, when the data volume of each task is equivalent, the distribution formula is more balanced; however, when the data amount of each task is not appropriate or is not uniform, some tasks have extremely large data amount, and some tasks have extremely small data amount, which may cause the load of each node to be non-uniform. Meanwhile, one of the node machines finishes processing the production task of the node machine, and the process of calling the production task from other nodes is very troublesome. The method of the present application does not distribute the task to a specific node, please refer to fig. 1, but distributes the task to the task issuing message queue through the task control node and sets the task issuing message queue, and the executing node pulls the task from the queue, so that the problem of unbalanced task load caused by static task distribution is well solved, the load of each executing node is relatively balanced, and the task processing efficiency is improved. According to the task scheduling method, the task dispatching and task result feedback are realized by adopting network communication between the task control node and the task execution node, when the number of generated subtasks is too large, frequent node communication can affect the data processing performance, the task issuing message queue and the task feedback message queue are arranged, the task and task completion conditions are directly pulled from the corresponding queues by the execution node and the task control node, and the problem that the processing performance is reduced due to too many subtasks of the control node and the execution node and node communication is solved through asynchronization of the message queues. Meanwhile, the task control node appoints a plurality of working nodes to be started to become the execution nodes, so that the plurality of working nodes can be increased or closed according to the task completion condition in the task execution process, dynamic capacity expansion and capacity increase are realized, and the task processing efficiency is further improved.

Example 2:

in this embodiment, on the basis of embodiment 1, the task control node pulls the task record from the execution node, and exchanges the pulled task record and the task completion condition to the database.

Specifically, as shown in fig. 3, a user can call a data record of the database from the database through the task control node to check the task condition and the task completion condition, and the task control node can also query the data record from the database and perform further arrangement processing.

Example 3:

in this embodiment, on the basis of embodiment 1, the number of the working nodes is n, the task control node designates 1 to n working nodes to start, and 1 to n working nodes become execution nodes.

Specifically, because the task control node does not fixedly allocate the task to a fixed node, but distributes the task to the task issuing message queue, each executing node which is started and has the processing capacity can pull the task from the queue, and the work efficiency is improved. And a plurality of working nodes are arranged at the same time, the task control node appoints the working nodes to start, and the started working nodes can execute the tasks to become executing nodes. The task control node acquires the task condition and the task completion condition through data exchange with the database, can start all the working nodes according to the needs, also can appoint a plurality of working nodes to start, and also can close a plurality of working nodes according to the needs, achieves dynamic addition and deletion of the nodes, can dynamically expand and reduce the capacity in the task execution process, and further improves the task processing efficiency

Example 4:

in this embodiment, on the basis of embodiment 1, the stopping condition is that the task control node determines that no new task is available and all dispatched tasks are completed after a specified time; and when the task control node determines that the stopping condition is met, stopping all the working threads by the task control node.

Specifically, as shown in fig. 4, the task control node determines that no new task is available and all dispatched tasks are completed after the specified time, that is, it is determined that the stopping condition is met, and the task control node stops all working threads. The task control node sets a flow for judging stop, so that the idle running of the node is prevented, and the working efficiency of the task control node is improved.

Example 5:

in this embodiment, on the basis of embodiment 1, the task executed by any one of the executing nodes includes the following steps: pulling a task from a task issuing message queue; judging whether a task is pulled; if the execution node does not pull the task, the execution node pulls the task again; if the execution node pulls the task, judging whether a subtask exists in the task path; if the task path has subtasks, each subtask is used as a new task and is sent to a task issuing message queue; if the task path does not have the subtask, directly processing the task; and processing the task completion, and feeding back task completion information to the task feedback message queue.

Specifically, as shown in fig. 2, when most of the tasks are executed by a single execution node, the tasks are directly executed after traversing the tasks, and the efficiency is very low in the case of many task levels. The execution node of the application has more judgment steps for judging whether the subtask exists, based on the task issuing message queue, the subtask is arranged at the tail of the task issuing message queue as a new task, the new task is obtained when other execution nodes or the current execution node pulls the task and then is continuously processed, the process is continuously repeated, the static process of task distribution is changed into a dynamic process, and the efficiency of processing the task is improved.

In a preferred embodiment of the present application, a computer apparatus, the apparatus comprising: a memory for storing executable program code; one or more processors configured to read executable program code stored in the memory to perform the method for multi-node parallel processing of large number of archive files with arbitrary directory structure according to any of the above embodiments.

The computer system includes a Central Processing Unit (CPU) that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) or a program loaded from a storage section into a Random Access Memory (RAM). In the RAM, various programs and data necessary for system operation are also stored. The CPU, ROM, and RAM are connected to each other via a bus. An input/output (I/O) interface is also connected to the bus.

The following components are connected to the I/O interface: an input section including a keyboard, a mouse, and the like; an output section including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section including a hard disk and the like; and a communication section including a network interface card such as a LAN card, a modem, or the like. The communication section performs communication processing via a network such as the internet. The drive is also connected to the I/O interface as needed. A removable medium such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive as necessary, so that a computer program read out therefrom is mounted into the storage section as necessary.

In particular, the processes described above for the method of multi-node parallel processing of large volumes of arbitrary directory structure archive files may be implemented as computer software programs, according to embodiments of the present invention. For example, an embodiment of the present invention, which relates to a method for multi-node parallel processing of archived files in vast numbers of arbitrary directory structures, comprises a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the system of the present application.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods, apparatus, and computer program products for multi-node parallel processing of large volumes of arbitrary directory structure archived files in accordance with the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves. The described units or modules may also be provided in a processor, and may be described as: a processor comprises a first generation module, an acquisition module, a search module, a second generation module and a merging module. Wherein the designation of a unit or module does not in some way constitute a limitation of the unit or module itself.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method for multi-node parallel processing of a large number of archive files with arbitrary directory structure as described in the above embodiments.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Application scenarios:

the method for the multi-node parallel processing of the massive archived files with any directory structure is mainly applied to the scene of processing file tasks, wherein the processing file tasks specifically refer to reading, warehousing or copying file data, and the nodes can be calculators, processors or servers. When the open source component faces millions of files in the form of tile records, the data of the files are further processed, and the data is very inconvenient and inefficient to acquire. When the method is adopted, a user inputs a path, and the task control node acquires a task for processing the file and dispatches the task for processing the file to the task issuing message queue. The execution node acquires a task of processing files from the task issuing message queue, acquires all files under a post-scanning current-level folder, continues to process the data when the files have the data, and returns the completed information to the task feedback queue after the processing is completed; and when the directory exists in the file, the file is sent back to the task issuing message queue as a new subtask, and the process is continuously repeated. Therefore, when the executive node sends the message queue to pull the task from the task issuing message queue, the task sent by the task control node is pulled, or the task sent back by other executive nodes or the current executive node is pulled, the task distribution process is a dynamic process, the performance problem caused by the traditional recursion directory is solved, the data acquisition efficiency is improved, and meanwhile, the requirement on the specification of the file level directory is not required.

Meanwhile, in the scene, the task control node also pulls the task record from the task issuing queue, and the starting or closing of the starting working node is increased by combining the task completion condition of the task feedback message queue, so that the working nodes can be dynamically increased or decreased, and the efficiency of processing the file and acquiring the file data is further improved.

The task control node exchanges the acquired file data, task recording data and task completion data to the database for storage, so that the task control node can acquire the data and control the increase and decrease of the working nodes conveniently on one hand, and a user can acquire the data conveniently through the task control node on the other hand.

The task control node of the scene judges that no file path is input by the user beyond the specified time, that is, no new task is dispatched, and the dispatched tasks are completely completed, stops all working threads from working, and further improves the efficiency of reading, warehousing or copying the file data.

The foregoing description is only exemplary of the preferred embodiments of the invention and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features and (but not limited to) features having similar functions disclosed in the present invention are mutually replaced to form the technical solution.

Claims

1. A method for multi-node parallel processing of a large number of archive files with any directory structure is characterized by comprising the following steps:

inputting a path of a task and issuing the task to a task control node;

2. The method for multi-node parallel processing of the archived files with the arbitrary directory structures according to claim 1, wherein the task control node pulls the task records from the execution node, and exchanges the pulled task records and the task completion conditions to the database.

3. The method for multi-node parallel processing of the archived files with the mass arbitrary directory structures according to claim 1, wherein the number of the working nodes is n, 1 to n working nodes are designated by the task control node to start, and 1 to n working nodes become the execution nodes.

4. The method for multi-node parallel processing of the archived files in the mass of the arbitrary directory structures according to claim 1, wherein the stop condition is that a task control node judges that no new task is available and that all dispatched tasks are completed after a specified time; and when the task control node determines that the stopping condition is met, stopping all the working threads by the task control node.

5. The method for multi-node parallel processing of a large number of archive files with any directory structure as claimed in claim 1, wherein any executing node performs the task including the following steps:

pulling a task from a task issuing message queue;

judging whether a task is pulled;

if the task path does not have the subtask, directly processing the task;

6. A computer device, the device comprising: a memory for storing executable program code; one or more processors configured to read executable program code stored in the memory to perform the method of multi-node parallel processing of a mass arbitrary directory structure archive file of any of claims 1 to 5.

7. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of multi-node parallel processing of a mass arbitrary directory structure archive file of any of claims 1 to 5.