CN109376137B

CN109376137B - File processing method and device

Info

Publication number: CN109376137B
Application number: CN201811541562.1A
Authority: CN
Inventors: 张铮; 潘传幸; 邬江兴; 王晓梅; 王俊超; 谢光伟; 王立群; 李卫超; 刘镇武
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2018-12-17
Filing date: 2018-12-17
Publication date: 2021-03-23
Anticipated expiration: 2038-12-17
Also published as: CN109376137A

Abstract

The file processing method and the file processing device classify each file of the task according to the dependency relationship among the files of the task, classify each file with the dependency relationship in the task into the same class, distribute the files on the basis, and specifically distribute each file classified into the same class to the same node of the distributed cluster; therefore, based on the scheme of the application, the files with the dependency relationship in the task can be distributed to the same node of the cluster, and correspondingly, the dependency relationship does not exist among the files of different nodes in the cluster, so that convenience is brought to the processing of the files by the nodes, cross-node reference is not needed, the nodes in the distributed system can be effectively prevented from generating wrong results, and meanwhile, the effective utilization of computing resources of the distributed system is facilitated.

Description

File processing method and device

Technical Field

The invention belongs to the field of distributed computing, network communication and network security, and particularly relates to a file processing method and device.

Background

With the advent of the "internet +" era, networks have not only profound effects on people's lifestyles, but also have serious challenges on the computing power of servers. More and more enterprises deploy background servers to distributed systems, and break through the bottleneck of development by means of the power of distributed computing.

For some real-time large-scale batch processing computing tasks, how to perform fast and efficient file distribution plays a crucial role in fully and efficiently utilizing computing resources. In view of this, the present invention is directed to realizing efficient file distribution in a distributed environment, and particularly, decomposing a problem into small problems, and adopting ideas such as divide-and-conquer and pipeline to distribute files quickly and efficiently in a distributed environment.

The inventor has found that if the task file is simply divided and distributed to the distributed nodes according to the above idea, some processing inconvenience may be caused, and even the processing result is wrong.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a file processing method and apparatus, so as to perform file division and distribution on tasks according to dependency relationships among files of the tasks, thereby avoiding occurrence of erroneous results of a distributed system, and being more beneficial to effective utilization of computing resources of the distributed system.

Therefore, the invention discloses the following technical scheme:

a method of file processing, comprising:

analyzing the dependency relationship among files included in the task to be processed;

classifying the files included in the task to be processed according to the dependency relationship among the files to obtain a plurality of classification categories, and recording the path structure of each file of the task to be processed; each classification category corresponds to a plurality of files with dependency relations and comprising the tasks to be processed;

distributing each file of the task to be processed to a plurality of nodes of a distributed cluster based on a preset file distribution strategy; wherein files belonging to the same classification category are distributed to the same node;

and acquiring the processing result of each node on the distributed files, and merging the processing result of each file based on the path structure of each file so as to restore the task structure of the task to be processed.

In the above method, preferably, if the task to be processed is a project source code to be processed, the classifying the files included in the task to be processed according to the dependency relationship among the files to obtain a plurality of classification categories, including:

and classifying the source code files according to the dependency relationship among the source code files of the engineering source code to obtain a plurality of classification type source code files, wherein each classification type correspondingly comprises a plurality of source code files with the dependency relationship.

Preferably, in the method, the distributing each file of the task to be processed to the plurality of nodes of the distributed cluster based on the predetermined file distribution policy includes:

acquiring the use condition of computing resources, the network load condition and the congestion condition of each node in the cluster;

distributing tasks to each source code file of the engineering source code based on the computing resource use condition, the network load condition and the congestion condition of each node in the cluster; wherein the respective source code files belonging to the same classification category are distributed to the same node.

Preferably, the obtaining of the processing result of each node on the distributed file and merging the processing results of each file based on the path structure of each file so as to restore the task structure of the task to be processed includes:

monitoring the processing condition of each node in the cluster based on multiple threads or multiple processes;

when monitoring a compiling result of a certain node for the distributed source code files of the corresponding category, receiving the compiling result and acquiring a path structure of each source code file in the category;

and writing the compiling result of each source code file in the category into a corresponding position of a storage medium based on the path structure of each source code file in the category so as to restore the engineering structure of the engineering source code.

Preferably, the method for writing the compiling result of the source code file into the corresponding position of the storage medium includes:

and storing the compiling result of the source code file in a preset sharing directory.

A document processing apparatus comprising:

the analysis unit is used for analyzing the dependency relationship among the files included in the task to be processed;

the classification and information recording unit is used for classifying the files included in the task to be processed according to the dependency relationship among the files to obtain a plurality of classification categories and recording the path structure of each file of the task to be processed; each classification category corresponds to at least one file comprising the tasks to be processed;

the distribution unit is used for distributing each file of the task to be processed to a plurality of nodes of the distributed cluster based on a preset file distribution strategy; wherein files belonging to the same classification category are distributed to the same node;

and the result processing unit is used for acquiring the processing result of each node on the distributed files and merging the processing result of each file based on the path structure of each file so as to restore the task structure of the task to be processed.

Preferably, in the apparatus, the task to be processed is an engineering source code to be processed;

the classification and information recording unit classifies the files included in the task to be processed according to the dependency relationship among the files to obtain a plurality of classification categories, and the classification and information recording unit specifically comprises the following steps:

and classifying the source code files according to the dependency relationship among the source code files of the engineering source code to obtain a plurality of classification type source code files, wherein each classification type correspondingly comprises at least one source code file.

The above apparatus, preferably, the distribution unit is specifically configured to:

The above apparatus, preferably, the result processing unit is specifically configured to:

Preferably, in the apparatus, the writing, by the result processing unit, the compiling result of the source code file into a corresponding location of the storage medium includes:

According to the scheme, the file processing method and the file processing device provided by the application classify the files of the task according to the dependency relationship among the files of the task, classify the files with the dependency relationship in the task into the same class, distribute the files on the basis, and specifically distribute the files classified into the same class to the same node of the distributed cluster; therefore, based on the scheme of the application, the files with the dependency relationship in the task can be distributed to the same node of the cluster, and correspondingly, the dependency relationship does not exist among the files of different nodes in the cluster, so that convenience is brought to the processing of the files by the nodes, cross-node reference is not needed, the nodes in the distributed system can be effectively prevented from generating wrong results, and meanwhile, the effective utilization of computing resources of the distributed system is facilitated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic flowchart of a document processing method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a file distribution model in a distributed environment according to an embodiment of the present application;

FIG. 3 is a schematic workflow diagram of modules of a file distribution model in a distributed environment according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a document processing apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The inventor finds that certain dependency relationships often exist among certain files in files included in the same task, and when a task is processed, if the files of the task are simply divided (for example, the files are simply divided into a plurality of file groups with uniform sizes according to the size of data amount) and the files are distributed, some processing inconvenience may be brought, and even processing results are wrong (for example, if a file distributed by a certain node needs to refer to a variable defined by a file of another node, the processing results may be wrong due to unsuccessful cross-node reference). In order to solve the problem, the application provides a file processing method and device, which are suitable for file distribution in a distributed environment. The document processing method and apparatus of the present application will be described in detail below with specific embodiments.

Example one

Referring to fig. 1, a flow diagram of a document processing method is shown, the method comprising the steps of:

step 101, analyzing the dependency relationship among the files included in the task to be processed.

The task to be processed may be, but is not limited to, an engineering source code to be processed (e.g., to be compiled), and the document processing method of the present application will be described in detail mainly by taking the task to be processed as the engineering source code to be processed.

In view of this, in this step, the dependency relationship between the source code files included in the engineering source code to be processed may be specifically analyzed for the engineering source code to be processed.

For example, if a variable defined in a B file needs to be referenced in an a file, the B file must be compiled before the a file (where "before" refers to temporal first), and the a file is dependent on the B file, so that A, B there is a dependency between the two files. In this application, the fact that one file is dependent on another file means that the use (e.g., compilation or execution) of the one file needs to be premised on the other file, and if the other file is missing, the one file cannot be used.

102, classifying the files included in the task to be processed according to the dependency relationship among the files to obtain a plurality of classification categories, and recording the path structure of each file of the task to be processed; and each classification category corresponds to at least one file comprising the tasks to be processed.

After analyzing the dependency relationship among the source codes of the engineering source codes, the source code files of the engineering source codes can be continuously classified according to the dependency relationship among the source code files, and the source code files with the dependency relationship are specifically classified into the same classification category, that is, each classification category can comprise a plurality of source code files with the dependency relationship, and correspondingly, the source code files without the dependency relationship are classified into different classification categories.

In this step, after classifying the source code files of the engineering source code according to the dependency relationship, the path structure of each source code file in the original engineering source code is also recorded at the same time, so as to provide a basis for subsequently recovering the engineering structure of the engineering source code after obtaining the processing result of each node on the source code file.

103, distributing each file of the task to be processed to a plurality of nodes of the distributed cluster based on a preset file distribution strategy; wherein files belonging to the same classification category are distributed to the same node.

After each source code file of the engineering source code is divided into each category, each source code file of the engineering source code can be continuously distributed to a plurality of nodes of the distributed cluster according to the category to which the source code belongs, and each source code file of the same category is specifically distributed to the same node of the cluster, so that each source code file on the same node is ensured to have a complete dependency relationship in the node without any cross-node reference, the phenomenon that the processing result of the node is wrong is avoided, and meanwhile, the effective utilization of computing resources of a distributed system is facilitated.

In practical implementation, when file distribution is needed, the computing resource use condition, the network load condition and the congestion condition of each node in the cluster can be obtained in real time, and task distribution can be performed on each source code file of the engineering source code by combining the computing resource use condition, the network load condition and the congestion condition of each node in the cluster, so that the resource use condition, the network load condition and the congestion condition of each node are relatively balanced.

And 104, acquiring the processing result of each node on the distributed files, and merging the processing results of each file based on the path structure of each file so as to restore the task structure of the task to be processed.

After distributing each source code file of the engineering source code to each node of the distributed cluster according to the classification category to which the node belongs and by combining the computing resource use condition, the network load condition and the congestion condition of each node, each node of the distributed cluster performs required processing on each distributed source code file of the corresponding category, such as compiling, encrypting or code obfuscating the source code file.

Meanwhile, multithreading or multiprocessing can be started to monitor the processing condition of each node in the cluster, wherein, each time a processing result (such as a compiling result, a ciphertext obtained by encryption or a confusion code obtained by code confusion) of a certain node to the distributed corresponding category source code file is monitored, the processing result is received and the path structure of the currently monitored category source code file is obtained from the recorded path structure information of each source code file, and further, the processing result of each source code file in the category can be written into the corresponding position of the storage medium in a predetermined mode (specifically, the compiling result of the source code file can be stored in a predetermined sharing directory) based on the path structure of each source code file in the category, so as to restore the engineering structure of the engineering source code until the processing results of all the source code files of the engineering source code are received completely, and writing the processing result into the corresponding position of the storage medium according to the path structure of the source code file, and then merging the processing result of the whole engineering source code and restoring the structure.

The following provides a specific application example of the document processing method based on the application. In this example, a file distribution model in a distributed environment is realized based on the above method of the present application, as shown in fig. 2, the model includes a file splitting module 201, a file distribution module 202, a file receiving module 203, and a received result storage module 204, where:

the file splitting module 201 first analyzes the dependency relationship among the files included in the task to be processed, for example, specifically analyzes the dependency relationship among the source code files of the engineering source code, then classifies the files according to the dependency relationship, and finally transmits the classification result to the file distributing module 202;

after receiving the result transmitted by the file splitting module 201, the file distributing module 202 records the path structure of each file, then selects a node for each classification category according to a certain policy (for example, performs file distribution based on the use condition of computing resources, the network load condition and the congestion condition of each node in the cluster), and distributes all files in each classification category to a corresponding node;

the file receiving module 203 starts multi-line/process, monitors the receiving of the processing result all the time, and calls the file storage module 204 when the arrival of the processing result is monitored;

the file storage module 204 first requests the recorded path structure information of each file from the file distribution module 202 for the processing result of the file, and then writes the processing result (such as the compiling result or the encryption result) of the file into the corresponding position of the storage medium in a predetermined manner based on the path structure information of the file, so as to complete the structure restoration of the whole engineering source code.

The workflow of each module of the model can be specifically referred to as shown in fig. 3.

According to the scheme, the file processing method provided by the application carries out classification on each file of the task according to the dependency relationship among the files of the task, divides each file with the dependency relationship in the task into the same class, carries out file distribution on the basis, and particularly distributes each file divided into the same class to the same node of the distributed cluster; therefore, based on the scheme of the application, the files with the dependency relationship in the task can be distributed to the same node of the cluster, and correspondingly, the dependency relationship does not exist among the files of different nodes in the cluster, so that convenience is brought to the processing of the files by the nodes, cross-node reference is not needed, the nodes in the distributed system can be effectively prevented from generating wrong results, and meanwhile, the effective utilization of computing resources of the distributed system is facilitated.

Example two

Corresponding to the file processing method in the first embodiment, the second embodiment of the present application further provides a file processing apparatus, referring to the schematic structural diagram of the file processing apparatus shown in fig. 4, the apparatus includes an analyzing unit 401, a classifying and information recording unit 402, a distributing unit 403, and a result processing unit 404, where:

the analysis unit 401 is configured to analyze a dependency relationship between files included in the task to be processed.

In view of the fact that the analysis unit 401 may specifically analyze the dependency relationship between the source code files included in the engineering source code to be processed, the analysis unit may include a plurality of source code files, and the plurality of source code files form an engineering with corresponding functions through a certain organization structure.

A classification and information recording unit 402, configured to classify files included in the task to be processed according to a dependency relationship among the files, obtain multiple classification categories, and record a path structure of each file of the task to be processed; and each classification category corresponds to at least one file comprising the tasks to be processed.

The task to be processed is an engineering source code to be processed; the classifying and information recording unit 402 classifies the files included in the task to be processed according to the dependency relationship among the files to obtain a plurality of classification categories, including: and classifying the source code files according to the dependency relationship among the source code files of the engineering source code to obtain a plurality of classification type source code files, wherein each classification type correspondingly comprises at least one source code file.

Specifically, after analyzing the dependency relationship among the source codes of the engineering source codes, the source code files of the engineering source codes can be continuously classified according to the dependency relationship among the source code files, and the source code files with the dependency relationship are specifically classified into the same classification category, that is, each classification category may include a plurality of source code files with dependency relationship, and correspondingly, source code files without dependency relationship are classified into different classification categories.

After the source code files of the engineering source codes are classified according to the dependency relationship, the path structure of each source code file in the original engineering source codes is also recorded at the same time, so that a basis is provided for recovering the engineering structure of the engineering source codes after the processing result of each node on the source code files is obtained subsequently.

A distributing unit 403, configured to distribute, based on a predetermined file distribution policy, each file of the task to be processed to multiple nodes of a distributed cluster; wherein files belonging to the same classification category are distributed to the same node.

Further, the distributing unit 403 is specifically configured to: acquiring the use condition of computing resources, the network load condition and the congestion condition of each node in the cluster; distributing tasks to each source code file of the engineering source code based on the computing resource use condition, the network load condition and the congestion condition of each node in the cluster; wherein the respective source code files belonging to the same classification category are distributed to the same node.

Specifically, after dividing each source code file of the engineering source code into each category, each source code file of the engineering source code can be continuously distributed to a plurality of nodes of the distributed cluster according to the category to which the source code belongs, and each source code file of the same category is specifically distributed to the same node of the cluster, so that each source code file on the same node is ensured to have a complete dependency relationship in the node without any cross-node reference, the phenomenon that the processing result of the node is wrong is avoided, and meanwhile, the effective utilization of computing resources of the distributed system is facilitated.

A result processing unit 404, configured to obtain a processing result of each node on the distributed file, and merge the processing result of each file based on the path structure of each file, so as to restore the task structure of the to-be-processed task.

Further, the result processing unit 404 is specifically configured to: monitoring the processing condition of each node in the cluster based on multiple threads or multiple processes; when monitoring a compiling result of a certain node for the distributed source code files of the corresponding category, receiving the compiling result and acquiring a path structure of each source code file in the category; and writing the compiling result of each source code file in the category into a corresponding position of a storage medium based on the path structure of each source code file in the category so as to restore the engineering structure of the engineering source code.

Specifically, after distributing each source code file of the engineering source code to each node of the distributed cluster according to the classification category to which the node belongs and by combining the computing resource usage condition, the network load condition, and the congestion condition of each node, each node of the distributed cluster performs required processing on each distributed source code file of the corresponding category, such as compiling, encrypting, code obfuscating, and the like on the source code file.

According to the scheme, the file processing device divides the files of the task into the same type according to the dependency relationship among the files of the task, divides the files with the dependency relationship in the task into the same type, distributes the files on the basis, and specifically distributes the files divided into the same type to the same node of the distributed cluster; therefore, based on the scheme of the application, the files with the dependency relationship in the task can be distributed to the same node of the cluster, and correspondingly, the dependency relationship does not exist among the files of different nodes in the cluster, so that convenience is brought to the processing of the files by the nodes, cross-node reference is not needed, the nodes in the distributed system can be effectively prevented from generating wrong results, and meanwhile, the effective utilization of computing resources of the distributed system is facilitated.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

For convenience of description, the above system or apparatus is described as being divided into various modules or units by function, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

Finally, it is further noted that, herein, relational terms such as first, second, third, fourth, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A file processing method, comprising:

acquiring the processing result of each node on the distributed files, and merging the processing results of each file based on the path structure of each file so as to restore the task structure of the task to be processed;

if the task to be processed is the engineering source code to be processed, classifying the files included in the task to be processed according to the dependency relationship among the files to obtain a plurality of classification categories, including:

classifying the source code files according to the dependency relationship among the source code files of the engineering source code to obtain a plurality of classification type source code files, wherein each classification type correspondingly comprises a plurality of source code files with the dependency relationship;

the dependency relationship among the files represents that the compiling or running of one file needs to be premised on another file, and if the another file is missing, the one file cannot be used.

2. The method of claim 1, wherein distributing the respective files of the pending task to the plurality of nodes of the distributed cluster based on a predetermined file distribution policy comprises:

3. The method according to claim 2, wherein the obtaining of the processing result of each node on the distributed files and merging the processing result of each file based on the path structure of each file, so as to restore the task structure of the task to be processed, comprises:

4. The method of claim 3, wherein writing the compiled result of the source code file to a corresponding location on the storage medium comprises:

5. A document processing apparatus, characterized by comprising:

the result processing unit is used for acquiring the processing result of each node on the distributed files and merging the processing result of each file based on the path structure of each file so as to restore the task structure of the task to be processed;

the task to be processed is an engineering source code to be processed;

classifying the source code files according to the dependency relationship among the source code files of the engineering source code to obtain a plurality of classification type source code files, wherein each classification type correspondingly comprises at least one source code file;

6. The apparatus according to claim 5, wherein the distribution unit is specifically configured to:

7. The apparatus of claim 6, wherein the result processing unit is specifically configured to:

8. The apparatus of claim 6, wherein the result processing unit writes the compilation result of the source code file into a corresponding location of the storage medium, and specifically comprises: