CN110287159B

CN110287159B - File processing method and device

Info

Publication number: CN110287159B
Application number: CN201910476837.6A
Authority: CN
Inventors: 梁亚辉; 卢鑫悦; 李建平
Original assignee: Beijing Yilan Qunzhi Data Technology Co ltd
Current assignee: Beijing Yilan Qunzhi Data Technology Co ltd
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2021-11-12
Anticipated expiration: 2039-06-03
Also published as: CN110287159A

Abstract

The application discloses a file processing method and a file processing device, wherein the method comprises the following steps: monitoring whether a newly added file exists in the FTP directory, and if so, judging whether idle processor resources exist; if the processor resources are idle, taking the newly added file as a file to be processed, moving the file to be processed into a folder corresponding to the processor resources, and storing the file name of the file to be processed, the serial number of the processor resources and the processing state of the file to be processed into a database; and creating a subtask to process the file to be processed by using the processor resource, and pushing a processing result to the Kafka server.

Description

File processing method and device

Technical Field

The application relates to a file processing method and device.

Background

Nodejs is a single-thread server technology, only one file can be processed at the same time, and the processing efficiency is low. If the amount of the concurrent data uploaded by the user side is large, the server cannot process the data, processing tasks are blocked, and the problems of full disk pile of the server, server failure and the like are caused.

In some scenarios, data interaction is often performed in a file form in the information interaction process with some third-party systems. Therefore, the server is required to have strong concurrent file processing capability, but nodjs is single-threaded, so that the server does not have the parallel processing capability.

Disclosure of Invention

In order to solve the above technical problem, embodiments of the present application provide a file processing method and apparatus.

The file processing method provided by the embodiment of the application comprises the following steps:

monitoring whether a newly added file exists in the FTP directory, and if so, judging whether idle processor resources exist;

if the processor resources are idle, taking the newly added file as a file to be processed, moving the file to be processed into a folder corresponding to the processor resources, and storing the file name of the file to be processed, the serial number of the processor resources and the processing state of the file to be processed into a database;

and creating a subtask to process the file to be processed by using the processor resource, and pushing a processing result to the Kafka server.

In an embodiment, the monitoring whether there is a newly added file in the FTP directory, and if there is a newly added file, determining whether there is an idle processor resource, includes:

the Watch module trains the FTP directory periodically and alternately to check whether a newly added file exists in the FTP directory, and if the newly added file exists, the Watch module transmits the file name of the newly added file to the scheduling module;

and the scheduling module receives the file name of the newly added file and then determines whether idle processor resources exist.

In an embodiment, the moving the newly added file into the folder corresponding to the processor resource as the file to be processed includes:

the scheduling module judges whether a folder corresponding to the processor resource exists or not; if not, creating a folder corresponding to the processor resource, and storing the file to be processed into the folder; if yes, directly storing the file to be processed into the folder;

and the folder name of the folder corresponding to the processor resource comprises the number of the processor resource.

In an embodiment, the creating a subtask uses the processor resource to process the file to be processed, including:

the scheduling module transmits the file name of the file to be processed and the serial number of the processor resource into the subtask through a Fork start task;

and after the subtask is started, acquiring a file path to be processed according to the file name of the file to be processed and the serial number of the processor resource, updating the processing state of the file to be processed in the database to be in processing, and then starting to analyze the file to be processed.

In an embodiment, the method further comprises:

and after the successful processing of the subtasks is finished, updating the processing state of the file to be processed in the database to be finished, and moving the processed file into the processed folder.

The file processing apparatus provided in the embodiment of the present application includes:

the Watch module is used for monitoring whether the FTP directory has a newly added file;

the scheduling module is used for judging whether idle processor resources exist or not if the newly added files exist; if the processor resources are idle, taking the newly added file as a file to be processed, moving the file to be processed into a folder corresponding to the processor resources, and storing the file name of the file to be processed, the serial number of the processor resources and the processing state of the file to be processed into a database; and creating a subtask to process the file to be processed by using the processor resource, and pushing a processing result to the Kafka server.

In one embodiment, the Watch module is configured to periodically and alternately train the FTP directory to check whether an additional file exists in the FTP directory, and if so, transmit the file name of the additional file to the scheduling module;

and the scheduling module is used for receiving the file name of the newly added file and then judging whether idle processor resources exist.

In an embodiment, the scheduling module is configured to determine whether a folder corresponding to the processor resource exists; if not, creating a folder corresponding to the processor resource, and storing the file to be processed into the folder; if yes, directly storing the file to be processed into the folder; and the folder name of the folder corresponding to the processor resource comprises the number of the processor resource.

In an embodiment, the scheduling module is configured to start a subtask through a Fork, and transmit a filename of the file to be processed and a number of the processor resource to the subtask; and after the subtask is started, acquiring a file path to be processed according to the file name of the file to be processed and the serial number of the processor resource, updating the processing state of the file to be processed in the database to be in processing, and then starting to analyze the file to be processed.

In an embodiment, the scheduling module is configured to, after the successful processing of the subtask is finished, update the processing state of the to-be-processed file in the database to be finished, and move the processed file into the processed folder.

The technical scheme of the embodiment of the application improves the processing capacity of Nodejs and makes full use of hardware resources of the server, and provides parallel processing capacity for Nodejs through the Cluster function library of Nodejs and the addition of the scheduler module, so that the server resources are made full use of, and the processing efficiency is improved. On the other hand, the problems of accumulation of files to be processed, insufficient disk space, service interruption and the like caused by insufficient processing capacity of the server are avoided.

Drawings

Fig. 1 is a schematic flowchart of a document processing method according to an embodiment of the present application;

FIG. 2 is a Nodejs-based file processing architecture diagram according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a document processing apparatus according to an embodiment of the present application.

Detailed Description

In order to facilitate understanding of the technical solutions of the embodiments of the present application, the following description is made of related art of the embodiments of the present application.

In the related art, a data collection provider pushes collected data to a data exchange server in a zip File form through a File Transfer Protocol (FTP); and analyzing and decrypting the zip file uploaded to the data exchange server by using a Nodejs program at the data exchange server, and pushing a data result after processing and analyzing to Kafka.

The above related art has the following problems: nodjs is a single-thread program, although the execution speed is much faster than that of Java and the development efficiency is high, due to lack of parallel processing capability, means for improving data processing capability is limited, and once data volume suddenly increases suddenly, the server runs out of disks and crashes because of insufficient processing capability. 2, due to the single-thread characteristic, the multi-core CPU resource of the server cannot be fully utilized, and resource waste is caused. And 3, in the scheme of only using the cluster, because all processes are executed in parallel, resource preemption can be caused, and errors are caused when the task queue simultaneously processes the same file. 4, if the lock mechanism is adopted to perform the shunting processing, the lock will block the task, resulting in performance degradation. In order to solve the above problems, the following technical solutions of the embodiments of the present application are proposed.

So that the manner in which the features and elements of the present embodiments can be understood in detail, a more particular description of the embodiments, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings.

Fig. 1 is a schematic flow chart of a file processing method provided in an embodiment of the present application, and as shown in fig. 1, the file processing method includes the following steps:

step 101: and monitoring whether the FTP directory has a newly added file, and if so, judging whether idle processor resources exist.

In this embodiment of the present application, the format of the file may be a JSON file.

In the embodiment of the present application, whether an additional file exists in the FTP directory is monitored, and if the additional file exists, whether an idle processor resource exists is determined, which can be implemented in the following manner:

1) the Watch module trains the FTP directory periodically and alternately to check whether a newly added file exists in the FTP directory, and if the newly added file exists, the Watch module transmits the file name of the newly added file to the scheduling module;

2) and the scheduling module receives the file name of the newly added file and then determines whether idle processor resources exist.

Here, processor resources include, but are not limited to, CPU resources.

Here, the function of the scheduling module is implemented by nodjs, for example, the scheduling module is nodjs Master.

Step 102: and if the spare processor resources exist, taking the newly added file as a file to be processed, moving the file into a folder corresponding to the processor resources, and storing the file name of the file to be processed, the serial number of the processor resources and the processing state of the file to be processed into a database.

In the embodiment of the present application, the moving the newly added file into the folder corresponding to the processor resource as the file to be processed may be implemented in the following manner:

Step 103: and creating a subtask to process the file to be processed by using the processor resource, and pushing a processing result to the Kafka server.

In the embodiment of the present application, the creating of one subtask for processing the file to be processed by using the processor resource may be implemented in the following manner:

1) the scheduling module transmits the file name of the file to be processed and the serial number of the processor resource into the subtask through a Fork start task;

2) and after the subtask is started, acquiring a file path to be processed according to the file name of the file to be processed and the serial number of the processor resource, updating the processing state of the file to be processed in the database to be in processing, and then starting to analyze the file to be processed.

In the embodiment of the application, after the successful processing of the subtask is finished, the processing state of the file to be processed in the database is updated to be finished, and the processed file is moved into the processed folder.

In order to more thoroughly understand the technical solutions of the embodiments of the present application, the following detailed description is given to the technical solutions of the embodiments of the present application with reference to specific examples.

Fig. 2 is a file processing architecture diagram based on nodjs according to an embodiment of the present disclosure, and as shown in fig. 2, the data exchange server includes an FTP directory, and the client can push the collected data to the FTP directory of the data exchange server through FTP. A scheduling module, namely a Nodejs Master module (hereinafter referred to as Master module) is established in a data exchange server, the Master module monitors whether a newly added file exists in an FTP directory through a Watch module (a module for periodically training a folder), if the newly added file exists, whether idle CPU resources exist is judged, if the newly added file exists, the newly added file is moved into a folder with a corresponding number of a CPU, meanwhile, the name of a file to be processed, the number of the CPU and the processing state are stored in a database (Redis), a subprocess is established to process the file, and the processed file is pushed into a Kafka server (or a Kafka cluster). By the scheme, the situation that the same data file is requested by multiple tasks can be effectively avoided, the use of locks is effectively avoided, and the parallel processing capability is improved.

Referring to fig. 2, the specific flow of the architecture shown in fig. 2 is as follows:

1) the Master module (i.e. the scheduling module) is started through the PM2 tool, and since each server has only one Master module, once the Master module is abnormally terminated, the entire service is stopped, and the server consumes light on the disk and is interrupted in service because of excessive data files. The PM2 has the characteristics of monitoring Nodejs programs and automatically restarting the Nodejs programs in the first time, so that the high availability of the Master module is realized.

2) The Master module checks whether a file needing to be processed exists in the FTP directory through the Watch module timing round (currently set to be 100ms), and if so, the file name is transmitted to the Master module.

3) After the Master module receives the file name to be processed, the Master module judges whether the CPU has spare resources, if not, the CPU enters a sleep 100MS and monitors a subtask exit event, and after the sleep time is over, the step 2) is returned for processing. And if the CPU has free resources, performing the processing of the step 4).

4) And the Master module stores the file name to be processed and the CPU number into the Redis, and marks the task as to be performed.

5) And the Master module judges whether a folder with the CPU number as the folder name exists or not, if not, the folder is created, the file is stored into the folder, and if so, the file to be processed is stored into the folder.

6) The Master module creates a sub-process through a Fork promoter task (utilizing a Cluster module of Nodejs), transmits the file name and the CPU number into the sub-task, and simultaneously stores the pid of the sub-task into the CPU number corresponding to redis.

7) And acquiring the file name and the CPU number after the subtask is started. And acquiring a file path needing to be processed through the CPU number and the file name, updating the file processing task state in the Redis to be in process, and then starting to analyze the file.

8) And after the file is successfully analyzed, pushing the analyzed result to a Kafka server, and if the file is failed to be analyzed, quitting the subtask, wherein the quit code of the subtask is 999.

9) And when the successful processing of the subtask is finished, updating the processing state of the file in the Redis to be finished, exiting, and setting the exit code to be 0.

10) The Master module can acquire the running end of the subtask through an exit event, read the file processing state in Redis, move the processing file into a processed folder if the processing file is ended, and indicate that the subtask is abnormally terminated if the processing file is not ended. The subtask needs to be restarted. The file is reprocessed.

11) If the subtask exits and the exit code is 0, step 2) is re-executed.

The technical scheme of the embodiment of the application provides a task scheduling scheme, the parallel computing efficiency of Nodejs programs in a multi-core CPU is optimized, the conflict among parallel tasks is avoided in a mode of distributing tasks through a single task scheduling unit (namely a Master module), and the concurrent processing efficiency of Nodejs is improved.

The above scheme of the embodiment of the present application can be implemented by the following code, it should be noted that the following code belongs to a pseudo code, and on the premise of not departing from the spirit of the technical scheme of the embodiment of the present application, functions implemented by any form of code belong to the protection scope of the embodiment of the present application:

fig. 3 is a schematic structural composition diagram of a document processing apparatus according to an embodiment of the present application, and as shown in fig. 3, the document processing apparatus includes:

the Watch module 301 is configured to monitor whether an FTP directory has a newly added file;

a scheduling module 302, configured to determine whether there is an idle processor resource if there is a new file; if the processor resources are idle, taking the newly added file as a file to be processed, moving the file to be processed into a folder corresponding to the processor resources, and storing the file name of the file to be processed, the serial number of the processor resources and the processing state of the file to be processed into a database; and creating a subtask to process the file to be processed by using the processor resource, and pushing a processing result to the Kafka server.

In an embodiment, the Watch module 301 is configured to periodically and alternately train an FTP directory to check whether there is a new file in the FTP directory, and if there is a new file, transmit a file name of the new file to the scheduling module 302;

the scheduling module 302 is configured to determine whether there is an idle processor resource after receiving the file name of the newly added file.

In an embodiment, the scheduling module 302 is configured to determine whether a folder corresponding to the processor resource exists; if not, creating a folder corresponding to the processor resource, and storing the file to be processed into the folder; if yes, directly storing the file to be processed into the folder; and the folder name of the folder corresponding to the processor resource comprises the number of the processor resource.

In an embodiment, the scheduling module 302 is configured to start a subtask by Fork, and transmit a filename of the file to be processed and a number of the processor resource to the subtask; and after the subtask is started, acquiring a file path to be processed according to the file name of the file to be processed and the serial number of the processor resource, updating the processing state of the file to be processed in the database to be in processing, and then starting to analyze the file to be processed.

In an embodiment, the scheduling module 302 is configured to, after the successful processing of the subtask is finished, update the processing state of the to-be-processed file in the database to be finished, and move the processed file into the processed folder.

Those skilled in the art will appreciate that the functions implemented by the modules in the document processing apparatus shown in fig. 3 can be understood by referring to the related description of the document processing method described above. The functions of the respective blocks in the file processing apparatus shown in fig. 3 may be implemented by a program running on a processor, or may be implemented by specific logic circuits.

The technical solutions described in the embodiments of the present application can be arbitrarily combined without conflict.

In the several embodiments provided in the present application, it should be understood that the disclosed method and intelligent device may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one second processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims

1. A method of file processing, the method comprising:

if the processor resources are idle, the scheduling module judges whether a folder corresponding to the processor resources exists; if not, creating a folder corresponding to the processor resource, and storing the file to be processed into the folder; if yes, directly storing the file to be processed into the folder; the folder name of the folder corresponding to the processor resource comprises the number of the processor resource, and the file name of the file to be processed, the number of the processor resource and the processing state of the file to be processed are stored in a database;

the scheduling module transmits the file name of the file to be processed and the serial number of the processor resource into the subtask through a Fork start task; and after the subtask is started, acquiring a file path to be processed according to the file name of the file to be processed and the serial number of the processor resource, updating the processing state of the file to be processed in the database to be in processing, then starting to analyze the file to be processed, and pushing a processing result to the Kafka server.

2. The method of claim 1, wherein the monitoring whether there are new files in the FTP directory, and if there are new files, determining whether there are idle processor resources, comprises:

3. The method of claim 1, further comprising:

4. A document processing apparatus, characterized in that the apparatus comprises:

the scheduling module is used for judging whether idle processor resources exist or not if the newly added files exist; if the processor resources are idle, judging whether a folder corresponding to the processor resources exists or not; if not, creating a folder corresponding to the processor resource, and storing the file to be processed into the folder; if yes, directly storing the file to be processed into the folder; the folder name of the folder corresponding to the processor resource comprises the number of the processor resource, and the file name of the file to be processed, the number of the processor resource and the processing state of the file to be processed are stored in a database; transmitting the file name of the file to be processed and the serial number of the processor resource into the subtask through a Fork start task; and after the subtask is started, acquiring a file path to be processed according to the file name of the file to be processed and the serial number of the processor resource, updating the processing state of the file to be processed in the database to be in processing, then starting to analyze the file to be processed, and pushing a processing result to the Kafka server.

5. The apparatus of claim 4, wherein,

the Watch module is used for training the FTP directory in a timing round manner to check whether a newly added file exists in the FTP directory, and if the newly added file exists, transmitting the file name of the newly added file to the scheduling module;

6. The apparatus according to claim 4, wherein the scheduling module is configured to, after the successful processing of the subtask is completed, update the processing status of the file to be processed in the database to be completed, and move the processed file into a processed folder.