CN111625507A - File processing method and device - Google Patents

File processing method and device Download PDF

Info

Publication number
CN111625507A
CN111625507A CN202010478332.6A CN202010478332A CN111625507A CN 111625507 A CN111625507 A CN 111625507A CN 202010478332 A CN202010478332 A CN 202010478332A CN 111625507 A CN111625507 A CN 111625507A
Authority
CN
China
Prior art keywords
file
server
processed
processing
subtasks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010478332.6A
Other languages
Chinese (zh)
Inventor
符修亮
万磊
李毅
钱进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202010478332.6A priority Critical patent/CN111625507A/en
Publication of CN111625507A publication Critical patent/CN111625507A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44505Configuring for program initiating, e.g. using registry, configuration files
    • G06F9/4451User profiles; Roaming

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a file processing method and a file processing device, which are suitable for a distributed cluster comprising a plurality of servers; the method comprises the following steps: the first server determines sub-tasks corresponding to the fragments of the file to be processed according to file processing configuration; the first server sends each subtask to a third server, and writes the number of each subtask into the third server; the third server receives each subtask and broadcasts each subtask to each second server; each second server processes respective subtasks in parallel according to the processing logic in the file processing configuration and sends the processing result to the third server; and the third server determines that the number of the received processing results is equal to the number of the subtasks, and the processing of the file to be processed is finished. By adopting the method, the to-be-processed fragments of the to-be-processed file can be simultaneously and respectively processed through multiple threads, and the time cost and the labor cost can be reduced by setting the file processing configuration.

Description

File processing method and device
Technical Field
The invention relates to the technical field of big data, in particular to a file processing method and device.
Background
With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech), but higher requirements are also put on the technologies due to the requirements of the financial industry on safety and real-time performance. That is, though the application of the computer technology greatly accelerates the speed and accuracy of the business processing, a large amount of manpower and material resources are released. However, with the development of the industry, the existing technology cannot meet the business processing requirements.
In the prior art, a server processes services, and a worker writes corresponding processing logic and runs the related logic of the processing logic to realize service processing. For example, the client's financial information file is loaded locally on the server, so that the server runs the processing logic according to the relevant logic of the processing logic, and further processes the client's financial information file locally on the server. Compared with the traditional processing mode, the processing mode has the advantages of speeding up the processing speed and improving the accuracy. However, due to the rapid development of social economy, the financial concept of people is enhanced, and people who pay attention to financial management are more than fixed people; e.g., young people, people at work or in certain areas, etc.; people in all ages, all industries and all regions almost know about financing products to a certain extent. Therefore, the base number of financing groups is increased, and the sales volume of financing products is increased; correspondingly, the statistics of the financial information of the client becomes a heavy work, and the processing mode in the prior art cannot meet the requirement of the current traffic.
Therefore, there is a need for a method and an apparatus for processing documents, which can increase the processing speed of documents and reduce the time cost and the labor cost.
Disclosure of Invention
The embodiment of the invention provides a file processing method and device, which can accelerate the processing speed of files and reduce the time cost and the labor cost.
In a first aspect, an embodiment of the present invention provides a file processing method, which is applicable to a distributed cluster including multiple servers; the method comprises the following steps: the first server determines sub-tasks corresponding to the fragments of the file to be processed according to file processing configuration; the first server sends each subtask to a third server, and writes the number of each subtask into the third server; the third server receives the subtasks and broadcasts the subtasks to the second servers; the second servers process respective subtasks in parallel according to the processing logic in the file processing configuration and send processing results to the third server; and the third server determines that the number of the received processing results is equal to the number of the subtasks, and then the file to be processed is processed.
By adopting the method, the first server determines each subtask corresponding to each fragment of the file to be processed according to the file processing configuration, and further sends each subtask to the third server. Therefore, the third server manages each subtask, the subsequent second server can conveniently acquire and process each subtask, and the third server is an atomic server, so that the task can be finished after all the subtasks of the file to be processed are processed, and the integrity and the accuracy of the processing result of the file to be processed are ensured. Further, the third server broadcasts to enable the second server to sequentially acquire the subtasks, so that the second server can obtain the corresponding fragments to be processed according to each subtask and process the fragments to be processed. Therefore, compared with the prior art that one server processes the file to be processed; the file processing method and the file processing system have the advantages that multithreading is achieved through the plurality of second servers, the files to be processed are processed simultaneously and respectively, processing speed can be increased, and processing pressure of the second servers is reduced. In addition, by setting the file processing configuration, time cost and labor cost can also be reduced.
In one possible design, the determining, by the first server, the subtasks corresponding to the segments of the file to be processed according to the file processing configuration includes: the method comprises the steps that a first server obtains a file to be processed according to a file path in file processing configuration; the first server determines the positions of all the fragments of the file to be processed according to the fragment rule in the file processing configuration; and the first server determines each subtask according to each fragment position.
By adopting the method, the first server acquires the file to be processed according to the file path in the file processing configuration, and further determines the positions of the fragments according to the fragment rule in the file processing configuration. Finally, the first server may determine each subtask according to each slice position. In this way, the part of each fragment in the file to be processed can be determined by the position of each fragment in each subtask. Further, after the second server acquires each subtask, the second server may acquire a corresponding fragment to be processed according to each subtask, and process the fragment to be processed. Therefore, the files to be processed are processed simultaneously and respectively by a plurality of second servers and multiple threads, the processing speed can be increased, and the processing pressure of the second servers is reduced. In addition, by setting the file processing configuration, time cost and labor cost can also be reduced.
In a possible design, the determining, by the first server, the positions of the fragments of the file to be processed according to the fragment rule in the file processing configuration includes: the first server determines a file body of the file to be processed; the first server determines the slicing positions of the file body according to the slicing threshold value in the slicing rule; for each slicing position, the following method is adopted to determine: counting from the starting position of the fragment, and determining whether the current character is a line feed character at the position meeting the fragment threshold; if not, continuing to count until the position of the first line break is taken as the end position of the slicing.
By adopting the method, after the first server determines the file body of the file to be processed, the fragment positions of the file body are determined according to the fragment threshold in the fragment rule. Therefore, the file body which does not comprise the file head and the file tail is obtained, the file body is segmented, the content of the segmented file can be ensured to be the content needing to be processed, and the processing accuracy is improved. And because the fragmentation threshold can be determined by the size of the file to be processed, the size of the file body, the processing capacity of the first server and the second server, and other factors. In this way, each server can perform multithreading processing on the files to be processed with the fastest efficiency. And if the ending position determined by counting from the slicing starting position according to the slicing threshold is not the line break, continuing counting until the first line break is determined, and taking the position of the line break as the ending position. Therefore, the integrity of the file information in the fragments to be processed can be ensured, and the accuracy of the subsequent server for processing the fragments to be processed is improved.
In one possible design, the determining, by the first server, each fragmentation position of the file body according to a fragmentation threshold in the fragmentation rule includes: the fragmentation threshold may be determined by the following formula:
P=MIN(G,MAX(P/(2*N),L))
wherein P is the fragmentation threshold; g is the maximum value of the fragmentation threshold value, L is the minimum value of the fragmentation threshold value, and G and L are determined by historical experience values and the performance of the server; and N is the number of processor cores of the server.
By adopting the method, when the file to be processed is fragmented, the file to be processed is fragmented according to the fragmentation threshold value. And the fragmentation threshold is the size of the best fragment as determined by historical experience and server performance. Therefore, the speed of the second server for processing the to-be-processed file fragments corresponding to the subtasks is increased, and time cost is saved.
In one possible design, the determining, by the first server, a file body of the pending file includes: the first server determines the file header of the file to be processed according to the file header line number in the file processing configuration from the character of the first non-line feed character of the file to be processed; the first server determines the file tail of the file to be processed according to the file tail line number in the file processing configuration from the last character of the non-line feed character of the file to be processed; and the first server determines the part of the file to be processed except the file head and the file tail as the file body.
By adopting the method, the first server can accurately determine the file body according to the type of the file to be processed and the configuration of the file head line number and the file tail line number of each type of file in the file processing configuration. The accuracy of the fragmentation to-be-processed file is increased, and the accuracy of each to-be-processed fragmentation content is guaranteed. In addition, the file tail is determined by counting down the bytes of the first non-line feed character of the file to be processed, so that the workload of the first server for traversing from the file head to the file tail can be reduced, and the determination speed of the file body is accelerated.
In one possible design, the third server receiving the subtasks and broadcasting the subtasks to the second servers includes: the third server receives the subtasks and broadcasts the subtasks to the second servers; the second server receives the broadcast and then sequentially acquires subtasks from the third server, and acquires fragments to be processed from the files to be processed according to the fragment positions in the subtasks; the parallel processing of the respective subtasks by the second servers according to the processing logic in the file processing configuration and the sending of the processing result to the third server include: and the second servers process the respective fragments to be processed in parallel according to the processing logic in the file processing configuration, and send the processing results of the fragments to be processed to the third server.
By adopting the method, the third server broadcasts to the second servers after receiving the subtasks, so that the second servers sequentially acquire the subtasks from the third server after receiving the broadcast, and acquire the to-be-processed fragments from the to-be-processed files according to the positions of the fragments in the subtasks. Therefore, each second server cannot acquire the same subtasks, the efficiency of the second servers in processing the subtasks is improved, and the accuracy of the processing result of the file to be processed is ensured. And the second server processes respective fragments to be processed in parallel according to the processing logic in the file processing configuration. The required processing logic can be flexibly set in the file processing configuration, the processing result of the required file to be processed can be obtained according to the processing logic in the file processing configuration, and the second server processes all the subtasks in parallel to increase the processing speed of the file to be processed. The second server sends the processing result to the third server, so that the unified management of the processing result of each subtask is facilitated, the atomicity of the processing task is confirmed by the third server, and the completeness of the processing of the file to be processed is guaranteed.
In a possible design, the obtaining, by the second server, the to-be-processed fragment from the to-be-processed file according to the fragment position in the subtask includes: and the second server acquires the fragments to be processed from the files to be processed to the memory of the second server according to the fragment positions in the subtasks in a memory mapping mode.
By adopting the method, the second server obtains the fragments to be processed according to the fragment positions in the subtasks in a memory mapping mode, and stores and processes the fragments to be processed, so that occupied resources in the second server are reduced, the processing pressure of the second server is reduced, and the processing speed of the second server is accelerated. In addition, the second server can also sequentially acquire partial contents of the fragments to be processed according to the sequence, and process the partial contents to obtain a processing result. Therefore, the processing result is reserved, and part of the content corresponding to the result is deleted, so that the memory resource of the second server is always in the maximum available state, the resource occupation of the second server is reduced, the processing performance of the second server is improved, and the processing speed of the second server is further accelerated.
In one possible design, the file processing configuration is compiled on a preset template; the preset template is provided with function codes for opening the file, reading the file and closing the file.
By adopting the method, the file processing configuration is compiled on the preset template, and the function codes for opening the file, reading the file and closing the file are set in the preset template. Therefore, the working pressure and the time cost of the working personnel are reduced without writing corresponding function codes additionally by the working personnel. In addition, due to the complexity of the function code of the part, the method can reduce the working pressure and time cost of workers; and the situation that the file is damaged due to the fact that a worker wrote the function codes for opening the file or reading the file can be eliminated. The situation that the files to be processed always occupy server resources due to the fact that workers forget to write or write the function codes for closing the files by mistake can be eliminated.
In a second aspect, an embodiment of the present invention provides a file processing apparatus, which is suitable for a distributed cluster including a plurality of servers; the device comprises:
the processing module is used for determining each subtask corresponding to each fragment of the file to be processed according to the file processing configuration;
the receiving and sending module is used for sending each subtask to a third server and writing the number of each subtask into the third server;
the receiving and sending module is used for receiving each subtask and broadcasting each subtask to each second server;
the processing module is further configured to process respective subtasks in parallel according to processing logic in the file processing configuration, and send a processing result to the third server;
and the transceiver module is further configured to determine that the number of the received processing results is equal to the number of the sub-tasks, and then the processing of the file to be processed is completed.
In a third aspect, an embodiment of the present invention further provides a computing device, including: a memory for storing program instructions; a processor for calling program instructions stored in said memory to execute the method as described in the various possible designs of the first aspect according to the obtained program.
In a fourth aspect, embodiments of the present invention also provide a computer-readable non-transitory storage medium including computer-readable instructions which, when read and executed by a computer, cause the computer to perform the method as set forth in the various possible designs of the first aspect.
These and other implementations of the invention will be more readily understood from the following description of the embodiments.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic diagram of a file processing method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a file processing method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a file processing method according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a file processing apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a system architecture for processing a file according to an embodiment of the present invention, where a first server 102 may be any one of second servers 104, and the first server 102 obtains a file to be processed from a local or file server 101. Here, the file to be processed may be obtained from the file server in a memory mapping manner, so that the resource occupation of the first server 102 may be reduced. The first server 102 fragments the file to be processed, determines the starting position and the ending position of each fragment to be processed, determines each subtask according to the position of each fragment, and sends each subtask to each second server 104, so that each second server 104 obtains the fragment to be processed according to each subtask for processing. Here, after the first server 102 determines each subtask, each subtask may also be sent to a third server, so that each second server 104 obtains each subtask from the third server, and further obtains the to-be-processed fragment for processing. Therefore, due to the availability and the atomicity of the third server in a high concurrency scene, the speed of processing the file to be processed and the integrity and the accuracy of the result can be ensured. Here, the third server may be any server in which a program or algorithm having a high concurrency and atomicity function resides, for example, a Redis server, or a server containing a Java high concurrency and atomicity function program, or the like.
Based on this, an embodiment of the present invention provides a flow of a file processing method, as shown in fig. 2, including:
step 201, the first server determines each subtask corresponding to each fragment of the file to be processed according to the file processing configuration;
here, each subtask may include a position of each fragment, a file address of the fragment to be processed, an identifier of the file to be processed, and the like. The processing logic in the file processing configuration can be a corresponding operation mode; for example, the information of the financial management purchased by the file to be processed for the client of the company in one year can be determined by setting the corresponding processing logic of the corresponding operation mode. Corresponding analysis processing can also be carried out; for example, the customer that purchased the company for the year with the most amount to manage money is determined. The processing logic is not particularly limited herein.
Step 202, the first server sends each subtask to a third server, and writes the number of each subtask into the third server;
here, while the first server sends each subtask to the third server, the number of the subtasks, that is, the number taskCount of the fragments to be processed, is written into the third server, so that the following third server determines the execution state of each second server for processing each subtask. The database processing speed of the second server is slow, and the state of one or more second servers may not be successfully synchronized in a high concurrency scene, so that the task execution state cannot be updated all the time. Therefore, the state synchronization of the second server is performed by setting the third server. The processing task can be completed when part of the fragments to be processed are not processed, or the part of the fragments to be processed are processed for multiple times, so that the processing result of the files to be processed is not accurate.
Step 203, the third server receives the subtasks and broadcasts the subtasks to the second servers;
step 204, the second servers process respective subtasks in parallel according to the processing logic in the file processing configuration, and send the processing result to the third server;
here, the processing result is sent to the third server, so that the third server is ensured to acquire the subtask processing state of the second server, and a new subtask is further provided for the second server. And the third server acquires the state of each subtask of the file to be processed by the second server, so that the processing speed of the file to be processed is increased.
Step 205, the third server determines that the number of the received processing results is equal to the number of the sub-tasks, and then the processing of the file to be processed is completed.
Here, the third server determines that the number of processing results is equal to the number of subtasks, and determines that the processing of the file to be processed is completed. Thus, the integrity of the processing task is guaranteed. And the processing result obtained by the second server is an incomplete processing result when the task processing is not completed and the network exception occurs.
By adopting the method, the first server determines each subtask corresponding to each fragment of the file to be processed according to the file processing configuration, and further sends each subtask to the third server. Therefore, the third server manages each subtask, the subsequent second server can conveniently acquire and process each subtask, and the third server is an atomic server, so that the task can be finished after all the subtasks of the file to be processed are processed, and the integrity and the accuracy of the processing result of the file to be processed are ensured. Further, the third server broadcasts to enable the second server to sequentially acquire the subtasks, so that the second server can obtain the corresponding fragments to be processed according to each subtask and process the fragments to be processed. Therefore, compared with the prior art that one server processes the file to be processed; the file processing method and the file processing system have the advantages that multithreading is achieved through the plurality of second servers, the files to be processed are processed simultaneously and respectively, processing speed can be increased, and processing pressure of the second servers is reduced. In addition, by setting the file processing configuration, time cost and labor cost can also be reduced.
The embodiment of the application provides a file processing method, wherein a first server determines sub-tasks corresponding to fragments of a file to be processed according to file processing configuration, and the method comprises the following steps: the method comprises the steps that a first server obtains a file to be processed according to a file path in file processing configuration; the first server determines the positions of all the fragments of the file to be processed according to the fragment rule in the file processing configuration; and the first server determines each subtask according to each fragment position.
Here, the file path in the file processing configuration may be directly input by the user or may be preset; the file path may be a physical address of the file to be processed in the memory of the first server; the address of the file server where the file to be processed is located and the physical address of the file to be processed in the memory of the file server can be used; or a logical address generated by memory mapping with respect to a physical address of the file server memory storing the file to be processed. The file path is not particularly limited herein. The fragmentation rule in the file processing configuration is a rule for fragmenting a file to be processed, and may include the size of the fragment to be processed, the recording mode of the fragment to be processed, and the like, and is not particularly limited. The fragment positions of the file to be processed are the position information of the fragment to be processed in the file to be processed, and can be recorded by using the starting position of the fragment to be processed at the number-th byte of the file to be processed and the ending position at the number-th byte of the file to be processed. For example, the total number of the file to be processed is 3000 bytes, the starting position of the 1 st fragment to be processed is at the 1 st byte of the file to be processed, and the ending position is at the 100 th byte of the file to be processed; the starting position of the 2 nd to-be-processed fragment is at 101 th byte of the to-be-processed file, and the ending position is at 200 th byte of the to-be-processed file. In this way, the information such as the starting position and the ending position of the to-be-processed fragment is recorded to generate the subtask corresponding to the to-be-processed fragment.
The embodiment of the present application provides a fragmentation rule in file processing configuration, where the determining, by the first server, each fragmentation position of the file to be processed according to the fragmentation rule in the file processing configuration includes: the first server determines a file body of the file to be processed; the first server determines the slicing positions of the file body according to the slicing threshold value in the slicing rule; for each slicing position, the following method is adopted to determine: counting from the starting position of the fragment, and determining whether the current character is a line feed character at the position meeting the fragment threshold; if not, continuing to count until the position of the first line break is taken as the end position of the slicing. That is to say, the fragmentation rule in the file processing configuration may determine the file body of the file to be processed first, and determine each fragmentation position of the file body according to the fragmentation threshold set in the fragmentation rule. The fragmentation threshold value represents the size of each fragment of the to-be-processed fragments, and the unit of the fragmentation threshold value is B, and the size of the fragmentation threshold value can be determined according to the size of the to-be-processed file (the unit of B) and the number N of processor cores of the server; therefore, the problem that the memory consumption is too high when the fragments to be processed are processed subsequently due to the fact that the fragments to be processed are too large can be solved; accordingly, the problem that a lot of small fragments to be processed are generated due to the fact that the fragments to be processed are too small, and the available memory is wasted can be solved. Therefore, a maximum value G and a minimum value L of the size of the fragments to be processed are also set, the unit is B, and in order to utilize the performance of the server as much as possible, a file to be processed can be divided into proper fragments to be processed in parallel; the size of the to-be-processed slice may be finally determined by the formula P (slice threshold) ═ MIN (G, MAX (P/(2 × N), L)). In the last example, the number of the to-be-processed file is 3000, if the set fragmentation threshold is 100 bytes, the starting position of the 1 st to-be-processed fragmentation is at the 1 st byte of the to-be-processed file, and the ending position is at the 100 th byte of the to-be-processed file; the starting position of the 2 nd to-be-processed fragment is at 101 th byte of the to-be-processed file, and the ending position is at 200 th byte of the to-be-processed file. The 1 st to-be-processed fragment is the 1 st to 100 th bytes of the to-be-processed file, and the 2 nd to-be-processed fragment is the 101 st and 200 th bytes of the to-be-processed file; thus, the 30 th fragment to be processed is 2901 st and 3000 th bytes of the file to be processed. If the ending position determined by the fragmentation threshold is not the line break, continuously counting the first line break. Thus, the integrity of the content of each fragment to be processed is ensured. In the last example, if the ending position of the 1 st to-be-processed fragment is at the 100 th byte of the to-be-processed file, but the 100 th byte is not a linefeed character, it represents that the content of the line is not ended, and downward statistics needs to be continued until the first linefeed character is obtained, and it is determined that the content of the line is ended. If the first line break is the 120 th byte of the file to be processed, the 1 st fragment to be processed is the 1 st to 120 th byte of the file to be processed. The 2 nd fragment to be processed is the 121 st and 220 nd bytes of the file to be processed.
The embodiment of the present application further provides a code formula implemented in the file fragmentation process, which is as follows:
local filepiair List; // a result set of pending shards for the pending file.
Forlong point is 0; point < ═ n-1 do; the file to be processed is traversed, and point represents the current pointer position of the file to be processed;
local end + P; // slicing at fixed size P.
if end > ═ n-1; if the file end is reached.
end-n-1; the// indicates that the end of the file has been read.
else;
end ═ seekRowEnd (); and/traversing from the end pointer position to the back, and finding the first line feed character as the end position of the to-be-processed fragment. The seekRowEnd () method represents traversing the file to be processed from the specified position and one byte after the next until finding the first line break, and returning the position value of the line break in the file to be processed;
end// slicing process ends.
Here, the fragmentation of the file to be processed can be realized by the method.
local filepiair; and/newly building a fragment to be processed.
Start point; // setting the starting position of the slice to be processed.
End ═ end; // setting the end position of the slice to be processed.
fideairlist.add (filipair); v/add the location information of the pending slice to the slice result set.
point + 1; move the file pointer one byte down and continue slicing.
End;
Here, the position information of the to-be-processed slice obtained in the previous slicing process is recorded for the subsequent processing of the to-be-processed slice.
Thus, in this manner; the file body which does not comprise the file head and the file tail is obtained, the file body is segmented, the content of the segmented file can be ensured to be the content needing to be processed, and the processing accuracy is improved. And because the fragmentation threshold can be determined by the size of the file to be processed, the size of the file body, the processing capacity of the first server and the second server, and other factors. In this way, each server can perform multithreading processing on the files to be processed with the fastest efficiency. And if the ending position determined by counting from the slicing starting position according to the slicing threshold is not the line break, continuing counting until the first line break is determined, and taking the position of the line break as the ending position. Therefore, the integrity of the file information in the fragments to be processed can be ensured, and the accuracy of the subsequent server for processing the fragments to be processed is improved.
An embodiment of the present application provides a fragmentation rule in another file processing configuration, where the determining, by the first server, a file body of the file to be processed includes: the first server determines the file header of the file to be processed according to the file header line number in the file processing configuration from the character of the first non-line feed character of the file to be processed; the first server determines the file tail of the file to be processed according to the file tail line number in the file processing configuration from the last character of the non-line feed character of the file to be processed; and the first server determines the part of the file to be processed except the file head and the file tail as the file body.
Here, the character of the first non-line break of the file to be processed marks the start of the header, and therefore, the header of the file to be processed can be determined in accordance with the number of header lines in the file processing configuration, starting from the character of the first non-line break of the file to be processed. The character of the last non-line break of the file to be processed marks the beginning of the file tail, so that the file tail of the file to be processed can be determined according to the number of the file tail lines in the file processing configuration from the character of the last non-line break of the file to be processed. In this way, the file body of the file to be processed can be determined through the file head line number and the file tail line number recorded in the fragmentation rule in the file processing configuration. Therefore, the file body can be accurately determined according to the type of the file to be processed and the configuration of the file head line number and the file tail line number of each type of file in the file processing configuration, and further according to the file head line number and the file tail line number. The accuracy of the fragmentation to-be-processed file is increased, and the accuracy of each to-be-processed fragmentation content is guaranteed.
An embodiment of the present application provides a file processing method, where the third server receives the subtasks and broadcasts the subtasks to the second servers, including: the third server receives the subtasks and broadcasts the subtasks to the second servers; the second server receives the broadcast and then sequentially acquires subtasks from the third server, and acquires fragments to be processed from the files to be processed according to the fragment positions in the subtasks; the parallel processing of the respective subtasks by the second servers according to the processing logic in the file processing configuration and the sending of the processing result to the third server include: and the second servers process the respective fragments to be processed in parallel according to the processing logic in the file processing configuration, and send the processing results of the fragments to be processed to the third server.
After the first server obtains the files to be processed and fragments the files, each subtask is determined according to each fragment position of each fragment to be processed, each subtask is sent to the third server, the third server sends out a broadcast to the second server after receiving each subtask, and each second server is informed to obtain each subtask. Here, the second server may monitor the fragment processing state of the first server all the time, and when the first server completes the fragment processing, the second server actively acquires the subtasks from the third server. Or the first server and the second server share the processing state, so that the second server actively acquires the subtasks from the third server after the first server fragmentation is completed. In addition, the second servers obtain corresponding fragments to be processed through the fragment positions in the obtained subtasks, and each second server simultaneously processes the obtained fragments to be processed according to the processing logic in the file processing configuration, so that the processing speed of the files to be processed is increased. And the second server acquires the to-be-processed fragments corresponding to the subtasks according to the positions of the fragments, and the to-be-processed files are not stored in the second server, so that the to-be-processed files do not occupy the memory resources of the second server. The processing performance of the second server is guaranteed.
The embodiment of the application provides a file processing method, where the second server obtains a fragment to be processed from the file to be processed according to a fragment position in the subtask, and the method includes: and the second server acquires the fragments to be processed from the files to be processed to the memory of the second server according to the fragment positions in the subtasks in a memory mapping mode. That is to say, the second server obtains the to-be-processed fragment corresponding to the subtask in a memory mapping manner, and after the to-be-processed fragment is processed, only the processing result of the to-be-processed fragment is stored, and then the next to-be-processed fragment is processed continuously, so that only one to-be-processed fragment in the second server occupies a small memory, the occupancy rate of resources is ensured to be minimum all the time, and the performance of the second server is ensured.
Here, according to the embodiment of the fragmentation process, an embodiment of the present application further provides a to-be-processed fragment reading method, which is as follows:
loading the fragments to be processed into a memory through memory mapping by a second server, wherein filePair (k) represents the kth fragment in the fragment result set to be processed, mapping the contents of the fragments to be processed into a virtual memory according to the starting position and the ending position of the fragments to be processed, and then sequentially loading a part of the set number of lines into the memory through cache so that only the fragments to be processed with the set number of lines are always in the memory of the second server; when the number of rows is set as V, then:
local rowList; // a fragmented data set to be processed.
local row; // each row of data content read.
local count=0;
for byte data in filePair (k). data do; // byte data is the content of the fragment filePair (k) to be processed.
if byte data is ═ n'// determine whether the last byte of the content is a linefeed character.
Add (row); if it is a line break, indicating that a line has been read, this line content that has been read is added to the rowList.
Clear (); v/clear row ready to read the next row of data.
count + +; // the number of rows read + 1.
if count is equal to V; and/judging whether the total row number read this time reaches the V value.
callbusinessprocess (rowlist); v pieces of data are read, and the processing logic corresponding to the file to be processed is called.
Clear (); v/clear the row data that has been read.
count is 0; // clear count to zero and re-count.
End;
Else;
Ap pend (data); if the read content of the fragment to be processed is not the tail of the line, adding the content into the line content.
End;
End;
When the content of the fragment to be processed is read, the method also provides a configuration table for supporting the configuration of illegal characters, each byte in the read fragment to be processed is checked through the configuration table, and the content of some illegal characters can be filtered in the reading process of the fragment to be processed.
The embodiment of the application provides a setting method of file processing configuration, wherein the file processing configuration is obtained by writing on a preset template; the preset template is provided with function codes for opening the file, reading the file and closing the file. Here, the preset template is provided with function codes for opening, reading, and closing a file. Therefore, the working pressure and the time cost of the working personnel are reduced without writing corresponding function codes additionally by the working personnel. In addition, due to the complexity of the function code of the part, the method can reduce the working pressure and time cost of workers; and the situation that the file is damaged due to the fact that a worker wrote the function codes for opening the file or reading the file can be eliminated. The situation that the files to be processed always occupy server resources due to the fact that workers forget to write or write the function codes for closing the files by mistake can be eliminated. A configuration item corresponding to the file type of each type of file to be processed can be added into the preset template, and fixed parameters of each type of file to be processed are added into the configuration item, such as the number of file head lines and the number of file tail lines of the file to be processed; therefore, the corresponding parameters do not need to be additionally set by the workers, the workload of the workers is reduced, and the time cost is reduced.
Based on the above flow, an embodiment of the present invention provides a flow of a file processing method, where a third server is described by taking a Redis server as an example, as shown in fig. 3, and the flow includes:
301, a first server acquires a file to be processed;
step 302, determining a fragmentation threshold according to the file size of the file to be processed and the performance of the first server and the second server;
step 303, the first server fragments the file to be processed according to the fragmentation threshold value to generate fragmentation positions of the fragments to be processed, and generates each subtask according to the fragmentation positions of each fragment to be processed;
step 304, sending each subtask to a Redis server;
305, after receiving each subtask, the Redis server sends a broadcast to a second server;
step 306, the second server obtains the subtasks from the Redis server after receiving the broadcast;
step 307, the second server obtains the corresponding fragment to be processed according to the subtask, processes the fragment to be processed according to the processing logic of the fragment to be processed, and records the processing result of the fragment to be processed;
step 308, the second server may share the processing result of each fragment to be processed, and one or more second servers determine the processing result of the file to be processed according to the processing result of each fragment to be processed; or the second server may also send the processing result of each fragment to be processed to the file server, and determine the processing result of the file to be processed in the file server.
Based on the same concept, an embodiment of the present invention provides a file processing apparatus, and fig. 4 is a schematic diagram of a file processing apparatus according to an embodiment of the present invention, as shown in fig. 4, including:
the processing module 401 is configured to determine, according to the file processing configuration, each subtask corresponding to each fragment of the file to be processed;
a transceiver module 402, configured to send each subtask to a third server, and write the number of each subtask in the third server;
the transceiver module 402 is further configured to receive each subtask, and broadcast each subtask to each second server;
the processing module 401 is further configured to process respective subtasks in parallel according to the processing logic in the file processing configuration, and send a processing result to the third server;
the processing module 401 is further configured to determine that the number of the received processing results is equal to the number of each subtask, and then the processing of the file to be processed is completed.
In one possible design, the processing module 401 is specifically configured to: the method comprises the steps that a first server obtains a file to be processed according to a file path in file processing configuration; the first server determines the positions of all the fragments of the file to be processed according to the fragment rule in the file processing configuration; and the first server determines each subtask according to each fragment position. In a possible design, the determining, by the first server, the positions of the fragments of the file to be processed according to the fragment rule in the file processing configuration includes: the first server determines a file body of the file to be processed; the first server determines the slicing positions of the file body according to the slicing threshold value in the slicing rule; for each slicing position, the following method is adopted to determine: counting from the starting position of the fragment, and determining whether the current character is a line feed character at the position meeting the fragment threshold; if not, continuing to count until the position of the first line break is taken as the end position of the slicing.
In one possible design, the processing module 401 is specifically configured to: the fragmentation threshold may be determined by the following formula:
P=MIN(G,MAX(P/(2*N),L))
wherein P is the fragmentation threshold; g is the maximum value of the fragmentation threshold value, L is the minimum value of the fragmentation threshold value, and G and L are determined by historical experience values and the performance of the server; and N is the number of processor cores of the server.
In one possible design, the processing module 401 is specifically configured to: the first server determines the file header of the file to be processed according to the file header line number in the file processing configuration from the character of the first non-line feed character of the file to be processed; the first server determines the file tail of the file to be processed according to the file tail line number in the file processing configuration from the last character of the non-line feed character of the file to be processed; and the first server determines the part of the file to be processed except the file head and the file tail as the file body.
In one possible design, the transceiver module 402 is specifically configured to: the third server receives the subtasks and broadcasts the subtasks to the second servers; the second server receives the broadcast and then sequentially acquires subtasks from the third server, and acquires fragments to be processed from the files to be processed according to the fragment positions in the subtasks; the parallel processing of the respective subtasks by the second servers according to the processing logic in the file processing configuration and the sending of the processing result to the third server include: and the second servers process the respective fragments to be processed in parallel according to the processing logic in the file processing configuration, and send the processing results of the fragments to be processed to the third server.
In one possible design, the transceiver module 402 is specifically configured to: and the second server acquires the fragments to be processed from the files to be processed to the memory of the second server according to the fragment positions in the subtasks in a memory mapping mode.
In one possible design, the file processing configuration is compiled on a preset template; the preset template is provided with function codes for opening the file, reading the file and closing the file.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (11)

1. A file processing method, adapted to a distributed cluster including a plurality of servers; the method comprises the following steps:
the first server determines sub-tasks corresponding to the fragments of the file to be processed according to file processing configuration;
the first server sends each subtask to a third server, and writes the number of each subtask into the third server;
the third server receives the subtasks and broadcasts the subtasks to the second servers;
the second servers process respective subtasks in parallel according to the processing logic in the file processing configuration and send processing results to the third server;
and the third server determines that the number of the received processing results is equal to the number of the subtasks, and then the file to be processed is processed.
2. The method of claim 1, wherein the first server determines the subtasks corresponding to the fragments of the file to be processed according to the file processing configuration, including:
the method comprises the steps that a first server obtains a file to be processed according to a file path in file processing configuration;
the first server determines the positions of all the fragments of the file to be processed according to the fragment rule in the file processing configuration; and the first server determines each subtask according to each fragment position.
3. The method of claim 2, wherein the determining, by the first server, the positions of the fragments of the file to be processed according to the fragmentation rule in the file processing configuration comprises:
the first server determines a file body of the file to be processed;
the first server determines the slicing positions of the file body according to the slicing threshold value in the slicing rule;
for each slicing position, the following method is adopted to determine: counting from the starting position of the fragment, and determining whether the current character is a line feed character at the position meeting the fragment threshold; if not, continuing to count until the position of the first line break is taken as the end position of the slicing.
4. The method of claim 3, wherein the determining, by the first server, the respective sharded locations of the file body according to a sharding threshold in the sharding rule comprises:
the fragmentation threshold may be determined by the following formula:
P=MIN(G,MAX(P/(2*N),L))
wherein P is the fragmentation threshold; g is the maximum value of the fragmentation threshold value, L is the minimum value of the fragmentation threshold value, and G and L are determined by historical experience values and the performance of the server; and N is the number of processor cores of the server.
5. The method of claim 3, wherein the first server determining the file body of the pending file comprises:
the first server determines the file header of the file to be processed according to the file header line number in the file processing configuration from the character of the first non-line feed character of the file to be processed;
the first server determines the file tail of the file to be processed according to the file tail line number in the file processing configuration from the last character of the non-line feed character of the file to be processed;
and the first server determines the part of the file to be processed except the file head and the file tail as the file body.
6. The method of claim 1, wherein the third server receiving the subtasks and broadcasting the subtasks to the second servers comprises:
the third server receives the subtasks and broadcasts the subtasks to the second servers; the second server receives the broadcast and then sequentially acquires subtasks from the third server, and acquires fragments to be processed from the files to be processed according to the fragment positions in the subtasks;
the parallel processing of the respective subtasks by the second servers according to the processing logic in the file processing configuration and the sending of the processing result to the third server include:
and the second servers process the respective fragments to be processed in parallel according to the processing logic in the file processing configuration, and send the processing results of the fragments to be processed to the third server.
7. The method of claim 6, wherein obtaining the to-be-processed slice from the to-be-processed file according to the slice position in the subtask comprises:
and the second server acquires the fragments to be processed from the files to be processed to the memory of the second server according to the fragment positions in the subtasks in a memory mapping mode.
8. The method according to any one of claims 1 to 7, wherein the file processing is configured to be composed on a preset template; the preset template is provided with function codes for opening the file, reading the file and closing the file.
9. A file processing method, adapted to a distributed cluster including a plurality of servers; the method comprises the following steps:
the processing module is used for determining each subtask corresponding to each fragment of the file to be processed according to the file processing configuration;
the receiving and sending module is used for sending each subtask to a third server and writing the number of each subtask into the third server;
the receiving and sending module is used for receiving each subtask and broadcasting each subtask to each second server;
the processing module is further configured to process respective subtasks in parallel according to processing logic in the file processing configuration, and send a processing result to the third server;
and the transceiver module is further configured to determine that the number of the received processing results is equal to the number of the sub-tasks, and then the processing of the file to be processed is completed.
10. A computing device, comprising:
a memory for storing a computer program;
a processor for calling a computer program stored in said memory and executing the method of any one of claims 1 to 8 in accordance with the obtained program.
11. A computer-readable non-transitory storage medium including a computer-readable program which, when read and executed by a computer, causes the computer to perform the method of any one of claims 1 to 8.
CN202010478332.6A 2020-05-29 2020-05-29 File processing method and device Pending CN111625507A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010478332.6A CN111625507A (en) 2020-05-29 2020-05-29 File processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010478332.6A CN111625507A (en) 2020-05-29 2020-05-29 File processing method and device

Publications (1)

Publication Number Publication Date
CN111625507A true CN111625507A (en) 2020-09-04

Family

ID=72271950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010478332.6A Pending CN111625507A (en) 2020-05-29 2020-05-29 File processing method and device

Country Status (1)

Country Link
CN (1) CN111625507A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579297A (en) * 2020-12-25 2021-03-30 中国农业银行股份有限公司 Data processing method and device
CN114363321A (en) * 2021-12-30 2022-04-15 支付宝(杭州)信息技术有限公司 File transmission method, equipment and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579297A (en) * 2020-12-25 2021-03-30 中国农业银行股份有限公司 Data processing method and device
CN114363321A (en) * 2021-12-30 2022-04-15 支付宝(杭州)信息技术有限公司 File transmission method, equipment and system

Similar Documents

Publication Publication Date Title
WO2019153973A1 (en) Event driving method and device
CN107729135B (en) Method and device for parallel data processing in sequence
CN104317928A (en) Service ETL (extraction-transformation-loading) method and service ETL system both based on distributed database
CN107402950B (en) File processing method and device based on sub-base and sub-table
CN107590207A (en) Method of data synchronization and device, electronic equipment
CN111625507A (en) File processing method and device
CN109241165B (en) Method, device and equipment for determining database synchronization delay
CN111125719B (en) Method, device, computer equipment and readable storage medium for improving code security detection efficiency
CN111784318B (en) Data processing method, device, electronic equipment and storage medium
CN110515795A (en) A kind of monitoring method of big data component, device, electronic equipment
CN108399175A (en) A kind of storage of data, querying method and its device
CN113672375B (en) Resource allocation prediction method, device, equipment and storage medium
CN112631731A (en) Data query method and device, electronic equipment and storage medium
CN110221914B (en) File processing method and device
CN111553652A (en) Service processing method and device
CN111625505A (en) File splitting method and device
CN113342897A (en) Data synchronization method and device
CN112104403A (en) Message queue-based multithreading remote sensing satellite baseband data processing method and device
CN106803202B (en) Method and device for extracting transaction records to be tested
CN113657084A (en) Method and system for automatically reading Excel content
CN112581277A (en) Data processing method and device
CN114625546A (en) Data processing method and device
CN105931091B (en) File generation method and device
EP3418914A1 (en) Data management apparatuses, methods, and non-transitory tangible machine-readable media thereof
CN106559445B (en) Data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination