CN115114247A

CN115114247A - File data processing method and device

Info

Publication number: CN115114247A
Application number: CN202210605685.7A
Authority: CN
Inventors: 陈国杰
Original assignee: Boc Financial Technology Co ltd
Current assignee: Boc Financial Technology Co ltd
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-09-27

Abstract

The invention provides a file data processing method and a device, wherein the method comprises the following steps: acquiring configuration files of batch files, and analyzing configuration information of each file from the configuration files; under the condition that a target file with the file size larger than a preset threshold exists in the batch files, splitting the target file to obtain a plurality of subfiles of the target file; calling a plurality of first threads from the thread pool, and writing the subfile corresponding to each first thread into a cache queue based on each first thread in the plurality of first threads; calling a plurality of second threads from the thread pool, reading the subfile corresponding to each second thread from the cache queue based on each second thread in the plurality of second threads, and executing corresponding processing logic on the subfile corresponding to each second thread; and acquiring the processing result of the target file according to the processing results of the plurality of subfiles. The invention effectively improves the flexibility, performance and efficiency of file data processing.

Description

File data processing method and device

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a file data processing method and apparatus.

Background

At present, in the process of program development of many internet companies, due to the fact that service scenes are complex, batch text downloading and interaction among systems are more and more, and related large text data are more and more issued. And not only the text data volume is larger and larger, but also the data content contained in a single text file is more and more, for example, the data content contained in a txt text file exceeds 300G, the issuing frequency of the large text data is higher, and the requirement on service timeliness is higher. With the increase of service scenes and the more frequent change of services, more and more systems are accessed, and more corresponding system large texts are also accessed. Therefore, in order to enable business personnel to view data and use data quickly and efficiently, the demand on how to process large text data efficiently and quickly is more and more urgent.

In the prior art, a large text data processing mode in program development is generally a pure manual code writing method, then text data are read one by one, then the currently read text data are processed, then results are stored, and then the next step of data use and processing is performed. And after the processing is finished, reading the next piece of data, processing the next piece of data, storing the result, and processing the next piece of data. Such a serial processing of one line results in very slow data processing performance, very slow processing, poor flexibility of data processing performance, relatively poor performance, and relatively low efficiency.

Disclosure of Invention

The invention provides a file data processing method and device, which are used for overcoming the defects of poor flexibility, poor performance and low efficiency of processing large text data by manually compiling codes in the prior art and improving the flexibility, performance and efficiency of processing file data.

The invention provides a file data processing method, which comprises the following steps:

acquiring configuration files of batch files, and analyzing configuration information of each file from the configuration files; the configuration information comprises file size and processing logic;

under the condition that a target file with the file size larger than a preset threshold exists in the batch of files, splitting the target file to obtain a plurality of subfiles of the target file;

calling a plurality of first threads from a thread pool, and writing a subfile corresponding to each first thread into a cache queue based on each first thread in the plurality of first threads; the plurality of first threads are to perform a write operation in parallel;

calling a plurality of second threads from the thread pool, reading a subfile corresponding to each second thread from the cache queue based on each second thread in the plurality of second threads, and executing corresponding processing logic on the subfile corresponding to each second thread; the plurality of second threads to perform read operations in parallel and to perform processing logic in parallel;

and acquiring the processing result of the target file according to the processing results of the plurality of subfiles.

According to the file data processing method provided by the invention, the configuration file further comprises thread pool parameters;

the method for calling the plurality of first threads from the thread pool and writing the subfile corresponding to each first thread in the plurality of first threads into the cache queue based on each first thread comprises the following steps:

determining the total number of the first threads in the thread pool according to the thread pool parameters;

determining the target number of calling the first threads according to the total number of the first threads and the running condition of each first thread;

calling the first threads with the target number, reading the subfiles corresponding to each first thread based on each first thread in the first threads with the target number, and writing the read subfiles into the cache queue.

According to a file data processing method provided by the present invention, the invoking the first threads of the target number, reading a subfile corresponding to each first thread based on each first thread of the first threads of the target number, and writing the read subfile into the cache queue includes:

calling the first threads with the target number under the condition that the target number is smaller than the number of the subfiles to be written in the plurality of subfiles, reading the subfiles to be written corresponding to each first thread based on each first thread in the first threads with the target number, writing the read subfiles to be written into the cache queue, continuously monitoring the running condition of each first thread in the thread pool and acquiring the rest subfiles to be written in the plurality of subfiles;

and under the condition that a target first thread with an idle running state exists in the thread pool, calling the target first thread, reading the remaining subfiles to be written corresponding to the target first thread, and writing the read remaining subfiles to be written into the cache queue until the plurality of subfiles are written into the cache queue.

According to the file data processing method provided by the invention, the method further comprises the following steps:

writing the processing result of the subfile corresponding to each second thread into the storage file corresponding to each second thread;

acquiring files with the same processing logic in the batch files according to the processing logic of each file in the batch files;

according to the storage files, processing results corresponding to the files with the same processing logic are merged and written into a database;

or, according to the storage file, writing the processing results corresponding to the files with the same processing logic into the database in batch.

According to the file data processing method provided by the invention, the acquiring of the configuration files of the batch files and the parsing of the configuration information of each file from the configuration files comprise the following steps:

acquiring a configuration file of the batch file, and verifying the correctness of the configuration file;

under the condition that the configuration files pass verification, analyzing the configuration information of each file from the configuration files;

under the condition that the configuration file is not verified, updating the configuration file according to the attribute information of the batch files, and verifying the correctness of the updated configuration file until the updated configuration file is verified;

and under the condition that the updated configuration file passes verification, analyzing the configuration information of each file from the updated configuration file.

According to a file data processing method provided by the present invention, the splitting the target file to obtain a plurality of subfiles of the target file includes:

determining the splitting number of the target file according to the preset threshold and the attribute information of the target file;

splitting the target file according to the splitting number to obtain a plurality of sub-files of which the number is the splitting number;

or determining the splitting size of the target file according to the preset threshold and the attribute information of the target file;

splitting the target file according to the split size to obtain a plurality of subfiles of which the file size is smaller than or equal to the split size; wherein the split size is less than the preset threshold.

According to the file data processing method provided by the invention, the configuration information further comprises a decompression state;

before the splitting the target file to obtain a plurality of subfiles of the target file, the method further includes:

judging whether the decompression state of each file in the batch of files is a to-be-decompressed state or not;

and decompressing the file with the decompression state to be the to-be-decompressed state in the batch of files.

The present invention also provides a file data processing apparatus, comprising:

the analysis module is used for acquiring configuration files of the batch files and analyzing the configuration information of each file from the configuration files; the configuration information comprises file size and processing logic;

the splitting module is used for splitting a target file with the file size larger than a preset threshold value in the batch of files to obtain a plurality of subfiles of the target file;

the cache module is used for calling a plurality of first threads from a thread pool and writing a subfile corresponding to each first thread into a cache queue based on each first thread in the plurality of first threads; the plurality of first threads are to perform a write operation in parallel;

the processing module is used for calling a plurality of second threads from the thread pool, reading a subfile corresponding to each second thread from the cache queue based on each second thread in the plurality of second threads, and executing corresponding processing logic on the subfile corresponding to each second thread; the plurality of second threads to perform read operations in parallel and to perform processing logic in parallel;

and the acquisition module is used for acquiring the processing result of the target file according to the processing results of the plurality of subfiles.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the file data processing method.

The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the file data processing method as described in any of the above.

The present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the file data processing method as described in any one of the above.

According to the file data processing method and device provided by the invention, on one hand, different processing operations are executed on each file in the batch files according to the configuration files of the batch files so as to realize unified processing on a plurality of files in the batch files, and the method and device are applicable to various batch files, good in universality, high in reusability and good in flexibility; on the other hand, under the condition that the target file with the file size larger than the preset threshold exists in the batch files, a plurality of first threads are adopted to fragment the target file and write the target file into the cache queue; by adopting the second thread, the subfiles are fragmented and read from the buffer queue and the subfiles are logically processed, so that the isolation of file writing, file reading and file processing operations is effectively realized, the accuracy and reliability of file data processing can be improved, the file processing efficiency is improved, and the user experience is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a document data processing method provided by the present invention;

FIG. 2 is a schematic diagram of a document data processing apparatus according to the present invention;

fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, no processing method for configuration of large files exists, most of the processing methods are used for realizing file logic processing by writing codes, the universality, the flexibility and the efficiency are very poor, the files cannot be processed quickly, and human factors have a great influence on the efficiency of processing the large files.

Therefore, in the prior art, file processing is realized by writing codes, so that the processing speed is low, different codes need to be written according to different processing rules of large text data, unnecessary workload is increased, and the processing mode is not uniform; moreover, due to different service change scenes, the processing method is difficult to multiplex, so that the multiplexing rate of the development processing method is low, the development efficiency is low, and meanwhile, quality risk is introduced to high-efficiency data processing; in addition, when the data volume of the data text is too large, the processing efficiency is low, the risks of analysis errors and delay are increased, the workload of developers is increased, and the user experience is poor.

In order to solve the problems in the prior art, this embodiment provides a file data processing method, in which configuration files of batch files are acquired to acquire configuration information of each file in the batch files, and a file size and a processing logic of each file are acquired according to the configuration information, and when the file is greater than a preset threshold, the file is subjected to split parallel writing, split parallel reading and logic processing, so that decoupling and asynchronous execution of writing operation, reading and logic processing operation are realized, flexibility, performance and efficiency of text data processing are effectively improved, the method is applicable to various batch file processing, and is relatively good in universality, relatively high in reuse rate, and effectively improves user experience.

It should be noted that the execution subject of the method may be an electronic device, a component or a chip in the electronic device. The electronic device may be a mobile electronic device or a non-mobile electronic device. For example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, etc., and the non-mobile electronic device may be a server, a network attached storage, a personal computer, etc., and the present invention is not limited in particular.

The file transmission method of this embodiment may be implemented based on JAVA language or C language.

The file data processing method of the present invention is described below with reference to fig. 1, and includes the steps of:

step 101, acquiring configuration files of batch files, and analyzing configuration information of each file from the configuration files; the configuration information comprises file size and processing logic;

the batch files are multiple files with the same processing logic or the same type, and may be batch files to be translated, batch files to be format updated, and the like, which is not specifically limited in this embodiment.

Each batch file corresponds to a configuration file, and the format of the configuration file may be an XML (Extensible Markup Language) format, or may be another format such as a JOSN (JavaScript Object Notation) format, which is not specifically limited in this embodiment.

The configuration file includes configuration information of each file in the batch file, where the configuration information includes a file size and processing logic, and may further include a file name and a processing step, and the like, which is not specifically limited in this embodiment.

The processing logic is business logic which needs to be executed on the file, and the processing steps are concrete implementation steps for executing the business logic on the file.

The configuration file can be expanded according to actual requirements, such as modification, deletion and addition of configuration information.

Optionally, acquiring batch files to be processed, and acquiring configuration files of the batch files;

after the configuration files are obtained, the configuration files can be directly analyzed to obtain the configuration information of each file; or the configuration file may be processed, and then the processed configuration file is analyzed to obtain the configuration information of each file, which is not specifically limited in this embodiment. The processing mode of the configuration file includes, but is not limited to, checking, format conversion, or the like.

102, splitting a target file with the file size larger than a preset threshold value in the batch of files to obtain a plurality of subfiles of the target file;

the preset threshold may be preset according to a thread performance parameter in the thread pool or preset according to an actual requirement, which is not specifically limited in this embodiment. Such as based on the transmission capacity of each thread in the thread pool.

The number of target files may be one or more, and the processing steps of each target file are consistent. For simplifying the description, the following description will be given by taking an object file as an example, and developing a file data processing method in the present embodiment.

Optionally, after the configuration information of each file is obtained, the file size of each file is obtained from the configuration information.

Comparing the file size of each file with a preset threshold value to determine whether each file is a large file; if the target file with the file size larger than the preset threshold value does not exist in the batch files, the files in the representation batch files are all small, and the files can be directly processed through multiple threads; if the target file with the file size larger than the preset threshold exists in the batch file, the target file with larger data content exists in the batch file, and the target file needs to be split so as to be split into a plurality of subfiles.

The splitting mode includes but is not limited to determining the splitting number according to a preset threshold and the file size of the target file, and splitting the file to be transmitted into a plurality of subfiles according to the splitting number; or determining the splitting size according to a preset threshold and the file size, and splitting the file to be transmitted into a plurality of subfiles.

103, calling a plurality of first threads from a thread pool, and writing a subfile corresponding to each first thread into a cache queue based on each first thread in the plurality of first threads; the plurality of first threads are to perform a write operation in parallel;

here, multithreading refers to a technique in which multiple threads are concurrently executed from software or hardware. The computer with multithreading capability can execute more than one thread at the same time due to the hardware support, thereby improving the overall processing performance. Systems with this capability include symmetric multiprocessors, multi-core processors, and chip-level multiprocessing or simultaneous multi-threaded processors. In a program, these independently running program segments are threads, and the concept of programming by using them is multithread processing. The computer with multithreading capability can execute more than one thread at the same time due to the hardware support, thereby improving the overall processing performance.

Thread pools are a form of multi-threaded processing in which tasks are added to a queue and then automatically started after a thread is created. The thread pool threads are all background threads. Each thread uses a default stack size, runs at a default priority, and is in a multi-threaded unit. If a thread is idle in managed code (e.g., waiting for an event), the thread pool may insert another helper thread to keep all processors busy. If all thread pool threads remain busy all the time, but the queue contains pending work, the thread pool will create another helper thread after a period of time, but the number of threads never exceeds the maximum. Threads that exceed the maximum value may be queued, but they wait until other threads are completed before starting.

The first thread is used for executing write operation so as to write the file into the cache queue;

optionally, after obtaining the plurality of subfiles of the target, a plurality of first threads may be called from the thread pool to write the corresponding subfiles into the cache queue in parallel;

the number of calls of the first thread can be determined according to the number of threads of the first thread in the thread pool, the running condition of the first thread, and the number of subfiles needing to be written into the cache queue.

Step 104, calling a plurality of second threads from the thread pool, reading a subfile corresponding to each second thread from the cache queue based on each second thread in the plurality of second threads, and executing corresponding processing logic on the subfile corresponding to each second thread; the plurality of second threads to perform read operations in parallel and to perform processing logic in parallel;

the second thread is used for executing reading operation and processing logic so as to read the file from the cache queue and perform corresponding processing logic on the file.

It should be noted that the first thread and the second thread execute their corresponding operations asynchronously.

Optionally, while the first thread is called, multiple second threads may be asynchronously called from the thread pool to perform a parallel action, so as to read the subfiles corresponding to the second threads from the buffer queue, execute corresponding processing logic on the subfiles corresponding to the second threads, and after the processing logic is executed, save the execution results of the subfiles corresponding to the second threads.

The number of calls of the second thread may be determined according to the number of threads of the second thread in the thread pool, the running condition of the second thread, and the number of subfiles that need to be read and logically processed in the cache queue.

In the embodiment, the asynchronous calling is performed on the plurality of first threads and the plurality of second threads, so that the isolation of file writing, file reading and file processing operations can be realized, the influence on data processing caused by abnormal operations is effectively avoided, and the accuracy and the reliability of data processing are improved; and when the plurality of first threads process the writing task of the subfile in parallel, the plurality of second threads can also process the reading task and the processing task of the subfile in parallel, so that the efficiency of file data processing is effectively improved.

And 105, acquiring the processing result of the target file according to the processing results of the plurality of subfiles.

Optionally, when it is determined that all the subfiles of the target file are processed, the processing results of the subfiles are obtained, and the processing results of the files are summarized to obtain the processing result of the target file.

According to the file data processing method in the embodiment, on one hand, different processing operations are executed on each file in the batch files according to the configuration files of the batch files so as to realize unified processing on a plurality of files in the batch files, and the method is applicable to various batch files, good in universality, high in reusability and good in flexibility; on the other hand, under the condition that the target file with the file size larger than the preset threshold exists in the batch files, a plurality of first threads are adopted to fragment the target file and write the target file into the cache queue; by adopting the second thread, the subfiles are fragmented and read from the buffer queue and the subfiles are logically processed, so that the isolation of file writing, file reading and file processing operations is effectively realized, the accuracy and reliability of file data processing can be improved, the file processing efficiency is improved, and the user experience is further improved.

In some embodiments, the configuration file further includes thread pool parameters; the method for calling the plurality of first threads from the thread pool and writing the subfile corresponding to each first thread in the plurality of first threads into the cache queue based on each first thread comprises the following steps: determining the total number of the first threads in the thread pool according to the thread pool parameters; determining the target number of calling the first threads according to the total number of the first threads and the running condition of each first thread; calling the first threads with the target number, reading the subfiles corresponding to each first thread based on each first thread in the first threads with the target number, and writing the read subfiles into the cache queue.

The thread pool parameters at least comprise the total number of the first threads and the total number of the second threads;

optionally, in step 103, the step of calling the plurality of first threads to write the plurality of subfiles into the cache queue includes:

firstly, acquiring the total number of first threads from thread pool parameters, and monitoring the running condition of each first thread in a thread pool;

then, determining the total number of the first threads in the idle state in the thread pool according to the total number of the first threads in the thread pool and the running condition of each first thread, and determining the target number of calling the first threads according to the total number of the first threads in the idle state;

and then, calling the first threads with the target number, and executing the writing operation of the subfiles corresponding to the first threads in the first threads with the target number in parallel to read the subfiles corresponding to the first threads in parallel and write the read subfiles into the cache queue in parallel.

Similarly, the step of invoking the second threads to read the subfiles and logically process the subfiles in step 104 comprises:

firstly, acquiring the total number of second threads from thread pool parameters, and monitoring the running condition of each second thread in a thread pool;

and then, determining the total number of the second threads in the idle state in the thread pool according to the total number of the second threads in the thread pool and the running condition of each second thread, determining the number of the second threads to be called according to the total number of the second threads in the idle state so as to call the second threads with the corresponding number, and executing the read-write operation and the processing logic of the subfiles corresponding to the second threads in the second threads with the corresponding number in parallel so as to read the subfiles corresponding to the second threads from the cache queue in parallel and execute the corresponding processing logic on the read subfiles.

In the embodiment, different file processing strategies are adaptively determined according to the running condition of each thread in the thread pool, and the subfiles in the target file are fragmented according to the corresponding file processing strategies to execute the writing operation, the reading operation and the processing logic in parallel, so that the effectiveness and the accuracy of file processing are effectively ensured, and file data loss and file data processing abnormity caused by congestion are avoided.

In some embodiments, the invoking the target number of first threads, based on each first thread in the target number of first threads, reading a subfile corresponding to each first thread, and writing the read subfile into the cache queue includes: calling the first threads with the target number under the condition that the target number is smaller than the number of the subfiles to be written in the plurality of subfiles, reading the subfiles to be written corresponding to each first thread based on each first thread in the first threads with the target number, writing the read subfiles to be written into the cache queue, continuously monitoring the running condition of each first thread in the thread pool and acquiring the rest subfiles to be written in the plurality of subfiles; and under the condition that a target first thread with an idle running state exists in the thread pool, calling the target first thread, reading the remaining subfiles to be written corresponding to the target first thread, and writing the read remaining subfiles to be written into the cache queue until the plurality of subfiles are written into the cache queue.

The subfile to be written is a subfile to be written in the target file.

And the remaining subfiles to be written are the subfiles to be written which are left after the current calling thread performs file writing operation on the subfiles to be written in the target file.

Optionally, in step 103, the step of calling the plurality of first threads to write the plurality of subfiles into the cache queue further includes:

after the target number of calling the first thread is determined, the target number can be compared with the number of the subfiles to be written in the target file, and under the condition that the target number is larger than or equal to the number of the subfiles to be written in the target file, a plurality of the subfiles to be written can be written into the cache queue in parallel at one time; at the moment, the first threads with the target number are directly called, and the corresponding subfiles to be written are written into the cache queue in parallel at one time based on the first threads with the target number.

Under the condition that the target number is smaller than the number of the subfiles to be written in the target file, the plurality of the subfiles to be written in the cache queue cannot be written in parallel at one time; at the moment, first threads with the target quantity are called first, and the corresponding subfiles to be written are written into the cache queue in parallel based on the first threads with the target quantity; and continuously monitoring the running condition of each first thread in the thread pool, and calling the target first thread to write the corresponding remaining subfiles to be written into the cache queue under the condition that the target first thread with the running state of an idle state exists in the thread pool until all the subfiles in the target file are written into the cache queue.

Similarly, the specific steps of calling a plurality of second threads in step 104 to read a plurality of subfiles and logically process the plurality of subfiles further include:

after the target number of calling the second thread is determined, the target number of the second thread can be compared with the number of the subfiles to be processed in the cache queue, and under the condition that the target number of the second thread is larger than or equal to the number of the subfiles to be processed in the cache queue, the plurality of the subfiles to be processed can be read out from the cache queue in parallel at one time, and corresponding processing logic is executed on the read subfiles to be processed; and at the moment, directly calling second threads with target quantity, and reading the sub-files to be processed corresponding to the second threads with the target quantity from the cache queue at one time and executing corresponding processing logic.

When the target number of the second thread is smaller than the number of the subfiles to be processed in the target file, the plurality of subfiles to be processed cannot be read in parallel and corresponding processing logics cannot be executed in parallel at one time; at the moment, first calling second threads with target quantity, reading the sub-files to be processed respectively from the buffer queue and executing corresponding processing logic in parallel based on the second threads with target quantity; and continuously monitoring the running condition of each second thread in the thread pool, and under the condition that a target second thread with a running state of an idle state exists in the thread pool, calling the target second thread to read the corresponding remaining sub-files to be processed from the cache queue and execute corresponding processing logic until all the sub-files in the target file are processed.

The subfiles to be processed are the subfiles to be processed in the target file.

And the remaining subfiles to be processed are the remaining subfiles to be processed after the second thread is called currently to perform file processing operation on the files to be processed in the target file.

In the embodiment, different file processing strategies are adaptively determined according to the running condition of each thread in the thread pool and the number of the subfiles to be written in the target file and the number of the subfiles to be processed in the cache queue, and the subfiles in the target file are fragmented according to the corresponding file processing strategies to execute writing operation, reading operation and processing logic in parallel, so that the effectiveness and the accuracy of file processing are effectively ensured.

In some embodiments, the method further comprises: writing the processing result of the subfile corresponding to each second thread into the storage file corresponding to each second thread; acquiring files with the same processing logic in the batch files according to the processing logic of each file in the batch files; according to the storage files, processing results corresponding to the files with the same processing logic are merged and written into a database; or, according to the storage file, writing the processing results corresponding to the files with the same processing logic into the database in batch.

Each second thread corresponds to one storage file.

Optionally, after the processing of the subfile corresponding to each second thread is completed, the execution result of the subfile corresponding to each second thread may be written into the storage file corresponding to each second thread based on each second thread;

then, comparing the processing logic of each file in the batch files with each other, and determining the files with the same processing logic in the batch files;

and then combining the processing results corresponding to the files with the same processing logic and writing the combined processing results into the database or writing the combined processing results into the database in batches. Wherein, the concrete write-in step includes: and determining an execution result corresponding to each subfile in the files with the same processing logic from the storage files, combining the execution results corresponding to all the subfiles in the files with the same processing logic, and writing the combined execution results into a database or writing the combined execution results into the database in batches.

According to the embodiment, files with the same processing logic can be stored in batches, and the file storage efficiency is effectively improved.

In some embodiments, the obtaining the configuration file of the batch file, and parsing the configuration information of each file from the configuration file includes: acquiring a configuration file of the batch file, and verifying the correctness of the configuration file; under the condition that the configuration files pass verification, analyzing the configuration information of each file from the configuration files; under the condition that the configuration file is not verified, updating the configuration file according to the attribute information of the batch files, and verifying the correctness of the updated configuration file until the updated configuration file is verified; and under the condition that the updated configuration file passes verification, analyzing the configuration information of each file from the updated configuration file.

Optionally, the specific step of parsing the configuration information of each file in the step 101 includes:

firstly, acquiring configuration files of batch files, and verifying the correctness of each configuration information in the configuration files before analyzing the configuration files in order to ensure the correctness of the configuration files and further ensure the correctness of analysis results;

determining that the configuration information passes the verification under the condition that all the configuration information in the configuration file passes the verification; at this time, the configuration file can be directly analyzed to obtain the configuration information of each file in the configuration file.

And under the condition that any configuration information in the configuration files fails to pass the verification, updating the configuration files according to the attribute information of the batch files, and continuously performing correctness verification on the updated configuration files until all the configuration information in the updated configuration files passes the correctness verification.

Then, the configuration information of each file is analyzed from the configuration files passing the correctness verification.

In the embodiment, when the configuration file is analyzed, the configuration file is correctly verified and corrected and updated, so that the correctness of the configuration file is effectively ensured, and the correctness of file processing is further ensured.

In some embodiments, the splitting the target file to obtain a plurality of subfiles of the target file includes: determining the splitting number of the target file according to the preset threshold and the attribute information of the target file; splitting the target file according to the split number to obtain a plurality of subfiles of which the number is the split number; or determining the splitting size of the target file according to the preset threshold and the attribute information of the target file; splitting the target file according to the split size to obtain a plurality of subfiles of which the file size is smaller than or equal to the split size; wherein the split size is less than the preset threshold.

The attribute information of the target file includes, but is not limited to, a file size of the target file.

Optionally, the splitting step of the target file in step 102 specifically includes:

determining the splitting number of the target file according to a preset threshold and the file size of the target file, splitting the target file according to the splitting number, and splitting the target file into a plurality of subfiles of which the data meet the splitting number.

Or determining the splitting size of the target file according to a preset threshold and the file size of the target file, and splitting the target file into a plurality of subfiles of which the file sizes are smaller than or equal to the splitting size according to the splitting size.

According to the embodiment, the target can be quickly and accurately split into the plurality of subfiles according to the attribute information of the target file and the preset threshold value, and then the target file is quickly and effectively subjected to fragmentation parallel processing.

In some embodiments, the configuration information further includes a decompression state; before the splitting the target file to obtain a plurality of subfiles of the target file, the method further includes: judging whether the decompression state of each file in the batch of files is a to-be-decompressed state or not; and decompressing the file with the decompression state to be the to-be-decompressed state in the batch of files.

The configuration information further includes a decompression state of each file, which is used to represent whether each file needs to be decompressed.

Optionally, before executing step 102, it is further required to determine whether the decompression state of each file is a to-be-compressed state, and perform decompression processing on any file when the decompression state of the file is the to-be-decompressed state, so as to obtain a decompressed file;

and executing the steps 102 to 104 according to the decompressed files to obtain a processing result of each file in the batch of files.

The following describes a complete flow of the file data processing method in this embodiment specifically, and the specific steps are as follows:

step (1), acquiring configuration files of batch files, verifying the correctness of the configuration files and verifying the configuration files;

step (2), reading and analyzing the configuration files passing the correctness verification to obtain the configuration information of each file; the configuration file includes, but is not limited to, a file name, a file size, a decompression status, a processing step, and a thread pool parameter of each file;

step (3), judging the decompression state of each file in the batch of files to determine whether the file needs to be decompressed, and decompressing the file first under the condition that the decompression operation needs to be performed;

step (4), judging whether the file size of each file in the batch of files is larger than a preset threshold value, and splitting the target file larger than the preset threshold value to obtain a plurality of subfiles of the target file;

step (5), calling a plurality of first threads from the thread pool to execute writing operation in parallel, wherein each first thread reads one subfile and then writes the subfile into a cache queue;

step (6), calling a plurality of second threads from the thread pool to execute reading operation and parallel execution processing operation in parallel, wherein each second thread reads one subfile from the cache queue and then executes the relevant processing logic;

step (7), after the treatment is finished, storing the treatment result; specifically, steps (8) and (9);

step (8), after each second thread finishes the corresponding subfile, writing the processing result of the subfile into the corresponding storage file; each second thread corresponds to one storage file;

and (9) writing processing results corresponding to the files with the same processing logic into the database in batches or writing the processing results into the database after merging according to the stored files.

In the prior art, files of each processing logic are written by codes one by one manually, so that the files are processed, the processing performance is low, the data use delay is high, the codes are coded manually aiming at each file, the processing is not flexible, the universality is poor, and the development efficiency is low.

The file data processing method provided in this embodiment not only can flexibly configure the file processing method according to the size of the file, but also can implement asynchronous processing of the file in steps, and can implement fragment concurrent processing for the write task and the file processing task of the file, thereby effectively reducing the delay time of file processing and improving the efficiency and accuracy of file processing; in addition, a user can efficiently process large files only by calling the configuration files according to the operation steps in the file data processing method, can directly complete development and use according to the operation steps, can also apply the method to the file processing of various types and sizes of files, can directly use the method without development, and reduces a lot of unnecessary workload of developers; meanwhile, the problem of high difficulty in unified processing of large-text data in batches can be solved, the data can be stored in real time, so that the batch files can be processed quickly and efficiently, the subsequent data can be used more and more efficiently, and the delay ratio is low. And various processing logics can be configured in the configuration file, the processing logics are flexible, the universality is better, and the developed function reuse rate is higher.

The following describes a file data processing apparatus provided by the present invention, and the file data processing apparatus described below and the file data processing method described above may be referred to in correspondence with each other.

As shown in fig. 2, this embodiment provides a file data processing apparatus, which includes a parsing module 201, a splitting module 202, a caching module 203, a processing module 204, and an obtaining module 205, where:

the parsing module 201 is configured to obtain configuration files of batch files, and parse configuration information of each file from the configuration files; the configuration information comprises file size and processing logic;

after the configuration files are obtained, the configuration files can be directly analyzed to obtain the configuration information of each file; or the configuration file may be processed, and then the processed configuration file is analyzed to obtain the configuration information of each file, which is not specifically limited in this embodiment. The processing mode of the configuration file includes, but is not limited to, checking, format conversion, and the like.

The splitting module 202 is configured to split a target file with a file size larger than a preset threshold in the batch of files to obtain multiple subfiles of the target file;

The cache module 203 is configured to invoke a plurality of first threads from a thread pool, and based on each first thread in the plurality of first threads, write a subfile corresponding to each first thread into a cache queue; the plurality of first threads are to perform a write operation in parallel;

the calling number of the first thread may be determined according to the thread number of the first thread in the thread pool, the running condition of the first thread, and the number of subfiles that need to be written into the cache queue.

The processing module 204 is configured to invoke a plurality of second threads from the thread pool, based on each second thread in the plurality of second threads, read a subfile corresponding to each second thread from the cache queue, and execute corresponding processing logic on the subfile corresponding to each second thread; the plurality of second threads to perform read operations in parallel and to perform processing logic in parallel;

optionally, while the first thread is called, multiple second threads may be asynchronously called from the thread pool to perform a parallel action, so as to read out the subfiles corresponding to the second threads from the cache queue, execute corresponding processing logic for the subfiles corresponding to the second threads, and after the processing logic is executed, save the execution results of the subfiles corresponding to the second threads.

The obtaining module 205 is configured to obtain a processing result of the target file according to the processing results of the plurality of subfiles.

Fig. 3 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 3: a processor (processor)301, a communication Interface (communication Interface)302, a memory (memory)303 and a communication bus 304, wherein the processor 301, the communication Interface 302 and the memory 303 complete communication with each other through the communication bus 304. Processor 301 may call logic instructions in memory 303 to perform a file data processing method comprising: acquiring configuration files of batch files, and analyzing configuration information of each file from the configuration files; the configuration information comprises file size and processing logic; under the condition that a target file with the file size larger than a preset threshold exists in the batch of files, splitting the target file to obtain a plurality of subfiles of the target file; calling a plurality of first threads from a thread pool, and writing a subfile corresponding to each first thread into a cache queue based on each first thread in the plurality of first threads; the plurality of first threads are to perform a write operation in parallel; calling a plurality of second threads from the thread pool, reading the subfile corresponding to each second thread from the cache queue based on each second thread in the plurality of second threads, and executing corresponding processing logic on the subfile corresponding to each second thread; the plurality of second threads to perform read operations in parallel and to perform processing logic in parallel; and acquiring the processing result of the target file according to the processing results of the plurality of subfiles.

In addition, the logic instructions in the memory 303 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program being capable of executing, when executed by a processor, a file data processing method provided by the above methods, the method including: acquiring configuration files of batch files, and analyzing configuration information of each file from the configuration files; the configuration information comprises file size and processing logic; under the condition that a target file with a file size larger than a preset threshold value exists in the batch of files, splitting the target file to obtain a plurality of subfiles of the target file; calling a plurality of first threads from a thread pool, and writing a subfile corresponding to each first thread into a cache queue based on each first thread in the plurality of first threads; the plurality of first threads are to perform a write operation in parallel; calling a plurality of second threads from the thread pool, reading the subfile corresponding to each second thread from the cache queue based on each second thread in the plurality of second threads, and executing corresponding processing logic on the subfile corresponding to each second thread; the plurality of second threads to perform read operations in parallel and to perform processing logic in parallel; and acquiring the processing result of the target file according to the processing results of the plurality of subfiles.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a file data processing method provided by the above methods, the method including: acquiring configuration files of batch files, and analyzing configuration information of each file from the configuration files; the configuration information comprises file size and processing logic; under the condition that a target file with the file size larger than a preset threshold exists in the batch of files, splitting the target file to obtain a plurality of subfiles of the target file; calling a plurality of first threads from a thread pool, and writing a subfile corresponding to each first thread into a cache queue based on each first thread in the plurality of first threads; the plurality of first threads are to perform a write operation in parallel; calling a plurality of second threads from the thread pool, reading the subfile corresponding to each second thread from the cache queue based on each second thread in the plurality of second threads, and executing corresponding processing logic on the subfile corresponding to each second thread; the plurality of second threads to perform read operations in parallel and to perform processing logic in parallel; and acquiring the processing result of the target file according to the processing results of the plurality of subfiles.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for processing file data, comprising:

calling a plurality of second threads from the thread pool, reading the subfile corresponding to each second thread from the cache queue based on each second thread in the plurality of second threads, and executing corresponding processing logic on the subfile corresponding to each second thread; the plurality of second threads to perform read operations in parallel and to perform processing logic in parallel;

2. The file data processing method according to claim 1, wherein the configuration file further includes a thread pool parameter;

3. The file data processing method according to claim 2, wherein the invoking the target number of first threads, based on each first thread in the target number of first threads, reading a subfile corresponding to each first thread, and writing the read subfile into the cache queue comprises:

4. A file data processing method according to any one of claims 1 to 3, characterized in that the method further comprises:

5. The method according to any one of claims 1 to 3, wherein the obtaining the configuration files of the batch files and parsing the configuration information of each file from the configuration files comprises:

under the condition that the configuration files pass verification, analyzing configuration information of each file from the configuration files;

6. The file data processing method according to any one of claims 1 to 3, wherein the splitting the target file to obtain a plurality of subfiles of the target file comprises:

splitting the target file according to the split number to obtain a plurality of subfiles of which the number is the split number;

7. The file data processing method according to any one of claims 1 to 3, wherein the configuration information further includes a decompression state;

8. A file data processing apparatus, characterized by comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the file data processing method according to any one of claims 1 to 7 when executing the program.

10. A non-transitory computer-readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing the document data processing method according to any one of claims 1 to 7.