CN107402950B

CN107402950B - File processing method and device based on sub-base and sub-table

Info

Publication number: CN107402950B
Application number: CN201710296156.2A
Authority: CN
Inventors: 丁彬
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2017-04-28
Filing date: 2017-04-28
Publication date: 2020-05-29
Anticipated expiration: 2037-04-28
Also published as: CN107402950A

Abstract

The application provides a file processing method and device based on sub-base and sub-table. The method comprises the following steps: splitting a file to be processed into a plurality of subfiles according to a preset service dimension; writing the data in the subfiles into corresponding database sublists; and calling a service system to process the data in the database sub-table by adopting a distributed parallel processing mode so as to realize the processing of the file to be processed. The application can improve the file processing speed by persisting the data of the file to be processed in the database, and meanwhile, the method has the advantages of saving system resources, improving the data utilization rate and the like.

Description

File processing method and device based on sub-base and sub-table

Technical Field

The application relates to the technical field of computers, in particular to a file processing method and device based on sub-warehouse and sub-table.

Background

In the related art, data can be stored in a file, and when the file is processed, a file system needs to read the data in the file line by line and call a service system for processing. How to increase the processing rate of the file has become an urgent problem to be solved at present.

Disclosure of Invention

In view of the above, the present application provides a method and an apparatus for processing a file based on a sub-library and a sub-table.

Specifically, the method is realized through the following technical scheme:

a file processing method based on sub-base and sub-table includes:

splitting a file to be processed into a plurality of subfiles according to a preset service dimension;

writing the data in the subfiles into corresponding database sublists;

and calling a service system to process the data in the database sub-table by adopting a distributed parallel processing mode so as to realize the processing of the file to be processed.

A file processing device based on sub-base and sub-table comprises:

the splitting unit is used for splitting the file to be processed into a plurality of subfiles according to a preset service dimension;

the writing unit writes the data in the subfiles into corresponding database sub-tables;

and the processing unit is used for calling a service system to process the data in the sub-tables of the database in a distributed parallel processing mode so as to realize the processing of the file to be processed.

From the above description, it can be seen that the method and the device for processing the file can improve the speed of processing the file by persisting the data of the file to be processed in the database, and meanwhile, have the beneficial effects of saving system resources, improving the data utilization rate and the like.

Drawings

Fig. 1 is a flowchart illustrating a document processing method based on a sub-library and sub-table according to an exemplary embodiment of the present application.

Fig. 2 is a flowchart illustrating splitting of a file to be processed into a plurality of subfiles according to an exemplary embodiment of the present application.

Fig. 3 is a schematic flowchart illustrating another process of splitting a file to be processed into multiple subfiles according to an exemplary embodiment of the present application.

FIG. 4 is a schematic diagram of a file system shown in an exemplary embodiment of the present application.

Fig. 5 is a block diagram of a document processing apparatus based on a sub-library and sub-table according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In the related art, when processing a file, the file system may split the entire file into a plurality of sub-blocks and then process each sub-block in parallel. However, if a certain piece of data in the sub-block fails to be processed, the whole sub-block needs to be processed again, which causes low processing efficiency and waste of a large amount of processing resources.

In order to solve the problems, the application provides a file processing scheme based on the sub-database and sub-table.

Referring to fig. 1, the file processing method based on sub-base and sub-table may be applied in a file system, where the file system is typically a server or a server cluster deployed by a service provider, and the file processing method may include the following steps:

step 101, splitting a file to be processed into a plurality of subfiles according to a preset service dimension.

Referring to fig. 2 and fig. 3, the process of splitting the file to be processed into a plurality of subfiles in this embodiment may include the following steps:

step 1011, splitting the file to be processed into M logical partitions according to the preset block size.

In this embodiment, the block size may be set by a developer, for example: 1M, 2M, etc. Assuming that the file to be processed is 100M and the block size is 1M, in this step, the file to be processed may be split into 100 logical partitions, and the size of each logical partition is 1M. Specifically, the 1 st logical slice corresponds to data in the interval of 0M to 1M of the file to be processed, the 2 nd logical slice corresponds to data in the interval of 1M to 2M of the file to be processed, and so on.

It should be noted that the splitting in this step is logical splitting, and the file to be processed is not processed yet at this time.

And 1012, splitting each logic fragment into N small files according to the preset service dimension by adopting a distributed parallel processing mode.

Based on the foregoing step 1011, after the file to be processed is split into M logical partitions, each logical partition may be split into N small files according to a preset service dimension by using a distributed parallel processing manner. The service dimension may be set by a developer according to a service condition, and may be a user ID, a bill ID, a running ID, or the like, which is not particularly limited in the present application.

In this embodiment, M subtasks may be created, each subtask corresponds to one logic fragment, and the subtask is used to split the corresponding logic fragment into N small files according to the service dimension. In practical application, each subtask can be allocated to different devices to be executed based on a message center, so that the splitting efficiency is improved.

Assuming that the service dimension is two last digits of the user ID, taking a logical partition as an example, all data in the logical partition may be split into 100 parts, where data with two last digits being 00 of the user ID is the first part, data with two last digits being 01 of the user ID is the second part, and so on, so as to split the data in the logical partition into 100 parts, that is, 100 small files, and N is equal to 100. Of course, in practical applications, all data in the logical partition may also be split into 50 shares, where the data with the last two digits 00 and 01 of the user ID is the first share, and the application does not specially limit this.

And 1013, merging the M small files with the same service dimension into one subfile, so as to split the file to be processed into N subfiles.

Based on the foregoing step 1012, after the logic fragments are split into N small files, the small files of the service dimensional system may be merged to be merged into one subfile.

Still taking the example that the service dimension is two digits after the reciprocal of the user ID, and the data in the logical partition is split into 100 parts, the data of which two digits after the user ID are 00 in the M logical partitions may be merged, and the data of which two digits after the user ID are 01 may be merged, so as to split the file to be processed into 100 sub-files.

And 102, writing the data in the subfiles into corresponding database sublists.

In this embodiment, the data in each subfile may be inserted into the corresponding database sublist in a batch manner. Still taking the foregoing example as an example, 100 databases may be provided, and each database has a corresponding sub-table for storing data in one sub-file. Referring to fig. 3, the data in subfile 00 may be inserted into the sublist of DB 00.

In this embodiment, when other service scenarios need to read data in a file, it may be determined whether the data in the file has been written into the database sublist, and if it is determined that the data in the file has been written into the database sublist, the relevant data may be directly read from the database. Therefore, the file data is stored in the database in a persistent mode, when the file data is read again, the file does not need to be processed again, the file reading time is shortened, and the data utilization rate is improved.

And 103, calling a service system to process the data in the database sub-table by adopting a distributed parallel processing mode so as to realize the processing of the file to be processed.

Based on the foregoing step 102, after the file to be processed is split into a plurality of subfiles and the data in the subfiles is placed in the database, a distributed parallel processing manner may be adopted to invoke a service system to process the data in the sub-table of the database, so as to implement processing of the file to be processed.

In this embodiment, when the file system processes the file data in the sub-table of the database in a distributed parallel processing manner, a file processing main task may be created first, and the file processing main task is divided into a plurality of sub-tasks. The number of the subtasks is equal to the number of the subfiles of the file to be processed. Still taking the foregoing example as an example, the file processing main task may be split into 100 subtasks, where the subtasks are in one-to-one correspondence with the database sub-tables shown in fig. 3, such as: subtask 00 corresponds to DB00, subtask 01 corresponds to DB01, and so on.

In this embodiment, taking the subtask 00 as an example, the business system may be called to process the data stored in the sublist of the DB00 one by one, that is, the business system is called to process the subfile 00 of the file to be processed. In general, subtask 00 may traverse the branch table of DB00, and may mark a piece of data as processing successful when the business system successfully processes the piece of data, and may mark a piece of data as processing failed when the business system fails to process the piece of data.

After the subtask 00 traverses the branch table of the DB00 once, the branch table may be traversed again, the data that fails to be processed is extracted, and the service system is called to process the data that fails to be processed again until all the data in the branch table are processed successfully, at which point the subtask 00 has been executed successfully. When all the 100 subtasks are successfully executed, it may be determined that the file processing main task is successfully executed, that is, the processing of the file to be processed is completed.

In this embodiment, although the subtask continuously and circularly traverses the sub-tables of the database, only the data that has not been successfully processed is processed during the traversal, and compared with an implementation scheme in the related art in which the whole sub-block needs to be reprocessed if a certain data processing fails, the speed of processing the file can be greatly increased, the processing resources of the file system are saved, and the dependency on the stability of the associated system is reduced.

Corresponding to the embodiment of the file processing method based on the sub-library and sub-table, the application also provides an embodiment of a file processing device based on the sub-library and sub-table.

The embodiment of the file processing device based on the sub-database and sub-table can be applied to a file system. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. Taking a software implementation as an example, as a logical device, the device is formed by reading corresponding computer program instructions in the non-volatile memory into the memory for operation through a processor of the file system where the device is located. In terms of hardware, as shown in fig. 4, the present application is a hardware structure diagram of a file system in which a file processing device based on a sub-library and a sub-table is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 4, the file system in which the device is located in the embodiment may also include other hardware according to the actual function of the file system, which is not described again.

Referring to fig. 5, the file processing apparatus 400 based on the sub-base and the sub-table can be applied to the file system shown in fig. 4, and includes: a splitting unit 401, a writing unit 402, a processing unit 403, and a reading unit 404.

The splitting unit 401 splits the file to be processed into a plurality of subfiles according to a preset service dimension;

a writing unit 402, which writes the data in the subfile into the corresponding database sublist;

the processing unit 403 calls a service system to process the data in the sub-table of the database in a distributed parallel processing manner, so as to implement processing on the file to be processed.

Optionally, the splitting unit 401:

splitting the file to be processed into M logic fragments according to a preset block size;

splitting each logic fragment into N small files according to the preset service dimensionality by adopting a distributed parallel processing mode;

combining the M small files with the same service dimension into a subfile to split the file to be processed into N subfiles;

wherein, M and N are both natural numbers larger than 1.

Optionally, the service dimension includes: user ID, flow ID, bill ID.

Optionally, the processing unit 403:

creating a file processing main task, and splitting the file processing main task into a plurality of subtasks;

the subtasks correspond to the database sub-tables one to one, the database sub-tables correspond to the subfiles one to one, and the subtasks are used for calling the service system to process data in the corresponding database sub-tables one by one and marking the processing result of each piece of data;

when all data in the database sub-tables are processed successfully, determining that the subtasks are executed successfully;

and when all the subtasks are successfully executed, determining that the file processing main task is successfully executed, and processing the file to be processed is realized.

Optionally, for the data that fails to be processed in the sub-table of the database, the subtask is further configured to continue to call the service system to re-process the data that fails to be processed, and update the processing result of the data.

The reading unit 404, when receiving a file reading instruction, determines whether a file to be read has been written into the database sub-table; and if the file to be read is determined to be written into the database sub-table, reading the data of the file to be read from the database sub-table.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims

1. A file processing method based on sub-base and sub-table includes:

writing the data in the subfiles into corresponding database sublists;

calling a service system to process the data in the database sub-table by adopting a distributed parallel processing mode so as to realize the processing of the file to be processed;

the splitting of the file to be processed into a plurality of subfiles according to the preset service dimension includes:

wherein, M and N are both natural numbers larger than 1.

2. The method of claim 1, the business dimension comprising: user ID, flow ID, bill ID.

3. The method of claim 1, wherein the invoking a business system in a distributed parallel processing manner to process the data in the database sub-table comprises:

4. The method according to claim 3, wherein for the data which fails to be processed in the database sublist, the subtask is further configured to continue to invoke the service system to re-process the data which fails to be processed, and update a processing result of the data.

5. The method of claim 1, further comprising:

when a file reading instruction is received, judging whether a file to be read is written into a database sub-table or not;

and if the file to be read is determined to be written into the database sub-table, reading the data of the file to be read from the database sub-table.

6. A file processing device based on sub-base and sub-table comprises:

the processing unit is used for calling a service system to process the data in the sub-tables of the database in a distributed parallel processing mode so as to realize the processing of the file to be processed;

the splitting unit:

wherein, M and N are both natural numbers larger than 1.

7. The apparatus of claim 6, the business dimension comprising: user ID, flow ID, bill ID.

8. The apparatus of claim 6, the processing unit to:

9. The apparatus according to claim 8, wherein for the data that fails to be processed in the sub-table of the database, the sub-task is further configured to continue to invoke the business system to re-process the data that fails to be processed, and update a processing result of the data.

10. The apparatus of claim 6, further comprising:

the reading unit is used for judging whether the file to be read is written into the sub-table of the database or not when receiving the file reading instruction; and if the file to be read is determined to be written into the database sub-table, reading the data of the file to be read from the database sub-table.