CN112416865A

CN112416865A - File processing method and device based on big data

Info

Publication number: CN112416865A
Application number: CN202011315296.8A
Authority: CN
Inventors: 张�浩; 陈军
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2021-02-26

Abstract

The invention discloses a file processing method and device based on big data, and relates to the technical field of computers. A specific implementation mode of the big data-based file processing method comprises the following steps: analyzing data processing parameters and file parameters from a file processing model for defining parameters required by file processing; the file parameters are parameters related to an input file and an output file; and calling a corresponding processor according to the data processing parameters, carrying out data processing on the input file of the file parameters, and writing a result obtained by the processing into the output file. According to the implementation method, the file processing capacity can be packaged based on the big data, the application and the technology can be isolated, and an application developer can develop and implement the big data technology without mastering the specific big data development technology and tool.

Description

File processing method and device based on big data

Technical Field

The invention relates to the technical field of computers, in particular to a file processing method and device based on big data.

Background

Because the existing file processing mode has the characteristic of high coupling of services and technologies, the file processing process needs customized development and developers are required to master certain big data technology. If the technology is updated iteratively, migration and upgrading can be difficult to perform.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for processing a file based on big data, which can solve the problem of high coupling between services and technologies in the existing file processing method.

To achieve the above object, according to an aspect of an embodiment of the present invention, a file processing method based on big data is provided.

The file processing method based on big data comprises the following steps:

analyzing data processing parameters and file parameters from a file processing model for defining parameters required by file processing; the file parameters are parameters related to an input file and an output file;

and calling a corresponding processor according to the data processing parameters, carrying out data processing on the input file of the file parameters, and writing a result obtained by the processing into the output file.

Optionally, after the step of parsing out the data processing parameters and the file parameters from the file processing model for defining the parameters required for file processing, the method further includes:

managing metadata of an input file and an output file in a file proxy mode, and determining a distribution path for distributing the output file to a distributed file cluster;

after the steps of calling a corresponding processor according to the data processing parameters, performing data processing on the input file of the file parameters, and writing a result obtained by the processing into the output file, the method further comprises:

and distributing the output file to the corresponding distributed file cluster according to the distribution path.

when file access is carried out, index information of the file is inquired; wherein the file may be an input file or an output file;

if the index information exists, returning the real path of the file;

and if the index information does not exist, generating a physical path of the file.

Optionally, querying the file index information includes:

inquiring file index information according to the batch number ID, the branch and the key value KE 7;

generating the physical path of the file comprises:

and acquiring a file root directory from the file root path mapping rule according to the branches, and generating a physical path of the file according to the directory splitting rule.

Optionally, the file proxy approach supports one or more of: local single path, local random sharding path, open source database, distributed file system, log type database, and distributed document storage database.

Optionally, the document processing model includes: the method comprises the steps of inputting a file list, outputting the file list and an operation set, wherein the operation set refers to a set formed by at least one operator.

Optionally, the operation set comprises at least one or more operators: association, aggregation, summation, and procedural processing.

Optionally, the data processing parameters include at least: data fragmentation rules and operators;

calling a corresponding processor according to the data processing parameters, performing data processing on the input file of the file parameters, and writing a result obtained by the processing into the output file, wherein the step comprises the following steps:

fragmenting data in the input file according to a data fragmentation rule;

and calling a processor corresponding to the operator according to the operator to perform data processing on the input file, and writing a processing result into the output file.

Optionally, after the step of fragmenting the data in the input file according to the data fragmentation rule, the method further includes:

and performing association, sorting or filtering operation on the input file after the fragmentation processing.

and generating a path and a file format of the input file and a path and a file format of the output file according to the file parameters obtained by analysis.

Optionally, the invoked processor supports an open source program that is Spark, Flink, or Java.

Optionally, the invoked processor is provided with a distributed file system and may be based on the processing capabilities of the shared storage of the network file system NFS protocol.

To achieve the above object, according to another aspect of an embodiment of the present invention, there is provided a file processing apparatus based on big data.

The file processing device based on big data of the embodiment of the invention comprises:

the analysis module is used for analyzing data processing parameters and file parameters from a file processing model used for defining parameters required by file processing; the file parameters are parameters related to an input file and an output file;

and the processing module is used for calling a corresponding processor according to the data processing parameters, carrying out data processing on the input file of the file parameters and writing a result obtained by the processing into the output file.

To achieve the above object, according to another aspect of an embodiment of the present invention, there is provided a server.

The server of the embodiment of the invention comprises:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method as described above.

To achieve the above object, according to another aspect of an embodiment of the present invention, there is provided a computer-readable medium.

A computer-readable medium of an embodiment of the invention has stored thereon a computer program which, when executed by a processor, implements the method as described above.

One embodiment of the above invention has the following advantages or benefits:

1) this embodiment may isolate applications and technologies. The data processing is realized based on data processing model design, and the data processing capacity can be improved and promoted by instantiating through different technologies. When the technology architecture is changed, no change application operations are required. When the technology is better, the part of the service can be migrated to a new technology platform only by instantiating the technology.

2) The implementation method can encapsulate the file processing capacity based on the big data, and the application developer can develop and implement the big data without mastering specific big data development technology and tools. The implementation mode reduces the difficulty of data processing of developers. The implementation mode is designed to be highly abstract, and the dependence of developers on big data concrete technology is greatly reduced.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a flowchart illustrating a big data based file processing method according to a first embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a document processing module according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a big data based file processing method according to a second embodiment of the present invention;

FIG. 4 is a block diagram of a big data based file processing apparatus according to a first embodiment of the present invention;

FIG. 5 is a block diagram of a big data based file processing apparatus according to a second embodiment of the present invention;

FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 7 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In a mainframe environment, the processing of data sets is generally performed by DFSort, and when a system migrates to an open environment, a capability needs to be found to support the processing of offline batch services. In an open environment, there are 2 ways to handle large amounts of data processing: one of them is through table association by using SQL (Structured Query Language) capability of the database; the other is to use an off-line mode and process the off-line files by using a big data technology.

For hundreds of millions of data in a single table, if data retrieval is carried out through table association, due to the limitations of processing capacity and temporary table space of a database, processing efficiency is low, and great influence is caused on online transactions.

In order to solve the problems of the existing file processing mode, the embodiment of the invention abstracts the file processing model by analyzing the structure of the big data file and by processing the big data file. And a file processor and file proxy framework are designed for the file processing model. The embodiment of the invention can flexibly replace the underlying big data processing technology without redevelopment.

It can be understood that the operation of big data files is realized based on a file processing model, a file processor and a file proxy mechanism are constructed by utilizing an object-oriented design concept, and business encapsulation of different big data technologies can be realized based on the processor. The processor has the processing capability of shared storage based on an HDFS (Hadoop Distributed File System), a Distributed File System and an NFS (Network File System) protocol, can perform scheduling control based on flow, and can perform compression and decompression processing on different files.

1) File handling encapsulation mechanism

Through the design facing to the interface, the file processing model is operated, a big data processing technology is called for instantiation, and a Spark processing model and a Flink processing model are instantiated respectively.

2) File proxy encapsulation

In the process of processing files, file proxy design is added, so that the file processing is unrelated to storage, and the file processing method can support HDFS distributed file processing technology and can also process shared storage (NAS and the like) based on NFS protocol. And can implement traffic scheduling for shared storage based on the file agent.

Based on the above analysis, the embodiment of the present invention provides a big data-based file processing method, and the file processing method may adopt a big data processing technology to perform operations such as association, sorting, and filtering on a file set. Fig. 1 is a flowchart illustrating a big data based file processing method according to a first embodiment of the present invention, and as shown in fig. 1, the file processing method may include steps S101 to S102 as follows.

Step S101: analyzing data processing parameters and file parameters from a file processing model for defining parameters required by file processing; the file parameters are parameters relating to the input file and the output file.

In step S101, the document processing model is used to define parameters required for document processing, and the document processing model includes: the structure of a file processing model of the big data based file processing method is shown in FIG. 2. The input file list at least comprises: inputting a file name; the output file list at least comprises: the name of the output file, information contained in the output file, and the like. The operation set refers to a set formed by at least one operator. The operator is used for representing the type of operation executed on the file, and the operation set at least comprises one or more operators: association, aggregation, summation, and procedural processing.

Further, data processing parameters and file parameters can be parsed from the file processing model by the file processing model parser. Wherein the data processing parameters are used to indicate parameters required for big data processing, such as: operation type, operation mode, operation object, and the like. The file parameters are parameters relating to the input file and the output file. The file parameters at least comprise: the name of the input file, the name of the output file, the format of the output file, etc. After the file parameters are obtained through analysis, the path and the file format of the input file and the path and the file format of the output file can be generated according to the file parameters obtained through analysis.

After step S101, the metadata of the input file and the output file may be managed in a file proxy manner, and a distribution path for distributing the output file to the distributed file cluster is determined. And then after step S102, shunting the output file to a corresponding distributed file cluster according to the shunting path. It should be noted that the file proxy mode supports one or more of the following: local single path, local random sharding path, open source database, distributed file system, log type database, and distributed document storage database.

It is emphasized that file routing, file load control, file path management, etc. may be implemented by way of a file proxy. The file proxy mode can also write files into different file systems in parallel when the files are generated, and the processing performance of the whole big data can be improved. It can be understood that the file proxy mode in big data processing mainly solves the problems that the file amount of a single distributed file system is too much, the management pressure on the metadata of the distributed file system is exerted, a large number of files of one system can be split into different distributed file clusters through file proxy, and the performance pressure that the file amount under the single cluster is too much is avoided.

When file access is carried out, index information of the file is inquired; wherein the file may be an input file or an output file; if the index information exists, returning the real path of the file; and if the index information does not exist, generating a physical path of the file. Further, file index information can be queried according to a batch number (ID), a branch and a KEY value (KEY); and acquiring a file root directory from the file root path mapping rule according to the branches, and generating a physical path of the file according to the directory splitting rule.

Step S102: and calling a corresponding processor according to the data processing parameters, carrying out data processing on the input file of the file parameters, and writing a result obtained by the processing into the output file.

In step S102, data processing parameters are obtained through the file processing model parser, the data processing parameters at least include an operation type, an operation method, and an operation object, and a corresponding processor is called according to the data operation type, the operation method, and the operation object, and a big data processing result is generated through the execution of the processor. The invoked processor supports open source programs such as Spark, Flink, or Java (Java is a door-to-object programming language). Where Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing. Flink is an open source streaming framework with distributed streaming data streaming engines written in Java and Scala (a programming language) at the heart. The invoked processors are provided with a distributed File System and can be based on the processing capability of the shared storage of the NFS (Network File System) protocol.

Further, the data processing parameters include at least: data fragmentation rules and operators; wherein the fragmentation rule at least comprises one or more of the following: number of pieces, file size, and slice key. And after the corresponding processor is called, fragmenting the data in the input file according to the data fragmentation rule through the called processor. And then calling a processor corresponding to the operator according to the operator to perform data processing on the input file, and writing a processing result into the output file. It will be appreciated that when a processor is invoked, different operators will invoke different processors to perform data processing on the input file.

It should be noted that after the input file is sliced, the input file after the slicing may be associated, sorted, or filtered.

Through modeling technology, a file operation model is designed, including input, output, processing modes and the like of a file, and codes of a file processing process are shown as follows

The embodiment of the invention has the following advantages:

Based on the above analysis, the embodiment of the present invention provides a file processing method based on big data. Fig. 3 is a flowchart illustrating a big data based file processing method according to a second embodiment of the present invention, and as shown in fig. 3, the file processing method may include steps S301 to S304 as follows.

Step S301: analyzing data processing parameters and file parameters from a file processing model for defining parameters required by file processing; the file parameters are parameters relating to the input file and the output file.

The file processing model is used for defining parameters required by file processing, and comprises: the method comprises the following steps of inputting a file list, outputting the file list and an operation set, wherein the input file list at least comprises the following components: inputting a file name; the output file list at least comprises: the name of the output file, information contained in the output file, and the like. The operation set refers to a set formed by at least one operator. The operator is used for representing the type of operation executed on the file, and the operation set at least comprises one or more operators: association, aggregation, summation, and procedural processing.

Step S302: and managing the metadata of the input file in a file proxy mode, and determining a distribution path for distributing the file to the distributed file cluster.

In step S302, when performing file access to an input file and an output file in a file proxy manner, index information of the file is queried; wherein the file may be an input file or an output file; if the index information exists, returning the real path of the file; and if the index information does not exist, generating a physical path of the file. Further, file index information can be queried according to a batch number (ID), a branch and a KEY value (KEY); and acquiring a file root directory from the file root path mapping rule according to the branches, and generating a physical path of the file according to the directory splitting rule.

It should be noted that the file proxy mode supports one or more of the following: local single path, local random sharding path, open source database, distributed file system, log type database, and distributed document storage database.

Step S303: and calling a corresponding processor according to the data processing parameters, carrying out data processing on the input file of the file parameters, and writing a result obtained by the processing into the output file.

In step S303, the called processor supports an open source program that is Spark, Flink, or Java (Java is a door-to-object programming language). Where Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing. Flink is an open source streaming framework, and the core of the Flink is a distributed streaming data streaming engine written in Java and Scale. The invoked processors are provided with a distributed File System and can be based on the processing capability of the shared storage of the NFS (Network File System) protocol.

Step S304: and distributing the files to the corresponding distributed file clusters according to the distribution paths.

The embodiment of the invention has the following advantages:

Fig. 4 is a block diagram of a big data based file processing apparatus according to a first embodiment of the present invention, and referring to fig. 4, the file processing apparatus 400 may include the following modules:

the analysis module 401 is configured to analyze a data processing parameter and a file parameter from a file processing model used for defining parameters required for file processing; the file parameters are parameters related to an input file and an output file;

and the processing module 402 is configured to call a corresponding processor according to the data processing parameters, perform data processing on the input file of the file parameters, and write a result obtained by the processing into the output file.

Optionally, the file processing apparatus 400 further includes:

the determining module is used for managing metadata of the input file and the output file in a file proxy mode and determining a distribution path for distributing the output file to the distributed file cluster;

and the distribution module is used for distributing the output file to the corresponding distributed file cluster according to the distribution path.

Optionally, the file processing apparatus 400 further includes:

the query module is used for querying the index information of the file when the file is accessed; wherein the file may be an input file or an output file;

the return module is used for returning the real path of the file if the index information exists;

and the generating module is used for generating a physical path of the file if the index information does not exist.

Optionally, the generating module is further configured to:

inquiring file index information according to the batch number ID, the branch and the KEY value KEY;

generating the physical path of the file comprises:

Optionally, the data processing parameters include at least: data fragmentation rules and operators; the processing module 402 is further configured to:

fragmenting data in the input file according to a data fragmentation rule;

Optionally, the file processing apparatus 400 further includes:

and the operation module is used for performing association, sequencing or filtering operation on the input file subjected to the fragment processing.

Optionally, the file processing apparatus 400 further includes:

and the first generation module is used for generating a path and a file format of the input file and a path and a file format of the output file according to the file parameters obtained by analysis.

The embodiment of the invention has the following advantages:

Fig. 5 is a block diagram of a big data based file processing apparatus according to a second embodiment of the present invention, and referring to fig. 5, the file processing apparatus may include: the system comprises a file processing model parser, a file agent and a big data executor.

The file agent: in the process of processing big data, shared storage or distributed storage is used, a certain performance bottleneck exists when the quantity of files is too large, and the file agent can understand that the file agent is a metadata manager of the big data processing device, and can realize the use of a multi-distributed file system cluster through the file agent so as to solve the performance problem caused by too much quantity of files. Namely, data distribution is realized through a file agent, and the big data processing efficiency is improved. The file agent supports Local shared storage files, HDFS files, HBase (HBase is a distributed and column-oriented open source database), Redis (Redis is an open source written by using ANSI C language, supports network, and can be based on a log type and can also be persisted in a memory, a Key-Value database), MongoDB (Mongodb, distributed document storage database), and the like.

The file processing model analyzer: the method mainly analyzes a file processing model of big data, realizes parameter resetting by combining a file agent, realizes a storage agent device, and realizes operation of files and data.

Big data executor: and acquiring data processing parameters according to the file processing model parser, calling a corresponding processor according to the operation type, the operation method and the operation object of the big data, and generating a big data processing result through the execution of the processor. The big data execution can also execute the following operations, such as connection, sorting, aggregation, classification, filtering and the like, and the operations are realized by packaging the bottom layer technologies such as Spark, Flink, Java program parallel fragmentation and the like, and can realize dynamic configuration and replacement. In order to facilitate the following operations, such as connecting, sorting, aggregating, classifying, filtering, and the like, the file processing apparatus further includes: a connection executor, a sequencing executor, a filtering executor and the like.

The process of file processing is roughly: firstly, analyzing a file processing model through a file processing model analyzer, and generating a processing object according to parameters required by file processing obtained through analysis; the parameters required for processing the file at least comprise: an input file manifest, an output file manifest, and an operation set, wherein the operation set may be understood as a set formed by operators. The operator is used to define the category to which the operation belongs, and it is understood that the operation performed on the file can be known through the operator, and different operators correspond to different processors. The operation set includes at least one or more of the following operators, modes of association, aggregation, summation, program processing, and the like. The set of operations includes at least one or more of: join, sort, aggregate, sort, filter, etc. Aiming at an input file and an output file, a path and a file format of the input file and the output file are dynamically generated through a file agent, file parameters and data processing parameters are sent to a big data actuator, the big data actuator calls open-source Spark, Flink and Java programs for processing, data fragmentation is carried out on the big data actuator according to data fragmentation rules (according to rules such as the number of pieces, the size of the file, fragmentation keys and the like), corresponding actuators are called according to the operation type of a file model for data processing, and a processing result is written into the output file and output to a client.

It should be explained that the file processing model is used for defining parameters required by file processing, and model instantiation for big data processing is performed through a file processing model definition rule to generate a processing instantiation file. The file processing model is used for performing high-level abstraction on big data processing, modeling and structuring operation objects and operation methods of the big data processing, and when an application developer performs big data processing design, the file processing model only aims at the big data to configure and generate an XML (extensible markup language) structure instantiation file. The method has the advantages that the technology can be shielded from the application through the abstracted big data file processing model, the future technology can be conveniently upgraded without feeling the application, and when the technology is advanced, the technology is only required to be upgraded at the actuator layer, so that the system can be quickly promoted.

Fig. 6 illustrates an exemplary system architecture 600 of a big-data based file processing method or a big-data based file processing apparatus to which an embodiment of the present invention may be applied.

As shown in fig. 6, the system architecture 600 may include

terminal devices

601, 602, 603, a network 604, and a server 605. The network 604 serves to provide a medium for communication links between the

terminal devices

601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. The

terminal devices

601, 602, 603 may have installed thereon various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 605 may be a server that provides various services. It should be noted that the file processing method based on big data provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the file processing apparatus based on big data is generally disposed in the server 605.

It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: analyzing data processing parameters and file parameters from a file processing model for defining parameters required by file processing; the file parameters are parameters related to an input file and an output file; and calling a corresponding processor according to the data processing parameters, carrying out data processing on the input file of the file parameters, and writing a result obtained by the processing into the output file.

According to the technical scheme of the embodiment of the invention, the file processing capacity can be packaged based on the big data, the application and the technology can be isolated, and an application developer can develop and implement the big data technology without mastering the specific big data development technology and tool.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A big data-based file processing method is characterized by comprising the following steps:

2. The method of claim 1, wherein after the step of parsing out data processing parameters and document parameters from a document processing model defining parameters required for document processing, the method further comprises:

3. The method of claim 2, wherein after the step of parsing out the data processing parameters and the document parameters from a document processing model defining parameters required for document processing, the method further comprises:

if the index information exists, returning the real path of the file;

4. The method of claim 3, wherein querying file index information comprises:

generating the physical path of the file comprises:

5. The method of claim 2, wherein the file proxy approach supports one or more of: local single path, local random sharding path, open source database, distributed file system, log type database, and distributed document storage database.

6. The method of claim 1, wherein the document processing model comprises: the method comprises the steps of inputting a file list, outputting the file list and an operation set, wherein the operation set refers to a set formed by at least one operator.

7. The method of claim 6, wherein the operation set comprises at least one or more operators: association, aggregation, summation, and procedural processing.

8. The method according to claim 1, characterized in that said data processing parameters comprise at least: data fragmentation rules and operators;

fragmenting data in the input file according to a data fragmentation rule;

9. The method of claim 8, wherein after the step of fragmenting data in the input file according to a data fragmentation rule, the method further comprises:

10. The method of claim 1, wherein after the step of parsing out data processing parameters and document parameters from a document processing model defining parameters required for document processing, the method further comprises:

11. The method of claim 1, wherein the invoked processor supports an open source program that is Spark, Flink, or Java.

12. The method of claim 1, wherein the invoked processor is provided with a processing capability of a distributed file system and may be based on a shared storage of a Network File System (NFS) protocol.

13. A big-data-based file processing apparatus, comprising:

14. A server, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-12.

15. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-12.