CN111400257B

CN111400257B - Object storage based Hadoop submitter implementation method and device

Info

Publication number: CN111400257B
Application number: CN202010188188.2A
Authority: CN
Inventors: 战策; 张旭明; 王豪迈; 胥昕
Original assignee: Xsky Beijing Data Technology Corp ltd
Current assignee: Beijing Xingchen Tianhe Technology Co ltd
Priority date: 2020-03-17
Filing date: 2020-03-17
Publication date: 2021-10-01
Anticipated expiration: 2040-03-17
Also published as: CN111400257A

Abstract

The invention discloses a method and a device for realizing a Hadoop submitter based on object storage. The invention comprises the following steps: reading one or more files, wherein the self-defined metadata of the files is used for representing the description information of the files; combining one or more files by using a file second combining function to generate a new file; and storing the new file into the target directory, wherein the new file is used as a soft connection for pointing to the corresponding source data file. The invention solves the technical problem of low file storage efficiency caused by the adoption of a submission protocol during file storage in the related art.

Description

Object storage based Hadoop submitter implementation method and device

Technical Field

The invention relates to the field of file storage, in particular to a method and a device for realizing a Hadoop submitter based on object storage.

Background

In the related technology, MapReduce in Hadoop divides a job submitted by a user into a plurality of single operation tasks (map task and reduce task respectively) and executes the tasks on a plurality of nodes, and after the execution of the tasks is completed, the output of an execution result is stored in a final result directory through an output commit protocol. Any job-side commit work will be performed across the nodes in the cluster and may occur outside of the critical portion of job execution. However, unless the output commit protocol requires all tasks to wait for a job driver signal, the submission of the tasks cannot instantiate their outputs in the final directory, which can be used to promote the output of a successful task to a state where a job can be submitted, solving the speculative execution and failure problems.

Therefore, the output commit needs to be able to handle that when the job driver fails and restarts, the restarted job driver only reruns incomplete task; when the restarted job completes, the output of the completed task will be restored for commit.

Among them, the commonly used output commatter includes: FileOutputCommitter, Staging Committer, and Magic Committer.

Wherein, FileOutputCommitter: the standard commit algorithm (and its v1 and v2 algorithms) relies on directory renaming this O (1) atomic operation: the caller exports his work to temporary directories in the target file system and then renames these directories as final targets as a way to commit the work. A commit is operated based on using a consistent list, and where the filesystem rename () command is an O (1) atomic operation, with rename allowing a single task to work in a temporary directory, the entire Job can be committed explicitly using the rename as atomic operation and eventually committed. Because renaming is very low cost, rename can be executed with minimal delay during task and job commit. Note that HDFS will lock the name metadata during the rename operation, so all rename () calls will be serialized. However, because they only update the metadata of two directory entries, the lock duration is short. For object storage, the rename operation from O (1) is changed into a two-segment commit mode, and is not an atomic operation, and in a copy phase of the rename operation of the object storage, a copy operation needs to be performed on data, which is very inefficient.

Staging Committer: when the commatter is used, data needs to be written into the local and then submitted to the S3 object storage, which is inefficient. Moreover, a strong consistency storage system (such as HDFS) of a third party needs to be introduced, which may bring complexity to the architecture and increase the difficulty in operation and maintenance.

Magic Committer: the commit uses the output file submitted with task from the segment upload, but does not distinguish the file size, generally proposes to use the segment upload for large files (such as more than 100M) to improve the upload efficiency, while the Magic commit uses the segment upload no matter how small the files are, resulting in low IO efficiency. Continuous reading, writing and merging are needed in the task commit and job commit stages, pendingset files affect IO efficiency. The Committer was released in Hadoop3 version and does not support the Hadoop2 version that is currently the mainstream in the market.

Meanwhile, it should be noted that the objects compatible with the S3 protocol are stored in the representation of consistency, and are roughly divided into final consistency (weak consistency) and strong consistency. Both Staging Committer and Magic Committer are supports that favor weakly consistent storage systems. For strong-consistency S3 protocol-compatible object stores, the introduction of a consistency component is not required.

In view of the above problems in the related art, no effective solution has been proposed.

Disclosure of Invention

The invention mainly aims to provide a method and a device for realizing a Hadoop submitter based on object storage, so as to solve the technical problem of low efficiency of a file storage mechanism in the related technology.

To achieve the above object, according to one aspect of the present invention, an implementation method of an object-based storage Hadoop submitter is provided. The invention comprises the following steps: reading one or more files, wherein the self-defined metadata of the files is used for representing the description information of the files; combining one or more files by using a file second combining function to generate a new file; and storing the new file into the target directory, wherein the new file is used as a soft connection for pointing to the corresponding source data file.

Further, prior to reading the one or more files, the method further comprises: creating a target directory and creating a job directory under a specified file directory; when a task in a job is executed, creating a temporary submitted file directory in the job directory, wherein the temporary submitted file directory is used for storing files generated by executing the task in the job; after one or more tasks in the job are successfully executed, one or more files are generated; and storing the output file or files under the temporary submission file directory.

Further, merging the one or more files using a file second merging function to generate a new file, comprising: and after all tasks in the operation are executed, merging all files in the temporary submitted file directory to generate a new file, and submitting the new file to a target directory.

Further, after storing the output one or more files under the temporary commit file directory, the method further comprises: all files under the temporary commit file directory are deleted.

Further, when one or more files are combined, the API interfaces of the files are combined to generate a soft connection, wherein the job information corresponding to the files is stored in the metadata of the soft connection.

Further, after storing the new file under the target directory, the method further comprises: and generating a marking file under the target directory, wherein the marking file is used for marking that the job is executed and completed.

To achieve the above object, according to another aspect of the present invention, an implementation apparatus for an object-based storage Hadoop submitter is provided. The device includes: the reading module is used for reading one or more files, wherein the self-defined metadata of the files is used for representing the description information of the files; the merging module is used for merging one or more files by using a file second merging function to generate a new file; and the storage module is used for storing the new file into the target directory, wherein the new file is used as a soft connection for pointing to the corresponding source data file.

In order to achieve the above object, according to another aspect of the present invention, there is provided a storage medium. The storage medium includes a stored program, where the program performs an implementation of the object-based storage Hadoop submitter described above.

To achieve the above object, according to another aspect of the present invention, there is provided a processor. The processor is used for running a program, wherein the program executes the implementation method based on the object storage Hadoop submitter during running.

The invention adopts the following steps: reading one or more files, wherein the self-defined metadata of the files is used for representing the description information of the files; combining one or more files by using a file second combining function to generate a new file; the new file is stored in the target directory, wherein the new file serves as a soft link and is used for pointing to the corresponding source data file, the technical problem that file storage efficiency is low due to a submission protocol adopted in file storage in the related technology is solved, and therefore work is efficiently and reliably submitted to an object storage device which is compatible with the S3 protocol and has strong consistency.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of an implementation method of an object-based storage Hadoop submitter according to an embodiment of the present invention; and

FIG. 2 is a schematic diagram of an apparatus for implementing an object-based storage Hadoop submitter according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged under appropriate circumstances in order to facilitate the description of the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For convenience of description, some terms or expressions referring to the embodiments of the present invention are explained below:

hadoop: is an open source software framework for storing data and running applications on a business hardware cluster that provides mass storage for any type of data, tremendous processing power, and the ability to process nearly unlimited concurrent tasks or jobs.

MapReduce: the programming model is used for parallel operation of large-scale data sets, map is mapping, and reduce is a specification.

OutputCommitter: the submission protocol of MapReduce in Hadoop is actually a set of abstract interfaces including Job Setup, Task Commit, Task Abort, Job Commit, Job Abort, Job Cleanup and Job Recovery.

Job: MapReduce. Execution of the job: the job is divided into a plurality of tasks and the completion of the tasks is managed, typically in a single process. If the job completes successfully, the output of the job will be visible to other stages in the larger sequence of operations or other applications.

Job Driver: manages the state of Job. Whichever process schedules task execution tracks Job success or failure and determines when to submit output. It may also determine that the job has failed and failed to recover, in which case the job will be aborted. In MapReduce, the process runs in the AppMaster of YARN.

Final directory: job execution is completed, generating the final output directory of the file.

Task: a single operation in a task, on a single process, one process generates one or more files. After a job is successfully completed, the data must be visible in the final directory if the execution is successful.

Task Working Directory: a directory that a single task has exclusive access to, and uncommitted work can be put in.

Task Commit: the output of the task is found from the task work catalog and made visible in the final catalog. Rename () calls are typically implemented through a filesystem.

Object storage: a computer data storage architecture that manages data as objects is in contrast to other storage architectures (e.g., file systems that manage data as a file hierarchy) and block storage that manages data as blocks and blocks within sectors. Each object typically includes the data itself, a variable amount of metadata, and a globally unique identifier.

S3: the object stores the de facto standard of the protocol.

S3A: the Hadoop native object store connector enables the object store of the S3 protocol to be used as the Hadoop store back-end.

Staging Committer: ryan Blue from Netflix submitted to a commatter for S3 object oriented storage in Hadoop community.

S3 Guard: the main problem is to solve the consistency assurance problem of S3 and HDFS. Changes to the Metadata are recorded by using an additional Metadata Store. When the Client sends a query metadata request, S3Guard will query S3 and the additional metadata store for metadata at the same time, and then perform result summarization.

The object stores custom metadata: a plurality of custom metadata (Key/Value pairs) may be set for one object to describe this object.

Object storage file seconds sum: one or more objects are merged into a new object and saved to a designated directory. The content of this new object is the set of all file contents that are merged (in the order of input). The new object generated by this function does not produce data movement in the storage system, but only generates a soft connection. API in non-standard S3 protocol.

According to the embodiment of the invention, an implementation method of a Hadoop submitter based on object storage is provided.

FIG. 1 is a flowchart of an implementation method of an object-based storage Hadoop submitter according to an embodiment of the present invention. As shown in fig. 1, the present invention comprises the steps of:

step S101, reading one or more files, wherein the self-defined metadata of the files is used for representing the description information of the files.

And step S102, combining one or more files by using a file second combination function to generate a new file.

Step S103, storing the new file into the target directory, wherein the new file is used as a soft link for pointing to the corresponding source data file.

As described above, in the related art, fileouptputcommatter is a default commatter for MapReduce in Hadoop. Its algorithm is divided into two versions, "V1" and "V2", respectively.

The "V1" commit algorithm is the default commit algorithm for FileOutputCommitter in Hadoop version 2. x. The algorithm is used for processing the fault and the restart of the jobdriver, and the restarted jobdriver only reruns unfinished tasks; when the restarted job is completed, the output of the completed task will be restored for submission. The cost is as follows: all files in all committed task directories are listed by recursion and committed by the rename operation. First, a jobAttemptPath (a temporary output directory of the jobresult) is created, '$ dest/_ temporary/$ appAttemptidId/', where $ dest denotes the final directory of the jobs. Next, a task creating directory is created, taskatemptympath $ dest/_ temporary/$ appattattemptid/_ temporary/$ taskatemptid'. The file output by the Task is stored in the taskatemppath, and when the Task is successfully executed, the output file is submitted to the jobattamptpath directory through rename operation. When all tasks are completed, the jobdriver calls the listStatus and rename methods to submit the result file aggregate in the jobAttemptPath to the $ dest directory. At which time one job is completed.

The "V2" algorithm is generally similar in flow to the "V1" algorithm, except that "V2" submits the task output directly from the taskAttemptPath to the $ dest directory. During execution, the intermediate data becomes visible. When Job fails, all outputs must be deleted and the Job restarted.

In contrast to the "real" file system, the S3A object store (like most other objects) does not support rename () at all. To emulate the rename, Hadoop S3A connector must copy the data into the new object of the target filename and then delete the original entry. This replication may be performed on the server side, but since it is not completed until the replication within the cluster is completed, the time it takes is proportional to the amount of data.

Rename overhead is the most obvious problem, but the most dangerous is that the path list has no consistency guarantees. The S3 object store is weakly consistent and asynchronous, so although the copy operation returns successful execution, the client may not see the file when executing the list directory. If the files are not listed, commit operations will not copy them so they will not appear in the final output.

There are mainly two open source implementations for commatter for S3 protocol compliant object storage: staging Committer and Magic Committer.

Staging Committer, the submitter writes the task output to a temporary directory on the local FS. The Task output is directed to the local FS by getTaskAttemptPath and getWorkPath. At task commit, the commit enumeration task attempts the files in the directory (ignores the hidden files). Finally, each file is uploaded to S3 using the multi-part upload API. Because the data of the task is written into the local disk first, the required information for submitting the upload needs to be saved in the HDFS, and the HDFS is a distributed storage system with strong consistency, so that the phenomenon that files are omitted in the commit process can not occur. The core algorithm is as follows:

the target export directory is a reference to a local file system.

the task commit starts the segmented upload PUT stored by the target object.

the list of each pending PUT for a task is persisted to a single file in a distributed file system. For Netflix, this is HDFS.

The standard fileoutputcommit (algorithm 1) is used to manage commit/abort of these files. That is to say: it copies only those lists of files that are to be committed from the successful task to the (temporary) job commit directory.

The S3 jobsubmitter reads the list of pending files for each task submitted in the HDFS and completes these put requests.

Magic Committer, the development of this submitter began before the Netflix donation Staging Committer. The Magic Committer successfully deferred all writes to files in a special directory ("Magic"); the final destination of the write is changed to the final job destination. When a task is submitted, each task output file is uploaded to S3 using the multi-part upload API. It differs from the Staging Committer in that it flows data directly to S3 (not Staging) and it also stores a list of pending submissions in S3. This requires the use of consistent metadata at S3, which is provided by S3 Guard. In the task commit phase of Magic Committer, the information needed to commit the task will be transferred from the task attempt to the job attempt.

the task submission operation lists all pendingfile contents in its attempt directory, these are loaded into an upload list and merged into a single Pendingset structure to be saved in the Pendingset file in the job attempt directory, and finally, the task attempt directory is deleted.

In the jobcommit stage of the Magic Committer, the jobcommit loads all the pendingset files in its jobattempt directory. A pendingset file is considered a job failure; all loadable pendingsets will be aborted. If all the.pendingsets are loaded, then every pending commit in the job will be committed. If any one of the submissions fails, all successful submissions will be restored by deleting the target file, and thus, the above-described problems exist in the related art.

The application provides an implementation method of a Hadoop submitter based on object storage, also called Merge Committer, which solves the above problems in the related art by using a function of combining custom metadata of the object storage with a file second, wherein the file second is an atomic operation with time complexity O (1), but the file generated after combination is not a file with actual content but a soft connection, so that a special directory (defined as mystery, namely a target directory) needs to be introduced to manage the source files of the soft connection.

The implementation method of the Hadoop submitter based on the object storage, provided by the embodiment of the invention, comprises the steps of reading one or more files, wherein the custom metadata of the files are used for representing the description information of the files; combining one or more files by using a file second combining function to generate a new file; the new file is stored in the target directory, wherein the new file serves as a soft link and is used for pointing to the corresponding source data file, the technical problem that file storage efficiency is low due to a submission protocol adopted in file storage in the related technology is solved, and therefore work is efficiently and reliably submitted to an object storage device which is compatible with the S3 protocol and has strong consistency.

Meanwhile, it should be noted that the soft connection file generated by the Task Commit establishes a mapping relationship with the source file through the object metadata function, so as to facilitate the management of the source object.

Optionally, before reading the one or more files, the method further comprises: creating a target directory and creating a job directory under a specified file directory; when a task in a job is executed, creating a temporary submitted file directory in the job directory, wherein the temporary submitted file directory is used for storing files generated by executing the task in the job; after one or more tasks in the job are successfully executed, one or more files are generated; and storing the output file or files under the temporary submission file directory.

In the embodiment provided in the present application, no operation is performed on the created task tasksetup, but before reading one or more files, it is necessary to create a target directory, that is, a final output directory, and create a job directory job attempt directory into a specified file directory/. mystery/, that is, jobAttemptPath '/. mystery/$ jobID/'.

Further, when a task is executed, a temporary commit directory taskatemptatpath is created in the job attempt directory, taskatemptatpath ═ v.

Optionally, merging the one or more files using a file second merging function to generate a new file, including: and after all tasks in the operation are executed, merging all files in the temporary submitted file directory to generate a new file, and submitting the new file to a target directory.

In the above, after the Task is successfully executed, the job driver calls the Task Commit method to merge the result file in the Task working directory into the target directory by the file merging method, that is, the final output directory.

Optionally, after storing the output one or more files under the temporary commit file directory, the method further comprises: all files under the temporary commit file directory are deleted.

Specifically, after one or more files are stored under the temporary commit file directory, all data under the temporary commit file directory jobAttemptPath also needs to be deleted.

It should be noted that, in the Task Recovery provided in this embodiment, that is, in the process of recovering a single operation in the Task list, since data is not written to the target directory, the job driver may clear the operation in the Task list by deleting the tasktemptpath.

Meanwhile, it should be noted that, in the Job Recovery process provided by this embodiment, since the single operation task output file in the successfully executed task list is renamed to final directory, it is recoverable, and only the Job driver needs to re-execute the incomplete task.

Optionally, when one or more files are merged, the API interfaces of the files are merged to generate a soft connection, where the job information corresponding to the files is stored in the metadata of the soft connection.

In the above, there is a problem that when the soft connection file generated by the object file merging API is executed with delete operation, only the current soft connection file is deleted, and the resource file, that is, the source file, is not deleted, so that the source file remains. Here, $ jobID and $ taskatemptID are stored in the metadata of the softconnect object in key/value pairs by the object's custom metadata function, e.g., { "jid": $ jobID, "tid": $ taskatempttid }. Thus, when a soft connection file is deleted, the corresponding source file can be located in the/. mystery/directory by the $ jobID and $ taskatemptID.

Optionally, after storing the new file under the target directory, the method further comprises: and generating a marking file under the target directory, wherein the marking file is used for marking that the job is executed and completed.

Specifically, when all task outputs have been completed, the job driver will call the job commit method to generate an empty _ SUCCESS file in the target directory final directory at this time, and use the _ SUCCESS file as a mark of the completed task job.

Furthermore, the calling of the object rename interface in the Output commit is replaced by an object file merging mode, so that commit efficiency is improved.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

The embodiment of the invention also provides an implementation device based on the object storage Hadoop submitter, and it should be noted that the implementation device based on the object storage Hadoop submitter can be used for executing the implementation method provided by the embodiment of the invention. The following describes an implementation apparatus based on an object storage Hadoop submitter according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of an apparatus for implementing an object-based storage Hadoop submitter according to an embodiment of the present invention. As shown in fig. 2, the apparatus includes: the reading module 201 is configured to read one or more files, where the custom metadata of the file is used to represent description information of the file; a merging module 202, configured to merge one or more files using a file second merging function to generate a new file; and the storage module 203 is configured to store the new file in the target directory, where the new file serves as a soft connection and is used to point to the corresponding source data file.

The device for implementing the Hadoop submitter based on the object storage is used for reading one or more files through a reading module 201, wherein custom metadata of the files are used for representing description information of the files; a merging module 202, configured to merge one or more files using a file second merging function to generate a new file; the storage module 203 is configured to store a new file in the target directory, where the new file serves as a soft link and is used to point to a corresponding source data file, so that the technical problem of low file storage efficiency caused by a commit protocol adopted during file storage in the related art is solved, and efficient and reliable work submission to an object storage device compatible with the S3 protocol with strong consistency is achieved.

Optionally, the apparatus further comprises: the first creating module is used for creating a target directory and creating a job directory under a specified file directory; the second creating module is used for creating a temporary submitted file directory in the job directory when the task in the job is executed, wherein the temporary submitted file directory is used for storing files generated by executing the task in the job; the generating module is used for generating one or more files after one or more tasks in the job are successfully executed; and the storage module is used for storing the output one or more files into the temporary submitted file directory.

Optionally, the merging module 202 includes: and the sub-saving module is used for merging all files in the temporary submitted file directory after all tasks in the operation are executed, generating a new file and submitting the new file to a target directory.

Optionally, the apparatus further comprises: and the deleting module is used for deleting all files in the temporary submitted file directory.

Optionally, the apparatus further comprises: and the interface merging module is used for merging the API interfaces of the files to generate soft connection when one or more files are merged, wherein the job information corresponding to the files is stored in the metadata of the soft connection.

Optionally, the apparatus further comprises: and the marking module is used for generating a marking file under the target directory, and the marking file is used for marking that the execution of the job is completed.

The device for realizing the Hadoop submitter based on the object storage comprises a processor and a memory, wherein the reading module 201 and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the technical problem of low file storage efficiency caused by a submission protocol adopted in file storage in the related technology is solved by adjusting kernel parameters.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

The embodiment of the invention provides a storage medium, wherein a program is stored on the storage medium, and the program can realize the implementation method of the Hadoop submitter based on the object storage when being executed by a processor.

The embodiment of the invention provides a processor, which is used for running a program, wherein the implementation method of the Hadoop submitter based on object storage is executed when the program runs.

The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps: reading one or more files, wherein the self-defined metadata of the files is used for representing the description information of the files; combining one or more files by using a file second combining function to generate a new file; and storing the new file into the target directory, wherein the new file is used as a soft connection for pointing to the corresponding source data file.

Further, after storing the new file under the target directory, the method further comprises: and generating a marking file under the target directory, wherein the marking file is used for marking that the job is executed and completed. The device herein may be a server, a PC, a PAD, a mobile phone, etc.

The invention also provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: reading one or more files, wherein the self-defined metadata of the files is used for representing the description information of the files; combining one or more files by using a file second combining function to generate a new file; and storing the new file into the target directory, wherein the new file is used as a soft connection for pointing to the corresponding source data file.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present invention, and are not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. An implementation method of a Hadoop submitter based on object storage is characterized by comprising the following steps:

reading one or more files, wherein the self-defined metadata of the files is used for representing the description information of the files;

combining one or more files by using a file second combining function to generate a new file;

storing the new file into a target directory, wherein the new file is used as a soft connection for pointing to a corresponding source data file;

prior to reading the one or more files, the method further comprises:

creating the target directory and creating a job directory under a specified file directory;

when a task in a job is executed, creating a temporary submission file directory in the job directory, wherein the temporary submission file directory is used for storing a file generated by executing the task in the job;

after one or more tasks in the job are successfully executed, generating one or more files;

storing the output one or more files under the temporary submission file directory;

merging one or more files by using a file second merging function to generate a new file, wherein the file second merging function comprises the following steps:

and after all tasks in the operation are executed, merging all files in the temporary submitted file directory to generate a new file, and submitting the new file to the target directory.

2. The method of claim 1, wherein after storing the output one or more of the files under the temporary commit file directory, the method further comprises: and deleting all files in the temporary submitted file directory.

3. The method of claim 1, wherein when one or more of the files are merged, the API merging of the files generates the soft connection, and wherein job information corresponding to the files is stored in metadata of the soft connection.

4. The method of any of claims 1 to 3, wherein after storing the new file under a target directory, the method further comprises: and generating a marking file under the target directory, wherein the marking file is used for marking that the operation is completed.

5. An implementation device based on an object storage Hadoop submitter is characterized by comprising:

the reading module is used for reading one or more files, wherein the self-defined metadata of the files is used for representing the description information of the files;

the merging module is used for merging one or more files by using a file second merging function to generate a new file;

the storage module is used for storing the new file into a target directory, wherein the new file is used as a soft connection for pointing to a corresponding source data file;

the device further comprises:

the first creating module is used for creating the target directory and creating a job directory under a specified file directory;

the second creating module is used for creating a temporary submitted file directory in the job directory when a task in the job is executed, wherein the temporary submitted file directory is used for storing a file generated by executing the task in the job;

the generation module is used for generating one or more files after one or more tasks in the job are successfully executed;

the saving module is used for saving the output one or more files to the temporary submitted file directory;

the device further comprises:

and the sub-saving module is used for merging all files in the temporary submitted file directory after all tasks in the operation are executed, generating a new file and submitting the new file to a target directory.

6. A storage medium comprising a stored program, wherein the program performs a method of any one of claims 1 to 4 for implementing an object-based storage Hadoop submitter.

7. A processor, configured to execute a program, wherein the program executes the method for implementing the object-based storage Hadoop submitter according to any one of claims 1 to 4.