CN114237892A

CN114237892A - Key value data processing method and device, electronic equipment and storage medium

Info

Publication number: CN114237892A
Application number: CN202111555578.XA
Authority: CN
Inventors: 李超
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-03-25

Abstract

The disclosure relates to a processing method and device of key value data, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring a task to be processed and meta-information corresponding to the task to be processed; processing the task to be processed through a mapping task in the calculation engine to obtain key value data, and transmitting the key value data to a shuffle processing node, wherein the shuffle processing node is a node independently packaged outside the calculation engine; sorting logic corresponding to the meta information is run by the shuffle processing node to sort the key-value data. According to the scheme disclosed by the invention, key value data generated by various computing engines are processed by adopting a universal sorting mechanism, so that the sorting requirements of the key value data of various computing engines can be met simultaneously.

Description

Key value data processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for processing key value data, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Current computing engines Mapreduce (a kind of map reduce model), Spark (a kind of computing engine), etc. have a shuffle mechanism. The shuffling mechanism takes over data transfer between MapTask and ReduceTask in tasks.

Currently, shuffle mechanism implementations are largely divided into sort-based shuffle and hash-based shuffle. Mapreduce mainly adopts a shuffle mode based on sorting, and sorts key-value (key value) data according to partition and key at a mapping task end and outputs the key-value (key value) data to a local disk to form an ordered shuffle file. In this way, the shuffle file obtained by the reduction task side is locally ordered. Spark mainly adopts a Hash-based shuffle mode, carries out key sorting at a mapping task end, and outputs the key sorting to a local disk according to partition classification to form a shuffle file which is out of order. Thus, the reduction task side needs to globally sort the acquired set of shuffled files.

However, most current streaming shuffle mechanisms within an enterprise support only the Spark engine. Under the condition that Spark and MapReduce operations exist in an enterprise at the same time, the hash-based shuffling method of Spark engines cannot meet the requirement of key value data sorting of MapReduce operations, so that a method capable of flexibly sorting key value data of multiple computing engines under the condition that the multiple computing engines operate at the same time is urgently needed.

Disclosure of Invention

The present disclosure provides a method and an apparatus for processing key-value data, an electronic device, a computer-readable storage medium, and a computer program product, so as to provide a way for flexibly sorting key-value data of multiple computing engines under the condition that the multiple computing engines operate simultaneously. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a method for processing key-value data, the method including:

acquiring a task to be processed and meta-information corresponding to the task to be processed;

processing the task to be processed through a mapping task in a calculation engine to obtain key value data, and transmitting the key value data to a shuffle processing node, wherein the shuffle processing node is a node independently packaged outside the calculation engine;

executing, by the shuffle processing node, a sort logic corresponding to the meta information to sort the key-value data.

In one embodiment, a first type loader is deployed in the shuffle processing node, and the meta information includes a storage path for sorting logical file packages;

the executing, by the shuffle processing node, ordering logic corresponding to the meta information includes:

obtaining, by the first class loader in the shuffle processing node, the ordered logical bundle of files stored in the storage path;

and loading the sorting logic file package through the first type loader to run the sorting logic.

In one embodiment, a second type loader is also deployed in the shuffle processing node;

prior to said retrieving, by said first class loader in said shuffle processing node, said ordered logical bundle of files stored in said storage path, further comprising:

and under the condition that the task to be processed is submitted through a preset tool, determining that no sequencing logic corresponding to the preset tool exists in the second class loader through the first class loader.

In one embodiment, the method further comprises:

when the first class loader determines that the ordering logic corresponding to the preset tool exists in the second class loader, the first class loader runs the ordering logic corresponding to the preset tool.

In one embodiment, a third type loader is also deployed in the shuffle processing node; the meta-information also comprises key information of the task to be processed;

determining, by the first class loader, that no ordering logic corresponding to the key class information exists in the third class loader.

In one embodiment, the method further comprises:

and when the first class loader determines that the sorting logic corresponding to the key class information exists in the third class loader, running the sorting logic corresponding to the key class information.

In one embodiment, the method further comprises:

registering the meta information to a global management component, wherein the global management component is a component independently packaged outside the computing engine;

sending, by the global management component, the meta-information to the shuffle processing node.

In one embodiment, the number of the shuffle processing nodes is multiple; the transmitting the key value data to a shuffle processing node includes:

obtaining a mapping relation between a partition and a shuffle processing node through a shuffle write node, wherein the mapping relation is obtained by the global management component according to the partition of the task to be processed in advance, and the shuffle write node is a node independently packaged outside the calculation engine;

determining, by the shuffle write node, key value data corresponding to each of the partitions, and sending the key value data corresponding to each of the partitions to the shuffle processing node corresponding to each of the partitions according to the mapping relationship;

the executing, by the shuffle processing node, ordering logic corresponding to the meta information to order the key-value data, comprising:

executing the sorting logic by the shuffle processing node corresponding to each of the partitions to sort the key-value data corresponding to each of the partitions.

According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for processing key-value data, the apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is configured to acquire a task to be processed and meta-information corresponding to the task to be processed;

the mapping module is configured to process the to-be-processed task through a mapping task in a computing engine to obtain key value data, and transmit the key value data to a shuffle processing node, wherein the shuffle processing node is a node independently packaged outside the computing engine;

a data sorting module configured to execute a sorting logic corresponding to the meta information run by the shuffle processing node to sort the key value data.

the data sorting module comprises:

a bundle acquisition unit configured to perform acquisition of the sorted logical bundle of files stored in the storage path by the first class loader in the shuffle processing node;

a first loading unit configured to execute loading of the sorting logic file package by the first class loader to run the sorting logic.

the data sorting module further comprises:

and the submission tool determining unit is configured to execute, when the task to be processed is submitted through a preset tool, determining, through the first class loader, that no ordering logic corresponding to the preset tool exists in the second class loader.

In one embodiment, the data sorting module further includes:

a second loading unit configured to execute, by the first class loader, when it is determined by the first class loader that there is sorting logic corresponding to the preset tool in the second class loader, the sorting logic corresponding to the preset tool.

the data sorting module further comprises:

a query unit configured to execute a sort logic that determines, by the first class loader, that no corresponding key class information exists in the third class loader.

In one embodiment, the data sorting module further includes:

a third loading unit configured to execute, when it is determined by the first class loader that the sorting logic corresponding to the key class information already exists in the third class loader, the sorting logic corresponding to the key class information.

In one embodiment, the apparatus further comprises:

a registration module configured to perform registration of the meta-information to a global management component, the global management component being a component independently packaged outside the compute engine;

an information sending module configured to perform sending the meta information to the shuffle processing node through the global management component.

In one embodiment, the number of the shuffle processing nodes is multiple;

the mapping module includes:

a mapping relation obtaining unit configured to perform obtaining, by a shuffle write node, a mapping relation between a partition and a shuffle processing node, the mapping relation being a result of the global management component being constructed in advance according to the partition of the task to be processed, the shuffle write node being a node independently packaged outside the compute engine;

a data transmission unit configured to perform determination of key value data corresponding to each of the partitions by the shuffle write node, and to transmit the key value data corresponding to each of the partitions to the shuffle processing node corresponding to each of the partitions in the mapping relationship;

the data sorting module is configured to execute the sorting logic to be run by a shuffle processing node corresponding to each of the partitions to sort the key-value data corresponding to each of the partitions.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to execute the instruction to implement the processing method of key-value data according to any one of the above embodiments.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions of the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the processing method of key-value data according to any one of the above embodiments.

According to a fifth aspect of the embodiments of the present disclosure, a computer program product is provided, where the computer program product includes instructions, and when the instructions are executed by a processor of an electronic device, the electronic device is enabled to execute the processing method of key-value data according to any one of the above embodiments.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the method comprises the steps that the shuffling processing nodes which are independently packaged are deployed outside a computing engine, after a task to be processed is processed through a mapping task to obtain key value data, the shuffling processing nodes can obtain a sorting logic corresponding to meta information of the task to be processed, the key value data are sorted by adopting the sorting logic, the key value data generated by various computing engines can be processed by adopting a universal sorting mechanism under the condition that various computing engines exist at the same time, and therefore the sorting requirement of the key value data of various computing engines can be met at the same time. In addition, the embodiment of the disclosure provides the independently deployed shuffle processing nodes, reduces the modification degree of the native computing engine code to the maximum extent in a decoupling mode, has higher universality, and is convenient for deployment and maintenance.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a diagram illustrating an application environment of a method for processing key-value data according to an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a method of processing key-value data, according to an example embodiment.

FIG. 3 is a schematic flow chart illustrating sorting of key-value data in accordance with an example embodiment.

FIG. 4 is a flow diagram illustrating processing of key-value data by loading a package of sorted logical files in accordance with an illustrative embodiment.

FIG. 5 is a diagram illustrating processing of key-value data by loading a package of sorted logical files in accordance with an illustrative embodiment.

FIG. 6 is a schematic diagram illustrating a shuffle processing node in accordance with an exemplary embodiment.

FIG. 7 is a flowchart illustrating a method of processing key-value data, in accordance with an exemplary embodiment.

FIG. 8 is a schematic diagram illustrating a shuffle service in accordance with an exemplary embodiment.

Fig. 9 is a block diagram illustrating a key-value data processing apparatus according to an example embodiment.

FIG. 10 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It should also be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are both information and data that are authorized by the user or sufficiently authorized by various parties.

The processing method of key value data provided by the present disclosure may be applied to an application environment as shown in fig. 1. Wherein the terminal 110 interacts with the server 120 through the network. At least one compute engine is deployed in server 120, and a separate packaged shuffle processing node is deployed in addition to the at least one compute engine. The server 120 obtains the to-be-processed task uploaded by the terminal 110 and the meta-information corresponding to the to-be-processed task. Processing the tasks to be processed through the mapping tasks in the calculation engine to obtain key value data, determining the partition to which each key value data belongs, performing aggregation processing on the key value data belonging to the same partition, and sending the key value data subjected to the aggregation processing to the shuffle processing node. The shuffle processing node obtains a sorting logic corresponding to the meta-information of the task to be processed, and sorts the key value data under each partition according to the sorting logic, so that the reduction task can pull the sorted key value data under the partition corresponding to the reduction task.

The terminal 110 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The portable wearable device can be a smart watch, a smart bracelet, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

Fig. 2 is a flowchart illustrating a processing method of key-value data according to an exemplary embodiment, as shown in fig. 2, including the following steps.

In step S210, the to-be-processed task and meta information corresponding to the to-be-processed task are acquired.

The task to be processed may be a set of tasks that a user requires to be done by a computation engine in a process of a computation problem or a transaction. The task to be processed can be uploaded to the computing engine by a user through the terminal device, and can also be scheduled to the computing engine through the task scheduler.

Meta information is information about information for describing the structure, semantics, usage, and usage of information, etc. In this embodiment of the present disclosure, the meta information may be information related to processing of the Key-value data, for example, Key Class information (Key Class) of the task, a storage path of a processing logic file packet of the Key-value data, a type of a computing engine to which the task belongs, and the like, which are not described herein in detail. Wherein the key class information may be used to uniquely represent the class. A class is a construct in an object-oriented computer programming language that describes the behavior rules of objects, which are called instances of the class.

The meta-information and the task have a mapping relation, so that the server can acquire the meta-information corresponding to the task to be processed according to the mapping relation after acquiring the task to be processed.

In step S220, the to-be-processed task is processed by the mapping task in the computing engine to obtain key value data, and the key value data is transmitted to the shuffle processing node.

The shuffle processing node is an independently packaged node deployed outside a computing engine, may have its own attribute and method, and may independently perform processing such as sorting key value data.

Specifically, after the computing engine starts the task to be processed, the computing engine starts the mapping task to process the task to be processed, and generates a series of key value data. For each key value data output by each mapping task, the partition of each key value data can be obtained by calculating the hash value of each key value and then performing modulo on the hash value by adopting the number of reduction tasks. And the mapping task aggregates the key value data under the same partition and sends the aggregated key value data to the shuffle processing node.

In step S230, sorting logic corresponding to the meta information is executed by the shuffle processing node to sort the key-value data.

Specifically, after receiving the key value data corresponding to each partition, the shuffle processing node obtains the sorting logic corresponding to the meta information of the task to be processed from the mapping relationship between the pre-deployed sorting logic and the meta information, and loads the sorting logic to sort the key value data corresponding to each partition. The sorted key-value data corresponding to the respective partitions are stored into a file system (e.g., HDFS, Hadoop distributed file system).

In one example, the meta-information includes a compute engine type corresponding to the pending task. The sorting logic corresponding to Spark is pre-deployed as a hash-based shuffle, and the sorting logic corresponding to MapReduce is a sort-based shuffle. If the type of the computing engine of the task to be processed is Spark, the shuffle processing node can directly execute the shuffle mode based on the hash without sorting the key value data of each partition after receiving the key value data aggregated according to the partition. And if the type of the computing engine of the task to be processed is MapReduce, the shuffle processing node sorts the key value data of each partition according to a sorting mode corresponding to the MapReduce after receiving the key value data aggregated by the partition.

In another example, the meta information includes key class information corresponding to the task to be processed. And pre-deploying the mapping relation between the key class information and the sequencing logic. After receiving the key value data aggregated according to the partitions, the shuffle processing node may sort the key value data of each partition according to the sorting logic corresponding to the key information of the task to be processed, based on the mapping relationship between the key information and the sorting logic.

In one embodiment, after the shuffle processing node finishes processing the tasks to be processed, the computing engine may start the reduction tasks, so that each reduction task pulls from the file system to obtain the key value data under the partition corresponding to the reduction task.

In the processing method of the key value data, the independent packaged shuffling processing nodes are deployed outside the computing engine. After the to-be-processed task is processed through the mapping task to obtain the key value data, the shuffle processing node can obtain the sorting logic corresponding to the meta information of the to-be-processed task, the key value data are sorted through the sorting logic, the key value data generated by various computing engines can be processed through a universal sorting mechanism under the condition that various computing engines exist at the same time, and therefore the sorting requirements of the key value data of the various computing engines can be met at the same time. In addition, the embodiment of the disclosure provides the independently deployed shuffle processing nodes, reduces the modification degree of the native computing engine code to the maximum extent in a decoupling mode, has higher universality, and is convenient for deployment and maintenance.

In an exemplary embodiment, the method further comprises: registering the meta-information to a global management component; the meta-information is sent to the shuffle processing node by a global management component.

The global management component is an independently packaged component deployed outside the compute engine, and may be used for, but not limited to, global resource scheduling, global task management, lifecycle management of the shuffle processing node, and processing a heartbeat request of the shuffle processing node.

Specifically, after the computing engine starts the to-be-processed task and the corresponding meta-information, the meta-information of the to-be-processed task may be registered to the global management component. In one embodiment, the global management component may issue meta-information for pending tasks to the shuffle processing node. In another embodiment, the shuffle processing node may actively obtain the meta information of the to-be-processed task from the registration information of the global management component after receiving the aggregated key value data corresponding to each partition.

In the embodiment, the independently packaged global management component is deployed outside the computing engine, so that resources can be centrally scheduled, information can be centrally managed, and global control and management are facilitated.

In an exemplary embodiment, the number of shuffle processing nodes is plural. As shown in fig. 3, transmitting the key value data to the shuffle processing node in step S220 includes:

in step S310, a mapping relationship between the partition and the shuffle processing node is obtained by the shuffle write node, where the mapping relationship is constructed in advance by the global management component according to the partition of the task to be processed.

Wherein the shuffle write node may be a stand-alone encapsulated node deployed outside of the compute engine. In one embodiment, the shuffle write node may be deployed corresponding to a mapping task of the compute engine, belong to the same process as the mapping task, and share the same JVM (Java Virtual Machine). In one example, the shuffle write node may be embedded as an SDK (Software Development Kit) in the mapping task on the compute engine side.

Specifically, after the to-be-processed task is started, the global management component acquires a partition of the to-be-processed task, generates a mapping relation between the shuffle processing node and the partition, and sends the mapping relation to the shuffle writing node. In one example, the pending task includes a partitions, and the global management component may determine a number a of currently empty nodes from the shuffle processing nodes and generate a one-to-one correspondence between the partitions and the shuffle processing nodes.

In step S320, the key value data corresponding to each partition is determined by the shuffle write node, and the key value data corresponding to each partition is transmitted to the shuffle processing node corresponding to each partition in accordance with the mapping relationship.

Specifically, after the mapping task processes the to-be-processed task to obtain key value data, the key value data are sent to the shuffle write-in node, and the shuffle write-in node sends the key value data corresponding to each partition to the shuffle processing node corresponding to each partition according to the mapping relation between the partitions and the shuffle processing node.

In one embodiment, the shuffle write nodes may have a one-to-one correspondence with the mapping tasks. That is, if there are M mapping tasks, a shuffle write node corresponding to each mapping task is deployed. Each shuffle write node is configured to send the key value data sent by the corresponding mapping task to the shuffle processing node corresponding to the partition, respectively.

In this embodiment, in step S230, the shuffle processing node runs the sorting logic corresponding to the meta information to sort the key-value data, which may be implemented by step S330. In step S330, sorting logic is run by the shuffle processing node corresponding to each partition to sort the key value data corresponding to each partition.

In this embodiment, by deploying the independently packaged shuffle write node and the multiple shuffle processing nodes, and establishing a mapping relationship between the partition and the shuffle processing nodes, compared with data isolation processing of each mapping task in the related art, aggregation processing can be performed on multiple mapping task output data according to the partition, so that a reduction task does not need to pull data from a local file corresponding to each mapping task, thereby reducing the number of data IO (input/output) times and improving the input/output efficiency of shuffle.

In an exemplary embodiment, as shown in fig. 4, a first type loader is disposed in the shuffle processing node, and the meta information includes a storage path for ordering the logical file packages.

The class loader is responsible for loading classes, and generates an instance object for all classes loaded into the memory. Once a class is loaded, such as in a JVM, the same class is not loaded again. As described in the above embodiments, each class corresponds to unique key class information. The sorted logical file package may refer to a file package formed by packaging classes written in code. Such as Jar (a computer file format) packages.

In this embodiment, in step S230, the shuffle processing node runs the sorting logic corresponding to the meta information, which may specifically be implemented by the following steps:

in step S410, the package of sorted logical files stored in the storage path is obtained by the first class loader in the shuffle processing node.

In step S420, the sorting logic file package is loaded through the first class loader to run the sorting logic.

In one embodiment, while the task to be processed is started, a storage path of the ordered logical file package can be obtained, and the storage path is registered to the global management component, so that the shuffle processing node obtains the storage path from the global management component. And a first type loader in the shuffle processing node acquires the sorting logic file packet stored in the storage path, loads the sorting logic file packet and runs the sorting logic file packet to sort the key value data under each partition.

In another embodiment, the sequencing logic file packet of the task to be processed can be acquired while the task to be processed is started. And uploading the sorted logical file package to a file system. And acquiring a storage path of the sorted logical file packet in the file system, and registering the storage path in the global management component so that the shuffle processing node acquires the storage path from the global management component. And a first type loader in the shuffle processing node acquires the sorting logic file packet stored in the storage path, loads the sorting logic file packet and runs the sorting logic file packet to sort the key value data under each partition.

In another embodiment, the tasks to be processed and the first class loaders may have a one-to-one correspondence relationship, and instantiating a corresponding first class loader for each task to be processed can ensure class loading isolation at the task level and avoid mutual influence between the classes.

FIG. 5 illustrates a diagram of processing key-value data based on a specified package of sorted logical files. As shown in fig. 5, this can be achieved by:

(1) when a task to be processed (Job) is started, a sequencing logic file package (Jar package) containing custom sequencing logic is uploaded to an HDFS remote storage.

(2) And uploading the sorted logical file package on a storage path of the HDFS and the key class information of the sorted logical file package to a global management component.

(3) And the shuffle processing node acquires the meta information of the tasks to be processed from the global management component, wherein the meta information comprises a storage path of the sorting logic file packet.

(4) When the shuffle processing node judges that the key value data reaches a certain threshold value and needs to be output to the HDFS, the Jar packet is dynamically pulled from the HDFS to the local according to needs, and the corresponding class is loaded into the memory.

In the embodiment, the high-order sorting requirements of the user can be flexibly met by supporting the user to upload the sorting logic file package in a customized manner. By deploying the first class loader, the class loading at the task level is realized, the loading among classes of different tasks cannot be influenced mutually, the class loading isolation at the task level is realized, and thus the data consistency can be ensured.

In an exemplary embodiment, in some cases, the tasks submitted by the preset tool are large in size and have the same key class information. For example, in some systems 90% of tasks are submitted via Hive (a data warehouse facility) and the pending tasks submitted via Hive have the same key class information. Therefore, in the present embodiment, a second type loader is also deployed in the shuffle processing node. The second class of loaders may be considered a cache layer for the first class of loaders. Before the package of sorted logical files stored in the storage path is obtained by the first class loader in the shuffle processing node at step S410, the following processing may also be performed.

Specifically, after the shuffle processing node receives the key value data corresponding to each partition, under the condition that it is determined that the task to be processed is submitted through the preset tool, it is determined through the first class loader whether the ordering logic corresponding to the preset tool exists in the second class loader. If the first class loader determines that the second class loader does not have the sorting logic corresponding to the preset tool, the storage path of the sorting logic file package is continuously acquired, and the sorting logic file package stored in the storage path is acquired, so that the key value data are sorted. If the first class loader determines that the second class loader has the sorting logic corresponding to the preset tool, the sorting logic can be loaded through the first class loader so as to sort the key value data of each partition.

In the embodiment, the sequencing logic corresponding to the tasks to be processed is pre-judged according to the submission tool of the tasks to be processed, so that the times of pulling and loading the sequencing logic file packet can be reduced, the processing efficiency of key value data can be improved, the IO consumption of a network and a disk caused by loading the sequencing logic file packet is reduced, and the performance of the shuffle service is improved.

In an exemplary embodiment, a third class loader is also deployed in the shuffle processing node, and the third class loader may be regarded as a cache layer of the first class loader and may be the same loader as the second class loader. In this embodiment, the meta information further includes key class information of the task to be processed. Before the obtaining, by the first class loader in the shuffle processing node, the package of ordered logical files stored in the storage path in step S410, the method further includes: determining, by the first class loader, that no ordering logic corresponding to the key class information exists in the third class loader.

Specifically, after the shuffle processing node receives the key value data corresponding to each partition, it is determined by the first class loader whether there is ordering logic corresponding to the key class information in the third class loader. If the sorting logic file packet does not exist, the storage path of the sorting logic file packet is continuously obtained, and the sorting logic file packet stored in the storage path is loaded. If so, the sort logic corresponding to the key class information may be run through the first class loader.

In one embodiment, key class information, preset tools, and priorities for sorting logical packages of files may be configured. And sorting the key value data by adopting a sorting mode with the highest priority.

In this embodiment, by carrying the key information in the task to be processed and determining in advance whether the sorting logic corresponding to the key information already exists, the times of pulling and loading the sorting logic file packet can be reduced, so that the processing efficiency of the key value data can be increased, and the IO consumption of the network and the disk due to loading of the sorting logic file packet can be reduced.

In a particular embodiment, FIG. 6 shows an internal flow diagram of a shuffle processing node. It is to be understood that, in the case where the number of shuffle processing nodes is plural, fig. 6 shows a workflow of each shuffle processing node. In fig. 6, the second class loader and the third class loader are the same loader (CommonClassLoader in fig. 6), and are globally shared class loaders. The number of the first class loader (SessionClassLoader) includes a plurality. The tasks to be processed and the first type loader have one-to-one correspondence. The method can be realized by the following steps:

(1) and when the shuffle processing node is started, acquiring the meta information of the task to be processed from the global management component. And pre-loading the existing sequencing logic file package from the local database through a second type loader.

(2) And loading whether the sequencing logic corresponding to the key class information of the preset tool exists or not from the second class loader by adopting the first class loader corresponding to the task to be processed. Or, loading whether the sorting logic corresponding to the key class information of the task to be processed exists or not from the second class loader by adopting the first class loader corresponding to the task to be processed.

(3) And if none exists in the step (2), acquiring a sorting logic file packet from the HDFS according to a storage path contained in the meta information, and operating the sorting logic file packet to sort the key value data corresponding to each partition.

FIG. 7 is a flowchart illustrating a method of processing key-value data, as shown in FIG. 7, as applied to a shuffle service deployed outside of a compute engine, in accordance with an exemplary embodiment. Figure 9 illustrates a schematic diagram of a shuffle service. As shown in fig. 9, the Shuffle service includes an App Shuffle Master (ASM), a global management component (Shuffle Master), a Shuffle write node (Shuffle write), a Shuffle process node (Shuffle Worker), and a Shuffle read node (Shuffle Reader), which are independently packaged. The functions of the respective components are explained below.

A global management component: and the system is responsible for global resource scheduling, global task management, life cycle management of the shuffle processing node, heartbeat request of the shuffle processing node and the like.

A task management component: the task manager can be deployed corresponding to a task manager (ApplicationMaster) in a computing engine, belongs to the same process with the task manager, and shares the same JVM. In one example, the task management component may be embedded in the task manager as an SDK. The task management component is responsible for resource management of the individual tasks, handling RPC (Remote Procedure Call) requests for the shuffle write node and the shuffle read node, and managing the life cycles of the shuffle write node and the shuffle read node.

Shuffle write node: the SDK can be embedded in a mapping task at a calculation engine side and is responsible for sending key value data obtained by processing the mapping task to a corresponding shuffle processing node according to the partition dimension. And safely exiting after the shuffle processing node completely persists the sorted key-value data.

A shuffle processing node: and the key value data are aggregated and sorted according to the partition dimension and are overflowed to the remote HDFS. After the overflow write is complete, the task management component and the shuffle write node are notified of the persisted result.

The shuffle reading node can be deployed corresponding to the reduction task at the calculation engine side, belongs to the same process with the reduction task, and shares the same JVM. In one embodiment, the shuffle read node may be embedded in the reduction task as an SDK. In another embodiment, the shuffle read node and the reduce task may have a one-to-one correspondence. The shuffle reading node is responsible for pulling a shuffle file set to be processed from the HDFS, and returning the shuffle file set to the calculation engine side after performing local deduplication according to consistency metadata.

A method of processing key-value data will be described with reference to fig. 7, taking as an example that the task to be processed is Mapreduce Job, in addition to the shuffle service shown in fig. 8.

In step S702, a Mapreduce Job (i.e., the task manager in Mapreduce) is started, whereupon the task management component is started.

In step S704, the sorting logic file package corresponding to Mapreduce Job is stored in the HDFS, and a corresponding storage path is obtained. And registering the meta information of Mapreduce Job into the global management component. The meta information includes the storage path and key class information of the sorted logical file package of Mapreduce Job.

In step S706, the task management component applies for a resource from the global management component, and requests the global management component to determine a mapping relationship between a partition of the task to be processed and the shuffle processing node. The shuffle service mode is started after the application is successful.

In step S708, the task manager starts a mapping task. The shuffle write node starts with the map task, both sharing the same process. After the shuffle write node starts, a task management component obtains a mapping relationship between the partitions and the shuffle processing node.

In step S710, each mapping task writes the processed key value data into a local buffer (buffer register) of the corresponding shuffle write node through the interface. And then the shuffle write node actively sends the key value data in the buffer register to the shuffle processing node corresponding to the partition according to the mapping relation between the partition and the shuffle processing node.

In step S712, each shuffle processing node sorts the key value data according to the partition dimension, and persists the sorted key value data to the HDFS. The sorting of the key value data can be specifically realized by the following steps:

(1) when Mapreduce Job is submitted via a preset tool (Hive), it is determined by the first class loader whether there is ordering logic in the second class loader that corresponds to the key class information for Hive. If yes, operating sequencing logic corresponding to the key type information of Hive to sequence the key value data under each partition.

(2) When Mapreduce Job is not committed via Hive, it is determined by the first class loader whether there is ordering logic in the second class loader that corresponds to the key class information for Mapreduce Job. If so, operating sequencing logic corresponding to the key class information of Mapreduce Job to sequence the key value data under each partition.

(3) And when the second class loader does not have the sorting logic corresponding to the key class information of Mapreduce Job, acquiring a storage path of the sorting logic file packet from the global management component, acquiring the sorting logic file packet from the storage path, and loading the sorting logic file packet to sort the key value data under each partition.

In step S714, each shuffle processing node transmits the key-value data sorted under each partition to the task management component on the storage path of the HDFS.

In step S716, the task manager starts the reduction task after the mapping phase is completed. The shuffle read node starts with the reduce task. And the started shuffle reading node acquires the storage paths corresponding to the partitions from the task management component, reads the key value data from the HDFS according to the storage paths corresponding to the partitions, and returns the read key value data to the reduction task at the calculation engine side after local duplication removal.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the above-mentioned flowcharts may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or the stages is not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a part of the steps or the stages in other steps.

It is understood that the same/similar parts between the embodiments of the method described above in this specification can be referred to each other, and each embodiment focuses on the differences from the other embodiments, and it is sufficient that the relevant points are referred to the descriptions of the other method embodiments.

Fig. 9 is a block diagram illustrating a processing apparatus X00 of key-value data according to an exemplary embodiment. Referring to fig. 9, the apparatus includes an obtaining module X02, a mapping module X04, and a data sorting module X06.

An obtaining module X02 configured to execute obtaining the task to be processed and the meta information corresponding to the task to be processed; the mapping module X04 is configured to execute processing on a to-be-processed task through a mapping task end in the calculation engine to obtain key value data, and transmit the key value data to a shuffle processing node, wherein the shuffle processing node is a node independently packaged outside the calculation engine; a data sorting module X06 configured to execute a sort logic corresponding to the meta information run through the shuffle processing node to sort the key-value data.

In an exemplary embodiment, a first type loader is disposed in the shuffle processing node, and the meta information includes a storage path for ordering the logical file package; a data sorting module X06, comprising: a bundle obtaining unit configured to perform obtaining of a bundle of sorted logical files stored in the storage path by a first class loader in the shuffle processing node; and the first loading unit is configured to load the sorting logic file package through the first class loader so as to run the sorting logic.

In an exemplary embodiment, a second type loader is also deployed in the shuffle processing node; the data sorting module X06 further includes: and the submission tool determining unit is configured to execute, when the task to be processed is submitted through a preset tool, determining, through the first class loader, that no ordering logic corresponding to the preset tool exists in the second class loader.

In an exemplary embodiment, the data sorting module X06 further includes: a second loading unit configured to execute, by the first class loader, when it is determined by the first class loader that there is sorting logic corresponding to the preset tool in the second class loader, the sorting logic corresponding to the preset tool.

In an exemplary embodiment, a third type loader is also deployed in the shuffle processing node; the meta-information also includes key information of the task to be processed; the data sorting module X06 further includes: a query unit configured to execute a sort logic that determines, by the first class loader, that no corresponding key class information exists in the third class loader.

In an exemplary embodiment, the data sorting module X06 further includes: a third loading unit configured to execute, when it is determined by the first class loader that the sorting logic corresponding to the key class information already exists in the third class loader, the sorting logic corresponding to the key class information.

In an exemplary embodiment, the apparatus X00 further includes: the registration module is configured to register the meta information to a global management component, and the global management component is a component independently packaged outside the computing engine; an information sending module configured to perform sending the meta information to the shuffle processing node through the global management component.

In an exemplary embodiment, the number of shuffle processing nodes is plural; a mapping module X04, comprising: the mapping relation obtaining unit is configured to execute obtaining of a mapping relation between the partitions and the shuffle processing nodes through the shuffle write-in nodes, the mapping relation is obtained by a global management component through pre-construction according to the partitions of the tasks to be processed, and the shuffle write-in nodes are nodes independently packaged outside the computing engine; a data transmitting unit configured to perform determining key value data corresponding to each partition by the shuffle write node, and transmitting the key value data corresponding to each partition to the shuffle processing node corresponding to each partition in a mapping relationship; a data sorting module X06 configured to perform the operation of sorting logic by the shuffle processing node corresponding to each partition to sort the key value data corresponding to each partition.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 10 is a block diagram illustrating an electronic device S00 for processing key-value data, according to an example embodiment. For example, the electronic device S00 may be a server. Referring to FIG. 10, electronic device S00 includes a processing component S20 that further includes one or more processors and memory resources represented by memory S22 for storing instructions, such as applications, that are executable by processing component S20. The application program stored in the memory S22 may include one or more modules each corresponding to a set of instructions. Further, the processing component S20 is configured to execute instructions to perform the above-described method.

The electronic device S00 may further include: the power supply module S24 is configured to perform power management of the electronic device S00, the wired or wireless network interface S26 is configured to connect the electronic device S00 to a network, and the input/output (I/O) interface S28. The electronic device S00 may operate based on an operating system stored in the memory S22, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, or the like.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory S22 comprising instructions, executable by the processor of the electronic device S00 to perform the above method is also provided. The storage medium may be a computer-readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product comprising instructions executable by a processor of the electronic device S00 to perform the above method.

It should be noted that the descriptions of the above-mentioned apparatus, the electronic device, the computer-readable storage medium, the computer program product, and the like according to the method embodiments may also include other embodiments, and specific implementations may refer to the descriptions of the related method embodiments, which are not described in detail herein.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for processing key-value data, the method comprising:

2. The method of processing key-value data of claim 1, wherein a first class loader is deployed in the shuffle processing node, and wherein the meta information includes a storage path for ordering logical packages of files;

3. The method of processing key-value data of claim 2, wherein a second class loader is further deployed in the shuffle processing node;

4. The method of processing key-value data according to claim 3, characterized in that the method further comprises:

5. The method of processing key-value data of claim 2, wherein a third class loader is further deployed in the shuffle processing node; the meta-information also comprises key information of the task to be processed;

6. The method of processing key-value data according to claim 5, characterized in that the method further comprises:

7. An apparatus for processing key-value data, the apparatus comprising:

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement a method of processing key-value data according to any one of claims 1 to 6.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of processing key-value data of any one of claims 1 to 6.

10. A computer program product comprising instructions which, when executed by a processor of an electronic device, enable the electronic device to carry out the method of processing key-value data according to any one of claims 1 to 6.