CN116185559A - Data processing method, device, equipment and storage medium - Google Patents

Data processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN116185559A
CN116185559A CN202211586510.2A CN202211586510A CN116185559A CN 116185559 A CN116185559 A CN 116185559A CN 202211586510 A CN202211586510 A CN 202211586510A CN 116185559 A CN116185559 A CN 116185559A
Authority
CN
China
Prior art keywords
processing
node
processing node
candidate
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211586510.2A
Other languages
Chinese (zh)
Inventor
周昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202211586510.2A priority Critical patent/CN116185559A/en
Publication of CN116185559A publication Critical patent/CN116185559A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/4557Distribution of virtual machine instances; Migration and load balancing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Retry When Errors Occur (AREA)
  • Hardware Redundancy (AREA)

Abstract

One or more embodiments of the present disclosure provide a data processing method, apparatus, device, and storage medium, applied to a management node in a cluster; the method comprises the following steps: in response to acquiring the data processing task, locking a data table for storing node information of a plurality of processing nodes in the cluster, and traversing the node information stored in the locked data table; determining whether the number of the processing nodes in the candidate processing node set is smaller than a first threshold value according to the traversed node information, if the number of the processing nodes is smaller than the first threshold value, determining whether the processing nodes corresponding to the node information are normal operation processing nodes, and if the processing nodes are normal operation processing nodes, adding the processing nodes to the candidate processing node set; determining a target processing node from the candidate processing node set based on weights corresponding to the respective processing nodes in the candidate processing node set; and distributing the data processing task to the target processing node, and unlocking the data table.

Description

Data processing method, device, equipment and storage medium
Technical Field
One or more embodiments of the present disclosure relate to the field of computer application, and in particular, to a data processing method, apparatus, device, and storage medium.
Background
Today, as the data size of various data such as business data, user data, etc. continues to expand, there are ever increasing demands on the scalability and availability of applications that need to process such data, which are often difficult to meet on a single device, thus deriving the concepts of clustered and distributed systems. This allows an application to be deployed on a cluster or distributed system of devices that are sized to handle the large data size, thereby improving the application's ability to provide services to the outside. In a clustered or distributed system, a device is often referred to as a node; the nodes can be physical devices such as physical servers, virtual servers, cloud servers, virtual machines, and virtual devices such as Docker containers.
For an application program deployed on a cluster, data processing tasks generated by the application program in the running process are generally distributed to different nodes in the cluster, and each node executes the data processing tasks distributed to the node, so that load balancing is realized on the different nodes in the cluster, and the problem that the data processing efficiency is low due to the fact that a single node executes excessive data processing tasks is avoided.
In a cluster in which the included node is a virtual device, a new virtual device is usually created continuously according to actual requirements, and is added as a new node to the cluster, and the old virtual device is destroyed to remove the old node from the cluster, so that the nodes in the cluster are dynamically changed. For clusters containing dynamically changing nodes, how to implement load balancing on different nodes in such clusters is also a problem to be solved.
Disclosure of Invention
One or more embodiments of the present disclosure provide the following technical solutions:
the specification provides a data processing method applied to management and control nodes in a cluster; the cluster further includes a plurality of processing nodes; the control node maintains a data table for storing node information of the plurality of processing nodes; the method comprises the following steps:
responding to the acquired data processing task, locking the data table, and traversing node information stored in the locked data table;
determining whether the number of processing nodes in a candidate processing node set is smaller than a first threshold value according to the traversed node information, if the number is smaller than the first threshold value, determining whether the processing node corresponding to the node information is a processing node with normal operation, and if the processing node is a processing node with normal operation, adding the processing node to the candidate processing node set;
Determining a target processing node from the candidate processing node set based on weights corresponding to each processing node in the candidate processing node set;
and distributing the data processing task to the target processing node, and unlocking the data table.
The specification also provides a data processing device, which is applied to the control nodes in the cluster; the cluster further includes a plurality of processing nodes; the control node maintains a data table for storing node information of the plurality of processing nodes; the device comprises:
the traversing module is used for locking the data table in response to the data processing task, and traversing the node information stored in the locked data table;
an adding module, configured to determine, for traversed node information, whether a number of processing nodes in a candidate processing node set is smaller than a first threshold, if the number is smaller than the first threshold, determine whether a processing node corresponding to the node information is a processing node that operates normally, and if the processing node is a processing node that operates normally, add the processing node to the candidate processing node set;
A determining module, configured to determine a target processing node from the candidate processing node set based on weights corresponding to respective processing nodes in the candidate processing node set;
and the distribution module is used for distributing the data processing task to the target processing node and unlocking the data table.
The present specification also provides an electronic apparatus including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements the steps of the method as described in any of the preceding claims by executing the executable instructions.
The present specification also provides a computer readable storage medium having stored thereon computer instructions which when executed by a processor perform the steps of the method as claimed in any one of the preceding claims.
In the above technical solution, a management node in a cluster may maintain a data table for storing node information of a plurality of processing nodes in the cluster, so that in response to acquiring a data processing task, the data table may be locked, node information stored in the locked data table may be traversed, for the traversed node information, if it is determined that the number of processing nodes in a candidate processing node set is less than a preset threshold and it is determined that the processing node corresponding to the node information is a processing node that operates normally, the processing node is added to the candidate processing node set, and after the traversing is completed, a target processing node is determined from the candidate processing node set based on weights corresponding to the respective processing nodes in the candidate processing node set, so that the data processing task is allocated to the target processing node, and the data table is unlocked.
By adopting the mode, on one hand, under the condition that a plurality of processing nodes in the cluster are dynamically changed, node information stored in a data table maintained by a management and control node in the cluster is also dynamically changed, but after the data table is locked, the node information stored in the locked data table is not changed any more, so that data processing tasks are conveniently distributed based on the node information stored in the locked data table; on the other hand, the data quantity of node information to be traversed can be reduced because the traversal can be finished after the number of the processing nodes in the candidate processing node set reaches the preset threshold, so that the duration required by the traversal is reduced, and the distribution efficiency of the data processing task is improved.
Drawings
Fig. 1 is a schematic architecture diagram of a cluster according to an exemplary embodiment of the present disclosure.
FIG. 2 is a schematic architecture diagram of a batch computing cluster as illustrated in an exemplary embodiment of the present description.
Fig. 3 is a flow chart of a data processing method according to an exemplary embodiment of the present disclosure.
FIG. 4 is a flow diagram illustrating a candidate processing node set generation phase in accordance with an exemplary embodiment of the present disclosure.
Fig. 5 is a hardware configuration diagram of an electronic device in which a data processing apparatus is located according to an exemplary embodiment of the present disclosure.
Fig. 6 is a block diagram of a data processing apparatus according to an exemplary embodiment of the present specification.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.
It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.
In order to achieve load balancing on different nodes in a cluster, in the related art, data processing tasks are typically distributed to the different nodes in the cluster in three ways.
The first is the polling mode. The polling mode refers to sequentially allocating data processing tasks to nodes in the cluster in the time sequence in which they are generated. In the polling mode, node information of all nodes in the cluster needs to be maintained, so that data processing tasks are sequentially distributed to the nodes in the cluster according to the sequence of the maintained node information. In the case that the nodes in the cluster are dynamically changed, the operations of adding, updating, modifying and removing are continuously performed on the maintained node information, so that it is difficult to determine the order of the maintained node information, that is, it is difficult to allocate the data processing tasks in a polling manner.
The second way is a weighted way. The weighting mode refers to distributing more data processing tasks for nodes with higher weights in the cluster. In general, weights may be set for each node in the cluster according to actual needs. For example, the weight may be proportional to the data processing capacity, i.e., a higher weight may be set for nodes with more data processing capacity; alternatively, the weight may be proportional to the access ratio, i.e., a higher weight may be set for nodes with higher access ratios; etc. In the weighting mode, the weights corresponding to all the nodes in the cluster are required to be known, so as to determine the node with the highest weight in the cluster. In the case that the nodes in the cluster are dynamically changed, it is necessary to traverse all the nodes in the cluster once every time a data processing task is allocated to obtain weights corresponding to the respective nodes in the cluster, so if the cluster contains a huge number of nodes, a great deal of time is required to traverse the huge number of nodes when a data processing task is allocated, resulting in lower efficiency.
The third method is a minimum connection number method. The minimum connection number mode refers to that the current data processing task is distributed to the node with the minimum connection number in the cluster. In practical applications, a client may initiate a corresponding data processing task by sending a data processing request to a node in a cluster serving as a server, and the node may establish a connection with the client in response to the data processing request to execute the data processing task. Thus, for each node in a cluster, the number of connections corresponding to that node may represent the number of data processing tasks allocated to that node. In the minimum connection number method, the connection number corresponding to all the nodes in the cluster needs to be known to determine the node with the minimum connection number in the cluster. In the case that the nodes in the cluster are dynamically changed, it is necessary to traverse all the nodes in the cluster once every time a data processing task is allocated to obtain the number of connections corresponding to each node in the cluster, so if the cluster contains a huge number of nodes, a great deal of time is required to traverse the huge number of nodes when a data processing task is allocated, resulting in lower efficiency.
The present description aims to provide a technical solution for data processing to optimize the allocation process of data processing tasks for nodes in a cluster such that load balancing is achieved on different nodes in the cluster. In the technical scheme, a data table for storing node information of a plurality of processing nodes in a cluster can be maintained by a management node in the cluster, so that a data processing task can be obtained, the data table is locked, node information stored in the locked data table is traversed, the number of processing nodes in a candidate processing node set is determined to be smaller than a preset threshold value according to the traversed node information, the processing nodes are added to the candidate processing node set under the condition that the processing nodes corresponding to the node information are determined to be normal processing nodes, and a target processing node is determined from the candidate processing node set based on weights corresponding to the processing nodes in the candidate processing node set after the traversing is finished, so that the data processing task is distributed to the target processing node, and the data table is unlocked.
In particular implementations, a cluster may include a management node and a plurality of processing nodes. Wherein the management node may maintain a data table for storing node information for the plurality of processing nodes.
The control node may be configured to, when acquiring a data processing task, allocate the data processing task to one of the plurality of processing nodes based on node information stored in the data table. In order to determine which processing node to assign a data processing task to, the above-described data table may be locked in response to acquiring the data processing task, and node information stored in the locked data table may be traversed.
In the case of traversing to a certain node information, the control node may first determine whether the number of processing nodes in the candidate processing node set is smaller than a preset threshold (referred to as a first threshold). The processing nodes in the candidate processing node set are processing nodes which are screened out at one time and can be considered to execute the data processing tasks to be distributed.
If the number is smaller than the first threshold at this time, it may be determined whether the processing node corresponding to the traversed node information is a processing node that operates normally.
Accordingly, if the number is greater than or equal to the first threshold at this time, the present traversal may be ended.
If the processing node corresponding to the traversed node information is a processing node which operates normally, the processing node can be added to the candidate processing node set so as to update the candidate processing node set until the traversal is finished.
Accordingly, if the processing node corresponding to the traversed node information is a processing node of the operation abnormality, the traversal may be continued. In this case, if the next node information is not traversed, i.e., the node information is already the last node information, the traversal can be ended; if the next node information is traversed, determining whether the number of processing nodes in the candidate processing node set is smaller than a first threshold, if the number is smaller than the first threshold, determining whether the processing node corresponding to the traversed next node information is a processing node with normal operation, and if the processing node is a processing node with normal operation, adding the processing node to the candidate processing node set; and so on.
When the control node obtains the candidate processing node set corresponding to the current traversal, one processing node (referred to as a target processing node) may be determined from the candidate processing node set based on the weights corresponding to the processing nodes in the candidate processing node set. The target processing node is the processing node which is screened out for the second time and is considered to be most suitable for executing the data processing task to be distributed.
Under the condition that the target processing node is determined, the control node can allocate the data processing task to be allocated to the target processing node and unlock the data table, so that node information stored in the data table can change along with the change of the processing nodes in the cluster.
By adopting the mode, on one hand, under the condition that a plurality of processing nodes in the cluster are dynamically changed, node information stored in a data table maintained by a management and control node in the cluster is also dynamically changed, but after the data table is locked, the node information stored in the locked data table is not changed any more, so that data processing tasks are conveniently distributed based on the node information stored in the locked data table; on the other hand, the data quantity of node information to be traversed can be reduced because the traversal can be finished after the number of the processing nodes in the candidate processing node set reaches the preset threshold, so that the duration required by the traversal is reduced, and the distribution efficiency of the data processing task is improved.
Referring to fig. 1, fig. 1 is a schematic diagram of a cluster architecture according to an exemplary embodiment of the present disclosure.
As shown in fig. 1, the cluster may include a management node and a plurality of processing nodes. Wherein the management and control node may be configured to manage and control the processing node, for example: the management and control node can distribute data processing tasks generated in the running process of the application program deployed on the cluster to different processing nodes; the processing node may be configured to process data, for example: each processing node may perform data processing tasks assigned to that processing node by the management node.
In practical applications, each node may be physical devices such as a physical server, or may be virtual devices such as a virtual server, a cloud server, or even a virtual machine, a Docker container, or the like.
It should be noted that, the plurality of processing nodes in the cluster may be dynamically changed. For example, each processing node in the cluster may periodically disconnect from a management node in the cluster and reestablish a connection with the management node after a period of time; alternatively, each processing node in the cluster may disconnect from the management node in the cluster after a period of time and establish a connection with the management node by the new processing node.
Taking a batch computing cluster as an example, the batch computing cluster may provide a batch computing service to the outside. The batch computing service provides RESTful style based APIs (Application Programming Interface ). Above the API, a user may use a batch computing service by way of an SDK (Software Development Kit ), command line tool, console, etc. The batch computing service allows users to highly customize the operating environment by way of custom virtual machine images or Docker container images so that user applications operate in an isolated virtualized environment, thereby ensuring the security of the user environment and user data. The bulk computing service may use an object store OSS or a file store NAS as persistent storage for input/output data, such as: user applications, custom Docker container images, and running logs are stored in the OSS.
As shown in FIG. 2, the batch computing cluster described above may include a batch computing management node and a plurality of virtual machines. One virtual machine is a processing node.
Referring to fig. 3, fig. 3 is a flowchart illustrating a data processing method according to an exemplary embodiment of the present disclosure.
The above described data processing method may be applied to a management node in a cluster as shown in fig. 1. In connection with fig. 1, the cluster may also include a plurality of processing nodes. Wherein the management node may maintain a data table for storing node information for the plurality of processing nodes.
In practical applications, for any processing node in the cluster, the node information of the processing node stored in the data table may include: the node identification of the processing node, the node access address (e.g., IP address, MAC address, URL address, etc.), the weight (also referred to as static weight) set in advance for the processing node, the number of connections corresponding to the processing node, the running state of the processing node, the load state of the processing node, etc.; this description is not limiting.
In some embodiments, the data table may be a hash table. In this case, for any processing node in the cluster, the key (key) in the hash table may be the node identifier of the processing node, and the corresponding value (value) may be other node information of the processing node except the node identifier. That is, the hash value of the node identification of the processing node may be mapped to the address of the node information of the processing node in the hash table.
In practice, a hash table may be made up of multiple buckets (buckets). Wherein the hash values of all elements in any one bucket are the same. Therefore, when the node information stored in the hash table is added, updated, modified or removed, only the bucket corresponding to the hash value of the node identifier in the node information is required to be locked, and the whole Zhang Haxi table is not required to be locked, so that the change efficiency of the node information stored in the hash table can be improved.
The data processing method may include the steps of:
step 302: and locking the data table in response to the data processing task, and traversing node information stored in the locked data table.
In this embodiment, when the management node acquires a data processing task, the management node may allocate the data processing task to a certain processing node of the plurality of processing nodes based on node information stored in the data table. In order to determine which processing node to assign a data processing task to, the above-described data table may be locked in response to acquiring the data processing task, and node information stored in the locked data table may be traversed.
In the case where the plurality of processing nodes in the cluster are dynamically changed, the node information stored in the data table is also dynamically changed. However, after the data table is locked, the node information stored in the locked data table is not changed any more, so that the data processing task is conveniently distributed based on the node information stored in the locked data table.
In practical application, the node information stored in the data table has a certain storage sequence, so that the node information stored in the locked data table can be traversed according to the storage sequence. Suppose that the locked data table is shown in table 1 below:
node identification Node access address Weighting of Number of connections
Sign
1 Address 1 Weight 1 Number of connections 1
Sign 2 Address 2 Weight 2 Number of connections 2
Sign 3 Address 3 Weight 3 Number of connections 3
TABLE 1
Wherein, the identifier 1, the address 1, the weight 1 and the connection number 1 are node information of the processing node 1, the identifier 2, the address 2, the weight 2 and the connection number 2 are node information of the processing node 2, and the identifier 3, the address 3, the weight 3 and the connection number 3 are node information of the processing node 3, then the storage sequence of the node information stored in the data table is as follows: processing node information of node 1, processing node information of node 2, and processing node information of node 3. In this case, the node information stored in the locked data table may be traversed first to the node information of the processing node 1, then to the node information of the processing node 2, and finally to the node information of the processing node 3. Since the node information of the processing node 3 is the last node information in the storage order, the traversal can be ended after the node information of the processing node 3 is traversed.
Step 304: for traversed node information, determining whether the number of processing nodes in a candidate processing node set is smaller than a first threshold value, if the number is smaller than the first threshold value, determining whether the processing node corresponding to the node information is a processing node with normal operation, and if the processing node is a processing node with normal operation, adding the processing node to the candidate processing node set.
In this embodiment, in the case of traversing to certain node information, it may be determined first whether the number of processing nodes in the candidate processing node set is smaller than a preset threshold (referred to as a first threshold). The processing nodes in the candidate processing node set are processing nodes which are screened out at one time and can be considered to execute the data processing tasks to be distributed. The first threshold value can be a preset value according to actual demands, or a default value; this description is not limiting.
If the number is smaller than the first threshold at this time, it may be determined whether the processing node corresponding to the traversed node information is a processing node that operates normally. For example, the node information may include an operation state and a load state of the processing node, and if it is determined that the operation state and the load state of the processing node are both normal based on the node information, the processing node may be determined to be an operating normal processing node, otherwise, the processing node may be determined to be an operating abnormal processing node.
Accordingly, if the number is greater than or equal to the first threshold at this time, the present traversal may be ended.
If the processing node corresponding to the traversed node information is a processing node which operates normally, the processing node can be added to the candidate processing node set so as to update the candidate processing node set until the traversal is finished.
Accordingly, if the processing node corresponding to the traversed node information is a processing node of the operation abnormality, the traversal may be continued. In this case, if the next node information is not traversed, i.e., the node information is already the last node information, the traversal can be ended; if the next node information is traversed, determining whether the number of processing nodes in the candidate processing node set is smaller than a first threshold, if the number is smaller than the first threshold, determining whether the processing node corresponding to the traversed next node information is a processing node with normal operation, and if the processing node is a processing node with normal operation, adding the processing node to the candidate processing node set; and so on.
The candidate processing node set obtained after the current traversal is finished is the candidate processing node set corresponding to the current traversal. The traversing can be finished after the number of the processing nodes in the candidate processing node set reaches the first threshold, so that the data quantity of the node information to be traversed can be reduced, the duration required by the traversing is shortened, and the distribution efficiency of the data processing task is improved.
Continuing taking the locked data table as shown in fig. 1 as an example, assuming that the first threshold value is 3, the processing node 1 and the processing node 3 are processing nodes with normal operation, and the processing node 2 is processing node with abnormal operation, after traversing the node information of the processing node 1, the number of the processing nodes in the candidate processing node set is still 0, and the processing node 1 is processing node with normal operation, so that the processing node 1 can be added to the candidate processing node set; after traversing the node information of the processing node 2, although the number of processing nodes in the candidate processing node set is 1 at this time, the processing node 2 is a processing node that is running abnormally, so the processing node 2 is not added to the candidate processing node set; after traversing the node information of the processing node 3, the number of the processing nodes in the candidate processing node set is 2, and the processing node 3 is a processing node which operates normally, so that the processing node 3 can be added to the candidate processing node set; since the node information of the processing node 3 is already the last node information, the present traversal can be ended. In this case, the candidate processing node set corresponding to the current traversal includes the processing node 1 and the processing node 3.
Step 306: a target processing node is determined from the set of candidate processing nodes based on weights corresponding to respective processing nodes in the set of candidate processing nodes.
In the present embodiment, when a candidate processing node set corresponding to the current traversal is obtained, one processing node (referred to as a target processing node) may be determined from the candidate processing node set based on weights corresponding to the respective processing nodes in the candidate processing node set. The target processing node is the processing node which is screened out for the second time and is considered to be most suitable for executing the data processing task to be distributed.
Step 308: and distributing the data processing task to the target processing node, and unlocking the data table.
In this embodiment, in the case where the target processing node is determined, the data processing task to be allocated may be allocated to the target processing node, and the data table may be unlocked, so that node information stored in the data table may change along with a change of the processing nodes in the cluster.
In the above technical solution, a management node in a cluster may maintain a data table for storing node information of a plurality of processing nodes in the cluster, so that in response to acquiring a data processing task, the data table may be locked, node information stored in the locked data table may be traversed, for the traversed node information, if it is determined that the number of processing nodes in a candidate processing node set is less than a preset threshold and it is determined that the processing node corresponding to the node information is a processing node that operates normally, the processing node is added to the candidate processing node set, and after the traversing is completed, a target processing node is determined from the candidate processing node set based on weights corresponding to the respective processing nodes in the candidate processing node set, so that the data processing task is allocated to the target processing node, and the data table is unlocked.
By adopting the mode, on one hand, under the condition that a plurality of processing nodes in the cluster are dynamically changed, node information stored in a data table maintained by a management and control node in the cluster is also dynamically changed, but after the data table is locked, the node information stored in the locked data table is not changed any more, so that data processing tasks are conveniently distributed based on the node information stored in the locked data table; on the other hand, the data quantity of node information to be traversed can be reduced because the traversal can be finished after the number of the processing nodes in the candidate processing node set reaches the preset threshold, so that the duration required by the traversal is reduced, and the distribution efficiency of the data processing task is improved.
As described above, the above-described data processing method can be roughly divided into three stages, which are a candidate processing node set generation stage, a target processing node determination stage, and a data processing task allocation stage, respectively. These three stages are each described in detail below.
(1) Candidate processing node set generation phase
Referring to fig. 4, fig. 4 is a flow chart illustrating a candidate processing node set generation phase according to an exemplary embodiment of the present disclosure.
As shown in fig. 4, in the above-described candidate processing node set generation stage, node information stored in the locked data table may be traversed, and when the traversed node information satisfies a specific requirement, a processing node corresponding to the traversed node information may be added to the above-described candidate processing node set.
In some embodiments, when traversing the node information stored in the locked data table, specifically, one node information (called target node information) may be randomly selected from the locked data table, and the node information stored in the locked data table may be traversed with the target node information as a traversing start point.
Continuing with the locked data table shown in table 1 as an example, assuming that the target node information randomly selected from the locked data table is the node information of the processing node 2, the node information of the processing node 2 may be used as a traversal start point to traverse the node information stored in the locked data table according to the storage order. At this time, the node information of the processing node 2 may be traversed first, and then the node information of the processing node 3 may be traversed, without traversing the node information of the processing node 1. By adopting the mode, the data volume of node information to be traversed can be reduced, so that the duration required by the traversal is reduced, and the distribution efficiency of data processing tasks is improved.
In some embodiments, where traversing to certain node information, it may be determined first whether the number of processing nodes in the candidate set of processing nodes is less than a first threshold.
If the number is smaller than the first threshold at this time, it may be determined whether or not the processing node corresponding to the node information is a processing node that operates normally, based on the traversed node information.
Accordingly, if the number is greater than or equal to the first threshold at this time, the present traversal may be ended.
If the processing node corresponding to the traversed node information is a processing node that is operating normally, it may be determined whether the number of processing nodes in the candidate processing node set is less than a preset threshold (referred to as a second threshold). Wherein the second threshold is less than the first threshold. The second threshold value can be a preset value according to actual demands, or a default value; this description is not limiting.
If the number is smaller than the second threshold value, the processing nodes corresponding to the traversed node information can be directly added to the candidate processing node set, so that a certain number of processing nodes are contained in the candidate processing node set; if the number is greater than or equal to the second threshold value, processing nodes corresponding to the traversed node information may be added to the candidate processing node set based on a preset probability value, thereby enhancing randomness of the processing nodes added to the candidate processing node set; thus, the candidate processing node set can be updated until the traversal is finished. The probability value may be a value preset according to actual requirements (for example, the probability value may be preset to be 50%, which indicates that the probability of adding the processing node to the candidate processing node set is 50%), or may be a default value; this description is not limiting.
Accordingly, if the processing node corresponding to the traversed node information is a processing node of the operation abnormality, the traversal may be continued. In this case, if the next node information is not traversed, the present traversal may be ended; if the next node information is traversed, determining whether the traversed next node information meets the specific requirement; and so on.
(2) Target processing node determination stage
In the target processing node determining stage, a target processing node may be determined from the candidate processing node set based on weights corresponding to the respective processing nodes in the candidate processing node set.
In practical application, weights can be set for each node in the cluster according to practical requirements. For example, the weight may be proportional to the data processing capacity, i.e., a higher weight may be set for nodes with more data processing capacity; alternatively, the weight may be proportional to the access ratio, i.e., a higher weight may be set for nodes with higher access ratios; etc.
Thus, in some embodiments, when determining the target processing node from the candidate processing node set based on the weights corresponding to the processing nodes in the candidate processing node set, the processing node with the largest weight in the candidate processing node set may be specifically determined as the target processing node based on the weights corresponding to the processing nodes in the candidate processing node set. At this time, the weight may be a static weight set in advance for each processing node in the cluster.
Further, in some embodiments, for any one of the set of candidate processing nodes, the dynamic weight corresponding to that processing node may be calculated according to the following formula:
W=w+w÷c
wherein W represents a dynamic weight corresponding to the processing node; w represents a static weight corresponding to the processing node; indicating the number of connections corresponding to the processing node.
When the dynamic weights corresponding to the processing nodes in the candidate processing node set are calculated, the processing node with the largest dynamic weight in the candidate processing node set may be determined as the target processing node. In this way, the advantages of the weighting mode and the minimum connection number mode are combined, so that the processing node most suitable for executing the data processing task to be allocated can be determined more effectively.
(3) Data processing task allocation phase
In the above-described data processing task allocation stage, the data processing task to be allocated may be allocated to the target processing node. After the data processing task is distributed, the data table can be unlocked.
In the case where a plurality of processing nodes in the cluster are dynamically changed, although it is determined that the target processing node is a processing node that operates normally based on the node information stored in the locked data table, the target processing node may be changed during the traversal process.
Thus, in some embodiments, when the above data processing task is allocated to the target processing node, the availability of the target processing node may be verified first, for example: it is verified whether the target processing node has been disconnected, has been deleted or is about to be unavailable.
If the availability verification for the target processing node is passed, the data processing task may be assigned to the target processing node.
Accordingly, if the availability verification for the target processing node fails, it may be retried, for example: the node information stored in the locked data table can be traversed again; or, the data table may be unlocked first, so that node information stored in the data table may change along with the change of the processing nodes in the cluster, and then the data table is locked again, and the node information stored in the locked data table is traversed.
Corresponding to the embodiments of the data processing method described above, the present description also provides embodiments of the data processing apparatus.
Fig. 5 is a schematic block diagram of an apparatus according to an exemplary embodiment. Referring to fig. 5, at the hardware level, the device includes a processor 502, an internal bus 504, a network interface 506, a memory 508, and a non-volatile storage 510, although other hardware may be required. One or more embodiments of the present description may be implemented in a software-based manner, such as by the processor 502 reading a corresponding computer program from the non-volatile storage 510 into the memory 508 and then running. Of course, in addition to software implementation, one or more embodiments of the present disclosure do not exclude other implementation manners, such as a logic device or a combination of software and hardware, etc., that is, the execution subject of the following processing flow is not limited to each logic module, but may also be hardware or a logic device.
Referring to fig. 6, fig. 6 is a block diagram of a data processing apparatus according to an exemplary embodiment of the present specification.
The data processing device described above may be applied to the electronic apparatus shown in fig. 5 to implement the technical solution of the present specification. The electronic equipment can be used as a control node in the cluster; the cluster further includes a plurality of processing nodes; the management and control node maintains a data table for storing node information for the plurality of processing nodes.
The data processing apparatus may include:
the traversing module 602 is configured to lock the data table in response to acquiring a data processing task, and traverse node information stored in the locked data table;
an adding module 604, configured to determine, for the traversed node information, whether the number of processing nodes in a candidate processing node set is smaller than a first threshold, if the number is smaller than the first threshold, determine whether a processing node corresponding to the node information is a processing node that operates normally, and if the processing node is a processing node that operates normally, add the processing node to the candidate processing node set;
a determining module 606, configured to determine a target processing node from the candidate processing node set based on weights corresponding to the processing nodes in the candidate processing node set;
And the allocation module 608 is configured to allocate the data processing task to the target processing node, and unlock the data table.
Optionally, the traversing module 602 is specifically configured to:
and randomly selecting target node information from the locked data table, and traversing the node information stored in the locked data table by taking the target node information as a traversing starting point.
Optionally, the adding module 604 is specifically configured to:
determining whether the number of processing nodes in the candidate set of processing nodes is less than a second threshold; wherein the second threshold is less than the first threshold;
if the number is less than the second threshold, adding the processing node to the candidate set of processing nodes;
if the number is greater than or equal to the second threshold, the processing node is added to the candidate set of processing nodes based on a pre-set probability value.
Optionally, the determining module 606 is specifically configured to:
and determining the processing node with the largest weight in the candidate processing node set as a target processing node based on the weight corresponding to each processing node in the candidate processing node set.
Optionally, the determining module 606 is specifically configured to:
calculating dynamic weights corresponding to all processing nodes in the candidate processing node set according to the following formula, and determining the processing node with the largest dynamic weight in the candidate processing node set as a target processing node:
W=w+w÷c
wherein W represents a dynamic weight corresponding to each processing node in the candidate set of processing nodes; w represents a static weight corresponding to each processing node in the candidate set of processing nodes; c represents the number of connections corresponding to each processing node in the set of candidate processing nodes.
Optionally, the allocation module 608 is specifically configured to:
and carrying out availability verification on the target processing node, and if the availability verification on the target processing node is passed, distributing the data processing task to the target processing node.
Optionally, the data table is a hash table; the key word of the hash table is the node identification of the processing node.
The implementation process of the functions and roles of each module in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, they essentially correspond to the method embodiments, so that reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims (10)

1. A data processing method is applied to control nodes in a cluster; the cluster further includes a plurality of processing nodes; the control node maintains a data table for storing node information of the plurality of processing nodes; the method comprises the following steps:
responding to the acquired data processing task, locking the data table, and traversing node information stored in the locked data table;
determining whether the number of processing nodes in a candidate processing node set is smaller than a first threshold value according to the traversed node information, if the number is smaller than the first threshold value, determining whether the processing node corresponding to the node information is a processing node with normal operation, and if the processing node is a processing node with normal operation, adding the processing node to the candidate processing node set;
Determining a target processing node from the candidate processing node set based on weights corresponding to each processing node in the candidate processing node set;
and distributing the data processing task to the target processing node, and unlocking the data table.
2. The method of claim 1, the traversing node information stored in the locked data table, comprising:
and randomly selecting target node information from the locked data table, and traversing the node information stored in the locked data table by taking the target node information as a traversing starting point.
3. The method of claim 1, the adding the processing node to the candidate set of processing nodes comprising:
determining whether the number of processing nodes in the candidate set of processing nodes is less than a second threshold; wherein the second threshold is less than the first threshold;
if the number is less than the second threshold, adding the processing node to the candidate set of processing nodes;
if the number is greater than or equal to the second threshold, the processing node is added to the candidate set of processing nodes based on a pre-set probability value.
4. The method of claim 1, the determining a target processing node from the set of candidate processing nodes based on weights corresponding to respective processing nodes in the set of candidate processing nodes, comprising:
and determining the processing node with the largest weight in the candidate processing node set as a target processing node based on the weight corresponding to each processing node in the candidate processing node set.
5. The method of claim 4, the determining, based on weights corresponding to respective processing nodes in the candidate set of processing nodes, the processing node in the candidate set of processing nodes having the greatest weight as a target processing node, comprising:
calculating dynamic weights corresponding to all processing nodes in the candidate processing node set according to the following formula, and determining the processing node with the largest dynamic weight in the candidate processing node set as a target processing node:
W=w+w÷c
wherein W represents a dynamic weight corresponding to each processing node in the candidate set of processing nodes; w represents a static weight corresponding to each processing node in the candidate set of processing nodes; c represents the number of connections corresponding to each processing node in the set of candidate processing nodes.
6. The method of claim 1, the assigning the data processing task to the target processing node, comprising:
and carrying out availability verification on the target processing node, and if the availability verification on the target processing node is passed, distributing the data processing task to the target processing node.
7. The method of claim 1, the data table being a hash table; the key word of the hash table is the node identification of the processing node.
8. A data processing device, which is applied to a management and control node in a cluster; the cluster further includes a plurality of processing nodes; the control node maintains a data table for storing node information of the plurality of processing nodes; the device comprises:
the traversing module is used for locking the data table in response to the data processing task, and traversing the node information stored in the locked data table;
an adding module, configured to determine, for traversed node information, whether a number of processing nodes in a candidate processing node set is smaller than a first threshold, if the number is smaller than the first threshold, determine whether a processing node corresponding to the node information is a processing node that operates normally, and if the processing node is a processing node that operates normally, add the processing node to the candidate processing node set;
A determining module, configured to determine a target processing node from the candidate processing node set based on weights corresponding to respective processing nodes in the candidate processing node set;
and the distribution module is used for distributing the data processing task to the target processing node and unlocking the data table.
9. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to implement the method of any one of claims 1 to 7 by executing the executable instructions.
10. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any of claims 1 to 7.
CN202211586510.2A 2022-12-09 2022-12-09 Data processing method, device, equipment and storage medium Pending CN116185559A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211586510.2A CN116185559A (en) 2022-12-09 2022-12-09 Data processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211586510.2A CN116185559A (en) 2022-12-09 2022-12-09 Data processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116185559A true CN116185559A (en) 2023-05-30

Family

ID=86447911

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211586510.2A Pending CN116185559A (en) 2022-12-09 2022-12-09 Data processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116185559A (en)

Similar Documents

Publication Publication Date Title
EP3221795B1 (en) Service addressing in distributed environment
US11709843B2 (en) Distributed real-time partitioned MapReduce for a data fabric
US20180144025A1 (en) Map-reduce job virtualization
US20200042364A1 (en) Movement of services across clusters
EP3210134B1 (en) Composite partition functions
CN108881512B (en) CTDB virtual IP balance distribution method, device, equipment and medium
US10394782B2 (en) Chord distributed hash table-based map-reduce system and method
CN109032803B (en) Data processing method and device and client
US11074179B2 (en) Managing objects stored in memory
JP2012048424A (en) Method and program for allocating identifier
CN110321225B (en) Load balancing method, metadata server and computer readable storage medium
CN106952085B (en) Method and device for data storage and service processing
CN114884962A (en) Load balancing method and device and electronic equipment
CN116595015B (en) Data processing method, device, equipment and storage medium
CN115378799B (en) Election method and device in equipment cluster based on PaxosLease algorithm
CN116185559A (en) Data processing method, device, equipment and storage medium
CN111198756A (en) Application scheduling method and device of kubernets cluster
US8850440B2 (en) Managing the processing of processing requests in a data processing system comprising a plurality of processing environments
CN113505111A (en) Shared directory mounting method and distributed network additional storage system
CN109787899B (en) Data partition routing method, device and system
KR20150093979A (en) Method and apparatus for assigning namenode in virtualized cluster environments
US11163462B1 (en) Automated resource selection for software-defined storage deployment
CN111338752B (en) Container adjusting method and device
US11683374B2 (en) Containerized gateways and exports for distributed file systems
US20240272822A1 (en) Dynamic over-provisioning of storage devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination