WO2018072618A1 - 流式计算任务的分配方法和控制服务器 - Google Patents

流式计算任务的分配方法和控制服务器 Download PDF

Info

Publication number
WO2018072618A1
WO2018072618A1 PCT/CN2017/105360 CN2017105360W WO2018072618A1 WO 2018072618 A1 WO2018072618 A1 WO 2018072618A1 CN 2017105360 W CN2017105360 W CN 2017105360W WO 2018072618 A1 WO2018072618 A1 WO 2018072618A1
Authority
WO
WIPO (PCT)
Prior art keywords
streaming computing
cluster
server
center server
streaming
Prior art date
Application number
PCT/CN2017/105360
Other languages
English (en)
French (fr)
Inventor
张钊
李名浩
胡四海
陈友林
汪光炼
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2018072618A1 publication Critical patent/WO2018072618A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1008Server selection for load balancing based on parameters of servers, e.g. available memory or workload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1034Reaction to server failures by a load balancer
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Definitions

  • the present application relates to the field of streaming computing technologies, and in particular, to a streaming computing task allocation method and a control server, a streaming computing task execution method, and a streaming computing center server cluster, and a streaming computing system.
  • a streaming computing system A different living system.
  • streaming computing it is impossible to determine the arrival time and arrival order of the data, and it is impossible to store all the data. Therefore, the server involved does not store the streaming data, but directly flows in the memory when the flowing data arrives. Perform real-time calculation of data.
  • the real-time, quality, service stability and availability of streaming data have higher and higher requirements. Therefore, it is also a traditional distributed web service system. challenge. Due to the huge amount of real-time computing and reading data processed by the streaming computing system, there are many difficulties when streaming computing tasks are distributed in multiple places. For example, real-time merging of de-statistical results in different places, how to ensure consistent data in multiple places. Sexuality, the geographical origin of data sources is uncontrollable, and so on. Therefore, how to achieve multi-regional coordination of convective computing, and real-time disaster recovery is very necessary.
  • the present application provides a method for allocating a flow computing task and an executive of a streaming computing task.
  • Method which adopts a control server to uniformly distribute each flow computing task, and performs different streaming computing tasks by clusters of various streaming computing center servers and clusters of various streaming computing units deployed in multiple locations.
  • Each flow computing center server cluster reserves preset computing resources, and data synchronization is performed between each central storage cluster, and data in the unit storage clusters of each streaming computing unit server cluster are also synchronized to each central storage.
  • On the cluster Based on this, when an abnormality occurs in a cluster of a streaming computing unit server or a cluster of a streaming computing center server, it is possible to reallocate the part of the running streaming computing task that has not been executed to a certain stream elsewhere.
  • the execution of the computing center server cluster enables the streaming computing tasks to be quickly restored and executed in a remote location, without the need to configure idle servers and save system resources.
  • the application also provides a control server, a streaming computing center server cluster and a streaming computing system to ensure the implementation and application of the above method in practice.
  • the present application discloses a computing task allocation method, which is applied to a control server connected to a streaming computing center server cluster and a streaming computing unit server cluster, and the streaming computing center server cluster is reserved.
  • this method includes:
  • the target streaming computing center server cluster or the target streaming computing unit server cluster Determining whether the target streaming computing center server cluster or the target streaming computing unit server cluster is abnormal during the execution of the streaming computing task by the target streaming computing center server cluster or the target streaming computing unit server cluster In the case, if yes, the unexecuted tasks in the streaming computing task are assigned to the candidate streaming computing center server cluster.
  • the method further comprises:
  • the control server periodically sends a heartbeat message to the cluster of the streaming computing center server and the cluster of the streaming computing unit server, where the heartbeat message is used to: detect the cluster of the control server and the cluster of the computing center server Whether communication is possible between, and detecting whether communication between the control server and the cluster of the streaming computing unit server is possible;
  • determining whether the target flow computing center server cluster or the target streaming computing unit server cluster has an abnormal situation is specifically:
  • the allocating the unexecuted tasks in the streaming computing task to the candidate streaming computing center server cluster includes:
  • the control server acquires a load condition of the cluster of the streaming computing center server in real time
  • the control server allocates the unexecuted tasks in the streaming computing task to the cluster of the streaming computing center server with the smallest current load according to the load condition.
  • the streaming computing center server cluster has a central storage cluster, and the central storage clusters between the central computing clusters of the streaming computing center server clusters synchronize intermediate state data and intermediate result data, and each streaming computing unit server clusters flows to each stream. Calculating the central storage cluster synchronization intermediate state data and the intermediate result data of the central server cluster; the method further includes:
  • the control server stores the execution state and configuration information of each flow computing task into the control database; the execution state is used to indicate that each flow computing task is on the corresponding streaming computing center server cluster or the streaming computing unit server cluster Executing part; the configuration information is used to indicate: a correspondence between each streaming computing task and a streaming computing center server cluster executing the streaming computing task, or each streaming computing task and performing the streaming computing The correspondence between the clusters of tasks of the flow computing unit server;
  • the task of the unexecuted task in the streaming computing task is allocated to the cluster of the streaming computing center server with the smallest current load, including:
  • the control server calculates an unexecuted task in the streaming computing task according to an execution state and configuration information stored in the control database;
  • the control server allocates the unexecuted task to the cluster of streaming computing center servers with the lowest current load.
  • the application also provides a method for executing a streaming computing task, which is applied to any current streaming computing center server cluster in a streaming computing system that reserves preset computing resources, the streaming computing system
  • the system includes: a streaming computing center server cluster, a streaming computing unit server cluster, and a control server; the streaming computing center server cluster has a central storage cluster, and each intermediate storage cluster synchronizes intermediate state data and intermediate result data, each stream
  • the unit storage cluster of the computing unit server cluster stores the cluster synchronization intermediate state data and the intermediate result data to each center; the method includes:
  • the current streaming computing center server cluster obtains intermediate state data and intermediate result data required to execute the unexecuted task from the central storage cluster;
  • the current streaming computing center server cluster executes the unexecuted tasks by using the preset computing resources, intermediate state data, and intermediate result data.
  • the method further comprises:
  • the current streaming computing center server cluster Responding to the control server periodically transmitting a heartbeat message, the current streaming computing center server cluster periodically feeds back a heartbeat response to the control server; the heartbeat message is used to detect the control server and the current streaming computing center Whether communication between server clusters is possible.
  • the method further comprises:
  • the current streaming computing center server cluster detects whether the number of consecutive failures of the heartbeat response to the control server exceeds a preset number of thresholds, and if so, the current streaming computing center server cluster stops execution of the unexecuted tasks .
  • the present application further provides a control server, where the control server is connected to a cluster of a streaming computing center server and a cluster of streaming computing unit servers, and a predetermined proportion of computing resources are reserved in the cluster of the streaming computing center server;
  • the control server includes:
  • a first allocating unit configured to allocate the streaming computing task to a target streaming computing center server cluster or a target streaming computing unit server cluster in response to receiving the streaming computing task
  • a determining unit configured to determine, in the process of executing the streaming computing task, the target streaming computing center server cluster or the target streaming computing unit in the target streaming computing center server cluster or the target streaming computing unit server cluster Whether the server cluster has an abnormal situation;
  • a second allocation unit configured to allocate the unexecuted tasks in the streaming computing task to the candidate streaming computing center server cluster if the result of the determining unit is yes.
  • the control server further includes:
  • a sending unit configured to periodically send a heartbeat message to the cluster of the streaming computing center server and the cluster of the streaming computing unit server, where the heartbeat message is used to: detect the control server and the cluster of the streaming computing center server Whether communication is possible between, and detecting whether communication between the control server and the cluster of the streaming computing unit server is possible;
  • the determining unit is specifically configured to: determine whether the target streaming computing center server cluster or the target streaming computing unit server cluster does not feed back a heartbeat response within a preset feedback time.
  • the second distribution unit includes:
  • Obtaining a load subunit configured to acquire, in real time, a load condition of the cluster of the streaming computing center server and the cluster of the streaming computing unit server;
  • the first allocation subunit is configured to allocate the unexecuted tasks in the streaming computing task to the current computing core server cluster with the smallest load according to the load condition of the central server cluster.
  • the streaming computing center server cluster has a central storage cluster, and the central storage clusters between the central computing clusters of each streaming computing center server cluster synchronize intermediate state data and intermediate result data, and each streaming computing unit server cluster flows to each stream.
  • the central storage cluster of the computing center server cluster synchronizes intermediate state data and intermediate result data; the server further includes:
  • a storage unit configured to store execution state and configuration information of each flow computing task into a control database;
  • the execution state is used to indicate that each flow computing task is in a corresponding streaming computing center server cluster or a streaming computing unit The executed part of the server cluster;
  • the configuration information is used to indicate a correspondence between each streaming computing task and a streaming computing center server cluster that executes the streaming computing task, or each streaming computing task and execution of the Corresponding relationship between clusters of streaming computing unit servers of streaming computing tasks;
  • the first allocation subunit includes:
  • a calculating subunit configured to calculate an unexecuted task in the streaming computing task according to an execution state and configuration information stored in the control database
  • a second allocation subunit configured to allocate the unexecuted task to a cluster of streaming computing center servers with a minimum current load.
  • the application also provides a streaming computing center server cluster, the streaming computing center server cluster is reserved with preset computing resources, the streaming computing center server cluster is connected to the control server, and the control server is also connected to the flow server.
  • the computing unit server cluster is connected; the streaming computing center server cluster has a central storage cluster, the central storage cluster synchronizes intermediate state data and intermediate result data, and the unit storage cluster of the streaming computing unit server cluster is synchronized to the central storage cluster intermediate state.
  • Data and intermediate result data including:
  • Executing a task unit configured to execute the unexecuted task by using the preset computing resource, the intermediate state data, and the intermediate result data.
  • the streaming computing center server cluster further includes:
  • a feedback unit configured to periodically send a heartbeat response to the control server in response to the control server periodically sending a heartbeat message; the heartbeat message is used to detect the control server and the current streaming computing center server Whether the clusters can communicate with each other.
  • the streaming computing center service cluster further includes:
  • a detecting unit configured to detect whether a consecutive number of consecutive heartbeat response failures sent to the control server exceeds a preset number of times threshold
  • a stopping unit configured to stop execution of the unexecuted task if the result of the detecting unit is YES.
  • the application also provides a streaming computing system, the streaming computing system comprising: a streaming computing central server cluster and a streaming computing unit server cluster, a control server;
  • a central storage cluster corresponding to the streaming computing center server cluster, a control database corresponding to the control server, and a unit storage cluster corresponding to the streaming computing unit server cluster.
  • the application further provides an off-site multi-live system
  • the remote-flow computing system includes: a first streaming computing center server cluster, a plurality of streaming computing unit server clusters, and a control server; wherein the first streaming computing center The server cluster is the aforementioned streaming computing center server cluster, and the control server is the foregoing control server;
  • the plurality of flow computing unit server clusters are respectively deployed in a plurality of second geographic locations; the first streaming computing center server cluster is deployed in a first geographic location, the second geographic location and the first geographic location It is a different geographical location.
  • the remote computing system further includes: a second streaming computing center server cluster, wherein the second streaming computing center server cluster and the first streaming computing center server cluster are deployed in different first geographic locations.
  • the application also provides an off-site multi-live system, including:
  • the first stream computing center server is configured to provide at least external computing resources, where the first streaming computing center server includes a first central storage unit;
  • the second flow computing center server is configured to provide at least external computing resources, where the second streaming computing center server includes a second central storage unit;
  • the first flow computing center server and the second flow computing center server perform load balancing based on a unified load balancing policy, and the first central storage unit and the second central storage unit are hot standby with each other;
  • the first streaming computing task running on the first streaming computing center server terminates running on the first streaming computing center server when the first streaming computing center server fails to provide computing resources externally. And, the first streaming computing task is continued to run on the second streaming computing center server based on the intermediate state data and the intermediate result data of the second central storage unit of the second streaming computing center server.
  • the present application includes the following advantages:
  • the present application uniformly allocates tasks performed by clusters of flow computing center servers and clusters of flow computing unit servers deployed in multiple places through a control server, thereby implementing unified scheduling of flow computing tasks. And allocating, and realizing the synchronous data between the central storage clusters, realizing the deployment of the flow computing center server cluster or the streaming computing unit server cluster in multiple places simultaneously calculating the parts or different parts of the same streaming computing task The function of the streaming computing task.
  • the streaming computing task that is being executed can be quickly resumed from the cluster of the remote computing center server, so that Ensure that the system resources are not vacant, and also ensure that the flow computing tasks are lived in different places, that is, in the case of abnormal local conditions, the streaming computing tasks can be quickly restored in different places to achieve high availability of the streaming computing services.
  • 1 is a scenario architecture diagram of the present application after actual application
  • FIG. 2 is a flowchart of an embodiment of a method for allocating a streaming computing task of the present application
  • FIG. 3 is a flowchart of an embodiment of a method for executing a streaming computing task of the present application
  • FIG. 5 is a structural block diagram of an embodiment of a control server of the present application.
  • FIG. 6 is a structural block diagram of an embodiment of a streaming computing center server cluster of the present application.
  • Server clustering means that one or more servers are grouped together to perform the same service. It seems to the client that there is only one server. Server clusters can use multiple computers for parallel computing to achieve high computing speeds, and can also be backed up by multiple computers, so that any one computer can break the entire server cluster or function properly.
  • a streaming computing center server cluster refers to a server cluster used to perform streaming computing tasks. These server clusters need to reserve preset computing resources and will perform intermediate result data and intermediate states generated during the streaming computing task. The data is stored in a central storage cluster.
  • a streaming computing unit server cluster also refers to a server cluster for performing streaming computing tasks, and stores intermediate result data and intermediate state data generated during the execution of the streaming computing task into the unit storage cluster, but these servers The cluster may not reserve preset computing resources.
  • a storage cluster aggregates storage space in one or more storage devices into a storage pool that provides a unified access interface and management interface for the server cluster.
  • the server cluster can transparently access and utilize all storage devices through the unified access interface.
  • the disk is on, so the storage cluster can take full advantage of the performance and disk utilization of the storage device.
  • a central storage cluster which is a storage cluster used to provide storage space for a streaming computing central server cluster
  • a cell storage cluster which is a storage cluster used to provide storage space for a streaming computing cell server cluster.
  • FIG. 1 a scenario architecture diagram of a method for allocating a streaming computing task in an actual application in the present application.
  • a control server 101, m streaming computing center server clusters 102, and n streaming computing unit server clusters 103 can be configured. Wherein m and n are each an integer greater than one.
  • the streaming computing center server cluster 102 can be configured with two.
  • the control server 101 can allocate a streaming computing task to each of the streaming computing center server cluster 102 and the streaming computing unit server cluster 103, wherein each of the streaming computing center server clusters 102 can reserve a portion of computing resources, and the streaming computing unit
  • the server cluster 103 does not need to reserve computing resources.
  • the control server 101 can detect the abnormality.
  • the tasks that are not performed by the abnormal streaming computing center server cluster 102 or the streaming computing unit server cluster 103 are reassigned to other normal candidate streaming computing center server clusters 102 for execution. It should be noted that, since each streaming computing unit server cluster 103 does not reserve computing resources, the control server 101 only selects the normal streaming computing center server cluster 102 when reallocating the unexecuted tasks. Do not The streaming computing unit server cluster 103 is selected as the candidate streaming computing central server cluster.
  • the streaming computing tasks are switched between different streaming computing center server clusters 102 or from the streaming computing unit server cluster 103 to the streaming computing center server cluster 102, they can be executed simultaneously.
  • the intermediate state data and the intermediate result data need to be synchronized between the central storage clusters 104 connected to each of the streaming computing center server clusters 102, that is, the intermediate state data and the intermediate result data are synchronized between the central storage clusters 104.
  • the unit storage clusters 105 connected to the flow computing unit server clusters 103 need to synchronize the intermediate state data and the intermediate result data to the respective central storage clusters 104, and may not synchronize between the respective unit storage clusters, and only synchronize to the central storage.
  • the cluster 104 is ok, thus reducing the resources consumed when the intermediate state data and the intermediate result data are synchronized between the respective unit storage clusters 105.
  • the control server 101 is also connected to a control database which can store the configuration information of the control server 101 when the task is assigned and the execution status generated when the task is executed.
  • the execution state may indicate an executed part that has been executed when each flow computing task is executed on the corresponding streaming computing center server cluster or the streaming computing unit server cluster; the configuration information may indicate: each flow computing The correspondence between the task and the streaming computing center server cluster executing the streaming computing task, or the correspondence between each streaming computing task and the streaming computing unit server cluster executing the streaming computing task.
  • each of the streaming computing center server clusters 102 can be deployed in the same first geographic location, preferably, or in different first geographic locations.
  • the first geographic location may be a city, including a municipality, a regional capital, a prefecture-level city, a county-level city, etc., for example, Beijing, Hangzhou, Nanjing, and the like.
  • a streaming computing center server is deployed in Hangzhou
  • another six-piece central server is deployed in Hangzhou
  • a streaming computing center server cluster is deployed in Hangzhou
  • another streaming computing center server cluster is deployed in Nanjing or Shanghai.
  • Different geographical location from Hangzhou is from Hangzhou.
  • Each of the streaming computing unit server clusters 103 can also be deployed in different second geographic locations, including municipalities, provincial capitals, prefecture-level cities, county-level cities, and the like, for example, Suzhou, Xiamen, Shenzhen, and the like.
  • the first geographic location is used to indicate the geographic location of the streaming computing center server cluster 102 deployment
  • the second geographic location is used to represent the geographic location of the streaming computing unit server cluster deployment.
  • the control server 101 assigns a streaming computing task to each of the different geographic locations in which each of the streaming computing center server clusters and the streaming computing unit server clusters are deployed.
  • FIG. 2 a flow of an embodiment of a method for performing streaming computing task allocation based on the application scenario shown in FIG. 1 is illustrated.
  • the embodiment is applied to the control server in FIG.
  • the present embodiment may include the following steps:
  • Step 201 The control server periodically sends a heartbeat message to the streaming computing center server cluster and the streaming computing unit server cluster respectively.
  • control server is connected to each of the streaming computing center server clusters and the streaming computing unit server clusters, and between the control server and each of the streaming computing center server clusters, and the control server and each streaming A heartbeat message feedback mechanism is established between the computing unit server clusters. Based on this, the control server periodically sends a heartbeat message to each of the streaming computing center server clusters and the respective streaming computing unit server clusters, and the heartbeat message is used to detect the control server and the streaming computing center server cluster. Whether communication can be normally performed, and whether normal communication between the control server and the cluster of the streaming computing unit server is detected.
  • each of the streaming computing center server clusters and the streaming computing unit server clusters can be normally communicated through each of the streaming computing center server clusters and the respective streaming computing unit server clusters, and if the clustering of the streaming computing center server clusters and the streaming computing unit server clusters are normal, if normal communication is not possible, usually In the case, the flow computing center server cluster or the streaming computing unit server cluster has an abnormal situation, and the task cannot be performed normally.
  • the control server can normally receive the heartbeat response fed back by each of the streaming computing center server clusters or the streaming computing unit server cluster, it is considered that the streaming computing center server cluster and the streaming computing unit server cluster can be normal with the control server. Communication, that is, no abnormal situation occurs. Conversely, the streaming computing center server cluster and the streaming computing unit server cluster cannot communicate with the control server normally, that is, an abnormal situation occurs.
  • the period for sending the heartbeat message may be a heartbeat duration, for example, 1 second. Of course, those skilled in the art can set the heartbeat duration autonomously.
  • Step 202 In response to receiving the streaming computing task, the control server assigns the streaming computing task to the target streaming computing center server cluster or the target streaming computing unit server cluster.
  • control server can be controlled by the system administrator.
  • the control server can provide a human-computer interaction interface.
  • the system administrator inputs the task instruction, and sends the flow calculation task to the system administrator according to the task instruction input by the system administrator.
  • a streamed central server cluster or a streaming computing center that is, a target streaming computing center server cluster or a target streaming computing unit server cluster.
  • other methods may be used to determine the target streaming computing center server cluster or the target streaming computing unit server cluster.
  • the control server randomly determines a streaming computing center server cluster as the target flow according to the rotation training manner. Calculate the central server cluster, or randomly identify a streaming computing unit server cluster as the target streaming computing unit server cluster.
  • step 203 may also be performed:
  • Step 203 The control server stores the execution status and configuration information of each streaming computing task to the control database. in.
  • the control server may store configuration information of each flow calculation task into a control database connected thereto, for example, each flow calculation task and execute the flow.
  • the control server may further store the execution status of each streaming computing task on the cluster of the streaming computing center server or the cluster of the streaming computing unit server in the control database, wherein the execution state may indicate that each streaming computing task corresponds to The executed portion of the flow computing center server cluster or the streaming computing unit server cluster that has been executed when executed.
  • Step 204 Determine, in the process that the target streaming computing center server cluster or the target streaming computing unit server cluster executes the streaming computing task, the target streaming computing center server cluster or the target streaming computing unit server cluster Whether an abnormal situation occurs, if yes, proceed to step 205, if not, continue to perform this step to make a determination.
  • the control server After the control server allocates the streaming computing task, the control server detects itself and the target streaming computing center in real time during the execution of the streaming computing task by the target streaming computing center server cluster or the target streaming computing unit server cluster. Whether the connection between the server cluster or the target streaming cell cluster is normal. If it is normal, there is no abnormality in the target streaming center server cluster or the target streaming cell server cluster. If the connection is not normal, for example, the control server does not receive the heartbeat response of the target streaming computing center server cluster or the target streaming computing unit server cluster feedback within the preset feedback time, indicating that the connection is abnormal, in this case It may be that the target streaming computing center server cluster or the target streaming computing unit server cluster has an abnormal condition.
  • the target streaming computing unit server cluster includes only one streaming computing unit server, then the streaming computing unit server needs to enter step 205 when the exception occurs; and the target streaming computing unit server cluster includes multiple streams.
  • the target streaming computing unit server cluster includes multiple streams.
  • only the flow computing unit servers of the target streaming computing unit server cluster are abnormal, and the connection between the control server and the target streaming computing unit server cluster is broken. It will be judged that an abnormal situation has occurred in the cluster of the entire streaming computing center unit server. For example, in a practical application, a power outage or a fire house occurs in a computer room where a target streaming computing unit server cluster is located.
  • the streaming computing unit server in the target streaming computing unit server cluster has an abnormality, for example, the streaming computing unit server is down, etc.
  • the unexecuted part of the task being executed on the abnormal flow computing unit server switches to another normal streaming computing unit server, so that the entire streaming computing unit server cluster executes
  • the task can be smoothly executed to ensure that the cluster of the streaming computing unit server is in a normal running state as a whole.
  • the control server may receive the heartbeat response within the preset feedback time after sending the heartbeat message in step 201 to determine whether the target streaming computing center server cluster or the target streaming computing unit server cluster is abnormal, for example, in a continuous manner. If the heartbeat response of the target streaming computing center server cluster or the target streaming computing unit server cluster feedback is not received within one minute, it is determined that the target streaming computing center server cluster or the target streaming computing unit server cluster is abnormal, and then Go to step 205; if the heartbeat response of the target streaming computing center server cluster or the target streaming computing unit server cluster feedback is received within one minute, it is determined that the target streaming computing center server cluster or the target streaming computing unit server cluster does not appear. If the exception is abnormal, step 204 can be continued to perform real-time judgment.
  • the control server can prompt the system administrator to alert, etc., and the system administrator determines a streaming computing center server.
  • Cluster or streaming cell server clusters do have abnormal conditions. For example, if the network is disconnected or powered off, repair operations can be performed. After the cluster of the streaming computing center server cluster or the streaming computing unit server is abnormally repaired, it can also be assigned a streaming computing task as a normal streaming computing center server cluster or a streaming computing unit server cluster.
  • Step 205 Assign the unexecuted tasks in the streaming computing task to the candidate streaming computing center server cluster.
  • the unexecuted task may be: the remaining tasks in the streaming computing task except that the target streaming computing center server cluster or the target streaming computing unit server cluster has performed tasks.
  • step 205 can include:
  • Step A1 The control server acquires the load status of the plurality of streaming computing center server clusters in real time.
  • the control server can obtain the load status of each streaming computing center server cluster and each streaming computing unit server cluster in real time.
  • the load condition may be a parameter value of a hardware such as a CPU utilization rate, a memory read speed, a disk input/output I/O performance, etc., and each of the streaming computing center server clusters and the streaming computing unit server cluster may be determined by the hardware parameter values.
  • the load situation so that when a task needs to be reassigned subsequently, the task can be assigned to a clustered streaming computing center server cluster or a streaming computing unit server cluster.
  • the streaming computing center server cluster needs to reserve computing resources. Assuming that the number of clusters in the streaming computing center server is N, where N is an integer greater than 1, the reserved computing resources can be “N*10%”, so that other streaming computing center server clusters or flows can be guaranteed as much as possible.
  • N an integer greater than 1
  • the computing resource may be a hardware resource such as a CPU, a memory, and a disk.
  • the streaming computing center server cluster can always have 20% of computing resources idle, and this free 20% of computing resources can be used to execute other streaming computing center server clusters or streaming. Computes tasks that have not been performed on the cell server cluster.
  • Step A2 The control server allocates the unexecuted tasks in the streaming computing task to the cluster of the streaming computing center server with the smallest current load.
  • the control server then allocates the unexecuted tasks to the cluster of the streaming computing center server with the smallest current load determined according to the load condition of each streaming computing central server cluster in step A1.
  • step A2 may include:
  • Step A21 The control server calculates an unexecuted task in the streaming computing task according to the execution state and configuration information stored in the control database.
  • control server may determine the streaming computing task that it is executing according to the configuration information, and then determine the streaming computing task according to the execution state. The completed part has been executed, and then the unexecuted tasks in the streaming computing task can be calculated.
  • Step A22 The control server allocates the unexecuted task to the cluster of the streaming computing center server with the smallest current load.
  • the control server then reassigns the unexecuted tasks to the currently clustered cluster of streaming computing center servers for execution.
  • step 205 After performing the re-allocation of the unexecuted tasks in step 205, it is possible to return to step 202 by the control server to then assign the currently received streaming computing tasks.
  • a flow control task performed by each flow computing center server cluster and a flow computing unit server cluster deployed in multiple places is uniformly allocated by a control server, thereby implementing unified scheduling and allocation of flow computing tasks.
  • Compute task functions when a streaming computing center server cluster or streaming computing unit service When an abnormality occurs in the cluster, the flow computing task can be quickly resumed from the remote computing center server cluster. This ensures that the system resources are not vacant and ensures that the flow can be calculated under abnormal conditions. Rapid recovery from off-site streaming computing center server clusters to achieve high availability for streaming computing services.
  • FIG. 3 a flowchart of an embodiment of a method for executing a streaming computing task according to the present application is shown.
  • the method is applied to any current streaming computing center server cluster shown in FIG. 1, the streaming computing system.
  • the method may include: a plurality of streaming computing center server clusters, a plurality of streaming computing unit server clusters, and a control server; the streaming computing center server cluster has a central storage cluster, and a central storage cluster between each streaming computing center server cluster
  • the intermediate state data and the intermediate result data are synchronized, and each of the streaming computing unit server clusters stores the cluster synchronization intermediate state data and the intermediate result data to the center of each of the streaming computing center server clusters.
  • this embodiment may include:
  • Step 301 Responding to an unexecuted task in the reassigned flow computing task when the control server has an abnormal situation in another flow computing center server cluster or a streaming computing unit server cluster in the streaming computing system.
  • the current streaming computing center server cluster obtains intermediate state data and intermediate result data required to execute the unexecuted task from the connected central storage cluster.
  • the embodiment shown in FIG. 2 is a cluster of the streaming computing center server in which the abnormality occurs.
  • the task being performed by the streaming cell server cluster reassigns the cluster of compute compute center servers.
  • the current streaming computing center server cluster obtains intermediate state data and intermediate result data required to execute an unexecuted task from the connected storage cluster.
  • the intermediate state data may be: a task state generated by the streaming computing center server cluster or the streaming computing unit server cluster executing the streaming computing task before the abnormal situation occurs, for example, the streaming computing task has been executed.
  • Which part; and the intermediate result data can be: the result data generated by the part of the task that has been executed.
  • the current streaming computing center server cluster may not need to repeatedly execute the part that the streaming computing task has already executed, but may perform the part of the unexecuted task according to the intermediate state data and the intermediate result data.
  • Step 302 The current streaming computing center server cluster executes the unexecuted task by using the intermediate state data and intermediate result data.
  • the current streaming computing center server cluster then references the intermediate state data and the intermediate result data to perform the re-allocated unexecuted task.
  • the method may further include:
  • Step 303 In response to the control server periodically sending a heartbeat message, the current streaming computing center server cluster periodically feeds back a heartbeat response to the control server.
  • control server establishes a heartbeat mechanism with the streaming computing center server cluster
  • the control server periodically sends a heartbeat message to the current streaming computing center server cluster
  • the heartbeat message is used to detect the control server and the current If the streaming computing center server clusters can communicate with each other, the current streaming computing center server cluster can periodically feed back the heartbeat response to the control server.
  • the method may further include:
  • Step 304 The current streaming computing center server cluster detects whether the continuous number of times the heartbeat response fails to be fed back to the control server exceeds a preset number of thresholds, and if so, the current streaming computing center server cluster stops the streaming computing task Execution.
  • the current streaming computing center server cluster can also detect whether the heartbeat mechanism between itself and the control server is normal, for example, detecting whether the number of consecutive failures of the heartbeat response to the control server exceeds a preset number of thresholds, for example, whether the number of consecutive times is 10 times.
  • the control server feedback heartbeat response fails. If yes, if the current streaming computing center server cluster has an abnormality, the execution of the streaming computing task can be stopped. If no, the current streaming computing center server cluster is normal, then step 303 can be continued, and the heartbeat response is periodically fed back to the control server.
  • a task performed by each of the streaming computing center server clusters and the streaming computing unit server cluster deployed in multiple locations is uniformly allocated by a control server, thereby implementing unified scheduling of the streaming computing tasks. And allocating, and utilizing the real-time synchronization data between the central storage clusters, realizes that the distributed computing center server cluster or the streaming computing unit server cluster deployed in multiple places simultaneously calculates different parts or different parts of the same streaming computing task.
  • the function of the streaming computing task when an abnormality occurs in a cluster of a streaming computing center server cluster or a streaming computing unit server, it can quickly resume the running streaming computing task from the remote computing center server cluster, which ensures The system resources are usually not vacant, and it also ensures that the flow computing task can be quickly recovered in an abnormal situation to achieve high availability of the streaming computing service.
  • Step 401 The control server sends a heartbeat message to the streaming computing center server clusters 1 and 2, and the streaming computing unit server clusters 1 and 2.
  • Cluster 1 and streaming computing center server cluster 2 and the number of streaming computing unit server clusters also includes two, including streaming computing unit server cluster 1 and streaming computing unit server cluster 2, then controlling the server and each streaming calculation
  • the central server cluster or each streaming computing unit server cluster sends heartbeat messages with a heartbeat duration of 1 second.
  • the streaming computing center server clusters 1 and 2 can be deployed in different places in Hangzhou. Of course, they can also be deployed in different cities.
  • the streaming computing unit server cluster 1 is deployed in Hangzhou, and the streaming computing unit server cluster 2 is deployed in Nanjing. .
  • Step 402 The streaming computing center server clusters 1 and 2, and the streaming computing unit server clusters 1 and 2 respectively feed back the heartbeat response to the control server.
  • Step 403 The control server allocates the streaming computing task to the streaming computing unit server cluster 1 for execution.
  • the system administrator triggers a streaming computing task to the control server, for example, statistics of the transaction volume of Hangzhou City on August 15, 2016, and assigns the streaming computing task to the cluster of streaming computing unit servers deployed in Hangzhou. carried out. Then, the control server allocates the task of the statistical transaction volume to the streaming computing unit server cluster 1 according to the instruction of the system administrator and triggers the streaming computing unit server cluster 1 to start counting the transaction volume.
  • the streaming computing center server cluster 1 has its own central storage cluster 1
  • the streaming computing center server cluster 2 has its own central storage cluster 2
  • the streaming computing unit server cluster 1 has its own unit storage cluster. 1.
  • the streaming computing unit server cluster 2 has its own unit storage cluster 2.
  • the streaming computing unit server cluster 1 can obtain the source data required for the statistical transaction volume from the data source, for example, the IP address is the order information of Hangzhou, etc., and according to the source. Data to calculate the volume of transactions.
  • the local data sources of each locality can be synchronized to the central data source corresponding to the cluster of the streaming computing center server, and the streaming computing center server cluster and the cluster of the streaming computing unit servers can pull the source data from the central data source. .
  • Step 404 In the process of the streaming computing unit server cluster 1 performing the streaming computing task, the unit storage cluster 1 connected by the streaming computing unit server cluster 1 synchronizes the intermediate state and intermediate result data generated during the execution to the central storage cluster. 1 and the central storage cluster 2, at the same time, the control server stores the execution status and configuration information of the streaming computing task into the control database.
  • the control server can acquire the execution status of the task in real time, and store the execution status and the configuration information executed by the streaming computing unit to the cluster computing unit server cluster 1 and store them in the control database.
  • the execution status may indicate that at a certain moment, the streaming computing unit server cluster obtains a total of 10000 source data information, and has already collected 4000 pieces of source data information, and the other 6000 source data have not been counted yet. ,Wait.
  • the execution state can also be expressed in other ways.
  • Step 405 The flow computing unit server cluster 1 detects whether the continuous number of times the heartbeat response fails to be fed back to the control server exceeds a preset number of thresholds, and if so, the streaming computing unit server cluster stops execution of the streaming computing task, If no, step 405 is performed.
  • the execution of the task in the cluster 1 of the streaming computing unit server it is also detected in real time whether it has failed to feed back the heartbeat response to the control server. If it fails, the number of consecutive failures is counted. If the number of consecutive failures exceeds the preset number of thresholds, for example, 10 times, it means that the connection between the cluster 1 and the control server of the streaming computing unit server has not been able to communicate normally. In this case, there may be an abnormal situation such as the network disconnection or power failure of the cluster 1 of the streaming computing unit server.
  • the flow calculation unit server cluster 1 exits the process of statistical transaction volume.
  • Step 406 The control server determines whether the streaming computing unit server cluster 1 feeds back the heartbeat response within the preset feedback time. If not, it proceeds to step 407, and if yes, proceeds to step 406.
  • the control server also determines in real time whether the streaming computing unit server cluster 1 feeds back the heartbeat response within a preset feedback time, for example, within 1 minute. If the heartbeat response fed back by the streaming computing unit server cluster 1 is not received, the streaming computing unit is illustrated. The server cluster cannot perform the task normally. Otherwise, the control server continues to monitor the heartbeat response to perform this step.
  • Step 407 The control server acquires the load status of each streaming computing center server cluster in real time, and determines an unexecuted task of the streaming computing task according to the execution state and the configuration information.
  • the control server can also obtain the load status of the streaming computing center server clusters 1 and 2 in real time, thereby determining that the load of the streaming computing center server cluster 1 is 40% of the CPU utilization, and the load of the streaming computing center server cluster 2 For CPU utilization of 60%, in this case, the load of the streaming computing center server cluster 1 is small.
  • the control server determines that the task of statistical transaction volume has been executed 40% according to the execution state and configuration information stored in the control database, and the remaining 6000 source data are not counted.
  • Step 408 The control server allocates the unexecuted tasks to the cluster of the streaming computing center server with the smallest current load for execution.
  • Step 409 The streaming computing center server cluster 1 is based on the synchronized intermediate state data in the central storage cluster 1. And intermediate result data continues to perform unexecuted tasks.
  • the control server allocates the remaining 60% of the unexecuted tasks to the streaming computing center server cluster 1 because the intermediate state data and the central result data stored in the central storage cluster 1 are the real-time synchronization of the unit storage clusters 1 and 2. Therefore, the streaming computing center server cluster 1 can directly obtain the intermediate state data and the intermediate result data of the statistical transaction volume task from the central storage cluster 1, and then continue to execute the remaining 60% according to the intermediate state data and the intermediate result data. Tasks, without repeating the 40% of the tasks that have already been performed.
  • the present application further provides an embodiment of a control server, where the control server and the plurality of streaming computing center server clusters respectively And the plurality of the flow computing unit server clusters are connected to each other, wherein the flow computing center server cluster is reserved with a preset proportion of computing resources;
  • the control server may include:
  • the first allocating unit 501 is configured to allocate the streaming computing task to the target streaming computing center server cluster or the target streaming computing unit server cluster in response to receiving the streaming computing task.
  • the determining unit 502 is configured to determine, in the process that the target streaming computing center server cluster or the target streaming computing unit server cluster executes the streaming computing task, the target streaming computing center server cluster or target flow computing Whether the cell server cluster has an abnormal condition.
  • a second allocating unit 503, configured to allocate an unexecuted task in the streaming computing task to a candidate streaming computing center server cluster; the unexecuted task is: in addition to the streaming computing task
  • the target streaming computing center server cluster or the target streaming computing unit server cluster has performed the remaining tasks beyond the tasks.
  • the second allocating unit 503 may specifically include:
  • Obtaining a load subunit configured to acquire, in real time, a load condition of the plurality of streaming computing center server clusters and multiple streaming computing unit server clusters;
  • the first allocation subunit is configured to allocate the unexecuted tasks in the streaming computing task to the cluster of the streaming computing center server with the smallest current load according to the load condition of the central server cluster.
  • the control server may further include:
  • a sending unit configured to periodically serve the streaming computing center server cluster and the streaming computing unit respectively
  • the cluster sends a heartbeat message, the heartbeat message is configured to: detect whether the control server and the streaming computing center server cluster are capable of communication, and detect the control server and the streaming computing unit server cluster Whether it can communicate with each other;
  • the determining unit 502 is specifically configured to: determine whether the target streaming computing center server cluster or the target streaming computing unit server cluster does not feed back a heartbeat response within a preset feedback time.
  • the streaming computing center server cluster has a storage cluster, and the intermediate state data and the intermediate result data are synchronized between the storage clusters of the clusters of the flow computing center servers, and the clusters of the streaming computing unit servers are synchronized to the central storage clusters.
  • Intermediate state data and intermediate result data; the server may further include:
  • a storage unit configured to store execution state and configuration information of each flow computing task into a control database;
  • the execution state is used to indicate that each flow computing task is in a corresponding streaming computing center server cluster or a streaming computing unit The executed part of the server cluster;
  • the configuration information is used to indicate a correspondence between each streaming computing task and a streaming computing center server cluster that executes the streaming computing task, or each streaming computing task and execution of the Corresponding relationship between clusters of streaming computing unit servers of streaming computing tasks;
  • the first allocating subunit may specifically include:
  • a calculating subunit configured to calculate an unexecuted task in the streaming computing task according to an execution state and configuration information stored in the control database
  • a second allocation subunit configured to allocate the unexecuted task to a cluster of streaming computing center servers with a minimum current load.
  • the control server of the embodiment can uniformly allocate tasks performed by each of the streaming computing center server clusters and the streaming computing unit server clusters deployed in multiple places, realize unified scheduling and allocation of the streaming computing tasks, and utilize each The way to store data in real time between clusters in the central storage cluster realizes the functions of simultaneously computing different parts of the same streaming computing task or different streaming computing tasks deployed in a multi-tiered streaming computing center server cluster or a streaming computing unit server cluster.
  • the streaming computing task that is being executed can be quickly resumed from the remote computing center server cluster, so that the system resources are not vacant. It also ensures that in the case of abnormal conditions, the streaming computing task can be quickly restored to achieve high availability of streaming computing services.
  • the present application further provides an embodiment of a cluster computing server cluster.
  • the streaming computing center server cluster has multiple presets in the streaming computing system, and a plurality of the streaming resources are reserved.
  • the computing center server cluster is respectively connected to the control server, and the control server is also connected to the plurality of streaming computing unit server clusters;
  • the streaming computing center server cluster has a central storage cluster, and the central storage cluster of each streaming computing central server cluster Synchronizing the intermediate state data and the intermediate result data, the cell storage cluster of each flow computing unit server cluster synchronizes the intermediate state data and the intermediate result data to the storage clusters of the flow computing center server clusters;
  • the streaming computing center server cluster may include:
  • the obtaining data unit 601 is configured to: in response to the control server failing to allocate a flow condition in another flow computing center server cluster or a streaming computing unit server cluster in the streaming computing system
  • the executed task obtains intermediate state data and intermediate result data required to execute the unexecuted task from the central storage cluster.
  • the execution task unit 602 is configured to execute the unexecuted task by using the preset computing resource, the intermediate state data, and the intermediate result data.
  • the streaming computing center server cluster may further include:
  • a feedback unit configured to periodically send a heartbeat response to the control server in response to the control server periodically sending a heartbeat message; the heartbeat message is used to detect the control server and the current streaming computing center server Whether the clusters can communicate with each other.
  • the streaming computing center server cluster may further include:
  • a detecting unit configured to detect whether a consecutive number of times the heartbeat response fails to be sent to the control server exceeds a preset number of times threshold; and a stopping unit, configured to stop the unexecuted if the result of the detecting unit is yes Execution of the task.
  • the streaming computing center server cluster after the embodiment can receive the streaming computing task uniformly allocated by the control server for execution, and realize the distributed computing center deployed in multiple places by real-time synchronous data between the central storage clusters.
  • a server cluster or a streaming computing unit server cluster simultaneously calculates the functions of different parts of the same stream computing task or different streaming computing tasks.
  • the remote computing center server cluster restores the ongoing streaming computing task, which ensures that the system resources are not vacant, and the streaming computing tasks can be quickly restored in the abnormal situation to achieve high availability of the streaming computing service. .
  • the embodiment of the present application further provides a streaming computing task allocation and execution system, which may include the control server shown in FIG. 5, multiple streaming computing center server clusters shown in FIG. 6, and multiple streaming systems.
  • the computing unit server cluster has its own unit storage cluster, and the control server has its own control database.
  • the structural block diagram of the system can be referred to FIG. 1 , and the unfinished part of the system can be referred to the detailed description of the foregoing embodiment. This will not be repeated here.
  • the embodiment of the present application further provides an off-site multi-live system, where the remote-flow multi-live system includes: a first streaming computing center server cluster, a second streaming computing center server cluster, a plurality of streaming computing unit server clusters, and a control a server; wherein the first flow computing center server cluster and the second streaming computing center server cluster are the streaming computing center server cluster shown in FIG. 6, and the control server may refer to FIG. 5;
  • the plurality of flow computing unit server clusters are respectively deployed in a plurality of second geographic locations; the first streaming computing center server cluster and the second streaming computing center server cluster are respectively deployed in the same or different first geographic locations.
  • the streaming computing center server cluster and the streaming computing unit server cluster are respectively deployed in the first geographic location and the second geographic location, so when a cluster of the streaming computing unit server is abnormal, it may be in a different location. Recovering the streaming computing task being executed by the cluster of the streaming computing unit server in which the abnormality occurs on the cluster of the first or second streaming computing center server, and the unexecuted part of the streaming computing task is in the remote computing center Continue to execute on the server cluster to realize the function of living in different places.
  • first-flow computing center server cluster and the second streaming computing center server cluster are deployed in different first geographical locations
  • another flow in another place may also be
  • the computing center server resumes the streaming computing task that the streaming computing unit server that is abnormally performing, and the unexecuted portion continues to execute on another cluster of the remote computing center server in another place, and can also realize the remote living. The function.
  • the present application further provides an off-site multi-active system, which may specifically include: a first streaming computing center server, at least for providing external computing resources, wherein the first streaming computing center server includes a first central storage unit; and the second streaming Computing a central server, at least for externally providing computing resources, wherein the second streaming computing center server includes a second central storage unit; wherein the first streaming computing center server and the second streaming computing center server are based on a unified load
  • the equalization policy completes load balancing, and the first central storage unit and the second central storage unit are hot standby with each other; wherein, for the first streaming computing task running on the first streaming computing center server, when the first streaming When the computing center server fails to provide computing resources externally, the operation is terminated on the first streaming computing center server, and based on the intermediate state data and the intermediate result data of the second central storage unit of the second streaming computing center server, Continue to operate on the second streaming computing center server The first stream of formula Calculation task.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Hardware Redundancy (AREA)

Abstract

本申请提供了流式计算任务的分配方法和控制服务器,其中,流式计算任务的分配方法应用于与流式计算中心服务器集群和流式计算单元服务器集群相连的控制服务器上;该方法包括:将流式计算任务分配至目标流式计算中心服务器集群或目标流式计算单元服务器集群;判断目标流式计算中心服务器集群或目标流式计算单元服务器集群是否出现异常情况,如果是,则将流式计算任务中未执行完的任务分配至候选流式计算中心服务器集群。采用本申请实施例可以实现当一个流式计算中心服务器集群或流式计算单元服务器集群出现异常的时候,能够在其他正常的流式计算中心服务器集群上继续执行未执行完的任务,保证流式计算任务的顺利执行。

Description

流式计算任务的分配方法和控制服务器
本申请要求2016年10月18日递交的申请号为201610908946.7、发明名称为“流式计算任务的分配方法和控制服务器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及流式计算技术领域,特别涉及一种流式计算任务的分配方法和控制服务器,一种流式计算任务的执行方法和流式计算中心服务器集群,以及,一种流式计算系统,一种异地多活系统。
背景技术
在流式计算中,无法确定数据的到来时刻和到来顺序,也无法将全部数据存储起来,因此,涉及的服务器不再进行流式数据的存储,而是当流动的数据到来后在内存中直接进行数据的实时计算。随着流式计算在互联网大数据时代的快速发展,对流式数据的实时性、质量、服务稳定性和可用性,都有了越来越高的要求,因此,对传统分布式web服务系统也是一个挑战。由于流式计算系统处理的实时计算和读取的数据量巨大,流式计算任务分布在多个地方时有很多困难,例如,去重统计结果的异地实时合并,如何保证多个地方的数据一致性,数据来源的地域不可控,等等,因此,如何实现对流式计算的多地域协同,且实时容灾是非常必要的。
现有技术在进行流式任务分配的时候,通常采用异地冷备的方式进行,即在另外一个地域部署一个闲置服务器,以便在一个地域的服务不可用时,临时把流式计算任务恢复到另外一个地域的闲置服务器上。但是该闲置服务器平时的大量时间都处于空转状态,这就造成大量的系统资源浪费的问题。还有另外一种方式,可以将服务器部署在单个机房或者同地域的多个机房,多个机房数据同时存储在一个存储系统来实现流式计算。但是这也会导致一旦这个地域的网络不可用(例如出现意外情况,光缆被工程机械挖断),该地域的存储系统不可用,或者,该地域的机器资源已经到了扩容上限无法继续扩容,等等,都会导致流式计算系统不可用,无法保证流式计算任务的顺利分配和后续执行。
发明内容
基于此,本申请提供了一种流式计算任务的分配方法和一种流式计算任务的执行方 法,用以采用一个控制服务器来对各流式计算任务进行统一分配的方式,由部署在多地的各流式计算中心服务器集群和各流式计算单元服务器集群来执行不同的流式计算任务,各流式计算中心服务器集群预留有预设计算资源,且各中心存储集群之间进行数据同步,并且,各流式计算单元服务器集群的单元存储集群中的数据也分别同步至各中心存储集群上。基于此,在某个流式计算单元服务器集群或流式计算中心服务器集群出现异常的时候,能够将正在执行的流式计算任务还未执行完的那部分任务重新分配至其他地方的某个流式计算中心服务器集群上执行,以实现流式计算任务能够在异地快速的恢复和正常执行,并且不需要配置闲置服务器,也节省了系统资源。
本申请还提供了一种控制服务器、一种流式计算中心服务器集群和一种流式计算系统,用以保证上述方法在实际中的实现及应用。
为了解决上述问题,本申请公开了一种计算任务分配方法,该方法应用于与流式计算中心服务器集群和流式计算单元服务器集群相连的控制服务器上,所述流式计算中心服务器集群预留有预设比例的计算资源;该方法包括:
响应于接收到流式计算任务,将所述流式计算任务分配至目标流式计算中心服务器集群或目标流式计算单元服务器集群;
在所述目标流式计算中心服务器集群或目标流式计算单元服务器集群执行所述流式计算任务的过程中,判断所述目标流式计算中心服务器集群或目标流式计算单元服务器集群是否出现异常情况,如果是,则将所述流式计算任务中未执行完的任务,分配至候选流式计算中心服务器集群。
其中,该方法还包括:
所述控制服务器周期性的分别向所述流式计算中心服务器集群和流式计算单元服务器集群发送心跳消息,所述心跳消息用于:检测所述控制服务器和所述流式计算中心服务器集群之间是否能够通信,以及,检测所述控制服务器和所述流式计算单元服务器集群之间是否能够通信;
相应的,所述判断所述目标流式计算中心服务器集群或目标流式计算单元服务器集群是否出现异常情况,具体为:
判断在预设反馈时间内所述目标流式计算中心服务器集群或目标流式计算单元服务器集群是否未反馈心跳响应。
其中,所述将所述流式计算任务中的未执行完的任务分配至候选流式计算中心服务器集群,包括:
所述控制服务器实时获取所述流式计算中心服务器集群的负载情况;
所述控制服务器依据所述负载情况,将所述流式计算任务中未执行完的任务分配至当前负载最小的流式计算中心服务器集群。
其中,所述流式计算中心服务器集群具有中心存储集群,各流式计算中心服务器集群之间的中心存储集群之间同步中间状态数据和中间结果数据,各流式计算单元服务器集群向各流式计算中心服务器集群的中心存储集群同步中间状态数据和中间结果数据;所述方法还包括:
控制服务器将各流式计算任务的执行状态和配置信息存储至控制数据库中;所述执行状态用于表示:各流式计算任务在对应的流式计算中心服务器集群或流式计算单元服务器集群上已执行部分;所述配置信息用于表示:各流式计算任务与执行该流式计算任务的流式计算中心服务器集群之间的对应关系,或,各流式计算任务与执行该流式计算任务的流式计算单元服务器集群之间的对应关系;
相应的,所述将所述流式计算任务中未执行完的任务分配至当前负载最小的流式计算中心服务器集群,包括:
所述控制服务器依据所述控制数据库中存储的执行状态和配置信息,计算所述流式计算任务中未执行完的任务;
所述控制服务器将所述未执行完的任务分配至当前负载最小的流式计算中心服务器集群。
本申请还提供了一种流式计算任务的执行方法,该方法应用于流式计算系统中的任意一个预留有预设计算资源的当前流式计算中心服务器集群上,所述流式计算系统包括:流式计算中心服务器集群、流式计算单元服务器集群和控制服务器;所述流式计算中心服务器集群具有中心存储集群,各中心存储集群之间同步中间状态数据和中间结果数据,各流式计算单元服务器集群的单元存储集群向各中心存储集群同步中间状态数据和中间结果数据;该方法包括:
响应于所述控制服务器在所述流式计算系统中的其他流式计算中心服务器集群或流式计算单元服务器集群出现异常情况时、重新分配的流式计算任务中未执行完的任务,所述当前流式计算中心服务器集群从中心存储集群中,获取执行所述未执行完的任务所需的中间状态数据和中间结果数据;
所述当前流式计算中心服务器集群利用所述预设计算资源、中间状态数据和中间结果数据执行所述未执行完的任务。
其中,该方法还包括:
响应于所述控制服务器周期性发送心跳消息,所述当前流式计算中心服务器集群周期性向所述控制服务器反馈心跳响应;所述心跳消息用于检测所述控制服务器与所述当前流式计算中心服务器集群之间是否能够通信。
其中,该方法还包括:
所述当前流式计算中心服务器集群检测向控制服务器反馈心跳响应失败的连续次数是否超过预设次数阈值,如果是,则所述当前流式计算中心服务器集群停止所述未执行完的任务的执行。
本申请还提供了一种控制服务器,所述控制服务器与流式计算中心服务器集群和流式计算单元服务器集群相连,所述流式计算中心服务器集群中预留有预设比例的计算资源;该控制服务器包括:
第一分配单元,用于响应于接收到流式计算任务,将所述流式计算任务分配至目标流式计算中心服务器集群或目标流式计算单元服务器集群;
判断单元,用于在所述目标流式计算中心服务器集群或目标流式计算单元服务器集群执行所述流式计算任务的过程中,判断所述目标流式计算中心服务器集群或目标流式计算单元服务器集群是否出现异常情况;
第二分配单元,用于在所述判断单元的结果为是的情况下,将所述流式计算任务中未执行完的任务分配至候选流式计算中心服务器集群。
其中,该控制服务器还包括:
发送单元,用于周期性的分别向所述流式计算中心服务器集群和流式计算单元服务器集群发送心跳消息,所述心跳消息用于:检测所述控制服务器和所述流式计算中心服务器集群之间是否能够通信,以及,检测所述控制服务器和所述流式计算单元服务器集群之间是否能够通信;
相应的,所述判断单元,具体用于:判断在预设反馈时间内所述目标流式计算中心服务器集群或目标流式计算单元服务器集群是否未反馈心跳响应。
其中,所述第二分配单元包括:
获取负载子单元,用于实时获取所述流式计算中心服务器集群和流式计算单元服务器集群的负载情况;
第一分配子单元,用于依据各流式计算中心服务器集群的负载情况,将所述流式计算任务中未执行完的任务分配至当前负载最小的流式计算中心服务器集群。
其中,所述流式计算中心服务器集群具有中心存储集群,各流式计算中心服务器集群之间的中心存储集群之间同步中间状态数据和中间结果数据,且各流式计算单元服务器集群向各流式计算中心服务器集群的中心存储集群同步中间状态数据和中间结果数据;所述服务器还包括:
存储单元,用于将各流式计算任务的执行状态和配置信息存储至控制数据库中;所述执行状态用于表示:各流式计算任务在对应的流式计算中心服务器集群或流式计算单元服务器集群上已执行部分;所述配置信息用于表示:各流式计算任务与执行该流式计算任务的流式计算中心服务器集群之间的对应关系,或,各流式计算任务与执行该流式计算任务的流式计算单元服务器集群之间的对应关系;
所述第一分配子单元,包括:
计算子单元,用于依据所述控制数据库中存储的执行状态和配置信息,计算所述流式计算任务中未执行完的任务;
第二分配子单元,用于将所述未执行完的任务分配至当前负载最小的流式计算中心服务器集群。
本申请还提供了一种流式计算中心服务器集群,该流式计算中心服务器集群预留有预设计算资源,所述流式计算中心服务器集群与控制服务器相连,所述控制服务器还与流式计算单元服务器集群相连;所述流式计算中心服务器集群具有中心存储集群,中心存储集群之间同步中间状态数据和中间结果数据,流式计算单元服务器集群的单元存储集群向中心存储集群同步中间状态数据和中间结果数据;包括:
获取数据单元,用于响应于所述控制服务器在所述流式计算系统中的其他流式计算中心服务器集群或流式计算单元服务器集群出现异常情况时、重新分配的流式计算任务中未执行完的任务,从中心存储集群中获取执行所述未执行完的任务所需的中间状态数据和中间结果数据;
执行任务单元,用于利用所述预设计算资源、中间状态数据和中间结果数据执行所述未执行完的任务。
其中,该流式计算中心服务器集群还包括:
反馈单元,用于响应于所述控制服务器周期性的发送心跳消息,周期性的向所述控制服务器反馈心跳响应;所述心跳消息用于检测所述控制服务器与所述当前流式计算中心服务器集群之间是否能够通信。
其中,该流式计算中心服务集群还包括:
检测单元,用于检测向控制服务器发送心跳响应失败的连续次数是否超过预设次数阈值;
停止单元,用于在所述检测单元的结果为是的情况下,停止所述未执行完的任务的执行。
本申请还提供了一种流式计算系统,所述流式计算系统包括:流式计算中心服务器集群和流式计算单元服务器集群,控制服务器;以及,
与所述流式计算中心服务器集群对应的中心存储集群,与所述控制服务器对应的控制数据库,和,与所述流式计算单元服务器集群对应的单元存储集群。
本申请还提供了一种异地多活系统,所述异地多活系统包括:第一流式计算中心服务器集群,多个流式计算单元服务器集群,以及控制服务器;其中,所述第一流式计算中心服务器集群为前述的流式计算中心服务器集群,所述控制服务器为前述的控制服务器;以及,
所述多个流式计算单元服务器集群分别对应部署于多个第二地理位置;所述第一流式计算中心服务器集群部署于第一地理位置,所述第二地理位置与所述第一地理位置是不同的地理位置。其中,所述异地多活系统还包括:第二流式计算中心服务器集群,所述第二流式计算中心服务器集群与所述第一流式计算中心服务器集群部署在不同的第一地理位置。
本申请还提供了一种异地多活系统,包括:
第一流式计算中心服务器,至少用于对外提供计算资源,其中,第一流式计算中心服务器包括第一中心存储单元;
第二流式计算中心服务器,至少用于对外提供计算资源,其中,第二流式计算中心服务器包括第二中心存储单元;
其中,所述第一流式计算中心服务器和第二流式计算中心服务器基于统一的负载均衡策略完成负载均衡,所述第一中心存储单元和第二中心存储单元相互热备;
其中,对于在所述第一流式计算中心服务器上运行的第一流式计算任务,当所述第一流式计算中心服务器出现故障无法对外提供计算资源时,终止在第一流式计算中心服务器上运行,并且,基于所述第二流式计算中心服务器的第二中心存储单元的中间状态数据和中间结果数据,在所述第二流式计算中心服务器上继续运行所述第一流式计算任务。
与现有技术相比,本申请包括以下优点:
在本申请实施例中,本申请通过一个控制服务器来对部署在多地的各流式计算中心服务器集群和流式计算单元服务器集群所执行的任务进行统一分配,实现流式计算任务的统一调度和分配,并且利用各中心存储集群之间实时同步数据的方式,实现了部署在多地的流式计算中心服务器集群或流式计算单元服务器集群同时计算同一个流式计算任务的各部分或不同的流式计算任务的功能。采用本申请实施例,当一个地方的流式计算中心服务器集群或流式计算单元服务器集群出现异常时,能快速从异地的流式计算中心服务器集群恢复正在执行的流式计算任务,这样既能保证系统资源平时不空置,也保证了流式计算任务的异地多活,即在本地出现异常情况下也能使流式计算任务在异地能迅速恢复从而达到流式计算服务的高可用性。
当然,实施本申请的任一产品并不一定需要同时达到以上所述的所有优点。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请在实际应用之后场景架构图;
图2是本申请的流式计算任务的分配方法实施例的流程图;
图3是本申请的流式计算任务的执行方法实施例的流程图;
图4是本申请的具体例子的方法流程图;
图5是本申请的控制服务器实施例的结构框图;
图6是本申请的流式计算中心服务器集群实施例的结构框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
为了方便本领域技术人员对本申请中的技术术语有进一步的理解,下面将技术术语 进行解释和介绍。
服务器集群,就是指将一个或多个服务器集中起来一起进行同一种服务,在客户端看来就像是只有一个服务器。服务器集群可以利用多个计算机进行并行计算从而获得很高的计算速度,也可以用多个计算机做备份,从而使得任何一个计算机坏了整个服务器集群还是能正常运行。
流式计算中心服务器集群,指的是用于执行流式计算任务的服务器集群,这些服务器集群需要预留有预设计算资源,并将执行流式计算任务过程中产生的中间结果数据和中间状态数据存储至中心存储集群中。
流式计算单元服务器集群,也指的是用于执行流式计算任务的服务器集群,并将执行流式计算任务过程中产生的中间结果数据和中间状态数据存储至单元存储集群中,只是这些服务器集群可以不预留预设计算资源。
存储集群,是将一台或多台存储设备中的存储空间聚合成一个能够给服务器集群提供统一访问接口和管理界面的存储池,服务器集群可以通过该统一访问接口透明地访问和利用所有存储设备上的磁盘,因此,存储集群可以充分发挥存储设备的性能和磁盘利用率。
中心存储集群,是用于为流式计算中心服务器集群提供存储空间的存储集群;单元存储集群,是用于为流水计算单元服务器集群提供存储空间的存储集群。
参考图1所示,为本申请中的流式计算任务的分配方法在实际应用中的场景架构图。在图1所示的一个流式计算系统中,可以配置一个控制服务器101,m个流式计算中心服务器集群102和n个流式计算单元服务器集群103。其中,m和n分别为大于1的整数。优选的,流式计算中心服务器集群102可以配置两个。控制服务器101可以向各流式计算中心服务器集群102和流式计算单元服务器集群103分配流式计算任务,其中,各个流式计算中心服务器集群102上均可以预留一部分计算资源,流式计算单元服务器集群103上无需预留计算资源,基于此,当该流式计算系统中的一个流式计算中心服务器集群102或流式计算单元服务器集群103异常的时候,控制服务器101可以检测到该异常进而将该异常的流式计算中心服务器集群102或流式计算单元服务器集群103未执行完的任务,重新分配给其他正常的候选流式计算中心服务器集群102执行。需要说明的是,因为各流式计算单元服务器集群103不会预留计算资源,因此,控制服务器101在重新分配未执行完的任务的时候,只会选择正常的流式计算中心服务器集群102而不 会选择流式计算单元服务器集群103作为候选流式计算中心服务器集群。
此外,在图1中,为了保证流式计算任务在不同的流式计算中心服务器集群102之间或者从流式计算单元服务器集群103到流式计算中心服务器集群102切换的时候,能够同步执行,各流式计算中心服务器集群102相连的各个中心存储集群104之间需要进行中间状态数据和中间结果数据的同步,即各个中心存储集群104之间实时同步中间状态数据和中间结果数据。而流式计算单元服务器集群103各自连接的单元存储集群105需要将中间状态数据和中间结果数据同步至各个中心存储集群104上,可以不在各个单元存储集群之间进行同步,只同步至各中心存储集群104即可,这样就减少了中间状态数据和中间结果数据在各个单元存储集群105之间同步时耗费的资源。控制服务器101还连接有控制数据库,控制数据库可以存储控制服务器101在分配任务时的配置信息和执行任务时产生的执行状态。其中,执行状态可以表示出各流式计算任务在对应的流式计算中心服务器集群或流式计算单元服务器集群上执行时已经执行完成的已执行部分;所述配置信息可以表示:各流式计算任务与执行该流式计算任务的流式计算中心服务器集群之间的对应关系,或,各流式计算任务与执行该流式计算任务的流式计算单元服务器集群之间的对应关系。
可以理解的是,各流式计算中心服务器集群102可以部署在相同的第一地理位置,优选的,也可以部署在不同的第一地理位置。其中,第一地理位置可以是城市,包括直辖市、省会城市、地级市、县级市等,例如,北京,杭州,南京等。例如,一个流式计算中心服务器部署在杭州,另外一个六件中心服务器也部署在杭州,或者,一个流式计算中心服务器集群部署在杭州,另外一个流式计算中心服务器集群部署在南京或者上海等与杭州不同的地理位置。各流式计算单元服务器集群103也可以部署在不同的第二地理位置,包括直辖市、省会城市、地级市、县级市等,例如,苏州、厦门、深圳等。其中,第一地理位置用于表示流式计算中心服务器集群102部署的地理位置,而第二地理位置用于表示流式计算单元服务器集群部署的地理位置。在实际应用中,无论各流式计算中心服务器集群和流式计算单元服务器集群分别部署在哪些不同的地理位置,都由控制服务器101为其分配流式计算任务。
在介绍完应用场景之后,参考图2,示出了本申请一种基于图1所示的应用场景进行流式计算任务分配的方法实施例的流程,本实施例应用于图1中的控制服务器上,本实施例可以包括以下步骤:
步骤201:控制服务器周期性的分别向所述流式计算中心服务器集群和流式计算单元服务器集群发送心跳消息。
在本实施例中,控制服务器和各个流式计算中心服务器集群以及各流式计算单元服务器集群都相连,并且在控制服务器和各个流式计算中心服务器集群之间,以及,控制服务器和各个流式计算单元服务器集群之间建立心跳消息反馈机制。基于此,控制服务器周期性的向各个流式计算中心服务器集群和各个流式计算单元服务器集群,分别发送心跳消息,该心跳消息用于检测所述控制服务器和所述流式计算中心服务器集群之间是否能够正常通信,以及,检测所述控制服务器和所述流式计算单元服务器集群之间是否能够正常通信。通过各个流式计算中心服务器集群和各个流式计算单元服务器集群是否正常反馈了心跳响应,可以确认各流式计算中心服务器集群和流式计算单元服务器集群是否能正常通信,如果不能正常通信,通常情况下就说明流式计算中心服务器集群或流式计算单元服务器集群出现了异常情况,不能再正常执行任务。
具体的,如果控制服务器能够正常接收到各流式计算中心服务器集群或流式计算单元服务器集群反馈的心跳响应,则认为该流式计算中心服务器集群和流式计算单元服务器集群能够和控制服务器正常通信,即没有出现异常情况,反之则认为流式计算中心服务器集群和流式计算单元服务器集群不能够和控制服务器正常通信,即出现了异常情况。其中,发送心跳消息的周期可以是心跳时长,例如1秒钟。当然本领域技术人员可以自主设置心跳时长。
步骤202:响应于接收到流式计算任务,控制服务器将所述流式计算任务分配至目标流式计算中心服务器集群或目标流式计算单元服务器集群。
在实际应用中,控制服务器可以由系统管理员操控,控制服务器可以提供人机交互界面由系统管理员输入任务指令,并按照系统管理员输入的任务指令将流式计算任务发送给系统管理员指定的流式计算中心服务器集群或流式计算中心(即目标流式计算中心服务器集群或目标流式计算单元服务器集群)。当然,在实际应用中,也可以采用其他方式来确定目标流式计算中心服务器集群或目标流式计算单元服务器集群,例如,控制服务器按照轮训的方式随机确定一个流式计算中心服务器集群作为目标流式计算中心服务器集群,或者随机确认一个流式计算单元服务器集群作为目标流式计算单元服务器集群。
在步骤202和步骤204之间,可选的,还可以执行步骤203:
步骤203:控制服务器将各流式计算任务的执行状态和配置信息存储至控制数据库 中。
在本实施例中,可选的,控制服务器在分配流式计算任务后,可以将各个流式计算任务的配置信息存储至与其相连的控制数据库中,例如,各流式计算任务与执行该流式计算任务的流式计算中心服务器集群之间的对应关系,或,各流式计算任务与执行该流式计算任务的流式计算单元服务器集群之间的对应关系。此外,控制服务器还可以将各流式计算任务在流式计算中心服务器集群或流式计算单元服务器集群上的执行状态存储在控制数据库中,其中,执行状态可以表示:各流式计算任务在对应的流式计算中心服务器集群或流式计算单元服务器集群上执行时已经执行完成的已执行部分。
步骤204:在所述目标流式计算中心服务器集群或目标流式计算单元服务器集群执行所述流式计算任务的过程中,判断所述目标流式计算中心服务器集群或目标流式计算单元服务器集群是否出现异常情况,如果是,则进入步骤205,如果没有,则继续执行本步骤进行判断。
控制服务器在分配了流式计算任务之后,在目标流式计算中心服务器集群或目标流式计算单元服务器集群执行所述流式计算任务的过程中,控制服务器实时检测自己与该目标流式计算中心服务器集群或目标流式计算单元服务器集群之间的连接是否正常,如果正常则说明目标流式计算中心服务器集群或目标流式计算单元服务器集群没有出现异常情况。而如果连接不正常,例如,控制服务器在预设反馈时间内收不到目标流式计算中心服务器集群或目标流式计算单元服务器集群反馈的心跳响应,则说明连接不正常,在这种情况下,可能是目标流式计算中心服务器集群或目标流式计算单元服务器集群出现了异常情况。
可以理解的是,如果目标流式计算单元服务器集群只包括一个流式计算单元服务器,则该流式计算单元服务器出现异常就需要进入步骤205;而对于目标流式计算单元服务器集群包括多个流式计算单元服务器的情况,只有该目标流式计算单元服务器集群的所有流式计算单元服务器都出现异常的情况,控制服务器与该目标流式计算单元服务器集群的连接才会断掉,在本步骤中才会判断得到整个流式计算中心单元服务器集群都出现了异常情况。例如,在实际应用中,目标流式计算单元服务器集群所在的机房出现了断电或者火宅等情况。在实际中还有一种可能是,该目标流式计算单元服务器集群中只有一部分的流式计算单元服务器出现了异常,例如,该流式计算单元服务器出现宕机等情况,在这种情况下,该异常的流式计算单元服务器上正在执行的任务中未执行完的部分会切换到其他正常的流式计算单元服务器,以使得整个流式计算单元服务器集群所执行 的任务能够顺利执行,保证流式计算单元服务器集群整体上处于正常运行状态。
当然,控制服务器可以步骤201中发送心跳消息后是否能在预设反馈时间内接收到心跳响应来判断目标流式计算中心服务器集群或目标流式计算单元服务器集群是否出现异常情况,例如,在连续一分钟内都没有收到目标流式计算中心服务器集群或目标流式计算单元服务器集群反馈的心跳响应,则确定该目标流式计算中心服务器集群或目标流式计算单元服务器集群出现异常,接着可以进入步骤205;如果在一分钟内收到目标流式计算中心服务器集群或目标流式计算单元服务器集群反馈的心跳响应,则确定目标流式计算中心服务器集群或目标流式计算单元服务器集群没有出现异常,可以继续执行步骤204进行实时判断。
可以理解的是,在一个流式计算中心服务器集群或流式计算单元服务器集群出现异常的情况下,控制服务器可以向系统管理员报警等进行提示,系统管理员在确定某个流式计算中心服务器集群或流式计算单元服务器集群确实出现异常情况,例如,断网或者断电等,则可以进行修复操作等。待出现异常的流式计算中心服务器集群或流式计算单元服务器集群修复成功之后,还可以作为正常的流式计算中心服务器集群或流式计算单元服务器集群为其分配流式计算任务。
步骤205:将所述流式计算任务中未执行完的任务分配至候选流式计算中心服务器集群。
在本步骤中,未执行完的任务可以为:所述流式计算任务中除了所述目标流式计算中心服务器集群或目标流式计算单元服务器集群已执行任务之外的剩余任务。
具体的,为了保证流式计算任务中未执行完的任务可以快速执行,可以将该未执行完的任务分配至当前负载最小的流式计算中心服务器集群继续执行。相应的,步骤205可以包括:
步骤A1:所述控制服务器实时获取所述多个流式计算中心服务器集群的负载情况。
在步骤A1中,控制服务器可以实时获取到各流式计算中心服务器集群和各流式计算单元服务器集群的负载情况。其中,负载情况可以是,CPU的利用率,内存读取速度,磁盘输入输出I/O性能等硬件的参数值,通过硬件参数值可以确定各流式计算中心服务器集群和流式计算单元服务器集群的负载情况,从而可以在后续需要重新分配某个任务的时候,能够将任务分配给负载较小的流式计算中心服务器集群或流式计算单元服务器集群。
可以理解的是,在实际应用中,因为流式计算单元服务器集群不需要预留计算资源, 而流式计算中心服务器集群需要预留计算资源。假设流式计算中心服务器集群的个数为N,其中N为大于1的整数,则预留的计算资源可以是“N*10%”,这样就可以尽量保证其他流式计算中心服务器集群或流式计算单元服务器集群出现异常情况时,某个正常的流式计算中心服务器集群有足够多的计算资源可以执行控制服务器为其重新分配的任务。其中,该计算资源可以是,CPU、内存和磁盘等硬件资源。例如,在执行控制服务器分配的任务时,流式计算中心服务器集群可始终有20%的计算资源空闲,这空闲的20%的计算资源就可以用来执行其他流式计算中心服务器集群或流式计算单元服务器集群上未执行完的任务。
步骤A2:所述控制服务器将所述流式计算任务中未执行完的任务分配至当前负载最小的流式计算中心服务器集群。
控制服务器再将未执行完的任务分配至根据步骤A1中各流式计算中心服务器集群的负载情况确定的、当前负载最小的流式计算中心服务器集群。
具体的,根据步骤203中的执行状态和配置信息,步骤A2可以包括:
步骤A21:所述控制服务器依据所述控制数据库中存储的执行状态和配置信息,计算所述流式计算任务中未执行完的任务。
控制服务器在某个目标流式计算中心服务器集群或目标流式计算单元服务器集群出现异常的时候,可以根据配置信息确定其正在执行的流式计算任务,再根据执行状态可以确定该流式计算任务已经执行完成的部分,进而可以计算出该流式计算任务中未执行完的任务。
步骤A22:所述控制服务器将所述未执行完的任务分配至当前负载最小的流式计算中心服务器集群。
控制服务器接着将该未执行完的任务重新分配至当前负载最小的流式计算中心服务器集群进行执行。
可以理解的是,在执行步骤205重新分配了未执行完的任务之后,可以再回到步骤202由控制服务器接着分配当前接收到流式计算任务。
本实施例通过一个控制服务器,来对部署在多地的各流式计算中心服务器集群和流式计算单元服务器集群所执行的流式计算任务进行统一分配,实现流式计算任务的统一调度和分配,并且利用各中心存储集群之间实时同步数据的方式,实现了部署在多地的流式计算中心服务器集群或流式计算单元服务器集群同时计算同一个流式计算任务的不同部分或不同流式计算任务功能,当一个流式计算中心服务器集群或流式计算单元服务 器集群出现异常时,能快速从异地的流式计算中心服务器集群恢复正在执行的流式计算任务,这样既能保证系统资源平时不空置,也保证了在异常情况下也能流式计算任务可以从异地的流式计算中心服务器集群迅速恢复从而达到流式计算服务的高可用性。
参考图3,示出了本申请一种流式计算任务的执行方法实施例的流程图,该方法应用于图1所示的任意一个当前流式计算中心服务器集群上,所述流式计算系统可以包括:多个流式计算中心服务器集群、多个流式计算单元服务器集群和控制服务器;所述流式计算中心服务器集群具有中心存储集群,各流式计算中心服务器集群之间的中心存储集群之间同步中间状态数据和中间结果数据,各流式计算单元服务器集群向各流式计算中心服务器集群的中心存储集群同步中间状态数据和中间结果数据。具体的,本实施例可以包括:
步骤301:响应于所述控制服务器在所述流式计算系统中的其他流式计算中心服务器集群或流式计算单元服务器集群出现异常情况时、重新分配的流式计算任务中未执行完的任务,所述当前流式计算中心服务器集群从相连的中心存储集群中,获取执行所述未执行完的任务所需的中间状态数据和中间结果数据。
在本实施例中,假设控制服务器检测到其他流式计算中心服务器集群或流式计算单元服务器集群出现异常情况了,则会按照图2所示的实施例为出现异常的流式计算中心服务器集群或流式计算单元服务器集群正在执行的任务重新分配流式计算中心服务器集群。在这种情况下,当前流式计算中心服务器集群从相连的存储集群中,获取执行未执行完的任务所需的中间状态数据和中间结果数据。其中,该中间状态数据可以为:出现异常的流式计算中心服务器集群或流式计算单元服务器集群在出现异常情况前执行流式计算任务产生的任务状态,例如,该流式计算任务已经执行了哪些部分;而中间结果数据可以为:已执行完的那部分任务产生的结果数据等。基于此,当前流式计算中心服务器集群可以不需要再重复执行该流式计算任务已经执行过的部分,而根据中间状态数据和中间结果数据执行未执行完的那部分任务即可。
步骤302:所述当前流式计算中心服务器集群利用所述中间状态数据和中间结果数据执行所述未执行完的任务。
当前流式计算中心服务器集群再参考中间状态数据和中间结果数据来执行重新分配的该未执行完的任务。
其中,在步骤302之后,还可以包括:
步骤303:响应于所述控制服务器周期性发送心跳消息,所述当前流式计算中心服务器集群周期性向所述控制服务器反馈心跳响应。
在控制服务器与流式计算中心服务器集群建立心跳机制的情况下,如果控制服务器周期性的向当前流式计算中心服务器集群发送了心跳消息,该心跳消息用于检测所述控制服务器与所述当前流式计算中心服务器集群之间是否能够通信,则当前流式计算中心服务器集群可以周期性的向控制服务器反馈心跳响应。
其中,在步骤303之后,还可以包括:
步骤304:所述当前流式计算中心服务器集群检测向控制服务器反馈心跳响应失败的连续次数是否超过预设次数阈值,如果是,则所述当前流式计算中心服务器集群停止所述流式计算任务的执行。
当前流式计算中心服务器集群也可以实时检测自己与控制服务器之间的心跳机制是否正常,例如,检测向控制服务器反馈心跳响应失败的连续次数是否超过预设次数阈值,例如,是否连续10次向控制服务器反馈心跳响应失败,如果是,则当前流式计算中心服务器集群出现了异常,则可以停止流式计算任务的执行。如果否,则说明当前流式计算中心服务器集群正常,则可以继续执行步骤303,向控制服务器接着周期性的反馈心跳响应。
可见,在本申请实施例中,通过一个控制服务器来对部署在多地的各流式计算中心服务器集群和流式计算单元服务器集群所执行的任务进行统一分配,实现流式计算任务的统一调度和分配,并且利用各中心存储集群之间实时同步数据的方式,实现了部署在多地的流式计算中心服务器集群或流式计算单元服务器集群同时计算同一个流式计算任务的不同部分或者不同流式计算任务的功能,当一个流式计算中心服务器集群或流式计算单元服务器集群出现异常时,能快速从异地的流式计算中心服务器集群恢复正在执行的流式计算任务,这样既能保证系统资源平时不空置,也保证了在异常情况下也能流式计算任务能迅速恢复从而达到流式计算服务的高可用性。
为了更方便本领域技术人员对本申请的实现过程有更清楚的理解,下面举出一个具体例子来详细阐述本申请的实现,本例子可以包括以下步骤:
步骤401:控制服务器向流式计算中心服务器集群1和2,以及流式计算单元服务器集群1和2发送心跳消息。
在本例子中,假设流式计算中心服务器集群一共有两个,包括流式计算中心服务器 集群1和流式计算中心服务器集群2,而流式计算单元服务器集群的个数也有两个,包括流式计算单元服务器集群1和流式计算单元服务器集群2,则控制服务器与各流式计算中心服务器集群或各流式计算单元服务器集群,都以1秒钟的心跳时长发送心跳消息。流式计算中心服务器集群1和2都可以部署在杭州市的不同地方,当然,也可以部署在不同的城市,流式计算单元服务器集群1部署在杭州,流式计算单元服务器集群2部署在南京。
步骤402:流式计算中心服务器集群1和2,以及流式计算单元服务器集群1和2分别向控制服务器反馈心跳响应。
步骤403:控制服务器将流式计算任务分配至流式计算单元服务器集群1执行。
系统管理员向控制服务器触发一个流式计算任务,例如,统计杭州市在2016年8月15号的交易量,并将该流式计算任务分配至部署在杭州市的流式计算单元服务器集群1执行。则控制服务器按照系统管理员的指令将该统计交易量的任务分配至流式计算单元服务器集群1并触发流式计算单元服务器集群1开始统计交易量。其中,本例子中,流式计算中心服务器集群1有自己的中心存储集群1,而流式计算中心服务器集群2有自己的中心存储集群2,流式计算单元服务器集群1有自己的单元存储集群1,流式计算单元服务器集群2有自己的单元存储集群2。在实际应用中,单元存储集群1和2之间不需要同步中间状态数据和中间结果数据,只需要将各自的中间状态数据和中间结果数据分别同步至中心存储集群1和2即可,并且中心存储集群1和2之间也需要同步中间状态数据和中间结果数据。
具体的,流式计算单元服务器集群1在执行统计交易量的过程中,可以从数据源中获取到统计交易量所需的源数据,例如,IP地址为杭州市的订单信息等,并根据源数据来统计交易量。其中,各地的本地数据源可以都同步到流式计算中心服务器集群对应的中心数据源上,流式计算中心服务器集群和各地的流式计算单元服务器集群可以都从中心数据源中拉取源数据。
步骤404:在流式计算单元服务器集群1执行流式计算任务的过程中,流式计算单元服务器集群1连接的单元存储集群1将执行过程中产生的中间状态和中间结果数据同步至中心存储集群1和中心存储集群2,同时,控制服务器将该流式计算任务的执行状态和配置信息存储至控制数据库中。
在流式计算单元服务器集群1执行任务的过程中,流式计算单元服务器集群1实时产生的中间状态数据和中间结果数据存储至单元存储集群1,并且单元存储集群1实时 将产生的中间状态数据和中间结果数据同步至中心存储集群1和中心存储集群2上。同时,控制服务器可以实时获取到该任务的执行状态,并将执行状态和将该流式计算任务分配至流式计算单元服务器集群1执行的配置信息,都存储在控制数据库中。例如,执行状态可以表示出,在当前某一时刻,流式计算单元服务器集群获取到共10000条源数据信息,已经对其中的4000条源数据信息进行统计,其他6000条源数据还未进行统计,等。当然,执行状态还可以采用别的方式表示。
步骤405:流式计算单元服务器集群1检测向控制服务器反馈心跳响应失败的连续次数是否超过预设次数阈值,如果是,则所述流式计算单元服务器集群停止所述流式计算任务的执行,如果否,则执行步骤405。
在流式计算单元服务器集群1执行任务的过程中,还会实时检测自己向控制服务器反馈心跳响应是否失败,如果失败了则统计连续失败的次数,如果连续失败的次数超过预设次数阈值,例如10次,则表示流式计算单元服务器集群1和控制服务器的连接已经不能正常通信,在这种情况下,有可能是流式计算单元服务器集群1断网或断电等出现了异常情况,则流式计算单元服务器集群1退出统计交易量的流程。
步骤406:控制服务器判断流式计算单元服务器集群1是否在预设反馈时间内反馈心跳响应,如果否,则进入步骤407,如果是,则继续执行步骤406。
控制服务器也会实时判断流式计算单元服务器集群1是否在预设反馈时间,例如1分钟内,反馈心跳响应,如果未接收流式计算单元服务器集群1反馈的心跳响应,则说明流式计算单元服务器集群已经不能正常执行任务,反之则控制服务器继续监测心跳响应执行本步骤即可。
步骤407:控制服务器实时获取各流式计算中心服务器集群的负载情况,并根据执行状态和配置信息确定该流式计算任务的未执行完的任务。
控制服务器还可以实时获取到流式计算中心服务器集群1和2的负载情况,从而确定出流式计算中心服务器集群1的负载为CPU利用率为40%,而流式计算中心服务器集群2的负载为CPU利用率为60%,在这种情况下,流式计算中心服务器集群1的负载较小。同时,控制服务器还根据控制数据库中存储的执行状态和配置信息,确定出统计交易量的任务已经执行了40%,还剩余6000条的源数据未进行统计。
步骤408:控制服务器将未执行完的任务分配至当前负载最小的流式计算中心服务器集群进行执行。
步骤409:流式计算中心服务器集群1依据中心存储集群1中同步的中间状态数据 和中间结果数据继续执行未执行完的任务。
则控制服务器就将剩余60%的未执行完的任务分配至流式计算中心服务器集群1执行,因为中心存储集群1中存储的中间状态数据和中心结果数据是单元存储集群1和2实时同步的,所以流式计算中心服务器集群1则可以直接从中心存储集群1中获取到统计交易量这个任务的中间状态数据和中间结果数据,进而依据该中间状态数据和中间结果数据继续执行剩余60%的任务,而不会重复执行已经执行过的那部分40%的任务。
对于前述的方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
与上述本申请一种流式计算任务的分配方法实施例所提供的方法相对应,参见图5,本申请还提供了一种控制服务器实施例,控制服务器分别与多个流式计算中心服务器集群和多个流式计算单元服务器集群相连,其中,流式计算中心服务器集群中预留有预设比例的计算资源;在本实施例中,该控制服务器可以包括:
第一分配单元501,用于响应于接收到流式计算任务,将所述流式计算任务分配至目标流式计算中心服务器集群或目标流式计算单元服务器集群。
判断单元502,用于在所述目标流式计算中心服务器集群或目标流式计算单元服务器集群执行所述流式计算任务的过程中,判断所述目标流式计算中心服务器集群或目标流式计算单元服务器集群是否出现异常情况。
第二分配单元503,用于在将所述流式计算任务中的未执行完的任务分配至候选流式计算中心服务器集群;所述未执行完的任务为:所述流式计算任务中除了所述目标流式计算中心服务器集群或目标流式计算单元服务器集群已执行任务之外的剩余任务。
其中,所述第二分配单元503具体可以包括:
获取负载子单元,用于实时获取所述多个流式计算中心服务器集群和多个流式计算单元服务器集群的负载情况;
第一分配子单元,用于依据各流式计算中心服务器集群的负载情况,将所述流式计算任务中的未执行完的任务分配至当前负载最小的流式计算中心服务器集群。
其中,该控制服务器还可以包括:
发送单元,用于周期性的分别向所述流式计算中心服务器集群和流式计算单元服务 器集群发送心跳消息,所述心跳消息用于:检测所述控制服务器和所述流式计算中心服务器集群之间是否能够通信,以及,检测所述控制服务器和所述流式计算单元服务器集群之间是否能够通信;
相应的,所述判断单元502,具体用于:判断在预设反馈时间内所述目标流式计算中心服务器集群或目标流式计算单元服务器集群是否未反馈心跳响应。
其中,所述流式计算中心服务器集群具有存储集群,各流式计算中心服务器集群之间的存储集群之间同步中间状态数据和中间结果数据,各流式计算单元服务器集群向各中心存储集群同步中间状态数据和中间结果数据;所述服务器还可以包括:
存储单元,用于将各流式计算任务的执行状态和配置信息存储至控制数据库中;所述执行状态用于表示:各流式计算任务在对应的流式计算中心服务器集群或流式计算单元服务器集群上已执行部分;所述配置信息用于表示:各流式计算任务与执行该流式计算任务的流式计算中心服务器集群之间的对应关系,或,各流式计算任务与执行该流式计算任务的流式计算单元服务器集群之间的对应关系;
相应的,所述第一分配子单元,具体可以包括:
计算子单元,用于依据所述控制数据库中存储的执行状态和配置信息,计算所述流式计算任务中未执行完的任务;
第二分配子单元,用于将所述未执行完的任务分配至当前负载最小的流式计算中心服务器集群。
本实施例的控制服务器,可以对部署在多地的各流式计算中心服务器集群和流式计算单元服务器集群所执行的任务进行统一分配,实现流式计算任务的统一调度和分配,并且利用各中心存储集群之间实时同步数据的方式,实现了部署在多地的流式计算中心服务器集群或流式计算单元服务器集群同时计算同一个流式计算任务的不同部分或不同流式计算任务的功能,当一个流式计算中心服务器集群或流式计算单元服务器集群出现异常时,能快速从异地的流式计算中心服务器集群恢复正在执行的流式计算任务,这样既能保证系统资源平时不空置,也保证了在异常情况下也能流式计算任务能迅速恢复从而达到流式计算服务的高可用性。
与上述本申请一种流式计算任务的执行方法实施例所提供的方法相对应,参考图6所示,本申请还提供了一种流式计算中心服务器集群实施例,在本实施例中,所述流式计算中心服务器集群在流式计算系统中有多个且都预留有预设计算资源,多个所述流式 计算中心服务器集群分别与控制服务器相连,所述控制服务器还与多个流式计算单元服务器集群相连;所述流式计算中心服务器集群具有中心存储集群,各流式计算中心服务器集群的中心存储集群之间同步中间状态数据和中间结果数据,各流式计算单元服务器集群的单元存储集群向各流式计算中心服务器集群的存储集群同步中间状态数据和中间结果数据;该流式计算中心服务器集群可以包括:
获取数据单元601,用于响应于所述控制服务器在所述流式计算系统中的其他流式计算中心服务器集群或流式计算单元服务器集群出现异常情况时、重新分配的流式计算任务中未执行完的任务,从中心存储集群中获取执行所述未执行完的任务所需的中间状态数据和中间结果数据。
执行任务单元602,用于利用所述预设计算资源、中间状态数据和中间结果数据执行所述未执行完的任务。
其中,该流式计算中心服务器集群还可以包括:
反馈单元,用于响应于所述控制服务器周期性的发送心跳消息,周期性的向所述控制服务器反馈心跳响应;所述心跳消息用于检测所述控制服务器与所述当前流式计算中心服务器集群之间是否能够通信。
其中,该流式计算中心服务器集群还可以包括:
检测单元,用于检测向控制服务器发送心跳响应失败的连续次数是否超过预设次数阈值;和,停止单元,用于在所述检测单元的结果为是的情况下,停止所述未执行完的任务的执行。
本实施例之后的流式计算中心服务器集群可以接收控制服务器统一分配的流式计算任务进行执行,并且利用各中心存储集群之间实时同步数据的方式,实现了部署在多地的流式计算中心服务器集群或流式计算单元服务器集群同时计算同一流计算任务的不同部分或不同的流式计算任务的功能,当一个流式计算中心服务器集群或流式计算单元服务器集群出现异常时,能快速从异地的流式计算中心服务器集群恢复正在执行的流式计算任务,这样既能保证系统资源平时不空置,也保证了在异常情况下流式计算任务也能迅速恢复从而达到流式计算服务的高可用性。
本申请实施例还提供了一种流式计算任务的分配和执行系统,该系统可以包括图5所示的控制服务器,多个图6所示的流式计算中心服务器集群,以及多个流式计算单元服务器集群,其中,各流式计算中心服务器集群都具有各自的中心存储集群,各流式计 算单元服务器集群都具有各自的单元存储集群,控制服务器具有自己的控制数据库,该系统的结构框图可以参考图1所示,该系统的未尽之处参考前述实施例的详细介绍即可,在此不再赘述。
本申请实施例还提供了一种异地多活系统,所述异地多活系统包括:第一流式计算中心服务器集群,第二流式计算中心服务器集群,多个流式计算单元服务器集群,以及控制服务器;其中,所述第一流式计算中心服务器集群和第二流式计算中心服务器集群为图6所示的流式计算中心服务器集群,所述控制服务器可以参考图5所示;以及,所述多个流式计算单元服务器集群分别对应部署于多个第二地理位置;所述第一流式计算中心服务器集群和第二流式计算中心服务器集群分别部署于相同或不同的第一地理位置。
在本实施例中,流式计算中心服务器集群和流式计算单元服务器集群分别部署于第一地理位置和第二地理位置,所以当某个流式计算单元服务器集群出现异常时,可以在异地的第一或第二流式计算中心服务器集群上恢复该出现异常的流式计算单元服务器集群正在执行的流式计算任务,将该流式计算任务中未执行完的部分在异地的流式计算中心服务器集群上继续执行,实现异地多活的功能。此外,第一流式计算中心服务器集群和第二流式计算中心服务器集群在部署在不同的第一地理位置时,其中一个流式计算中心服务器集群出现异常的时候,也可以在异地的另一个流式计算中心服务器恢复该出现异常的流式计算单元服务器正在执行的流式计算任务,同样将未执行完的部分在异地的另一个流式计算中心服务器集群上继续执行,也可以实现异地多活的功能。
本申请还提供了一种异地多活系统,具体可以包括:第一流式计算中心服务器,至少用于对外提供计算资源,其中,第一流式计算中心服务器包括第一中心存储单元;第二流式计算中心服务器,至少用于对外提供计算资源,其中,第二流式计算中心服务器包括第二中心存储单元;其中,所述第一流式计算中心服务器和第二流式计算中心服务器基于统一的负载均衡策略完成负载均衡,所述第一中心存储单元和第二中心存储单元相互热备;其中,对于在所述第一流式计算中心服务器上运行的第一流式计算任务,当所述第一流式计算中心服务器出现故障无法对外提供计算资源时,终止在第一流式计算中心服务器上运行,并且,基于所述第二流式计算中心服务器的第二中心存储单元的中间状态数据和中间结果数据,在所述第二流式计算中心服务器上继续运行所述第一流式 计算任务。
需要说明的是,本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。对于装置类实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上对本申请所提供的流式计算任务的分配方法及控制服务器、流式计算任务的执行方法及流式计算中心服务器集群、流式计算系统、异地多活系统进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (13)

  1. 一种计算任务分配方法,其特征在于,该方法应用于与流式计算中心服务器集群和流式计算单元服务器集群相连的控制服务器上,所述流式计算中心服务器集群预留有预设比例的计算资源;该方法包括:
    响应于接收到流式计算任务,将所述流式计算任务分配至目标流式计算中心服务器集群或目标流式计算单元服务器集群;
    在所述目标流式计算中心服务器集群或目标流式计算单元服务器集群执行所述流式计算任务的过程中,判断所述目标流式计算中心服务器集群或目标流式计算单元服务器集群是否出现异常情况,如果是,则将所述流式计算任务中未执行完的任务,分配至候选流式计算中心服务器集群。
  2. 根据权利要求1所述的方法,其特征在于,还包括:
    所述控制服务器周期性的分别向所述流式计算中心服务器集群和流式计算单元服务器集群发送心跳消息,所述心跳消息用于:检测所述控制服务器和所述流式计算中心服务器集群之间是否能够通信,以及,检测所述控制服务器和所述流式计算单元服务器集群之间是否能够通信;
    相应的,所述判断所述目标流式计算中心服务器集群或目标流式计算单元服务器集群是否出现异常情况,具体为:
    判断在预设反馈时间内所述目标流式计算中心服务器集群或目标流式计算单元服务器集群是否未反馈心跳响应。
  3. 根据权利要求1所述的方法,其特征在于,所述将所述流式计算任务中的未执行完的任务分配至候选流式计算中心服务器集群,包括:
    所述控制服务器实时获取所述流式计算中心服务器集群的负载情况;
    所述控制服务器依据所述负载情况,将所述流式计算任务中未执行完的任务分配至当前负载最小的流式计算中心服务器集群。
  4. 根据权利要求3所述的方法,其特征在于,所述流式计算中心服务器集群具有中心存储集群,各流式计算中心服务器集群之间的中心存储集群之间同步中间状态数据和中间结果数据,各流式计算单元服务器集群向各流式计算中心服务器集群的中心存储集群同步中间状态数据和中间结果数据;所述方法还包括:
    控制服务器将各流式计算任务的执行状态和配置信息存储至控制数据库中;所述执行状态用于表示:各流式计算任务在对应的流式计算中心服务器集群或流式计算单元服 务器集群上已执行部分;所述配置信息用于表示:各流式计算任务与执行该流式计算任务的流式计算中心服务器集群之间的对应关系,或,各流式计算任务与执行该流式计算任务的流式计算单元服务器集群之间的对应关系;
    相应的,所述将所述流式计算任务中未执行完的任务分配至当前负载最小的流式计算中心服务器集群,包括:
    所述控制服务器依据所述控制数据库中存储的执行状态和配置信息,计算所述流式计算任务中未执行完的任务;
    所述控制服务器将所述未执行完的任务分配至当前负载最小的流式计算中心服务器集群。
  5. 一种流式计算任务的执行方法,其特征在于,该方法应用于流式计算系统中的任意一个预留有预设计算资源的当前流式计算中心服务器集群上,所述流式计算系统包括:流式计算中心服务器集群、流式计算单元服务器集群和控制服务器;所述流式计算中心服务器集群具有中心存储集群,中心存储集群之间同步中间状态数据和中间结果数据,流式计算单元服务器集群的单元存储集群向中心存储集群同步中间状态数据和中间结果数据;该方法包括:
    响应于所述控制服务器在所述流式计算系统中的其他流式计算中心服务器集群或流式计算单元服务器集群出现异常情况时、重新分配的流式计算任务中未执行完的任务,所述当前流式计算中心服务器集群从中心存储集群中,获取执行所述未执行完的任务所需的中间状态数据和中间结果数据;
    所述当前流式计算中心服务器集群利用所述预设计算资源、中间状态数据和中间结果数据执行所述未执行完的任务。
  6. 根据权利要求5所述的方法,其特征在于,还包括:
    响应于所述控制服务器周期性发送心跳消息,所述当前流式计算中心服务器集群周期性向所述控制服务器反馈心跳响应;所述心跳消息用于检测所述控制服务器与所述当前流式计算中心服务器集群之间是否能够通信。
  7. 根据权利要求6所述的方法,其特征在于,还包括:
    所述当前流式计算中心服务器集群检测向控制服务器反馈心跳响应失败的连续次数是否超过预设次数阈值,如果是,则所述当前流式计算中心服务器集群停止所述未执行完的任务的执行。
  8. 一种控制服务器,其特征在于,所述控制服务器与流式计算中心服务器集群和流式计算单元服务器集群相连,所述流式计算中心服务器集群中预留有预设比例的计算资源;该控制服务器包括:
    第一分配单元,用于响应于接收到流式计算任务,将所述流式计算任务分配至目标流式计算中心服务器集群或目标流式计算单元服务器集群;
    判断单元,用于在所述目标流式计算中心服务器集群或目标流式计算单元服务器集群执行所述流式计算任务的过程中,判断所述目标流式计算中心服务器集群或目标流式计算单元服务器集群是否出现异常情况;
    第二分配单元,用于在所述判断单元的结果为是的情况下,将所述流式计算任务中未执行完的任务分配至候选流式计算中心服务器集群。
  9. 一种流式计算中心服务器集群,其特征在于,所述流式计算中心服务器集群预留有预设计算资源,所述流式计算中心服务器集群与控制服务器相连,所述控制服务器还与流式计算单元服务器集群相连;所述流式计算中心服务器集群具有中心存储集群,中心存储集群之间同步中间状态数据和中间结果数据;所述流式计算单元服务器具有单元存储集群,单元存储集群向中心存储集群同步中间状态数据和中间结果数据;包括:
    获取数据单元,用于响应于所述控制服务器在所述流式计算系统中的其他流式计算中心服务器集群或流式计算单元服务器集群出现异常情况时、重新分配的流式计算任务中未执行完的任务,从中心存储集群中获取执行所述未执行完的任务所需的中间状态数据和中间结果数据;
    执行任务单元,用于利用所述预设计算资源、中间状态数据和中间结果数据执行所述未执行完的任务。
  10. 一种流式计算系统,其特征在于,所述流式计算系统包括:权利要求9所述的流式计算中心服务器集群和流式计算单元服务器集群,权利要求8所述的控制服务器;以及,
    与所述流式计算中心服务器集群对应的中心存储集群,与所述控制服务器对应的控制数据库,和,与所述流式计算单元服务器集群对应的单元存储集群。
  11. 一种异地多活系统,其特征在于,所述异地多活系统包括:第一流式计算中心服务器集群,多个流式计算单元服务器集群,以及控制服务器;其中,所述第一流式计算中心服务器集群为权利要求9所述的流式计算中心服务器集群,所述控制服务器为权利要求8所述的控制服务器;
    以及,
    所述多个流式计算单元服务器集群分别对应部署于多个第二地理位置;所述第一流式计算中心服务器集群部署于第一地理位置。
  12. 根据权利要求11所述的系统,其特征在于,所述异地多活系统还包括:第二流式计算中心服务器集群,所述第二流式计算中心服务器集群与所述第一流式计算中心服务器集群部署在不同的第一地理位置。
  13. 一种异地多活系统,其特征在于,包括:
    第一流式计算中心服务器,至少用于对外提供计算资源,其中,第一流式计算中心服务器包括第一中心存储单元;
    第二流式计算中心服务器,至少用于对外提供计算资源,其中,第二流式计算中心服务器包括第二中心存储单元;
    其中,所述第一流式计算中心服务器和第二流式计算中心服务器基于统一的负载均衡策略完成负载均衡,所述第一中心存储单元和第二中心存储单元相互热备;
    其中,对于在所述第一流式计算中心服务器上运行的第一流式计算任务,当所述第一流式计算中心服务器出现故障无法对外提供计算资源时,终止在第一流式计算中心服务器上运行,并且,基于所述第二流式计算中心服务器的第二中心存储单元的中间状态数据和中间结果数据,在所述第二流式计算中心服务器上继续运行所述第一流式计算任务。
PCT/CN2017/105360 2016-10-18 2017-10-09 流式计算任务的分配方法和控制服务器 WO2018072618A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610908946.7A CN107959705B (zh) 2016-10-18 2016-10-18 流式计算任务的分配方法和控制服务器
CN201610908946.7 2016-10-18

Publications (1)

Publication Number Publication Date
WO2018072618A1 true WO2018072618A1 (zh) 2018-04-26

Family

ID=61954266

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/105360 WO2018072618A1 (zh) 2016-10-18 2017-10-09 流式计算任务的分配方法和控制服务器

Country Status (3)

Country Link
CN (1) CN107959705B (zh)
TW (1) TWI755417B (zh)
WO (1) WO2018072618A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111090502A (zh) * 2018-10-24 2020-05-01 阿里巴巴集团控股有限公司 一种流数据任务调度方法和装置
CN111124812A (zh) * 2019-12-02 2020-05-08 深圳市智微智能软件开发有限公司 服务器的监测方法及系统
CN112732491A (zh) * 2021-01-22 2021-04-30 中国人民财产保险股份有限公司 数据处理系统、基于数据处理系统的业务数据处理方法
CN113472662A (zh) * 2021-07-09 2021-10-01 武汉绿色网络信息服务有限责任公司 路径重分配方法和网络业务系统
CN114884946A (zh) * 2022-04-28 2022-08-09 抖动科技(深圳)有限公司 基于人工智能的异地多活实现方法及相关设备
CN115242648A (zh) * 2022-07-19 2022-10-25 北京百度网讯科技有限公司 扩缩容判别模型训练方法和算子扩缩容方法
WO2023077451A1 (zh) * 2021-11-05 2023-05-11 中国科学院计算技术研究所 一种基于列存数据库的流式数据处理方法及系统
CN113283803B (zh) * 2021-06-17 2024-04-23 金蝶软件(中国)有限公司 一种物资需求计划的制定方法、相关装置及存储介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108737270B (zh) * 2018-05-07 2021-01-26 北京京东尚科信息技术有限公司 一种服务器集群的资源管理方法和装置
CN109358983A (zh) * 2018-09-04 2019-02-19 深圳市宝德计算机系统有限公司 服务器数据处理方法、装置以及存储介质
CN109656782A (zh) * 2018-12-24 2019-04-19 成都四方伟业软件股份有限公司 可视化调度监控方法、装置及服务器
CN112148439B (zh) * 2019-06-28 2024-03-08 浙江宇视科技有限公司 任务处理方法、装置、设备及存储介质
CN111092931B (zh) * 2019-11-15 2021-08-06 中国科学院计算技术研究所 电力系统在线超实时仿真的流式数据快速分发方法及系统
CN113190364A (zh) * 2021-04-30 2021-07-30 平安壹钱包电子商务有限公司 远程调用管理方法、装置、计算机设备及可读存储介质
CN113391902B (zh) * 2021-06-22 2023-03-31 未鲲(上海)科技服务有限公司 一种任务调度方法及设备、存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101483673A (zh) * 2009-02-20 2009-07-15 杭州华三通信技术有限公司 异地热备实现方法及系统
CN102158387A (zh) * 2010-02-12 2011-08-17 华东电网有限公司 基于动态负载均衡与互相热备的保护故障信息处理系统
CN103973725A (zh) * 2013-01-28 2014-08-06 阿里巴巴集团控股有限公司 一种分布式协同方法和协同器
CN104683488A (zh) * 2015-03-31 2015-06-03 百度在线网络技术(北京)有限公司 流式计算系统及其调度方法和装置
US20160239350A1 (en) * 2015-02-12 2016-08-18 Netapp, Inc. Load balancing and fault tolerant service in a distributed data system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6779016B1 (en) * 1999-08-23 2004-08-17 Terraspring, Inc. Extensible computing system
EP2511821B1 (en) * 2005-10-07 2021-06-09 Citrix Systems, Inc. Method and system for accessing a file in a directory structure associated with an application
TWI476610B (zh) * 2008-04-29 2015-03-11 Maxiscale Inc 同級間冗餘檔案伺服器系統及方法
CN103703830B (zh) * 2013-05-31 2017-11-17 华为技术有限公司 一种物理资源调整方法、装置及控制器
CN103763378A (zh) * 2014-01-24 2014-04-30 中国联合网络通信集团有限公司 基于分布式流式计算系统的任务处理方法、系统及节点

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101483673A (zh) * 2009-02-20 2009-07-15 杭州华三通信技术有限公司 异地热备实现方法及系统
CN102158387A (zh) * 2010-02-12 2011-08-17 华东电网有限公司 基于动态负载均衡与互相热备的保护故障信息处理系统
CN103973725A (zh) * 2013-01-28 2014-08-06 阿里巴巴集团控股有限公司 一种分布式协同方法和协同器
US20160239350A1 (en) * 2015-02-12 2016-08-18 Netapp, Inc. Load balancing and fault tolerant service in a distributed data system
CN104683488A (zh) * 2015-03-31 2015-06-03 百度在线网络技术(北京)有限公司 流式计算系统及其调度方法和装置

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111090502A (zh) * 2018-10-24 2020-05-01 阿里巴巴集团控股有限公司 一种流数据任务调度方法和装置
CN111090502B (zh) * 2018-10-24 2024-05-17 阿里巴巴集团控股有限公司 一种流数据任务调度方法和装置
CN111124812A (zh) * 2019-12-02 2020-05-08 深圳市智微智能软件开发有限公司 服务器的监测方法及系统
CN112732491A (zh) * 2021-01-22 2021-04-30 中国人民财产保险股份有限公司 数据处理系统、基于数据处理系统的业务数据处理方法
CN112732491B (zh) * 2021-01-22 2024-03-12 中国人民财产保险股份有限公司 数据处理系统、基于数据处理系统的业务数据处理方法
CN113283803B (zh) * 2021-06-17 2024-04-23 金蝶软件(中国)有限公司 一种物资需求计划的制定方法、相关装置及存储介质
CN113472662A (zh) * 2021-07-09 2021-10-01 武汉绿色网络信息服务有限责任公司 路径重分配方法和网络业务系统
WO2023077451A1 (zh) * 2021-11-05 2023-05-11 中国科学院计算技术研究所 一种基于列存数据库的流式数据处理方法及系统
CN114884946A (zh) * 2022-04-28 2022-08-09 抖动科技(深圳)有限公司 基于人工智能的异地多活实现方法及相关设备
CN114884946B (zh) * 2022-04-28 2024-01-16 抖动科技(深圳)有限公司 基于人工智能的异地多活实现方法及相关设备
CN115242648A (zh) * 2022-07-19 2022-10-25 北京百度网讯科技有限公司 扩缩容判别模型训练方法和算子扩缩容方法
CN115242648B (zh) * 2022-07-19 2024-05-28 北京百度网讯科技有限公司 扩缩容判别模型训练方法和算子扩缩容方法

Also Published As

Publication number Publication date
CN107959705A (zh) 2018-04-24
TW201816616A (zh) 2018-05-01
TWI755417B (zh) 2022-02-21
CN107959705B (zh) 2021-08-20

Similar Documents

Publication Publication Date Title
WO2018072618A1 (zh) 流式计算任务的分配方法和控制服务器
US11307943B2 (en) Disaster recovery deployment method, apparatus, and system
US10609159B2 (en) Providing higher workload resiliency in clustered systems based on health heuristics
WO2017067484A1 (zh) 一种虚拟化数据中心调度系统和方法
US8862928B2 (en) Techniques for achieving high availability with multi-tenant storage when a partial fault occurs or when more than two complete faults occur
US20170279674A1 (en) Method and apparatus for expanding high-availability server cluster
TWI701916B (zh) 用於在分布式系統中使管理能力自恢復的方法和裝置
WO2016058307A1 (zh) 资源的故障处理方法及装置
CN105703940A (zh) 一种面向多级调度分布式并行计算的监控系统及监控方法
CN105337780B (zh) 一种服务器节点配置方法及物理节点
CN105471622A (zh) 一种基于Galera的控制节点主备切换的高可用方法及系统
CN105069152B (zh) 数据处理方法及装置
JP2020115330A (ja) ソフトウエアアプリケーションプロセスを監視するシステムと方法
CN112631764A (zh) 任务调度方法、装置、计算机设备和计算机可读介质
CN104158707A (zh) 一种检测并处理集群脑裂的方法和装置
CN104484228B (zh) 基于Intelli‑DSC的分布式并行任务处理系统
CN114338670B (zh) 一种边缘云平台和具有其的网联交通三级云控平台
CN104123183A (zh) 集群作业调度方法和装置
CN101442437A (zh) 一种实现高可用性的方法、系统及设备
CN111200518B (zh) 一种基于paxos算法的去中心化HPC计算集群管理方法及系统
JPH09293059A (ja) 分散システム及びその運用管理方法
US10001939B1 (en) Method and apparatus for highly available storage management using storage providers
CN103973811A (zh) 一种可动态迁移的高可用集群管理方法
CN116055314A (zh) 一种配置同步方法及装置
Li et al. Design and implementation of high availability distributed system based on multi-level heartbeat protocol

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17861368

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17861368

Country of ref document: EP

Kind code of ref document: A1