WO2023231398A1 - 分布式处理系统的监控方法及装置 - Google Patents

分布式处理系统的监控方法及装置 Download PDF

Info

Publication number
WO2023231398A1
WO2023231398A1 PCT/CN2022/142237 CN2022142237W WO2023231398A1 WO 2023231398 A1 WO2023231398 A1 WO 2023231398A1 CN 2022142237 W CN2022142237 W CN 2022142237W WO 2023231398 A1 WO2023231398 A1 WO 2023231398A1
Authority
WO
WIPO (PCT)
Prior art keywords
operator
data
instance
working state
upstream
Prior art date
Application number
PCT/CN2022/142237
Other languages
English (en)
French (fr)
Inventor
张俊鹏
周文明
叶姣荣
Original Assignee
杭州数梦工场科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州数梦工场科技有限公司 filed Critical 杭州数梦工场科技有限公司
Publication of WO2023231398A1 publication Critical patent/WO2023231398A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Definitions

  • the present disclosure relates to the field of distributed technology, and in particular, to a monitoring method and device for a distributed processing system.
  • the present disclosure provides a monitoring method and device for a distributed processing system.
  • a method for monitoring a distributed processing system is provided, where the distributed processing system includes operators with upstream and downstream relationships;
  • the methods include:
  • the operator corresponds to at least one operator instance, and each operator instance has a corresponding data processing rate;
  • Determining at least one working state of the operator based on the data processing rate corresponding to the operator and/or and/or its upstream operator includes:
  • At least one working state of the first operator is determined according to the error value between the data processing rates corresponding to each operator instance corresponding to the first operator and/or the second operator; wherein, the first operator operator is any one of the operators; the second operator is the upstream operator of the first operator.
  • the data processing rate includes a first data production rate; wherein the first data production rate represents the rate at which an operator produces data; the working state of the first operator includes a first working state;
  • the first working state of the first operator indicates whether the data produced by the upstream operator corresponding to the first operator is evenly distributed
  • Determining at least one working state of the first operator based on the error value between the data processing rates corresponding to each operator instance corresponding to the first operator and/or the second operator includes:
  • the first working state of the first operator is determined.
  • determining the first working state of the first operator according to the first error value includes:
  • determining the first working state of the first operator indicates the distribution of data produced by the upstream operator corresponding to the first operator.
  • Uneven including:
  • determining the first working state of the first operator indicates the upstream corresponding to the first operator
  • the data produced by the operator is unevenly distributed.
  • the first alarm information is output; wherein the first alarm information is used to prompt Increase the number of concurrency degrees of downstream operators.
  • the first working state of the first operator indicates that the data produced by the upstream operator corresponding to the first operator is unevenly distributed, based on the execution process corresponding to the distributed task, it is determined that the first operator is in the first each third operator above the first operator;
  • each third operator When the first working status corresponding to each third operator indicates that the data produced by the upstream operator is unevenly distributed, determine the source operator corresponding to the distributed task except the first downstream operator.
  • Other downstream operators other than; wherein, the first downstream operator is the downstream operator of the source operator, and is the third operator;
  • Alarm operations are performed based on the data consumption rates corresponding to other downstream operators.
  • the upstream operator and downstream operator are connected through channels;
  • the alarm operation is performed based on the data consumption rate corresponding to other downstream operators, including:
  • second alarm information is output; wherein the second alarm information is used to prompt that there is too much data at the source end and to increase the number of downstream operator concurrency degrees.
  • the upstream operator and the downstream operator are connected through a channel; when the first working status corresponding to each third operator indicates that the data produced by the upstream operator is unevenly distributed, the method Also includes:
  • a first channel reallocation operation is performed; wherein the first channel reallocation operation indicates controlling the other downstream operators to consume the first downstream
  • the data in the input buffer corresponding to the operator; the data in the input buffer is the data produced by the source operator.
  • the method further includes:
  • a speed limit instruction is generated, and the speed limit instruction is sent to the source operator, so that the source operator adjusts the first data corresponding to the source operator based on the minimum value Production rate.
  • the method further includes:
  • a second channel reallocation operation is performed; wherein the second channel reallocation operation indicates abnormal consumption of normal operator instances.
  • the data in the input buffer corresponding to the operator instance; the normal operator instance is an operator instance in which the first working state in the operator instance corresponding to the first operator indicates that the data produced by the upstream operator is evenly distributed
  • the abnormal operator instance is an operator instance in which the first working state in the operator instance corresponding to the first operator indicates uneven distribution of data produced by the upstream operator.
  • the method further includes:
  • the data processing rate includes a data consumption rate; the data consumption rate represents the rate at which an operator consumes data generated by an upstream operator;
  • the working state of the first operator includes a first working state
  • the second working state of the first operator indicates whether the data consumption capability is normal
  • Determining at least one working state of the first operator based on the error value between the data processing rates corresponding to each operator instance corresponding to the first operator and/or the second operator includes:
  • the second working state corresponding to each operator instance corresponding to the first operator is determined.
  • the operator is scheduled to at least one resource group slot; the slot corresponds to the operator instance one-to-one;
  • alarm operations including:
  • the corresponding alarm operation is performed according to the second working state corresponding to each first operator instance, including:
  • the abnormality prompt information of the operator instances is output.
  • the operator instance exception prompt information is output.
  • the operator corresponds to at least one operator instance; the operator instance has a corresponding buffer;
  • the method also includes:
  • the buffer is controlled to stop expansion.
  • the operator corresponds to at least one operator instance; the method further includes:
  • a monitoring device for a distributed processing system is provided, where the distributed processing system includes operators with upstream and downstream relationships;
  • the device includes:
  • a rate acquisition module used to acquire the data processing rate corresponding to the operator for distributed tasks
  • a rate processing module configured to determine at least one working state of the operator based on the data processing rate corresponding to the operator and/or its upstream operator;
  • An alarm module configured to perform corresponding alarm operations in response to any working status of the operator indicating that the operator is abnormal.
  • the operator corresponds to at least one operator instance, and each operator instance has a corresponding data processing rate;
  • the rate processing module is specifically used for:
  • At least one working state of the first operator is determined according to the error value between the data processing rates corresponding to each operator instance corresponding to the first operator and/or the second operator; wherein, the first operator operator is any one of the operators; the second operator is the upstream operator of the first operator.
  • the data processing rate includes a first data production rate; wherein the first data production rate represents the rate at which an operator produces data; the working state of the first operator includes a first working state;
  • the first working state of the first operator indicates whether the data produced by the upstream operator corresponding to the first operator is evenly distributed
  • the rate processing module is specifically used for:
  • the first working state of the first operator is determined.
  • the rate processing module is also used to:
  • the rate processing module is also used to:
  • determining the first working state of the first operator indicates the upstream corresponding to the first operator
  • the data produced by the operator is unevenly distributed.
  • the alarm module is specifically used for:
  • the first alarm information is output; wherein the first alarm information is used to prompt Increase the number of concurrency degrees of downstream operators.
  • the alarm module is specifically used for:
  • the first working state of the first operator indicates that the data produced by the upstream operator corresponding to the first operator is unevenly distributed, based on the execution process corresponding to the distributed task, it is determined that the first operator is in the first each third operator above the first operator;
  • each third operator When the first working status corresponding to each third operator indicates that the data produced by the upstream operator is unevenly distributed, determine the source operator corresponding to the distributed task except the first downstream operator.
  • Other downstream operators other than; wherein, the first downstream operator is the downstream operator of the source operator, and is the third operator;
  • Alarm operations are performed based on the data consumption rates corresponding to other downstream operators.
  • the upstream operator and downstream operator are connected through channels;
  • the alarm module is specifically used for:
  • second alarm information is output; wherein the second alarm information is used to prompt that there is too much data at the source end and to increase the number of downstream operator concurrency degrees.
  • the upstream operator and the downstream operator are connected through a channel; the device also includes a first channel processing module;
  • the first channel processing module is specifically used for:
  • a first channel reallocation operation is performed; wherein the first channel reallocation operation indicates controlling the other downstream operators to consume the first downstream
  • the data in the input buffer corresponding to the operator; the data in the input buffer is the data produced by the source operator.
  • the device also includes a speed limiting module
  • the speed limiting module is specifically used for:
  • a speed limit instruction is generated, and the speed limit instruction is sent to the source operator, so that the source operator adjusts the first data corresponding to the source operator based on the minimum value Production rate.
  • the device further includes a second channel processing module
  • the second channel processing module is specifically used for:
  • the second channel reallocation operation After obtaining the first working status corresponding to each third operator, and when the first working status of all third operators indicates that the data produced by the upstream operator is evenly distributed, the second channel reallocation operation is performed; wherein , the second channel reallocation operation instructs the normal operator instance to consume the data in the input buffer corresponding to the abnormal operator instance; the normal operator instance is the first operator instance corresponding to the first operator.
  • One working state indicates that the data produced by the upstream operator is evenly distributed.
  • the abnormal operator instance is the first working state in the operator instance corresponding to the first operator that indicates that the data produced by the upstream operator is unevenly distributed. Operator instance.
  • the device further includes a data recording module
  • the data logging module is specifically used for:
  • the data processing rate includes a data consumption rate; the data consumption rate represents the rate at which an operator consumes data generated by an upstream operator; the working state of the first operator includes a first working state;
  • the second working state of the first operator indicates whether the data consumption capability is normal
  • the rate processing module is specifically used for:
  • the second working state corresponding to each operator instance corresponding to the first operator is determined.
  • the operator is scheduled to at least one resource group slot; the slot corresponds to the operator instance one-to-one;
  • the alarm module is specifically used for:
  • the alarm module is also used to:
  • the target task manager fault prompt information is output;
  • the abnormality prompt information of the operator instances is output.
  • the alarm module is also used to:
  • the operator instance exception prompt information is output.
  • the operator corresponds to at least one operator instance; the operator instance has a corresponding buffer;
  • the alarm module is also used to:
  • the buffer is controlled to stop expansion.
  • the operator corresponds to at least one operator instance; the alarm module is also used to:
  • a computer-readable storage medium is provided.
  • Computer-executable instructions are stored in the computer-readable storage medium.
  • the processor executes the computer-executable instructions, the above first aspect and In the first aspect, various possible designs are provided for the monitoring method of the distributed processing system.
  • a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein when the processor executes the program, the above is implemented.
  • a computer program product including a computer program.
  • the computer program When the computer program is executed by a processor, it implements the distributed distribution described in the first aspect and various possible designs of the first aspect. Processing system monitoring methods.
  • the distributed processing system includes operators with upstream and downstream relationships. Obtain the data processing rate corresponding to each operator involved in the distributed task. Correspond to each operator. Based on the data processing rate corresponding to the operator and its upstream operator, determine whether the operator processes data abnormally, thereby determining whether the operator is processing data abnormally. At least one working state to detect fault operators in time. When a certain working status of the operator indicates that there is an abnormality in the operator, corresponding alarm operations are performed to ensure the timeliness of the alarm and realize intelligent monitoring of the distributed processing system, thereby solving the abnormality in a timely manner and improving the efficiency of the distributed processing system. Robustness and operability.
  • Figure 1 is a flow chart of a monitoring method for a distributed processing system according to an embodiment of the present disclosure.
  • Figure 2 is a schematic diagram of an operator according to an embodiment of the present disclosure.
  • Figure 3 is a schematic diagram of an execution plan according to an embodiment of the present disclosure.
  • Figure 4 is a schematic diagram of another operator according to an embodiment of the present disclosure.
  • Figure 5 is a flow chart of yet another monitoring method for a distributed processing system according to an embodiment of the present disclosure.
  • Figure 6 is a schematic diagram of yet another operator according to an embodiment of the present disclosure.
  • Figure 7 is a flow chart of yet another monitoring method for a distributed processing system according to an embodiment of the present disclosure.
  • Figure 8 is a hardware structure diagram of the electronic equipment where the monitoring device of the distributed processing system is located according to the embodiment of the present disclosure.
  • FIG. 9 is a block diagram of a monitoring device of a distributed processing system according to an embodiment of the present disclosure.
  • first, second, third, etc. may be used in this disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other.
  • first information may also be called second information, and similarly, the second information may also be called first information.
  • word “if” as used herein may be interpreted as "when” or “when” or “in response to determining.”
  • Figure 1 is a flow chart of a monitoring method for a distributed processing system according to an embodiment of the present disclosure.
  • the distributed processing system includes processing nodes with operators installed on them. Between the operators There is an upstream and downstream relationship.
  • the execution subject of this method is the main control server. Specifically, it is a computer device, that is, the processor in the main control server. The method includes the following steps:
  • Step 101 For distributed tasks, obtain the data processing rate corresponding to the operator.
  • the distributed processing system is the Flink system.
  • Distributed task instructions Flink Job also known as Flink job.
  • Flink jobs include an operator chain formed by multiple operators. For two adjacent operators in the operator chain, the operator in front (that is, above) can be called the upstream operator, and the operator in the back (that is, in The operator below) can be called a downstream operator. Traffic is always sent from upstream to downstream, that is, the downstream operator processes the data produced by the upstream operator. For each operator in the operator chain, determine the data processing rate corresponding to the operator.
  • the data processing rate includes a data consumption rate and/or a first data production rate.
  • the data consumption rate indicates the rate at which an operator consumes data generated by an upstream operator.
  • the first data production rate represents the rate at which the operator produces data.
  • the data consumption rate indicates the rate at which the operator processes the data generated by the upstream operator in the first preset unit time, which reflects the processing capability of the operator;
  • the first data production rate indicates the rate at which the operator processes the data generated by the upstream operator in the first preset unit time.
  • the rate at which data is generated internally reflects the data produced by the upstream operator.
  • the operators involved in a distributed task that is, the operator chain includes operator 1 and operator 2.
  • Operator 1 is the upstream operator of operator 2.
  • Operator 1 transmits data, that is, traffic to operator 2.
  • This data is the data generated by the upstream operator of operator 2.
  • Operator 2 processes the data, for example, filters it, and the filtered data becomes the data produced by operator 2.
  • the upstream operator saves the data produced in the input buffer, the downstream operator consumes the data in the input buffer, and the downstream operator saves the data it produces in the output buffer.
  • the data consumption rate corresponding to the operator indicates the rate at which the operator consumes the data in the input buffer;
  • the first data production rate corresponding to the operator indicates the rate at which the operator fills the output buffer. For example, when calculating At the first data production rate, the amount of data added to the outgoing buffer within a certain period of time is obtained, and the data amount is divided by the time to obtain the first data production rate.
  • the data processing rate corresponding to the operator can be collected by the client plug-in installed on the operator. After collecting the data processing rate corresponding to the operator, the client sends it to the main control server.
  • the number of upstream operators of an operator is at least one. Upstream and downstream operators are connected through channels. Each operator includes or corresponds to at least one operator instance, and the operator instance is obtained by instantiating the operator.
  • the data processing rate corresponding to an operator includes the data processing rate corresponding to each operator instance corresponding to the operator.
  • the data processing rate includes the data consumption rate
  • the operator includes operator D
  • operator D corresponds to two operator instances, namely operator instance d1 and operator instance d2.
  • the data consumption rate corresponding to operator instance d1 is 3m/s
  • the data consumption rate corresponding to operator instance d2 is 5m/s.
  • the data consumption rate corresponding to operator D includes the data consumption rate corresponding to operator instance d1 (that is, 3m /s) and the data consumption rate corresponding to operator instance d2 (i.e. 5m/s).
  • an operator instance corresponding to the operator and at least one operator instance corresponding to the upstream operator of the operator, that is, the upstream operator instance is connected through a channel, and each channel corresponds to an input buffer. and outbound buffer.
  • operator 2 corresponds to 2 operator instances 2
  • operator 1 corresponds to 6 operator instances 1
  • operator 1 is the upstream operator of operator 2
  • one operator instance 2 and 3 Operator instance 1 is connected through three channels respectively, that is, an operator instance 1 and an operator instance 2 are connected through a channel.
  • the operator instance 2 consumes the input buffer that stores the data produced by the operator instance 1. area, that is, the data in the input buffer corresponding to the channel.
  • the Flink job architecture consists of three parts: job Client (client), Flink Jobmanager (job manager), and Flink TaskManager (task manager).
  • job is parsed into stream graph (stream graph) and job through Client and Jobmanager three times.
  • graph job graph
  • execution graph execution graph
  • the main control server monitors the Jobmanager. If a new job (that is, new data) is submitted, the execution graph is obtained to query the taskmanager and slot (slot) that the job is scheduled to, that is, the address and operator list of the slot involved in the job. , dependencies between operators, operator concurrency, etc.
  • the execution plan of the job that is, the execution process
  • vertex represents a certain vertex on the DAG graph
  • Jobvertex job vertex
  • executionvertex execution vertex
  • Resultpartition represents the output of the vertex
  • Flink will instantiate it according to the concurrency degree of the operator to obtain the operator instance corresponding to the operator.
  • the resultsubpartition i.e., subpartition
  • the downstream inputgate receives upstream data, and the number of concurrencies is controlled by the downstream concurrency.
  • the map operator represents the source operator
  • the reduce operator represents the destination operator.
  • These two operators are used as two vertices on the DAG graph.
  • One vertex can correspond to one or more operators. (operator chain, i.e. operator chain), for convenience of description, this disclosure associates one vertex with one operator.
  • the job execution process described above can determine: 1) the task slot of the taskmanager to which the job is assigned; 2) the operator identification involved in the job and the parent-child dependency relationship between the operators, that is, the upstream and downstream relationships; 3) The identification of the slot to which the operator is scheduled, the concurrency of the operator, the channel data between operators, etc.; 4) The inputgate and resultsubpartion corresponding to each operator are used to track the upstream operator and the corresponding Downstream operator.
  • the concurrency of upstream and downstream operators can be different.
  • Each operator can be scheduled to at least one slot.
  • the slot corresponds to the operator instance one-to-one.
  • an operator is scheduled to 3 slots.
  • each slot corresponds to an operator instance of this operator.
  • the upstream operator and the downstream operator communicate through channels, and the number of channels is determined by the concurrency number of the upstream and downstream operators.
  • the upstream operator has 10 concurrency degrees and the downstream operator has 2 concurrency degrees.
  • every 5 processing threads send data to one thread of the downstream operator, that is, the upstream operator has a total of 10 operators.
  • the downstream operator has 2 operator instances, and every 5 operator instances of the upstream operator communicate with one operator instance of the downstream operator.
  • the channels connecting upstream operators and downstream operators are not fixed.
  • each operator needs to record its own upstream and downstream channels.
  • the client will also record the channel status between operators, that is, between operator instances.
  • the above identification includes name, ID and other information.
  • the operator identifier is operator ID.
  • the data processing rate corresponding to the operator may also include a second data production rate.
  • the second data production rate represents the rate at which the upstream operator of the operator produces data, that is, the rate at which the input buffer is filled.
  • the data processing rate corresponding to the operator is the data processing rate corresponding to the operator instance.
  • the data consumption rate corresponding to the operator instance represents the rate at which the operator instance consumes data generated by the upstream operator instance.
  • the first data production rate indicates the rate at which the operator instance produces data.
  • each operator instance corresponds to an input buffer and an output buffer.
  • the total space size of the input buffer and the output buffer can be changed, that is, it can be expanded or reduced.
  • the main control server can obtain the incoming buffer information and outgoing buffer information corresponding to the operator instance in real time or regularly; the incoming buffer information includes the current total size, remaining size and other information of the incoming buffer; similarly, The outbound buffer information includes the current total size, remaining size and other information of the outbound buffer.
  • the client corresponding to the operator records the inbound buffer information and outbound buffer information corresponding to each operator instance corresponding to the operator, and sends them to the main control server.
  • the main control server can save it to the target location for use in aggregation processing calculations.
  • the first preset unit time can be seconds.
  • the data consumption rate represents the operator, that is, the operator instance consumes the upstream operator per second, that is, the amount of data generated by the upstream operator instance.
  • the first data production rate represents the amount of data produced by the operator per second.
  • the data processing rate is aggregated and calculated, it is aggregated according to a second preset unit time, where the second preset unit time indicates minutes, hours and other levels. For example, when the second preset unit time indicates the minute level, the aggregated first data production rate represents the amount of data produced by the operator per minute.
  • the target location includes databases, ES and other devices that can store data.
  • the data processing rate is obtained, it is aggregated and calculated so that business and operation and maintenance personnel can refer to the aggregated data processing rate to analyze the relevant conditions of the Flink system (for example, the concurrency of operators) Carry out tuning to ensure the performance of the Flink system.
  • the relevant conditions of the Flink system for example, the concurrency of operators
  • Carry out tuning to ensure the performance of the Flink system.
  • Step 102 Determine at least one working state of the operator based on the data processing rate corresponding to the operator and/or its upstream operator.
  • the data processing rate corresponding to the operator and/or the data processing rate corresponding to the upstream operator of the operator is determined. Whether there is any abnormality in the operator's ability to process data (for example, production, consumption) in various dimensions, in order to obtain the working status of the operator in each dimension, that is, to obtain the various working statuses of the operator, and the working status indicates the operator Whether the ability to process data is abnormal, thereby determining whether there are abnormal operators and realizing intelligent monitoring of operators.
  • the dimensions include a first dimension and/or a second dimension, where the first dimension indicates a production dimension and the second dimension indicates a consumption dimension.
  • the working state of the operator includes the first working state of the operator and/or the second working state of the operator.
  • the first working state of the operator indicates whether the data produced by the upstream operator is evenly distributed, that is, it indicates whether the difference in the amount of data produced by all upstream operator instances corresponding to the operator is small.
  • the second working state indicates whether the data consumption capacity is normal, that is, whether the difference in the amount of data consumed by each operator instance corresponding to the operator is small.
  • the first operator is any operator among the operators.
  • the second operator is the upstream operator of the first operator.
  • the first operator is determined based on the error value between the data processing rates corresponding to the operator instances corresponding to the upstream operator (ie, the second operator) of the first operator.
  • the first working state of the sub is determined based on the error value between the data processing rates corresponding to the operator instances corresponding to the upstream operator (ie, the second operator) of the first operator.
  • the first working state of the sub is determined based on the error value between the data processing rates corresponding to the operator instances corresponding to the upstream operator (ie, the second operator) of the first operator.
  • the second working state of the first operator is determined according to the error value between the data processing rates corresponding to the respective operator instances corresponding to the first operator. Specifically, the second working state of each operator instance corresponding to the first operator can be determined.
  • the first value of the first operator is determined based on the error value between the data processing rates corresponding to the operator instances corresponding to the upstream operator of the first operator.
  • the working state determines the second working state of the first operator based on the error value between the data processing rates corresponding to each operator instance corresponding to the first operator.
  • Step 103 In response to any working status of the operator indicating that the operator is abnormal, perform corresponding alarm operations.
  • each working status of the operator when a certain working status indicates that the operator is abnormal, it indicates that the operator is a faulty operator, and corresponding alarm operations are performed so that relevant personnel can promptly Discover fault operators so that faults can be solved in time to ensure the normal operation and performance of the distributed processing system.
  • the distributed processing system includes multiple operators, and there are upstream and downstream relationships between the operators.
  • the data processing rate includes the data consumption rate and/or the first data production rate.
  • the data consumption rate indicates that the operator consumes the data produced by its corresponding upstream operator.
  • the first data production rate represents the rate at which the operator produces data. Determine whether the operator's production and/or consumption data are abnormal according to the data consumption rate and/or the first data production rate corresponding to the operator, thereby determining the working status of the operator in at least one dimension, and discovering the faulty operator in a timely manner.
  • Figure 5 is a flow chart of another monitoring method of a distributed processing system according to an embodiment of the present disclosure. The process of determining the first working state of an operator will be described below in conjunction with a specific embodiment. Detailed description, as shown in Figure 5, the method includes the following steps:
  • Step 501 For distributed tasks, obtain the data processing rate corresponding to the operator. Wherein, the data processing rate includes the first data production rate.
  • the first data production rate represents the rate at which the operator produces data.
  • Step 502 Calculate the upstream operator of the first operator, that is, the first error value between any two first data production rates of the first data production rates corresponding to each operator instance corresponding to the second operator. .
  • the first operator is any one of the above operators.
  • the operator is regarded as the first operator, and the upstream operator of the first operator is regarded as the second operator.
  • the second operator instance calculate the difference between the first data production rate corresponding to the second operator instance and the first data production rate corresponding to other second operator instances, and use it as the second operator.
  • the first error value corresponding to the instance indicates the difference in the amount of data produced by the two second operator instances in the first preset unit time, that is, it indicates the data production capacity of the two second operator instances. difference.
  • Step 503 Determine the first working state of the first operator according to the first error value.
  • the first error values are used to determine each second operator Whether the ability of the instances to produce data has a small difference, that is, determine whether the data produced by the upstream operator of the first operator is evenly distributed, thereby obtaining the first working state of the first operator.
  • determine the first working state of the first operator based on the first error value including:
  • the first error value reaches the first preset value, it is determined that the first working state of the first operator indicates uneven distribution of data produced by the upstream operator.
  • reach means greater than and/or equal to. Not reached means less than.
  • the first error value between the first data production rates corresponding to the two second operator instances reaches the first preset value, it indicates that the data production capabilities of the two second operator instances are quite different. That is, the amount of data produced by one second operator instance is large, and the amount of data produced by another second operator instance is small. That is, the data produced by the operator instances corresponding to the upstream operator corresponding to the first operator is unevenly distributed. , that is, the first working state of the first operator indicates that the data produced by the upstream operator of the first operator is unevenly distributed.
  • the calculated value is used to determine the first working state of the first operator, for example, after obtaining the first data production rate corresponding to the two second operator instances.
  • the ratio of the first error value to the first data production rate corresponding to any one of the two second operator instances is used as the error value.
  • the first working state of each operator instance corresponding to the first operator can also be determined first, so that the first working state of each operator instance can be used to determine the first working state of the first operator.
  • the process It includes: for each operator instance corresponding to the first operator, obtain each second operator instance corresponding to the operator instance, and the first error value between each second operator instance does not reach the first predetermined value. If the value is set, it indicates that the difference in data production capabilities between the upstream operator instances corresponding to the operator instance is small, then it is determined that the first working state of the operator instance indicates that the data produced by the upstream operator is evenly distributed.
  • the first error value between the second operator instances corresponding to the operator instance reaches the first preset value, it indicates that the production data capabilities of the upstream operator instances corresponding to the operator instance are quite different. , it is determined that the first working state of the operator instance indicates uneven distribution of data produced by the upstream operator, thereby determining that the first working state of the operator indicates uneven distribution of data produced by the upstream operator.
  • the first working state of any operator instance in the operator instance corresponding to the first operator indicates uneven distribution of data produced by the upstream operator
  • the first working state of the first operator is determined Indicates that the data produced by the upstream operator is unevenly distributed.
  • the first working status of all operator instances corresponding to the first operator indicates that the data produced by the upstream operator is evenly distributed
  • it is determined that the first working status of the first operator indicates that the data produced by the upstream operator is evenly distributed.
  • the first working state of the first operator can be further determined by using the output buffer corresponding to the second operator.
  • the specific determination process includes:
  • the outgoing buffer is used to store the data produced by the second operator.
  • determining the first working state of the first operator indicates the data distribution produced by the upstream operator corresponding to the first operator. Uneven.
  • the expansion rate corresponding to the outbound buffer corresponding to the second operator instance is further obtained, and the expansion rate indicates the growth rate of the total space size of the outbound buffer. For example, at time 1, the total space size of the outbound buffer corresponding to the second operator instance is 100MB, and at time 2, the total space size of the outbound buffer corresponding to the second operator instance is 200MB, then the The expansion rate corresponding to the outbound buffer corresponding to the second operator instance is (200MB-100MB)/(time 2-time 1).
  • the expansion rate corresponding to the outbound buffer corresponding to the second operator instance reaches the first preset rate, it indicates that the buffer corresponding to the second operator instance is expanded too fast, which means that the data produced by the second operator instance If there are too many, it is determined that the data produced by the second operator instance is unevenly distributed, thereby determining the downstream operator instance connected to the second operator instance, that is, the first job corresponding to the operator instance in the first operator.
  • the status indicates that the data produced by the upstream operator is unevenly distributed, that is, the first working status corresponding to the first operator indicates that the data produced by the upstream operator is unevenly distributed.
  • the first working state of the operator instance may be determined according to the above-mentioned process of determining the first working state of the first operator.
  • Step 504 In response to the first working state of the first operator indicating that the operator is abnormal, perform a corresponding alarm operation.
  • the alarm when the alarm operation is performed according to the first working state of the first operator, the alarm can be performed according to the following two methods.
  • One way is to output the first alarm information when the first working state of the first operator indicates that the data produced by the upstream operator corresponding to the first operator is unevenly distributed.
  • the first alarm information is used to prompt to increase the number of concurrency degrees of downstream operators.
  • the first working state of the first operator indicates that the data produced by the upstream operator is unevenly distributed, that is, when the first working state of the operator instance corresponding to the first operator indicates that the data produced by the upstream operator is distributed
  • the upstream operator instance of the first operator i.e., the second operator instance
  • the first alarm information is output to prompt the first operator, that is, the operator instance has A computing bottleneck occurs.
  • the ability of the upstream operator to produce data exceeds the ability of the downstream operator to consume data. In order to meet the business needs, more consumer data needs to be added, that is, the concurrency of the first operator needs to be increased. In other words, relevant personnel are prompted to increase the number of operators and the production capacity.
  • the number of downstream operator instances connected to the second operator instance whose data distribution is uneven.
  • the first operator includes two operator instances, namely instances 1 and 2.
  • the upstream operator of the first operator that is, the second operator includes 6 second operator instances, instance 1 is connected to 3 second operator instances, instance 2 is connected to another 3 second operator instances, and If the data produced by a second operator instance connected to instance 1 is unevenly distributed, the first alarm information will be output to increase the concurrency of the downstream operator instance connected to the second operator instance.
  • the first operator new Add instance 3 which is also connected to the second operator instance. This instance 3 is also used to consume the data produced by the second operator instance, thereby increasing the number of consumers corresponding to the second operator instance.
  • Another way is to determine, based on the execution process corresponding to the distributed task, that the data produced by the upstream operator corresponding to the first operator is unevenly distributed when the first working state of the first operator indicates that the first operator is in the first operator.
  • Each third operator above the sub. Obtain the first working state corresponding to each third operator.
  • Alarm operations are performed based on the data consumption rates corresponding to other downstream operators.
  • the execution plan is traversed upward with the first operator as the child node, that is, based on the job execution process, the execution plan is traversed upward. , in other words, based on the operator chain where the first operator is located, traverse upward until the source operator (that is, the source operator) is traversed, and the traversed operator (excluding the source operator) is used as the third operator .
  • the first working status of all third operators indicates that the data produced by the upstream operator is unevenly distributed, it indicates that the source data, that is, the data volume of some partitions of the data source, that is, the partition consumed by the first downstream operator, is too high.
  • Operator a1 consumes data in partition 1 in the message queue
  • operator b1 consumes data in partition 2.
  • the first working status of operator a3 indicates that the data produced by the upstream operator is unevenly distributed, determine the working status of operators a1 and a2 in the first dimension in turn. Since source operator A does not have an upstream operator, Therefore, there is no need to obtain the first working state of source operator A.
  • operator a1 When the working status of operators a1 and a2 in the first dimension both indicate that the data produced by the upstream operator is unevenly distributed, indicating that there is too much data in partition 1, then operator a1 is used as the first downstream operator, and Use operator b1 as another downstream operator, and use the data consumption rate corresponding to operator b1 to perform alarm operations.
  • operators a1, a2 and a3 each correspond to an operator instance, and partition 1 indicates the input buffer of operator a1.
  • Operators b1, b2, and b3 each correspond to an operator instance, and partition 2 indicates the input buffer of operator b1.
  • partition 1 corresponds to multiple sub-partitions, that is, input buffers.
  • Each operator instance of operator a1 consumes data in an input buffer.
  • both the first operator and the third operator can indicate operator instances, and correspondingly, the first downstream operator and other downstream operators can also indicate operator instances.
  • second alarm information is output.
  • the second alarm information is used to prompt that there is too much data at the source end and to increase the number of concurrency degrees of downstream operators.
  • the first channel reallocation operation when the second error value does not reach the third preset value, the first channel reallocation operation is performed.
  • the first channel reallocation operation instruction controls other downstream operators to consume data in the input buffer corresponding to the first downstream operator.
  • the data in the input buffer is the data produced by the source operator.
  • the second error value When the second error value does not reach the third preset value, it indicates that other downstream operators can consume more data, and the channels can be reallocated according to the data consumption rate of the downstream operators, that is, the first channel can be The reallocation operation causes other downstream operators with higher data consumption rates to consume the input buffer corresponding to the first downstream operator.
  • the data consumption rate corresponding to other downstream operators may refer to the data consumption rate corresponding to each operator instance corresponding to other downstream operators
  • the first data production rate corresponding to the source operator may refer to the data consumption rate corresponding to the other downstream operators.
  • the second error value corresponding to each operator instance can be determined.
  • the second error value corresponding to the operator instance does not reach the third preset value, it indicates that the operator instance can consume more data, and the operator instance is allowed to consume the data corresponding to the affected channel, that is, Consume the data in the input buffer corresponding to the first downstream operator, that is, establish a channel between the operator instance and the source operator, and realize automatic intelligent connection of the channel.
  • the second preset value and the third preset value may be the same or different.
  • the operator instance and the first downstream operator that is, the first downstream operator corresponding to the input buffer with too much data, are actually established.
  • the channel connection between the operator instances in the operator is to reconstruct the channel between the upstream slot and the downstream slot.
  • the minimum value of the data consumption rate corresponding to other downstream operators and the data consumption rate corresponding to the first downstream operator is obtained. Based on the minimum value, generate a speed limit instruction and send the speed limit instruction to the source operator, so that the source operator adjusts the first data production rate corresponding to the source operator based on the minimum value, that is, based on the minimum value, adjusts the source operator.
  • the first data production rate corresponding to the operator prevents the entire system from causing back pressure and affecting mechanisms such as checkpoints and watermarks.
  • the adjustment can be made according to the preset adjustment rules. For example, the first data production rate corresponding to the source operator is adjusted to the minimum value. Value, here, there is no restriction on the adjustment rule.
  • the first data production rate corresponding to the source operator is adjusted.
  • the second channel reallocation operation instructs the normal operator instance to consume the data in the input buffer corresponding to the abnormal operator instance;
  • the normal operator instance is the first working status indication upstream of the operator instance corresponding to the first operator.
  • the abnormal operator instance is an operator instance in which the first working state in the operator instance corresponding to the first operator indicates that the data produced by the upstream operator is unevenly distributed.
  • the first working status of all third operators indicates that the data produced by the upstream operator is evenly distributed, it indicates that only the data produced by the upstream operator of the first operator is unevenly distributed, that is, the data produced by the upstream operator is unevenly distributed.
  • the rate exceeds the consumption capacity of the first operator, then determine the operator instance whose first working state in the operator instance corresponding to the first operator indicates that the data produced by the upstream operator is evenly distributed, and use the determined operator instance as Normal operator example. For each normal operator instance, calculate the difference between the data consumption rate corresponding to the normal operator instance and its corresponding first data production rate, and obtain a fourth error value for use in determining the normal operator instance. Whether the child instance can consume more data.
  • the operator instance whose first working status indicates uneven distribution of data produced by the upstream operator in the operator instance corresponding to the first operator is regarded as an abnormal operator instance.
  • the key value corresponding to the data produced by the upstream operator corresponding to the first operator that is, the key value (i.e., key value) corresponding to the data produced by the second operator instance that determines the uneven distribution of data, realizes the key value of skewed data record, and output the key value so that relevant personnel can learn the tilt data, so that relevant personnel can reset the key value that the first operator needs to process based on the tilt data, that is, reset the flow direction of the data to avoid key value setting Unreasonable results in the first operator needing to consume too much data, resulting in a computational bottleneck.
  • the main control server can also obtain the expansion rate corresponding to the buffer corresponding to each operator instance, where the buffer includes an outbound buffer and/or an inbound buffer.
  • the expansion rate reaches the second preset rate, it indicates that the buffer expansion is too fast and the buffer expansion rate needs to be limited.
  • the buffer is controlled to stop. Expansion and implementation of bufferpool (buffer pool) pre-allocation.
  • channels can be redistributed according to the data processing rate of the operator to realize automatic switching of traffic without the intervention of business personnel and operation and maintenance personnel, avoiding the occurrence of feedback caused by the distributed processing system. Problems that are only noticed by operation and maintenance after a large number of checkpoint failures and compression failures ensure the performance of the distributed processing system.
  • the data processing rate corresponding to the operator that is, the operator instance is compared horizontally, that is, all data processing rates corresponding to the operator instance within the set time are obtained. If all data processing rates have not reached the historical The average processing rate indicates that the operator instance has become a slow node, and the slow operator alarm information is output to enable relevant personnel to maintain the slower operator. Among them, the historical average rate is calculated based on the average value of the collected data processing rate.
  • the working status of the operator in the first dimension is determined based on the data processing rate corresponding to the upstream operator of the operator, that is, it is determined whether the data produced by the upstream operator of the operator is evenly distributed, that is, it is determined Whether the concurrency of the operator is reasonable is determined to determine whether to issue a corresponding alarm, that is, whether to prompt for adjustment of the concurrency of the operator, to achieve timely adjustment of abnormalities and ensure the running performance of distributed processing.
  • the data is unevenly distributed in the Kafka partitions, and some partitions have a large amount of data. This will cause the Flink kafka consumer, that is, the downstream operators of the source operator to consume large amounts of data.
  • a bottleneck is encountered during data partitioning. Therefore, when the first working state of the first operator indicates that the data produced by the upstream operator is unevenly distributed, determine whether the load of the downstream operator is too high due to uneven source end partitioning, and proceed Corresponding alarm operations, for example, increasing the concurrency of downstream operators of the prompt source operator to avoid system congestion.
  • the business logic when the business logic includes keystream, if the set partition key value is unreasonable, it will also cause a computing bottleneck in the downstream operator. Therefore, the key value of the skewed data is recorded so that relevant personnel can solve it as soon as possible.
  • the operation and maintenance indicators that is, the data processing rate corresponding to the operator
  • the operation and maintenance indicators can be used to determine the abnormalities in the system in a timely manner, so that when there are no serious problems in the system, the operation and maintenance personnel can be informed whether the system has unreasonable concurrency settings. Whether the job has key value data skew or unreasonable settings allows operation and maintenance personnel to solve problems in the distributed processing system as early as possible and ensure the reliability of the system.
  • this operator is regarded as the first operator, and the first data production corresponding to each operator instance corresponding to the second operator located above the first operator is obtained.
  • the rate is used to determine whether the data production capabilities of each operator instance differ too much based on the first data production rate corresponding to each operator instance, that is, to determine whether there is an operator instance that produces too much or too little data. Therefore, it can be determined whether the data produced by the operator instance corresponding to the upstream operator corresponding to the first operator is evenly distributed, and then the first working state corresponding to the first operator can be obtained, thereby achieving the accuracy of the first working state of the first operator.
  • Figure 7 is a flow chart of yet another monitoring method for a distributed processing system according to an embodiment of the present disclosure. The process of determining the second working state corresponding to the operator will be described below in conjunction with a specific embodiment. To explain in detail, as shown in Figure 7, the method includes the following steps:
  • Step 701 For distributed tasks, obtain the data processing rate corresponding to the operator. Among them, the data processing rate includes the data consumption rate. The data consumption rate indicates the rate at which an operator consumes data generated by an upstream operator.
  • Step 702 Calculate the third error value between any two data consumption rates among the data consumption rates corresponding to the first operator.
  • the first operator is any one of the above operators.
  • this operator is used as the first operator, and the data consumption rate corresponding to each operator instance corresponding to the first operator is obtained.
  • the third error value corresponding to the sub-instance indicates the difference in the amount of data consumed by the two operator instances corresponding to the first operator in the first preset unit time, which means that the two second operators The difference in the instance's ability to consume data.
  • the data consumption rate corresponding to the operator instance indicates the data consumption rate of the operator instance in its corresponding slot.
  • Step 703 Determine the second working state corresponding to each operator instance corresponding to the first operator according to the third error value.
  • the third error value between the operator instance and other operator instances reaches the fourth preset value, it indicates that the operator instance has a poor ability to consume data.
  • the second working state of the operator instance indicates that the data consumption capability is abnormal; otherwise, it is determined that the second working state of the operator instance indicates that the data consumption capability is normal.
  • Step 704 In response to the second working state of the operator instance indicating that the data consumption capacity is abnormal, perform a corresponding alarm operation.
  • alarms can be generated based on the following two methods.
  • One way is to determine the target task manager to which the slot corresponding to the operator instance belongs when the second working state of the operator instance indicates that the data consumption capacity is abnormal. Determine all first operator instances other than the operator instance included in the target task manager, and obtain the second working status of each first operator instance. According to the second working state of each first operator instance, corresponding alarm operations are performed.
  • the second working state of the operator instance corresponding to the first operator indicates abnormal data consumption capacity, it indicates that there is a computing bottleneck in the operator instance, that is, there is a computing bottleneck in the slot where the operator instance is located. Determine whether it is Due to the failure of the task manager to which the slot belongs (i.e., the target task manager), other operator instances on the target task manager are obtained, and the other operator instances are used as the first operator instance for utilization of the first operator instance.
  • the second working state of an operator instance determines the reason why the data consumption capacity of the operator instance corresponding to the first operator is abnormal, that is, it determines the reason why the slot has a computing bottleneck, achieves accurate location of the problem, and then achieves accurate alarms.
  • the target task manager failure prompt information to remind relevant personnel that the target task manager has failed and accurately locate the problem.
  • the proportion of the abnormal first operator instance When the proportion of the abnormal first operator instance does not reach the first preset proportion, it indicates that only the second working state corresponding to the first operator indicates that the operator instance with abnormal data consumption capacity has a computing bottleneck, that is, the operator If a computing bottleneck occurs in the slot to which the instance belongs, the operator instance exception prompt information will be output to remind relevant personnel that there is an exception in the operator instance.
  • the target task manager fault prompt information may also include the affected job name and the affected operator identifier, that is, the identifiers of all operator instances corresponding to the target task manager that indicate abnormal data consumption capabilities.
  • Another way is to directly output the operator instance exception prompt information when the operator instance indicates that the data consumption capacity is abnormal to inform relevant personnel that the operator instance is abnormal.
  • the operation and maintenance indicators that is, the data consumption rate corresponding to the operator
  • the operation and maintenance indicators can be used to determine the abnormality in the system in a timely manner and achieve precise location of faults, so that when no serious problems occur in the system, the operation and maintenance personnel can be informed of the existence of the system.
  • Faulty taskmanagers and slots enable operation and maintenance personnel to solve problems in the distributed processing system as early as possible to ensure system reliability.
  • this operator is used as the first operator, and the data consumption rate corresponding to each operator instance corresponding to the first operator is obtained, and the consumption rate between each operator instance is determined.
  • the second working state realizes the accurate determination of the second working state corresponding to the operator instance corresponding to the first operator.
  • the present disclosure also provides embodiments of apparatuses and computer equipment to which they are applied.
  • Embodiments of the monitoring device of the distributed processing system of the present disclosure can be applied to computer equipment, such as terminal equipment (for example, servers, computers, etc.).
  • the device embodiments may be implemented by software, or may be implemented by hardware or a combination of software and hardware.
  • Taking software implementation as an example as a device in a logical sense, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory and running them through the file processing processor where it is located. From the hardware level, as shown in Figure 8, it is a hardware structure diagram of the computer equipment where the monitoring device of the distributed processing system according to the embodiment of the present disclosure is located.
  • the computer equipment where the monitoring device 831 of the distributed processing system is located in the embodiment may also include other hardware according to the actual functions of the computer equipment, which will not be described again.
  • Figure 9 is a block diagram of a monitoring device of a distributed processing system according to an embodiment of the present disclosure.
  • the device includes:
  • the rate acquisition module 910 is used to obtain the data processing rate corresponding to the operator for distributed tasks
  • the rate processing module 920 is used to determine at least one working state of the operator based on the data processing rate corresponding to the operator and/or its upstream operator;
  • the alarm module 930 is used to perform corresponding alarm operations in response to any working status of the operator indicating that the operator is abnormal.
  • the operator corresponds to at least one operator instance, and each operator instance has a corresponding data processing rate.
  • the rate processing module 920 is specifically configured to determine at least one working state of the first operator based on the error value between the data processing rates corresponding to each operator instance corresponding to the first operator and/or the second operator.
  • the first operator is any operator among the operators.
  • the second operator is the upstream operator of the first operator.
  • the data processing rate includes a first data production rate.
  • the first data production rate represents the rate at which the operator produces data.
  • the working state of the first operator includes the first working state.
  • the first working state of the first operator indicates whether the data produced by the upstream operator corresponding to the first operator is evenly distributed.
  • the rate processing module 920 is specifically configured to calculate a first error value between any two first data production rates among the first data production rates respectively corresponding to each operator instance corresponding to the second operator.
  • the first working state of the first operator is determined.
  • the rate processing module 920 is also configured to: when the first error value reaches the first preset value, determine that the first working state of the first operator indicates that the upstream operator corresponding to the first operator produces Data distribution is uneven.
  • the rate processing module 920 is also configured to obtain the expansion rate corresponding to the outgoing buffer corresponding to the second operator.
  • the outgoing buffer is used to store the data produced by the second operator.
  • determining the first working state of the first operator indicates the data distribution produced by the upstream operator corresponding to the first operator. Uneven.
  • the alarm module 930 is specifically configured to: output the first alarm information when the first working state of the first operator indicates that the data produced by the upstream operator corresponding to the first operator is unevenly distributed.
  • the first alarm information is used to prompt to increase the number of concurrency degrees of downstream operators.
  • the alarm module 930 is specifically configured to: when the first working state of the first operator indicates that the data produced by the upstream operator corresponding to the first operator is unevenly distributed, based on the execution process corresponding to the distributed task, Determine each third operator above the first operator.
  • each third operator When the first working status corresponding to each third operator indicates that the data produced by the upstream operator is unevenly distributed, determine other downstream operators other than the first downstream operator corresponding to the source operator corresponding to the distributed task. operator. Among them, the first downstream operator is the downstream operator of the source operator and is the third operator.
  • Alarm operations are performed based on the data consumption rates corresponding to other downstream operators.
  • the upstream operator and the downstream operator are connected through channels.
  • the alarm module 930 is specifically configured to determine the second error value between the data consumption rate corresponding to other downstream operators and the first data production rate corresponding to the source operator.
  • second alarm information is output.
  • the second alarm information is used to prompt that there is too much data at the source end and to increase the number of concurrency degrees of downstream operators.
  • the upstream operator and the downstream operator are connected through channels.
  • the device also includes a first channel processing module.
  • the first channel processing module is specifically used to determine the data consumption rate corresponding to other downstream operators and the source operator when the first working status corresponding to each third operator indicates that the data produced by the upstream operator is unevenly distributed. A second error value between the corresponding first data production rates.
  • the first channel reallocation operation instruction controls other downstream operators to consume data in the input buffer corresponding to the first downstream operator.
  • the data in the input buffer is the data produced by the source operator.
  • the device also includes a speed limiting module.
  • the rate limiting module is specifically used to: obtain the data consumption rates corresponding to other downstream operators and the first downstream operator when the first working status corresponding to each third operator indicates uneven distribution of data produced by the upstream operator. The minimum value among the data consumption rates corresponding to the child.
  • a speed limit instruction is generated, and the speed limit instruction is sent to the source operator, so that the source operator adjusts the first data production rate corresponding to the source operator based on the minimum value.
  • the device also includes a second channel processing module.
  • the second channel processing module is specifically used to: after obtaining the first working status corresponding to each third operator, and when the first working status of all third operators indicates that the data produced by the upstream operator is evenly distributed, Perform second channel reallocation operation.
  • the second channel reallocation operation instructs the normal operator instance to consume the data in the input buffer corresponding to the abnormal operator instance.
  • the normal operator instance is the operator instance in which the first working state indicates that the data produced by the upstream operator is evenly distributed in the operator instance corresponding to the first operator.
  • the abnormal operator instance is the operator instance in the operator instance corresponding to the first operator.
  • the first working state indicates an operator instance in which data produced by an upstream operator is unevenly distributed.
  • the device also includes a data recording module.
  • the data recording module is specifically used to: record the data produced by the upstream operator corresponding to the first operator when the first working state of the first operator indicates that the data produced by the upstream operator corresponding to the first operator is unevenly distributed.
  • the corresponding key value and output the key value.
  • Data processing rate includes data consumption rate.
  • the data consumption rate indicates the rate at which an operator consumes data generated by an upstream operator.
  • the working state of the first operator includes the first working state.
  • the second working state of the first operator indicates whether the data consumption capability is normal.
  • the rate processing module 920 is specifically configured to calculate a third error value between any two data consumption rates among the data consumption rates corresponding to the first operator.
  • the second working state corresponding to each operator instance corresponding to the first operator is determined.
  • the operator is scheduled to at least one resource group slot.
  • the alarm module 930 is specifically used to determine the target task manager to which the slot corresponding to the operator instance belongs when the second working state of the operator instance indicates abnormal data consumption capacity.
  • All first operator instances other than operator instances included in the target task manager are determined, and second working states corresponding to each first operator instance are obtained.
  • the alarm module 930 is also configured to: calculate the ratio of the number of first operator instances whose second working status indicates abnormal data consumption capabilities to the total number of first operator instances, to obtain the ratio of abnormal first operator instances.
  • the target task manager fault prompt information is output.
  • the operator instance exception prompt information is output.
  • the alarm module 930 is also configured to: output operator instance exception prompt information when the second working state of the operator instance indicates abnormal data consumption capability.
  • the operator corresponds to at least one operator instance.
  • the alarm module 930 is also used to obtain the expansion rate of the buffer corresponding to each operator instance.
  • the buffer is controlled to stop expanding.
  • the operator corresponds to at least one operator instance.
  • the alarm module 930 is also used to: obtain all data processing rates corresponding to the operator instances obtained at the set time, and obtain the historical average processing rate corresponding to the operator instances.
  • the present disclosure also provides a computer-readable storage medium in which computer-executable instructions are stored.
  • the processor executes the computer-executable instructions, the method as described above is implemented.
  • the present disclosure also provides a computer program product, which includes a computer program.
  • a computer program product which includes a computer program.
  • the computer program is executed by a processor, the method as described above is implemented.
  • the device embodiment since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details.
  • the device embodiments described above are only illustrative.
  • the modules described as separate components may or may not be physically separated.
  • the components shown as modules may or may not be physical modules, that is, they may be located in One place, or it can be distributed to multiple network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

一种分布式处理系统的监控方法及装置,分布式处理系统的节点上设置有算子。监控方法包括:针对分布式任务,获取算子对应的数据处理速率;根据算子和/或其上游算子对应的数据处理速率,确定算子的至少一个工作状态;响应于算子的任一维度的工作状态指示算子存在异常,进行相应的告警操作,实现异常算子的及时发现。

Description

分布式处理系统的监控方法及装置 技术领域
本公开涉及分布式技术领域,尤其涉及分布式处理系统的监控方法及装置。
背景技术
随着大数据技术在各行各业得到普及,数据产生的价值对客户越来越重要,在一些领域大批量离线计算的小时、天级延时对业务的时效性支持不够,客户越来越关注数据的实时性。实时计算分布式技术经过几代的演进,从storm、spark streaming到Flink,低延时、高吞吐量、强一致性等方面都得到了成熟的发展。
目前,实时计算分布式技术的普及给客户数据分析带来时效性的同时,由于Flink平台中的算子经常会出现异常,也给平台的数据运维造成了新的压力。Flink内部提供一些运维监控指标去辅助确定故障算子,但往往发现问题具有一定的延迟性,从而导致无法及时地发现异常算子。
发明内容
为克服相关技术中存在的问题,本公开提供了分布式处理系统的监控方法及装置。
根据本公开实施例的第一方面,提供一种分布式处理系统的监控方法,所述分布式处理系统包括存在上下游关系的算子;
所述方法包括:
针对分布式任务,获取所述算子对应的数据处理速率;
根据所述算子和/或其上游算子对应的数据处理速率,确定所述算子的至少一个工作状态;
响应于所述算子的任意工作状态指示所述算子存在异常,进行相应的告警操作。
可选的,所述算子对应于至少一个算子实例,每个算子实例存在对应的数据处理速率;
所述根据所述算子和/或和/或其上游算子对应的数据处理速率,确定所述算子的至少一个工作状态,包括:
根据第一算子和/或第二算子对应的各个算子实例所分别对应的数据处理速率之间的误差值确定所述第一算子的至少一个工作状态;其中,所述第一算子为所述算子中的任一算子;所述第二算子为所述第一算子的上游算子。
可选的,所述数据处理速率包括第一数据生产速率;其中,所述第一数据生产速率表示算子生产数据的速率;所述第一算子的工作状态包括第一工作状态;
其中,所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配是否均匀;
所述根据第一算子和/或第二算子对应的各个算子实例所分别对应的数据处理速率之间的误差值确定所述第一算子的至少一个工作状态,包括:
计算所述第二算子对应的各个算子实例所分别对应的第一数据生产速率中的任意两个第一数据生产速率之间的第一误差值;
根据所述第一误差值,确定所述第一算子的第一工作状态。
可选的,所述根据所述第一误差值,确定所述第一算子的第一工作状态,包括:
在存在第一误差值达到第一预设值的情况下,确定所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀;
在所有第一误差值均未达到第一预设值的情况下,确定所述第一算子的第一工作状 态指示所述第一算子对应的上游算子生产的数据分配均匀。
可选的,所述在存在第一误差值达到第一预设值的情况下,确定所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀,包括:
获取所述第二算子对应的出端缓冲区所对应的扩容速率;其中,所述出端缓冲区用于存放所述第二算子生产的数据;
在存在第一误差值达到第一预设值,且所述扩容速率达到第一预设速率的情况下,确定所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀。
可选的,所述响应于所述算子的任意工作状态指示所述算子存在异常,进行相应的告警操作,包括:
在所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀的情况下,输出第一告警信息;其中所述第一告警信息用于提示增加下游算子并发度的数量。
可选的,所述响应于所述算子的任意工作状态指示所述算子存在异常,进行相应的告警操作,包括:
在所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀的情况下,基于所述分布式任务对应的执行流程,确定处于所述第一算子上方的各个第三算子;
获取各个第三算子分别对应的第一工作状态;
在所述各个第三算子分别对应的第一工作状态均指示上游算子生产的数据分配不均匀的情况下,确定所述分布式任务对应的源算子所对应的除第一下游算子以外的其它下游算子;其中,所述第一下游算子为所述源算子的下游算子,且为所述第三算子;
根据其它下游算子对应的数据消费速率,进行告警操作。
可选的,上游算子与下游算子之间通过通道连接;
所述根据其它下游算子对应的数据消费速率,进行告警操作,包括:
确定所述其它下游算子对应的数据消费速率与所述源算子对应的第一数据生产速率之间的第二误差值;
在所述第二误差值达到第二预设值的情况下,输出第二告警信息;其中,所述第二告警信息用于提示源端数据过多,增加下游算子并发度的数量。
可选的,上游算子与下游算子之间通过通道连接;在所述各个第三算子分别对应的第一工作状态均指示上游算子生产的数据分配不均匀的情况下,所述方法还包括:
确定所述其它下游算子对应的数据消费速率与所述源算子对应的第一数据生产速率之间的第二误差值;
在所述第二误差值未达到第三预设值的情况下,进行第一通道重分配操作;其中,所述第一通道重分配操作指示控制所述其它下游算子消费所述第一下游算子对应的入端缓冲区中的数据;所述入端缓冲区中的数据为所述源算子生产的数据。
可选的,在所述各个第三算子分别对应的第一工作状态均指示上游算子生产的数据分配不均匀的情况下,所述方法还包括:
获取所述其它下游算子对应的数据消费速率和所述第一下游算子对应的数据消费速率中的最小值;
根据所述最小值,生成限速指令,并将所述限速指令发送至所述源算子,以使所述源算子基于所述最小值,调整所述源算子对应的第一数据生产速率。
可选的,在获取各个第三算子分别对应的第一工作状态之后,所述方法还包括:
在所有第三算子的第一工作状态均指示上游算子生产的数据分配均匀的情况下,进行第二通道重分配操作;其中,所述第二通道重分配操作指示正常算子实例消费异常算子实例对应的入端缓冲区中的数据;所述正常算子实例为所述第一算子对应的算子实例中的第一工作状态指示上游算子生产的数据分配均匀的算子实例,异常算子实例为所述第一算子对应的算子实例中的第一工作状态指示上游算子生产的数据分配不均匀的算 子实例。
可选的,在所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀的情况下,所述方法还包括:
记录所述第一算子对应的上游算子生产的数据所对应的键值,并输出所述键值。
可选的,所述数据处理速率包括数据消费速率;所述数据消费速率表示算子消费上游算子产生的数据的速率;
所述第一算子的工作状态包括第一工作状态;
其中,所述第一算子的第二工作状态指示数据消费能力是否正常;
所述根据第一算子和/或第二算子对应的各个算子实例所分别对应的数据处理速率之间的误差值确定所述第一算子的至少一个工作状态,包括:
计算第一算子对应的各个数据消费速率中的任意两个数据消费速率之间的第三误差值;
根据所述第三误差值,确定所述第一算子对应的各个算子实例分别对应的第二工作状态。
可选的,所述算子被调度至至少一个资源组slot上;所述slot与所述算子实例一一对应;
所述响应于所述算子的任意工作状态指示所述算子存在异常,进行相应的告警操作,包括:
在所述算子实例的第二工作状态指示数据消费能力异常的情况下,确定所述算子实例对应的slot所属的目标任务管理器;
确定所述目标任务管理器所包括的除所述算子实例以外的所有第一算子实例,并获取各个第一算子实例分别对应的第二工作状态;
根据各个第一算子实例分别对应的第二工作状态,进行相应的告警操作。
可选的,所述根据各个第一算子实例分别对应的第二工作状态,进行相应的告警操作,包括:
计算第二工作状态指示数据消费能力异常的第一算子实例的数目与第一算子实例的总数目的比值,得到异常第一算子实例比例;
在异常第一算子实例比例达到第一预设比例的情况下,输出目标任务管理器故障提示信息;
在异常第一算子实例比例未达到第一预设比例的情况下,输出所述算子实例异常提示信息。
可选的,所述响应于所述算子的任意工作状态指示所述算子存在异常,进行相应的告警操作,包括:
在所述算子实例的第二工作状态指示数据消费能力异常的情况下,输出所述算子实例异常提示信息。
可选的,所述算子对应至少一个算子实例;所述算子实例存在对应的缓冲区;
所述方法还包括:
获取各个算子实例分别对应的缓冲区的扩容速率;
在所述缓冲区的扩容速率达到第二预设速率的情况下,控制所述缓冲区停止扩容。
可选的,所述算子对应至少一个算子实例;所述方法还包括:
获取在设定时间获取到所述算子实例对应的所有数据处理速率,并获取所述算子实例对应的历史平均处理速率;
若所述所有数据处理速率未达到所述历史平均处理速率,则输出慢算子告警信息。
根据本公开实施例的第二方面,提供一种分布式处理系统的监控装置,所述分布式处理系统包括存在上下游关系的算子;
所述装置包括:
速率获取模块,用于针对分布式任务,获取所述算子对应的数据处理速率;
速率处理模块,用于根据所述算子和/或其上游算子对应的数据处理速率,确定所 述算子的至少一个工作状态;
告警模块,用于响应于所述算子的任意工作状态指示所述算子存在异常,进行相应的告警操作。
可选的,所述算子对应于至少一个算子实例,每个算子实例存在对应的数据处理速率;
所述速率处理模块具体用于:
根据第一算子和/或第二算子对应的各个算子实例所分别对应的数据处理速率之间的误差值确定所述第一算子的至少一个工作状态;其中,所述第一算子为所述算子中的任一算子;所述第二算子为所述第一算子的上游算子。
可选的,所述数据处理速率包括第一数据生产速率;其中,所述第一数据生产速率表示算子生产数据的速率;所述第一算子的工作状态包括第一工作状态;
其中,所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配是否均匀;
所述速率处理模块具体用于:
计算所述第二算子对应的各个算子实例所分别对应的第一数据生产速率中的任意两个第一数据生产速率之间的第一误差值;
根据所述第一误差值,确定所述第一算子的第一工作状态。
可选的,所述速率处理模块还用于:
在存在第一误差值达到第一预设值的情况下,确定所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀;
在所有第一误差值均未达到第一预设值的情况下,确定所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配均匀。
可选的,所述速率处理模块还用于:
获取所述第二算子对应的出端缓冲区所对应的扩容速率;其中,所述出端缓冲区用于存放所述第二算子生产的数据;
在存在第一误差值达到第一预设值,且所述扩容速率达到第一预设速率的情况下,确定所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀。
可选的,所述告警模块具体用于:
在所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀的情况下,输出第一告警信息;其中所述第一告警信息用于提示增加下游算子并发度的数量。
可选的,所述告警模块具体用于:
在所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀的情况下,基于所述分布式任务对应的执行流程,确定处于所述第一算子上方的各个第三算子;
获取各个第三算子分别对应的第一工作状态;
在所述各个第三算子分别对应的第一工作状态均指示上游算子生产的数据分配不均匀的情况下,确定所述分布式任务对应的源算子所对应的除第一下游算子以外的其它下游算子;其中,所述第一下游算子为所述源算子的下游算子,且为所述第三算子;
根据其它下游算子对应的数据消费速率,进行告警操作。
可选的,上游算子与下游算子之间通过通道连接;
可选的,所述告警模块具体用于:
确定所述其它下游算子对应的数据消费速率与所述源算子对应的第一数据生产速率之间的第二误差值;
在所述第二误差值达到第二预设值的情况下,输出第二告警信息;其中,所述第二告警信息用于提示源端数据过多,增加下游算子并发度的数量。
可选的,上游算子与下游算子之间通过通道连接;所述装置还包括第一通道处理模 块;
所述第一通道处理模块具体用于:
在所述各个第三算子分别对应的第一工作状态均指示上游算子生产的数据分配不均匀的情况下,确定所述其它下游算子对应的数据消费速率与所述源算子对应的第一数据生产速率之间的第二误差值;
在所述第二误差值未达到第三预设值的情况下,进行第一通道重分配操作;其中,所述第一通道重分配操作指示控制所述其它下游算子消费所述第一下游算子对应的入端缓冲区中的数据;所述入端缓冲区中的数据为所述源算子生产的数据。
可选的,所述装置还包括限速模块;
所述限速模块具体用于:
在所述各个第三算子分别对应的第一工作状态均指示上游算子生产的数据分配不均匀的情况下,获取所述其它下游算子对应的数据消费速率和所述第一下游算子对应的数据消费速率中的最小值;
根据所述最小值,生成限速指令,并将所述限速指令发送至所述源算子,以使所述源算子基于所述最小值,调整所述源算子对应的第一数据生产速率。
可选的,所述装置还包括第二通道处理模块;
所述第二通道处理模块具体用于:
在获取各个第三算子分别对应的第一工作状态之后,在所有第三算子的第一工作状态均指示上游算子生产的数据分配均匀的情况下,进行第二通道重分配操作;其中,所述第二通道重分配操作指示正常算子实例消费异常算子实例对应的入端缓冲区中的数据;所述正常算子实例为所述第一算子对应的算子实例中的第一工作状态指示上游算子生产的数据分配均匀的算子实例,异常算子实例为所述第一算子对应的算子实例中的第一工作状态指示上游算子生产的数据分配不均匀的算子实例。
可选的,所述装置还包括数据记录模块;
数据记录模块具体用于:
在所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀的情况下,记录所述第一算子对应的上游算子生产的数据所对应的键值,并输出所述键值。
可选的,所述数据处理速率包括数据消费速率;所述数据消费速率表示算子消费上游算子产生的数据的速率;所述第一算子的工作状态包括第一工作状态;
其中,所述第一算子的第二工作状态指示数据消费能力是否正常;
所述速率处理模块具体用于:
计算第一算子对应的各个数据消费速率中的任意两个数据消费速率之间的第三误差值;
根据所述第三误差值,确定所述第一算子对应的各个算子实例分别对应的第二工作状态。
可选的,所述算子被调度至至少一个资源组slot上;所述slot与所述算子实例一一对应;
所述告警模块具体用于:
在所述算子实例的第二工作状态指示数据消费能力异常的情况下,确定所述算子实例对应的slot所属的目标任务管理器;
确定所述目标任务管理器所包括的除所述算子实例以外的所有第一算子实例,并获取各个第一算子实例分别对应的第二工作状态;
根据各个第一算子实例分别对应的第二工作状态,进行相应的告警操作。
可选的,所述告警模块还用于:
计算第二工作状态指示数据消费能力异常的第一算子实例的数目与第一算子实例的总数目的比值,得到异常第一算子实例比例;
在异常第一算子实例比例达到第一预设比例的情况下,输出目标任务管理器故障提 示信息;
在异常第一算子实例比例未达到第一预设比例的情况下,输出所述算子实例异常提示信息。
可选的,所述告警模块还用于:
在所述算子实例的第二工作状态指示数据消费能力异常的情况下,输出所述算子实例异常提示信息。
可选的,所述算子对应至少一个算子实例;所述算子实例存在对应的缓冲区;
所述告警模块还用于:
获取各个算子实例分别对应的缓冲区的扩容速率;
在所述缓冲区的扩容速率达到第二预设速率的情况下,控制所述缓冲区停止扩容。
可选的,所述算子对应至少一个算子实例;所述告警模块还用于:
获取在设定时间获取到所述算子实例对应的所有数据处理速率,并获取所述算子实例对应的历史平均处理速率;
若所述所有数据处理速率未达到所述历史平均处理速率,则输出慢算子告警信息。
根据本公开实施例的第三方面,提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上第一方面以及第一方面各种可能的设计所述的分布式处理系统的监控方法。
根据本公开实施例的第四方面,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如上第一方面以及第一方面各种可能的设计所述的分布式处理系统的监控方法。
根据本公开实施例的第五方面,提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时,实现如上第一方面以及第一方面各种可能的设计所述的分布式处理系统的监控方法。
本公开的实施例提供的技术方案可以包括以下有益效果:
本公开实施例中,分布式处理系统包括存在上下游关系的算子。获取分布式任务所涉及的各个算子对应的数据处理速率,对应每个算子,根据该算子与其上游算子对应的数据处理速率确定该算子处理数据是否异常,从而确定该算子在至少一个工作状态,以及时发现故障算子。当算子的某个工作状态指示算子存在异常的情况下,进行相应的告警操作,保证告警的及时性,实现分布式处理系统的智能监控,进而可以及时解决异常,提高分布式处理系统的鲁棒性和可运维性。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本说明书的实施例,并与说明书一起用于解释本说明书的原理。
图1是本公开根据一实施例示出的一种分布式处理系统的监控方法的流程图。
图2是本公开根据一实施例示出的一种算子示意图。
图3是本公开根据一实施例示出的一种执行计划示意图。
图4是本公开根据一实施例示出的另一种算子示意图。
图5是本公开根据一实施例示出的又一种分布式处理系统的监控方法的流程图。
图6是本公开根据一实施例示出的再一种算子示意图。
图7是本公开根据一实施例示出的再一种分布式处理系统的监控方法的流程图。
图8是本公开实施例分布式处理系统的监控装置所在电子设备的一种硬件结构图。
图9是本公开根据一实施例示出的一种分布式处理系统的监控装置的框图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本说明书和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其它含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本公开可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
接下来对本公开实施例进行详细说明。
如图1所示,图1是本公开根据一实施例示出的一种分布式处理系统的监控方法的流程图,所述分布式处理系统包括的处理节点上设置有算子,算子之间存在上下游关系。该方法的执行主体为主控服务器,具体的,为计算机设备,也即主控服务器中的处理器,该方法包括以下步骤:
步骤101、针对分布式任务,获取算子对应的数据处理速率。
在本实施例中,分布式处理系统为Flink系统。分布式任务指示Flink Job,也即Flink作业。Flink作业包括由多个算子形成的算子链,对于算子链中相邻两个算子而言,在前(即在上方)的算子可以称为上游算子,在后(即在下方)的算子可以称为下游算子。流量总是从上游发往下游,即下游算子对上游算子生产的数据进行处理。对于算子链中的每个算子,确定该算子对应的数据处理速率。
可选的,数据处理速率包括数据消费速率和/或第一数据生产速率。数据消费速率表示算子消费上游算子产生的数据的速率。第一数据生产速率表示算子生产数据的速率。
其中,数据消费速率指示算子在第一预设单位时间内处理上游算子产生的数据的速率,其反映该算子的处理能力;第一数据生产速率指示算子在第一预设单位时间内产生数据的速率,其反映上游算子生产数据的情况。例如,分布式任务所涉及的算子,即算子链包括算子1和算子2,算子1为算子2的上游算子,算子1把数据,即流量传输至算子2,该数据便为算子2的上游算子产生的数据,算子2对该数据进行处理,例如,对其进行过滤,过滤得到的数据变为算子2生产的数据。
可选的,上游算子将生产的数据保存至入端缓冲区,下游算子消费入端缓冲区中的数据,下游算子将其生产的数据保存至出端缓冲区。相应的,算子对应的数据消费速率指示该算子消费入端缓冲区中的数据的速率;算子对应的第一数据生产速率指示该算子填充出端缓冲区的速率,例如,在计算第一数据生产速率时,获取一定时间内,该出端缓冲区增加的数据量,并将该数据量除以该时间,得到第一数据生产速率。
可选的,算子对应的数据处理速率可以是由安装在该算子上的客户端插件采集的,客户端在采集到算子对应的数据处理速率后,将其发送至主控服务器。
可选的,算子的上游算子的数目为至少一个。上下游算子之间通过通道channel连接。每个算子包括即对应至少一个算子实例,算子实例是通过对算子进行实例化得到的。相应的,算子对应的数据处理速率包括该算子对应的各个算子实例所分别对应的数据处理速率。例如,当数据处理速率包括数据消费速率时,算子包括算子D,算子D对应2个算子实例,分别为算子实例d1和算子实例d2。算子实例d1对应的数据消费速率为3m/s,算子实例d2对应的数据消费速率为5m/s,则算子D对应的数据消费速率包括算子实例d1对应的数据消费速率(即3m/s)和算子实例d2对应的数据消费速率(即5m/s)。
可选的,算子对应的一个算子实例与该算子的上游算子所对应的至少一个算子实例,也即上游算子实例之间通过通道连接,每个通道对应一个入端缓冲区和出端缓冲区。例如,如图2所示,算子2对应2个算子实例2,算子1对应6个算子实例1,算子1为算子2的上游算子,一个算子实例2与3个算子实例1分别通过三个通道连接,也即一个算子实例1与一个算子实例2之间通过一个通道连接,该算子实例2消费存放该算子实例1生产的数据的入端缓冲区,也即该通道对应的入端缓冲区中的数据。
具体的,Flink作业架构包括三部分:作业Client(客户端)、Flink Jobmanager(作业管理器)、Flink TaskManager(任务管理器),作业经由Client和Jobmanager分三次解析为stream graph(流图)、job graph(作业图),execution graph(执行图)。主控服务器监控Jobmanager,若有新的作业(即新的数据)提交,则获取execution graph,以查询作业被调度到taskmanager、slot(槽位),即作业所涉及的slot的地址、算子列表、算子之间依赖关系,算子并发度等等。如图3所示的作业的执行计划,即执行流程,vertex(顶点)代表DAG图上的某个顶点,Jobvertex(作业顶点)和executionvertex(执行顶点)分别代表逻辑执行计划和物理执行计划。Resultpartition(结果分区)代表顶点的输出,Flink会按算子的并发度进行实例化,以得到该算子对应的算子实例。如图4所示,resultsubpartition(即子分区)和并发数目一致初始化为2个,下游inputgate(输入端)接收上游数据,并发数目由下游并发度控制。其中,map(图)算子代表源端算子,reduce(减少)算子代表目的端算子,用这两个算子作为DAG图上的两个顶点,一个顶点可以对应一个或多个算子(operator chain,即算子链),为了方便描述,本公开将一个顶点对应一个算子。
具体的,上述描述的作业执行流程,可以确定:1)作业分配到的taskmanager的task slot;2)作业所涉及的算子标识、算子之间的父子依赖关系,即上下游关系;3)算子被调度到的slot的标识,算子的并发度,算子之间的通道数据等;4)每个算子对应的inputgate和resultsubpartion,以用来跟踪该算子对应的上游算子和下游算子。
可选的,上下游算子的并发度可以是不同的,每个算子可以被调度到至少一个slot上,slot与算子实例一一对应,例如,一个算子被调度到3个slot上,每个slot对应该算子的一个算子实例。按照并发度拆分每个slot上的线程数,比如一个作业包括10个taskmanager,每个taskmanager对应2个slot,map的并发度为10,那么map可以分配到5个taskmanager,每个slot上启动一个map线程。
可选的,上游算子和下游算子通过通道通信,通道数由上下游的并发数决定。比如,上游算子为10个并发度,下游算子为2个并发度,假设平均分配的话,每5个处理线程向下游算子的一个线程发送数据,也即上游算子共有10个算子实例,下游算子有2个算子实例,上游算子的每5个算子实例与下游算子的一个算子实例通信。在Flink引擎中,上游算子和下游算子连接的通道并不是固定不变的。
其中,每个算子需要记录自己的上下游通道,相应的,客户端也会记录算子之间,也即算子实例之间的通道情况。
其中,上述标识包括名称、ID等信息。例如,算子标识为算子ID。
可选的,算子对应的数据处理速率还可以包括第二数据生产速率,第二数据生产速率表示算子的上游算子生产数据的速率,即填充入端缓冲区的速率。
可选的,算子对应的数据处理速率为算子实例对应的数据处理速率,具体的,算子实例对应的数据消费速率表示算子实例消费上游算子实例产生的数据的速率。第一数据生产速率表示该算子实例生产数据的速率。
可选的,每个算子实例对应一个入端缓冲区和一个出端缓冲区。入端缓冲区和出端缓冲区的总空间大小是可以变化,即可以进行扩容、缩容。主控服务器可以实时或定时获取算子实例对应的入端缓冲区信息和出端缓冲区信息;其中,入端缓冲区信息包括入端缓冲区的当前总大小,剩余大小等信息;同理,出端缓冲区信息包括出端缓冲区的当前总大小,剩余大小等信息。
可选的,算子对应的客户端记录该算子对应的各个算子实例所对应的入端缓冲区信 息以及出端缓冲区信息,并将其发送至主控服务器。
可选的,主控服务器在获取到算子,即算子对应的各个算子实例的数据处理速率后,可以将其保存至目标位置,以用于聚合加工计算。第一预设单位时间可以是秒级,相应的,数据消费速率表示算子,也即算子实例每秒消费上游算子,也即上游算子实例产生的数据量大小。第一数据生产速率表示算子每秒生产的数据量大小。在对数据处理速率进行聚合计算时,按照第二预设单位时间,对其进行聚合,其中,第二预设单位时间指示分钟,小时等级别。例如,当第二预设单位时间指示分钟级别时,聚合后的第一数据生产速率表示算子每分钟生产的数据量大小。
其中,目标位置包括数据库、ES等能够存储数据的装置。
在本实施例中,在得到数据处理速率后,对其进行聚合计算,以使业务和运维人员可以参考聚合后的数据处理速率,对Flink系统的相关状况(例如,算子的并发度)进行调优处理,以保证Flink系统的性能。当然,也可以直接利用数据处理速率进行调优。
步骤102、根据算子和/或其上游算子对应的数据处理速率,确定算子的至少一个工作状态。
在本实施例中,对于每个算子,在得到该算子数据处理速率后,基于该算子对应的数据处理速率和/或位于该算子的上游算子所对应的数据处理速率确定该算子在各个维度上处理数据(例如,生产、消费)的能力是否存在异常,以得到该算子的各个维度的工作状态,也即得到该算子的各个工作状态,该工作状态指示算子处理数据的能力是否异常,从而确定是否存在异常算子,实现算子的智能监控。
可选的,维度包括第一维度和/或第二维度,第一维度指示生产维度,第二维度指示消费维度。相应的,算子的工作状态包括算子的第一工作状态和/或算子的第二工作状态。其中,算子的第一工作状态指示上游算子生产的数据分配是否均匀,即表示该算子对应的所有上游算子实例生产的数据量相差是否较小。第二工作状态指示数据消费能力是否正常,即表示该算子对应的各个算子实例消费的数据量相差是否较小。
可选的,在确定所有算子中的任一算子的工作状态时,根据第一算子和/或第二算子对应的各个算子实例所分别对应的数据处理速率之间的误差值确定第一算子的至少一个工作状态。其中,第一算子为算子中的任一算子。第二算子为第一算子的上游算子。
具体的,当工作状态包括第一工作状态时,根据第一算子的上游算子(即第二算子)对应的算子实例所分别对应的数据处理速率之间的误差值确定第一算子的第一工作状态。
当工作状态包括第二工作状态时,根据第一算子对应的各个算子实例所分别对应的数据处理速率之间的误差值确定第一算子的第二工作状态。具体的,可以确定第一算子对应的各个算子实例的第二工作状态。
当工作状态既包括第一工作状态和第二工作状态时,根据第一算子的上游算子对应的算子实例所分别对应的数据处理速率之间的误差值确定第一算子的第一工作状态,同时,根据第一算子对应的各个算子实例所分别对应的数据处理速率之间的误差值确定第一算子的第二工作状态。
步骤103、响应于算子的任意工作状态指示算子存在异常,进行相应的告警操作。
在本实施例中,在确定算子的各个工作状态后,当某个工作状态指示该算子存在异常时,表明该算子为故障算子,则进行相应的告警操作,使得相关人员可以及时发现故障算子,从而可以及时解决故障,保证分布式处理系统的正常运行以及性能。
从上述描述可知,分布式处理系统包括多个算子,算子之间存在上下游关系。获取处理分布式任务所涉及的各个算子对应的数据处理速率,该数据处理速率包括数据消费速率和/或第一数据生产速率,数据消费速率表示算子消费其对应的上游算子所生产的数据的速率,第一数据生产速率表示算子生产数据的速率。根据算子对应的数据消费速率和/或第一数据生产速率确定算子生产和/或消费数据是否异常,从而确定算子在至少一个维度上的工作状态,以及时发现故障算子。当算子在某个维度上的工作状态指示算子存在异常的情况下,进行相应的告警操作,保证告警的及时性,实现分布式处理系统的 智能监控,进而可以及时解决异常,提高分布式处理系统的鲁棒性和可运维性。
如图5所示,图5是本公开根据一实施例示出的另一种分布式处理系统的监控方法的流程图,下面将结合一个具体实施例对确定算子的第一工作状态的过程进行详细说明,如图5所示,该方法包括以下步骤:
步骤501、针对分布式任务,获取算子对应的数据处理速率。其中,数据处理速率包括第一数据生产速率。第一数据生产速率表示算子生产数据的速率。
步骤502、计算第一算子的上游算子,即第二算子对应的各个算子实例所分别对应的第一数据生产速率中的任意两个第一数据生产速率之间的第一误差值。其中,第一算子为上述算子中的任一算子。
在本实施例中,对于分布式处理系统中的每个算子,将该算子作为第一算子,并将第一算子的上游算子作为第二算子。获取该第二算子对应的各个算子实例(即第二算子实例)所对应的第一数据生产速率。对于每个第二算子实例,计算该第二算子实例对应的第一数据生产速率与其它第二算子实例对应的第一数据生产速率的差值,并将其作为该第二算子实例对应的第一误差值,该第一误差值指示两个第二算子实例在第一预设单位时间内生产的数据量的差值,即表示两个第二算子实例生产数据能力的差值。
步骤503、根据第一误差值,确定第一算子的第一工作状态。
在本实施例中,在得到第二算子对应的所有第一误差值后,也即在得到各个第二算子实例对应的第一误差值后,利用第一误差值确定各个第二算子实例生产数据的能力是否相差较小,即确定第一算子的上游算子生产的数据分配是否均匀,从而得到第一算子的第一工作状态。
可选的,根据第一误差值,确定第一算子的第一工作状态,包括:
在存在第一误差值达到第一预设值的情况下,确定第一算子的第一工作状态指示上游算子生产的数据分配不均匀。
在所有第一误差值均未达到第一预设值的情况下,确定第一算子的第一工作状态指示上游算子生产的数据分配均匀。
其中,达到表示大于和/或等于。未达到表示小于。
具体的,当两个第二算子实例对应的第一数据生产速率之间的第一误差值达到第一预设值时,表明该两个第二算子实例生产数据的能力相差较大,即一个第二算子实例生产的数据量较大,另一个第二算子实例生产的数据量较小,即第一算子对应的上游算子所对应的算子实例生产的数据分配不均匀,也即第一算子的第一工作状态指示第一算子的上游算子生产的数据分配不均匀。
当所有第一误差值均未达到第一预设值时,即任意两个第二算子实例对应的第一数据生产速率之间的第一误差值未达到第一预设值时,表明任意两个第二算子实例生产数据的能力相差较小,即第一算子对应的上游算子所对应的算子实例生产的数据分配均匀,也即第一算子的第一工作状态指示第一算子的上游算子生产的数据分配均匀。
当然,也可以基于第一误差值进行其它数学计算,并利用计算得到的数值确定第一算子的第一工作状态,例如,在得到两个第二算子实例对应的第一数据生产速率之间的第一误差值后,将该第一误差值与该两个第二算子实例中的任一第二算子实例对应的第一数据生产速率的比值,作为误差值,当该误差值达到预设值时,确定该第一算子的第一工作状态指示上游算子生产的数据分配不均匀,否则,确定该第一算子的第一工作状态指示上游算子生产的数据分配均匀。
可选的,还可以先确定出第一算子对应的各个算子实例的第一工作状态,以供利用各个算子实例的第一工作状态确定第一算子的第一工作状态,其过程包括:对于第一算子对应的每个算子实例,获取该算子实例对应的各个第二算子实例,在该各个第二算子实例之间的第一误差值均未达到第一预设值的情况下,表明该算子实例对应的上游算子实例之间的生产数据能力相差较小,则确定该算子实例的第一工作状态指示上游算子生产的数据分配均匀。
在该算子实例对应的各个第二算子实例之间的第一误差值达到第一预设值的情况 下,表明该算子实例对应的上游算子实例之间的生产数据能力相差较大,确定该算子实例的第一工作状态指示上游算子生产的数据分配不均匀,从而确定算子的第一工作状态指示上游算子生产的数据分配不均匀。
在本实施例中,当第一算子对应的算子实例中存在任一算子实例的第一工作状态指示上游算子生产的数据分配不均匀时,确定第一算子的第一工作状态指示上游算子生产的数据分配不均匀。当第一算子对应的所有算子实例的第一工作状态均指示上游算子生产的数据分配均匀时,确定第一算子的第一工作状态指示上游算子生产的数据分配均匀。
可选的,在基于第一误差值确定第一算子的第一工作状态的基础上,还可以进一步利用第二算子对应的出端缓冲区确定第一算子的第一工作状态,其具体确定过程包括:
获取第二算子对应的出端缓冲区所对应的扩容速率。其中,出端缓冲区用于存放第二算子生产的数据。
在存在第一误差值达到第一预设值,且扩容速率达到第一预设速率的情况下,确定第一算子的第一工作状态指示第一算子对应的上游算子生产的数据分配不均匀。
具体的,对于每个第二算子实例,当该第二算子实例对应的所有第一误差值均达到第一预设值时,表明该第二算子实例生产的数据可能过多,则进一步获取该第二算子实例对应的出端缓冲区所对应的扩容速率,该扩容速率指示出端缓冲区的总空间大小的增长速率。例如,在时刻1时,第二算子实例对应的出端缓冲区的总空间大小为100MB,在时刻2时,第二算子实例对应的出端缓冲区的总空间大小为200MB,则该第二算子实例对应的出端缓冲区所对应的扩容速率为(200MB-100MB)/(时刻2-时刻1)。
当第二算子实例对应的出端缓冲区所对应的扩容速率达到第一预设速率时,表明第二算子实例对应的缓冲区扩容过快,即表明该第二算子实例生产的数据过多,则确定该第二算子实例生产的数据分配不均匀,从而确定与该第二算子实例连接的下游算子实例,即第一算子中的算子实例所对应的第一工作状态指示上游算子生产的数据分配不均匀,也即该第一算子对应的第一工作状态指示上游算子生产的数据分配不均匀。
当第二算子实例对应的出端缓冲区所对应的扩容速率未达到第一预设速率时,确定该第二算子实例生产的数据分配均匀。可选的,对于第一算子对应的每个算子实例,可以根据上述确定第一算子的第一工作状态的过程确定该算子实例的第一工作状态。
步骤504、响应于第一算子的第一工作状态指示算子存在异常,进行相应的告警操作。
在本实施例中,在得到第一算子的第一工作状态后,当第一算子的第一工作状态指示上游算子生产的数据分配不均匀时,进行告警操作,实现及时告警。
可选的,当根据第一算子的第一工作状态进行告警操作时,可以根据以下两种方式进行告警。
一种方式为,在第一算子的第一工作状态指示第一算子对应的上游算子生产的数据分配不均匀的情况下,输出第一告警信息。其中第一告警信息用于提示增加下游算子并发度的数量。
具体的,在第一算子的第一工作状态指示上游算子生产的数据分配不均匀时,也即在第一算子对应的算子实例的第一工作状态指示上游算子生产的数据分配不均匀时,表明存在第一算子的上游算子实例(即第二算子实例)生产的数据较多,则输出第一告警信息,以提示第一算子,也即该算子实例已经出现计算瓶颈,上游算子生产数据的能力超过下游算子消费数据的能力,需增加消费者数据才能满足业务需求,也即需增加第一算子的并发度,换言之,提示相关人员增加与生产的数据分配不均匀的第二算子实例连接的下游算子实例的数目。例如,第一算子包括2个算子实例,分别为实例1和2。第一算子的上游算子,也即第二算子包括6个第二算子实例,实例1和3个第二算子实例连接,实例2和另外3个第二算子实例连接,与实例1连接的一个第二算子实例生产的数据分配不均匀,则输出第一告警信息,以增加与该第二算子实例连接的下游算子实例的并发度,比如,第一算子新增实例3,该实例3也与该第二算子实例连接,该实例3也用于消费第二算子实例生产的数据,从而增加该第二算子实例对应的消费者数量。
另一种方式为,在第一算子的第一工作状态指示第一算子对应的上游算子生产的数据分配不均匀的情况下,基于分布式任务对应的执行流程,确定处于第一算子上方的各个第三算子。获取各个第三算子分别对应的第一工作状态。在所有第三算子分别对应的第一工作状态均指示上游算子生产的数据分配不均匀的情况下,确定分布式任务对应的源算子所对应的除第一下游算子以外的其它下游算子;其中,第一下游算子为源算子的下游算子,且为第三算子。根据其它下游算子对应的数据消费速率,进行告警操作。
具体的,当第一算子的第一工作状态指示上游算子生产的数据分配不均匀时,以该第一算子为子节点,向上遍历执行计划,也即基于作业的执行流程,向上遍历,换言之,基于第一算子所在的算子链向上遍历,直至遍历到最源头的source算子(即源算子),将遍历到的算子(不包括源算子)作为第三算子。当所有第三算子的第一工作状态均指示上游算子生产的数据分配不均匀时,表明源端数据,即数据源部分分区数据量过高,也即第一下游算子所消费的分区内的数据过多,造成第一下游算子负载过高,第一下游算子所在的算子链中的算子负载过高,则获取与源算子连接的其它下游算子,以供根据其它下游算子对应的数据消费速率确定其它下游算子是否能够处理更多数据,从而进行相应的告警操作。
以一个具体应用场景为例,如图6所示的算子之间的连接关系,算子a1消费消息队列中的分区1中的数据,算子b1消费分区2中的数据。在算子a3的第一工作状态指示上游算子生产的数据分配不均匀的情况下,依次确定算子a1、a2在第一维度上的工作状态,由于源算子A不存在上游算子,因此,无需获取源算子A的第一工作状态。当算子a1、a2在第一维度上的工作状态均指示上游算子生产的数据分配不均匀时,表明分区1中的数据过多,则将该算子a1作为第一下游算子,且将算子b1作为其它下游算子,利用算子b1对应的数据消费速率进行告警操作。
其中,为了方便描述,算子a1、a2和a3分别对应一个算子实例,分区1指示算子a1的入端缓冲区。算子b1、b2和b3分别对应一个算子实例,分区2指示算子b1的入端缓冲区。
另外,当算子a1对应多个算子实例时,分区1对应多个子分区,也即入端缓冲区,算子a1的每个算子实例消费一个入端缓冲区中的数据。
可以理解,第一算子、第三算子均可以指示算子实例,相应的,第一下游算子和其它下游算子也可以指示算子实例。
可选的,根据其它下游算子对应的数据消费速率,进行告警操作,包括:
确定其它下游算子对应的数据消费速率与源算子对应的第一数据生产速率之间的第二误差值。
在第二误差值达到第二预设值的情况下,输出第二告警信息。其中,第二告警信息用于提示源端数据过多,增加下游算子并发度的数量。
可选的,在第二误差值未达到第三预设值的情况下,进行第一通道重分配操作。其中,第一通道重分配操作指示控制其它下游算子消费第一下游算子对应的入端缓冲区中的数据。入端缓冲区中的数据为源算子生产的数据。
具体的,对于每个其它下游算子,计算该其它下游算子对应的数据消费速率与源算子对应的第一数据生产速率之间的差值,并将其作为该其它下游算子对应的第二误差值。当该第二误差值未达到第三预设值的情况下,表明该其它下游算子可以消费更多数据,可以根据下游算子的数据消费速率的快慢重新分配通道,也即进行第一通道重分配操作,使得数据消费速率较高的其它下游算子消费第一下游算子对应的入端缓冲区。
当所有第二误差值均达到第二预设值的情况下,表明所有其它下游算子的计算压力也较大,无法消费更多的数据,则输出第二告警信息,以提示源端数据过多,增加下游算子并发度的数量,即增加第一下游算子的并发度。
可选的,其它下游算子对应的数据消费速率可以是指其它下游算子对应的各个算子实例所对应的数据消费速率,源算子对应的第一数据生产速率可以是指与该其它下游算子对应的算子实例连接的源算子对应的算子实例所对应的第一数据生产速率。相应的, 可以确定该各个算子实例对应的第二误差值。当算子实例对应的第二误差值未达到第三预设值的情况下,表明该算子实例可以消费更多的数据,则使该算子实例消费受影响的通道所对应的数据,即消费第一下游算子对应的入端缓冲区中的数据,也即建立该算子实例与源算子之间的通道,实现通道自动智能连接。
其中,第二预设值和第三预设值可以是相同的,也可以是不同的。
可以理解,在建立该算子实例与源算子之间的通道时,实际是建立该算子实例与该第一下游算子,也即数据过多的入端缓冲区对应的第一下游算子中的算子实例之间的通道连接,即重建上游slot和下游slot之间的通道。
可选的,还可以关闭第一下游算子与源算子之间的通道,即关闭源算子对应的受影响的下游算子实例。
承接上述应用场景,计算算子b1对应的数据消费速率与源算子A对应的第一生产速率之间的差值,即计算算子b1消费分区2中数据的速率与源算子A填充分区2的速率之间的差值,得到第二误差值。当该第二误差值未达到第三预设值时,表明算子b1消费数据能力较高,可以消费更多的数据,则使算子b1消费区域1中的数据。
可选的,当输出第二告警信息时,获取其它下游算子对应的数据消费速率和第一下游算子对应的数据消费速率中的最小值。根据最小值,生成限速指令,并将限速指令发送至源算子,以使源算子基于最小值,调整源算子对应的第一数据生产速率,即以最小值为基准,调整源算子对应的第一数据生产速率,避免整个系统造成反压影响checkpoint(检查点)、watermark(水位线)等机制。
可选的,在以最小值为基准,调整源算子对应的第一数据生产速率时,可以按照预设调整规则进行调整,例如,将源算子对应的第一数据生产速率调整为该最小值,在此,不对该调整规则进行限制。
可以理解,在调整源算子对应的第一数据生产速率时,调整该源算子对应的各个算子实例所对应的第一数据生产速率。
可选的,在所有第三算子的第一工作状态均指示上游算子生产的数据分配均匀的情况下,进行第二通道重分配操作。其中,第二通道重分配操作指示正常算子实例消费异常算子实例对应的入端缓冲区中的数据;正常算子实例为第一算子对应的算子实例中的第一工作状态指示上游算子生产的数据分配均匀的算子实例,异常算子实例为第一算子对应的算子实例中的第一工作状态指示上游算子生产的数据分配不均匀的算子实例。
具体的,当所有第三算子的第一工作状态均指示上游算子生产的数据分配均匀时,表明只有第一算子的上游算子生产的数据分配不均匀,也即上游算子生产数据的速率超过第一算子的消费能力,则确定第一算子对应的算子实例中的第一工作状态指示上游算子生产的数据分配均匀的算子实例,并将确定的算子实例作为正常算子实例。对于每个正常算子实例,计算该正常算子实例对应的数据消费速率与其对应的第一数据生产速率之间的差值,得到第四误差值,以供利用该第四误差值确定正常算子实例是否可以消费更多的数据。在确定正常算子实例可以消费更多的数据时,将第一算子对应的算子实例中的第一工作状态指示上游算子生产的数据分配不均匀的算子实例作为异常算子实例,并将该正常算子实例与该异常算子实例对应的上游算子实例,也即生产数据过多的上游算子实例建立通道,以使该正常算子实例可以消费该上游算子实例所生产的数据,降低该异常算子实例的消费压力,实现根据数据消费能力重分配通道,保证分布式系统的数据处理性能,即保证业务的正常运行。
可选的,当第一算子的上游算子生产的数据是基于keystream(键流)时,在第一算子的第一工作状态指示上游算子生产的数据分配不均匀的情况下,记录第一算子对应的上游算子生产的数据所对应的键值,即确定数据分配不均匀的第二算子实例生产的数据所对应的键值(即key值),实现倾斜数据的key值的记录,并输出键值,以使相关人员可以获知倾斜数据,从而可以使相关人员根据倾斜数据重新设置第一算子所需处理的key值,即重新设置数据的流向,避免由于key值设置不合理导致第一算子需要消费过多的数据,进而出现计算瓶颈。
在本实施例中,主控服务器还可以获取各个算子实例对应的缓冲区所对应的扩容速率,其中,缓冲区包括出端缓冲区和/或入端缓冲区。在该扩容速率达到第二预设速率的情况下,表明该缓冲区扩容过快,需限制该缓冲区扩容速率,则在该缓冲区的总大小扩容至预设大小时,控制该缓冲区停止扩容,实现bufferpool(缓冲区池)预分配。
在本实施例中,在不影响业务的基础上,能够根据算子的数据处理速率进行通道重分配,实现流量的自动切换,无须业务人员和运维人员介入,避免出现分布式处理系统造成反压、checkpoint大量失败后才被运维感知的问题,保证分布式处理系统的性能。
在本实施例中,对算子,即算子实例对应的数据处理速率进行横向比较,即获取在设定时间内的算子实例对应的所有数据处理速率,若所有数据处理速率均未达到历史平均处理速率,表明该算子实例已经成为慢节点,则输出慢算子告警信息,以使相关人员对运行较慢的算子进行维护。其中,历史平均速率是基于采集到的数据处理速率进行平均值计算得到的。
在本实施例中,基于算子的上游算子所对应的数据处理速率确定该算子在第一维度上的工作状态,即确定该算子的上游算子生产的数据是否分配均匀,即确定该算子的并发度是否合理,从而确定是否进行相应的告警,即是否提示对该算子的并发度进行调整,实现异常的及时调整,保证分布式处理的运行性能。
在本实施例中,以源端kafka消息队列为例,数据在kafka分区分散不均匀,有些分区的数据量很大,这样会造成Flink kafka consumer,也即源算子的下游算子在消费大数据分区时遇到瓶颈,因此,当第一算子的第一工作状态指示上游算子生产的数据分配不均匀时,确定是否由于源端分区不均匀导致下游算子的负载过高,从而进行相应的告警操作,例如,增加提示源算子的下游算子的并发度,避免系统堵塞。
在本实施例中,当业务逻辑中包括keystream时,如果设置的分区key值不合理,同样也会造成下游算子出现计算瓶颈,因此,记录倾斜数据的key值,以使相关人员尽快解决。
在本实施例中,通过运维指标,即算子对应的数据处理速率可以及时确定系统所存在的异常,以在系统未出现严重问题时,告知运维人员系统是否存在并发度设置不合理,作业是否存在key值数据倾斜或设置不合理的问题,使得运维人员可以尽早解决分布式处理系统中的问题,保证系统的可靠性。
在本实施例中,对于每个算子,将该算子作为第一算子,并获取位于该第一算子上方的第二算子对应的各个算子实例所分别对应的第一数据生产速率,以供基于各个算子实例分别对应的第一数据生产速率,确定各个算子实例之间生产数据的能力是否相差过大,即确定是否存在生产数据过多或过少的算子实例,从而可以确定第一算子对应的上游算子所对应的算子实例生产的数据分配是否均匀,进而得到第一算子对应的第一工作状态,实现第一算子的第一工作状态的准确确定。当第一算子对应的第一工作状态指示第一算子对应的上游算子生产的数据分配不均匀时,进行告警操作,实现及时告警,以使相关人员可以及时解决第一算子对应的上游算子生产的数据分配不均匀的问题。
如图7所示,图7是本公开根据一实施例示出的再一种分布式处理系统的监控方法的流程图,下面将结合一个具体实施例对确定算子对应的第二工作状态的过程进行详细说明,如图7所示,该方法包括以下步骤:
步骤701、针对分布式任务,获取算子对应的数据处理速率。其中,数据处理速率包括数据消费速率。数据消费速率表示算子消费上游算子产生的数据的速率。
步骤702、计算第一算子对应的各个数据消费速率中的任意两个数据消费速率之间的第三误差值。其中,第一算子为上述算子中的任一算子。
在本实施例中,对于分布式处理系统中的每个算子,将该算子作为第一算子,获取该第一算子对应的各个算子实例所对应的数据消费速率。
对于第一算子对应的每个算子实例,计算该算子实例对应的数据消费速率与第一算子对应的其它算子实例所对应的数据消费速率的差值,并将其作为该算子实例对应的第三误差值,该第三误差值指示第一算子对应的两个算子实例在第一预设单位时间内消费 数据量的差值,即表示该两个第二算子实例消费数据能力的差值。
具体的,算子实例对应的数据消费速率指示算子实例在其对应的slot上的数据消费速率。
步骤703、根据第三误差值,确定第一算子对应的各个算子实例分别对应的第二工作状态。
在本实施例中,对于每个算子实例,在该算子实例与其它算子实例之间的第三误差值均达到第四预设值时,表明该算子实例消费数据的能力较差,该算子实例所在的slot存在计算瓶颈,则确定该算子实例的第二工作状态指示数据消费能力异常,否则,则确定该算子实例的第二工作状态指示数据消费能力正常。
步骤704、响应于算子实例的第二工作状态指示数据消费能力异常,进行相应的告警操作。
在本实施例中,在得到第一算子对应的各个算子实例的第二工作状态后,当算子实例的第二工作状态指示数据消费能力异常时,表明该算子实例异常,则进行相应的告警操作,实现及时告警。
可选的,在基于算子实例的第二工作状态进行告警操作时,可以基于以下两种方式进行告警。
一种方式为,在算子实例的第二工作状态指示数据消费能力异常的情况下,确定算子实例对应的slot所属的目标任务管理器。确定目标任务管理器所包括的除算子实例以外的所有第一算子实例,并获取各个第一算子实例的第二工作状态。根据各个第一算子实例的第二工作状态,进行相应的告警操作。
具体的,当第一算子对应的算子实例的第二工作状态指示数据消费能力异常时,表明该算子实例存在计算瓶颈,也即该算子实例所在的slot存在计算瓶颈,判断是否是由于该slot所属的任务管理器(即目标任务管理器)故障导致的,则获取该目标任务管理器上的其它算子实例,并将其它算子实例作为第一算子实例,以供利用第一算子实例的第二工作状态确定第一算子对应的该算子实例出现数据消费能力异常的原因,也即确定该slot出现计算瓶颈的原因,实现问题的准确定位,进而实现精准告警。
可选的,在基于第一算子实例的第二工作状态进行告警时,计算第二工作状态指示数据消费能力异常的第一算子实例的数目与第一算子实例的总数目的比值,得到异常第一算子实例比例。在该异常第一算子实例比例达到第一预设比例的情况下,表明该目标任务管理器出现故障,从而造成不止一个slot上的算子实例出现问题,多个作业均已受到影响,则输出目标任务管理器故障提示信息,以提示相关人员目标任务管理器出现故障,实现问题的精准定位。
在该异常第一算子实例比例未达到第一预设比例的情况下,表明只有第一算子对应的第二工作状态指示数据消费能力异常的算子实例出现计算瓶颈,也即该算子实例所属的slot出现计算瓶颈,则输出所述算子实例异常提示信息,以提示相关人员该算子实例存在异常。
可选的,目标任务管理器故障提示信息还可以包括受影响的作业名称、受影响的算子标识,也即目标任务管理器对应的所有指示数据消费能力异常的算子实例的标识。
另一种方式为,在算子实例指示数据消费能力异常的情况下,直接输出算子实例异常提示信息,以告知相关人员该算子实例存在异常。
在本实施例中,通过运维指标,即算子对应的数据消费速率可以及时确定系统所存在的异常,实现故障的精准定位,以在系统未出现严重问题时,告知运维人员系统中存在故障的taskmanager和slot,使得运维人员可以尽早解决分布式处理系统中的问题,保证系统的可靠性。
在本实施例中,对于每个算子,将该算子作为第一算子,并获取第一算子对应的各个算子实例所分别对应的数据消费速率,确定各个算子实例之间消费数据的能力是否相差过大,即确定是否存在消费数据过多或过少的算子实例,从而可以确定该各个算子实例的数据消费能力是否正常,进而得到该各个算子实例分别对应的第二工作状态,实现 第一算子对应的算子实例所对应的第二工作状态的准确确定。当第一算子对应的算子实例所对应的第二工作状态指示数据消费能力异常时,进行告警操作,实现及时告警,以使相关人员可以及时解决该算子实例消费数据能力异常的问题。
与前述方法的实施例相对应,本公开还提供了装置及其所应用的计算机设备的实施例。
本公开分布式处理系统的监控装置的实施例可以应用在计算机设备上,例如终端设备(例如,服务器、电脑等)。装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在文件处理的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图8所示,为本公开实施例分布式处理系统的监控装置所在计算机设备的一种硬件结构图,除了图8所示的处理器810、内存830、网络接口820、以及非易失性存储器840之外,实施例中分布式处理系统的监控装置831所在的计算机设备,通常根据该计算机设备的实际功能,还可以包括其他硬件,对此不再赘述。
如图9所示,图9是本公开根据一实施例示出的一种分布式处理系统的监控装置的框图,所述装置包括:
速率获取模块910,用于针对分布式任务,获取所述算子对应的数据处理速率;
速率处理模块920,用于根据算子和/或其上游算子对应的数据处理速率,确定算子的至少一个工作状态;
告警模块930,用于响应于算子的任意工作状态指示算子存在异常,进行相应的告警操作。
可选的,算子对应于至少一个算子实例,每个算子实例存在对应的数据处理速率。
速率处理模块920具体用于:根据第一算子和/或第二算子对应的各个算子实例所分别对应的数据处理速率之间的误差值确定第一算子的至少一个工作状态。其中,第一算子为算子中的任一算子。第二算子为第一算子的上游算子。
可选的,数据处理速率包括第一数据生产速率。其中,第一数据生产速率表示算子生产数据的速率。第一算子的工作状态包括第一工作状态。
其中,第一算子的第一工作状态指示第一算子对应的上游算子生产的数据分配是否均匀。
速率处理模块920具体用于:计算第二算子对应的各个算子实例所分别对应的第一数据生产速率中的任意两个第一数据生产速率之间的第一误差值。
根据第一误差值,确定第一算子的第一工作状态。
可选的,速率处理模块920还用于:在存在第一误差值达到第一预设值的情况下,确定第一算子的第一工作状态指示第一算子对应的上游算子生产的数据分配不均匀。
在所有第一误差值均未达到第一预设值的情况下,确定第一算子的第一工作状态指示第一算子对应的上游算子生产的数据分配均匀。
可选的,速率处理模块920还用于:获取第二算子对应的出端缓冲区所对应的扩容速率。其中,出端缓冲区用于存放第二算子生产的数据。
在存在第一误差值达到第一预设值,且扩容速率达到第一预设速率的情况下,确定第一算子的第一工作状态指示第一算子对应的上游算子生产的数据分配不均匀。
可选的,告警模块930具体用于:在第一算子的第一工作状态指示第一算子对应的上游算子生产的数据分配不均匀的情况下,输出第一告警信息。其中第一告警信息用于提示增加下游算子并发度的数量。
可选的,告警模块930具体用于:在第一算子的第一工作状态指示第一算子对应的上游算子生产的数据分配不均匀的情况下,基于分布式任务对应的执行流程,确定处于第一算子上方的各个第三算子。
获取各个第三算子分别对应的第一工作状态。
在各个第三算子分别对应的第一工作状态均指示上游算子生产的数据分配不均匀的情况下,确定分布式任务对应的源算子所对应的除第一下游算子以外的其它下游算子。 其中,第一下游算子为源算子的下游算子,且为第三算子。
根据其它下游算子对应的数据消费速率,进行告警操作。
可选的,上游算子与下游算子之间通过通道连接。
可选的,告警模块930具体用于:确定其它下游算子对应的数据消费速率与源算子对应的第一数据生产速率之间的第二误差值。
在第二误差值达到第二预设值的情况下,输出第二告警信息。其中,第二告警信息用于提示源端数据过多,增加下游算子并发度的数量。
可选的,上游算子与下游算子之间通过通道连接。装置还包括第一通道处理模块。
第一通道处理模块具体用于:在各个第三算子分别对应的第一工作状态均指示上游算子生产的数据分配不均匀的情况下,确定其它下游算子对应的数据消费速率与源算子对应的第一数据生产速率之间的第二误差值。
在第二误差值未达到第三预设值的情况下,进行第一通道重分配操作。其中,第一通道重分配操作指示控制其它下游算子消费第一下游算子对应的入端缓冲区中的数据。入端缓冲区中的数据为源算子生产的数据。
可选的,装置还包括限速模块。
限速模块具体用于:在各个第三算子分别对应的第一工作状态均指示上游算子生产的数据分配不均匀的情况下,获取其它下游算子对应的数据消费速率和第一下游算子对应的数据消费速率中的最小值。
根据最小值,生成限速指令,并将限速指令发送至源算子,以使源算子基于最小值,调整源算子对应的第一数据生产速率。
可选的,装置还包括第二通道处理模块。
第二通道处理模块具体用于:在获取各个第三算子分别对应的第一工作状态之后,在所有第三算子的第一工作状态均指示上游算子生产的数据分配均匀的情况下,进行第二通道重分配操作。其中,第二通道重分配操作指示正常算子实例消费异常算子实例对应的入端缓冲区中的数据。正常算子实例为第一算子对应的算子实例中的第一工作状态指示上游算子生产的数据分配均匀的算子实例,异常算子实例为第一算子对应的算子实例中的第一工作状态指示上游算子生产的数据分配不均匀的算子实例。
可选的,装置还包括数据记录模块。
数据记录模块具体用于:在第一算子的第一工作状态指示第一算子对应的上游算子生产的数据分配不均匀的情况下,记录第一算子对应的上游算子生产的数据所对应的键值,并输出键值。
数据处理速率包括数据消费速率。数据消费速率表示算子消费上游算子产生的数据的速率。第一算子的工作状态包括第一工作状态。
其中,第一算子的第二工作状态指示数据消费能力是否正常。
速率处理模块920具体用于:计算第一算子对应的各个数据消费速率中的任意两个数据消费速率之间的第三误差值。
根据第三误差值,确定第一算子对应的各个算子实例分别对应的第二工作状态。
可选的,算子被调度至至少一个资源组slot上。slot与算子实例一一对应。
告警模块930具体用于:在算子实例的第二工作状态指示数据消费能力异常的情况下,确定算子实例对应的slot所属的目标任务管理器。
确定目标任务管理器所包括的除算子实例以外的所有第一算子实例,并获取各个第一算子实例分别对应的第二工作状态。
根据各个第一算子实例分别对应的第二工作状态,进行相应的告警操作。
可选的,告警模块930还用于:计算第二工作状态指示数据消费能力异常的第一算子实例的数目与第一算子实例的总数目的比值,得到异常第一算子实例比例。
在异常第一算子实例比例达到第一预设比例的情况下,输出目标任务管理器故障提示信息。
在异常第一算子实例比例未达到第一预设比例的情况下,输出算子实例异常提示信 息。
可选的,告警模块930还用于:在算子实例的第二工作状态指示数据消费能力异常的情况下,输出算子实例异常提示信息。
可选的,算子对应至少一个算子实例。算子实例存在对应的缓冲区。
告警模块930还用于:获取各个算子实例分别对应的缓冲区的扩容速率。
在缓冲区的扩容速率达到第二预设速率的情况下,控制缓冲区停止扩容。
可选的,算子对应至少一个算子实例。告警模块930还用于:获取在设定时间获取到算子实例对应的所有数据处理速率,并获取算子实例对应的历史平均处理速率。
若所有数据处理速率未达到历史平均处理速率,则输出慢算子告警信息。
上述装置中各个模块的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。
在一个实施例中,本公开还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上所述的方法。
在一个实施例中,本公开还提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时,实现如上所述的方法。
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本公开方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
上述对本公开特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
本领域技术人员在考虑本公开及实践这里申请的发明后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未申请的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。
以上所述仅为本公开的较佳实施例而已,并不用以限制本公开,凡在本公开的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本公开保护的范围之内。

Claims (21)

  1. 一种分布式处理系统的监控方法,其中,所述分布式处理系统包括存在上下游关系的算子;
    所述方法包括:
    针对分布式任务,获取所述算子对应的数据处理速率;
    根据所述算子和/或其上游算子对应的数据处理速率,确定所述算子的至少一个工作状态;
    响应于所述算子的任意工作状态指示所述算子存在异常,进行相应的告警操作。
  2. 根据权利要求1所述的方法,其中,所述算子对应于至少一个算子实例,每个算子实例存在对应的数据处理速率;
    所述根据所述算子和/或其上游算子对应的数据处理速率,确定所述算子的至少一个工作状态,包括:
    根据第一算子和/或第二算子对应的各个算子实例所分别对应的数据处理速率之间的误差值确定所述第一算子的至少一个工作状态;其中,所述第一算子为所述算子中的任一算子;所述第二算子为所述第一算子的上游算子。
  3. 根据权利要求2所述的方法,其中,所述数据处理速率包括第一数据生产速率;其中,所述第一数据生产速率表示算子生产数据的速率;所述第一算子的工作状态包括第一工作状态;
    其中,所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配是否均匀;
    所述根据第一算子和/或第二算子对应的各个算子实例所分别对应的数据处理速率之间的误差值确定所述第一算子的至少一个工作状态,包括:
    计算所述第二算子对应的各个算子实例所分别对应的第一数据生产速率中的任意两个第一数据生产速率之间的第一误差值;
    根据所述第一误差值,确定所述第一算子的第一工作状态。
  4. 根据权利要求3所述的方法,其中,所述根据所述第一误差值,确定所述第一算子的第一工作状态,包括:
    在存在第一误差值达到第一预设值的情况下,确定所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀;
    在所有第一误差值均未达到第一预设值的情况下,确定所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配均匀。
  5. 根据权利要求4所述的方法,其中,所述在存在第一误差值达到第一预设值的情况下,确定所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀,包括:
    获取所述第二算子对应的出端缓冲区所对应的扩容速率;其中,所述出端缓冲区用于存放所述第二算子生产的数据;
    在存在第一误差值达到第一预设值,且所述扩容速率达到第一预设速率的情况下,确定所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀。
  6. 根据权利要求3所述的方法,其中,所述响应于所述算子的任意工作状态指示所述算子存在异常,进行相应的告警操作,包括:
    在所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀的情况下,输出第一告警信息;其中所述第一告警信息用于提示增加下游算子并发度的数量。
  7. 根据权利要求3所述的方法,其中,所述响应于所述算子的任意工作状态指示所述算子存在异常,进行相应的告警操作,包括:
    在所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀的情况下,基于所述分布式任务对应的执行流程,确定处于所述第一算子上方 的各个第三算子;
    获取各个第三算子分别对应的第一工作状态;
    在所述各个第三算子分别对应的第一工作状态均指示上游算子生产的数据分配不均匀的情况下,确定所述分布式任务对应的源算子所对应的除第一下游算子以外的其它下游算子;其中,所述第一下游算子为所述源算子的下游算子,且为所述第三算子;
    根据其它下游算子对应的数据消费速率,进行告警操作。
  8. 根据权利要求7所述的方法,其中,上游算子与下游算子之间通过通道连接;
    所述根据其它下游算子对应的数据消费速率,进行告警操作,包括:
    确定所述其它下游算子对应的数据消费速率与所述源算子对应的第一数据生产速率之间的第二误差值;
    在所述第二误差值达到第二预设值的情况下,输出第二告警信息;其中,所述第二告警信息用于提示源端数据过多,增加下游算子并发度的数量。
  9. 根据权利要求7所述的方法,其中,上游算子与下游算子之间通过通道连接;在所述各个第三算子分别对应的第一工作状态均指示上游算子生产的数据分配不均匀的情况下,所述方法还包括:
    确定所述其它下游算子对应的数据消费速率与所述源算子对应的第一数据生产速率之间的第二误差值;
    在所述第二误差值未达到第三预设值的情况下,进行第一通道重分配操作;其中,所述第一通道重分配操作指示控制所述其它下游算子消费所述第一下游算子对应的入端缓冲区中的数据;所述入端缓冲区中的数据为所述源算子生产的数据。
  10. 根据权利要求7所述的方法,其中,在所述各个第三算子分别对应的第一工作状态均指示上游算子生产的数据分配不均匀的情况下,所述方法还包括:
    获取所述其它下游算子对应的数据消费速率和所述第一下游算子对应的数据消费速率中的最小值;
    根据所述最小值,生成限速指令,并将所述限速指令发送至所述源算子,以使所述源算子基于所述最小值,调整所述源算子对应的第一数据生产速率。
  11. 根据权利要求7所述的方法,其中,在获取各个第三算子分别对应的第一工作状态之后,所述方法还包括:
    在所有第三算子的第一工作状态均指示上游算子生产的数据分配均匀的情况下,进行第二通道重分配操作;其中,所述第二通道重分配操作指示正常算子实例消费异常算子实例对应的入端缓冲区中的数据;所述正常算子实例为所述第一算子对应的算子实例中的第一工作状态指示上游算子生产的数据分配均匀的算子实例,异常算子实例为所述第一算子对应的算子实例中的第一工作状态指示上游算子生产的数据分配不均匀的算子实例。
  12. 根据权利要求3所述的方法,其中,在所述第一算子的第一工作状态指示所述第一算子对应的上游算子生产的数据分配不均匀的情况下,所述方法还包括:
    记录所述第一算子对应的上游算子生产的数据所对应的键值,并输出所述键值。
  13. 根据权利要求2所述的方法,其中,所述数据处理速率包括数据消费速率;所述数据消费速率表示算子消费上游算子产生的数据的速率;
    所述第一算子的工作状态包括第一工作状态;
    其中,所述第一算子的第二工作状态指示数据消费能力是否正常;
    所述根据第一算子和/或第二算子对应的各个算子实例所分别对应的数据处理速率之间的误差值确定所述第一算子的至少一个工作状态,包括:
    计算第一算子对应的各个数据消费速率中的任意两个数据消费速率之间的第三误差值;
    根据所述第三误差值,确定所述第一算子对应的各个算子实例分别对应的第二工作状态。
  14. 根据权利要求13所述的方法,其中,所述算子被调度至至少一个资源组槽位 (slot)上;所述slot与所述算子实例一一对应;
    所述响应于所述算子的任意工作状态指示所述算子存在异常,进行相应的告警操作,包括:
    在所述算子实例的第二工作状态指示数据消费能力异常的情况下,确定所述算子实例对应的slot所属的目标任务管理器;
    确定所述目标任务管理器所包括的除所述算子实例以外的所有第一算子实例,并获取各个第一算子实例分别对应的第二工作状态;
    根据各个第一算子实例分别对应的第二工作状态,进行相应的告警操作。
  15. 根据权利要求14所述的方法,其中,所述根据各个第一算子实例分别对应的第二工作状态,进行相应的告警操作,包括:
    计算第二工作状态指示数据消费能力异常的第一算子实例的数目与第一算子实例的总数目的比值,得到异常第一算子实例比例;
    在异常第一算子实例比例达到第一预设比例的情况下,输出目标任务管理器故障提示信息;
    在异常第一算子实例比例未达到第一预设比例的情况下,输出所述算子实例异常提示信息。
  16. 根据权利要求13所述的方法,其中,所述响应于所述算子的任意工作状态指示所述算子存在异常,进行相应的告警操作,包括:
    在所述算子实例的第二工作状态指示数据消费能力异常的情况下,输出所述算子实例异常提示信息。
  17. 根据权利要求1至16任一项所述的方法,其中,所述算子对应至少一个算子实例;所述算子实例存在对应的缓冲区;
    所述方法还包括:
    获取各个算子实例分别对应的缓冲区的扩容速率;
    在所述缓冲区的扩容速率达到第二预设速率的情况下,控制所述缓冲区停止扩容。
  18. 根据权利要求1至16任一项所述的方法,其中,所述算子对应至少一个算子实例;所述方法还包括:
    获取在设定时间获取到所述算子实例对应的所有数据处理速率,并获取所述算子实例对应的历史平均处理速率;
    若所述所有数据处理速率未达到所述历史平均处理速率,则输出慢算子告警信息。
  19. 一种分布式处理系统的监控装置,其中,所述分布式处理系统的处理节点上设置有算子;
    所述装置包括:
    速率获取模块,用于针对分布式任务,获取所述算子对应的数据处理速率;
    速率处理模块,用于根据所述算子和/或其上游算子对应的数据处理速率,确定所述算子的至少一个工作状态;
    告警模块,用于响应于所述算子的任意工作状态指示所述算子存在异常,进行相应的告警操作。
  20. 一种计算机设备,其中,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如权利要求1至18任一项所述的分布式处理系统的监控方法。
  21. 一种计算机程序产品,其中,包括计算机程序,所述计算机程序被处理器执行时,实现如权利要求1至18任一项所述的分布式处理系统的监控方法。
PCT/CN2022/142237 2022-05-31 2022-12-27 分布式处理系统的监控方法及装置 WO2023231398A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210615433.2A CN114896121A (zh) 2022-05-31 2022-05-31 分布式处理系统的监控方法及装置
CN202210615433.2 2022-05-31

Publications (1)

Publication Number Publication Date
WO2023231398A1 true WO2023231398A1 (zh) 2023-12-07

Family

ID=82726099

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/142237 WO2023231398A1 (zh) 2022-05-31 2022-12-27 分布式处理系统的监控方法及装置

Country Status (2)

Country Link
CN (1) CN114896121A (zh)
WO (1) WO2023231398A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117931756A (zh) * 2024-03-25 2024-04-26 广州睿帆科技有限公司 一种基于Flink的FTP文件实时监控分析系统及方法
CN117931756B (zh) * 2024-03-25 2024-06-04 广州睿帆科技有限公司 一种基于Flink的FTP文件实时监控分析系统及方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114896121A (zh) * 2022-05-31 2022-08-12 杭州数梦工场科技有限公司 分布式处理系统的监控方法及装置
WO2024045016A1 (zh) * 2022-08-31 2024-03-07 华为技术有限公司 一种节点的配置方法、装置以及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10394664B1 (en) * 2017-08-04 2019-08-27 EMC IP Holding Company LLC In-memory parallel recovery in a distributed processing system
CN110795151A (zh) * 2019-10-08 2020-02-14 支付宝(杭州)信息技术有限公司 算子并发度调整方法、装置和设备
CN111143143A (zh) * 2019-12-26 2020-05-12 北京神州绿盟信息安全科技股份有限公司 一种性能测试方法及装置
CN114896121A (zh) * 2022-05-31 2022-08-12 杭州数梦工场科技有限公司 分布式处理系统的监控方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10394664B1 (en) * 2017-08-04 2019-08-27 EMC IP Holding Company LLC In-memory parallel recovery in a distributed processing system
CN110795151A (zh) * 2019-10-08 2020-02-14 支付宝(杭州)信息技术有限公司 算子并发度调整方法、装置和设备
CN111143143A (zh) * 2019-12-26 2020-05-12 北京神州绿盟信息安全科技股份有限公司 一种性能测试方法及装置
CN114896121A (zh) * 2022-05-31 2022-08-12 杭州数梦工场科技有限公司 分布式处理系统的监控方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117931756A (zh) * 2024-03-25 2024-04-26 广州睿帆科技有限公司 一种基于Flink的FTP文件实时监控分析系统及方法
CN117931756B (zh) * 2024-03-25 2024-06-04 广州睿帆科技有限公司 一种基于Flink的FTP文件实时监控分析系统及方法

Also Published As

Publication number Publication date
CN114896121A (zh) 2022-08-12

Similar Documents

Publication Publication Date Title
WO2023231398A1 (zh) 分布式处理系统的监控方法及装置
US20210349749A1 (en) Systems and methods for dynamic provisioning of resources for virtualized
Kalavri et al. Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows
CN107734035B (zh) 一种云计算环境下的虚拟集群自动伸缩方法
Lohrmann et al. Elastic stream processing with latency guarantees
CN110071821B (zh) 确定事务日志的状态的方法,节点和存储介质
Castelli et al. Proactive management of software aging
WO2022007552A1 (zh) 处理节点的管理方法、配置方法及相关装置
CN101876938B (zh) 一种基于消息队列的应用软件响应时间测量方法及系统
CN108733509B (zh) 用于在集群系统中备份和恢复数据的方法和系统
US9274842B2 (en) Flexible and safe monitoring of computers
US10050852B2 (en) Method and system for synchronous and asynchronous monitoring
JP4054616B2 (ja) 論理計算機システム、論理計算機システムの構成制御方法および論理計算機システムの構成制御プログラム
US9450700B1 (en) Efficient network fleet monitoring
Meng et al. State monitoring in cloud datacenters
US20090327854A1 (en) Analysis of Database Performance Reports for Graphical Presentation of Summary Results
US20020087913A1 (en) System and method for performing automatic rejuvenation at the optimal time based on work load history in a distributed data processing environment
TW201403480A (zh) 用於應用服務自動遷移之方法及裝置
CN111200526B (zh) 网络设备的监控系统及方法
US20100043004A1 (en) Method and system for computer system diagnostic scheduling using service level objectives
US20080046552A1 (en) Service resiliency within on-premise products
Talwar et al. An energy efficient agent aware proactive fault tolerance for preventing deterioration of virtual machines within cloud environment
Yang et al. Computing at massive scale: Scalability and dependability challenges
US20230409206A1 (en) Systems and methods for ephemeral storage snapshotting
Li et al. Constructing large-scale real-world benchmark datasets for aiops

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22944699

Country of ref document: EP

Kind code of ref document: A1