WO2018019232A1 - 流计算方法、装置及系统 - Google Patents

流计算方法、装置及系统 Download PDF

Info

Publication number
WO2018019232A1
WO2018019232A1 PCT/CN2017/094331 CN2017094331W WO2018019232A1 WO 2018019232 A1 WO2018019232 A1 WO 2018019232A1 CN 2017094331 W CN2017094331 W CN 2017094331W WO 2018019232 A1 WO2018019232 A1 WO 2018019232A1
Authority
WO
WIPO (PCT)
Prior art keywords
operator
flow graph
node
information
logical
Prior art date
Application number
PCT/CN2017/094331
Other languages
English (en)
French (fr)
Inventor
史云龙
方丰斌
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21192261.2A priority Critical patent/EP3975004A1/en
Priority to EP17833535.2A priority patent/EP3483740B1/en
Publication of WO2018019232A1 publication Critical patent/WO2018019232A1/zh
Priority to US16/261,014 priority patent/US11132402B2/en
Priority to US17/486,066 priority patent/US20220012288A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • G06F8/61Installation
    • G06F8/63Image based installation; Cloning; Build to order
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/543User-generated data transfer, e.g. clipboards, dynamic data exchange [DDE], object linking and embedding [OLE]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Definitions

  • the embodiments of the present application relate to the field of big data computing, and in particular, to a stream computing method, apparatus, and system.
  • data streams are characterized by real-time, volatile, bursty, disordered, and infinite.
  • Stream computing system has been widely used.
  • the processing logic of the streaming application (usually also a streaming application) deployed in the streaming computing system can be characterized by a directed acyclic graph (English: Directed Acyclic Graph, or DAG), which is also called a flow graph.
  • DAG Directed Acyclic Graph
  • the processing logic of the streaming application is characterized using flow graph 100.
  • Each of the directed edges in the flow graph 100 represents a data stream (English: Stream), and each node represents an operator (English: Operator), each operator in the graph has at least one input data stream and at least one output. data flow.
  • An operator is the smallest unit in a stream computing system that can be scheduled to perform computational tasks. An operator can also be called an execution operator.
  • the flow computing system provides an integrated development environment (English: Integrated Development Environment, IDE for short), which provides a graphical user interface for constructing a flow graph, the graphical user interface includes a plurality of basic operators, and the user is in the graphic
  • IDE Integrated Development Environment
  • the embodiment of the present application provides a stream computing method, device, and system.
  • the technical solution is as follows:
  • Stream computing systems typically employ a distributed computing architecture.
  • the distributed computing architecture includes: a management node and at least one working node.
  • the user configures the flow graph in the management node through the client, and the management node dispatches each operator in the flow graph to the working node for operation.
  • an embodiment of the present application provides a flow calculation method, which is applied to a flow computing system including a management node and a working node, where the method includes: the management node acquires input channel description information, a structured query language from a client.
  • management node Generating a flow graph according to the input channel description information, the SQL statement, and the output channel description information, where the flow graph is used to define calculation logic of a plurality of operators performing a flow calculation task and data flow between each operator And an input/output relationship; the management node controls an operator in the working node to perform a flow calculation task according to the flow graph; the plurality of operators are scheduled to be executed on one or more working nodes of the flow computing system;
  • SQL Structured Query Language
  • the input channel description information is used to define an input channel
  • the input channel is a logical channel that inputs a data stream from a data production system into the flow graph
  • the output channel description information is used to define an output channel
  • the output channel is a logical channel through which the output data stream of the flow graph is output to the data consumption system.
  • the embodiment of the present application generates an executable flow graph by the management node according to the input channel description information, the SQL statement, and the output channel description information, and then the management node performs the flow calculation according to the flow graph control working node; the current flow is solved to some extent.
  • the computing system builds a flow graph through the basic operators provided by the IDE, the function of each basic operator is divided into very fine granularity, which results in a higher complexity of constructing the flow graph, and the overall computational performance of the generated flow graph is better. Poor problem; SQL is a relatively common database management language. Streaming computing systems support SQL statements to build flow graphs to improve system usability and enhance user experience.
  • users use the programming language features of SQL language to adopt The SQL statement defines the processing logic of the flow graph, and the management node dynamically generates the flow graph according to the processing logic defined by the SQL statement, thereby improving the overall computing performance of the flow graph.
  • the SQL statement includes a plurality of SQL rules, each of the SQL rules includes at least one SQL sub-statement;
  • the management node generates a flow graph according to the input channel description information, the SQL statement, and the output channel description information, which specifically includes:
  • the management node generates a first flow graph according to the input channel description information, the plurality of SQL rules, and the output channel description information, where the first flow graph includes a plurality of logical layer nodes;
  • the management node divides each logical node in the first flow graph to obtain a plurality of logical node groups; and selects a common operator in the preset operator library according to each of the logical node groups, according to the selected
  • the common operator generates a second flow graph; each operator in the second flow graph is used to implement one or more logical nodes in the logical node group corresponding to the operator.
  • the user only needs to write the SQL rule at the logic level, and the management node generates the first flow graph according to the SQL rule, and the first flow graph includes several logical nodes, and then the management node passes After the preset operator library divides each logical node in the first flow graph, each logical node group is converted into an operator in the second flow graph, and each operator in the second flow graph is used to implement the first stream.
  • the logical nodes belonging to the same logical node group in the figure so that the user does not need to have streaming programming thinking, and does not need to care about the division logic of the operator, only need to write the SQL rules at the logical level to construct the flow graph, and the management node
  • the operator in the flow graph is generated by itself, thereby reducing the code editing work when the user constructs the stream computing application, and reducing the complexity of the user constructing the stream computing application.
  • the first flow graph includes a source logical node, an intermediate logical node, and a target logical node connected by a directed edge
  • the management node generates the first flow graph according to the input channel description information, the plurality of SQL rules, and the output channel description information, specifically:
  • the management node Generating, by the management node, the source logical node in the first flow graph according to the input channel description information, where the source logical node is configured to receive an input data stream from the data production system;
  • the management node generates the middle in the first flow graph according to a selection sub-sentence in each of the SQL rules a logical node, where the intermediate logical node is used to indicate calculation logic when calculating the input data stream, and each intermediate logical node corresponds to a SQL rule;
  • the management node Generating, by the management node, the target logical node in the first flow graph according to the output channel description information, where the target logical node is configured to send an output data stream to the data consumption system;
  • the management node generates a directed edge between the source logical node, the intermediate logical node, and the target logical node according to an input sub-statement and/or an output sub-statement in each of the SQL rules.
  • the flow calculation method converts the input sub-statement, the selection sub-statement and the output sub-statement in the SQL language through the flow computing system, and realizes that the flow computing system supports the user to use a SQL rule in the Logically define a logical node in the flow graph, and use the familiar SQL syntax to reduce the difficulty of defining the flow computing application, and provide a flow graph customization method with extremely high ease of use.
  • the second flow graph includes a source operator connected by a directed edge, and an intermediate calculation a sub- and target operator, the preset operator library comprising: a common source operator, a common intermediate operator, and a common target operator;
  • the management node divides each logical node in the first flow graph, selects a common operator in the preset operator library according to the divided logical node, and generates a second according to the selected common operator.
  • Flow diagram including:
  • the management node compiles the common source operator to obtain a source operator in the second flow graph
  • the management node selects at least one common intermediate operator for each of the logical node groups including the intermediate logical node in the preset operator library, and compiles the selected common intermediate operator to obtain a The intermediate operator in the second flow graph;
  • the management node compiles the common target operator to obtain a target operator in the second flow graph
  • the management node generates a directed edge between the operators in the second flow graph according to the directed edge between the source logical node, the intermediate logical node, and the target logical node.
  • the flow calculation method provided by the implementation manner divides, by the management node, a plurality of logical nodes in the first flow graph, and each logical node divided into the same logical node group is performed by the same common intermediate operator.
  • the implementation does not require the user to consider factors such as load balancing and concurrent execution.
  • the management node determines the generation of the second flow graph according to factors such as load balancing and concurrent execution, which further reduces the difficulty for the user to generate the second flow graph. Users need to have the ability to build a first-level graph of the logical level through SQL.
  • the management node controls the working node to perform flow calculation according to the flow graph, including :
  • management node schedules each of the operators in the second flow graph to at least one working node in the flow computing system, where the working node is configured to execute the operator;
  • the management node generates, according to the output data stream of each of the operators, subscription release information corresponding to the operator, and configures the subscription release information to the operator;
  • the management node generates input stream definition information corresponding to the operator according to the input data stream of each of the operators, and configures the input stream definition information to the operator;
  • the subscription release information is used to indicate a sending manner of the output data stream corresponding to the current operator
  • the input stream definition information is used to indicate a receiving manner of the input data stream corresponding to the current operator.
  • the flow calculation method decouples the reference relationship between the input data stream and the output data stream of each operator in the second flow graph by setting a subscription mechanism, and provides the second After the flow graph is executed, still
  • the ability to dynamically adjust individual operators in the second flow graph can improve the overall ease of use and maintainability of the stream computing application.
  • the method further includes:
  • the management node receives first modification information from the client, where the first modification information is information that is modified by the SQL rule;
  • management node adds, modifies, or deletes the corresponding intermediate operator in the second flow graph according to the first modification information.
  • the flow calculation method provided by the implementation manner sends the first modification information to the management node by the client, and the management node adds, modifies, or deletes the intermediate operator in the second flow graph according to the first modification information.
  • the management node Providing the management node with the ability to dynamically adjust the intermediate operator in the second flow graph after the second flow graph is generated.
  • the method further includes:
  • the management node receives second modification information from the client, where the second modification information is information for modifying the input channel description information; and the second modification information is used according to the second modification information.
  • the source operator is added, modified, or deleted;
  • the management node receives third modification information from the client, where the third modification information is information for modifying the output channel description information; and the third modification information is used in the second flow graph according to the third modification information.
  • the target operator is added, modified, or deleted.
  • the flow calculation method provided by the implementation manner sends the second modification information and/or the third modification information to the management node by the client, and the source node and/or the target in the second flow graph is managed by the management node.
  • the addition, modification or deletion of the operator provides the management node with the ability to dynamically adjust the source and/or target operators in the second flow graph after the second flow graph is generated.
  • a flow computing device comprising at least one unit, the at least one unit for implementing the flow calculation method provided by any one of the first aspect or the first aspect of the first aspect .
  • a management node comprising a processor and a memory; the processor for storing one or more instructions, the instructions being indicated to be executed by the processor, the processor A stream computing method provided by implementing any of the above possible implementations of the first aspect or the first aspect.
  • the embodiment of the present application provides a computer readable storage medium, where the flow calculation method provided by implementing the foregoing first aspect or any possible implementation manner of the first aspect is stored in the computer readable storage medium. Executable program.
  • a flow computing system comprising: a management node and at least one working node, the management node being a management node as described in the third aspect.
  • FIG. 1 is a schematic structural diagram of a flow diagram provided by the prior art
  • FIG. 2A is a structural block diagram of a stream computing system provided by an embodiment of the present application.
  • 2B is a structural block diagram of a stream computing system provided by another embodiment of the present application.
  • 3A is a structural block diagram of a management node according to an embodiment of the present application.
  • 3B is a structural block diagram of a management node according to another embodiment of the present application.
  • FIG. 4 is a schematic diagram of a flow calculation process provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of a method for calculating a stream provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a principle of a flow calculation method provided by an embodiment of the present application.
  • FIG. 7 is a flowchart of a method for calculating a stream provided by another embodiment of the present application.
  • 8A is a flowchart of a method for calculating a stream provided by another embodiment of the present application.
  • FIG. 8B is a schematic diagram of a principle of a flow calculation method according to another embodiment of the present application.
  • 8C is a flowchart of a method for calculating a stream provided by another embodiment of the present application.
  • 8D is a flowchart of a method for calculating a stream provided by another embodiment of the present application.
  • 8E is a flowchart of a method for calculating a stream provided by another embodiment of the present application.
  • 9A is a schematic diagram of a principle of a flow calculation method provided by an embodiment of the present application.
  • 9B is a schematic diagram of a principle of a flow calculation method according to another embodiment of the present application.
  • FIG. 10 is a block diagram showing the structure of a stream computing device according to another embodiment of the present application.
  • FIG. 11 is a structural block diagram of a stream computing system according to another embodiment of the present application.
  • FIG. 2A is a schematic structural diagram of a flow computing system provided by an embodiment of the present application.
  • the flow computing system is a distributed computing system comprising: a terminal 220, a management node 240, and a plurality of working nodes 260.
  • the terminal 220 is an electronic device such as a mobile phone, a tablet computer, a laptop portable computer, and a desktop computer.
  • the hardware form of the terminal 220 is not limited in this embodiment.
  • a client is running in the terminal 220, and the client is used to provide a human-computer interaction entry between the user and the distributed computing system.
  • the client has the ability to obtain input channel description information, several SQL rules, and output channel description information according to user input.
  • the client is a native client provided by a distributed computing system, or the client is a client developed by the user.
  • Terminal 220 is coupled to management node 240 via a wired network, a wireless network, or a dedicated hardware interface.
  • the management node 240 is a combination of a server or a plurality of servers.
  • the hardware form of the management node 240 is not limited in this embodiment.
  • Management node 240 is a node for managing individual work nodes 260 in a distributed computing system.
  • the management node 240 is configured to perform at least one of resource management, active/standby management, application management, and task management on each working node 260.
  • Resource management refers to management of computing resources in each working node 260.
  • Active/standby management refers to implementing active/standby switching management for each working node 260 in the event of a failure; application management refers to running on a distributed computing system. At least one stream computing application is managed; task management refers to managing computing tasks of various operators in a stream computing application.
  • the management node 240 may have different names, such as a master node (English: master node).
  • the management node 240 is coupled to the worker node 260 via a wired network, a wireless network, or a dedicated hardware interface.
  • the working node 260 is a combination of a server or a plurality of servers.
  • the hardware form of the working node 260 is not limited in this embodiment.
  • an operator in the flow computing application is running in the working node 260.
  • Each worker node 260 is responsible for the computational tasks of one or more operators. For example, each process in worker node 260 is responsible for an operator. Calculation task.
  • the plurality of working nodes 260 are connected by a wired network, a wireless network, or a dedicated hardware interface.
  • FIG. 2B is a schematic structural diagram of a flow computing system provided by another embodiment of the present application.
  • the flow computing system includes: a distributed computing platform formed by a plurality of computing devices 22, each computing device 22 having at least one virtual machine running therein, each virtual machine being a management node 240 or a working node 260 .
  • Management node 240 and worker node 260 are different virtual machines located on the same computing device 22 (as shown in Figure 2B). Alternatively, management node 240 and worker node 260 are different virtual machines located on different computing devices 22.
  • each worker node 260 is running on each computing device 22, and each worker node 260 is a virtual machine.
  • the number of working nodes 260 that can be run on each computing device 22 is determined by the computing power of computing device 22.
  • the various computing devices 22 are connected by a wired network, a wireless network, or a dedicated hardware interface.
  • the dedicated hardware interface is an optical fiber, a cable of a predetermined interface type, or the like.
  • the embodiment of the present application does not limit whether the management node 240 is a physical entity or a logical entity, and does not limit whether the working node 260 is a physical entity or a logical entity.
  • the structure and function of the management node 240 will be further described below.
  • FIG. 3A is a structural diagram of a management node 240 provided by an embodiment of the present application.
  • the management node 240 includes a processor 241, a network interface 242, a bus 243, and a memory 244.
  • the processor 241 is connected to the network interface 242 and the memory 244 via a bus 243, respectively.
  • Network interface 242 is used to effect communication with terminal 220 and worker node 260.
  • Processor 241 includes one or more processing cores.
  • the processor 241 implements management functions in the stream computing system by running an operating system or application modules.
  • the memory 244 can store an operating system 245, an application module 25 required for at least one function.
  • the application module 25 includes an acquisition module 251, a generation module 252, an execution module 253, and the like.
  • the obtaining module 251 is configured to obtain input channel description information, an SQL statement, and an output channel description information from the client.
  • the generating module 252 is configured to generate a flow graph according to the input channel description information, the SQL statement, and the output channel description information, where the flow graph is used to define calculation logic of each operator that performs the flow computing task and input and output of the data flow between each operator relationship.
  • the execution module 253 controls the working node to perform a flow calculation task according to the flow graph.
  • the memory 244 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable and programmable.
  • SRAM static random access memory
  • Read-only memory (English: electrically erasable programmable read-only memory, EEPROM for short), erasable programmable read-only memory (EPROM), programmable read-only memory (English: Programmable read-only memory (referred to as: PROM), read-only memory (English: read-only memory, referred to as: ROM), magnetic memory, flash memory, disk or optical disk.
  • FIG. 3A does not constitute a limitation of the management node 240, and It includes more or fewer components or combinations of components, or different component arrangements.
  • FIG. 3B illustrates an embodiment of a management node 240 in a virtualization scenario.
  • the management node 240 is a virtual machine (English: Virtual Machine, VM: 224) running on the computing device 22.
  • the computing device 22 includes a hardware layer 221, a virtual machine monitor (English: Virtual Machine Monitor, VMM for short) 222 running on the hardware layer 21, and a host Host 223 running on the VMM 222 and several virtual machines.
  • the machine VM wherein the hardware layer 221 includes but is not limited to: an I/O device, a central processing unit (English: Central Processing Unit, CPU for short), and a memory.
  • An executable program is run in the VM, and the VM calls the executable program and calls the hardware resources of the hardware layer 221 through the host Host 223 during the running of the program to implement the foregoing obtaining module 251, the generating module 252, and the executing module. 253 features. Specifically, the obtaining module 251, the generating module 252, and the executing module 253 may be included in the executable program in the form of a software module or a function, and the VM 224 runs the CPU by calling resources such as CPU and Memory in the hardware layer 221. The program is executable to implement the functions of the acquisition module 251, the generation module 252, and the execution module 253.
  • FIG. 4 is a schematic diagram showing the principle of a flow calculation process provided by an embodiment of the present application.
  • the entire stream computing process involves a data production system 41, a stream computing system 42, and a data consuming system 43.
  • the data production system 41 is used to generate data.
  • the data production system 41 may be a financial system, a network monitoring system, a manufacturing system, a web application system, a sensing system, or the like.
  • the storage form of the data generated by the data production system 41 includes, but is not limited to, at least one of a file, a network data packet, and a database.
  • the storage form of the data in the embodiment of the present application is not limited.
  • data production system 41 is coupled to stream computing system 42 via hardware circuitry such as a network, fiber optic, hardware interface card.
  • data production system 41 is coupled to stream computing system 42 via input channel 411.
  • the input channel 411 is a logical channel for inputting data streams from the data production system 41 into the flow graphs in the stream computing system 42 for effecting transmission paths and transmissions between the data production system 41 and the stream computing system 42. Docking of protocols, data formats, data encoding/decoding methods, etc.
  • Stream computing system 42 typically includes a flow graph of a plurality of operators. This flow graph can be considered a flow computing application.
  • the flow graph includes a source operator 421, at least one intermediate operator 422, and a destination operator 423.
  • Source operator 421 is used to receive the input data stream from data production system 41, source operator 42 is also used to send the input data stream to intermediate operator 422; intermediate operator 422 is used to calculate the input data stream, which will be calculated
  • the resulting output data stream is input to a next stage intermediate operator 422 or destination operator 423; the destination operator 423 is used to send an output data stream to the data consumption system 43.
  • Each of the above-mentioned operators is scheduled by the management node shown in FIG. 2, and runs in a distributed form in a plurality of work nodes 260 shown in FIG. 2, and each work node 260 runs at least one operator.
  • the stream computing system 42 is coupled to the data consuming system 43 via hardware circuitry such as a network, fiber optic, hardware interface card.
  • stream computing system 42 is coupled to data consuming system 43 via output channel 421.
  • the output channel 421 is a logical channel that outputs the output data stream of the stream computing system 42 to the data consuming system 43 for implementing the transmission path, transmission protocol, data between the stream computing system 42 and the data consuming system 43. Docking of formats, data encoding/decoding methods, etc.
  • Data consumption system 43 is utilized to utilize the output data stream calculated by stream computing system 42.
  • the data consuming system 43 performs persistent storage or secondary utilization of the output data stream.
  • the data consumption system 43 is a recommendation system that recommends web pages, text, audio, video, shopping information, and the like of interest to the user based on the output data stream.
  • the flow graph in the stream computing system 42 is generated, deployed, or adjusted by the user through the client 44.
  • FIG. 5 is a flowchart of a flow calculation method provided by an embodiment of the present application. This embodiment is exemplified by the flow calculation method applied to the management nodes shown in FIGS. 2A-2B and 3A-3B. The method includes:
  • Step 501 The management node acquires input channel description information, an SQL statement, and an output channel description information from the client.
  • the user sends the input channel description information, the SQL statement, and the output channel description information to the management node through the client.
  • the input channel description information is used to define an input channel, or the input channel description information is used to describe the input mode of the input data stream, or the input channel description information is used to describe the construction information of the input channel.
  • An input channel is a logical channel used to input a data stream from a data production system into a flow graph.
  • the input channel description information includes at least one of: transmission medium information, transmission path information, data format information, and data decoding mode information.
  • an input channel description information includes: an Ethernet medium, an Internet Protocol (IP) address and port number, and a Transmission Control Protocol (English: Transmission Control Protocol, TCP: The data packet, the default decoding mode; the other input channel description information includes: file storage path, Excel file.
  • IP Internet Protocol
  • TCP Transmission Control Protocol
  • the SQL statement is used to define the calculation logic for each operator in the flow graph, as well as the input data stream and output data stream for each operator.
  • the input data stream and output data stream for each operator there is at least one input data stream for each operator and at least one output data stream for each operator.
  • the output channel description information is used to define an output data stream, or the output channel description information is used to describe the output mode of the output data stream, or the output channel description information is used to describe the construction information of the output channel.
  • the output channel is a logical channel for outputting the output data stream of the flow graph to a data consuming system.
  • the output channel description information includes at least one of transmission medium information, transmission path information, data format information, and data encoding mode information.
  • an output channel description information includes: a file storage path, a CSV file.
  • the management node receives input channel description information, SQL, and output channel description information sent by the client.
  • Step 502 The management node generates a flow graph according to the input channel description information, the SQL statement, and the output channel description information, where the flow graph is used to define calculation logic of each operator in the flow calculation and an input/output relationship of the data flow between each operator;
  • the SQL statement includes a number of SQL rules, each of which is used to define the computational logic of a logical operator, and the input data stream and the output data stream of the operator.
  • Each SQL rule includes at least one SQL substatement.
  • each operator has at least one input data stream, each operator having at least one output data stream.
  • an executable flow graph includes: a source operator (English: Source), an intermediate operator, and a target operator (English: Sink).
  • the source operator is used to receive the input data stream from the data production system and to input the input data stream into the intermediate operator.
  • the intermediate operator is used to calculate the input data stream from the source operator, or the intermediate operator is used to calculate the input data stream from other intermediate operators.
  • the target operator is used to send an output data stream to the data consuming system based on the calculation results from the intermediate operator.
  • Step 503 The management node controls the working node to perform a flow calculation task according to the flow graph.
  • the management node controls the flow calculation tasks of each working node in the flow calculation system according to the flow graph.
  • the management node schedules the generated flow graph to each working node for distributed execution, and the plurality of working nodes perform flow calculation on the input data stream from the data production system according to the flow graph to obtain a final output data stream, and Lose Out to the data consumption system.
  • the flow calculation method provided by the implementation manner generates an executable flow graph by the management node according to the input channel description information, the SQL statement, and the output channel description information, and then the management node controls the workflow execution flow according to the flow graph.
  • Calculation solves the problem that the current flow computing system uses the basic operator provided by the IDE to construct the flow graph, the function of each basic operator is divided into very fine granularity, resulting in poor overall computing performance of the generated flow graph.
  • the flow computing system supports SQL statements to build flow graphs. SQL is a relatively common database management language. Users use SQL statements to build flow graphs that are still very easy to use. On the other hand, users use the SQL language programming language.
  • the feature uses the SQL statement to define the processing logic of the flow graph.
  • the management node dynamically generates a flow graph with a reasonable number of operators according to the processing logic defined by the SQL statement, thereby improving the overall computing performance of the flow graph.
  • the user needs to configure the input channel description information 61a, configure the SQL rule 62a related to the service, and configure.
  • the output channel description information 63a from the management node, the management node introduces the input data stream 61b from the data production system according to the input channel description information, constructs the operator 62b in the flow graph through the SQL statement, and describes the information to the data consumption system according to the output channel
  • the output data stream 63b is sent; from the working node, the source operator Source, the intermediate operator CEP, and the target operator Sink in the stream computing application generated by the management node need to be executed.
  • step 502 can be implemented by a number of more subdivided steps.
  • the above step 502 can be implemented as step 502a and step 502b instead, as shown in FIG.
  • Step 502a The management node generates a first flow graph according to the input channel description information, the plurality of SQL rules, and the output channel description information, where the first flow graph includes a plurality of logical nodes;
  • Step 502b the management node divides each logical node in the first flow graph to obtain a plurality of logical node groups; selects a common operator corresponding to each logical node group in the preset operator library, and according to the selected public The operator generates a second flow graph; each operator in the second flow graph is used to implement the function of one or more logical nodes in the logical node group corresponding to the operator.
  • the first flow graph is a temporary flow graph of a logical layer
  • the second flow graph is an executable flow graph of a code level.
  • the first flow graph is a temporary flow graph obtained after one layer compilation according to several SQL rules in the SQL; the second flow graph is an executable flow graph obtained after the second layer compilation according to the first flow graph.
  • the operators in the second flow graph can be assigned to the work node by the management schedule.
  • the management node After obtaining the input channel description information, the several SQL rules, and the output channel description information, the management node first compiles the first flow graph through a layer, and the first flow graph includes a source logical node connected by the directed edge and a plurality of intermediate logical nodes. And the target logical node.
  • the first flow graph includes nodes of several logical levels.
  • the management node divides each logical node in the first flow graph, and uses the common operator in the preset operator library to perform two-layer compilation on each logical node group in the first flow graph to obtain a second flow graph, and second Each operator in the flow graph is used to implement each logical node in the first flow graph that is divided into the same logical node group.
  • a public operator is a general-purpose operator that is preset to implement a certain function or a certain number of functions.
  • an operator is used to implement the function of a source logical node; or an operator is used to implement the function of one or more intermediate logical nodes; or an operator is used to implement the function of a target logical node.
  • an operator is used to implement the functions of a source logical node and an intermediate logical node; or, An operator is used to implement the functions of one source logical node and multiple intermediate logical nodes; or one operator is used to implement the functions of multiple intermediate logical nodes; or an operator is used to implement an intermediate logical node and a destination The function of a node; or, an operator is used to implement the functions of multiple intermediate logical nodes and one destination node.
  • the management node may according to at least one of load balancing, operator concurrency, intimacy between logical nodes, and mutual exclusion between logical nodes. Perform the division of logical nodes.
  • the management node refers to the computing power of each operator and the computing resources consumed by each logical node, and divides each logical node, so that the calculation amount undertaken by each operator is relatively balanced. For example, the computing power of an operator is 100%, the computing resource consumed by logical node A is 30%, the computing resource consumed by logical node B is 40%, and the computing resource consumed by logical node C is 50%.
  • the logical node D needs to consume 70% of the computing resources, then the logical node A and the logical node D are divided into the same logical node group, and the logical node B and the logical node C are divided into another logical node group.
  • the management node When the management node divides according to the operator concurrency, the management node acquires the data stream size of each input data stream, and determines the number of logical nodes for processing the input data stream according to the data stream size of each input data stream, so that The calculation speed of each input data stream remains the same or similar.
  • the management node calculates the intimacy between the logical nodes according to the type of the input data stream and/or the dependency relationship between the logical nodes, and then the intimacy is higher.
  • the logical nodes are divided into the same logical node group. For example, if the input data stream 1 is the input data stream of the logical node A and the logical node D, the affinity between the logical node A and the logical node D is high, and the logical node A and the logical node D are divided into the same logical node group.
  • the amount of data stream transmission between each operator can be reduced; for example, the output data stream of logical node A is the input data stream of logical node B, and the logical node B depends on logical node A, then the logic The intimacy between node A and logical node B is relatively high.
  • the division of logical node A and logical node B into the same logical node group is implemented by the same operator, and the amount of data stream transmission between each operator can also be reduced.
  • the management node When the management node divides according to the mutual exclusion between the logical nodes, the management node detects whether the operation logic between the logical nodes is mutually exclusive, when the operation logic between the two logical nodes is mutually exclusive. , divides two logical nodes into different logical node groups. Since the basis of distributed computing systems is the concurrency and cooperation between multiple operators, this inevitably involves the mutual exclusion access of multiple operators to shared resources. In order to avoid access conflicts, two mutually exclusive ones need to exist. Logical nodes are divided into different logical node groups.
  • the user only needs to write the SQL rule at the logical level
  • the management node generates the first flow graph according to the SQL rule
  • the first flow graph includes several logical nodes
  • the management node passes
  • the preset operator library divides each logical node in the first flow graph to obtain a plurality of logical node groups, and each logical node group is converted into an operator in the second flow graph, and each of the second flow graphs
  • the operator is used to implement each logical node belonging to the same logical node group, so that the user does not need to have streaming programming thinking, and does not need to care about the division logic of the operator. It is only necessary to write the SQL rule at the logical level to construct the flow graph.
  • the management node generates the operator in the second flow graph by itself, thereby reducing the code editing work when the user constructs the stream computing application, and reducing the complexity of the user constructing the stream computing application.
  • FIG. 8A is a flowchart of a flow calculation method provided by another embodiment of the present application. This embodiment uses the flow calculation side The method is applied to the flow computing system shown in FIG. 2 for illustration. The method includes:
  • Step 801 The management node acquires input channel description information, an SQL statement, and output channel description information from the client.
  • Input channel description information is used to define an input channel, which is a logical channel that inputs a data stream from a data production system into a flow graph.
  • an illustrative input channel description information is as follows:
  • the input data stream from the data production system is a TCP or UDP data stream, a file, a database, a distributed file system (English: Hadoop Distributed File System, HDFS for short).
  • SQL is used to define the calculation logic of each operator in the flow graph, as well as the input data stream and output data stream of each operator.
  • SQL includes: Data Definition Language (English: Data Definition Language, DLL for short) and Data Manipulation Language (English: Data Manipulation Language, DML for short).
  • DLL Data Definition Language
  • DML Data Manipulation Language
  • the input data stream and/or the output data stream are usually defined by the DLL language, such as creating (English: create) sub-statements; defining the logic using DML language, for example, inserting :insert into) sub-statement, select (English: Select) sub-statement.
  • SQL usually includes multiple SQL rules.
  • Each SQL rule includes at least one SQL sub-statement.
  • Each SQL rule is used to define a logical node in the flow graph.
  • a typical set of SQL rules includes:
  • the Insert into substatement is a statement in SQL that inserts data into a data table.
  • the Select substatement is a statement in SQL for selecting data from a data table.
  • the from substatement is used in SQL from a data table.
  • the statement that reads the data, the Where substatement is a conditional statement added to the Select substatement when the data needs to be selected from the data table according to the condition.
  • the input data stream is A and the output data stream is B.
  • the Insert into substatement is used as a statement for defining an output data stream
  • the Select substatement is used as a statement representing the calculation logic
  • the from substatement is used to define the input data.
  • the flow statement, the Where substatement is used as the statement to select the data.
  • the user enters several SQL rules for configuring a flow graph including the following:
  • the input data stream of SQL rule 1 is tcp_channel_edr, the output data stream is s_edr;
  • the input data stream of SQL rule 2 is tcp_channel_xdr, the output data stream is s_xdr;
  • the input data stream of SQL rule 3 is tcp_channel_edr, and the output data stream is s;
  • the input data stream of SQL Rule 4 is s_xdr and temp1, and the output data stream Is file_channel_result1;
  • the input data stream of SQL rule 5 is s_xdr, and the output data stream is file_channel_result2.
  • the output channel description information is used to define the output channel, and the output channel is the logical channel that sends the output data stream to the data consumption system.
  • the input data stream from the data production system is a TCP or UDP data stream, a file, a database, a distributed file system (English: Hadoop Distributed File System, HDFS for short).
  • the first flow graph is a temporary flow graph including a source logical node, an intermediate logical node, and a target logical node.
  • the first flow graph is a flow graph at the logical level.
  • the generating process of the first flow graph may include the following steps 802 to 805:
  • Step 802 The management node generates a source logical node according to the input channel description information.
  • the source logical node is configured to receive an input data stream from a data production system.
  • each source logical node is used to receive an input data stream from a data production system.
  • Step 803 The management node generates an intermediate logical node according to the selected sub-statement in the SQL rule according to each SQL rule in the SQL statement;
  • an intermediate logical node is generated according to the calculation logic defined by the select sub-statement in the SQL rule.
  • the generated data is used to input the input data stream tcp_channel_edr
  • the intermediate logical node of the calculation is generated according to the select statement in SQL rule 2.
  • Step 804 The management node generates a target logical node according to the output channel description information.
  • the target logical node is configured to send an output data stream to the data consumption system.
  • each target logical node is used to output an output data stream.
  • Step 805 The management node generates a directed edge between the source logical node and the intermediate logical node, the intermediate logical node and the intermediate logical node, the intermediate logical node, and the target logical node according to the input sub-statement and the output sub-statement in the SQL rule.
  • the input edge of the intermediate logical node corresponding to the SQL rule is generated according to the from substatement in the SQL rule.
  • the other end of the input side is connected to the source logical node, or the other end of the input side is connected to other intermediate logical nodes.
  • an output edge of the intermediate logical node corresponding to the SQL rule is generated.
  • the other end of the output side is connected to other intermediate logical nodes, or the other end of the output side is connected to the target logical node.
  • the input edge is a directed edge pointing to the intermediate logical node
  • the output edge is a directed edge from the intermediate logical node to other intermediate logical nodes or target logical nodes.
  • the first flow graph includes: a first source logical node 81, a second source logical node 82, a first intermediate logical node 83, a second intermediate logical node 84, and a third intermediate logical node 85, The fourth intermediate logical node 86, the fifth intermediate logical node 87, the first target logical node 88, and the second target logical node 89.
  • the output data stream tcp_channel_edr of the first source logical node 81 is the input data stream of the first intermediate logical node 83.
  • the output data stream tcp_channel_xdr of the second source logical node 82 is the input data stream of the second intermediate logical node 84.
  • the output data stream s_edr of the first intermediate logic node 83 is the input data stream of the third intermediate logic node 85.
  • the output data stream temp1 of the third intermediate logic node 85 is the input data stream of the fourth intermediate logic node 86.
  • the output data stream s_xdr of the second intermediate logic node 84 is the input data stream of the fourth intermediate logic node 86.
  • the output data stream s_xdr of the second intermediate logic node 84 is the input data stream of the fifth intermediate logic node 87.
  • the output data stream file_channel_result1 of the fourth intermediate logical node 86 is the input data stream of the first target logical node 88.
  • the output data stream file_channel_result2 of the fifth intermediate logical node 87 is the input data stream of the second target logical node 89.
  • step 802, step 803, and step 804 are steps performed in parallel, or Step 802, step 803, and step 804 are steps performed in series.
  • the second flow graph is an executable stream computing application, and the second flow graph is a flow graph at the code level.
  • the generating process of the second flow graph may include the following steps 806 to 808:
  • Step 806 the management node compiles the common source operator to obtain the source operator in the second flow graph
  • the management node selects a common source operator in the preset operator library according to the source logical node, and compiles the source operator in the second flow graph according to the common source operator;
  • one or more common source operators are set in the preset operator library, for example, a public corresponding to the TCP protocol.
  • a source operator a common source operator corresponding to a User Datagram Protocol (English: User Datagram Protocol, UDP) protocol, a public source operator corresponding to file type A, a common source operator corresponding to file type B, A common source operator corresponding to database type A and a common source operator corresponding to database type B, and the like.
  • UDP User Datagram Protocol
  • the management node divides each source logical node into a logical node group, and each source logical node implementation becomes a source operator.
  • the management node compiles the corresponding common source operator from the preset operator library according to the source logical node in the first flow graph, and can obtain the source operator in the second flow graph.
  • the source operator is used to receive input data streams from the data production system.
  • Step 807 The management node selects at least one common intermediate operator for each logical node group including the intermediate logical node in the preset operator library, compiles the selected common intermediate operator to obtain the intermediate calculation in the second flow graph. child;
  • the management node divides at least one intermediate logical node to obtain a plurality of logical node groups; and selects the logical node group in the preset operator library according to each intermediate logical node in the same logical node group.
  • the corresponding common intermediate operator compiles the intermediate operator in the second flow graph according to the common intermediate operator;
  • one or more common calculation operators are set in the preset operator library, for example, a common intermediate operator for implementing the multiplication operation, a common intermediate operator for implementing the subtraction operation, and a sort operation for implementing the sort operation.
  • Common intermediate operators common intermediate operators for filtering operations, and so on.
  • the functions of the same common intermediate operator can be multiple, that is, the common intermediate operator is an operator with multiple computing functions.
  • multiple logical nodes can be implemented on one common intermediate operator.
  • the management node is based on the load balancing, the concurrency requirement, the intimacy between the logical nodes, and the mutual exclusion between the logical nodes. At least one factor divides each intermediate logical node, and is divided into the same logical node group. Each intermediate logical node is compiled by the same common intermediate operator in the preset operator library to obtain an intermediate calculation in the second flow graph. child.
  • the management node divides two intermediate logical nodes with less computational complexity into the same group; for example, the management node divides the intermediate logical node A, the intermediate logical node B, and the intermediate logical node C into the same group, wherein the intermediate logical node
  • the output data stream of A is the input data stream of the intermediate logical node B
  • the output data stream of the intermediate logical node B is the input data stream of the intermediate logical node C
  • the management node will have the intermediate logical node A with the same input data stream and
  • the intermediate logical nodes D are divided into the same group.
  • Step 808 the management node compiles the common target operator to obtain the target operator in the second flow graph
  • the management node selects a common target operator in the preset operator library according to the target logical node, and compiles the target operator in the second flow graph according to the common target operator.
  • one or more public destination operators are set in the preset operator library, for example, a public destination operator corresponding to the TCP protocol, a public destination operator corresponding to the UDP protocol, and a public corresponding to the file type A.
  • the management node divides each target logical node into a logical node group, and each target logical node is implemented as a target operator.
  • the management node selects the corresponding common target operator from the preset operator library according to the target logical node in the first flow graph to compile, and can obtain the target operator in the second flow graph.
  • the target operator is used to send the final output data stream to the data consuming system.
  • the first source logical node 81 in the first flow graph is performed by a common source operator. Compiling, obtaining the first source operator source1; compiling the second source logical node 82 in the first stream graph by the common source operator to obtain the second source operator source2; and the first intermediate logical node 83 in the first stream graph
  • the fifth intermediate logical node 87 is divided into the same group, and is compiled by the same common intermediate operator to obtain the intermediate operator CEP; the first target logical node in the first flow graph is compiled by the public purpose operator to obtain the first The destination operator sink1; compiles the second target logical node in the first flow graph by the public destination operator to obtain the second destination operator sink2.
  • the second flow graph includes: a first source operator source1, a second source operator source2, an intermediate operator CEP, a first destination operator sink1, and a second destination operator sink2.
  • Step 809 The management node generates, according to the directed edge between the source logical node and the intermediate logical node, the intermediate logical node and the intermediate logical node, the intermediate logical node, and the target logical node, each of the operators in the second flow graph is generated. To the side.
  • the management node correspondingly generates the directed edges between the operators in the second flow graph according to the respective directed edges in the first flow graph.
  • the flow graph can also be considered a stream computing application.
  • step 806, step 807, and step 808 are steps performed in parallel, or Step 806, step 807, and step 808 are steps performed in series.
  • Step 810 The management node schedules each operator in the second flow graph to at least one working node in the distributed computing system, where the working node is used to execute an operator;
  • the distributed computing system includes a plurality of working nodes, and the management node dispatches each operator in the second flow graph to a plurality of working nodes for execution according to the physical execution plan of the decision.
  • Each worker node is used to execute at least one operator.
  • the first source operator source1 is scheduled to the working node 1
  • the second source operator source2 is scheduled to the working node 2
  • the intermediate operator CEP is scheduled to the working node 3
  • the first destination operator sink1 is dispatched to the working node 4.
  • this embodiment also introduces a subscription mechanism.
  • Step 811 The management node generates, according to the output data stream of each operator, subscription release information corresponding to the operator, and configures the subscription release information to the operator.
  • the subscription release information is used to indicate the manner in which the output data stream corresponding to the current operator is published.
  • the management node generates subscription subscription information corresponding to the current operator according to the output data stream of the current operator, the directed edge in the second flow graph, and the topology structure between the working nodes.
  • the output data stream of the first source operator source1 is tcp_channel_edr, and the directed edge corresponding to tcp_channel_edr in the second flow graph points to the intermediate operator CEP, and the network interface 3 of the working node 1 is connected to the network interface 4 of the working node 3.
  • the management node generates subscription release information that issues the output data stream tcp_channel_edr from the network interface 3 of the worker node 1 in a predetermined form.
  • the management node sends the subscription release information to the first source operator source1 located in the working node 1, and the first source operator source1 issues the output data stream tcp_channel_edr according to the subscription release information.
  • the first source The operator source1 does not need to care which operator is the downstream operator, nor does it need to care which working node the downstream operator is located. It only needs to publish the output data stream from the network interface 3 of the working node 1 according to the subscription release information. Just fine.
  • Step 812 The management node generates input stream definition information corresponding to the operator according to the input data stream of each operator, and configures input stream definition information to the operator.
  • the input stream definition information is used to indicate the manner in which the input data stream corresponding to the current operator is received.
  • the management node generates subscription information corresponding to the current operator according to the input data stream of the current operator, the directed edge in the second flow graph, and the topology between the working nodes.
  • the input data stream of the intermediate operator CEP includes tcp_channel_edr, and the directed edge corresponding to tcp_channel_edr in the second flow graph is derived from the first source operator Source1, the network interface 3 of the working node 1 and the network interface 4 of the working node 3 Connected, the management node generates input stream definition information that is received from the network interface 4 in a predetermined form. Then, the management node sends the input stream definition information to the intermediate operator CEP located in the working node 3, and the intermediate operator CEP receives the input data stream tcp_channel_edr according to the input stream definition information.
  • the intermediate operator CEP does not need to be Which operator is concerned with the upstream operator, and which working node the upstream operator is located on, only need to receive the input data stream from the network interface 4 of the working node 3 according to the input stream definition information.
  • Step 813 The working node performs execution on each operator in the second flow graph.
  • Each working node executes each operator in the second flow graph according to the scheduling of the management node. For example, each process is responsible for the computational task of an operator.
  • the flow calculation method provided by the embodiment generates an executable flow graph by the management node according to the input channel description information, the SQL statement, and the output channel description information, and then the management node controls the workflow execution flow according to the flow graph.
  • Calculation solves the problem that the current flow computing system uses the basic operator provided by the IDE to construct the flow graph, the function of each basic operator is divided into very fine granularity, resulting in poor overall computing performance of the generated flow graph.
  • the flow computing system supports SQL statements to build flow graphs. SQL is a relatively common database management language. Users use SQL statements to build flow graphs that are still very easy to use. On the other hand, users use the SQL language programming language.
  • the feature uses the SQL statement to define the processing logic of the flow graph.
  • the management node dynamically generates a flow graph with a reasonable number of operators according to the processing logic defined by the SQL statement, thereby improving the overall computing performance of the flow graph.
  • each logical node divided into the same group is implemented by the same common intermediate operator, and the user does not need to consider load balancing, concurrent execution, intimacy, and Mutual exclusion and other factors, the management node decides load balancing, concurrent execution, intimacy and mutual exclusion to generate the second flow graph, which further reduces the difficulty for the user to generate the second flow graph.
  • the user has the ability to build a first-level graph of the logical level through SQL.
  • Decoupling the reference relationship between the input data stream and the output data stream of each operator in the second flow graph by setting a subscription mechanism provides that the user can still be in the flow computing system after the second flow graph is executed
  • the ability to dynamically adjust the various operators in the second flow graph improves the overall ease of use and maintainability of the stream computing application.
  • the executed second flow graph After the second flow graph is executed in the flow computing system, as the business function changes and adjusts in the actual usage scenario, the executed second flow graph also needs to be changed to adapt to the new requirements. Different from the prior art, it is generally required to reconstruct the second flow graph.
  • the embodiment of the present application provides the capability of dynamically modifying the executed second flow graph, with specific reference to FIG. 8C to FIG. 8E.
  • Step 814 The client sends the first modification information to the management node.
  • the first modification information is information for modifying the SQL rule; or the first modification information carries the modified SQL rule.
  • the client sends the first modification information for modifying the SQL rule to the management node.
  • Step 815 The management node receives the first modification information from the client.
  • Step 816 The management node adds, modifies, or deletes the intermediate operator in the second flow graph according to the first modification information.
  • the modification process of replacing an original intermediate operator with a new intermediate operator may be to delete the original intermediate operator and add a new intermediate operator.
  • Step 817 The management node reconfigures the subscription release information and/or the input flow definition information to the modified intermediate operator.
  • the management node further needs to reconfigure the input stream definition information to the intermediate operator.
  • the management node If the output data stream of the modified intermediate operator is a new data stream or a changed data stream, the management node also needs to reconfigure the subscription release information to the intermediate operator.
  • the first modification information is sent by the client to the management node, and the management node adds, modifies, or deletes the intermediate operator in the second flow graph according to the first modification information.
  • the management node adds, modifies, or deletes the intermediate operator in the second flow graph according to the first modification information.
  • Step 818 The client sends second modification information to the management node.
  • the second modification information is information for modifying the input channel description information; or the second modification information carries the modified input channel description information.
  • the client sends second modification information for modifying the input channel description information to the management node.
  • Step 819 the management node receives second modification information from the client.
  • Step 820 The management node adds, modifies, or deletes the source operator in the second flow graph according to the second modification information.
  • the process of modifying an original source operator with a new source operator may be to delete the original source operator and add a new source operator.
  • step 821 the management node reconfigures the subscription release information to the modified source operator.
  • the management node further needs to re-configure the subscription release information to the source operator.
  • the second modification information is sent by the client to the management node, and the management node adds, modifies, or deletes the source operator in the second flow graph according to the second modification information. Provides the ability for the management node to dynamically adjust the source operators in the second flow graph.
  • Step 822 The client sends second modification information to the management node.
  • the second modification information is information for modifying the input channel description information; or the second modification information carries the modified input channel description information.
  • the client sends second modification information for modifying the input channel description information to the management node.
  • Step 823 The management node receives second modification information from the client.
  • Step 824 The management node adds, modifies, or deletes the source operator in the second flow graph according to the second modification information.
  • the process of modifying an original source operator with a new target operator may be The target operator is deleted and a new target operator is added.
  • step 825 the management node reconfigures the input stream definition information to the modified target operator.
  • the management node further needs to reconfigure the input stream definition information to the target operator.
  • the third modification information is sent by the client to the management node, and the management node adds, modifies, or deletes the target operator in the second flow graph according to the third modification information. Provides the ability for the management node to dynamically adjust the target operator in the second flow graph.
  • the streaming computing system provides the user with two clients: a native client 92 provided by the streaming computing system, and a client 94 that is secondary developed by the user.
  • Both the native client 92 and the secondary developer client 94 are provided with a SQL application programming interface (English: Application Programming Interface, API for short), which is used to implement the function of defining a flow graph using the SQL language.
  • the user enters input/output channel description information and SQL statements on the native client 92 or the secondary developed client 94, and the native client 92 or the secondary developed client 94 sends the input/output channel description information and the SQL statement to the management.
  • Node Master which is step 1 in the figure.
  • the management node Master establishes a connection with the native client 92 or the secondary developed client 94 through the App connection service.
  • the management node Master obtains input/output channel description information and SQL statements, and the SQL engine 96 generates an executable flow graph based on the input/output channel description information and the SQL statement, that is, step 2 in the figure.
  • the management node master further includes a flow platform execution framework management module 98, and the flow platform execution framework management module 98 is configured to implement management tasks such as resource management, application management, active backup management, and task management.
  • An executable flow graph generated by the SQL engine 96.
  • the flow platform execution framework management module 98 plans and determines the execution plan of the flow graph on each work node Worker, that is, step 3 in the figure.
  • a processing unit set (English: processing element container, PEC for short) includes a plurality of processing units PE, each PE is used to call a source operator Soucer in the executable flow graph, or an intermediate operator. CEP, or a target operator Slink. Each operator in the executable flow graph is processed by cooperation between the PEs.
  • FIG. 9B is a schematic diagram showing the principle of the SQL engine 96 provided by an embodiment of the present application in a specific implementation. After the SQL engine 96 obtains the input/output channel description information and the SQL statement, the following processes are performed:
  • the SQL engine 96 parses each SQL rule in the SQL statement; 2.
  • the SQL engine 96 generates a temporary first stream map according to the parsing result; 3.
  • the SQL engine 96 performs load balancing and intimacy on each logical node in the first stream graph. Degrees and mutual exclusion factors are divided to obtain at least one logical node group, each logical node group includes one or more logical nodes; 4.
  • SQL engine 96 performs simulation of operator concurrency calculation, and calculates according to operator concurrency The simulation result adjusts each logical node group; 5.
  • the SQL engine 96 generates a second flow graph according to the adjusted logical node groups, and each logical node belonging to the same logical node group is allocated to the second flow graph.
  • An executable operator 6.
  • the SQL engine 96 parses each executable operator in the second flow graph, analyzes the operational requirements of each operator, and the like; 7.
  • the SQL engine 96 pairs each of the second flow graphs.
  • the executable operator generates a logic execution plan;
  • the SQL engine 96 performs code editing optimization on the logical execution plan of the second flow graph to generate a physical execution plan.
  • the SQL engine 96 sends a physical execution plan to the streaming platform execution framework management module 98, and the flow platform execution framework management module 98 performs the execution of the flow computing application in accordance with the physical execution plan.
  • steps 1 to 5 belong to a layer compilation process
  • steps 6 to 9 belong to a layer 2 compilation process.
  • the following is a device embodiment of the present application.
  • the device embodiment corresponds to the foregoing method embodiment.
  • FIG. 10 is a block diagram showing the structure of a stream computing device provided by an embodiment of the present application.
  • the stream computing device can be implemented as all or part of the management node 240 by dedicated hardware circuitry, or a combination of hardware and software.
  • the stream computing device includes an obtaining unit 1020, a generating unit 1040, and an executing unit 1060.
  • the obtaining unit 1020 is configured to implement the functions of the foregoing steps 501 and 801;
  • a generating unit 1040 configured to implement the functions of step 502, step 502a, step 502b, and step 802 to step 808;
  • the executing unit 1060 is configured to implement the functions of step 503 and step 810 to step 812 described above.
  • the device further includes: a modifying unit 1080;
  • the modifying unit 1080 is configured to implement the functions of the foregoing steps 815 to 825.
  • the obtaining unit 1020 is implemented by the network interface 242 of the management node 240 and the processor 241 executing the obtaining module 251 in the memory 244.
  • the network interface 242 is an Ethernet network card, a fiber optic transceiver, a Universal Serial Bus (USB) interface, or other I/O interfaces.
  • the generating unit 1040 is implemented by the processor 241 of the management node 240 executing the generating module 252 in the memory 244.
  • the flow graph generated by the generating unit 1040 is an executable distributed stream computing application formed by a plurality of operators, and each operator in the distributed stream computing application can be dispatched to a different working node for execution.
  • the execution unit 1060 described above is implemented by the network interface 242 of the management node 240 and the execution module 253 of the processor 241 executing the memory 244.
  • the network interface 242 is an Ethernet network card, a fiber optic transceiver, a USB interface, or other I/O interface.
  • the processor 241 dispatches the various operators in the flow graph to different working nodes via the network interface 242, and then performs data calculations for the operators by the respective working nodes.
  • modification unit 1080 is implemented by the processor 241 of the management node 240 executing a modification module (not shown) in the memory 244.
  • the flow computing device provided by the foregoing embodiment when the flow computing device provided by the foregoing embodiment generates a flow graph and performs flow calculation, only the division of each functional module described above is used for example. In actual applications, the functions may be assigned differently according to needs.
  • the function module is completed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the flow computing device provided by the foregoing embodiment is the same as the method embodiment of the flow computing method, and the specific implementation process is described in detail in the method embodiment, and details are not described herein again.
  • FIG. 11 is a block diagram showing the structure of a stream computing system provided by an embodiment of the present application.
  • the flow computing system includes a terminal 1120, a management node 1140, and a work node 1160.
  • the terminal 1120 is configured to perform the steps performed by the terminal or the client in the foregoing method embodiment.
  • the management node 1140 is configured to perform the steps performed by the management node in the foregoing method embodiment.
  • the working node 1160 is configured to perform the steps performed by the working node in the foregoing method embodiment.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

一种流计算方法、装置及系统,属于大数据计算领域。所述方法包括:管理节点获取输入通道描述信息、SQL语句和输出通道描述信息;根据输入通道描述信息、SQL语句和输出通道描述信息生成流图;根据流图控制工作节点执行流计算任务。上述方法在一定程度上解决了目前的流计算系统提供的基本算子的功能粒度太细,导致构建流图的复杂度较高且性能较差的问题。同时,由于采用SQL语句对流图的处理逻辑进行定义,由管理节点按照SQL语句所定义的处理逻辑动态生成流图,进而提高了构建流图的易用性和流图的整体计算性能。

Description

流计算方法、装置及系统
本申请要求于2016年7月29日提交中国专利局、申请号为201610617253.2、发明名称为“流计算方法、装置及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及大数据计算领域,特别涉及一种流计算方法、装置及系统。
背景技术
在诸如金融服务、传感监测和网络监控之类的应用领域中,数据流具有实时性、易失性、突发性、无序性和无限性等特征。流计算(英文:Stream computing)系统作为一种能对实时的数据流进行计算处理的系统,得到了越来越广泛的应用。
流计算系统中部署的流式应用(通常也成为流应用)的处理逻辑可以采用有向无环图(英文:Directed Acyclic Graph,简称:DAG)来表征,该DAG也称为流图。参考图1,流应用的处理逻辑采用流图100进行表征。该流图100中的每一个有向边代表一个数据流(英文:Stream),每个节点代表一个算子(英文:Operator),图中每个算子具有至少一条输入数据流和至少一条输出数据流。算子是流计算系统中可被调度执行计算任务的最小单元,算子也可称为执行算子。
在将流应用部署到流计算系统时,需要由用户事先为流应用构建流图,然后流应用以流图的形式在流系统中编译和运行,以执行对数据流的处理任务。流计算系统向用户提供集成开发环境(英文:Integrated Development Environment,简称:IDE),该IDE提供有用于构建流图的图形用户界面,该图形用户界面中包括若干个基本算子,用户在该图形用户界面上通过拖拽基本算子的方式构建流图,并且需要为该流图配置各种运行参数。
虽然通过拖拽基本算子的方式构建流图非常直观,但是为了便于用户构建较为复杂的流图,IDE中提供的每个基本算子的功能都被预先划分到非常细的粒度,导致构建流图的复杂度较高,且用户实际构建出的流图较为臃肿,流图的整体计算性能较差。
发明内容
为了改善流图的整体计算性能,本申请实施例提供了一种流计算方法、装置及系统。所述技术方案如下:
流计算系统通常采用分布式计算架构。该分布式计算架构中包括:管理节点和至少一个工作节点。用户通过客户端在管理节点中配置流图,由管理节点将流图中的各个算子调度至工作节点中运行。
第一方面,本申请实施例提供了一种流计算方法,应用于包括管理节点和工作节点的流计算系统中,所述方法包括:管理节点从客户端获取输入通道描述信息、结构化查询语言(英文:Structured Query Language,简称:SQL)语句和输出通道描述信息;管理节点 根据所述输入通道描述信息、所述SQL语句和所述输出通道描述信息生成流图,所述流图用于定义执行流计算任务的多个算子的计算逻辑以及各个算子之间数据流的输入输出关系;管理节点根据所述流图控制所述工作节点中的算子执行流计算任务;所述多个算子被调度到所述流计算系统的一个或多个工作节点上执行;
其中,所述输入通道描述信息用于定义输入通道,所述输入通道是将来自数据生产系统的数据流输入所述流图的逻辑通道;所述输出通道描述信息用于定义输出通道,所述输出通道是所述流图的输出数据流输出至数据消费系统的逻辑通道。
本申请实施例通过由管理节点根据输入通道描述信息、SQL语句和输出通道描述信息生成可执行的流图,然后由管理节点根据流图控制工作节点执行流计算;一定程度上解决了目前的流计算系统通过IDE提供的基本算子来构建流图时,每个基本算子的功能被划分为非常细的粒度,导致构建流图的复杂度较高,且生成的流图的整体计算性能较差的问题;SQL是较为常见的数据库管理语言,流计算系统支持SQL语句来构建流图可以提高系统的易用性,提升用户体验,另一方面,由用户利用SQL语言的编程语言特性,采用SQL语句对流图的处理逻辑进行定义,由管理节点按照SQL语句所定义的处理逻辑动态生成流图,从而提高流图的整体计算性能。
结合第一方面,在第一方面的第一种可能的实现方式中,所述SQL语句包括若干条SQL规则,每条SQL规则包括至少一条SQL子语句;
所述管理节点根据所述输入通道描述信息、所述SQL语句和所述输出通道描述信息生成流图,具体包括:
所述管理节点根据所述输入通道描述信息、所述若干条SQL规则和所述输出通道描述信息生成第一流图,所述第一流图包括若干个逻辑层面的节点;
所述管理节点将所述第一流图中的各个逻辑节点进行划分,以得到若干个逻辑节点组;按照每个所述逻辑节点组在预设算子库中选择公共算子,根据选择出的所述公共算子生成第二流图;所述第二流图中的每个算子用于实现该算子对应的逻辑节点组中的一个或多个逻辑节点。
综上所述,本实现方式提供的流计算方法,用户只需要在逻辑层面编写SQL规则,由管理节点根据SQL规则生成第一流图,第一流图包括若干个逻辑节点,然后再由管理节点通过预设算子库将第一流图中各个逻辑节点进行划分后,将每个逻辑节点组转换为第二流图中的一个算子,第二流图中的每个算子用于实现第一流图中属于同一逻辑节点组的各个逻辑节点,使得用户不需要具有流式编程思维,也不需要关心算子的划分逻辑,只需要在逻辑层面上编写SQL规则即可构建流图,由管理节点自行生成流图中的算子,从而减少了用户构建流计算应用时的代码编辑工作,降低用户构建流计算应用的复杂度。
结合第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,所述第一流图包括通过有向边相连的源逻辑节点、中间逻辑节点和目标逻辑节点,所述管理节点根据所述输入通道描述信息、所述若干条SQL规则和所述输出通道描述信息生成第一流图,具体包括:
所述管理节点根据所述输入通道描述信息生成所述第一流图中的所述源逻辑节点,所述源逻辑节点用于接收来自所述数据生产系统的输入数据流;
所述管理节点根据每条所述SQL规则中的选择子语句生成所述第一流图中的所述中间 逻辑节点,所述中间逻辑节点用于指示对所述输入数据流进行计算时的计算逻辑,每个中间逻辑节点对应一条SQL规则;
所述管理节点根据所述输出通道描述信息生成所述第一流图中的所述目标逻辑节点,所述目标逻辑节点用于向所述数据消费系统发送输出数据流;
所述管理节点根据每条所述SQL规则中的输入子语句和/或输出子语句,生成所述源逻辑节点、所述中间逻辑节点与所述目标逻辑节点之间的有向边。
综上所述,本实现方式提供的流计算方法,通过流计算系统对SQL语言中的输入子语句、选择子语句和输出子语句进行转用,实现了流计算系统支持用户使用一条SQL规则在逻辑层面上定义流图中的一个逻辑节点,利用用户熟悉的SQL语法,将定义流计算应用的难度降低,提供了易用性极高的流图定制方式。
结合第一方面的第一种或第二种可能的实现方式,在第一方面的第三种可能的实现方式中,所述第二流图包括通过有向边相连的源算子、中间算子和目标算子,所述预设算子库包括:公共源算子、公共中间算子和公共目标算子;
所述管理节点将所述第一流图中的各个逻辑节点进行划分,按照划分后的所述逻辑节点在预设算子库中选择公共算子,根据选择出的所述公共算子生成第二流图,包括:
所述管理节点编译所述公共源算子以得到所述第二流图中的源算子;
所述管理节点在所述预设算子库中为每个包括所述中间逻辑节点的所述逻辑节点组选择出至少一个公共中间算子,编译选择出的所述公共中间算子以得到所述第二流图中的中间算子;
所述管理节点编译所述公共目标算子以得到所述第二流图中的目标算子;
所述管理节点根据所述源逻辑节点、所述中间逻辑节点与所述目标逻辑节点之间的有向边,生成所述第二流图中的各个算子之间的有向边。
综上所述,本实现方式提供的流计算方法,通过由管理节点将第一流图中的多个逻辑节点进行划分,将划分为同一逻辑节点组的各个逻辑节点通过同一个公共中间算子进行实现,不需要用户考虑负载均衡、并发执行等因素,由管理节点自行根据负载均衡和并发执行等因素来决策第二流图的生成,进一步地降低了用户生成第二流图时的难度,只需要用户具有通过SQL构建逻辑层面的第一流图的能力即可。
结合第一方面的第一种或第二种或第三种可能的实现方式,在第四种可能的实现方式中,所述管理节点根据所述流图控制所述工作节点执行流计算,包括:
所述管理节点将所述第二流图中的各个所述算子调度至流计算系统中的至少一个工作节点中,所述工作节点用于执行所述算子;
所述管理节点根据每个所述算子的所述输出数据流,生成与所述算子对应的订阅发布信息,向所述算子配置所述订阅发布信息;
所述管理节点根据每个所述算子的所述输入数据流,生成与所述算子对应的输入流定义信息,向所述算子配置所述输入流定义信息;
其中,所述订阅发布信息用于指示与当前算子对应的输出数据流的发送方式;所述输入流定义信息用于指示与当前算子对应的输入数据流的接收方式。
综上所述,本实现方式提供的流计算方法,通过设置订阅机制,将第二流图中的各个算子的输入数据流和输出数据流之间的引用关系解耦,提供了在第二流图被执行后,仍然 可以动态调整第二流图中的各个算子的能力,提高了流计算应用的整体易用性和可维护性。
结合第一方面的第一种或第二种或第三种或第四种可能的实现方式,在第五种可能的实现方式中,所述方法还包括:
所述管理节点接收来自所述客户端的第一修改信息,所述第一修改信息是对所述SQL规则进行修改的信息;
所述管理节点根据所述第一修改信息对所述第二流图中对应的所述中间算子进行增加、修改或删除。
综上所述,本实现方式提供的流计算方法,通过客户端向管理节点发送第一修改信息,由管理节点根据第一修改信息对第二流图中的中间算子进行增加、修改或删除,为管理节点提供了在第二流图生成之后,仍然可以动态调整第二流图中的中间算子的能力。
结合第一方面的第一种或第二种或第三种或第四种或第五种可能的实现方式,在第六种可能的实现方式中,所述方法还包括:
所述管理节点接收来自所述客户端的第二修改信息,所述第二修改信息是对所述输入通道描述信息进行修改的信息;根据所述第二修改信息对所述第二流图中的所述源算子进行增加、修改或删除;
和/或,
所述管理节点接收来自所述客户端的第三修改信息,所述第三修改信息是对所述输出通道描述信息进行修改的信息;根据所述第三修改信息对所述第二流图中的所述目标算子进行增加、修改或删除。
综上所述,本实现方式提供的流计算方法,通过客户端向管理节点发送第二修改信息和/或第三修改信息,由管理节点对第二流图中的源算子和/或目标算子进行增加、修改或删除,为管理节点提供了在第二流图生成之后,仍然可以动态调整第二流图中的源算子和/或目标算子的能力。
第二方面,提供了一种流计算装置,该流计算装置包括至少一个单元,该至少一个单元用于实现上述第一方面或第一方面中任意一种可能的实现方式所提供的流计算方法。
第三方面,提供了一种管理节点,该管理节点包括处理器和存储器;所述处理器用于存储一个或一个以上的指令,所述指令被指示为由所述处理器执行,所述处理器用于实现上述第一方面或第一方面中任意一种可能的实现方式所提供的流计算方法。
第四方面,本申请实施例提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于实现上述第一方面或第一方面中任意一种可能的实现方式所提供的流计算方法的可执行程序。
第五方面,提供了一种流计算系统,该流计算系统包括:管理节点和至少一个工作节点,所述管理节点是如第三方面所述的管理节点。
附图说明
图1是现有技术提供的流图的结构示意图;
图2A是本申请一个实施例提供的流计算系统的结构方框图;
图2B是本申请另一个实施例提供的流计算系统的结构方框图;
图3A是本申请一个实施例提供的管理节点的结构方框图;
图3B是本申请另一实施例提供的管理节点的结构方框图;
图4是本申请一个实施例提供的流计算过程的原理示意图;
图5是本申请一个实施例提供的流计算方法的方法流程图;
图6是本申请一个实施例提供的流计算方法的原理示意图;
图7是本申请另一个实施例提供的流计算方法的方法流程图;
图8A是本申请另一个实施例提供的流计算方法的方法流程图;
图8B是本申请另一个实施例提供的流计算方法的原理示意图;
图8C是本申请另一个实施例提供的流计算方法的方法流程图;
图8D是本申请另一个实施例提供的流计算方法的方法流程图;
图8E是本申请另一个实施例提供的流计算方法的方法流程图;
图9A是本申请一个实施例提供的流计算方法在具体实施时的原理示意图;
图9B是本申请另一实施例提供的流计算方法在具体实施时的原理示意图;
图10是本申请另一个实施例提供的流计算装置的结构方框图;
图11是本申请另一个实施例提供的流计算系统的结构方框图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。图2A示出了本申请一个实施例提供的流计算系统的结构示意图。示意性的,该流计算系统是一个分布式计算系统,该分布式计算系统包括:终端220、管理节点240和多个工作节点260。
终端220是诸如手机、平板电脑、膝上型便携计算机和台式计算机之类的电子设备,本实施例对终端220的硬件形式不加以限定。终端220中运行有客户端,客户端用于提供用户与分布式计算系统之间的人机交互入口。客户端具有根据用户的输入,获取输入通道描述信息、若干条SQL规则和输出通道描述信息的能力。
可选地,客户端是分布式计算系统提供的原生客户端,或者,客户端是由用户自行开发的客户端。
终端220通过有线网络、无线网络或专用硬件接口与管理节点240相连。
管理节点240是一台服务器或几台服务器的组合,本实施例对管理节点240的硬件形式不加以限定。管理节点240是用于分布式计算系统中对各个工作节点260进行管理的节点。可选地,管理节点240用于对各个工作节点260进行资源管理、主备管理、应用管理和任务管理中的至少一种。资源管理是指对各个工作节点260中的计算资源进行管理;主备管理是指对各个工作节点260在发生故障时,实现主备切换管理;应用管理是指对运行在分布式计算系统上的至少一个流计算应用进行管理;任务管理是指对于一个流计算应用中的各个算子的计算任务进行管理。在不同的流计算系统中,管理节点240可能具有不同的名称,比如,主控节点(英文:master node)。
管理节点240通过有线网络、无线网络或专用硬件接口与工作节点260相连。
工作节点260是一台服务器或几台服务器的组合,本实施例对工作节点260的硬件形式不加以限定。可选地,工作节点260中运行有流计算应用中的算子。每个工作节点260负责一个或多个算子的计算任务。比如,工作节点260中的每个进程用于负责一个算子的 计算任务。
当存在多个工作节点260时,多个工作节点260之间通过有线网络、无线网络或专用硬件接口相连。
可以理解的是,在虚拟化场景下,流计算系统的管理节点240和工作节点260也可以由运行在通用硬件上的虚拟机来实现。图2B示出了本申请另一个实施例提供的流计算系统的结构示意图。示意性的,该流计算系统包括:若干台计算设备22所构成的分布式计算平台,每个计算设备22中运行有至少一个虚拟机,每个虚拟机是一个管理节点240或一个工作节点260。
管理节点240和工作节点260是位于同一计算设备22上的不同虚拟机(如图2B中所示)。可选地,管理节点240和工作节点260是位于不同计算设备22上的不同虚拟机。
可选地,每个计算设备22上运行不止一个工作节点260,每个工作节点260是一个虚拟机。每个计算设备22上所能够运行的工作节点260的数量,视计算设备22的计算能力所决定。
可选地,各个计算设备22之间通过有线网络、无线网络或专用硬件接口相连。可选地,专用硬件接口是光纤、预定接口类型的电缆等。
也即,本申请实施例不限定管理节点240是物理实体还是逻辑实体,也不限定工作节点260是物理实体还是逻辑实体。下面对管理节点240的结构和功能做进一步说明。
图3A示出了本申请一个实施例提供的管理节点240的结构图。管理节点240包括:处理器241、网络接口242、总线243和存储器244。
处理器241通过总线243分别与网络接口242、存储器244相连。
网络接口242用于实现与终端220和工作节点260的通信。
处理器241包括一个或一个以上处理核心。处理器241通过运行操作系统或应用程序模块,实现流计算系统中的管理功能。
可选地,存储器244可存储操作系统245、至少一个功能所需的应用程序模块25。应用程序模块25包括获取模块251、生成模块252和执行模块253等。
获取模块251,用于从客户端获取输入通道描述信息、SQL语句和输出通道描述信息。
生成模块252,用于根据输入通道描述信息、SQL语句和输出通道描述信息生成流图,流图用于定义执行流计算任务的各个算子的计算逻辑以及各个算子之间数据流的输入输出关系。
执行模块253,根据流图控制工作节点执行流计算任务。
此外,存储器244可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(英文:static random access memory,简称:SRAM),电可擦除可编程只读存储器(英文:electrically erasable programmable read-only memory,简称:EEPROM),可擦除可编程只读存储器(英文:erasable programmable read-only memory,简称:EPROM),可编程只读存储器(英文:programmable read-only memory,简称:PROM),只读存储器(英文:read-only memory,简称:ROM),磁存储器,快闪存储器,磁盘或光盘。
本领域技术人员可以理解,图3A中所示出的结构并不构成管理节点240的限定,可以 包括比图示更多或更少的部件或组合某些部件,或者不同的部件布置。
图3B示出了一种虚拟化场景下中的管理节点240的实施例,如图3B所示,管理节点240为运行在计算设备22上的虚拟机(英文:Virtual Machine,简称:VM)224。其中,计算设备22包括硬件层221,运行在硬件层21之上的虚拟机监视器(英文:Virtual Machine Monitor,简称:VMM)222,以及运行在VMM 222之上的宿主机Host 223和若干虚拟机VM,其中,硬件层221包括但不限于:I/O设备、中央处理器(英文:Central Processing Unit,简称:CPU)和内存Memory。VM中运行有可执行程序,VM通过运行该可执行程序,并在程序运行的过程中通过宿主机Host 223来调用硬件层221的硬件资源,以实现上述获取模块251、生成模块252和执行模块253的功能。具体而言,获取模块251、生成模块252和执行模块253可以以软件模块或函数的形式被包含在上述可执行程序中,VM 224通过调用硬件层221中的CPU、Memory等资源,以运行该可执行程序,从而实现获取模块251、生成模块252和执行模块253的功能。
结合图2A-2B和图3A-3B,下面对流计算系统执行流计算的整体过程进行介绍。图4示出了本申请一个实施例提供的流计算过程的原理示意图。整个流计算过程涉及数据生产系统41、流计算系统42和数据消费系统43。
数据生产系统41用于产生数据。视不同的实施环境,数据生产系统41可以是金融系统、网络监控系统、生产制造系统、Web应用系统、传感检测系统等。
可选地,数据生产系统41产生的数据的存储形式包括但不限于:文件、网络数据包、数据库中的至少一种。本申请实施例对数据的存储形式不加以限定。
可选地,在硬件层面,数据生产系统41通过网络、光纤、硬件接口卡之类的硬件线路与流计算系统42相连。在软件层面,数据生产系统41通过输入通道411与流计算系统42相连。输入通道411是一种将来自数据生产系统41的数据流输入流计算系统42中的流图的逻辑通道,该逻辑通道用于实现数据生产系统41和流计算系统42之间在传输路径、传输协议、数据格式、数据编/解码方式等方面的对接。
流计算系统42中通常包括由多个算子所构成的流图。该流图可认为是一个流计算应用。该流图中包括:源算子421、至少一个中间算子422和目的算子423。源算子421用于从数据生产系统41中接收输入数据流,源算子42还用于将输入数据流发送给中间算子422;中间算子422用于将输入数据流进行计算,将计算得到的输出数据流输入至下一级中间算子422或者目的算子423;目的算子423用于向数据消费系统43发送输出数据流。上述各个算子被图2所示的管理节点调度,以分布式的形式运行在图2所示的多个工作节点260中,每个工作节点260中运行有至少一个算子。
可选地,在硬件层面,流计算系统42通过网络、光纤、硬件接口卡之类的硬件线路与数据消费系统43相连。在软件层面,流计算系统42通过输出通道421与数据消费系统43相连。输出通道421是一种将流计算系统42的输出数据流输出至数据消费系统43的逻辑通道,该逻辑通道用于实现流计算系统42和数据消费系统43之间在传输路径、传输协议、数据格式、数据编/解码方式等方面的对接。
数据消费系统43用于对流计算系统42所计算出的输出数据流进行利用。数据消费系统43对输出数据流进行持久化存储或二次利用。比如,数据消费系统43是推荐系统,该推荐系统根据输出数据流向用户推荐感兴趣的网页、文本、音频、视频和购物信息等。
其中,流计算系统42中的流图由用户通过客户端44进行生成、部署或调整。
在本申请实施例中,流计算系统提供了一种利用SQL语句来构建流图的流图构建方式。示意性的,图5示出了本申请一个实施例提供的流计算方法的流程图。本实施例以该流计算方法应用于图2A-2B和图3A-3B所示的管理节点中来举例说明。该方法包括:
步骤501,管理节点从客户端获取输入通道描述信息、SQL语句和输出通道描述信息;
用户通过客户端向管理节点发送输入通道描述信息、SQL语句和输出通道描述信息。
输入通道描述信息用于定义输入通道,或者说,输入通道描述信息用于描述输入数据流的输入方式,或者说,输入通道描述信息用于描述输入通道的构建信息。输入通道是用于将来自数据生产系统的数据流输入流图的逻辑通道。
可选地,输入通道描述信息包括:传输介质信息、传输路径信息、数据格式信息和数据解码方式信息中的至少一种。示意性的,一条输入通道描述信息包括:以太网介质、网络之间互连协议(英文:Internet Protocol,简称:IP)地址及端口号,传输控制协议(英文:Transmission Control Protocol,简称:TCP)数据包,默认解码方式;另一条输入通道描述信息包括:文件存储路径、Excel文件。
SQL语句用于定义流图中每个算子的计算逻辑,以及每个算子的输入数据流和输出数据流。可选地,每个算子存在至少一个输入数据流,每个算子存在至少一个输出数据流。
输出通道描述信息用于定义输出数据流,或者说,输出通道描述信息用于描述输出数据流的输出方式,或者说,输出通道描述信息用于描述输出通道的构建信息。输出通道是用于将所述流图的输出数据流输出至数据消费系统的逻辑通道。
可选地,输出通道描述信息包括:传输介质信息、传输路径信息、数据格式信息和数据编码方式信息中的至少一种。示意性的,一条输出通道描述信息包括:文件存储路径、CSV文件。
管理节点接收客户端发送的输入通道描述信息、SQL和输出通道描述信息。
步骤502,管理节点根据输入通道描述信息、SQL语句和输出通道描述信息生成流图,流图用于定义流计算中的各个算子的计算逻辑以及各个算子之间数据流的输入输出关系;
可选地,SQL语句包括若干条SQL规则,每条SQL规则用于定义一个逻辑算子的计算逻辑,以及该算子的输入数据流和输出数据流。每条SQL规则包括至少一条SQL子语句。
可选地,每个算子具有至少一个输入数据流,每个算子具有至少一个输出数据流。
可选地,一个可执行的流图中,包括:源算子(英文:Source)、中间算子和目标算子(英文:Sink)。源算子用于接收来自数据生产系统的输入数据流,以及将输入数据流输入至中间算子中。中间算子用于对来自源算子的输入数据流进行计算,或者,中间算子用于对来自其他中间算子的输入数据流进行计算。目标算子用于根据来自中间算子的计算结果向数据消费系统发送输出数据流。
步骤503,管理节点根据流图控制工作节点执行流计算任务。
管理节点根据流图控制流计算系统中各个工作节点执行流计算任务。这里的“流图”应当被理解为一个可执行的流应用。
可选地,管理节点将生成的流图调度至各个工作节点中进行分布式执行,多个工作节点根据流图对来自数据生产系统的输入数据流进行流计算,得到最终的输出数据流,并输 出给数据消费系统。
综上所述,本实现方式提供的流计算方法,通过由管理节点根据输入通道描述信息、SQL语句和输出通道描述信息生成可执行的流图,然后由管理节点根据流图控制工作节点执行流计算;解决了目前的流计算系统通过IDE提供的基本算子来构建流图时,每个基本算子的功能被划分为非常细的粒度,导致生成的流图的整体计算性能较差的问题;达到了流计算系统支持SQL语句来构建流图,SQL是较为常见的数据库管理语言,用户使用SQL语句来构建流图仍然非常易用的效果,另一方面,由用户利用SQL语言的编程语言特性,采用SQL语句对流图的处理逻辑进行定义,由管理节点按照SQL语句所定义的处理逻辑动态生成具有合理数量的算子的流图,从而提高流图的整体计算性能。
为了更清楚地理解图5实施例所提供的流计算方法的计算原理,请结合参考图6,从用户面来讲,需要用户配置输入通道描述信息61a、配置与业务有关的SQL规则62a、配置输出通道描述信息63a;从管理节点来讲,管理节点根据输入通道描述信息从数据生产系统引入输入数据流61b、通过SQL语句构建流图中的算子62b、根据输出通道描述信息向数据消费系统发送输出数据流63b;从工作节点来讲,需要执行管理节点生成的流计算应用中的源算子Source、中间算子CEP和目标算子Sink。
上述步骤502可由若干个更为细分的步骤实现,在可选的实施例中,上述步骤502可被替代实现成为步骤502a和步骤502b,如图7所示:
步骤502a,管理节点根据输入通道描述信息、若干条SQL规则和输出通道描述信息生成第一流图,第一流图包括若干个逻辑节点;
步骤502b,管理节点将第一流图中的各个逻辑节点进行划分,以得到若干个逻辑节点组;在预设算子库中选择每个逻辑节点组对应的公共算子,并根据选择出的公共算子生成第二流图;第二流图中的每个算子用于实现该算子对应的逻辑节点组中的一个或多个逻辑节点的功能。
可选地,第一流图是逻辑层面的临时流图,第二流图是代码层面的可执行流图。第一流图是根据SQL中的若干个SQL规则进行一层编译后所得到的临时流图;第二流图是根据第一流图进行二层编译后所得到的可执行流图。第二流图中的算子可被管理调度分配至工作节点中执行。
管理节点在获取到输入通道描述信息、若干条SQL规则和输出通道描述信息后,先经过一层编译得到第一流图,第一流图包括通过有向边相连的源逻辑节点、若干个中间逻辑节点和目标逻辑节点。第一流图包括若干个逻辑层面的节点。
然后,管理节点对第一流图中的各个逻辑节点进行划分,利用预设算子库中的公共算子,将第一流图中的各个逻辑节点组进行二层编译得到第二流图,第二流图中的每个算子用于实现第一流图中被划分为同一逻辑节点组的各个逻辑节点。
公共算子是预先设置的用于实现某一种功能或某几种功能的通用型算子。
示意性的,一个算子用于实现一个源逻辑节点的功能;或者,一个算子用于实现一个或者多个中间逻辑节点的功能;或者,一个算子用于实现一个目标逻辑节点的功能。
示意性的,一个算子用于实现一个源逻辑节点和一个中间逻辑节点的功能;或者,一 个算子用于实现一个源逻辑节点和多个中间逻辑节点的功能;或者,一个算子用于实现多个中间逻辑节点的功能;或者,一个算子用于实现一个中间逻辑节点和一个目的节点的功能;或者,一个算子用于实现多个中间逻辑节点和一个目的节点的功能。
在对第一流图中的各个逻辑节点进行划分时,管理节点可按照负载均衡、算子并发度、各个逻辑节点之间的亲密度和各个逻辑节点之间的互斥性中的至少一种因素进行逻辑节点的划分。
当管理节点按照负载均衡进行划分时,管理节点参考每个算子的计算能力和每个逻辑节点所需要消耗的计算资源,将各个逻辑节点进行划分,使得每个算子所承担的计算量相对均衡。比如,一个算子的计算能力是100%,逻辑节点A所需要消耗的计算资源是30%,逻辑节点B所需要消耗的计算资源是40%,逻辑节点C所需要消耗的计算资源是50%,逻辑节点D所需要消耗的计算资源是70%,则将逻辑节点A和逻辑节点D划分至同一个逻辑节点组,将逻辑节点B和逻辑节点C划分至另一个逻辑节点组。
当管理节点按照算子并发度进行划分时,管理节点获取每个输入数据流的数据流大小,根据每个输入数据流的数据流大小确定用于处理该输入数据流的逻辑节点的数量,使得每个输入数据流的计算速度保持相同或相似。
当管理节点按照各个逻辑节点之间的亲密度来进行划分时,管理节点根据输入数据流的类型和/或逻辑节点之间的依赖关系计算逻辑节点之间的亲密度,然后将亲密度较高的逻辑节点划分为同一逻辑节点组。比如,输入数据流1同时是逻辑节点A和逻辑节点D的输入数据流,则逻辑节点A和逻辑节点D之间的亲密度较高,将逻辑节点A和逻辑节点D划分至同一逻辑节点组由同一个算子实现,能够减少各个算子之间的数据流传输量;又比如,逻辑节点A的输出数据流是逻辑节点B的输入数据流,逻辑节点B依赖于逻辑节点A,则逻辑节点A和逻辑节点B之间的亲密度较高,将逻辑节点A和逻辑节点B划分至同一逻辑节点组由同一个算子实现,也能够减少各个算子之间的数据流传输量。
当管理节点按照各个逻辑节点之间的互斥性来进行划分时,管理节点检测各个逻辑节点之间的运算逻辑是否存在互斥性,当两个逻辑节点之间的运算逻辑存在互斥性时,将两个逻辑节点划分至不同的逻辑节点组。由于分布式计算系统的基础是多算子之间的并发与协作,这就不可避免地涉及到多个算子对共享资源的互斥访问,为了避免访问冲突,需要将存在互斥性的两个逻辑节点划分至不同的逻辑节点组中。
综上所述,本实施例提供的流计算方法,用户只需要在逻辑层面编写SQL规则,由管理节点根据SQL规则生成第一流图,第一流图包括若干个逻辑节点,然后再由管理节点通过预设算子库将第一流图中各个逻辑节点进行划分后,得到若干个逻辑节点组,每个逻辑节点组被转换为第二流图中的一个算子,第二流图中的每个算子用于实现属于同一逻辑节点组中的各个逻辑节点,使得用户不需要具有流式编程思维,也不需要关心算子的划分逻辑,只需要在逻辑层面上编写SQL规则即可构建流图,由管理节点自行生成第二流图中的算子,从而减少了用户构建流计算应用时的代码编辑工作,降低用户构建流计算应用的复杂度。
下文采用图8A实施例对上述流计算方法进行更为详细的示例和阐述。
图8A示出了本申请另一个实施例提供的流计算方法的流程图。本实施例以该流计算方 法应用于图2所示的流计算系统中来举例说明。该方法包括:
步骤801,管理节点从客户端获取输入通道描述信息、SQL语句和输出通道描述信息;
一、输入通道描述信息用于定义输入通道,输入通道是将来自数据生产系统的数据流输入流图的逻辑通道。
以输入通道描述信息采用XML文件方式为例,一个示意性的输入通道描述信息如下:
Figure PCTCN2017094331-appb-000001
本申请实施例对输入通道描述信息的具体形式不加以限定,上述例子仅为示意性说明。
可选地,来自数据生产系统的输入数据流是TCP或UDP数据流,文件,数据库,分布式文件系统(英文:Hadoop Distributed File System,简称:HDFS)等。
二、SQL用于定义流图中每个算子的计算逻辑,以及每个算子的输入数据流和输出数据流。
SQL包括:数据定义语言(英文:Data Definition Language,简称:DLL)和数据操纵语言(英文:Data Manipulation Language,简称:DML)。通过SQL来定义流图中的各个算子时,通常采用DLL语言定义输入数据流和/或输出数据流,比如创建(英文:create)子语句;采用DML语言定义计算逻辑,比如,插入(英文:insert into)子语句、选择(英文:Select)子语句。
为了定义流图中的多个算子,SQL通常包括多条SQL规则,每条SQL规则包括至少一条SQL子语句,每条SQL规则用于定义流图中的一个逻辑节点。
示意性的,一组典型的SQL规则包括:
Insert into B…
Select…
From A…
Where….
在数据库领域中,Insert into子语句是SQL中向数据表中插入数据的语句,Select子语句是SQL中用于从数据表中选取数据的语句,from子语句是SQL中用于从数据表中读取数据的语句,Where子语句是在需要按照条件从数据表中选取数据时,添加在Select子语句中的条件语句。在上述例子中,输入数据流为A,输出数据流为B。
在本实施例的SQL中,Insert into子语句被转用为用于定义输出数据流的语句,Select子语句被转用为表示计算逻辑的语句,from子语句被转用为用于定义输入数据流的语句,Where子语句被转用为选择数据的语句。
示意性的,用户输入用于配置一个流图的若干条SQL规则包括如下:
Create stream s_edr(TriggerType uint32,MSISDN string,QuotaName string,QuotaConsumption uint32,QuotaBalance uint32,CaseID uint32)
as select*from tcp_channel_edr.edr_event;//SQL规则1
Create stream s_xdr(MSISDN string,Host string,CaseID uint32,CI uint32,App_Category uint32,App_sub_Category uint32,Up_Thoughput uint32,Down_Thoughput uint32)
as select*from tcp_channel_xdr.xdr_event;//SQL规则2
insert into temp1
select*form s_edr as a
where a.QuotaName=‘GPRS’and a.QuotaConsumption*10>=a.QuotaBalance*8;//SQL规则3
insert into file_channel_result1.cep_result
select b.*,1as Fixnum
from s_xdr as a,temp1.win:time_sliding(15sec)as b
where a.MSISON=b.MSISDN;//SQL规则4
insert into file_channel_result2.cep_result
select MSISDN,App_Category,App_sub)_category;
sum(Up_Thoughput+Down_Thoughput)as Thoughput
from s_xdr.win:time_tumbling(5min)
group by MSISDN,App_Category,APP_Sub_Category//SQL规则5
其中,SQL规则1的输入数据流是tcp_channel_edr,输出数据流是s_edr;SQL规则2的输入数据流是tcp_channel_xdr,输出数据流是s_xdr;SQL规则3的输入数据流是tcp_channel_edr,输出数据流是s;SQL规则4的输入数据流是s_xdr和temp1,输出数据流 是file_channel_result1;SQL规则5的输入数据流是s_xdr,输出数据流是file_channel_result2。
三、输出通道描述信息用于定义输出通道,输出通道是向数据消费系统发送输出数据流的逻辑通道。
以输出通道描述信息采用XML文件方式为例,一个示意性的输入通道描述信息如下:
Figure PCTCN2017094331-appb-000002
可选地,来自数据生产系统的输入数据流是TCP或UDP数据流,文件,数据库,分布式文件系统(英文:Hadoop Distributed File System,简称:HDFS)等。
第一流图是包括源逻辑节点、中间逻辑节点和目标逻辑节点的临时性流图。第一流图是逻辑层面的流图。该第一流图的生成过程,可包括如下步骤802至步骤805:
步骤802,管理节点根据输入通道描述信息生成源逻辑节点;
可选地,源逻辑节点用于接收来自数据生产系统的输入数据流。通常,每个源逻辑节点用于接收来自数据生产系统的一个输入数据流。
步骤803,管理节点根据SQL语句中的每条SQL规则,根据SQL规则中的选择子语句生成中间逻辑节点;
可选地,对于每个SQL规则,根据该SQL规则中的select子语句所限定的计算逻辑,生成中间逻辑节点。
比如,根据SQL规则1中的select语句,生成用于对输入数据流tcp_channel_edr进行 计算的中间逻辑节点。又比如,根据SQL规则2中的select语句,生成用于对输入数据流tcp_channel_xdr进行计算的中间逻辑节点。
步骤804,管理节点根据输出通道描述信息生成目标逻辑节点;
可选地,目标逻辑节点用于向数据消费系统发送输出数据流。通常,每个目标逻辑节点用于输出一个输出数据流。
步骤805,管理节点根据SQL规则中的输入子语句和输出子语句,生成源逻辑节点与中间逻辑节点、中间逻辑节点与中间逻辑节点、中间逻辑节点与目标逻辑节点之间的有向边。
根据SQL规则中的from子语句,生成与该SQL规则对应的中间逻辑节点的输入边。该输入边的另一端与源逻辑节点相连,或者,该输入边的另一端与其它中间逻辑节点相连。
根据SQL规则中的Insert into子语句,生成与该SQL规则对应的中间逻辑节点的输出边。该输出边的另一端与其它中间逻辑节点相连,或者,该输出边的另一端与目标逻辑节点相连。
对于一个中间逻辑节点来讲,输入边是指向该中间逻辑节点的有向边,输出边是从该中间逻辑节点指向其它中间逻辑节点或目标逻辑节点的有向边。
示意性的,如图8B所示,第一流图包括:第一源逻辑节点81、第二源逻辑节点82、第一中间逻辑节点83、第二中间逻辑节点84、第三中间逻辑节点85、第四中间逻辑节点86、第五中间逻辑节点87、第一目标逻辑节点88和第二目标逻辑节点89。
第一源逻辑节点81的输出数据流tcp_channel_edr,是第一中间逻辑节点83的输入数据流。
第二源逻辑节点82的输出数据流tcp_channel_xdr,是第二中间逻辑节点84的输入数据流。
第一中间逻辑节点83的输出数据流s_edr,是第三中间逻辑节点85的输入数据流。
第三中间逻辑节点85的输出数据流temp1,是第四中间逻辑节点86的输入数据流。
第二中间逻辑节点84的输出数据流s_xdr,是第四中间逻辑节点86的输入数据流。
第二中间逻辑节点84的输出数据流s_xdr,是第五中间逻辑节点87的输入数据流。
第四中间逻辑节点86的输出数据流file_channel_result1,是第一目标逻辑节点88的输入数据流。
第五中间逻辑节点87的输出数据流file_channel_result2,是第二目标逻辑节点89的输入数据流。
需要说明的是,本实施例对步骤802、步骤803和步骤804互相之间的执行先后顺序不加以限定,可选地,上述步骤802、步骤803和步骤804是并行执行的步骤,或者,上述步骤802、步骤803和步骤804是先后串行执行的步骤。
第二流图是一个可执行的流计算应用,第二流图是代码层面的流图。该第二流图的生成过程,可包括如下步骤806至步骤808:
步骤806,管理节点编译公共源算子以得到第二流图中的源算子;
可选地,管理节点根据源逻辑节点在预设算子库中选择出公共源算子,根据公共源算子编译得到第二流图中的源算子;
可选地,预设算子库中设置有一种或多种公共源算子,比如,对应于TCP协议的公共 源算子、对应于用户数据报协议(英文:User Datagram Protocol,简称:UDP)协议的公共源算子、对应于文件类型A的公共源算子、对应于文件类型B的公共源算子、对应于数据库类型A的公共源算子和对应于数据库类型B的公共源算子等。
可选地,管理节点将每个源逻辑节点划分为一个逻辑节点组,每个源逻辑节点实现成为一个源算子。
管理节点根据第一流图中的源逻辑节点,从预设算子库中选择出对应的公共源算子进行编译,能够得到第二流图中的源算子。源算子用于接收来自数据生产系统的输入数据流。
步骤807,管理节点在预设算子库中为每个包括中间逻辑节点的逻辑节点组选择出至少一个公共中间算子,编译选择出的公共中间算子以得到第二流图中的中间算子;
可选地,管理节点将至少一个中间逻辑节点进行划分,得到若干个逻辑节点组;根据划分为同一逻辑节点组中的各个中间逻辑节点,在预设算子库中选择出与该逻辑节点组对应的公共中间算子,根据公共中间算子编译得到第二流图中的中间算子;
可选地,预设算子库中设置有一种或多种公共计算算子,比如,用于实现乘法运算的公共中间算子、用于实现减法运算的公共中间算子、用于实现排序运算的公共中间算子、用于筛选运算的公共中间算子等等。当然,同一个公共中间算子的功能可以为多种,也即,公共中间算子是具有多种计算功能的算子。当同一个公共中间算子的功能为多种时,能够在一个公共中间算子上实现多个逻辑节点。
由于第一流图中每个中间逻辑节点的计算类型和/或计算量不同,管理节点根据负载均衡、并发度要求、各个逻辑节点之间的亲密度和各个逻辑节点之间的互斥性中的至少一个因素将每个中间逻辑节点进行划分,被划分至同一逻辑节点组各个中间逻辑节点通过预设算子库中的同一个公共中间算子进行编译,得到第二流图中的一个中间算子。
比如,管理节点将两个计算量较少的中间逻辑节点划分为同一组;又比如,管理节点将中间逻辑节点A、中间逻辑节点B、中间逻辑节点C划分为同一组,其中,中间逻辑节点A的输出数据流是中间逻辑节点B的输入数据流,中间逻辑节点B的输出数据流是中间逻辑节点C的输入数据流;再比如,管理节点将具有相同输入数据流的中间逻辑节点A和中间逻辑节点D划分为同一组。
步骤808,管理节点编译公共目标算子以得到第二流图中的目标算子;
可选地,管理节点根据目标逻辑节点在预设算子库中选择出公共目标算子,根据公共目标算子编译得到第二流图中的目标算子。
可选地,预设算子库中设置有一种或多种公共目的算子,比如,对应于TCP协议的公共目的算子、对应于UDP协议的公共目的算子、对应于文件类型A的公共目的算子、对应于文件类型B的公共目的算子、对应于数据库类型A的公共目的算子和对应于数据库类型B的公共目的算子等。
可选地,管理节点将每个目标逻辑节点划分为一个逻辑节点组,每个目标逻辑节点实现成为一个目标算子。
管理节点根据第一流图中的目标逻辑节点,从预设算子库中选择出对应的公共目标算子进行编译,能够得到第二流图中的目标算子。目标算子用于向数据消费系统发送最终的输出数据流。
示意性的,参考图8B所示,将第一流图中的第一源逻辑节点81通过公共源算子进行 编译,得到第一源算子source1;将第一流图中的第二源逻辑节点82通过公共源算子进行编译,得到第二源算子source2;将第一流图中的第一中间逻辑节点83至第五中间逻辑节点87划分为同一组,通过同一个公共中间算子进行编译,得到中间算子CEP;将第一流图中的第一目标逻辑节点通过公共目的算子进行编译,得到第一目的算子sink1;将第一流图中的第二目标逻辑节点通过公共目的算子进行编译,得到第二目的算子sink2。
最终,第二流图包括:第一源算子source1、第二源算子source2、中间算子CEP、第一目的算子sink1和第二目的算子sink2。
步骤809,管理节点根据源逻辑节点与中间逻辑节点、中间逻辑节点与中间逻辑节点、中间逻辑节点与目标逻辑节点之间的有向边,生成第二流图中的各个算子之间的有向边。
管理节点根据第一流图中的各个有向边,对应地生成第二流图中的各个算子之间的有向边。
至此,一个可执行的流图被生成。该流图也可被认为是一个流计算应用。
需要说明的是,本实施例对步骤806、步骤807和步骤808互相之间的执行先后顺序不加以限定,可选地,上述步骤806、步骤807和步骤808是并行执行的步骤,或者,上述步骤806、步骤807和步骤808是先后串行执行的步骤。
步骤810,管理节点根据将第二流图中的各个算子调度至分布式计算系统中的至少一个工作节点中,该工作节点用于执行算子;
分布式计算系统中包括有多个工作节点,管理节点按照自身决策的物理执行计划,将第二流图中的各个算子调度至多个工作节点中进行执行。每个工作节点用于执行至少一个算子。通常情况下,每个工作节点中运行有至少一个进程,每个进程用于执行一个算子。
比如,第一源算子source1被调度至工作节点1、第二源算子source2被调度至工作节点2、中间算子CEP被调度至工作节点3、第一目的算子sink1和第二目的算子sink2被调度至工作节点4。
为了解耦各个算子之间的数据流引用关系,本实施例还引入了订阅机制。
步骤811,管理节点根据每个算子的输出数据流,生成与该算子对应的订阅发布信息,向该算子配置该订阅发布信息;
订阅发布信息用于指示与当前算子对应的输出数据流的发布方式。
管理节点根据当前算子的输出数据流、第二流图中的有向边和各个工作节点之间的拓扑结构,生成与当前算子对应的订阅发布信息。
比如,第一源算子source1的输出数据流是tcp_channel_edr,在第二流图中与tcp_channel_edr对应的有向边指向中间算子CEP,工作节点1的网络接口3与工作节点3的网络接口4相连,则管理节点生成将输出数据流tcp_channel_edr从工作节点1的网络接口3以预定形式进行发布的订阅发布信息。然后,管理节点将该订阅发布信息下发给位于工作节点1中的第一源算子source1,第一源算子source1根据该订阅发布信息向外发布输出数据流tcp_channel_edr,此时,第一源算子source1并不需要关心下游的算子具体是哪一个算子,也不需要关心下游的算子位于哪一个工作节点,只需要按照订阅发布信息从工作节点1的网络接口3发布输出数据流即可。
步骤812,管理节点根据每个算子的输入数据流,生成与该算子对应的输入流定义信息,向该算子配置输入流定义信息;
输入流定义信息用于指示与当前算子对应的输入数据流的接收方式。
管理节点根据当前算子的输入数据流、第二流图中的有向边和各个工作节点之间的拓扑结构,生成与当前算子对应的订阅信息。
比如,中间算子CEP的输入数据流包括tcp_channel_edr,在第二流图中与tcp_channel_edr对应的有向边来源于第一源算子Source1,工作节点1的网络接口3与工作节点3的网络接口4相连,则管理节点生成从网络接口4以预定形式进行接收的输入流定义信息。然后,管理节点将该输入流定义信息下发给位于工作节点3中的中间算子CEP,中间算子CEP根据该输入流定义信息接收输入数据流tcp_channel_edr,此时,中间算子CEP并不需要关心上游的算子具体是哪一个算子,上游的算子具体位于哪一个工作节点,只需要按照输入流定义信息从工作节点3的网络接口4接收输入数据流即可。
步骤813,工作节点对第二流图中的各个算子进行执行;
各个工作节点根据管理节点的调度,对第二流图中的各个算子进行执行。比如,每个进程用于负责一个算子的计算任务。
综上所述,本实施例提供的流计算方法,通过由管理节点根据输入通道描述信息、SQL语句和输出通道描述信息生成可执行的流图,然后由管理节点根据流图控制工作节点执行流计算;解决了目前的流计算系统通过IDE提供的基本算子来构建流图时,每个基本算子的功能被划分为非常细的粒度,导致生成的流图的整体计算性能较差的问题;达到了流计算系统支持SQL语句来构建流图,SQL是较为常见的数据库管理语言,用户使用SQL语句来构建流图仍然非常易用的效果,另一方面,由用户利用SQL语言的编程语言特性,采用SQL语句对流图的处理逻辑进行定义,由管理节点按照SQL语句所定义的处理逻辑动态生成具有合理数量的算子的流图,从而提高流图的整体计算性能。
还通过由管理节点将第一流图中的多个逻辑节点进行划分,将划分为同一组的各个逻辑节点通过同一个公共中间算子进行实现,不需要用户考虑负载均衡、并发执行、亲密度和互斥性等因素,由管理节点自行决策负载均衡、并发执行、亲密度和互斥性等因素来进行第二流图的生成,进一步地降低了用户生成第二流图时的难度,只需要用户具有通过SQL构建逻辑层面的第一流图的能力即可。
还通过设置订阅机制,将第二流图中的各个算子的输入数据流和输出数据流之间的引用关系解耦,提供了在第二流图被执行后,用户仍然可以在流计算系统中动态调整第二流图中的各个算子的能力,提高了流计算应用的整体易用性和可维护性。
当第二流图在流计算系统中被执行以后,随着业务功能在实际使用场景中的变更和调整,已经执行的第二流图也需要进行改变才能适应新的需求。与现有技术中通常需要重新构建第二流图不同的是,本申请实施例提供了对已执行的第二流图进行动态修改的能力,具体参考如下图8C至图8E。
在第二流图被执行以后,用户还可对第二流图中的中间算子进行修改。如图8C所示:
步骤814,客户端向管理节点发送第一修改信息;
第一修改信息是对SQL规则进行修改的信息;或者说,第一修改信息携带有修改后的SQL规则。
若第二流图中的中间算子需要修改,则客户端向管理节点发送用于对SQL规则进行修改的第一修改信息。
步骤815,管理节点接收来自客户端的第一修改信息;
步骤816,管理节点根据第一修改信息对第二流图中的中间算子进行增加、修改或删除。
可选地,对一个原有的中间算子,替换为一个新的中间算子的修改过程,可以是将原有的中间算子进行删除,再增加新的中间算子。
步骤817,管理节点向修改后的中间算子重新配置订阅发布信息和/或输入流定义信息;
可选地,若修改后的中间算子的输入数据流是新增的数据流或发生改变的数据流,则管理节点还需要重新向该中间算子配置输入流定义信息。
若修改后的中间算子的输出数据流是新增的数据流或发生改变的数据流,则管理节点还需要重新向该中间算子配置订阅发布信息。
综上所述,本实施例提供的流计算方法,通过客户端向管理节点发送第一修改信息,由管理节点根据第一修改信息对第二流图中的中间算子进行增加、修改或删除,为管理节点提供了动态调整第二流图中的中间算子的能力。
在第二流图被执行以后,用户还可对第二流图中的源算子进行修改。如图8D所示:
步骤818,客户端向管理节点发送第二修改信息;
第二修改信息是对输入通道描述信息进行修改的信息;或者说,第二修改信息携带有修改后的输入通道描述信息。
若第二流图中的源算子需要修改,则客户端向管理节点发送用于对输入通道描述信息进行修改的第二修改信息。
步骤819,管理节点接收来自客户端的第二修改信息;
步骤820,管理节点根据第二修改信息对第二流图中的源算子进行增加、修改或删除。
可选地,对一个原有的源算子,替换为一个新的源算子的修改过程,可以是将原有的源算子进行删除,再增加新的源算子。
步骤821,管理节点向修改后的源算子重新配置订阅发布信息。
可选地,若修改后的源算子的输出数据流是新增的数据流或发生改变的数据流,则管理节点还需要重新向该源算子配置订阅发布信息。
综上所述,本实施例提供的流计算方法,通过客户端向管理节点发送第二修改信息,由管理节点根据第二修改信息对第二流图中的源算子进行增加、修改或删除,为管理节点提供了动态调整第二流图中的源算子的能力。
在第二流图被执行以后,用户还可对第二流图中的目标算子进行修改。如图8E所示:
步骤822,客户端向管理节点发送第二修改信息;
第二修改信息是对输入通道描述信息进行修改的信息;或者说,第二修改信息携带有修改后的输入通道描述信息。
若第二流图中的源算子需要修改,则客户端向管理节点发送用于对输入通道描述信息进行修改的第二修改信息。
步骤823,管理节点接收来自客户端的第二修改信息;
步骤824,管理节点根据第二修改信息对第二流图中的源算子进行增加、修改或删除。
可选地,对一个原有的源算子,替换为一个新的目标算子的修改过程,可以是将原有 的目标算子进行删除,再增加新的目标算子。
步骤825,管理节点向修改后的目标算子重新配置输入流定义信息。
可选地,若修改后的目标算子的输入数据流是新增的数据流或发生改变的数据流,则管理节点还需要重新向该目标算子配置输入流定义信息。
综上所述,本实施例提供的流计算方法,通过客户端向管理节点发送第三修改信息,由管理节点根据第三修改信息对第二流图中的目标算子进行增加、修改或删除,为管理节点提供了动态调整第二流图中的目标算子的能力。
在一个具体的实施例中,如图9A所示,流计算系统向用户提供两种客户端:由流计算系统提供的原生客户端92,和,由用户二次开发的客户端94。原生客户端92和二次开发的客户端94都提供有SQL应用程序编程接口(英文:Application Programming Interface,简称:API),该SQL API用于实现使用SQL语言来定义流图的功能。用户在原生客户端92或二次开发的客户端94录入输入/输出通道描述信息和SQL语句,原生客户端92或二次开发的客户端94将输入/输出通道描述信息和SQL语句发送给管理节点Master,也即图中的步骤1。
管理节点Master通过App连接服务与原生客户端92或二次开发的客户端94建立连接。管理节点Master获取输入/输出通道描述信息和SQL语句,由SQL引擎96根据输入/输出通道描述信息和SQL语句生成可执行的流图,也即图中的步骤2。
管理节点Master还包括流平台执行框架管理模块98,该流平台执行框架管理模块98用于实现资源管理、应用管理、主备管理和任务管理等管理事务。对于SQL引擎96所生成的一个可执行流图。由流平台执行框架管理模块98规划和决策该流图在各个工作节点Worker上的执行计划,也即图中的步骤3。
各个工作节点Worker上的处理单元集(英文:processing element container,简称:PEC)包括多个处理单元PE,每个PE用于调用可执行流图中的一个源算子Soucer、或者一个中间算子CEP、或者一个目标算子Slink。通过各个PE之间的协作,对可执行流图中的各个算子进行处理。
图9B示出了本申请一个实施例提供的SQL引擎96在具体实施时的原理示意图。SQL引擎96在获取到输入/输出通道描述信息和SQL语句后,执行如下几个过程:
1、SQL引擎96对SQL语句中的各个SQL规则进行解析;2、SQL引擎96根据解析结果生成临时的第一流图;3、SQL引擎96对第一流图中的各个逻辑节点按照负载均衡、亲密度和互斥性等因素进行划分,得到至少一个逻辑节点组,每个逻辑节点组包括一个或多个逻辑节点;4、SQL引擎96进行算子并发度计算的模拟,按照算子并发度计算的模拟结果对各个逻辑节点组进行调整;5、SQL引擎96根据调整后的各个逻辑节点组生成第二流图,属于同一个逻辑节点组中的各个逻辑节点被分配至第二流图中的一个可执行算子;6、SQL引擎96对第二流图中的各个可执行算子进行解析,分析每个算子的运算要求等信息;7、SQL引擎96对第二流图中的各个可执行算子生成逻辑执行计划;8、SQL引擎96对第二流图的逻辑执行计划进行代码编辑优化,生成物理执行计划。9、SQL引擎96向流平台执行框架管理模块98发送物理执行计划,由流平台执行框架管理模块98按照物理执行计划进行流计算应用的执行。
其中,步骤1至步骤5属于一层编译过程,步骤6至步骤9属于二层编译过程。
以下为本申请的装置实施例,该装置实施例与上述方法实施例对应,装置实施例中未详细阐述的细节,可参考上述方法实施例中的描述。
图10示出了本申请一个实施例提供的流计算装置的结构方框图。该流计算装置可以通过专用硬件电路,或者,软硬件的组合实现成为管理节点240的全部或一部分。该流计算装置包括:获取单元1020、生成单元1040和执行单元1060。
获取单元1020,用于实现上述步骤501、步骤801的功能;
生成单元1040,用于实现上述步骤502、步骤502a、步骤502b、步骤802至步骤808的功能;
执行单元1060,用于实现上述步骤503、步骤810至步骤812的功能。
可选地,该装置还包括:修改单元1080;
修改单元1080,用于实现上述步骤815至步骤825的功能。
相关细节可结合参考图5、图6、图7、图8A、图8B、图8C、图8D和图8E所述的方法实施例。
可选地,上述获取单元1020通过管理节点240的网络接口242以及处理器241执行存储器244中的获取模块251来实现。该网络接口242是以太网网卡、光纤收发器、通用串行总线(英文:Universal Serial Bus,简称:USB)接口或者其它I/O接口。
可选地,上述生成单元1040通过管理节点240的处理器241执行存储器244中的生成模块252来实现。该生成单元1040所生成的流图是由多个算子所形成的可执行的分布式流计算应用,该分布式流计算应用中的每个算子可分派至不同的工作节点去执行。
可选地,上述执行单元1060通过管理节点240的网络接口242以及处理器241执行存储器244中的执行模块253来实现。该网络接口242是以太网网卡、光纤收发器、USB接口或者其它I/O接口。换句话说,处理器241将流图中的各个算子通过网络接口242分派至不同的工作节点,然后由各个工作节点执行该算子的数据计算。
可选地,上述修改单元1080通过管理节点240的处理器241执行存储器244中的修改模块(图中未示出)来实现。
需要说明的是:上述实施例提供的流计算装置在生成流图并进行流计算时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的流计算装置与流计算方法的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图11示出了本申请一个实施例提供的流计算系统的结构方框图。该流计算系统包括:终端1120、管理节点1140和工作节点1160。
终端1120,用于执行上述方法实施例中由终端或客户端所执行的步骤。
管理节点1140,用于执行上述方法实施例中由管理节点所执行的步骤。
工作节点1160,用于执行上述方法实施例中由工作节点所执行的步骤。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。

Claims (17)

  1. 一种流计算方法,其特征在于,应用于包括管理节点和工作节点的流计算系统中,所述方法包括:
    所述管理节点从客户端获取输入通道描述信息、结构化查询语言SQL语句和输出通道描述信息;
    所述管理节点根据所述输入通道描述信息、所述SQL语句和所述输出通道描述信息生成流图,所述流图用于定义执行流计算任务的多个算子的计算逻辑以及所述多个算子之间数据流的输入输出关系;所述管理节点根据所述流图控制所述工作节点执行所述流计算任务;
    其中,所述输入通道描述信息用于定义输入通道,所述输入通道是用于将来自数据生产系统的数据流输入所述流图的逻辑通道;所述输出通道描述信息用于定义输出通道,所述输出通道是用于将所述流图的输出数据流输出至数据消费系统的逻辑通道。
  2. 根据权利要求1所述的方法,其特征在于,所述SQL语句包括若干条SQL规则,每条SQL规则包括至少一条SQL子语句;
    所述管理节点根据所述输入通道描述信息、所述SQL语句和所述输出通道描述信息生成流图,包括:
    所述管理节点根据所述输入通道描述信息、所述若干条SQL规则和所述输出通道描述信息生成第一流图,所述第一流图包括若干个逻辑节点;
    所述管理节点将所述第一流图中的各个逻辑节点进行划分,以得到若干个逻辑节点组;在预设算子库中选择每个所述逻辑节点组对应的公共算子,并根据选择出的所述公共算子生成第二流图;所述第二流图中的每个算子用于实现所述算子对应的逻辑节点组中的一个或多个逻辑节点的功能。
  3. 根据权利要求2所述的方法,其特征在于,所述第一流图包括通过有向边相连的源逻辑节点、中间逻辑节点和目标逻辑节点;所述第二流图包括通过有向边相连的源算子、中间算子和目标算子。
  4. 根据权利要求3所述的方法,其特征在于,所述管理节点根据所述输入通道描述信息、所述若干条SQL规则和所述输出通道描述信息生成第一流图,包括:
    所述管理节点根据所述输入通道描述信息生成所述第一流图中的所述源逻辑节点,所述源逻辑节点用于接收来自所述数据生产系统的输入数据流;
    所述管理节点根据每条所述SQL规则中的选择子语句生成所述第一流图中的所述中间逻辑节点,所述中间逻辑节点用于指示对所述输入数据流进行计算时的计算逻辑,每个中间逻辑节点对应一条SQL规则;
    所述管理节点根据所述输出通道描述信息生成所述第一流图中的目标逻辑节点,所述目标逻辑节点用于向所述数据消费系统发送输出数据流;
    所述管理节点根据每条所述SQL规则中的输入子语句和/或输出子语句,生成所述源逻辑节点、所述中间逻辑节点以及所述目标逻辑节点之间的有向边。
  5. 根据权利要求3或4所述的方法,其特征在于,所述预设算子库包括:公共源算子、公共中间算子和公共目标算子;
    所述在预设算子库中选择每个所述逻辑节点组对应的公共算子,并根据选择出的所述公共算子生成第二流图包括:
    编译所述公共源算子以得到所述第二流图中的源算子;
    在所述预设算子库中为每个包括所述中间逻辑节点的所述逻辑节点组选择出至少一个公共中间算子,编译选择出的所述公共中间算子以得到所述第二流图中的中间算子;
    编译所述公共目标算子以得到所述第二流图中的目标算子;
    根据所述源逻辑节点、所述中间逻辑节点与所述目标逻辑节点之间的有向边,生成所述第二流图中的各个算子之间的有向边。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    所述管理节点接收来自所述客户端的第一修改信息,所述第一修改信息是对所述SQL规则进行修改的信息;
    所述管理节点根据所述第一修改信息对所述第二流图中对应的所述中间算子进行增加、修改或删除。
  7. 根据权利要求5或6所述的方法,其特征在于,所述方法还包括:
    所述管理节点接收来自所述客户端的第二修改信息,所述第二修改信息是对所述输入通道描述信息进行修改的信息;根据所述第二修改信息对所述第二流图中的所述源算子进行增加、修改或删除;
    和/或,
    所述管理节点接收来自所述客户端的第三修改信息,所述第三修改信息是对所述输出通道描述信息进行修改的信息;根据所述第三修改信息对所述第二流图中的所述目标算子进行增加、修改或删除。
  8. 根据权利要求2至5任一所述的方法,其特征在于,所述流计算系统包括多个工作节点,所述管理节点根据所述流图控制所述工作节点执行所述流计算任务,包括:
    所述管理节点将所述第二流图中的各个所述算子调度至所述流计算系统中的至少一个工作节点中,所述至少一个工作节点用于执行所述算子;
    所述管理节点根据每个所述算子的所述输出数据流,生成与所述算子对应的订阅发布信息,向所述算子配置所述订阅发布信息;
    所述管理节点根据每个所述算子的所述输入数据流,生成与所述算子对应的输入流定义信息,向所述算子配置所述输入流定义信息;
    其中,所述订阅发布信息用于指示与当前算子对应的输出数据流的发送方式;所述输入流定义信息用于指示与当前算子对应的输入数据流的接收方式。
  9. 一种流计算装置,其特征在于,所述装置包括:
    获取单元,用于从客户端获取输入通道描述信息、结构化查询语言SQL语句和输出通道描述信息;
    生成单元,用于根据所述输入通道描述信息、所述SQL语句和所述输出通道描述信息生成流图,所述流图用于定义执行流计算任务的多个算子的计算逻辑以及所述多个算子之间数据流的输入输出关系;
    执行单元,根据所述流图控制流计算系统中工作节点执行所述流计算任务;
    其中,所述输入通道描述信息用于定义输入通道,所述输入通道是用于将来自数据生产系统的数据流输入所述流图的逻辑通道;所述输出通道描述信息用于定义输出通道,所述输出通道是用于将所述流图的输出数据流输出至数据消费系统的逻辑通道。
  10. 根据权利要求9所述的装置,其特征在于,所述SQL语句包括若干条SQL规则,每条SQL规则包括至少一条SQL子语句;
    所述生成单元,用于根据所述输入通道描述信息、所述若干条SQL规则和所述输出通道描述信息生成第一流图,所述第一流图包括若干个逻辑层面的节点;
    所述生成单元,还用于将所述第一流图中的各个逻辑节点进行划分,以得到若干个逻辑节点组;在预设算子库中选择每个所述逻辑节点组对应的公共算子,根据选择出的所述公共算子生成第二流图;所述第二流图中的每个算子用于实现该算子对应的逻辑节点组中的一个或多个逻辑节点的功能。
  11. 根据权利要求10所述的装置,其特征在于,所述第一流图包括通过有向边相连的源逻辑节点、中间逻辑节点和目标逻辑节点;所述第二流图包括通过有向边相连的源算子、中间算子和目标算子。
  12. 根据权利要求11所述的装置,其特征在于,
    所述生成单元,具体用于根据所述输入通道描述信息生成所述第一流图中的所述源逻辑节点,所述源逻辑节点用于接收来自所述数据生产系统的输入数据流;根据每条所述SQL规则中的选择子语句生成所述第一流图中的所述中间逻辑节点,所述中间逻辑节点用于指示对所述输入数据流进行计算时的计算逻辑,每个中间逻辑节点对应一条SQL规则;根据所述输出通道描述信息生成所述第一流图中的目标逻辑节点,所述目标逻辑节点用于向所述数据消费系统发送输出数据流;根据每条所述SQL规则中的输入子语句和/或输出子语句,生成所述源逻辑节点、所述中间逻辑节点、以及所述目标逻辑节点之间的有向边。
  13. 根据权利要求11或12所述的装置,其特征在于,所述预设算子库包括:公共源算子、公共中间算子和公共目标算子;
    所述生成单元,具体用于编译所述公共源算子以得到所述第二流图中的源算子;在所述预设算子库中为每个包括所述中间逻辑节点的所述逻辑节点组选择出至少一个公共中间算子,编译选择出的所述公共中间算子以得到所述第二流图中的中间算子;编译所述公共目标算子以得到所述第二流图中的目标算子;根据所述源逻辑节点、所述中间逻辑节点与所述目标逻辑节点之间的有向边,生成所述第二流图中的各个算子之间的有向边。
  14. 根据权利要求13所述的装置,其特征在于,所述装置还包括:修改单元;
    所述获取单元,还用于接收来自所述客户端的第一修改信息,所述第一修改信息是对所述SQL规则进行修改的信息;
    所述修改单元,用于根据所述第一修改信息对所述第二流图中对应的所述中间算子进行增加、修改或删除
  15. 根据权利要求13或14所述的装置,其特征在于,所述装置还包括:修改单元;
    所述获取单元,还用于接收来自所述客户端的第二修改信息,所述第二修改信息是对所述输入通道描述信息进行修改的信息;所述修改单元,用于根据所述第二修改信息对所述第二流图中的所述源算子进行增加、修改或删除;
    和/或,
    所述获取单元,还用于接收来自所述客户端的第三修改信息,所述第三修改信息是对所述输出通道描述信息进行修改的信息;所述修改单元,用于根据所述第三修改信息对所述第二流图中的所述目标算子进行增加、修改或删除。
  16. 根据权利要求10至13任一所述的装置,其特征在于,所述执行单元,用于将所述第二流图中的各个所述算子调度至所述流计算系统中的至少一个工作节点中,所述工作节点用于执行所述算子;根据每个所述算子的所述输出数据流,生成与所述算子对应的订阅发布信息,向所述算子配置所述订阅发布信息;根据每个所述算子的所述输入数据流,生成与所述算子对应的输入流定义信息,向所述算子配置所述输入流定义信息;
    其中,所述订阅发布信息用于指示与当前算子对应的输出数据流的发送方式;所述输入流定义信息用于指示与当前算子对应的输入数据流的接收方式。
  17. 一种流计算系统,其特征在于,所述系统包括:管理节点和至少一个工作节点;
    所述管理节点包括如权利要求9至16任一所述的流计算装置。
PCT/CN2017/094331 2016-07-29 2017-07-25 流计算方法、装置及系统 WO2018019232A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP21192261.2A EP3975004A1 (en) 2016-07-29 2017-07-25 Stream computing method, apparatus, and system
EP17833535.2A EP3483740B1 (en) 2016-07-29 2017-07-25 Method, device and system for stream computation
US16/261,014 US11132402B2 (en) 2016-07-29 2019-01-29 Stream computing method, apparatus, and system
US17/486,066 US20220012288A1 (en) 2016-07-29 2021-09-27 Stream Computing Method, Apparatus, and System

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610617253.2A CN107678790B (zh) 2016-07-29 2016-07-29 流计算方法、装置及系统
CN201610617253.2 2016-07-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/261,014 Continuation US11132402B2 (en) 2016-07-29 2019-01-29 Stream computing method, apparatus, and system

Publications (1)

Publication Number Publication Date
WO2018019232A1 true WO2018019232A1 (zh) 2018-02-01

Family

ID=61015659

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/094331 WO2018019232A1 (zh) 2016-07-29 2017-07-25 流计算方法、装置及系统

Country Status (4)

Country Link
US (2) US11132402B2 (zh)
EP (2) EP3483740B1 (zh)
CN (1) CN107678790B (zh)
WO (1) WO2018019232A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020041872A1 (en) * 2018-08-30 2020-03-05 Streamworx.Ai Inc. Systems, methods and computer program products for scalable, low-latency processing of streaming data
CN112099788A (zh) * 2020-09-07 2020-12-18 北京红山信息科技研究院有限公司 一种可视化数据开发方法、系统、服务器和存储介质

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108363615B (zh) * 2017-09-18 2019-05-14 清华大学 用于可重构处理系统的任务分配方法和系统
US10860576B2 (en) 2018-01-26 2020-12-08 Vmware, Inc. Splitting a query into native query operations and post-processing operations
US11144570B2 (en) 2018-01-26 2021-10-12 Vmware, Inc. Data ingestion by distributed-computing systems
US11016971B2 (en) 2018-01-26 2021-05-25 Vmware, Inc. Splitting a time-range query into multiple sub-queries for parallel execution
US11016972B2 (en) 2018-01-26 2021-05-25 Vmware, Inc. Splitting a time-range query into multiple sub-queries for serial execution
US11178213B2 (en) * 2018-02-28 2021-11-16 Vmware, Inc. Automated configuration based deployment of stream processing pipeline
US10812332B2 (en) 2018-02-28 2020-10-20 Vmware Inc. Impartial buffering in stream processing
US10824623B2 (en) 2018-02-28 2020-11-03 Vmware, Inc. Efficient time-range queries on databases in distributed computing systems
CN109345377B (zh) * 2018-09-28 2020-03-27 北京九章云极科技有限公司 一种数据实时处理系统及数据实时处理方法
CN109840305B (zh) * 2019-03-26 2023-07-18 中冶赛迪技术研究中心有限公司 一种蒸汽管网水力-热力计算方法及系统
CN112801439A (zh) * 2019-11-14 2021-05-14 深圳百迈技术有限公司 任务管理方法及装置
CN110908641B (zh) * 2019-11-27 2024-04-26 中国建设银行股份有限公司 基于可视化的流计算平台、方法、设备和存储介质
CN111127077A (zh) * 2019-11-29 2020-05-08 中国建设银行股份有限公司 一种基于流计算的推荐方法和装置
CN111104214B (zh) * 2019-12-26 2020-12-15 北京九章云极科技有限公司 一种工作流应用方法及装置
US11792014B2 (en) * 2020-03-16 2023-10-17 Uatc, Llc Systems and methods for vehicle message signing
CN111324345A (zh) * 2020-03-19 2020-06-23 北京奇艺世纪科技有限公司 数据处理方式生成方法、数据处理方法、装置及电子设备
CN113806429A (zh) * 2020-06-11 2021-12-17 深信服科技股份有限公司 基于大数据流处理框架的画布式日志分析方法
CN111679916B (zh) * 2020-08-11 2020-11-27 北京搜狐新媒体信息技术有限公司 一种视频推荐方法、目标服务提供端、服务调用端及系统
CN112181477B (zh) * 2020-09-02 2024-05-10 广州市双照电子科技有限公司 复杂事件处理方法、装置及终端设备
CN112417014A (zh) * 2020-11-16 2021-02-26 杭州安恒信息技术股份有限公司 动态修改执行计划方法、系统、计算机可读存储介质
US20220300417A1 (en) * 2021-03-19 2022-09-22 Salesforce.Com, Inc. Concurrent computation on data streams using computational graphs
CN113434282B (zh) * 2021-07-20 2024-03-26 支付宝(杭州)信息技术有限公司 流计算任务的发布、输出控制方法及装置
CN114296947B (zh) * 2022-03-09 2022-07-08 四川大学 一种面向复杂场景的多计算模型管理方法
CN116501805A (zh) * 2023-06-29 2023-07-28 长江三峡集团实业发展(北京)有限公司 一种流数据系统、计算机设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101192239A (zh) * 2006-12-01 2008-06-04 国际商业机器公司 实现用于集成系统的统一模型的系统和方法
CN102915303A (zh) * 2011-08-01 2013-02-06 阿里巴巴集团控股有限公司 一种etl测试的方法和装置
CN104866310A (zh) * 2015-05-20 2015-08-26 百度在线网络技术(北京)有限公司 知识数据的处理方法和系统
CN105786808A (zh) * 2014-12-15 2016-07-20 阿里巴巴集团控股有限公司 一种用于分布式执行关系型计算指令的方法与设备

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9076448B2 (en) 1999-11-12 2015-07-07 Nuance Communications, Inc. Distributed real time speech recognition system
US7103590B1 (en) * 2001-08-24 2006-09-05 Oracle International Corporation Method and system for pipelined database table functions
US7383246B2 (en) 2003-10-31 2008-06-03 International Business Machines Corporation System, method, and computer program product for progressive query processing
US7310638B1 (en) * 2004-10-06 2007-12-18 Metra Tech Method and apparatus for efficiently processing queries in a streaming transaction processing system
US8396886B1 (en) * 2005-02-03 2013-03-12 Sybase Inc. Continuous processing language for real-time data streams
US9727604B2 (en) * 2006-03-10 2017-08-08 International Business Machines Corporation Generating code for an integrated data system
US8352456B2 (en) * 2007-05-11 2013-01-08 Microsoft Corporation Producer/consumer optimization
US7676461B2 (en) * 2007-07-18 2010-03-09 Microsoft Corporation Implementation of stream algebra over class instances
US8725707B2 (en) * 2009-03-26 2014-05-13 Hewlett-Packard Development Company, L.P. Data continuous SQL process
JP5395565B2 (ja) * 2009-08-12 2014-01-22 株式会社日立製作所 ストリームデータ処理方法及び装置
US9037571B1 (en) * 2013-03-12 2015-05-19 Amazon Technologies, Inc. Topology service using closure tables and metagraphs
US9405854B2 (en) * 2013-12-16 2016-08-02 Sybase, Inc. Event stream processing partitioning
CN103870340B (zh) * 2014-03-06 2017-11-07 华为技术有限公司 流计算系统中的数据处理方法、控制节点及流计算系统
CN103970602B (zh) * 2014-05-05 2017-05-10 华中科技大学 一种面向x86多核处理器的数据流程序调度方法
CN104199831B (zh) * 2014-07-31 2017-10-24 深圳市腾讯计算机系统有限公司 信息处理方法及装置
JP6516110B2 (ja) * 2014-12-01 2019-05-22 日本電気株式会社 SQL−on−Hadoopシステムにおける複数クエリ最適化
CN104504143B (zh) * 2015-01-04 2017-12-29 华为技术有限公司 一种流图优化方法及其装置
CN105404690B (zh) * 2015-12-16 2019-06-21 华为技术服务有限公司 查询数据库的方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101192239A (zh) * 2006-12-01 2008-06-04 国际商业机器公司 实现用于集成系统的统一模型的系统和方法
CN102915303A (zh) * 2011-08-01 2013-02-06 阿里巴巴集团控股有限公司 一种etl测试的方法和装置
CN105786808A (zh) * 2014-12-15 2016-07-20 阿里巴巴集团控股有限公司 一种用于分布式执行关系型计算指令的方法与设备
CN104866310A (zh) * 2015-05-20 2015-08-26 百度在线网络技术(北京)有限公司 知识数据的处理方法和系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3483740A4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020041872A1 (en) * 2018-08-30 2020-03-05 Streamworx.Ai Inc. Systems, methods and computer program products for scalable, low-latency processing of streaming data
US11146467B2 (en) 2018-08-30 2021-10-12 Streamworx.Ai Inc. Systems, methods and computer program products for scalable, low-latency processing of streaming data
US11425006B2 (en) 2018-08-30 2022-08-23 Streamworx.Ai Inc. Systems, methods and computer program products for scalable, low-latency processing of streaming data
CN112099788A (zh) * 2020-09-07 2020-12-18 北京红山信息科技研究院有限公司 一种可视化数据开发方法、系统、服务器和存储介质

Also Published As

Publication number Publication date
US11132402B2 (en) 2021-09-28
EP3483740A4 (en) 2019-05-15
US20220012288A1 (en) 2022-01-13
US20190155850A1 (en) 2019-05-23
EP3483740A1 (en) 2019-05-15
CN107678790A (zh) 2018-02-09
CN107678790B (zh) 2020-05-08
EP3975004A1 (en) 2022-03-30
EP3483740B1 (en) 2021-09-15

Similar Documents

Publication Publication Date Title
WO2018019232A1 (zh) 流计算方法、装置及系统
US10613909B2 (en) Method and apparatus for generating an optimized streaming graph using an adjacency operator combination on at least one streaming subgraph
US8046373B2 (en) Structured parallel data intensive computing
Teixeira Sá et al. CloudReports: An extensible simulation tool for energy-aware cloud computing environments
KR20140111266A (ko) 클라우드-에지 토폴로지
US20170060536A1 (en) Fusion recommendation for performance management in streams
Liu et al. A stepwise auto-profiling method for performance optimization of streaming applications
WO2017045400A1 (zh) 一种流应用优化的方法及装置
US10516729B2 (en) Dynamic graph adaptation for stream processing over hybrid, physically disparate analytics platforms
WO2017007466A1 (en) Orchestration template generation
CN112052011A (zh) 小程序的合包方法、装置、电子设备及介质
Mahapatra Composing high-level stream processing pipelines
Palyvos-Giannas et al. Lachesis: a middleware for customizing OS scheduling of stream processing queries
Méry et al. Towards an integrated formal method for verification of liveness properties in distributed systems: with application to population protocols
Chardet et al. Predictable efficiency for reconfiguration of service-oriented systems with concerto
US11256486B2 (en) Method and computer program product for an UI software application
Glaser Domain model optimized deployment and execution of cloud applications with TOSCA
Galante et al. Exploring cloud elasticity in scientific applications
Liu Robust resource management in distributed stream processing systems
Higashino Complex event processing as a service in multi-cloud environments
Dautov et al. Vs-driven big data process development
Tolosana-Calasanz et al. On autonomic platform-as-a-service: Characterisation and conceptual model
Zhang et al. Runtime verification of parametric properties using smedl
US20240202043A1 (en) Dynamic subtask creation and execution in processing platforms
WO2022102001A1 (ja) データフロー制御装置、データフロー制御方法、及びデータフロー制御プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17833535

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017833535

Country of ref document: EP

Effective date: 20190211