WO2016206567A1 - 分布式流计算系统、方法和装置 - Google Patents

分布式流计算系统、方法和装置 Download PDF

Info

Publication number
WO2016206567A1
WO2016206567A1 PCT/CN2016/086105 CN2016086105W WO2016206567A1 WO 2016206567 A1 WO2016206567 A1 WO 2016206567A1 CN 2016086105 W CN2016086105 W CN 2016086105W WO 2016206567 A1 WO2016206567 A1 WO 2016206567A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
data
processing
module
dag
Prior art date
Application number
PCT/CN2016/086105
Other languages
English (en)
French (fr)
Inventor
魏蒲萌
李闪
段培乐
喻奎
孙敬
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2016206567A1 publication Critical patent/WO2016206567A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention belongs to the field of Internet technologies, and in particular, to a distributed stream computing system, method and apparatus.
  • Stream computing is an important part of the current data processing field. Compared with the traditional data processing system, the data is stored in the hard disk or other storage services and then processed. The stream computing processes the incoming data in real time and reflects the value of the data in real time. It is generally believed that the value of the data stream has passed with time. And reduce.
  • the more mature stream processing systems include Yahoo's S4 (open source), Twitter's Storm (open source), Google's MillWheel, and Amazon's Kinesis.
  • the user's stream computing needs depend on the user's own data processing logic.
  • Spout (spoof, which can be understood as a message source) node sends a message stream (Stream) to the next-level Bolts node.
  • the level bolt node implements the processing logic of the message, such as performing filtering, aggregation calculation, and the like.
  • the calculation logic of the bolt node (and the data generation logic of the spout node) is done by the user through the interface provided by Storm.
  • the topology of Storm's topology is shown in Figure 1. Similar to storm, stream processing systems such as Kinesis are also implemented by the user.
  • the present application provides a distributed stream computing system, method and apparatus, which solves the technical problem of the processing logic that the user needs to implement the stream computing by himself when using the stream computing system in the prior art.
  • the present application discloses a distributed stream computing system, including: a first node and a second node; the first node converts the input offline SQL operation logic into a DAG (Directed Acyclic Graph, Directed Acyclic Graph), the DAG represents a logical relationship between each operator in the offline sql operation logic; the first node divides the DAG into multiple parts according to a logical relationship between the respective operators And allocating to a corresponding plurality of second nodes, the plurality of second nodes forming a plurality of levels according to the allocated partial DAGs; the plurality of second nodes receiving the real-time data stream and completing the flow step by step according to the DAG Calculation processing.
  • DAG Directed Acyclic Graph, Directed Acyclic Graph
  • the first node divides the DAG into a plurality of parts according to a logical relationship between the respective operators and allocates to a corresponding plurality of second nodes, and the plurality of second nodes are according to the allocated partial DAGs.
  • Forming the plurality of levels includes: determining, in a logical relationship between the respective operators, a position at which shuffling processing of the data has been completed, dividing the DAG into corresponding portions according to the position and allocating To a plurality of second nodes, the plurality of second nodes form a plurality of levels according to the allocated partial DAGs.
  • the DAG includes a first type operator having no logic state and a second type operator having a logic state; in the stream calculation process, the second type operator adds a logic state identifier to the processing result.
  • the second node includes a data driving module, a stream computing module, and an output module; wherein the data driving module receives the real-time data stream and sends the data to the stream computing module, where the stream computing module is configured according to each operator in the allocated partial DAG The logical relationship between the two completes the flow calculation process, and sends the processing result to the output module; the output module sends the processing result to the second node of the next level or an external storage device.
  • the output module includes a scheduling sub-module and a writing sub-module; the output module sends the processing result to a second-level second node by using the scheduling sub-module, or the output module passes the writing sub-module
  • the processing result is transmitted to an external storage device.
  • the present application also discloses a distributed stream computing method, the method comprising: receiving a real-time data stream from a client or a second-level node according to a partial DAG allocated from the first node. And performing flow calculation processing on the real-time data stream according to a logical relationship between the operators in the partial DAG to obtain a processing result; and transmitting the processing result to a second node of the next level or an external storage device.
  • Performing calculation processing on the real-time data stream according to a logical relationship between each operator in the partial DAG, and obtaining a processing result includes: determining whether the current operator belongs to the second type operator, when the current operator Belongs to When the two types of operators are used, a logical state identifier is added for the processing result.
  • the adding the logic status identifier to the processing result includes: adding an update identifier to the processing result and sending the second node to the next level; or adding an append/delete to the processing result. Identifies and sends to the next node of the next level.
  • the adding the add/delete identifier to the processing result and sending to the next level second node includes: when generating the first data according to the processing result, sending the first data with the additional identifier to the next second a node, wherein the second node of the next level adds the first data; when the first data becomes the second data according to the processing result, sending the first data with the deletion identifier to the a second node of the next level, the second node of the next level deletes the first data; and the second data with an additional identifier is sent to the second node of the next level, so that the next The second node of the level adds the second data.
  • Performing flow calculation processing on the real-time data stream according to the logical relationship between the operators in the partial DAG, and obtaining the processing result further includes: stopping processing the received data every preset time period, and the processing is being processed. After the data processing is completed, a snapshot is generated for the second type of operator having a logical state in the partial DAG; and the memory image file of the snapshot is recorded as a checkpoint.
  • the method After receiving the real-time data stream from the client or the second-level second node according to the partial DAG to which the first node is allocated, the method further includes: writing the received real-time data stream to the redo log When a failure occurs, reading a checkpoint that is closest to the current time; restoring a logical state of the second type of operator according to the memory image file of the checkpoint; reading the checkpoint from the redo log
  • the received data is processed and processed; when the data processing in the redo log is completed, the stream processing of the received real-time data stream is continued.
  • the present application further discloses a distributed stream computing device, comprising: a receiving module, configured to receive real-time from a client or a second node according to a part of the DAG allocated from the first node. a data processing unit, configured to: perform flow calculation processing on the real-time data stream according to a logical relationship between the operators in the partial DAG, to obtain a processing result; and send a module, where the processing result is used Send to the next node or the external storage device.
  • a receiving module configured to receive real-time from a client or a second node according to a part of the DAG allocated from the first node.
  • a data processing unit configured to: perform flow calculation processing on the real-time data stream according to a logical relationship between the operators in the partial DAG, to obtain a processing result; and send a module, where the processing result is used Send to the next node or the external storage device.
  • the first processing module includes: a determining sub-module, configured to determine whether the current operator belongs to the second type of operator, and the processing sub-module is configured to: when the current operator belongs to the second type of operator, The result adds a logical state identifier.
  • the processing sub-module includes: a first sending unit, configured to add an update identifier to the processing result and send the identifier to the next-level second node; or a second sending unit, configured to add an add/delete to the processing result Identifies and sends to the next node of the next level.
  • the second sending unit includes: a first sending subunit, configured to generate first data according to the processing result Transmitting the first data with the additional identifier to the second node of the next level, so that the second node of the next level adds the first data; and the second sending subunit is configured to be used according to the processing result
  • the first data is changed to the second data
  • the first data with the deletion identifier is sent to the second node of the next level, so that the second node of the next level deletes the first data
  • the second data with the additional identifier is sent to the second node of the next level, so that the second node of the next level adds the second data.
  • the first processing module further includes: a generating submodule, configured to stop processing the received data every preset time period, and after processing the data being processed, performing a second type of logic state in the partial DAG The child generates a snapshot; the mirror submodule is used to record the memory image file of the snapshot as a checkpoint.
  • the device further includes: a log module, configured to write the received real-time data stream to the redo log; and a reading module, configured to read a checkpoint closest to the current time when the fault occurs; the recovery module Recovering the logic state of the second type of operator according to the memory image file of the checkpoint; the second processing module is configured to read the data received after the checkpoint from the redo log and Processing, the third processing module is configured to continue to perform stream calculation processing on the received real-time data stream when the data processing in the redo log is completed.
  • the present application can obtain the following technical effects: the operator of the offline SQL operation familiar to the user is implemented in the flow computing system, and the user can quickly convert the offline sql into the flow calculation processing logic that the system can support. And the system contains the processing logic for the failure, and the logic state of each operator can be restored by checkpoint and redo log.
  • FIG. 1 is a schematic diagram of a topology structure of a Storm stream processing system in the prior art
  • FIG. 2 is a schematic diagram of a topology structure of a distributed flow computing system according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of an internal topology structure of a second node in the embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a distributed flow calculation method according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a processing procedure when a second type of operator adds an update identifier in the embodiment of the present application
  • FIG. 6 is a schematic diagram of a processing procedure when an add/delete identifier is added by a second type of operator in the second embodiment of the present application;
  • FIG. 7 is a schematic flowchart of a distributed flow calculation method according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a distributed stream computing device according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a topology structure of a distributed flow computing system according to an embodiment of the present disclosure, including a first node 10 and a second node 11.
  • the first node 10 converts the offline sql (Structured Query Language) operation logic input by the user into a DAG (Directed Acyclic Graph).
  • the DAG includes various operators in the input offline sql operation logic for representing the logical relationship between the operators in the input offline sql operation logic.
  • the first node 10 divides the DAG into a plurality of parts according to a logical relationship between the respective operators and allocates them to a corresponding plurality of second nodes 11, and the plurality of second nodes 11 are divided into a plurality of parts according to the allocated partial DAGs. Level, thus forming the relationship between the superior node and the lower node. If the partial DAGs to which each of the second nodes 11 are assigned are spliced together according to the level relationship of the respective second nodes 11, the complete DAG converted by the first node 10 can be obtained.
  • part of the operation needs to be hashed according to a specific column.
  • the split node needs to use different hashes to shuffle the data.
  • the first node 10 divides the converted DAG into a plurality of parts, it determines in the logical relationship between the operators that the position where the data is shuffled is completed, according to the position of the completed shuffling process
  • the DAG is divided into corresponding parts and distributed to a plurality of second nodes 11, which form different levels according to the allocated partial DAGs.
  • the real-time data stream from the client is received by the second node 11 of the highest level, and the stream calculation process is completed step by step according to the DAG.
  • the internal topology of each of the second nodes 11 is as shown in FIG. 3, and includes a data driving module 110, a stream computing module 111, and an output module 112.
  • the data driving module 11 receives the real-time data stream and sends it to the stream computing module 111.
  • the data driving module 110 of the highest-level second node 11 receives the real-time data stream from the client, and the data driving module 111 of the second node 11 of other levels.
  • a real-time data stream from the second node 11 of the upper level is received.
  • the data driving module 110 sends the received real-time data stream to the stream computing module 111, and the stream computing module 111 stores the allocated partial DAG, which is completed by the stream computing module 111 according to the logical relationship between the operators in the allocated partial DAG.
  • Flow calculation processing, and the processing result is sent to the output module 112, and the processing result is sent by the output module 112 to the next-level second node 11 or an external storage device, and the output module 112 of the lowest-level second node 11 will process the result.
  • the output is sent to an external storage device, and the output module 112 of the second node 11 of the other level transmits the processing result to its next level second node 11.
  • the output module 112 further includes a scheduling sub-module 1121 and a writing sub-module 1122; the output module 112 transmits the processing result of the stream calculation to the next-level second node through the scheduling sub-module 1121, or by writing to the sub-module 1122 The processing result of the stream calculation is sent to an external storage device.
  • the DAG includes two types of operators, a first type operator without a logical state and a second type operator with a logical state.
  • the first type of operator does not add a logic state to the data.
  • the calculation of the real-time data stream does not depend on the logic state of the data, nor does it affect the logic state of the data of other operators.
  • the filter (filter) operator is used to complete the function of the where/having clause in sql;
  • the transform (converter) operator is used to provide processing of various conversion functions similar to the select statement in offline sql.
  • window (time window) operator used to achieve the function of segmenting data based on time, such as counting website visits by day, the result will be zeroed at 0:00 every day, and the statistics of the next day will be started;
  • various final output operators such as writing data to the cloud server OTS (Open Table Service, open structured data service).
  • OTS Open Table Service, open structured data service.
  • the processing of the real-time data stream by the second type of operator depends on the logic state, and is processed according to different logic states. At the same time, the processing may cause the logic state of the data to change, thereby affecting other second-class operator pairs. How data is processed. For example, the Groupby operator is used to divide a data set into several small areas for processing, similar to the grouping in offline sql.
  • the operator also includes the specific aggregation used, such as returning
  • the only value of the table is count (statistics) / sum (total) / average (average) and other functions; top (limited number of returns) operator, used to limit the number of returned records, that is, according to a certain rule on a limited set Part of the result; the join operator establishes a connection between multiple tables of a finite set to query data; and other operators that may cause multiple changes in real-time data due to one data entry.
  • the distributed stream computing system provided by the embodiment of the present application implements an offline sql operation operator familiar to the user in the stream computing system, and the user can quickly convert the offline sql into a stream computing processing logic that the system can support, thereby reducing the user's Workload and work difficulty improve the efficiency of stream computing processing.
  • FIG. 4 is a distributed flow calculation method provided by an embodiment of the present application, which is applicable to a second node, and the method includes the following steps.
  • step S20 a real-time data stream is received from the client or the second-level node according to the partial DAG assigned from the first node.
  • the second node is divided into a plurality of levels according to a partial DAG assigned from the first node.
  • the second node of the highest level receives the real-time data stream from the client for processing, while the other lower-level second node receives the real-time data stream for processing from the second node of the upper level.
  • step S21 the real-time data stream is subjected to stream calculation processing according to the logical relationship between the operators in the partial DAG, and the processing result is obtained.
  • the partial DAG to which the second node is allocated includes various operators, and the received real-time data stream is subjected to stream calculation processing according to the logical relationship between the operators.
  • the assigned partial DAG includes a first type of operator with no logic state and a second type of operator with logic state.
  • the first type of operator without logic state processes the real-time data stream relatively directly, or modifies part of the content on a piece of data (such as time window operator, converter operator), or judges whether Filter current data (such as filter operators), or external output.
  • These first-class operators have neither their own logic state nor logical state added to the data, and will not affect the continued processing of real-time data streams by other operators. .
  • a second type of operator with a logical state adds a logical state identifier to each data when processing the real-time data stream, and may generate multiple data outputs from one data input, and the second type of operator receives the logic with When the status ID data is different, it will be processed differently according to the logic status identifier. Therefore, in the process of performing stream calculation processing, it is necessary to determine whether the current operator belongs to the second type of operator, and when the current operator belongs to the second type of operator, a logical state identifier is added to the processing result. In this way, the embodiment of the present application solves the problem of real-time updating of multiple data changes caused by one data input in a distributed system.
  • the process of stream computing processing of the second type of operator will be described below by way of example.
  • the first stage hashes according to column A
  • the second stage hashes according to the count value of column A.
  • the two-level grouping operator must be hashed separately according to the statistical values of column A and column A on the second node of the two levels.
  • the two-level second node completes the stream computation process by adding an "update" flag. Due to the characteristics of stream computing, the data has no boundaries and is endless. Unlike offline sql, offline sql performs the second level processing after the first level processing is completed.
  • the embodiment of the present application after being processed by the second node of each level, is handed over to the second node of the next level for processing as soon as possible, so that the entry of one piece of data may generate changes of multiple pieces of data.
  • FIG. 1 In order to ensure the real-time performance of the stream computing process, the embodiment of the present application, after being processed by the second node of each level, is handed over to the second node of the next level for processing as soon as possible, so that the entry of one piece of data may generate changes of multiple pieces of data.
  • the grouping operator of the second node of the previous level receives the record with the A column value a again, the data of the grouping operator of the second node of the previous level becomes A: a, Count(A): 2; the change Generating a packet operator with the data of the "update:count(A)1->2" logical state identifier sent to the second node of the next level, and parsing the logic after receiving the packet operator of the second node of the next level Status flag and update the data Count(A):1 to Count(A):2.
  • the foregoing process is used to describe the flow calculation process using the added update identifier, and does not constitute the present application. Limitation of protection scope.
  • the second node of the upper level may generate multiple pieces of data for updating to the second node at the same time, and there are other logical relationships between the two nodes in order to make the second node of the upper level and the second level of the next level.
  • the framework design and code logic between the nodes are more clear and easy to analyze.
  • the method of adding the "append/delete" identifier can be further processed. As shown in FIG.
  • the grouping operator of the second node of the previous level When the grouping operator of the second node of the previous level receives the record with the A column value a again, the data of the grouping operator of the second node of the previous level becomes A: a, Count(A): 2; The grouping operator of the second node generates a pre-change data "A:a,Count(A):1" and adds a "delete” flag. After the packet operator of the second node of the next level receives the data, The data of Count(A):1 will be deleted; then, the grouping operator of the second node of the previous level generates a changed data "A:a,Count(A):2" and adds the "additional" flag.
  • the grouping operator of the second node of the first level After receiving the data, the grouping operator of the second node of the first level increases a piece of data of Count(A):2, thereby completing the calculation process of the real-time data stream.
  • the multi-stage grouping operation processing or the other real-time data stream calculation processing of the second type operator can be completed through the above process.
  • step S22 the processing result is transmitted to the next-stage second node or an external storage device.
  • the second node after obtaining the processing result, the second node sends the processing result to the next-level second node to continue processing.
  • the lowest level second node sends the processing result to an external storage device, such as a storage device such as a memory or a hard disk.
  • Step S21 performs stream calculation processing on the real-time data stream according to the logical relationship between the operators in the partial DAG, and the obtained processing result further includes the following steps:
  • step S210 the received data is stopped for every preset duration, and after the processing of the data being processed is completed, a snapshot is generated for the second type of operator having a logical state in the partial DAG.
  • the second node keeps receiving the real-time data stream, stops the processing of the received real-time data stream every preset time period, and continues to process the data that is being processed but has not been processed yet.
  • the second node When the data that is being processed but has not been processed has been processed, the second node generates a snapshot of the second type of operator having a logical state in the allocated partial DAG, which is used to record the moment. The logical state of each piece of data in all second class operators in the second node.
  • step S211 the memory image file of the snapshot is recorded as a checkpoint.
  • the second node saves the snapshot image file (for example, a dump file) to the memory, and records the image file as a check point for each of the second type of operators when the second node fails.
  • the strip data is restored to the logical state at the moment of the checkpoint. After the checkpoint is established, the second node continues to process the received real-time data stream.
  • each second type of operator can be restored to the previous state by the established checkpoint.
  • the second node can automatically restore the correct processing process when a failure occurs, as shown in FIG. 7, the method includes the following steps.
  • step S301 a real-time data stream is received from the client or the second-level node according to the partial DAG assigned from the first node.
  • step S302 the received real-time data stream is written to the redo log.
  • the redo log is used to record each real-time data received by the second node, including the data content and the time information when the data is received.
  • step S303 the received data is stopped for every preset duration, and after the processing of the data being processed is completed, a snapshot is generated for the second type of operator having a logical state in the partial DAG.
  • step S304 the memory image file of the snapshot is recorded as a checkpoint.
  • step S305 when a failure occurs, the checkpoint closest to the current time is read.
  • the checkpoint closest to the current time is read from the memory, that is, the memory image file of the checkpoint.
  • step S306 the logic state of the second type of operator is restored according to the memory image file of the checkpoint.
  • the read memory image file includes a snapshot of each second type of operator of the second node, that is, a logical state of each data at that moment, so the data of the second type of operator is restored to the check according to the memory image file.
  • the logical state of the point is a snapshot of each second type of operator of the second node, that is, a logical state of each data at that moment, so the data of the second type of operator is restored to the check according to the memory image file.
  • the logical state of the point is a snapshot of each second type of operator of the second node
  • step S307 the data received after the checkpoint is read from the redo log and processed.
  • the second node determines the time at which the checkpoint is located, reads the data received after the time of the checkpoint from the redo log, and processes the data one by one by the internal operator.
  • step S308 when the data processing in the redo log is completed, the stream computing process of the received real-time data stream is continued.
  • the second node When the second node completes the processing of all the data after the time when the checkpoint is in the redo log, the data of each operator of the second node can be restored to the logic state at the time of the failure. At this point, the second node can continue to perform stream calculation processing on the received real-time data stream, thereby implementing the second node passing the checkpoint and redoing day in the event of a failure. The function of automatic recovery.
  • step S309 the processing result is transmitted to the next-stage second node or an external storage device.
  • the user does not need to implement the processing logic of the flow calculation by himself, and when the second node fails, the entire data can be recovered by itself and the flow calculation processing of the real-time data stream is continued.
  • FIG. 8 is a distributed flow computing device provided by an embodiment of the present application, including:
  • the receiving module 40 is configured to receive, according to the partial DAG allocated from the first node, a real-time data stream from the client or the second node;
  • the first processing module 41 is configured to perform stream calculation processing on the real-time data stream according to a logical relationship between the operators in the partial DAG, to obtain a processing result;
  • the sending module 42 is configured to send the processing result to the second node of the next level or an external storage device.
  • the first processing module 41 includes:
  • a judging submodule for judging whether the current operator belongs to the second type of operator
  • the processing submodule is configured to add a logical state identifier to the processing result when the current operator belongs to the second type operator.
  • the processing submodule includes:
  • a first sending unit configured to add an update identifier to the processing result and send the second node to the next level
  • a second sending unit configured to add an attach/delete identifier to the processing result and send the second node to the next level.
  • the second sending unit includes:
  • a first sending subunit configured to: when the first data is generated according to the processing result, send the first data with the additional identifier to the second node of the next level, so that the second node of the next level adds the first data;
  • a second sending subunit configured to: when the first data becomes the second data according to the processing result, send the first data with the deletion identifier to the second node of the next level, so that the second node of the next level deletes the first data And sending the second data with the additional identifier to the second node of the next level, so that the next level node adds the second data.
  • the first processing module 41 further includes:
  • a mirror submodule that records a snapshot's memory image file as a checkpoint.
  • the device also includes:
  • a log module configured to write the received real-time data stream to the redo log
  • a reading module for reading a checkpoint that is closest to the current time when a failure occurs
  • a recovery module configured to restore a logic state of the second type of operator according to the memory image file of the checkpoint
  • a second processing module configured to read and process the data received after the checkpoint from the redo log
  • the third processing module is configured to continue to perform stream calculation processing on the received real-time data stream when the data processing in the redo log is completed.
  • Internet data statistical analysis service providers it is necessary to provide professional, authoritative and independent website data statistics and analysis services for various websites and enterprises.
  • large-scale Internet data statistical analysis service providers have service targets of more than one million, and the number of statistical data processed per day is more than one billion.
  • the browsing behavior of network users occurs in real time, so the statistics of website data belong to real-time stream computing processing.
  • Statistics on website data usually include: Page View (PV), Unique Visitor (UV), IP address, visit duration and number of visits (a series of activities recorded by visitors from the website to the website) Basic statistics such as access, multiple visits may result in multiple page views; and advanced statistics such as returning visitors, new independent visitors, average visitor frequency, average visit duration, average visit depth, and number of pages viewed per person.
  • PV Page View
  • UV Unique Visitor
  • IP address Visit duration and number of visits
  • the returning visitors on the same day, for independent visitors who visit the website multiple times in a day need to be further determined based on the number of independent visitors and visits.
  • the new independent visitor that is, the new visitor generated every day, needs to compare the real-time statistical IP address with the historical IP address to determine the newly appearing IP address, and the independent visitor who is counted in real time under the newly emerged IP address is newly independent.
  • the average visit frequency of visitors is the average number of visits to the website by each independent visitor within one day.
  • the average visit frequency of visitors number of visits / independent visitors.
  • Average visit duration, average time spent on each visit to the site, average visit duration length of visits / number of visits.
  • the number of pages per person viewed, the average number of page views per individual visitor, the number of pages viewed per person page views / independent visitors.
  • the distributed flow computing system of the embodiment of the present application can be applied to the data statistics of each website.
  • the statistics and processing logic for the above statistics can be converted from a first node to a directed acyclic graph.
  • the directed acyclic graph is divided into multiple parts and assigned to multiple levels of the second node.
  • the second node of the lower level completes the statistics of the basic statistics, counts the number of page views, independent visitors, IP addresses, access times, etc.
  • the second node of the higher level completes the above according to the basic statistics of the real-time statistics.
  • the real-time calculation of advanced statistics calculates the average visit frequency, average visit duration, average visit depth, and number of pages viewed per person for the returning visitors, new independent visitors, visitor visits.
  • a first-level second node counts the number of page views of independent visitors, and internally utilizes a first-level
  • the second type of operator collects the independent visitors entering the website in real time, and uses a second-level second-class operator to count the number of page views of each independent visitor for the website, and outputs it to a second-level second node to calculate and calculate Statistics related to the number of page views of independent visitors, such as a second-level second node that calculates the total number of page views for the website, that is, the number of page views for each individual visitor.
  • the first level second node simultaneously outputs the counted independent visitor number to another second level second node for calculating statistics related to the independent visitor, for example, calculating the number of visited pages per visit, and the average access frequency.
  • the second level of the second node The second type of operator adds a status identifier ("update" identifier or "insert/delete” identifier) to the number of page views for independent visitors and each individual visitor in real time, to continuously update independent visitors in real time, each independent The number of page views and the like of the visitor, and the total number of page views, the number of pages visited per capita, and the average frequency of visits calculated by the second node in the next level.
  • the first level second node and the second level second node write the received data to the redo log to back up the received data.
  • the second node of the first level and the second node of the second level stop processing the received data every preset time period, and after the statistics and calculations of the data being processed are completed, a snapshot is generated for the second type of operators in the respective internal.
  • the snapshot includes the current statistical value of each second type of operator and the logical state identification of the value.
  • the application examples of the distributed statistical system of the above-mentioned website data are used for the exemplary description of the embodiments of the present application, and do not constitute a limitation on the scope of the present application.
  • the distributed stream computing system and method provided by the embodiments of the present application are equally applicable to any other. Data real-time statistics system.
  • the content delivery network distributes the source station content to all nodes in the country, shortens the delay of the user viewing the object, improves the response speed of the user visiting the website and the availability of the website, and solves the problem of small network bandwidth and large user access.
  • the problem of uneven distribution of outlets, the content of the website is distributed to the whole network, and the stations are accelerated across operators and regions.
  • the user accesses the website access source as an input of the distributed stream computing system to detect whether the access to the website is abnormal.
  • the visitor's average visit frequency is further calculated. Data such as the frequency of visits to the IP address and the number of page views, sorting the number of page views and the average visitor frequency of each individual visitor to predict whether there is an attack against the website, such as distributed denial service (Distributed Denial) Of Service, DDoS) attack.
  • distributed denial service distributed Denial
  • DDoS distributed Denial of Service
  • the first node of the distributed stream computing system converts the above logic into a directed acyclic graph, and divides the directed acyclic graph into multiple parts and distributes them to multiple second nodes.
  • the first-level second node uses the first-level second node to count the number of independent visitors, the IP address, the number of page views of each individual visitor, and the second-level second node to calculate the number of page views and the average visitor frequency of each individual visitor.
  • Data such as the access frequency of each IP address and the number of page views are sorted.
  • the visitor or IP of the visit frequency or the number of browsing times may be attacking the website. For example, several IP addresses have simultaneously initiated the access request or the page browsing request in the unit time has exceeded the website server.
  • Throughput which takes up almost all the resources of the website server in a short period of time, so that the normal access of other users cannot be completed, it means that the computers of these IP addresses may be attacking the website server.
  • the distributed stream computing system notifies the CDN of the abnormal result, and blocks the access of the IP addresses to the website server within a certain period of time, thereby preventing the network user from attacking the website server and maintaining the normal operation of the website server.
  • Each of the second nodes in the distributed stream computing system can also recover itself in the event of a failure, thereby not affecting real-time statistics on user access data.
  • the distributed stream computing system provided by the embodiment of the present application can also count the current content distribution of the website in the CDN, and confirm whether the users from different regions and different operators can access the website according to the independent visitor and the IP address.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology.
  • the information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
  • computer readable media does not include non-transitory computer readable media, such as modulated data signals and carrier waves.
  • first device if a first device is coupled to a second device, the first device can be directly electrically coupled to the second device, or electrically coupled indirectly through other devices or coupling means. Connected to the second device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种分布式流计算系统、方法和装置,其中,该系统包括:第一节点(10)和第二节点(11);所述第一节点(10)将输入的离线sql运算逻辑转换为有向无环图DAG,所述DAG表示所述离线sql运算逻辑中各个算子之间的逻辑关系;所述第一节点(10)根据所述各个算子之间的逻辑关系将所述DAG划分为多个部分并分配到对应的多个第二节点(11),所述多个第二节点(11)根据分配到的部分DAG而形成多个级别;所述多个第二节点(11)接收实时数据流并根据所述DAG逐级完成流计算处理。将用户熟悉的离线sql运算的算子在流计算系统中实现,用户可以快速将离线sql转换为系统可以支持的流计算处理逻辑。

Description

分布式流计算系统、方法和装置
本申请要求2015年06月26日递交的申请号为201510360023.8、发明名称为“分布式流计算系统、方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明属于互联网技术领域,具体地说,涉及一种分布式流计算系统、方法和装置。
背景技术
流计算是目前数据处理领域里的一个重要组成部分。相对于传统的数据处理系统会把数据存储在硬盘或其他存储服务后再进行计算处理,流计算处理实时传入的数据并且实时的体现出数据的价值,普遍认为数据流的价值随时间的流逝而减低。
目前比较成熟的流处理系统包括Yahoo的S4(开源)、Twitter的Storm(开源)、Google的MillWheel、Amazon的Kinesis等,其用户的流计算需求需要依赖用户自己准备的数据处理逻辑代码实现。
以Storm为例,它在完成流处理需求时主要体现在两种节点上,Spout(喷口,可以理解为消息源)节点向下一级Bolts(螺栓)节点发送消息流(Stream),由下一级螺栓节点实现消息的处理逻辑,比如执行过滤、聚合计算等。而螺栓节点的计算逻辑(以及喷口节点的数据产生逻辑)是由用户通过实现storm提供的接口来完成的。Storm的拓扑(Topology)原理如图1所示。与storm类似,Kinesis等流处理体系同样是由用户实现计算逻辑。
在这种流计算系统中,如Storm、Kinesis等都需要用户自己实现流处理的逻辑。用户需要保证实现的逻辑无误,尤其是在各种偶发的边界条件上,否则,难以完成一个长时间稳定运行的流计算业务,更进一步地,在用户对系统框架理解不足,或者考虑不全的情况下,用户难以保证自己实现的逻辑能在系统发生故障时,正确处理各种异常,维持正确的逻辑。以上对用户来说是极高的要求,很难让用户快速正确的使用分流处理系统。在使用上述系统时,用户通常需要身兼运维、测试、开发等更多角色的工作,而传统的数据处理(如sql查询)用户只需要想清自己的逻辑,完成sql(结构化查询语言,Structured Query Language)查询的编写即可,无需考虑sql查询的实现方法以及查询的正确性问题。
发明内容
有鉴于此,本申请提供了一种分布式流计算系统、方法和装置,解决了现有技术中用户在使用流计算系统时需要自己实现流计算的处理逻辑的技术问题。
为了解决上述技术问题,本申请公开了一种分布式流计算系统,包括:第一节点和第二节点;所述第一节点将输入的离线sql运算逻辑转换为DAG(有向无环图,Directed Acyclic Graph),所述DAG表示所述离线sql运算逻辑中各个算子之间的逻辑关系;所述第一节点根据所述各个算子之间的逻辑关系将所述DAG划分为多个部分并分配到对应的多个第二节点,所述多个第二节点根据分配到的部分DAG而形成多个级别;所述多个第二节点接收实时数据流并根据所述DAG逐级完成流计算处理。
所述第一节点根据所述各个算子之间的逻辑关系将所述DAG划分为多个部分并分配到对应的多个第二节点,所述多个第二节点根据分配到的部分DAG而形成多个级别包括:在所述各个算子之间的逻辑关系中确定已完成对数据进行洗牌(shuffle)处理的位置,根据所述位置将所述DAG划分成相应的多个部分并分配到多个第二节点,所述多个第二节点根据分配到的部分DAG而形成多个级别。
所述DAG包括无逻辑状态的第一类算子和有逻辑状态的第二类算子;在所述流计算处理中,所述第二类算子为处理结果添加逻辑状态标识。
所述第二节点包括数据驱动模块、流计算模块和输出模块;其中,所述数据驱动模块接收实时数据流并发送至流计算模块,所述流计算模块根据分配到的部分DAG中各个算子之间的逻辑关系完成所述流计算处理,将处理结果发送至输出模块;所述输出模块将所述处理结果发送至下一级第二节点或者外部的存储装置。
所述输出模块包括调度子模块和写入子模块;所述输出模块通过所述调度子模块将所述处理结果发送至下一级第二节点,或者所述输出模块通过所述写入子模块将所述处理结果发送至外部的存储装置。
为了解决上述技术问题,本申请还公开了一种分布式流计算方法,所述方法包括:根据从第一节点分配到的部分DAG,从客户端或上一级第二节点接收到实时数据流;根据所述部分DAG中各个算子之间的逻辑关系,对所述实时数据流进行流计算处理,得到处理结果;将所述处理结果发送至下一级第二节点或外部的存储装置。
所述根据所述部分DAG中各个算子之间的逻辑关系,对所述实时数据流进行计算处理,得到处理结果包括:判断当前算子是否属于第二类算子,当所述当前算子属于第 二类算子时,为所述处理结果添加逻辑状态标识。
所述为所述处理结果添加逻辑状态标识包括:为所述处理结果添加更新(update)标识并发送至下一级第二节点;或者,为所述处理结果添加附加/删除(append/delete)标识并发送至下一级第二节点。
所述为所述处理结果添加附加/删除标识并发送至下一级第二节点包括:当根据所述处理结果生成第一数据时,发送带有附加标识的第一数据至下一级第二节点,使所述下一级第二节点添加所述第一数据;当根据所述处理结果所述第一数据变为第二数据时,发送带有删除标识的所述第一数据至所述下一级第二节点,使所述下一级第二节点删除所述第一数据;再发送带有附加标识的所述第二数据至所述下一级第二节点,使所述下一级第二节点添加所述第二数据。
所述根据所述部分DAG中各个算子之间的逻辑关系,对所述实时数据流进行流计算处理,得到处理结果还包括:每隔预设时长停止处理接收到的数据,将正在处理的数据处理完成后,对所述部分DAG中有逻辑状态的第二类算子生成快照;将所述快照的内存镜像文件记录为一个检查点。
所述根据第一节点分配到的部分DAG,从客户端或上一级第二节点接收到实时数据流之后,所述方法还包括:将所述接收到的实时数据流写入到重做日志;当发生故障时,读取距离当前时间最近的检查点;根据所述检查点的内存镜像文件恢复所述第二类算子的逻辑状态;从所述重做日志中读取所述检查点后接收到的数据并进行处理;当所述重做日志中的数据处理完成时,继续对接收到的实时数据流进行流计算处理。
为了解决上述技术问题,本申请还公开了一种分布式流计算装置,包括:接收模块,用于根据从第一节点分配到的部分DAG,从客户端或上一级第二节点接收到实时数据流;第一处理模块,用于根据所述部分DAG中各个算子之间的逻辑关系,对所述实时数据流进行流计算处理,得到处理结果;发送模块,用于将所述处理结果发送至下一级第二节点或外部的存储装置。
所述第一处理模块包括:判断子模块,用于判断当前算子是否属于第二类算子,处理子模块,用于当所述当前算子属于第二类算子时,为所述处理结果添加逻辑状态标识。
所述处理子模块包括:第一发送单元,用于为所述处理结果添加更新标识并发送至下一级第二节点;或者,第二发送单元,用于为所述处理结果添加附加/删除标识并发送至下一级第二节点。
所述第二发送单元包括:第一发送子单元,用于当根据所述处理结果生成第一数据 时,发送带有附加标识的第一数据至下一级第二节点,使所述下一级第二节点添加所述第一数据;第二发送子单元,用于当根据所述处理结果所述第一数据变为第二数据时,发送带有删除标识的所述第一数据至所述下一级第二节点,使所述下一级第二节点删除所述第一数据;再发送带有附加标识的所述第二数据至所述下一级第二节点,使所述下一级第二节点添加所述第二数据。
所述第一处理模块还包括:生成子模块,用于每隔预设时长停止处理接收到的数据,将正在处理的数据处理完成后,对所述部分DAG中有逻辑状态的第二类算子生成快照;镜像子模块,用于将所述快照的内存镜像文件记录为一个检查点。
所述装置还包括:日志模块,用于将所述接收到的实时数据流写入到重做日志;读取模块,用于当发生故障时,读取距离当前时间最近的检查点;恢复模块,用于根据所述检查点的内存镜像文件恢复所述第二类算子的逻辑状态;第二处理模块,用于从所述重做日志中读取所述检查点后接收到的数据并进行处理;第三处理模块,用于当所述重做日志中的数据处理完成时,继续对接收到的实时数据流进行流计算处理。
与现有技术相比,本申请可以获得包括以下技术效果:将用户熟悉的离线sql运算的算子在流计算系统中实现,用户可以快速将离线sql转换为系统可以支持的流计算处理逻辑,并且系统包含了对于出现故障时的处理逻辑,可以通过检查点和重做日志恢复各个算子的逻辑状态。
当然,实施本申请的任一产品必不一定需要同时达到以上所述的所有技术效果。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1是现有技术中Storm流处理系统的拓扑结构示意图;
图2是本申请实施例提供的一种分布式流计算系统的拓扑结构示意图;
图3是本申请实施例第二节点的内部拓扑结构示意图;
图4是本申请实施例提供的一种分布式流计算方法的流程示意图;
图5是本申请实施例第二类算子添加更新标识时的处理过程示意图;
图6是本申请实施例第二类算子添加附加/删除标识时的处理过程示意图;
图7是本申请实施例提供的一种分布式流计算方法的流程示意图;
图8是本申请实施例提供的一种分布式流计算装置的结构示意图。
具体实施方式
以下将配合附图及实施例来详细说明本发明的实施方式,藉此对本发明如何应用技术手段来解决技术问题并达成技术功效的实现过程能充分理解并据以实施。
图2是本申请实施例提供的一种分布式流计算系统的拓扑结构示意图,包括第一节点10和第二节点11。其中,第一节点10将由用户输入的离线sql(结构化查询语言,Structured Query Language)运算逻辑转换为DAG(有向无环图,Directed Acyclic Graph)。该DAG包括输入的离线sql运算逻辑中的各个算子,用于表示输入的离线sql运算逻辑中各个算子之间的逻辑关系。
第一节点10根据各个算子之间的逻辑关系将DAG划分成多个部分并分配到对应的多个第二节点11,该多个第二节点11根据分配到的部分DAG被划分为多个级别,从而形成上级节点和下级节点的关系。如果将每个第二节点11分配到的部分DAG按照各个第二节点11的级别关系拼接在一起,即可得到第一节点10转换出的完整DAG。
在用户输入的离线sql运算逻辑中,部分运算需要按照特定的列进行哈希,在哈希方式发生变化时,需要切分节点使用不同的哈希来对数据进行洗牌(shuffle)处理。第一节点10在将转换出的DAG划分成多个部分时,在各个算子之间的逻辑关系中确定已完成对数据进行洗牌处理的位置,根据该已完成洗牌处理的位置将所述DAG划分成相应的多个部分,并分配到多个第二节点11,该多个第二节点11根据分配到的部分DAG形成不同级别。
根据该多个第二节点11形成的不同级别,由最高级别的第二节点11接收来自客户端的实时数据流,并根据DAG逐级完成流计算处理。其中每个第二节点11的内部拓扑结构如图3所示,包括数据驱动模块110、流计算模块111和输出模块112。其中数据驱动模块11接收实时数据流并发送至流计算模块111,最高级别的第二节点11的数据驱动模块110接收来自客户端的实时数据流,而其他级别的第二节点11的数据驱动模块111接收来自上一级第二节点11的实时数据流。数据驱动模块110将接收到实时数据流发送至流计算模块111,流计算模块111保存有分配到的部分DAG,由流计算模块111根据分配到的部分DAG中各个算子之间的逻辑关系完成流计算处理,并将处理结果发送至输出模块112,由输出模块112将处理结果发送至下一级第二节点11或者外部的存储装置,最低级别的第二节点11的输出模块112将处理结果发送至外部的存储装置,而其他级别的第二节点11的输出模块112将处理结果发送至其下一级第二节点11。如图3 所示,输出模块112进一步包括调度子模块1121和写入子模块1122;输出模块112通过调度子模块1121将流计算的处理结果发送至下一级第二节点,或者通过写入子模块1122将流计算的处理结果发送至外部的存储装置。
该DAG包括两类算子,无逻辑状态的第一类算子和有逻辑状态的第二类算子。其中第一类算子不会为数据添加逻辑状态,对实时数据流进行计算处理时不会依赖该数据的逻辑状态,也不会影响其他算子的数据的逻辑状态。例如,filter(过滤器)算子,用于完成sql中的where/having子句实现的功能;transform(转换器)算子,用于提供类似于离线sql中select语句的各种转化函数的处理效果;window(时间窗)算子,用于实现以时间为依据将数据切分的功能,如按天统计网站访问量,在每日0:00结果会被归零,开始下一天的统计;以及各种最终输出的算子,如将数据写入云服务端OTS(Open Table Service,开放结构化数据服务)等。第二类算子对实时数据流的处理过程会依赖逻辑状态,根据不同的逻辑状态分情况进行处理,同时该处理可能会引起数据的逻辑状态的变化,从而影响到其他第二类算子对数据的处理方式。例如,Groupby(分组)算子,用于将一个数据集分为若干小区域分别进行处理,类似于离线sql中的分组,在本系统中,该算子也包括具体使用的聚合,如可返回表中唯一不同值的count(统计)/sum(总计)/average(平均)等函数;top(限定返回数)算子,用于限定返回记录的数目,即在有限集合上依据某种规则取得其中的部分结果;join(连接)算子,在有限集合的多个表之间建立联系以查询数据;以及其他可能由于一条数据输入导致多条数据产生实时变化的其他算子。对于该第二类算子在实时数据流的计算处理中的应用将在后续实施例中进行说明。
本申请实施例提供的分布式流计算系统,将用户熟悉的离线sql运算的算子在流计算系统中实现,用户可以快速将离线sql转换为系统可以支持的流计算处理逻辑,降低了用户的工作量和工作难度,提高了流计算处理的工作效率。
图4是本申请实施例提供的一种分布式流计算方法,适用于第二节点,该方法包括以下步骤。
在步骤S20中,根据从第一节点分配到的部分DAG,从客户端或上一级第二节点接收到实时数据流。
第二节点根据从第一节点分配到的部分DAG而划分为多个级别。最高级别的第二节点从客户端接收实时数据流进行处理,而其他较低级别的第二节点从上一级第二节点接收实时数据流进行处理。
在步骤S21中,根据部分DAG中各个算子之间的逻辑关系,对实时数据流进行流计算处理,得到处理结果。
第二节点分配到的部分DAG中包括各种算子,根据各个算子之间的逻辑关系对接收到的实时数据流进行流计算处理。分配到的部分DAG中包括无逻辑状态的第一类算子和有逻辑状态的第二类算子。在流计算处理的过程中,无逻辑状态的第一类算子对实时数据流的处理相对直接,或修改一条数据上的部分内容(如时间窗算子、转换器算子),或判断是否过滤当前数据(如过滤器算子),或对外输出,这些第一类算子既没有自身的逻辑状态,也不会对数据添加逻辑状态,不会影响其他算子对实时数据流的继续处理。
有逻辑状态的第二类算子在处理实时数据流时,会在每条数据上增加逻辑状态标识,并可能由一条数据输入产生多条数据输出,并且第二类算子收到带有逻辑状态标识的数据时,会根据该逻辑状态标识的不同而进行不同处理。因此,在进行流计算处理的过程中,需要判断当前算子是否属于第二类算子,当该当前算子属于第二类算子时,则会为处理结果添加逻辑状态标识。本申请实施例通过这种方式解决了分布式系统中一条数据输入导致多条数据变化的实时更新问题。下面通过举例对第二类算子的进行流计算处理的过程进行说明。
如图5所示,在一个两级分组的流运算DAG中,第一级按照A列哈希(hash),第二级按照A列的统计(count)值哈希。按照分布式的原理,这两级分组算子必须在两级第二节点上分别按照A列和A列的统计值分别进行哈希。在这个例子中,两级第二节点通过添加“更新(update)”标识来完成流计算处理。由于流计算的特点,数据是没有边界且无穷无尽的,不同于离线sql,离线sql在第一级处理完成后,才进行第二级的处理。为了保证流计算处理的实时性,本申请实施例在每一级第二节点处理后,都尽快的交给下一级第二节点去处理,导致一条数据的进入可能产生多条数据的变化。在图5中,收到A列值为a的记录(record),上一级第二节点的分组算子增加一条“A:a,Count(A):1”数据,该变化产生一条有“update:count(A)Null->1”逻辑状态标识的记录发送给下一级第二节点的分组算子;则下一级第二节点的分组算子收到后增加Count(A):1的一条数据。当上一级第二节点的分组算子再次收到A列值为a的记录时,上一级第二节点的分组算子的数据变为A:a,Count(A):2;该变化产生一条有“update:count(A)1->2”逻辑状态标识的数据发送给下一级第二节点的分组算子,则下一级第二节点的分组算子收到后解析该逻辑状态标识并将数据Count(A):1更新为Count(A):2。
上述过程用于对采用添加更新标识的流计算处理过程进行说明,并不构成对本申请 保护范围的限制。在实际情况中,上一级第二节点向下一级第二节点可能同时产生多条数据进行更新,互相之间还存在其他逻辑关系,为了使上一级第二节点和下一级第二节点之间的框架设计和代码逻辑更加清晰且便于分析,对于上述例子还可以进一步采用添加“附加/删除(append/delete)”标识的方法来进行处理。如图6所示,收到A列值为a的记录时,上一级第二节点的分组算子增加一条“A:a,Count(A):1”数据,上一级第二节点的分组算子为该变化添加“附加”标识并发送至下一级第二节点,则下一级第二节点的分组算子收到后增加Count(A):1的一条数据。当上一级第二节点的分组算子再次收到A列值为a的记录时,上一级第二节点的分组算子的数据变为A:a,Count(A):2;上一级第二节点的分组算子产生一条变化前的数据“A:a,Count(A):1”并添加“删除”标识,下一级第二节点的分组算子接收到这条数据后,将把Count(A):1的数据删除;然后,上一级第二节点的分组算子产生一条变化后的数据“A:a,Count(A):2”并添加“附加”标识,下一级第二节点的分组算子接收到这条数据后,增加Count(A):2的一条数据,由此完成这次实时数据流的计算处理过程。同理可通过上述过程完成多级分组运算处理,或者其他第二类算子(如限定返回数算子、连接算子)的实时数据流计算处理。
在步骤S22中,将处理结果发送至下一级第二节点或外部的存储装置。
如上例所述,第二节点在得到处理结果后,会将处理结果发送至下一级第二节点继续处理。或者,最低级别的第二节点将处理结果发送至外部的存储装置,例如内存、硬盘等存储装置。
在一个实施例中,为了避免由于某个第二节点出现故障而导致整个流计算处理系统无法运行,需要建立针对第二节点出现故障时的处理机制。步骤S21根据部分DAG中各个算子之间的逻辑关系,对实时数据流进行流计算处理,得到处理结果进一步包括以下步骤:
在步骤S210中,每隔预设时长停止处理接收到的数据,将正在处理的数据处理完成后,对部分DAG中有逻辑状态的第二类算子生成快照。
第二节点一直保持接收实时数据流,每隔预设时长停止对接收到的实时数据流的处理,而只是继续处理那些正在处理中但还没有处理完成的数据。当这些正在处理中但还没有处理完成的数据已处理完毕时,第二节点对分配到的部分DAG中有逻辑状态的第二类算子生成快照(snapshot),该快照用于记录这一时刻该第二节点内所有第二类算子中各条数据的逻辑状态。
在步骤S211中,将快照的内存镜像文件记录为一个检查点。
第二节点将该快照的镜像文件(例如dump文件)保存到内存,将该镜像文件记录为一个检查点(check point),用于当第二节点出现故障时将第二类算子中的各条数据恢复到该检查点所在时刻的逻辑状态。该检查点被建立之后,第二节点继续开始处理接收到的实时数据流。
当第二节点出现故障时,可通过建立的检查点使各个第二类算子恢复到以前的状态。通过以下步骤第二节点在发生故障时可自动恢复正确的处理过程,如图7所示,该方法包括以下步骤。
在步骤S301中,根据从第一节点分配到的部分DAG,从客户端或上一级第二节点接收到实时数据流。
在步骤S302中,将接收到的实时数据流写入到重做日志。
重做日志(redo log)用于记录第二节点接收到的每条实时数据,包括数据内容和接收到该条数据时的时间信息。
在步骤S303中,每隔预设时长停止处理接收到的数据,将正在处理的数据处理完成后,对部分DAG中有逻辑状态的第二类算子生成快照。
在步骤S304中,将快照的内存镜像文件记录为一个检查点。
在步骤S305中,当发生故障时,读取距离当前时间最近的检查点。
第二节点发生故障时,从内存中读取距离当前时间最近的检查点,即该检查点的内存镜像文件。
在步骤S306中,根据检查点的内存镜像文件恢复第二类算子的逻辑状态。
读取的内存镜像文件包括该第二节点各个第二类算子的快照,即每条数据在那一时刻的逻辑状态,因此根据该内存镜像文件将第二类算子的数据恢复到该检查点的逻辑状态。
在步骤S307中,从重做日志中读取检查点后接收到的数据并进行处理。
第二节点确定该检查点所在的时刻,从重做日志中读取该检查点所在时刻之后接收到的数据,并由内部算子逐条进行处理。
在步骤S308中,当重做日志中的数据处理完成时,继续对接收到的实时数据流进行流计算处理。
第二节点将重做日志中检查点所在的时刻之后的数据全部处理完成时,能够使第二节点各个算子的数据恢复到发生故障时的逻辑状态。此时,第二节点便可继续对接收到的实时数据流进行流计算处理,从而实现了第二节点在发生故障时通过检查点和重做日 志而自动恢复的功能。
在步骤S309中,将处理结果发送至下一级第二节点或外部的存储装置。
通过上述方法用户无需自己实现流计算的处理逻辑,并且第二节点发生故障时可自行恢复全部数据并继续对实时数据流进行流计算处理。
图8是本申请实施例提供的一种分布式流计算装置,包括:
接收模块40,用于根据从第一节点分配到的部分DAG,从客户端或上一级第二节点接收到实时数据流;
第一处理模块41,用于根据部分DAG中各个算子之间的逻辑关系,对实时数据流进行流计算处理,得到处理结果;
发送模块42,用于将处理结果发送至下一级第二节点或外部的存储装置。
该第一处理模块41包括:
判断子模块,用于判断当前算子是否属于第二类算子,
处理子模块,用于当当前算子属于第二类算子时,为处理结果添加逻辑状态标识。
该处理子模块包括:
第一发送单元,用于为处理结果添加更新标识并发送至下一级第二节点;或者,
第二发送单元,用于为处理结果添加附加/删除标识并发送至下一级第二节点。
该第二发送单元包括:
第一发送子单元,用于当根据处理结果生成第一数据时,发送带有附加标识的第一数据至下一级第二节点,使下一级第二节点添加第一数据;
第二发送子单元,用于当根据处理结果第一数据变为第二数据时,发送带有删除标识的第一数据至下一级第二节点,使下一级第二节点删除第一数据;再发送带有附加标识的第二数据至下一级第二节点,使下一级节点添加第二数据。
在一个实施例中,该第一处理模块41还包括:
生成子模块,用于每隔预设时长停止处理接收到的数据,将正在处理的数据处理完后,对部分DAG中有逻辑状态的第二类算子生成快照;
镜像子模块,用于将快照的内存镜像文件记录为一个检查点。
所述装置还包括:
日志模块,用于将接收到的实时数据流写入到重做日志;
读取模块,用于当发生故障时,读取距离当前时间最近的检查点;
恢复模块,用于根据检查点的内存镜像文件恢复第二类算子的逻辑状态;
第二处理模块,用于从重做日志中读取检查点后接收到的数据并进行处理;
第三处理模块,用于当重做日志中的数据处理完成时,继续对接收到的实时数据流进行流计算处理。
下面通过应用实例对本申请实施例的技术方案做进一步说明。
对于互联网数据统计分析服务提供商而言,需要为各类网站和企事业单位提供专业、权威、独立的网站数据统计与分析服务。通常,大型互联网数据统计分析服务提供商的服务对象在百万级以上,每天处理的统计数据数量在十亿条以上。网络用户的浏览行为是实时发生的,因此对网站数据的统计属于实时的流计算处理。
对于网站数据的统计通常包括:页面浏览次数(Page View,PV),独立访客(Unique Visitor,UV),IP地址,访问时长和访问次数(访客从进入网站到离开网站的一系列活动记录为一次访问,一次访问可能产生多次页面浏览)等基础统计数据;以及当日回头访客、新独立访客、访客平均访问频度,平均访问时长,平均访问深度和人均浏览页数等高级统计数据。
其中,当日回头访客,为一天之内多次访问网站的独立访客,需要根据统计到的独立访客和访问次数来进一步确定。新独立访客,即每天产生的新访客,需要根据实时统计IP地址与历史记录的IP地址进行比较确定新出现的IP地址,实时统计到的在新出现的IP地址下的独立访客即为新独立访客。访客平均访问频度,为平均每个独立访客一天内访问网站的次数,访客平均访问频度=访问次数/独立访客。平均访问时长,平均每次访问在网站上的停留时间,平均访问时长=访问时长/访问次数。平均访问深度,平均每次访问产生的页面浏览次数,平均访问深度=页面浏览次数/访问次数。人均浏览页数,平均每个独立访客的页面浏览次数,人均浏览页数=页面浏览次数/独立访客。
可见,上述高级统计数据需要根据实时的基础统计数据做进一步计算处理而得出,因此针对每个网站的数据统计都可适用本申请实施例的分布式流计算系统。可以将针对上述统计数据的统计和处理逻辑由第一节点转化为有向无环图。将有向无环图分成多个部分,并分配到多级第二节点。由较低级别的第二节点完成上述基础统计数据的统计工作,统计页面浏览次数,独立访客、IP地址、访问次数等等,由较高级别的第二节点根据实时统计的基础统计数据完成上述高级统计数据的实时计算工作,计算出当日回头访客、新独立访客、访客平均访问频度,平均访问时长,平均访问深度和人均浏览页数等数据。
例如一个第一级第二节点统计独立访客的页面浏览次数,其内部利用一个第一级第 二类算子实时统计进入网站的独立访客,利用一个第二级第二类算子来统计每个独立访客针对该网站的页面浏览次数,并输出至一个第二级第二节点来计算与每个独立访客的页面浏览次数相关的统计数据,例如计算该网站的页面浏览总次数的一个第二级第二节点,即对每个独立访客的页面浏览次数求和。该第一级第二节点同时将统计到的独立访客数输出到另一个第二级第二节点,以用于计算与独立访客相关的统计数据,例如用于计算人均访问页数、平均访问频度的第二级第二节点。其中的第二类算子为实时统计到的独立访客和每个独立访客的页面浏览次数添加状态标识(“更新”标识或者“插入/删除”标识),以不断实时更新独立访客、每个独立访客的页面浏览次数等数据以及下一级第二节点实时计算出的页面浏览总次数、人均访问页数、平均访问频度等数据。第一级第二节点和第二级第二节点将接收到的数据写入重做日志,以备份接收到的数据。第一级第二节点和第二级第二节点每隔预设时长停止处理接收到的数据,将正在处理的数据统计和计算完毕之后,对各自内部的第二类算子生成快照。快照包括每个第二类算子的当前统计数值和该数值的逻辑状态标识。将该快照的内存镜像文件作为一个检查点,以备出现故障时进行数据恢复。当其中的第一级第二节点在独立访客为230时出现故障时,读取距离当前时间最近的检查点,相应的第二类算子恢复到10分钟之前独立访客为220以及当时每个独立访客的页面浏览次数。然后从重做日志中读取这10分钟之内接收到的数据,由第二类算子根据重做日志中的数据重新完成这10分钟之内的数据统计,统计到独立访客为230以及每个独立访客的页面浏览次数后,继续对独立访客进行实时数据统计。上述网站数据的分布式统计系统的每个第二节点都能够在发生故障时通过以上方式快速自行恢复,从而为用户提供可靠的实时统计数据。
上述网站数据的分布式统计系统的应用实例用于对本申请实施例的示例性说明,并不构成对本申请保护范围的限制,本申请实施例提供的分布式流计算系统和方法同样适用于其他任何数据实时统计系统。
在内容分发网络(Content Delivery Network,CDN)将源站内容分发至全国所有的节点,缩短用户查看对象的延迟,提高用户访问网站的响应速度与网站的可用性,解决网络带宽小、用户访问量大、网点分布不均等问题,将网站内容分发至全网,跨运营商、跨地域加速站点。
将本申请实施例提供的分布式流计算系统应用于CDN时,将用户对网站访问来源做为分布式流计算系统的输入,以检测针对该网站的访问是否出现异常。通过统计独立访客数量、IP地址、每个独立访客的页面浏览次数,进一步计算出访客平均访问频度、每 个IP地址的访问频度和页面浏览次数等数据,对每个独立访客的页面浏览次数以及访客平均访问频度进行排序,以预测是否存在针对该网站的攻击,例如分布式拒绝服务(Distributed Denial of Service,DDoS)攻击。分布式流计算系统的第一节点将上述逻辑转化为有向无环图,将有向无环图分为多个部分并分配到多个第二节点中。利用第一级第二节点统计独立访客数量、IP地址、每个独立访客的页面浏览次数等数据,并利用第二级第二节点计算出每个独立访客的页面浏览次数、访客平均访问频度、每个IP地址的访问频度和页面浏览次数等数据并进行排序。在短时间内访问频度或浏览次数的过高访客或IP,有可能正在对网站发动攻击,例如,有几个IP地址在单位时间内同时发起的访问请求或页面浏览请求已超过网站服务器的吞吐量,短时间内几乎占用了网站服务器的全部资源,使其他用户的正常访问无法完成,则说明这几个IP地址的计算机有可能正在对该网站服务器发起攻击。此时分布式流计算系统将异常结果通知CDN,在一定时间之内阻止这几个IP地址对网站服务器的访问,从而防止网络用户针对网站服务器的攻击,维持网站服务器的正常运行。分布式流计算系统中的各个第二节点也同样能够在出现故障时自行恢复,从而不影响对用户访问数据的实时统计。本申请实施例提供的分布式流计算系统还可以在CDN中统计网站的当前的内容分发情况,根据独立访客和IP地址以确认来自不同地域、不同运营商的用户是否能够正常访问该网站。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
如在说明书及权利要求当中使用了某些词汇来指称特定组件。本领域技术人员应可理解,硬件制造商可能会用不同名词来称呼同一个组件。本说明书及权利要求并不以名称的差异来作为区分组件的方式,而是以组件在功能上的差异来作为区分的准则。如在通篇说明书及权利要求当中所提及的“包含”为一开放式用语,故应解释成“包含但不限定于”。“大致”是指在可接收的误差范围内,本领域技术人员能够在一定误差范围内解决所述技术问题,基本达到所述技术效果。此外,“耦接”一词在此包含任何直接及间接的电性耦接手段。因此,若文中描述一第一装置耦接于一第二装置,则代表所述第一装置可直接电性耦接于所述第二装置,或通过其他装置或耦接手段间接地电性耦接至所述第二装置。说明书后续描述为实施本发明的较佳实施方式,然所述描述乃以说明本发明的一般原则为目的,并非用以限定本发明的范围。本发明的保护范围当视所附权利要求所界定者为准。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的商品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种商品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的商品或者系统中还存在另外的相同要素。
上述说明示出并描述了本发明的若干优选实施例,但如前所述,应当理解本发明并非局限于本文所披露的形式,不应看作是对其他实施例的排除,而可用于各种其他组合、修改和环境,并能够在本文所述发明构想范围内,通过上述教导或相关领域的技术或知识进行改动。而本领域人员所进行的改动和变化不脱离本发明的精神和范围,则都应在本发明所附权利要求的保护范围内。

Claims (17)

  1. 一种分布式流计算系统,其特征在于,包括:第一节点和第二节点;所述第一节点将输入的离线sql运算逻辑转换为DAG,所述DAG表示所述离线sql运算逻辑中各个算子之间的逻辑关系;
    所述第一节点根据所述各个算子之间的逻辑关系将所述DAG划分为多个部分并分配到对应的多个第二节点,所述多个第二节点根据分配到的部分DAG而形成多个级别;
    所述多个第二节点接收实时数据流并根据所述DAG逐级完成流计算处理。
  2. 如权利要求1所述的系统,其特征在于,所述第一节点根据所述各个算子之间的逻辑关系将所述DAG划分为多个部分并分配到对应的多个第二节点,所述多个第二节点根据分配到的部分DAG而形成多个级别包括:
    在所述各个算子之间的逻辑关系中确定已完成对数据进行洗牌(shuffle)处理的位置,根据所述位置将所述DAG划分成相应的多个部分并分配到多个第二节点,所述多个第二节点根据分配到的部分DAG而形成多个级别。
  3. 如权利要求1所述的系统,其特征在于,所述DAG包括无逻辑状态的第一类算子和有逻辑状态的第二类算子;在所述流计算处理中,所述第二类算子为处理结果添加逻辑状态标识。
  4. 如权利要求1所述的系统,其特征在于,所述第二节点包括数据驱动模块、流计算模块和输出模块;其中,所述数据驱动模块接收实时数据流并发送至流计算模块,所述流计算模块根据分配到的部分DAG中各个算子之间的逻辑关系完成所述流计算处理,将处理结果发送至输出模块;所述输出模块将所述处理结果发送至下一级第二节点或者外部的存储装置。
  5. 如权利要求4所述的系统,其特征在于,所述输出模块包括调度子模块和写入子模块;所述输出模块通过所述调度子模块将所述处理结果发送至下一级第二节点,或者所述输出模块通过所述写入子模块将所述处理结果发送至外部的存储装置。
  6. 一种分布式流计算方法,其特征在于,所述方法包括:
    根据从第一节点分配到的部分DAG,从客户端或上一级第二节点接收到实时数据流;
    根据所述部分DAG中各个算子之间的逻辑关系,对所述实时数据流进行流计算处理,得到处理结果;
    将所述处理结果发送至下一级第二节点或外部的存储装置。
  7. 如权利要求6所述的方法,其特征在于,所述根据所述部分DAG中各个算子之间的逻辑关系,对所述实时数据流进行计算处理,得到处理结果包括:
    判断当前算子是否属于第二类算子,
    当所述当前算子属于第二类算子时,为所述处理结果添加逻辑状态标识。
  8. 如权利要求7所述的方法,其特征在于,所述为所述处理结果添加逻辑状态标识包括:
    为所述处理结果添加更新(update)标识并发送至下一级第二节点;或者,
    为所述处理结果添加附加/删除(append/delete)标识并发送至下一级第二节点。
  9. 如权利要求8所述的方法,其特征在于,所述为所述处理结果添加附加/删除标识并发送至下一级第二节点包括:
    当根据所述处理结果生成第一数据时,发送带有附加标识的第一数据至下一级第二节点,使所述下一级第二节点添加所述第一数据;
    当根据所述处理结果所述第一数据变为第二数据时,发送带有删除标识的所述第一数据至所述下一级第二节点,使所述下一级第二节点删除所述第一数据;再发送带有附加标识的所述第二数据至所述下一级第二节点,使所述下一级第二节点添加所述第二数据。
  10. 如权利要求6所述的方法,其特征在于,所述根据所述部分DAG中各个算子之间的逻辑关系,对所述实时数据流进行流计算处理,得到处理结果还包括:
    每隔预设时长停止处理接收到的数据,将正在处理的数据处理完成后,对所述部分DAG中有逻辑状态的第二类算子生成快照;
    将所述快照的内存镜像文件记录为一个检查点。
  11. 如权利要求10所述的方法,其特征在于,所述根据第一节点分配到的部分DAG,从客户端或上一级第二节点接收到实时数据流之后,所述方法还包括:
    将所述接收到的实时数据流写入到重做日志;
    当发生故障时,读取距离当前时间最近的检查点;
    根据所述检查点的内存镜像文件恢复所述第二类算子的逻辑状态;
    从所述重做日志中读取所述检查点后接收到的数据并进行处理;
    当所述重做日志中的数据处理完成时,继续对接收到的实时数据流进行流计算处理。
  12. 一种分布式流计算装置,其特征在于,包括:
    接收模块,用于根据从第一节点分配到的部分DAG,从客户端或上一级第二节点接收到实时数据流;
    第一处理模块,用于根据所述部分DAG中各个算子之间的逻辑关系,对所述实时数据流进行流计算处理,得到处理结果;
    发送模块,用于将所述处理结果发送至下一级第二节点或外部的存储装置。
  13. 如权利要求12所述的装置,其特征在于,所述第一处理模块包括:
    判断子模块,用于判断当前算子是否属于第二类算子,
    处理子模块,用于当所述当前算子属于第二类算子时,为所述处理结果添加逻辑状态标识。
  14. 如权利要求13所述的装置,其特征在于,所述处理子模块包括:
    第一发送单元,用于为所述处理结果添加更新标识并发送至下一级第二节点;或者,
    第二发送单元,用于为所述处理结果添加附加/删除标识并发送至下一级第二节点。
  15. 如权利要求14所述的装置,其特征在于,所述第二发送单元包括:
    第一发送子单元,用于当根据所述处理结果生成第一数据时,发送带有附加标识的第一数据至下一级第二节点,使所述下一级第二节点添加所述第一数据;
    第二发送子单元,用于当根据所述处理结果所述第一数据变为第二数据时,发送带有删除标识的所述第一数据至所述下一级第二节点,使所述下一级第二节点删除所述第一数据;再发送带有附加标识的所述第二数据至所述下一级第二节点,使所述下一级第二节点添加所述第二数据。
  16. 如权利要求12所述的装置,其特征在于,所述第一处理模块还包括:
    生成子模块,用于每隔预设时长停止处理接收到的数据,将正在处理的数据处理完成后,对所述部分DAG中有逻辑状态的第二类算子生成快照;
    镜像子模块,用于将所述快照的内存镜像文件记录为一个检查点。
  17. 如权利要求16所述的装置,其特征在于,所述装置还包括:
    日志模块,用于将所述接收到的实时数据流写入到重做日志;
    读取模块,用于当发生故障时,读取距离当前时间最近的检查点;
    恢复模块,用于根据所述检查点的内存镜像文件恢复所述第二类算子的逻辑状态;
    第二处理模块,用于从所述重做日志中读取所述检查点后接收到的数据并进行处理;
    第三处理模块,用于当所述重做日志中的数据处理完成时,继续对接收到的实时数 据流进行流计算处理。
PCT/CN2016/086105 2015-06-26 2016-06-17 分布式流计算系统、方法和装置 WO2016206567A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510360023.8A CN106293892B (zh) 2015-06-26 2015-06-26 分布式流计算系统、方法和装置
CN201510360023.8 2015-06-26

Publications (1)

Publication Number Publication Date
WO2016206567A1 true WO2016206567A1 (zh) 2016-12-29

Family

ID=57584648

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/086105 WO2016206567A1 (zh) 2015-06-26 2016-06-17 分布式流计算系统、方法和装置

Country Status (2)

Country Link
CN (1) CN106293892B (zh)
WO (1) WO2016206567A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109189746A (zh) * 2018-07-12 2019-01-11 北京百度网讯科技有限公司 通用流式Shuffle引擎的实现方法、装置、设备及存储介质
CN109800069A (zh) * 2018-12-25 2019-05-24 北京明略软件系统有限公司 一种实现数据治理的方法及装置
CN111414264A (zh) * 2020-03-20 2020-07-14 北京奇艺世纪科技有限公司 数据处理方法、装置、电子设备及存储介质
CN111984380A (zh) * 2020-08-21 2020-11-24 北京金山云网络技术有限公司 流计算服务系统及其控制方法和装置
CN114676324A (zh) * 2022-03-28 2022-06-28 网易(杭州)网络有限公司 一种数据处理方法、装置及设备
US11546162B2 (en) 2017-11-09 2023-01-03 Nchain Licensing Ag Systems and methods for ensuring correct execution of computer program using a mediator computer system
US11575511B2 (en) 2017-11-09 2023-02-07 Nchain Licensing Ag System for simplifying executable instructions for optimised verifiable computation
US11888976B2 (en) 2017-12-13 2024-01-30 Nchain Licensing Ag System and method for multi-party generation of blockchain-based smart contract

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273193A (zh) * 2017-04-28 2017-10-20 中国科学院信息工程研究所 一种基于dag的面向多计算框架的数据处理方法及系统
CN109033109B (zh) * 2017-06-09 2020-11-27 杭州海康威视数字技术股份有限公司 数据处理方法及系统
CN109426574B (zh) * 2017-08-31 2022-04-05 华为技术有限公司 分布式计算系统,分布式计算系统中数据传输方法和装置
CN107665241B (zh) * 2017-09-07 2020-09-29 北京京东尚科信息技术有限公司 一种实时数据多维度去重方法和装置
CN108984155B (zh) * 2018-05-17 2021-09-07 创新先进技术有限公司 数据处理流程设定方法和装置
CN108777612B (zh) * 2018-05-18 2020-03-20 中科声龙科技发展(北京)有限公司 一种工作量证明运算芯片核心计算部件的优化方法和电路
CN109063056A (zh) * 2018-07-20 2018-12-21 阿里巴巴集团控股有限公司 一种数据查询方法、系统及终端设备
CN109799973B (zh) * 2018-12-11 2022-02-11 极道科技(北京)有限公司 一种数据驱动的用户透明的可扩展编程方法
CN111435352A (zh) * 2019-01-11 2020-07-21 北京京东尚科信息技术有限公司 一种分布式实时计算方法、装置、系统及其存储介质
CN112148762A (zh) * 2019-06-28 2020-12-29 西安京迅递供应链科技有限公司 一种实时数据流的统计方法和装置
CN110532072A (zh) * 2019-07-24 2019-12-03 中国科学院计算技术研究所 基于微内核操作系统的分布式流式数据处理方法及系统
CN110795151A (zh) * 2019-10-08 2020-02-14 支付宝(杭州)信息技术有限公司 算子并发度调整方法、装置和设备
CN112988239A (zh) * 2019-12-17 2021-06-18 深圳市优必选科技股份有限公司 数据运算方法、装置及终端设备
CN113515285A (zh) * 2020-04-10 2021-10-19 北京沃东天骏信息技术有限公司 生成实时计算逻辑数据的方法和装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120137018A1 (en) * 2010-11-30 2012-05-31 Volkmar Uhlig Methods and systems for reconfiguration and repartitioning of a parallel distributed stream process
CN102609451A (zh) * 2012-01-11 2012-07-25 华中科技大学 面向流式数据处理的sql查询计划生成方法
CN104123374A (zh) * 2014-07-28 2014-10-29 北京京东尚科信息技术有限公司 分布式数据库中聚合查询的方法及装置
CN104580322A (zh) * 2013-10-25 2015-04-29 华为技术有限公司 一种分布式数据流处理方法及装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7200623B2 (en) * 1998-11-24 2007-04-03 Oracle International Corp. Methods to perform disk writes in a distributed shared disk system needing consistency across failures
US9430117B2 (en) * 2012-01-11 2016-08-30 International Business Machines Corporation Triggering window conditions using exception handling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120137018A1 (en) * 2010-11-30 2012-05-31 Volkmar Uhlig Methods and systems for reconfiguration and repartitioning of a parallel distributed stream process
CN102609451A (zh) * 2012-01-11 2012-07-25 华中科技大学 面向流式数据处理的sql查询计划生成方法
CN104580322A (zh) * 2013-10-25 2015-04-29 华为技术有限公司 一种分布式数据流处理方法及装置
CN104123374A (zh) * 2014-07-28 2014-10-29 北京京东尚科信息技术有限公司 分布式数据库中聚合查询的方法及装置

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11546162B2 (en) 2017-11-09 2023-01-03 Nchain Licensing Ag Systems and methods for ensuring correct execution of computer program using a mediator computer system
US11575511B2 (en) 2017-11-09 2023-02-07 Nchain Licensing Ag System for simplifying executable instructions for optimised verifiable computation
US11635950B2 (en) 2017-11-09 2023-04-25 Nchain Licensing Ag Arithmetic enhancement of C-like smart contracts for verifiable computation
US11658801B2 (en) 2017-11-09 2023-05-23 Nchain Licensing Ag System for securing verification key from alteration and verifying validity of a proof of correctness
US11888976B2 (en) 2017-12-13 2024-01-30 Nchain Licensing Ag System and method for multi-party generation of blockchain-based smart contract
CN109189746A (zh) * 2018-07-12 2019-01-11 北京百度网讯科技有限公司 通用流式Shuffle引擎的实现方法、装置、设备及存储介质
CN109189746B (zh) * 2018-07-12 2021-01-22 北京百度网讯科技有限公司 通用流式Shuffle引擎的实现方法、装置、设备及存储介质
CN109800069A (zh) * 2018-12-25 2019-05-24 北京明略软件系统有限公司 一种实现数据治理的方法及装置
CN109800069B (zh) * 2018-12-25 2021-04-30 北京明略软件系统有限公司 一种实现数据治理的方法及装置
CN111414264A (zh) * 2020-03-20 2020-07-14 北京奇艺世纪科技有限公司 数据处理方法、装置、电子设备及存储介质
CN111984380A (zh) * 2020-08-21 2020-11-24 北京金山云网络技术有限公司 流计算服务系统及其控制方法和装置
CN114676324A (zh) * 2022-03-28 2022-06-28 网易(杭州)网络有限公司 一种数据处理方法、装置及设备

Also Published As

Publication number Publication date
CN106293892B (zh) 2019-03-19
CN106293892A (zh) 2017-01-04

Similar Documents

Publication Publication Date Title
WO2016206567A1 (zh) 分布式流计算系统、方法和装置
CN110521171B (zh) 用于应用性能监视和管理的流簇解析
US10560465B2 (en) Real time anomaly detection for data streams
JP6723329B2 (ja) エッジ位置でのカスタマイズ可能なイベントトリガ型計算のためのシステム、方法、及びコンピュータ可読記憶媒体
US11329904B2 (en) Using subject alternative names for aggregate network traffic monitoring
US11646972B2 (en) Dynamic allocation of network resources using external inputs
US9830240B2 (en) Smart storage recovery in a distributed storage system
CN105917632B (zh) 用于电信中的可扩缩分布式网络业务分析的方法
US10686807B2 (en) Intrusion detection system
US8805849B1 (en) Enabling use of analytic functions for distributed storage system data
Laboshin et al. The big data approach to collecting and analyzing traffic data in large scale networks
JP2015508543A (ja) 店舗訪問データを処理すること
US11297105B2 (en) Dynamically determining a trust level of an end-to-end link
US10698863B2 (en) Method and apparatus for clearing data in cloud storage system
CN102082800A (zh) 一种用户请求处理的方法和服务器
US10630818B2 (en) Increasing data resiliency operations based on identifying bottleneck operators
JP2023534696A (ja) ネットワークトポロジーにおけるアノマリー検知
CN106649344B (zh) 一种网络日志压缩方法和装置
US20210089556A1 (en) Asynchronous row to object enrichment of database change streams
US20210133186A1 (en) Capture and replay of user requests for performance analysis
US11258860B2 (en) System and method for bot detection and classification
US10528400B2 (en) Detecting deadlock in a cluster environment using big data analytics
Sasidharan Implementation of High Available and Scalable Syslog Server with NoSQL Cassandra Database and Message Queue
Kumar et al. Raw Cardinality Information Discovery for Big Datasets
US20230370426A1 (en) Sensitive Data Identification In Real-Time for Data Streaming

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16813683

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16813683

Country of ref document: EP

Kind code of ref document: A1