CN108696559B - Stream processing method and device - Google Patents

Stream processing method and device Download PDF

Info

Publication number
CN108696559B
CN108696559B CN201710233425.0A CN201710233425A CN108696559B CN 108696559 B CN108696559 B CN 108696559B CN 201710233425 A CN201710233425 A CN 201710233425A CN 108696559 B CN108696559 B CN 108696559B
Authority
CN
China
Prior art keywords
stream processing
block
data storage
management unit
storage node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710233425.0A
Other languages
Chinese (zh)
Other versions
CN108696559A (en
Inventor
曹俊
胡斐然
林铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201710233425.0A priority Critical patent/CN108696559B/en
Priority to PCT/CN2018/082641 priority patent/WO2018188607A1/en
Publication of CN108696559A publication Critical patent/CN108696559A/en
Application granted granted Critical
Publication of CN108696559B publication Critical patent/CN108696559B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/561Adding application-functional data or data for application control, e.g. adding metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a stream processing method and a stream processing device, wherein the method comprises the following steps: the stream processing management unit receives a stream processing task sent by a client; the stream processing management unit acquires the block number of each block corresponding to the path of the file to be processed and the network address of the data storage node where each block is located from the metadata management node; the stream processing management unit respectively sends the stream processing logic and the block number of each block to the stream processing unit of the data storage node where each block is located; and the stream processing calculation unit acquires the block data corresponding to the received block number from the data storage node, and executes stream processing logic aiming at the block data corresponding to the received block number. By the method, the technical problem that the speed of stream processing is influenced due to the fact that the network transmission speed between the stream processing system and the data storage node is low can be solved.

Description

Stream processing method and device
Technical Field
The present invention relates to the field of information technologies, and in particular, to a stream processing method and apparatus.
Background
Workflow (Work flow) is an abstraction, generalization, description of the workflow and the logical rules how the various services in the workflow are organized together back and forth. The workflow concept originates from the field of production organization and office automation, is a concept provided aiming at the fixed program activities in daily work, and aims to decompose the work into well-defined flows or roles, execute the flows according to certain rules and processes and monitor the flows, thereby achieving the purposes of improving the work efficiency, better controlling the processes, enhancing the service to customers, effectively managing the business flows and the like. Workflow modeling, i.e. representing a workflow in a computer in a suitable model and performing calculations on it. Through workflow modeling, workflows can be managed through a workflow system.
The main function of the stream processing system is to define, execute and manage the workflow through the support of computer technology, and coordinate the information interaction between the streams and the group members in the workflow execution process. The stream processing system generally comprises a workflow design tool for a user to design a workflow definition thereof, and a workflow management tool for managing the execution of the workflow. During the workflow system operation, a workflow instance includes one or more businesses (tasks), each representing some work that needs to be done.
Apache Storm is a typical stream processing system in the prior art, and is composed of a Master-Slave architecture, Nimbus is a Master process, and Supervisor is a Slave process for running services. The stream processing system Storm is in network connection with a distributed file system, the distributed file system stores Data needing to be processed by the stream processing system Storm, the distributed file system comprises a Master Server (main Server) and a Data Server (Data Server), the Master Server is a metadata management node and manages the distribution condition of Data blocks, the Data Server is a Data storage node and stores Data block Data, and the Storm and the Data storage node are arranged on different servers.
In the stream processing job of Storm, Storm first needs to acquire data to be stream processed from a data server. Specifically, the data server provides a data query interface, Storm inputs parameters to the data query interface through a network, acquires data from the data server through the network, and then loads the acquired data into the hypervisor.
Since the stream processing system needs to acquire data from the data storage node through the network in the prior art, the speed of acquiring data is limited by the network performance, which results in that the performance of the whole stream processing is limited by the network, and when the network transmission speed between the stream processing system and the data storage node is low, the speed of stream processing is greatly affected.
Disclosure of Invention
To solve the problems in the prior art, embodiments of the present invention provide a stream processing method and apparatus, which can overcome the technical problem that the network transmission speed between a stream processing system and a data storage node is low, which affects the stream processing speed.
In a first aspect, an embodiment of the present invention provides a stream processing method, where the method is applied to a stream processing system, where the stream processing system includes a stream processing management unit and a stream processing computing unit, and the method includes:
the method comprises the steps that a stream processing management unit receives a stream processing task sent by a client, wherein the stream processing task comprises stream processing logic and a path of a file to be processed in a distributed file system, the distributed file system comprises a metadata management node and a plurality of data storage nodes, and each data storage node is provided with a stream processing calculation unit;
the stream processing management unit acquires the block number of each block corresponding to the path of the file to be processed and the network address of the data storage node where each block is located from the metadata management node;
the stream processing management unit respectively sends the stream processing logic and the block number of each block to the stream processing unit of the data storage node where each block is located;
and the stream processing calculation unit acquires the block data corresponding to the received block number from the data storage node, and executes stream processing logic aiming at the block data corresponding to the received block number.
The stream processing computing units are distributed on the data storage nodes, the stream processing management unit sends the stream processing tasks to the corresponding data storage nodes according to the paths of the files to be processed, the stream processing computing units on the corresponding data storage nodes directly read the block data corresponding to the files to be processed locally and run stream processing logic on the read block data, and the stream processing computing units read the files to be processed locally, so that the technical problem that the stream processing speed is influenced due to the fact that the network transmission speed between a stream processing system and the data storage nodes is low can be solved.
And the files to be processed are scattered into block data, and the stream processing logics are executed in parallel in different stream processing computing units respectively, so that the stream processing speed can be further increased, and the processing efficiency is improved.
In one implementation manner of the embodiment of the present invention, the data storage node is provided with a data management unit, the stream processing calculation unit is provided as a program library, and the data management unit executes the function of the stream processing calculation unit by loading the program library.
Since the stream processing calculation unit is provided in the data management unit through the library and the data management unit can directly read the block data, the stream processing logic can be executed after the data management unit can read the block data, and the stream processing speed can be increased.
In another implementation manner of the embodiment of the present invention, the method further includes:
the stream processing calculation unit transmits a processing result obtained by executing the stream processing logic to the stream processing management unit.
In another implementation manner of the embodiment of the present invention, the recording, by the metadata management node, a first correspondence between a path of a file to be processed in the distributed file system and a block number of each block, and the acquiring, by the stream processing management unit from the metadata management node, the block number of each block corresponding to the path, and a network address of the data storage node where the block number of each block is located specifically include:
and the stream processing management unit acquires the block number of each block from the first corresponding relation according to the path of the file to be processed in the distributed file system.
In another implementation manner of the embodiment of the present invention, the metadata management node records a second correspondence between a block number of each block and a network address of a data storage node where the block number of each block is located, and the stream processing management unit obtains, from the metadata management node, the block number of each block corresponding to the path, and the network address of the data storage node where the block number of each block is located specifically include:
and the stream processing management unit acquires the network address of the data storage node where each block number is located from the second corresponding relation according to each block number.
In a second aspect, an embodiment of the present invention provides a stream processing system, including a stream processing management unit and a stream processing computing unit,
the system comprises a stream processing management unit, a stream processing calculation unit and a file processing unit, wherein the stream processing management unit is used for receiving a stream processing task sent by a client, the stream processing task comprises stream processing logic and a path of a file to be processed in a distributed file system, the distributed file system comprises a metadata management node and a plurality of data storage nodes, and each data storage node is provided with a stream processing calculation unit;
the stream processing management unit is also used for acquiring each block number corresponding to the path and the network address of the data storage node where each block number is located from the metadata management node;
the stream processing management unit is also used for respectively sending the stream processing logic and the block numbers corresponding to the network addresses to the stream processing units of the corresponding data storage nodes;
and the stream processing calculation unit is used for acquiring the block data corresponding to the received block number from the data storage node, and executing stream processing logic aiming at the block data corresponding to the received block number.
In one implementation manner of the embodiment of the present invention, the data storage node is provided with a data management unit, the stream processing calculation unit is provided as a program library, and the data management unit executes the function of the stream processing calculation unit by loading the program library.
In another implementation manner of the embodiment of the present invention, the stream processing calculating unit is further configured to send a processing result obtained by executing the stream processing logic to the stream processing managing unit.
In another implementation manner of the embodiment of the present invention, the metadata management node records a first correspondence between a path of the file to be processed in the distributed file system and a block number of each block, and the stream processing management unit is specifically configured to:
and acquiring the block number of each block from the first corresponding relation according to the path of the file to be processed in the distributed file system.
In another implementation manner of the embodiment of the present invention, the metadata management node records a second correspondence between a block number of each block and a network address of a data storage node where the block number of each block is located, and the stream processing management unit is specifically configured to:
and acquiring the network address of the data storage node where each block number is located from the second corresponding relation according to the block number of each block.
In a third aspect, an embodiment of the present invention provides a stream processing management unit, which executes a function of the stream processing management unit in the stream processing system.
In a fourth aspect, an embodiment of the present invention provides a host, including a memory, a processor, and a bus, where the memory and the processor are connected to the bus, the memory stores program instructions, and the processor executes the program instructions to implement functions of a stream processing management unit in the stream processing system.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic block diagram of a stream processing system according to an embodiment of the present invention;
FIG. 2 is another schematic diagram of a stream processing system according to an embodiment of the present invention;
FIG. 3 is a data interaction diagram of a stream processing method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an apparatus configuration of a stream processing system according to an embodiment of the present invention;
fig. 5 is a schematic device structure diagram of a host according to an embodiment of the present invention.
Detailed Description
Referring first to fig. 1, fig. 1 is a schematic diagram illustrating a connection between a stream processing system and a distributed file system and a client according to an embodiment of the present invention, where, as shown in fig. 1, the stream processing system includes a stream processing management unit 302 and a plurality of stream processing computing units 1011, 1021, … …, and 1031, and the distributed file system includes a metadata management node 201 and a plurality of data storage nodes 101, 102, … …, and 103.
In the embodiment of the present invention, the client 301 is connected to the stream processing management unit 302, and the stream processing management unit 302 is connected to the metadata management node 201 and the plurality of data storage nodes 101, 102, … …, and 103, respectively.
The client 301 is configured to receive a stream processing job submitted by a user, and in the embodiment of the present invention, when the user submits the stream processing job, the user specifies a path of data to be processed in the distributed file system, and specifies what processing is to be performed on the data to be processed.
The path of the file to be processed in the stream processing task in the distributed file system may be, for example, a URL (uniform Resource Locator), where the URL is a storage identifier of the distributed file system, and the block number of each block corresponding to the file to be processed may be found in the metadata management node 201 through the URL.
The client 301 generates a stream processing task according to a stream processing job submitted by a user, where the stream processing task includes stream processing logic and a path of data to be processed in the distributed file system, where the stream processing logic defines what kind of processing is performed on the data to be processed, and for example, the stream processing logic may specify that an abnormal event is searched for in the data to be processed.
The client 301 sends a stream processing task to the stream processing management unit 302, and the stream processing management unit 302 performs scheduling according to the stream processing task, selects the stream processing computing unit to obtain the file to be processed from the distributed file system, and processes the file to be processed with the stream processing logic.
For example, the stream processing system may be implemented based on an apache flash architecture, where the client 301 is a client process of the apache flash, the stream processing management unit 302 is a job manager process of the apache flash, and the stream processing calculation unit is a task manager process of the apache flash.
The metadata management node 201 is provided with a metadata management unit 2011 and a database 2012, the metadata management unit 2011 provides an interface through which an external device can query the database 2012. The database 2012 records a first corresponding relationship between a path of a file to be processed in the distributed file system and a block number of each block in the distributed file system, and a second corresponding relationship between the block number of each block and a network address of a data storage node where each block is located.
In the distributed storage system, files to be processed are stored in a database of data storage nodes in a fragment mode, wherein the fragments refer to different block data, each block data corresponds to a block number, a metadata management node records the corresponding relation between the paths of all files in the distributed storage system and the block numbers of all blocks, and each block number is respectively and correspondingly stored in the database of which data storage node.
The data storage node 101 is provided with a stream processing calculation unit 1011 and a database 1012, the database 1012 being recorded with block data and a correspondence relationship between block numbers and the block data, the stream processing calculation unit 1011 being able to access the database 1012 and acquire the corresponding block data from the database 1012 by the block numbers.
In fig. 1, the data storage nodes 102 and 103 have a similar structure to the data storage node 101, except that the block data recorded in the database is different, and will not be described herein.
For example, the distributed file system may be implemented by Hadoop, the Database 2012, the Database 1012, the Database 1022 … …, and the Database 1032 may be implemented by Hbase (Hadoop Database), and the metadata management unit 2011 may be an hmaster process of the Hbase Database.
In the embodiment of the present invention, the client 301 and the stream processing management unit 302 may be disposed on the same host, and establish data connection with the metadata management node 201 and the data storage nodes 101 and 102 … … 103, respectively, through a network.
In some examples, the client 301 and the stream processing management unit 302 may also be disposed on different hosts, which is not limited by the embodiment of the present invention.
For convenience of understanding, referring to fig. 2, fig. 2 is another schematic structural diagram of the stream processing system according to the embodiment of the present invention, as shown in fig. 2, a client 301 and a stream processing management unit 302 are disposed on a host 10, the host 10 further includes an operating system 303 and hardware 304, the hardware 304 is used for carrying the operation of the operating system 303, the hardware 304 includes a physical network card 3041, and the client 301 and the stream processing management unit 302 respectively run on the operating system 303 in the form of processes and access the network 50 through the physical network card 3041.
The metadata management node 201 includes a database 2012, a metadata management unit 2011, an operating system 2013, and hardware 2014, where the database 2012 and the metadata management unit 2011 run on the operating system 2013 in the form of processes, the hardware 2014 is used for carrying the running of the operating system 2013, the hardware 2014 includes a physical network card 20141, the physical network card 20141 is connected to the network 50, the metadata management unit 2011 provides an interface, and an external device can access the database 2012 through the interface.
Moreover, the data storage node 101 includes data 1012, a stream processing calculation unit 1011, an operating system 1013, and hardware 1014, where the database 1012 and the stream processing calculation unit 1011 run on the operating system 1013 in the form of processes, the hardware 1014 is used for carrying out the running of the operating system 2013, the hardware 1014 includes a physical network card 10141, the physical network card 10141 is connected to the network 50, and in the embodiment of the present invention, the stream processing calculation unit 1011 can access the database 1012.
The structure of data storage nodes 102 and 103 is similar to data storage node 101 and will not be described in detail herein.
For example, communication between the stream processing management unit 302 and the client 301, the metadata management unit 2011, and each of the stream processing calculation units 101, 1021, … …, and 1031 may be realized by an RPC (Remote Procedure Call Protocol).
Based on the above architecture, in the embodiment of the present invention, a stream processing management unit 302 receives a stream processing task sent by a client 301, where the stream processing task includes a stream processing logic and a path of a file to be processed in a distributed file system; the stream processing management unit 302 acquires, from the metadata management node 201, the block number of each block corresponding to the path and the network address of the data storage node where the block number of each block is located; the stream processing management unit 302 sends the stream processing logic and the block numbers corresponding to the network addresses to the stream processing units of the corresponding data storage nodes, respectively; and the stream processing calculation unit acquires the block data corresponding to the received block number from the data storage node, and executes stream processing logic aiming at the block data corresponding to the received block number.
The stream processing computing units are distributed on the data storage nodes, the stream processing management unit sends the stream processing tasks to the corresponding data storage nodes according to the paths of the files to be processed, the stream processing computing units on the corresponding data storage nodes directly read the block data corresponding to the files to be processed locally and run stream processing logic on the read block data, and the stream processing computing units read the files to be processed locally, so that the technical problem that the stream processing speed is influenced due to the fact that the network transmission speed between a stream processing system and the data storage nodes is low can be solved.
For further clarity, please refer to fig. 3, fig. 3 is a data interaction diagram of a stream processing method according to an embodiment of the present invention, and as shown in fig. 3, the stream processing method includes the following steps:
step 401: the stream processing management unit 302 receives a stream processing task sent by the client 301, where the stream processing task includes stream processing logic and a path of a file to be processed in the distributed file system.
For example, the client 301 may be a client process in the apache flash system, and the stream processing management unit 302 may be a jobmanager process in the apache flash system.
Step 402: the stream processing management unit 302 sends an inquiry request to the metadata management node 201, where the inquiry request carries a path of a file to be processed in the distributed file system.
For example, the query request includes an input parameter and a query instruction, and the stream processing management unit 302 takes a path of the file to be processed in the distributed file system as the input parameter and sends the input parameter and the control instruction to the interface provided by the metadata management unit 2011 of the metadata management node 201 and used for accessing the database 2012.
Step 403: the metadata management node 201 returns the block number of each block corresponding to the path and the network address of the data storage node corresponding to each block to the stream processing management unit 302 according to the query request.
As can be seen from the above, the database 2012 of the metadata management node 201 records a first corresponding relationship between a path of the file to be processed in the distributed file system and a block number of each block, and a second corresponding relationship between a block number of each block and a network address of the data storage node where each block is located, so that the stream processing management unit 302 of the metadata management node 201 obtains the block number of each block from the first corresponding relationship according to the path of the file to be processed in the distributed file system, and obtains the network address of the data storage node where each block is located from the second corresponding relationship according to the block number of each block.
Assuming that the block numbers acquired by the stream processing management unit 302 are block number 1 and block number 2, respectively, it should be noted that in practical applications, the block numbers include a plurality of block numbers, and for the sake of brief description, only two block numbers are taken as examples for description, and the stream processing management unit 302 queries the network address a of the data storage node 101 according to the block number 1 and queries the network address B of the data storage node 102 according to the block number 2.
Step 404: the stream processing management unit 302 sends the stream processing logic and the block number 1 to the stream processing calculation unit 1011.
In this step, after the stream processing management unit 302 inquires about the network address a of the data storage node 101 based on the block number 1, the block number 1 corresponding to the stream processing task and the network address a is sent to the stream processing calculation unit 1011 of the data storage node 101.
Step 405: the stream processing management unit 302 transmits the stream processing logic and block number 2 to the stream processing calculation unit 1021.
In this step, after querying the network address B of the data storage node 102 according to the block number 2, the stream processing management unit 302 sends the block number 2 corresponding to the stream processing task and the network address B to the stream processing calculation unit 1021 of the data storage node 102.
In steps 404 and 405, the stream processing calculation unit 1011 may be, for example, one task manager process in the apache flash system, and the stream processing calculation unit 1021 may be, for example, another task manager process in the apache flash system.
Step 406: the stream processing calculation unit 1011 acquires the block data corresponding to the received block number 1 from the data storage node 101 where it is located, and executes stream processing logic for the block data corresponding to the received block number 1.
In this step, the stream-processing calculating unit 1011 acquires the block data corresponding to the block number 1 received from the stream-processing managing unit 302 from the database 1012 of the data storage node 101 where it is located, and executes the stream processing logic with respect to the block data corresponding to the block number 1.
In some examples, the data storage node 101 is further provided with a data management unit for accessing the database 1012 to manage the block data in the database 1012.
For example, the distributed file system may be a Hadoop, a database of the Hadoop is implemented by an Hbase database, the metadata management unit 2011 is an Hmaster process of the Hbase database, the stream processing calculation unit is set as a program library, and the data management unit executes a function of the stream processing calculation unit by loading the program library.
Further, the data management unit is, for example, an hreigon server process of the Hbase database, the hreigon server process embeds a task manager process into the hreigon server process, the task manager process may be set as a library in a jar packet or so file format, and provides a start interface, and the hreigon server process may implement the function of the task manager process by running the start interface after loading the library.
In the embodiment of the present invention, the HReigonServer process that implements the function of the task manager process can locally read the block data of the database 1012, so the process of acquiring the block data can be prevented from being affected by the performance of the external network, and the HReigonServer process directly accesses the database 1012 in the process, i.e., directly reads the block data from the memory, so the speed of acquiring the block data is faster, and the efficiency of stream processing can be effectively improved.
In other examples, the data management unit and the stream processing computing unit 1011 may run in the operating system 1013 simultaneously, and the stream processing computing unit 1011 accesses the database 1012 through an interface provided by the data management unit, and in these examples, although the database 1012 is not directly accessed in-process by the HReigonServer process, the stream processing computing unit 1011 may locally access the database 1012 and may avoid the impact on the performance of the external network.
Step 407: the stream processing calculation unit 1021 acquires the block data corresponding to the received block number 2 from the data storage node 102 where it is located, and executes stream processing logic for the block data corresponding to the received block number 2.
Similar to the above step, in some examples, the data storage node 102 is provided with a data management unit for accessing the database 1022 to manage the block data. The distributed file system may be a Hadoop, a database of the Hadoop is implemented by an Hbase database, the metadata management unit 2011 is an Hmaster process of the Hbase database, the stream processing calculation unit 1011 is set as a program library, and the data management unit executes functions of the stream processing calculation unit by loading the program library.
Further, the data management unit is, for example, an hreigon server process of the Hbase database, the hreigon server process embeds a task manager process into the hreigon server process, the task manager process may be set as a library in a jar packet or so file format, and provides a start interface, and the hreigon server process may implement the function of the task manager process by running the start interface after loading the library.
In the embodiment of the present invention, the HReigonServer process that implements the function of the task manager process can locally read the block data of the database 1022, so the process of acquiring the block data can be prevented from being affected by the performance of the external network, and the HReigonServer process directly accesses the database 1022 in the process, so the speed of acquiring the block data is faster, and the efficiency of stream processing can be effectively improved.
In other examples, the data management unit and the stream processing computing unit 1021 may run on the operating system 1023 at the same time, and the stream processing computing unit 1021 accesses the database 1022 through an interface provided by the data management unit, in which case, although the database 1012 is not directly accessed in-process by the HReigonServer process, the stream processing computing unit 1021 may access the database 1022 locally, and may avoid the impact on external network performance.
Step 408: the stream processing calculation unit transmits a first processing result acquired by performing stream processing logic on the block data corresponding to the block number 1 to the stream processing management unit 302.
Step 409: the stream processing calculation unit transmits the second processing result acquired by performing the stream processing logic on the block data corresponding to the block number 2 to the stream processing management unit 302.
In summary, in the embodiments of the present invention, the stream processing calculating units are distributed on each data storage node, the stream processing management unit sends the stream processing task to the corresponding data storage node according to the path of the file to be processed, the stream processing calculating unit on the corresponding data storage node directly reads the block data corresponding to the file to be processed locally, and runs the stream processing logic on the read block data.
And the files to be processed are scattered into block data, and the stream processing logics are executed in parallel in different stream processing computing units respectively, so that the stream processing speed can be further increased, and the processing efficiency is improved.
It is noted that in alternative embodiments of the present invention, stream processing system 90 may also be implemented based on the Storm, Spark, or Samza architectures.
Referring to fig. 4, fig. 4 is a schematic diagram of an apparatus structure of a flow processing management unit according to an embodiment of the present invention, and as shown in fig. 4, the flow processing management unit 302 includes:
the receiving module 601 is configured to receive a stream processing task sent by a client, where the stream processing task includes a stream processing logic and a path of a file to be processed in a distributed file system, the distributed file system includes a metadata management node and a plurality of data storage nodes, and each data storage node is provided with a stream processing computing unit;
the query module 602 is configured to obtain, from the metadata management node, a block number of each block corresponding to the path and a network address of a data storage node where each block is located;
a sending module 603, configured to send the stream processing logic and the block number of each block to the stream processing unit of the data storage node where each block is located.
Optionally, the receiving unit 601 is further configured to receive a processing result obtained by the execution stream processing logic sent by the stream processing computing unit.
Optionally, the metadata management node records a first corresponding relationship between a path of the file to be processed in the distributed file system and a block number of each block, and a second corresponding relationship between a block number of each block and a network address of the data storage node where each block is located, and the query module 602 is specifically configured to:
acquiring the block number of each block from the first corresponding relation according to the path of the file to be processed in the distributed file system;
and acquiring the network address of the data storage node where each block is located from the second corresponding relation according to the block number of each block.
Referring to fig. 5, fig. 5 is a schematic device structure diagram of a host according to an embodiment of the present invention, as shown in fig. 5, the host 50 includes a memory 502, a processor 501 and a bus 503, the memory 502 and the processor 501 are connected to the bus 503, the memory 502 stores program instructions, and the processor 501 executes the program instructions to implement the functions of the stream processing management unit 302 in the stream processing system.
The stream processing computing units are distributed on the data storage nodes, the stream processing management unit sends the stream processing tasks to the corresponding data storage nodes according to the paths of the files to be processed, the stream processing computing units on the corresponding data storage nodes directly read the block data corresponding to the files to be processed locally and run stream processing logic on the read block data, and the stream processing computing units read the files to be processed locally, so that the technical problem that the stream processing speed is influenced due to the fact that the network transmission speed between a stream processing system and the data storage nodes is low can be solved.
And the files to be processed are scattered into block data, and the stream processing logics are executed in parallel in different stream processing computing units respectively, so that the stream processing speed can be further increased, and the processing efficiency is improved.
It should be noted that any of the above-described device embodiments are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Part or all of the processes can be selected according to actual needs to achieve the purpose of the scheme of the embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the processes indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by software plus necessary general hardware, and may also be implemented by special hardware including special integrated circuits, special CPUs, special memories, special components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, the implementation of a software program is a more preferable embodiment for the present invention. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It is clear to those skilled in the art that the specific working process of the above-described system, apparatus or unit may refer to the corresponding process in the foregoing method embodiments, and is not described herein again.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (12)

1. A stream processing method applied to a stream processing system including a stream processing management unit and a stream processing calculation unit, the method comprising:
the method comprises the steps that a stream processing management unit receives a stream processing task sent by a client, wherein the stream processing task comprises stream processing logic and a path of a file to be processed in a distributed file system, the distributed file system comprises a metadata management node and a plurality of data storage nodes, and each data storage node is provided with a stream processing calculation unit;
the stream processing management unit acquires the block number of each block corresponding to the path of the file to be processed and the network address of the data storage node where each block is located from the metadata management node;
the stream processing management unit respectively sends the stream processing logic and the block number of each block to the stream processing unit of the data storage node where each block is located;
and the stream processing calculation unit acquires the received block data corresponding to the block number from the data storage node, and executes the stream processing logic aiming at the received block data corresponding to the block number.
2. The method according to claim 1, wherein the data storage node is provided with a data management unit, the flow processing calculation unit is provided as a library, and the data management unit executes the function of the flow processing calculation unit by loading the library.
3. The method according to claim 1 or 2, characterized in that the method further comprises:
and the stream processing calculation unit sends the processing result obtained by executing the stream processing logic to the stream processing management unit.
4. The method according to any one of claims 1 to 3, wherein the metadata management node records a first correspondence between a path of the file to be processed in the distributed file system and a block number of each block, the stream processing management unit obtains, from the metadata management node, the block number of each block corresponding to the path, and the network address of the data storage node where each block is located specifically includes:
and the stream processing management unit acquires the block number of each block from the first corresponding relation according to the path of the file to be processed in the distributed file system.
5. The method according to claim 4, wherein the metadata management node records a second correspondence between a block number of each block and a network address of a data storage node where each block is located, and the acquiring, by the stream processing management unit, the block number of each block corresponding to the path from the metadata management node, and the network address of the data storage node where the block number of each block is located specifically include:
and the stream processing management unit acquires the network address of the data storage node where each block is located from the second corresponding relation according to the block number of each block.
6. A stream processing system is characterized by comprising a stream processing management unit and a stream processing calculation unit,
the stream processing management unit is used for receiving a stream processing task sent by a client, wherein the stream processing task comprises stream processing logic and a path of a file to be processed in a distributed file system, the distributed file system comprises a metadata management node and a plurality of data storage nodes, and each data storage node is provided with a stream processing calculation unit;
the stream processing management unit is further configured to obtain, from the metadata management node, a block number of each block corresponding to the path and a network address of a data storage node where each block is located;
the stream processing management unit is further configured to send the stream processing logic and the block number of each block to the stream processing unit of the data storage node where each block is located;
and the stream processing calculation unit is used for acquiring the received block data corresponding to the block number from the data storage node, and executing the stream processing logic aiming at the received block data corresponding to the block number.
7. The system according to claim 6, wherein the data storage node is provided with a data management unit, the flow processing calculation unit is provided as a library, and the data management unit executes the function of the flow processing calculation unit by loading the library.
8. The system of claim 6,
the stream processing calculation unit is further configured to send a processing result obtained by executing the stream processing logic to the stream processing management unit.
9. The system according to claim 6, wherein the metadata management node records a first correspondence between a path of the file to be processed in the distributed file system and a block number of each block, and the stream processing management unit is specifically configured to:
and acquiring the block number of each block from the first corresponding relation according to the path of the file to be processed in the distributed file system.
10. The system according to claim 9, wherein the metadata management node records a second correspondence between each of the block numbers and a network address of a data storage node where each of the block numbers is located, and the stream processing management unit is specifically configured to:
and acquiring the network address of the data storage node where each block is located from the second corresponding relation according to the block number of each block.
11. A stream processing management unit, comprising:
the system comprises a receiving module, a processing module and a processing module, wherein the receiving module is used for receiving a stream processing task sent by a client, the stream processing task comprises stream processing logic and a path of a file to be processed in a distributed file system, the distributed file system comprises a metadata management node and a plurality of data storage nodes, and each data storage node is provided with a stream processing computing unit;
the query module is used for acquiring the block number of each block corresponding to the path from the metadata management node and the network address of the data storage node where each block is located;
and the sending module is used for respectively sending the stream processing logic and the block numbers of the blocks to the stream processing units of the data storage nodes where the blocks are located.
12. A host comprising a memory, a processor, and a bus, the memory, the processor, and the bus coupled to the memory, the memory storing program instructions, the processor executing the program instructions to cause the host to perform the steps of:
receiving a stream processing task sent by a client, wherein the stream processing task comprises stream processing logic and a path of a file to be processed in a distributed file system, the distributed file system comprises a metadata management node and a plurality of data storage nodes, and each data storage node is provided with a stream processing computing unit;
acquiring the block number of each block corresponding to the path and the network address of the data storage node where each block is located from the metadata management node;
and respectively sending the stream processing logic and the block number of each block to a stream processing unit of the data storage node where each block is located.
CN201710233425.0A 2017-04-11 2017-04-11 Stream processing method and device Active CN108696559B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710233425.0A CN108696559B (en) 2017-04-11 2017-04-11 Stream processing method and device
PCT/CN2018/082641 WO2018188607A1 (en) 2017-04-11 2018-04-11 Stream processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710233425.0A CN108696559B (en) 2017-04-11 2017-04-11 Stream processing method and device

Publications (2)

Publication Number Publication Date
CN108696559A CN108696559A (en) 2018-10-23
CN108696559B true CN108696559B (en) 2021-08-20

Family

ID=63792265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710233425.0A Active CN108696559B (en) 2017-04-11 2017-04-11 Stream processing method and device

Country Status (2)

Country Link
CN (1) CN108696559B (en)
WO (1) WO2018188607A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111435938B (en) * 2019-01-14 2022-11-29 阿里巴巴集团控股有限公司 Data request processing method, device and equipment
CN110046131A (en) * 2019-01-23 2019-07-23 阿里巴巴集团控股有限公司 The Stream Processing method, apparatus and distributed file system HDFS of data
CN111290744B (en) * 2020-01-22 2023-07-21 北京百度网讯科技有限公司 Stream type computing job processing method, stream type computing system and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101089819A (en) * 2006-06-13 2007-12-19 国际商业机器公司 Method for dynamic stationary flow processing system and upstream processing node
CN101741885A (en) * 2008-11-19 2010-06-16 珠海市西山居软件有限公司 Distributed system and method for processing task flow thereof
CN102456185A (en) * 2010-10-29 2012-05-16 金蝶软件(中国)有限公司 Distributed workflow processing method and distributed workflow engine system
CN102467411A (en) * 2010-11-19 2012-05-23 金蝶软件(中国)有限公司 Workflow processing and workflow agent method, device and system
CN102542367A (en) * 2010-12-10 2012-07-04 金蝶软件(中国)有限公司 Cloud computing network workflow processing method, device and system based on domain model
CN104536814A (en) * 2015-01-16 2015-04-22 北京京东尚科信息技术有限公司 Method and system for processing workflow
CN106155791A (en) * 2016-06-30 2016-11-23 电子科技大学 A kind of workflow task dispatching method under distributed environment

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6415297B1 (en) * 1998-11-17 2002-07-02 International Business Machines Corporation Parallel database support for workflow management systems
US7849289B2 (en) * 2003-10-27 2010-12-07 Turbo Data Laboratories, Inc. Distributed memory type information processing system
US20090125553A1 (en) * 2007-11-14 2009-05-14 Microsoft Corporation Asynchronous processing and function shipping in ssis
US8150889B1 (en) * 2008-08-28 2012-04-03 Amazon Technologies, Inc. Parallel processing framework
US20110313934A1 (en) * 2010-06-21 2011-12-22 Craig Ronald Van Roy System and Method for Configuring Workflow Templates
US9361323B2 (en) * 2011-10-04 2016-06-07 International Business Machines Corporation Declarative specification of data integration workflows for execution on parallel processing platforms
CN103309867A (en) * 2012-03-09 2013-09-18 句容智恒安全设备有限公司 Web data mining system on basis of Hadoop platform
US9292815B2 (en) * 2012-03-23 2016-03-22 Commvault Systems, Inc. Automation of data storage activities
BR112016026524A2 (en) * 2014-05-13 2017-08-15 Cloud Crowding Corp DATA STORAGE FOR SECURE DISTRIBUTION AND TRANSMISSION OF STREAM MEDIA CONTENT.
CN104063486B (en) * 2014-07-03 2017-07-11 四川中亚联邦科技有限公司 A kind of big data distributed storage method and system
CN105608077A (en) * 2014-10-27 2016-05-25 青岛金讯网络工程有限公司 Big data distributed storage method and system
CN104657497A (en) * 2015-03-09 2015-05-27 国家电网公司 Mass electricity information concurrent computation system and method based on distributed computation
CN105468756A (en) * 2015-11-30 2016-04-06 浪潮集团有限公司 Design and implementation method of mass data processing system
CN106339415B (en) * 2016-08-12 2019-08-23 北京奇虎科技有限公司 Querying method, the apparatus and system of data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101089819A (en) * 2006-06-13 2007-12-19 国际商业机器公司 Method for dynamic stationary flow processing system and upstream processing node
CN101741885A (en) * 2008-11-19 2010-06-16 珠海市西山居软件有限公司 Distributed system and method for processing task flow thereof
CN102456185A (en) * 2010-10-29 2012-05-16 金蝶软件(中国)有限公司 Distributed workflow processing method and distributed workflow engine system
CN102467411A (en) * 2010-11-19 2012-05-23 金蝶软件(中国)有限公司 Workflow processing and workflow agent method, device and system
CN102542367A (en) * 2010-12-10 2012-07-04 金蝶软件(中国)有限公司 Cloud computing network workflow processing method, device and system based on domain model
CN104536814A (en) * 2015-01-16 2015-04-22 北京京东尚科信息技术有限公司 Method and system for processing workflow
CN106155791A (en) * 2016-06-30 2016-11-23 电子科技大学 A kind of workflow task dispatching method under distributed environment

Also Published As

Publication number Publication date
WO2018188607A1 (en) 2018-10-18
CN108696559A (en) 2018-10-23

Similar Documents

Publication Publication Date Title
CN109643312B (en) Hosted query service
US11836533B2 (en) Automated reconfiguration of real time data stream processing
CN109997126B (en) Event driven extraction, transformation, and loading (ETL) processing
JP6732798B2 (en) Automatic scaling of resource instance groups in a compute cluster
US20190138639A1 (en) Generating a subquery for a distinct data intake and query system
US8914469B2 (en) Negotiating agreements within a cloud computing environment
US9043445B2 (en) Linking instances within a cloud computing environment
US10970303B1 (en) Selecting resources hosted in different networks to perform queries according to available capacity
US20200169534A1 (en) Enabling access across private networks for a managed blockchain service
CN111258978B (en) Data storage method
CA2829915C (en) Method and system for dynamically tagging metrics data
CN108573029B (en) Method, device and storage medium for acquiring network access relation data
US20160006600A1 (en) Obtaining software asset insight by analyzing collected metrics using analytic services
CN108696559B (en) Stream processing method and device
CN110096521A (en) Log information processing method and device
US10951540B1 (en) Capture and execution of provider network tasks
US20150163111A1 (en) Managing resources in a distributed computing environment
US10944814B1 (en) Independent resource scheduling for distributed data processing programs
US20220044144A1 (en) Real time model cascades and derived feature hierarchy
CN115617480A (en) Task scheduling method, device and system and storage medium
WO2018200167A1 (en) Managing asynchronous analytics operation based on communication exchange
US11288291B2 (en) Method and system for relation discovery from operation data
US20210286819A1 (en) Method and System for Operation Objects Discovery from Operation Data
WO2018217406A1 (en) Providing instant preview of cloud based file
CN103176847A (en) Virtual machine distribution method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220216

Address after: 550025 Huawei cloud data center, jiaoxinggong Road, Qianzhong Avenue, Gui'an New District, Guiyang City, Guizhou Province

Patentee after: Huawei Cloud Computing Technologies Co.,Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd.

TR01 Transfer of patent right