CN115051981A

CN115051981A - Zookeeper-based asynchronous downloading method and device

Info

Publication number: CN115051981A
Application number: CN202210515994.5A
Authority: CN
Inventors: 许吉来; 罗晓峰
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2022-05-12
Filing date: 2022-05-12
Publication date: 2022-09-13

Abstract

The embodiment of the application provides a Zookeeper-based asynchronous downloading method and device, which are used for reducing the time consumed by data downloading, and the method specifically comprises the following steps: according to m Hadoop data sources, m dynamic queues are created in a Zookeeper, m is the number of components supporting a query function in a Hadoop cluster, m is a positive integer larger than or equal to 1, data downloading tasks of the m dynamic queues are controlled in parallel, each dynamic queue corresponds to one data source, a monitor monitors the change conditions of the m dynamic queues in real time, and according to the change conditions of the m dynamic queues, a query component corresponding to the changed dynamic queue is called to download data from the corresponding data source, so that the states of the data downloading tasks and downloaded data files are obtained.

Description

Zookeeper-based asynchronous downloading method and device

Technical Field

The invention relates to the technical field of electronics, in particular to a Zookeeper-based asynchronous downloading method and device.

Background

With the rapid development of information technology, big data technology is widely applied to various industries, and Hadoop provides big data solutions for various industries by virtue of low software and hardware cost and strong parallel computing capability. Because the Hadoop storage data volume is large, when a user inquires data on a foreground page, the user needs to adopt a paging mode for inquiry, and when the foreground page downloads large data volume, the time consumption is more, and great inconvenience is brought to the user.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide an asynchronous downloading method based on Zookeeper, so as to overcome or at least partially solve the above problems, and the specific scheme is as follows:

in a first aspect, an embodiment of the present invention discloses an asynchronous downloading method based on Zookeeper, where the method includes:

according to the m Hadoop data sources, m dynamic queues are created in a Zookeeper, wherein m is the number of components supporting a query function in a Hadoop cluster, and m is a positive integer greater than or equal to 1, and data downloading tasks of the m dynamic queues are controlled in parallel; each dynamic queue corresponds to a data source;

the monitoring program monitors the change conditions of the m dynamic queues in real time, and calls the query assemblies corresponding to the changed dynamic queues to download data from corresponding data sources according to the change conditions of the m dynamic queues;

and acquiring the state of the data downloading task and the downloaded data file.

Optionally, the creating m dynamic queues in the Zookeeper includes:

acquiring the m dynamic queues, wherein each dynamic queue comprises a primary node; the primary node corresponds to a query component;

when a downloading request is received, determining a query component corresponding to the downloading request;

determining a primary node corresponding to the downloading request according to the query component corresponding to the downloading request;

under the primary node corresponding to the downloading request, sequentially creating secondary nodes according to the sequence of receiving the downloading request; the secondary nodes correspond to the download requests one to one.

Optionally, sequentially creating, under the primary node corresponding to the download request, secondary nodes according to the order of receiving the download request further includes:

when a plurality of secondary nodes exist under any one primary node, the serial numbers of the plurality of secondary nodes are sequentially increased according to the sequence of creating the secondary nodes.

Optionally, the monitoring program monitors the change conditions of the m dynamic queues in real time, and invokes a query component corresponding to the dynamic queues to download data according to the change conditions of the m dynamic queues, including:

setting m monitoring programs, wherein each monitoring program respectively monitors the change condition of a secondary node in a dynamic queue in real time;

when the secondary node changes, acquiring information of all the secondary nodes in the dynamic queue where the changed secondary node is located according to a notification sent by a Zookeeper;

acquiring a secondary node with the minimum serial number; according to the downloading request, packaging the data information stored in the secondary node with the minimum serial number into a data quantity query statement, and querying through a query component corresponding to the primary node; and deleting the secondary node with the minimum sequence number.

Optionally, the monitoring program monitors the change conditions of the m dynamic queues in real time, and invokes query components corresponding to the dynamic queues to download data according to the change conditions of the m dynamic queues, and the method further includes:

and when the concurrency of the query component exceeds a preset threshold value, setting a monitoring program to be in a waiting state until the concurrency of the query component is smaller than the preset threshold value, and continuing monitoring the change conditions of the m dynamic queues.

In a second aspect, the present invention discloses an asynchronous downloading device based on Zookeeper, which includes:

the creating unit is used for creating m dynamic queues in the Zookeeper according to the m Hadoop data sources, wherein m is the number of components supporting the query function in a Hadoop cluster, and m is a positive integer greater than or equal to 1, and the data downloading tasks of the m dynamic queues are controlled in parallel; each dynamic queue corresponds to a data source;

the monitoring unit is used for monitoring the change conditions of the m dynamic queues in real time through a monitoring program, and calling the query assemblies corresponding to the changed dynamic queues to download data from corresponding data sources according to the change conditions of the m dynamic queues;

and the acquisition unit is used for acquiring the data downloading task state and the downloaded data file.

Optionally, the creating unit is specifically configured to:

Optionally, the monitoring unit is specifically configured to:

Optionally, the listening unit is further configured to:

and when the concurrency of the query component exceeds a preset threshold, setting a monitoring program to be in a waiting state until the concurrency of the query component is smaller than the preset threshold, and continuing monitoring the change conditions of the m dynamic queues.

Compared with the prior art, the invention has the following beneficial effects:

according to the method and the device, a plurality of dynamic queues are established for different data source data in the Zookeeper, the change condition of each dynamic queue is monitored, the corresponding query assembly is called to download data from the corresponding data source according to the change condition, so that queue control is performed on data downloading of each assembly of Hadoop, mutual influence during data downloading of a plurality of assemblies is avoided, parallelization and automation of asynchronous downloading of Hadoop and multiple data sources are realized, and the use range of the Zookeeper is expanded.

Drawings

FIG. 1 is a diagram of an association relationship between roles of Zookeeper;

fig. 2 is a schematic flowchart of an asynchronous downloading method based on Zookeeper in the embodiment of the present application;

FIG. 3 is a Zookeeper node directory structure diagram;

fig. 4 is a schematic diagram of an asynchronous downloading device based on Zookeeper in the embodiment of the present application.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As can be seen from the background art, in the prior art, when a large amount of data is downloaded from multiple components of a Hadoop, a long time is required, and in the embodiment of the present application, a corresponding solution is provided for the above situation, and a detailed description is provided below for a specific implementation scheme provided by the embodiment of the present application.

Referring to fig. 1, for an association diagram among roles of Zookeeper, Zookeeper is used in the Hadoop big data technology, and Zookeeper has no alternative position in a distributed system. The Zookeeper is used as a coordinator of the distributed system, so that the distributed system is used like a single machine and is more reliable than the single machine. The Zookeeper is a cluster consisting of a plurality of servers, one leader and a plurality of folrows, each Server stores one copy of data, the global data is consistent, distributed reading and writing are performed, and data updating requests are transmitted to the leader by the folrows to be uniformly implemented. And (4) association relationship among roles of Zookeeper.

Referring to fig. 2, fig. 2 is a schematic flowchart of an asynchronous downloading method based on Zookeeper according to an embodiment of the present application, where the method specifically includes:

s201: according to the m Hadoop data sources, m dynamic queues are created in a Zookeeper, wherein m is the number of components supporting a query function in a Hadoop cluster, and m is a positive integer greater than or equal to 1, and data downloading tasks of the m dynamic queues are controlled in parallel; each dynamic queue corresponds to a data source.

The Zookeeper node is of a tree structure, a root node and a plurality of branch nodes. And comprehensively utilizing the real-time property, the sequence property, the atomicity and the consistency of the Zookeeper during node updating, and dynamically managing the Zookeeper node by using a Java client API.

And determining the number of the components supporting the query function in the Hadoop cluster according to the local configuration of the Hadoop cluster, wherein the components supporting the query function can be Impala, HBase, Phoenix, Kylin and the like. And creating m dynamic queues according to the determined number of the components supporting the query function in the Hadoop cluster, wherein m is the number of the components supporting the query function in the Hadoop cluster, for example, when the determined query components are Impala and Hbase, m is equal to 2, and when the determined query components are Impala, Hbase and Phoenix, m is equal to 3. And according to the determined type quantity of the query components, creating dynamic queues with the same type quantity as the query components, wherein each dynamic queue corresponds to one data source. And downloading the downloading tasks through the plurality of dynamic queues, thereby realizing the parallel control of the downloading tasks.

In an embodiment, step S201 specifically includes:

a1, acquiring the m dynamic queues, wherein each dynamic queue comprises a first-level node; the one primary node corresponds to a query component.

A2, when receiving the download request, determining the inquiry component corresponding to the download request.

The received downloading request comprises information such as a task number, a user number, a data table name, query conditions and the like, and the query assembly corresponding to the downloading request is determined according to the query conditions.

A3, determining a primary node corresponding to the download request according to the query component corresponding to the download request.

A4, under the primary node corresponding to the download request, sequentially creating secondary nodes according to the order of receiving the download request; the secondary nodes correspond to the download requests one to one.

When a plurality of download requests which need to be downloaded by using the same inquiry component are received, a plurality of secondary nodes are created under the primary node corresponding to the inquiry component, and the serial numbers of the secondary nodes in the primary nodes are sequentially increased according to the order of creating the secondary nodes.

The method comprises the steps that first-level nodes in m dynamic queues respectively receive downloading requests of different query components, after the query components needed to be used are determined according to the downloading requests, second-level nodes are created under the first-level nodes corresponding to the query components, if a plurality of second-level nodes exist under the first-level nodes, serial numbers of the second-level nodes are sequentially increased progressively according to the order of creating the second-level nodes, and the serial number of the second-level node under one first-level node is not related to the serial numbers of the second-level nodes under other first-level nodes. For example, "/Impala", "/HBase", "/Phoenix", "/Kylin" 4 primary nodes (Znode) are created for receiving data download requests for Impala, HBase, Phoenix and Kylin, respectively. After a user submits a data downloading request, a secondary node is created under a corresponding primary node, the prefix of the node name is 'queue-', the type of the node is PERSISTENT _ SEQUENTIAL (persistent order, PERSISTENT represents persistent storage until a command is deleted, SEQUENTIAL represents an increasing unique serial number), and the format of the serial number is 10 digits 0000000001. Referring to fig. 3, the primary node is "/Impala", and the names of the secondary nodes to which "/Impala" belongs are "/Impala/queue-0000000001", "/Impala/queue-0000000002", "/Impala/queue-0000000003"; the primary node is "/Kylin", the names of the secondary nodes belonging to the "/Kylin" are "/Kylin/queue-0000000001", "/Kylin/queue-0000000002", and the like. Therefore, the download requests with different query conditions can be distributed to the corresponding dynamic queues, and the corresponding secondary nodes are created according to the sequence of the received download requests.

S202: and the monitoring program monitors the change condition of each dynamic queue in real time, and calls the query component corresponding to the changed dynamic queue to download data from the corresponding data source according to the change condition of each dynamic queue.

And monitoring whether each dynamic queue changes in real time, determining the changed dynamic queue when the dynamic queue changes, and selecting the query assembly corresponding to the dynamic queue to download data.

In an embodiment, step S202 specifically includes:

and B1, setting m monitoring programs, wherein each monitoring program respectively monitors the change condition of the secondary node in one dynamic queue in real time. The m listeners correspond to the m dynamic queues one to one.

And B2, when the secondary node changes, acquiring the information of all the secondary nodes in the dynamic queue where the changed secondary node is located according to the notification sent by the Zookeeper.

And the m monitoring programs monitor the change condition of the secondary nodes in the dynamic queue, and if the secondary nodes in the dynamic queue change, the information of all the secondary nodes in the dynamic queue where the changed secondary nodes are located is obtained through a getChildren () method according to the notification sent by the Zookeeper.

B3, acquiring the secondary node with the minimum sequence number; according to the downloading request, packaging the data information stored in the secondary node with the minimum serial number into a data quantity query statement, and querying through a query component corresponding to the primary node; and deleting the secondary node with the minimum sequence number.

And selecting a secondary node with the minimum serial number, acquiring the name and the query condition of the data table stored in the secondary node with the minimum serial number by a getData method because the data information of the downloading request is stored in the secondary node, packaging the name and the query condition of the data table into a data quantity query statement, selecting a Hadoop query component corresponding to the primary node in the dynamic queue for query, and deleting the secondary node from a directory structure. For example: the client side listens to the program Impala watch, reads the information in the secondary node with the minimum sequence number "/Impala/queue-XXXXXXXXXX" (xxxxxxxxxxxx is just beginning to be 0000000001) after finding that the secondary node under the "/Impala" node is changed, then queries the Impala component, and deletes the secondary node "/Impala/queue-0000000001 from the directory structure. And the downloading requests corresponding to the secondary nodes are correspondingly processed in sequence according to the application sequence of the downloading requests.

And rejecting the downloading request for the task with the query data volume exceeding 1048576 lines (the upper limit of excel), encapsulating the data table name and the query condition into a query statement for the condition that the query data volume is less than 1048576 lines, carrying out data query on the query component again, finally encapsulating the query result into an excel file, compressing the excel file, and storing the excel file into a certain directory of the WAS server, wherein the file is named as 'data downloading task number, zip', and is used for a user to download.

Setting a threshold value for the concurrency of the query assembly, setting the monitoring program to be in a waiting state when the concurrency of the query assembly exceeds the preset threshold value until the concurrency of the query assembly is smaller than the preset threshold value, and awakening the monitoring program to continue monitoring the change condition of the secondary nodes in the dynamic queue. The threshold is set for the concurrency of the query components, different preset thresholds may be set for different query components, the same preset threshold may be set for all query components, or a total preset threshold may be set for all query components. Therefore, the normal operation of the query assembly is ensured, and the conditions of over high utilization rate of a CPU and a memory and the like are avoided.

S203: and acquiring the state of the data downloading task and the downloaded data file.

When the user requires to track the download request provided by the user, if the task of the download request is not finished, the execution state of the current download task is displayed, and if the download request is finished, the user can download the execution result.

And setting a statistic and summarizing function so that a system administrator can conveniently carry out statistic analysis on the downloading time of each task and optimize the preset threshold of the query assembly.

The method comprises the steps of establishing a plurality of dynamic queues aiming at different data source data in the Zookeeper, monitoring the change condition of each dynamic queue, calling the corresponding query component to download data from the corresponding data source according to the change condition, and thus realizing queue control on data downloading of all components of Hadoop.

The embodiment of the invention provides an asynchronous downloading device based on Zookeeper, which comprises the following units:

a creating unit 401, configured to create m dynamic queues in the Zookeeper according to the m Hadoop data sources, where m is the number of components supporting a query function in a Hadoop cluster, and m is a positive integer greater than or equal to 1, and perform parallel control on data downloading tasks of the m dynamic queues; each dynamic queue corresponds to a data source.

A monitoring unit 402, configured to monitor the change conditions of the m dynamic queues in real time through a monitor, and invoke query components corresponding to the changed dynamic queues to download data from corresponding data sources according to the change conditions of the m dynamic queues.

An obtaining unit 403, configured to obtain a data downloading task state and a downloaded data file.

A creating unit 401, configured to obtain the m dynamic queues, where each dynamic queue includes a first-level node; the primary node corresponds to a query component;

A monitoring unit 402, configured to set m monitoring programs, where each monitoring program monitors a change condition of a secondary node in a dynamic queue in real time;

The method comprises the steps of establishing a plurality of dynamic queues aiming at different data source data in the Zookeeper, monitoring the change condition of each dynamic queue, calling corresponding query components to download data from corresponding data sources according to the change condition, and accordingly realizing queue control of data downloading of all components of Hadoop.

It is to be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A Zookeeper-based asynchronous downloading method is characterized by comprising the following steps:

2. The method of claim 1, wherein creating m dynamic queues in a Zookeeper comprises:

3. The method according to claim 2, wherein said creating, under the primary node corresponding to the download request, secondary nodes in order according to the order in which the download request is received further comprises:

4. The method according to claim 2, wherein the monitoring program monitors the change of the m dynamic queues in real time, and invokes the query component corresponding to the dynamic queue to download data according to the change of the m dynamic queues, including:

5. The method according to claim 4, wherein the monitor monitors the change of the m dynamic queues in real time, and invokes the query component corresponding to the dynamic queue to download data according to the change of the m dynamic queues, further comprising:

6. An asynchronous Zookeeper-based downloading device, the device comprising:

7. The apparatus according to claim 6, wherein the creating unit is specifically configured to:

8. The apparatus according to claim 7, wherein the creating unit is specifically configured to:

9. The apparatus of claim 7, wherein the listening unit is specifically configured to:

when the secondary node changes, acquiring information of all secondary nodes in the dynamic queue where the changed secondary node is located according to a notification sent by the Zookeeper;

10. The apparatus of claim 9, wherein the listening unit is further configured to: