CN112527880B

CN112527880B - Method, device, equipment and medium for collecting metadata information of big data cluster

Info

Publication number: CN112527880B
Application number: CN202011483745.XA
Authority: CN
Inventors: 陆魏; 胡凭智
Original assignee: Ping An E Wallet Electronic Commerce Co Ltd
Current assignee: Ping An E Wallet Electronic Commerce Co Ltd
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2023-08-08
Anticipated expiration: 2040-12-16
Also published as: CN112527880A

Abstract

The application relates to the technical field of data acquisition, and discloses a method, a device, equipment and a medium for acquiring metadata information of a big data cluster, wherein the method comprises the following steps: receiving a task submitted to a big data cluster by a user, and analyzing the task to obtain an execution plan corresponding to the task; performing calculation operation on the nodes of the big data cluster through the execution plan, and receiving the execution plan returned by the interface corresponding to the big data cluster when the completion of the calculation operation is monitored; analyzing the execution plan, acquiring metadata information corresponding to the execution plan, and storing the metadata information in a relational database; metadata information stored in the relational database is imported into the large data warehouse in the manner of Sqoop data importation. The present application also relates to blockchain technology in which metadata information is stored. According to the method and the device, the execution plan is analyzed, so that the complete collection of the metadata information is realized, and the efficiency of collecting the metadata information is improved.

Description

Method, device, equipment and medium for collecting metadata information of big data cluster

Technical Field

The present disclosure relates to the field of data acquisition technologies, and in particular, to a method, an apparatus, a device, and a medium for acquiring metadata information of a big data cluster.

Background

Metadata information is an important concept in the field of big data, and reflects real data information stored in a current big data cluster, for example, metadata information a generally includes corresponding real data storage positions, data sizes, data storage modes and the like, and is a basic unit for managing and storing data in the big data cluster. However, with the advent of the large data age, the volume of user data has been increasing explosively, and the volume of data has increased, resulting in excessive redundancy of cluster data, which presents a great challenge for the storage of large data clusters. At the same time, the data needs to be managed by corresponding metadata information, which also causes maintenance overhead of cluster metadata information and degradation of cluster performance, so that metadata information used in the clusters needs to be collected.

The existing metadata information acquisition method is that in the task submitting stage of a user to a big data cluster, analysis and acquisition of metadata information are carried out. However, because the task mode is submitted to the cluster by the user, the data volume is large, and the metadata information is collected by adopting the mode, the metadata information cannot be accurately and completely collected easily, and the metadata information is missed, so that the metadata information collection efficiency is low, and the management of the big data cluster is not facilitated. There is a need for a method that can improve the efficiency of collection of metadata information for large data clusters.

Disclosure of Invention

The embodiment of the application aims to provide a method, a device, equipment and a medium for acquiring metadata information of a big data cluster so as to improve the acquisition efficiency of the metadata information of the big data cluster.

In order to solve the above technical problems, an embodiment of the present application provides a method for collecting metadata information of a big data cluster, including:

receiving a task submitted to a big data cluster by a user, and analyzing the task to obtain an execution plan corresponding to the task;

performing calculation operation on the nodes of the big data cluster through the execution plan, and monitoring the calculation operation;

when the completion of the execution of the computing operation is monitored, receiving the execution plan returned by the interface corresponding to the big data cluster;

analyzing the execution plan, obtaining metadata information corresponding to the execution plan, and storing the metadata information in a relational database;

metadata information stored in the relational database is imported into the large data warehouse in the manner of Sqoop data importation.

In order to solve the above technical problem, an embodiment of the present application provides a device for collecting metadata information of a big data cluster, including:

The execution plan generation module is used for receiving tasks submitted to the big data cluster by a user, analyzing the tasks and obtaining an execution plan corresponding to the tasks;

the execution plan execution module is used for carrying out calculation operation on the nodes of the big data cluster through the execution plan and monitoring the calculation operation;

the execution plan receiving module is used for receiving the execution plan returned by the interface corresponding to the big data cluster when the completion of the execution of the computing operation is monitored;

the execution plan analysis module is used for analyzing the execution plan, acquiring metadata information corresponding to the execution plan and storing the metadata information in a relational database;

and the metadata information importing module is used for importing the metadata information stored in the relational database into the large data warehouse according to the Sqoop data importing mode.

In order to solve the technical problems, the invention adopts a technical scheme that: a computer device is provided comprising one or more processors; and the memory is used for storing one or more programs, so that the one or more processors can realize the method for acquiring the metadata information of the big data cluster.

In order to solve the technical problems, the invention adopts a technical scheme that: a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method for collecting metadata information of a big data cluster according to any one of the above.

The embodiment of the invention provides a method, a device, equipment and a medium for acquiring metadata information of a big data cluster. Wherein the method comprises the following steps: receiving a task submitted to a big data cluster by a user, and analyzing the task to obtain an execution plan corresponding to the task; performing calculation operation on the nodes of the big data cluster through an execution plan, and monitoring the calculation operation; when the completion of the execution of the computing operation is monitored, receiving an execution plan returned by the interface corresponding to the big data cluster; analyzing the execution plan, acquiring metadata information corresponding to the execution plan, and storing the metadata information in a relational database; metadata information stored in the relational database is imported into the large data warehouse in the manner of Sqoop data importation. According to the embodiment of the invention, the execution plan is analyzed, so that the tasks submitted by users aiming at various channels are realized, the metadata information of the tasks is completely collected, and the metadata information collection efficiency is improved.

Drawings

For a clearer description of the solution in the present application, a brief description will be given below of the drawings that are needed in the description of the embodiments of the present application, it being obvious that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 is an application environment schematic diagram of a method for collecting metadata information of a big data cluster according to an embodiment of the present application;

FIG. 2 is a flowchart of an implementation of a method for collecting metadata information of a big data cluster according to an embodiment of the present application;

FIG. 3 is a flowchart of an implementation of a sub-process in the method for collecting metadata information of a big data cluster according to the embodiment of the present application;

FIG. 4 is a flowchart of another implementation of a sub-process in the method for collecting metadata information of a big data cluster according to the embodiment of the present application;

FIG. 5 is a flowchart of another implementation of a sub-process in the method for collecting metadata information of a big data cluster according to the embodiment of the present application;

FIG. 6 is a flowchart of another implementation of a sub-process in the method for collecting metadata information of a big data cluster according to the embodiment of the present application;

FIG. 7 is a flowchart of another implementation of a sub-process in the method for collecting metadata information of a big data cluster according to the embodiment of the present application;

FIG. 8 is a flowchart of another implementation of a sub-process in the method for collecting metadata information of a big data cluster according to the embodiment of the present application;

fig. 9 is a schematic diagram of a device for collecting metadata information of a big data cluster according to an embodiment of the present application;

fig. 10 is a schematic diagram of a computer device provided in an embodiment of the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the description of the figures above are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to better understand the technical solutions of the present application, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings.

The present invention will be described in detail with reference to the drawings and embodiments.

Referring to fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a search class application, an instant messaging tool, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the method for collecting metadata information of a big data cluster provided in the embodiments of the present application is generally executed by a server, and accordingly, the device for collecting metadata information of a big data cluster is generally configured in the server.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring to fig. 2, fig. 2 illustrates an embodiment of a method for collecting metadata information of a big data cluster.

It should be noted that, if there are substantially the same results, the method of the present invention is not limited to the flow sequence shown in fig. 2, and the method includes the following steps:

s1: and receiving the task submitted to the big data cluster by the user, and analyzing the task to obtain an execution plan corresponding to the task.

Specifically, different users in the big data cluster send task instructions to the big data cluster through different channels, when the server receives the task, the server analyzes the task, acquires task content and acquires an execution plan corresponding to the task, and the task of executing corresponding content can be performed on the big data cluster frame through the execution plan.

The big data cluster refers to a data analysis mining platform for providing processing capacity for industry big data, and generally adopts big data processing technology and mode to construct an intermediate big data statistics, analysis and mining platform loosely coupled with specific business. Spark is a fast and versatile computing engine designed for large-scale data processing. The big data platform in the embodiment of the invention uses Spark as a basic framework to realize the management of big data cluster tasks.

The task is to perform a specific method for processing big data according to a user request, and the method comprises uploading a file, transferring the file, calculating statistical data of specific data and the like. The task is performed in response to a start task request sent by a user. Tasks include dependency data tables, execution result data tables, and logic code, among others. The execution plan is a detailed scheme generated according to the corresponding task, describes what transformation and calculation operations the task is about to do, and comprises input data tables, output data table information and the like in a large data cluster, wherein the input data tables, the output data table information and the like are required to be accessed or generated for completing the task, and the execution plan comprises a logic execution plan and a physical execution plan.

S2: and performing calculation operation on the nodes of the big data cluster through the execution plan, and monitoring the calculation operation.

Specifically, since tasks submitted by users are to be executed, the execution plan needs to be distributed to nodes of a corresponding large data cluster according to the execution plan, and the computing operation described by the execution plan is performed on different nodes. In the calculation process, a corresponding monitor is arranged on a corresponding interface, the calculation operation process is monitored in real time through the monitor, and if the monitor receives information fed back by the calculation operation, whether the calculation operation is finished or not, namely whether the execution plan is finished or not can be judged.

S3: and when the completion of the execution of the computing operation is monitored, receiving an execution plan returned by the interface corresponding to the big data cluster.

Specifically, because the large data cluster processing in the Spark framework is based, after the execution of the execution plan is completed, the interface corresponding to the large data cluster can return to the execution plan, so that the execution plan can be further analyzed. If the calculation operation of the execution plan is not completed, the analysis method of the execution plan cannot be performed. Therefore, it is necessary to track and monitor the computing operation of the execution plan, and when it is monitored that the computing operation is completed, that is, the execution plan is completed, the execution plan returned by the interface corresponding to the big data cluster is received, so as to analyze the execution plan, thereby obtaining metadata information.

S4: and analyzing the execution plan, acquiring metadata information corresponding to the execution plan, and storing the metadata information in a relational database.

Specifically, the purpose of the present application is to collect metadata information in a large data cluster, and tasks submitted by users to the large data cluster are eventually converted into an execution plan, where the execution plan includes input data tables, output data table information, and the like in the large data cluster that are required to be accessed or generated to complete the tasks. After the execution plan is analyzed, which data in the big data cluster is accessed and called, the input data table, the output data table information and the like can be obtained, the corresponding metadata information can be obtained, and the obtained metadata is stored in an external relational database, so that the metadata information can be conveniently called later.

Metadata (Metadata), also called intermediate data and relay data, is data (data about data) describing data, mainly describing data attribute information, and is used to support functions such as indicating storage location, history data, resource searching, file recording, etc. Metadata is an electronic catalog, and in order to achieve the purpose of cataloging, the contents or characteristics of data are described and collected, so that the purpose of assisting in data retrieval is achieved. In this embodiment of the present application, metadata information refers to real data information that reflects storage of a large data cluster, and includes a corresponding real data storage location, a data size, a data storage manner, and the like, which are basic units of management storage data of the large data cluster.

S5: metadata information stored in the relational database is imported into the large data warehouse in the manner of Sqoop data importation.

Specifically, if the metadata information is stored in a large amount in the relational database, the relational database is easy to be overloaded and inconvenient to call the metadata information, so that the metadata information stored in the relational database is imported into a large data warehouse, and the metadata is convenient to process subsequently.

The Sqoop is a distributed data migration tool, and can guide data in a relational database into the HDFS of Hadoop or guide data of the HDFS into the relational database. In the present application, metadata information stored in a relational database is imported into a large data warehouse by means of Sqoop data import. Further, the large data warehouse referred to in this application is referred to as a data warehouse tool hive.

In the embodiment, a task submitted to a big data cluster by a user is received, and the task is analyzed to obtain an execution plan corresponding to the task; performing calculation operation on the nodes of the big data cluster through an execution plan, and monitoring the calculation operation; when the completion of the execution of the computing operation is monitored, receiving an execution plan returned by the interface corresponding to the big data cluster; analyzing the execution plan, acquiring metadata information corresponding to the execution plan, and storing the metadata information in a relational database; according to the Sqoop data importing method, metadata information stored in a relational database is imported into a large data warehouse, so that tasks submitted by users aiming at various channels are realized, the metadata information is completely collected, and the metadata information collecting efficiency is improved.

Referring to fig. 3, fig. 3 shows a specific implementation manner of step S4, in which the execution plan is parsed in step S4, metadata information corresponding to the execution plan is obtained, and the metadata information is stored in a relational database, which is described in detail as follows:

s41: and analyzing the execution plan to acquire corresponding content related to the execution plan.

Specifically, because the execution plan includes input data tables, output data table information, etc. in the large data cluster that are required to be accessed or generated to complete the task submitted by the user. Therefore, after the execution plan is analyzed, which data in the big data cluster is accessed and called can be obtained, and the input data table, the output data table information and the like of the large data cluster can be obtained, namely the corresponding content related to the execution plan is obtained.

The corresponding content refers to a task submitted by a pointer to a user, and corresponds to the content related to the execution plan. Including information of which data tables or data in the data tables are accessed, called, etc.

S42: and acquiring an input-output identifier in the execution plan, and distinguishing source information attributes of corresponding contents according to the input-output identifier to obtain target contents, wherein the source information attributes comprise input source information and output source information.

Specifically, metadata information in a large data cluster is to be collected, including a data storage location, a data size, and a data storage mode, and further includes whether the data is a parameter accessed during input or output, and attributes that can be accessed by a server as input and output are used as source information attributes, so the source information attributes include input source information and output source information.

Wherein, the input source information refers to the attribute of the data as the attribute of the data in the input big data cluster; the output source information means that the attribute of the data is an attribute of the data in the output large data cluster. The target content means that the data in the corresponding content has been distinguished from the source information attribute, i.e. it is distinguished which data belongs to the input source information and which data belongs to the output source information. In addition, when the execution plan is constructed, tasks submitted by users are analyzed to obtain various identifiers, wherein the identifiers comprise input and output identifiers, and the input and output identifiers are used for distinguishing source information attributes of corresponding contents.

S43: metadata information in the target content is extracted and stored in a relational database.

Specifically, as the target content has been obtained in the above steps, only the data in the target content is extracted to obtain metadata information therein, and then the metadata information is stored in the relational database.

In the embodiment, the corresponding content related to the execution plan is acquired by analyzing the execution plan, the input and output identifiers in the execution plan are acquired, the source information attribute of the corresponding content is distinguished according to the input and output identifiers, the target content is obtained, the metadata information in the target content is finally extracted, and the metadata information is stored in the relational database, so that the extraction of the metadata information is realized, and the collection efficiency of the metadata information of the big data cluster is improved.

Referring to fig. 4, fig. 4 shows a specific implementation manner of step S1, in which a task submitted by a user to a big data cluster is received in step S1, and the task is parsed to obtain a specific implementation process of an execution plan corresponding to the task, which is described in detail as follows:

s11: and receiving tasks submitted to the big data cluster by the user, and analyzing the tasks into SQL statement files in an SQL analysis mode.

Specifically, the SQL parsing mode is to parse the task submitted by the user through an SQL parsing engine to form a corresponding SQL sentence file. Further, the SQL analysis engine comprises a Hive SQL analysis engine, a Spark SQL analysis engine and other tools. The SQL sentence file is a file formed by corresponding SQL sentences after the task submitted by the user is analyzed, and can be read by a subsequent grammar analysis tool.

S12: and constructing a grammar tree by carrying out grammar analysis on the SQL sentence file.

Specifically, the open source parser Antlr parses the SQL statement file, and constructs its syntax tree according to the parsing.

The open-source grammar analyzer Antlr is a visualized open-source grammar analyzer capable of automatically generating a grammar tree according to an input SQL sentence file. Open source parser Antlr provides a framework for automatically constructing custom languages from syntactic descriptions, recognizers, compilers and interpreters for a variety of languages. In the embodiment of the application, an open source parser Antlr parses an SQL sentence file and constructs a syntax tree thereof according to the syntax parsing.

The grammar analysis refers to reading an input SQL sentence file through an open source grammar analyzer Antlr, analyzing related keywords and identifiers, carrying out grammar construction according to the related keywords and identifiers, and finally forming a grammar tree.

S13: and compiling and analyzing the grammar tree through a compiler to obtain an execution plan.

Specifically, the above steps have constructed a syntax tree, and the execution plan corresponding to the task submitted by the user can be obtained only by compiling and analyzing the syntax tree. In the embodiment of the application, the compiler adopts an AstBuilder which is an open source code parser. And compiling and analyzing the grammar tree through an AstBuilder to finally obtain an execution plan.

In this embodiment, a task submitted by a user to a big data cluster is received, the task is parsed into an SQL statement file in an SQL parsing manner, a syntax tree is constructed by parsing the SQL statement file, and finally, the syntax tree is compiled and parsed by a compiler to obtain an execution plan, so that the execution plan is obtained, and a foundation is provided for the execution process of a subsequent monitoring execution plan.

Referring to fig. 5, fig. 5 shows a specific implementation manner of step S12, and a specific implementation process of constructing a syntax tree in step S12 by parsing an SQL statement file is described in detail as follows:

s121, analyzing the SQL sentence file through a lexical analyzer to obtain keywords and identifiers in the SQL sentence file.

Among these, lexical analyzers are also known as Scanner, lexical analyzer and token. Since SQL statement files are composed of keywords and strictly defined grammatical structures, the task of the lexical analyzer is to analyze and quantify those character streams that are otherwise meaningless, translating them into discrete groups of characters (i.e., tokens), including keywords, identifiers, and so forth. These parsed keywords and identifiers are provided to a parser in a subsequent step, which ultimately forms a syntax tree.

S122, grammar construction is carried out on the keywords and the identifiers through a grammar analyzer, and a grammar tree is generated.

The specific parser does not care about the grammatical meaning of the generated individual character sets and their relation to the context when parsing the character stream, which is the work of the parser. The parser organizes the received character sets and converts them into sequences allowed by the target language grammar definition.

Wherein, when analyzing the character stream, the grammar analyzer organizes the grammar meaning of the generated single character group and the relation between the grammar meaning and the context, which are not concerned by the grammar analyzer, and converts the grammar meaning into a sequence allowed by the grammar definition of the target language. In the embodiment of the application, the grammar analyzer organizes and builds the character groups such as keywords, identifiers and the like generated by the lexical analyzer, and converts the character groups into grammar trees.

The grammar construction is to construct the character groups such as keywords, identifiers and the like generated by the lexical analyzer through the grammar analyzer according to the grammar meaning of the character groups and the relation between the contexts of the character groups, and finally form a grammar tree.

In the embodiment, the SQL sentence file is parsed by the lexical analyzer to obtain the keywords and the identifiers in the SQL sentence file, and then the grammar analyzer is used for grammar construction of the keywords and the identifiers to generate a grammar tree, so that construction of the grammar tree is realized, and a foundation is provided for subsequent generation of an execution plan.

Referring to fig. 6, fig. 6 shows a specific implementation manner of step S2, in which a calculation operation is performed on a node of a big data cluster by executing a plan in step S2, and a specific implementation process of monitoring the calculation operation is described in detail as follows:

s21: after the logic execution plans in the execution plans are executed in parallel, the logic execution plans are translated into physical execution plans.

Specifically, since the execution plan includes a logical execution plan and a physical execution plan, the logical execution plan is only a data structure, and does not include any data information, so that a data source and a data type cannot be acquired, and it cannot be known from which table the different columns come from, and the like. The logical execution plan is required to be converted into a physical execution plan, and the physical execution plan is mainly used for unified management of various function resource information and metadata information (including databases, data tables, data views, data partitions, functions and the like), so that the physical execution plan can complete tasks submitted by users. And after the logic execution plans are executed in parallel, translating the logic execution plans into a physical execution plan.

Wherein parallel execution is to allow multiple program sets to coexist and execute simultaneously on the same server; in the embodiment of the application, the logic execution plan is executed on a plurality of program sets simultaneously, and after the execution is completed, the logic execution plan is translated and converted to form the physical execution plan.

S22: based on the physical execution plan, the physical execution plan is distributed over the compute nodes of the large data cluster.

Specifically, the physical execution plan includes what conversion and calculation operations are to be performed correspondingly because the task submitted by the user is included in the physical execution plan, so that the physical execution plan is distributed to the calculation nodes of the big data cluster according to the content of the physical execution plan, and the corresponding calculation operations are convenient to follow-up.

S23: and executing the computing operation corresponding to the task on the computing node, and monitoring the computing operation process.

Specifically, the computing operation refers to an operation required for completing a task submitted by a user, and in the embodiment of the application, the content of the computing operation is not required to be known, and only the computing operation process is required to be monitored to obtain the information that the computing operation process is completed. The server can further analyze the physical execution plan to acquire the metadata information only after the execution of the computing operation is completed, that is, after the execution of the physical execution plan is completed, so that the server can judge whether the execution of the computing operation is completed or not by monitoring the computing operation process.

In this embodiment, after the logic execution plans in the execution plan are executed in parallel, the logic execution plans are translated into the physical execution plans, the physical execution plans are distributed to the computing nodes of the big data cluster based on the physical execution plans, the computing operation corresponding to the task is executed on the computing nodes, the computing operation process is monitored, the execution plans are executed, the process is monitored, and a basis is provided for analyzing the execution plans subsequently, so that the collection efficiency of metadata information of the big data cluster is improved.

Referring to fig. 7, fig. 7 shows a specific implementation manner of step S3, and in step S3, when it is monitored that the execution of the computing operation is completed, a specific implementation process of the execution plan returned by the interface corresponding to the big data cluster is received, which is described in detail as follows:

s31: and when feedback information returned by the nodes of the big data cluster is received, judging that the execution of the computing operation is finished.

Specifically, since the above steps monitor the calculation operation of the execution plan on the node in real time through the monitor, after receiving the feedback information returned by the node of the big data cluster, the feedback information is analyzed, and it is judged that the execution of the calculation operation is completed, that is, the execution of the execution plan is completed.

S32: and receiving an execution plan returned after the calculation operation is completed through the large data cluster corresponding interface.

Specifically, when the execution of the execution plan is completed, the execution plan is returned to the monitor through the interface corresponding to the big data cluster, so that the server is convenient for analyzing the returned execution plan by receiving the returned execution plan, and metadata information in the metadata information is acquired.

In this embodiment, after feedback information returned by the nodes of the big data cluster is received, it is determined that the execution of the computing operation is completed, and an execution plan returned after the completion of the computing operation is received through the corresponding interface of the big data cluster, so that the execution plan is obtained, and analysis of the execution plan in subsequent steps is facilitated, thereby obtaining metadata information.

Referring to fig. 8, fig. 8 shows a specific implementation manner after step S5, and this embodiment includes:

and S51, identifying data information with the same metadata information as the historical data in the large data warehouse as repeated data information.

Specifically, since the collected metadata information may have the same data information as the history data in the large data warehouse, in order to reduce redundancy of data and thus reduce the load of the large data cluster, the data information having the same metadata information as the history data in the large data warehouse is identified as the duplicate data information.

And S52, deleting repeated data information in the metadata information in the large data warehouse to obtain newly added metadata information.

Specifically, the duplicate data information in the metadata information is deleted, and the remaining metadata information is distinguished from the history data in the large data warehouse, so that the remaining metadata information is used as newly added metadata information.

In this embodiment, the metadata information is identified and used as the repeated data information by identifying the data information with the same historical data in the large data warehouse, and the repeated data information in the metadata information is deleted in the large data warehouse, so as to obtain newly added metadata information, reduce data redundancy, and reduce the load of the large data clusters, thereby improving the collection efficiency of the metadata information of the large data clusters.

It is emphasized that to further guarantee the privacy and security of metadata information, the metadata information may also be stored in a node of a blockchain.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

Referring to fig. 9, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a device for collecting metadata information of a big data cluster, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device may be specifically applied to various electronic devices.

As shown in fig. 9, the apparatus for collecting metadata information of a big data cluster according to the present embodiment includes: an execution plan generation module 61, an execution plan execution module 62, an execution plan reception module 63, an execution plan analysis module 64, and a metadata information import module 65, wherein:

The execution plan generating module 61 is configured to receive a task submitted by a user to the big data cluster, and parse the task to obtain an execution plan corresponding to the task;

an execution plan execution module 62, configured to perform a calculation operation on a node of the big data cluster and monitor the calculation operation by executing the plan;

the execution plan receiving module 63 is configured to receive an execution plan returned by the interface corresponding to the big data cluster when it is monitored that the execution of the computing operation is completed;

the execution plan analysis module 64 is configured to analyze the execution plan, obtain metadata information corresponding to the execution plan, and store the metadata information in the relational database;

the metadata information importing module 65 is configured to import metadata information stored in the relational database into the large data warehouse in a manner of importing Sqoop data.

Further, the execution plan parsing module 64 includes:

the corresponding content acquisition unit is used for analyzing the execution plan and acquiring corresponding content related to the execution plan;

the target content acquisition unit is used for acquiring an input and output identifier in the execution plan, distinguishing source information attributes of corresponding content according to the input and output identifier, and obtaining target content, wherein the source information attributes comprise input source information and output source information;

And the metadata information extraction unit is used for extracting metadata information in the target content and storing the metadata information in the relational database.

Further, the execution plan generation module 61 includes:

the task analysis unit is used for receiving tasks submitted to the big data cluster by a user and analyzing the tasks into SQL statement files in an SQL analysis mode;

the grammar tree construction unit is used for constructing a grammar tree by carrying out grammar analysis on the SQL sentence file;

and the grammar tree compiling unit is used for compiling and analyzing the grammar tree through a compiler to obtain an execution plan.

Further, the syntax tree construction unit includes:

the sentence file analysis subunit is used for analyzing the SQL sentence file through the lexical analyzer to obtain keywords and identifiers in the SQL sentence file;

and the grammar construction subunit is used for carrying out grammar construction on the keywords and the identifiers through a grammar analyzer to generate a grammar tree.

Further, the execution plan execution module 62 includes:

the physical execution plan acquisition unit is used for translating the logic execution plan in the execution plan into a physical execution plan after parallel execution;

a physical execution plan distribution unit for distributing the physical execution plan to the computing nodes of the big data cluster based on the physical execution plan;

And the computing operation monitoring unit is used for executing the computing operation corresponding to the task on the computing node and monitoring the computing operation process.

Further, the execution plan receiving module 63 includes:

the feedback information receiving unit is used for judging that the execution of the calculation operation is finished when receiving the feedback information returned by the nodes of the big data cluster;

and the execution plan acquisition unit is used for receiving the execution plan returned after the calculation operation is completed through the large data cluster corresponding interface.

Further, after the metadata information importing module 65, the collecting device of metadata information of the big data set further includes:

the repeated data information identification module is used for identifying data information with the same metadata information as the historical data in the big data warehouse as repeated data information;

and the metadata information deleting module is used for deleting repeated data information in the metadata information in the large data warehouse to obtain newly added metadata information.

It is emphasized that, to further ensure the privacy and security of the metadata information, the metadata information may also be stored in a blockchain node.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 10, fig. 10 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 7 comprises a memory 71, a processor 72, a network interface 73 communicatively connected to each other via a system bus. It is noted that only a computer device 7 having three components memory 71, a processor 72, a network interface 73 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer device may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The computer device can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 71 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 71 may be an internal storage unit of the computer device 7, such as a hard disk or a memory of the computer device 7. In other embodiments, the memory 71 may also be an external storage device of the computer device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer device 7. Of course, the memory 71 may also comprise both an internal memory unit of the computer device 7 and an external memory device. In this embodiment, the memory 71 is generally used to store an operating system installed in the computer device 7 and various types of application software, such as program codes of a collection method of metadata information of a big data cluster. In addition, the memory 71 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 72 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 72 is typically used to control the overall operation of the computer device 7. In this embodiment, the processor 72 is configured to execute a program code stored in the memory 71 or process data, such as a program code for executing a method for collecting metadata information of a large data cluster.

The network interface 73 may comprise a wireless network interface or a wired network interface, which network interface 73 is typically used to establish a communication connection between the computer device 7 and other electronic devices.

The present application further provides another embodiment, namely, a computer readable storage medium, where a computer program is stored, where the computer program is executable by at least one processor, so that the at least one processor performs the steps of a method for collecting metadata information of a big data set as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method of the embodiments of the present application.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

It is apparent that the embodiments described above are only some embodiments of the present application, but not all embodiments, the preferred embodiments of the present application are given in the drawings, but not limiting the patent scope of the present application. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a more thorough understanding of the present disclosure. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing, or equivalents may be substituted for elements thereof. All equivalent structures made by the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the protection scope of the application.

Claims

1. The method for collecting metadata information of the big data cluster is characterized by comprising the following steps:

after the logic execution plans in the execution plans are executed in parallel, translating the logic execution plans into physical execution plans;

distributing the physical execution plan to computing nodes of the big data cluster based on the physical execution plan;

executing the computing operation corresponding to the task on the computing node, and monitoring the computing operation process;

analyzing the execution plan to obtain corresponding content related to the execution plan;

acquiring an input-output identifier in an execution plan, and distinguishing source information attributes of the corresponding content according to the input-output identifier to obtain target content, wherein the source information attributes comprise input source information and output source information;

extracting metadata information in the target content, and storing the metadata information in a relational database;

2. The method for collecting metadata information of a big data cluster according to claim 1, wherein the steps of receiving a task submitted to the big data cluster by a user, analyzing the task, and obtaining an execution plan corresponding to the task include:

receiving a task submitted to a big data cluster by a user, and analyzing the task into an SQL statement file in an SQL analysis mode;

constructing a grammar tree by carrying out grammar analysis on the SQL sentence file;

and compiling and analyzing the grammar tree through a compiler to obtain the execution plan.

3. The method for collecting metadata information of big data clusters according to claim 2, wherein said constructing a syntax tree by parsing the SQL statement file comprises:

analyzing the SQL sentence file through a lexical analyzer to obtain keywords and identifiers in the SQL sentence file;

and constructing grammar of the keywords and the identifiers through a grammar analyzer, and generating the grammar tree.

4. The method for collecting metadata information of a large data cluster according to claim 1, wherein receiving the execution plan returned by the large data cluster corresponding interface when it is monitored that the execution of the computing operation is completed comprises:

When feedback information returned by the nodes of the big data cluster is received, judging that the execution of the computing operation is completed;

and receiving an execution plan returned after the calculation operation is completed through the large data cluster corresponding interface.

5. The method for collecting metadata information of big data clusters according to any one of claims 1 to 4, wherein after the metadata information stored in the relational database is imported into the big data warehouse in the manner of Sqoop data import, the method further comprises:

identifying data information of which the metadata information is identical to the history data in the large data warehouse as repeated data information;

and deleting repeated data information in the metadata information in the large data warehouse to obtain newly added metadata information.

6. A device for collecting metadata information of a big data cluster, comprising:

A physical execution plan distribution unit, configured to distribute the physical execution plan to computing nodes of the big data cluster based on the physical execution plan;

a computing operation monitoring unit, configured to execute a computing operation corresponding to the task on the computing node, and monitor the computing operation process;

the target content acquisition unit is used for acquiring an input and output identifier in an execution plan, distinguishing source information attributes of the corresponding content according to the input and output identifier, and obtaining target content, wherein the source information attributes comprise input source information and output source information;

a metadata information extraction unit for extracting metadata information in the target content and storing the metadata information in a relational database;

7. A computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing a method of collecting metadata information for a big data cluster according to any of claims 1 to 5 when the computer program is executed.

8. A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and the computer program, when executed by a processor, implements the method for collecting metadata information of a big data cluster according to any one of claims 1 to 5.