CN117271584A

CN117271584A - Data processing method and device, computer readable storage medium and electronic equipment

Info

Publication number: CN117271584A
Application number: CN202311197119.8A
Authority: CN
Inventors: 周佳文
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2023-09-15
Filing date: 2023-09-15
Publication date: 2023-12-22

Abstract

The present disclosure relates to the field of data processing technologies, and relates to a data processing method and apparatus, a computer readable storage medium, and an electronic device. The method comprises the following steps: acquiring task configuration information, starting a corresponding task stream according to the task configuration information, and determining a target data source of the task stream; when executing the task flow, acquiring original metadata of a target data source, and converting the original metadata into target metadata in a preset format; and matching the target metadata with the cache metadata corresponding to the original metadata, and determining the metadata to be output according to the matching result. The present disclosure can improve the efficiency and accuracy of discovering data assets.

Description

Data processing method and device, computer readable storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of data processing technology, and more particularly, to a data processing method, a data processing apparatus, a computer-readable storage medium, and an electronic device.

Background

With the advent of the digitization age, data has become an asset of significant value, and data asset management has also shown significant meaning and impact. The data asset discovery refers to the fact that data can be automatically or semi-automatically discovered, including structured data, unstructured data, semi-structured data and the like, and the data assets are classified, described and organized, so that the data asset discovery is not only a basic function, but also one of key steps for realizing the whole flow of data asset management for a data asset management system.

However, the data asset discovery method in the related art needs to report each service system autonomously, and has the defects of low efficiency and low accuracy, which affects effective management and application of the data asset to a certain extent.

It should be noted that the information of the present invention in the above background section is only for enhancing understanding of the background of the present disclosure, and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a data processing method and apparatus, a computer-readable storage medium, and an electronic device, thereby improving efficiency and accuracy of discovering data assets, at least to some extent.

Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.

According to one aspect of the present disclosure, there is provided a data processing method including: acquiring task configuration information, starting a corresponding task flow according to the task configuration information, and determining a target data source of the task flow; when the task flow is executed, acquiring original metadata in the target data source, and converting the original metadata into target metadata in a preset format; and matching the target metadata with the cache metadata corresponding to the original metadata, and determining metadata to be output according to a matching result.

In an exemplary embodiment of the present disclosure, the acquiring task configuration information includes: responding to configuration operation aiming at a preset task configuration file, and generating task configuration information according to the task configuration file and configuration information of the configuration operation; wherein the task configuration information is used to indicate that original metadata is extracted from the target data source.

In an exemplary embodiment of the present disclosure, the task configuration information includes data transmission configurations of a plurality of data sources, and the determining the target data source of the task flow includes: determining the target data source from the plurality of data sources according to source data type information in configuration information of the configuration operation; and calling the data transmission configuration of the target data source based on the connection configuration information in the configuration information of the configuration operation so as to establish data transmission connection with the target data source.

In one exemplary embodiment of the present disclosure, the task configuration information includes a timer configuration; when the task flow is executed, the original metadata in the target data source is obtained, and the original metadata is converted into target metadata in a preset format, which comprises the following steps: when executing the task flow, triggering metadata extraction tasks based on the timer configuration; extracting metadata from the target data source according to an information extraction mode corresponding to the metadata extraction task to obtain the original metadata; and formatting the original metadata to obtain target metadata with the preset format.

In an exemplary embodiment of the present disclosure, the formatting the original metadata to obtain the target metadata having the predetermined format includes: acquiring a data template; and analyzing the original metadata, and filling the analysis result into the data template to obtain the target metadata with the preset format.

In an exemplary embodiment of the present disclosure, after the converting the original metadata into target metadata in a predetermined format, the target metadata is stored in a first buffer; the matching the target metadata with the cache metadata corresponding to the target metadata, and determining metadata to be output according to a matching result includes: acquiring the target metadata from the first cache region, wherein the target metadata comprises metadata identification; accessing a second cache region, and acquiring the cache metadata corresponding to the metadata identification from the second cache region, wherein the second cache region stores the history metadata of the target data source; and matching the target metadata with the cache metadata to determine the metadata to be output according to a matching result.

In an exemplary embodiment of the present disclosure, after determining the metadata to be output, the method further includes: and updating the history metadata of the second buffer area based on the metadata to be output.

In an exemplary embodiment of the present disclosure, the second buffer is a checkpoint buffer, and before the accessing the second buffer, the method further includes: loading the check point cache to a target thread, wherein the target thread is a thread corresponding to the target data source, and the target thread is used for acquiring metadata to be output of the target data source.

In an exemplary embodiment of the present disclosure, the matching the target metadata with the cached metadata corresponding to the original metadata, determining metadata to be output according to a matching result, includes: matching the target metadata with cache metadata corresponding to the original metadata, and acquiring changed metadata according to a matching result; and determining the changed metadata as the metadata to be output.

In an exemplary embodiment of the present disclosure, the number of the target data sources is a plurality, and the task configuration information includes an asynchronous thread number; the method further comprises the steps of: creating a plurality of metadata extraction tasks having the asynchronous thread number; and extracting metadata from each target data source in real time and asynchronously based on the metadata extraction tasks to obtain original metadata corresponding to each target data source.

According to an aspect of the present disclosure, there is provided a data processing apparatus comprising: the configuration module is used for acquiring task configuration information, starting a corresponding task flow according to the task configuration information, and determining a target data source of the task flow; the metadata extraction module is used for acquiring original metadata in the target data source and converting the original metadata into target metadata in a preset format when the task stream is executed; and the metadata processing module is used for matching the target metadata with the cache metadata corresponding to the original metadata, and determining metadata to be output according to a matching result.

According to one aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.

According to one aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any of the above via execution of the executable instructions.

According to the data processing method in the exemplary embodiment of the disclosure, on one hand, task configuration information is obtained, a task stream is started according to the task configuration information, and a target data source of the task stream is determined, so that when the task stream is executed, original meta information of the target data source is obtained, the original meta information of an external data source can be read based on the task configuration information, access requirements on various data sources can be met according to requirements, the original meta data can be obtained, instantaneity of finding the original meta data is improved in a streaming processing mode, and data finding efficiency is improved. On the other hand, after the original metadata is obtained, the original metadata is converted into target metadata with a preset format, so that the metadata from different target source data can be subjected to format unification, and further, when downstream data asset management is performed, only metadata with the unified format is required to be converted into data assets, other additional formatting processing is not required, the docking with various downstream data asset management systems is facilitated, and the efficiency of obtaining the data assets by each data asset management system is improved. In still another aspect, by determining metadata to be output in combination with a matching result of the target metadata and the cache metadata, the matching process can accurately obtain metadata to be sent to the downstream data asset management system, so that all the target metadata are prevented from being sent to the downstream data asset management system in any case, accuracy of data asset discovery is improved, impact on the downstream data management system is reduced, and operation stability of the system is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:

FIG. 1 shows a schematic diagram of a system to which exemplary embodiments of the present disclosure may be applied;

FIG. 2 illustrates a flow chart of a data processing method according to an exemplary embodiment of the present disclosure;

FIG. 3 illustrates a flowchart of a method for determining a target data source for a task flow in accordance with an exemplary embodiment of the present disclosure;

FIG. 4 illustrates a system architecture diagram according to an exemplary embodiment of the present disclosure;

FIG. 5 illustrates a flowchart of one implementation of obtaining target metadata in a target data source, according to an exemplary embodiment of the present disclosure;

FIG. 6 illustrates a schematic diagram of a data template according to an exemplary embodiment of the present disclosure;

FIG. 7 illustrates a flowchart of one implementation of determining metadata to be output according to an exemplary embodiment of the present disclosure;

FIG. 8 illustrates a system diagram for determining metadata to be output according to an exemplary embodiment of the present disclosure;

FIG. 9 illustrates a data asset discovery Flink flow for a billing data asset management system Aliosh according to an exemplary embodiment of the disclosure;

FIG. 10 illustrates a schematic composition of a data processing apparatus according to an exemplary embodiment of the present disclosure;

fig. 11 shows a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

Exemplary embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the exemplary embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar structures, and thus detailed descriptions thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosed aspects may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known structures, methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 shows a schematic diagram of a system 100 to which exemplary embodiments of the present disclosure may be applied. As shown in fig. 1, the system 100 may include a server 101, a terminal 102, a data asset may be stored on the server 101, and the data asset may be managed on the terminal 102.

In an exemplary embodiment, the data processing method provided by the exemplary embodiment of the present disclosure may be performed by the server 101, and the corresponding data processing apparatus is disposed in the server 101. Correspondingly, in this manner performed by the server 101, the server 101 may initiate execution of steps in the solution of the exemplary embodiments of the present disclosure in response to a trigger execution, which may be sent by the terminal used by the user or may be triggered locally by the server in response to some automated event.

The server 101 can acquire task configuration information, then start a corresponding task stream according to the task configuration information, and determine a target data source of the task stream; when the task flow is executed, acquiring original metadata of a target data source, and converting the original metadata into target metadata in a preset format; and then matching the target metadata with the cache metadata corresponding to the original metadata, and determining the metadata to be output according to the matching result.

The server 101 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. The server 101 may perform background tasks.

Furthermore, in another exemplary embodiment, the terminal 102 may also have a similar function as the server 101, thereby performing the data processing method provided by the exemplary embodiment of the present disclosure.

The terminal 102 may be an edge device such as a smart phone, a computer, etc. The user may perform task configuration through the operation page on the terminal 102, and perform operations such as updating, modifying, deleting, etc. on the data asset. The terminal 102 may also be referred to as a mobile terminal, a terminal device, a mobile device, etc., and the exemplary embodiments of the present disclosure do not limit the type of terminal 102.

In addition, the technical solution of the exemplary embodiment of the present disclosure may also be cooperatively performed by the terminal 102 and the server 101. In this manner, which is cooperatively performed by the terminal 102 and the server 101, some of the steps in the technical solution provided by the exemplary embodiment of the present disclosure are performed by the terminal 102, and other parts of the steps are performed by the server 101. For example, the terminal 102 acquires the task configuration information, and transmits the task configuration information to the server 101, and the server 101 performs the subsequent data processing process.

In this manner, the steps performed by the terminal 102 and the server 101 may be dynamically adjusted according to the actual situation, and are not particularly limited.

Wherein the terminal 102 and the server 101 may be directly or indirectly connected through wireless communication, the exemplary embodiments of the present disclosure are not particularly limited herein.

Data asset management has great significance and impact in a number of ways, exemplary:

1) The decision quality is improved: the data asset management can help extract valuable information and insight from mass data and provide accurate and comprehensive data support for decision makers, thereby improving the quality and effect of decisions. 2) Enhancing the competitiveness: fully utilizing the data assets can help to obtain key information such as market insight, customer behavior patterns, competitor dynamics, and the like. For example, knowing market trends, optimizing products and services, improving customer satisfaction, enhancing competitiveness, and achieving business growth. 3) Risk management and control: data asset management helps identify and manage potential risks and vulnerabilities. By monitoring, analyzing and reporting the data, abnormal conditions, security threats and compliance problems can be found in time, so that risks of data leakage, fraudulent behaviors and compliance violations are reduced, the security of sensitive data is protected, and data abuse is prevented. 4) Efficiency and benefit are improved: the data asset management can improve the accessibility, reliability and repeatability of the data, simplify the collection, storage and processing flow of the data, reduce the data management cost and realize the maximum utilization of the resources.

Data asset discovery is both a basic function for a data asset management system and one of the key steps in implementing a data asset management full flow. Data asset discovery refers to the ability to automatically or semi-automatically discover data, including structured data, unstructured data, semi-structured data, etc., and categorize, describe, and organize these data assets.

In the related art in the field, data are scattered in a plurality of service systems, the data of each service system are not interacted and shared, the service systems need to be reported by themselves, the service systems are difficult to acquire data assets in a standardized way, a set of data collection tool is developed for each service system, and the maintenance tool is too difficult.

In addition, each service system stores data in different data sources, for example, common structured data such as order information, user information, commodity information, configuration information and the like are strongly related to the service system and service logic, and are generally stored in relational databases such as MySQL, tiDB and the like; unstructured data such as client logs and server logs are generally collected and stored by adopting log collection tools such as Flume, kafka and the like; there are also common tools for providing unstructured data searching, such as Elastic Search, where the metadata formats of data in different storage media (data sources) differ significantly, requiring a compatible process for metadata of different formats.

In addition, after the data asset is found, all the found data assets need to be sent to the downstream system, so that the data actually needed to be sent to the downstream system cannot be accurately perceived, and impact is caused to the downstream system.

As can be seen from one or more of the above problems, the present method of discovering data assets has the disadvantages of low efficiency and low accuracy, and based on this, exemplary embodiments of the present disclosure provide a data processing method. Referring to fig. 2, which is a flowchart illustrating a data processing method according to an exemplary embodiment of the present disclosure, referring to fig. 2, the data processing method includes steps S210 to S230, which are described in detail as follows:

in step S210, task configuration information is obtained, a corresponding task flow is started according to the task configuration information, and a target data source of the task flow is determined.

In an exemplary embodiment of the present disclosure, the task configuration information is relevant configuration information for indicating extraction of metadata from a target data source, based on which real-time computing streams may be initiated at different stream computing platforms.

Generally, the data stream may include a bounded stream and an unbounded stream, where the unbounded stream has the characteristics of being unbounded, real-time and not required to be specific to the whole data set, and the processing of the unbounded stream is to process each data generated and flowing in a predefined data source (target data source), and the processing of the unbounded data stream is endless, and when the data in the data source is generated and flowing in, the processing is immediately performed, so that the real-time performance of the data stream processing is ensured. The task flow of the exemplary embodiment of the present disclosure is used for processing the data flow without the boundary flow.

Exemplary embodiments of the present disclosure may initiate a task flow, such as a Flink task flow, based on the task configuration information. And, a target data source of the task flow may be determined at the same time based on the task configuration information, i.e., the task flow is executed by retrieving metadata from the target data source. The target data source may be any data source of ES, mySQL, redis, tiDB, kafka, clichouse, hive, etc., and the target data source corresponding to the task flow may be determined based on the task configuration information.

In step S220, when the task flow is executed, the original metadata in the target data source is acquired, and the original metadata is converted into target metadata in a predetermined format.

In the exemplary embodiment of the present disclosure, metadata, also called meta information, intermediate data, relay data, is data describing data, mainly information describing data attributes, for supporting functions such as indicating storage locations, history data, resource searching, file recording, and the like.

The format of the original metadata from different target data sources is usually quite different, and the original metadata can be converted into target metadata in a preset format, so that different downstream systems can be compatible.

In step S230, the target metadata is matched with the cached metadata corresponding to the original metadata, and the metadata to be output is determined according to the matching result.

In an exemplary embodiment of the present disclosure, the cached metadata corresponding to the original metadata is historical metadata of the target data source, and is updated with the determined metadata to be output after each execution of the task flow. The metadata to be output is metadata that is determined after execution of the task flow and that needs to be sent to a downstream system.

The matching of the target metadata and the cache metadata is to compare the field contents of the target metadata and the cache metadata, and determine that the metadata to be output can be all or part of the target metadata according to the comparison result of the target metadata and the cache metadata.

According to the data processing method of the exemplary embodiment of the disclosure, on one hand, by acquiring the configuration information, starting the task flow according to the configuration information, determining the target data source of the task flow, so as to acquire the original meta information of the target data source when the task flow is executed, the original meta information of the external data source can be read based on the configuration information, the access requirements on various data sources can be realized according to the requirements, the original meta data can be acquired, and the real-time performance of discovering the original meta data is improved in a streaming processing mode, so that the data discovery efficiency is improved. On the other hand, after the original metadata is obtained, the original metadata is converted into target metadata with a preset format, so that the metadata from different target source data can be subjected to format unification, and further, when downstream data asset management is performed, only metadata with the unified format is required to be converted into data assets, other additional formatting processing is not required, the docking with various downstream data asset management systems is facilitated, and the efficiency of obtaining the data assets by each data asset management system is improved. In still another aspect, by determining metadata to be output in combination with a matching result of the target metadata and the cache metadata, the matching process can accurately obtain metadata to be sent to downstream data asset management, so that all the target metadata are prevented from being sent to the downstream data asset management in any case, accuracy of data asset discovery is improved, impact on the downstream data management is reduced, and operation stability of the system is improved.

The following describes step S210 to step S230 in detail.

In an exemplary embodiment, an implementation of obtaining task configuration information is provided. Acquiring task configuration information may include: responding to configuration operation aiming at a preset task configuration file, and generating task configuration information according to the task configuration file and configuration information of the configuration operation; wherein the task configuration information is used to indicate that the original metadata is extracted from the target data source.

The preset configuration file is a set of data processing description grammar (configuration grammar), namely a set of configuration file for describing data processing. Based on the configuration file, corresponding processing logic may be implemented by modifying the contents of the target field in the configuration file.

In particular, exemplary embodiments of the present disclosure may be applied to a distributed system in which a distributed processing engine and framework, such as a flank, may operate. To improve the processing operability and flexibility of the task flow, APIs (Application Programming Interface, application program interfaces) of the distributed processing engine and framework may be secondarily packaged to obtain a SDK (software development kit ) based on the distributed processing engine and framework. The SDK defines a configuration grammar and comprises a configuration grammar parser which can parse variables in the configuration grammar into variables readable by a distributed processing engine and a framework, thereby realizing corresponding calculation logic through the parsed variables.

Based on the preset configuration file, the user can obtain task configuration information through configuration operation (such as writing configuration grammar), and further based on the task configuration information, corresponding data processing logic can be realized.

The software development kit based on the distributed processing engine and the framework of the exemplary embodiment of the disclosure integrates rich data input and output connectors (data transmission configuration of a data source) and various data processing operators, realizes a complete dynamic class loading and reflecting engine, and can be configured to expand the data source and the operators, and a real-time computing task flow can be started on any platform, system or program only by compiling configuration according to requirements through configuration grammar defined by the SDK.

The integration comprises ES, mySQL, redis, tiDB, kafka, clichouse, hive and other data input and output connectors, and the integration comprises Parser, filter, aggregator, sorter, watermarker, session, join, gauge and other various data processing operators.

The data processing operator is used for converting input data and outputting the converted input data, and is a data processing unit and comprises input, conversion and output functions, wherein the conversion of the data can be composed of one or a series of functions, and has a specific data conversion function, for example, a filtering operator is used for filtering the input data under specific conditions and outputting a filtering result, an aggregation operator is used for aggregating the input data in a specific mode and outputting an aggregation result, and the specific conversion function of the operator is not limited herein. The data processing method of the exemplary embodiment of the present disclosure may determine a data acquisition operator based on task configuration information to start a task flow based on the data acquisition operator and perform a subsequent data processing procedure to determine metadata to be output.

Further, as described above, the task profile includes data transmission configurations of a plurality of data sources, based on which a target data source of the task flow may be determined through a configuration operation of a user. As shown in fig. 3, a flowchart of determining a target data source of a task flow according to an exemplary embodiment of the present disclosure, the process includes steps S310 and S320:

step S310: determining a target data source from a plurality of data sources according to source data type information in configuration information of configuration operation;

step S320: and calling the data transmission configuration of the target data source based on the connection configuration information in the configuration information of the configuration operation so as to establish the data transmission connection with the target data source.

Specifically, the configuration information of the configuration operation includes source data type information such as a "connection. Type" field, and connection configuration information such as a "properties" field. The task configuration information can be obtained by configuring the two fields for the task configuration file. The target data source can be determined according to the connection type field, and the data transmission configuration of the target data source is called according to the properties field to establish a data transmission connection with the target data source, such as establishing a data transmission connection between the ingesta and the target data source.

For example, if the target data source is a relational database, such as MySQL, tiDB, the configuration includes: "connection. Type: JDBC", "properties: xxx; url jdbc mysql:// xxx … …). The task configuration information obtained based on the configuration operation for the preset task configuration file can be used to obtain metadata of the corresponding database. For another example, if the target data source is a message storage system, such as KAFKA, the configuration includes: "connection. Type: KAFKA", "properties: xxx; url: KAFKA … …). For another example, if the target data source is a full text search engine, such as an ELASTIC, the configuration includes: "connection. Type: ELASTIC", "properties: xxx; url http:// xxx … …).

Based on the configuration operation aiming at the preset task configuration file, even a new data source can realize quick docking and establish data transmission connection. That is, different data sources may be requested and associated at the task flow by simple configuration grammars without the need for programming rules defined by a complex distributed processing framework.

A system architecture diagram according to an exemplary embodiment of the present disclosure is shown in fig. 4, and a process of acquiring task configuration information, starting a corresponding task flow according to the task configuration information, and determining a target data source of the task flow is described below in connection with fig. 4.

Referring to fig. 4, a metadata ingest manager may be created and initialized based on source data type information in configuration information of a configuration operation, responsible for managing creation, start-up, operation, destruction, reception and transmission of metadata of the metadata ingest. The metadata ingester is responsible for interacting with the target data source to conduct metadata ingest and metadata generation, is hereinafter referred to as an ingester for short, is created by the metadata ingester, is initialized based on connection configuration information in configuration information of configuration operation, is connected with the target data source, waits for scheduling of the metadata ingester, and reads original metadata from the target data source through connection once an ingest instruction of the metadata ingester is received.

The metadata buffer may be used to store the original metadata read by the ingester. The metadata ingest manager is also responsible for sending the determined metadata to downstream systems. The metadata ingest manager may also be responsible for destroying the ingest when it is necessary to stop the present task flow.

It should be noted that, different data sources store data with different structures, so that the ingestion process of the metadata ingester of different types and the format of the metadata ingested are different, for example, the metadata is ingested from a relational database, and the following instruction needs to be executed:

SQL("SELECT TABLE_SCHEMA,TABLE_NAME FROM INFORMATION_SCHEMA.TABLES")。

And reading out the library name and the table name from the SQL execution result, and reading out the specific table structure of each table according to the library name and the table name. Briefly, the data asset meta-information of a relational database is stored in a Result Set object. Whereas ingester ingestion of metadata like KAFKA and Elastic Search types is relatively simple, ingester initialization establishes a data transfer connection of the target data source as a client, directly invokes the metadata get method provided by the client connection, metadata of data assets of KAFKA are stored in Collection < Topic Listing > objects, and metadata of data assets of Elastic Search are stored in Aliases Response objects.

That is, after the metadata ingest manager and the ingester are created based on the configuration information, the ingester of different types has the respective corresponding information extraction manner and format type of the extracted metadata, which may be preconfigured according to the actual requirement, and the exemplary embodiments of the present disclosure do not make any particular limitation on the format type of the information extraction manner and the extracted metadata corresponding to the ingester.

According to the exemplary embodiment of the disclosure, the task configuration information can be obtained through configuration operation based on the preset task configuration file, so that an operator is determined based on the task configuration information and used for extracting metadata of a target data source, therefore, the metadata can be conveniently docked with various types of data sources and extracted, the configuration is convenient and has strong universality, the professional requirements on operators are low, the operability of acquiring the metadata is improved, and the access requirements on various data sources are met.

In an exemplary embodiment, it is contemplated that the data is constantly generated and near real-time analysis may be achieved through streaming. However, in practice, the update frequency of metadata is generally lower than the data generation frequency, because the update of metadata is generally driven by external factors, such as changes in data sources, data formats, data requirements, etc., and the update of metadata is only caused. For example, tables are added or deleted, column information is changed, index is updated, and the like. Thus, metadata updates pertain to low frequency operations, and there is no need to continue the operation of capturing metadata from a target data source in streaming real-time jobs, and further, an implementation of obtaining target metadata from a target data source is provided in exemplary embodiments of the present disclosure.

As shown in fig. 5, when the task flow is executed, acquiring the original metadata in the target data source and converting the original metadata into the target metadata in the predetermined format may include steps S510 to S530:

step S510: when executing task flow, the metadata extraction task is triggered regularly based on timer configuration.

The task configuration information includes timer configuration, referring to fig. 4, a timer may be registered by the metadata ingestion manager initialization stage, and relevant configuration such as a trigger time interval of the timer may be obtained by configuring for the task configuration file. For example, the timer configuration may include: "interval.ms:60000". The metadata extraction task may be triggered periodically based on a timer configuration to trigger the ingester to extract the original metadata in the target data source.

Step S520: and extracting the metadata from the target data source according to the information extraction mode corresponding to the metadata extraction task to obtain the original metadata.

After the metadata extraction task is triggered, metadata extraction can be performed on the target data source according to an information extraction mode corresponding to the metadata extraction task (i.e., an information extraction mode corresponding to the ingester) so as to obtain original metadata.

Based on timer configuration, the timing task can be set according to actual conditions so as to update relevant information in time when metadata changes and optimize data processing efficiency.

Step S530: the original metadata is formatted to obtain target metadata having a predetermined format.

Because the metadata obtained from the target data source is sent to the downstream system, the metadata is only required to be converted into data assets after being received by the downstream system due to different information extraction modes corresponding to different metadata extraction tasks (namely, information extraction modes corresponding to ingestor) and different format types of the extracted metadata, and no additional formatting processing is required, so that any redundant processing process of the downstream system is avoided. Exemplary embodiments of the present disclosure also provide an implementation of formatting raw metadata. The process may include:

Acquiring a data template;

and analyzing the original metadata, and filling the analysis result into a data template to obtain target metadata with a preset format.

The data template is used for indicating the format of the target metadata, and the target metadata with the preset format can be obtained by analyzing the original metadata and filling the analysis result into the corresponding position of the data template.

By way of example, a schematic diagram of a data template of an exemplary embodiment of the present disclosure is shown in fig. 6, which may include the following: a unique identifier of the metadata extractor, a metadata unique identifier, a metadata name, metadata content, whether metadata is available, a time when metadata was discovered, etc. The result of the analysis of the original metadata can be filled into a data template, and then the target metadata with a preset format can be obtained. Of course, fig. 6 is merely an example of a data template, and the exemplary embodiments of the present disclosure do not particularly limit the specific content of the data template, and may be adjusted according to actual data requirements.

In an exemplary embodiment, after converting the original metadata into target metadata in a predetermined format, the target metadata may be stored in the first buffer. Such as the metadata cache shown in fig. 4.

Further, in an exemplary embodiment, an implementation of determining metadata to be output is also provided. As shown in fig. 7, matching the target metadata with the cache metadata corresponding to the target metadata, and determining the metadata to be output according to the matching result may include steps S710 to S730:

step S710: and acquiring target metadata from the first cache region, wherein the target metadata comprises metadata identification.

The target metadata may be obtained from the first cache region by the metadata ingest manager, the metadata identification being a unique identification of the target metadata, see "Public String Meta ID" in the data template of fig. 6.

Step S720: and accessing a second cache region, acquiring cache metadata corresponding to the metadata identification from the second cache region, wherein the second cache region stores historical metadata of the target data source.

The second buffer area is used for storing the historical metadata of the target data source, and the metadata of the target data source is stored in the local storage in a lasting way in the real-time streaming (task streaming) operation process, namely the data in the second buffer area cannot be lost. When the real-time stream is started for the first time, the second buffer area is initialized to be empty, all the ingested metadata are stored in the second buffer area, and the buffer metadata of the second buffer area are updated after each task stream is executed.

Step S730: and matching the target metadata with the cache metadata to determine metadata to be output according to a matching result.

The target metadata and the cache metadata corresponding to the original metadata can be matched, the changed metadata is obtained according to the matching result, and the changed metadata is determined to be the metadata to be output.

Specifically, whether the contents of each field of the two target metadata and the cache metadata are identical may be compared, and if the cache metadata identified by the metadata does not exist in the second cache region or the contents of the metadata are not identical, the target metadata is determined as metadata to be output. In contrast, the target metadata that has not changed is not metadata to be output.

Based on this, in the exemplary embodiment of the present disclosure, only the changed (updated) target metadata is automatically found as metadata to be output during each task stream operation, except for the first initial operation of the task stream.

Because the metadata to be output is sent to the downstream system, the metadata which is changed is obtained as the metadata to be output, so that the metadata which is received at the downstream can not be repeatedly sent, and the data processing pressure of the downstream system is greatly reduced.

In an exemplary embodiment, after determining the metadata to be output, the history metadata of the second buffer may also be updated based on the metadata to be output. Further, when the task flow is executed later, metadata that is not received downstream can be acquired each time based on the history metadata.

In an exemplary embodiment, the second buffer is a checkpoint buffer, which is adapted to handle large, complex data sets by writing data to the file system and cutting off its association with previous dependencies.

Before the second buffer area is accessed, a check point buffer can be loaded to a target thread, wherein the target thread is a thread corresponding to a target data source, and the target thread is used for acquiring metadata to be output of the target data source.

Based on a check point caching mechanism, when a task flow is executed each time, the check point cache is loaded into a main thread, only the check point cache is needed to be loaded and accessed, target metadata and cache metadata are compared, the check point cache can be metadata processing and cache in the same process, the request cache is quick, excessive network overhead is avoided, and the real-time performance of the task flow is not affected completely.

Fig. 8 is a system diagram illustrating a determination of metadata to be output according to an exemplary embodiment of the present disclosure, and a description will be given below of a process of determining metadata to be output by taking a metadata ingest manager as an execution subject and taking a process of executing a task flow as an example, with reference to fig. 8.

Firstly, triggering a metadata extraction task based on timer configuration, acquiring original metadata by an ingester, formatting the obtained original metadata to obtain target metadata with a preset format, and storing the target metadata in a first cache region.

And secondly, continuously accessing the first buffer area by a thread in the metadata ingestion manager to acquire target metadata, and acquiring corresponding buffer metadata by accessing the checkpoint buffer area according to the metadata identification.

And then, matching the target metadata with field contents of the cached metadata, and if the cached metadata corresponding to the metadata identification does not exist in the check point cache area or the metadata contents are not completely the same, importing the target metadata into a standard output so that a downstream system perceives metadata change to make corresponding processing logic, and importing the unchanged target metadata into the side output so as not to be perceived by the downstream system and not to make any processing.

The metadata ingest manager may send metadata to be output to a downstream system in a unified and general JSON format, and the exemplary embodiment of the present disclosure may determine a specific JSON format according to actual requirements, and the specific format is not limited in particular.

Finally, the history metadata of the second buffer (checkpoint buffer) is updated based on the metadata to be output (standard output).

In the above process, if all the target metadata extracted each time are sent to the downstream system, the data processing capability of the general downstream system is weaker than that of real-time calculation, and it is difficult to support a large amount of data processing, especially if all the target metadata are output at one time when the real-time task stream is started each time, the impact may be brought to the downstream system. The exemplary embodiment of the disclosure introduces a checkpoint buffer mechanism, and can match the target metadata with the buffer metadata only by accessing the checkpoint buffer, thereby improving the accuracy of finding the data asset and further effectively reducing the adverse effect on the downstream system.

In an exemplary embodiment, the number of target data sources is a plurality and the corresponding task configuration information may include an asynchronous thread number. Further, exemplary embodiments of the present disclosure may further include:

creating a plurality of metadata extraction tasks having asynchronous thread numbers;

and extracting metadata from each target data source in real time and asynchronously based on the plurality of metadata extraction tasks to obtain the original metadata corresponding to each target data source.

Specifically, the ingester of the exemplary embodiment of the present disclosure is implemented in an asynchronous manner, and a plurality of metadata extraction tasks respectively perform metadata extraction for one target data source, so as to obtain original metadata corresponding to each target data source.

Correspondingly, for the original metadata corresponding to each target data source, the original metadata can be formatted in real time and asynchronously to obtain the respective target metadata of each target data source, and the respective target metadata of each target data source are matched with the corresponding cache metadata to obtain the respective metadata to be output of each target data source.

Based on the method, a plurality of target data sources can be connected in parallel and asynchronously, the access requirements for the plurality of target data sources are met, and the efficiency of data asset discovery is improved.

Fig. 9 illustrates a data asset discovery Flink flow of a billing data asset management system Alioth according to an exemplary embodiment of the disclosure. The task flow takes a relational database as a target data source. As shown in fig. 9, the process may include:

step S910, obtaining metadata to be output: tiDB, read the metadata table of the cluster, produce the metadata, judge whether to need to send to the downstream system.

Step S920, data processing: the downstream system takes the received metadata as the body portion of the request.

Step S930, data aggregation: and calling an Aliosh data asset creation Api interface to convert the metadata into a managed asset in the Aliosh system.

When a database or a table is newly added to the TiDB database read by the data source, the corresponding update can be seen in a downstream system immediately, namely, the system has one more asset with a database name and a table name; also, when a database table is deleted, the asset can be seen to be automatically taken off-line.

It should be noted that, in step S910, the metadata may be automatically found in real time by performing the configuration operation based on the task configuration file, i.e. writing the configuration, so as to improve the efficiency and accuracy of finding the data asset, and details of step S910 are described in the above exemplary embodiments and are not described herein.

According to the data processing method in the exemplary embodiment of the disclosure, on one hand, by acquiring the configuration information, starting the task flow according to the configuration information, determining the target data source of the task flow, acquiring the original meta information of the target data source when the task flow is executed, reading the original meta information of the external data source based on the configuration information, realizing the access requirements on various data sources according to the requirements, acquiring the original meta data, and improving the real-time performance of finding the original meta data in a stream processing mode, and improving the data finding efficiency. On the other hand, after the original metadata is obtained, the original metadata is converted into target metadata with a preset format, so that the metadata from different target source data can be subjected to format unification, and further, when downstream data asset management is performed, only metadata with the unified format is required to be converted into data assets, other additional formatting processing is not required, the docking with various downstream data asset management systems is facilitated, and the efficiency of obtaining the data assets by each data asset management system is improved. In still another aspect, by determining metadata to be output in combination with a matching result of the target metadata and the cache metadata, the matching process can accurately obtain metadata to be sent to downstream data asset management, so that all the target metadata are prevented from being sent to the downstream data asset management in any case, accuracy of data asset discovery is improved, impact on the downstream data management is reduced, and operation stability of the system is improved. In addition, the output metadata must be consistent accurately, and the data asset discovery tasks of the task flow are performed multiple times, yet the metadata that needs to be sent to the downstream system is guaranteed to be consistent.

In an exemplary embodiment of the present disclosure, a data processing apparatus is also provided. Referring to fig. 10, the data processing apparatus 1000 may include a configuration module 1010, a metadata extraction module 1020, and a metadata processing module 1030. Specifically:

the configuration module 1010 is configured to obtain task configuration information, start a corresponding task flow according to the task configuration information, and determine a target data source of the task flow;

the metadata extraction module 1020 is configured to obtain original metadata in the target data source and convert the original metadata into target metadata in a predetermined format when the task flow is executed;

and the metadata processing module 1030 is configured to match the target metadata with the cached metadata corresponding to the original metadata, and determine metadata to be output according to a matching result.

In one exemplary embodiment of the present disclosure, the configuration module 1010 is configured to perform: responding to configuration operation aiming at a preset task configuration file, and generating task configuration information according to the task configuration file and configuration information of the configuration operation; wherein the task configuration information is used to indicate that original metadata is extracted from the target data source.

In an exemplary embodiment of the present disclosure, the task profile includes a data transmission configuration of a plurality of data sources, and the configuration module 1010 is configured to perform: determining the target data source from the plurality of data sources according to source data type information in configuration information of the configuration operation; and calling the data transmission configuration of the target data source based on the connection configuration information in the configuration information of the configuration operation so as to establish data transmission connection with the target data source.

In one exemplary embodiment of the present disclosure, the task configuration information includes a timer configuration; the metadata extraction module 1020 is configured to perform: when executing the task flow, triggering metadata extraction tasks based on the timer configuration; extracting metadata from the target data source according to an information extraction mode corresponding to the metadata extraction task to obtain the original metadata; and formatting the original metadata to obtain target metadata with the preset format.

In one exemplary embodiment of the present disclosure, the metadata extraction module 1020 is configured to perform: acquiring a data template; and analyzing the original metadata, and filling the analysis result into the data template to obtain the target metadata with the preset format.

In one exemplary embodiment of the present disclosure, the metadata extraction module 1020 is configured to perform: storing the target metadata in a first cache region; the metadata processing module 1030 is configured to perform: acquiring the target metadata from the first cache region, wherein the target metadata comprises metadata identification; accessing a second cache region, and acquiring the cache metadata corresponding to the metadata identification from the second cache region, wherein the second cache region stores the history metadata of the target data source; and matching the target metadata with the cache metadata to determine the metadata to be output according to a matching result.

In one exemplary embodiment of the present disclosure, the metadata processing module 1030 is configured to perform: and updating the history metadata of the second buffer area based on the metadata to be output.

In an exemplary embodiment of the present disclosure, the second buffer is a checkpoint buffer, and the metadata processing module 1030 is configured to perform: loading the check point cache to a target thread, wherein the target thread is a thread corresponding to the target data source, and the target thread is used for acquiring metadata to be output of the target data source.

In one exemplary embodiment of the present disclosure, the metadata processing module 1030 is configured to perform: matching the target metadata with cache metadata corresponding to the original metadata, and acquiring changed metadata according to a matching result; and determining the changed metadata as the metadata to be output.

In an exemplary embodiment of the present disclosure, the number of the target data sources is a plurality, and the task configuration information includes an asynchronous thread number;

the metadata extraction module 1020 is configured to perform: creating a plurality of metadata extraction tasks having the asynchronous thread number; and extracting metadata from each target data source in real time and asynchronously based on the metadata extraction tasks to obtain original metadata corresponding to each target data source.

Since the details of the respective functional modules of the data processing apparatus of the exemplary embodiment of the present disclosure have been described in the exemplary embodiments of the data processing method described above, a detailed description thereof is omitted herein.

It should be noted that although in the above detailed description several modules or units of the data processing apparatus are mentioned, this division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Furthermore, in exemplary embodiments of the present disclosure, a computer storage medium capable of implementing the above-described method is also provided. On which a program product is stored which enables the implementation of the method described above in the present specification. In some possible embodiments, the various aspects of the present disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when said program product is run on the terminal device, e.g. the following steps may be carried out:

acquiring task configuration information, starting a corresponding task flow according to the task configuration information, and determining a target data source of the task flow; when the task flow is executed, acquiring original metadata in the target data source, and converting the original metadata into target metadata in a preset format; and matching the target metadata with the cache metadata corresponding to the original metadata, and determining metadata to be output according to a matching result.

In an exemplary embodiment of the present disclosure, the acquiring task configuration information includes:

Responding to configuration operation aiming at a preset task configuration file, and generating task configuration information according to the task configuration file and configuration information of the configuration operation; wherein the task configuration information is used to indicate that original metadata is extracted from the target data source.

In an exemplary embodiment of the present disclosure, the task profile includes a data transmission configuration of a plurality of data sources, and the determining the target data source of the task flow includes: determining the target data source from the plurality of data sources according to source data type information in configuration information of the configuration operation; and calling the data transmission configuration of the target data source based on the connection configuration information in the configuration information of the configuration operation so as to establish data transmission connection with the target data source.

The program product may take the form of a portable compact disc read-only memory (CD-ROM) and comprises program code and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided. Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

An electronic device 1100 according to such an embodiment of the present disclosure is described below with reference to fig. 11. The electronic device 1100 shown in fig. 11 is merely an example and should not be construed as limiting the functionality and scope of use of the disclosed embodiments.

As shown in fig. 11, the electronic device 1100 is embodied in the form of a general purpose computing device. Components of electronic device 1100 may include, but are not limited to: the at least one processing unit 1110, the at least one memory unit 1120, a bus 1130 connecting the different system components (including the memory unit 1120 and the processing unit 1110), and a display unit 1140.

Wherein the storage unit stores program code that is executable by the processing unit 1110 such that the processing unit 1110 performs steps according to various exemplary embodiments of the present disclosure described in the above-described "exemplary methods" section of the present specification. For example, the processing unit 1110 may perform the steps as follows: acquiring task configuration information, starting a corresponding task flow according to the task configuration information, and determining a target data source of the task flow; when the task flow is executed, acquiring original metadata in the target data source, and converting the original metadata into target metadata in a preset format; and matching the target metadata with the cache metadata corresponding to the original metadata, and determining metadata to be output according to a matching result.

The storage unit 1120 may include a readable medium in the form of a volatile storage unit, such as a Random Access Memory (RAM) 1121 and/or a cache memory 1122, and may further include a Read Only Memory (ROM) 1123.

Storage unit 1120 may also include a program/utility 1124 having a set (at least one) of program modules 1125, such program modules 1125 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The bus 1130 may be a local bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a bus using any of a variety of bus architectures.

The electronic device 1100 may also communicate with one or more external devices 1200 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 1100, and/or any devices (e.g., routers, modems, etc.) that enable the electronic device 1100 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 1150. Also, electronic device 1100 can communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 1160. As shown, network adapter 1160 communicates with other modules of electronic device 1100 via bus 1130. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 1100, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

Furthermore, the above-described figures are only schematic illustrations of processes included in the method according to the exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method of data processing, comprising:

acquiring task configuration information, starting a corresponding task flow according to the task configuration information, and determining a target data source of the task flow;

when the task flow is executed, acquiring original metadata in the target data source, and converting the original metadata into target metadata in a preset format;

and matching the target metadata with the cache metadata corresponding to the original metadata, and determining metadata to be output according to a matching result.

2. The method of claim 1, wherein the obtaining task configuration information comprises:

responding to configuration operation aiming at a preset task configuration file, and generating task configuration information according to the task configuration file and configuration information of the configuration operation;

wherein the task configuration information is used to indicate that original metadata is extracted from the target data source.

3. The method of claim 2, wherein the task profile includes data transmission configurations for a plurality of data sources, and wherein the determining the target data source for the task flow comprises:

determining the target data source from the plurality of data sources according to source data type information in configuration information of the configuration operation;

And calling the data transmission configuration of the target data source based on the connection configuration information in the configuration information of the configuration operation so as to establish data transmission connection with the target data source.

4. The method of claim 1, wherein the task configuration information comprises a timer configuration; when the task flow is executed, the original metadata in the target data source is obtained, and the original metadata is converted into target metadata in a preset format, which comprises the following steps:

when executing the task flow, triggering metadata extraction tasks based on the timer configuration;

extracting metadata from the target data source according to an information extraction mode corresponding to the metadata extraction task to obtain the original metadata;

and formatting the original metadata to obtain target metadata with the preset format.

5. The method of claim 4, wherein formatting the original metadata to obtain target metadata having the predetermined format comprises:

acquiring a data template;

and analyzing the original metadata, and filling the analysis result into the data template to obtain the target metadata with the preset format.

6. The method of claim 1, wherein after the converting the original metadata into target metadata in a predetermined format, storing the target metadata in a first buffer;

the matching the target metadata with the cache metadata corresponding to the target metadata, and determining metadata to be output according to a matching result includes:

acquiring the target metadata from the first cache region, wherein the target metadata comprises metadata identification;

accessing a second cache region, and acquiring the cache metadata corresponding to the metadata identification from the second cache region, wherein the second cache region stores the history metadata of the target data source;

and matching the target metadata with the cache metadata to determine the metadata to be output according to a matching result.

7. The method of claim 6, wherein after determining the metadata to be output, the method further comprises:

and updating the history metadata of the second buffer area based on the metadata to be output.

8. The method of claim 6, wherein the second cache region is a checkpoint cache, the method further comprising, prior to the accessing the second cache region:

Loading the check point cache to a target thread, wherein the target thread is a thread corresponding to the target data source, and the target thread is used for acquiring metadata to be output of the target data source.

9. The method according to claim 1, wherein the matching the target metadata with the cached metadata corresponding to the original metadata, and determining metadata to be output according to the matching result, includes:

matching the target metadata with cache metadata corresponding to the original metadata, and acquiring changed metadata according to a matching result;

and determining the changed metadata as the metadata to be output.

10. The method according to any one of claims 1 to 9, wherein the number of target data sources is a plurality, and the task configuration information includes an asynchronous thread number;

the method further comprises the steps of:

creating a plurality of metadata extraction tasks having the asynchronous thread number;

and extracting metadata from each target data source in real time and asynchronously based on the metadata extraction tasks to obtain original metadata corresponding to each target data source.

11. A data processing apparatus, comprising:

the configuration module is used for acquiring task configuration information, starting a corresponding task flow according to the task configuration information, and determining a target data source of the task flow;

the metadata extraction module is used for acquiring original metadata in the target data source and converting the original metadata into target metadata in a preset format when the task stream is executed;

and the metadata processing module is used for matching the target metadata with the cache metadata corresponding to the original metadata, and determining metadata to be output according to a matching result.

12. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1 to 10.

13. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any one of claims 1 to 10 via execution of the executable instructions.