CN113779017A - Method and apparatus for data asset management - Google Patents

Method and apparatus for data asset management Download PDF

Info

Publication number
CN113779017A
CN113779017A CN202010752400.3A CN202010752400A CN113779017A CN 113779017 A CN113779017 A CN 113779017A CN 202010752400 A CN202010752400 A CN 202010752400A CN 113779017 A CN113779017 A CN 113779017A
Authority
CN
China
Prior art keywords
data
mart
asset
information
production
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010752400.3A
Other languages
Chinese (zh)
Inventor
周奇博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202010752400.3A priority Critical patent/CN113779017A/en
Publication of CN113779017A publication Critical patent/CN113779017A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method and a device for data asset management, and relates to the technical field of computers. One embodiment of the method comprises: acquiring all production tasks in the data mart, and determining a first association relation among data tables in the data mart according to the production tasks; acquiring a second incidence relation between each data table and each application model in the data mart; and constructing a data asset three-dimensional relationship network of the data mart according to the first incidence relationship and the second incidence relationship. According to the implementation mode, the data table, the production task and the application model are taken as the core to construct the data asset three-dimensional relationship network of the data mart, the mutual correlation influence and dependency relationship among the data table, the production task and the application model can be determined, the generation process and the external use condition of the data are displayed, the problem location and the influence range can be rapidly located when the data are abnormal, and the functions of conveniently checking the mutual dependence among abnormal data and notifying and alarming are achieved.

Description

Method and apparatus for data asset management
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for data asset management.
Background
As is well known, data management and data quality are two major core functions of a data center, and a data mart of a data center working service serves as a department-level data warehouse and acquires data from a group enterprise-level data warehouse. The main mode of the linkage of the upstream and downstream data is as follows: data changes are manually informed at the upstream, the influence range is manually combed at the downstream or the approximate influence is obtained through simple key information, and the department performs team knowledge deposition and information sharing by uniformly collecting personal experiences or notes of technicians.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
for how the data is generated, the correlation influence and the dependency relationship among the data table, the production task and the important application are not known; the data situation input from an upstream warehouse and the data situation output to a downstream market can not be intuitively known; unified query and classification management cannot be realized, and meanwhile, inferior or junk assets in the data assets cannot be identified and effectively removed in real time.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for data asset management, which construct a data asset three-dimensional relationship network of a data mart with a data table, a production task, and an application model as a core, and can determine the correlation influence and dependency relationship among the data table, the production task, and the application model, and display the generation process and external use condition of data, so that the problem location and influence range can be quickly located when data is abnormal, and the functions of conveniently checking the interdependency among abnormal data and notifying and alarming are achieved.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method of data asset management, including:
acquiring all production tasks in a data mart, and determining a first association relation among data tables in the data mart according to the production tasks;
acquiring a second incidence relation between each data table and each application model in the data mart;
and constructing a data asset three-dimensional relationship network of the data mart according to the first incidence relationship and the second incidence relationship.
Optionally, determining a first association relationship between the data tables in the data mart according to the production task includes:
acquiring and analyzing the production task to determine a first data table serving as a source table of the production task and a second data table serving as a target table of the production task;
and forming a data table parent-child recursive relationship network by taking one of the first data table and the second data table as a child and the other one of the first data table and the second data table as a parent to obtain a first association relationship among the data tables in the data mart.
Optionally, parsing the production task includes:
performing word segmentation on the execution codes of the production tasks, and establishing unique index information for each analysis to obtain a unique index sequence of the production tasks;
and extracting a source table and a target table of the production task from the unique index sequence according to regular expression rules.
Optionally, before performing word segmentation processing on the execution code of the production task, the method further includes:
and standardizing the execution codes of the production tasks.
Optionally, the method further comprises: acquiring asset information of all data tables, production tasks and application models in the data mart, and adding the asset information into the data asset three-dimensional relationship network; the asset information includes at least one of: asset file, storage location, file format, storage path, storage amount, usage status information, asset quality information update information.
Optionally, the method further comprises: screening input data tables input to the data mart from all data tables of the data mart to form an input data table list, and counting contact information and/or statistical information of each input data table in the input data table order; the contact information includes at least one of: input source, input party contact, input form and input caliber; the usage information includes at least one of: using status information, data asset quality status.
Optionally, the method further comprises at least one of:
configuring a life cycle for each data sheet in the data mart, and deleting the data sheet when the data sheet fails;
monitoring each data table in the data mart, and sending prompt information when the data table is abnormal;
monitoring the resource consumption of each production task in the data mart in real time, and sending prompt information or triggering a preset coping strategy when the resource consumption of the production task meets a first preset condition;
and merging a plurality of data tables meeting a second preset condition in the data mart.
According to a second aspect of embodiments of the present invention, there is provided an apparatus for data asset management, comprising:
the first acquisition module is used for acquiring all production tasks in the data mart and determining a first incidence relation among data tables in the data mart according to the production tasks;
the second acquisition module is used for acquiring a second incidence relation between each data table and each application model in the data mart;
and the network construction module is used for constructing the data asset three-dimensional relationship network of the data mart according to the first incidence relationship and the second incidence relationship.
Optionally, the determining, by the first obtaining module, a first association relationship between the data tables in the data mart according to the production task includes:
acquiring and analyzing the production task to determine a first data table serving as a source table of the production task and a second data table serving as a target table of the production task;
and forming a data table parent-child recursive relationship network by taking one of the first data table and the second data table as a child and the other one of the first data table and the second data table as a parent to obtain a first association relationship among the data tables in the data mart.
Optionally, the analyzing the production task by the first obtaining module includes:
performing word segmentation on the execution codes of the production tasks, and establishing unique index information for each analysis to obtain a unique index sequence of the production tasks;
and extracting a source table and a target table of the production task from the unique index sequence according to regular expression rules.
Optionally, the first obtaining module is further configured to: and before performing word segmentation processing on the execution code of the production task, performing standardization processing on the execution code of the production task.
Optionally, the network construction module is further configured to: acquiring asset information of all data tables, production tasks and application models in the data mart, and adding the asset information into the data asset three-dimensional relationship network; the asset information includes at least one of: asset file, storage location, file format, storage path, storage amount, usage status information, asset quality information update information.
Optionally, the network construction module is further configured to: screening input data tables input to the data mart from all data tables of the data mart to form an input data table list, and counting contact information and/or statistical information of each input data table in the input data table order; the contact information includes at least one of: input source, input party contact, input form and input caliber; the usage information includes at least one of: using status information, data asset quality status.
Optionally, the apparatus further comprises an asset remediation module for at least one of:
configuring a life cycle for each data sheet in the data mart, and deleting the data sheet when the data sheet fails;
monitoring each data table in the data mart, and sending prompt information when the data table is abnormal;
monitoring the resource consumption of each production task in the data mart in real time, and sending prompt information or triggering a preset coping strategy when the resource consumption of the production task meets a first preset condition;
and merging a plurality of data tables meeting a second preset condition in the data mart.
According to a third aspect of the embodiments of the present invention, there is provided an electronic device for data assets, including:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method provided by the first aspect of the embodiments of the present invention.
According to a fourth aspect of embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the method provided by the first aspect of embodiments of the present invention.
One embodiment of the above invention has the following advantages or benefits: the data asset three-dimensional relationship network of the data mart is constructed by taking the data table, the production task and the application model as the core, the mutual correlation influence and dependency relationship among the data table, the production task and the application model can be determined, the generation process and the external use condition of the data are displayed, the problem location and the influence range can be rapidly positioned when the data are abnormal, and the functions of conveniently checking the mutual dependence among abnormal data and notifying and alarming are achieved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic illustration of the main flow of a method of data asset management of an embodiment of the present invention;
FIG. 2 is a schematic diagram of a data asset perspective network in an alternative embodiment of the present invention;
FIG. 3 is a schematic flow chart of a method of data asset management in an alternative embodiment of the present invention;
FIG. 4 is a schematic diagram of the major modules of an apparatus for data asset management of an embodiment of the present invention;
FIG. 5 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 6 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
According to one aspect of an embodiment of the present invention, a method of data asset management is provided.
Fig. 1 is a schematic diagram of a main flow of a method of data asset management according to an embodiment of the present invention, as shown in fig. 1, the method of data asset management includes: step S101, step S102, and step S103.
In step S101, all production tasks in the data mart are obtained, and a first association relationship between data tables in the data mart is determined according to the production tasks.
Data tables, production tasks, and application models are data objects in a data mart. The data table is the most basic data in which various data in the data mart are stored. The production task is a collection of code-encapsulated data tables or application models, and the application models are sampled into a data service form outside the heap and are used for outputting data in the data tables. For example: scheduling task ID: 123, which is a collection of N sections of code or N scripts or N methods mutually executing calls, for generating data of the data table a; the application model takes the data of the data table A or N data tables as the basis to serve the data mining model to output results and the like.
The first association relationship refers to an association relationship between data tables in the data mart. Production tasks typically use certain data as source data to generate new data through production processes. The data table storing the source data is called a source table of the production task, and the data table storing new data generated by the production process is called a target table of the production task. A first associative relationship between the various data tables in the data mart may be determined based on the source tables and the generated target tables on which the various production tasks in the data mart depend.
The data in one data table may be processed by a plurality of production tasks, and one production task may process the data in a plurality of data tables, so that the three-dimensional network relationship between each production task and each data table in the data mart can be obtained through step S101. Through combing the first association relationship among the data tables in the data mart, the mutual association influence and dependency relationship among the data tables and production tasks can be determined, and the data generation process is displayed, so that the problem location and influence range can be quickly positioned when the data are abnormal, the effects of conveniently checking the mutual dependence among abnormal data and informing and alarming are achieved, for example, when the data tables are found to have defects, the risk reminding is timely carried out on the production tasks with the association relationship with the data tables, and the production tasks are optimized and managed.
Optionally, determining a first association relationship between the data tables in the data mart according to the production task includes: acquiring and analyzing a production task to determine a first data table serving as a source table of the production task and a second data table serving as a target table of the production task; and forming a data table parent-child recursive relationship network by taking one of the first data table and the second data table as a child and the other one of the first data table and the second data table as a parent to obtain a first association relationship among the data tables in the data mart.
Taking the source table for determining the production task as an example, a Hadoop (a distributed system infrastructure developed by the Apache foundation) cluster allocates the computing resources to the execution codes of all the submitted production tasks through a Yarn (Yet other Resource manager) Resource manager, and records the channel for submitting the production tasks. An hql (Hibernate query Language) code of a full-queue and full-channel production task submitted to a Yarn resource manager is captured in real time through a Hive Rest API (Rest is an abbreviation of a Representational State Transfer (EPT) and is used for describing a standard method for creating the HTTP API), wherein the API is an abbreviation of an application n Programming Interface and is a predefined function or a convention for linking different components of a software system, and the code comprises Spark (a computing engine) real-time code information and corresponding production task information. For example, task ID: 111 submitted a piece of execution code, and the yarn resource manager automatically records that the channel submitting the task is 'scheduling platform', the task ID is 111, and the code is select from A. By capturing all of this information and performing code analysis, it can be known that the source table of the production task 111 is the data table a.
The embodiment may also learn the association relationship between the data table and the production task, for example, the association relationship between the data table a and the production task 111 is as follows: data table a is a source table for production task 111.
By forming a data table parent-child recursive relationship network by taking one of the source table and the target table as a child and the other as a parent, for example, the source table is a child and the target table is a negative, the association relationship among the data tables in the data mart can be more intuitively displayed, so that the problem location and the influence range can be quickly positioned when the data is abnormal, and the functions of conveniently checking the interdependence between abnormal data and notifying and alarming are achieved.
Optionally, parsing the production task comprises: performing word segmentation on the execution code of the production task, and establishing unique index information for each analysis to obtain a unique index sequence of the production task; and extracting the source table and the target table of the production task from the unique index sequence according to the regular expression rule.
Illustratively, the tokens are participled with spaces and key grammars of DML (Data management Language) statements (i.e., keywords for adding, deleting, updating, and querying Data table records, and checking Data integrity, commonly used keywords mainly include insert, delete, udp, select, join, etc.), and unique index information is established for each participle. Based on the basic grammar of the DML sentence, the keywords can only be separated by spaces, and meanwhile, no spaces can be contained among the table name, the keywords and other contents, so the specific process of word segmentation can be as follows: firstly, determining the index of each word in a code sentence by taking a single space as a separator; and finally, indexing on the basis of the indexes of the join key words and the from key words, adding 1 to the words at the new indexing position to serve as table names of the source table, and extracting the source table of the production task from the unique indexing sequence.
The method for extracting the source table and the target table of the production task is simple, fast and accurate.
Optionally, before performing word segmentation processing on the execution code of the production task, the method further includes: the execution code of the production task is standardized. Illustratively, some information in the execution code, such as Chinese characters, remarks, special characters, punctuation marks and the like, is eliminated, and only information required by data table naming, such as numbers, letters, underlines and the like, is reserved. Through standardization processing, the efficiency and accuracy of subsequently extracting the source table and the target table of the production task from the unique index sequence can be improved.
In step S102, a second association relationship between each data table and each application model in the data mart is obtained.
For example, if the application model B outputs the data in the data table B, the second relationship between the data table B and the application model B may be determined to be the output and output relationship.
The data in one data table may be output by a plurality of application models, and one application model may output the data in a plurality of data tables, so that the stereoscopic network relationship between each application model and each data table in the data mart may be obtained through step S102.
Through combing the second incidence relations among the data tables and the application models in the data mart, the incidence influence and the dependence relation among the data tables and the application models can be determined, and the external use condition of the data is displayed, so that the problem location and the influence range can be quickly positioned when the data is abnormal, the effects of conveniently checking the incidence relation among the abnormal data and informing and alarming are achieved, for example, when the data table is found to have defects, the risk reminding is timely carried out on the application models with the incidence relations with the data table, and the application models are optimized and managed.
In step S103, a data asset three-dimensional relationship network of the data mart is constructed according to the first incidence relationship and the second incidence relationship.
The method for constructing the data asset three-dimensional relationship network according to the first incidence relation and the second incidence relation can be selected according to actual conditions, for example, the first incidence relation and the second incidence relation are spliced, and then, for example, the data table, the processing task and the application model are associated pairwise through key values. FIG. 2 is a schematic diagram of a data asset perspective network in an alternative embodiment of the invention. It should be noted that fig. 2 only shows a schematic network diagram among the data tables, the production tasks, and the application models, and in the actual application process, the data marts usually include a large amount of data tables, production tasks, and application models, so the data asset network obtained in step S103 usually has a three-dimensional mesh structure.
The data asset three-dimensional relationship network is constructed based on the three-dimensional network relationship between each production task and each data sheet in the data mart and the three-dimensional network relationship between each application model and each data sheet, the mutual correlation influence and dependency relationship among the data sheets, the production tasks and the application models can be determined, the generation process and the external use condition of the data are displayed, the problem location and the influence range can be rapidly positioned when the data are abnormal, and the functions of conveniently checking the mutual dependence among abnormal data and notifying and alarming are achieved.
In some optional embodiments, the method further comprises: and acquiring asset information of all data tables, production tasks and application models in the data mart, and adding the asset information into the data asset three-dimensional relationship network. Asset information refers to information about an asset, and may include at least one of: asset file, storage location, file format, storage path, storage amount, usage status information, asset quality information, update information. An asset file refers to a file contained by a data asset. The storage location is used to indicate a location where the data asset is stored, and the storage path refers to a path that can be routed to the location where the data asset is stored. Storage refers to the memory footprint of the data asset. The usage status information is used to reflect the usage status of the data asset, for example, the data table C is used as the source table of the production task C, and the usage status may be completed, in execution, failed, on time, delayed, etc. The asset quality information is used to reflect instructions on the data assets, such as asset quality, asset anomalies, and the like. The update information is used for indicating the update condition of the data assets, such as version information of past updates, update time, update content and the like.
The hdfs is a distributed file system existing in the Hadoop cluster, is used as a basic path for storing data assets, and can be automatically added with an hdfs asset path through the hdfs file system when a new data asset is created, such as table data is newly created. Therefore, in the practical application process, the asset information of each data asset can be acquired in the hdfs asset path integral full scanning mode.
By adding the asset information to the data asset three-dimensional relationship network, the track of each data asset can be displayed comprehensively and visually.
Data assets in data marts and data warehouses are typically not all self-produced, and there are input data assets. Thus, in further alternative embodiments, the method further comprises: and screening input data tables input to the data mart from all the data tables of the data mart to form an input data table list, and counting the contact information and/or the statistical information of each input data table in the order of the input data table.
The contact information includes at least one of: input source, input party contact, input form and input aperture. The input source is used for identifying an input party for inputting the data sheet, the contact way of the input party is recorded in the contact way of the input party, the input form is used for reflecting the form of the input data sheet to be input into the data mart, and the input aperture is used for reflecting the way of the input data sheet to be input into the data mart. The usage information includes at least one of: using status information, data asset quality status. The usage status information is used to reflect the usage status of the data asset, for example, the data table C is used as the source table of the production task C, and the usage status may be completed, in execution, failed, on time, delayed, etc. The asset quality information is used to reflect instructions on the data assets, such as asset quality, asset anomalies, and the like. The statistical information can be displayed in a form similar to a report form, and operation and maintenance and managers can conveniently and visually check the statistical information. The statistical period can be selectively set according to actual conditions, such as once per day, once per week, and the like.
By forming the input data table list and counting the contact information and/or statistical information of each input data table in the input data table order, the data condition input to the data mart, such as the data and the quality and variation condition thereof, is conveniently displayed, so that the problem is conveniently checked, determined and solved when the data assets have problems.
In an alternative embodiment illustrated in FIG. 3, the method further comprises remediating data assets in the data marts based on the constructed data asset perspective network. Remediating data assets in a data mart may include at least one of:
(1) a lifecycle is configured for each data sheet in the data mart, and the data sheet is deleted when the data sheet fails. Illustratively, the life cycle of the official data sheet may be set to be permanently valid for automatic storage; and setting the life cycles of the temporary data table and the invalid empty data table to be smaller values, so that the temporary data table and the invalid empty data table are automatically deleted and cleaned when the temporary data table and the invalid empty data table are invalid.
(2) And monitoring each data table in the data mart, and sending prompt information when the data table is abnormal. Illustratively, when the data table is found to have defects, risk reminding is timely carried out on the application model having the association relation with the data table, so that application model optimization treatment is carried out. For another example, when a defect exists in the data table, a risk reminding is timely performed on a production task having a relation with the data table, so that the production task is optimized and managed.
(3) And monitoring the resource consumption of each production task in the data mart in real time, and sending prompt information or triggering a preset coping strategy when the resource consumption of the production task meets a first preset condition. The first preset condition may be selectively set according to actual conditions, for example, the resource consumption of the production task is greater than a set resource consumption threshold, or the percentage of the computing resources in the resource consumption warfare cluster exceeds a set percentage threshold, or the like. The preset coping strategy can also be selectively set according to the actual situation, for example, the production task meeting the first preset condition is suspended or deleted. Illustratively, computing resource information of the Yarn resource management manager is acquired in full, the running condition and fluctuation condition of the computing resources are monitored in real time, and real-time warning and automatic searching, killing and treatment are carried out on production tasks which seriously affect the cluster running efficiency.
(4) And merging a plurality of data tables meeting a second preset condition in the data mart. The second preset condition may be selectively set according to actual conditions, for example, the storage amount of the data asset is less than a set storage amount threshold, the access frequency of the data asset is less than a set frequency threshold, and the like. The resource consumption of the production task is larger than a set resource consumption threshold, or the percentage of the computing resources in the resource consumption warfare cluster exceeds a set percentage threshold, and the like. Illustratively, identified small files below 128Mb in the data assets that can affect data access efficiency are automatically merged.
The embodiment of the invention automatically realizes the life cycle management and storage of the assets, the automatic cleaning and treatment of the garbage assets by combining the data table assets and the established data asset three-dimensional relationship network, automatically optimizes and treats the garbage assets or the behaviors influencing the cluster operation efficiency, reduces or even avoids manual access, and improves the data use efficiency.
The embodiment of the invention constructs the data asset three-dimensional relationship network of the data mart by taking the data table, the production task and the application model as the core, can determine the mutual correlation influence and dependency relationship among the data table, the production task and the application model, and shows the generation process and external use condition of the data, thereby quickly positioning the problem location and influence range when the data is abnormal, and achieving the effects of conveniently checking the mutual dependence among abnormal data and notifying and alarming. The embodiment of the invention can display the data asset three-dimensional relationship network of the data mart through a visualization scheme to form an asset influence analysis mode, an asset full-link data relationship track analysis mode, a relationship map query mode, an upstream change rapid positioning influence mode and a downstream agile asset management analysis mode, wherein the downstream agile asset management analysis mode is rapidly released by processing asset changes.
According to a second aspect of the embodiments of the present invention, there is provided an apparatus for implementing the above method.
Fig. 4 is a schematic diagram of main modules of an apparatus for data asset management according to an embodiment of the present invention, and as shown in fig. 4, the apparatus 400 for data asset management includes:
the first obtaining module 401 obtains all production tasks in the data mart, and determines a first association relationship between data tables in the data mart according to the production tasks;
a second obtaining module 402, configured to obtain a second association relationship between each data table and each application model in the data mart;
the network construction module 403 constructs a data asset three-dimensional relationship network of the data mart according to the first association relationship and the second association relationship.
Optionally, the determining, by the first obtaining module, a first association relationship between the data tables in the data mart according to the production task includes:
acquiring and analyzing the production task to determine a first data table serving as a source table of the production task and a second data table serving as a target table of the production task;
determining a second data table serving as a target table of the production task to obtain an association relation between the second data table and the production task;
and forming a data table parent-child recursive relationship network by taking one of the first data table and the second data table as a child and the other one of the first data table and the second data table as a parent to obtain a first association relationship among the data tables in the data mart.
Optionally, the analyzing the production task by the first obtaining module includes:
performing word segmentation on the execution codes of the production tasks, and establishing unique index information for each analysis to obtain a unique index sequence of the production tasks;
and extracting a source table and a target table of the production task from the unique index sequence according to regular expression rules.
Optionally, the first obtaining module is further configured to: and before performing word segmentation processing on the execution code of the production task, performing standardization processing on the execution code of the production task.
Optionally, the network construction module is further configured to: acquiring asset information of all data tables, production tasks and application models in the data mart, and adding the asset information into the data asset three-dimensional relationship network; the asset information includes at least one of: asset file, storage location, file format, storage path, storage amount, usage status information, asset quality information update information.
Optionally, the network construction module is further configured to: screening input data tables input to the data mart from all data tables of the data mart to form an input data table list, and counting contact information and/or statistical information of each input data table in the input data table order; the contact information includes at least one of: input source, input party contact, input form and input caliber; the usage information includes at least one of: using status information, data asset quality status.
Optionally, the apparatus further comprises an asset remediation module for at least one of:
configuring a life cycle for each data sheet in the data mart, and deleting the data sheet when the data sheet fails;
monitoring each data table in the data mart, and sending prompt information when the data table is abnormal;
monitoring the resource consumption of each production task in the data mart in real time, and sending prompt information or triggering a preset coping strategy when the resource consumption of the production task meets a first preset condition;
and merging a plurality of data tables meeting a second preset condition in the data mart.
According to a third aspect of the embodiments of the present invention, there is provided an electronic device for data assets, including:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method provided by the first aspect of the embodiments of the present invention.
According to a fourth aspect of embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the method provided by the first aspect of embodiments of the present invention.
Fig. 5 illustrates an exemplary system architecture 500 of a method of data asset management or an apparatus of data asset management to which embodiments of the present invention may be applied.
As shown in fig. 5, the system architecture 500 may include terminal devices 501, 502, 503, a network 504, and a server 505. The network 504 serves to provide a medium for communication links between the terminal devices 501, 502, 503 and the server 505. Network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 501, 502, 503 to interact with a server 505 over a network 504 to receive or send messages or the like. The terminal devices 501, 502, 503 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 501, 502, 503 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 505 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the terminal devices 501, 502, 503. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.
It should be noted that the method for managing data assets provided by the embodiment of the present invention is generally executed by the server 505, and accordingly, the apparatus for managing data assets is generally disposed in the server 505.
It should be understood that the number of terminal devices, networks, and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use with a terminal device implementing an embodiment of the invention is shown. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 601.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprising: the first acquisition module is used for acquiring all production tasks in the data mart and determining a first incidence relation among data tables in the data mart according to the production tasks; the second acquisition module is used for acquiring a second incidence relation between each data table and each application model in the data mart; and the network construction module is used for constructing the data asset three-dimensional relationship network of the data mart according to the first incidence relationship and the second incidence relationship. The names of these modules do not form a limitation to the module itself in some cases, for example, the first obtaining module may also be described as a "module for obtaining a second association relationship between each data table and each application model in the data mart".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: acquiring all production tasks in a data mart, and determining a first association relation among data tables in the data mart according to the production tasks; acquiring a second incidence relation between each data table and each application model in the data mart; and constructing a data asset three-dimensional relationship network of the data mart according to the first incidence relationship and the second incidence relationship.
According to the technical scheme of the embodiment of the invention, the data table, the production task and the application model are taken as the core to construct the data asset three-dimensional relationship network of the data mart, the mutual correlation influence and dependency relationship among the data table, the production task and the application model can be determined, the generation process and the external use condition of the data are displayed, the problem location and influence range can be rapidly positioned when the data are abnormal, and the functions of checking the mutual dependence among abnormal data and informing and alarming are conveniently achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of data asset management, comprising:
acquiring all production tasks in a data mart, and determining a first association relation among data tables in the data mart according to the production tasks;
acquiring a second incidence relation between each data table and each application model in the data mart;
and constructing a data asset three-dimensional relationship network of the data mart according to the first incidence relationship and the second incidence relationship.
2. The method of claim 1, wherein determining a first associative relationship between respective data tables in the data mart based on the production task comprises:
acquiring and analyzing the production task to determine a first data table serving as a source table of the production task and a second data table serving as a target table of the production task;
and forming a data table parent-child recursive relationship network by taking one of the first data table and the second data table as a child and the other one of the first data table and the second data table as a parent to obtain a first association relationship among the data tables in the data mart.
3. The method of claim 2, wherein resolving the production task comprises:
performing word segmentation on the execution codes of the production tasks, and establishing unique index information for each analysis to obtain a unique index sequence of the production tasks;
and extracting a source table and a target table of the production task from the unique index sequence according to regular expression rules.
4. The method of claim 3, wherein prior to performing the tokenization of the code for execution of the production task, further comprising:
and standardizing the execution codes of the production tasks.
5. The method of any of claims 1-4, further comprising: acquiring asset information of all data tables, production tasks and application models in the data mart, and adding the asset information into the data asset three-dimensional relationship network; the asset information includes at least one of: asset file, storage location, file format, storage path, storage amount, usage status information, asset quality information update information.
6. The method of any of claims 1-4, further comprising: screening input data tables input to the data mart from all data tables of the data mart to form an input data table list, and counting contact information and/or statistical information of each input data table in the input data table order; the contact information includes at least one of: input source, input party contact, input form and input caliber; the usage information includes at least one of: using status information, data asset quality status.
7. The method of any of claims 1-4, further comprising at least one of:
configuring a life cycle for each data sheet in the data mart, and deleting the data sheet when the data sheet fails;
monitoring each data table in the data mart, and sending prompt information when the data table is abnormal;
monitoring the resource consumption of each production task in the data mart in real time, and sending prompt information or triggering a preset coping strategy when the resource consumption of the production task meets a first preset condition;
and merging a plurality of data tables meeting a second preset condition in the data mart.
8. An apparatus for data asset management, comprising:
the first acquisition module is used for acquiring all production tasks in the data mart and determining a first incidence relation among data tables in the data mart according to the production tasks;
the second acquisition module is used for acquiring a second incidence relation between each data table and each application model in the data mart;
and the network construction module is used for constructing the data asset three-dimensional relationship network of the data mart according to the first incidence relationship and the second incidence relationship.
9. An electronic device for data asset management, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202010752400.3A 2020-07-30 2020-07-30 Method and apparatus for data asset management Pending CN113779017A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010752400.3A CN113779017A (en) 2020-07-30 2020-07-30 Method and apparatus for data asset management

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010752400.3A CN113779017A (en) 2020-07-30 2020-07-30 Method and apparatus for data asset management

Publications (1)

Publication Number Publication Date
CN113779017A true CN113779017A (en) 2021-12-10

Family

ID=78835130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010752400.3A Pending CN113779017A (en) 2020-07-30 2020-07-30 Method and apparatus for data asset management

Country Status (1)

Country Link
CN (1) CN113779017A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076387A (en) * 2023-08-22 2023-11-17 北京天华星航科技有限公司 Quick gear restoration system for mass small files based on magnetic tape

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446274A (en) * 2017-08-31 2019-03-08 北京京东尚科信息技术有限公司 The method and apparatus of big data platform BI metadata management
CN110362585A (en) * 2019-06-19 2019-10-22 东软集团股份有限公司 Data analysing method, device, storage medium and electronic equipment
US20190347596A1 (en) * 2018-05-08 2019-11-14 Bank Of America Corporation System for decommissioning information technology assets using solution data modelling
CN110807036A (en) * 2019-11-06 2020-02-18 北京顶象技术有限公司 Associated data network construction method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446274A (en) * 2017-08-31 2019-03-08 北京京东尚科信息技术有限公司 The method and apparatus of big data platform BI metadata management
US20190347596A1 (en) * 2018-05-08 2019-11-14 Bank Of America Corporation System for decommissioning information technology assets using solution data modelling
CN110362585A (en) * 2019-06-19 2019-10-22 东软集团股份有限公司 Data analysing method, device, storage medium and electronic equipment
CN110807036A (en) * 2019-11-06 2020-02-18 北京顶象技术有限公司 Associated data network construction method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076387A (en) * 2023-08-22 2023-11-17 北京天华星航科技有限公司 Quick gear restoration system for mass small files based on magnetic tape
CN117076387B (en) * 2023-08-22 2024-03-01 北京天华星航科技有限公司 Quick gear restoration system for mass small files based on magnetic tape

Similar Documents

Publication Publication Date Title
CN107809331B (en) Method and device for identifying abnormal flow
US10430111B2 (en) Optimization for real-time, parallel execution of models for extracting high-value information from data streams
CN109446274B (en) Method and device for managing BI metadata of big data platform
WO2014145092A2 (en) Hierarchical, parallel models for extracting in real time high-value information from data streams and system and method for creation of same
CN111190888A (en) Method and device for managing graph database cluster
CN110795315A (en) Method and device for monitoring service
CN113760677A (en) Abnormal link analysis method, device, equipment and storage medium
CN109977139B (en) Data processing method and device based on class structured query statement
CN113779017A (en) Method and apparatus for data asset management
CN112433757A (en) Method and device for determining interface calling relationship
CN115422202A (en) Service model generation method, service data query method, device and equipment
CN110688355A (en) Method and device for changing container state
CN115033574A (en) Information generation method, information generation device, electronic device, and storage medium
CN114997414A (en) Data processing method and device, electronic equipment and storage medium
CN112783615B (en) Data processing task cleaning method and device
CN113760568A (en) Data processing method and device
EP3380906A1 (en) Optimization for real-time, parallel execution of models for extracting high-value information from data streams
CN112579673A (en) Multi-source data processing method and device
CN111831534A (en) Method and device for verifying accuracy of datagram table
CN111767185A (en) Data point burying method and device
CN112749204A (en) Method and device for reading data
CN116450622B (en) Method, apparatus, device and computer readable medium for data warehouse entry
CN116915870B (en) Task creation request processing method, device, electronic equipment and readable medium
CN115309612B (en) Method and device for monitoring data
CN111813765B (en) Method, device, electronic equipment and computer readable medium for processing abnormal data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination