CN118096387A - Data asset management method, device, equipment and medium - Google Patents

Data asset management method, device, equipment and medium Download PDF

Info

Publication number
CN118096387A
CN118096387A CN202410206475.XA CN202410206475A CN118096387A CN 118096387 A CN118096387 A CN 118096387A CN 202410206475 A CN202410206475 A CN 202410206475A CN 118096387 A CN118096387 A CN 118096387A
Authority
CN
China
Prior art keywords
data
asset
component
preset
quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410206475.XA
Other languages
Chinese (zh)
Inventor
余阳
徐博
马一骁
段先明
张毅
肖俊
罗斌
夏天翔
王立仁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei Chutianyun Co ltd
Original Assignee
Hubei Chutianyun Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei Chutianyun Co ltd filed Critical Hubei Chutianyun Co ltd
Publication of CN118096387A publication Critical patent/CN118096387A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a data asset management method, a device, equipment and a medium. The method includes continuously collecting component data of a third party component; for each component data, distributing the component data to a corresponding memory queue according to the class of the component data; acquiring data assets corresponding to each component data from a preset data asset catalog based on preset asset IDs in each component data; the method comprises the steps of identifying semantic information of metadata based on an NLP algorithm through real-time synchronization of metadata of a data source, obtaining names, information items and summaries of data assets, and vectorizing to obtain data asset vectors; classifying and labeling the data asset vectors based on a preset classification algorithm, and storing corresponding data asset categories in a preset data asset catalog; for each component data in each memory queue, based on the component data and the corresponding data asset, analyzing to obtain a relevant data index result of the data asset, and displaying the relevant data index result.

Description

Data asset management method, device, equipment and medium
Technical Field
The application relates to a data asset management method, a device, equipment and a medium.
Background
In the digital age, data has become one of the core assets of enterprises. Effective data asset management has important significance for protecting intellectual property rights of enterprises, ensuring data security and improving operation efficiency. In the existing method, a distributed management method is adopted in the data asset management process, and the collection, circulation and disposal actions of the data assets in an organization are managed and controlled through different modules from the angles of collection management, storage management, quality management, sharing exchange management, data service management and the like, wherein each module is respectively responsible for executing and implementing the corresponding links.
Disclosure of Invention
In order to better realize management of data assets, the embodiment of the application provides a data asset management method, a device, equipment and a medium.
In a first aspect, an embodiment of the present application provides a data asset management method, including:
continuously collecting component data of a third party component; the component data comprises data quality component data, data security component data and data application component data;
Distributing the component data to a corresponding memory queue according to the category of the component data for each component data; the memory queues comprise a data quality memory queue, a data security memory queue and a data application memory queue;
acquiring data assets corresponding to each component data from a preset data asset catalog based on preset asset IDs in each component data; the method comprises the following steps of obtaining data assets and storing the data assets in a classification mode of the preset data asset catalogue:
metadata of a data source are synchronized in real time, semantic information of the metadata is identified based on an NLP algorithm, and names, information items and abstracts of data assets are obtained;
Vectorizing the names, the information items and the abstracts of the data assets to obtain data asset vectors;
Classifying and marking the data asset vector based on a preset classification algorithm, and storing the corresponding data asset class in a preset data asset catalog;
In an optional implementation manner of the embodiment of the present application, for each component data in each memory queue, based on the component data and the corresponding data asset, analyzing to obtain a related data index result of the data asset, and displaying the related data index result, where the analyzing includes:
for each data quality component data in the data quality memory queue, analyzing and obtaining the quality score, quality problem and change trend of the data asset based on the data quality component data and the corresponding data asset, and displaying the quality score, quality problem and change trend of the data asset;
for each data security component data in the data security memory queue, based on the data security component data and the corresponding data asset, analyzing to obtain the security classification, the sensitivity level and the security policy of the data asset, and displaying the security classification, the sensitivity level and the security policy of the data asset;
and for each data application component data in the data application memory queue, carrying out aggregation statistics on the data application component data based on the corresponding preset asset ID to obtain and analyze the data processing information of each data asset in each time period, and displaying the obtained analysis result.
In an optional implementation manner of the embodiment of the present application, after analyzing and obtaining a quality score, a quality problem and a variation trend of the data asset based on the data quality component data and the corresponding data asset for each data quality component data in the data quality memory queue, the method further includes:
Determining an influence factor of the quality problem of the data asset based on a preset decision tree according to the quality problem of the data asset;
based on the attribution analysis method, the root cause of the quality problem of the data asset is determined according to the influencing factors of the quality problem of the data asset.
In an optional implementation manner of the embodiment of the present application, after analyzing and obtaining, for each data security component data in the data security memory queue, a security classification, a sensitivity level and a security policy of the data asset based on the data security component data and the corresponding data asset, the method further includes:
for each data asset, vectorizing the data asset name field and the data field to obtain a data asset name field vector and a data field vector;
judging whether the data asset name field vector and the data field vector are sensitive information or not according to the trained classifier;
if yes, the sensitive information of the data asset is marked correspondingly according to a preset security management strategy, and an administrator is notified in a station communication mode.
In an optional implementation manner of the embodiment of the present application, for each data application component data in the data application memory queue, aggregate statistics is performed on the data application component data based on the corresponding preset asset ID, so as to obtain and analyze data processing information of each data asset in each time period, and an obtained analysis result is displayed, where the method includes:
for each data application component data in the data application memory queue, carrying out aggregation statistics on the data application component data based on the corresponding preset asset ID to obtain data application information and data use information of each data asset in each time period;
Analyzing the supply condition and the demand condition of each data asset according to the data application information to obtain an application information analysis result;
According to the data use information, analyzing the utilization rate and the application effect of each data asset in the use process to obtain a use information analysis result;
and displaying the application information analysis result and the use information analysis result.
In an optional implementation manner of the embodiment of the present application, the continuously collecting component data of the third party component includes:
collecting component data of a third party component by regularly calling a standardized interface provided by the third party component;
Or alternatively, the first and second heat exchangers may be,
And acquiring component data of the third-party component in real time by acquiring a metadata base log based on the access mode of the metadata base of the third-party component and the metadata dictionary.
In an alternative implementation manner of the embodiment of the present application, the method further includes acquiring data assets and storing the data assets in the preset data asset directory in a classification manner:
And acquiring the data assets recorded according to the preset format, and storing the data assets in the preset data asset catalogue in a classified mode.
In a second aspect, an embodiment of the present application provides a data asset management device, the device including:
the acquisition module is used for continuously acquiring the component data of the third-party component; the component data comprises data quality component data, data security component data and data application component data;
The distribution module is used for distributing the component data to the corresponding memory queues according to the types of the component data; the memory queues comprise a data quality memory queue, a data security memory queue and a data application memory queue;
The acquisition module is used for acquiring the data asset corresponding to each component data from a preset data asset catalog based on the preset asset ID in each component data; the method comprises the following steps of obtaining data assets and storing the data assets in a classification mode of the preset data asset catalogue: metadata of a data source are synchronized in real time, semantic information of the metadata is identified based on an NLP algorithm, and names, information items and abstracts of data assets are obtained; vectorizing the names, the information items and the abstracts of the data assets to obtain data asset vectors; classifying and marking the data asset vector based on a preset classification algorithm, and storing the corresponding data asset class in a preset data asset catalog;
The analysis display module is used for analyzing and obtaining related data index results of the data assets based on the component data and the corresponding data assets for each component data in each memory queue, and displaying the related data index results.
In a third aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a data asset management method as described above.
In a fourth aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements a data asset management method as described above when executing the computer program.
In a fifth aspect, embodiments of the present application provide a computer program product comprising instructions which, when run on a computer device, cause the computer device to perform a data asset management method as described above.
In a sixth aspect, embodiments of the present application provide a chip comprising a processor and a communication interface, the communication interface being coupled to the processor, the processor being configured to execute a computer program or instructions to implement a data asset management method as described above.
The technical scheme provided by the embodiment of the application has the beneficial effects that at least:
according to the data asset management method provided by the embodiment of the application, the metadata of the data source is synchronized in real time, and the semantic information of the metadata is identified based on the NLP algorithm, so that the automation of data asset identification is realized, the labor cost in the data asset identification process is greatly reduced, and a powerful support is provided for intelligent retrieval of the data asset; by displaying the related data index results of the data asset, the data asset manager can quickly learn the bottleneck and problem in the data quality, safety and application process, so that corresponding countermeasures can be quickly taken, and the data application capability and effect can be improved; by summarizing and visually displaying the related data indexes of each component, the business process of data asset management is simplified, and the workload of data asset management personnel for switching among different systems is reduced.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
The technical scheme of the application is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, serve to explain the application. In the drawings:
FIG. 1 is a flow chart of a method for managing data assets according to an embodiment of the present application;
FIG. 2 is a flow chart for implementing data asset acquisition and classified storage in a preset data asset directory according to an embodiment of the present application;
FIG. 3 is a flow chart illustrating a method for analyzing and displaying data metrics associated with a data asset according to an embodiment of the present application;
FIG. 4 is a flow chart diagram of an implementation of determining a root cause of a quality problem for a data asset provided by an embodiment of the present application;
FIG. 5 is a schematic flow chart for implementing corresponding labeling of sensitive information of a data asset according to an embodiment of the present application;
FIG. 6 is a schematic flow chart for implementing analysis and display of data processing information of a data asset in various time periods according to an embodiment of the present application;
FIG. 7 is a functional architecture diagram of a data asset management system according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a technical architecture of a data asset management system according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a data asset management device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".
Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.
Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
It should be understood that the sequence numbers of the steps in the following embodiments do not mean the order of execution, and the execution order of the processes should be determined by the functions and the internal logic, and should not be construed as limiting the implementation process of the embodiments of the present application.
In order to illustrate the technical scheme of the application, the following description is made by specific examples.
The inventor finds that in the prior art, in the process of data asset management, a distributed management method is adopted, and the collection, circulation and disposal actions of data assets in an organization are managed and controlled through different modules from the angles of collection management, storage management, quality management, sharing exchange management, data service management and the like, and corresponding execution means are provided according to each management angle to respectively take charge of execution and implementation work of corresponding links. The method comprises the following steps:
And (3) data acquisition: the method has the advantages that the capability of ETL of bottom data is provided, external data is extracted and converted and then loaded into a platform storage system, access of various external data sources such as various relational libraries and various files is supported, rich function cleaning capability is provided, high-quality warehousing of the data is guaranteed, meanwhile, the capability of transverse and longitudinal expansion is better, and rapid development is supported to meet access requirements of clients and high-performance processing of a distributed deployment guarantee system. Among them, ETL stands for Extraction (Extraction), transformation (Transformation), and loading (Loading), which are processes for extracting data from one data source, performing various Transformation processes, and then loading to a target data warehouse or database.
And (3) data storage: the method has the advantages of providing rich data storage capacity, adopting an HDFS+relational database (PG/MySQL), supporting data storage in a memory database (Redis), a relational database, a distributed file system (HDFS), an HBase, a full text retrieval system and a distributed graph database system, providing high reliability of the data, timely recovering when one data is lost, ensuring the safety of the data and having stronger transverse expansion capacity.
And (3) data quality management: data quality management is a critical process aimed at ensuring that the data used and analyzed in an organization is accurate, complete, consistent and trusted. This process involves a number of steps including identifying and correcting erroneous data, eliminating duplicate records, implementing and maintaining data standards and guidelines, and continuously monitoring data quality. Through these steps, data quality management helps ensure the validity and reliability of data in analysis, decision-making, and business operations, supporting the overall performance and goal implementation of the organization.
Data development and calculation: data development and computation are processes that build and optimize data warehouse structures in order to support complex data analysis and business intelligence requirements. This process includes defining the architecture of the data warehouse (e.g., star, snowflake, or other model), determining the source of the data, and the extraction, transformation, and loading (ETL) process of the design data.
Data service: data services involve providing services related to managing data to support data access, analysis, and application requirements of an organization. Typically including storage, backup, recovery, and security management of data, as well as integration and cleansing services of data. And may also include providing data APIs, data warehouse, reporting and data visualization tools that enable users to easily access and understand data.
However, the above method for decentralized management of data asset management has shortcomings in terms of automatic identification of data asset, data security assessment, visual analysis and integrated management, and the shortcomings are as follows:
Data asset identification costs are high: in the aspect of data asset identification, the existing decentralized management method relies on manual input, and even if metadata of a service system can be automatically synchronized, intelligent semantic identification is difficult to be performed on data resources in a newly added or modified service database, so that a great amount of labor cost is generated in the process of data asset identification, and meanwhile, the period of data asset identification is prolonged.
The data application effect is not visual enough: the existing decentralized management method is difficult to stand at the angle of data application to provide visual and clear visual analysis capability in the aspects of supply and demand docking and data enabling, and presents problems and bottlenecks in the current data application to users. Resulting in inefficiency and great difficulty in the work of data asset management personnel when performing data asset applications and operations.
The management process is complex: in the existing distributed management method, different systems are independently responsible for corresponding data processing links in the whole life cycle management process of the data assets, and an integrated management system facing to the service is lacked. In the process of data asset management, a plurality of systems need to be faced, the use habit of different systems is adapted, the problem of data intercommunication among the systems exists, and the complexity and difficulty of data asset management are increased.
Based on this, the inventor has further developed to make the present application, and provide a data asset management method, apparatus, device, and medium.
Example 1
The embodiment of the application provides a data asset management method, referring to FIG. 1, the method comprises the following steps S101 to S104:
S101: continuously collecting component data of a third party component; the component data includes data quality component data, data security component data, and data application component data.
S102: distributing the component data to a corresponding memory queue according to the category of the component data for each component data; the memory queues include a data quality memory queue, a data security memory queue, and a data application memory queue.
S103: and acquiring the data asset corresponding to each component data from a preset data asset catalog based on the preset asset ID in each component data.
S104: and for each component data in each memory queue, analyzing and obtaining a related data index result of the data asset based on the component data and the corresponding data asset, and displaying the related data index result.
As shown in fig. 2, the data assets in the preset data asset directory in the step S103 may be acquired and stored in a classified manner through the following steps S201 to S203:
S201: metadata of a data source are synchronized in real time, semantic information of the metadata is identified based on an NLP algorithm, and names, information items and abstracts of data assets are obtained.
S202: and vectorizing the names, the information items and the abstracts of the data assets to obtain data asset vectors.
S203: and classifying and marking the data asset vector based on a preset classification algorithm, and storing the corresponding data asset class in a preset data asset catalog.
According to the data asset management method provided by the embodiment of the application, the metadata of the data source is synchronized in real time, and the semantic information of the metadata is identified based on the NLP algorithm, so that the automation of data asset identification is realized, the labor cost in the data asset identification process is greatly reduced, and a powerful support is provided for intelligent retrieval of the data asset; by displaying the related data index results of the data asset, the data asset manager can quickly learn the bottleneck and problem in the data quality, safety and application process, so that corresponding countermeasures can be quickly taken, and the data application capability and effect can be improved; by summarizing and visually displaying the related data indexes of each component, the business process of data asset management is simplified, and the workload of data asset management personnel for switching among different systems is reduced.
In the embodiment of the application, the data of the components in the third party component (mainly the data management component) during operation and the data of the components in the task execution result and other data of the components are collected through a protocol agreed in advance with the third party component. The collected third party data management components include, but are not limited to, the following: a data catalog component, a data quality component, a data application component, a data standard component, a data storage component, a data calculation component, a data management (including quality audit and clear conversion) component, a data security (including data classification and data security management) component, a data service (including data service hooking, data sharing exchange and data usage statistics) component and the like. The data collected includes, but is not limited to, the following: data quality component data, data security component data, and data application component data. In the embodiment of the application, data quality component data, data security component data and data application component data are mainly taken as examples for explanation.
In the step S101, when component data of the third party component is continuously collected, there are two ways:
the first way is: collecting component data of a third party component by regularly calling a standardized interface provided by the third party component;
The second way is: and acquiring component data of the third-party component in real time by acquiring a metadata base log based on the access mode of the metadata base of the third-party component and the metadata dictionary. The access mode of the metadata base and the metadata dictionary are provided by a third party component, and component data of the third party component can be acquired in real time through a database log acquisition mode.
The component data obtained in the acquisition process are finally acquired into a database when the preset component runs.
In the embodiment of the application, the component data of the third-party component are continuously collected in the two modes, so that the real-time performance of the component data collection process is stronger, and the collected component data is more complete, accurate and time-efficient.
Through the step S102, the component data continuously collected in the step S101 is distributed to the corresponding memory queues according to the preset distribution policy. For example, the data may be distributed to corresponding memory queues according to different types of components (or component data), including a data quality memory queue, a data security memory queue, and a data application memory queue, for example, the collected data of the third party data administration component is the data quality component data, and then distributed to the data quality memory queue. The processing efficiency of a large amount of different component data is improved by routing the component data of different components in the database when the preset component runs and distributing the component data to the corresponding memory queues and asynchronously processing the component data by the corresponding processing threads (such as the background threads corresponding to the data quality memory queues).
The preset data asset catalogue in step S103 is a data asset catalogue established by determining data asset classification by a data asset manager according to the business characteristics of an organization and establishing a unified data asset classification standard.
The data asset identification implemented in step S201 described above refers to a process of specifying, identifying and managing all of its data assets in one organization. Data asset identification is critical to data governance, compliance, security, and efficient data management. Based on the NLP algorithm, semantic information of metadata (such as table names, table notes, field names, field notes, field contents, data lines, etc.) in the data source is identified, and information of the data asset (including data asset names, information items and summaries, and other attributes) is obtained. The metadata of the service system is automatically synchronized, and intelligent semantic recognition is performed on the data resources in the newly added or modified service database, so that automation of data asset recognition is realized, and labor cost in the data asset recognition process is greatly reduced.
Through step S202 and step S203, each data asset in the semantic information is vectorized, that is, the name, the information item, the abstract and other attributes of the data asset acquired in step S201 are vectorized, so as to obtain each data asset vector.
In one embodiment, for each data asset vector, the data asset vector may be classified and labeled (tagged) by a preset classification algorithm, thereby determining a class and tag of the data asset, and storing the data asset under a corresponding preset asset class in a preset data asset directory according to the class and tag. The preset classification algorithm may select a classification algorithm such as a decision tree, a support vector machine, or a conditional random field, which is not specifically limited herein.
In another embodiment, for each data asset vector, the data asset may be considered to belong to the data asset range and be able to be hooked under the preset asset class by calculating the similarity (e.g., calculating the cosine distance) of the data asset vector and each preset data asset class vector, when the similarity of the data asset vector to a certain preset asset class vector is greater than a preset similarity threshold.
In the embodiment of the application, the collected component data carries the preset asset ID of the corresponding data asset, and the data asset corresponding to each component data can be obtained from the classified storage of the preset data asset catalogue through the preset asset ID.
Through the steps S201 to S203, unified data asset classification and classification standards are established, classified data assets are managed in a unified mode, and data safety is guaranteed.
In a specific embodiment, in step S103, the data asset classification may be stored in the preset data asset directory by:
And acquiring the data assets recorded according to the preset format, and storing the data assets in the preset data asset catalogue in a classified mode. The data assets of the organization can be entered manually.
And manually inputting data assets which are input according to preset formats and support file formats such as excel, txt and xml, and storing the data assets in a preset data asset catalog in a classified mode. Where a data asset includes any data that is valuable to an organization, it may encompass structured data (e.g., tables in a database), semi-structured data (e.g., XML files), unstructured data (e.g., text files, images, audio or video files), and so forth.
In the step S104, for each component data in each memory queue, based on the component data and the corresponding data asset, a relevant data index result of each dimension of the data asset may be obtained by analysis, and the relevant data index result may be displayed through a preset visual graphical interface, so as to implement full life cycle management of the data asset.
In one embodiment, as shown in fig. 3, the data quality component data index result, the data security component data index result, and the data application component data index result are taken as examples, and the step S104 specifically includes the following steps S1041 to S1043:
S1041: and analyzing the quality scores, quality problems and change trends of the data assets based on the data quality assembly data and the corresponding data assets for each data quality assembly data in the data quality memory queue, and displaying the quality scores, the quality problems and the change trends of the data assets.
S1042: and analyzing the security classification, the sensitivity level and the security policy of the data asset based on the data security component data and the corresponding data asset for each data security component data in the data security memory queue, and displaying the security classification, the sensitivity level and the security policy of the data asset.
S1043: and for each data application component data in the data application memory queue, carrying out aggregation statistics on the data application component data based on the corresponding preset asset ID to obtain data processing information of each data asset in each time period, analyzing the data processing information, and displaying the obtained analysis result.
In the embodiment of the present application, the steps S1041, S1042 and S1043 may be sequentially executed or may be executed in parallel, and the execution sequence thereof may not be specifically limited.
In the step S1041, for each data quality component data in the data quality memory queue, the quality score, quality problem and variation trend of the data asset are analyzed based on the data quality component data and the corresponding data asset. The quality scores, quality problems, and trends of the data assets are shown by pie chart, graph, bar chart, and table classifications.
In a specific embodiment, as shown in fig. 4, after analyzing the quality score, quality problem and variation trend of the data asset based on the data quality component data and the corresponding data asset for each data quality component data in the data quality memory queue, the following steps S401 to S402 are further included:
S401: and determining the influencing factors of the quality problems of the data assets based on a preset decision tree according to the quality problems of the data assets.
S402: based on the attribution analysis method, the root cause of the quality problem of the data asset is determined according to the influencing factors of the quality problem of the data asset.
In the embodiment of the present application, after the quality problem (i.e., the data quality problem) of the data asset is obtained by the analysis in the step S1041, the attribution analysis is performed on the quality problem of the data asset in the steps S401 and S402, so that the data quality problem in the data management is comprehensively and deeply understood and solved. The attribution analysis can use a preset decision tree, and judges the reason for causing the data quality problem according to the type of the data quality problem (such as inconsistency, inaccuracy, non-standardization, poor timeliness and the like) and the data quantity of the quality problem.
The specific process of finding the cause of the data quality problem in the data treatment process through the preset decision tree and attribution analysis is as follows:
First, in the data preparation phase, data related to data governance, such as data input, processing, output results, and user feedback, needs to be collected and cleaned.
Then, when the preset decision tree is used for preliminary analysis, a preset decision tree model is constructed, influencing factors of quality problems of the data asset are identified through feature selection, and the preset decision tree structure is analyzed to know the mode of the data quality problems.
Subsequently, in the attribution analysis stage, influencing factors in the preset decision tree analysis are determined, how the influencing factors interact to cause data quality problems are deeply discussed using attribution analysis techniques, and root causes of the quality problems of the data assets are identified.
Finally, in the result interpretation and application stage, the decision tree and the result of attribution analysis are comprehensively interpreted, specific improvement suggestions are provided, and monitoring is implemented to ensure that the data quality is continuously improved.
Through the process of finding the cause of the data quality problem in the data treatment process through the preset decision tree and the attribution analysis, the data quality problem in the data treatment can be comprehensively and deeply understood and solved by combining the structured decision path of the preset decision tree and the deep exploration of the attribution analysis.
In step S1042, for each data security component data in the data security memory queue, the security classification, the sensitivity level and the security policy of the data asset are obtained based on the data security component data and the corresponding data asset. The security classification, sensitivity level, and security policy of the data asset are presented by pie chart, graph, bar chart, and table classifications.
In the embodiment of the present application, after analyzing the security classification, the sensitivity level and the security policy of the data asset based on the data security component data and the corresponding data asset for each data security component data in the data security memory queue, as shown in fig. 5, the method further includes the following steps S501 to S504:
S501: for each data asset, vectorizing the data asset name field and the data field to obtain a data asset name field vector and a data field vector.
S502: and judging whether the data asset name field vector and the data field vector are sensitive information according to the trained classifier, if so, executing S503, and if not, executing S504.
S503: and correspondingly marking the sensitive information of the data asset according to a preset security management strategy, and informing an administrator in an intra-station communication mode.
S504: the data asset need not be annotated.
In the embodiment of the application, the data security risk intelligent identification is realized by vectorizing the data asset (data asset name field) and the data field based on metadata (such as table name, table annotation, field name, field annotation and field content) of the data asset through NLP technology, so as to obtain a data asset name field vector and a data field vector. And judging whether the data asset and the data field belong to sensitive information or not through a pre-trained classifier, marking the data asset which belongs to the sensitive information but is not endowed with corresponding security policies, and prompting a data asset manager that the data security risk is found in a station communication mode.
The specific implementation mode of the intelligent identification of the data security risk is as follows:
Firstly, training a classifier, firstly, selecting the classifier, for example, a conditional random field (conditional random field, CRF) as the classifier, wherein the CRF is used as a probability model, is particularly suitable for processing sequence data, and can effectively improve the accuracy of sensitive information identification by considering context information. Next, a training sample is prepared, which includes collecting various types of text data, and labeling by a professional whether the text data is sensitive. These text data are then processed using NLP techniques to extract key features such as keywords, phrases, and syntactic structures. And finally, taking the features and the corresponding labels thereof as training data, and learning the relation between the features and the sensitive information by using a CRF classifier to obtain a trained CRF model.
And secondly, classifying the data assets and the data fields through a trained CRF model, judging whether the data assets and the data fields are sensitive information, and carrying out corresponding processing on the identified sensitive information according to a predefined security management strategy, such as encryption, access control, log record and the like.
The whole process of intelligent recognition of the data security risk needs to repeatedly execute the first step and the second step continuously, and the model is adjusted to adapt to new data types and potential threats, so that the recognition and protection of sensitive information are always in an optimal state.
The data security risk intelligent identification process solves the problems of high data security assessment cost and high risk in the prior art. The inventor finds that in the prior art, when data security evaluation is performed, intelligent data security level identification cannot be performed according to a preset data classification hierarchy, automatic security policy matching cannot be achieved, professional knowledge of data security specialists and deep understanding of data business meanings are seriously relied on, and once related human resources are limited or human errors occur, serious potential safety hazards can be caused to data assets of an organization. In this regard, the inventor realizes the intellectualization of the data asset security assessment through the intelligent identification of the data security risk, and the system automatically recommends corresponding security strategies according to different data security requirements, so that compared with the mode of the prior art, the labor cost of the data security assessment and the risk brought by human errors are reduced.
In an embodiment, as shown in fig. 6, in the step S1043, for each data application component data in the data application memory queue, aggregate statistics is performed on the data application component data based on a corresponding preset asset ID, so as to obtain and analyze data processing information of each data asset in each time period, and the obtained analysis result is displayed, which specifically includes the following steps S10431 to S10434:
s10431: and for each data application component data in the data application memory queue, carrying out aggregation statistics on the data application component data based on the corresponding preset asset ID to obtain data application information and data use information of each data asset in each time period.
S10432: and analyzing the supply condition and the demand condition of each data asset according to the data application information to obtain an application information analysis result.
S10433: and according to the data use information, analyzing the utilization rate and the application effect of each data asset in the use process to obtain a use information analysis result.
S10434: and displaying the application information analysis result and the use information analysis result.
In the embodiment of the application, for each data application component data (including data use, shared exchange and the like) in the data application memory queue, the visual data display and analysis are realized through steps S10431 to S10434 from the view point of supply and demand docking and data enabling.
Through the visual data display and analysis processes of the steps S10431 to S10434, a data resource manager can be helped to grasp the profile of the data assets in the organization in a visual manner, and meanwhile, the problems and bottlenecks of the data resources in the management and application processes can be rapidly analyzed and mined.
In a specific embodiment, the above step S1043 may also be implemented directly in the actual implementation process by the following steps:
First, data application component data (including data use and application data of a third party sharing exchange system, a data interface service system, a file downloading system and the like) are aggregated and counted according to preset asset IDs, and data application information and data use information of each data asset in each time period are obtained. The data application information comprises an application resource ID, an application applicant, application time, application content, an approval state, whether approval is passed, an approval passing reason, an approval rejection reason and the like; the data usage information includes the data asset ID used, the data user, the data usage time, the library table exchange amount, the interface call amount, the number of file downloads, and the like.
Then, based on the data application information and the data use information, a front-end data visualization technology (vue +echorts) is adopted to develop a data supply and demand docking and data enabling visualization large screen, and the large screen supports data display, chart display, map display, data drill-down, multi-region data linkage and the like. According to the data application information, the supply condition and the demand condition of the data can be analyzed, including analyzing the user with the largest data resource which is explicitly supplied, the largest data resource which is applied, the data resource demand which is not satisfied, and the like; according to the data usage information, the utilization rate and the application effect of the data in the using process can be analyzed, wherein the analysis comprises data assets used at high frequency, data assets which are not used in application, users who use the data assets most frequently, and the correlation between the quality scores and the utilization rates of the data assets.
Through the visual data display and analysis process, the visual mining technology based on the data can help data resource management personnel grasp the outline of the data assets in the organization in a visual mode, and meanwhile, the problems and bottlenecks of the data resources in the management and application process can be rapidly analyzed and mined.
The data asset management method provided by the embodiment of the application follows the national standard of the data management capability maturity evaluation model (DATA MANAGEMENT Capability Maturity Assessment Model, DCMM), and from the functional point of view, the method can be divided into a data asset management system comprising a data component management module, a data asset catalog management module, a data quality management module, a data security management module and a data application management module, and a functional architecture shown in figure 7 is built and is divided into a component layer, a control layer, a resource layer, a management and control layer and an application layer.
The system comprises a component layer, a control layer and an application layer, wherein the component layer comprises various types of third-party components (a data storage component, a data calculation component, a data management component, a data security component and a data service component), the control layer comprises a data component management module, the asset layer comprises a data asset catalog management module, the management and control layer comprises a data quality management module and a data security management module, and the application layer comprises a data application management module. The specific roles of the functional modules are as follows:
and the data component management module: the method is used for managing the big data component of the third party, not only can collect metadata, runtime data and task execution result data of the third party component, but also can give instructions to the third party component.
A data asset directory management module: the method is used for managing preset data asset catalogs in an organization and supporting various catalogue data input modes, such as manual input, intelligent recognition based on data sources and the like. Data assets within an organization include library tables, interfaces, files, and the like.
And a data quality management module: based on the collected component data and the quality information of the data asset, the quality scores, quality problems and variation trends of the data asset are displayed in a classified manner, and an attribution analysis function of the data quality problems is provided.
And the data security management module: based on the collected component data and the safety information of the data asset, the safety classification, the sensitivity level and the safety strategy of the data asset are displayed in a classified mode, and meanwhile, the intelligent data safety risk identification function is provided.
And the data application management module is used for: aiming at aspects of data supply and demand docking, data enabling and the like in the data application process, a visual data display and analysis function is provided.
The above-mentioned functional modules can implement the data asset management process in steps S101 to S104 through interaction, and the repetition is not repeated here.
In one embodiment, the data asset management system formed by the functional architecture and the functional modules can be developed by adopting a Spring boot+vue & element+ Mybatis Plus framework technology based on a B/S mode, an MVC mode and a three-layer architecture mode, and the technical architecture is shown in fig. 8. By analyzing the internal structure of the platform, different functions of different hierarchical structures are endowed, so that the richness and usability of the functions of the platform are improved. The layering technology is to decompose the software result into multiple layers according to a certain logic relationship, each layer has respective functions, and the layers are combined to form complete software. Meanwhile, the department layers have certain independence, when software functions are required to be perfected or the software is required to be upgraded and modified, only the related layers are required to be perfected and modified, and the effects on other layers are hardly influenced, so that software upgrading and modifying work is simpler, and the working efficiency is improved. Under normal conditions, layering techniques are interfaces that define various functional levels through layering patterns. According to the design mode, the software reusability is greatly improved, and the method is very beneficial to the self-development and design of the software. And after development, each layer can fully utilize standard interfaces, and automatic docking is effectively realized.
Example two
Based on the same inventive concept, an embodiment of the present application further provides a data asset management device, referring to fig. 9, the device includes:
the acquisition module 101 is used for continuously acquiring the component data of the third party component; the component data comprises data quality component data, data security component data and data application component data;
The distributing module 102 is configured to distribute, for each component data, the component data to a corresponding memory queue according to a class of the component data; the memory queues comprise a data quality memory queue, a data security memory queue and a data application memory queue;
An obtaining module 103, configured to obtain, from a preset data asset directory, a data asset corresponding to each component data based on a preset asset ID in each component data; the method comprises the following steps of obtaining data assets and storing the data assets in a classification mode of the preset data asset catalogue: metadata of a data source are synchronized in real time, semantic information of the metadata is identified based on an NLP algorithm, and names, information items and abstracts of data assets are obtained; vectorizing the names, the information items and the abstracts of the data assets to obtain data asset vectors; classifying and marking the data asset vector based on a preset classification algorithm, and storing the corresponding data asset class in a preset data asset catalog;
The analysis and display module 104 is configured to, for each component data in each memory queue, analyze and obtain a related data index result of the data asset based on the component data and the corresponding data asset, and display the related data index result.
The acquisition module 101 may perform the process of step S101 in the embodiment, the distribution module 102 may perform the process of step S102 in the embodiment, the acquisition module 103 may perform the process of step S103 in the embodiment, and the analysis display module 104 may perform the process of step S104 in the embodiment, and the above steps are repeated, which is not repeated here.
Example III
Based on the same inventive concept, an embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data asset management method as described in the above embodiment one.
Example IV
Based on the same inventive concept, an embodiment of the present application also provides a computer device, including a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the data asset management method as described in the above embodiment one when executing the computer program.
Example five
Based on the same inventive concept, embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer device, cause the computer device to perform the data asset management method as described in the above embodiment one.
Example six
Based on the same inventive concept, an embodiment of the present application further provides a chip, the chip including a processor and a communication interface, the communication interface and the processor being coupled, the processor being configured to execute a computer program or instructions to implement the data asset management method as described in the above embodiment one.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (10)

1. A method of data asset management comprising:
continuously collecting component data of a third party component; the component data comprises data quality component data, data security component data and data application component data;
Distributing the component data to a corresponding memory queue according to the category of the component data for each component data; the memory queues comprise a data quality memory queue, a data security memory queue and a data application memory queue;
acquiring data assets corresponding to each component data from a preset data asset catalog based on preset asset IDs in each component data; the method comprises the following steps of obtaining data assets and storing the data assets in a classification mode of the preset data asset catalogue:
metadata of a data source are synchronized in real time, semantic information of the metadata is identified based on an NLP algorithm, and names, information items and abstracts of data assets are obtained;
Vectorizing the names, the information items and the abstracts of the data assets to obtain data asset vectors;
Classifying and marking the data asset vector based on a preset classification algorithm, and storing the corresponding data asset class in a preset data asset catalog;
And for each component data in each memory queue, analyzing and obtaining a related data index result of the data asset based on the component data and the corresponding data asset, and displaying the related data index result.
2. The method of claim 1, wherein for each of the component data in each of the memory queues, analyzing to obtain a relevant data index result for the data asset based on the component data and the corresponding data asset, and displaying the relevant data index result, comprises:
for each data quality component data in the data quality memory queue, analyzing and obtaining the quality score, quality problem and change trend of the data asset based on the data quality component data and the corresponding data asset, and displaying the quality score, quality problem and change trend of the data asset;
for each data security component data in the data security memory queue, based on the data security component data and the corresponding data asset, analyzing to obtain the security classification, the sensitivity level and the security policy of the data asset, and displaying the security classification, the sensitivity level and the security policy of the data asset;
and for each data application component data in the data application memory queue, carrying out aggregation statistics on the data application component data based on the corresponding preset asset ID to obtain and analyze the data processing information of each data asset in each time period, and displaying the obtained analysis result.
3. The method of claim 2, further comprising, for each data quality component data in the data quality memory queue, after analyzing the quality scores, quality problems, and trends of the data assets based on the data quality component data and corresponding data assets:
Determining an influence factor of the quality problem of the data asset based on a preset decision tree according to the quality problem of the data asset;
based on the attribution analysis method, the root cause of the quality problem of the data asset is determined according to the influencing factors of the quality problem of the data asset.
4. The method of claim 2, wherein after analyzing the security classification, sensitivity level, and security policy of the data asset based on the data security component data and the corresponding data asset for each data security component data in the data security memory queue, further comprising:
for each data asset, vectorizing the data asset name field and the data field to obtain a data asset name field vector and a data field vector;
judging whether the data asset name field vector and the data field vector are sensitive information or not according to the trained classifier;
if yes, the sensitive information of the data asset is marked correspondingly according to a preset security management strategy, and an administrator is notified in a station communication mode.
5. The method of claim 2, wherein for each data application component data in the data application memory queue, performing aggregate statistics on the data application component data based on the corresponding preset asset IDs to obtain and analyze data processing information of each data asset in each time period, and displaying the obtained analysis results, including:
for each data application component data in the data application memory queue, carrying out aggregation statistics on the data application component data based on the corresponding preset asset ID to obtain data application information and data use information of each data asset in each time period;
Analyzing the supply condition and the demand condition of each data asset according to the data application information to obtain an application information analysis result;
According to the data use information, analyzing the utilization rate and the application effect of each data asset in the use process to obtain a use information analysis result;
and displaying the application information analysis result and the use information analysis result.
6. The method of claim 1, wherein continuously collecting component data for a third party component comprises:
collecting component data of a third party component by regularly calling a standardized interface provided by the third party component;
Or alternatively, the first and second heat exchangers may be,
And acquiring component data of the third-party component in real time by acquiring a metadata base log based on the access mode of the metadata base of the third-party component and the metadata dictionary.
7. The method of claim 1, further comprising acquiring data assets and storing in the preset data asset inventory categories by:
And acquiring the data assets recorded according to the preset format, and storing the data assets in the preset data asset catalogue in a classified mode.
8. A data asset management device, comprising:
the acquisition module is used for continuously acquiring the component data of the third-party component; the component data comprises data quality component data, data security component data and data application component data;
The distribution module is used for distributing the component data to the corresponding memory queues according to the types of the component data; the memory queues comprise a data quality memory queue, a data security memory queue and a data application memory queue;
The acquisition module is used for acquiring the data asset corresponding to each component data from a preset data asset catalog based on the preset asset ID in each component data; the method comprises the following steps of obtaining data assets and storing the data assets in a classification mode of the preset data asset catalogue: metadata of a data source are synchronized in real time, semantic information of the metadata is identified based on an NLP algorithm, and names, information items and abstracts of data assets are obtained; vectorizing the names, the information items and the abstracts of the data assets to obtain data asset vectors; classifying and marking the data asset vector based on a preset classification algorithm, and storing the corresponding data asset class in a preset data asset catalog;
The analysis display module is used for analyzing and obtaining related data index results of the data assets based on the component data and the corresponding data assets for each component data in each memory queue, and displaying the related data index results.
9. A computer readable storage medium having stored therein a computer program which, when executed by a processor, causes the processor to perform the data asset management method of any of claims 1-7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the data asset management method of any of claims 1-7 when the computer program is executed.
CN202410206475.XA 2023-12-28 2024-02-26 Data asset management method, device, equipment and medium Pending CN118096387A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202311844876X 2023-12-28
CN202311844876 2023-12-28

Publications (1)

Publication Number Publication Date
CN118096387A true CN118096387A (en) 2024-05-28

Family

ID=91152657

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410206475.XA Pending CN118096387A (en) 2023-12-28 2024-02-26 Data asset management method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN118096387A (en)

Similar Documents

Publication Publication Date Title
Zakir et al. Big data analytics.
US10055426B2 (en) System and method transforming source data into output data in big data environments
EP3161635B1 (en) Machine learning service
Ferguson Architecting a big data platform for analytics
Ahmed et al. A literature review on NoSQL database for big data processing
US10599678B2 (en) Input gathering system and method for defining, refining or validating star schema for a source database
Merten et al. Do information retrieval algorithms for automated traceability perform effectively on issue tracking system data?
De Weerdt et al. Foundations of process event data
CN107480188B (en) Audit service data processing method and computer equipment
CN116049379A (en) Knowledge recommendation method, knowledge recommendation device, electronic equipment and storage medium
CN115221337A (en) Data weaving processing method and device, electronic equipment and readable storage medium
Dong et al. Teledata: data mining, social network analysis and statistics analysis system based on cloud computing in telecommunication industry
Cherradi et al. Data lake governance using IBM-Watson knowledge catalog
WO2009006028A2 (en) Explaining changes in measures thru data mining
US10360239B2 (en) Automated definition of data warehouse star schemas
Dong et al. Scene-based big data quality management framework
Esteva et al. Data mining for “big archives” analysis: A case study
Nascimento et al. Data quality monitoring of cloud databases based on data quality SLAs
CN118096387A (en) Data asset management method, device, equipment and medium
CN109033196A (en) A kind of distributed data scheduling system and method
Punn et al. Testing big data application
US10409871B2 (en) Apparatus and method for searching information
Graf et al. Frost: a platform for benchmarking and exploring data matching results
Jiang et al. Method of Online Teaching Resource Recommendation Towards International Communication Based on. NET Platform
Richter et al. LIProMa: Label-independent process matching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination