CN116911671A - Data asset operation efficiency evaluation method and system - Google Patents

Data asset operation efficiency evaluation method and system Download PDF

Info

Publication number
CN116911671A
CN116911671A CN202310802066.1A CN202310802066A CN116911671A CN 116911671 A CN116911671 A CN 116911671A CN 202310802066 A CN202310802066 A CN 202310802066A CN 116911671 A CN116911671 A CN 116911671A
Authority
CN
China
Prior art keywords
data
metadata
index
information
fields
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310802066.1A
Other languages
Chinese (zh)
Inventor
齐宁
周云松
王治平
茅天天
王子青
华伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu United Credit Reference Co ltd
Original Assignee
Jiangsu United Credit Reference Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu United Credit Reference Co ltd filed Critical Jiangsu United Credit Reference Co ltd
Priority to CN202310802066.1A priority Critical patent/CN116911671A/en
Publication of CN116911671A publication Critical patent/CN116911671A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Strategic Management (AREA)
  • Educational Administration (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Tourism & Hospitality (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data asset operation efficiency evaluation method, which comprises the steps of collecting a data source, importing the data source into a database, extracting and storing data after warehousing, and carrying out formal registration after correcting or auditing marked metadata update records; in the metadata generation and registration process, the metadata and genetic management module sends metadata registration information to the global event information queue, and the data management module, the data application module and the data asset operation efficiency analysis reporting module respectively receive information notification and execute corresponding actions or update own data; metadata information, genetic lineage information and various indexes of the global data are tracked and calculated, the resource investment of data asset operation activities is reduced through quantization and automation means, and meanwhile, the overall efficiency of the work of the asset operation strategies is greatly improved on the premise of improving the accuracy and pertinence of the asset operation strategies.

Description

Data asset operation efficiency evaluation method and system
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a data asset operation efficiency evaluation method and system.
Background
How to fully utilize data assets and expert experiences to develop effective and efficient asset operation activities for data is a great challenge for many data-centric enterprises. In actual production practice, the data asset operation is usually completed by experts in the data management department, and involves a large number of early data analysis, storage model design, system construction, effect evaluation, system correction, and the like. Depending on the overall complexity, coverage, richness of the data asset and the accumulation of expert experience in the relevant field, the working cycle of the data asset operation also varies from months to long-term, activity tracking of the asset operation lacks more efficient means or tool support for effect assessment, and the industry also has no relatively standard practical experience reference in this field. Therefore, the evaluation of the operation efficiency of the data asset is always put into huge resources, but the return is relatively low, and the support of quantification means is lacking.
The ability of the general financial institutions (hereinafter referred to as institutions) to collect and store data is rapidly expanding while fully developing services by means of digital technology, and a large amount of original data resources are deposited in the process, so that service datamation is realized. However, grasping a large amount of raw data itself does not place the institution itself in a defeated place in the market competition. Statistics indicate that about 40% of the data is not utilized, and this portion of the data does not actually create value while taking up valuable computing storage resources of the institution.
Disclosure of Invention
In order to achieve the above purpose, the technical scheme of the invention is as follows: a data asset operation efficiency evaluation method comprises the following steps
The acquired data source is imported into a database, the data after warehousing are extracted and stored, and the marked metadata update record is corrected or audited and then formally registered;
in the metadata generation and registration process, the metadata and genetic management module sends metadata registration information to the global event information queue, and the data management module, the data application module and the data asset operation efficiency analysis reporting module respectively receive information notification and execute corresponding actions or update own data;
the metadata information, the pedigree information and various indexes of the global data are tracked and calculated, and the data asset operation efficiency analysis reporting module provides corresponding analysis results on different granularities according to the metadata information, the pedigree information and the various indexes, and integrates and outputs evaluation conclusion.
Based on the technical scheme, data acquisition/import: the data acquisition module acquires and imports data from various data sources such as files (txt/csv/custom, etc.), SQL scripts, databases, APIs, etc., to the databases (the databases are still generally referred to herein and are not described in detail). When the data is imported, a data acquisition module sends a data intake message (dataingestinonmessage) to a global event message queue, and a metadata and related management module and a data management module receive and consume the data intake message and respectively develop follow-up actions.
Metadata extraction: after data is put into storage (generating a data intake message), the metadata and the genetic management module automatically trigger (start through receiving a message queue notification) and execute a database scanning analysis task (traverse a scanning database and extract related metadata), metadata description, namely metadata extraction and storage (MetaStore database), are carried out on the put-in data, meanwhile, a metadata generation message (metagenerator message) is sent to a global event message queue, and the module receives the message notification and executes subsequent actions. The metadata information in the MetaStore is updated in quasi-real time according to the notification of the global event message queue, the corresponding update record is marked (namely, the dirty state) after each update, the marked metadata update record can be formally registered after correction or audit, and the registration process is automatically carried out by the system. The auditing action may be accomplished manually or automatically by the system, typically by a first time manually, followed by a subsequent automatic execution by the system. The extracted metadata information includes, but is not limited to, the following: library, table, field name, description, owner information, format type, value range (value field), etc.
Correction and registration: and (3) manually correcting and auditing the metadata information marked as dirty in the MetaStore, finally registering in the MetaStore, and simultaneously sending a metadata registration message (MetaRegistrationMessage) to a global event message queue. After metadata registration, the initial genealogy (genesislinear) of libraries, tables, fields has in fact been established. The initial affinity pedigree is the premise and basis for the metadata and affinity management module to follow-up analysis of the global data.
This process covers the business datamation phase, data asset formation phase and asset productization phase entirely. In the metadata generation and registration process, the metadata and genetic management module sends metadata registration messages (metaregistration message) to the global event message queue, and the data management module, the data application module and the data asset operation efficiency analysis report module respectively receive message notification and execute corresponding actions or update own data.
Preferably, in the data management module, the data is processed by an integration class (such as migration, splitting, merging, etc.), a processing class (such as cleaning, conversion, interception, etc.), and an analysis class (mathematical statistics), in the data application module, the data is packaged in the form of API and SDK, and then the data is served for external progress, and the data flows between different hierarchies of storage computation and the final application steps comprise:
establishing related binary groups (refers to the fields with related relations) among all fields, and calculating an information loss rate index MR according to field types or service requirements IL
Tracking the upstream and downstream fields of all fields of record (which can be realized by the techniques of double linked list, etc.), concatenating the processing chains forming all fields, and calculating the complexity index M of field chain CC The final chain complexity of the field takes the maximum value of the corresponding chain complexity of each path it experiences;
tracking record nodes for processing actions of all fieldsLiveness index M A Calculating the number of processing waves experienced by a specific field of a library table in an observation period or unit time of a specific duration (the processing of the field generally refers to various processes including reading, cleaning, converting, migrating and the like);
calculating a genetic difference activity ratio index MR according to the field level activity index and genetic lineage data acquired by a processing chain DAL Genetic difference activity ratio index MR DAL =Max{M A,1~n }/N Acc Where n is the number of nodes other than the termination node, M A N is the node liveness index Acc The processing or access times of the nodes;
and (3) carrying out periodic static analysis (the same observation period as that of other tracking calculation processes can be adopted) on the database tables generated in all the processes, and calculating all the quality indexes.
Preferably, when analyzing the adjacent upstream and downstream fields, SQL-like operations are identified by SQL-like syntax analysis and field extraction (field extraction) techniques, non-SQL-like operations (typically operations on a system interface) are identified by system operations, and finally, metadata storage (MetaStore) matching is combined, an adjacent processing chain is generated and concatenated into a complete processing chain, all processing chains are traversed, and fan-in indexes M of all fields are calculated FI Fanout index M FO
Preferably, the inter-domain asset coverage index MR is calculated based on the processing chain and the metadata information of each table (obtained by the metadata and the affinity management module) CDAC Inter-domain asset coverage index MR CDAC The computing of (1) includes firstly, different computing storage hierarchy is called different domains, the domains of the data are marked by using a label technology, and meanwhile, the labels are used for describing the data information.
Preferably, the quality class indicator comprises a null rate indicator MR N Error rate index MR WR Repetition rate index MR Dup Time efficiency meeting index MR Chr-S Time-efficient overflow index MR Chr-O
Preferably, for the fields, a null rate index MR is calculated in the table N1 =C NF /C A Wherein C NF To count the number of times a field in the table has a null value, C A For the total number of records in the table, for the table, the table space value rate MR N2 =Sum(C NF )/(C A *N F ) Wherein Sum (C) NF ) To sum the number of nulls for all fields, N F For the number of fields of the table, C A The total number of records in the table.
Preferably, the error rate indicator MR WR The method comprises two types of general type error values and service type error values, wherein the general type error values refer to data anomalies which can be distinguished without related service knowledge, such as text messy codes, nonsensical symbols, data format errors and the like, the service type error values refer to data which do not accord with service rules preset by manpower or a system, and the field error rate index MR WR1 =C NF /C A Wherein C NF To count the number of times a field in the table has an error value, C A For the total number of records in the table, the table error rate MR WR2 =Sum(C NF )/(C A *N F ) Wherein Sum (C) NF ) Sum the number of error values of all fields, N F For the number of fields of the table, C A The total number of records in the table.
Preferably, the repetition rate index MR Dup Comparing all data records according to the record main key/unique key aiming at the selected field/attribute set, and comparing the field repetition rate index MR Dup1 =Sum(N Dup1 ) N1, wherein N1 is the record number of the record primary key mark, N Dup1 Record the number of records for a certain repetition value of field occurrence, sum (N Dup1 ) Is the sum of all the repeated values; for a table, calculating abstracts for all non-judging multiple key fields of a certain record, comparing abstract information of all records to obtain a table repetition rate index MRDup2=NDup2/N2, wherein NDup2 is the repetition number (refer to description of the first half of the sentence: comparing abstract information of all records, the number of times of occurrence of the same value is the repetition number), and N2 is the total number of table records.
Preferably, the failure satisfaction rate index MR Chr-S For a particular data set S,the age satisfaction describes the ratio of data satisfying business requirements for data freshness/time coordinate range in S; data record main key/unique key is used as judging key and is provided with C S In the data set S of the strip data record there is C NS The bar records being contained in a record count of C Ref Reference or reference dataset S of (2) Ref In (2) the time efficiency satisfies the index MR Chr-S =C NS /C Ref Wherein 0 is<=C NS <=C Ref
Preferably, age-overflow rate MR Chr-O =(C S -C NS )/C Ref Wherein the age-overflow index MR Chr-O Meeting the time efficiency index MR Chr-S Together calculate and provide a data asset operation condition assessment reference.
Preferably, the data acquisition: the method is in butt joint with an internal and external data source, and forms a data resource by means of active and passive internal and external acquisition (including outsourcing, cooperation and other forms), service generation, sedimentation and the like, and is mainly corresponding to a service datamation process.
Data management: the data is subjected to various operations such as cleaning processing, conversion, splitting, aggregation, migration, statistical calculation and the like, so that the data can flow effectively and efficiently between different layers of architecture for calculation and storage, and the data is oriented to the development needs of existing or potential business, so that a data asset with potential market value is formed, and the data asset is an important link of data asset.
Data application: the method is characterized in that the method is output in various forms such as API (application programming interface), SDK (software development kit), research analysis report (which is generally referred to herein as including data analysis results presented in various file formats), package solution and the like through product and service packaging, and is an important link of asset productization.
Global event message queues: the unified monitoring and information distribution channel of the system global event can accept information notification and take corresponding disposal measures suitable for the message consumer after the message consumer completes registration and subscribes to the corresponding subject information of the message queue.
Metadata and affinity management: a core module for data asset operational performance evaluation (including managing the global metadata store MetaStore). And carrying out overall process tracking on the data flowing among different computing and storage hierarchy structures, constructing an affinity lineage of global data, and dynamically tracking and computing from multiple dimensions such as data quality, data circulation, data application and the like to obtain performance evaluation indexes.
Data asset operation efficacy analysis report: and outputting data asset operation efficiency analysis reports in various formats according to the calculation results of the metadata and the affinity management module to assist data asset operation.
Compared with the prior art, the invention has the beneficial effects that: according to the data asset operation efficiency evaluation method and system based on the genealogy, the genealogy relation formed by processing circulation of data between the computing and storing hierarchical structures and the data management and application conditions are tracked by fully utilizing an automation technology, finally various analysis indexes are combined, an analysis report of the data asset operation efficiency is automatically given out by a system, the data asset operation efficiency is evaluated by a help mechanism, the capability and maturity of the data asset operation management are reflected laterally, the resource investment of the data asset operation activity is reduced by means of quantification and automation, and meanwhile, the overall working efficiency is greatly improved on the premise of improving the accuracy and pertinence of an asset operation strategy.
Drawings
FIG. 1 shows a data record C in this embodiment S A value combination condition schematic diagram;
FIG. 2 shows the data record C in this embodiment NS A value combination condition schematic diagram;
FIG. 3 shows the total number of records C in the present embodiment Ref A value combination condition schematic diagram;
FIG. 4 is a schematic diagram of the connection of the genetic path in the present embodiment;
FIG. 5 is a diagram showing the overall architecture of the evaluation system according to the present embodiment;
fig. 6 is a schematic diagram of data flow in the evaluation system according to the present embodiment.
Description of the embodiments
The present invention is further illustrated in the following drawings and detailed description, which are to be understood as being merely illustrative of the invention and not limiting the scope of the invention.
Examples: the embodiment is a data asset operation effectiveness evaluation method and system based on pedigree,
the overall architecture of the system is shown in fig. 5, and includes 6 major modules: data collection, data management, data application, global event message queues, metadata and affinity management, data asset operational performance analysis reporting.
The responsibilities of each module are described as follows:
and (3) data acquisition: the method is in butt joint with an internal and external data source, and forms a data resource by means of active and passive internal and external acquisition (including outsourcing, cooperation and other forms), service generation, sedimentation and the like, and is mainly corresponding to a service datamation process.
Data management: the data is subjected to various operations such as cleaning processing, conversion, splitting, aggregation, migration, statistical calculation and the like, so that the data can flow effectively and efficiently between different layers of architecture for calculation and storage, and the data is oriented to the development needs of existing or potential business, so that a data asset with potential market value is formed, and the data asset is an important link of data asset.
Data application: the method is characterized in that the method is output in various forms such as API (application programming interface), SDK (software development kit), research analysis report (which is generally referred to herein as including data analysis results presented in various file formats), package solution and the like through product and service packaging, and is an important link of asset productization.
Global event message queues: the unified monitoring and information distribution channel of the system global event can accept information notification and take corresponding disposal measures suitable for the message consumer after the message consumer completes registration and subscribes to the corresponding subject information of the message queue.
Metadata and affinity management: a core module for data asset operational performance evaluation (including managing the global metadata store MetaStore). And carrying out overall process tracking on the data flowing among different computing and storage hierarchy structures, constructing an affinity lineage of global data, and dynamically tracking and computing from multiple dimensions such as data quality, data circulation, data application and the like to obtain performance evaluation indexes.
Data asset operation efficacy analysis report: and outputting data asset operation efficiency analysis reports in various formats according to the calculation results of the metadata and the affinity management module to assist data asset operation.
The system operation evaluation method comprises the following steps:
metadata generation and registration: mainly corresponds to the service datamation stage. The method comprises the sub-processes of data acquisition/import, metadata extraction, correction, registration and the like.
Data acquisition/import: the data acquisition module acquires and imports data from various data sources such as files (txt/csv/custom, etc.), SQL scripts, databases, APIs, etc., to the databases (the databases are still generally referred to herein and are not described in detail). When the data is imported, a data acquisition module sends a data intake message (dataingestinonmessage) to a global event message queue, and a metadata and related management module and a data management module receive and consume the data intake message and respectively develop follow-up actions.
Metadata extraction: after data is put into storage (generating a data intake message), the metadata and the genetic management module automatically trigger (start through receiving a message queue notification) and execute a database scanning analysis task (traverse a scanning database and extract related metadata), metadata description, namely metadata extraction and storage (MetaStore database), are carried out on the put-in data, meanwhile, a metadata generation message (metagenerator message) is sent to a global event message queue, and the module receives the message notification and executes subsequent actions. The metadata information in the MetaStore is updated in quasi-real time according to the notification of the global event message queue, the corresponding update record is marked (namely, the dirty state) after each update, the marked metadata update record can be formally registered after correction or audit, and the registration process is automatically carried out by the system. The auditing action may be accomplished manually or automatically by the system, typically by a first time manually, followed by a subsequent automatic execution by the system. The extracted metadata information includes, but is not limited to, the following:
library, table, field name, description, owner information, format type, value range (value field), etc.
Correction and registration: and (3) manually correcting and auditing the metadata information marked as dirty in the MetaStore, finally registering in the MetaStore, and simultaneously sending a metadata registration message (MetaRegistrationMessage) to a global event message queue. After metadata registration, the initial genealogy (genesislinear) of libraries, tables, fields has in fact been established. The initial affinity pedigree is the premise and basis for the metadata and affinity management module to follow-up analysis of the global data.
Genetic lineage tracking Analysis (LineageTrace & Analysis):
this process covers the business datamation phase, data asset formation phase and asset productization phase entirely.
In the metadata generation and registration process, the metadata and genetic management module sends metadata registration messages (metaregistration message) to the global event message queue, and the data management module, the data application module and the data asset operation efficiency analysis report module respectively receive message notification and execute corresponding actions or update own data.
In the data governance module, the data is processed by an integration class (such as migration, splitting, merging and the like), a processing class (such as cleaning, conversion, interception and the like) and an analysis class (mathematical statistics).
In the data application module, the data is packaged in the form of API, SDK and the like, and then the data is provided with services for external progress.
The data streaming process is illustrated in fig. 6. When data flows between different hierarchical architectures of storage computation and is finally applied:
in the category of t_1_1:f_2->the form t_2_1:f_2' establishes the related binary group (refers to the fields with related relations) among all fields, and calculates the information loss rate index MR according to the field types or service requirements IL . For example, a particular field of a particular table is of the datetime type, after processingIn the process, for business requirements such as information simplification, accuracy reduction processing may be performed on the field, so that the datetime field is processed into a date type field; in this case, since the "time" part of the information is discarded by the "date and time information", the loss rate of the conversion process information of the two fields can be defined as 50% (i.e., half the amount of information is lost) according to the default policy, or the loss rate ratio can be defined by itself according to the actual service requirement. As a default policy, the loss of information for the conversion between different types of fields may be referred to in table 4. The field types in table 4 are generic and need to be adapted to a specific storage type for a database of a specific selection.
TABLE 4 loss of information for different types of field transformations (partial reference)
Tracking the upstream and downstream fields of all fields of record (which can be realized by the techniques of double linked list, etc.), concatenating the processing chains forming all fields, and calculating the complexity index M of field chain CC . For example, for field f_1 in a particular table of some original database:
if the field f_1 undergoes a certain path p1 containing n link processing operations, in the path p1, the field f_1 only has changes in aspects of format, value rule and the like, and a plurality of changes are recorded, and the chain complexity Mcc of the field f1 is recorded, wherein f_1:p1=n;
if the field f_1 undergoes several processing paths, such as splitting, merging, cleaning, migrating, etc., by the data management system, the final chain complexity of the field f_1 should take the maximum value of the corresponding chain complexity of each path that it undergoes, i.e., mcc: f_1:p =max { Mcc: f_1: p 1-n }. Taking the example of the graph x, the field f_1 of the t_0_2 table undergoes 3 processing paths, respectively: - > t_1_2:f_1- > t_3_2:f_1 (Mcc: f_1:p1=2), - > t_1_2:f_2- > t_2_1:f_3 (Mcc: f_1:p2=2), - > t_1_2:f_2:f_2- > t_3:f_2 (Mcc: f_1:p3=3), then the final Mcc: f_1:p =max { Mcc: f_1:p1-3 } =3.
When analyzing the adjacent upstream and downstream fields, the SQL-like operation analyzes and words through SQL-like grammarSegment extraction (Fieldextraction) techniques, non-SQL-like operations (typically operations on a system interactive interface) are identified by system operations, and finally, in combination with metadata store (MetaStore) matching, an adjoining processing chain is generated and concatenated into a complete processing chain. Thus, all processing chains are traversed, fan-in index M of all fields is calculated FI Fanout index M FO . For example, the fields f_2, f_3 of the table t_1_1 are processed to form the field f_2 of the table t_2_1, then the field f_2 of the table t_2_1 is said to fan in to be 2. The field f_1 of the table t_0_2 is split to form the fields f_1 and f_2 of the table t_1_2, and the field f_1 of the table t_0_2 is referred to as fanning out to be 2. Note that multiple uses of fields do not change fanout.
Calculating an inter-domain asset coverage index MR according to a processing chain and each table metadata information (obtained by metadata and a related management module) CDAC . Inter-domain asset coverage index MR CDAC The calculation of (2) is described as follows: first, different computing storage hierarchies are called different "domains", and the domains to which data belongs can be marked by using a tag technology, and meanwhile, the tag can be used for describing information such as use, time, source and the like of the data. In the following case, the labels are mainly used for the relevant description of the table, as shown in fig. 4, the table t_2_1 is labeled with the following labels: "dws", "all", "pers"; table t_2_2 is labeled with the following label: "dws", "all", "corp", "cref"; table t_3_1 is labeled with the following label: "app", "corp-app"; table t_3_2 is labeled with the following label: "app", "corp-app". Inter-domain asset coverage may be calculated at different levels of granularity. For example, based on the foregoing labels, the asset coverage of table t_3_2 to table t_2_2 may be calculated, and since field f_2 of table t_2_2 is processed to form field f_2 of table t_3_2, the inter-domain asset coverage of table t_3_2 to table t_2_2 is 1/3=33.33%; similarly, the in-domain asset labeled "dws" includes tables t_2_1 and t_2_2, with 6 fields for both tables, then the coverage of the in-domain asset labeled "dws" by table t_3_2 is 1/6=16.67%; further, the intra-domain assets labeled "app" include tables t_3_1 and t_3_2, with a total of 3 fields from tables t_2_1 and t_2_2 labeled "dwsTable t_2_1 and table t_2_2 together have 6 fields, then the intra-domain asset coverage for the intra-domain asset labeled "app" is 3/6=50% for the intra-domain asset labeled "dws".
Processing actions of all fields, tracking and recording node liveness index M A . In a specific observation period or unit time (the observation period or unit time is recorded as T), the number of processing wave number (processing of the field refers to various processes including reading, cleaning, converting, migrating and the like) of the specific field of a certain library table is calculated, and then the number is the activity index M A
Calculating a genetic difference activity ratio index MR according to the field level activity index and the genetic lineage data obtained by the processing chain DAL . For a particular field and a complete, particular data flow path containing the field, all nodes on the path (referred to herein as fields) are said to have affinity, which is also referred to as an affinity path. The termination node f of a genetic path may have multiple genetic paths. As shown in FIG. 4, the f_2 field of the table t_3_1 experiences two genetic paths, i.e., t_0_1:f_2->t_1_1:f_2->t_2_1:f_2->t_3_1:f_2 and t_0_1:f_3->t_1_1:f_3->t_2_1:f_2->t_3_1:f_2; for termination node f, tracking calculates liveness index M of all nodes (n are assumed to be except for termination node f) on all related paths A At the same time, calculate liveness index M A In the same observation period or time T, if the processing or access frequency of the computing node f is NAcc, the affinity difference activity ratio of the terminating node f is the maximum activity of non-terminating nodes of all affinity paths divided by the access frequency of the terminating node, namely MR DAL =Max{M A ,1~n}/NAcc。
Performing periodic static analysis (the same observation period as other tracking calculation processes can be adopted) on the database tables generated in all the processes, and calculating all quality indexes:
null rate MR N : for field F, calculate that in Table T, the F field appears Null (contains no value, i.e., null/Nil, or nonsensical Null, i.e., nonsensical blank characterStrings, meaningless default fill zeros, etc.), noted as C NF The total recorded numbers in Table T are recorded as C A MR is then N1 =C NF /C A
Sum the number of nulls for all fields against table T, denoted Sum (C NF ) The number of the T field of the table is N F Table T null rate MR N2 =Sum(C NF )/(C A *N F )。
Error rate MRWR: there are two types of error values, namely general-purpose error values and business-type error values. The general type error value refers to data abnormality which can be distinguished without related business knowledge, such as text messy codes, nonsensical symbols, data format errors and the like, and the business type error value refers to data which does not accord with business rules preset by a person or a system, such as value field requirements (value collection/range).
For field F, calculate the number of times that the F field has an error value in Table T, denoted as C NF The total recorded numbers in Table T are recorded as C A MR is then WR1 =C NF /C A
For table T, sum the number of error values for all fields, denoted Sum (C NF ) The number of the T field of the table is N F Table T error rate MR WR2 =Sum(C NF )/(C A *N F )。
Repetition rate MR Dup : for a selected set of fields/attributes, all data records are aligned according to the record primary/unique key (as a key of decision).
For field F (non-record judging key), record number marked by record main key is N1, some repeated value appearing in field F is N Dup1 (the calculation can be included when the number of repeated records is greater than 1), and the Sum of all repeated values is Sum (N) Dup1 ) Repetition rate MR of field F Dup1 =Sum(N Dup1 )/N;
For table T, calculate abstracts for all non-critical multiple key fields (ordered by natural order in storage or by field name classical order) of a record, compare abstract information for all records, record repetition number N Dup2 The total number of table records is N2Table T repetition rate MR Dup2 =N Dup2 /N2。
Aging satisfaction MR Chr-S : for a particular data set S, the time efficiency satisfaction describes how much ratio of data in S satisfies the business requirement for the data freshness/time coordinate range, i.e., the time efficiency requirement. Note that, unlike indexes such as null rate, error rate, repetition rate, etc., the time-efficiency satisfaction index only plays a reference role in a service scenario where there is a specific requirement for data time efficiency.
Data record main key/unique key is used as judging key and is provided with C S In the data set S of the strip data record there is C NS The bar records being contained in a record count of C Ref Reference or reference data set S of (C) Ref In this dataset S Ref Typically, the directory is a certain entity directory with timeliness constraint (note that the directory may be a logical directory, i.e. not physically present), for example, the entity directory is updated according to service needs in a fixed or dynamic time range/window, and typically, the "master data" directory is obtained by filtering by superimposing timeliness constraint conditions.
Then the time efficiency satisfies MR Chr-S =C NS /C Ref Wherein 0 is<=C NS <=C Ref . Correspondingly, age-overflow rate MR Chr-O =(C S -C NS )/C Ref . In general, when C S And C NS When the time is unequal, the data asset operation condition is comprehensively estimated by combining the time efficiency meeting rate and the time efficiency overflow rate index.
C S 、C NS 、C Ref The combination of the three values and the corresponding interpretation are shown in fig. 1, 2 and 3, and in fig. 1 (case 1): set S is included in set S Ref Or overlap of the two, 1 when cs=c N s<C Ref When the set S meets the aging requirement in a best effort mode, the aging satisfaction of the set S is still improved through measures such as data complement, system optimization and the like; 2. when cs=c Ns =C Ref Time (i.e. S and S Ref Heavy ), set S fully meets the current aging requirements; in fig. 2 (case 2): set S and set S Ref Part of the intersection, or set S comprises set S Ref In general, when the data entities in the set S are generated by multi-source fusion/processing, a partial data dispersion in the set S occurs Ref Other things. At this time, indexes such as time efficiency meeting and time efficiency overflow rate can be combined to comprehensively evaluate the 'missing' and time efficiency abnormal conditions of data asset operation, specifically: 1. when Cs>C Ns And C N s<C Ref When the set S has partially satisfied the aging requirement, but the entity data source well in the set S is not completely defined by the set S Ref Covering, wherein the data fusion/processing has links which are not monitored and managed by asset operation, and the complete investigation of missing/risk points is required to be carried out on the whole data treatment process; 2. when Cs>C N s and cns=c Ref Time (i.e. set S contains set S Ref ) Similarly, in the data management process, links which are not regulated by the data asset operation are arranged, and in combination with business practice, the situation means that the front and rear processes of the data management are seriously disjointed, and asset operation strategies aiming at original data (or data at the front part in a computing and storage architecture level), main data/reference data, reference data and the like are required to be adjusted in a key way, so that the tracking management of the whole life cycle of the data is ensured; in fig. 3 (case 3): set S and set S Ref Are mutually disjoint; 1. when C Ns =0, i.e. set S and set S Ref When the two data sources are mutually disjoint, the two entity data sources are usually completely independent/split, and serious supervision and omission exist in the data treatment process, so that the full life cycle management of the data cannot be realized. In this case, the data asset operation policy needs to be re-analyzed, and an effective mechanism for tracking the data circulation among the computing and storage hierarchies is established.
Age overflow rate MR Chr-O : reference is made to the above description of the timeliness satisfaction index. Typically together with the age satisfaction index, a data asset operation assessment reference is calculated and provided.
According to the above description, all the metrics have been calculated and stored into the MetaStore for subsequent execution of the progression action.
Note that when the above-described processes relate to an observation period, it is not mandatory that all the processes are in the same observation period, and only the duration of the observation period needs to be ensured to be the same.
Metrics and reports (Measurement & Report):
in the above process, metadata information, pedigree information, quality class, circulation class, application class indexes and the like of the global data are tracked and calculated, and the data asset operation efficiency analysis reporting module can provide corresponding analysis results on different granularities of libraries, tables, fields and the like according to the metadata information, the pedigree information, the quality class, the circulation class, the application class indexes and the like, and integrate and output evaluation conclusion.
A typical data asset operation performance analysis report may be presented by way of automatically generating charts, text, etc. (including but not limited to the following):
each library, table base analysis (including table/field/number of records, type distribution, etc.);
each library, table, field data quality index, data flow index, data application index.
The result of the deep analysis of the table which needs to pay special attention to the service can be, for example, the field index information which does not participate in calculation can be removed, other indexes can be read in combination with the service scene (for example, the rule is preset manually and automatically executed by the system, so that the read result corresponding to the matching rule is generated), and the like.
And according to the index distribution situation, the system gives evaluation conclusion and advice and the data asset operators carry out progress auditing and tracking treatment by combining the analysis results automatically provided by the system.
It should be noted that the foregoing merely illustrates the technical idea of the present invention and is not intended to limit the scope of the present invention, and that a person skilled in the art may make several improvements and modifications without departing from the principles of the present invention, which fall within the scope of the claims of the present invention.

Claims (10)

1. A method for evaluating the operational effectiveness of a data asset, the method comprising the steps of:
the acquired data source is imported into a database, the data after warehousing are extracted and stored, and the marked metadata update record is corrected or audited and then formally registered;
in the metadata generation and registration process, the metadata and genetic management module sends metadata registration information to the global event information queue, and the data management module, the data application module and the data asset operation efficiency analysis reporting module respectively receive information notification and execute corresponding actions or update own data;
the metadata information, the pedigree information and various indexes of the global data are tracked and calculated, and the data asset operation efficiency analysis reporting module provides corresponding analysis results on different granularities according to the metadata information, the pedigree information and the various indexes, and integrates and outputs evaluation conclusion.
2. The method of claim 1, wherein in the data management module, the data is processed by an integration class, a processing class and an analysis class, and in the data application module, the data is packaged in the form of API and SDK, and then the data is served for external progress, and the data flows between different hierarchies of the storage calculation and the final application step comprises:
establishing related binary groups among all fields, and calculating information loss rate index MR according to field types or service requirements IL
Tracking the upstream and downstream fields of all fields of the record, concatenating the processing chains forming all fields, and calculating the field chain complexity index M therefrom CC The final chain complexity of the field takes the maximum value of the corresponding chain complexity of each path it experiences;
processing actions of all fields, tracking and recording node liveness index M A Calculating the number of processing wave times experienced by a specific field of a library table in an observation period of a specific duration or in a unit time;
calculating a genetic difference activity ratio index MR according to the field level activity index and genetic lineage data acquired by a processing chain DAL Genetic difference activity ratioIndex MR DAL = Max{M A,1~n }/N Acc Where n is the number of nodes other than the termination node, M A N is the node liveness index Acc The processing or access times of the nodes;
and (3) carrying out periodic static analysis on the database tables generated in all the processes, and calculating all quality indexes.
3. The method for evaluating the operation efficiency of a data asset according to claim 2, wherein when analyzing the adjacent upstream and downstream fields, the SQL-like operation is identified by the SQL-like grammar analysis and the field extraction technique, the non-SQL-like operation is identified by the system operation, and finally metadata storage matching is combined, the adjacent processing chains are generated and connected in series as complete processing chains, all the processing chains are traversed, and the fan-in index M of all the fields is calculated FI Fanout index M FO
4. A method of evaluating performance of a data asset operation as claimed in claim 3, wherein the inter-domain asset coverage index MR is calculated based on the processing chain and the metadata information of each table CDAC Inter-domain asset coverage index MR CDAC The computing of (a) includes first referring to different computing storage hierarchy as different domains, the data's belonging domains are labeled with a label technique, while the labels are used to describe the data information.
5. The method of claim 2, wherein the quality class indicator comprises a null rate indicator MR N For the fields, a null rate index MR is calculated in the table N1 =C NF /C A Wherein C NF To count the number of times a field in the table has a null value, C A For the total number of records in the table, for the table, the table space value rate MR N2 =Sum(C NF )/(C A *N F ) Wherein Sum (C) NF ) To sum the number of nulls for all fields, N F For the number of fields of the table, C A The total number of records in the table.
6. The method of claim 5, wherein the quality class indicator comprises a bit error rate indicator MR WR Field error rate indicator MR WR1 =C NF /C A Wherein C NF To count the number of times a field in the table has an error value, C A For the total number of records in the table, the table error rate MR WR2 =Sum(C NF )/(C A *N F ) Wherein Sum (C) NF ) Sum the number of error values of all fields, N F For the number of fields of the table, C A The total number of records in the table.
7. The method of claim 6, wherein the quality class indicator comprises a repetition rate indicator MR Dup Repetition rate index MR Dup Comparing all data records according to the record main key/unique key aiming at the selected field/attribute set, and comparing the field repetition rate index MR Dup1 =Sum(N Dup1 ) N1, wherein N1 is the record number of the record primary key mark, N Dup1 Record the number of records for a certain repetition value of field occurrence, sum (N Dup1 ) Is the sum of all the repeated values; aiming at the table, calculating abstracts of all non-judging multiple key fields of a record, and comparing abstract information of all records to obtain a table repetition rate index MR Dup2 =N Dup2 N2, where N Dup2 For the number of repetitions, N2 is the total number of table records.
8. The method of claim 7, wherein the quality class indicator comprises a failure satisfaction rate indicator MR Chr-S Failure satisfaction rate index MR Chr-S For a particular data set S, the age satisfaction describes the ratio of data satisfying business requirements for the data freshness/time coordinate range in S; data record main key/unique key is used as judging key and is provided with C S In the data set S of the strip data record there is C NS The bar records being contained in a record count of C Ref Reference or reference dataset S of (2) Ref In the middle, age is fullFoot rate index MR Chr-S =C NS /C Ref Wherein 0 is<=C NS <= C Ref
9. The method of claim 8, wherein the quality class indicator comprises an age overflow indicator MR Chr-O Age-overflow index MR Chr-O =(C S -C NS )/C Ref Wherein the age-overflow index MR Chr-O Meeting the time efficiency index MR Chr-S Together calculate and provide a data asset operation condition assessment reference.
10. A data asset operational effectiveness assessment system based on the method of any one of claims 1-9, wherein the system comprises:
the data acquisition module is in butt joint with an internal and external data source, and forms data resources corresponding to a service datamation process in an active and passive internal and external acquisition or service generation and precipitation mode;
the data management module is used for performing cleaning, conversion, splitting, aggregation, migration and statistical calculation operations on the data;
the data application module is used for outputting API, SDK, research analysis report and a package solution form through product and service package and oriented to the market;
the system comprises a global event message queue, a unified monitoring and information distribution channel of a global event of the system, a message consumer receives information notification after finishing registration and subscribing corresponding subject information of the message queue, and takes corresponding disposal measures suitable for the message consumer;
the system comprises a metadata and genetic management module, a core module for evaluating the operation efficiency of data assets, a data management module and a data management module, wherein the core module is used for tracking the whole process of data circulation among different calculation and storage hierarchy structures, constructing a genetic lineage of global data, dynamically tracking and calculating from data quality, data circulation and data application dimensions, and acquiring efficiency evaluation indexes;
and the data asset operation efficiency analysis report module outputs data asset operation efficiency analysis reports in various formats according to the metadata and the calculation result of the genetic management module so as to assist the data asset operation.
CN202310802066.1A 2023-07-03 2023-07-03 Data asset operation efficiency evaluation method and system Pending CN116911671A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310802066.1A CN116911671A (en) 2023-07-03 2023-07-03 Data asset operation efficiency evaluation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310802066.1A CN116911671A (en) 2023-07-03 2023-07-03 Data asset operation efficiency evaluation method and system

Publications (1)

Publication Number Publication Date
CN116911671A true CN116911671A (en) 2023-10-20

Family

ID=88357334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310802066.1A Pending CN116911671A (en) 2023-07-03 2023-07-03 Data asset operation efficiency evaluation method and system

Country Status (1)

Country Link
CN (1) CN116911671A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117216592A (en) * 2023-11-07 2023-12-12 青岛港国际股份有限公司 Idle analysis system and analysis method for assets
CN117910850A (en) * 2023-12-18 2024-04-19 北京宇信科技集团股份有限公司 Index data analysis engine, index data calculation device and calculation method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117216592A (en) * 2023-11-07 2023-12-12 青岛港国际股份有限公司 Idle analysis system and analysis method for assets
CN117910850A (en) * 2023-12-18 2024-04-19 北京宇信科技集团股份有限公司 Index data analysis engine, index data calculation device and calculation method

Similar Documents

Publication Publication Date Title
Ali et al. From conceptual design to performance optimization of ETL workflows: current state of research and open problems
Esser et al. Multi-dimensional event data in graph databases
US11537369B2 (en) System and method for dynamic, incremental recommendations within real-time visual simulation
US20190340518A1 (en) Systems and methods for enriching modeling tools and infrastructure with semantics
Vera-Baquero et al. Business process analytics using a big data approach
CN116911671A (en) Data asset operation efficiency evaluation method and system
EP2831767B1 (en) Method and system for processing data queries
US9159024B2 (en) Real-time predictive intelligence platform
US11119989B1 (en) Data aggregation with schema enforcement
CN103336790A (en) Hadoop-based fast neighborhood rough set attribute reduction method
CN111897804A (en) Computer-implemented method, computing system and computer-readable medium
An Data analysis
Dayarathna et al. Automatic optimization of stream programs via source program operator graph transformations
Lagerström et al. Visualizing and measuring enterprise architecture: an exploratory biopharma case
CN103336791A (en) Hadoop-based fast rough set attribute reduction method
AU2014238337B2 (en) Auditing of data processing applications
CN112527886A (en) Data warehouse system based on urban brain
Taleb et al. Big data quality: A data quality profiling model
Pandey et al. Association rules network: Definition and applications
Balakrishnan et al. Implementing data strategy: design considerations and reference architecture for data-enabled value creation
JP2017054487A (en) Method and system for fusing business data for distributional queries
Punn et al. Testing big data application
US20140067874A1 (en) Performing predictive analysis
Bronselaer Data quality management: an overview of methods and challenges
CN111260452B (en) Method and system for constructing tax big data model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination