CN116911671A

CN116911671A - Data asset operation efficiency evaluation method and system

Info

Publication number: CN116911671A
Application number: CN202310802066.1A
Authority: CN
Inventors: 齐宁; 周云松; 王治平; 茅天天; 王子青; 华伟
Original assignee: Jiangsu United Credit Reference Co ltd
Current assignee: Jiangsu United Credit Reference Co ltd
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2023-10-20

Abstract

The invention discloses a data asset operation efficiency evaluation method, which comprises the steps of collecting a data source, importing the data source into a database, extracting and storing data after warehousing, and carrying out formal registration after correcting or auditing marked metadata update records; in the metadata generation and registration process, the metadata and genetic management module sends metadata registration information to the global event information queue, and the data management module, the data application module and the data asset operation efficiency analysis reporting module respectively receive information notification and execute corresponding actions or update own data; metadata information, genetic lineage information and various indexes of the global data are tracked and calculated, the resource investment of data asset operation activities is reduced through quantization and automation means, and meanwhile, the overall efficiency of the work of the asset operation strategies is greatly improved on the premise of improving the accuracy and pertinence of the asset operation strategies.

Description

Data asset operation efficiency evaluation method and system

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a data asset operation efficiency evaluation method and system.

Background

How to fully utilize data assets and expert experiences to develop effective and efficient asset operation activities for data is a great challenge for many data-centric enterprises. In actual production practice, the data asset operation is usually completed by experts in the data management department, and involves a large number of early data analysis, storage model design, system construction, effect evaluation, system correction, and the like. Depending on the overall complexity, coverage, richness of the data asset and the accumulation of expert experience in the relevant field, the working cycle of the data asset operation also varies from months to long-term, activity tracking of the asset operation lacks more efficient means or tool support for effect assessment, and the industry also has no relatively standard practical experience reference in this field. Therefore, the evaluation of the operation efficiency of the data asset is always put into huge resources, but the return is relatively low, and the support of quantification means is lacking.

The ability of the general financial institutions (hereinafter referred to as institutions) to collect and store data is rapidly expanding while fully developing services by means of digital technology, and a large amount of original data resources are deposited in the process, so that service datamation is realized. However, grasping a large amount of raw data itself does not place the institution itself in a defeated place in the market competition. Statistics indicate that about 40% of the data is not utilized, and this portion of the data does not actually create value while taking up valuable computing storage resources of the institution.

Disclosure of Invention

In order to achieve the above purpose, the technical scheme of the invention is as follows: a data asset operation efficiency evaluation method comprises the following steps

The acquired data source is imported into a database, the data after warehousing are extracted and stored, and the marked metadata update record is corrected or audited and then formally registered;

in the metadata generation and registration process, the metadata and genetic management module sends metadata registration information to the global event information queue, and the data management module, the data application module and the data asset operation efficiency analysis reporting module respectively receive information notification and execute corresponding actions or update own data;

the metadata information, the pedigree information and various indexes of the global data are tracked and calculated, and the data asset operation efficiency analysis reporting module provides corresponding analysis results on different granularities according to the metadata information, the pedigree information and the various indexes, and integrates and outputs evaluation conclusion.

Based on the technical scheme, data acquisition/import: the data acquisition module acquires and imports data from various data sources such as files (txt/csv/custom, etc.), SQL scripts, databases, APIs, etc., to the databases (the databases are still generally referred to herein and are not described in detail). When the data is imported, a data acquisition module sends a data intake message (dataingestinonmessage) to a global event message queue, and a metadata and related management module and a data management module receive and consume the data intake message and respectively develop follow-up actions.

Metadata extraction: after data is put into storage (generating a data intake message), the metadata and the genetic management module automatically trigger (start through receiving a message queue notification) and execute a database scanning analysis task (traverse a scanning database and extract related metadata), metadata description, namely metadata extraction and storage (MetaStore database), are carried out on the put-in data, meanwhile, a metadata generation message (metagenerator message) is sent to a global event message queue, and the module receives the message notification and executes subsequent actions. The metadata information in the MetaStore is updated in quasi-real time according to the notification of the global event message queue, the corresponding update record is marked (namely, the dirty state) after each update, the marked metadata update record can be formally registered after correction or audit, and the registration process is automatically carried out by the system. The auditing action may be accomplished manually or automatically by the system, typically by a first time manually, followed by a subsequent automatic execution by the system. The extracted metadata information includes, but is not limited to, the following: library, table, field name, description, owner information, format type, value range (value field), etc.

Correction and registration: and (3) manually correcting and auditing the metadata information marked as dirty in the MetaStore, finally registering in the MetaStore, and simultaneously sending a metadata registration message (MetaRegistrationMessage) to a global event message queue. After metadata registration, the initial genealogy (genesislinear) of libraries, tables, fields has in fact been established. The initial affinity pedigree is the premise and basis for the metadata and affinity management module to follow-up analysis of the global data.

This process covers the business datamation phase, data asset formation phase and asset productization phase entirely. In the metadata generation and registration process, the metadata and genetic management module sends metadata registration messages (metaregistration message) to the global event message queue, and the data management module, the data application module and the data asset operation efficiency analysis report module respectively receive message notification and execute corresponding actions or update own data.

Preferably, in the data management module, the data is processed by an integration class (such as migration, splitting, merging, etc.), a processing class (such as cleaning, conversion, interception, etc.), and an analysis class (mathematical statistics), in the data application module, the data is packaged in the form of API and SDK, and then the data is served for external progress, and the data flows between different hierarchies of storage computation and the final application steps comprise:

establishing related binary groups (refers to the fields with related relations) among all fields, and calculating an information loss rate index MR according to field types or service requirements _IL ；

Tracking the upstream and downstream fields of all fields of record (which can be realized by the techniques of double linked list, etc.), concatenating the processing chains forming all fields, and calculating the complexity index M of field chain _CC The final chain complexity of the field takes the maximum value of the corresponding chain complexity of each path it experiences;

tracking record nodes for processing actions of all fieldsLiveness index M _A Calculating the number of processing waves experienced by a specific field of a library table in an observation period or unit time of a specific duration (the processing of the field generally refers to various processes including reading, cleaning, converting, migrating and the like);

calculating a genetic difference activity ratio index MR according to the field level activity index and genetic lineage data acquired by a processing chain _DAL Genetic difference activity ratio index MR _DAL =Max{M _A,1~n }/N _Acc Where n is the number of nodes other than the termination node, M _A N is the node liveness index _Acc The processing or access times of the nodes;

and (3) carrying out periodic static analysis (the same observation period as that of other tracking calculation processes can be adopted) on the database tables generated in all the processes, and calculating all the quality indexes.

Preferably, when analyzing the adjacent upstream and downstream fields, SQL-like operations are identified by SQL-like syntax analysis and field extraction (field extraction) techniques, non-SQL-like operations (typically operations on a system interface) are identified by system operations, and finally, metadata storage (MetaStore) matching is combined, an adjacent processing chain is generated and concatenated into a complete processing chain, all processing chains are traversed, and fan-in indexes M of all fields are calculated _FI Fanout index M _FO 。

Preferably, the inter-domain asset coverage index MR is calculated based on the processing chain and the metadata information of each table (obtained by the metadata and the affinity management module) _CDAC Inter-domain asset coverage index MR _CDAC The computing of (1) includes firstly, different computing storage hierarchy is called different domains, the domains of the data are marked by using a label technology, and meanwhile, the labels are used for describing the data information.

Preferably, the quality class indicator comprises a null rate indicator MR _N Error rate index MR _WR Repetition rate index MR _Dup Time efficiency meeting index MR _Chr-S Time-efficient overflow index MR _Chr-O 。

Preferably, for the fields, a null rate index MR is calculated in the table _N1 =C _NF /C _A Wherein C _NF To count the number of times a field in the table has a null value, C _A For the total number of records in the table, for the table, the table space value rate MR _N2 =Sum(C _NF )/(C _A *N _F ) Wherein Sum (C) _NF ) To sum the number of nulls for all fields, N _F For the number of fields of the table, C _A The total number of records in the table.

Preferably, the error rate indicator MR _WR The method comprises two types of general type error values and service type error values, wherein the general type error values refer to data anomalies which can be distinguished without related service knowledge, such as text messy codes, nonsensical symbols, data format errors and the like, the service type error values refer to data which do not accord with service rules preset by manpower or a system, and the field error rate index MR _WR1 =C _NF /C _A Wherein C _NF To count the number of times a field in the table has an error value, C _A For the total number of records in the table, the table error rate MR _WR2 =Sum(C _NF )/(C _A *N _F ) Wherein Sum (C) _NF ) Sum the number of error values of all fields, N _F For the number of fields of the table, C _A The total number of records in the table.

Preferably, the repetition rate index MR _Dup Comparing all data records according to the record main key/unique key aiming at the selected field/attribute set, and comparing the field repetition rate index MR _Dup1 =Sum(N _Dup1 ) N1, wherein N1 is the record number of the record primary key mark, N _Dup1 Record the number of records for a certain repetition value of field occurrence, sum (N _Dup1 ) Is the sum of all the repeated values; for a table, calculating abstracts for all non-judging multiple key fields of a certain record, comparing abstract information of all records to obtain a table repetition rate index MRDup2=NDup2/N2, wherein NDup2 is the repetition number (refer to description of the first half of the sentence: comparing abstract information of all records, the number of times of occurrence of the same value is the repetition number), and N2 is the total number of table records.

Preferably, the failure satisfaction rate index MR _Chr-S For a particular data set S,the age satisfaction describes the ratio of data satisfying business requirements for data freshness/time coordinate range in S; data record main key/unique key is used as judging key and is provided with C _S In the data set S of the strip data record there is C _NS The bar records being contained in a record count of C _Ref Reference or reference dataset S of (2) _Ref In (2) the time efficiency satisfies the index MR _Chr-S =C _NS /C _Ref Wherein 0 is<=C _NS <=C _Ref 。

Preferably, age-overflow rate MR _Chr-O =（C _S -C _NS ）/C _Ref Wherein the age-overflow index MR _Chr-O Meeting the time efficiency index MR _Chr-S Together calculate and provide a data asset operation condition assessment reference.

Preferably, the data acquisition: the method is in butt joint with an internal and external data source, and forms a data resource by means of active and passive internal and external acquisition (including outsourcing, cooperation and other forms), service generation, sedimentation and the like, and is mainly corresponding to a service datamation process.

Data management: the data is subjected to various operations such as cleaning processing, conversion, splitting, aggregation, migration, statistical calculation and the like, so that the data can flow effectively and efficiently between different layers of architecture for calculation and storage, and the data is oriented to the development needs of existing or potential business, so that a data asset with potential market value is formed, and the data asset is an important link of data asset.

Data application: the method is characterized in that the method is output in various forms such as API (application programming interface), SDK (software development kit), research analysis report (which is generally referred to herein as including data analysis results presented in various file formats), package solution and the like through product and service packaging, and is an important link of asset productization.

Global event message queues: the unified monitoring and information distribution channel of the system global event can accept information notification and take corresponding disposal measures suitable for the message consumer after the message consumer completes registration and subscribes to the corresponding subject information of the message queue.

Metadata and affinity management: a core module for data asset operational performance evaluation (including managing the global metadata store MetaStore). And carrying out overall process tracking on the data flowing among different computing and storage hierarchy structures, constructing an affinity lineage of global data, and dynamically tracking and computing from multiple dimensions such as data quality, data circulation, data application and the like to obtain performance evaluation indexes.

Data asset operation efficacy analysis report: and outputting data asset operation efficiency analysis reports in various formats according to the calculation results of the metadata and the affinity management module to assist data asset operation.

Compared with the prior art, the invention has the beneficial effects that: according to the data asset operation efficiency evaluation method and system based on the genealogy, the genealogy relation formed by processing circulation of data between the computing and storing hierarchical structures and the data management and application conditions are tracked by fully utilizing an automation technology, finally various analysis indexes are combined, an analysis report of the data asset operation efficiency is automatically given out by a system, the data asset operation efficiency is evaluated by a help mechanism, the capability and maturity of the data asset operation management are reflected laterally, the resource investment of the data asset operation activity is reduced by means of quantification and automation, and meanwhile, the overall working efficiency is greatly improved on the premise of improving the accuracy and pertinence of an asset operation strategy.

Drawings

FIG. 1 shows a data record C in this embodiment _S A value combination condition schematic diagram;

FIG. 2 shows the data record C in this embodiment _NS A value combination condition schematic diagram;

FIG. 3 shows the total number of records C in the present embodiment _Ref A value combination condition schematic diagram;

FIG. 4 is a schematic diagram of the connection of the genetic path in the present embodiment;

FIG. 5 is a diagram showing the overall architecture of the evaluation system according to the present embodiment;

fig. 6 is a schematic diagram of data flow in the evaluation system according to the present embodiment.

Description of the embodiments

The present invention is further illustrated in the following drawings and detailed description, which are to be understood as being merely illustrative of the invention and not limiting the scope of the invention.

Examples: the embodiment is a data asset operation effectiveness evaluation method and system based on pedigree,

the overall architecture of the system is shown in fig. 5, and includes 6 major modules: data collection, data management, data application, global event message queues, metadata and affinity management, data asset operational performance analysis reporting.

The responsibilities of each module are described as follows:

and (3) data acquisition: the method is in butt joint with an internal and external data source, and forms a data resource by means of active and passive internal and external acquisition (including outsourcing, cooperation and other forms), service generation, sedimentation and the like, and is mainly corresponding to a service datamation process.

The system operation evaluation method comprises the following steps:

metadata generation and registration: mainly corresponds to the service datamation stage. The method comprises the sub-processes of data acquisition/import, metadata extraction, correction, registration and the like.

Data acquisition/import: the data acquisition module acquires and imports data from various data sources such as files (txt/csv/custom, etc.), SQL scripts, databases, APIs, etc., to the databases (the databases are still generally referred to herein and are not described in detail). When the data is imported, a data acquisition module sends a data intake message (dataingestinonmessage) to a global event message queue, and a metadata and related management module and a data management module receive and consume the data intake message and respectively develop follow-up actions.

Metadata extraction: after data is put into storage (generating a data intake message), the metadata and the genetic management module automatically trigger (start through receiving a message queue notification) and execute a database scanning analysis task (traverse a scanning database and extract related metadata), metadata description, namely metadata extraction and storage (MetaStore database), are carried out on the put-in data, meanwhile, a metadata generation message (metagenerator message) is sent to a global event message queue, and the module receives the message notification and executes subsequent actions. The metadata information in the MetaStore is updated in quasi-real time according to the notification of the global event message queue, the corresponding update record is marked (namely, the dirty state) after each update, the marked metadata update record can be formally registered after correction or audit, and the registration process is automatically carried out by the system. The auditing action may be accomplished manually or automatically by the system, typically by a first time manually, followed by a subsequent automatic execution by the system. The extracted metadata information includes, but is not limited to, the following:

library, table, field name, description, owner information, format type, value range (value field), etc.

Genetic lineage tracking Analysis (LineageTrace & Analysis):

this process covers the business datamation phase, data asset formation phase and asset productization phase entirely.

In the metadata generation and registration process, the metadata and genetic management module sends metadata registration messages (metaregistration message) to the global event message queue, and the data management module, the data application module and the data asset operation efficiency analysis report module respectively receive message notification and execute corresponding actions or update own data.

In the data governance module, the data is processed by an integration class (such as migration, splitting, merging and the like), a processing class (such as cleaning, conversion, interception and the like) and an analysis class (mathematical statistics).

In the data application module, the data is packaged in the form of API, SDK and the like, and then the data is provided with services for external progress.

The data streaming process is illustrated in fig. 6. When data flows between different hierarchical architectures of storage computation and is finally applied:

in the category of t_1_1:f_2->the form t_2_1:f_2' establishes the related binary group (refers to the fields with related relations) among all fields, and calculates the information loss rate index MR according to the field types or service requirements _IL . For example, a particular field of a particular table is of the datetime type, after processingIn the process, for business requirements such as information simplification, accuracy reduction processing may be performed on the field, so that the datetime field is processed into a date type field; in this case, since the "time" part of the information is discarded by the "date and time information", the loss rate of the conversion process information of the two fields can be defined as 50% (i.e., half the amount of information is lost) according to the default policy, or the loss rate ratio can be defined by itself according to the actual service requirement. As a default policy, the loss of information for the conversion between different types of fields may be referred to in table 4. The field types in table 4 are generic and need to be adapted to a specific storage type for a database of a specific selection.

TABLE 4 loss of information for different types of field transformations (partial reference)

Tracking the upstream and downstream fields of all fields of record (which can be realized by the techniques of double linked list, etc.), concatenating the processing chains forming all fields, and calculating the complexity index M of field chain _CC . For example, for field f_1 in a particular table of some original database:

if the field f_1 undergoes a certain path p1 containing n link processing operations, in the path p1, the field f_1 only has changes in aspects of format, value rule and the like, and a plurality of changes are recorded, and the chain complexity Mcc of the field f1 is recorded, wherein f_1:p1=n;

if the field f_1 undergoes several processing paths, such as splitting, merging, cleaning, migrating, etc., by the data management system, the final chain complexity of the field f_1 should take the maximum value of the corresponding chain complexity of each path that it undergoes, i.e., mcc: f_1:p =max { Mcc: f_1: p 1-n }. Taking the example of the graph x, the field f_1 of the t_0_2 table undergoes 3 processing paths, respectively: - > t_1_2:f_1- > t_3_2:f_1 (Mcc: f_1:p1=2), - > t_1_2:f_2- > t_2_1:f_3 (Mcc: f_1:p2=2), - > t_1_2:f_2:f_2- > t_3:f_2 (Mcc: f_1:p3=3), then the final Mcc: f_1:p =max { Mcc: f_1:p1-3 } =3.

When analyzing the adjacent upstream and downstream fields, the SQL-like operation analyzes and words through SQL-like grammarSegment extraction (Fieldextraction) techniques, non-SQL-like operations (typically operations on a system interactive interface) are identified by system operations, and finally, in combination with metadata store (MetaStore) matching, an adjoining processing chain is generated and concatenated into a complete processing chain. Thus, all processing chains are traversed, fan-in index M of all fields is calculated _FI Fanout index M _FO . For example, the fields f_2, f_3 of the table t_1_1 are processed to form the field f_2 of the table t_2_1, then the field f_2 of the table t_2_1 is said to fan in to be 2. The field f_1 of the table t_0_2 is split to form the fields f_1 and f_2 of the table t_1_2, and the field f_1 of the table t_0_2 is referred to as fanning out to be 2. Note that multiple uses of fields do not change fanout.

Calculating an inter-domain asset coverage index MR according to a processing chain and each table metadata information (obtained by metadata and a related management module) _CDAC . Inter-domain asset coverage index MR _CDAC The calculation of (2) is described as follows: first, different computing storage hierarchies are called different "domains", and the domains to which data belongs can be marked by using a tag technology, and meanwhile, the tag can be used for describing information such as use, time, source and the like of the data. In the following case, the labels are mainly used for the relevant description of the table, as shown in fig. 4, the table t_2_1 is labeled with the following labels: "dws", "all", "pers"; table t_2_2 is labeled with the following label: "dws", "all", "corp", "cref"; table t_3_1 is labeled with the following label: "app", "corp-app"; table t_3_2 is labeled with the following label: "app", "corp-app". Inter-domain asset coverage may be calculated at different levels of granularity. For example, based on the foregoing labels, the asset coverage of table t_3_2 to table t_2_2 may be calculated, and since field f_2 of table t_2_2 is processed to form field f_2 of table t_3_2, the inter-domain asset coverage of table t_3_2 to table t_2_2 is 1/3=33.33%; similarly, the in-domain asset labeled "dws" includes tables t_2_1 and t_2_2, with 6 fields for both tables, then the coverage of the in-domain asset labeled "dws" by table t_3_2 is 1/6=16.67%; further, the intra-domain assets labeled "app" include tables t_3_1 and t_3_2, with a total of 3 fields from tables t_2_1 and t_2_2 labeled "dwsTable t_2_1 and table t_2_2 together have 6 fields, then the intra-domain asset coverage for the intra-domain asset labeled "app" is 3/6=50% for the intra-domain asset labeled "dws".

Processing actions of all fields, tracking and recording node liveness index M _A . In a specific observation period or unit time (the observation period or unit time is recorded as T), the number of processing wave number (processing of the field refers to various processes including reading, cleaning, converting, migrating and the like) of the specific field of a certain library table is calculated, and then the number is the activity index M _A 。

Calculating a genetic difference activity ratio index MR according to the field level activity index and the genetic lineage data obtained by the processing chain _DAL . For a particular field and a complete, particular data flow path containing the field, all nodes on the path (referred to herein as fields) are said to have affinity, which is also referred to as an affinity path. The termination node f of a genetic path may have multiple genetic paths. As shown in FIG. 4, the f_2 field of the table t_3_1 experiences two genetic paths, i.e., t_0_1:f_2->t_1_1:f_2->t_2_1:f_2->t_3_1:f_2 and t_0_1:f_3->t_1_1:f_3->t_2_1:f_2->t_3_1:f_2; for termination node f, tracking calculates liveness index M of all nodes (n are assumed to be except for termination node f) on all related paths _A At the same time, calculate liveness index M _A In the same observation period or time T, if the processing or access frequency of the computing node f is NAcc, the affinity difference activity ratio of the terminating node f is the maximum activity of non-terminating nodes of all affinity paths divided by the access frequency of the terminating node, namely MR _DAL =Max{M _A ,1~n}/NAcc。

Performing periodic static analysis (the same observation period as other tracking calculation processes can be adopted) on the database tables generated in all the processes, and calculating all quality indexes:

null rate MR _N : for field F, calculate that in Table T, the F field appears Null (contains no value, i.e., null/Nil, or nonsensical Null, i.e., nonsensical blank characterStrings, meaningless default fill zeros, etc.), noted as C _NF The total recorded numbers in Table T are recorded as C _A MR is then _N1 =C _NF /C _A ；

Sum the number of nulls for all fields against table T, denoted Sum (C _NF ) The number of the T field of the table is N _F Table T null rate MR _N2 =Sum(C _NF )/(C _A *N _F )。

Error rate MRWR: there are two types of error values, namely general-purpose error values and business-type error values. The general type error value refers to data abnormality which can be distinguished without related business knowledge, such as text messy codes, nonsensical symbols, data format errors and the like, and the business type error value refers to data which does not accord with business rules preset by a person or a system, such as value field requirements (value collection/range).

For field F, calculate the number of times that the F field has an error value in Table T, denoted as C _NF The total recorded numbers in Table T are recorded as C _A MR is then _WR1 =C _NF /C _A ；

For table T, sum the number of error values for all fields, denoted Sum (C _NF ) The number of the T field of the table is N _F Table T error rate MR _WR2 =Sum(C _NF )/(C _A *N _F )。

Repetition rate MR _Dup : for a selected set of fields/attributes, all data records are aligned according to the record primary/unique key (as a key of decision).

For field F (non-record judging key), record number marked by record main key is N1, some repeated value appearing in field F is N _Dup1 (the calculation can be included when the number of repeated records is greater than 1), and the Sum of all repeated values is Sum (N) _Dup1 ) Repetition rate MR of field F _Dup1 =Sum(N _Dup1 )/N；

For table T, calculate abstracts for all non-critical multiple key fields (ordered by natural order in storage or by field name classical order) of a record, compare abstract information for all records, record repetition number N _Dup2 The total number of table records is N2Table T repetition rate MR _Dup2 =N _Dup2 /N2。

Aging satisfaction MR _Chr-S : for a particular data set S, the time efficiency satisfaction describes how much ratio of data in S satisfies the business requirement for the data freshness/time coordinate range, i.e., the time efficiency requirement. Note that, unlike indexes such as null rate, error rate, repetition rate, etc., the time-efficiency satisfaction index only plays a reference role in a service scenario where there is a specific requirement for data time efficiency.

Data record main key/unique key is used as judging key and is provided with C _S In the data set S of the strip data record there is C _NS The bar records being contained in a record count of C _Ref Reference or reference data set S of (C) _Ref In this dataset S _Ref Typically, the directory is a certain entity directory with timeliness constraint (note that the directory may be a logical directory, i.e. not physically present), for example, the entity directory is updated according to service needs in a fixed or dynamic time range/window, and typically, the "master data" directory is obtained by filtering by superimposing timeliness constraint conditions.

Then the time efficiency satisfies MR _Chr-S =C _NS /C _Ref Wherein 0 is<=C _NS <=C _Ref . Correspondingly, age-overflow rate MR _Chr-O =（C _S -C _NS ）/C _Ref . In general, when C _S And C _NS When the time is unequal, the data asset operation condition is comprehensively estimated by combining the time efficiency meeting rate and the time efficiency overflow rate index.

C _S 、C _NS 、C _Ref The combination of the three values and the corresponding interpretation are shown in fig. 1, 2 and 3, and in fig. 1 (case 1): set S is included in set S _Ref Or overlap of the two, 1 when cs=c _N s<C _Ref When the set S meets the aging requirement in a best effort mode, the aging satisfaction of the set S is still improved through measures such as data complement, system optimization and the like; 2. when cs=c _Ns =C _Ref Time (i.e. S and S _Ref Heavy ), set S fully meets the current aging requirements; in fig. 2 (case 2): set S and set S _Ref Part of the intersection, or set S comprises set S _Ref In general, when the data entities in the set S are generated by multi-source fusion/processing, a partial data dispersion in the set S occurs _Ref Other things. At this time, indexes such as time efficiency meeting and time efficiency overflow rate can be combined to comprehensively evaluate the 'missing' and time efficiency abnormal conditions of data asset operation, specifically: 1. when Cs>C _Ns And C _N s<C _Ref When the set S has partially satisfied the aging requirement, but the entity data source well in the set S is not completely defined by the set S _Ref Covering, wherein the data fusion/processing has links which are not monitored and managed by asset operation, and the complete investigation of missing/risk points is required to be carried out on the whole data treatment process; 2. when Cs>C _N s and cns=c _Ref Time (i.e. set S contains set S _Ref ) Similarly, in the data management process, links which are not regulated by the data asset operation are arranged, and in combination with business practice, the situation means that the front and rear processes of the data management are seriously disjointed, and asset operation strategies aiming at original data (or data at the front part in a computing and storage architecture level), main data/reference data, reference data and the like are required to be adjusted in a key way, so that the tracking management of the whole life cycle of the data is ensured; in fig. 3 (case 3): set S and set S _Ref Are mutually disjoint; 1. when C _Ns =0, i.e. set S and set S _Ref When the two data sources are mutually disjoint, the two entity data sources are usually completely independent/split, and serious supervision and omission exist in the data treatment process, so that the full life cycle management of the data cannot be realized. In this case, the data asset operation policy needs to be re-analyzed, and an effective mechanism for tracking the data circulation among the computing and storage hierarchies is established.

Age overflow rate MR _Chr-O : reference is made to the above description of the timeliness satisfaction index. Typically together with the age satisfaction index, a data asset operation assessment reference is calculated and provided.

According to the above description, all the metrics have been calculated and stored into the MetaStore for subsequent execution of the progression action.

Note that when the above-described processes relate to an observation period, it is not mandatory that all the processes are in the same observation period, and only the duration of the observation period needs to be ensured to be the same.

Metrics and reports (Measurement & Report):

in the above process, metadata information, pedigree information, quality class, circulation class, application class indexes and the like of the global data are tracked and calculated, and the data asset operation efficiency analysis reporting module can provide corresponding analysis results on different granularities of libraries, tables, fields and the like according to the metadata information, the pedigree information, the quality class, the circulation class, the application class indexes and the like, and integrate and output evaluation conclusion.

A typical data asset operation performance analysis report may be presented by way of automatically generating charts, text, etc. (including but not limited to the following):

each library, table base analysis (including table/field/number of records, type distribution, etc.);

each library, table, field data quality index, data flow index, data application index.

The result of the deep analysis of the table which needs to pay special attention to the service can be, for example, the field index information which does not participate in calculation can be removed, other indexes can be read in combination with the service scene (for example, the rule is preset manually and automatically executed by the system, so that the read result corresponding to the matching rule is generated), and the like.

And according to the index distribution situation, the system gives evaluation conclusion and advice and the data asset operators carry out progress auditing and tracking treatment by combining the analysis results automatically provided by the system.

It should be noted that the foregoing merely illustrates the technical idea of the present invention and is not intended to limit the scope of the present invention, and that a person skilled in the art may make several improvements and modifications without departing from the principles of the present invention, which fall within the scope of the claims of the present invention.

Claims

1. A method for evaluating the operational effectiveness of a data asset, the method comprising the steps of:

2. The method of claim 1, wherein in the data management module, the data is processed by an integration class, a processing class and an analysis class, and in the data application module, the data is packaged in the form of API and SDK, and then the data is served for external progress, and the data flows between different hierarchies of the storage calculation and the final application step comprises:

establishing related binary groups among all fields, and calculating information loss rate index MR according to field types or service requirements _IL ；

Tracking the upstream and downstream fields of all fields of the record, concatenating the processing chains forming all fields, and calculating the field chain complexity index M therefrom _CC The final chain complexity of the field takes the maximum value of the corresponding chain complexity of each path it experiences;

processing actions of all fields, tracking and recording node liveness index M _A Calculating the number of processing wave times experienced by a specific field of a library table in an observation period of a specific duration or in a unit time;

calculating a genetic difference activity ratio index MR according to the field level activity index and genetic lineage data acquired by a processing chain _DAL Genetic difference activity ratioIndex MR _DAL = Max{M _A,1~n }/N _Acc Where n is the number of nodes other than the termination node, M _A N is the node liveness index _Acc The processing or access times of the nodes;

and (3) carrying out periodic static analysis on the database tables generated in all the processes, and calculating all quality indexes.

3. The method for evaluating the operation efficiency of a data asset according to claim 2, wherein when analyzing the adjacent upstream and downstream fields, the SQL-like operation is identified by the SQL-like grammar analysis and the field extraction technique, the non-SQL-like operation is identified by the system operation, and finally metadata storage matching is combined, the adjacent processing chains are generated and connected in series as complete processing chains, all the processing chains are traversed, and the fan-in index M of all the fields is calculated _FI Fanout index M _FO 。

4. A method of evaluating performance of a data asset operation as claimed in claim 3, wherein the inter-domain asset coverage index MR is calculated based on the processing chain and the metadata information of each table _CDAC Inter-domain asset coverage index MR _CDAC The computing of (a) includes first referring to different computing storage hierarchy as different domains, the data's belonging domains are labeled with a label technique, while the labels are used to describe the data information.

5. The method of claim 2, wherein the quality class indicator comprises a null rate indicator MR _N For the fields, a null rate index MR is calculated in the table _N1 =C _NF /C _A Wherein C _NF To count the number of times a field in the table has a null value, C _A For the total number of records in the table, for the table, the table space value rate MR _N2 =Sum(C _NF )/(C _A *N _F ) Wherein Sum (C) _NF ) To sum the number of nulls for all fields, N _F For the number of fields of the table, C _A The total number of records in the table.

6. The method of claim 5, wherein the quality class indicator comprises a bit error rate indicator MR _WR Field error rate indicator MR _WR1 =C _NF /C _A Wherein C _NF To count the number of times a field in the table has an error value, C _A For the total number of records in the table, the table error rate MR _WR2 =Sum(C _NF )/(C _A *N _F ) Wherein Sum (C) _NF ) Sum the number of error values of all fields, N _F For the number of fields of the table, C _A The total number of records in the table.

7. The method of claim 6, wherein the quality class indicator comprises a repetition rate indicator MR _Dup Repetition rate index MR _Dup Comparing all data records according to the record main key/unique key aiming at the selected field/attribute set, and comparing the field repetition rate index MR _Dup1 =Sum(N _Dup1 ) N1, wherein N1 is the record number of the record primary key mark, N _Dup1 Record the number of records for a certain repetition value of field occurrence, sum (N _Dup1 ) Is the sum of all the repeated values; aiming at the table, calculating abstracts of all non-judging multiple key fields of a record, and comparing abstract information of all records to obtain a table repetition rate index MR _Dup2 =N _Dup2 N2, where N _Dup2 For the number of repetitions, N2 is the total number of table records.

8. The method of claim 7, wherein the quality class indicator comprises a failure satisfaction rate indicator MR _Chr-S Failure satisfaction rate index MR _Chr-S For a particular data set S, the age satisfaction describes the ratio of data satisfying business requirements for the data freshness/time coordinate range in S; data record main key/unique key is used as judging key and is provided with C _S In the data set S of the strip data record there is C _NS The bar records being contained in a record count of C _Ref Reference or reference dataset S of (2) _Ref In the middle, age is fullFoot rate index MR _Chr-S =C _NS /C _Ref Wherein 0 is<=C _NS <= C _Ref 。

9. The method of claim 8, wherein the quality class indicator comprises an age overflow indicator MR _Chr-O Age-overflow index MR _Chr-O =（C _S -C _NS ）/C _Ref Wherein the age-overflow index MR _Chr-O Meeting the time efficiency index MR _Chr-S Together calculate and provide a data asset operation condition assessment reference.

10. A data asset operational effectiveness assessment system based on the method of any one of claims 1-9, wherein the system comprises:

the data acquisition module is in butt joint with an internal and external data source, and forms data resources corresponding to a service datamation process in an active and passive internal and external acquisition or service generation and precipitation mode;

the data management module is used for performing cleaning, conversion, splitting, aggregation, migration and statistical calculation operations on the data;

the data application module is used for outputting API, SDK, research analysis report and a package solution form through product and service package and oriented to the market;

the system comprises a global event message queue, a unified monitoring and information distribution channel of a global event of the system, a message consumer receives information notification after finishing registration and subscribing corresponding subject information of the message queue, and takes corresponding disposal measures suitable for the message consumer;

the system comprises a metadata and genetic management module, a core module for evaluating the operation efficiency of data assets, a data management module and a data management module, wherein the core module is used for tracking the whole process of data circulation among different calculation and storage hierarchy structures, constructing a genetic lineage of global data, dynamically tracking and calculating from data quality, data circulation and data application dimensions, and acquiring efficiency evaluation indexes;

and the data asset operation efficiency analysis report module outputs data asset operation efficiency analysis reports in various formats according to the metadata and the calculation result of the genetic management module so as to assist the data asset operation.