CN115293243A - Method, device and equipment for realizing intelligent matching of data assets - Google Patents

Method, device and equipment for realizing intelligent matching of data assets Download PDF

Info

Publication number
CN115293243A
CN115293243A CN202210828078.7A CN202210828078A CN115293243A CN 115293243 A CN115293243 A CN 115293243A CN 202210828078 A CN202210828078 A CN 202210828078A CN 115293243 A CN115293243 A CN 115293243A
Authority
CN
China
Prior art keywords
matching
data
evaluation value
repetition rate
rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210828078.7A
Other languages
Chinese (zh)
Inventor
王宁
张延生
朱拥军
牟岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guoneng Wangxin Technology Beijing Co ltd
Original Assignee
Guoneng Wangxin Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guoneng Wangxin Technology Beijing Co ltd filed Critical Guoneng Wangxin Technology Beijing Co ltd
Priority to CN202210828078.7A priority Critical patent/CN115293243A/en
Publication of CN115293243A publication Critical patent/CN115293243A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data processing, and provides a method, a device and equipment for realizing intelligent matching of data assets. The method for realizing intelligent matching of the data assets comprises the following steps: matching the name and the field of the data asset in a data asset library to obtain a matching result; calculating a first matching evaluation value corresponding to the data asset according to the name similarity rate and the field repetition rate of the data asset and the matching result; if the normalized first matching evaluation value is not the end point of the range, recalculating to obtain a second matching evaluation value by adjusting the corresponding weights of the name similarity rate and the field repetition rate; mapping the second matching evaluation value to an end point of the range. The embodiment of the invention also provides a corresponding device and equipment. The implementation method provided by the invention can improve the efficiency of data duplication checking and data screening.

Description

Method, device and equipment for realizing intelligent matching of data assets
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for realizing intelligent matching of data assets, equipment and a computer-readable storage medium.
Background
Data assets refer to data resources that are physically or electronically recorded, owned or controlled by an individual or business, and that can bring future economic benefits to the business. Data assets are considered one of the most important forms of assets in the digital age. The metadata is basic information for constructing the data assets, and in the metadata acquisition process, for newly-added metadata acquisition, the conditions of content repetition or similarity with the original metadata exist, and the content repetition or similarity of the metadata has great influence on the uniqueness and authority of the data assets. Aiming at the problems, a scheme for intelligently identifying and matching the data assets is provided, the repeated and similar conditions of the data assets are identified through an algorithm, the result is submitted to manual judgment, and corresponding parameters of the algorithm are automatically corrected according to the result of the manual judgment so as to gradually improve the intelligent matching accuracy of the data assets. Therefore, the purposes of intelligently identifying the uniqueness and authority of the data assets are achieved.
The intelligent identification of data assets is essentially the identification of similar data assets and data through an algorithm. In the prior art, the checking is mainly performed by artificial data assets, and the efficiency is low. For intelligent identification of data assets, no relevant design implementation is currently known.
Disclosure of Invention
The embodiment of the invention aims to provide a method, a device and equipment for realizing intelligent matching of data assets, which improve the uniqueness and authority of the data assets in the process of acquiring and generating the data assets by using a mathematical algorithm.
In order to achieve the above object, a first aspect of the present invention provides a method for implementing intelligent matching of data assets, the method including:
matching the names and the fields of the data assets in a data asset library to obtain a matching result; calculating a first matching evaluation value corresponding to the data asset according to the name similarity rate and the field repetition rate of the data asset and the matching result; if the normalized first matching evaluation value is not the end point of the range, recalculating to obtain a second matching evaluation value by adjusting the corresponding weights of the name similarity rate and the field repetition rate; mapping the second matching evaluation value to an end point of the range.
Preferably, the name similarity ratio is calculated by the following steps: and obtaining the name similarity rate according to the number of continuous consistent characters in the names of the data assets and the matching results and the total number of characters of the names of the matching results.
Preferably, the field repetition rate is calculated by: respectively acquiring field sets of the data assets and the matching results as a first list and a second list; acquiring the number of the same fields in the first list and the second list; and obtaining the field repetition rate according to the number of the same fields and the number of the fields in the second list.
Preferably, the recalculating the second matching evaluation value by adjusting the corresponding weight of the name similarity ratio and the field repetition rate includes: determining the initial weight and the weight adjustment step of the name similarity rate and the field repetition rate; adjusting the initial weight step by step according to the weight adjustment, and calculating a matching evaluation value according to the adjusted weight each time; and taking the maximum value of the obtained multiple matching evaluation values as the second matching evaluation value.
Preferably, the method further comprises: taking the weight corresponding to the second matching evaluation value as an optimal weight; for other data assets having an association relationship with the data asset, the step of recalculating a second matching evaluation value by adjusting corresponding weights of the name similarity ratio and the field repetition rate is replaced by: and recalculating to obtain a second matching evaluation value according to the optimal weight of the name similarity rate and the field repetition rate.
Preferably, the method further comprises: in addition to the name similarity rate and the field repetition rate, adding at least one of the following labels, and setting corresponding weights for the added labels: data type, whether time series data, data magnitude, user information, and data source.
Preferably, the method further comprises: and if the normalized first matching evaluation value is a value representing 'matching' in the Boolean value, acquiring the time attribute of the data asset and the matching result, and determining the execution action of the data asset according to the time attribute.
In a second aspect of the present invention, there is also provided an apparatus for implementing intelligent matching of data assets, the apparatus including:
the data matching module is used for matching the names and the fields of the data assets in the data asset library to obtain a matching result; the first evaluation calculation module is used for calculating a first matching evaluation value corresponding to the data asset according to the name similarity rate and the field repetition rate of the data asset and the matching result; the second evaluation calculation module is used for recalculating to obtain a second matching evaluation value by adjusting the corresponding weights of the name similarity rate and the field repetition rate if the normalized first matching evaluation value is not the end point of the range; and a matching result module for mapping the second matching evaluation value to an end point of the range.
In a third aspect of the present invention, there is also provided an apparatus for implementing intelligent matching of data assets, including a memory, a processor and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the foregoing method for implementing intelligent matching of data assets.
In a fourth aspect of the present invention, there is also provided a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to perform the steps of the aforementioned implementation method for intelligent matching of data assets.
A fifth aspect of the invention provides a computer program product comprising a computer program which, when executed by a processor, implements the aforementioned method of implementing intelligent matching of data assets.
The technical scheme at least has the following beneficial effects:
according to the scheme, multiple algorithms can be fused to find out the high similarity from a large amount of data under the realization of a computer program, and meanwhile, the fusion algorithm is continuously optimized according to the manual judgment result and the assistance of the label, so that the accuracy of the uniqueness and authority of the data asset is improved. Meanwhile, the scheme can be conveniently transplanted to other data duplicate checking scenes for use.
Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:
FIG. 1 is a schematic diagram of an implementation of a method for intelligent matching of data assets, according to an embodiment of the invention;
FIG. 2 schematically illustrates a data asset management flow diagram according to an embodiment of the present invention;
FIG. 3 schematically shows a schematic diagram of the steps performed to perform an authoritative test according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an implementation device for intelligent matching of data assets according to an embodiment of the invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.
Fig. 1 schematically shows an implementation diagram of an implementation method of intelligent matching of data assets according to an embodiment of the invention. As shown in fig. 1, the method includes:
s01, matching the names and the fields of the data assets in a data asset library to obtain a matching result;
s02, calculating a first matching evaluation value corresponding to the data asset according to the name similarity rate and the field repetition rate of the data asset and the matching result;
s03, if the normalized first matching evaluation value is not the end point of the range, recalculating to obtain a second matching evaluation value by adjusting the corresponding weights of the name similarity rate and the field repetition rate;
and S04, mapping the second matching evaluation value to the end point of the range.
In the present embodiment, a metric of the matching evaluation value is defined, and if the metrics are completely the same, the matching evaluation value is 100%, and if the metrics are completely different, the matching evaluation value is 0%. When the above matching evaluation values are normalized, the two endpoints are respectively 0 and 1, where 0 represents completely different and 1 represents completely the same. However, in most cases, the obtained evaluation value of the matching property is between 0 and 1, which means that the evaluation values are partially the same, and thus, it is necessary to perform matching property determination in one step. Another matching evaluation value, i.e., a second matching evaluation value, is obtained by calculation, and a boolean value indicating whether the data assets are similar is obtained from the second matching evaluation value. The data asset library can also be called a data governance platform or a data management platform.
FIG. 2 schematically illustrates a data asset management flow diagram according to an embodiment of the present invention. As shown in fig. 2, the data asset management mainly includes several links, namely, metadata collection, data object storage, data asset inventory, data asset distribution, and data service creation and provision. To ensure that data assets are unique and authoritative, certain processing of data objects is required before they are stored (binned).
Through the implementation mode, the similarity between the metadata which are not collected and the metadata which are collected can be intelligently judged, the intelligent matching precision of the data assets is improved, and the collection efficiency and precision are improved.
In some embodiments provided herein, the name similarity ratio is calculated by: and obtaining the name similarity rate according to the number of continuous consistent characters in the names of the data assets and the matching results and the total number of characters of the names of the matching results. For example: the only 3-field Chinese or English name in the A table is identical to the only 3-field Chinese or English name in the B table (case-independent). If the B table range is greater than the A table range, the determination cannot be made to be exactly the same. When the parts are the same, the name similarity ratio is calculated by the following formula:
the number of continuous consistent characters/the total number of characters of English table names in the data asset library is 100 percent;
number of consecutive identical characters/total number of characters of the name of the Chinese table in the data asset library 100%.
In some embodiments of the present invention, the field repetition rate is calculated by: respectively acquiring field sets of the data assets and the matching results as a first list and a second list; acquiring the number of the same fields in the first list and the second list; and obtaining the field repetition rate according to the number of the same fields and the number of the fields in the second list. Specifically, the repetition rate of the Chinese and English character segments is equal to: the same number of fields/collected data asset contains the total number of fields 100%. Further, the first matching evaluation value is obtained by: field repetition rate field weight + table name similarity rate table name weight, wherein: table name weight + field weight =1. To facilitate understanding and implementation by those skilled in the art, one embodiment is specifically exemplified as follows:
the pseudo code of the data asset partial identity decision algorithm is as follows:
calculating the similarity of the unselled name A and the picked name B;
calculating the field repetition rate of the unexploited table A and the adopted table B;
calculating a matching evaluation value = a field repetition rate, a field weight and a table name similarity rate, and a table name weight;
if (overall repetition rate > threshold 60%) { recommending { A, B } pairs to the user for selection; recording a calculation result; }
else { records the calculation results, not recommended; }// below recommended threshold
In some embodiments of the present invention, the recalculating the second matching evaluation value by adjusting the corresponding weights of the name similarity ratio and the field repetition rate includes: determining the initial weight and the weight adjustment step of the name similarity rate and the field repetition rate; adjusting the initial weight step by step according to the weight adjustment, and calculating a matching evaluation value according to the adjusted weight each time; and taking the maximum value of the obtained plurality of matching evaluation values as the second matching evaluation value. Specifically, the initial weight x =0.5 is now used as an initial value, the value of x is increased or decreased in an iterative progressive manner, each time the step size is 0.01, and then the size of the matching evaluation value Vsim is calculated in a loop until the value of Vsim between two data assets is maximum. If during the calculation Vsim is found to be smaller and smaller, indicating that the step direction is reversed, the step direction is switched, e.g. increased by 0.01 to decreased by 0.01.
The comparison object may be determined in the following manner: there is an existing acquisition asset a of this time, which has a table name Tna and fields Cna1, cna2, \ 8230;, cna5, initial weight x =0.5.
According to calculation, the similarity of the collected assets B1, B2, B6 and A is more than 60%, and the similarity is respectively 0.95,0.83,0.73,0.89,0.92 and 0.77. B2 (0.83) is selected as the final result by user tag selection, and all other 5 options are negated.
In some embodiments provided herein, the method further comprises: taking the weight corresponding to the second matching evaluation value as an optimal weight; for other data assets having an association relationship with the data asset, the step of recalculating a second matching evaluation value by adjusting corresponding weights of the name similarity ratio and the field repetition rate is replaced by: and recalculating to obtain a second matching evaluation value through the optimal weight of the name similarity rate and the field repetition rate.
After the calculation of the optimal x for the data asset a collected this time is completed, the calculation is started for a batch of non-collected assets, such as the same assets A1, A2, \ 8230, collected similar assets of a10, with x as a weight:
the data asset A1 acquired this time and the corresponding acquired similar asset set STa1{ acquired assets Ta11, ta12, a., ta1n }, assume that the user tag Ta18 in STa1 is a duplicate asset.
The intermediate process is similar and will not be repeated here.
This collection of assets a10 and the corresponding collection of similar assets STa10{ collected assets Ta101, ta 102., ta10k }, assuming that the user tag Ta103 is a duplicate asset from STa 10.
Therefore, there is a one-to-one correspondence relationship between the data asset set { A1., a10} acquired this time and the user tag set { Ta 18., ta103 }.
For a large number of data assets, we can calculate the best value of x in the same way, for example, 1000 data assets and 10000 data assets. When the number of the data assets is larger, the value of x is more stable, and after the convergence to a certain degree, the calculation can be stopped, and the value is used as the default formula weight of the system. When the system has larger change, the calculation can be carried out again, thereby reducing the processing overhead of the system.
In some embodiments provided herein, the method further comprises: in addition to the name similarity rate and the field repetition rate, adding at least one of the following labels, and setting corresponding weights for the added labels: data type, whether time series data, data magnitude, user information, and data source. Specifically, the present embodiment marks an example of the failure of the first partial method by adding a classification tag. After all data processing is completed, the example which is not successfully judged is run again in a mode of adding labels, and the process can be repeated, so that the subsequent identification rate can be improved, namely, the data which is not successfully identified is labeled manually, and extra judgment information is added, so that the identification rate is improved in the subsequent iteration, for example, the following types of labels are added: data category (product, material, personnel, operation, etc.), time series data/non-time series data, whether the data volume is the same order of magnitude, user department or user, source unit or department. And adding the data into a weight formula by adopting the similar method described in the previous section, and then calculating to obtain the optimal matching degree. For simplicity, the repetition rate of the above categories is: and 1 if the two are identical, and 0 otherwise.
Fig. 3 schematically shows a schematic representation of the implementation steps of an authoritative test according to an embodiment of the invention. As shown in fig. 3, in some embodiments provided herein, the method further comprises: and if the normalized first matching evaluation value is a value representing 'matching' in the Boolean value, acquiring the time attribute of the data asset and the matching result, and determining the execution action of the data asset according to the time attribute. This step may also be referred to as authoritative check, and if the collected data asset is repeated (manually determined) with the existing data asset, the authoritative check of the data asset is performed to ensure that the source of the data asset is authoritative. And judging whether the data are authoritative according to the data creation time in the database table corresponding to the data assets. In principle, data created early is an authoritative data asset. For example: and B is the data asset collected at this time, the data asset is judged to be the existing A repeat (labeled), the data corresponding to the data asset B and the data corresponding to the data asset A need to be called, the creation time of the data is compared, the data corresponding to which data asset is created earlier, the data asset is judged to be an authoritative data asset, if the existing A is the authoritative data asset, the existing data asset is labeled, and if the B is judged to be the authoritative data asset, the asset A needs to be deleted or set invalid, and the data asset B is collected to be used as the authoritative data asset. The pseudo code of the algorithm is as follows:
Figure BDA0003744764220000081
Figure BDA0003744764220000091
based on the same inventive concept, the embodiment of the invention also provides a device for realizing intelligent matching of data assets. Fig. 4 is a schematic structural diagram of an implementation device for intelligent matching of data assets according to an embodiment of the invention. As shown in fig. 4, the apparatus includes: the data matching module is used for matching the names and the fields of the data assets in the data asset library to obtain a matching result; the first evaluation calculation module is used for calculating a first matching evaluation value corresponding to the data asset according to the name similarity rate and the field repetition rate of the data asset and the matching result; the second evaluation calculation module is used for recalculating to obtain a second matching evaluation value by adjusting the corresponding weights of the name similarity rate and the field repetition rate if the normalized first matching evaluation value is not the end point of the range; and a matching result module for mapping the second matching evaluation value to an end point of the range.
In some alternative embodiments, the name similarity ratio is calculated by: and obtaining the name similarity rate according to the number of continuous consistent characters in the names of the data assets and the matching results and the total number of characters of the names of the matching results.
In some alternative embodiments, the field repetition rate is calculated by: respectively acquiring field sets of the data assets and the matching results as a first list and a second list; acquiring the number of the same fields in the first list and the second list; and obtaining the field repetition rate according to the number of the same fields and the number of the fields in the second list.
In some optional embodiments, the recalculating the second matching evaluation value by adjusting the corresponding weights of the name similarity ratio and the field repetition rate includes: determining the initial weight and the weight adjustment step of the name similarity rate and the field repetition rate; adjusting the initial weight step by step according to the weight adjustment, and calculating a matching evaluation value by the weight after each adjustment; and taking the maximum value of the obtained plurality of matching evaluation values as the second matching evaluation value.
In some optional embodiments, the method further includes taking a weight corresponding to the second matching evaluation value as an optimal weight; for other data assets having an association relationship with the data asset, the step of recalculating a second matching evaluation value by adjusting corresponding weights of the name similarity ratio and the field repetition rate is replaced by: and recalculating to obtain a second matching evaluation value according to the optimal weight of the name similarity rate and the field repetition rate.
In some optional embodiments, the method further comprises: in addition to the name similarity rate and the field repetition rate, at least one of the following tags is added, and a corresponding weight is set for the added tag: data type, whether time series data, data magnitude, user information, and data source.
In some optional embodiments, the method further comprises: and if the normalized first matching evaluation value is a value representing 'matching' in the Boolean value, acquiring the time attribute of the data asset and the matching result, and determining the execution action of the data asset according to the time attribute.
The specific limitations of each functional module in the above apparatus for implementing intelligent matching of data assets can be referred to the limitations of the above method for implementing intelligent matching of data assets, and are not described herein again. The various modules in the above-described apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In some embodiments provided by the present invention, an implementation apparatus for intelligent matching of data assets is further provided, and includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the foregoing steps of the implementation method for intelligent matching of data assets when executing the computer program. The processor herein has functions of numerical calculation and logical operation, and has at least a central processing unit CPU having data processing capability, a random access memory RAM, a read only memory ROM, various I/O ports, an interrupt system, and the like. The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. One or more than one kernel can be set, and the method is realized by adjusting kernel parameters. The memory may include volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), including at least one memory chip.
In an embodiment of the present invention, there is also provided a computer-readable storage medium, which has instructions stored therein, and when the instructions are executed by a processor on a computer, the instructions cause the processor to be configured to execute the method for implementing intelligent matching of data assets described above.
In one embodiment, the invention provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the method for implementing intelligent matching of data assets.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional identical elements in the process, method, article, or apparatus comprising the element.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for realizing intelligent matching of data assets is characterized by comprising the following steps:
matching the name and the field of the data asset in a data asset library to obtain a matching result;
calculating a first matching evaluation value corresponding to the data asset according to the name similarity rate and the field repetition rate of the data asset and the matching result;
if the normalized first matching evaluation value is not the end point of the range, recalculating to obtain a second matching evaluation value by adjusting the corresponding weights of the name similarity rate and the field repetition rate;
mapping the second matching evaluation value to an end point of the range.
2. The method of claim 1, wherein the name similarity ratio is calculated by:
and obtaining the name similarity rate according to the number of continuous consistent characters in the names of the data assets and the matching results and the total number of characters of the names of the matching results.
3. The method of claim 1, wherein the field repetition rate is calculated by:
respectively acquiring field sets of the data assets and the matching results as a first list and a second list;
acquiring the number of the same fields in the first list and the second list;
and obtaining the field repetition rate according to the number of the same fields and the number of the fields in the second list.
4. The method according to claim 1, wherein the recalculating a second matching evaluation value by adjusting corresponding weights of the name similarity ratio and the field repetition rate comprises:
determining the initial weight and the weight adjustment step of the name similarity rate and the field repetition rate;
adjusting the initial weight step by step according to the weight adjustment, and calculating a matching evaluation value by the weight after each adjustment;
and taking the maximum value of the obtained plurality of matching evaluation values as the second matching evaluation value.
5. The method of claim 4, further comprising:
taking the weight corresponding to the second matching evaluation value as an optimal weight;
for other data assets having an association relationship with the data asset, the step of recalculating the second matching evaluation value by adjusting the corresponding weights of the name similarity rate and the field repetition rate is replaced by:
and recalculating to obtain a second matching evaluation value through the optimal weight of the name similarity rate and the field repetition rate.
6. The method of claim 1, further comprising: in addition to the name similarity rate and the field repetition rate, at least one of the following tags is added, and a corresponding weight is set for the added tag:
data type, whether time series data, data magnitude, user information, and data source.
7. The method of claim 1, further comprising:
if the normalized first matching evaluation value is a value representing "match" in the boolean values,
a time attribute of the data asset and the matching result is obtained,
determining an action to perform on the data asset based on the temporal attribute.
8. An apparatus for implementing intelligent matching of data assets, the apparatus comprising:
the data matching module is used for matching the names and the fields of the data assets in the data asset library to obtain a matching result;
the first evaluation calculation module is used for calculating a first matching evaluation value corresponding to the data asset according to the name similarity rate and the field repetition rate of the data asset and the matching result;
the second evaluation calculation module is used for recalculating to obtain a second matching evaluation value by adjusting the corresponding weights of the name similarity rate and the field repetition rate if the normalized first matching evaluation value is not the end point of the range; and
a matching result module for mapping the second matching evaluation value to an end point of the range.
9. An apparatus for implementing intelligent matching of data assets, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method for implementing intelligent matching of data assets of any one of claims 1 to 7.
10. A computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the steps of a method for implementing intelligent matching of data assets of any one of claims 1 to 7.
CN202210828078.7A 2022-07-13 2022-07-13 Method, device and equipment for realizing intelligent matching of data assets Pending CN115293243A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210828078.7A CN115293243A (en) 2022-07-13 2022-07-13 Method, device and equipment for realizing intelligent matching of data assets

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210828078.7A CN115293243A (en) 2022-07-13 2022-07-13 Method, device and equipment for realizing intelligent matching of data assets

Publications (1)

Publication Number Publication Date
CN115293243A true CN115293243A (en) 2022-11-04

Family

ID=83821709

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210828078.7A Pending CN115293243A (en) 2022-07-13 2022-07-13 Method, device and equipment for realizing intelligent matching of data assets

Country Status (1)

Country Link
CN (1) CN115293243A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116361283A (en) * 2022-12-01 2023-06-30 北京码牛科技股份有限公司 Method, system, terminal and storage medium for identifying association relationship of mass data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116361283A (en) * 2022-12-01 2023-06-30 北京码牛科技股份有限公司 Method, system, terminal and storage medium for identifying association relationship of mass data
CN116361283B (en) * 2022-12-01 2023-09-26 北京码牛科技股份有限公司 Method, system, terminal and storage medium for identifying association relationship of mass data

Similar Documents

Publication Publication Date Title
CN111522989B (en) Method, computing device, and computer storage medium for image retrieval
CN108932257B (en) Multi-dimensional data query method and device
CN106649346B (en) Data repeatability checking method and device
CN109002443B (en) Text information classification method and device
CN106897342B (en) Data verification method and equipment
CN109634682B (en) Configuration file updating method and device for application program
CN107391532B (en) Data filtering method and device
CN112307004B (en) Data management method, device, equipment and storage medium
CN115293243A (en) Method, device and equipment for realizing intelligent matching of data assets
CN117492670A (en) Log printing sequence determining method and device and electronic equipment
CN111427871B (en) Data processing method, device and equipment
CN108241622B (en) Query script generation method and device
CN110928941A (en) Data fragment extraction method and device
CN109947933B (en) Method and device for classifying logs
CN111191007A (en) Article keyword filtering method and device based on block chain and medium
CN111125087A (en) Data storage method and device
CN114817209A (en) Monitoring rule processing method and device, processor and electronic equipment
CN111666347B (en) Data processing method, device and equipment
CN109299125B (en) Database updating method and device
CN111125165A (en) Set merging method, device, processor and machine-readable storage medium
CN107025615B (en) Learning condition statistical method based on learning tracking model
CN108062329B (en) Data import method and device
CN110609926A (en) Data tag storage management method and device
CN110362595A (en) A kind of SQL statement dynamic analysis method
CN110969019A (en) Method and device for disambiguating name

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination