CN114547231A

CN114547231A - Data tracing method and system

Info

Publication number: CN114547231A
Application number: CN202011328158.3A
Authority: CN
Inventors: 高灵超; 皮志贤; 黄佩卓; 刘洋; 陈相舟; 王家凯
Original assignee: Big Data Center Of State Grid Corp Of China
Current assignee: Big Data Center Of State Grid Corp Of China
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2022-05-27

Abstract

The invention provides a method and a system for tracing data sources, which comprise the following steps: acquiring a function point list, a display text after the function point operation, a data table call record related to the function point operation, and all data table lists and data tables; based on the function point list, the display text after the function point operation, the data table call record related to the function point operation, all the data table lists and the data tables, obtaining the association coefficient of each data table and the function point under each matching method by adopting a plurality of matching methods; the invention determines the data tables associated with the function points based on the association coefficient of each data table and each function point under each matching method, gives consideration to the data matching relationship of the function points and each data table in multiple dimensions, can accurately find the association relationship between the function points and the corresponding data tables, realizes the automatic association of the front-end service function of the system and the background database table, improves the data asset management efficiency, reduces the manual workload, enhances the application development capability of software, and simplifies the later maintenance cost.

Description

Data tracing method and system

Technical Field

The invention belongs to the technical field of databases, and particularly relates to a data tracing method and system.

Background

With the rapid development of computer mobile internet and information storage capability, various kinds of information are explosively and exponentially increased. With the advent of cloud computing and big data era, people are gradually aware of the importance of data, but the data is huge and complicated, and various data-related problems such as data loss, data inconsistency and data reliability are inevitably brought about. When people obtain some data, the fact whether the data is true and reliable is often considered, and otherwise wrong decisions can be brought to the people. Such data information can be generally classified into two types, one being the most primitive entry data and the other being data derived from such data. However, the user is usually exposed to more derived data, that is, various processed data, which are often stored in various complicated ways such as conversion or editing. People often have a questionable attitude towards such result data because the conversion process is unknown. In fact, sometimes the result data has no relation to the original data, so we must be concerned about the generation process of the result data and their source.

Traceability techniques have found wide application in many fields, such as archaeology, physics, astronomy, archives, etc. In recent years, data tracing is also developed in the field of computers and mainly exists in research directions of databases, scientific experiments, workflows and the like, but research in the field of big data is relatively less, and with the increasingly wide application of big data platforms in enterprises, various data are processed through a series of big data models to obtain results, a decision maker analyzes and makes decisions by using the result data, if the result data is inaccurate or the source is unreliable, decision errors can be caused, even inestimable loss can be brought to the enterprises, and therefore, the data tracing under the big data platforms is more and more important. Users often need to combine historical information such as data sources and processing procedures to determine whether the data is reliable, and data tracing can describe the data sources and the processing procedures and can provide auditing mechanisms, positioning errors, debugging processing procedures and the like for the users.

The existing data tracing method comprises the step of carrying out matching analysis on displayed content and all contents of a database based on a character string matching or data analysis matching method, wherein data in one data table may be repeated, data in different tables may be similar, and each piece of data may comprise a plurality of parts, so that after the content matching analysis, a probability model is used for analyzing the matching probability of the two data, and further the source information of the data is obtained through analysis. However, since the data has the same possibility and the results of many data sources may be similar, and many data may have some additional processing before being displayed, the difference between the data in the result and the original data is large, which often causes errors and omissions in the data matching method, and the data tracing is not very accurate, so how to improve the data tracing accuracy is an urgent technical problem to be solved by those skilled in the art.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a data tracing method, which comprises the following steps:

acquiring a function point list corresponding to a function point for generating data, a display text after the function point is operated, a data table call record related to the function point operation, and all data table lists and data tables;

based on the function point list, the display text after the function point operation, the data table call records related to the function point operation, all the data table lists and the data tables, obtaining the association coefficient of each data table and the function point under each matching method by adopting a plurality of matching methods;

and determining the data table associated with the function point based on the association coefficient of each data table and the function point under each matching method.

Preferably, the matching method comprises: semantic understanding matching, data analysis matching and query tracking matching methods.

Preferably, obtaining the association coefficients of the data tables and the function points under the semantic understanding matching method by adopting a semantic understanding matching method comprises:

determining a function point list corresponding to the function point based on the function point;

respectively determining a data table list corresponding to each data table based on each data table;

respectively carrying out semantic comprehension annotation on the texts in the function point list and each data table list based on a natural language processing method;

based on the understood and annotated function point list and each data table list, vectors corresponding to the function point list and vectors corresponding to each data table list are sequentially constructed;

calculating the cosine similarity of the vector corresponding to each data table list and the vector corresponding to the function point list in sequence;

and taking the cosine similarity as a correlation coefficient of each data table and the function point under a semantic understanding matching method.

Preferably, the obtaining of the correlation coefficient between each data table and the function point under the data analysis matching method by using the data analysis matching method includes:

determining a display text after the function point is operated based on the function point;

respectively determining the content in each data table based on each data table;

based on the display text after the operation of the function point and the content in each data table, sequentially constructing a vector corresponding to the display text after the operation of the function point and a vector corresponding to the content in each data table;

calculating cosine similarity of the vector corresponding to the content in each data table and the vector corresponding to the displayed text after the function point is operated in sequence;

and taking the cosine similarity as a correlation coefficient between each data table and the function point under a data analysis matching method.

Preferably, the obtaining of the correlation coefficient between each data table and the function point under the query tracking matching method by using the query tracking matching method includes:

acquiring a data table call record related to the operation of the function point;

determining a data table associated with the function point and a data table not associated based on a data table call record related to the function point operation;

and respectively endowing preset values to the data tables associated with the function points and the data tables not associated with the function points, and using the preset values as the association coefficients of the data tables and the function points under the inquiry tracking matching method.

Preferably, the determining the data table associated with the function point based on the association coefficient between each data table and the function point under each method includes:

calculating a comprehensive association value of each data table and the function point based on the association coefficient of each data table and the function point obtained under the semantic understanding matching, data analysis matching and query tracking matching methods and the preset association coefficient weight corresponding to each method;

and arranging the comprehensive association values of the data tables and the function points in a descending order, and setting the data tables which are arranged at the front of the comprehensive association values of the function points as the data tables associated with the function points.

Preferably, the data table associated with the function point is determined and then rechecked to obtain the data table finally associated with the function point.

Based on the same conception, the invention also provides a data tracing system, which comprises:

the data acquisition module is used for acquiring a function point list corresponding to a function point for generating data, a display text after the function point is operated, a data table call record related to the function point operation, and all data table lists and data tables;

the correlation coefficient calculation module is used for obtaining the correlation coefficient between each data table and the function point under each matching method by adopting a plurality of matching methods based on the function point list, the display text after the function point operation, the data table call record related to the function point operation, all the data table lists and all the data tables;

and the result output module is used for determining the data table related to the function point based on the correlation coefficient between each data table and the function point under each matching method.

Preferably, the result output module includes:

the comprehensive correlation value calculating unit is used for calculating the comprehensive correlation value of each data table and the function point based on the correlation coefficient of each data table and the function point obtained by the semantic understanding matching method, the data analysis matching method and the query tracking matching method and the preset correlation coefficient weight corresponding to each method;

and the screening unit is used for arranging the comprehensive associated values of the data tables and the function points in a descending order, and setting the data tables which are arranged in the front of the comprehensive associated values of the function points as the data tables associated with the function points.

Compared with the closest prior art, the invention has the following beneficial effects:

the invention provides a method and a system for tracing data sources, which comprise the following steps: acquiring a function point list corresponding to a function point generating data, a display text after the function point is operated, a data table call record related to the function point operation, and all data table lists and data tables; based on the function point list, the display text after the function point operation, the data table call records related to the function point operation, all the data table lists and the data tables, obtaining the association coefficient of each data table and the function point under each matching method by adopting a plurality of matching methods; the invention considers the data matching relationship of the functional points and the data tables in multiple dimensions, can accurately find the association relationship between the functional points and the corresponding data tables, realizes the automatic association of the front-end service function of the system and the background database table, improves the management efficiency of data assets, reduces the manual workload, enhances the application development capability of software and simplifies the later maintenance cost.

Drawings

FIG. 1 is a schematic diagram of a data tracing method according to the present invention;

FIG. 2 is a schematic diagram of a data tracing system according to the present invention;

fig. 3 is a flowchart of data tracing provided in an embodiment of the present invention;

fig. 4 is a program framework diagram of data tracing provided in the embodiment of the present invention.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

Example 1:

as shown in fig. 1, a method for tracing a data source according to an embodiment of the present invention includes:

s1, acquiring a function point list corresponding to the function point generating data, a display text after the function point is operated, a data table call record related to the function point operation, and all data table lists and data tables;

s2, based on the function point list, the display text after the function point operation, the data sheet call records related to the function point operation, and all the data sheet lists and data sheets, obtaining the association coefficient between each data sheet and the function point under each matching method by adopting a plurality of matching methods;

s3, determining the data table related to the function point based on the related coefficient of each data table and the function point under each matching method.

Specifically, the data tracing flow chart is shown in fig. 3, and includes three parts, i.e., input, operation processing and output;

the input part can be used for importing data by a user, can be used for importing data such as a function point list and a data table list in a form of manually inputting and importing an Excel table and the like, and can be stored in a server database for unified management after being imported, when the data of the database is used after being stored, the data can be judged and searched through a bottom calling path, the data can also be judged and searched through the similarity of the data, and the auxiliary judgment can be carried out by using the information under the condition that other information (database table information provided by a manager, description contents of each function and the like) exists.

The operation processing part obtains the association coefficient of each data table and the function point under each matching method by using the acquired function point list, the display text after the function point operation, the data table call record related to the function point operation, all the data table lists and the data tables and adopting a plurality of matching methods, and specifically comprises the following three steps of simultaneously operating:

s2-1 semantic understanding matching, wherein semantic understanding and annotation are carried out on the contents of the function point list and the data table list by adopting a natural language processing tool, text similarity measurement is carried out on the function point list and the data table list after understanding and annotation, and the correlation coefficient between each data table and each function point is obtained;

the text similarity measurement means that a text is regarded as an aggregation of a group of words, the number of times of each word appearing in the text and the number of times of each word appearing in the whole text set are analyzed, then the text is modeled into a vector by using the word frequency information, and the similarity between the texts is calculated by using the cosine distance between the vectors, a neural network method for sentence embedding and the like. Text similarity measures are widely used in many fields, for example: the method comprises the following steps of information retrieval field, text classification, automatic generation of text abstract and duplicate checking detection of text. In the existing TF-IDF method (Term Frequency-Inverse Document Frequency), a text is mainly modeled as a Term Frequency vector, and then a cosine similarity is used to calculate the similarity between two texts, which includes the following specific steps:

s2-1-1, determining a function point list corresponding to the function point based on the function point;

s2-1-2 respectively determining a data table list corresponding to each data table based on each data table;

s2-1-3, respectively carrying out semantic comprehension annotation on the texts in the function point list and each data table list based on a natural language processing method;

s2-1-4, based on the understood and annotated function point list and each data table list, sequentially constructing a vector corresponding to the function point list and a vector corresponding to each data table list;

s2-1-5, calculating cosine similarity of the vector corresponding to each data table list and the vector corresponding to the function point list in sequence;

s2-1-6 uses the cosine similarity as the correlation coefficient between each data table and the function point under the semantic understanding matching method.

S2-2, analyzing and matching data, comparing the display text operated by the function point with the specific content of each data table, mining the data correlation degree based on the algorithm of vector space matching, and obtaining the correlation coefficient between each data table and the function point, wherein the specific steps are as follows:

s2-2-1, determining a display text after the function point is operated based on the function point;

s2-2-2, respectively determining the content in each data table based on each data table;

s2-2-3, based on the display text after the operation of the function point and the content in each data table, sequentially constructing a vector corresponding to the display text after the operation of the function point and a vector corresponding to the content in each data table;

s2-2-4, calculating cosine similarity between the vector corresponding to the content in each data table and the vector corresponding to the displayed text after the function point is operated in sequence;

s2-2-5 uses the cosine similarity as the correlation coefficient between each data table and the function point under the data analysis matching method.

S2-3, inquiring, tracking and matching, matching the operation behavior of each function point of the user with the calling record of the specific data table, finding out the data table used in the calling process of each function point, and further obtaining the correlation coefficient between each data table and the function point, wherein the specific steps are as follows:

s2-3-1, acquiring a data table call record related to the operation of the function point;

s2-3-2, determining a data table associated with the function point and an unassociated data table based on the data table call record related to the function point operation;

s2-3-3 sets the data table associated with the function point and the function point association coefficient to 1 and the data table not associated with the function point and the function point association coefficient to 0 as the association coefficient of each data table and function point.

And the output part performs corresponding weighted average calculation on the correlation coefficients of the data tables and the function points calculated under the three matching analysis methods to obtain the comprehensive correlation coefficients of the data tables and the function points, performs descending order on the comprehensive correlation coefficients, takes the N data tables with the highest ranking order of the comprehensive correlation values of the function points as the data tables associated with the function points, obtains a final matching result through manual recheck, and stores the result in the system so as to be conveniently exported.

Fig. 4 shows a program architecture for implementing the above data tracing, which includes: a front-end I/O layer, an operation processing layer and a data management layer;

the front end I/O layer is a user interactive interface and comprises: the system comprises an interface UI component, a message processing module, an exception handling module and front and back end interfaces;

the various UI components are used for acquiring information by receiving various input modes such as clicking and keyboard input of a user;

the message processing module and the exception processing module are used for processing normal information and exception information of the information acquired by various UI components;

the front-end interface and the back-end interface are used for interacting the processed information with the operation processing layer and the data management layer.

The operation processing layer comprises a semantic understanding module, a data mining module, an inquiry tracking module, a cloud service module, a session control module and a data interaction interface;

the named entity recognition part in the semantic understanding module can be used for analyzing the meaning of each word, the word embedding part respectively converts each word into a corresponding vector, and then similarity calculation is carried out in the semantic measurement component;

the data mining module is used for independently taking out a result obtained after the user operation of each function point by using the target information collection component, collecting the specific content of each data table by using the resource information collection component, and further mining the data correlation degree in the data mining component by using a vector space matching-based algorithm so as to obtain the matching degree;

the query tracking module captures input and click operation of a user through the action tracking module, the call tracking module obtains call records of a specific data table, and matching is carried out in the tracking information mining assembly, so that the corresponding relation of the data table is obtained;

the cloud service module, the session service module and the data interaction interface are used for enabling the operation processing layer to store information, interact with the outside and acquire corresponding data content.

The data management layer comprises: the system comprises a data storage module, a data updating module, a backup and guarantee module and a data interaction interface;

the data management layer is used for storing data, the data storage module is used for storing the data, the data updating module is used for maintaining and updating the data, the backup and guarantee module is used for importing, exporting and regularly backing up the data, and the data interaction interface is used for interacting with the operation processing layer.

The data tracing is carried out based on the three-matching method, the similarity between the data and the database data, the database calling record, the artificial knowledge and other contents are considered for carrying out data matching analysis, the incidence relation between the function points and the corresponding data tables can be accurately found, the corresponding incidence relation can be found even if the contents of the data tables are not directly displayed in a user window, the recommendation of the incidence relation between the front-end service function of the automatic generation system and the background database table is realized, the automation and the intelligence degree of the artificial inventory work are improved, the inventory efficiency can be effectively improved, the inventory effect is optimized, and the artificial workload is reduced.

Example 2:

the embodiment of the invention discloses a data tracing system, as shown in fig. 2, comprising:

The matching method comprises the following steps: semantic understanding matching, data analysis matching and query tracking matching methods.

Further, the correlation coefficient calculation module includes: the system comprises a semantic understanding matching calculation unit, a data analysis matching calculation unit and a query tracking matching calculation unit;

the semantic understanding matching calculation unit is used for obtaining the association coefficient of each data table and the function point under the semantic understanding matching method by adopting a semantic understanding matching method;

the data analysis matching calculation unit is used for obtaining the correlation coefficient between each data table and the function point under the data analysis matching method by adopting a data analysis matching method;

and the query tracking matching calculation unit is used for obtaining the association coefficient of each data table and the functional point under the query tracking matching method by adopting a query tracking matching method.

Further, the semantic understanding matching calculation unit comprises:

a functional point list determining subunit, configured to determine, based on the functional point, a functional point list corresponding to the functional point;

the data table list determining subunit is used for determining a data table list corresponding to each data table respectively based on each data table;

the natural language processing subunit is used for respectively carrying out semantic comprehension annotation on the texts in the function point list and each data table list based on a natural language processing method;

the vector constructing subunit 1 is configured to sequentially construct, based on the understood and annotated function point lists and each data table list, a vector corresponding to the function point list and a vector corresponding to each data table list;

the cosine similarity calculation operator unit 1 is used for calculating the cosine similarity of the vector corresponding to each data table list and the vector corresponding to the function point list in sequence;

and the correlation coefficient calculation subunit 1 is used for taking the cosine similarity as a correlation coefficient between each data table and the function point under a semantic understanding matching method.

Further, the data analysis matching calculation unit comprises:

a function point display text determining subunit, configured to determine, based on the function point, a display text after the function point operation;

the data table content determining subunit is used for determining the content in each data table respectively based on each data table;

a vector construction subunit 2, configured to sequentially construct, based on the display text after the operation of the function point and the content in each data table, a vector corresponding to the display text after the operation of the function point and a vector corresponding to the content in each data table;

the cosine similarity calculation operator unit 2 is used for calculating the cosine similarity between the vector corresponding to the content in each data table and the vector corresponding to the displayed text after the function point is operated in sequence;

and the correlation coefficient calculating subunit 2 is configured to use the cosine similarity as a correlation coefficient between each data table and the function point under a data analysis matching method.

Further, the query tracking matching calculation unit includes:

the call record determining subunit is used for acquiring a data table call record related to the operation of the function point;

the calling result determining subunit is used for determining a data table associated with the function point and an unassociated data table based on the data table calling record related to the function point operation;

and the correlation coefficient calculation subunit 3 is used for respectively endowing preset values to the data tables associated with the function points and the data tables not associated with the function points, and using the preset values as correlation coefficients of the data tables and the function points under the inquiry tracking matching method.

Further, the result output module includes:

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that the above-mentioned embodiments are only used for illustrating the technical solutions of the present application and not for limiting the scope of protection thereof, and although the present application is described in detail with reference to the above-mentioned embodiments, those skilled in the art should understand that after reading the present application, they can make various changes, modifications or equivalents to the specific embodiments of the application, but those changes, modifications or equivalents are within the scope of the claims of the application.

Claims

1. A method for tracing data source is characterized by comprising the following steps:

based on the function point list, the display text after the function point operation, the data sheet calling record related to the function point operation, all the data sheet lists and the data sheets, obtaining the association coefficient of each data sheet and the function point under each matching method by adopting a plurality of matching methods;

2. The method of claim 1, wherein the matching method comprises: semantic understanding matching, data analysis matching and query tracking matching methods.

3. The method according to claim 2, wherein obtaining the correlation coefficient between each data table and the function point under the semantic understanding matching method by using a semantic understanding matching method comprises:

based on the understood and annotated function point list and each data table list, sequentially constructing a vector corresponding to the function point list and a vector corresponding to each data table list;

4. The method of claim 2, wherein obtaining the correlation coefficient between each data table and the function point under the data analysis matching method by using the data analysis matching method comprises:

5. The method of claim 2, wherein the obtaining of the correlation coefficient between each data table and the function point under the query tracking matching method by using the query tracking matching method comprises:

and respectively endowing preset values to the data tables associated with the function points and the data tables not associated with the function points, and using the preset values as the association coefficients of the data tables and the function points under the inquiry, tracking and matching method.

6. The method of claim 2, wherein the determining the data table associated with the function point based on the association coefficient of each data table with the function point under each method comprises:

7. The method of claim 1, wherein determining the data table associated with the function point further comprises reviewing the data table associated with the function point to obtain the data table ultimately associated with the function point.

8. A system for data tracing, comprising:

9. The system of claim 8, wherein the matching method comprises: semantic understanding matching, data analysis matching and query tracking matching methods.

10. The system of claim 9, wherein the result output module comprises: