CN116450710B - Data analysis tracing method and system based on big data - Google Patents

Data analysis tracing method and system based on big data Download PDF

Info

Publication number
CN116450710B
CN116450710B CN202310708251.4A CN202310708251A CN116450710B CN 116450710 B CN116450710 B CN 116450710B CN 202310708251 A CN202310708251 A CN 202310708251A CN 116450710 B CN116450710 B CN 116450710B
Authority
CN
China
Prior art keywords
data
data table
database
traced
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310708251.4A
Other languages
Chinese (zh)
Other versions
CN116450710A (en
Inventor
张蓉
黄礼成
邢文元
刘杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Halu Information Technology Co ltd
Original Assignee
Nanjing Halu Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Halu Information Technology Co ltd filed Critical Nanjing Halu Information Technology Co ltd
Priority to CN202310708251.4A priority Critical patent/CN116450710B/en
Publication of CN116450710A publication Critical patent/CN116450710A/en
Application granted granted Critical
Publication of CN116450710B publication Critical patent/CN116450710B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data analysis tracing method and system based on big data, which belong to the technical field of big data tracing, and the method comprises the following steps: acquiring data to be traced, establishing a data table to be traced, and determining the type of the data table to be traced; a traceability database is established, and the traceability database is formed by the data tables of the same type as the data tables to be traced in the original large database; screening the traceability database, and screening a data table similar to the data table to be traced from the traceability database; matching the traceability database, matching the data table to be traced with the screened data table by utilizing a matching strategy, judging the real source of the data table to be traced, finishing the data authenticity verification and copyright authentication of the data table to be traced, and reducing the matched data and time in a large database under the uncontrollable illegal leakage scene of the database by establishing the traceability database, screening the traceability database and the matching strategy, and improving the efficiency and accuracy of data tracing.

Description

Data analysis tracing method and system based on big data
Technical Field
The invention belongs to the technical field of big data tracing, and particularly relates to a data analysis tracing method and system based on big data.
Background
With the rapid development of the mobile internet, various information has been exponentially exploded. The huge data volume brings greater potential safety hazards, and the data is easy to tamper and steal. The security problems such as illegal data leakage, data theft and data loss frequently occur, and serious influence is caused to the information security of individuals, enterprises and countries. In order to solve the information security problem existing at present, many scientific researchers lay in the study of the data tracing technology. The data tracing technology is used for tracing the trace of illegal leaked data, finding the real source of the leaked data and achieving the purpose of protecting the copyright of the database. The existing data tracing technology is applied to the controllable illegal leakage scene of the database, and the research on the data tracing technology applied to the uncontrollable illegal leakage scene of the database is less. Therefore, the data tracing is required to be carried out under the uncontrollable illegal leakage scene, and the method has great significance on the data security research.
For example, chinese patent with an authorized publication number CN109657110B discloses a data tracing method and a corresponding data tracing device, where the data tracing method includes: adding unique identification information to each piece of source data, and establishing an original data set; performing target data operation on the original data set to obtain a target result set matched with the target data operation, wherein each result record contains identification information of source data matched with the result record; and integrating the tuple number of the result record, the identification information contained in the result record and the target data operation to obtain the tracing information corresponding to each result record, so as to trace the data according to the tracing information. In the data tracing process, the source and evolution process of the result records are traced according to the data operation in the tracing information and the identification information of the source data, so that the reliability and the credibility of the analysis of the result record source are improved, and the data tracing efficiency is also effectively improved.
For example, chinese patent publication No. CN110674360B discloses a method and system for tracing data, including obtaining file information in response to file operations occurring on a target machine; based on screening the file information, structured data and unstructured data corresponding to the structured data are obtained, wherein the structured data are used as fixed key variable groups, and the unstructured data comprise a plurality of variable key variable groups; in response to the fact that the information of the fixed variable group does not exist in the association graph, the file information corresponding to the fixed variable group is uniquely identified and stored in the association graph; and responding to the information of the fixed variable group in the association graph, carrying out correlation verification on the variable key variable group corresponding to the fixed variable group and the variable key variable group existing in the association graph, and if the variable key variable group is associated with the variable key variable group, taking the variable key variable group into the unique identification of the file information corresponding to the variable key variable group existing in the association graph. By utilizing the method for tracing the data, the working efficiency can be greatly improved.
The problems of large data matching quantity, low matching efficiency and low matching accuracy exist in the above patents, and the problem of data tracing under the uncontrollable illegal leakage scene of the database cannot be solved.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a data analysis tracing method and a data analysis tracing system based on big data.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a data analysis tracing method based on big data comprises the following specific steps:
step S1: acquiring data to be traced, establishing a data table to be traced, and determining the type of the data table to be traced;
step S2: a traceability database is established, and the traceability database is formed by the data tables of the same type as the data tables to be traced in the original large database;
step S3: screening the traceability database, and screening a data table similar to the data table to be traced from the traceability database;
step S4: and matching the traceability database, matching the data table to be traced with the screened data table by utilizing a matching strategy, judging the real source of the data table to be traced, and finishing the data authenticity verification and copyright authentication of the data table to be traced.
Specifically, the specific method of step S2 includes:
step S201: setting an original big database asThe data table to be traced isUsing clustering algorithm to make the original large databaseClustering according to the original big databaseThe compactness of the middle object distribution will be the original big databaseDividing into K clustering spaces, and setting as...
Step S202: extracting attribute column characteristics of the traceability data table, extracting characteristic values by the attribute column, and forming characteristic vectors by the characteristic values;
step S203: setting up an original big databaseWhereinFor the nth data table in the original big database, the DB of the original big database comprises the data tableAnd data sheetWhereinA t-th attribute column representing a r-th data table in the database,the m attribute column of the j-th data table in the database is represented, the distance between feature vectors of the attribute column is calculated by using the cosine distance, and the cosine distance calculation formula of the numerical value type and the character string data is as follows:whereinA t-th attribute column feature vector representing a r-th data table,the feature vector of the mth attribute column of the jth data table is represented, the special type data is judged by using a regular expression, and the type judgment calculation formula is as follows:
step S204: selecting a clustering space with a small distance, and taking all data tables in the clustering space with the small distance as a tracing database
Specifically, the specific method in step S3 is as follows:
step S301: will trace to the source databasePreprocessing the data table in the data table, converting the data table into a form with uniform format, extracting the feature vectors of the attribute columns of the data table and the data table to be traced in the tracing database, and obtaining the numberA feature matrix of the attribute column of the data table;
step S302: and putting the feature matrix of the attribute column of the data table into a trained convolutional neural network model, outputting a matching result of the data table to be traced and the data table in the tracing database, and screening out the data table similar to the data table to be traced in the tracing database.
Specifically, the step S4 includes a matching policy, where the specific steps of the matching policy are as follows:
step S401: setting a data table similar to a data table to be traced in a tracing database as a data tableRepresented asWherein q represents the number of data tables similar to the data table to be traced in the data table in the tracing database,represents the p-th data table similar to the data table to be traced in the data table in the tracing database,wherein, the method comprises the steps of, wherein,a w-th value representing a z-th attribute;
step S402: calculating the specific gravity of the attribute column, wherein the calculation formula is as follows:whereinThe ith attribute column specific gravity of the jth data table,values representing attribute columns of a data table, calculationEntropy of attribute column in table, the calculation formula is:whereinRepresenting the entropy value of the ith attribute column in the data table;
calculating the weight of the attribute column, wherein the calculation formula is as follows:whereinThe value range of (2) is [0,1 ]],
Step S403: comparing the matched data table to be traced with the data table to be tracedThe attribute column weight values in the tables are selected as keywords, the attribute columns with the largest weight values are sequentially ordered from big to big, whether the attribute values of the attribute column weight values with the largest attribute values of the two matched data tables are equal or not is judged, if so, the attribute with the next largest weight value is selected for ordering, and then whether the data quantity of the attribute columns is equal or not is compared;
step S404: the data quantity of the attribute columns of the two matched data tables is acquired, and the acquisition formula is as follows:where L represents the data amount of the ith attribute column in the data table p,representing the ith attribute column in the data table p;
step S405: repeating the steps S403-S404 until the last attribute column is compared, and judging whether the two matched data tables are similar data tables or not;
step S406: matching the numerical data of the same attribute column in the similar data table with the special type data, wherein the calculation formula is as follows:wherein, the method comprises the steps of, wherein,d represents the similarity of the numerical data and the special type data of the same attribute column in the similar data table,the values representing the same attribute columns in similar data tables correspond to the comparison function, and when the values of the same attribute columns in similar data tables are equal,when the values of the same attribute columns in the similar data table are not equal,corresponding values representing the same attribute columns in the similar data table;
step S407: repeating step S406 until the last attribute column of the similarity data table is compared, comparing the similarity values of the similarity data tables, whenWhen the data in the database to be traced is derived from the data in the similar data table.
Specifically, the feature values of the attribute column include: maximum, minimum, mean, variance, standard deviation, median, and range.
Specifically, the special type data includes: date-time type data and mailbox type data.
Specifically, the neural network model is a trained neural network model combining BIGRU with an attention mechanism.
Specifically, the number of the similar data tables is 2 or more.
Specifically, a data analysis traceability system based on big data includes:
the traceability database establishing module is used for judging the type of the data table to be traced, selecting the data table which is the same as the data table to be traced in the original large database to establish the traceability database, and reducing the matching range of the data table;
the tracing database screening module is used for screening out a data table similar to the data table to be traced from the tracing database;
the tracing database matching module is used for matching the data table screened by the tracing database screening module with the data table to be traced, and finding out the real source of the data in the data table to be traced;
and the data integration module is used for organically integrating data with different sources, formats and characteristics logically or physically to provide comprehensive data sharing.
The electronic equipment comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of a data analysis tracing method based on big data when executing the computer program.
Specifically, a computer readable storage medium has stored thereon computer instructions which, when executed, perform the steps of a data analysis tracing method based on big data.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention optimizes and improves the architecture, the operation steps and the flow of the existing data analysis tracing system based on big data, and the system has the advantages of simple flow, low investment and operation cost and low production and working cost, and improves the tracing efficiency on the basis of the original tracing system.
2. The invention provides a data analysis tracing method based on big data, which comprises the steps of establishing a data table to be traced, determining the type of the data table to be traced, forming a tracing database by the data table of the same type in an original big database and the data table to be traced, screening the tracing database, screening the data table similar to the data table to be traced in the tracing database, matching the data table to be traced with the screened data table by utilizing a matching strategy, judging the real source of the data table to be traced, completing the data authenticity verification and copyright authentication of the data table to be traced, and reducing the matched data and time under the uncontrollable illegal leakage scene of the database, thereby improving the efficiency and accuracy of data tracing.
Drawings
FIG. 1 is a flow chart of a data analysis tracing method based on big data;
FIG. 2 is a matching flow chart of the matching strategy of the present invention;
FIG. 3 is a diagram of a data analysis traceability system architecture based on big data according to the present invention;
fig. 4 is a diagram of an electronic device of the present invention.
Detailed Description
In order that the technical means, the creation characteristics, the achievement of the objects and the effects of the present invention may be easily understood, it should be noted that in the description of the present invention, the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements to be referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "a", "an", "the" and "the" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The invention is further described below in conjunction with the detailed description.
Example 1
Referring to fig. 1 and 2, an embodiment of the present invention is provided: a data analysis tracing method based on big data comprises the following specific steps:
step S1: acquiring data to be traced, establishing a data table to be traced, and determining the type of the data table to be traced;
step S2: a traceability database is established, and the traceability database is formed by the data tables of the same type as the data tables to be traced in the original large database;
step S3: screening the traceability database, and screening a data table similar to the data table to be traced from the traceability database;
step S4: and matching the traceability database, matching the data table to be traced with the screened data table by utilizing a matching strategy, judging the real source of the data table to be traced, and finishing the data authenticity verification and copyright authentication of the data table to be traced.
The specific method of the step S2 comprises the following steps:
step S201: setting an original big database asThe data table to be traced isUsing clustering algorithm to make the original large databaseClustering according to the original big databaseThe compactness of the middle object distribution will be the original big databaseDividing into K clustering spaces, and setting as...
Step S202: extracting attribute column characteristics of the traceability data table, wherein the extracted characteristic values of the attribute column comprise characteristic vectors formed by the characteristic values;
the calculation formula of the attribute value variation coefficient is as follows:the variation coefficient is used for removing the influence caused by overlarge data value gap, and the calculation formula of the different ratios is as follows:where f represents the number of modes, N represents the number of overall, the odds ratio is used to reflect the dispersion of the data, the skewness Sk is used to measure the direction and degree of deviation of the data distribution in the attribute column, and the kurtosis is used to describe the degree of steepness of the data distribution in the attribute column.
Step S203: setting up an original big databaseWhereinFor the nth data table in the original big database, the DB of the original big database comprises the data tableAnd data sheetWhereinA t-th attribute column representing a r-th data table in the database,the m attribute column of the j-th data table in the database is represented, the distance between feature vectors of the attribute column is calculated by using the cosine distance, and the cosine distance calculation formula of the numerical value type and the character string data is as follows:whereinA t-th attribute column feature vector representing a r-th data table,the feature vector of the mth attribute column of the jth data table is represented, the special type data is judged by using a regular expression, and the type judgment calculation formula is as follows:
the date regular expression is:
the mailbox regular expression is:
step S204: selecting a clustering space with a small distance, and taking all data tables in the clustering space with the small distance as a tracing database
The specific method of the step S3 is as follows:
step S301: will trace to the source databasePreprocessing the data table in the database, converting the data table into a form with uniform format, extracting the data table in the traceability database and the feature vector of the attribute column of the data table to be traced, and obtaining the feature matrix of the attribute column of the data table;
pretreatment: because the sizes of the data tables are inconsistent, the number of the data tables is very large, the data tables cannot be directly input into a network, the feature matrix of the data tables is processed into consistent sizes in preprocessing, and if the data tables have no attribute column, the feature value of the attribute column is 0.
Step S302: and putting the feature matrix of the attribute column of the data table into a trained convolutional neural network model, outputting a matching result of the data table to be traced and the data table in the tracing database, and screening out the data table similar to the data table to be traced in the tracing database.
The step S4 comprises a matching strategy, and the specific steps of the matching strategy are as follows:
step S401: setting a data table similar to a data table to be traced in a tracing database as a data tableRepresented asWherein q represents the number of data tables similar to the data table to be traced in the data table in the tracing database,represents the p-th data table similar to the data table to be traced in the data table in the tracing database,wherein, the method comprises the steps of, wherein,a w-th value representing a z-th attribute;
step S402: calculating the specific gravity of the attribute column, wherein the calculation formula is as follows:whereinThe ith attribute column specific gravity of the jth data table,values representing attribute columns of a data table, calculationEntropy of attribute column in table, the calculation formula is:whereinRepresenting the entropy value of the ith attribute column in the data table;
calculating the weight of the attribute column, wherein the calculation formula is as follows:whereinThe value range of (2) is [0,1 ]],
The method comprises the following steps of: attribute column entropy valueThe larger the distinction degree is, the smaller the distinction degree is, which means that the numerical distribution in the attribute column is more uniform, the weaker the distinguishing capability of the attribute column is, the lower the importance degree of the attribute column is, and the entropy value of the attribute column isThe smaller the discrimination, the larger the discrimination, the more the discrimination capability of the attribute column is, and the higher the importance of the attribute column is, which is an indication of the non-uniformity of the numerical distribution in the attribute column.
Step S403: comparing the matched data table to be traced with the data table to be tracedSelecting an attribute column with the largest weight value as a keyword and sorting from large to small in the attribute column weight values in the data tables, judging whether the attribute values of the attribute column with the largest weight value of the two matched data tables are equal or not, selecting an attribute with the next largest weight value for sorting if the attribute values are equal, and comparing whether the data quantity of the attribute columns is equal or not;
step S404: the data quantity of the attribute columns of the two matched data tables is acquired, and the acquisition formula is as follows:where L represents the data amount of the ith attribute column in the data table p,representing the ith attribute column in the data table p;
step S405: repeating the steps S403-S404 until the last attribute column is compared, and judging whether the two matched data tables are similar data tables or not;
step S406: matching the numerical data of the same attribute column in the similar data table with the special type data, wherein the calculation formula is as follows:wherein, the method comprises the steps of, wherein,d represents the similarity of the numerical data and the special type data of the same attribute column in the similar data table,the values representing the same attribute columns in similar data tables correspond to the comparison function, and when the values of the same attribute columns in similar data tables are equal,when the values of the same attribute columns in the similar data table are not equal,corresponding values representing the same attribute columns in the similar data table;
step S407: repeating step S406 until the last attribute column of the similarity data table is compared, comparing the similarity values of the similarity data tables, whenWhen the data in the database to be traced is derived from the data in the similar data table.
The feature values of the attribute column include: maximum, minimum, mean, variance, standard deviation, median, and range.
Specific types of data include: date-time type data and mailbox type data.
The neural network model is a trained neural network model of BIGRU combined with an attention mechanism.
The number of similar data tables is 2 or more.
Example 2
Referring to fig. 3, another embodiment of the present invention is provided: a big data based data analysis traceability system comprising:
the traceability database establishing module is used for judging the type of the data table to be traced, selecting the data table which is the same as the data table to be traced in the original large database to establish the traceability database, and reducing the matching range of the data table;
the tracing database screening module is used for screening out a data table similar to the data table to be traced from the tracing database;
the tracing database matching module is used for matching the data table screened by the tracing database screening module with the data table to be traced, and finding out the real source of the data in the data table to be traced;
and the data integration module is used for organically integrating data with different sources, formats and characteristics logically or physically to provide comprehensive data sharing.
The database is as follows: MYSQL, ORacle, SQLserver and domestic databases.
Example 3
Referring to fig. 4, in this embodiment, an electronic device is further provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the data analysis tracing method based on big data when executing the computer program.
Electronic device details: the memory may be a computer readable signal medium or a non-transitory computer readable storage medium or any combination of the two. The non-transitory computer readable memory may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the above. More specific examples of the non-transitory computer readable memory may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable storage medium having stored thereon computer instructions which, when executed, perform the steps of a data analysis tracing method based on big data.
The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (9)

1. The data analysis tracing method based on big data is characterized by comprising the following specific steps:
step S1: acquiring data to be traced, establishing a data table to be traced, and determining the type of the data table to be traced;
step S2: a traceability database is established, and the traceability database is formed by the data tables of the same type as the data tables to be traced in the original large database;
step S3: screening the traceability database, and screening a data table similar to the data table to be traced from the traceability database;
step S4: matching the traceability database, matching the data table to be traced with the screened data table by utilizing a matching strategy, judging the real source of the data table to be traced, and finishing the data authenticity verification and copyright authentication of the data table to be traced;
the specific method of the step S2 comprises the following steps:
step S201: setting an original big database asThe data table to be traced is +.>The clustering algorithm is utilized to perform +.A clustering algorithm on the original big database>Clustering is carried out according to the original big database +.>The compactness of the distribution of the medium objects will be +.>Dividing into K cluster spaces, and setting the K cluster spaces as +.>,/>,/>.../>
Step S202: extracting attribute column characteristics of the data table to be traced, and forming characteristic vectors by characteristic values extracted by the attribute columns;
step S203: setting up an original big databaseWherein/>For the nth data table in the original big database, the original big database DB comprises the data table +.>And data sheetWherein->A t-th attribute column representing a r-th data table in the database,the m attribute column of the j-th data table in the database is represented, the distance between feature vectors of the attribute column is calculated by using the cosine distance, and the cosine distance calculation formula of the numerical value type and the character string data is as follows: />Wherein->T attribute column feature vector representing the r-th data table, < >>The feature vector of the mth attribute column of the jth data table is represented, the special type data is judged by using a regular expression, and the type judgment calculation formula is as follows: />
Step S204: selecting a clustering space with a small distance, and taking all data tables in the clustering space with the small distance as a tracing database
The specific method of the step S3 is as follows:
step S301: will trace to the source databasePreprocessing the data table in the database, converting the data table into a form with uniform format, extracting the data table in the traceability database and the feature vector of the attribute column of the data table to be traced, and obtaining the feature matrix of the attribute column of the data table;
step S302: putting the feature matrix of the attribute column of the data table into a trained convolutional neural network model, outputting a matching result of the data table to be traced and the data table in the tracing database, and screening out the data table similar to the data table to be traced in the tracing database;
the step S4 includes a matching policy, where the specific steps of the matching policy are as follows:
step S401: setting a data table similar to a data table to be traced in a tracing database as a data table,/>Denoted as->Wherein q represents the number of data tables similar to the data table to be traced in the data table in the tracing database, and +.>Representing a p-th data table similar to a data table to be traced in a data table in a tracing database,/->Wherein->A w-th value representing a z-th attribute;
step S402: calculating the specific gravity of the attribute column, wherein the calculation formula is as follows:wherein->The ith attribute column specific gravity of the jth data table,/->Values representing attribute columns of the data table, calculate +.>Entropy of attribute column in table, the calculation formula is: />Wherein->Representing the entropy value of the ith attribute column in the data table;
calculating the weight of the attribute column, wherein the calculation formula is as follows:wherein->The value range of (2) is [0,1 ]],/>
Step S403: comparing the matched data table to be traced with the data table to be tracedThe attribute column weight value in the table is selected, the attribute column with the largest weight value is used as a keyword, the keyword is orderly sequenced from big to small, whether the attribute values of the attribute column with the largest weight value of the attribute columns of the two matched data tables are equal or not is judged, and if the attribute values are equal, the weight is selectedSequencing the next-largest attributes, and comparing whether the data quantity of the attribute columns is equal or not;
step S404: the data quantity of the attribute columns of the two matched data tables is acquired, and the acquisition formula is as follows:wherein L represents the data amount of the ith attribute column in the data table p, ++>Representing the ith attribute column in the data table p;
step S405: repeating the steps S403-S404 until the last attribute column is compared, and judging whether the two matched data tables are similar data tables or not;
step S406: matching the numerical data of the same attribute column in the similar data table with the special type data, wherein the calculation formula is as follows:wherein->D represents the similarity of the numerical data and the special type data of the same attribute column in the similar data table, +.>The values representing the same attribute columns in the similar data table correspond to the comparison function, when the values of the same attribute columns in the similar data table are equal, +.>When the values of the same attribute columns in the similar data table are not equal, +.>,/>Representing the same attributes in a similar data tableCorresponding values of the columns;
step S407: repeating step S406 until the last attribute column of the similarity data table is compared, comparing the similarity values of the similarity data tables, whenWhen the data in the database to be traced is derived from the data in the similar data table.
2. The method for tracing data analysis based on big data according to claim 1, wherein the feature values of the attribute column comprise: maximum, minimum, mean, variance, standard deviation, median, and range.
3. The data analysis tracing method based on big data according to claim 2, wherein said special type data comprises: date-time type data and mailbox type data.
4. A data analysis tracing method based on big data according to claim 3, wherein said neural network model is a trained bigu combined with an attention mechanism neural network model.
5. The data analysis tracing method based on big data as claimed in claim 4, wherein the number of the similar data tables is greater than or equal to 2.
6. A big data based data analysis traceability system, comprising:
the traceability database establishing module is used for judging the type of the data table to be traced, and selecting the data table which is the same as the data table to be traced in the original large database to establish the traceability database;
the tracing database screening module is used for screening out a data table similar to the data table to be traced from the tracing database;
the tracing database matching module is used for matching the data table screened by the tracing database screening module with the data table to be traced, and finding out the real source of the data in the data table to be traced;
the data integration module is used for organically gathering data with different sources, formats and characteristic properties logically or physically;
the matching strategy comprises the following specific steps:
step S401: setting a data table similar to a data table to be traced in a tracing database as a data table,/>Denoted as->Wherein q represents the number of data tables similar to the data table to be traced in the data table in the tracing database, and +.>Representing a p-th data table similar to a data table to be traced in a data table in a tracing database,/->Wherein->A w-th value representing a z-th attribute;
step S402: calculating the specific gravity of the attribute column, wherein the calculation formula is as follows:wherein->The ith attribute column specific gravity of the jth data table,/->Values representing attribute columns of the data table, calculate +.>Entropy of attribute column in table, the calculation formula is: />Wherein->Representing the entropy value of the ith attribute column in the data table;
calculating the weight of the attribute column, wherein the calculation formula is as follows:wherein->The value range of (2) is [0,1 ]],/>
Step S403: comparing the matched data table to be traced with the data table to be tracedThe attribute column weight values in the tables are selected as keywords, the attribute columns with the largest weight values are sequentially ordered from large to small, whether the attribute values of the attribute column weight values of the two matched data tables with the largest attribute columns are equal or not is judged, if so, the attribute with the next largest weight value is selected for ordering, and then whether the data quantity of the attribute columns is equal or not is compared;
step S404: the data quantity of the attribute columns of the two matched data tables is acquired, and the acquisition formula is as follows:wherein L represents the data amount of the ith attribute column in the data table p, ++>Representing the ith attribute column in the data table p;
step S405: repeating the steps S403-S404 until the last attribute column is compared, and judging whether the two matched data tables are similar data tables or not;
step S406: matching the numerical data of the same attribute column in the similar data table with the special type data, wherein the calculation formula is as follows:wherein->D represents the similarity of the numerical data and the special type data of the same attribute column in the similar data table, +.>The values representing the same attribute columns in the similar data table correspond to the comparison function, when the values of the same attribute columns in the similar data table are equal, +.>When the values of the same attribute columns in the similar data table are not equal, +.>,/>Corresponding values representing the same attribute columns in the similar data table;
step S407: repeating step S406 until the last attribute column of the similarity data table is compared, comparing the similarity values of the similarity data tables, whenWhen the data in the database to be traced is derived from the data in the similar data table.
7. The big data based data analysis and tracing system of claim 6, wherein said database is: MYSQL, ORacle, SQLserver and domestic databases.
8. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of a big data based data analysis tracing method of any one of claims 1-5.
9. A computer readable storage medium having stored thereon computer instructions which when executed perform the steps of a big data based data analysis tracing method of any one of claims 1-5.
CN202310708251.4A 2023-06-15 2023-06-15 Data analysis tracing method and system based on big data Active CN116450710B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310708251.4A CN116450710B (en) 2023-06-15 2023-06-15 Data analysis tracing method and system based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310708251.4A CN116450710B (en) 2023-06-15 2023-06-15 Data analysis tracing method and system based on big data

Publications (2)

Publication Number Publication Date
CN116450710A CN116450710A (en) 2023-07-18
CN116450710B true CN116450710B (en) 2023-09-26

Family

ID=87128781

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310708251.4A Active CN116450710B (en) 2023-06-15 2023-06-15 Data analysis tracing method and system based on big data

Country Status (1)

Country Link
CN (1) CN116450710B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117056740A (en) * 2023-08-07 2023-11-14 北京东方金信科技股份有限公司 Method, system and readable medium for calculating table similarity in data asset management

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709024A (en) * 2016-12-28 2017-05-24 深圳市华傲数据技术有限公司 Data table source-tracing method and device based on consanguinity analysis
CN114138784A (en) * 2021-11-30 2022-03-04 中国平安财产保险股份有限公司 Information tracing method and device based on storage library, electronic equipment and medium
CN114547231A (en) * 2020-11-24 2022-05-27 国家电网有限公司大数据中心 Data tracing method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10268568B2 (en) * 2016-03-29 2019-04-23 Infosys Limited System and method for data element tracing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709024A (en) * 2016-12-28 2017-05-24 深圳市华傲数据技术有限公司 Data table source-tracing method and device based on consanguinity analysis
CN114547231A (en) * 2020-11-24 2022-05-27 国家电网有限公司大数据中心 Data tracing method and system
CN114138784A (en) * 2021-11-30 2022-03-04 中国平安财产保险股份有限公司 Information tracing method and device based on storage library, electronic equipment and medium

Also Published As

Publication number Publication date
CN116450710A (en) 2023-07-18

Similar Documents

Publication Publication Date Title
CN107392121B (en) Self-adaptive equipment identification method and system based on fingerprint identification
CN111291070B (en) Abnormal SQL detection method, equipment and medium
CN116450710B (en) Data analysis tracing method and system based on big data
Yue et al. Hashing based fast palmprint identification for large-scale databases
CN110134719B (en) Identification and classification method for sensitive attribute of structured data
CN110851461A (en) Method and device for auditing non-relational database and storage medium
CN107220325A (en) A kind of similar icon search methods of APP based on convolutional neural networks and system
CN111275599B (en) Big data integration algorithm-based group rental house early warning method and device, storage medium and terminal
CN113743496A (en) K-anonymous data processing method and system based on cluster mapping
CN112069269B (en) Big data and multidimensional feature-based data tracing method and big data cloud server
CN117216109A (en) Data query method, device and storage medium for multi-type mixed data
CN112084293A (en) Data authentication system and data authentication method for public security field
CN116582309A (en) GAN-CNN-BiLSTM-based network intrusion detection method
CN112967759B (en) DNA material evidence identification STR typing comparison method based on memory stack technology
CN111507878B (en) Network crime suspects investigation method and system based on user portrait
CN111178455B (en) Image clustering method, system, device and medium
US20170024358A1 (en) Method of processing statistical data
CN112100670A (en) Big data based privacy data grading protection method
CN111813987B (en) Portrait comparison method based on police big data
CN117786732B (en) Intelligent institution data storage system based on big data information desensitization method
CN117113321B (en) Image searching method and system for searching face by face
CN112559823B (en) Data standardized data acquisition method
CN113656405B (en) Method and device for sharing on-chain radar map co-construction based on block chain
CN116702024B (en) Method, device, computer equipment and storage medium for identifying type of stream data
CN117633605B (en) Data security classification capability maturity assessment method, system and readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant