CN116450710B - Data analysis tracing method and system based on big data - Google Patents
Data analysis tracing method and system based on big data Download PDFInfo
- Publication number
- CN116450710B CN116450710B CN202310708251.4A CN202310708251A CN116450710B CN 116450710 B CN116450710 B CN 116450710B CN 202310708251 A CN202310708251 A CN 202310708251A CN 116450710 B CN116450710 B CN 116450710B
- Authority
- CN
- China
- Prior art keywords
- data
- data table
- database
- traced
- attribute
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000007405 data analysis Methods 0.000 title claims abstract description 27
- 238000012216 screening Methods 0.000 claims abstract description 24
- 238000012795 verification Methods 0.000 claims abstract description 6
- 238000004364 calculation method Methods 0.000 claims description 26
- 239000013598 vector Substances 0.000 claims description 15
- 230000005484 gravity Effects 0.000 claims description 8
- 238000009826 distribution Methods 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 6
- 238000003062 neural network model Methods 0.000 claims description 6
- 238000003860 storage Methods 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 4
- 230000006870 function Effects 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 230000010354 integration Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data analysis tracing method and system based on big data, which belong to the technical field of big data tracing, and the method comprises the following steps: acquiring data to be traced, establishing a data table to be traced, and determining the type of the data table to be traced; a traceability database is established, and the traceability database is formed by the data tables of the same type as the data tables to be traced in the original large database; screening the traceability database, and screening a data table similar to the data table to be traced from the traceability database; matching the traceability database, matching the data table to be traced with the screened data table by utilizing a matching strategy, judging the real source of the data table to be traced, finishing the data authenticity verification and copyright authentication of the data table to be traced, and reducing the matched data and time in a large database under the uncontrollable illegal leakage scene of the database by establishing the traceability database, screening the traceability database and the matching strategy, and improving the efficiency and accuracy of data tracing.
Description
Technical Field
The invention belongs to the technical field of big data tracing, and particularly relates to a data analysis tracing method and system based on big data.
Background
With the rapid development of the mobile internet, various information has been exponentially exploded. The huge data volume brings greater potential safety hazards, and the data is easy to tamper and steal. The security problems such as illegal data leakage, data theft and data loss frequently occur, and serious influence is caused to the information security of individuals, enterprises and countries. In order to solve the information security problem existing at present, many scientific researchers lay in the study of the data tracing technology. The data tracing technology is used for tracing the trace of illegal leaked data, finding the real source of the leaked data and achieving the purpose of protecting the copyright of the database. The existing data tracing technology is applied to the controllable illegal leakage scene of the database, and the research on the data tracing technology applied to the uncontrollable illegal leakage scene of the database is less. Therefore, the data tracing is required to be carried out under the uncontrollable illegal leakage scene, and the method has great significance on the data security research.
For example, chinese patent with an authorized publication number CN109657110B discloses a data tracing method and a corresponding data tracing device, where the data tracing method includes: adding unique identification information to each piece of source data, and establishing an original data set; performing target data operation on the original data set to obtain a target result set matched with the target data operation, wherein each result record contains identification information of source data matched with the result record; and integrating the tuple number of the result record, the identification information contained in the result record and the target data operation to obtain the tracing information corresponding to each result record, so as to trace the data according to the tracing information. In the data tracing process, the source and evolution process of the result records are traced according to the data operation in the tracing information and the identification information of the source data, so that the reliability and the credibility of the analysis of the result record source are improved, and the data tracing efficiency is also effectively improved.
For example, chinese patent publication No. CN110674360B discloses a method and system for tracing data, including obtaining file information in response to file operations occurring on a target machine; based on screening the file information, structured data and unstructured data corresponding to the structured data are obtained, wherein the structured data are used as fixed key variable groups, and the unstructured data comprise a plurality of variable key variable groups; in response to the fact that the information of the fixed variable group does not exist in the association graph, the file information corresponding to the fixed variable group is uniquely identified and stored in the association graph; and responding to the information of the fixed variable group in the association graph, carrying out correlation verification on the variable key variable group corresponding to the fixed variable group and the variable key variable group existing in the association graph, and if the variable key variable group is associated with the variable key variable group, taking the variable key variable group into the unique identification of the file information corresponding to the variable key variable group existing in the association graph. By utilizing the method for tracing the data, the working efficiency can be greatly improved.
The problems of large data matching quantity, low matching efficiency and low matching accuracy exist in the above patents, and the problem of data tracing under the uncontrollable illegal leakage scene of the database cannot be solved.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a data analysis tracing method and a data analysis tracing system based on big data.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a data analysis tracing method based on big data comprises the following specific steps:
step S1: acquiring data to be traced, establishing a data table to be traced, and determining the type of the data table to be traced;
step S2: a traceability database is established, and the traceability database is formed by the data tables of the same type as the data tables to be traced in the original large database;
step S3: screening the traceability database, and screening a data table similar to the data table to be traced from the traceability database;
step S4: and matching the traceability database, matching the data table to be traced with the screened data table by utilizing a matching strategy, judging the real source of the data table to be traced, and finishing the data authenticity verification and copyright authentication of the data table to be traced.
Specifically, the specific method of step S2 includes:
step S201: setting an original big database asThe data table to be traced isUsing clustering algorithm to make the original large databaseClustering according to the original big databaseThe compactness of the middle object distribution will be the original big databaseDividing into K clustering spaces, and setting as,,...;
Step S202: extracting attribute column characteristics of the traceability data table, extracting characteristic values by the attribute column, and forming characteristic vectors by the characteristic values;
step S203: setting up an original big databaseWhereinFor the nth data table in the original big database, the DB of the original big database comprises the data tableAnd data sheetWhereinA t-th attribute column representing a r-th data table in the database,the m attribute column of the j-th data table in the database is represented, the distance between feature vectors of the attribute column is calculated by using the cosine distance, and the cosine distance calculation formula of the numerical value type and the character string data is as follows:whereinA t-th attribute column feature vector representing a r-th data table,the feature vector of the mth attribute column of the jth data table is represented, the special type data is judged by using a regular expression, and the type judgment calculation formula is as follows:;
step S204: selecting a clustering space with a small distance, and taking all data tables in the clustering space with the small distance as a tracing database。
Specifically, the specific method in step S3 is as follows:
step S301: will trace to the source databasePreprocessing the data table in the data table, converting the data table into a form with uniform format, extracting the feature vectors of the attribute columns of the data table and the data table to be traced in the tracing database, and obtaining the numberA feature matrix of the attribute column of the data table;
step S302: and putting the feature matrix of the attribute column of the data table into a trained convolutional neural network model, outputting a matching result of the data table to be traced and the data table in the tracing database, and screening out the data table similar to the data table to be traced in the tracing database.
Specifically, the step S4 includes a matching policy, where the specific steps of the matching policy are as follows:
step S401: setting a data table similar to a data table to be traced in a tracing database as a data table,Represented asWherein q represents the number of data tables similar to the data table to be traced in the data table in the tracing database,represents the p-th data table similar to the data table to be traced in the data table in the tracing database,wherein, the method comprises the steps of, wherein,a w-th value representing a z-th attribute;
step S402: calculating the specific gravity of the attribute column, wherein the calculation formula is as follows:whereinThe ith attribute column specific gravity of the jth data table,values representing attribute columns of a data table, calculationEntropy of attribute column in table, the calculation formula is:whereinRepresenting the entropy value of the ith attribute column in the data table;
calculating the weight of the attribute column, wherein the calculation formula is as follows:whereinThe value range of (2) is [0,1 ]],;
Step S403: comparing the matched data table to be traced with the data table to be tracedThe attribute column weight values in the tables are selected as keywords, the attribute columns with the largest weight values are sequentially ordered from big to big, whether the attribute values of the attribute column weight values with the largest attribute values of the two matched data tables are equal or not is judged, if so, the attribute with the next largest weight value is selected for ordering, and then whether the data quantity of the attribute columns is equal or not is compared;
step S404: the data quantity of the attribute columns of the two matched data tables is acquired, and the acquisition formula is as follows:where L represents the data amount of the ith attribute column in the data table p,representing the ith attribute column in the data table p;
step S405: repeating the steps S403-S404 until the last attribute column is compared, and judging whether the two matched data tables are similar data tables or not;
step S406: matching the numerical data of the same attribute column in the similar data table with the special type data, wherein the calculation formula is as follows:wherein, the method comprises the steps of, wherein,d represents the similarity of the numerical data and the special type data of the same attribute column in the similar data table,the values representing the same attribute columns in similar data tables correspond to the comparison function, and when the values of the same attribute columns in similar data tables are equal,when the values of the same attribute columns in the similar data table are not equal,,corresponding values representing the same attribute columns in the similar data table;
step S407: repeating step S406 until the last attribute column of the similarity data table is compared, comparing the similarity values of the similarity data tables, whenWhen the data in the database to be traced is derived from the data in the similar data table.
Specifically, the feature values of the attribute column include: maximum, minimum, mean, variance, standard deviation, median, and range.
Specifically, the special type data includes: date-time type data and mailbox type data.
Specifically, the neural network model is a trained neural network model combining BIGRU with an attention mechanism.
Specifically, the number of the similar data tables is 2 or more.
Specifically, a data analysis traceability system based on big data includes:
the traceability database establishing module is used for judging the type of the data table to be traced, selecting the data table which is the same as the data table to be traced in the original large database to establish the traceability database, and reducing the matching range of the data table;
the tracing database screening module is used for screening out a data table similar to the data table to be traced from the tracing database;
the tracing database matching module is used for matching the data table screened by the tracing database screening module with the data table to be traced, and finding out the real source of the data in the data table to be traced;
and the data integration module is used for organically integrating data with different sources, formats and characteristics logically or physically to provide comprehensive data sharing.
The electronic equipment comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of a data analysis tracing method based on big data when executing the computer program.
Specifically, a computer readable storage medium has stored thereon computer instructions which, when executed, perform the steps of a data analysis tracing method based on big data.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention optimizes and improves the architecture, the operation steps and the flow of the existing data analysis tracing system based on big data, and the system has the advantages of simple flow, low investment and operation cost and low production and working cost, and improves the tracing efficiency on the basis of the original tracing system.
2. The invention provides a data analysis tracing method based on big data, which comprises the steps of establishing a data table to be traced, determining the type of the data table to be traced, forming a tracing database by the data table of the same type in an original big database and the data table to be traced, screening the tracing database, screening the data table similar to the data table to be traced in the tracing database, matching the data table to be traced with the screened data table by utilizing a matching strategy, judging the real source of the data table to be traced, completing the data authenticity verification and copyright authentication of the data table to be traced, and reducing the matched data and time under the uncontrollable illegal leakage scene of the database, thereby improving the efficiency and accuracy of data tracing.
Drawings
FIG. 1 is a flow chart of a data analysis tracing method based on big data;
FIG. 2 is a matching flow chart of the matching strategy of the present invention;
FIG. 3 is a diagram of a data analysis traceability system architecture based on big data according to the present invention;
fig. 4 is a diagram of an electronic device of the present invention.
Detailed Description
In order that the technical means, the creation characteristics, the achievement of the objects and the effects of the present invention may be easily understood, it should be noted that in the description of the present invention, the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements to be referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "a", "an", "the" and "the" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The invention is further described below in conjunction with the detailed description.
Example 1
Referring to fig. 1 and 2, an embodiment of the present invention is provided: a data analysis tracing method based on big data comprises the following specific steps:
step S1: acquiring data to be traced, establishing a data table to be traced, and determining the type of the data table to be traced;
step S2: a traceability database is established, and the traceability database is formed by the data tables of the same type as the data tables to be traced in the original large database;
step S3: screening the traceability database, and screening a data table similar to the data table to be traced from the traceability database;
step S4: and matching the traceability database, matching the data table to be traced with the screened data table by utilizing a matching strategy, judging the real source of the data table to be traced, and finishing the data authenticity verification and copyright authentication of the data table to be traced.
The specific method of the step S2 comprises the following steps:
step S201: setting an original big database asThe data table to be traced isUsing clustering algorithm to make the original large databaseClustering according to the original big databaseThe compactness of the middle object distribution will be the original big databaseDividing into K clustering spaces, and setting as,,...;
Step S202: extracting attribute column characteristics of the traceability data table, wherein the extracted characteristic values of the attribute column comprise characteristic vectors formed by the characteristic values;
the calculation formula of the attribute value variation coefficient is as follows:the variation coefficient is used for removing the influence caused by overlarge data value gap, and the calculation formula of the different ratios is as follows:where f represents the number of modes, N represents the number of overall, the odds ratio is used to reflect the dispersion of the data, the skewness Sk is used to measure the direction and degree of deviation of the data distribution in the attribute column, and the kurtosis is used to describe the degree of steepness of the data distribution in the attribute column.
Step S203: setting up an original big databaseWhereinFor the nth data table in the original big database, the DB of the original big database comprises the data tableAnd data sheetWhereinA t-th attribute column representing a r-th data table in the database,the m attribute column of the j-th data table in the database is represented, the distance between feature vectors of the attribute column is calculated by using the cosine distance, and the cosine distance calculation formula of the numerical value type and the character string data is as follows:whereinA t-th attribute column feature vector representing a r-th data table,the feature vector of the mth attribute column of the jth data table is represented, the special type data is judged by using a regular expression, and the type judgment calculation formula is as follows:;
the date regular expression is:,
the mailbox regular expression is:
;
step S204: selecting a clustering space with a small distance, and taking all data tables in the clustering space with the small distance as a tracing database。
The specific method of the step S3 is as follows:
step S301: will trace to the source databasePreprocessing the data table in the database, converting the data table into a form with uniform format, extracting the data table in the traceability database and the feature vector of the attribute column of the data table to be traced, and obtaining the feature matrix of the attribute column of the data table;
pretreatment: because the sizes of the data tables are inconsistent, the number of the data tables is very large, the data tables cannot be directly input into a network, the feature matrix of the data tables is processed into consistent sizes in preprocessing, and if the data tables have no attribute column, the feature value of the attribute column is 0.
Step S302: and putting the feature matrix of the attribute column of the data table into a trained convolutional neural network model, outputting a matching result of the data table to be traced and the data table in the tracing database, and screening out the data table similar to the data table to be traced in the tracing database.
The step S4 comprises a matching strategy, and the specific steps of the matching strategy are as follows:
step S401: setting a data table similar to a data table to be traced in a tracing database as a data table,Represented asWherein q represents the number of data tables similar to the data table to be traced in the data table in the tracing database,represents the p-th data table similar to the data table to be traced in the data table in the tracing database,wherein, the method comprises the steps of, wherein,a w-th value representing a z-th attribute;
step S402: calculating the specific gravity of the attribute column, wherein the calculation formula is as follows:whereinThe ith attribute column specific gravity of the jth data table,values representing attribute columns of a data table, calculationEntropy of attribute column in table, the calculation formula is:whereinRepresenting the entropy value of the ith attribute column in the data table;
calculating the weight of the attribute column, wherein the calculation formula is as follows:whereinThe value range of (2) is [0,1 ]],;
The method comprises the following steps of: attribute column entropy valueThe larger the distinction degree is, the smaller the distinction degree is, which means that the numerical distribution in the attribute column is more uniform, the weaker the distinguishing capability of the attribute column is, the lower the importance degree of the attribute column is, and the entropy value of the attribute column isThe smaller the discrimination, the larger the discrimination, the more the discrimination capability of the attribute column is, and the higher the importance of the attribute column is, which is an indication of the non-uniformity of the numerical distribution in the attribute column.
Step S403: comparing the matched data table to be traced with the data table to be tracedSelecting an attribute column with the largest weight value as a keyword and sorting from large to small in the attribute column weight values in the data tables, judging whether the attribute values of the attribute column with the largest weight value of the two matched data tables are equal or not, selecting an attribute with the next largest weight value for sorting if the attribute values are equal, and comparing whether the data quantity of the attribute columns is equal or not;
step S404: the data quantity of the attribute columns of the two matched data tables is acquired, and the acquisition formula is as follows:where L represents the data amount of the ith attribute column in the data table p,representing the ith attribute column in the data table p;
step S405: repeating the steps S403-S404 until the last attribute column is compared, and judging whether the two matched data tables are similar data tables or not;
step S406: matching the numerical data of the same attribute column in the similar data table with the special type data, wherein the calculation formula is as follows:wherein, the method comprises the steps of, wherein,d represents the similarity of the numerical data and the special type data of the same attribute column in the similar data table,the values representing the same attribute columns in similar data tables correspond to the comparison function, and when the values of the same attribute columns in similar data tables are equal,when the values of the same attribute columns in the similar data table are not equal,,corresponding values representing the same attribute columns in the similar data table;
step S407: repeating step S406 until the last attribute column of the similarity data table is compared, comparing the similarity values of the similarity data tables, whenWhen the data in the database to be traced is derived from the data in the similar data table.
The feature values of the attribute column include: maximum, minimum, mean, variance, standard deviation, median, and range.
Specific types of data include: date-time type data and mailbox type data.
The neural network model is a trained neural network model of BIGRU combined with an attention mechanism.
The number of similar data tables is 2 or more.
Example 2
Referring to fig. 3, another embodiment of the present invention is provided: a big data based data analysis traceability system comprising:
the traceability database establishing module is used for judging the type of the data table to be traced, selecting the data table which is the same as the data table to be traced in the original large database to establish the traceability database, and reducing the matching range of the data table;
the tracing database screening module is used for screening out a data table similar to the data table to be traced from the tracing database;
the tracing database matching module is used for matching the data table screened by the tracing database screening module with the data table to be traced, and finding out the real source of the data in the data table to be traced;
and the data integration module is used for organically integrating data with different sources, formats and characteristics logically or physically to provide comprehensive data sharing.
The database is as follows: MYSQL, ORacle, SQLserver and domestic databases.
Example 3
Referring to fig. 4, in this embodiment, an electronic device is further provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the data analysis tracing method based on big data when executing the computer program.
Electronic device details: the memory may be a computer readable signal medium or a non-transitory computer readable storage medium or any combination of the two. The non-transitory computer readable memory may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the above. More specific examples of the non-transitory computer readable memory may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable storage medium having stored thereon computer instructions which, when executed, perform the steps of a data analysis tracing method based on big data.
The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (9)
1. The data analysis tracing method based on big data is characterized by comprising the following specific steps:
step S1: acquiring data to be traced, establishing a data table to be traced, and determining the type of the data table to be traced;
step S2: a traceability database is established, and the traceability database is formed by the data tables of the same type as the data tables to be traced in the original large database;
step S3: screening the traceability database, and screening a data table similar to the data table to be traced from the traceability database;
step S4: matching the traceability database, matching the data table to be traced with the screened data table by utilizing a matching strategy, judging the real source of the data table to be traced, and finishing the data authenticity verification and copyright authentication of the data table to be traced;
the specific method of the step S2 comprises the following steps:
step S201: setting an original big database asThe data table to be traced is +.>The clustering algorithm is utilized to perform +.A clustering algorithm on the original big database>Clustering is carried out according to the original big database +.>The compactness of the distribution of the medium objects will be +.>Dividing into K cluster spaces, and setting the K cluster spaces as +.>,/>,/>.../>;
Step S202: extracting attribute column characteristics of the data table to be traced, and forming characteristic vectors by characteristic values extracted by the attribute columns;
step S203: setting up an original big databaseWherein/>For the nth data table in the original big database, the original big database DB comprises the data table +.>And data sheetWherein->A t-th attribute column representing a r-th data table in the database,the m attribute column of the j-th data table in the database is represented, the distance between feature vectors of the attribute column is calculated by using the cosine distance, and the cosine distance calculation formula of the numerical value type and the character string data is as follows: />Wherein->T attribute column feature vector representing the r-th data table, < >>The feature vector of the mth attribute column of the jth data table is represented, the special type data is judged by using a regular expression, and the type judgment calculation formula is as follows: />;
Step S204: selecting a clustering space with a small distance, and taking all data tables in the clustering space with the small distance as a tracing database;
The specific method of the step S3 is as follows:
step S301: will trace to the source databasePreprocessing the data table in the database, converting the data table into a form with uniform format, extracting the data table in the traceability database and the feature vector of the attribute column of the data table to be traced, and obtaining the feature matrix of the attribute column of the data table;
step S302: putting the feature matrix of the attribute column of the data table into a trained convolutional neural network model, outputting a matching result of the data table to be traced and the data table in the tracing database, and screening out the data table similar to the data table to be traced in the tracing database;
the step S4 includes a matching policy, where the specific steps of the matching policy are as follows:
step S401: setting a data table similar to a data table to be traced in a tracing database as a data table,/>Denoted as->Wherein q represents the number of data tables similar to the data table to be traced in the data table in the tracing database, and +.>Representing a p-th data table similar to a data table to be traced in a data table in a tracing database,/->Wherein->A w-th value representing a z-th attribute;
step S402: calculating the specific gravity of the attribute column, wherein the calculation formula is as follows:wherein->The ith attribute column specific gravity of the jth data table,/->Values representing attribute columns of the data table, calculate +.>Entropy of attribute column in table, the calculation formula is: />Wherein->Representing the entropy value of the ith attribute column in the data table;
calculating the weight of the attribute column, wherein the calculation formula is as follows:wherein->The value range of (2) is [0,1 ]],/>;
Step S403: comparing the matched data table to be traced with the data table to be tracedThe attribute column weight value in the table is selected, the attribute column with the largest weight value is used as a keyword, the keyword is orderly sequenced from big to small, whether the attribute values of the attribute column with the largest weight value of the attribute columns of the two matched data tables are equal or not is judged, and if the attribute values are equal, the weight is selectedSequencing the next-largest attributes, and comparing whether the data quantity of the attribute columns is equal or not;
step S404: the data quantity of the attribute columns of the two matched data tables is acquired, and the acquisition formula is as follows:wherein L represents the data amount of the ith attribute column in the data table p, ++>Representing the ith attribute column in the data table p;
step S405: repeating the steps S403-S404 until the last attribute column is compared, and judging whether the two matched data tables are similar data tables or not;
step S406: matching the numerical data of the same attribute column in the similar data table with the special type data, wherein the calculation formula is as follows:wherein->D represents the similarity of the numerical data and the special type data of the same attribute column in the similar data table, +.>The values representing the same attribute columns in the similar data table correspond to the comparison function, when the values of the same attribute columns in the similar data table are equal, +.>When the values of the same attribute columns in the similar data table are not equal, +.>,/>Representing the same attributes in a similar data tableCorresponding values of the columns;
step S407: repeating step S406 until the last attribute column of the similarity data table is compared, comparing the similarity values of the similarity data tables, whenWhen the data in the database to be traced is derived from the data in the similar data table.
2. The method for tracing data analysis based on big data according to claim 1, wherein the feature values of the attribute column comprise: maximum, minimum, mean, variance, standard deviation, median, and range.
3. The data analysis tracing method based on big data according to claim 2, wherein said special type data comprises: date-time type data and mailbox type data.
4. A data analysis tracing method based on big data according to claim 3, wherein said neural network model is a trained bigu combined with an attention mechanism neural network model.
5. The data analysis tracing method based on big data as claimed in claim 4, wherein the number of the similar data tables is greater than or equal to 2.
6. A big data based data analysis traceability system, comprising:
the traceability database establishing module is used for judging the type of the data table to be traced, and selecting the data table which is the same as the data table to be traced in the original large database to establish the traceability database;
the tracing database screening module is used for screening out a data table similar to the data table to be traced from the tracing database;
the tracing database matching module is used for matching the data table screened by the tracing database screening module with the data table to be traced, and finding out the real source of the data in the data table to be traced;
the data integration module is used for organically gathering data with different sources, formats and characteristic properties logically or physically;
the matching strategy comprises the following specific steps:
step S401: setting a data table similar to a data table to be traced in a tracing database as a data table,/>Denoted as->Wherein q represents the number of data tables similar to the data table to be traced in the data table in the tracing database, and +.>Representing a p-th data table similar to a data table to be traced in a data table in a tracing database,/->Wherein->A w-th value representing a z-th attribute;
step S402: calculating the specific gravity of the attribute column, wherein the calculation formula is as follows:wherein->The ith attribute column specific gravity of the jth data table,/->Values representing attribute columns of the data table, calculate +.>Entropy of attribute column in table, the calculation formula is: />Wherein->Representing the entropy value of the ith attribute column in the data table;
calculating the weight of the attribute column, wherein the calculation formula is as follows:wherein->The value range of (2) is [0,1 ]],/>;
Step S403: comparing the matched data table to be traced with the data table to be tracedThe attribute column weight values in the tables are selected as keywords, the attribute columns with the largest weight values are sequentially ordered from large to small, whether the attribute values of the attribute column weight values of the two matched data tables with the largest attribute columns are equal or not is judged, if so, the attribute with the next largest weight value is selected for ordering, and then whether the data quantity of the attribute columns is equal or not is compared;
step S404: the data quantity of the attribute columns of the two matched data tables is acquired, and the acquisition formula is as follows:wherein L represents the data amount of the ith attribute column in the data table p, ++>Representing the ith attribute column in the data table p;
step S405: repeating the steps S403-S404 until the last attribute column is compared, and judging whether the two matched data tables are similar data tables or not;
step S406: matching the numerical data of the same attribute column in the similar data table with the special type data, wherein the calculation formula is as follows:wherein->D represents the similarity of the numerical data and the special type data of the same attribute column in the similar data table, +.>The values representing the same attribute columns in the similar data table correspond to the comparison function, when the values of the same attribute columns in the similar data table are equal, +.>When the values of the same attribute columns in the similar data table are not equal, +.>,/>Corresponding values representing the same attribute columns in the similar data table;
step S407: repeating step S406 until the last attribute column of the similarity data table is compared, comparing the similarity values of the similarity data tables, whenWhen the data in the database to be traced is derived from the data in the similar data table.
7. The big data based data analysis and tracing system of claim 6, wherein said database is: MYSQL, ORacle, SQLserver and domestic databases.
8. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of a big data based data analysis tracing method of any one of claims 1-5.
9. A computer readable storage medium having stored thereon computer instructions which when executed perform the steps of a big data based data analysis tracing method of any one of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310708251.4A CN116450710B (en) | 2023-06-15 | 2023-06-15 | Data analysis tracing method and system based on big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310708251.4A CN116450710B (en) | 2023-06-15 | 2023-06-15 | Data analysis tracing method and system based on big data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116450710A CN116450710A (en) | 2023-07-18 |
CN116450710B true CN116450710B (en) | 2023-09-26 |
Family
ID=87128781
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310708251.4A Active CN116450710B (en) | 2023-06-15 | 2023-06-15 | Data analysis tracing method and system based on big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116450710B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117056740A (en) * | 2023-08-07 | 2023-11-14 | 北京东方金信科技股份有限公司 | Method, system and readable medium for calculating table similarity in data asset management |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106709024A (en) * | 2016-12-28 | 2017-05-24 | 深圳市华傲数据技术有限公司 | Data table source-tracing method and device based on consanguinity analysis |
CN114138784A (en) * | 2021-11-30 | 2022-03-04 | 中国平安财产保险股份有限公司 | Information tracing method and device based on storage library, electronic equipment and medium |
CN114547231A (en) * | 2020-11-24 | 2022-05-27 | 国家电网有限公司大数据中心 | Data tracing method and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10268568B2 (en) * | 2016-03-29 | 2019-04-23 | Infosys Limited | System and method for data element tracing |
-
2023
- 2023-06-15 CN CN202310708251.4A patent/CN116450710B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106709024A (en) * | 2016-12-28 | 2017-05-24 | 深圳市华傲数据技术有限公司 | Data table source-tracing method and device based on consanguinity analysis |
CN114547231A (en) * | 2020-11-24 | 2022-05-27 | 国家电网有限公司大数据中心 | Data tracing method and system |
CN114138784A (en) * | 2021-11-30 | 2022-03-04 | 中国平安财产保险股份有限公司 | Information tracing method and device based on storage library, electronic equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN116450710A (en) | 2023-07-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107392121B (en) | Self-adaptive equipment identification method and system based on fingerprint identification | |
CN111291070B (en) | Abnormal SQL detection method, equipment and medium | |
CN116450710B (en) | Data analysis tracing method and system based on big data | |
Yue et al. | Hashing based fast palmprint identification for large-scale databases | |
CN110134719B (en) | Identification and classification method for sensitive attribute of structured data | |
CN110851461A (en) | Method and device for auditing non-relational database and storage medium | |
CN107220325A (en) | A kind of similar icon search methods of APP based on convolutional neural networks and system | |
CN111275599B (en) | Big data integration algorithm-based group rental house early warning method and device, storage medium and terminal | |
CN113743496A (en) | K-anonymous data processing method and system based on cluster mapping | |
CN112069269B (en) | Big data and multidimensional feature-based data tracing method and big data cloud server | |
CN117216109A (en) | Data query method, device and storage medium for multi-type mixed data | |
CN112084293A (en) | Data authentication system and data authentication method for public security field | |
CN116582309A (en) | GAN-CNN-BiLSTM-based network intrusion detection method | |
CN112967759B (en) | DNA material evidence identification STR typing comparison method based on memory stack technology | |
CN111507878B (en) | Network crime suspects investigation method and system based on user portrait | |
CN111178455B (en) | Image clustering method, system, device and medium | |
US20170024358A1 (en) | Method of processing statistical data | |
CN112100670A (en) | Big data based privacy data grading protection method | |
CN111813987B (en) | Portrait comparison method based on police big data | |
CN117786732B (en) | Intelligent institution data storage system based on big data information desensitization method | |
CN117113321B (en) | Image searching method and system for searching face by face | |
CN112559823B (en) | Data standardized data acquisition method | |
CN113656405B (en) | Method and device for sharing on-chain radar map co-construction based on block chain | |
CN116702024B (en) | Method, device, computer equipment and storage medium for identifying type of stream data | |
CN117633605B (en) | Data security classification capability maturity assessment method, system and readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |