CN116450710B

CN116450710B - Data analysis tracing method and system based on big data

Info

Publication number: CN116450710B
Application number: CN202310708251.4A
Authority: CN
Inventors: 张蓉; 黄礼成; 邢文元; 刘杰
Original assignee: Nanjing Halu Information Technology Co ltd
Current assignee: Nanjing Halu Information Technology Co ltd
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2023-09-26
Anticipated expiration: 2043-06-15
Also published as: CN116450710A

Abstract

The invention discloses a data analysis tracing method and system based on big data, which belong to the technical field of big data tracing, and the method comprises the following steps: acquiring data to be traced, establishing a data table to be traced, and determining the type of the data table to be traced; a traceability database is established, and the traceability database is formed by the data tables of the same type as the data tables to be traced in the original large database; screening the traceability database, and screening a data table similar to the data table to be traced from the traceability database; matching the traceability database, matching the data table to be traced with the screened data table by utilizing a matching strategy, judging the real source of the data table to be traced, finishing the data authenticity verification and copyright authentication of the data table to be traced, and reducing the matched data and time in a large database under the uncontrollable illegal leakage scene of the database by establishing the traceability database, screening the traceability database and the matching strategy, and improving the efficiency and accuracy of data tracing.

Description

Data analysis tracing method and system based on big data

Technical Field

The invention belongs to the technical field of big data tracing, and particularly relates to a data analysis tracing method and system based on big data.

Background

With the rapid development of the mobile internet, various information has been exponentially exploded. The huge data volume brings greater potential safety hazards, and the data is easy to tamper and steal. The security problems such as illegal data leakage, data theft and data loss frequently occur, and serious influence is caused to the information security of individuals, enterprises and countries. In order to solve the information security problem existing at present, many scientific researchers lay in the study of the data tracing technology. The data tracing technology is used for tracing the trace of illegal leaked data, finding the real source of the leaked data and achieving the purpose of protecting the copyright of the database. The existing data tracing technology is applied to the controllable illegal leakage scene of the database, and the research on the data tracing technology applied to the uncontrollable illegal leakage scene of the database is less. Therefore, the data tracing is required to be carried out under the uncontrollable illegal leakage scene, and the method has great significance on the data security research.

For example, chinese patent with an authorized publication number CN109657110B discloses a data tracing method and a corresponding data tracing device, where the data tracing method includes: adding unique identification information to each piece of source data, and establishing an original data set; performing target data operation on the original data set to obtain a target result set matched with the target data operation, wherein each result record contains identification information of source data matched with the result record; and integrating the tuple number of the result record, the identification information contained in the result record and the target data operation to obtain the tracing information corresponding to each result record, so as to trace the data according to the tracing information. In the data tracing process, the source and evolution process of the result records are traced according to the data operation in the tracing information and the identification information of the source data, so that the reliability and the credibility of the analysis of the result record source are improved, and the data tracing efficiency is also effectively improved.

For example, chinese patent publication No. CN110674360B discloses a method and system for tracing data, including obtaining file information in response to file operations occurring on a target machine; based on screening the file information, structured data and unstructured data corresponding to the structured data are obtained, wherein the structured data are used as fixed key variable groups, and the unstructured data comprise a plurality of variable key variable groups; in response to the fact that the information of the fixed variable group does not exist in the association graph, the file information corresponding to the fixed variable group is uniquely identified and stored in the association graph; and responding to the information of the fixed variable group in the association graph, carrying out correlation verification on the variable key variable group corresponding to the fixed variable group and the variable key variable group existing in the association graph, and if the variable key variable group is associated with the variable key variable group, taking the variable key variable group into the unique identification of the file information corresponding to the variable key variable group existing in the association graph. By utilizing the method for tracing the data, the working efficiency can be greatly improved.

The problems of large data matching quantity, low matching efficiency and low matching accuracy exist in the above patents, and the problem of data tracing under the uncontrollable illegal leakage scene of the database cannot be solved.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a data analysis tracing method and a data analysis tracing system based on big data.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a data analysis tracing method based on big data comprises the following specific steps:

step S1: acquiring data to be traced, establishing a data table to be traced, and determining the type of the data table to be traced;

step S2: a traceability database is established, and the traceability database is formed by the data tables of the same type as the data tables to be traced in the original large database;

step S3: screening the traceability database, and screening a data table similar to the data table to be traced from the traceability database;

step S4: and matching the traceability database, matching the data table to be traced with the screened data table by utilizing a matching strategy, judging the real source of the data table to be traced, and finishing the data authenticity verification and copyright authentication of the data table to be traced.

Specifically, the specific method of step S2 includes:

step S201: setting an original big database asThe data table to be traced isUsing clustering algorithm to make the original large databaseClustering according to the original big databaseThe compactness of the middle object distribution will be the original big databaseDividing into K clustering spaces, and setting as，，...；

Step S202: extracting attribute column characteristics of the traceability data table, extracting characteristic values by the attribute column, and forming characteristic vectors by the characteristic values;

step S203: setting up an original big databaseWhereinFor the nth data table in the original big database, the DB of the original big database comprises the data tableAnd data sheetWhereinA t-th attribute column representing a r-th data table in the database,the m attribute column of the j-th data table in the database is represented, the distance between feature vectors of the attribute column is calculated by using the cosine distance, and the cosine distance calculation formula of the numerical value type and the character string data is as follows:whereinA t-th attribute column feature vector representing a r-th data table,the feature vector of the mth attribute column of the jth data table is represented, the special type data is judged by using a regular expression, and the type judgment calculation formula is as follows:；

step S204: selecting a clustering space with a small distance, and taking all data tables in the clustering space with the small distance as a tracing database。

Specifically, the specific method in step S3 is as follows:

step S301: will trace to the source databasePreprocessing the data table in the data table, converting the data table into a form with uniform format, extracting the feature vectors of the attribute columns of the data table and the data table to be traced in the tracing database, and obtaining the numberA feature matrix of the attribute column of the data table;

step S302: and putting the feature matrix of the attribute column of the data table into a trained convolutional neural network model, outputting a matching result of the data table to be traced and the data table in the tracing database, and screening out the data table similar to the data table to be traced in the tracing database.

Specifically, the step S4 includes a matching policy, where the specific steps of the matching policy are as follows:

step S401: setting a data table similar to a data table to be traced in a tracing database as a data table，Represented asWherein q represents the number of data tables similar to the data table to be traced in the data table in the tracing database,represents the p-th data table similar to the data table to be traced in the data table in the tracing database,wherein, the method comprises the steps of, wherein,a w-th value representing a z-th attribute;

step S402: calculating the specific gravity of the attribute column, wherein the calculation formula is as follows:whereinThe ith attribute column specific gravity of the jth data table,values representing attribute columns of a data table, calculationEntropy of attribute column in table, the calculation formula is:whereinRepresenting the entropy value of the ith attribute column in the data table;

calculating the weight of the attribute column, wherein the calculation formula is as follows:whereinThe value range of (2) is [0,1 ]]，；

Step S403: comparing the matched data table to be traced with the data table to be tracedThe attribute column weight values in the tables are selected as keywords, the attribute columns with the largest weight values are sequentially ordered from big to big, whether the attribute values of the attribute column weight values with the largest attribute values of the two matched data tables are equal or not is judged, if so, the attribute with the next largest weight value is selected for ordering, and then whether the data quantity of the attribute columns is equal or not is compared;

step S404: the data quantity of the attribute columns of the two matched data tables is acquired, and the acquisition formula is as follows:where L represents the data amount of the ith attribute column in the data table p,representing the ith attribute column in the data table p;

step S405: repeating the steps S403-S404 until the last attribute column is compared, and judging whether the two matched data tables are similar data tables or not;

step S406: matching the numerical data of the same attribute column in the similar data table with the special type data, wherein the calculation formula is as follows:wherein, the method comprises the steps of, wherein,d represents the similarity of the numerical data and the special type data of the same attribute column in the similar data table,the values representing the same attribute columns in similar data tables correspond to the comparison function, and when the values of the same attribute columns in similar data tables are equal,when the values of the same attribute columns in the similar data table are not equal,，corresponding values representing the same attribute columns in the similar data table;

step S407: repeating step S406 until the last attribute column of the similarity data table is compared, comparing the similarity values of the similarity data tables, whenWhen the data in the database to be traced is derived from the data in the similar data table.

Specifically, the feature values of the attribute column include: maximum, minimum, mean, variance, standard deviation, median, and range.

Specifically, the special type data includes: date-time type data and mailbox type data.

Specifically, the neural network model is a trained neural network model combining BIGRU with an attention mechanism.

Specifically, the number of the similar data tables is 2 or more.

Specifically, a data analysis traceability system based on big data includes:

the traceability database establishing module is used for judging the type of the data table to be traced, selecting the data table which is the same as the data table to be traced in the original large database to establish the traceability database, and reducing the matching range of the data table;

the tracing database screening module is used for screening out a data table similar to the data table to be traced from the tracing database;

the tracing database matching module is used for matching the data table screened by the tracing database screening module with the data table to be traced, and finding out the real source of the data in the data table to be traced;

and the data integration module is used for organically integrating data with different sources, formats and characteristics logically or physically to provide comprehensive data sharing.

The electronic equipment comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of a data analysis tracing method based on big data when executing the computer program.

Specifically, a computer readable storage medium has stored thereon computer instructions which, when executed, perform the steps of a data analysis tracing method based on big data.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention optimizes and improves the architecture, the operation steps and the flow of the existing data analysis tracing system based on big data, and the system has the advantages of simple flow, low investment and operation cost and low production and working cost, and improves the tracing efficiency on the basis of the original tracing system.

2. The invention provides a data analysis tracing method based on big data, which comprises the steps of establishing a data table to be traced, determining the type of the data table to be traced, forming a tracing database by the data table of the same type in an original big database and the data table to be traced, screening the tracing database, screening the data table similar to the data table to be traced in the tracing database, matching the data table to be traced with the screened data table by utilizing a matching strategy, judging the real source of the data table to be traced, completing the data authenticity verification and copyright authentication of the data table to be traced, and reducing the matched data and time under the uncontrollable illegal leakage scene of the database, thereby improving the efficiency and accuracy of data tracing.

Drawings

FIG. 1 is a flow chart of a data analysis tracing method based on big data;

FIG. 2 is a matching flow chart of the matching strategy of the present invention;

FIG. 3 is a diagram of a data analysis traceability system architecture based on big data according to the present invention;

fig. 4 is a diagram of an electronic device of the present invention.

Detailed Description

In order that the technical means, the creation characteristics, the achievement of the objects and the effects of the present invention may be easily understood, it should be noted that in the description of the present invention, the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements to be referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "a", "an", "the" and "the" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The invention is further described below in conjunction with the detailed description.

Example 1

Referring to fig. 1 and 2, an embodiment of the present invention is provided: a data analysis tracing method based on big data comprises the following specific steps:

The specific method of the step S2 comprises the following steps:

Step S202: extracting attribute column characteristics of the traceability data table, wherein the extracted characteristic values of the attribute column comprise characteristic vectors formed by the characteristic values;

the calculation formula of the attribute value variation coefficient is as follows:the variation coefficient is used for removing the influence caused by overlarge data value gap, and the calculation formula of the different ratios is as follows:where f represents the number of modes, N represents the number of overall, the odds ratio is used to reflect the dispersion of the data, the skewness Sk is used to measure the direction and degree of deviation of the data distribution in the attribute column, and the kurtosis is used to describe the degree of steepness of the data distribution in the attribute column.

the date regular expression is:，

the mailbox regular expression is:

；

The specific method of the step S3 is as follows:

step S301: will trace to the source databasePreprocessing the data table in the database, converting the data table into a form with uniform format, extracting the data table in the traceability database and the feature vector of the attribute column of the data table to be traced, and obtaining the feature matrix of the attribute column of the data table;

pretreatment: because the sizes of the data tables are inconsistent, the number of the data tables is very large, the data tables cannot be directly input into a network, the feature matrix of the data tables is processed into consistent sizes in preprocessing, and if the data tables have no attribute column, the feature value of the attribute column is 0.

The step S4 comprises a matching strategy, and the specific steps of the matching strategy are as follows:

The method comprises the following steps of: attribute column entropy valueThe larger the distinction degree is, the smaller the distinction degree is, which means that the numerical distribution in the attribute column is more uniform, the weaker the distinguishing capability of the attribute column is, the lower the importance degree of the attribute column is, and the entropy value of the attribute column isThe smaller the discrimination, the larger the discrimination, the more the discrimination capability of the attribute column is, and the higher the importance of the attribute column is, which is an indication of the non-uniformity of the numerical distribution in the attribute column.

Step S403: comparing the matched data table to be traced with the data table to be tracedSelecting an attribute column with the largest weight value as a keyword and sorting from large to small in the attribute column weight values in the data tables, judging whether the attribute values of the attribute column with the largest weight value of the two matched data tables are equal or not, selecting an attribute with the next largest weight value for sorting if the attribute values are equal, and comparing whether the data quantity of the attribute columns is equal or not;

The feature values of the attribute column include: maximum, minimum, mean, variance, standard deviation, median, and range.

Specific types of data include: date-time type data and mailbox type data.

The neural network model is a trained neural network model of BIGRU combined with an attention mechanism.

The number of similar data tables is 2 or more.

Example 2

Referring to fig. 3, another embodiment of the present invention is provided: a big data based data analysis traceability system comprising:

The database is as follows: MYSQL, ORacle, SQLserver and domestic databases.

Example 3

Referring to fig. 4, in this embodiment, an electronic device is further provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the data analysis tracing method based on big data when executing the computer program.

Electronic device details: the memory may be a computer readable signal medium or a non-transitory computer readable storage medium or any combination of the two. The non-transitory computer readable memory may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the above. More specific examples of the non-transitory computer readable memory may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable storage medium having stored thereon computer instructions which, when executed, perform the steps of a data analysis tracing method based on big data.

The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The data analysis tracing method based on big data is characterized by comprising the following specific steps:

step S4: matching the traceability database, matching the data table to be traced with the screened data table by utilizing a matching strategy, judging the real source of the data table to be traced, and finishing the data authenticity verification and copyright authentication of the data table to be traced;

the specific method of the step S2 comprises the following steps:

step S201: setting an original big database asThe data table to be traced is +.>The clustering algorithm is utilized to perform +.A clustering algorithm on the original big database>Clustering is carried out according to the original big database +.>The compactness of the distribution of the medium objects will be +.>Dividing into K cluster spaces, and setting the K cluster spaces as +.>，/>，/>.../>；

Step S202: extracting attribute column characteristics of the data table to be traced, and forming characteristic vectors by characteristic values extracted by the attribute columns;

step S203: setting up an original big databaseWherein/>For the nth data table in the original big database, the original big database DB comprises the data table +.>And data sheetWherein->A t-th attribute column representing a r-th data table in the database,the m attribute column of the j-th data table in the database is represented, the distance between feature vectors of the attribute column is calculated by using the cosine distance, and the cosine distance calculation formula of the numerical value type and the character string data is as follows: />Wherein->T attribute column feature vector representing the r-th data table, < >>The feature vector of the mth attribute column of the jth data table is represented, the special type data is judged by using a regular expression, and the type judgment calculation formula is as follows: />；

Step S204: selecting a clustering space with a small distance, and taking all data tables in the clustering space with the small distance as a tracing database；

The specific method of the step S3 is as follows:

step S302: putting the feature matrix of the attribute column of the data table into a trained convolutional neural network model, outputting a matching result of the data table to be traced and the data table in the tracing database, and screening out the data table similar to the data table to be traced in the tracing database;

the step S4 includes a matching policy, where the specific steps of the matching policy are as follows:

step S401: setting a data table similar to a data table to be traced in a tracing database as a data table，/>Denoted as->Wherein q represents the number of data tables similar to the data table to be traced in the data table in the tracing database, and +.>Representing a p-th data table similar to a data table to be traced in a data table in a tracing database,/->Wherein->A w-th value representing a z-th attribute;

step S402: calculating the specific gravity of the attribute column, wherein the calculation formula is as follows:wherein->The ith attribute column specific gravity of the jth data table,/->Values representing attribute columns of the data table, calculate +.>Entropy of attribute column in table, the calculation formula is: />Wherein->Representing the entropy value of the ith attribute column in the data table;

calculating the weight of the attribute column, wherein the calculation formula is as follows:wherein->The value range of (2) is [0,1 ]]，/>；

Step S403: comparing the matched data table to be traced with the data table to be tracedThe attribute column weight value in the table is selected, the attribute column with the largest weight value is used as a keyword, the keyword is orderly sequenced from big to small, whether the attribute values of the attribute column with the largest weight value of the attribute columns of the two matched data tables are equal or not is judged, and if the attribute values are equal, the weight is selectedSequencing the next-largest attributes, and comparing whether the data quantity of the attribute columns is equal or not;

step S404: the data quantity of the attribute columns of the two matched data tables is acquired, and the acquisition formula is as follows:wherein L represents the data amount of the ith attribute column in the data table p, ++>Representing the ith attribute column in the data table p;

step S406: matching the numerical data of the same attribute column in the similar data table with the special type data, wherein the calculation formula is as follows:wherein->D represents the similarity of the numerical data and the special type data of the same attribute column in the similar data table, +.>The values representing the same attribute columns in the similar data table correspond to the comparison function, when the values of the same attribute columns in the similar data table are equal, +.>When the values of the same attribute columns in the similar data table are not equal, +.>，/>Representing the same attributes in a similar data tableCorresponding values of the columns;

2. The method for tracing data analysis based on big data according to claim 1, wherein the feature values of the attribute column comprise: maximum, minimum, mean, variance, standard deviation, median, and range.

3. The data analysis tracing method based on big data according to claim 2, wherein said special type data comprises: date-time type data and mailbox type data.

4. A data analysis tracing method based on big data according to claim 3, wherein said neural network model is a trained bigu combined with an attention mechanism neural network model.

5. The data analysis tracing method based on big data as claimed in claim 4, wherein the number of the similar data tables is greater than or equal to 2.

6. A big data based data analysis traceability system, comprising:

the traceability database establishing module is used for judging the type of the data table to be traced, and selecting the data table which is the same as the data table to be traced in the original large database to establish the traceability database;

the data integration module is used for organically gathering data with different sources, formats and characteristic properties logically or physically;

the matching strategy comprises the following specific steps:

Step S403: comparing the matched data table to be traced with the data table to be tracedThe attribute column weight values in the tables are selected as keywords, the attribute columns with the largest weight values are sequentially ordered from large to small, whether the attribute values of the attribute column weight values of the two matched data tables with the largest attribute columns are equal or not is judged, if so, the attribute with the next largest weight value is selected for ordering, and then whether the data quantity of the attribute columns is equal or not is compared;

step S406: matching the numerical data of the same attribute column in the similar data table with the special type data, wherein the calculation formula is as follows:wherein->D represents the similarity of the numerical data and the special type data of the same attribute column in the similar data table, +.>The values representing the same attribute columns in the similar data table correspond to the comparison function, when the values of the same attribute columns in the similar data table are equal, +.>When the values of the same attribute columns in the similar data table are not equal, +.>，/>Corresponding values representing the same attribute columns in the similar data table;

7. The big data based data analysis and tracing system of claim 6, wherein said database is: MYSQL, ORacle, SQLserver and domestic databases.

8. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of a big data based data analysis tracing method of any one of claims 1-5.

9. A computer readable storage medium having stored thereon computer instructions which when executed perform the steps of a big data based data analysis tracing method of any one of claims 1-5.