CN117155576A

CN117155576A - Data asset tracing method based on multi-scale pooling hash data fingerprint

Info

Publication number: CN117155576A
Application number: CN202311017090.0A
Authority: CN
Inventors: 邵国林
Original assignee: Nanchang University
Current assignee: Nanchang University
Priority date: 2023-08-14
Filing date: 2023-08-14
Publication date: 2023-12-01

Abstract

The application discloses a data asset tracing method based on multiscale pooling hash data fingerprints, when a client generates MPH data fingerprints, a substring is firstly constructed on a data byte sequence based on a K-gram method to extract low-order sequence information of a data stream, then multiscale and multiscale pooling processing is carried out on the substring hash result to extract high-order structure information of the data stream, and finally key information extracted by multiscale pooling operation is used for generating the data fingerprints with high robustness through a local sensitive hash method and is reported to a server for tracing. When the server-side traces the source, firstly, the data fingerprint is extracted according to the same method, and then the data asset is traced based on the fingerprint similarity. According to the multi-scale pooling hash method, the robustness of the data fingerprint in the variable data asset tracing scene is improved, so that the continuity of tracing information can be guaranteed under the countermeasure environment of data content tampering and data morphological change.

Description

Data asset tracing method based on multi-scale pooling hash data fingerprint

Technical Field

The application relates to the technical field of data tracing and leakage prevention, in particular to a data asset tracing method based on multi-scale pooling hash data fingerprints.

Background

Enterprises and organizations accumulate large amounts of important, sensitive resource data in operation, which are data assets that are important to the enterprise. Such important, sensitive data assets, once stolen, can cause serious property damage, loss of benefit to the enterprise, and disruption of normal industry and economic order. Currently, data leakage exceeds data destruction to become the maximum risk of data security, and a data tracing technology is a key technology for realizing data security and privacy protection.

The traditional data asset tracing mode has the following defects: the traditional data tracing technology depends on stable data form, once the data form is changed, the data tracing may face the risk of interruption, and the traditional data tracing technology is difficult to cope with the flexible and changeable data tracing requirements of data tampering, transfer, interception and the like in a real scene.

Disclosure of Invention

Aiming at the problems, the application aims to provide a data asset tracing method based on multi-scale pooled hash data fingerprints, which is used for generating data fingerprints with high robustness through a multi-scale information extraction and pooled hash method and tracing data assets based on the similarity of the multi-scale hashed data fingerprints. By the multi-scale pooling hash method, robustness of the data fingerprint in a variable-state data asset tracing scene is improved, and therefore continuity of tracing information can be guaranteed under an countermeasure environment of data content tampering and data form change.

In order to achieve the above purpose, the application adopts the following technical scheme:

a data asset tracing method based on multi-scale pooled hash data fingerprint comprises two parts: (1) multi-scale pooled hash (MPH) data fingerprint generation; (2) data asset tracing based on MPH fingerprint similarity.

The multi-scale pooled hash (MPH) data fingerprint generation process includes the steps of:

step 11: the client monitors new data generation, specifically, the data fingerprint generation client monitors key data operation behaviors in the terminal, including file creation, modification, movement, copying, outgoing, deletion and other data processing scenes, and if the new data file generation is found, the client enters a data fingerprint generation link;

step 12: extracting low-order sequence information of a data stream, namely converting a data asset into a byte sequence, carrying out byte-by-byte sliding window processing on the byte sequence by a k-gram method, wherein the sliding window size is k, splicing k byte data in the sliding window into a unit data substring each time, and finally converting an original byte sequence with the length of n into n-k+1 substring sequences with the unit length of k; optionally, the data may be pre-processed initially, specifically, the data asset is processed according to a specific algorithm, and key information representing the content of the data is extracted;

step 13: carrying out hash processing on each sub-string, converting the sub-string into a byte with a fixed length m, and finally converting the sub-string sequence into a hash sequence with a length of L=n-k+1;

step 14: the multi-scale extraction of the data stream high-order structural information is specifically to extract information with different granularities and different types from a hash sequence with the length L by a pyramid pooling method. The calculation process is as follows:

step 141: sequentially according to the window of 2, 4, 8, … … and 2 ⁱ Carrying out pooling treatment, wherein i represents the number of layers of pooling;

step 142: respectively calculating the maximum hash and the minimum hash of each block of data, and extracting 2 values from each block of data;

step 143: splicing the values to form a new hash sequence with the length of

Step 15: generating MPH data fingerprints based on a local sensitive hash method, specifically converting a hash sequence with a length of K into a hash value with a fixed length by using an LSH method, wherein the hash value represents the final data fingerprints of the data asset;

step 16: and writing the generated data fingerprints into a local log, and reporting to a data tracing server to provide a data foundation for data tracing. Optionally, in order to facilitate hash similarity calculation and fast fuzzy matching, the reported fingerprint is in the form of a binary string cut into N segments.

The data asset tracing process based on MPH fingerprint similarity comprises the following steps:

step 21: generating MPH data fingerprints of the leakage file, extracting the MPH data fingerprints of the leakage file to be analyzed according to the steps 12-15, and converting the data assets into a data fingerprint representation form of a binary hash string with a fixed length;

step 22: MPH data fingerprint similarity comparison: cutting the binary hash string into N sections, performing similarity measurement with records in a data fingerprint database reported to the cloud, performing equivalent matching on each section according to the pigeonry principle, and calculating the similarity only for the records with the same hash value on each section;

step 23: tracing data assets: similarity matching is measured according to hamming distance, if the similarity exceeds a threshold, the same data asset is considered, and the data operation process is traced back based on the similarity.

The beneficial effects of the application are as follows:

according to the multi-scale pooling hash method, the robustness of the data fingerprint in the variable data asset tracing scene is improved, so that the continuity of tracing information can be guaranteed under the countermeasure environment of data content tampering and data morphological change.

Drawings

FIG. 1 is a flow chart of an embodiment of the present application.

Detailed Description

In order to make the technical solution of the present application better understood by those skilled in the art, the technical solution of the present embodiment will be clearly and completely described in the following description with reference to the accompanying drawings in the embodiments of the present specification, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, shall fall within the scope of the application. The following examples are only for more clearly illustrating the technical aspects of the present application, and are not intended to limit the scope of the present application.

In the embodiment of the description, as shown in fig. 1, when a client generates an MPH data fingerprint, a substring is firstly constructed on a data byte sequence based on a K-gram method to extract low-order sequence information of a data stream, then a multi-scale and multi-type pooling process is performed on a substring hash result to extract high-order structure information of the data stream, and finally key information extracted by the multi-scale pooling operation is used for generating the data fingerprint with high robustness through a local sensitive hash method and is reported to a server for tracing. When the server-side traces the source, firstly, the data fingerprint is extracted according to the same method, and then the data asset is traced based on the fingerprint similarity. The application is further described below with reference to the accompanying drawings.

step 11: the client monitors new data generation, specifically, the data fingerprint generation client monitors key data operation behaviors in the terminal through kernel functions of the kernel programming registration system, including file creation, modification, movement, copying, outgoing, deletion and other data processing scenes, and if the new data file generation is found, the client enters a data fingerprint generation link; the local file inspection can be judged according to the MD5 hash value of the file, if the fingerprint generation log of the same MD5 file exists locally, the log is not processed, and otherwise, a fingerprint generation link is entered. Optionally, selective interception may be performed according to configured monitoring policies, such as interception only for specific file suffixes, specific file sizes, specific file types, file handling actions of specific file operations.

Step 12: extracting low-order sequence information of a data stream, preprocessing an original data asset through a k-gram method to preserve context information of the data asset, enriching granularity of fingerprint information preservation, specifically converting the data asset into a byte sequence, carrying out byte-by-byte sliding window processing on the byte sequence through the k-gram method, splicing k byte data in the sliding window into a unit data substring each time, and finally converting the original byte sequence with the length of n into n-k+1 substring sequences with the unit length of k; for example, assuming that the byte sequence of the original data asset is ABCDEF, the length is 6, and the sliding window size is set to 3, the original byte sequence is converted to ABC BCD CDE DEF after processing by the k-gram method, i.e., 6-3+1=4 substrings, each of which has a length of 3. Optionally, the data may be pre-processed preliminarily before the k-gram processing, specifically, the data asset is processed according to a specific algorithm, and key information representing the content of the data is extracted, for example, stop words in a text sequence are removed, redundant information in a file is removed by other statistical methods, or data of the data asset is sampled to a certain extent;

step 13: carrying out hash processing on each sub-string, converting the sub-string into a byte with a fixed length m, and finally converting the sub-string sequence into a hash sequence with a length of L=n-k+1; for example, each sub-string in the ABC BCD CDE DEF is converted into a hash value with the same length through an MD5 hash method;

ABC：902FBDD2B1DF0C4F70B4A5D23525E932

BCD：8539EF1FBA74A70F5A77FCC3F25C1659

CDE：F8E054E3416DE72E874492E25C38B3EC

DEF：822DD494B3E14A82AA76BD455E6B6F4B

in order to save hash byte space, only low-order data of the hash value can be reserved as a hash result, and for convenience of description, only low 8 bits are reserved as a final hash value;

ABC：0x32(50)

BCD：0x59(80)

CDE：0xEC(236)

DEF：0x4B(75)

thereby converting the byte sequence into a hash sequence of length 4, 50 80 236 75.

Step 14: the data stream high-order structure information multi-scale extraction, in particular to the division and information extraction of different granularity and different types of the hash sequence through a multi-level multi-granularity pooling method. The calculation process is as follows:

(1) Sequentially according to the window of 2, 4, … … and 2 ⁱ Carrying out pooling treatment, wherein i represents the pooled layer number, for example, the hash sequence with the length of 4 is partitioned according to windows 2 and 4;

level 1: the window is 2, divided into 2 blocks, respectively (50, 80), (236, 75);

level 2: the window is 4 and divided into 1 block (50 80 236 75);

(2) Respectively calculating the maximum hash and the minimum hash of each block of data, and extracting 2 values from each block of data;

extracting (80), (236) respectively by maximum hashing;

extracting (50), (75) respectively by minimum hash;

level 2: the window is 4 and divided into 1 block (50 80 236 75);

extracting (236) respectively through the maximum hash;

extracting (50) respectively by minimum hash;

(3) Splicing the values to form a new hash sequence with the length ofIn the case, after multi-level multi-granularity extraction and splicing, hash value sequences (80), (236), (50), (75), (236) and (50) with the length of 6 are obtained;

step 15: generating MPH data fingerprints based on a local sensitive hash method, specifically converting a hash sequence with a length of K into a hash value with a fixed length by using an LSH method, wherein the hash value represents the final data fingerprints of the data asset; LSH adopts a locally sensitive hash scheme such as SimHash, miniHash. Taking SimHash as an example:

firstly, converting the hash value sequence into a binary form;

(80)：01010000

(236)：11101100

(50)：00110010

(75)：01001011

(236)：11101100

(50)：00110010

secondly, aligning the hash values according to bits, adding the hash values according to a certain weight (the position value with the bit of 1 is replaced by 1, the position value with the bit of 0 is replaced by-1), and obtaining a result-2 22 0 0-2 0-5 assuming that the weights are the same and are all 1; and setting the bit with the summation result being greater than or equal to 0 as 1, otherwise setting the bit as 0, and obtaining the hash fingerprint as 01111010.

Step 16: writing the generated MPH data fingerprint into a local log, and reporting to a data tracing server side to provide a data foundation for data tracing; the data fingerprint log uploaded to the cloud can comprise information such as an operation time stamp, terminal information (IP, system and the like), a user Identity (ID), a file name, a file attribute, a file ID, a data fingerprint and the like; optionally, in order to facilitate hash similarity calculation and fast fuzzy matching, the reported fingerprint is in the form of a binary string cut into N segments. Assuming that the data fingerprint is 01111010 and the number of segments N is 4, the data fingerprint storage forms are n1=01, n2=11, n3=10, n4=10.

step 22: MPH data fingerprint similarity comparison: cutting the binary hash string into N segments, and assuming that the extracted data fingerprint is 01101010, n1=01, n2=10, n3=10, n4=10; similarity matching is carried out with records in a data fingerprint database reported to the cloud, and in order to accelerate the calculation process and avoid meaningless measurement calculation, N1, N2, N3 and N4 segments can be respectively subjected to equivalent matching according to the pigeonry principle;

step 23: tracing data assets: the similarity matching can be measured according to the Hamming distance, if the similarity exceeds a certain threshold, the similarity is determined to be the same data asset, and the data operation process is traced back based on the similarity; it is assumed that the above data of the fingerprint 01111010 can be matched according to the matching results of N1, N2, N3, N4, and then the similarity with the leaked file fingerprint (01101010) is calculated to be 7/8=87.5%.

The foregoing description of the preferred embodiments of the present application has been presented only in terms of those specific and detailed descriptions, and is not, therefore, to be construed as limiting the scope of the application. It should be noted that modifications, improvements and substitutions can be made by those skilled in the art without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A data asset tracing method based on multi-scale pooled hash data fingerprint is characterized by comprising the following two parts: generating MPH data fingerprints of the multi-scale pooling hash; tracing data assets based on MPH fingerprint similarity;

the multi-scale pooled hash MPH data fingerprint generation process comprises the following steps:

step 11: the client monitors new data generation;

step 12: extracting low-order sequence information of a data stream, converting a data asset into a byte sequence, carrying out byte-by-byte sliding window processing on the byte sequence by a k-gram method, wherein the sliding window size is k, splicing k byte data in the sliding window into a unit data substring each time, and finally converting an original byte sequence with the length of n into n-k+1 substring sequences with the unit length of k;

step 14: carrying out multi-scale extraction on the high-order structural information of the data stream, and carrying out information extraction of different scales and different types on the hash sequence with the length L by a pyramid pooling method;

step 15: generating MPH data fingerprints based on a local sensitive hash method;

step 16: writing the generated MPH data fingerprint into a local log, and reporting to a data tracing server side to provide a data foundation for data tracing;

2. The data asset tracing system based on data fingerprint similarity according to claim 1, wherein the step 11 is specifically: the data fingerprint generation client monitors key data operation behaviors in the terminal, including file creation, modification, movement, copying, outgoing and deleting data processing scenes, and if a new data file is found to be generated, the data fingerprint generation client enters a data fingerprint generation link.

3. The data asset tracing technology based on multi-scale pooled hash data fingerprint according to claim 1, wherein the calculation process of step 14 is:

step 143: these values are concatenated to form a new hash sequence of length k=2 (L/2+L/4+ … …) =

4. A data asset tracing technique based on multi-scale pooled hash data fingerprints as recited in claim 3, wherein the step 15 specifically comprises: the hash sequence with the length of K is converted into a hash value with the fixed length by using a local sensitive hash method, and the hash value represents the final data fingerprint of the data asset.