CN113449341A

CN113449341A - File data tracing method, device, equipment and storage medium

Info

Publication number: CN113449341A
Application number: CN202110791648.5A
Authority: CN
Inventors: 孙亚东; 谢福进; 王志海; 喻波; 魏力
Original assignee: Beijing Wondersoft Technology Co Ltd
Current assignee: Beijing Wondersoft Technology Co Ltd
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-09-28
Anticipated expiration: 2041-07-13
Also published as: CN113449341B

Abstract

The application provides a file data tracing method, a file data tracing device and a file data storage medium, and relates to the technical field of data security. The file data leaked to the Internet can be quickly and accurately inquired. The method comprises the following steps: respectively carrying out file division on the target file and the candidate file associated with the target file according to the same scale to obtain respective multi-level division results of the target file and the candidate file; each level division result comprises a plurality of file blocks corresponding to each level; a plurality of file blocks of each level form a complete file, and the file division of the next level is carried out on the division result of the previous level; acquiring file blocks to be compared, which belong to a target level, from a target file, and acquiring a plurality of reference file blocks, which belong to the target level, from a candidate file; and when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold value, determining the candidate file as a leakage file associated with the target file.

Description

File data tracing method, device, equipment and storage medium

Technical Field

The present application relates to the field of data security technologies, and in particular, to a method, an apparatus, a device, and a storage medium for tracing file data.

Background

With the deep development of informatization, governments and enterprises and public institutions accumulate a great deal of data, which not only is the wealth of the organization, but also objectively reflects the actual situation of the operation of the organization and contains a great deal of sensitive information of the organization.

File data including a large amount of sensitive information is leaked to the internet due to improper operation of personnel or cooperation manufacturers in the organization; or personnel or cooperation manufacturers in the organization slightly modify the file data containing a large amount of sensitive information and then leak the file data to the Internet, which can have bad influence on the operation and the personal privacy of the organization and even the national security. However, the prior art cannot quickly and accurately identify leaked files from a huge amount of internet files.

Disclosure of Invention

The embodiment of the application provides a file data tracing method, a file data tracing device, file data tracing equipment and a storage medium, and file data leaked to the Internet can be quickly and accurately inquired.

A first aspect of the embodiments of the present application provides a file data tracing method, where the method includes:

respectively carrying out file division on a target file and a candidate file associated with the target file according to the same scale to obtain respective multi-level division results of the target file and the candidate file; each level division result comprises a plurality of file blocks corresponding to each level; a plurality of file blocks of each level form a complete file, and the file division of the next level is carried out on the division result of the previous level;

acquiring file blocks to be compared, which belong to a target level, from the target file, and acquiring a plurality of reference file blocks, which belong to the target level, from the candidate file;

and when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold value, determining the candidate file as a leakage file associated with the target file.

Optionally, the method further comprises:

determining the current hierarchy level in the target file and the candidate file as the target hierarchy level according to the forming sequence of each hierarchy dividing result in the target file or the candidate file;

and when the similarity between the file block to be compared and all the reference file blocks in the plurality of reference file blocks is not more than a first preset threshold, determining the next level of the current level as the target level, returning to the step of acquiring the file block to be compared belonging to the target level from the target file and acquiring the plurality of reference file blocks belonging to the target level from the candidate file.

Optionally, the method further comprises:

calculating a first hash value of the file blocks to be compared by adopting a SIMHASH algorithm, and calculating a second hash value of each of a plurality of reference file blocks;

sequentially comparing the similarity of the first hash value and each second hash value;

and when a target second hash value with the similarity to the first hash value larger than a second preset threshold exists, determining that the similarity between the reference file block corresponding to the target second hash value and the file block to be compared is larger than the first preset threshold.

Optionally, after determining that the candidate file is a leakage file associated with the target file, the method further includes:

and determining the position of the leakage content associated with the target file in the candidate file and the position of the leakage content in the target file according to the position of the target level in the target file or the candidate file and the position of a reference file block with the similarity to the file block to be compared being greater than a first preset threshold value in the target level.

generating and outputting first prompt information for indicating the leakage file;

after determining the location of the leaked content associated with the target file in the candidate file and the location of the leaked content in the target file, the method further comprises at least one of:

generating and outputting second prompt information for indicating the leaked content;

generating and outputting third prompt information for indicating the position of the leaked content in the candidate file;

and generating and outputting fourth prompt information for indicating the position of the leaked content in the target file.

Optionally, when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold, determining that the candidate file is a leakage file associated with the target file includes:

when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold value, determining the overlapping degree of the candidate file and the target file according to the position of the target level in the target file or the candidate file; wherein the degree of overlap includes paragraph level overlap, sentence level overlap, and phrase level overlap.

A second aspect of the embodiments of the present application provides a file data tracing apparatus, where the apparatus includes:

the dividing module is used for respectively dividing a target file and a candidate file associated with the target file according to the same scale to obtain respective multi-level dividing results of the target file and the candidate file; each level division result comprises a plurality of file blocks corresponding to each level; a plurality of file blocks of each level form a complete file, and the file division of the next level is carried out on the division result of the previous level;

the acquisition module is used for acquiring file blocks to be compared, which belong to a target level, from the target file and acquiring a plurality of reference file blocks, which belong to the target level, from the candidate file;

the first determining module is configured to determine that the candidate file is a leakage file associated with the target file when the similarity between the file block to be compared and any reference file block in the multiple reference file blocks is greater than a first preset threshold.

Optionally, the apparatus further comprises:

a second determining module, configured to determine, according to a formation order of each hierarchical division result in the target file or the candidate file, a current hierarchical level in the target file and the candidate file as the target hierarchical level;

and a returning module, configured to determine, when the similarity between the file block to be compared and all of the reference file blocks in the multiple reference file blocks is not greater than a first preset threshold, a next level of the current level as the target level, and return to the step of acquiring the file block to be compared belonging to the target level from the target file and acquiring the multiple reference file blocks belonging to the target level from the candidate file.

Optionally, the apparatus further comprises:

the calculating module is used for calculating a first hash value of the file blocks to be compared by adopting a SIMHASH algorithm and calculating a second hash value of each of a plurality of reference file blocks;

the comparison module is used for sequentially comparing the similarity between the first hash value and each second hash value;

and the third determining module is used for determining that the similarity between the reference file block corresponding to the target second hash value and the file block to be compared is greater than the first preset threshold value when the target second hash value with the similarity greater than the second preset threshold value exists.

Optionally, the apparatus further comprises:

a fourth determining module, configured to determine, according to a position of the target level in the target file or the candidate file, and a position of a reference file block, in the target level, where a similarity between the reference file block and the file block to be compared is greater than a first preset threshold, a position of the leakage content associated with the target file in the candidate file, and a position of the leakage content in the target file.

Optionally, the apparatus further comprises:

the first output module is used for generating and outputting first prompt information used for indicating the leakage file;

the second output module is used for generating and outputting second prompt information used for indicating the leaked content;

a third output module, configured to generate and output third prompt information indicating a location of the leaked content in the candidate file;

and the fourth output module is used for generating and outputting fourth prompt information used for indicating the position of the leaked content in the target file.

Optionally, the first determining module includes:

the determining submodule is used for determining the overlapping degree of the candidate file and the target file according to the position of the target level in the target file or the candidate file when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is larger than a first preset threshold; wherein the degree of overlap includes paragraph level overlap, sentence level overlap, and phrase level overlap.

A third aspect of embodiments of the present application provides a readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps in the method according to the first aspect of the present application.

A fourth aspect of the embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method according to the first aspect of the present application.

According to the method and the device, the target file is divided into a plurality of target file blocks corresponding to each hierarchy according to the same scale, the candidate file is divided into a plurality of candidate file blocks corresponding to each hierarchy, then the target file blocks and the candidate file blocks in the same hierarchy are compared one by one, whether the contents of the target file and the candidate file are the same or not is compared by taking the file blocks as a unit, the contents of comparison in each time are less, the speed is high, and the comparison is faster than that of full-text characters; compared with a comparison method based on full-text hash value calculation, the method is more comprehensive. Moreover, because a plurality of file blocks of each hierarchy can form a complete file, file structures represented by the file blocks of different hierarchies are different, and then the file blocks are compared at different hierarchies, leakage contents of different degrees can be inquired, and for the file which is leaked to the internet after being modified, the leakage contents after being modified can be accurately identified through traversing comparison of the file blocks representing different file structures by each hierarchy.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flow diagram of a method one for querying a leaked file;

FIG. 2 is a flow diagram of method two querying a leaked file;

fig. 3 is a flowchart illustrating steps of a file data tracing method according to an embodiment of the present application;

FIG. 4 is a diagram illustrating the results of multi-level partitioning of a target document and a candidate document in an example of the present application;

FIG. 5 is a diagram illustrating a comparison of similarity between a file block to be compared and a plurality of reference file blocks according to an embodiment of the present application;

FIG. 6 is a schematic diagram of comparing similarity between a target document and a candidate document according to an embodiment of the present application;

FIG. 7 is a block diagram of an exemplary document data tracking system of the present application;

FIG. 8 is a flowchart illustrating a method for a terminal to perform file data tracing based on a file data tracing system according to an embodiment of the present disclosure;

FIG. 9 is a first flowchart comparing hash values according to an embodiment of the present application;

FIG. 10 is a second flowchart of comparing hash values according to an embodiment of the present application;

fig. 11 is a functional block diagram of a file data tracing apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For file data leaked to the internet, the file data needs to be found and deleted from massive data of the internet, so that the tracing and the control of the leaked file data are completed, and the leaked file is prevented from being continuously transmitted and read. The prior art generally adopts the following method for tracing leaked files:

the first method, fig. 1 is a flow chart of the first method for querying a leaked file, as shown in fig. 1, according to a keyword of the leaked file, searching a related file in an internet data resource pool, for example, on a platform such as a hundred-degree network disk, a tao-gaoba platform, etc., querying the file according to the keyword, downloading the file to the local, performing character comparison on the downloaded file and the leaked file by a terminal, and determining the downloaded file as the related file of the leaked file under the condition that the characters of the downloaded file and the leaked file are completely the same; in the case where the characters of the download file and the leaked file are not identical, it is determined that the download file is not a related file of the leaked file.

However, the full-text character comparison speed is slow, and the comparison result changes due to the change of any character or punctuation in the file. Before the leaked files are uploaded to the Internet, through simple character replacement, paragraph replacement and position conversion of sentences or phrases, the first method of inquiring the relevant files of the leaked files through full-text character comparison cannot determine the leaked files or the contents in the leaked files in the Internet, and the success rate of identifying the leaked files in mass Internet data is low.

And a second method are shown in fig. 2, which is a flow chart for querying the leaked files in the second method, and as shown in fig. 2, downloading the files in the internet data resource pool to the local, calculating the hash value of the leaked files by adopting an SM3 algorithm, and simultaneously calculating the hash value of each downloaded file in sequence, and determining that the downloaded file is a related file of the leaked files if the hash value of any downloaded file is the same as the hash value of the leaked files.

But computing a hash of a full text using the SM3 algorithm results in a meaningless string of characters. Changes in the content of the file, such as replacing paragraphs, etc., may result in changes in the hash value calculated from the file. For example, after adding or deleting a character, exchanging a sentence, or a phrase in the leaked file, the SM3 algorithm is used to calculate the hash value of the leaked file, which is different from the SM3 algorithm used to calculate the hash value of the modified leaked file. Therefore, the second method for querying the relevant file of the leaked file by calculating the full-text hash value cannot determine the leaked file or the content in the leaked file in the internet, and the success rate of identifying the leaked file in massive internet data is low.

In view of the above problems, the embodiments of the present application provide a file data tracing method, which can quickly and accurately identify the leaked files and the contents of the modified leaked files in the mass internet data.

Fig. 3 is a flowchart of steps of a file data tracing method provided in an embodiment of the present application, and as shown in fig. 3, the steps are as follows:

step S31: respectively carrying out file division on a target file and a candidate file associated with the target file according to the same scale to obtain respective multi-level division results of the target file and the candidate file; each level division result comprises a plurality of file blocks corresponding to each level; the file blocks of each level form a complete file, and the file division of the next level is performed on the division result of the previous level.

The target file is the determined leaked file, and the candidate file associated with the target file is a file which is obtained from the internet data resource pool and is compared with the target file at the current time in the process of tracing the target file. In an example of the application, after receiving an instruction of selecting a target file by a user, the terminal starts to acquire the file in the internet data resource pool as a candidate file associated with the target file.

The target file and the division scale of the candidate file associated with the target file may be determined according to an instruction input by a user. In an example of the application, the terminal provides an input entrance for a user, and the division scale is determined according to keywords input by the user.

In an example of the present application, a target file and a candidate file are divided according to a scale 3, and an obtained multi-level division result is shown in fig. 4, where fig. 4 is a schematic diagram of the multi-level division result of the target file and the candidate file in the example of the present application. Specifically, a binary tree partitioning mode is adopted to divide the target file and the candidate file 3 times respectively, each time division is performed on a last division result, namely, file blocks in a last division result are further divided. The multi-level classification result comprises: a first layer division result, a second layer division result, and a third layer division result. Taking the target file as an example, the first layer of division results are target file block 1 and target file block 2, the second layer of division results are target file block 1.1, target file block 1.2, target file block 2.1 and target file block 2.2, and the third layer of division results are target file block 1.1.1, target file block 1.1.2, target file block 1.2.1, target file block 1.2.2, target file block 2.1.1, target file block 2.1.2, target file block 2.2.1 and target file block 2.2.2.

The target file block 1 and the target file block 2 jointly form a complete target file, the target file block 1.1, the target file block 1.2, the target file block 2.1 and the target file block 2.2 jointly form a complete target file, and the target file block 1.1.1, the target file block 1.1.2, the target file block 1.2.1, the target file block 1.2.2, the target file block 2.1.1, the target file block 2.1.2, the target file block 2.2.1 and the target file block 2.2.2 jointly form a complete target file.

Step S32: and acquiring a file block to be compared, which belongs to a target level, from the target file, and acquiring a plurality of reference file blocks, which belong to the target level, from the candidate file.

The target level is one of a plurality of levels. The number of the levels of the multi-level division results of the file blocks to be compared in the target file is the same as the number of the levels of the multi-level division results of the plurality of reference file blocks in the candidate file.

Step S33: and when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold value, determining the candidate file as a leakage file associated with the target file.

Another embodiment of the present application details a method for accurately identifying leaked content by traversing and comparing file blocks representing different file structures at various levels.

When the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold value, determining that the candidate file is a leakage file associated with the target file

In addition to the comparison of file blocks based on the multi-level partitioning results, the target file and the candidate file may be directly compared. And under the condition that the similarity between the target file and the candidate file is greater than a first preset threshold value, the target file and the candidate file are in file-level overlapping, and the candidate file leaks the whole content of the whole file of the target file.

Paragraph level overlap, sentence level overlap, and phrase level overlap have different definitions at different scale divisions.

With continued reference to fig. 4, in an example of the present application, the division scale is 3, when the target level is the first level, if the similarity between the target file block 1 and the candidate file block 2 is greater than the first preset threshold, since the candidate file block 2 is a partial content obtained by dividing the candidate file once, the target file block 1 is a partial content obtained by dividing the target file once, and the partial contents are all composed of several paragraphs, and before the similarity between the target file block 1 and the candidate file block 2 is greater than the first preset threshold, the similarity between the target file block 1 and the candidate file block 1 is also calculated, so that even if a relevant person replaces the beginning of the paragraph of the target file to the end before leaking the target file, the present application embodiment can identify the replaced paragraph. If the similarity of target file block 1.2 to candidate file block 2.1 is greater than a first preset threshold, since the candidate file block 2.1 is a partial content obtained by secondarily dividing the candidate file, the target file block 1 is a partial content obtained by secondarily dividing the target file, before the similarity between the target file block 1.2 and the candidate file block 2.1 is larger than a first preset threshold, the similarities between the target file block 1.1 and the candidate file block 1.1, between the candidate file block 1.2 and the candidate file block 2.1 and between the target file block 1.2 and the candidate file block 1.1, between the candidate file block 1.2 and the candidate file block 2.1 and between the target file block 2.2 and the candidate file block 2.2 are calculated, so that even if relevant personnel replace a plurality of sentences in the beginning paragraphs of the target file into paragraphs of the file before the target file is leaked, the replaced sentences can be identified.

The embodiment of the application not only can identify partial content in the leaked file, but also can identify the positions of the leaked content in the target file and the candidate file.

With continued reference to fig. 4, in one example of the present application, assuming that the target tier is the third tier and the reference file block with a similarity greater than the first preset threshold to the file block to be compared is the candidate file block 1.1.2, it may be determined that the leakage content associated with the target file is two-eighths of the candidate file. And then according to the fact that the file block to be compared with the reference file block with the similarity larger than the first preset threshold value is the target file block 2.1.1, the fifth part of the leaked content in the target file can be determined, and before uploading the target file, the fifth part of the content of the target file is replaced to the file or two eighths of other files by the leakage personnel.

Another embodiment of the present application provides a specific method for obtaining a target level. The target level is a level at which the file block comparison is performed at the current level. Therefore, the current hierarchy level in the target file and the candidate file is first determined as the target hierarchy level in the order of formation of the per-hierarchy division result in the target file or the candidate file.

According to the forming sequence of each level of the division results, the level of the file block obtained after the first division is determined as a first level, a plurality of file blocks of the first level and the first layer are determined as first level division results, the level of the file block obtained after the second division is determined as a second level, a plurality of file blocks of the second level and the second layer are determined as second layer division results, and the like.

With continued reference to FIG. 4, since the division scale is 3, the order of formation of the per-level division results of the target file is: the forming sequence of each level division result of the candidate file is as follows: first level-second level-third level. Thus, the first hierarchy is first determined as the target hierarchy. File blocks to be compared are obtained in a first level of the target file, and each target file block in the first level of the target file is specifically used as a file block to be compared. And then acquiring a candidate file block 1 and a candidate file block 2 from the first hierarchy of the candidate file as reference file blocks. And when no file block with the target file 1 or the similarity with the target file 2 is greater than a first preset threshold exists in the candidate file block 1 and the candidate file block 2, taking the second hierarchy as a target hierarchy, and repeating the operation.

By adopting the method for acquiring the target level, the contents of the target file can be checked in sequence from a large range to a small range, and whether all the contents of the target file are disclosed in the candidate file or not is identified by comparing the target file with the candidate file. After the target file is checked not to be completely leaked by the candidate file, whether the partial paragraph of the candidate file discloses the partial content of the target file is identified by the mode of dividing the plurality of target file blocks obtained by dividing the target file for the first time and the plurality of candidate file blocks obtained by dividing the candidate file for the first time. After the integral section of the target file which is not leaked by the candidate file is checked, whether partial sentences of the candidate file disclose partial contents of the target file is identified by means of dividing a plurality of target file blocks obtained by dividing the target file for the second time and a plurality of candidate file blocks obtained by dividing the candidate file for the second time, so that the target file and the candidate file are divided according to different scales, for example, the target file and the candidate file are divided for N times, then the leakage range of the candidate file to the target file is sequentially checked, and the leakage contents at the phrase or word level can be identified. Even if the leakage personnel split the content in the leakage file into scattered content before the leakage file is leaked and then mix the scattered content into different files, the leakage personnel can also accurately identify the content by adopting the mode of the embodiment of the application.

Another embodiment of the present application provides a specific method for comparing the similarity between a file block to be compared and a plurality of reference file blocks. Fig. 5 is a schematic diagram for comparing similarity between a file block to be compared and a plurality of reference file blocks according to an embodiment of the present application, and as shown in fig. 5, a SIMHASH algorithm is adopted to calculate a first hash value of the file block to be compared and a second hash value of each of the plurality of reference file blocks. And sequentially comparing the similarity of the first hash value and each second hash value. And when a target second hash value with the similarity to the first hash value larger than a second preset threshold exists, determining that the similarity between the reference file block corresponding to the target second hash value and the file block to be compared is larger than the first preset threshold.

In an example of the present application, after a first hash value of a file block to be compared is calculated, a correspondence between the first hash value and the file block to be compared may be established and stored, and after second hash values of a plurality of reference file blocks are calculated, a one-to-one correspondence between the plurality of reference file blocks and the plurality of second hash values may be established and stored, and then when a target second hash value of which the similarity to the first hash value is greater than a second preset threshold is identified, a reference file block of which the similarity to the file to be compared is greater than the first preset threshold is determined through the pre-stored correspondence.

Fig. 6 is a schematic diagram illustrating comparison of similarity between a target file and a candidate file according to an embodiment of the present application, and as shown in fig. 6, in addition to calculating a first hash value of a file block to be compared and calculating a second hash value of a reference file block, the embodiment of the present application further calculates hash values of the target file and the candidate file by using a SIMHASH algorithm, so as to compare the overall similarity between the target file and the candidate file.

And generating the hash value of the file block by adopting a SIMHASH algorithm, and if the file block is not changed, identifying successfully if the file block hash values are the same. If the file blocks are changed, the hash values cannot be changed completely based on the characteristics of the SIMHASH algorithm, and if the similarity of the hash values reaches more than 90%, the identification is successful. Meanwhile, the SIMHASH algorithm further compresses comparison contents and improves the identification speed.

In order to develop a control measure for a leaked file in time after the leaked file is found, the embodiment of the application provides the following method, so that a terminal prompts a user to delete the leaked file or content in time in the internet by sound or icon information after identifying the leaked file, the leaked content or the position of the leaked content.

After determining that the candidate file is a leaked file associated with the target file, the method further comprises:

In an example of the present application, all modules in the file data tracing apparatus may also be integrated and set at corresponding positions, a file data tracing system as shown in fig. 7 is constructed at a terminal, and the system runs each module to execute a corresponding program, so as to implement the file data tracing method provided in other embodiments of the present application, where fig. 7 is a structural diagram of the file data tracing system in an example of the present application. Fig. 8 is a flowchart illustrating a method for executing file data tracing based on a file data tracing system by an exemplary terminal according to the present application. As shown in fig. 7 and 8, when the terminal operates the file data tracing system, the splitting layer number unit receives a scale input by a user, the file splitting unit performs the first step, splits the target file into target file blocks by using a binary tree, and splits the candidate file into candidate file blocks. The binary tree is an important type of tree structure, and is an ordered tree with the number of nodes not more than 2 in the tree.

With continuing reference to fig. 7 and 8, the file data tracing system includes an original file unit for storing the leaked target file, a candidate file unit for storing a candidate file obtained from the internet data resource pool, an original file block unit, and a candidate file block unit; the original file block unit is used for storing a target file block obtained by splitting a target file, and the candidate file block unit is used for storing a candidate file block obtained by splitting a candidate file.

With continued reference to fig. 7 and 8, the terminal runs the file data tracing system, the hash value generating unit performs the second step of calculating the file hash value of the target file stored in the original file unit by using the SIMHASH algorithm, and calculating the file hash value of the candidate file stored in the candidate file unit. The SIMHASH algorithm is also used to calculate the file block hash values of all target file blocks stored in the original file block unit and the file block hash values of all candidate file blocks stored in the candidate file block unit.

With continuing reference to fig. 7 and 8, the file data tracing system further includes a target file and target file block corresponding list unit, a candidate file and candidate file block corresponding list unit, a target file and file hash value corresponding list unit, a candidate file and file hash value corresponding list unit, a target file block and file block hash value corresponding list unit, and a candidate file block and file block hash value corresponding list unit. The target file and target file block corresponding list unit is used for storing the corresponding relation between a target file and a target file block, the candidate file and candidate file block corresponding list unit is used for storing the corresponding relation between a candidate file and a candidate file block, the target file and file hash value corresponding list unit is used for storing the file hash value of the target file, the candidate file and file hash value corresponding list unit is used for storing the hash value of the candidate file, the target file block and file block hash value corresponding list unit is used for storing the corresponding relation between the target file block and the file block hash value, and the candidate file block and file block hash value corresponding list unit is used for storing the corresponding relation between the candidate file block and the file block hash value.

With continuing reference to fig. 7 and 8, the hash value similarity setting unit is configured to set a second preset threshold and a first preset threshold. And the hash value comparison unit compares the hash values in a mode of comparing layer by layer and traversing in layers.

Fig. 9 is a first flowchart comparing hash values according to an embodiment of the present application, and fig. 10 is a second flowchart comparing hash values according to an embodiment of the present application. Referring to fig. 9, the file block hash value of the target file block of the first hierarchy is compared with the file block hash values of the candidate file blocks of the first hierarchy, and when the file block hash value comparison is performed at the first hierarchy, for any target file block of the first hierarchy, it is compared with each candidate file block of the first hierarchy one by one.

Assuming that a represents a target file, B represents a candidate file, X represents a hierarchy number, n represents the number of file blocks at the X hierarchy, and Y represents a hash value comparison result, a similarity comparison between a target file block and a candidate file block can be expressed by equation (1):

Y＝F(Axⁿ，Bx^1-n)

in the comparison process, when the similarity of the hash values is greater than a second preset threshold value, the target file block corresponding to the hash value is obtained from the list unit corresponding to the target file block and the file block hash value, and the candidate file block corresponding to the hash value is obtained from the list unit corresponding to the candidate file block and the file block hash value.

With continued reference to fig. 7, the file data tracing system further includes a file leakage warning unit configured to output first prompt information indicating the leaked file, second prompt information indicating the leaked content, third prompt information indicating a location of the leaked content in the candidate file, and fourth prompt information indicating a location of the leaked content in the target file.

Based on the same inventive concept, the embodiment of the application provides a file data tracing device. Fig. 11 is a functional block diagram of a file data tracing apparatus according to an embodiment of the present application. As shown in fig. 11, the apparatus includes:

the dividing module 111 is configured to perform file division on a target file and a candidate file associated with the target file according to the same scale, so as to obtain respective multi-level division results of the target file and the candidate file; each level division result comprises a plurality of file blocks corresponding to each level; a plurality of file blocks of each level form a complete file, and the file division of the next level is carried out on the division result of the previous level;

an obtaining module 112, configured to obtain a file block to be compared belonging to a target level from the target file, and obtain a plurality of reference file blocks belonging to the target level from the candidate file;

a first determining module 113, configured to determine that the candidate file is a leakage file associated with the target file when a similarity between the file block to be compared and any reference file block in the multiple reference file blocks is greater than a first preset threshold.

Optionally, the apparatus further comprises:

Optionally, the first determining module includes:

Based on the same inventive concept, another embodiment of the present application provides a readable storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the steps in the file data tracing method according to any of the above embodiments of the present application.

Based on the same inventive concept, another embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the method for tracing file data according to any of the above embodiments of the present application is implemented.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive or descriptive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method, the device, the equipment and the storage medium for tracing the file data provided by the application are introduced in detail, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A file data tracing method is characterized by comprising the following steps:

2. The method of claim 1, further comprising:

3. The method of claim 1, further comprising:

4. The method of claim 1, wherein after determining that the candidate file is a leakage file associated with the target file, the method further comprises:

5. The method of claim 4, wherein after determining that the candidate file is a leakage file associated with the target file, the method further comprises:

6. The method according to claim 1, wherein when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold, determining the candidate file as a leakage file associated with the target file comprises:

7. A file data tracing apparatus, characterized in that the apparatus comprises:

8. The apparatus of claim 1, further comprising:

9. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executed implements the steps of the method according to any of claims 1-6.