CN113449341A - File data tracing method, device, equipment and storage medium - Google Patents

File data tracing method, device, equipment and storage medium Download PDF

Info

Publication number
CN113449341A
CN113449341A CN202110791648.5A CN202110791648A CN113449341A CN 113449341 A CN113449341 A CN 113449341A CN 202110791648 A CN202110791648 A CN 202110791648A CN 113449341 A CN113449341 A CN 113449341A
Authority
CN
China
Prior art keywords
file
target
level
candidate
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110791648.5A
Other languages
Chinese (zh)
Other versions
CN113449341B (en
Inventor
孙亚东
谢福进
王志海
喻波
魏力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wondersoft Technology Co Ltd
Original Assignee
Beijing Wondersoft Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wondersoft Technology Co Ltd filed Critical Beijing Wondersoft Technology Co Ltd
Priority to CN202110791648.5A priority Critical patent/CN113449341B/en
Publication of CN113449341A publication Critical patent/CN113449341A/en
Application granted granted Critical
Publication of CN113449341B publication Critical patent/CN113449341B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Storage Device Security (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a file data tracing method, a file data tracing device and a file data storage medium, and relates to the technical field of data security. The file data leaked to the Internet can be quickly and accurately inquired. The method comprises the following steps: respectively carrying out file division on the target file and the candidate file associated with the target file according to the same scale to obtain respective multi-level division results of the target file and the candidate file; each level division result comprises a plurality of file blocks corresponding to each level; a plurality of file blocks of each level form a complete file, and the file division of the next level is carried out on the division result of the previous level; acquiring file blocks to be compared, which belong to a target level, from a target file, and acquiring a plurality of reference file blocks, which belong to the target level, from a candidate file; and when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold value, determining the candidate file as a leakage file associated with the target file.

Description

File data tracing method, device, equipment and storage medium
Technical Field
The present application relates to the field of data security technologies, and in particular, to a method, an apparatus, a device, and a storage medium for tracing file data.
Background
With the deep development of informatization, governments and enterprises and public institutions accumulate a great deal of data, which not only is the wealth of the organization, but also objectively reflects the actual situation of the operation of the organization and contains a great deal of sensitive information of the organization.
File data including a large amount of sensitive information is leaked to the internet due to improper operation of personnel or cooperation manufacturers in the organization; or personnel or cooperation manufacturers in the organization slightly modify the file data containing a large amount of sensitive information and then leak the file data to the Internet, which can have bad influence on the operation and the personal privacy of the organization and even the national security. However, the prior art cannot quickly and accurately identify leaked files from a huge amount of internet files.
Disclosure of Invention
The embodiment of the application provides a file data tracing method, a file data tracing device, file data tracing equipment and a storage medium, and file data leaked to the Internet can be quickly and accurately inquired.
A first aspect of the embodiments of the present application provides a file data tracing method, where the method includes:
respectively carrying out file division on a target file and a candidate file associated with the target file according to the same scale to obtain respective multi-level division results of the target file and the candidate file; each level division result comprises a plurality of file blocks corresponding to each level; a plurality of file blocks of each level form a complete file, and the file division of the next level is carried out on the division result of the previous level;
acquiring file blocks to be compared, which belong to a target level, from the target file, and acquiring a plurality of reference file blocks, which belong to the target level, from the candidate file;
and when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold value, determining the candidate file as a leakage file associated with the target file.
Optionally, the method further comprises:
determining the current hierarchy level in the target file and the candidate file as the target hierarchy level according to the forming sequence of each hierarchy dividing result in the target file or the candidate file;
and when the similarity between the file block to be compared and all the reference file blocks in the plurality of reference file blocks is not more than a first preset threshold, determining the next level of the current level as the target level, returning to the step of acquiring the file block to be compared belonging to the target level from the target file and acquiring the plurality of reference file blocks belonging to the target level from the candidate file.
Optionally, the method further comprises:
calculating a first hash value of the file blocks to be compared by adopting a SIMHASH algorithm, and calculating a second hash value of each of a plurality of reference file blocks;
sequentially comparing the similarity of the first hash value and each second hash value;
and when a target second hash value with the similarity to the first hash value larger than a second preset threshold exists, determining that the similarity between the reference file block corresponding to the target second hash value and the file block to be compared is larger than the first preset threshold.
Optionally, after determining that the candidate file is a leakage file associated with the target file, the method further includes:
and determining the position of the leakage content associated with the target file in the candidate file and the position of the leakage content in the target file according to the position of the target level in the target file or the candidate file and the position of a reference file block with the similarity to the file block to be compared being greater than a first preset threshold value in the target level.
Optionally, after determining that the candidate file is a leakage file associated with the target file, the method further includes:
generating and outputting first prompt information for indicating the leakage file;
after determining the location of the leaked content associated with the target file in the candidate file and the location of the leaked content in the target file, the method further comprises at least one of:
generating and outputting second prompt information for indicating the leaked content;
generating and outputting third prompt information for indicating the position of the leaked content in the candidate file;
and generating and outputting fourth prompt information for indicating the position of the leaked content in the target file.
Optionally, when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold, determining that the candidate file is a leakage file associated with the target file includes:
when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold value, determining the overlapping degree of the candidate file and the target file according to the position of the target level in the target file or the candidate file; wherein the degree of overlap includes paragraph level overlap, sentence level overlap, and phrase level overlap.
A second aspect of the embodiments of the present application provides a file data tracing apparatus, where the apparatus includes:
the dividing module is used for respectively dividing a target file and a candidate file associated with the target file according to the same scale to obtain respective multi-level dividing results of the target file and the candidate file; each level division result comprises a plurality of file blocks corresponding to each level; a plurality of file blocks of each level form a complete file, and the file division of the next level is carried out on the division result of the previous level;
the acquisition module is used for acquiring file blocks to be compared, which belong to a target level, from the target file and acquiring a plurality of reference file blocks, which belong to the target level, from the candidate file;
the first determining module is configured to determine that the candidate file is a leakage file associated with the target file when the similarity between the file block to be compared and any reference file block in the multiple reference file blocks is greater than a first preset threshold.
Optionally, the apparatus further comprises:
a second determining module, configured to determine, according to a formation order of each hierarchical division result in the target file or the candidate file, a current hierarchical level in the target file and the candidate file as the target hierarchical level;
and a returning module, configured to determine, when the similarity between the file block to be compared and all of the reference file blocks in the multiple reference file blocks is not greater than a first preset threshold, a next level of the current level as the target level, and return to the step of acquiring the file block to be compared belonging to the target level from the target file and acquiring the multiple reference file blocks belonging to the target level from the candidate file.
Optionally, the apparatus further comprises:
the calculating module is used for calculating a first hash value of the file blocks to be compared by adopting a SIMHASH algorithm and calculating a second hash value of each of a plurality of reference file blocks;
the comparison module is used for sequentially comparing the similarity between the first hash value and each second hash value;
and the third determining module is used for determining that the similarity between the reference file block corresponding to the target second hash value and the file block to be compared is greater than the first preset threshold value when the target second hash value with the similarity greater than the second preset threshold value exists.
Optionally, the apparatus further comprises:
a fourth determining module, configured to determine, according to a position of the target level in the target file or the candidate file, and a position of a reference file block, in the target level, where a similarity between the reference file block and the file block to be compared is greater than a first preset threshold, a position of the leakage content associated with the target file in the candidate file, and a position of the leakage content in the target file.
Optionally, the apparatus further comprises:
the first output module is used for generating and outputting first prompt information used for indicating the leakage file;
the second output module is used for generating and outputting second prompt information used for indicating the leaked content;
a third output module, configured to generate and output third prompt information indicating a location of the leaked content in the candidate file;
and the fourth output module is used for generating and outputting fourth prompt information used for indicating the position of the leaked content in the target file.
Optionally, the first determining module includes:
the determining submodule is used for determining the overlapping degree of the candidate file and the target file according to the position of the target level in the target file or the candidate file when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is larger than a first preset threshold; wherein the degree of overlap includes paragraph level overlap, sentence level overlap, and phrase level overlap.
A third aspect of embodiments of the present application provides a readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps in the method according to the first aspect of the present application.
A fourth aspect of the embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method according to the first aspect of the present application.
According to the method and the device, the target file is divided into a plurality of target file blocks corresponding to each hierarchy according to the same scale, the candidate file is divided into a plurality of candidate file blocks corresponding to each hierarchy, then the target file blocks and the candidate file blocks in the same hierarchy are compared one by one, whether the contents of the target file and the candidate file are the same or not is compared by taking the file blocks as a unit, the contents of comparison in each time are less, the speed is high, and the comparison is faster than that of full-text characters; compared with a comparison method based on full-text hash value calculation, the method is more comprehensive. Moreover, because a plurality of file blocks of each hierarchy can form a complete file, file structures represented by the file blocks of different hierarchies are different, and then the file blocks are compared at different hierarchies, leakage contents of different degrees can be inquired, and for the file which is leaked to the internet after being modified, the leakage contents after being modified can be accurately identified through traversing comparison of the file blocks representing different file structures by each hierarchy.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a flow diagram of a method one for querying a leaked file;
FIG. 2 is a flow diagram of method two querying a leaked file;
fig. 3 is a flowchart illustrating steps of a file data tracing method according to an embodiment of the present application;
FIG. 4 is a diagram illustrating the results of multi-level partitioning of a target document and a candidate document in an example of the present application;
FIG. 5 is a diagram illustrating a comparison of similarity between a file block to be compared and a plurality of reference file blocks according to an embodiment of the present application;
FIG. 6 is a schematic diagram of comparing similarity between a target document and a candidate document according to an embodiment of the present application;
FIG. 7 is a block diagram of an exemplary document data tracking system of the present application;
FIG. 8 is a flowchart illustrating a method for a terminal to perform file data tracing based on a file data tracing system according to an embodiment of the present disclosure;
FIG. 9 is a first flowchart comparing hash values according to an embodiment of the present application;
FIG. 10 is a second flowchart of comparing hash values according to an embodiment of the present application;
fig. 11 is a functional block diagram of a file data tracing apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
For file data leaked to the internet, the file data needs to be found and deleted from massive data of the internet, so that the tracing and the control of the leaked file data are completed, and the leaked file is prevented from being continuously transmitted and read. The prior art generally adopts the following method for tracing leaked files:
the first method, fig. 1 is a flow chart of the first method for querying a leaked file, as shown in fig. 1, according to a keyword of the leaked file, searching a related file in an internet data resource pool, for example, on a platform such as a hundred-degree network disk, a tao-gaoba platform, etc., querying the file according to the keyword, downloading the file to the local, performing character comparison on the downloaded file and the leaked file by a terminal, and determining the downloaded file as the related file of the leaked file under the condition that the characters of the downloaded file and the leaked file are completely the same; in the case where the characters of the download file and the leaked file are not identical, it is determined that the download file is not a related file of the leaked file.
However, the full-text character comparison speed is slow, and the comparison result changes due to the change of any character or punctuation in the file. Before the leaked files are uploaded to the Internet, through simple character replacement, paragraph replacement and position conversion of sentences or phrases, the first method of inquiring the relevant files of the leaked files through full-text character comparison cannot determine the leaked files or the contents in the leaked files in the Internet, and the success rate of identifying the leaked files in mass Internet data is low.
And a second method are shown in fig. 2, which is a flow chart for querying the leaked files in the second method, and as shown in fig. 2, downloading the files in the internet data resource pool to the local, calculating the hash value of the leaked files by adopting an SM3 algorithm, and simultaneously calculating the hash value of each downloaded file in sequence, and determining that the downloaded file is a related file of the leaked files if the hash value of any downloaded file is the same as the hash value of the leaked files.
But computing a hash of a full text using the SM3 algorithm results in a meaningless string of characters. Changes in the content of the file, such as replacing paragraphs, etc., may result in changes in the hash value calculated from the file. For example, after adding or deleting a character, exchanging a sentence, or a phrase in the leaked file, the SM3 algorithm is used to calculate the hash value of the leaked file, which is different from the SM3 algorithm used to calculate the hash value of the modified leaked file. Therefore, the second method for querying the relevant file of the leaked file by calculating the full-text hash value cannot determine the leaked file or the content in the leaked file in the internet, and the success rate of identifying the leaked file in massive internet data is low.
In view of the above problems, the embodiments of the present application provide a file data tracing method, which can quickly and accurately identify the leaked files and the contents of the modified leaked files in the mass internet data.
Fig. 3 is a flowchart of steps of a file data tracing method provided in an embodiment of the present application, and as shown in fig. 3, the steps are as follows:
step S31: respectively carrying out file division on a target file and a candidate file associated with the target file according to the same scale to obtain respective multi-level division results of the target file and the candidate file; each level division result comprises a plurality of file blocks corresponding to each level; the file blocks of each level form a complete file, and the file division of the next level is performed on the division result of the previous level.
The target file is the determined leaked file, and the candidate file associated with the target file is a file which is obtained from the internet data resource pool and is compared with the target file at the current time in the process of tracing the target file. In an example of the application, after receiving an instruction of selecting a target file by a user, the terminal starts to acquire the file in the internet data resource pool as a candidate file associated with the target file.
The target file and the division scale of the candidate file associated with the target file may be determined according to an instruction input by a user. In an example of the application, the terminal provides an input entrance for a user, and the division scale is determined according to keywords input by the user.
In an example of the present application, a target file and a candidate file are divided according to a scale 3, and an obtained multi-level division result is shown in fig. 4, where fig. 4 is a schematic diagram of the multi-level division result of the target file and the candidate file in the example of the present application. Specifically, a binary tree partitioning mode is adopted to divide the target file and the candidate file 3 times respectively, each time division is performed on a last division result, namely, file blocks in a last division result are further divided. The multi-level classification result comprises: a first layer division result, a second layer division result, and a third layer division result. Taking the target file as an example, the first layer of division results are target file block 1 and target file block 2, the second layer of division results are target file block 1.1, target file block 1.2, target file block 2.1 and target file block 2.2, and the third layer of division results are target file block 1.1.1, target file block 1.1.2, target file block 1.2.1, target file block 1.2.2, target file block 2.1.1, target file block 2.1.2, target file block 2.2.1 and target file block 2.2.2.
The target file block 1 and the target file block 2 jointly form a complete target file, the target file block 1.1, the target file block 1.2, the target file block 2.1 and the target file block 2.2 jointly form a complete target file, and the target file block 1.1.1, the target file block 1.1.2, the target file block 1.2.1, the target file block 1.2.2, the target file block 2.1.1, the target file block 2.1.2, the target file block 2.2.1 and the target file block 2.2.2 jointly form a complete target file.
Step S32: and acquiring a file block to be compared, which belongs to a target level, from the target file, and acquiring a plurality of reference file blocks, which belong to the target level, from the candidate file.
The target level is one of a plurality of levels. The number of the levels of the multi-level division results of the file blocks to be compared in the target file is the same as the number of the levels of the multi-level division results of the plurality of reference file blocks in the candidate file.
Step S33: and when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold value, determining the candidate file as a leakage file associated with the target file.
According to the method and the device, the target file is divided into a plurality of target file blocks corresponding to each hierarchy according to the same scale, the candidate file is divided into a plurality of candidate file blocks corresponding to each hierarchy, then the target file blocks and the candidate file blocks in the same hierarchy are compared one by one, whether the contents of the target file and the candidate file are the same or not is compared by taking the file blocks as a unit, the contents of comparison in each time are less, the speed is high, and the comparison is faster than that of full-text characters; compared with a comparison method based on full-text hash value calculation, the method is more comprehensive. Moreover, because a plurality of file blocks of each hierarchy can form a complete file, file structures represented by the file blocks of different hierarchies are different, and then the file blocks are compared at different hierarchies, leakage contents of different degrees can be inquired, and for the file which is leaked to the internet after being modified, the leakage contents after being modified can be accurately identified through traversing comparison of the file blocks representing different file structures by each hierarchy.
Another embodiment of the present application details a method for accurately identifying leaked content by traversing and comparing file blocks representing different file structures at various levels.
When the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold value, determining that the candidate file is a leakage file associated with the target file
When the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold value, determining the overlapping degree of the candidate file and the target file according to the position of the target level in the target file or the candidate file; wherein the degree of overlap includes paragraph level overlap, sentence level overlap, and phrase level overlap.
In addition to the comparison of file blocks based on the multi-level partitioning results, the target file and the candidate file may be directly compared. And under the condition that the similarity between the target file and the candidate file is greater than a first preset threshold value, the target file and the candidate file are in file-level overlapping, and the candidate file leaks the whole content of the whole file of the target file.
Paragraph level overlap, sentence level overlap, and phrase level overlap have different definitions at different scale divisions.
With continued reference to fig. 4, in an example of the present application, the division scale is 3, when the target level is the first level, if the similarity between the target file block 1 and the candidate file block 2 is greater than the first preset threshold, since the candidate file block 2 is a partial content obtained by dividing the candidate file once, the target file block 1 is a partial content obtained by dividing the target file once, and the partial contents are all composed of several paragraphs, and before the similarity between the target file block 1 and the candidate file block 2 is greater than the first preset threshold, the similarity between the target file block 1 and the candidate file block 1 is also calculated, so that even if a relevant person replaces the beginning of the paragraph of the target file to the end before leaking the target file, the present application embodiment can identify the replaced paragraph. If the similarity of target file block 1.2 to candidate file block 2.1 is greater than a first preset threshold, since the candidate file block 2.1 is a partial content obtained by secondarily dividing the candidate file, the target file block 1 is a partial content obtained by secondarily dividing the target file, before the similarity between the target file block 1.2 and the candidate file block 2.1 is larger than a first preset threshold, the similarities between the target file block 1.1 and the candidate file block 1.1, between the candidate file block 1.2 and the candidate file block 2.1 and between the target file block 1.2 and the candidate file block 1.1, between the candidate file block 1.2 and the candidate file block 2.1 and between the target file block 2.2 and the candidate file block 2.2 are calculated, so that even if relevant personnel replace a plurality of sentences in the beginning paragraphs of the target file into paragraphs of the file before the target file is leaked, the replaced sentences can be identified.
The embodiment of the application not only can identify partial content in the leaked file, but also can identify the positions of the leaked content in the target file and the candidate file.
And determining the position of the leakage content associated with the target file in the candidate file and the position of the leakage content in the target file according to the position of the target level in the target file or the candidate file and the position of a reference file block with the similarity to the file block to be compared being greater than a first preset threshold value in the target level.
With continued reference to fig. 4, in one example of the present application, assuming that the target tier is the third tier and the reference file block with a similarity greater than the first preset threshold to the file block to be compared is the candidate file block 1.1.2, it may be determined that the leakage content associated with the target file is two-eighths of the candidate file. And then according to the fact that the file block to be compared with the reference file block with the similarity larger than the first preset threshold value is the target file block 2.1.1, the fifth part of the leaked content in the target file can be determined, and before uploading the target file, the fifth part of the content of the target file is replaced to the file or two eighths of other files by the leakage personnel.
Another embodiment of the present application provides a specific method for obtaining a target level. The target level is a level at which the file block comparison is performed at the current level. Therefore, the current hierarchy level in the target file and the candidate file is first determined as the target hierarchy level in the order of formation of the per-hierarchy division result in the target file or the candidate file.
And when the similarity between the file block to be compared and all the reference file blocks in the plurality of reference file blocks is not more than a first preset threshold, determining the next level of the current level as the target level, returning to the step of acquiring the file block to be compared belonging to the target level from the target file and acquiring the plurality of reference file blocks belonging to the target level from the candidate file.
According to the forming sequence of each level of the division results, the level of the file block obtained after the first division is determined as a first level, a plurality of file blocks of the first level and the first layer are determined as first level division results, the level of the file block obtained after the second division is determined as a second level, a plurality of file blocks of the second level and the second layer are determined as second layer division results, and the like.
With continued reference to FIG. 4, since the division scale is 3, the order of formation of the per-level division results of the target file is: the forming sequence of each level division result of the candidate file is as follows: first level-second level-third level. Thus, the first hierarchy is first determined as the target hierarchy. File blocks to be compared are obtained in a first level of the target file, and each target file block in the first level of the target file is specifically used as a file block to be compared. And then acquiring a candidate file block 1 and a candidate file block 2 from the first hierarchy of the candidate file as reference file blocks. And when no file block with the target file 1 or the similarity with the target file 2 is greater than a first preset threshold exists in the candidate file block 1 and the candidate file block 2, taking the second hierarchy as a target hierarchy, and repeating the operation.
By adopting the method for acquiring the target level, the contents of the target file can be checked in sequence from a large range to a small range, and whether all the contents of the target file are disclosed in the candidate file or not is identified by comparing the target file with the candidate file. After the target file is checked not to be completely leaked by the candidate file, whether the partial paragraph of the candidate file discloses the partial content of the target file is identified by the mode of dividing the plurality of target file blocks obtained by dividing the target file for the first time and the plurality of candidate file blocks obtained by dividing the candidate file for the first time. After the integral section of the target file which is not leaked by the candidate file is checked, whether partial sentences of the candidate file disclose partial contents of the target file is identified by means of dividing a plurality of target file blocks obtained by dividing the target file for the second time and a plurality of candidate file blocks obtained by dividing the candidate file for the second time, so that the target file and the candidate file are divided according to different scales, for example, the target file and the candidate file are divided for N times, then the leakage range of the candidate file to the target file is sequentially checked, and the leakage contents at the phrase or word level can be identified. Even if the leakage personnel split the content in the leakage file into scattered content before the leakage file is leaked and then mix the scattered content into different files, the leakage personnel can also accurately identify the content by adopting the mode of the embodiment of the application.
Another embodiment of the present application provides a specific method for comparing the similarity between a file block to be compared and a plurality of reference file blocks. Fig. 5 is a schematic diagram for comparing similarity between a file block to be compared and a plurality of reference file blocks according to an embodiment of the present application, and as shown in fig. 5, a SIMHASH algorithm is adopted to calculate a first hash value of the file block to be compared and a second hash value of each of the plurality of reference file blocks. And sequentially comparing the similarity of the first hash value and each second hash value. And when a target second hash value with the similarity to the first hash value larger than a second preset threshold exists, determining that the similarity between the reference file block corresponding to the target second hash value and the file block to be compared is larger than the first preset threshold.
In an example of the present application, after a first hash value of a file block to be compared is calculated, a correspondence between the first hash value and the file block to be compared may be established and stored, and after second hash values of a plurality of reference file blocks are calculated, a one-to-one correspondence between the plurality of reference file blocks and the plurality of second hash values may be established and stored, and then when a target second hash value of which the similarity to the first hash value is greater than a second preset threshold is identified, a reference file block of which the similarity to the file to be compared is greater than the first preset threshold is determined through the pre-stored correspondence.
Fig. 6 is a schematic diagram illustrating comparison of similarity between a target file and a candidate file according to an embodiment of the present application, and as shown in fig. 6, in addition to calculating a first hash value of a file block to be compared and calculating a second hash value of a reference file block, the embodiment of the present application further calculates hash values of the target file and the candidate file by using a SIMHASH algorithm, so as to compare the overall similarity between the target file and the candidate file.
And generating the hash value of the file block by adopting a SIMHASH algorithm, and if the file block is not changed, identifying successfully if the file block hash values are the same. If the file blocks are changed, the hash values cannot be changed completely based on the characteristics of the SIMHASH algorithm, and if the similarity of the hash values reaches more than 90%, the identification is successful. Meanwhile, the SIMHASH algorithm further compresses comparison contents and improves the identification speed.
In order to develop a control measure for a leaked file in time after the leaked file is found, the embodiment of the application provides the following method, so that a terminal prompts a user to delete the leaked file or content in time in the internet by sound or icon information after identifying the leaked file, the leaked content or the position of the leaked content.
After determining that the candidate file is a leaked file associated with the target file, the method further comprises:
generating and outputting first prompt information for indicating the leakage file;
after determining the location of the leaked content associated with the target file in the candidate file and the location of the leaked content in the target file, the method further comprises at least one of:
generating and outputting second prompt information for indicating the leaked content;
generating and outputting third prompt information for indicating the position of the leaked content in the candidate file;
and generating and outputting fourth prompt information for indicating the position of the leaked content in the target file.
In an example of the present application, all modules in the file data tracing apparatus may also be integrated and set at corresponding positions, a file data tracing system as shown in fig. 7 is constructed at a terminal, and the system runs each module to execute a corresponding program, so as to implement the file data tracing method provided in other embodiments of the present application, where fig. 7 is a structural diagram of the file data tracing system in an example of the present application. Fig. 8 is a flowchart illustrating a method for executing file data tracing based on a file data tracing system by an exemplary terminal according to the present application. As shown in fig. 7 and 8, when the terminal operates the file data tracing system, the splitting layer number unit receives a scale input by a user, the file splitting unit performs the first step, splits the target file into target file blocks by using a binary tree, and splits the candidate file into candidate file blocks. The binary tree is an important type of tree structure, and is an ordered tree with the number of nodes not more than 2 in the tree.
With continuing reference to fig. 7 and 8, the file data tracing system includes an original file unit for storing the leaked target file, a candidate file unit for storing a candidate file obtained from the internet data resource pool, an original file block unit, and a candidate file block unit; the original file block unit is used for storing a target file block obtained by splitting a target file, and the candidate file block unit is used for storing a candidate file block obtained by splitting a candidate file.
With continued reference to fig. 7 and 8, the terminal runs the file data tracing system, the hash value generating unit performs the second step of calculating the file hash value of the target file stored in the original file unit by using the SIMHASH algorithm, and calculating the file hash value of the candidate file stored in the candidate file unit. The SIMHASH algorithm is also used to calculate the file block hash values of all target file blocks stored in the original file block unit and the file block hash values of all candidate file blocks stored in the candidate file block unit.
With continuing reference to fig. 7 and 8, the file data tracing system further includes a target file and target file block corresponding list unit, a candidate file and candidate file block corresponding list unit, a target file and file hash value corresponding list unit, a candidate file and file hash value corresponding list unit, a target file block and file block hash value corresponding list unit, and a candidate file block and file block hash value corresponding list unit. The target file and target file block corresponding list unit is used for storing the corresponding relation between a target file and a target file block, the candidate file and candidate file block corresponding list unit is used for storing the corresponding relation between a candidate file and a candidate file block, the target file and file hash value corresponding list unit is used for storing the file hash value of the target file, the candidate file and file hash value corresponding list unit is used for storing the hash value of the candidate file, the target file block and file block hash value corresponding list unit is used for storing the corresponding relation between the target file block and the file block hash value, and the candidate file block and file block hash value corresponding list unit is used for storing the corresponding relation between the candidate file block and the file block hash value.
With continuing reference to fig. 7 and 8, the hash value similarity setting unit is configured to set a second preset threshold and a first preset threshold. And the hash value comparison unit compares the hash values in a mode of comparing layer by layer and traversing in layers.
Fig. 9 is a first flowchart comparing hash values according to an embodiment of the present application, and fig. 10 is a second flowchart comparing hash values according to an embodiment of the present application. Referring to fig. 9, the file block hash value of the target file block of the first hierarchy is compared with the file block hash values of the candidate file blocks of the first hierarchy, and when the file block hash value comparison is performed at the first hierarchy, for any target file block of the first hierarchy, it is compared with each candidate file block of the first hierarchy one by one.
Assuming that a represents a target file, B represents a candidate file, X represents a hierarchy number, n represents the number of file blocks at the X hierarchy, and Y represents a hash value comparison result, a similarity comparison between a target file block and a candidate file block can be expressed by equation (1):
Y=F(Axn,Bx1-n)
in the comparison process, when the similarity of the hash values is greater than a second preset threshold value, the target file block corresponding to the hash value is obtained from the list unit corresponding to the target file block and the file block hash value, and the candidate file block corresponding to the hash value is obtained from the list unit corresponding to the candidate file block and the file block hash value.
With continued reference to fig. 7, the file data tracing system further includes a file leakage warning unit configured to output first prompt information indicating the leaked file, second prompt information indicating the leaked content, third prompt information indicating a location of the leaked content in the candidate file, and fourth prompt information indicating a location of the leaked content in the target file.
Based on the same inventive concept, the embodiment of the application provides a file data tracing device. Fig. 11 is a functional block diagram of a file data tracing apparatus according to an embodiment of the present application. As shown in fig. 11, the apparatus includes:
the dividing module 111 is configured to perform file division on a target file and a candidate file associated with the target file according to the same scale, so as to obtain respective multi-level division results of the target file and the candidate file; each level division result comprises a plurality of file blocks corresponding to each level; a plurality of file blocks of each level form a complete file, and the file division of the next level is carried out on the division result of the previous level;
an obtaining module 112, configured to obtain a file block to be compared belonging to a target level from the target file, and obtain a plurality of reference file blocks belonging to the target level from the candidate file;
a first determining module 113, configured to determine that the candidate file is a leakage file associated with the target file when a similarity between the file block to be compared and any reference file block in the multiple reference file blocks is greater than a first preset threshold.
Optionally, the apparatus further comprises:
a second determining module, configured to determine, according to a formation order of each hierarchical division result in the target file or the candidate file, a current hierarchical level in the target file and the candidate file as the target hierarchical level;
and a returning module, configured to determine, when the similarity between the file block to be compared and all of the reference file blocks in the multiple reference file blocks is not greater than a first preset threshold, a next level of the current level as the target level, and return to the step of acquiring the file block to be compared belonging to the target level from the target file and acquiring the multiple reference file blocks belonging to the target level from the candidate file.
Optionally, the apparatus further comprises:
the calculating module is used for calculating a first hash value of the file blocks to be compared by adopting a SIMHASH algorithm and calculating a second hash value of each of a plurality of reference file blocks;
the comparison module is used for sequentially comparing the similarity between the first hash value and each second hash value;
and the third determining module is used for determining that the similarity between the reference file block corresponding to the target second hash value and the file block to be compared is greater than the first preset threshold value when the target second hash value with the similarity greater than the second preset threshold value exists.
Optionally, the apparatus further comprises:
a fourth determining module, configured to determine, according to a position of the target level in the target file or the candidate file, and a position of a reference file block, in the target level, where a similarity between the reference file block and the file block to be compared is greater than a first preset threshold, a position of the leakage content associated with the target file in the candidate file, and a position of the leakage content in the target file.
Optionally, the apparatus further comprises:
the first output module is used for generating and outputting first prompt information used for indicating the leakage file;
the second output module is used for generating and outputting second prompt information used for indicating the leaked content;
a third output module, configured to generate and output third prompt information indicating a location of the leaked content in the candidate file;
and the fourth output module is used for generating and outputting fourth prompt information used for indicating the position of the leaked content in the target file.
Optionally, the first determining module includes:
the determining submodule is used for determining the overlapping degree of the candidate file and the target file according to the position of the target level in the target file or the candidate file when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is larger than a first preset threshold; wherein the degree of overlap includes paragraph level overlap, sentence level overlap, and phrase level overlap.
Based on the same inventive concept, another embodiment of the present application provides a readable storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the steps in the file data tracing method according to any of the above embodiments of the present application.
Based on the same inventive concept, another embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the method for tracing file data according to any of the above embodiments of the present application is implemented.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive or descriptive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The method, the device, the equipment and the storage medium for tracing the file data provided by the application are introduced in detail, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A file data tracing method is characterized by comprising the following steps:
respectively carrying out file division on a target file and a candidate file associated with the target file according to the same scale to obtain respective multi-level division results of the target file and the candidate file; each level division result comprises a plurality of file blocks corresponding to each level; a plurality of file blocks of each level form a complete file, and the file division of the next level is carried out on the division result of the previous level;
acquiring file blocks to be compared, which belong to a target level, from the target file, and acquiring a plurality of reference file blocks, which belong to the target level, from the candidate file;
and when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold value, determining the candidate file as a leakage file associated with the target file.
2. The method of claim 1, further comprising:
determining the current hierarchy level in the target file and the candidate file as the target hierarchy level according to the forming sequence of each hierarchy dividing result in the target file or the candidate file;
and when the similarity between the file block to be compared and all the reference file blocks in the plurality of reference file blocks is not more than a first preset threshold, determining the next level of the current level as the target level, returning to the step of acquiring the file block to be compared belonging to the target level from the target file and acquiring the plurality of reference file blocks belonging to the target level from the candidate file.
3. The method of claim 1, further comprising:
calculating a first hash value of the file blocks to be compared by adopting a SIMHASH algorithm, and calculating a second hash value of each of a plurality of reference file blocks;
sequentially comparing the similarity of the first hash value and each second hash value;
and when a target second hash value with the similarity to the first hash value larger than a second preset threshold exists, determining that the similarity between the reference file block corresponding to the target second hash value and the file block to be compared is larger than the first preset threshold.
4. The method of claim 1, wherein after determining that the candidate file is a leakage file associated with the target file, the method further comprises:
and determining the position of the leakage content associated with the target file in the candidate file and the position of the leakage content in the target file according to the position of the target level in the target file or the candidate file and the position of a reference file block with the similarity to the file block to be compared being greater than a first preset threshold value in the target level.
5. The method of claim 4, wherein after determining that the candidate file is a leakage file associated with the target file, the method further comprises:
generating and outputting first prompt information for indicating the leakage file;
after determining the location of the leaked content associated with the target file in the candidate file and the location of the leaked content in the target file, the method further comprises at least one of:
generating and outputting second prompt information for indicating the leaked content;
generating and outputting third prompt information for indicating the position of the leaked content in the candidate file;
and generating and outputting fourth prompt information for indicating the position of the leaked content in the target file.
6. The method according to claim 1, wherein when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold, determining the candidate file as a leakage file associated with the target file comprises:
when the similarity between the file block to be compared and any reference file block in the plurality of reference file blocks is greater than a first preset threshold value, determining the overlapping degree of the candidate file and the target file according to the position of the target level in the target file or the candidate file; wherein the degree of overlap includes paragraph level overlap, sentence level overlap, and phrase level overlap.
7. A file data tracing apparatus, characterized in that the apparatus comprises:
the dividing module is used for respectively dividing a target file and a candidate file associated with the target file according to the same scale to obtain respective multi-level dividing results of the target file and the candidate file; each level division result comprises a plurality of file blocks corresponding to each level; a plurality of file blocks of each level form a complete file, and the file division of the next level is carried out on the division result of the previous level;
the acquisition module is used for acquiring file blocks to be compared, which belong to a target level, from the target file and acquiring a plurality of reference file blocks, which belong to the target level, from the candidate file;
the first determining module is configured to determine that the candidate file is a leakage file associated with the target file when the similarity between the file block to be compared and any reference file block in the multiple reference file blocks is greater than a first preset threshold.
8. The apparatus of claim 1, further comprising:
a second determining module, configured to determine, according to a formation order of each hierarchical division result in the target file or the candidate file, a current hierarchical level in the target file and the candidate file as the target hierarchical level;
and a returning module, configured to determine, when the similarity between the file block to be compared and all of the reference file blocks in the multiple reference file blocks is not greater than a first preset threshold, a next level of the current level as the target level, and return to the step of acquiring the file block to be compared belonging to the target level from the target file and acquiring the multiple reference file blocks belonging to the target level from the candidate file.
9. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executed implements the steps of the method according to any of claims 1-6.
CN202110791648.5A 2021-07-13 2021-07-13 File data tracing method, device, equipment and storage medium Active CN113449341B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110791648.5A CN113449341B (en) 2021-07-13 2021-07-13 File data tracing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110791648.5A CN113449341B (en) 2021-07-13 2021-07-13 File data tracing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113449341A true CN113449341A (en) 2021-09-28
CN113449341B CN113449341B (en) 2024-07-12

Family

ID=77816083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110791648.5A Active CN113449341B (en) 2021-07-13 2021-07-13 File data tracing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113449341B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130101641A (en) * 2012-02-21 2013-09-16 윤경한 Methods for data protecting
JP2016212879A (en) * 2015-05-12 2016-12-15 株式会社リコー Information processing method and information processing apparatus
CN109101572A (en) * 2018-07-17 2018-12-28 何晓行 Card method, apparatus and server, storage medium are deposited based on block chain
CN112182604A (en) * 2020-09-23 2021-01-05 恒安嘉新(北京)科技股份公司 File detection system and method
CN112632952A (en) * 2020-12-08 2021-04-09 中国建设银行股份有限公司 Method and device for comparing files

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130101641A (en) * 2012-02-21 2013-09-16 윤경한 Methods for data protecting
JP2016212879A (en) * 2015-05-12 2016-12-15 株式会社リコー Information processing method and information processing apparatus
CN109101572A (en) * 2018-07-17 2018-12-28 何晓行 Card method, apparatus and server, storage medium are deposited based on block chain
CN112182604A (en) * 2020-09-23 2021-01-05 恒安嘉新(北京)科技股份公司 File detection system and method
CN112632952A (en) * 2020-12-08 2021-04-09 中国建设银行股份有限公司 Method and device for comparing files

Also Published As

Publication number Publication date
CN113449341B (en) 2024-07-12

Similar Documents

Publication Publication Date Title
US20200257543A1 (en) Aggregate Features For Machine Learning
US9542477B2 (en) Method of automated discovery of topics relatedness
CN105989040B (en) Intelligent question and answer method, device and system
US8645298B2 (en) Topic models
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN107391682B (en) Knowledge verification method, knowledge verification apparatus, and storage medium
CN110019668A (en) A kind of text searching method and device
CN109471889B (en) Report accelerating method, system, computer equipment and storage medium
CN113515589A (en) Data recommendation method, device, equipment and medium
CN110990523A (en) Legal document determining method and system
CN117251879A (en) Secure storage and query method and system based on trust extension and computer storage medium
CN113901783B (en) Domain-oriented document duplication checking method and system
Amreen et al. A methodology for measuring floss ecosystems
CN117077679B (en) Named entity recognition method and device
Peeperkorn et al. Conformance checking using activity and trace embeddings
CN113449341B (en) File data tracing method, device, equipment and storage medium
CN116028626A (en) Text matching method and device, storage medium and electronic equipment
Ledel et al. Broccoli: Bug localization with the help of text search engines
CN111949783A (en) Question and answer result generation method and device in knowledge base
Lindawati et al. Automated parameter tuning framework for heterogeneous and large instances: Case study in quadratic assignment problem
CN105824871A (en) Picture detecting method and equipment
CN111143582A (en) Multimedia resource recommendation method and device for updating associative words in real time through double indexes
CN116225770B (en) Patch matching method, device, equipment and storage medium
CN116992111B (en) Data processing method, device, electronic equipment and computer storage medium
Asthana et al. ML Model Change Detection and Versioning Service

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant