CN111858486A - File classification method and device - Google Patents

File classification method and device Download PDF

Info

Publication number
CN111858486A
CN111858486A CN202010631285.4A CN202010631285A CN111858486A CN 111858486 A CN111858486 A CN 111858486A CN 202010631285 A CN202010631285 A CN 202010631285A CN 111858486 A CN111858486 A CN 111858486A
Authority
CN
China
Prior art keywords
file
fingerprint
label
local
meta
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010631285.4A
Other languages
Chinese (zh)
Inventor
陈少涵
胡立中
李仕毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Skyguard Network Security Technology Co ltd
Original Assignee
Beijing Skyguard Network Security Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Skyguard Network Security Technology Co ltd filed Critical Beijing Skyguard Network Security Technology Co ltd
Priority to CN202010631285.4A priority Critical patent/CN111858486A/en
Publication of CN111858486A publication Critical patent/CN111858486A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/16Program or content traceability, e.g. by watermarking

Abstract

The invention discloses a file classification method and device, and relates to the technical field of computers. One embodiment of the method comprises: responding to the operation of the query tag of the target file, acquiring the file fingerprint of the target file, and determining the similar file fingerprint of which the similarity with the file fingerprint in the local fingerprint database exceeds a preset similarity threshold; acquiring meta-information corresponding to the fingerprints of the similar files, and determining labels according to label identifications in the meta-information to obtain a first label set; transmitting the file fingerprint to a server side for label query so as to receive a second label set returned by the server side; and acquiring a union set of the first label set and the second label set to obtain a marked set of the target file, and determining the category of the target file according to the labels in the marked set. The file fingerprint is only associated with the file content, so that the limitation that the existing file is only specific to a specific type of file is broken through; the label of the file related to the label is used for marking the file, so that the file classification accuracy is improved.

Description

File classification method and device
Technical Field
The invention relates to the technical field of computers, in particular to a file classification method and device.
Background
In recent years, the computer security industry gradually develops from early network security to data security, one direction of data security is data classification, data is classified into classes with different security levels, and different security policies are adopted for different levels to manage the data. On the basis of the data classification tools, such as machine learning (classification algorithm and clustering algorithm) of non-user-driven classes, file labels/marks of user-driven classes and the like are generated.
The scheme mainly relates to file labels/marks of user drive classes, the files are managed according to the existing labels on the files, and the current modes for operating the labels on the files comprise adding, deleting and updating the labels.
In the process of implementing the invention, the inventor finds that the prior art has at least the following problems:
1. the application range is limited, and the label management operation can be only carried out on files of specific types (such as doc, docx, pdf, jpg and mp 4);
2. the mark is marked or deleted manually, and the error rate is high. If a user uses the tag 01 to mark a file as a financial file, the financial file belongs to a general sensitive file, but for a file b which is similar to the file a, the fact that the file b is marked with the tag 01 cannot be displayed, and in order to realize marking the file b with the tag 01, manual operation is needed once again, and operation is complicated.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for classifying documents, which can at least solve the problems in the prior art that the types of marked documents are limited and manual marking is relied on.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a file classifying method including: responding to the operation of a query tag of a target file, acquiring a file fingerprint of the target file, and determining a similar file fingerprint of which the similarity with the file fingerprint exceeds a preset similarity threshold in a local fingerprint library; acquiring meta-information corresponding to the similar file fingerprint, and determining labels according to label identifications in the meta-information to obtain a first label set; transmitting the file fingerprint to a server for label query so as to receive a second label set returned by the server; and acquiring a union set of the first label set and the second label set to obtain a labeled set of the target file, and determining the category of the target file according to the labels in the labeled set.
Optionally, the local fingerprint library includes a first fingerprint library and a second fingerprint library; the acquiring of the file fingerprint of the target file includes: generating a file fingerprint according to the file content in the target file; wherein the file fingerprint comprises a first fingerprint and a second fingerprint, and the first fingerprint is obtained by processing the file content by using an information summarization algorithm; the determining of similar file fingerprints in the local fingerprint database whose similarity to the file fingerprint exceeds a predetermined similarity threshold includes: calculating the similarity between the first fingerprint and fingerprints in the first fingerprint library, and determining a first similar fingerprint with the similarity exceeding a first preset similarity threshold; calculating the similarity between the second fingerprint and the fingerprints in the second fingerprint library, and determining a second similar fingerprint with the similarity exceeding a second preset similarity threshold; the obtaining of the meta-information corresponding to the similar file fingerprint includes: and acquiring first meta-information corresponding to the first similar fingerprint and second meta-information corresponding to the second similar fingerprint.
Optionally, the second fingerprint includes a sub-fingerprint, and the sub-fingerprint is generated by processing word segmentation of the file content by using an information summarization algorithm; the calculating the similarity between the second fingerprint and the fingerprints in the second fingerprint library comprises: and calculating the similarity between each sub-fingerprint in the second fingerprint and each sub-fingerprint of one fingerprint in the second fingerprint library, and accumulating the sum of the similarities to obtain the similarity between the second fingerprint and the one fingerprint.
Optionally, the obtaining first meta-information corresponding to the first similar fingerprint and second meta-information corresponding to the second similar fingerprint further includes: determining the file volume of the target file, and acquiring first meta-information corresponding to the first similar fingerprint and the file volume; and determining a file suffix and a file type of the target file, and acquiring second meta-information corresponding to the second similar fingerprint, the file suffix and the file type.
Optionally, the method further includes: acquiring a fully-defined domain name of a client, and performing meta-information query from a local file information base by combining a file path and a file type of the target file to determine a label according to a label identifier in the queried meta-information to obtain a third label set; the merging the first set of labels and the second set of labels, comprising: and taking a union of the first label set, the second label set and the third label set.
Optionally, the method further includes: sending an authentication request to a server to authenticate the user name in the authentication request through the server to obtain a fourth tag set of the user name with operation authority; the obtaining of the labeled set of the target file further comprises: and taking intersection of the labeled set and the fourth label set returned by the server side to obtain a fifth label set of the user name having operation authority on the target file.
Optionally, the method further includes: and marking the labels in the fourth label set, wherein the labels in the fifth label set are marked, the rest labels are used as the non-marked labels, and the processed fourth label set is displayed.
Optionally, after the displaying the processed fourth tab set, the method includes: responding to the marking operation of one unmarked label, generating an operation log for marking the label of the target file, storing the operation log into a local log library, and storing the file information corresponding to the target file and the marking label into a local file information library; and/or responding to the marking removal operation of one marked file, generating an operation log for removing the marked label from the target file, storing the operation log into a local log library, and deleting file information corresponding to the target file and the removed marked label from a local file information library.
Optionally, the method further includes: and responding to the marking operation of an untagged file or the marking removing operation of a tagged file, generating a file fingerprint according to the file content in the target file, and storing the file fingerprint in a local fingerprint library together with the meta information of the target file.
Optionally, the method further includes: downloading the latest operation log corresponding to the user name from a server side, and acquiring a hash value in the latest operation log; wherein, the hash value is obtained by processing the operation log; determining an operation log in a local log library according to the identification of the file in the latest operation log, and acquiring a last hash value in the operation log; the last hash value is the hash value of the last operation log positioned in the operation log; comparing whether the hash value is consistent with the last hash value or not, if so, uploading the file fingerprint and the operation log corresponding to the identifier in the local fingerprint library and the local log library to the server; and if not, updating the local fingerprint library and the local log library based on the file fingerprint and the operation log which are pulled from the server and correspond to the identification.
To achieve the above object, according to another aspect of the embodiments of the present invention, there is provided a file sorting apparatus including: the fingerprint searching module is used for responding to the operation of an inquiry label of a target file, acquiring the file fingerprint of the target file, and determining the similar file fingerprint of which the similarity with the file fingerprint in a local fingerprint database exceeds a preset similarity threshold; the first tag module is used for acquiring meta-information corresponding to the similar file fingerprint and determining tags according to tag identifications in the meta-information to obtain a first tag set; the second label module is used for transmitting the file fingerprint to a server side for label query so as to receive a second label set returned by the server side; and the label processing module is used for acquiring a union set of the first label set and the second label set to obtain a labeled set of the target file, and determining the category of the target file according to the labels in the labeled set.
Optionally, the local fingerprint library includes a first fingerprint library and a second fingerprint library; further comprising a fingerprint generation module for: generating a file fingerprint according to the file content in the target file; wherein the file fingerprint comprises a first fingerprint and a second fingerprint, and the first fingerprint is obtained by processing the file content by using an information summarization algorithm; the fingerprint searching module is used for: calculating the similarity between the first fingerprint and fingerprints in the first fingerprint library, and determining a first similar fingerprint with the similarity exceeding a first preset similarity threshold; calculating the similarity between the second fingerprint and the fingerprints in the second fingerprint library, and determining a second similar fingerprint with the similarity exceeding a second preset similarity threshold; the first tag module is configured to: and acquiring first meta-information corresponding to the first similar fingerprint and second meta-information corresponding to the second similar fingerprint.
Optionally, the second fingerprint includes a sub-fingerprint, and the sub-fingerprint is generated by processing word segmentation of the file content by using an information summarization algorithm; the fingerprint searching module is used for: and calculating the similarity between each sub-fingerprint in the second fingerprint and each sub-fingerprint of one fingerprint in the second fingerprint library, and accumulating the sum of the similarities to obtain the similarity between the second fingerprint and the one fingerprint.
Optionally, the first tag module is further configured to: determining the file volume of the target file, and acquiring first meta-information corresponding to the first similar fingerprint and the file volume; and determining a file suffix and a file type of the target file, and acquiring second meta-information corresponding to the second similar fingerprint, the file suffix and the file type.
Optionally, the apparatus further comprises a third tag module, configured to: acquiring a fully-defined domain name of a client, and performing meta-information query from a local file information base by combining a file path and a file type of the target file to determine a label according to a label identifier in the queried meta-information to obtain a third label set; the tag processing module is configured to: and taking a union of the first label set, the second label set and the third label set.
Optionally, the system further comprises a tag filtering module, configured to: sending an authentication request to a server to authenticate the user name in the authentication request through the server to obtain a fourth tag set of the user name with operation authority; and taking intersection of the labeled set and the fourth label set returned by the server side to obtain a fifth label set of the user name having operation authority on the target file.
Optionally, the apparatus further includes a tag display module, configured to: and marking the labels in the fourth label set, wherein the labels in the fifth label set are marked, the rest labels are used as the non-marked labels, and the processed fourth label set is displayed.
Optionally, a marking/deleting module is included for: responding to the marking operation of one unmarked label, generating an operation log for marking the label of the target file, storing the operation log into a local log library, and storing the file information corresponding to the target file and the marking label into a local file information library; and/or responding to the marking removal operation of one marked file, generating an operation log for removing the marked label from the target file, storing the operation log into a local log library, and deleting file information corresponding to the target file and the removed marked label from a local file information library.
Optionally, the marking/deleting module is further configured to: and responding to the marking operation of an untagged file or the marking removing operation of a tagged file, generating a file fingerprint according to the file content in the target file, and storing the file fingerprint in a local fingerprint library together with the meta information of the target file.
Optionally, the system further includes an information synchronization module, configured to: downloading the latest operation log corresponding to the user name from a server side, and acquiring a hash value in the latest operation log; wherein, the hash value is obtained by processing the operation log; determining an operation log in a local log library according to the identification of the file in the latest operation log, and acquiring a last hash value in the operation log; the last hash value is the hash value of the last operation log positioned in the operation log; comparing whether the hash value is consistent with the last hash value or not, if so, uploading the file fingerprint and the operation log corresponding to the identifier in the local fingerprint library and the local log library to the server; and if not, updating the local fingerprint library and the local log library based on the file fingerprint and the operation log which are pulled from the server and correspond to the identification.
To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided a file sorting electronic device.
The electronic device of the embodiment of the invention comprises: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement any of the file classification methods described above.
To achieve the above object, according to a further aspect of the embodiments of the present invention, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements any of the file classification methods described above.
According to the scheme provided by the invention, one embodiment of the invention has the following advantages or beneficial effects: the file fingerprint generated based on the file content can break through the limitations of file formats and types, only the relevance among the file contents needs to be considered, the original file cannot be damaged, the file tracking and management are facilitated, and the marking accuracy is improved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic main flow chart of a document classification method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an alternative document classification method according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart diagram of an alternative document classification method according to an embodiment of the invention;
FIG. 4 is a schematic flow chart diagram illustrating an alternative document classification method according to an embodiment of the present invention;
FIG. 5A is a schematic diagram of a specific trigger query tag operation;
FIG. 5B is a diagram illustrating the labeled and unreachable label of a particular display target document;
FIG. 6 is a flowchart illustrating an alternative document classification method according to an embodiment of the present invention;
FIG. 7 is a flowchart illustrating an alternative document classification method according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of the main modules of a document sorting apparatus according to an embodiment of the present invention;
FIG. 9 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
FIG. 10 is a schematic block diagram of a computer system suitable for use with a mobile device or server implementing an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The label in the scheme means the category to which the file belongs, for example, if the content of the file is a financial related file, the corresponding label is "financial"; it is understood that a file may belong to multiple categories, that is, have multiple tags, for example, a file belongs to both a development document and a product requirement document, and then has corresponding tags "tag 01" and "tag 02", where "tag 01" and "tag 02" refer to a "development class document" and a "product requirement class document", respectively.
The label management agent in the scheme is a process running on the client and used for managing operations (such as addition, deletion, modification and query) of a user on a file label on the client, and synchronizing and uploading a file fingerprint label. The label management server is a process running on the server and used for managing label operations of all users on all clients, such as a label of which the user has operation authority, a fingerprint label of a file, file information of a marked label, and an operation log.
Referring to fig. 1, a main flowchart of a file classification method provided in an embodiment of the present invention is shown, including the following steps:
s101: responding to the operation of a query tag of a target file, acquiring a file fingerprint of the target file, and determining a similar file fingerprint of which the similarity with the file fingerprint exceeds a preset similarity threshold in a local fingerprint library;
S102: acquiring meta-information corresponding to the similar file fingerprint, and determining labels according to label identifications in the meta-information to obtain a first label set;
s103: transmitting the file fingerprint to a server for label query so as to receive a second label set returned by the server;
s104: and acquiring a union set of the first label set and the second label set to obtain a labeled set of the target file, and determining the category of the target file according to the labels in the labeled set.
In the above embodiment, for step S101, the present solution depends on the query tag operation of the target file by the user to trigger the tag query action. Referring to the following fig. 5A, the user selects the file "stars", clicks the right mouse button to select the fingerprint tag, and thereby triggers the operation of querying the tagged file on the file. In addition, other triggering modes can be selected, and only the client can interact with the user, such as word plug-in, user-user interaction and the like, and the specific triggering mode is not limited in the scheme.
The file fingerprint of the target file is generated based on the file content, so that the attributes of the file such as the file type, the file suffix and the like do not need to be considered. The file fingerprints of a plurality of files are stored in the local fingerprint database, and similar file fingerprints with higher similarity to the file fingerprints of the target files are calculated based on the similarity between the fingerprints.
For step S102, the searched file corresponding to the similar file fingerprint generally has a certain relationship with the target file, such as a variant file or a copy file, where the variant file is a new file obtained by modifying the file content of the original file, such as adding new content, deleting part of the original content, and the like; the copied file has the same file content as the original file, but the file name may be different.
Besides storing the file fingerprints, the local fingerprint library also stores file meta information which has a binding relationship with the file fingerprints, wherein the file meta information comprises file sizes, access time, modification time, file paths, file names, files MD5, label IDs and the like. However, it should be noted that the tag ID is usually added after the file has the tag, for example, the meta information of the original file includes file size, access time, modification time, file path, file name, and file MD5, and the tag ID of the tag is added when the file is stored in the local fingerprint library.
Therefore, after similar file fingerprints are acquired, the meta-information of the similar file fingerprints can be determined based on the binding relationship between the file fingerprints and the meta-information. And then, determining the tags according to the tag IDs in the meta-information to obtain a local query result, namely a first tag set.
For steps S103 and S104, in addition to the local query, a remote query is included, the operation is executed depending on the server, and the remote query is capable of switching control, which is preferably set at the client. And the label management agent transmits the file fingerprint of the target file to the server, and the server performs similar fingerprint query, meta information query and label query to obtain a second label set.
In addition, after the file fingerprint of the target file is acquired, the file fingerprint can be sent to the server for query after being queried locally, and can also be directly sent to the server, so that the local and the server can synchronously perform label query operation.
After receiving the second tag set returned by the server, the tag management agent performs merging, filtering and deduplication processing (i.e., fetching a union set) with the first tag set queried locally to obtain a tagged set of the target file, where tags in the tagged set can be displayed in a list form. For example, a target file b is formed by adding new content on the basis of the file a, the file fingerprint of the file b is similar to the file fingerprint of the file a, then the label query of the file a is performed according to the meta information of the file a to obtain the label set of the file a, and if the label of the file a is a financial label, the label set of the target file b also contains the financial label.
According to the method provided by the embodiment, the file fingerprint is only associated with the file content, so that the limitation that the existing mark only aims at the specific type of file is broken through, and the original file is not influenced; the label is inquired based on the file fingerprint, so that the label of the file related to the label is marked, and the marking accuracy is improved.
Referring to fig. 2, a main flowchart of an alternative file classification method provided in an embodiment of the present invention is shown, including the following steps:
s201: responding to the operation of a query tag on a target file, and generating a file fingerprint according to the file content in the target file; the file fingerprint comprises a first fingerprint and a second fingerprint, the first fingerprint is obtained by processing the file content through an information summarization algorithm, the second fingerprint comprises a sub-fingerprint, and the sub-fingerprint is generated by processing word segmentation of the file content through the information summarization algorithm;
s202: calculating the similarity between the first fingerprint and fingerprints in a first fingerprint library, and determining a first similar fingerprint with the similarity exceeding a first preset similarity threshold; the local fingerprint database comprises a first fingerprint database and a second fingerprint database;
s203: calculating the similarity between each sub-fingerprint in the second fingerprint and each sub-fingerprint of a fingerprint in the second fingerprint library, accumulating the sum of the similarities to obtain the similarity between the second fingerprint and the fingerprint, and determining a second similar fingerprint with the similarity exceeding a second preset similarity threshold;
S204: acquiring first meta-information corresponding to the first similar fingerprint, and determining a label according to a label identifier in the first meta-information;
s205: acquiring second meta-information corresponding to the second similar fingerprint, and determining a label according to a label identifier in the second meta-information;
s206: obtaining a first label set;
s207: transmitting the file fingerprint to a server for label query so as to receive a second label set returned by the server;
s208: and acquiring a union set of the first label set and the second label set to obtain a labeled set of the target file, and determining the category of the target file according to the labels in the labeled set.
In the above embodiment, for steps S207 and S208, reference may be made to the descriptions of steps S103 and S104 shown in fig. 1, and details are not repeated here.
In the above embodiment, for step S201, the file fingerprint of the target file is generated based on the file content thereof, and includes the precise fingerprint (i.e. the first fingerprint) and the similarity fingerprint (i.e. the second fingerprint), where:
1) the precise fingerprint is generated based on the file content of the file, and is generally obtained by processing MD5(Message digest algorithm), and the result is generally in the form of a character string, such as 64 digits;
2) The similarity fingerprint is obtained by performing word segmentation on the file content firstly, and then processing each word segmentation in a hash manner such as MD5 and the like, and finally presenting the word segmentation as a set consisting of a plurality of sub-fingerprints. For example, the content of the document is processed to obtain 100 segmented words, and each segmented word is processed by MD5 to obtain 100 sub-fingerprints, where the set of 100 sub-fingerprints is a similarity fingerprint.
In addition, after the segmentation is obtained, the segmentation can be spliced firstly, and then the spliced segmentation is subjected to hash processing to obtain the similarity fingerprint. Further, before the concatenation, word segmentation preprocessing may be performed, for example, the weight of each word segmentation is calculated according to the number of the word segmentation, and the word segmentation with smaller weight is filtered out.
For steps S202 to S206, the local fingerprint database and the remote fingerprint database are both stored separately according to the fingerprint types, for example, the first fingerprint database stores the precise fingerprint, and the second fingerprint database stores the similarity fingerprint. However, the sensitivity of the accurate fingerprint is higher than that of the similarity fingerprint, and once the content of the file is changed, the stored accurate fingerprint is invalid and needs to be regenerated. Here, the example of calculating the fingerprint similarity by using the local fingerprint database is described, and the remote fingerprint database is similar to the local fingerprint database.
1) And calculating the similarity between the accurate fingerprint and each fingerprint in the first fingerprint library, and extracting the first similar fingerprint with higher similarity. It should be noted that the requirement of the accurate fingerprint on the query result is high, and the first predetermined similarity threshold is usually set to be 100%, so in a preferred embodiment, the number of the found first similar fingerprints is at most one;
2) and calculating the similarity between the similarity fingerprint and each fingerprint in the second fingerprint library, and extracting a second similar fingerprint with the similarity being greater than or equal to a second preset similarity threshold. The similarity fingerprint comprises a fingerprint set, so the similarity at the position needs to firstly calculate the similarity between the fingerprint set of the target file and the sub-fingerprints in the fingerprint set in the library, and then count the sum of the similarities between the sub-fingerprints in the fingerprint set to obtain the similarity between the fingerprint sets.
Assuming that the target file similarity fingerprint comprises n sub-fingerprints, a fingerprint in the second fingerprint library comprises m sub-fingerprints, and the similarity between the two is (where R represents similarity):
Figure BDA0002568881080000111
it should be noted that the precise fingerprint has the advantages of small size, fast query speed and capability of locking the file quickly, but the application range is narrow; the similarity fingerprint has the characteristics of larger size, slow query speed and wide application range. Through the mutual matching of the accurate fingerprints and the similarity fingerprints, the query efficiency, the query range and the query accuracy can be improved.
In order to further narrow the query range and reduce the operation workload, the obtained meta information can be filtered based on the attributes of the target files, so as to judge and screen out the target files selected by the user. For example:
1) inquiring the first similar fingerprint through the accurate fingerprint, and if the size (namely the file amount) of the file corresponding to the first similar fingerprint is equal to the size of the target file, regarding the file and the target file as the same file;
2) and querying a second similar fingerprint through the similarity fingerprint, and if the file type and the file suffix of the corresponding file are the same as those of the target file, regarding the file and the target file as the same file.
It should be noted that the file type involved in the present solution is the actual file type of the target file. For example, file 1.doc is renamed to 1.txt, the visual view is of type txt, but its true type is doc.
The method provided by the embodiment describes the types of the file fingerprints in detail, performs corresponding similar fingerprint calculation and meta-information determination based on different types of fingerprints, and filters the meta-information by combining the attribute information of the target file, so as to reduce the calculation workload.
Referring to fig. 3, a main flowchart of another alternative file classification method provided in the embodiment of the present invention is shown, which includes the following steps:
s301: responding to the operation of a query tag of a target file, acquiring a file fingerprint of the target file, and determining a similar file fingerprint of which the similarity with the file fingerprint exceeds a preset similarity threshold in a local fingerprint library;
s302: acquiring meta-information corresponding to the similar file fingerprint, and determining labels according to label identifications in the meta-information to obtain a first label set;
s303: transmitting the file fingerprint to a server for label query so as to receive a second label set returned by the server;
s304: acquiring a fully-defined domain name of a client, and performing meta-information query from a local file information base by combining a file path and a file type of the target file to determine a label according to a label identifier in the queried meta-information to obtain a third label set;
s305: and taking a union set of the first label set, the second label set and the third label set to obtain a labeled set of the target file, and determining the category of the target file according to the labels in the labeled set.
In the above embodiment, the descriptions of steps S101 to S103 shown in fig. 1 can be referred to for steps S301 to S303, and are not repeated herein.
In the above embodiment, as for the actual operations in steps S304 and S305, there may be a case of fingerprint failure, and for this case, the present solution further provides a local file information base at the client, and records all file information that the user has marked. Besides the meta information of the file, other information of the file, such as tag operators, tag information, equipment information and the like, is recorded.
The mode of the label management agent for inquiring the matched meta information based on the meta information of the target file is as follows:
endpoint_fqdn=skyguard-PC.WORKGROUP
file_path=C:\Users\skyguard\Desktop\fs.txt
true_type=2
the endpoint _ fqdn is a fully qualified domain name of the client and is a character string formed by splicing a host name and the domain name; the file _ path is a file path of the target file and contains a file suffix, and the true _ type indicates a file true type. For example, adding content on the basis of a file a to form a target file b, and querying the meta information of the file a through endpoint _ fqdn, file _ path and true _ type in the meta information of the target file b.
After the matching meta information is queried, tag determination is performed based on the tag ID contained in the matching meta information, so as to obtain a third tag set, for example, a corresponding tag is determined based on the tag ID in the meta information of the file a. And then merging, removing duplicate and filtering the first label set, the second label set and the third label set to obtain a marked label set of the target file.
The method provided by the above embodiment establishes the local file information base to track the marked fingerprint of the target file under the condition of fingerprint failure, for example, the fingerprint in the fingerprint base fails due to large-scale content modification of the file, which can be regarded as a supplementary and fault-tolerant process of the fingerprint label.
Referring to fig. 4, a main flowchart of another alternative file classification method provided in the embodiment of the present invention is shown, which includes the following steps:
s401: sending an authentication request to a server to authenticate the user name in the authentication request through the server to obtain a fourth tag set of the user name with operation authority;
s402: responding to the operation of a query tag of a target file, acquiring a file fingerprint of the target file, and determining a similar file fingerprint of which the similarity with the file fingerprint exceeds a preset similarity threshold in a local fingerprint library;
s403: acquiring meta-information corresponding to the similar file fingerprint, and determining labels according to label identifications in the meta-information to obtain a first label set;
s404: transmitting the file fingerprint to a server for label query so as to receive a second label set returned by the server;
S405: acquiring a fully-defined domain name of a client, and performing meta-information query from a local file information base by combining a file path and a file type of the target file to determine a label according to a label identifier in the queried meta-information to obtain a third label set;
s406: taking a union set of the first label set, the second label set and the third label set, and taking an intersection set of the union set of the labels and a fourth label set returned by a server side to obtain a fifth label set of the user name having operation permission on the target file;
s407: and determining the category of the target file according to the labels in the fifth label set.
In the above embodiment, for steps S402 to S405 and S407, reference may be made to the description of steps S101 to S104 shown in fig. 1, and for step S405, reference may be made to the description of step S304 shown in fig. 2, which is not described herein again.
In the above embodiment, in step S401, the tags are classified, and different users have different operation authorities for different tags, for example, an enterprise management department may operate the tags 01, the tags 02 and the tags 03, and a general employee may only operate the tags 01 and the tags 02, or may also set a tag related to a department attribute, such as a logistics department-logistics tag.
The tag management agent usually sends an authentication request to the server at the time of starting, and the request carries a user name. When the data classification scheme is deployed inside an enterprise, duplication of usernames for authentication does not typically occur. And the server side authenticates the user name, determines the label of the user name with the operation authority, and obtains a fourth label set.
In step S406, the tags queried by the server for the file fingerprint of the target file may include tags already tagged by other users for the target file, in addition to the tags already tagged by the current user. Therefore, the tag set obtained by integrating the first tag set, the second tag set and the third tag set may include tags already tagged to the target file by others, and some of the tags may not have the operation right of the current user.
Therefore, after the fourth tag set returned by the server is obtained, the intersection of the combined tag set and the fourth tag set can be taken, and only the tag of which the user name has the operation authority on the target file, that is, the fifth tag set, is determined. For example, the tag union is { tag 01, tag 02 and tag 06}, and the fourth tag set is { tag 01, tag 02, tag 03, tag 04 and tag 05}, and the intersection is taken to obtain the fifth tag set { tag 01, tag 02 }.
In actual operation, besides the above-mentioned label which the user does not have operation authority, other processing methods may be adopted, for example, graying the label 06, and an option for marking the label is empty, or adding a mark such as a drawing pin behind the label 06 for displaying separately from other labels.
In addition, considering that the subsequent user may perform tag operation (including adding and deleting) on the target file, in addition to displaying the fifth tag set, other tags having the operation authority of the user may also be displayed together, except that the fifth tag set is marked. Referring to fig. 5A, after clicking the right mouse button to drive the file "star" and selecting the fingerprint tag, the displayed fig. 5B is all tags that the user has permission to operate, where the marked "fingerprint tag 01" indicates that the file has been marked with the tag.
The method provided by the above embodiment filters the tagged set of the target file, determines and processes the tags of which the current user does not have the operation authority, and can display the tags of which the user has the operation authority and the tagged set of the target file together in consideration of the subsequent tag operation.
Referring to fig. 6, a schematic flow chart of another optional file classification method according to an embodiment of the present invention is shown, including the following steps:
S601: responding to the marking operation of an untagged file or the marking removal operation of a tagged file, generating a file fingerprint according to the file content in the target file, and storing the file fingerprint in a local fingerprint library together with the meta information of the target file;
s602: generating an operation log for marking a label on the target file and storing the operation log into a local log library;
s603: storing file information corresponding to the target file and the marking label into a local file information base;
s604: generating an operation log for removing the marking label from the target file and storing the operation log into a local log library;
s605: and deleting the file information corresponding to the target file and the removed marking label from a local file information base.
In the above embodiment, for steps S601 to S605, the user may click an unmarked label to perform a labeling operation on the selected file, and further trigger a labeling action of the label management agent; and selecting the marked label to remove the marking, and triggering the label deleting and marking action of the label management agent, wherein the triggering can be performed on the basis that the user selects a right click of a mouse.
Once the target file has the label with deletion or addition operation, the file fingerprint needs to be regenerated again based on the file content of the target file and stored in the local fingerprint library together with the meta-information of the file fingerprint. To speed up the fingerprint query process, an inverted index of the fingerprint may also be built.
As can be seen from the foregoing description, the finally generated tag set includes tagged and untagged target files, and in addition to the tag processing and file fingerprint generation described above, the method further includes:
1) responding to marking operation of any one unmarked label by a user, generating file information corresponding to a target file and a marked label and storing the file information into a local file information base, and generating an operation log corresponding to the target file and the marked label and storing the operation log into a local log base;
2) and responding to the marking removal operation of a user on any marked label, generating an operation log corresponding to the target file and the marked label removal, storing the operation log into a local log library, and deleting the file information corresponding to the target file and the label from the local file information library.
The method provided by the embodiment needs to adaptively modify the information in the local file information base and the local log base when the label of the target file is deleted or marked, so as to ensure the synchronization of the operation information.
Referring to fig. 7, a schematic flow chart of another optional file classification method according to an embodiment of the present invention is shown, including the following steps:
s701: downloading the latest operation log corresponding to the user name from a server side, and acquiring a hash value in the latest operation log; wherein, the hash value is obtained by processing the operation log;
S702: determining an operation log in a local log library according to the identification of the file in the latest operation log, and acquiring a last hash value in the operation log; the last hash value is the hash value of the last operation log positioned in the operation log;
s703: comparing whether the hash value is consistent with the last hash value;
s704: if the file fingerprints and the operation logs in the local fingerprint library and the local log library are consistent, uploading the file fingerprints and the operation logs corresponding to the identifiers to a server;
s705: and if not, updating the local fingerprint library and the local log library based on the file fingerprint and the operation log which are pulled from the server and correspond to the identification.
In the above embodiment, in steps S701 to S705, the operation log in this embodiment represents that a user operates a certain label on a certain device for a certain file.
The operation log stores operation information for processing the target file and the tag, and also stores a hash value obtained by processing the operation information. The oplog at the server side stores only one hash value, and the locally stored oplog includes the hash value of the last oplog, i.e. the refer hash, in addition to its hash value, to represent the replaced/updated/referenced object.
And the label management agent synchronizes the operation log from the server side according to the user name at regular intervals, and performs conflict resolution and combination with the operation log in the local log library. Specifically, for the same file, comparing whether the hash value of the latest operation log returned by the server is the same as the last hash value in the local operation log:
1) if the operation logs are the same, the latest operation log at the server is the last operation log of the local file, and the local current operation log is updated based on the last operation log. Then, uploading the locally stored file fingerprint corresponding to the file and the operation log to a server;
2) if the operation log is different, the latest operation log at the server is different from the last operation log of the local file, and the local current operation log is invalid. The file fingerprint and the file information which are locally stored and correspond to the file need to be deleted, and the file fingerprint and the operation log of the file which are pulled from the server side need to be stored, so that the effective information of the server side can be synchronized to the local.
It should be noted that, at present, a server can only delete a marked file, that is, delete all tags on a certain file, and the operations of adding tags, changing tags, and querying tags on a certain file still need to be performed by a client, so that the tags of the file are mainly managed by a tag management agent.
In particular, the amount of the solvent to be used,
1) and for the files deleted at the server side and still existing locally, the file fingerprints in the local fingerprint library are deleted, and an operation log is generated. Further, the fingerprint reverse index may be deleted.
2) For a file which exists at a server and does not exist locally, file fingerprints and operation logs of the file downloaded from the server need to be stored in a local fingerprint library and a local log library respectively;
3) for the conflict file, the operation log of the server side is taken as the standard. Deleting the file fingerprint corresponding to the conflict file in the local fingerprint library, downloading the file fingerprint corresponding to the conflict file again from the server side and storing the file fingerprint into the local fingerprint library, establishing a reverse index, and updating the corresponding operation log;
4) and for the locally added and deleted file, if the replacement object is the last operation log and the hash of the upward-moving operation log is the same as the hash of the latest operation log at the server, uploading the operation log of the local file and the file fingerprint to the server.
In the method provided in the foregoing embodiment, the last hash value in the local operation log is compared with the hash value of the operation log at the service end, so as to determine whether the last update operation of the local operation log is the same as that recorded at the service end, and thus determine the way of synchronizing the file fingerprint and the operation log.
Compared with the prior art, the method provided by the embodiment of the invention has at least the following beneficial effects:
1) the file fingerprint generated based on the file content can break through the limitations of file formats and types, only the relevance (such as similarity, similarity or consistency) among the file contents is considered, the original file cannot be damaged, the file tracking and management are convenient, and the marking accuracy is improved;
2) the file fingerprints comprise accurate fingerprints and similarity fingerprints, the similarity is calculated in different modes based on different fingerprints, and the meta information is screened by combining the attributes of the target file, so that the query range is reduced, and the query accuracy is improved;
3) considering the possible failure condition of the file fingerprint, establishing a local file information base for tracking the marked label of the target file under the condition that the fingerprint fails; aiming at marking or deleting labels of files, locally stored information needs to be synchronously changed;
4) filtering the labeled set of the target file, determining and processing the labels of which the current user does not have the operation authority, and displaying the labels of which the user has the operation authority and the labeled set of the target file together by considering the subsequent label operation;
5) and judging whether the last updating operation of the local operation log is the same as that recorded at the service end or not by comparing the last hash value in the local operation log with the hash value of the operation log at the service end, so as to determine the file fingerprint and the operation log mode of the synchronous file.
Referring to fig. 8, a schematic diagram of main modules of a file sorting apparatus 800 according to an embodiment of the present invention is shown, including:
a fingerprint searching module 801, configured to, in response to an operation on an inquiry tag of a target file, obtain a file fingerprint of the target file, and determine a similar file fingerprint in a local fingerprint library, where similarity between the local fingerprint library and the file fingerprint exceeds a predetermined similarity threshold;
a first tag module 802, configured to obtain meta information corresponding to the similar file fingerprint, and determine a tag according to a tag identifier in the meta information to obtain a first tag set;
a second tag module 803, configured to transmit the file fingerprint to a server for tag query, so as to receive a second tag set returned by the server;
and the tag processing module 804 is configured to obtain a union set of the first tag set and the second tag set to obtain a tagged set of the target file, and determine a category to which the target file belongs according to tags in the tagged set.
In the implementation device of the invention, the local fingerprint database comprises a first fingerprint database and a second fingerprint database; further included is a fingerprint generation module 805 (not shown) for: generating a file fingerprint according to the file content in the target file; wherein the file fingerprint comprises a first fingerprint and a second fingerprint, and the first fingerprint is obtained by processing the file content by using an information summarization algorithm; the fingerprint searching module 801 is configured to: calculating the similarity between the first fingerprint and fingerprints in the first fingerprint library, and determining a first similar fingerprint with the similarity exceeding a first preset similarity threshold; calculating the similarity between the second fingerprint and the fingerprints in the second fingerprint library, and determining a second similar fingerprint with the similarity exceeding a second preset similarity threshold; the first tag module 802 is configured to: and acquiring first meta-information corresponding to the first similar fingerprint and second meta-information corresponding to the second similar fingerprint.
In the implementation device of the invention, the second fingerprint comprises a sub-fingerprint, and the sub-fingerprint is generated by processing word segmentation of the file content by using an information abstract algorithm; the fingerprint searching module 801 is configured to: and calculating the similarity between each sub-fingerprint in the second fingerprint and each sub-fingerprint of one fingerprint in the second fingerprint library, and accumulating the sum of the similarities to obtain the similarity between the second fingerprint and the one fingerprint.
In the implementation apparatus of the present invention, the first tag module 802 is further configured to: determining the file volume of the target file, and acquiring first meta-information corresponding to the first similar fingerprint and the file volume; and determining a file suffix and a file type of the target file, and acquiring second meta-information corresponding to the second similar fingerprint, the file suffix and the file type.
In the device of the present invention, the device further includes a third label module 806 (not shown in the figure), configured to: acquiring a fully-defined domain name of a client, and performing meta-information query from a local file information base by combining a file path and a file type of the target file to determine a label according to a label identifier in the queried meta-information to obtain a third label set; the tag processing module 804 is configured to: and taking a union of the first label set, the second label set and the third label set.
The apparatus further comprises a label filtering module 807 (not shown) for: sending an authentication request to a server to authenticate the user name in the authentication request through the server to obtain a fourth tag set of the user name with operation authority; and taking intersection of the labeled set and the fourth label set returned by the server side to obtain a fifth label set of the user name having operation authority on the target file.
In the device of the present invention, the device further includes a label display module 808 (not shown in the figure) for: and marking the labels in the fourth label set, wherein the labels in the fifth label set are marked, the rest labels are used as the non-marked labels, and the processed fourth label set is displayed.
The device further comprises a marking/deleting module 809 (not shown) for: responding to the marking operation of one unmarked label, generating an operation log for marking the label of the target file, storing the operation log into a local log library, and storing the file information corresponding to the target file and the marking label into a local file information library; and/or responding to the marking removal operation of one marked file, generating an operation log for removing the marked label from the target file, storing the operation log into a local log library, and deleting file information corresponding to the target file and the removed marked label from a local file information library.
In the implementation apparatus of the present invention, the marking/deleting module 809 is further configured to: and responding to the marking operation of an untagged file or the marking removing operation of a tagged file, generating a file fingerprint according to the file content in the target file, and storing the file fingerprint in a local fingerprint library together with the meta information of the target file.
The apparatus further includes an information synchronization module 810 (not shown) for: downloading the latest operation log corresponding to the user name from a server side, and acquiring a hash value in the latest operation log; wherein, the hash value is obtained by processing the operation log; determining an operation log in a local log library according to the identification of the file in the latest operation log, and acquiring a last hash value in the operation log; the last hash value is the hash value of the last operation log positioned in the operation log; comparing whether the hash value is consistent with the last hash value or not, if so, uploading the file fingerprint and the operation log corresponding to the identifier in the local fingerprint library and the local log library to the server; and if not, updating the local fingerprint library and the local log library based on the file fingerprint and the operation log which are pulled from the server and correspond to the identification.
FIG. 9 illustrates an exemplary system architecture 900 to which embodiments of the invention may be applied.
As shown in fig. 9, the system architecture 900 may include end devices 901, 902, 903, a network 904, and a server 905 (by way of example only). Network 904 is the medium used to provide communication links between terminal devices 901, 902, 903 and server 905. Network 904 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 901, 902, 903 to interact with a server 905 over a network 904 to receive or send messages and the like. Various communication client applications can be installed on the terminal devices 901, 902, 903.
The terminal devices 901, 902, 903 may be various electronic devices having display screens and supporting web browsing, and the server 905 may be a server providing various services.
It should be noted that the method provided by the embodiment of the present invention is generally executed by the server 905, and accordingly, the apparatus is generally disposed in the server 905.
It should be understood that the number of terminal devices, networks, and servers in fig. 9 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 10, a block diagram of a computer system 1000 suitable for use with a terminal device implementing an embodiment of the invention is shown. The terminal device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 10, the computer system 1000 includes a Central Processing Unit (CPU)1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the system 1000 are also stored. The CPU 1001, ROM 1002, and RAM 1003 are connected to each other via a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 1001.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a fingerprint lookup module, a first tag module, a second tag module, and a tag processing module. Where the names of these modules do not in some cases constitute a limitation on the modules themselves, for example, a tag processing module may also be described as a "tag union module".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: responding to the operation of a query tag of a target file, acquiring a file fingerprint of the target file, and determining a similar file fingerprint of which the similarity with the file fingerprint exceeds a preset similarity threshold in a local fingerprint library; acquiring meta-information corresponding to the similar file fingerprint, and determining labels according to label identifications in the meta-information to obtain a first label set; transmitting the file fingerprint to a server for label query so as to receive a second label set returned by the server; and acquiring a union set of the first label set and the second label set to obtain a labeled set of the target file, and determining the category of the target file according to the labels in the labeled set.
According to the technical scheme of the embodiment of the invention, compared with the prior art, the method has the following beneficial effects:
1) The file fingerprint generated based on the file content can break through the limitations of file formats and types, only the relevance (such as similarity, similarity or consistency) among the file contents is considered, the original file cannot be damaged, the file tracking and management are convenient, and the marking accuracy is improved;
2) the file fingerprints comprise accurate fingerprints and similarity fingerprints, the similarity is calculated in different modes based on different fingerprints, and the meta information is screened by combining the attributes of the target file, so that the query range is reduced, and the query accuracy is improved;
3) considering the possible failure condition of the file fingerprint, establishing a local file information base for tracking the marked label of the target file under the condition that the fingerprint fails; aiming at marking or deleting labels of files, locally stored information needs to be synchronously changed;
4) filtering the labeled set of the target file, determining and processing the labels of which the current user does not have the operation authority, and displaying the labels of which the user has the operation authority and the labeled set of the target file together by considering the subsequent label operation;
5) and judging whether the last updating operation of the local operation log is the same as that recorded at the service end or not by comparing the last hash value in the local operation log with the hash value of the operation log at the service end, so as to determine the file fingerprint and the operation log mode of the synchronous file.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (13)

1. A method of classifying a document, comprising:
responding to the operation of a query tag of a target file, acquiring a file fingerprint of the target file, and determining a similar file fingerprint of which the similarity with the file fingerprint exceeds a preset similarity threshold in a local fingerprint library;
acquiring meta-information corresponding to the similar file fingerprint, and determining labels according to label identifications in the meta-information to obtain a first label set;
transmitting the file fingerprint to a server for label query so as to receive a second label set returned by the server;
and acquiring a union set of the first label set and the second label set to obtain a labeled set of the target file, and determining the category of the target file according to the labels in the labeled set.
2. The method of claim 1, wherein the local fingerprint repository comprises a first fingerprint repository and a second fingerprint repository;
the acquiring of the file fingerprint of the target file includes:
generating a file fingerprint according to the file content in the target file; wherein the file fingerprint comprises a first fingerprint and a second fingerprint, and the first fingerprint is obtained by processing the file content by using an information summarization algorithm;
the determining of similar file fingerprints in the local fingerprint database whose similarity to the file fingerprint exceeds a predetermined similarity threshold includes:
calculating the similarity between the first fingerprint and fingerprints in the first fingerprint library, and determining a first similar fingerprint with the similarity exceeding a first preset similarity threshold; and
calculating the similarity between the second fingerprint and fingerprints in the second fingerprint library, and determining a second similar fingerprint with the similarity exceeding a second preset similarity threshold;
the obtaining of the meta-information corresponding to the similar file fingerprint includes: and acquiring first meta-information corresponding to the first similar fingerprint and second meta-information corresponding to the second similar fingerprint.
3. The method of claim 2, wherein the second fingerprint comprises a sub-fingerprint, and the sub-fingerprint is generated by processing word segmentation of the file content by using an information summarization algorithm;
The calculating the similarity between the second fingerprint and the fingerprints in the second fingerprint library comprises:
and calculating the similarity between each sub-fingerprint in the second fingerprint and each sub-fingerprint of one fingerprint in the second fingerprint library, and accumulating the sum of the similarities to obtain the similarity between the second fingerprint and the one fingerprint.
4. The method of claim 2 or 3, wherein the obtaining first meta-information corresponding to the first similar fingerprint and second meta-information corresponding to the second similar fingerprint further comprises:
determining the file volume of the target file, and acquiring first meta-information corresponding to the first similar fingerprint and the file volume; and
and determining a file suffix and a file type of the target file, and acquiring second meta-information corresponding to the second similar fingerprint, the file suffix and the file type.
5. The method of claim 1, further comprising:
acquiring a fully-defined domain name of a client, and performing meta-information query from a local file information base by combining a file path and a file type of the target file to determine a label according to a label identifier in the queried meta-information to obtain a third label set;
The merging the first set of labels and the second set of labels, comprising:
and taking a union of the first label set, the second label set and the third label set.
6. The method of claim 1 or 5, further comprising:
sending an authentication request to a server to authenticate the user name in the authentication request through the server to obtain a fourth tag set of the user name with operation authority;
the obtaining of the labeled set of the target file further comprises:
and taking intersection of the labeled set and the fourth label set returned by the server side to obtain a fifth label set of the user name having operation authority on the target file.
7. The method of claim 6, further comprising:
and marking the labels in the fourth label set, wherein the labels in the fifth label set are marked, the rest labels are used as the non-marked labels, and the processed fourth label set is displayed.
8. The method of claim 7, after said displaying the processed fourth tab set, comprising:
responding to the marking operation of one unmarked label, generating an operation log for marking the label of the target file, storing the operation log into a local log library, and storing the file information corresponding to the target file and the marking label into a local file information library; and/or
And responding to the marking removal operation of one marked file, generating an operation log for removing the marked label from the target file, storing the operation log into a local log library, and deleting file information corresponding to the target file and the removed marked label from a local file information library.
9. The method of claim 8, further comprising:
and responding to the marking operation of an untagged file or the marking removing operation of a tagged file, generating a file fingerprint according to the file content in the target file, and storing the file fingerprint in a local fingerprint library together with the meta information of the target file.
10. The method of claim 8, further comprising:
downloading the latest operation log corresponding to the user name from a server side, and acquiring a hash value in the latest operation log; wherein, the hash value is obtained by processing the operation log;
determining an operation log in a local log library according to the identification of the file in the latest operation log, and acquiring a last hash value in the operation log; the last hash value is the hash value of the last operation log positioned in the operation log;
comparing whether the hash value is consistent with the last hash value or not, if so, uploading the file fingerprint and the operation log corresponding to the identifier in the local fingerprint library and the local log library to the server;
And if not, updating the local fingerprint library and the local log library based on the file fingerprint and the operation log which are pulled from the server and correspond to the identification.
11. A document sorting apparatus, comprising:
the fingerprint searching module is used for responding to the operation of an inquiry label of a target file, acquiring the file fingerprint of the target file, and determining the similar file fingerprint of which the similarity with the file fingerprint in a local fingerprint database exceeds a preset similarity threshold;
the first tag module is used for acquiring meta-information corresponding to the similar file fingerprint and determining tags according to tag identifications in the meta-information to obtain a first tag set;
the second label module is used for transmitting the file fingerprint to a server side for label query so as to receive a second label set returned by the server side;
and the label processing module is used for acquiring a union set of the first label set and the second label set to obtain a labeled set of the target file, and determining the category of the target file according to the labels in the labeled set.
12. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
When executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-10.
13. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-10.
CN202010631285.4A 2020-07-03 2020-07-03 File classification method and device Pending CN111858486A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010631285.4A CN111858486A (en) 2020-07-03 2020-07-03 File classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010631285.4A CN111858486A (en) 2020-07-03 2020-07-03 File classification method and device

Publications (1)

Publication Number Publication Date
CN111858486A true CN111858486A (en) 2020-10-30

Family

ID=73153418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010631285.4A Pending CN111858486A (en) 2020-07-03 2020-07-03 File classification method and device

Country Status (1)

Country Link
CN (1) CN111858486A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113760834A (en) * 2021-09-22 2021-12-07 北京字跳网络技术有限公司 File classification method, device, equipment and medium
CN113901001A (en) * 2021-12-09 2022-01-07 武汉华工安鼎信息技术有限责任公司 File identification processing method and device
CN114003963A (en) * 2021-12-30 2022-02-01 天津联想协同科技有限公司 Method, system, network disk and storage medium for file authorization under enterprise network disk

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999637A (en) * 2012-12-29 2013-03-27 珠海金山办公软件有限公司 Method and system for automatically adding file tab to file according to file feature code
CN103281325A (en) * 2013-06-04 2013-09-04 北京奇虎科技有限公司 Method and device for processing file based on cloud security
CN105354318A (en) * 2015-11-13 2016-02-24 北京金山安全软件有限公司 File searching method and device
CN105653984A (en) * 2015-12-25 2016-06-08 北京奇虎科技有限公司 File fingerprint check method and apparatus
CN106844143A (en) * 2016-12-27 2017-06-13 微梦创科网络科技(中国)有限公司 A kind of daily record duplicate removal treatment method and device
CN107798082A (en) * 2017-10-16 2018-03-13 广东欧珀移动通信有限公司 A kind of processing method and processing device of file label
CN108255915A (en) * 2017-09-07 2018-07-06 新华三技术有限公司 File management method and device and machine-readable storage medium
CN109766320A (en) * 2018-12-04 2019-05-17 深圳供电局有限公司 The method and system of label display are shared in a kind of network file
CN109800775A (en) * 2017-11-17 2019-05-24 腾讯科技(深圳)有限公司 Document clustering method, apparatus, equipment and readable medium
CN110519654A (en) * 2019-09-11 2019-11-29 广州荔支网络技术有限公司 A kind of label determines method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999637A (en) * 2012-12-29 2013-03-27 珠海金山办公软件有限公司 Method and system for automatically adding file tab to file according to file feature code
CN103281325A (en) * 2013-06-04 2013-09-04 北京奇虎科技有限公司 Method and device for processing file based on cloud security
CN105354318A (en) * 2015-11-13 2016-02-24 北京金山安全软件有限公司 File searching method and device
CN105653984A (en) * 2015-12-25 2016-06-08 北京奇虎科技有限公司 File fingerprint check method and apparatus
CN106844143A (en) * 2016-12-27 2017-06-13 微梦创科网络科技(中国)有限公司 A kind of daily record duplicate removal treatment method and device
CN108255915A (en) * 2017-09-07 2018-07-06 新华三技术有限公司 File management method and device and machine-readable storage medium
CN107798082A (en) * 2017-10-16 2018-03-13 广东欧珀移动通信有限公司 A kind of processing method and processing device of file label
CN109800775A (en) * 2017-11-17 2019-05-24 腾讯科技(深圳)有限公司 Document clustering method, apparatus, equipment and readable medium
CN109766320A (en) * 2018-12-04 2019-05-17 深圳供电局有限公司 The method and system of label display are shared in a kind of network file
CN110519654A (en) * 2019-09-11 2019-11-29 广州荔支网络技术有限公司 A kind of label determines method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113760834A (en) * 2021-09-22 2021-12-07 北京字跳网络技术有限公司 File classification method, device, equipment and medium
CN113760834B (en) * 2021-09-22 2024-04-09 北京字跳网络技术有限公司 File classification method, device, equipment and medium
CN113901001A (en) * 2021-12-09 2022-01-07 武汉华工安鼎信息技术有限责任公司 File identification processing method and device
CN113901001B (en) * 2021-12-09 2022-03-01 武汉华工安鼎信息技术有限责任公司 File identification processing method and device
CN114003963A (en) * 2021-12-30 2022-02-01 天津联想协同科技有限公司 Method, system, network disk and storage medium for file authorization under enterprise network disk
CN114003963B (en) * 2021-12-30 2022-05-06 天津联想协同科技有限公司 Method, system, network disk and storage medium for file authorization under enterprise network disk

Similar Documents

Publication Publication Date Title
US20220038408A1 (en) Method and system for displaying similar email messages based on message contents
US10621211B2 (en) Language tag management on international data storage
US20190050419A1 (en) De-duplicating distributed file system using cloud-based object store
CN110168523B (en) Change monitoring cross-graph query
US8560569B2 (en) Method and apparatus for performing bulk file system attribute retrieval
CN111858486A (en) File classification method and device
US20180285596A1 (en) System and method for managing sensitive data
US11693908B2 (en) System and methods for dynamic generation of object storage datasets from existing file datasets
US10659486B2 (en) Universal link to extract and classify log data
JP4799018B2 (en) Method, system and apparatus for managing computer identification information
CN111459985A (en) Identification information processing method and device
US20130218854A1 (en) File identification via universal file code
US20140358868A1 (en) Life cycle management of metadata
CN103119551A (en) Optimized recovery
US10983718B2 (en) Method, device and computer program product for data backup
CN112988770B (en) Method, device, electronic equipment and storage medium for updating serial number
EP2680174A1 (en) A method, a server, a system and a computer program product for copying data from a source server to a target server
CN111259282A (en) URL duplicate removal method and device, electronic equipment and computer readable storage medium
US20170262439A1 (en) Information processing apparatus and non-transitory computer readable medium
US20200133792A1 (en) Method, apparatus, and computer program product for managing virtual machine
US20240111742A1 (en) Management device, management method, and recording medium
CN113515504B (en) Data management method, device, electronic equipment and storage medium
CN113760860B (en) Data reading method and device
CN114422388B (en) Remote data supervisory systems
CN111104787B (en) Method, apparatus and computer program product for comparing files

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination