CN113407495A

CN113407495A - SIMHASH-based file similarity determination method and system

Info

Publication number: CN113407495A
Application number: CN202110731030.XA
Authority: CN
Inventors: 代俊朴; 王升平
Original assignee: BEIJING TIP TECHNOLOGY CO LTD
Current assignee: BEIJING TIP TECHNOLOGY CO LTD
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-09-17

Abstract

The embodiment of the application discloses a file similarity judging method and a file similarity judging system based on SIMHASH, wherein the method comprises the following steps: extracting file information of a file to be judged, and writing the file information into kafka; the similarity judgment processing engine carries out tika analysis on the file message to obtain a tika analysis result; the Simhash engine carries out Simhash calculation on the tika analysis result to obtain a Simhash value of the text to be judged; comparing the simhash value of the file to be judged with the simhash value of the sample in the sample library to obtain the Hamming distance; judging whether the Hamming distance is larger than 0, if so, judging whether the files to be judged are similar based on the set similarity, if so, writing the result into a hbase database, and updating the sample state in the sample database, wherein the sample state comprises secret-related state and non-secret-related state. The file similarity judgment is carried out based on SIMHASH accuracy, so that the cost of manual judgment is greatly reduced, and the confidential disposal efficiency is also improved.

Description

SIMHASH-based file similarity determination method and system

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a file similarity determination method and system based on SIMHASH.

Background

The security inspection and security level judgment of files transmitted between internet access ports are an important part of a security monitoring system, and monitoring whether the files are transmitted on a network with huge flow is a working challenge for security personnel.

In this context, a need exists for a technique to quickly and accurately assist security personnel in automating the inspection of documents.

Disclosure of Invention

Therefore, the file similarity judging method and system based on SIMHASH are provided by the embodiment of the application, the file similarity judgment is accurately carried out based on SIMHASH, the cost of manual judgment is greatly reduced, and the secrecy disposal efficiency is also improved.

In order to achieve the above object, the embodiments of the present application provide the following technical solutions:

according to a first aspect of embodiments of the present application, there is provided a SIMHASH-based file similarity determination method, including:

extracting file information of a file to be judged, and writing the file information into kafka;

the similarity judgment processing engine carries out tika analysis on the file message to obtain a tika analysis result;

the Simhash engine carries out Simhash calculation on the tika analysis result to obtain a Simhash value of the text to be judged;

comparing the simhash value of the file to be judged with the simhash value of the sample in the sample library to obtain the Hamming distance;

judging whether the Hamming distance is larger than 0, if so, judging whether the files to be judged are similar based on the set similarity, if so, writing the result into a hbase database, and updating the sample state in the sample database, wherein the sample state comprises secret-related state and non-secret-related state.

Optionally, the set similarity is that the hamming distance is less than or equal to three.

Optionally, the Simhash engine performs Simhash calculation on the tika parsing result to obtain a Simhash value of the text to be determined, where the Simhash value includes:

and performing a Simhash calculation on the pure text in the tika parsing result by the Simhash engine to obtain a Simhash value of the text to be judged, wherein the Simhash value is a 64-bit 01 character string.

Optionally, the method further comprises:

and performing Simhash value comparison calculation on all sample files in the sample library once every set period, and disposing the text files meeting the set similarity.

According to a second aspect of the embodiments of the present application, there is provided a SIMHASH-based file similarity determination system, including:

the file extraction module is used for extracting the file information of the file to be judged and writing the file information into the kafka;

the similarity judgment processing engine is used for carrying out tika analysis on the file message to obtain a tika analysis result;

the Simhash engine is used for carrying out Simhash calculation on the tika analysis result to obtain a Simhash value of the text to be judged;

the hamming distance calculation module is used for comparing the simhash value of the file to be judged with the simhash value of the sample in the sample library to obtain the hamming distance;

and the similarity judging module is used for judging whether the hamming distance is greater than 0, judging whether the files to be judged are similar based on the set similarity if the hamming distance is greater than 0, writing the result into the hbase database if the files to be judged are similar, and updating the sample state in the sample database, wherein the sample state comprises secret-related information and non-secret-related information.

Optionally, the Simhash engine is specifically configured to: and carrying out simhash calculation on the pure text in the tika parsing result to obtain a simhash value of the text to be judged, wherein the simhash value is a 64-bit 01 character string.

Optionally, the system further comprises:

and the regular disposal module is used for carrying out simhash value comparison calculation on all sample files in the sample library once every set period and disposing the text files meeting the set similarity.

According to a third aspect of embodiments herein, there is provided an apparatus comprising: the device comprises a data acquisition device, a processor and a memory; the data acquisition device is used for acquiring data; the memory is to store one or more program instructions; the processor is configured to execute one or more program instructions to perform the method of any of the first aspect.

According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium having one or more program instructions embodied therein for performing the method of any of the first aspects.

To sum up, the embodiment of the present application provides a file similarity determination method and system based on SIMHASH, which writes a file message of a file to be determined into kafka by extracting the file message; the similarity judgment processing engine carries out tika analysis on the file message to obtain a tika analysis result; the Simhash engine carries out Simhash calculation on the tika analysis result to obtain a Simhash value of the text to be judged; comparing the simhash value of the file to be judged with the simhash value of the sample in the sample library to obtain the Hamming distance; judging whether the Hamming distance is larger than 0, if so, judging whether the files to be judged are similar based on the set similarity, if so, writing the result into a hbase database, and updating the sample state in the sample database, wherein the sample state comprises secret-related state and non-secret-related state. The file similarity judgment is carried out based on SIMHASH accuracy, so that the cost of manual judgment is greatly reduced, and the confidential disposal efficiency is also improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are only exemplary and that for a person skilled in the art, other implementation drawings can be derived from the hmn nn drawing provided without inventive effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art can understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the functions and purposes of the present invention, should still fall within the scope of the present invention.

Fig. 1 is a schematic flow chart of a file similarity determination method based on SIMHASH according to an embodiment of the present application;

fig. 2 is a block diagram of a file similarity determination system based on SIMHASH according to an embodiment of the present application.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The system provided by the embodiment of the application has the functions of establishing a secret-related file library, automatically matching the secret-related file library and the like. And establishing a confidential document library, namely performing sample storage on the processed confidential documents, and if similar documents are found in the network flow, immediately capturing and generating an alarm. Establishing a non-confidential document library means automatically filtering similar non-confidential documents to reduce the number of documents handled by confidential personnel, so that the confidential personnel can use more energy on the blade.

With the increase of the number of samples of the confidential document library and the non-confidential document library, the handling work of confidential personnel on the documents can be reduced to a certain extent, the handling efficiency is greatly improved, and a better confidential supervision work result is obtained.

The file sample library is established by an elastic search engine method, a regular matching method and a SIMHASH judgment method.

The file classified judgment based on the sample library is a judgment method for simulating a human, the file of the sample library and the file to be judged are subjected to similarity matching work, and then the comparison work of classified file judgment is completed by the aid of the processing of a business process. The judgment of the file similarity is based on a regular matching method, the thought is simple, but the practical application efficiency is low, because the regular matching needs to match two long character strings, under the condition that a large number of samples and a large number of network stream files need Cartesian product comparison, the regular matching is difficult to complete a large number of file comparison works in a limited time, and the judgment parameters of the text similarity cannot be flexibly adjusted. The elastic search engine method is relatively high in efficiency, but the degree and specific situation of file similarity cannot be well grasped because the scoring mechanism calculation of the matching degree cannot be separated from the automatic control of the engine, and the actual development and maintenance are difficult to realize.

In order to quickly and accurately complete the file comparison task of Cartesian product magnitude, the SIMHASH calculation method based on the file content in the embodiment of the application completes the work of automatically performing classified judgment on the file according to the sample library. The SIMHASH can be calculated before being compared when the file is generated, so that the asynchronous calculation and comparison mode can greatly improve the calculation efficiency.

The file message is captured and generated by the front end, and the file message is pushed into the kafka message storage engine system. And then the file analysis program is used for consuming. The process of consuming messages mainly comprises: acquiring a file, analyzing the content of the file by Tika, calculating the SIMHASH value of the file, saving the calculation result and the like.

File analysis is performed prior to file determination, and SIMHASH calculation is performed when a file is acquired in network traffic, so that calculation cost is distributed and efficiency is high. On the other hand, since the hamming distance is calculated from SIMHASH (calculation is also performed when the sample library is changed), the calculation cost is equally distributed and the efficiency is high.

Fig. 1 illustrates a file similarity determination method based on SIMHASH according to an embodiment of the present application, where the method includes:

step 101: extracting file information of a file to be judged, and writing the file information into kafka;

step 102: the similarity judgment processing engine carries out tika analysis on the file message to obtain a tika analysis result;

step 103: the Simhash engine carries out Simhash calculation on the tika analysis result to obtain a Simhash value of the text to be judged;

step 104: comparing the simhash value of the file to be judged with the simhash value of the sample in the sample library to obtain the Hamming distance;

step 105: judging whether the Hamming distance is larger than 0, if so, judging whether the files to be judged are similar based on the set similarity, if so, writing the result into a hbase database, and updating the sample state in the sample database, wherein the sample state comprises secret-related state and non-secret-related state.

In a possible embodiment, the set similarity is a hamming distance of three or less. The set similarity can be expanded or contracted in a small range according to actual conditions.

In a possible implementation manner, in step 103, the Simhash engine performs a Simhash calculation on the tika parsing result to obtain a Simhash value of the text to be determined, including: and performing a Simhash calculation on the pure text in the tika parsing result by the Simhash engine to obtain a Simhash value of the text to be judged, wherein the Simhash value is a 64-bit 01 character string.

In one possible embodiment, the method further comprises: and performing Simhash value comparison calculation on all sample files in the sample library once every set period, and disposing the text files meeting the set similarity.

In one possible implementation, there are two possibilities for obtaining the simhas value, one is directly obtained after previous calculation, and the other is not obtained before calculation, and the result is saved after calculation. The simhash value of the sample pool is multiple. The file to be judged and the sample library are in a one-to-many relationship, and one hamming distance (namely the number of different simhash 01 bits) can be obtained by one file to be judged and each sample.

In one possible embodiment, if an abnormal simhash value occurs during the calculation of the hamming distance, the calculated hamming distance value is set to-1 to mark that it is not a legitimate hamming distance.

The system architecture applicable to the method provided by the embodiment of the application adopts a Kafka + Hbase + Tika file recognition engine + file similarity judgment engine to realize a similarity judgment calculation scheme, and the Kafka is acquired from the file data message in the network flow and is processed by the message processing engine. And storing the processing result, namely the plain text parsed from the file and the SIMHASH calculated according to the plain text into an Hbase database for query. And finally, comparing the Hamming distance when similarity comparison is needed.

The message generation module is used for extracting file messages on network traffic by an ETL tool and writing the file messages into Kafka. A message processing module: the similarity judgment processing engine carries out Tika analysis on the file according to the file message, and then the SimHash engine carries out SimHash calculation on the Tika analysis result to obtain a 64-bit 01 character string (SimHash value). Determining a treatment layer: in the first case, all the non-processed records in the system are screened and compared at intervals, and the files meeting similar conditions are processed. For more timely automatic handling, the second case is that the sample pool is updated, which triggers: and comparing all the untreated records according to the files of the newly added samples, and treating the matched satisfied items. The results after the treatment are written into the Hbase database and displayed by the front end. The result of the last disposition, i.e. the state of the file, is shown, as: secret or non-secret.

The similar condition for each time is set by the user writing to the database through the foreground page and then reading the value of the database.

To sum up, the embodiment of the present application provides a file similarity determination method based on SIMHASH, which extracts the file message of the file to be determined and writes the file message into kafka; the similarity judgment processing engine carries out tika analysis on the file message to obtain a tika analysis result; the Simhash engine carries out Simhash calculation on the tika analysis result to obtain a Simhash value of the text to be judged; comparing the simhash value of the file to be judged with the simhash value of the sample in the sample library to obtain the Hamming distance; judging whether the Hamming distance is larger than 0, if so, judging whether the files to be judged are similar based on the set similarity, if so, writing the result into a hbase database, and updating the sample state in the sample database, wherein the sample state comprises secret-related state and non-secret-related state. The file similarity judgment is carried out based on SIMHASH accuracy, so that the cost of manual judgment is greatly reduced, and the confidential disposal efficiency is also improved.

Based on the same technical concept, an embodiment of the present application further provides a file similarity determination system based on SIMHASH, as shown in fig. 2, the system includes:

the file extraction module 201 is used for extracting the file information of the file to be judged and writing the file information into kafka;

the similarity judgment processing engine 202 is used for carrying out tika analysis on the file message to obtain a tika analysis result;

the Simhash engine 203 is used for carrying out Simhash calculation on the tika analysis result to obtain a Simhash value of the text to be judged;

the hamming distance calculating module 204 is configured to compare the simhash value of the file to be determined with the simhash value of the sample in the sample library to obtain a hamming distance;

and the similarity determination module 205 is configured to determine whether the hamming distance is greater than 0, if so, determine whether the files to be determined are similar based on the set similarity, if so, write the result into the hbase database, and update a sample state in the sample database, where the sample state includes secret related and non-secret related.

In a possible embodiment, the set similarity is a hamming distance of three or less.

In a possible implementation, the Simhash engine 203 is specifically configured to: and carrying out simhash calculation on the pure text in the tika parsing result to obtain a simhash value of the text to be judged, wherein the simhash value is a 64-bit 01 character string.

In one possible embodiment, the system further comprises: and the regular disposal module is used for carrying out simhash value comparison calculation on all sample files in the sample library once every set period and disposing the text files meeting the set similarity.

Based on the same technical concept, an embodiment of the present application further provides an apparatus, including: the device comprises a data acquisition device, a processor and a memory; the data acquisition device is used for acquiring data; the memory is to store one or more program instructions; the processor is configured to execute one or more program instructions to perform the method.

Based on the same technical concept, the embodiment of the present application also provides a computer-readable storage medium, wherein the computer-readable storage medium contains one or more program instructions, and the one or more program instructions are used for executing the method.

In the present specification, each embodiment of the method is described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. Reference is made to the description of the method embodiments.

It is noted that while the operations of the methods of the present invention are depicted in the drawings in a particular order, this is not a requirement or suggestion that the operations must be performed in this particular order or that all of the illustrated operations must be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Although the present application provides method steps as in embodiments or flowcharts, additional or fewer steps may be included based on conventional or non-inventive approaches. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an apparatus or client product in practice executes, it may execute sequentially or in parallel (e.g., in a parallel processor or multithreaded processing environment, or even in a distributed data processing environment) according to the embodiments or methods shown in the figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, article, or apparatus that comprises the recited elements is not excluded.

The units, devices, modules, etc. set forth in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, in implementing the present application, the functions of each module may be implemented in one or more software and/or hardware, or a module implementing the same function may be implemented by a combination of a plurality of sub-modules or sub-units, and the like. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a mobile terminal, a server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same or similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The above-mentioned embodiments are further described in detail for the purpose of illustrating the invention, and it should be understood that the above-mentioned embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A file similarity determination method based on SIMHASH, comprising:

2. The method of claim 1, wherein the set similarity is a hamming distance of three or less.

3. The method as claimed in claim 1, wherein the Simhash engine performs Simhash calculation on the tika parsing result to obtain a Simhash value of the text to be determined, including:

4. The method of claim 1, wherein the method further comprises:

5. A file similarity determination system based on SIMHASH, the system comprising:

6. The system of claim 5, wherein the set similarity is a hamming distance of three or less.

7. The system of claim 5, wherein the Simhash engine is specifically configured to: and carrying out simhash calculation on the pure text in the tika parsing result to obtain a simhash value of the text to be judged, wherein the simhash value is a 64-bit 01 character string.

8. The system of claim 5, wherein the system further comprises:

9. An apparatus, characterized in that the apparatus comprises: the device comprises a data acquisition device, a processor and a memory;

the data acquisition device is used for acquiring data; the memory is to store one or more program instructions; the processor, configured to execute one or more program instructions to perform the method of any of claims 1-4.

10. A computer-readable storage medium having one or more program instructions embodied therein for performing the method of any of claims 1-4.