CN111666928A

CN111666928A - Computer file similarity recognition system and method based on image analysis

Info

Publication number: CN111666928A
Application number: CN202010689843.2A
Authority: CN
Inventors: 宋国训; 魏磊; 仲伟付; 杨秀红; 刘曌
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2020-09-15

Abstract

The invention belongs to the technical field of computers, and particularly relates to a computer file similarity recognition system and method based on image analysis. The system comprises: a file attribute data extraction unit configured to extract basic attributes of two target files for comparison, the target files being: a first object file and a second object file, the base attribute comprising at least: file name, file type, file size, file position, file creation time and file modification time; and the file content extraction unit is configured to open the first target file and the second target file, extract the contents of the two files and temporarily store the extracted file contents. The method can accurately identify the similarity of the files by analyzing the attribute data of the files and converting the content of the files into the content of the images for similarity matching analysis, and has the advantages of high identification accuracy and high efficiency.

Description

Computer file similarity recognition system and method based on image analysis

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a computer file similarity recognition system and method based on image analysis.

Background

The document similarity calculation method is a method for analyzing and calculating the similarity of documents using information (document contents and connection information) of the documents themselves. With the progress of the times, the file similarity calculation method has been widely applied to various fields (e.g., related fields such as information retrieval, collaborative recommendation systems, library classification systems, etc.).

The existing method for detecting the similarity of the files generally comprises the following steps:

(1) after basic simplification processing is carried out on each file in the submitted file set, each file is divided into continuous marking blocks; keeping a certain number of representative mark blocks in the mark block; the representative mark blocks are made into unique representative fingerprints, and different files are signed by using different representative fingerprints.

(2) And judging whether the fingerprints of the signatures of the 2 files are the same, if so, correlating the 2 files and belonging to similar files, otherwise, not correlating the 2 files and not belonging to similar files.

Patent No. CN201410421951.6A discloses a duplicate removal method and system based on file similarity: the method comprises the following steps: extracting files needing comparison to generate pure characters; carrying out normalization processing on the pure characters to generate a normalized character unit; coding the standard character unit, and generating a fixed-length irreversible representative code through a coding algorithm; extracting key words of the representative codes of the files needing to be compared to generate a key word sequence; calculating the similarity of word shapes and the similarity of word sequences of sentences to be compared according to the keyword sequences of the sentences to be compared; calculating the similarity of the sentences to be compared according to the similarity of the word shapes and the similarity of the word sequences of the sentences to be compared; and calculating the similarity of the files needing to be compared according to the similarity of the sentences. The method and the device are not only suitable for Chinese characters and convenient for domestic users to use, but also have higher accuracy in comparison with similar files.

Patent No. CN2007101058353A discloses a system and method for detecting file similarity: the method comprises the following steps: extracting pure character parts from the files to be detected respectively; splitting the extracted pure character part into character units; coding the split character unit; and comparing the coded character unit in one file with the coded character unit of at least one other file to determine the similarity of the two.

It can be seen that the existing method for detecting file similarity has the following defects when in use: generally, the whole content of the file needs to be matched and identified, so that the efficiency is low; there is a lack of effective means of identification for many image files.

Disclosure of Invention

In view of the above, the present invention is directed to a computer file similarity recognition system and method based on image analysis, which can accurately recognize the similarity of files by analyzing the attribute data of the files and performing similarity matching analysis by converting the file content into the image content, and has the advantages of high recognition accuracy and high efficiency.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

a computer file similarity identification system based on image analysis, the system comprising: a file attribute data extraction unit configured to extract basic attributes of two target files for comparison, the target files being: a first object file and a second object file, the base attribute comprising at least: file name, file type, file size, file position, file creation time and file modification time; the file content extraction unit is configured for opening the first target file and the second target file, extracting the contents of the two files and temporarily storing the extracted file contents; a file content conversion unit configured to convert the extracted file content into corresponding image content; the file similarity identification unit comprises: the device comprises a first similarity identification unit, a second similarity identification unit and a result generation unit; the first similarity identification unit is configured to judge the similarity of the two files according to the basic attributes of the two files to obtain a first judgment result; the second similarity recognition unit is configured to judge the similarity of the two files according to the image content based on a preset neural network model to obtain a second judgment result; and the result generating unit is configured to calculate a final similarity recognition result by weighting based on a preset weight value of the judgment result according to the first judgment result and the second judgment result.

Further, the method for judging the similarity of the two files by the first similarity identification unit according to the basic attributes of the two files to obtain the first judgment result executes the following steps: matching and identifying items to be matched, such as file names, file types, file sizes, file positions, file creating time and file modifying time of the first target file and the second target file respectively; the matching identification method comprises the following steps: acquiring a keyword to which the character belongs and an index position of the character in the keyword to which the character belongs according to a keyword set for each character in a to-be-matched item; judging whether the character is a first character of the keyword according to the index position of the character in the keyword to which the character belongs; if the character is the first character of the affiliated keyword, recording the affiliated keyword of the character in a matching information set, and marking the first character of the keyword in the record to exist in the item to be matched; if the character is not the first character of the corresponding keyword and the record of the keyword to which the character belongs exists in the matching information set, acquiring the record of the keyword to which the character belongs, and marking the character in the keyword existing in the item to be matched in the record; when all characters in a keyword are marked and exist in the item to be matched, judging that the keyword is hit by the item to be matched; if the items to be matched of the two files hit the same keyword, judging that the items to be matched are matched; and if the number of the matched items to be matched is greater than that of the unmatched items to be matched, the first judgment result is that the two files are matched and is marked as 1.

Further, the second similarity degree identification unit includes: local outlineA rate model estimation subunit configured to calculate a probability for each local region of the image content using the following formula:

wherein i is the number of each local region, n is the number of local regions, and σ (x)_i) Representing a local area x_iPer local area x_iIn the form of a matrix of a plurality of,

is a transpose of a matrix, w_iFor a predetermined template matrix, b_iThe adjustment value corresponding to the matrix has a value range of: 5-10, wherein m is a probability adjustment value and has a value range of: 0.2 to 0.6; a local region weight calculation subunit that calculates a weight value of each local region in the image as a local region weight value according to the probability of the local region; an image segmentation subunit configured to segment image content of the second target file into unit domains; a unit domain feature quantity extracting subunit that extracts, from the divided unit domains, a feature quantity of each unit domain as a unit domain feature quantity of the image content of the first target file; a unit domain similarity calculation subunit configured to compare the image content unit domain feature quantity of the first target file with the image content unit domain feature quantity of the second target file, the unit domain feature quantity being a unit domain feature quantity prepared in advance of the image content of the first target file; calculating the similarity of the characteristic quantity of each unit domain as the unit domain similarity; and an image similarity calculation subunit that calculates an image similarity between the image content of the second target file and the image content of the first target file by weighting the unit domain similarity using the unit domain-based weight value obtained from the local region weight value.

Further, the method for calculating the final similarity recognition result by weighting according to the first judgment result and the second judgment result and based on the preset weight value of the judgment result, by the result generating unit, executes the following steps: the final similarity is calculated using the following formula: the final similarity is the first judgment result A + the second judgment result B; if the final similarity value is less than 0.8, the obtained final similarity identification result is that the two files are not similar; if the final similarity value is not lower than 0.8, the obtained final similarity identification result is that the two files are similar; wherein A is the weight value of the first judgment result, and B is the weight value of the second judgment result; a + B is 1.

Further, the file content conversion unit, the method for converting the extracted file content into the corresponding image content, performs the following steps: converting the file content into a binary character string, filling the binary character string into a matrix, corresponding each value in the matrix to an RGB value, and then regarding the matrix as the image content.

A computer file similarity identification method based on image analysis, the method comprising the steps of: step 1, extracting basic attributes of two target files for comparison, wherein the target files are respectively as follows: a first object file and a second object file, the base attribute comprising at least: file name, file type, file size, file position, file creation time and file modification time; step 2: opening a first target file and a second target file, extracting the contents of the two files, and temporarily storing the extracted file contents; and step 3: converting the extracted file content into corresponding image content; and 4, step 4: and according to the first judgment result and the second judgment result, based on a preset weight value of the judgment result, weighting and calculating a final similarity recognition result.

Further, the step 4: the method for judging the similarity of the two files according to the basic attributes of the two files to obtain a first judgment result executes the following steps: matching and identifying items to be matched, such as file names, file types, file sizes, file positions, file creating time and file modifying time of the first target file and the second target file respectively; the matching identification method comprises the following steps: acquiring a keyword to which the character belongs and an index position of the character in the keyword to which the character belongs according to a keyword set for each character in a to-be-matched item; judging whether the character is a first character of the keyword according to the index position of the character in the keyword to which the character belongs; if the character is the first character of the affiliated keyword, recording the affiliated keyword of the character in a matching information set, and marking the first character of the keyword in the record to exist in the item to be matched; if the character is not the first character of the corresponding keyword and the record of the keyword to which the character belongs exists in the matching information set, acquiring the record of the keyword to which the character belongs, and marking the character in the keyword existing in the item to be matched in the record; when all characters in a keyword are marked and exist in the item to be matched, judging that the keyword is hit by the item to be matched; if the items to be matched of the two files hit the same keyword, judging that the items to be matched are matched; and if the number of the matched items to be matched is greater than that of the unmatched items to be matched, the first judgment result is that the two files are matched and is marked as 1.

Further, the step 4: the method for judging the similarity of the two files according to the image content to obtain a second judgment result comprises the following steps: calculating a probability for each local region of the image content using the following formula:

is a transpose of a matrix, w_iFor a predetermined template matrix, b_iThe adjustment value corresponding to the matrix has a value range of: 5-10, wherein m is a probability adjustment value and has a value range of: 0.2 to 0.6; calculating a weight value of each local region in the image according to the probability of the local region, wherein the weight value is used as the local region weight value; image of the second target fileContent is divided into unit domains; extracting a feature quantity of each unit domain from the divided unit domains as a unit domain feature quantity of the image content of the first target file; comparing the image content unit domain feature quantity of the first target file with the image content unit domain feature quantity of the second target file, wherein the unit domain feature quantity is prepared in advance for the image content of the first target file; calculating the similarity of the characteristic quantity of each unit domain as the unit domain similarity; the image similarity between the image content of the second target file and the image content of the first target file is calculated by weighting the unit domain similarity with a unit domain-based weight value derived from the local region weight value.

Further, the step 4: based on the preset weight value of the judgment result, the method for calculating the final similarity recognition result by weighting executes the following steps: the final similarity is calculated using the following formula: the final similarity is the first judgment result A + the second judgment result B; if the final similarity value is less than 0.8, the obtained final similarity identification result is that the two files are not similar; if the final similarity value is not lower than 0.8, the obtained final similarity identification result is that the two files are similar; wherein A is the weight value of the first judgment result, and B is the weight value of the second judgment result; a + B is 1.

Further, step 3: the method for converting the extracted file content into the corresponding image content performs the following steps: the method for converting the extracted file content into the corresponding image content performs the following steps: converting the file content into a binary character string, filling the binary character string into a matrix, corresponding each value in the matrix to an RGB value, and then regarding the matrix as the image content.

The computer file similarity recognition system and method based on image analysis have the following beneficial effects: the method can accurately identify the similarity of the files by analyzing the attribute data of the files and converting the content of the files into the content of the images for similarity matching analysis, and has the advantages of high identification accuracy and high efficiency. The method is mainly realized by the following steps: 1. in the prior art, the file similarity is identified directly by identifying the file content, but the technology of identifying the file by combining the attributes of the file is novel, the attributes of the file are used as important contents for representing the file and are added into the file similarity identification, so that the obviously dissimilar files can be screened out quickly, the obviously dissimilar files can be prevented from being compared, and the identification efficiency is improved; 2. in the process of matching and identifying the file attributes, keywords to which characters belong and index bits of the characters in the keywords are acquired according to the keyword set respectively, so that the identification of all contents of the file attributes is avoided, and the identification efficiency is further improved; 3. when image identification is carried out and over-weighing is carried out, the unit domain similarity is weighted by the unit domain-based weight value obtained from the local region weight value, compared with the similarity identification of the whole image, the identification method has remarkable advantages when files with more blank regions are identified, aiming at the identification of unit thresholds, the situation that the blank regions of the image are identified by using the same algorithm complexity can be avoided, only the unit regions with contents need to be identified, the identification efficiency is improved, and meanwhile, the accuracy is ensured; 4. the method for obtaining the final identification by weighting based on the two identification results can effectively avoid the problem of low identification accuracy of a single identification method, and improves the identification accuracy.

Drawings

Fig. 1 is a schematic system structure diagram of a computer file similarity recognition system based on image analysis according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a file similarity identification unit of a computer file similarity system based on image analysis according to an embodiment of the present invention

FIG. 3 is a schematic flowchart of a method for identifying similarity of computer files based on image analysis according to an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating a principle that an image second similarity recognition unit divides image content of a second target file into unit domains according to the computer file similarity recognition system and method based on image analysis provided in the embodiment of the present invention;

fig. 5 is a schematic diagram illustrating a principle that a second similarity recognition unit of an image compares similarity between unit domains of a first target file and a second target file in the system and method for recognizing similarity of a computer file based on image analysis according to the embodiment of the present invention;

fig. 6 is a schematic diagram of an experiment curve of the recognition efficiency of the computer file similarity recognition system and method based on image analysis according to the embodiment of the present invention, which varies with the number of experiments, and a schematic diagram of a comparative experiment effect in the prior art;

fig. 7 is a schematic diagram of an experiment curve of which the recognition accuracy varies with the experiment times and a schematic diagram of a comparative experiment effect in the prior art, in the computer file similarity recognition system and method based on image analysis provided in the embodiment of the present invention.

Detailed Description

The method of the present invention will be described in further detail below with reference to the accompanying drawings and embodiments of the invention.

Example 1

As shown in fig. 1 and 3, a computer file similarity recognition system based on image analysis, the system comprising: a file attribute data extraction unit configured to extract basic attributes of two target files for comparison, the target files being: a first object file and a second object file, the base attribute comprising at least: file name, file type, file size, file position, file creation time and file modification time; the file content extraction unit is configured for opening the first target file and the second target file, extracting the contents of the two files and temporarily storing the extracted file contents; a file content conversion unit configured to convert the extracted file content into corresponding image content; the file similarity identification unit comprises: the device comprises a first similarity identification unit, a second similarity identification unit and a result generation unit; the first similarity identification unit is configured to judge the similarity of the two files according to the basic attributes of the two files to obtain a first judgment result; the second similarity recognition unit is configured to judge the similarity of the two files according to the image content based on a preset neural network model to obtain a second judgment result; and the result generating unit is configured to calculate a final similarity recognition result by weighting based on a preset weight value of the judgment result according to the first judgment result and the second judgment result.

By adopting the technical scheme, the similarity of the files can be accurately identified by analyzing the file attributes and converting the file contents into the image contents for similarity matching analysis, and the method has the advantages of high identification accuracy and high efficiency. The method is mainly realized by the following steps: the method carries out identification through two different identification means, wherein the process is a 'rough identification' process in the process of analyzing and identifying the file attributes, and aims to screen out obviously dissimilar files, so that the obviously dissimilar files can be prevented from being compared, and the identification efficiency is reduced; in addition, for two files which pass through the rough recognition, the file content is converted into the image content for recognition, if the file is originally an image, the file is also converted, and the image is directly recognized, so that the recognition efficiency is ensured, meanwhile, in the process of image recognition, the whole file is directly recognized, instead of recognizing each character or word, so that the recognition efficiency is obviously improved.

Example 2

On the basis of the previous embodiment, the method for judging the similarity of the two files by the first similarity identification unit according to the basic attributes of the two files to obtain the first judgment result executes the following steps: matching and identifying items to be matched, such as file names, file types, file sizes, file positions, file creating time and file modifying time of the first target file and the second target file respectively; the matching identification method comprises the following steps: acquiring a keyword to which the character belongs and an index position of the character in the keyword to which the character belongs according to a keyword set for each character in a to-be-matched item; judging whether the character is a first character of the keyword according to the index position of the character in the keyword to which the character belongs; if the character is the first character of the affiliated keyword, recording the affiliated keyword of the character in a matching information set, and marking the first character of the keyword in the record to exist in the item to be matched; if the character is not the first character of the corresponding keyword and the record of the keyword to which the character belongs exists in the matching information set, acquiring the record of the keyword to which the character belongs, and marking the character in the keyword existing in the item to be matched in the record; when all characters in a keyword are marked and exist in the item to be matched, judging that the keyword is hit by the item to be matched; if the items to be matched of the two files hit the same keyword, judging that the items to be matched are matched; and if the number of the matched items to be matched is greater than that of the unmatched items to be matched, the first judgment result is that the two files are matched and is marked as 1.

Specifically, the file format (or file type) refers to a special encoding method for information used by a computer to store information, and is used for identifying data stored inside. For example, some store pictures, some store programs, and some store text messages. Each type of information may be stored in one or more file formats in computer storage. Each file format typically has one or more extensions that can be identified, but may not have extensions. The extension may help the application identify the file format.

File attributes, which refer to the division of files into different types of files for storage and transmission, define some unique property of a file. Common file attributes are system attributes, hidden attributes, read-only attributes, and archive attributes.

Attributes are some descriptive information that can be used to help you find and sort files. The attributes are not contained in the actual content of the file, but provide information about the file. But provides information about the file. In addition to the markup property (which is a custom property that may contain any text selected), the file includes many other properties such as date of modification, author, and rating.

Example 3

Referring to fig. 4 and 5, the second similarity is based on the previous embodimentThe degree identification unit includes: a local probability model estimation subunit configured to calculate a probability for each local region of the image content using the following formula:

Specifically, refer to fig. 4 and 5. The image division subunit first divides the image content into unit domains, each of which has many feature quantities of the image content. The feature quantities in the image content unit domains are compared, and the similarity of the feature quantities of each unit domain is calculated as the unit domain similarity. And then, the unit domain similarity is weighted by using the unit domain-based weight value obtained from the local region weight value, so that the image similarity between the image content of the second target file and the image content of the first target file is calculated.

Example 4

On the basis of the previous embodiment, the method for calculating the final similarity recognition result by weighting according to the first judgment result and the second judgment result and based on the preset weight value of the judgment result by the result generating unit performs the following steps: the final similarity is calculated using the following formula: the final similarity is the first judgment result A + the second judgment result B; if the final similarity value is less than 0.8, the obtained final similarity identification result is that the two files are not similar; if the final similarity value is not lower than 0.8, the obtained final similarity identification result is that the two files are similar; wherein A is the weight value of the first judgment result, and B is the weight value of the second judgment result; a + B is 1.

Example 5

On the basis of the above embodiment, the file content conversion unit, the method for converting the extracted file content into the corresponding image content, performs the following steps: converting the file content into a binary character string, filling the binary character string into a matrix, corresponding each value in the matrix to an RGB value, and then regarding the matrix as the image content.

Example 6

As shown in fig. 2, a computer file similarity recognition method based on image analysis, the method performs the following steps: step 1, extracting basic attributes of two target files for comparison, wherein the target files are respectively as follows: a first object file and a second object file, the base attribute comprising at least: file name, file type, file size, file position, file creation time and file modification time; step 2: opening a first target file and a second target file, extracting the contents of the two files, and temporarily storing the extracted file contents; and step 3: converting the extracted file content into corresponding image content; and 4, step 4: and according to the first judgment result and the second judgment result, based on a preset weight value of the judgment result, weighting and calculating a final similarity recognition result.

Example 7

On the basis of the above embodiment, the step 4: the method for judging the similarity of the two files according to the basic attributes of the two files to obtain a first judgment result executes the following steps: matching and identifying items to be matched, such as file names, file types, file sizes, file positions, file creating time and file modifying time of the first target file and the second target file respectively; the matching identification method comprises the following steps: acquiring a keyword to which the character belongs and an index position of the character in the keyword to which the character belongs according to a keyword set for each character in a to-be-matched item; judging whether the character is a first character of the keyword according to the index position of the character in the keyword to which the character belongs; if the character is the first character of the affiliated keyword, recording the affiliated keyword of the character in a matching information set, and marking the first character of the keyword in the record to exist in the item to be matched; if the character is not the first character of the corresponding keyword and the record of the keyword to which the character belongs exists in the matching information set, acquiring the record of the keyword to which the character belongs, and marking the character in the keyword existing in the item to be matched in the record; when all characters in a keyword are marked and exist in the item to be matched, judging that the keyword is hit by the item to be matched; if the items to be matched of the two files hit the same keyword, judging that the items to be matched are matched; and if the number of the matched items to be matched is greater than that of the unmatched items to be matched, the first judgment result is that the two files are matched and is marked as 1.

Example 8

On the basis of the above embodiment, the step 4: judging the similarity of the two files according to the image content,the method for obtaining the second judgment result comprises the following steps: calculating a probability for each local region of the image content using the following formula:

is a transpose of a matrix, w_iFor a predetermined template matrix, b_iThe adjustment value corresponding to the matrix has a value range of: 5-10, wherein m is a probability adjustment value and has a value range of: 0.2 to 0.6; calculating a weight value of each local region in the image according to the probability of the local region, wherein the weight value is used as the local region weight value; dividing the image content of the second target file into unit domains; extracting a feature quantity of each unit domain from the divided unit domains as a unit domain feature quantity of the image content of the first target file; comparing the image content unit domain feature quantity of the first target file with the image content unit domain feature quantity of the second target file, wherein the unit domain feature quantity is prepared in advance for the image content of the first target file; calculating the similarity of the characteristic quantity of each unit domain as the unit domain similarity; the image similarity between the image content of the second target file and the image content of the first target file is calculated by weighting the unit domain similarity with a unit domain-based weight value derived from the local region weight value.

Referring to fig. 6, in the process of analyzing and identifying the file attributes first, the process is a "rough identification" process, which aims to screen out obviously dissimilar files, so that comparison of obviously dissimilar files can be avoided, and the identification efficiency is reduced; in addition, for two files which pass through the rough recognition, the file content is converted into the image content for recognition, if the file is originally an image, the file is also converted, and the image is directly recognized, so that the recognition efficiency is ensured, meanwhile, in the process of image recognition, the whole file is directly recognized, instead of recognizing each character or word, so that the recognition efficiency is obviously improved.

Example 9

On the basis of the above embodiment, the step 4: based on the preset weight value of the judgment result, the method for calculating the final similarity recognition result by weighting executes the following steps: the final similarity is calculated using the following formula: the final similarity is the first judgment result A + the second judgment result B; if the final similarity value is less than 0.8, the obtained final similarity identification result is that the two files are not similar; if the final similarity value is not lower than 0.8, the obtained final similarity identification result is that the two files are similar; wherein A is the weight value of the first judgment result, and B is the weight value of the second judgment result; a + B is 1.

Referring to fig. 7, the final result of the present invention is weighted, and the values of a and B can be adjusted according to the actual recognition situation, so as to adjust the recognition accuracy and improve the accuracy, and the accuracy will be gradually improved as the number of experiments increases.

Example 10

On the basis of the above embodiment, step 3: the method for converting the extracted file content into the corresponding image content performs the following steps: the method for converting the extracted file content into the corresponding image content performs the following steps: converting the file content into a binary character string, filling the binary character string into a matrix, corresponding each value in the matrix to an RGB value, and then regarding the matrix as the image content.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be noted that, the system provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A computer file similarity identification system based on image analysis, the system comprising: a file attribute data extraction unit configured to extract basic attributes of two target files for comparison, the target files being: a first object file and a second object file, the base attribute comprising at least: file name, file type, file size, file position, file creation time and file modification time; the file content extraction unit is configured for opening the first target file and the second target file, extracting the contents of the two files and temporarily storing the extracted file contents; a file content conversion unit configured to convert the extracted file content into corresponding image content; the file similarity identification unit comprises: the device comprises a first similarity identification unit, a second similarity identification unit and a result generation unit; the first similarity identification unit is configured to judge the similarity of the two files according to the basic attributes of the two files to obtain a first judgment result; the second similarity recognition unit is configured to judge the similarity of the two files according to the image content based on a preset neural network model to obtain a second judgment result; and the result generating unit is configured to calculate a final similarity recognition result by weighting based on a preset weight value of the judgment result according to the first judgment result and the second judgment result.

2. The system according to claim 1, wherein the first similarity recognition unit judges the similarity of two files according to the basic attributes of the two files, and the method for obtaining the first judgment result performs the following steps: matching and identifying items to be matched, such as file names, file types, file sizes, file positions, file creating time and file modifying time of the first target file and the second target file respectively; the matching identification method comprises the following steps: acquiring a keyword to which the character belongs and an index position of the character in the keyword to which the character belongs according to a keyword set for each character in a to-be-matched item; judging whether the character is a first character of the keyword according to the index position of the character in the keyword to which the character belongs; if the character is the first character of the affiliated keyword, recording the affiliated keyword of the character in a matching information set, and marking the first character of the keyword in the record to exist in the item to be matched; if the character is not the first character of the corresponding keyword and the record of the keyword to which the character belongs exists in the matching information set, acquiring the record of the keyword to which the character belongs, and marking the character in the keyword existing in the item to be matched in the record; when all characters in a keyword are marked and exist in the item to be matched, judging that the keyword is hit by the item to be matched; if the items to be matched of the two files hit the same keyword, judging that the items to be matched are matched; and if the number of the matched items to be matched is greater than that of the unmatched items to be matched, the first judgment result is that the two files are matched and is marked as 1.

3. The system of claim 2, wherein the second similarity identification unit comprises: a local probability model estimation subunit configured to calculate a probability for each local region of the image content using the following formula:

whereinI is the number of each local area, n is the number of local areas, σ (x)_i) Representing a local area x_iPer local area x_iIn the form of a matrix of a plurality of,

4. The system of claim 3, wherein the result generating unit performs the following steps according to the first and second determination results, based on a preset weight value of the determination results, and a method for calculating a final similarity recognition result by weighting: the final similarity is calculated using the following formula: the final similarity is the first judgment result A + the second judgment result B; if the final similarity value is less than 0.8, the obtained final similarity identification result is that the two files are not similar; if the final similarity value is not lower than 0.8, the obtained final similarity identification result is that the two files are similar; wherein A is the weight value of the first judgment result, and B is the weight value of the second judgment result; a + B is 1.

5. The system of claim 4, wherein the document content conversion unit, the method of converting the extracted document content into the corresponding image content, performs the steps of: converting the file content into a binary character string, filling the binary character string into a matrix, corresponding each value in the matrix to an RGB value, and then regarding the matrix as the image content.

6. Computer file similarity recognition method based on image analysis based on the system according to one of claims 1 to 5, characterized in that it performs the following steps: step 1, extracting basic attributes of two target files for comparison, wherein the target files are respectively as follows: a first object file and a second object file, the base attribute comprising at least: file name, file type, file size, file position, file creation time and file modification time; step 2: opening a first target file and a second target file, extracting the contents of the two files, and temporarily storing the extracted file contents; and step 3: converting the extracted file content into corresponding image content; and 4, step 4: and according to the first judgment result and the second judgment result, based on a preset weight value of the judgment result, weighting and calculating a final similarity recognition result.

7. The method of claim 6, wherein the step 4: the method for judging the similarity of the two files according to the basic attributes of the two files to obtain a first judgment result executes the following steps: matching and identifying items to be matched, such as file names, file types, file sizes, file positions, file creating time and file modifying time of the first target file and the second target file respectively; the matching identification method comprises the following steps: acquiring a keyword to which the character belongs and an index position of the character in the keyword to which the character belongs according to a keyword set for each character in a to-be-matched item; judging whether the character is a first character of the keyword according to the index position of the character in the keyword to which the character belongs; if the character is the first character of the affiliated keyword, recording the affiliated keyword of the character in a matching information set, and marking the first character of the keyword in the record to exist in the item to be matched; if the character is not the first character of the corresponding keyword and the record of the keyword to which the character belongs exists in the matching information set, acquiring the record of the keyword to which the character belongs, and marking the character in the keyword existing in the item to be matched in the record; when all characters in a keyword are marked and exist in the item to be matched, judging that the keyword is hit by the item to be matched; if the items to be matched of the two files hit the same keyword, judging that the items to be matched are matched; and if the number of the matched items to be matched is greater than that of the unmatched items to be matched, the first judgment result is that the two files are matched and is marked as 1.

8. The method of claim 7, wherein the step 4: the method for judging the similarity of the two files according to the image content to obtain a second judgment result comprises the following steps: calculating a probability for each local region of the image content using the following formula:

is a transpose of a matrix, w_iFor a predetermined template matrix, b_iIs thatThe adjustment value corresponding to the matrix has a value range of: 5-10, wherein m is a probability adjustment value and has a value range of: 0.2 to 0.6; calculating a weight value of each local region in the image according to the probability of the local region, wherein the weight value is used as the local region weight value; dividing the image content of the second target file into unit domains; extracting a feature quantity of each unit domain from the divided unit domains as a unit domain feature quantity of the image content of the first target file; comparing the image content unit domain feature quantity of the first target file with the image content unit domain feature quantity of the second target file, wherein the unit domain feature quantity is prepared in advance for the image content of the first target file; calculating the similarity of the characteristic quantity of each unit domain as the unit domain similarity; the image similarity between the image content of the second target file and the image content of the first target file is calculated by weighting the unit domain similarity with a unit domain-based weight value derived from the local region weight value.

9. The method of claim 8, wherein the step 4: based on the preset weight value of the judgment result, the method for calculating the final similarity recognition result by weighting executes the following steps: the final similarity is calculated using the following formula: the final similarity is the first judgment result A + the second judgment result B; if the final similarity value is less than 0.8, the obtained final similarity identification result is that the two files are not similar; if the final similarity value is not lower than 0.8, the obtained final similarity identification result is that the two files are similar; wherein A is the weight value of the first judgment result, and B is the weight value of the second judgment result; a + B is 1.

10. The method of claim 49, wherein step 3: the method for converting the extracted file content into the corresponding image content performs the following steps: the method for converting the extracted file content into the corresponding image content performs the following steps: converting the file content into a binary character string, filling the binary character string into a matrix, corresponding each value in the matrix to an RGB value, and then regarding the matrix as the image content.