CN111666928A - Computer file similarity recognition system and method based on image analysis - Google Patents

Computer file similarity recognition system and method based on image analysis Download PDF

Info

Publication number
CN111666928A
CN111666928A CN202010689843.2A CN202010689843A CN111666928A CN 111666928 A CN111666928 A CN 111666928A CN 202010689843 A CN202010689843 A CN 202010689843A CN 111666928 A CN111666928 A CN 111666928A
Authority
CN
China
Prior art keywords
file
similarity
character
keyword
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010689843.2A
Other languages
Chinese (zh)
Inventor
宋国训
魏磊
仲伟付
杨秀红
刘曌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202010689843.2A priority Critical patent/CN111666928A/en
Publication of CN111666928A publication Critical patent/CN111666928A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/418Document matching, e.g. of document images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of computers, and particularly relates to a computer file similarity recognition system and method based on image analysis. The system comprises: a file attribute data extraction unit configured to extract basic attributes of two target files for comparison, the target files being: a first object file and a second object file, the base attribute comprising at least: file name, file type, file size, file position, file creation time and file modification time; and the file content extraction unit is configured to open the first target file and the second target file, extract the contents of the two files and temporarily store the extracted file contents. The method can accurately identify the similarity of the files by analyzing the attribute data of the files and converting the content of the files into the content of the images for similarity matching analysis, and has the advantages of high identification accuracy and high efficiency.

Description

Computer file similarity recognition system and method based on image analysis
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a computer file similarity recognition system and method based on image analysis.
Background
The document similarity calculation method is a method for analyzing and calculating the similarity of documents using information (document contents and connection information) of the documents themselves. With the progress of the times, the file similarity calculation method has been widely applied to various fields (e.g., related fields such as information retrieval, collaborative recommendation systems, library classification systems, etc.).
The existing method for detecting the similarity of the files generally comprises the following steps:
(1) after basic simplification processing is carried out on each file in the submitted file set, each file is divided into continuous marking blocks; keeping a certain number of representative mark blocks in the mark block; the representative mark blocks are made into unique representative fingerprints, and different files are signed by using different representative fingerprints.
(2) And judging whether the fingerprints of the signatures of the 2 files are the same, if so, correlating the 2 files and belonging to similar files, otherwise, not correlating the 2 files and not belonging to similar files.
Patent No. CN201410421951.6A discloses a duplicate removal method and system based on file similarity: the method comprises the following steps: extracting files needing comparison to generate pure characters; carrying out normalization processing on the pure characters to generate a normalized character unit; coding the standard character unit, and generating a fixed-length irreversible representative code through a coding algorithm; extracting key words of the representative codes of the files needing to be compared to generate a key word sequence; calculating the similarity of word shapes and the similarity of word sequences of sentences to be compared according to the keyword sequences of the sentences to be compared; calculating the similarity of the sentences to be compared according to the similarity of the word shapes and the similarity of the word sequences of the sentences to be compared; and calculating the similarity of the files needing to be compared according to the similarity of the sentences. The method and the device are not only suitable for Chinese characters and convenient for domestic users to use, but also have higher accuracy in comparison with similar files.
Patent No. CN2007101058353A discloses a system and method for detecting file similarity: the method comprises the following steps: extracting pure character parts from the files to be detected respectively; splitting the extracted pure character part into character units; coding the split character unit; and comparing the coded character unit in one file with the coded character unit of at least one other file to determine the similarity of the two.
It can be seen that the existing method for detecting file similarity has the following defects when in use: generally, the whole content of the file needs to be matched and identified, so that the efficiency is low; there is a lack of effective means of identification for many image files.
Disclosure of Invention
In view of the above, the present invention is directed to a computer file similarity recognition system and method based on image analysis, which can accurately recognize the similarity of files by analyzing the attribute data of the files and performing similarity matching analysis by converting the file content into the image content, and has the advantages of high recognition accuracy and high efficiency.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
a computer file similarity identification system based on image analysis, the system comprising: a file attribute data extraction unit configured to extract basic attributes of two target files for comparison, the target files being: a first object file and a second object file, the base attribute comprising at least: file name, file type, file size, file position, file creation time and file modification time; the file content extraction unit is configured for opening the first target file and the second target file, extracting the contents of the two files and temporarily storing the extracted file contents; a file content conversion unit configured to convert the extracted file content into corresponding image content; the file similarity identification unit comprises: the device comprises a first similarity identification unit, a second similarity identification unit and a result generation unit; the first similarity identification unit is configured to judge the similarity of the two files according to the basic attributes of the two files to obtain a first judgment result; the second similarity recognition unit is configured to judge the similarity of the two files according to the image content based on a preset neural network model to obtain a second judgment result; and the result generating unit is configured to calculate a final similarity recognition result by weighting based on a preset weight value of the judgment result according to the first judgment result and the second judgment result.
Further, the method for judging the similarity of the two files by the first similarity identification unit according to the basic attributes of the two files to obtain the first judgment result executes the following steps: matching and identifying items to be matched, such as file names, file types, file sizes, file positions, file creating time and file modifying time of the first target file and the second target file respectively; the matching identification method comprises the following steps: acquiring a keyword to which the character belongs and an index position of the character in the keyword to which the character belongs according to a keyword set for each character in a to-be-matched item; judging whether the character is a first character of the keyword according to the index position of the character in the keyword to which the character belongs; if the character is the first character of the affiliated keyword, recording the affiliated keyword of the character in a matching information set, and marking the first character of the keyword in the record to exist in the item to be matched; if the character is not the first character of the corresponding keyword and the record of the keyword to which the character belongs exists in the matching information set, acquiring the record of the keyword to which the character belongs, and marking the character in the keyword existing in the item to be matched in the record; when all characters in a keyword are marked and exist in the item to be matched, judging that the keyword is hit by the item to be matched; if the items to be matched of the two files hit the same keyword, judging that the items to be matched are matched; and if the number of the matched items to be matched is greater than that of the unmatched items to be matched, the first judgment result is that the two files are matched and is marked as 1.
Further, the second similarity degree identification unit includes: local outlineA rate model estimation subunit configured to calculate a probability for each local region of the image content using the following formula:
Figure BDA0002588947470000031
wherein i is the number of each local region, n is the number of local regions, and σ (x)i) Representing a local area xiPer local area xiIn the form of a matrix of a plurality of,
Figure BDA0002588947470000032
is a transpose of a matrix, wiFor a predetermined template matrix, biThe adjustment value corresponding to the matrix has a value range of: 5-10, wherein m is a probability adjustment value and has a value range of: 0.2 to 0.6; a local region weight calculation subunit that calculates a weight value of each local region in the image as a local region weight value according to the probability of the local region; an image segmentation subunit configured to segment image content of the second target file into unit domains; a unit domain feature quantity extracting subunit that extracts, from the divided unit domains, a feature quantity of each unit domain as a unit domain feature quantity of the image content of the first target file; a unit domain similarity calculation subunit configured to compare the image content unit domain feature quantity of the first target file with the image content unit domain feature quantity of the second target file, the unit domain feature quantity being a unit domain feature quantity prepared in advance of the image content of the first target file; calculating the similarity of the characteristic quantity of each unit domain as the unit domain similarity; and an image similarity calculation subunit that calculates an image similarity between the image content of the second target file and the image content of the first target file by weighting the unit domain similarity using the unit domain-based weight value obtained from the local region weight value.
Further, the method for calculating the final similarity recognition result by weighting according to the first judgment result and the second judgment result and based on the preset weight value of the judgment result, by the result generating unit, executes the following steps: the final similarity is calculated using the following formula: the final similarity is the first judgment result A + the second judgment result B; if the final similarity value is less than 0.8, the obtained final similarity identification result is that the two files are not similar; if the final similarity value is not lower than 0.8, the obtained final similarity identification result is that the two files are similar; wherein A is the weight value of the first judgment result, and B is the weight value of the second judgment result; a + B is 1.
Further, the file content conversion unit, the method for converting the extracted file content into the corresponding image content, performs the following steps: converting the file content into a binary character string, filling the binary character string into a matrix, corresponding each value in the matrix to an RGB value, and then regarding the matrix as the image content.
A computer file similarity identification method based on image analysis, the method comprising the steps of: step 1, extracting basic attributes of two target files for comparison, wherein the target files are respectively as follows: a first object file and a second object file, the base attribute comprising at least: file name, file type, file size, file position, file creation time and file modification time; step 2: opening a first target file and a second target file, extracting the contents of the two files, and temporarily storing the extracted file contents; and step 3: converting the extracted file content into corresponding image content; and 4, step 4: and according to the first judgment result and the second judgment result, based on a preset weight value of the judgment result, weighting and calculating a final similarity recognition result.
Further, the step 4: the method for judging the similarity of the two files according to the basic attributes of the two files to obtain a first judgment result executes the following steps: matching and identifying items to be matched, such as file names, file types, file sizes, file positions, file creating time and file modifying time of the first target file and the second target file respectively; the matching identification method comprises the following steps: acquiring a keyword to which the character belongs and an index position of the character in the keyword to which the character belongs according to a keyword set for each character in a to-be-matched item; judging whether the character is a first character of the keyword according to the index position of the character in the keyword to which the character belongs; if the character is the first character of the affiliated keyword, recording the affiliated keyword of the character in a matching information set, and marking the first character of the keyword in the record to exist in the item to be matched; if the character is not the first character of the corresponding keyword and the record of the keyword to which the character belongs exists in the matching information set, acquiring the record of the keyword to which the character belongs, and marking the character in the keyword existing in the item to be matched in the record; when all characters in a keyword are marked and exist in the item to be matched, judging that the keyword is hit by the item to be matched; if the items to be matched of the two files hit the same keyword, judging that the items to be matched are matched; and if the number of the matched items to be matched is greater than that of the unmatched items to be matched, the first judgment result is that the two files are matched and is marked as 1.
Further, the step 4: the method for judging the similarity of the two files according to the image content to obtain a second judgment result comprises the following steps: calculating a probability for each local region of the image content using the following formula:
Figure BDA0002588947470000051
wherein i is the number of each local region, n is the number of local regions, and σ (x)i) Representing a local area xiPer local area xiIn the form of a matrix of a plurality of,
Figure BDA0002588947470000052
is a transpose of a matrix, wiFor a predetermined template matrix, biThe adjustment value corresponding to the matrix has a value range of: 5-10, wherein m is a probability adjustment value and has a value range of: 0.2 to 0.6; calculating a weight value of each local region in the image according to the probability of the local region, wherein the weight value is used as the local region weight value; image of the second target fileContent is divided into unit domains; extracting a feature quantity of each unit domain from the divided unit domains as a unit domain feature quantity of the image content of the first target file; comparing the image content unit domain feature quantity of the first target file with the image content unit domain feature quantity of the second target file, wherein the unit domain feature quantity is prepared in advance for the image content of the first target file; calculating the similarity of the characteristic quantity of each unit domain as the unit domain similarity; the image similarity between the image content of the second target file and the image content of the first target file is calculated by weighting the unit domain similarity with a unit domain-based weight value derived from the local region weight value.
Further, the step 4: based on the preset weight value of the judgment result, the method for calculating the final similarity recognition result by weighting executes the following steps: the final similarity is calculated using the following formula: the final similarity is the first judgment result A + the second judgment result B; if the final similarity value is less than 0.8, the obtained final similarity identification result is that the two files are not similar; if the final similarity value is not lower than 0.8, the obtained final similarity identification result is that the two files are similar; wherein A is the weight value of the first judgment result, and B is the weight value of the second judgment result; a + B is 1.
Further, step 3: the method for converting the extracted file content into the corresponding image content performs the following steps: the method for converting the extracted file content into the corresponding image content performs the following steps: converting the file content into a binary character string, filling the binary character string into a matrix, corresponding each value in the matrix to an RGB value, and then regarding the matrix as the image content.
The computer file similarity recognition system and method based on image analysis have the following beneficial effects: the method can accurately identify the similarity of the files by analyzing the attribute data of the files and converting the content of the files into the content of the images for similarity matching analysis, and has the advantages of high identification accuracy and high efficiency. The method is mainly realized by the following steps: 1. in the prior art, the file similarity is identified directly by identifying the file content, but the technology of identifying the file by combining the attributes of the file is novel, the attributes of the file are used as important contents for representing the file and are added into the file similarity identification, so that the obviously dissimilar files can be screened out quickly, the obviously dissimilar files can be prevented from being compared, and the identification efficiency is improved; 2. in the process of matching and identifying the file attributes, keywords to which characters belong and index bits of the characters in the keywords are acquired according to the keyword set respectively, so that the identification of all contents of the file attributes is avoided, and the identification efficiency is further improved; 3. when image identification is carried out and over-weighing is carried out, the unit domain similarity is weighted by the unit domain-based weight value obtained from the local region weight value, compared with the similarity identification of the whole image, the identification method has remarkable advantages when files with more blank regions are identified, aiming at the identification of unit thresholds, the situation that the blank regions of the image are identified by using the same algorithm complexity can be avoided, only the unit regions with contents need to be identified, the identification efficiency is improved, and meanwhile, the accuracy is ensured; 4. the method for obtaining the final identification by weighting based on the two identification results can effectively avoid the problem of low identification accuracy of a single identification method, and improves the identification accuracy.
Drawings
Fig. 1 is a schematic system structure diagram of a computer file similarity recognition system based on image analysis according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a file similarity identification unit of a computer file similarity system based on image analysis according to an embodiment of the present invention
FIG. 3 is a schematic flowchart of a method for identifying similarity of computer files based on image analysis according to an embodiment of the present invention;
fig. 4 is a schematic diagram illustrating a principle that an image second similarity recognition unit divides image content of a second target file into unit domains according to the computer file similarity recognition system and method based on image analysis provided in the embodiment of the present invention;
fig. 5 is a schematic diagram illustrating a principle that a second similarity recognition unit of an image compares similarity between unit domains of a first target file and a second target file in the system and method for recognizing similarity of a computer file based on image analysis according to the embodiment of the present invention;
fig. 6 is a schematic diagram of an experiment curve of the recognition efficiency of the computer file similarity recognition system and method based on image analysis according to the embodiment of the present invention, which varies with the number of experiments, and a schematic diagram of a comparative experiment effect in the prior art;
fig. 7 is a schematic diagram of an experiment curve of which the recognition accuracy varies with the experiment times and a schematic diagram of a comparative experiment effect in the prior art, in the computer file similarity recognition system and method based on image analysis provided in the embodiment of the present invention.
Detailed Description
The method of the present invention will be described in further detail below with reference to the accompanying drawings and embodiments of the invention.
Example 1
As shown in fig. 1 and 3, a computer file similarity recognition system based on image analysis, the system comprising: a file attribute data extraction unit configured to extract basic attributes of two target files for comparison, the target files being: a first object file and a second object file, the base attribute comprising at least: file name, file type, file size, file position, file creation time and file modification time; the file content extraction unit is configured for opening the first target file and the second target file, extracting the contents of the two files and temporarily storing the extracted file contents; a file content conversion unit configured to convert the extracted file content into corresponding image content; the file similarity identification unit comprises: the device comprises a first similarity identification unit, a second similarity identification unit and a result generation unit; the first similarity identification unit is configured to judge the similarity of the two files according to the basic attributes of the two files to obtain a first judgment result; the second similarity recognition unit is configured to judge the similarity of the two files according to the image content based on a preset neural network model to obtain a second judgment result; and the result generating unit is configured to calculate a final similarity recognition result by weighting based on a preset weight value of the judgment result according to the first judgment result and the second judgment result.
By adopting the technical scheme, the similarity of the files can be accurately identified by analyzing the file attributes and converting the file contents into the image contents for similarity matching analysis, and the method has the advantages of high identification accuracy and high efficiency. The method is mainly realized by the following steps: the method carries out identification through two different identification means, wherein the process is a 'rough identification' process in the process of analyzing and identifying the file attributes, and aims to screen out obviously dissimilar files, so that the obviously dissimilar files can be prevented from being compared, and the identification efficiency is reduced; in addition, for two files which pass through the rough recognition, the file content is converted into the image content for recognition, if the file is originally an image, the file is also converted, and the image is directly recognized, so that the recognition efficiency is ensured, meanwhile, in the process of image recognition, the whole file is directly recognized, instead of recognizing each character or word, so that the recognition efficiency is obviously improved.
Example 2
On the basis of the previous embodiment, the method for judging the similarity of the two files by the first similarity identification unit according to the basic attributes of the two files to obtain the first judgment result executes the following steps: matching and identifying items to be matched, such as file names, file types, file sizes, file positions, file creating time and file modifying time of the first target file and the second target file respectively; the matching identification method comprises the following steps: acquiring a keyword to which the character belongs and an index position of the character in the keyword to which the character belongs according to a keyword set for each character in a to-be-matched item; judging whether the character is a first character of the keyword according to the index position of the character in the keyword to which the character belongs; if the character is the first character of the affiliated keyword, recording the affiliated keyword of the character in a matching information set, and marking the first character of the keyword in the record to exist in the item to be matched; if the character is not the first character of the corresponding keyword and the record of the keyword to which the character belongs exists in the matching information set, acquiring the record of the keyword to which the character belongs, and marking the character in the keyword existing in the item to be matched in the record; when all characters in a keyword are marked and exist in the item to be matched, judging that the keyword is hit by the item to be matched; if the items to be matched of the two files hit the same keyword, judging that the items to be matched are matched; and if the number of the matched items to be matched is greater than that of the unmatched items to be matched, the first judgment result is that the two files are matched and is marked as 1.
Specifically, the file format (or file type) refers to a special encoding method for information used by a computer to store information, and is used for identifying data stored inside. For example, some store pictures, some store programs, and some store text messages. Each type of information may be stored in one or more file formats in computer storage. Each file format typically has one or more extensions that can be identified, but may not have extensions. The extension may help the application identify the file format.
File attributes, which refer to the division of files into different types of files for storage and transmission, define some unique property of a file. Common file attributes are system attributes, hidden attributes, read-only attributes, and archive attributes.
Attributes are some descriptive information that can be used to help you find and sort files. The attributes are not contained in the actual content of the file, but provide information about the file. But provides information about the file. In addition to the markup property (which is a custom property that may contain any text selected), the file includes many other properties such as date of modification, author, and rating.
Example 3
Referring to fig. 4 and 5, the second similarity is based on the previous embodimentThe degree identification unit includes: a local probability model estimation subunit configured to calculate a probability for each local region of the image content using the following formula:
Figure BDA0002588947470000101
wherein i is the number of each local region, n is the number of local regions, and σ (x)i) Representing a local area xiPer local area xiIn the form of a matrix of a plurality of,
Figure BDA0002588947470000102
is a transpose of a matrix, wiFor a predetermined template matrix, biThe adjustment value corresponding to the matrix has a value range of: 5-10, wherein m is a probability adjustment value and has a value range of: 0.2 to 0.6; a local region weight calculation subunit that calculates a weight value of each local region in the image as a local region weight value according to the probability of the local region; an image segmentation subunit configured to segment image content of the second target file into unit domains; a unit domain feature quantity extracting subunit that extracts, from the divided unit domains, a feature quantity of each unit domain as a unit domain feature quantity of the image content of the first target file; a unit domain similarity calculation subunit configured to compare the image content unit domain feature quantity of the first target file with the image content unit domain feature quantity of the second target file, the unit domain feature quantity being a unit domain feature quantity prepared in advance of the image content of the first target file; calculating the similarity of the characteristic quantity of each unit domain as the unit domain similarity; and an image similarity calculation subunit that calculates an image similarity between the image content of the second target file and the image content of the first target file by weighting the unit domain similarity using the unit domain-based weight value obtained from the local region weight value.
Specifically, refer to fig. 4 and 5. The image division subunit first divides the image content into unit domains, each of which has many feature quantities of the image content. The feature quantities in the image content unit domains are compared, and the similarity of the feature quantities of each unit domain is calculated as the unit domain similarity. And then, the unit domain similarity is weighted by using the unit domain-based weight value obtained from the local region weight value, so that the image similarity between the image content of the second target file and the image content of the first target file is calculated.
Example 4
On the basis of the previous embodiment, the method for calculating the final similarity recognition result by weighting according to the first judgment result and the second judgment result and based on the preset weight value of the judgment result by the result generating unit performs the following steps: the final similarity is calculated using the following formula: the final similarity is the first judgment result A + the second judgment result B; if the final similarity value is less than 0.8, the obtained final similarity identification result is that the two files are not similar; if the final similarity value is not lower than 0.8, the obtained final similarity identification result is that the two files are similar; wherein A is the weight value of the first judgment result, and B is the weight value of the second judgment result; a + B is 1.
Example 5
On the basis of the above embodiment, the file content conversion unit, the method for converting the extracted file content into the corresponding image content, performs the following steps: converting the file content into a binary character string, filling the binary character string into a matrix, corresponding each value in the matrix to an RGB value, and then regarding the matrix as the image content.
Example 6
As shown in fig. 2, a computer file similarity recognition method based on image analysis, the method performs the following steps: step 1, extracting basic attributes of two target files for comparison, wherein the target files are respectively as follows: a first object file and a second object file, the base attribute comprising at least: file name, file type, file size, file position, file creation time and file modification time; step 2: opening a first target file and a second target file, extracting the contents of the two files, and temporarily storing the extracted file contents; and step 3: converting the extracted file content into corresponding image content; and 4, step 4: and according to the first judgment result and the second judgment result, based on a preset weight value of the judgment result, weighting and calculating a final similarity recognition result.
Example 7
On the basis of the above embodiment, the step 4: the method for judging the similarity of the two files according to the basic attributes of the two files to obtain a first judgment result executes the following steps: matching and identifying items to be matched, such as file names, file types, file sizes, file positions, file creating time and file modifying time of the first target file and the second target file respectively; the matching identification method comprises the following steps: acquiring a keyword to which the character belongs and an index position of the character in the keyword to which the character belongs according to a keyword set for each character in a to-be-matched item; judging whether the character is a first character of the keyword according to the index position of the character in the keyword to which the character belongs; if the character is the first character of the affiliated keyword, recording the affiliated keyword of the character in a matching information set, and marking the first character of the keyword in the record to exist in the item to be matched; if the character is not the first character of the corresponding keyword and the record of the keyword to which the character belongs exists in the matching information set, acquiring the record of the keyword to which the character belongs, and marking the character in the keyword existing in the item to be matched in the record; when all characters in a keyword are marked and exist in the item to be matched, judging that the keyword is hit by the item to be matched; if the items to be matched of the two files hit the same keyword, judging that the items to be matched are matched; and if the number of the matched items to be matched is greater than that of the unmatched items to be matched, the first judgment result is that the two files are matched and is marked as 1.
Example 8
On the basis of the above embodiment, the step 4: judging the similarity of the two files according to the image content,the method for obtaining the second judgment result comprises the following steps: calculating a probability for each local region of the image content using the following formula:
Figure BDA0002588947470000121
wherein i is the number of each local region, n is the number of local regions, and σ (x)i) Representing a local area xiPer local area xiIn the form of a matrix of a plurality of,
Figure BDA0002588947470000122
is a transpose of a matrix, wiFor a predetermined template matrix, biThe adjustment value corresponding to the matrix has a value range of: 5-10, wherein m is a probability adjustment value and has a value range of: 0.2 to 0.6; calculating a weight value of each local region in the image according to the probability of the local region, wherein the weight value is used as the local region weight value; dividing the image content of the second target file into unit domains; extracting a feature quantity of each unit domain from the divided unit domains as a unit domain feature quantity of the image content of the first target file; comparing the image content unit domain feature quantity of the first target file with the image content unit domain feature quantity of the second target file, wherein the unit domain feature quantity is prepared in advance for the image content of the first target file; calculating the similarity of the characteristic quantity of each unit domain as the unit domain similarity; the image similarity between the image content of the second target file and the image content of the first target file is calculated by weighting the unit domain similarity with a unit domain-based weight value derived from the local region weight value.
Referring to fig. 6, in the process of analyzing and identifying the file attributes first, the process is a "rough identification" process, which aims to screen out obviously dissimilar files, so that comparison of obviously dissimilar files can be avoided, and the identification efficiency is reduced; in addition, for two files which pass through the rough recognition, the file content is converted into the image content for recognition, if the file is originally an image, the file is also converted, and the image is directly recognized, so that the recognition efficiency is ensured, meanwhile, in the process of image recognition, the whole file is directly recognized, instead of recognizing each character or word, so that the recognition efficiency is obviously improved.
Example 9
On the basis of the above embodiment, the step 4: based on the preset weight value of the judgment result, the method for calculating the final similarity recognition result by weighting executes the following steps: the final similarity is calculated using the following formula: the final similarity is the first judgment result A + the second judgment result B; if the final similarity value is less than 0.8, the obtained final similarity identification result is that the two files are not similar; if the final similarity value is not lower than 0.8, the obtained final similarity identification result is that the two files are similar; wherein A is the weight value of the first judgment result, and B is the weight value of the second judgment result; a + B is 1.
Referring to fig. 7, the final result of the present invention is weighted, and the values of a and B can be adjusted according to the actual recognition situation, so as to adjust the recognition accuracy and improve the accuracy, and the accuracy will be gradually improved as the number of experiments increases.
Example 10
On the basis of the above embodiment, step 3: the method for converting the extracted file content into the corresponding image content performs the following steps: the method for converting the extracted file content into the corresponding image content performs the following steps: converting the file content into a binary character string, filling the binary character string into a matrix, corresponding each value in the matrix to an RGB value, and then regarding the matrix as the image content.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.
It should be noted that, the system provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (10)

1. A computer file similarity identification system based on image analysis, the system comprising: a file attribute data extraction unit configured to extract basic attributes of two target files for comparison, the target files being: a first object file and a second object file, the base attribute comprising at least: file name, file type, file size, file position, file creation time and file modification time; the file content extraction unit is configured for opening the first target file and the second target file, extracting the contents of the two files and temporarily storing the extracted file contents; a file content conversion unit configured to convert the extracted file content into corresponding image content; the file similarity identification unit comprises: the device comprises a first similarity identification unit, a second similarity identification unit and a result generation unit; the first similarity identification unit is configured to judge the similarity of the two files according to the basic attributes of the two files to obtain a first judgment result; the second similarity recognition unit is configured to judge the similarity of the two files according to the image content based on a preset neural network model to obtain a second judgment result; and the result generating unit is configured to calculate a final similarity recognition result by weighting based on a preset weight value of the judgment result according to the first judgment result and the second judgment result.
2. The system according to claim 1, wherein the first similarity recognition unit judges the similarity of two files according to the basic attributes of the two files, and the method for obtaining the first judgment result performs the following steps: matching and identifying items to be matched, such as file names, file types, file sizes, file positions, file creating time and file modifying time of the first target file and the second target file respectively; the matching identification method comprises the following steps: acquiring a keyword to which the character belongs and an index position of the character in the keyword to which the character belongs according to a keyword set for each character in a to-be-matched item; judging whether the character is a first character of the keyword according to the index position of the character in the keyword to which the character belongs; if the character is the first character of the affiliated keyword, recording the affiliated keyword of the character in a matching information set, and marking the first character of the keyword in the record to exist in the item to be matched; if the character is not the first character of the corresponding keyword and the record of the keyword to which the character belongs exists in the matching information set, acquiring the record of the keyword to which the character belongs, and marking the character in the keyword existing in the item to be matched in the record; when all characters in a keyword are marked and exist in the item to be matched, judging that the keyword is hit by the item to be matched; if the items to be matched of the two files hit the same keyword, judging that the items to be matched are matched; and if the number of the matched items to be matched is greater than that of the unmatched items to be matched, the first judgment result is that the two files are matched and is marked as 1.
3. The system of claim 2, wherein the second similarity identification unit comprises: a local probability model estimation subunit configured to calculate a probability for each local region of the image content using the following formula:
Figure FDA0002588947460000021
whereinI is the number of each local area, n is the number of local areas, σ (x)i) Representing a local area xiPer local area xiIn the form of a matrix of a plurality of,
Figure FDA0002588947460000022
is a transpose of a matrix, wiFor a predetermined template matrix, biThe adjustment value corresponding to the matrix has a value range of: 5-10, wherein m is a probability adjustment value and has a value range of: 0.2 to 0.6; a local region weight calculation subunit that calculates a weight value of each local region in the image as a local region weight value according to the probability of the local region; an image segmentation subunit configured to segment image content of the second target file into unit domains; a unit domain feature quantity extracting subunit that extracts, from the divided unit domains, a feature quantity of each unit domain as a unit domain feature quantity of the image content of the first target file; a unit domain similarity calculation subunit configured to compare the image content unit domain feature quantity of the first target file with the image content unit domain feature quantity of the second target file, the unit domain feature quantity being a unit domain feature quantity prepared in advance of the image content of the first target file; calculating the similarity of the characteristic quantity of each unit domain as the unit domain similarity; and an image similarity calculation subunit that calculates an image similarity between the image content of the second target file and the image content of the first target file by weighting the unit domain similarity using the unit domain-based weight value obtained from the local region weight value.
4. The system of claim 3, wherein the result generating unit performs the following steps according to the first and second determination results, based on a preset weight value of the determination results, and a method for calculating a final similarity recognition result by weighting: the final similarity is calculated using the following formula: the final similarity is the first judgment result A + the second judgment result B; if the final similarity value is less than 0.8, the obtained final similarity identification result is that the two files are not similar; if the final similarity value is not lower than 0.8, the obtained final similarity identification result is that the two files are similar; wherein A is the weight value of the first judgment result, and B is the weight value of the second judgment result; a + B is 1.
5. The system of claim 4, wherein the document content conversion unit, the method of converting the extracted document content into the corresponding image content, performs the steps of: converting the file content into a binary character string, filling the binary character string into a matrix, corresponding each value in the matrix to an RGB value, and then regarding the matrix as the image content.
6. Computer file similarity recognition method based on image analysis based on the system according to one of claims 1 to 5, characterized in that it performs the following steps: step 1, extracting basic attributes of two target files for comparison, wherein the target files are respectively as follows: a first object file and a second object file, the base attribute comprising at least: file name, file type, file size, file position, file creation time and file modification time; step 2: opening a first target file and a second target file, extracting the contents of the two files, and temporarily storing the extracted file contents; and step 3: converting the extracted file content into corresponding image content; and 4, step 4: and according to the first judgment result and the second judgment result, based on a preset weight value of the judgment result, weighting and calculating a final similarity recognition result.
7. The method of claim 6, wherein the step 4: the method for judging the similarity of the two files according to the basic attributes of the two files to obtain a first judgment result executes the following steps: matching and identifying items to be matched, such as file names, file types, file sizes, file positions, file creating time and file modifying time of the first target file and the second target file respectively; the matching identification method comprises the following steps: acquiring a keyword to which the character belongs and an index position of the character in the keyword to which the character belongs according to a keyword set for each character in a to-be-matched item; judging whether the character is a first character of the keyword according to the index position of the character in the keyword to which the character belongs; if the character is the first character of the affiliated keyword, recording the affiliated keyword of the character in a matching information set, and marking the first character of the keyword in the record to exist in the item to be matched; if the character is not the first character of the corresponding keyword and the record of the keyword to which the character belongs exists in the matching information set, acquiring the record of the keyword to which the character belongs, and marking the character in the keyword existing in the item to be matched in the record; when all characters in a keyword are marked and exist in the item to be matched, judging that the keyword is hit by the item to be matched; if the items to be matched of the two files hit the same keyword, judging that the items to be matched are matched; and if the number of the matched items to be matched is greater than that of the unmatched items to be matched, the first judgment result is that the two files are matched and is marked as 1.
8. The method of claim 7, wherein the step 4: the method for judging the similarity of the two files according to the image content to obtain a second judgment result comprises the following steps: calculating a probability for each local region of the image content using the following formula:
Figure FDA0002588947460000041
wherein i is the number of each local region, n is the number of local regions, and σ (x)i) Representing a local area xiPer local area xiIn the form of a matrix of a plurality of,
Figure FDA0002588947460000042
is a transpose of a matrix, wiFor a predetermined template matrix, biIs thatThe adjustment value corresponding to the matrix has a value range of: 5-10, wherein m is a probability adjustment value and has a value range of: 0.2 to 0.6; calculating a weight value of each local region in the image according to the probability of the local region, wherein the weight value is used as the local region weight value; dividing the image content of the second target file into unit domains; extracting a feature quantity of each unit domain from the divided unit domains as a unit domain feature quantity of the image content of the first target file; comparing the image content unit domain feature quantity of the first target file with the image content unit domain feature quantity of the second target file, wherein the unit domain feature quantity is prepared in advance for the image content of the first target file; calculating the similarity of the characteristic quantity of each unit domain as the unit domain similarity; the image similarity between the image content of the second target file and the image content of the first target file is calculated by weighting the unit domain similarity with a unit domain-based weight value derived from the local region weight value.
9. The method of claim 8, wherein the step 4: based on the preset weight value of the judgment result, the method for calculating the final similarity recognition result by weighting executes the following steps: the final similarity is calculated using the following formula: the final similarity is the first judgment result A + the second judgment result B; if the final similarity value is less than 0.8, the obtained final similarity identification result is that the two files are not similar; if the final similarity value is not lower than 0.8, the obtained final similarity identification result is that the two files are similar; wherein A is the weight value of the first judgment result, and B is the weight value of the second judgment result; a + B is 1.
10. The method of claim 49, wherein step 3: the method for converting the extracted file content into the corresponding image content performs the following steps: the method for converting the extracted file content into the corresponding image content performs the following steps: converting the file content into a binary character string, filling the binary character string into a matrix, corresponding each value in the matrix to an RGB value, and then regarding the matrix as the image content.
CN202010689843.2A 2020-07-17 2020-07-17 Computer file similarity recognition system and method based on image analysis Withdrawn CN111666928A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010689843.2A CN111666928A (en) 2020-07-17 2020-07-17 Computer file similarity recognition system and method based on image analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010689843.2A CN111666928A (en) 2020-07-17 2020-07-17 Computer file similarity recognition system and method based on image analysis

Publications (1)

Publication Number Publication Date
CN111666928A true CN111666928A (en) 2020-09-15

Family

ID=72392947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010689843.2A Withdrawn CN111666928A (en) 2020-07-17 2020-07-17 Computer file similarity recognition system and method based on image analysis

Country Status (1)

Country Link
CN (1) CN111666928A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114943285A (en) * 2022-05-20 2022-08-26 深圳市创意智慧港科技有限责任公司 Intelligent auditing system for internet news content data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114943285A (en) * 2022-05-20 2022-08-26 深圳市创意智慧港科技有限责任公司 Intelligent auditing system for internet news content data
CN114943285B (en) * 2022-05-20 2023-04-07 深圳市创意智慧港科技有限责任公司 Intelligent auditing system for internet news content data

Similar Documents

Publication Publication Date Title
US7783581B2 (en) Data learning system for identifying, learning apparatus, identifying apparatus and learning method
CN111460827B (en) Text information processing method, system, equipment and computer readable storage medium
CN111343203B (en) Sample recognition model training method, malicious sample extraction method and device
JP2001167131A (en) Automatic classifying method for document using document signature
CN110287784B (en) Annual report text structure identification method
CN110647505A (en) Computer-assisted secret point marking method based on fingerprint characteristics
CN108304502A (en) Quick hot spot detecting method and system based on magnanimity news data
CN109033212A (en) A kind of file classification method based on similarity mode
CN109583438A (en) The recognition methods of the text of electronic image and image processing apparatus
Gordo et al. Document classification and page stream segmentation for digital mailroom applications
CN110837568A (en) Entity alignment method and device, electronic equipment and storage medium
CN114090736A (en) Enterprise industry identification system and method based on text similarity
CN111782595A (en) Mass file management method and device, computer equipment and readable storage medium
CN114495139A (en) Operation duplicate checking system and method based on image
CN109344276B (en) Image fingerprint generation method, image similarity comparison method and storage medium
CN112733140B (en) Detection method and system for model inclination attack
CN111666928A (en) Computer file similarity recognition system and method based on image analysis
CN116663549B (en) Digitized management method, system and storage medium based on enterprise files
Sari et al. A search engine for Arabic documents
CN105975643B (en) A kind of realtime graphic search method based on text index
CN117076455A (en) Intelligent identification-based policy structured storage method, medium and system
CN112560849B (en) Neural network algorithm-based grammar segmentation method and system
CN115186138A (en) Comparison method and terminal for power distribution network data
CN115203474A (en) Automatic database classification and extraction technology
Lu et al. A search engine for imaged documents in PDF files

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200915