CN108733843B - File detection method based on Hash algorithm and sample Hash library generation method - Google Patents

File detection method based on Hash algorithm and sample Hash library generation method Download PDF

Info

Publication number
CN108733843B
CN108733843B CN201810534536.XA CN201810534536A CN108733843B CN 108733843 B CN108733843 B CN 108733843B CN 201810534536 A CN201810534536 A CN 201810534536A CN 108733843 B CN108733843 B CN 108733843B
Authority
CN
China
Prior art keywords
file
hash
sample
library
size
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810534536.XA
Other languages
Chinese (zh)
Other versions
CN108733843A (en
Inventor
江汉祥
赵世强
陈云
张金灵
黄勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meiya Pico Information Co Ltd
Original Assignee
Xiamen Meiya Pico Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meiya Pico Information Co Ltd filed Critical Xiamen Meiya Pico Information Co Ltd
Priority to CN201810534536.XA priority Critical patent/CN108733843B/en
Publication of CN108733843A publication Critical patent/CN108733843A/en
Application granted granted Critical
Publication of CN108733843B publication Critical patent/CN108733843B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a file detection method based on a Hash algorithm, which comprises the following steps: s1: detecting the size of a file; s2: comparing the size of the file with the size of the file in the hash library, if the size of the file is not matched with the size of the file in the hash library, ending the process, and if the size of the file is matched with the size of the file in the hash library, executing the following steps; s3: when the size of the file is smaller than a threshold value, calculating the hash value of the whole file, and comparing the hash value with the sample hash value in the hash library; and S4: and when the size of the file is larger than a threshold value, selecting a part of files in the file to determine the hash type corresponding to the part of files, calculating the hash value of the part of files in the file, and comparing the hash value with the sample hash value with the same hash type in the hash library. The invention further comprises a sample hash library generation method.

Description

File detection method based on Hash algorithm and sample Hash library generation method
Technical Field
The invention belongs to the field of file detection methods, and particularly relates to a file detection method based on a hash algorithm and a sample hash library generation method.
Background
In recent years, the network transmission of some files, especially sensitive audio and video files, has seriously influenced social security and even becomes a poison source which currently influences the political stability and national association of China. The network transmission mainly refers to the transmission and watching or listening and speaking by means of self-built websites, cloud network disks, overseas illegal organization websites, sharing websites, social platforms, electronic books and the like and by using mobile phones, televisions, computers, multimedia cards, U disks and the like. How to detect these documents becomes the key point of governance. The detection of the current file is mainly realized by a hash (MD5) algorithm. However, in the process of file transmission, formats may be changed manually or automatically by an application program, so that the hash value changes under the condition that the content of the file is not changed, and detection cannot be pursued. Also the broadcaster may time clip files or insert into other files causing the hash values to change such that detection cannot be envisaged.
At present, the existing similar detection tools in the market are all realized based on a file hash algorithm, most of sample hash library collection is formed based on manual work or semi-manual work, and the actual needs cannot be met because the sample hash library collection comprises the following defects: by using a complete hash algorithm of the file, the calculation time is too long when large files are encountered, the efficiency is influenced, and the actual combat requirement cannot be met; the variant files generated due to the change of format or compression ratio cannot be identified, and the cut files or the inserted files cannot be identified; suspicious samples cannot be identified so as to improve the sample identification efficiency; the sample hash library cannot be automatically collected and the front-end sample library is automatically updated. Therefore, the invention provides an accurate, comprehensive and scientific file detection method based on the hash algorithm and a sample hash library generation method, which are production problems to be solved urgently.
Disclosure of Invention
In view of the above, the present invention provides a file detection method based on a hash algorithm and a sample hash library generation method, which mainly utilize the hash algorithm to improve efficiency, utilize content identification to discover variants, and utilize cloud searching and killing to summarize and send samples to improve detection capability.
In order to achieve the above object, the present invention provides a file detection method based on a hash algorithm, comprising the following steps: s1: detecting the size of a file; s2: comparing the size of the file with the size of the file in the hash library, if the size of the file is not matched with the size of the file in the hash library, ending the process, and if the size of the file is matched with the size of the file in the hash library, executing the following steps; s3: when the size of the file is smaller than a threshold value, calculating the hash value of the whole file, and comparing the hash value with the sample hash value in the hash library; and S4: and when the size of the file is larger than a threshold value, selecting a part of files in the file to determine the hash type corresponding to the part of files, calculating the hash value of the part of files in the file, and comparing the hash value with the sample hash value with the same hash type in the hash library. The file detection method based on the Hash algorithm uses the complete file Hash algorithm for small files; for large files, the detection efficiency can be improved by matching the file size with the hash algorithm of part of files.
Preferably, the type of the partial file is one of three forms of a file header, a file end and a file header and file end. Different types of partial files can be detected.
Preferably, the threshold value is 50-200 MB.
Preferably, the number of bytes of the file header or the file trailer is 1KB-20 KB.
The invention also provides a sample hash library generation method, which specifically comprises the following steps: s51: carrying out hash detection on the file by using the method; s52: detecting variant samples by using an electronic fingerprint detection technology for the files which are not hit by the Hash detection; s53: carrying out hash calculation on the variant sample detected by the electronic fingerprint detection technology, and recording the calculated hash value into the sample hash library; s54: for the file of the non-variant sample which is not detected by the electronic fingerprint detection technology, judging whether the file belongs to a suspected sample by utilizing a neural network model algorithm technology; s55: and identifying the suspected sample determined by the neural network model algorithm technology, performing file hash calculation after the sample file is confirmed, and recording the calculated hash value into the sample hash library. According to the sample hash library generation method, the sample hash library is summarized, and the variant samples are collected, so that the sample hash library can be favorably increased, the front-end detection equipment does not need to be modified, and a large amount of cost of a user is saved.
Preferably, the S52 includes the following steps: and recording the electronic fingerprint record of the key frame by using the specific scene characteristics of the specific key frame in the file through an electronic fingerprint algorithm, and comparing the electronic fingerprint record with the electronic fingerprint record in an electronic fingerprint library to find out the variant sample. The detection efficiency of searching the variant sample is greatly improved by adopting the electronic fingerprint algorithm.
Preferably, the forming of the electronic fingerprint database specifically includes the following steps: s520: collecting collected samples into a sample hash library after the collected samples are identified; s521: marking the sample with the starting position of a key frame and related content; s522: performing electronic fingerprint operation on the key frame to generate an electronic fingerprint record; s523: adding the generated electronic fingerprint record to a library, thereby forming the electronic fingerprint library. The electronic fingerprint database formed by electronic fingerprint operation increases the detection of variant files and suspected samples, can continuously enrich the electronic fingerprint database, and improves the detection capability.
Preferably, the S54 includes the following steps: training key frames in original samples by using a neural network artificial intelligence method to form specific algorithm models, assembling the algorithm models into a model base, and carrying out artificial intelligence recognition on the files according to the model base so as to judge whether the files are suspected samples. By adopting the artificial intelligence method of the neural network, suspected samples can be detected, the detection capability is improved, and the sample library is enriched.
The invention also provides a method for detecting the file by utilizing the hash library, which comprises the steps of detecting the file by utilizing the front-end sample hash library and the hash detection method, and also comprises the following steps: the front-end sample hash library is periodically updated using the sample hash library generated by the method described above as a back-end sample hash library. The front-end sample hash library is updated by the sample hash library, so that the detection range is increased, and the detection capability can be continuously improved.
Preferably, the front end is a website end, and the back end is a cloud end. The front end can obtain variant file samples and suspected samples from a plurality of sources, so that the samples can be effectively enriched, and a sample hash library can be enriched.
The invention also proposes a computer-readable medium, which stores a program that, when executed by a computer processor, implements the method described above.
Drawings
FIG. 1 is a schematic diagram of a hash algorithm based file detection method of the present invention;
FIG. 2 is a schematic diagram of a sample hash library generation method of the present invention;
fig. 3 is a schematic diagram of the steps of forming the electronic fingerprint library of the sample hash library generation method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Fig. 1 shows a schematic diagram of a file detection method based on a hash algorithm of the present invention, and the present invention provides a file detection method based on a hash algorithm, which includes the following steps: s1: detecting the size of a file; s2: comparing the size of the file with the size of the file in the hash library, if the size of the file is not matched with the size of the file in the hash library, ending the process, and if the size of the file is matched with the size of the file in the hash library, executing the following steps; s3: when the size of the file is smaller than a threshold value, calculating the hash value of the whole file, and comparing the hash value with the sample hash value in the hash library; and S4: and when the size of the file is larger than a threshold value, selecting a part of files in the file to determine the hash type corresponding to the part of files, calculating the hash value of the part of files in the file, and comparing the hash value with the sample hash value with the same hash type in the hash library. The file detection method based on the Hash algorithm uses the complete file Hash algorithm for small files; for large files, the detection efficiency can be improved by matching the file size with the hash algorithm of part of files. It is noted that a part of a file may be any part of the file.
The type of the part of the file detection method based on the Hash algorithm can be one of three forms of a file head, a file tail and a file head and file tail. Different types of partial files can be detected. The size of the threshold is 50-200 MB. The byte number of the file header or the file tail is 1KB-20 KB. It should be noted that the types of the partial file in the present invention include, but are not limited to, three forms of a file header, a file trailer and a file header plus file trailer, and it may also be other specified parts in the file.
Fig. 2 is a schematic diagram illustrating a sample hash library generation method according to the present invention, and the present invention further provides a sample hash library generation method, which specifically includes the following steps: s51: performing hash detection on the file by using the method; s52: detecting variant samples by using an electronic fingerprint detection technology for the files which are not hit by the Hash detection; s53: carrying out hash calculation on the variant sample detected by the electronic fingerprint detection technology, and recording the calculated hash value into the sample hash library; s54: for the file of the non-variant sample which is not detected by the electronic fingerprint detection technology, judging whether the file belongs to a suspected sample by utilizing a neural network model algorithm technology; s55: and identifying the suspected sample determined by the neural network model algorithm technology, performing file hash calculation after the sample file is confirmed, and recording the calculated hash value into the sample hash library. The electronic fingerprint detection technology is that after a picture or audio and video is decoded, a sequence in the picture or audio and video is subjected to feature calculation by using an image and audio processing algorithm or a spatial transformation model algorithm, and a sequence feature value of an audio and video file, namely an electronic fingerprint, is formed after extraction. The generated electronic fingerprint has high robustness and can be kept unchanged under the conversion of audio and video transcoding, cutting and the like. In practical application, firstly, electronic fingerprint calculation is carried out on known specific audio and video and then the audio and video are put in a storage, then, electronic fingerprints are extracted from samples to be identified and are compared with an electronic fingerprint storage one by one, and if the electronic fingerprints are larger than a set threshold value, the samples are regarded as hit samples. According to the sample hash library generation method, the sample hash library is summarized, and the variant samples are collected, so that the sample hash library can be favorably increased, the front-end detection equipment does not need to be modified, and a large amount of cost of a user is saved.
S52 of the sample hash library generating method of the present invention includes the steps of: and recording the electronic fingerprint record of the key frame by using the specific scene characteristics of the specific key frame in the file through an electronic fingerprint algorithm, and comparing the electronic fingerprint record with the electronic fingerprint record in an electronic fingerprint library to find out the variant sample. Fig. 3 shows a schematic diagram of the steps of forming the electronic fingerprint library of the present invention, the forming of the electronic fingerprint library specifically includes the following steps: s520: collecting collected samples into a sample hash library after the collected samples are identified; s521: marking the sample with the starting position of a key frame and related content; s522: performing electronic fingerprint operation on the key frame to generate an electronic fingerprint record; s523: adding the generated electronic fingerprint record to a library, thereby forming the electronic fingerprint library. The detection efficiency of searching the variant sample is greatly improved by adopting the electronic fingerprint algorithm. The electronic fingerprint database formed by electronic fingerprint operation increases the detection of variant files and suspected samples, can continuously enrich the electronic fingerprint database, and improves the detection capability.
S54 of the sample hash library generating method of the present invention includes the steps of: training key frames in original samples by using a neural network artificial intelligence method to form specific algorithm models, assembling the algorithm models into a model base, and carrying out artificial intelligence recognition on the files according to the model base so as to judge whether the files are suspected samples. By adopting the artificial intelligence method of the neural network, suspected samples can be detected, the detection capability is improved, and the sample library is enriched. The neural network artificial intelligence method mainly depends on scene characteristics in audio and video, such as gun holding, a mask, a terrorist flag, explosion sound, speaker voice and the like of a terrorist scene, and the characteristics have obvious picture texture characteristics or spectrum distribution characteristics, so that a sample is trained by using the neural network method to form a recognition model. The neural network algorithm can respond to surrounding units in a part of coverage range, has excellent performance on image processing and audio recognition, and therefore can obtain a corresponding discriminant model. For example, for a given input picture containing a gun, since the neural network algorithm has the ability to recognize the shape distribution of the picture, "gun" response for this sample is stronger, and the scene class identification- "gun" can be directly output.
The invention also provides a method for detecting the file by utilizing the hash library, which comprises the steps of detecting the file by utilizing the front-end sample hash library and the hash detection method, and also comprises the following steps: the front-end sample hash library is periodically updated using the sample hash library generated by the method described above as a back-end sample hash library. The front end is a website end, and the back end is a cloud end. The front-end sample hash library is updated by the sample hash library, so that the detection range is increased, and the detection capability can be continuously improved. The front end can obtain variant file samples and suspected samples from a plurality of sources, so that the samples can be effectively enriched, and a sample hash library can be enriched. After the technology is utilized, a back-end data center can be built for summarizing a sample library, training the sample library, forming a sample hash library, interacting with the front end, continuously updating the sample hash library of the front end and improving the detection capability.
The invention also proposes a computer-readable medium, which stores a program that, when executed by a computer processor, implements the method described above. The computer readable medium may be a hard disk, an optical disk, a floppy disk, a flash disk, an SD card, a TF card, etc.
The invention also arranges content detection equipment for areas with numerous specific sample sources to obtain variant file samples and push suspected samples, thereby effectively enriching the samples and enriching the sample hash library.
Therefore, the file detection method based on the hash algorithm and the sample hash library generation method adopt various algorithms to carry out data matching verification, and ensure that variant samples and suspected samples can be identified.
It will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present invention without departing from the spirit and scope of the invention. In this way, if these modifications and changes are within the scope of the claims of the present invention and their equivalents, the present invention is also intended to cover these modifications and changes. The word "comprising" does not exclude the presence of other elements or steps than those listed in a claim. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. Any reference signs in the claims shall not be construed as limiting the scope.

Claims (9)

1. A sample hash library generation method specifically comprises the following steps:
s51: carrying out Hash detection on the file;
s52: for the file which is not hit by the Hash detection, recording an electronic fingerprint record of a key frame by using a specific scene characteristic of a specific key frame in the file through an electronic fingerprint algorithm, and comparing the electronic fingerprint record with an electronic fingerprint record in an electronic fingerprint library to find out a variant sample;
s53: carrying out hash calculation on the variant sample detected by the electronic fingerprint detection technology, and recording the calculated hash value into the sample hash library;
s54: for the file of the non-variant sample which is not detected by the electronic fingerprint detection technology, judging whether the file belongs to a suspected sample by utilizing a neural network model algorithm technology;
s55: identifying the suspected sample determined by the neural network model algorithm technology, performing file hash calculation after the sample file is confirmed, and recording the calculated hash value into the sample hash library;
the hash detection of the file in step S51 specifically includes:
s1: detecting the size of a file;
s2: comparing the size of the file with the size of the file in the hash library, if the size of the file is not matched with the size of the file in the hash library, ending the process, and if the size of the file is matched with the size of the file in the hash library, executing the following steps;
s3: when the size of the file is smaller than a threshold value, calculating the hash value of the whole file, and comparing the hash value with the sample hash value in the hash library; and
s4: and when the size of the file is larger than a threshold value, selecting a part of files in the file to determine the hash type corresponding to the part of files, calculating the hash value of the part of files in the file, and comparing the hash value with the sample hash value with the same hash type in the hash library.
2. The method of claim 1, wherein the type of the partial file is one of a header, an end and a header and an end.
3. The method of claim 1, wherein the threshold is between about 50 MB and about 200 MB.
4. The method of claim 2, wherein the number of bytes of the header or the trailer is 1KB-20 KB.
5. The method according to claim 1, characterized in that the creation of the electronic fingerprint repository comprises in particular the steps of:
s520: collecting collected samples into a sample hash library after the collected samples are identified;
s521: marking the sample with the starting position of a key frame and related content;
s522: performing electronic fingerprint operation on the key frame to generate an electronic fingerprint record;
s523: adding the generated electronic fingerprint record to a library, thereby forming the electronic fingerprint library.
6. The method according to claim 1, wherein the S54 comprises the following steps: training key frames in original samples by using a neural network artificial intelligence method to form specific algorithm models, assembling the algorithm models into a model base, and carrying out artificial intelligence recognition on the files according to the model base so as to judge whether the files are suspected samples.
7. A method for detecting a file by utilizing a hash library comprises the steps of detecting the file by utilizing a front-end sample hash library and a hash detection method, and is characterized by further comprising the following steps: periodically updating the front-end sample hash library with a sample hash library generated by the method of one of claims 1 to 6 as a back-end sample hash library.
8. The method of claim 7, wherein the front end is a web site end and the back end is a cloud end.
9. A computer-readable medium storing a program which, when executed by a computer processor, implements the method of one of claims 1 to 8.
CN201810534536.XA 2018-05-29 2018-05-29 File detection method based on Hash algorithm and sample Hash library generation method Active CN108733843B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810534536.XA CN108733843B (en) 2018-05-29 2018-05-29 File detection method based on Hash algorithm and sample Hash library generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810534536.XA CN108733843B (en) 2018-05-29 2018-05-29 File detection method based on Hash algorithm and sample Hash library generation method

Publications (2)

Publication Number Publication Date
CN108733843A CN108733843A (en) 2018-11-02
CN108733843B true CN108733843B (en) 2021-01-12

Family

ID=63936656

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810534536.XA Active CN108733843B (en) 2018-05-29 2018-05-29 File detection method based on Hash algorithm and sample Hash library generation method

Country Status (1)

Country Link
CN (1) CN108733843B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990897A (en) * 2019-12-16 2020-04-10 北京无忧创想信息技术有限公司 File fingerprint generation method and device
CA3118234A1 (en) * 2020-05-13 2021-11-13 Magnet Forensics Inc. System and method for identifying files based on hash values
US11768937B1 (en) * 2020-11-30 2023-09-26 Amazon Technologies, Inc. Hash based flexible threat scanning engine

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593259A (en) * 2009-06-29 2009-12-02 北京航空航天大学 software integrity verification method and system
CN102915325A (en) * 2012-08-11 2013-02-06 深圳市极限网络科技有限公司 Md5 Hash list-based file decomposing and combining technique
CN102970294A (en) * 2012-11-21 2013-03-13 网神信息技术(北京)股份有限公司 Method and device for detecting virus of security gateway
CN104700033A (en) * 2015-03-30 2015-06-10 北京瑞星信息技术有限公司 Virus detection method and virus detection device
CN107992599A (en) * 2017-12-13 2018-05-04 厦门市美亚柏科信息股份有限公司 File comparison method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593259A (en) * 2009-06-29 2009-12-02 北京航空航天大学 software integrity verification method and system
CN102915325A (en) * 2012-08-11 2013-02-06 深圳市极限网络科技有限公司 Md5 Hash list-based file decomposing and combining technique
CN102970294A (en) * 2012-11-21 2013-03-13 网神信息技术(北京)股份有限公司 Method and device for detecting virus of security gateway
CN104700033A (en) * 2015-03-30 2015-06-10 北京瑞星信息技术有限公司 Virus detection method and virus detection device
CN107992599A (en) * 2017-12-13 2018-05-04 厦门市美亚柏科信息股份有限公司 File comparison method and system

Also Published As

Publication number Publication date
CN108733843A (en) 2018-11-02

Similar Documents

Publication Publication Date Title
US11657079B2 (en) System and method for identifying social trends
US11482242B2 (en) Audio recognition method, device and server
Wei et al. Frame fusion for video copy detection
CN108733843B (en) File detection method based on Hash algorithm and sample Hash library generation method
WO2017045443A1 (en) Image retrieval method and system
US20140245463A1 (en) System and method for accessing multimedia content
WO2017067400A1 (en) Video file identification method and device
US10380267B2 (en) System and method for tagging multimedia content elements
Ye et al. Joint audio-visual bi-modal codewords for video event detection
CN102411578A (en) Multimedia playing system and method
Ali et al. A review of digital forensics methods for JPEG file carving
CN104021217A (en) System and method for extracting fragment file and deleted file of mobile phone
Mou et al. Content-based copy detection through multimodal feature representation and temporal pyramid matching
CN109117622B (en) Identity authentication method based on audio fingerprints
CN110378190B (en) Video content detection system and detection method based on topic identification
CN111553191A (en) Video classification method and device based on face recognition and storage medium
CN111738042A (en) Identification method, device and storage medium
CN114372267A (en) Malicious webpage identification and detection method based on static domain, computer and storage medium
CN114022923A (en) Intelligent collecting and editing system
CN114817518B (en) License handling method, system and medium based on big data archive identification
KR100916310B1 (en) System and Method for recommendation of music and moving video based on audio signal processing
JP2013092941A (en) Image retrieval device, method and program
CN116030820A (en) Audio verification method and device and audio evidence obtaining method and device
CN104637496A (en) Computer system and audio comparison method
US20140245018A1 (en) Systems and Methods for Media Recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant