CN108460155A - A kind of file identification method, device, equipment and storage medium - Google Patents

A kind of file identification method, device, equipment and storage medium Download PDF

Info

Publication number
CN108460155A
CN108460155A CN201810265755.2A CN201810265755A CN108460155A CN 108460155 A CN108460155 A CN 108460155A CN 201810265755 A CN201810265755 A CN 201810265755A CN 108460155 A CN108460155 A CN 108460155A
Authority
CN
China
Prior art keywords
file
text
binary
result
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810265755.2A
Other languages
Chinese (zh)
Inventor
黄伟佳
吴楚伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN201810265755.2A priority Critical patent/CN108460155A/en
Publication of CN108460155A publication Critical patent/CN108460155A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems

Abstract

This application discloses a kind of file identification method, device, equipment and storage medium, this method includes:Determine the file class of file destination;If the file class of the file destination is binary file, feature string corresponding with the binary file is searched, and determines the file identification result of the binary file according to lookup result;If the file class of the file destination is text file, keyword corresponding with the text file and/or canonical sentence are searched for, and determines the file identification result of the text file according to search result.From the foregoing, it will be observed that being not necessarily based on file suffixes name this application discloses a kind of and carry out the technical solution of file identification, it is possible thereby to realize that file after being modified to the file or suffix name of no suffix name carries out file identification, to improve the discrimination of file.

Description

A kind of file identification method, device, equipment and storage medium
Technical field
The present invention relates to field of computer technology, more particularly to a kind of file identification method, device, equipment and storage are situated between Matter.
Background technology
Currently, in order to identify the file format or file type of some file, it is common practice to after this document Sew name to be identified, this identification method is very convenient and fast in normal conditions, and application range is wider.However, by It can be changed in the suffix name of file, after the suffix name of file is either intentionally or unintentionally changed, above-mentioned identification method It will be unable to that correctly file is identified.In addition, since some files are not no suffix names, will be unable at this time above-mentioned knowledge Other mode is applied in the identification process of these files.
In summary as can be seen that the format identification rate for how promoting file is that have problem to be solved at present.
Invention content
In view of this, the purpose of the present invention is to provide a kind of file identification method, device, equipment and storage medium, energy Enough discriminations for effectively promoting file.Its concrete scheme is as follows:
In a first aspect, the invention discloses a kind of file identification methods, including:
Determine the file class of file destination;
If the file class of the file destination is binary file, feature corresponding with the binary file is searched Character string, and determine according to lookup result the file identification result of the binary file;
If the file class of the file destination is text file, keyword corresponding with the text file is searched for And/or canonical sentence, and determine according to search result the file identification result of the text file.
Optionally, the binary file includes PE files, compound document or compressed file.
Optionally, described to search feature string corresponding with the binary file, and institute is determined according to lookup result The step of stating the file identification result of binary file, including:
The file header feature of the binary file is searched, and judges that the binary file is according to current lookup result No is PE files;
If it is not, then using the mapping table between the feature string and offset of preset compound document, lookup and institute The corresponding feature string of binary file is stated, and judges whether the binary file is compound text according to current lookup result Shelves;
If it is not, then using the mapping table between the feature string and offset of preset compressed file, lookup and institute The corresponding feature string of binary file is stated, and judges whether the binary file is compression text according to current lookup result Part.
Optionally, the file header feature includes DOS features and NT features.
Optionally, the text file includes programming file or script file.
Optionally, described search keyword corresponding with the text file and/or canonical sentence, and according to search result The step of determining the file identification result of the text file, including:
Using the keyword and/or canonical sentence of preset programming file, key corresponding with the text file is searched for Word and canonical sentence;
The corresponding practical discrimination of first file is determined according to current search result;
Judge whether the practical discrimination of the first file is more than the first predetermined threshold value, if it is, judging the text File is programming file.
Optionally, described search keyword corresponding with the text file and/or canonical sentence, and according to search result The step of determining the file identification result of the text file, including:
Using the keyword and/or canonical sentence of preset script file, key corresponding with the text file is searched for Word and canonical sentence;
The corresponding practical discrimination of second file is determined according to current search result;
Judge whether the described second practical discrimination is more than the second predetermined threshold value, if it is, judging the text file For script file.
Optionally, before the step of file class of the determining file destination, further include:
Determine whether the file destination includes file suffixes name;
If it is, directly determining the file identification knot of the file destination according to the file suffixes name of the file destination Fruit.
Second aspect, the invention discloses a kind of file identification devices, including:
File class determining module, the file class for determining file destination;
Binary file identification module, for when the file destination file class be binary file, then search with The corresponding feature string of the binary file, and determine according to lookup result the file identification knot of the binary file Fruit;
Text file identification module, for when the file destination file class be text file, then search with it is described The corresponding keyword of text file and/or canonical sentence, and determine according to search result the file identification knot of the text file Fruit.
Optionally, the binary file includes PE files, compound document or compressed file.
Optionally, the binary file identification module, including:
First judging unit, the file header feature for searching the binary file, and sentenced according to current lookup result Whether the binary file of breaking is PE files;
Second judgment unit then utilizes preset compound text for being no when the judging result of first judging unit Mapping table between the feature string and offset of shelves searches feature string corresponding with the binary file, and root Judge whether the binary file is compound document according to current lookup result;
Third judging unit then utilizes preset compression text for being no when the judging result of the second judgment unit Mapping table between the feature string and offset of part searches feature string corresponding with the binary file, and root Judge whether the binary file is compressed file according to current lookup result.
Optionally, the file header feature includes DOS features and NT features.
Optionally, the text file includes programming file or script file.
Optionally, the text file identification module, including:
First search unit, for the keyword and/or canonical sentence using preset programming file, search and the text The corresponding keyword of this document and canonical sentence;
First determination unit, for determining the corresponding practical discrimination of first file according to current search result;
4th judging unit, for judging whether the practical discrimination of the first file is more than the first predetermined threshold value, if It is then to judge the text file for programming file.
Optionally, the text file identification module, including:
First search unit, for the keyword and/or canonical sentence using preset script file, search and the text The corresponding keyword of this document and canonical sentence;
First determination unit, for determining the corresponding practical discrimination of second file according to current search result;
5th judging unit, for judging whether the described second practical discrimination is more than the second predetermined threshold value, if it is, Judge the text file for script file.
Optionally, described device further includes:
File Direct Recognition module, the step of the file class for determining file destination in the file class determining module Before rapid, determine whether the file destination includes file suffixes name, if it is, directly according to the file of the file destination Suffix name determines the file identification result of the file destination.
The third aspect, the invention discloses a kind of file identification equipment, including processor and memory;Wherein, the place Reason device realizes aforementioned disclosed file identification method when executing the computer program preserved in the memory.
Fourth aspect, the invention discloses a kind of computer readable storage mediums, for storing computer program;Wherein, The computer program realizes aforementioned disclosed file identification method when being executed by processor.
As it can be seen that the present invention is first to determine the file class of file, it, will be according to two when file class is binary file The feature string of binary file come determine file identification as a result, when file class be text file when, will be according to text file Corresponding keyword and/or canonical sentence determine file identification as a result, from the foregoing, it will be observed that the invention discloses one kind to be not necessarily based on File suffixes name carries out the technical solution of file identification, is modified to the file or suffix name of no suffix name it is possible thereby to realize File afterwards carries out file identification, to improve the discrimination of file.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 is a kind of file identification method flow chart disclosed by the embodiments of the present invention;
Fig. 2 is a kind of specific file identification method flow chart disclosed by the embodiments of the present invention;
Fig. 3 is a kind of file identification method sub-process figure disclosed by the embodiments of the present invention;
Fig. 4 is a kind of file identification method sub-process figure disclosed by the embodiments of the present invention;
Fig. 5 is a kind of file identification device structural schematic diagram disclosed by the embodiments of the present invention.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Shown in Figure 1 the embodiment of the invention discloses a kind of file identification method, this method includes:
Step S11:Determine the file class of file destination.
It should be pointed out that file destination can be the file for having suffix name in the present embodiment, can also be no suffix The file of name.For the file that no suffix name or suffix name are tampered, the preferential file identification side using in the present embodiment Method carries out the identification of file.
In addition, determining the process of the file class of file destination in the present embodiment, can specifically include:Read file destination In file content, the file class of file destination is then determined using above-mentioned file content.Wherein, in above-mentioned file destination File content can be specifically partial file content in file destination, such as can be randomly selected from file destination The file content of default file content-length.
Further, file class specifically includes two kinds in file destination in the present embodiment, and one is binary files, separately It is a kind of then be text file.The present embodiment can determine that the file class of file destination is by the file content of file destination Binary file or text file.
In the present embodiment, the binary file can specifically include PE files (PE, i.e. Portable Executable), compound document or compressed file, that is, the file type of binary file can be specifically in the present embodiment PE files, compound document or compressed file.In addition, the text file can specifically include programming file or script file, That is, the file type of text file can be specifically programming file or script file in the present embodiment.
In order to promote recognition rate, in the case where the suffix name of file is not tampered with, the present embodiment can be described Before the step of determining the file class of file destination, further comprise:Determine whether the file destination includes file suffixes Name;If it is, directly determining the file identification result of the file destination according to the file suffixes name of the file destination.
Step S12:If the file class of the file destination is binary file, search and the binary file pair The feature string answered, and determine according to lookup result the file identification result of the binary file.
That is, the present embodiment in the case where file destination is binary file, is searched corresponding with the binary file Then feature string determines the file identification result of binary file according to the feature string found.It is appreciated that , the file identification result of above-mentioned binary file can specifically include the recognition result and/or file type of file format Recognition result.
Step S13:If the file class of the file destination is text file, search is corresponding with the text file Keyword and/or canonical sentence, and determine according to search result the file identification result of the text file.
That is, the present embodiment in the case where file destination is text file, searches for key corresponding with this article this document Then word and/or canonical sentence determine the identification knot of this article this document according to the keyword and/or canonical sentence that search Fruit.It is understood that the file identification result of above-mentioned text file can specifically include file format recognition result and/or The recognition result of file type.
As it can be seen that the embodiment of the present invention is first to determine the file class of file, it, will when file class is binary file Determine file identification according to the feature string of binary file as a result, when file class be text file when, will be according to text The corresponding keyword of this document and/or canonical sentence come determine file identification as a result, from the foregoing, it will be observed that the embodiment of the invention discloses It is a kind of to be not necessarily based on file suffixes name to carry out the technical solution of file identification, it is possible thereby to realize to the file of no suffix name or File after suffix name is modified carries out file identification, to improve the discrimination of file.
On the basis of previous embodiment, the embodiment of the invention discloses a kind of specific file identification modes, referring to Fig. 2 Shown, this method includes:
Step S21:Determine the file class of file destination.
Wherein, corresponding contents disclosed in previous embodiment can be referred to about the detailed process of above-mentioned steps S21, herein No longer repeated.
Step S22:If the file class of the file destination is binary file, the text of the binary file is searched Part head feature, and judge whether the binary file is PE files according to current lookup result.
Wherein, the file header feature can specifically include DOS features and NT features, be based on above-mentioned DOS feature With NT features, it can identify whether binary file is PE files, and be to belong to which type of PE files.Wherein, The file type of PE files can specifically include but be not limited to command file, dll file, sys file, EXE files, LE files and NE files.
Step S23:If it is not, then using the mapping table between the feature string and offset of preset compound document, Feature string corresponding with the binary file is searched, and whether the binary file is judged according to current lookup result For compound document.
That is, the present embodiment can further utilize preset compound in the case where binary file is not PE files Mapping table between the feature string and offset of document searches feature string corresponding with binary file, wherein multiple The file type for closing document can specifically include but be not limited to WPS documents, Visio documents, Chm documents, Caj documents and PDF texts Shelves.
For example, for WPS documents, corresponding feature string include " WordDocument " and " WPS Office " its In, the corresponding characteristic value of feature string " WordDocument " is specially:57 00 6F 00 72 00 64 00 44 00 6F 00 63 00 75 00 6D 00 65 00 6E 00 74;Feature string " WPS Office " corresponding characteristic value is specific For:57 00 50 00 52 00 20 00 4F 00 66 00 69 00 63 00 65 00.Utilize the spy of above-mentioned WPS documents Levy the mapping relations between character string and the offset of feature string, it may be determined that whether include above-mentioned spy in file destination Character string is levied, if it is, can be determined that file destination is WPS documents.
For Visio documents, corresponding feature string is specially " Visio Document ", with this feature character string Corresponding characteristic value is specially:56 00 69 00 73 00 69 00 6F 00 44 00 6F 00 63 00 75 00 6D 00 65 00 6E 00 74 00.Using between the feature string of above-mentioned Visio documents and the offset of feature string Mapping relations, it may be determined that in file destination whether include features described above character string, if it is, can be determined that file destination For Visio documents.
For Chm documents, corresponding feature string specifically includes " ITSF ", " ITSP " and " PMGL ", wherein feature The corresponding characteristic value of character string " ITSF " is specially:49 54 53 46;The corresponding characteristic value of feature string " ITSP " is specific For:46 54 53 01;The corresponding characteristic value of feature string " PMGL " is specially:50 4D 47 4C.Utilize above-mentioned Chm documents Feature string and feature string offset between mapping relations, it may be determined that whether comprising upper in file destination Feature string is stated, if it is, can be determined that file destination is Chm documents.
For Caj documents, corresponding feature string is specially " CAJ ", characteristic value corresponding with this feature character string Specially:43 41 4A.It is closed using the mapping between the feature string of above-mentioned Caj documents and the offset of feature string System, it may be determined that whether include features described above character string in file destination, if it is, can be determined that file destination is Caj texts Shelves.
For PDF document, corresponding feature string is specially " %PDF-1. ", spy corresponding with this feature character string Value indicative is specially:25 50 44 46 2D 31 2E.Using above-mentioned PDF document feature string and feature string it is inclined Mapping relations between shifting amount, it may be determined that whether include features described above character string in file destination, if it is, can be determined that File destination is PDF document.
As seen from the above, the present embodiment can be identified based on the feature string found in above-mentioned steps S23 State whether binary file is compound document, and is to belong to which type of compound document.
Step S24:If it is not, then using the mapping table between the feature string and offset of preset compressed file, Feature string corresponding with the binary file is searched, and whether the binary file is judged according to current lookup result For compressed file.
That is, the present embodiment can further be utilized in the case where binary file is not PE files and compound document Mapping table between the feature string and offset of preset compressed file searches characteristic character corresponding with binary file String.Wherein, the file type of compressed file can specifically include but be not limited to zip file, wim files, 7z files, tar files With Rar files.
For example, for zip file, corresponding feature string is " PK ", characteristic value corresponding with this feature character string Specially 50 4B.Using the mapping relations between the feature string of above-mentioned zip file and the offset of feature string, It can determine in file destination whether include features described above character string, if it is, can be determined that file destination is zip file.
For wim files, corresponding feature string is " MSWIM ", characteristic value tool corresponding with this feature character string Body is 53 57 49 4D of 4D.Utilize reflecting between the feature string of above-mentioned wim files and the offset of feature string Penetrate relationship, it may be determined that whether include features described above character string in file destination, if it is, can be determined that file destination is Wim files.
For 7z files, corresponding feature string is " 7z.. ' ", and characteristic value corresponding with this feature character string is specific For 37 7A BC AF 27.It is closed using the mapping between the feature string of above-mentioned 7z files and the offset of feature string System, it may be determined that whether include features described above character string in file destination, if it is, can be determined that file destination is 7z texts Part.
For tar files, corresponding feature string is " .ustar.00 ", feature corresponding with this feature character string Value is specially 00 75 73 74 61 72 00 30 30.Utilize the feature string and feature string of above-mentioned tar files Offset between mapping relations, it may be determined that whether include features described above character string in file destination, if it is, can be with Judge that file destination is tar files.
For Rar files, corresponding feature string is " Rar!... ..s..... ", it is corresponding with this feature character string Characteristic value be specially 52 61 72 21 1A, 07 00CF, 90 73 00 0D.Utilize the feature string of above-mentioned Rar files And the mapping relations between the offset of feature string, it may be determined that whether include features described above character in file destination String, if it is, can be determined that file destination is Rar files.
As seen from the above, the present embodiment can be identified based on the feature string found in above-mentioned steps S24 State whether binary file is compressed file, and is to belong to which type of compressed file.
Content described in above-mentioned steps S22 to S24 can be seen that the present embodiment to binary file into style of writing During part identifies, the identification of PE files is carried out before this, then carries out the identification of compound document again, then just carry out compression text The identification of part.Implement it should be pointed out that the above-mentioned file identification process for binary file expansion is that one kind is specific The file identification sequencing of mode, binary file can be specifically adjusted flexibly according to actual application, for example, The present embodiment can also first carry out the identification of compressed file, then carry out the identification of PE files, finally carry out compound document again Identification then carries out the identification of compressed file alternatively, the present embodiment can also first carry out the identification of compound document, finally again into The identification of row PE files.
Step S25:If the file class of the file destination is text file, the key of preset programming file is utilized Word and/or canonical sentence, and/or using the keyword and/or canonical sentence of preset script file, search is literary with the text The corresponding keyword of part and/or canonical sentence, and determine according to search result the file identification result of the text file.
In the present embodiment, the file type of text file can specifically include programming file and script file.Wherein, it programs File can specifically include but be not limited to Java programming files and C/C++ programming files.Script file can specifically include but It is not limited to PHP files, jsp file, ASPX files and ASP files.
File is programmed for Java, keyword includes:"package"、"import"、"public class"、" Extends " with " implements ", canonical sentence include:import[\s]*\bjava[(\w+)*.(\w+)+]*.Pass through It searches in file destination and whether contains above-mentioned keyword and canonical sentence, it may be determined that go out whether the file destination is Java programmings File.
File is programmed for C/C++, keyword includes:"#include"、"#define"、"public"、" Private ", " struct " and " class ".By searching in file destination whether contain above-mentioned keyword, it may be determined that go out this Whether file destination is C/C++ programming files.
For PHP files, keyword includes:"<"、"<php"、"$"、"function"、"array"、" Isset ", " eval " and ">", canonical sentence includes:var\s+\$(\w+)+(\s)=(s)(\w)+.By searching for mesh It marks in file and whether contains above-mentioned keyword and canonical sentence, it may be determined that go out whether the file destination is PHP files.
For jsp file, keyword includes:"<script"、"javascript"、"function"、"var"、" document."、"</script>" with " jsp ", canonical sentence include:<%@page [(w+) (s)] +=and bString(\s)+(\w+)\b.By searching in file destination whether contain above-mentioned keyword and canonical sentence, it may be determined that go out Whether the file destination is jsp file.
For ASPX files, keyword includes:"<%@", " namespace ", " system. ", "<asp:"、" Response ", "@", "@renderPage " and "</asp:".By searching in file destination whether contain above-mentioned keyword, It can determine whether the file destination is ASPX files.
For ASP files, keyword includes:"<% ", " vbscript ", " option ", " explicit ", " dim ", " Sub ", " end " and " response ".By searching in file destination whether contain above-mentioned keyword, it may be determined that go out the target Whether file is ASP files.
It is shown in Figure 3, using the keyword and/or canonical sentence of preset programming file, search and text text The corresponding keyword of part and/or canonical sentence, and determine according to search result the step of the file identification result of the text file Suddenly, it can specifically include:
Step S31:Using the keyword and/or canonical sentence of preset programming file, search and the text file pair The keyword and canonical sentence answered.
It is understood that preset programming file is specifically as follows Java programming files or C/C++ in above-mentioned steps S31 Program file.
Step S32:The corresponding practical discrimination of first file is determined according to current search result.
Wherein, the above-mentioned the step of practical discrimination of corresponding first file is determined according to current search result, specifically may be used To include:Hit in statistics current search result with the reality of the preset programming file corresponding keyword and canonical sentence Border quantity, then by the total quantity of the actual quantity divided by the keyword and canonical sentence of the preset programming file, thus Obtain in current search result with the hit rate of the preset programming file corresponding keyword and canonical sentence, the present embodiment By the hit rate as the practical discrimination of the first file.
Step S33:Judge whether the practical discrimination of the first file is more than the first predetermined threshold value, if it is, judgement The text file is programming file.
That is, when the practical discrimination of the first file obtained in step S32 is more than first predetermined threshold value, then can sentence It is specifically the preset programming file in step S31 to determine text file.For example, it is assumed that preset C/C++ programmings text The keyword of part includes:" #include ", " #define ", " public ", " private ", " struct " and " class ", and Assuming that corresponding first predetermined threshold value be 80%, then if searched from some file " #include ", " public ", " Private ", " struct " and " class ", then the keyword corresponding with C/C++ programming files of this document can be calculated Hit rate is 83.33%, since the hit rate is more than above-mentioned first predetermined threshold value 80%, it is possible to judge that this document is specifically C/C++ programs file.
It is shown in Figure 4, using the keyword and/or canonical sentence of preset script file, search and text text The corresponding keyword of part and/or canonical sentence, and determine according to search result the step of the file identification result of the text file Suddenly, it can specifically include:
Step S41:Using the keyword and/or canonical sentence of preset script file, search and the text file pair The keyword and canonical sentence answered.
It is understood that in above-mentioned steps S41 preset script file be specifically as follows PHP files, jsp file, ASPX files or ASP files.
Step S42:The corresponding practical discrimination of second file is determined according to current search result.
Wherein, the above-mentioned the step of practical discrimination of corresponding second file is determined according to current search result, specifically may be used To include:The reality of corresponding with the preset script file keyword and canonical sentence that are hit in statistics current search result Border quantity, then by the total quantity of the actual quantity divided by the keyword and canonical sentence of the preset script file, thus Obtain the hit rate of corresponding with the preset script file keyword and canonical sentence in current search result, the present embodiment By the hit rate as the practical discrimination of the second file.
Step S43:Judge whether the described second practical discrimination is more than the second predetermined threshold value, if it is, described in judgement Text file is script file.
That is, when the practical discrimination of the second file obtained in step S42 is more than second predetermined threshold value, then can sentence It is specifically the preset script file in step S41 to determine text file.
It is understood that above-mentioned first predetermined threshold value and the second predetermined threshold value can carry out according to the actual application Setting, herein without specifically limiting.
In the present embodiment, if after above-mentioned file identification process, still None- identified goes out the file format of file Or file type, then this document can be classified as to unknown file, the unknown file can be subsequently sent to preset unknown In file collecting unit, file manager user can check all unknown files by the unknown file collector unit, with Just file manager user carries out the document manipulations such as manual identified to these unknown files.
Correspondingly, the embodiment of the invention also discloses a kind of file identification device, shown in Figure 5, which includes:
File class determining module 11, the file class for determining file destination;
Binary file identification module 12 is binary file for the file class when the file destination, then searches Feature string corresponding with the binary file, and determine according to lookup result the file identification knot of the binary file Fruit;
Text file identification module 13 is text file for the file class when the file destination, then search and institute The corresponding keyword of text file and/or canonical sentence are stated, and determines the file identification of the text file according to search result As a result.
Specifically, the binary file includes but not limited to PE files, compound document or compressed file.
In the present embodiment, the binary file identification module may include:
First judging unit, the file header feature for searching the binary file, and sentenced according to current lookup result Whether the binary file of breaking is PE files;
Second judgment unit then utilizes preset compound text for being no when the judging result of first judging unit Mapping table between the feature string and offset of shelves searches feature string corresponding with the binary file, and root Judge whether the binary file is compound document according to current lookup result;
Third judging unit then utilizes preset compression text for being no when the judging result of the second judgment unit Mapping table between the feature string and offset of part searches feature string corresponding with the binary file, and root Judge whether the binary file is compressed file according to current lookup result.
Further, the file header feature includes but not limited to DOS features and NT features.
In the present embodiment, the text file includes but not limited to program file or script file.
In a kind of specific embodiment, the text file identification module may include:
First search unit, for the keyword and/or canonical sentence using preset programming file, search and the text The corresponding keyword of this document and canonical sentence;
First determination unit, for determining the corresponding practical discrimination of first file according to current search result;
4th judging unit, for judging whether the practical discrimination of the first file is more than the first predetermined threshold value, if It is then to judge the text file for programming file.
In another embodiment specific implementation mode, the text file identification module may include:
First search unit, for the keyword and/or canonical sentence using preset script file, search and the text The corresponding keyword of this document and canonical sentence;
First determination unit, for determining the corresponding practical discrimination of second file according to current search result;
5th judging unit, for judging whether the described second practical discrimination is more than the second predetermined threshold value, if it is, Judge the text file for script file.
Further, the file identification device further includes:
File Direct Recognition module, the step of the file class for determining file destination in the file class determining module Before rapid, determine whether the file destination includes file suffixes name, if it is, directly according to the file of the file destination Suffix name determines the file identification result of the file destination.
Correspondingly, the invention also discloses a kind of file identification equipment, including processor and memory;Wherein, the place Reason device realizes file identification method disclosed in previous embodiment when executing the computer program preserved in the memory.About The specific steps of above-mentioned file identification method can refer to corresponding contents disclosed in previous embodiment, no longer go to live in the household of one's in-laws on getting married herein It states.
Further, the invention also discloses a kind of computer readable storage mediums, for storing computer program;Its In, file identification method disclosed in previous embodiment is realized when the computer program is executed by processor.About above-mentioned text The specific steps of part recognition methods can refer to corresponding contents disclosed in previous embodiment, no longer be repeated herein.
Each embodiment is described by the way of progressive in this specification, the highlights of each of the examples are with it is other The difference of embodiment, just to refer each other for same or similar part between each embodiment.For being filled disclosed in embodiment For setting, since it is corresponded to the methods disclosed in the examples, so description is fairly simple, related place is referring to method part Explanation.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These Function is implemented in hardware or software actually, depends on the specific application and design constraint of technical solution.Profession Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered Think beyond the scope of this invention.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment including a series of elements includes not only that A little elements, but also include other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or apparatus that includes the element.
Detailed Jie has been carried out to a kind of file identification method provided by the present invention, device, equipment and storage medium above It continues, principle and implementation of the present invention are described for specific case used herein, and the explanation of above example is only It is the method and its core concept for being used to help understand the present invention;Meanwhile for those of ordinary skill in the art, according to this hair Bright thought, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification should not manage Solution is limitation of the present invention.

Claims (18)

1. a kind of file identification method, which is characterized in that including:
Determine the file class of file destination;
If the file class of the file destination is binary file, characteristic character corresponding with the binary file is searched It goes here and there, and determines the file identification result of the binary file according to lookup result;
If the file class of the file destination is text file, search for corresponding with text file keyword and/or Canonical sentence, and determine according to search result the file identification result of the text file.
2. file identification method according to claim 1, which is characterized in that the binary file includes PE files, answers Close document or compressed file.
3. file identification method according to claim 2, which is characterized in that the lookup is corresponding with the binary file Feature string, and the step of determining according to lookup result the file identification result of the binary file, including:
Search the file header feature of the binary file, and according to current lookup result judge the binary file whether be PE files;
If it is not, then using the mapping table between the feature string and offset of preset compound document, search and described two The corresponding feature string of binary file, and judge whether the binary file is compound document according to current lookup result;
If it is not, then using the mapping table between the feature string and offset of preset compressed file, search and described two The corresponding feature string of binary file, and judge whether the binary file is compressed file according to current lookup result.
4. file identification method according to claim 3, which is characterized in that the file header feature includes DOS features With NT features.
5. file identification method according to claim 1, which is characterized in that the text file includes programming file or foot This document.
6. file identification method according to claim 5, which is characterized in that described search is corresponding with the text file Keyword and/or canonical sentence, and the step of determining according to search result the file identification result of the text file, including:
Using the keyword and/or canonical sentence of preset programming file, search for corresponding with text file keyword and Canonical sentence;
The corresponding practical discrimination of first file is determined according to current search result;
Judge whether the practical discrimination of the first file is more than the first predetermined threshold value, if it is, judging the text file To program file.
7. file identification method according to claim 5, which is characterized in that described search is corresponding with the text file Keyword and/or canonical sentence, and the step of determining according to search result the file identification result of the text file, including:
Using the keyword and/or canonical sentence of preset script file, search for keyword corresponding with the text file and Canonical sentence;
The corresponding practical discrimination of second file is determined according to current search result;
Judge whether the described second practical discrimination is more than the second predetermined threshold value, if it is, judging the text file for foot This document.
8. file identification method according to any one of claims 1 to 7, which is characterized in that the determining file destination Before the step of file class, further include:
Determine whether the file destination includes file suffixes name;
If it is, directly determining the file identification result of the file destination according to the file suffixes name of the file destination.
9. a kind of file identification device, which is characterized in that including:
File class determining module, the file class for determining file destination;
Binary file identification module, for when the file destination file class be binary file, then search with it is described The corresponding feature string of binary file, and determine according to lookup result the file identification result of the binary file;
Text file identification module, for being text file, then search and the text when the file class of the file destination The corresponding keyword of file and/or canonical sentence, and determine according to search result the file identification result of the text file.
10. file identification device according to claim 9, which is characterized in that the binary file includes PE files, answers Close document or compressed file.
11. file identification device according to claim 10, which is characterized in that the binary file identification module, packet It includes:
First judging unit, the file header feature for searching the binary file, and institute is judged according to current lookup result State whether binary file is PE files;
Second judgment unit then utilizes preset compound document for being no when the judging result of first judging unit Mapping table between feature string and offset, searches corresponding with binary file feature string, and according to working as Preceding lookup result judges whether the binary file is compound document;
Third judging unit then utilizes preset compressed file for being no when the judging result of the second judgment unit Mapping table between feature string and offset, searches corresponding with binary file feature string, and according to working as Preceding lookup result judges whether the binary file is compressed file.
12. file identification device according to claim 11, which is characterized in that the file header feature includes DOS spies It seeks peace NT features.
13. file identification device according to claim 9, which is characterized in that the text file include programming file or Script file.
14. file identification device according to claim 13, which is characterized in that the text file identification module, including:
First search unit, for the keyword and/or canonical sentence using preset programming file, search and text text The corresponding keyword of part and canonical sentence;
First determination unit, for determining the corresponding practical discrimination of first file according to current search result;
4th judging unit, for judging whether the practical discrimination of the first file is more than the first predetermined threshold value, if it is, The text file is judged to program file.
15. file identification device according to claim 13, which is characterized in that the text file identification module, including:
First search unit, for the keyword and/or canonical sentence using preset script file, search and text text The corresponding keyword of part and canonical sentence;
First determination unit, for determining the corresponding practical discrimination of second file according to current search result;
5th judging unit, for judging whether the described second practical discrimination is more than the second predetermined threshold value, if it is, judgement The text file is script file.
16. according to claim 9 to 14 any one of them file identification device, which is characterized in that further include:
File Direct Recognition module, for the step of the file class determining module determines the file class of file destination it Before, determine whether the file destination includes file suffixes name, if it is, directly according to the file suffixes of the file destination Name determines the file identification result of the file destination.
17. a kind of file identification equipment, which is characterized in that including processor and memory;Wherein, described in the processor executes Such as claim 1 to 8 any one of them file identification method is realized when the computer program preserved in memory.
18. a kind of computer readable storage medium, which is characterized in that for storing computer program;Wherein, the computer journey Such as claim 1 to 8 any one of them file identification method is realized when sequence is executed by processor.
CN201810265755.2A 2018-03-28 2018-03-28 A kind of file identification method, device, equipment and storage medium Pending CN108460155A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810265755.2A CN108460155A (en) 2018-03-28 2018-03-28 A kind of file identification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810265755.2A CN108460155A (en) 2018-03-28 2018-03-28 A kind of file identification method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN108460155A true CN108460155A (en) 2018-08-28

Family

ID=63238082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810265755.2A Pending CN108460155A (en) 2018-03-28 2018-03-28 A kind of file identification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN108460155A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134644A (en) * 2019-05-17 2019-08-16 成都卫士通信息产业股份有限公司 File type identification method, device, electronic equipment and readable storage medium storing program for executing
CN110502486A (en) * 2019-08-21 2019-11-26 中国工商银行股份有限公司 Log processing method, device, electronic equipment and computer readable storage medium
CN110825701A (en) * 2019-11-07 2020-02-21 深信服科技股份有限公司 File type determination method and device, electronic equipment and readable storage medium
CN111159709A (en) * 2019-12-27 2020-05-15 深信服科技股份有限公司 File type identification method, device, equipment and storage medium
CN111352907A (en) * 2020-03-30 2020-06-30 见知数据科技(上海)有限公司 Method and device for analyzing pipeline file, computer equipment and storage medium
CN113111147A (en) * 2020-01-13 2021-07-13 深信服科技股份有限公司 Text type identification method and device, electronic equipment and storage medium
CN113742002A (en) * 2021-09-10 2021-12-03 上海达梦数据库有限公司 Method, device, equipment and storage medium for acquiring dependency relationship of dynamic library

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7680850B2 (en) * 2006-03-31 2010-03-16 Fujitsu Limited Computer-readable recording medium storing information search program, information search method, and information search system
CN102902768A (en) * 2012-09-24 2013-01-30 广东威创视讯科技股份有限公司 Method and system for searching and displaying file content
CN103701821A (en) * 2013-12-31 2014-04-02 北京网康科技有限公司 File type recognition method and device
CN104679871A (en) * 2015-03-06 2015-06-03 北京语言大学 Chinese text searching method and Chinese text searching device
CN105975575A (en) * 2016-05-04 2016-09-28 电子科技大学 Automatic data type recognition method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7680850B2 (en) * 2006-03-31 2010-03-16 Fujitsu Limited Computer-readable recording medium storing information search program, information search method, and information search system
CN102902768A (en) * 2012-09-24 2013-01-30 广东威创视讯科技股份有限公司 Method and system for searching and displaying file content
CN103701821A (en) * 2013-12-31 2014-04-02 北京网康科技有限公司 File type recognition method and device
CN104679871A (en) * 2015-03-06 2015-06-03 北京语言大学 Chinese text searching method and Chinese text searching device
CN105975575A (en) * 2016-05-04 2016-09-28 电子科技大学 Automatic data type recognition method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134644A (en) * 2019-05-17 2019-08-16 成都卫士通信息产业股份有限公司 File type identification method, device, electronic equipment and readable storage medium storing program for executing
CN110502486A (en) * 2019-08-21 2019-11-26 中国工商银行股份有限公司 Log processing method, device, electronic equipment and computer readable storage medium
CN110502486B (en) * 2019-08-21 2022-01-11 中国工商银行股份有限公司 Log processing method and device, electronic equipment and computer readable storage medium
CN110825701A (en) * 2019-11-07 2020-02-21 深信服科技股份有限公司 File type determination method and device, electronic equipment and readable storage medium
CN111159709A (en) * 2019-12-27 2020-05-15 深信服科技股份有限公司 File type identification method, device, equipment and storage medium
CN113111147A (en) * 2020-01-13 2021-07-13 深信服科技股份有限公司 Text type identification method and device, electronic equipment and storage medium
CN111352907A (en) * 2020-03-30 2020-06-30 见知数据科技(上海)有限公司 Method and device for analyzing pipeline file, computer equipment and storage medium
CN113742002A (en) * 2021-09-10 2021-12-03 上海达梦数据库有限公司 Method, device, equipment and storage medium for acquiring dependency relationship of dynamic library

Similar Documents

Publication Publication Date Title
CN108460155A (en) A kind of file identification method, device, equipment and storage medium
US5935210A (en) Mapping the structure of a collection of computer resources
JP2004355614A5 (en)
RU2005112058A (en) ESTABLISHING REQUEST FOR REQUEST AND RECORD
CN108829829A (en) Detect method, system, device and storage medium that ideal money digs mine program
CN100524302C (en) File management in a computing device
AU2005209584A1 (en) System and method for determining target failback and target priority for a distributed file system
WO2005060484A3 (en) Generic token-based authentication system
KR20060045659A (en) Method and system for renaming consecutive keys in a b-tree
CA2516741A1 (en) Additional hash functions in content-based addressing
NO20065092L (en) System and method for dynamically generating a selectable sock version
WO2005069783A3 (en) Methods and apparatus for searching backup data based on content and attributes
Block et al. Linux memory forensics: Dissecting the user space process heap
CN108399124A (en) Application testing method, device, computer equipment and storage medium
CN108363923A (en) A kind of blackmailer&#39;s virus defense method, system and equipment
JP2008146601A5 (en)
JP2008287533A5 (en)
CN109388943A (en) A kind of method, apparatus and computer readable storage medium identifying XSS attack
CN104346102B (en) A kind of data auto-deleted method based on DICOM
CN107066592A (en) A kind of file defragmentation method and system for file system
CN108073808A (en) Method and system based on pdb Debugging message generation attacker&#39;s portrait
CN109977075A (en) A kind of file store path acquisition methods and device
CN108628871A (en) A kind of link De-weight method based on chain feature
CN112422581B (en) Webshell webpage detection method, device and equipment in JVM (Java virtual machine)
CN108959401A (en) A kind of method for monitoring and analyzing, system, server and storage medium that information is propagated

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180828

RJ01 Rejection of invention patent application after publication