WO2013091435A1 - 文件类型识别方法及文件类型识别装置 - Google Patents

文件类型识别方法及文件类型识别装置 Download PDF

Info

Publication number
WO2013091435A1
WO2013091435A1 PCT/CN2012/083169 CN2012083169W WO2013091435A1 WO 2013091435 A1 WO2013091435 A1 WO 2013091435A1 CN 2012083169 W CN2012083169 W CN 2012083169W WO 2013091435 A1 WO2013091435 A1 WO 2013091435A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
identified
type
file type
header
Prior art date
Application number
PCT/CN2012/083169
Other languages
English (en)
French (fr)
Inventor
阮玲宏
蒋武
李世光
王振辉
Original Assignee
华为数字技术(成都)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为数字技术(成都)有限公司 filed Critical 华为数字技术(成都)有限公司
Priority to EP12860856.9A priority Critical patent/EP2733892A4/en
Publication of WO2013091435A1 publication Critical patent/WO2013091435A1/zh
Priority to US14/198,326 priority patent/US20140189879A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0245Filtering by information in the payload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]

Definitions

  • the present invention relates to the field of computer and communication technologies, and in particular, to a file type identification method and a file type identification device. Background technique
  • Computer networks greatly facilitate people's lives, enabling people in different locations to seamlessly transfer data over networked computers, but this also poses a challenge to information security.
  • For enterprises how to ensure the confidentiality of confidential information without affecting the normal development of work and business has become a hot issue.
  • a user sends an email with an attachment to another user connected to the network
  • security and auditing purposes for example, to prevent confidential information from being sent to the wrong recipient
  • the enterprise often needs to The type of the transmitted file is identified and detected, and it is determined whether the mail needs to be filtered according to the result of the identification detection.
  • the early file type identification technology determines the file type according to the file suffix name.
  • the principle is as follows:
  • the detecting device set between the sender and the receiver performs protocol analysis on the transmitted data packet, and if it is determined that the file is being transmitted, the suffix is extracted.
  • Name according to the correspondence between the suffix name and the file type, determine the type of the file, for example, if the suffix is named "doc", it is a word file, and if the suffix is "txt", it is a text file.
  • the scheme can only identify the type of the file with the suffix name. If the sender manually removes the suffix name of the file, and the receiver adds the real suffix name after the transmission is completed, the filtering device cannot be effectively identified and filtered. .
  • “Devil Number” refers to the contents of a field in a file header that reflect the characteristics of different file types.
  • the principle is that the detecting device analyzes the file header of the transmitted file. If the file header contains a pre-stored devil number corresponding to the known file type, it determines that the type of the transmitted file is the file type corresponding to the devil number. .
  • the sender can artificially modify several bytes in the file header, so that the content of the file header, especially the field where the devil number is located, changes, and the receiver restores the real file header after the transmission is completed, and can also achieve evasion recognition and filtering. the goal of.
  • the existing detecting device cannot determine which type of file is transmitted, so the prior art cannot effectively identify the type of file transmitted through the network, thereby ensuring the security of the confidential information.
  • the embodiment of the invention provides a file type identification method for solving the problem that the file type cannot be effectively identified when the sender tampers with the transmitted file in the prior art.
  • an embodiment of the present invention further provides a file type identifying apparatus.
  • a file type identification method including:
  • the file type corresponding to the devil number in the file header is searched from the first correspondence between the file type and the devil number;
  • Determining whether the data of the file to be identified meets the data structure feature of the file type if yes, determining that the file type of the file to be identified is a file type corresponding to a devil number in the file header; if not, determining The file type of the identified file is an abnormal type, and the abnormal type is used to indicate that the file to be identified is a file whose type has been tampered with.
  • a file type identification device includes: a first test unit, configured to obtain a file header of the file to be identified from the transmitted data packet, and test whether a devil number of the file to be identified is obtained from the file header;
  • a first searching unit configured to: if the first test unit can obtain the devil number of the file to be identified, search for a file type corresponding to the devil number in the file header from the first correspondence between the file type and the devil number;
  • a first determining unit configured to determine whether data of the file to be identified meets a data structure feature of the file type
  • a first determining unit configured to determine, if the first determining unit determines that the result is a match, the file type of the file to be identified is a file type corresponding to a devil number in the file header; if the determination result is not met, determining the waiting
  • the file type of the identification file is an exception type, and the exception type is used to indicate that the file to be identified is a file whose type has been tampered with.
  • the embodiment of the present invention needs to determine again the file structure feature reflected by the data in the file to be identified, and whether the file structure feature corresponding to the file type determined according to the devil number is met. , only the match, can finally determine the file type of the file to be identified.
  • the detecting device can effectively identify the file whose type has been tampered, and protect the confidential information from being maliciously leaked.
  • FIG. 1 is a schematic flowchart of a file identification method according to Embodiment 1 of the present invention.
  • FIG. 2 is a flowchart of a file identification method according to Embodiment 2 of the present invention.
  • Embodiment 3 is a schematic diagram of an example of file identification provided by Embodiment 2 of the present invention.
  • FIG. 4 is a flowchart of a file identification method according to Embodiment 3 of the present invention
  • 5 is a schematic diagram showing structural features of a portable file format (PDF) in a third embodiment of the present invention
  • FIG. 6 is a schematic diagram of a first structure of a file type identification device according to Embodiment 4 of the present invention.
  • FIG. 7 is a second schematic structural diagram of a file type identification device according to Embodiment 4 of the present invention.
  • FIG. 8 is a schematic structural diagram of a first determining unit in a file type identifying apparatus according to an embodiment of the present disclosure
  • the detection device disposed between the sender and the receiver of the data packet, and the data packet sent by the sender needs to pass through the detection device to be sent to the receiver.
  • the detection device may be a firewall device, an intrusion prevention system (IPS, Intrusion Prevention System) device, etc. , or as a stand-alone module integrated into devices such as routers or IPS.
  • the detecting device may also be a software module in a host browser, an instant messaging (IM), or other application software.
  • the detecting device detects the data packet transmitted by the sender and the receiver, and identifies the file type of the file carried by the transmitted data packet. Further, the detecting device may filter the data packets carrying certain types of files defined by the filtering policy according to the identified file type and the pre-configured filtering policy to ensure the security of the confidential information.
  • Step 10 The detecting device acquires a file header of the file to be identified from the transmitted data packet, and determines whether a devil number of the file to be identified is obtained from the file header, and if yes, proceeds to step 20.
  • the detecting device performs layer-by-layer protocol analysis on the data packets flowing through the detecting device, and the data packet parsing method can refer to the existing Deep Packet Inspection (DPI) device, which will not be described in detail herein.
  • DPI Deep Packet Inspection
  • the detecting device After receiving the transmitted data packet, the detecting device obtains the payload content of the data packet by using a deep protocol, and determines whether the payload content includes a feature field of the file transmission. If the feature field is included, the detecting device determines that the data packet carries the data packet. file.
  • the process of judging whether a data packet carries a file according to a feature field is a prior art, and refer to a standard document corresponding to an existing application layer protocol that can be used for transmitting a file, such as a hypertext transfer protocol (HTTP, HyperText Transfer Protocol).
  • HTTP HyperText Transfer Protocol
  • the RFC 593 corresponding to RFC 2616, File Transfer Protocol (FTP), and RFC 783 file corresponding to the Trivial File Transfer Protocol (TFTP) are not described here.
  • determining that the content carried by the data packet is a file, and buffering the file data in the payload content of the data packet according to the file start address indicated by the start address field in the file header; determining whether the cached file data is The predetermined size has been reached. If the cached file data is used as the file header of the file to be identified, the file data in the subsequent data packet payload content in the same data stream is continued to be cached.
  • the detecting device After the cached file data reaches a predetermined size, the detecting device sequentially compares the cached data with the devil numbers corresponding to the various identifiable file types; if there is a devil number with a consistent comparison result, the comparison result is A consistent devil number is used as the devil number in the header of the file to be identified; otherwise, it is determined that the devil number of the file to be identified cannot be obtained.
  • the predetermined size is determined according to empirical data such as length values of dozens of identifiable file types of devil numbers currently known.
  • the devil number is the content of the field in the file header that can be used to identify the file type. It should be noted here that the devil number is an important way to identify the file type. As long as the file type of a file is identifiable, the file class must be extracted from the file header. The type corresponds to the devil number. Devil number lengths, numeric sizes, and features are different in files of different file types. Some file types have a devil number of 2 bytes, some are 20 bytes or 22 bytes, which is difficult to enumerate here. Usually, the length of the devil number is in the range of 2 bytes to 32 bytes. Therefore, the size of the above buffered data can be set to 2 bytes to 32 bytes, and in this range, it does not occupy too much buffer space, and can achieve better recognition effect.
  • Step 20 If the devil number of the file to be identified is obtained, the file type corresponding to the devil number in the file header is searched from the first correspondence between the file type and the devil number.
  • the first correspondence between the file type and the devil number is pre-stored in the detecting device, and by the first correspondence, the file type can be determined by the devil number extracted from the file.
  • a specific example is a file of the type of compressed file ( rar , Roshal ARchive ), the sender tampers with the devil number in the header of the file, ⁇ changes to the devil number corresponding to the PDF file type, and sends the falsified file to receiver.
  • the detection device obtains the devil number
  • the file type corresponding to the devil number is searched from the first correspondence, and the file to be identified is determined to be a PDF file.
  • Step 30 Determine whether the data of the file to be identified meets the data structure characteristics of the file type corresponding to the devil number. If yes, go to step 40, otherwise go to step 50.
  • the data structure characteristics of the file reflect the data organization characteristics of the file.
  • the data structure characteristics are determined at the file format design stage. All types of files follow this data organization form.
  • File structure features include feature characters or strings, data structure formats used when data is stored, relationships between objects of various data structures, cross-reference tables, and so on.
  • an appropriate file parser can be designed to input file data of a file type into a parser of the file type. If the correct file content can be parsed instead of garbled, The file data is consistent with the data structure characteristics of the file type.
  • Step 40 If the structural feature of the file type corresponding to the devil number is met, determine that the file type of the file to be identified is a file type corresponding to the devil number in the file header.
  • Step 50 If the structural feature of the file type corresponding to the devil number is not met, determine that the file type of the file to be identified is an abnormal type, and the abnormal type is used to indicate that the file to be identified is a file whose type has been tampered with. .
  • the file type determined according to the devil number is rar
  • the file structure feature extracted from the file to be identified is the structural feature of the PDF file, and the two are different, indicating that the file to be identified has been tampered with.
  • the data stream in which the data packet is located may be allowed to pass, but when determining that the file type of the file to be identified is After the exception type, the data stream is blocked from passing.
  • the embodiment of the present invention needs to determine again the file structure feature reflected by the data in the file to be identified, and whether the file structure corresponding to the file type determined according to the devil number is met. The feature, only the match, can finally determine the file type of the file to be identified. In this way, even if the sender attempts to evade detection by tampering with the devil number of the header to be recognized, since the structural feature of the file still corresponds to the type corresponding to the devil number before the tampering, the type corresponding to the falsified devil number does not correspond, thereby The detecting device is able to recognize files whose type has been tampered with.
  • the file type identification method can improve the accuracy of file type identification and enhance the security of confidential information.
  • the file header may not be known exactly.
  • the devil number is specifically At this time, the sender often arbitrarily modifies the contents of some of the fields in the header of the file, and the modified header does not contain any demon number of the identifiable file type.
  • Steps 10 to 50 are similar to the first embodiment and will not be repeated here.
  • Step 10 The detecting device obtains the file header of the file to be identified from the transmitted data packet, determines whether the devil number of the file to be identified is obtained from the file header, and if yes, proceeds to step 20, otherwise proceeds to step 60.
  • a specific example is that the original file is a rar type file, and the sender tampers with the field content of the devil number in the file header.
  • the falsified data is not any identifiable file type of devil data, and the falsified file is sent to receiver.
  • the detecting device cannot successfully obtain the devil number of the file to be identified in the manner that the devil number of the file to be identified is obtained in step 10 of the first embodiment.
  • Step 20 If the devil number of the file to be identified is obtained, the file type corresponding to the devil number in the file header is searched from the first correspondence between the file type and the devil number.
  • Step 30 Determine whether the data of the file to be identified meets the structural characteristics of the file type corresponding to the devil number. If yes, go to step 40; otherwise, go to step 50.
  • Step 40 If the structural feature of the file type corresponding to the devil number is met, determine that the file type of the file to be identified is a file type corresponding to the devil number in the file header.
  • Step 50 If the structural feature of the file type corresponding to the devil number is not met, determine that the file type of the file to be identified is an abnormal type, and the abnormal type is used to indicate that the file to be identified is a file whose type has been tampered with. .
  • Step 60 If the devil number of the file to be identified cannot be obtained, determine whether the suffix name of the file to be identified can be extracted from the data packet. If yes, go to step 70, otherwise go to step 80.
  • the file name is obtained by performing deep protocol parsing on the data packet. According to the predetermined suffix acquisition policy, it can be determined whether the file name includes the suffix name and the suffix name is obtained. Step 70: If the suffix name can be extracted, the file type corresponding to the suffix name of the file to be identified is searched from the second correspondence between the suffix name and the file type, and the process proceeds to step 90.
  • the detecting device finds the corresponding file type compressed file from the second correspondence according to the suffix name "rar".
  • Step 80 If the suffix name cannot be extracted, determine that the type of the file to be identified is an unrecognized file type.
  • Step 90 Determine whether there is a file type that is found in the first correspondence, and the file type in the first correspondence is an identifiable file type. If the process proceeds to step 100, go to step 110. .
  • Step 100 If there is a file type that is found in the second correspondence, the file type of the file to be identified is an abnormal type, and the abnormal type is used to indicate that the file to be identified is a type. The file being tampered with.
  • Step 110 If the file type found in the second correspondence relationship does not exist in the first correspondence, determine that the type of the file to be identified is an unrecognized file type.
  • step 40 further includes: Step 401: Determine whether the suffix name of the file to be identified can be extracted from the data packet; if yes, go to step 402.
  • determining that the file type of the file to be identified is a file type corresponding to the devil number in the file header.
  • Step 402 Search for a file type corresponding to the suffix name of the file to be identified from the second correspondence between the stored suffix name and the file type.
  • Step 403 Compare the file type corresponding to the suffix name of the file to be identified with the file type corresponding to the devil number in the file header, and confirm whether the two are consistent. If the comparison result is consistent, go to step 404, otherwise Proceed to step 405.
  • Step 404 Determine a file type of the file to be identified as a file type corresponding to a devil number in the file header.
  • Step 405 Determine that the file type of the file to be identified is an abnormal type.
  • the method for identifying the file type provided by the embodiment of the present invention can adapt to the case where the devil number of the original file is arbitrarily modified by the sender on the basis of the first embodiment, and the process of file identification is improved, and the scope of application is expanded.
  • an office file and a PDF file are taken as an example to describe a file type identification method provided in the first embodiment and the second embodiment.
  • the original file is an office file
  • the sender changes the devil number in the file header to the devil number of the PDF file type in order to evade detection.
  • FIG. 4 is a flowchart of a file type identification method according to an embodiment of the present invention. The steps are similar to those in Figure 2, and only some of the steps performed in this example are described in detail herein, and the steps that are not performed are not repeated.
  • Step 310 The detecting device obtains a file header of the file to be identified from the transmitted data packet, determines whether a devil number of the file to be identified is obtained from the file header, and if yes, proceeds to step 320.
  • the detecting device defines the format of the protocol for transmitting the file, and after confirming the data packet transmission file according to the feature field included in the data packet, extracting the file information from the data packet, the file information includes: a file name, a file Start address, packet size, etc.
  • the payload of the data packet of the data stream in the data stream is buffered until the 32 bytes are buffered, and the cached data is used as the file header.
  • the detecting device obtains the devil number "%PDF_xx%" in the header of the file to be identified from the cached data, where ⁇ is the version identifier.
  • Step 320 If the devil number of the file to be identified is obtained, the file type corresponding to the devil number in the file header is searched from the first correspondence between the file type and the devil number.
  • the detecting device searches for the file type corresponding to the devil number "%PDF-xx%" from the first correspondence as a PDF file.
  • Step 330 Determine whether the data of the file to be identified meets the structural characteristics of the file type corresponding to the devil number, if not, the process proceeds to step 350.
  • the header of the PDF file begins with "%PDF— xx%".
  • the offset of the line where the file header is located is followed by the content portion of the PDF file.
  • the content part is the object (identified as obj ), the specific format of the object, please refer to the relevant standard definition.
  • the cross-reference table (identified as xref ) holds information about the previous objects, such as the offset of each object data storage. A combination of several objects and cross-reference tables may be repeated multiple times.
  • the file ends with the file tracker (identified as trailer), the storage offset for each cross-reference table (identified as startxref), and the PDF file end tag (identified as %% £( ⁇ ).
  • the file tracker is used for quick indexing to Cross-reference tables and special objects.
  • the detecting device determines whether the cached data has a character string that is identified by obj. If not, the data of the file to be identified does not conform to the structural feature of the PDF file type. Since the original file is an office file, the structure of 0LE2 is followed by the devil number, instead of obj being the string of the starting identifier, so the data of the file to be identified does not conform to the structural features of the PDF file type.
  • Step 350 If the structural feature of the file type corresponding to the devil number is not met, determine that the file type of the file to be identified is an abnormal type, and the abnormal type is used to indicate that the file to be identified is a file whose type has been tampered with. .
  • the detecting device since the data of the file to be identified does not conform to the structural feature of the PDF file type, the detecting device outputs the file type of the file to be identified as an abnormal type.
  • the embodiment of the present invention further provides a file type identifying apparatus.
  • the apparatus includes a first testing unit 601, a first searching unit 602, a first determining unit 603, and a first authentic
  • the unit 604 is as follows:
  • a first test unit 601 configured to obtain a file header of the file to be identified from the transmitted data packet, and test whether a devil number of the file to be identified is obtained from the file header;
  • the first searching unit 602 is configured to: if the first test unit 601 can obtain the devil number of the file to be identified, search for a file type corresponding to the devil number in the file header from the first correspondence between the file type and the devil number;
  • the first determining unit 603 is configured to determine whether the data of the file to be identified meets the data structure feature of the file type found by the first searching unit 602;
  • the first determining unit 604 is configured to: if the first determining unit 603 determines that the result is consistent, determine that the file type of the file to be identified is a file type corresponding to the devil number in the file header; if the determination result is not met, determine the The file type of the identified file is an abnormal type, and the abnormal type is used to indicate that the file to be identified is a file whose type has been tampered with.
  • the device shown in FIG. 6 further includes:
  • a second test unit 605 configured to: if the first test unit 601 cannot obtain the devil number of the file to be identified, test whether the suffix name of the file to be identified can be extracted from the data packet by protocol parsing;
  • a second search unit 606, configured to: if the second test unit 605 can extract the suffix name, search for a file type corresponding to the suffix name of the file to be identified from the second correspondence between the suffix name and the file type;
  • the second determining unit 607 is configured to determine whether there is a file type that is found by the second searching unit 606 from the second corresponding relationship in the first correspondence, where the file type in the first correspondence is an identifiable file type. ;
  • a second determining unit 608, configured to determine, if the second determining unit 607 determines that the result is a presence, determining that the file type of the file to be identified is an abnormal type
  • the third determining unit 609 is configured to determine, if the second test unit 605 cannot extract the suffix name, or if the file type found in the second correspondence relationship does not exist in the first correspondence relationship, determine the file to be recognized
  • the type of the piece is an unrecognized file type.
  • the first determining unit 604 includes:
  • test subunit 801 configured to: when the first judging unit 603 determines that the result is a match, test whether the suffix name of the to-be-identified file can be extracted from the data packet;
  • the sub-unit 802 is configured to: if the test sub-unit 801 can extract the suffix name of the file to be identified, search for the suffix name of the file to be identified from the second correspondence between the stored suffix name and the file type. file type;
  • the comparing subunit 803 is configured to compare the file type corresponding to the suffix name of the to-be-identified file found by the searching sub-unit 802 with the file type corresponding to the devil number in the file header;
  • Determining the sub-unit 804 if the comparison result of the comparison sub-unit 803 is consistent, determining that the file type of the file to be identified is a file type corresponding to the devil number in the file header; if the comparison result is inconsistent, determining the file to be identified
  • the file type is an exception type.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Virology (AREA)
  • Storage Device Security (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种文件类型识别方法及文件类型识别装置,用以现有技术在发送方对传输的文件进行篡改时,不能有效地识别出文件类型的问题。该方法包括:从传输的数据包中获取待识别文件的文件头,判断从所述文件头中是否能获得待识别文件的魔鬼数字;若能获得待识别文件的魔鬼数字,则从文件类型与魔鬼数字的第一对应关系中查找所述文件头中的魔鬼数字对应的文件类型;判断所述待识别文件的数据是否符合所述文件类型的数据结构特征;若符合,则确定所述待识别文件的文件类型为文件头中魔鬼数字对应的文件类型;若不符合,则确定所述待识别文件的文件类型为异常类型,所述异常类型用于表明所述待识别文件为类型被篡改的文件。

Description

文件类型识别方法及文件类型识别装置 本申请要求于 2011 年 12 月 24 日提交中国专利局, 申请号为 201110439351.9, 发明名称为 "文件类型识别方法及文件类型识别装置" 的中 国专利申请的优先权, 其全部内容通过引用结合在本申请中。 技术领域
本发明涉及计算机及通信技术领域, 尤其涉及一种文件类型的识别方法及 一种文件类型的识别装置。 背景技术
计算机网络极大地便利了人们的生活,使得处于不同地点的人们可以通过 联网计算机无缝地传输数据,然而这也对信息安全提出了挑战。对于企业而言, 如何在确保机密信息安全的同时, 不影响工作、 业务的正常开展, 已经成为一 个热点问题。 例如, 在用户向连接到网络中的另一用户发送带有附件的电子邮 件的场景下, 出于安全和审计方面的考虑, 例如为了防止机密信息被发送给错 误的接收对象, 企业常需要对所传输文件的类型进行识别检测, 并根据识别检 测的结果确定是否需要对邮件进行过滤。
早期的文件类型识别技术根据文件后缀名来确定文件类型, 其原理为: 设 置于发送方和接收方之间的检测设备对传输的数据包进行协议分析, 如果判断 出正在传输文件, 则提取后缀名, 根据后缀名与文件类型的对应关系, 确定该 文件的类型, 例如若后缀名为" doc", 则为 word文件, 若后缀名为" txt", 则为 文本文件。 但是该方案只能识别出带有后缀名的文件的类型, 如果发送方人为 地去掉文件的后缀名, 接收方在传输完成后再添加真实的后缀名, 则过滤设备 无法进行有效的识别和过滤。
为解决以上问题,现有技术提出了基于 "魔鬼数字"的文件类型识别方法。 "魔鬼数字"是指文件头中能够反映不同文件类型特征的字段内容。其原理为, 检测设备对所传输的文件的文件头进行分析, 若文件头中包含预先存储的已知 文件类型对应的魔鬼数字, 则确定所传输的文件的类型为该魔鬼数字对应的文 件类型。
发明人在实现本发明过程中发现, 现有技术至少存在以下缺陷:
发送方可以人为地修改文件头中的几个字节, 使得文件头、 特别是魔鬼数 字所在字段的内容发生改变, 接收方在传输完成后再还原真实的文件头, 也可 以达到逃避识别和过滤的目的。 在这种情况下, 现有检测设备就无法确定传输 的是哪种类型的文件, 因此现有技术不能有效地识别出通过网络传输的文件的 类型, 从而确保机密信息的安全。 发明内容
本发明实施例提供一种文件类型识别方法, 用以解决现有技术在发送方对 传输的文件进行篡改时, 不能有效地识别出文件类型的问题。
对应地, 本发明实施例还提供了一种文件类型识别装置。
本发明实施例提供的技术方案如下:
一种文件类型识别方法, 包括:
从传输的数据包中获取待识别文件的文件头, 判断从所述文件头中是否能 获得待识别文件的魔鬼数字;
若能获得待识别文件的魔鬼数字, 则从文件类型与魔鬼数字的第一对应关 系中查找所述文件头中的魔鬼数字对应的文件类型;
判断所述待识别文件的数据是否符合所述文件类型的数据结构特征; 若符合, 则确定所述待识别文件的文件类型为文件头中魔鬼数字对应的 文件类型; 若不符合, 则确定所述待识别文件的文件类型为异常类型, 所述异 常类型用于表明所述待识别文件为类型被篡改的文件。
一种文件类型识别装置, 包括: 第一测试单元, 用于从传输的数据包中获取待识别文件的文件头, 测试从 所述文件头中是否能获得待识别文件的魔鬼数字;
第一查找单元, 用于若第一测试单元能获得待识别文件的魔鬼数字, 则从 文件类型与魔鬼数字的第一对应关系中查找所述文件头中的魔鬼数字对应的 文件类型;
第一判断单元, 用于判断所述待识别文件的数据是否符合所述文件类型的 数据结构特征;
第一确定单元, 用于若第一判断单元判断结果为符合, 则确定所述待识 别文件的文件类型为文件头中魔鬼数字对应的文件类型; 若判断结果为不符 合, 则确定所述待识别文件的文件类型为异常类型, 所述异常类型用于表明所 述待识别文件为类型被篡改的文件。
本发明实施例通过文件头中的魔鬼数字确定待识别文件的类型后, 还需 要再次确定待识别文件中数据反映出的文件结构特征, 是否符合根据魔鬼数字 所确定的文件类型对应的文件结构特征, 只有符合, 才能最终确定待识别文件 的文件类型。 通过上述方案可以使检测设备能够有效识别出类型被篡改的文 件, 保护机密信息不被恶意泄露。 附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案, 下面将对实施 例或现有技术描述中所需要使用的附图作一简单地介绍, 显而易见地, 下面描 述中的附图是本发明的一些实施例, 对于本领域普通技术人员来讲, 在不付出 创造性劳动的前提下, 还可以根据这些附图获得其他的附图。
图 1为本发明实施例一提供的文件识别方法的原理流程图;
图 2为本发明实施例二提供的文件识别方法的流程图;
图 3为本发明实施例二提供的文件识别实例的示意图;
图 4为本发明实施例三提供的文件识别方法的流程图; 图 5为本发明实施例三中便携文件格式(PDF, Portable Document Format ) 文件结构特征的示意图;
图 6为本发明实施例四中文件类型识别装置的第一结构示意图;
图 7为本发明实施例四中文件类型识别装置的第二结构示意图;
图 8为本发明实施例提供的文件类型识别装置中第一确定单元的结构示意
具体实施方式
为使本发明实施例的目的、 技术方案和优点更加清楚, 下面将结合本发明 实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然, 所描述的实施例是本发明一部分实施例, 而不是全部的实施例。 基于本发明中 的实施例, 本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其 他实施例, 都属于本发明保护的范围。
实施例一
在本发明实施例中有设置于数据包发送方和接收方中间的检测设备,发送 方发送的数据包需要经过检测设备才能发送给接收方。在发送方为企业构建的 局域网内部的用户, 接收方为局域网外部的用户的场景下, 所述检测设备可以 为部署局域网边界的防火墙设备、 入侵防御系统 ( IPS , Intrusion Prevention System )设备等防护设备, 或者作为一个独立模块集成于路由器或 IPS等设备 中。在个人用户的场景下, 所述检测设备也可以为主机浏览器、 即时消息(IM, Instant Messaging )聊天客户端或其他应用软件中的一个软件模块。
检测设备对发送方和接收方所传输的数据包进行检测, 识别传输的数据包 携带的文件的文件类型。 进一步地, 检测设备可以根据识别出的文件类型和预 先配置的过滤策略,对携带有过滤策略所限定的某些类型文件的数据包进行过 滤, 以保证机密信息的安全。
如图 1所示, 本发明实施例提供的文件类型识别方法的原理流程如下: 步骤 10,检测设备从传输的数据包中获取待识别文件的文件头,判断从所 述文件头中是否能获得待识别文件的魔鬼数字, 若是, 进入步骤 20。
检测设备对流经该检测设备的数据包进行逐层协议解析,数据包解析方法 可以参照现有的深度包识别 ( DPI , Deep Packet Inspection )设备, 在这里不再 详述。
检测设备接收到传输的数据包后,通过深度协议解析获得该数据包的载荷 内容,并判断所述载荷内容中是否包含文件传输的特征字段,若包含特征字段, 则检测设备确定数据包携带有文件。根据特征字段判断数据包是否携带有文件 的过程是现有技术,请参照现有各种可用于传输文件的应用层协议对应的标准 文档,如超文本传输协议( HTTP, HyperText Transfer Protocol )对应的 RFC2616、 文件传输协议(FTP, File Transfer Protocol )对应的 RFC959、 简单文件传输协 议( TFTP, Trivial File Transfer Protocol )对应的 RFC783文档等, 在这里不做 详述。
若是, 则确定该数据包携带的内容是文件, 并根据文件头中起始地址字段 所指示的文件起始地址, 对该数据包载荷内容中的文件数据进行緩存; 判断已 緩存的文件数据是否已达到预定大小 , 若是将已緩存的文件数据作为所述待识 别文件的文件头, 否则继续緩存同一数据流中后续数据包载荷内容中的文件数 据。
所述检测设备在緩存的文件数据达到预定大小后,将已緩存的数据依次分 别与各种可识别文件类型对应的魔鬼数字进行比较; 若存在比较结果一致的魔 鬼数字, 则将所述比较结果一致的魔鬼数字作为所述待识别文件头中的魔鬼数 字; 否则, 确定不能获得待识别文件的魔鬼数字。
其中, 所述预定大小是根据目前已知的几十种可识别文件类型魔鬼数字的 长度值等经验数据来确定的。魔鬼数字是指文件头中可以用来标识该文件类型 的字段内容。 这里需要说明的是, 魔鬼数字是识别文件类型的重要途径, 只要 一个文件的文件类型是可识别的, 那么从该文件头中一定可以提取到该文件类 型对应的魔鬼数字。 不同的文件类型的文件中魔鬼数字长度、 数值大小、 特征 均不相同。 有的文件类型的魔鬼数字为 2字节, 有的为 20字节或者 22字节, 在这里难以一一列举, 通常魔鬼数字的长度均在 2字节至 32字节的范围内。 因此上述緩存的数据大小可以设置为 2字节至 32字节, 在这个范围内既不至 于占用过大的緩冲空间, 又能够实现较好的识别效果。
步骤 20,若能获得待识别文件的魔鬼数字,则从文件类型与魔鬼数字的第 一对应关系中查找所述文件头中的魔鬼数字对应的文件类型。
所述检测设备中预先存储文件类型与魔鬼数字的第一对应关系,通过该第 一对应关系, 就可以由从文件中提取的魔鬼数字确定文件类型。
一个具体实例为原始文件为压缩文件( rar , Roshal ARchive )类型的文件, 发送方对该文件头中的魔鬼数字进行篡改, 篡改为 PDF 文件类型对应的魔鬼 数字, 并把篡改后的文件发送给接收方。 此时检测设备获取魔鬼数字后, 从第 一对应关系中查找该魔鬼数字对应的文件类型, 确定待识别文件为 PDF文件。
步骤 30,判断所述待识别文件的数据是否符合所述魔鬼数字对应的文件类 型的数据结构特征, 若符合, 则进入步骤 40, 否则进入步骤 50。
文件的数据结构特征反映了文件的数据组织特点,数据结构特征是在文件 格式设计阶段就确定出的, 一种类型的所有文件都遵从这种数据组织形式。 文 件结构特征包括特征字符或字符串、 数据存储时所釆用的数据结构格式, 各种 数据结构的对象间的关系, 交叉引用表等等。 可以根据某种类型文件的数据结 构特征, 设计相适应的文件解析器, 将一种文件类型的文件数据输入该文件类 型的解析器, 如果能够解析出正确的文件内容而不是乱码, 则说明所述文件数 据与所述文件类型的数据结构特征是相符合的。在后面的例子中将进行详细介 绍。
此时, 从待识别文件中提取的文件结构特征仍然是 rar文件的结构特征。 步骤 40,若符合所述魔鬼数字对应的文件类型的结构特征, 则确定所述待 识别文件的文件类型为文件头中魔鬼数字对应的文件类型。 步骤 50,若不符合所述魔鬼数字对应的文件类型的结构特征,则确定所述 待识别文件的文件类型为异常类型, 所述异常类型用于表明所述待识别文件为 类型被篡改的文件。
在上述实例中, 根据魔鬼数字确定出的文件类型为 rar, 而从待识别文件 中提取的文件结构特征为 PDF 文件的结构特征, 二者不同, 说明待识别文件 已被篡改。
可选地, 在本发明实施例中, 在确定出所述待识别文件的文件类型为异常 类型之前, 可以允许数据包所在的数据流通过, 但当确定出所述待识别文件的 文件类型为异常类型之后, 阻断所述数据流通过。 这样做的好处是在检测设备 无需緩存大量的数据包; 而接收方由于数据流被阻断造成数据缺失, 无法还原 出待识别文件的, 可以达到保护数据安全的目的。
本发明实施例在通过文件头中的魔鬼数字确定待识别文件的类型后,还需 要再次确定待识别文件中数据反映出的文件结构特征, 是否符合根据魔鬼数字 所确定的文件类型对应的文件结构特征, 只有符合, 才能最终确定待识别文件 的文件类型。 这样, 即使发送方企图通过篡改待识别文件头的魔鬼数字来逃避 检测, 由于该文件的结构特征仍然对应篡改前的魔鬼数字对应的类型, 与篡改 后的魔鬼数字对应的类型不对应,从而使检测设备能够识别出类型被篡改的文 件。
与篡改魔鬼数字相比,发送方企图通过篡改文件结构特征逃避检测的实施 难度要大得多, 因为只要修改了文件内容中的部分数据, 将很可能接收方无法 恢复原始文件。 因此, 本发明实施例提供的文件类型识别方法能够提高文件类 型识别的准确性, 加强机密信息的安全性。
实施例二
发送方在企图通过篡改待识别文件头的魔鬼数字来逃避检测时, 除了将一 种文件类型的魔鬼数字修改为另一种文件类型的魔鬼数字之外,还可能并不确 切地知晓文件头中魔鬼数字的字段位置或者其他文件类型魔鬼数字具体是什 么, 这时发送方往往是随意地修改文件头中的部分字段内容, 修改后的文件头 中并未包含任意一种可识别文件类型的魔鬼数字。
针对这种情况, 本实施例在实施例一的基础上进行了改进, 改进后的文件 类型识别方法流程图如图 2所示。其中步骤 10〜步骤 50与实施例一类似,这里 不再重复。
步骤 10,检测设备从传输的数据包中获取待识别文件的文件头,判断从所 述文件头中是否能获得待识别文件的魔鬼数字, 若是, 进入步骤 20, 否则进入 步骤 60。
一个具体实例为原始文件为 rar类型的文件, 发送方对该文件头中的魔鬼 数字的字段内容进行篡改, 篡改后的数据不是任何可识别文件类型的魔鬼数 据, 并把篡改后的文件发送给接收方。
检测设备依照实施例一步骤 10 中获得待识别文件的魔鬼数字的方式无法 成功获得待识别文件的魔鬼数字。
步骤 20,若能获得待识别文件的魔鬼数字,则从文件类型与魔鬼数字的第 一对应关系中查找所述文件头中的魔鬼数字对应的文件类型。
步骤 30,判断所述待识别文件的数据是否符合所述魔鬼数字对应的文件类 型的结构特征, 若符合, 则进入步骤 40 , 否则进入步骤 50。
步骤 40,若符合所述魔鬼数字对应的文件类型的结构特征, 则确定所述待 识别文件的文件类型为文件头中魔鬼数字对应的文件类型。
步骤 50,若不符合所述魔鬼数字对应的文件类型的结构特征,则确定所述 待识别文件的文件类型为异常类型, 所述异常类型用于表明所述待识别文件为 类型被篡改的文件。
步骤 60,若不能获得待识别文件的魔鬼数字, 则判断是否能够从所述数据 包中提取到所述待识别文件的后缀名。 若是, 进入步骤 70, 否则进入步骤 80。
所述文件名是通过对数据包进行深度协议解析得到的,根据预定后缀获取 策略, 可以判断文件名中是否包含后缀名, 并获得后缀名。 步骤 70,若能够提取到后缀名,则从后缀名与文件类型的第二对应关系中 查找所述待识别文件的后缀名对应的文件类型, 进入步骤 90。
在上述实例中, 检测设备根据后缀名 "rar"从所述第二对应关系中查找到 对应的文件类型压缩文件。
步骤 80,若不能提取到后缀名,则确定所述待识别文件的类型为未识别文 件类型。
步骤 90,判断所述第一对应关系中是否存在从第二对应关系中查找到得文 件类型, 所述第一对应关系中的文件类型为可识别文件类型, 若是进入步骤 100, 否则进入步骤 110。
步骤 100 , 若第一对应关系中存在从第二对应关系中查找到得文件类型, 则确定所述待识别文件的文件类型为异常类型, 所述异常类型用于表明所述待 识别文件为类型被篡改的文件。
在上述实例中, 由于第一对应关系中存在后缀名 "rar"对应的压缩文件类 型, 而在步骤 10 中却没有获得文本文件类型的魔鬼数字, 即没有获得可识别 文件类型的魔鬼数字, 那么就说明待识别文件文件头中的魔鬼数字已被篡改。
步骤 110 ,若第一对应关系中不存在从第二对应关系中查找到得文件类型 , 则确定所述待识别文件的类型为未识别文件类型。
通过上述实施方案能够准确地确定待识别文件类型, 可选地, 为了能够检 测出发送方只是单纯修改后缀名的情况, 进一步提高识别篡改行为的可靠性和 准确度, 对上述步骤 40进行了改进, 如附图 3所示, 步骤 40进一步包括: 步骤 401,判断是否能够从所述数据包中提取到所述待识别文件的后缀名; 若是进入步骤 402。
可选地, 若未提取到后缀名, 则确定所述待识别文件的文件类型为文件头 中魔鬼数字对应的文件类型。
步骤 402 , 从存储的后缀名与文件类型的第二对应关系中查找所述待识别 文件的后缀名对应的文件类型。 步骤 403 , 将查找到的所述待识别文件的后缀名对应的文件类型与文件头 中所述魔鬼数字对应的文件类型进行比较, 确认二者是否一致; 若比较结果一 致, 进入步骤 404, 否则进入步骤 405。
步骤 404 , 确定所述待识别文件的文件类型为文件头中魔鬼数字对应的文 件类型。
步骤 405 , 确定所述待识别文件的文件类型为异常类型。
本发明实施例提供的文件类型的识别方法, 在实施例一的基础上, 能够适 应原始文件的魔鬼数字被发送方任意修改的情况, 完善了文件识别的流程, 扩 大了适用范围。
实施例三
本发明实施例以 office文件和 PDF文件为例, 对实施例一、 实施例二中提 供的文件类型识别方法进行举例说明。在本实施例中,原始文件为 office文件, 发送方为了逃避检测, 将文件头中的魔鬼数字修改为 PDF 文件类型的魔鬼数 字。
附图 4为本发明实施例提供的文件类型识别方法的流程图。其中各步骤与 附图 2中的步骤类似, 这里只对该实例中所执行的部分步骤进行详细说明, 未 执行的步骤不再重复。
步骤 310 , 检测设备从传输的数据包中获取待识别文件的文件头, 判断从 所述文件头中是否能获得待识别文件的魔鬼数字, 若是, 进入步骤 320。
检测设备根据各种不同的用于传输文件的协议的格式定义,在根据数据包 中包含的特征字段确认数据包传输文件之后, 从数据包中提取文件信息, 文件 信息包括: 文件名、 文件起始地址、 数据包大小等。
从文件起始地址开始, 对数据流中传输文件的数据包的载荷内容进行緩 存, 直到緩存了 32个字节为止, 将緩存的数据作为文件头。
检测设备从所述緩存数据中获得待识别文件文件头中的魔鬼数字 "%PDF — xx%" , 其中, χχ为版本标识。 步骤 320, 若能获得待识别文件的魔鬼数字, 则从文件类型与魔鬼数字的 第一对应关系中查找所述文件头中的魔鬼数字对应的文件类型。
检测设备从所述第一对应关系中查找到魔鬼数字 "%PDF— xx%" 对应的 文件类型为 PDF文件。
步骤 330, 判断所述待识别文件的数据是否符合所述魔鬼数字对应的文件 类型的结构特征, 若不符合进入步骤 350。
PDF文件的结构特征具体如附图 5所示。
PDF文件的文件头以 "%PDF— xx%" 开始。 文件头所在的一行偏移量之 后是 PDF文件的内容部分。 内容部分是对象(标识为 obj ), 对象的具体格式 请参照相关标准定义。在若干对象之后为交叉引用表,交叉引用表(标识为 xref ) 中保存了之前各对象的信息, 例如每个对象数据存储时的偏移量。 若干对象和 交叉引用表组成的组合体可能会重复多次。 文件最后是文件追踪体(标识为 trailer ), 每个交叉引用表的存储偏移量(标识为 startxref )和 PDF文件结束标 记(标识为%% £(^ )。 文件追踪体用于迅速索引到交叉引用表和特殊对象。
检测设备判断所述緩存的数据是否存在以 obj为起始标识的字符串, 若不 存在则说明所述待识别文件的数据不符合 PDF 文件类型的结构特征。 由于原 始文件是 office文件, 在魔鬼数字之后是 0LE2的结构体, 而不是 obj为起始 标识的字符串, 因此待识别文件的数据不符合 PDF文件类型的结构特征。
步骤 350, 若不符合所述魔鬼数字对应的文件类型的结构特征, 则确定所 述待识别文件的文件类型为异常类型, 所述异常类型用于表明所述待识别文件 为类型被篡改的文件。
在本实例中, 由于所述待识别文件的数据不符合 PDF 文件类型的结构特 征, 检测设备输出待识别文件的文件类型为异常类型。
实施例四
相应地, 本发明实施例还提供了一种文件类型识别装置, 如图 6所示, 该 装置包括第一测试单元 601、 第一查找单元 602、 第一判断单元 603和第一确 定单元 604, 具体如下:
第一测试单元 601 , 用于从传输的数据包中获取待识别文件的文件头, 测 试从所述文件头中是否能获得待识别文件的魔鬼数字;
第一查找单元 602 , 用于若第一测试单元 601能获得待识别文件的魔鬼数 字, 则从文件类型与魔鬼数字的第一对应关系中查找所述文件头中的魔鬼数字 对应的文件类型;
第一判断单元 603 , 用于判断所述待识别文件的数据是否符合第一查找单 元 602查找到的所述文件类型的数据结构特征;
第一确定单元 604 , 用于若第一判断单元 603判断结果为符合, 则确定所 述待识别文件的文件类型为文件头中魔鬼数字对应的文件类型; 若判断结果为 不符合, 则确定所述待识别文件的文件类型为异常类型, 所述异常类型用于表 明所述待识别文件为类型被篡改的文件。
进一步地, 如附图 7所示, 附图 6所述装置中还包括:
第二测试单元 605 , 用于若第一测试单元 601不能获得待识别文件的魔鬼 数字, 则测试通过协议解析是否能够从所述数据包中提取到所述待识别文件的 后缀名;
第二查找单元 606 , 用于若第二测试单元 605能够提取到后缀名, 则从后 缀名与文件类型的第二对应关系中查找所述待识别文件的后缀名对应的文件 类型;
第二判断单元 607 , 用于判断所述第一对应关系中是否存在第二查找单元 606从第二对应关系中查找到得文件类型, 所述第一对应关系中的文件类型为 可识别文件类型;
第二确定单元 608 , 用于若第二判断单元 607判断结果为存在, 则确定所 述待识别文件的文件类型为异常类型;
第三确定单元 609 , 用于若第二测试单元 605不能提取后缀名、 或者第一 对应关系中不存在从第二对应关系中查找到得文件类型, 则确定所述待识别文 件的类型为未识别文件类型。
可选地, 请参照附图 8, 所述第一确定单元 604包括:
测试子单元 801 , 用于在第一判断单元 603判断结果为符合时, 测试是否 能够从所述数据包中提取到所述待识别文件的后缀名;
查找子单元 802, 用于若测试子单元 801能够提取到所述待识别文件的后 缀名, 则从存储的后缀名与文件类型的第二对应关系中查找所述待识别文件的 后缀名对应的文件类型;
比较子单元 803 , 用于将查找子单元 802查找到的所述待识别文件的后缀 名对应的文件类型与文件头中所述魔鬼数字对应的文件类型进行比较;
确定子单元 804, 用于若比较子单元 803比较结果一致, 则确定所述待识 别文件的文件类型为文件头中魔鬼数字对应的文件类型; 若比较结果不一致, 则确定所述待识别文件的文件类型为异常类型。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分步骤 是可以通过程序来指令相关的硬件来完成, 该程序可以存储于一计算机可读取 存储介质中, 如: ROM/RAM、 磁碟、 光盘等。
在上述实施例中, 对各个实施例的描述都各有侧重, 某个实施例中没 有详述的部分, 可以参见其他实施例的相关描述。 最后应说明的是: 以上实 施例仅用以说明本发明的技术方案, 而非对其限制; 尽管参照前述实施例对本 发明进行了详细的说明, 本领域的普通技术人员应当理解: 其依然可以对前述 各实施例所记载的技术方案进行修改, 或者对其中部分技术特征进行等同替 换; 而这些修改或者替换, 并不使相应技术方案的本质脱离本发明各实施例技 术方案的精神和范围。

Claims

权 利 要 求
1、 一种文件类型识别方法, 其特征在于, 包括:
从传输的数据包中获取待识别文件的文件头, 判断从所述文件头中是否能 获得待识别文件的魔鬼数字;
若能获得待识别文件的魔鬼数字, 则从文件类型与魔鬼数字的第一对应关 系中查找所述文件头中的魔鬼数字对应的文件类型;
判断所述待识别文件的数据是否符合所述文件类型的数据结构特征; 若符合, 则确定所述待识别文件的文件类型为文件头中的魔鬼数字对应的 文件类型; 若不符合, 则确定所述待识别文件的文件类型为异常类型, 所述异 常类型用于表明所述待识别文件为类型被篡改的文件。
2、 如权利要求 1 所述的方法, 其特征在于, 所述判断从所述文件头中是 否能获得待识别文件的魔鬼数字后, 还包括:
若不能获得待识别文件的魔鬼数字, 则判断通过协议解析是否能够从所述 数据包中提取到所述待识别文件的后缀名;
若能够提取到后缀名, 则从后缀名与文件类型的第二对应关系中查找所述 待识别文件的后缀名对应的文件类型; 判断所述第一对应关系中是否存在从第 二对应关系中查找到得文件类型, 所述第一对应关系中的文件类型为可识别文 件类型; 若存在, 则确定所述待识别文件的文件类型为异常类型;
若不能提取后缀名、或者第一对应关系中不存在从第二对应关系中查找到 得文件类型, 则确定所述待识别文件的类型为未识别文件类型。
3、 如权利要求 1 所述的方法, 其特征在于, 所述若符合, 则确定所述待 识别文件的文件类型为文件头中的魔鬼数字对应的文件类型, 包括:
若所述待识别文件的数据符合所述文件类型的数据结构特征, 判断是否能 够从所述数据包中提取到所述待识别文件的后缀名;
若能够提取到所述待识别文件的后缀名, 则从存储的后缀名与文件类型的 第二对应关系中查找所述待识别文件的后缀名对应的文件类型;
将查找到的所述待识别文件的后缀名对应的文件类型与文件头中所述魔 鬼数字对应的文件类型进行比较;
若比较结果一致, 则确定所述待识别文件的文件类型为文件头中魔鬼数字 对应的文件类型。
4、 如权利要求 1-3 任一所述的方法, 其特征在于, 所述从传输的数据包 中获取待识别文件的文件头, 包括:
接收到传输的数据包后, 通过协议解析获得该数据包的载荷内容, 判断所 述载荷内容中是否包含文件头标识;
若所述载荷内容中包含文件头标识, 则确定该数据包携带的内容是文件, 并根据文件头标识所指示的文件起始地址,对该数据包载荷内容中的文件数据 进行緩存;
判断已緩存的文件数据是否已达到预定大小, 若是, 将已緩存的文件数据 作为所述待识别文件的文件头, 否则继续緩存同一数据流中后续数据包载荷内 容中的文件数据。
5、 如权利要求 4所述的方法, 其特征在于, 所述判断从所述文件头中是 否能获得待识别文件的魔鬼数字, 包括:
将已緩存的数据依次分别与各种可识别文件类型对应的魔鬼数字进行比 较;
若存在比较结果一致的魔鬼数字, 则将所述比较结果一致的魔鬼数字作为 所述待识别文件头中的魔鬼数字;否则,确定不能获得待识别文件的魔鬼数字。
6、 如权利要求 4所述的方法, 其特征在于, 所述预定大小为 2字节至 32 字节。
7、 如权利要求 1、 2、 3、 5或 6所述的方法, 其特征在于, 所述确定所述 待识别文件的文件类型为异常类型之前, 还包括:
允许所述数据包所在的数据流通过; 所述确定所述待识别文件的文件类型为异常类型之后, 还包括: 阻断所述数据包所在的数据流通过。
8、 一种文件类型识别装置, 其特征在于, 包括:
第一测试单元, 用于从传输的数据包中获取待识别文件的文件头, 测试从 所述文件头中是否能获得待识别文件的魔鬼数字;
第一查找单元, 用于若第一测试单元能获得待识别文件的魔鬼数字, 则从 文件类型与魔鬼数字的第一对应关系中查找所述文件头中的魔鬼数字对应的 文件类型;
第一判断单元, 用于判断所述待识别文件的数据是否符合所述文件类型的 数据结构特征;
第一确定单元, 用于若第一判断单元判断结果为符合, 则确定所述待识别 文件的文件类型为文件头中魔鬼数字对应的文件类型; 若判断结果为不符合, 则确定所述待识别文件的文件类型为异常类型, 所述异常类型用于表明所述待 识别文件为类型被篡改的文件。
9、 如权利要求 8所述的装置, 其特征在于, 还包括:
第二测试单元, 用于若第一测试单元不能获得待识别文件的魔鬼数字, 则 测试通过协议解析是否能够从所述数据包中提取到所述待识别文件的后缀名; 第二查找单元, 用于若第二测试单元能够提取到后缀名, 则从后缀名与文 件类型的第二对应关系中查找所述待识别文件的后缀名对应的文件类型; 第二判断单元, 用于判断所述第一对应关系中是否存在从第二对应关系中 查找到得文件类型, 所述第一对应关系中的文件类型为可识别文件类型; 第二确定单元, 用于若第二判断单元判断结果为存在, 则确定所述待识别 文件的文件类型为异常类型;
第三确定单元, 用于若第二测试单元不能提取后缀名、 或者第一对应关系 中不存在从第二对应关系中查找到得文件类型, 则确定所述待识别文件的类型 为未识别文件类型。
10、如权利要求 8或 9所述的装置,其特征在于,所述第一确定单元包括: 测试子单元, 用于在第一判断单元判断结果为符合时, 测试是否能够从所 述数据包中提取到所述待识别文件的后缀名;
查找子单元, 用于若测试子单元能够提取到所述待识别文件的后缀名, 则 从存储的后缀名与文件类型的第二对应关系中查找所述待识别文件的后缀名 对应的文件类型;
比较子单元, 用于将查找子单元查找到的所述待识别文件的后缀名对应的 文件类型与文件头中所述魔鬼数字对应的文件类型进行比较;
确定子单元, 用于若比较结果一致, 则确定所述待识别文件的文件类型为 文件头中魔鬼数字对应的文件类型; 若比较结果不一致, 则确定所述待识别文 件的文件类型为异常类型。
PCT/CN2012/083169 2011-12-24 2012-10-19 文件类型识别方法及文件类型识别装置 WO2013091435A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP12860856.9A EP2733892A4 (en) 2011-12-24 2012-10-19 FILE TYPE IDENTIFICATION METHOD AND FILE TYPE IDENTIFICATION DEVICE
US14/198,326 US20140189879A1 (en) 2011-12-24 2014-03-05 Method for identifying file type and apparatus for identifying file type

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201110439351.9 2011-12-24
CN2011104393519A CN102571767A (zh) 2011-12-24 2011-12-24 文件类型识别方法及文件类型识别装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US14/198,326 Continuation US20140189879A1 (en) 2011-12-24 2014-03-05 Method for identifying file type and apparatus for identifying file type

Publications (1)

Publication Number Publication Date
WO2013091435A1 true WO2013091435A1 (zh) 2013-06-27

Family

ID=46416243

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/083169 WO2013091435A1 (zh) 2011-12-24 2012-10-19 文件类型识别方法及文件类型识别装置

Country Status (4)

Country Link
US (1) US20140189879A1 (zh)
EP (1) EP2733892A4 (zh)
CN (1) CN102571767A (zh)
WO (1) WO2013091435A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111159758A (zh) * 2019-12-18 2020-05-15 深信服科技股份有限公司 识别方法、设备及存储介质
CN111741019A (zh) * 2020-07-28 2020-10-02 常州昊云工控科技有限公司 一种基于字段描述的通信协议解析方法和系统

Families Citing this family (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571767A (zh) * 2011-12-24 2012-07-11 成都市华为赛门铁克科技有限公司 文件类型识别方法及文件类型识别装置
CN102768676B (zh) * 2012-06-14 2014-03-12 腾讯科技(深圳)有限公司 一种格式未知文件的处理方法和装置
US9535809B2 (en) 2013-01-22 2017-01-03 General Electric Company Systems and methods for implementing data analysis workflows in a non-destructive testing system
CN103209170A (zh) * 2013-03-04 2013-07-17 汉柏科技有限公司 文件类型识别方法及识别系统
CN103347092A (zh) * 2013-07-22 2013-10-09 星云融创(北京)信息技术有限公司 一种识别缓存文件的方法及装置
CN103544449B (zh) * 2013-10-09 2018-05-22 上海上讯信息技术股份有限公司 基于分级控制的文件流转方法及系统
CN103631589B (zh) * 2013-11-08 2017-02-01 华为技术有限公司 应用识别方法与装置
US9332025B1 (en) * 2013-12-23 2016-05-03 Symantec Corporation Systems and methods for detecting suspicious files
US9330264B1 (en) * 2014-11-26 2016-05-03 Glasswall (Ip) Limited Statistical analytic method for the determination of the risk posed by file based content
CN105808583B (zh) * 2014-12-30 2019-09-17 Tcl集团股份有限公司 文件类型识别方法及装置
CN104598818A (zh) * 2014-12-30 2015-05-06 北京奇虎科技有限公司 一种用于虚拟化环境中的文件检测系统及方法
CN106227893A (zh) * 2016-08-24 2016-12-14 乐视控股(北京)有限公司 一种文件类型获取方法及装置
CN106327560B (zh) * 2016-08-25 2019-11-26 苏州创意云网络科技有限公司 一种文件版本的识别方法及识别客户端
CN107846381B (zh) * 2016-09-18 2021-02-09 阿里巴巴集团控股有限公司 网络安全处理方法及设备
CN107169353B (zh) * 2017-04-20 2021-05-14 腾讯科技(深圳)有限公司 异常文件识别方法及装置
CN107145801A (zh) * 2017-04-26 2017-09-08 浙江远望信息股份有限公司 一种后缀名遭篡改的涉密文件自动发现方法
CN107506471A (zh) * 2017-08-31 2017-12-22 湖北灰科信息技术有限公司 快速取证方法及系统
CN108038101B (zh) * 2017-12-07 2021-04-27 杭州迪普科技股份有限公司 一种篡改文本的识别方法及装置
CN108040069A (zh) * 2017-12-28 2018-05-15 成都数成科技有限公司 一种快速打开网络数据包文件的方法
CN108270783B (zh) * 2018-01-15 2021-04-16 新华三信息安全技术有限公司 一种数据处理方法、装置、电子设备及存储介质
CN108540480B (zh) * 2018-04-19 2021-01-08 中电和瑞科技有限公司 一种网关以及基于网关的文件访问控制方法
CN108595672A (zh) * 2018-04-28 2018-09-28 努比亚技术有限公司 一种识别下载文件类型的方法、装置及可读存储介质
US10242189B1 (en) 2018-10-01 2019-03-26 OPSWAT, Inc. File format validation
CN111274766B (zh) * 2018-11-16 2023-11-03 福建天泉教育科技有限公司 一种文件转码结果的校验方法及终端
CN111859896B (zh) * 2019-04-01 2022-11-25 长鑫存储技术有限公司 配方文档检测方法、装置、计算机可读介质及电子设备
CN110134644A (zh) * 2019-05-17 2019-08-16 成都卫士通信息产业股份有限公司 文件类型识别方法、装置、电子设备及可读存储介质
US11652789B2 (en) 2019-06-27 2023-05-16 Cisco Technology, Inc. Contextual engagement and disengagement of file inspection
CN110532529A (zh) * 2019-09-04 2019-12-03 北京明朝万达科技股份有限公司 一种文件类型的识别方法及装置
CN110825701A (zh) * 2019-11-07 2020-02-21 深信服科技股份有限公司 一种文件类型确定方法、装置、电子设备及可读存储介质
CN110929110B (zh) * 2019-11-13 2023-02-21 北京北信源软件股份有限公司 一种电子文档检测方法、装置、设备及存储介质
CN111159709A (zh) * 2019-12-27 2020-05-15 深信服科技股份有限公司 一种文件类型识别方法、装置、设备及存储介质
CN111367582B (zh) * 2020-03-06 2023-08-25 上海赋华网络科技有限公司 一种高性能识别文件类型的方法
CN111414277B (zh) * 2020-03-06 2023-10-20 网易(杭州)网络有限公司 数据恢复方法、装置、电子设备和介质
CN111563063B (zh) * 2020-05-12 2022-09-13 福建天晴在线互动科技有限公司 一种基于HashMap识别文件类型的方法
CN111949985A (zh) * 2020-10-19 2020-11-17 远江盛邦(北京)网络安全科技股份有限公司 结合文件识别的病毒检测方法
CN113641999A (zh) * 2021-08-27 2021-11-12 四川中电启明星信息技术有限公司 一种在web系统文件上传过程中的文件类型自动校验方法
CN113704184A (zh) * 2021-08-30 2021-11-26 康键信息技术(深圳)有限公司 一种文件分类方法、装置、介质及设备
CN114710482A (zh) * 2022-03-23 2022-07-05 马上消费金融股份有限公司 文件检测方法、装置、电子设备及存储介质
CN115374075B (zh) * 2022-08-01 2023-09-01 北京明朝万达科技股份有限公司 一种文件类型识别方法及装置

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770470A (zh) * 2008-12-31 2010-07-07 中国银联股份有限公司 一种文件类型识别分析方法及系统
CN102571767A (zh) * 2011-12-24 2012-07-11 成都市华为赛门铁克科技有限公司 文件类型识别方法及文件类型识别装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090013408A1 (en) * 2007-07-06 2009-01-08 Messagelabs Limited Detection of exploits in files
GB0822619D0 (en) * 2008-12-11 2009-01-21 Scansafe Ltd Malware detection
JP4993323B2 (ja) * 2010-04-12 2012-08-08 キヤノンマーケティングジャパン株式会社 情報処理装置、情報処理方法及びプログラム
CN102143010A (zh) * 2010-08-24 2011-08-03 华为软件技术有限公司 检测报文被修改的方法、发送方设备和接收方设备

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770470A (zh) * 2008-12-31 2010-07-07 中国银联股份有限公司 一种文件类型识别分析方法及系统
CN102571767A (zh) * 2011-12-24 2012-07-11 成都市华为赛门铁克科技有限公司 文件类型识别方法及文件类型识别装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CAO, DING ET AL.: "Improved of Content-based File Type Identification Algorithm", COMPUTER ENGINEERING AND DESIGN, V, vol. 32, no. 12, 16 December 2011 (2011-12-16), pages 4246 - 4250, XP008171234 *
ZHANG: "Runfeng Recognizing and Matching of File Type based on Identifiers", COMPUTER SECURITY, 30 June 2011 (2011-06-30), pages 40 - 42, XP008171235 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111159758A (zh) * 2019-12-18 2020-05-15 深信服科技股份有限公司 识别方法、设备及存储介质
CN111741019A (zh) * 2020-07-28 2020-10-02 常州昊云工控科技有限公司 一种基于字段描述的通信协议解析方法和系统

Also Published As

Publication number Publication date
EP2733892A4 (en) 2014-11-12
CN102571767A (zh) 2012-07-11
EP2733892A1 (en) 2014-05-21
US20140189879A1 (en) 2014-07-03

Similar Documents

Publication Publication Date Title
WO2013091435A1 (zh) 文件类型识别方法及文件类型识别装置
US11030311B1 (en) Detecting and protecting against computing breaches based on lateral movement of a computer file within an enterprise
WO2015120752A1 (zh) 网络威胁处理方法及设备
US7802303B1 (en) Real-time in-line detection of malicious code in data streams
US11122061B2 (en) Method and server for determining malicious files in network traffic
US10862923B2 (en) System and method for detecting a compromised computing system
US8533824B2 (en) Resisting the spread of unwanted code and data
US7721334B2 (en) Detection of code-free files
US9614866B2 (en) System, method and computer program product for sending information extracted from a potentially unwanted data sample to generate a signature
GB2427048A (en) Detection of unwanted code or data in electronic mail
KR101434388B1 (ko) 네트워크 보안 장비의 패턴 매칭 시스템 및 그 패턴 매칭 방법
KR102152338B1 (ko) Nidps 엔진 간의 룰 변환 시스템 및 방법
US11057347B2 (en) Filtering data using malicious reference information
WO2018076697A1 (zh) 僵尸特征的检测方法和装置
US8910281B1 (en) Identifying malware sources using phishing kit templates
CN108446543A (zh) 一种邮件处理方法、系统及邮件代理网关
WO2017084513A1 (zh) 一种核验信息处理方法及服务器
US20160065617A1 (en) Image monitoring framework
KR101572239B1 (ko) 사용자 브라우저 영역에서 악성 스크립트 탐지 및 실행 방지 장치 및 시스템
TWI503695B (zh) 封包資料提取裝置、封包資料提取裝置之控制方法、控制程式及電腦可讀取之儲存媒體
US20140157412A1 (en) Device, method and non-transitory computer readable storage medium thereof for performing anonymous testing on electronic digital
TW201044212A (en) Method and system for recognizing doubtful fake website

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12860856

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2012860856

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE