WO2012091488A1

WO2012091488A1 - System and method for detecting malicious content in non-pe file

Info

Publication number: WO2012091488A1
Application number: PCT/KR2011/010309
Authority: WO
Inventors: Sun Young Sim
Original assignee: Ahnlab., Inc.
Priority date: 2010-12-31
Filing date: 2011-12-29
Publication date: 2012-07-05
Also published as: KR20120078030A; KR101228900B1

Abstract

There is provided a method for detecting whether malicious content is included in a non-PE (Portable Executable) file. The method includes extracting information from a portion within the non-PE file in which the malicious content can be inserted and determining whether the malicious content is included in the non-PE file on the basis of the extracted information.

Description

SYSTEM AND METHOD FOR DETECTING MALICIOUS CONTENT IN NON-PE FILE

The present invention relates to system and method for detecting malicious content in a non-PE (Portable Executable) file, and more particularly to system and method for determining whether a non-PE file includes malicious content using information about a portion within the non-PE file in which the malicious content can be inserted.

Malicious content included in files has been used to disturb or obstruct program execution, file utilization, computer operation and so on. Vulnerabilities, malwares, computer viruses, or the like may correspond to such a malicious content. Since malicious content can cause undesired operations, technologies have been developed for detecting, non-activating and deleting the malicious content before the malicious content is executed.

Meanwhile, PE (Portable Executable) files refer to a file format which can be executed on a Win32 executable computer system regardless of platform. In other words, PE files correspond to programs being executed on the computer system. Malicious content can also be included in such PE files. When a PE file including malicious content is executed on a computer system, the malicious content is simultaneously executed so that the computer system is maliciously affected. Due to this, there has been plenty of research on how to detect the malicious content included in the PE file. As such, a variety of technologies for detecting malicious content has been developed.

On the other hand, relatively not enough research has been done for non-PE files. Recently, non-PE files such as documents, images and moving pictures are transmitted and distributed quite often due to development of networks. In this circumstance, if malicious content is included in such a non-PE file, it may be difficult to detect the included malicious content without analyzing the configuration of the non-PE file. Moreover, hiding objects including malicious content is performed, which makes it more difficult to detect the malicious content.

In addition, according to the statistical report of Symantec Cooperation for a period from April to June 2010, attacks with PDF files (corresponding to non-PE files) including malicious contents, i.e. attacks with malicious PDF files are being rapidly increased, especially the proportion of attacks with malicious PDF files including FLASH contents become higher. Therefore, it is necessary to develop methods and systems capable of detecting malicious contents in non-PE files, particularly in PDF files.

In view of the foregoing, the present invention provides method and system capable of detecting malicious content included in a non-PE file on the basis of the configuration of the non-PE file and to identify the malicious content.

In accordance with one aspect of the present invention, there is provided a method for detecting whether malicious content is included in a non-PE file. The method includes extracting information from a portion within the non-PE file in which the malicious content can be inserted; and determining whether the malicious content is included in the non-PE file on the basis of the extracted information.

In accordance with another aspect of the present invention, there is provided a system for detecting whether malicious content is included in a non-PE file. The system includes an information extraction unit for extracting information from a portion within the non-PE file in which the malicious content can be inserted; and a determination unit for determining whether the malicious content is included in the non-PE file on the basis of the extracted information.

The above and other objects and features of the present invention will become apparent from the following description of embodiments given in conjunction with the accompanying drawings, in which:

Figs. 1a and 1b are views showing examples of the configuration of a non-PE file in which malicious contents can be included;

Fig. 2 is a view showing an example of a stream object within a PDF file;

Fig. 3 is a view showing an example of a stream object in which malicious content is included;

Fig. 4 is a flow chart illustrating a malicious content detection method in accordance with an embodiment of the present invention;

Fig. 5 is a flow chart illustrating a malicious content detection method in accordance with another embodiment of the present invention; and

Fig. 6 is a block diagram showing a malicious content detection system in accordance with an embodiment of the present invention.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that they can be readily implemented by those skilled in the art.

Figs. 1a and 1b are views showing examples of the configuration of a non-PE file in which malicious content can be inserted. Generally, the non-PE file includes a header 10 representing the kind of the file or the like including information about the file, and a body 20 representing content of the file. A variety of contents can be included in the body 20. Further, the contents may be included in the body 20 in the form of an object.

In some examples, the body 20 may include a flash object 22 and PE objects 24, as can be seen in Fig. 1a. All the flash and

PE objects

22 and 24 may include malicious contents. Alternatively, the body 20 may include a TTF object 26 for representing font information of the file and a PE object 28 as can be seen in Fig. 1b. The TTF and

PE object

26 and 28 may also include malicious contents. As such, it can be detected whether the malicious contents are included in the non-PE file by considering each object within the non-PE file. However, the consideration for every object may increase system computing load result in deteriorating the efficiency of detection of the malicious contents. Further, in some examples, the object includes encoded contents. In this case, when the object including the encoded contents is decoded for the consideration, the memory resources of a system can be excessively consumed, to thereby deteriorate the efficiency of detection of the malicious contents.

In view of the foregoing, the present disclosure provides methods and systems capable of efficiently detecting the malicious contents by selectively inspecting objects in which malicious contents can be inserted without using such a decoding process. As an example of the non-PE file, a PDF file will be exemplified in the description, but the scope of the present invention is not limited to this. In other words, the scope of the present invention includes every non-PE file to which the principle of the present invention is applied.

Fig. 2 shows an example of a stream object included in a PDF file. The PDF file includes information as a stream object which is a sequence of bytes as a stream object. The stream object within the PDF file includes a label 110 identifying an object, and a keyword "stream" 130 representing the object to be a stream object. The keyword "stream" 130 indicates the start of a stream and configures a pair together with a keyword "endstream" 140 indicating the end of the stream. A sequence of bytes 150 is arranged between the keyword "stream" 130 and the keyword "endstream" 140. Further, the stream object includes a dictionary 120 representing encoding information, size information, content information of the object and so on. The start and end of the dictionary 120 are indicated by parentheses "<<" and ">>". Further, a keyword "endobj" 160 indicating the end of the object is disposed at the end of the object.

In general, the term "stream object" refers to a portion being distinguished by the keywords "stream" 130 and "endstream" 140. However, in the present disclosure, the term "stream object" refers to an entire portion of the object, which includes not only the object being distinguished by the keywords "stream" 130 and "endstream" 140, but also the dictionary 110, unless a separate comment is added. Further, the term "object" refers to a portion being distinguished by the keywords "obj" and "endobj" in order to distinguish from the term "stream object".

Since such a stream object may include a long sequence of bytes, malicious contents can be inserted in the stream object. Therefore, some embodiments of the present invention suggest methods and systems capable of determining whether malicious content is included in a PDF file on the basis of information which is extracted from the stream object within the PDF file.

Meanwhile, the dictionary 120 of the stream object may include a variety of information which may be identified by respective entries.

By way of example, but not limitation, the dictionary 120 includes a DL item. The DL item is identified by a keyword "/DL". Further, as can be seen in Fig. 2, numerals representing an original length of the included stream (i.e., a length of the included stream before being encoded) in bytes, are followed by the keyword "/DL". When a long sequence of stream is included in the PDF file and high numerals are followed, a system for processing the PDF file can secure enough memory resources in advance on the basis of the DL item. As such, if the DL item is included in the dictionary 120, it may mean that the PDF file includes a long sequence of stream and has a high possibility of including a malicious content. Therefore, the possibility of the existence of the malicious content can be determined according to whether the DL item is included in the dictionary 120.

Further, the dictionary 120 may include an EF item and a keyword "EmbeddedFiles" which represents another file to be embedded in the PDF file. The EF item is identified by a keyword "/EF". If another file is embedded in the PDF file, the name of another file is followed by the keyword "/EF". As such, the EF item included in the object (i.e., the dictionary 120) indicates that another file is embedded in the PDF file. In this case, it can be determined that the possibility of the existence of a malicious content is high on the basis of the existence of the EF item.

Meanwhile, the keyword "EmbeddedFiles" is one of the keywords which may be included in a Type item. The Type item represents the kind of an object and is identified by a keyword "/Type". If the object is a file, the keyword "EmbeddedFiles" is followed by the keyword "/Type". As such, when the keyword "EmbeddedFiles" is included, it may mean that the possibility of the existence of a malicious content is high. Alternatively, an EmbeddedFiles item identified by the keyword "/EmbeddedFiles" can be included in the dictionary 120, and an identifier representing another file can be followed by the EmbeddedFiles item. In this case, it can be determined that the possibility of the existence of a malicious content is high on the basis of the existence of the EmbeddedFiles item.

Furthermore, the dictionary 120 may include a Params item representing information about another file (i.e., an embedded file) when the embedded file is included in the PDF file. The Params item is identified by a keyword "/Params". Detailed information may be followed by the Params item. In some embodiments, the embedded file may be included in the object in a non-compressed file type. In this case, the embedded file can be included in the object in a stream type, and a Checksum item, which includes a checksum value of the stream, can be followed by the keyword "/Params". The previously calculated checksum value included in the checksum item can be used for detecting the malicious content. As such, system computing load caused by calculating the checksum value can be reduced by using the previously calculated checksum value.

Further, a Size item being followed by the keyword "/Params" represents the size of the included stream. Besides the Size item, a CreationDate item and/or a ModDate item can be followed by the keyword "/Params". The CreationDate item represents the date and time when the embedded file is created, and the ModDate item represents the data and time when the embedded file is altered. On the basis of the information regarding the Params item, the substance of the embedded file can be identified, and thus the malicious content can be detected.

When the embedded file is included in the PDF file, a Subtype item representing a kind of the embedded file can also be included in the dictionary 120. The Subtype item is identified by a keyword "/Subtype". Here, the kind of the embedded file can be identified by identifying a kind of a file followed by the keyword "/Subtype", and the identified kind of the embedded file can be used to determine whether the PDF file includes the malicious content.

Fig. 3 is a view showing an example of a stream object within a non-PE file which includes malicious content. As shown in Fig. 3, malicious content is included into the non-PE file in the form of a stream 250 and the stream 250 is identified by a keyword "stream". A checksum value 270 is included in a Checksum item. Here, the Checksum item is included in a Params item which is identified by a keyword "/Params". Further, a Subtype item 280 is included in the stream 250. Thus, by considering the configuration of the stream 250, it can be determined that a flash content as an embedded file is included in the non-PE file.

In this way, the above-mentioned items can be used to detect malicious content. As such, the existence of malicious content can be easily detected.

Furthermore, an object ID (110 in Fig. 2) for identifying the object can be used to detect malicious content. If the same malicious content is inserted into several files, the malicious content may be inserted in the same objects within the several files. In accordance therewith, the malicious content can be detected by comparing an object ID or a characteristic of the object within a target file with an object ID or a characteristic previously derived from a file which had been determined to have malicious content.

A malicious content detection method in accordance with an embodiment of the present invention will now be described with reference to Fig. 4.

First, at a step 300, information about predeterminded items may be extracted from a portion within a non-PE file into which malicious content can be inserted. In some embodiments, information about the items can be extracted from a dictionary of an object within the PDF file. Alternatively, a stream object within the PDF file is identified and then information included in a dictionary of the identified stream object can be extracted. Further, the items relating to extracted information includes at least one of a DL item, an EF item, a Type item, a Params item and a Subtype item which are included in an object within the non-PE file. In some embodiments, information about all the items may be extracted. In this case, if some of the items are not included in the object when extracting information, the non-existence of the items can be indicated for the items which are not included in the object. As such, all the items of the object can be considered in the detection of malicious content. Therefore, the number of items used for detecting the existence of the malicious content may increase, thereby enhancing accuracy for the determination of the malicious content.

Subsequently, the existence of items may be determined at a step 310. In some embodiments, by inspecting the extracted information, the existence of at least one of the DL item, the EF item, the Params item and the Subtype item can be determined. By way of example, but not limitation, if the extracted information about the DL item exists, it can be determined that a large-sized stream is included in the non-PE file. As such, it can also be determined that possibility of including malicious content is high. Further, if the extracted information about the EF item exists, it can be determined that an embedded file exists in the non-PE file. As such, it can also be determined that possibility of including malicious content is high. Furthermore, if the extracted information about the Params item or the Subtype item exists, it can be determined that the embedded file exists in the non-PE file.

Thereafter, values of the existing items may be compared to those relating to malicious contents at a step 320. In some embodiments, the checksum value of a checksum item within the existing Params item can be compared to a checksum value relating to malicious content. The checksum value of the malicious content can be previously calculated and stored in a database. As such, the malicious content detection method in accordance with at least some embodiments described herein can use the checksum value received from the database. If the checksum value included in the object is not the same as that of the malicious content, a similarity between both of the checksum values can be calculated and compared to a reference similarity. When the calculated similarity is higher than the reference similarity, the non-PE file can be determined to include the malicious content. The comparison process using the similarities can be performed along the well-known method. In accordance therewith, mutations of malicious contents can also be detected.

Alternatively, the Subtype item can be compared to types of well-known malicious contents. By way of example, if specific malicious content is well known as a flash type and the Subtype item indicates a flash type, it can be determined that the non-PE file includes the malicious content.

In another different manner, the value of a Type item can be used to determine whether "EmbeddedFiles" exists. As such, it can be determined whether the non-PE file includes an embedded file on the basis of the value of the Type item.

Finally, at a step 330, whether the non-PE file includes a malicious content can be determined based on the determined resultant of the step 310 for the existence of items and the compared resultant of the step 320 for the values of the items. In this manner, the existence of a variety of items and the values of the item are used for determining whether to include malicious content. As such, accuracy for the detection of malicious content can become higher. Also, it can be easily determined whether the non-PE file includes malicious content without decoding the stream.

Meanwhile, the malicious content detection method in accordance with at least some embodiments allows the step 320 for comparing the item values to those relating to malicious content to be performed only when the determined resultant of the step 310 represents the predetermined items to exist. To address this matter, a malicious content detection method in accordance with another embodiment of the present invention can be proposed.

The malicious content detection method in accordance with another embodiment of the present invention will now be described with reference to Fig. 5.

As shown in Fig. 5, the malicious content detection method may enable information about predetermined items to be extracted from a non-PE file at a step 400. The existence of the predetermined items may be determined by inspecting the extracted information at a step 410. Then, whether at least one of a DL item, an EF item, a Params item and a Subtype item exists within an object of the non-PE file may be determined, at a step 412. If the determination resultant of the step 412 represents any one of the above-mentioned items not to exist, a step 414 instead of a step 420 is performed for determining that any malicious content is not included in the non-PE file. This results from the fact that a long sequence of stream or an embedded file, which can be regarded as an existence of a malicious content, does not exist.

On the other hand, when the determination resultant of the step 412 represents that at least one of above-mentioned items to exist, for example, a long sequence of stream or an embedded file exists, the values of the existing items are compared to those relating to malicious contents at a step 422. By way of example, but not limitation, at the step 422, the checksum value of a checksum item within the existing Params item is compared to a checksum value relating to malicious content so as to determine whether the two checksum values are similar to each other. If the two checksum values are not similar to each other, the step 414 may be performed for determining that any malicious content is not included in the non-PE file.

On the contrary, when the two checksum values are similar to each other, a step 424 is performed for comparing a type of an embedded file included in the non-PE file to each type of malicious contents. By way of example, but not limitation, the value of the Subtype item is compared to types of well-known malicious contents. If the Subtype value corresponds to one of the types of well-known malicious contents, the non-PE file is determined to include malicious content at a step 430. On the contrary, if the Subtype value does not correspond to any one of the kinds of the well-known malicious contents, it may mean that the non-PE content includes new malicious content, an error occurred in the comparision of the checksum values or the like. Therefore, a step 426 is performed to inform a user or an external device for an additional procedure, e.g. analyzing the configuration of the new malicious content.

The malicious content detection method of Fig. 5 can reduce the number of steps to be substantially executed, to thereby provide a higher efficiency than that of Fig. 4. However, since the method of Fig. 5 may be provided only as an example, the scope of the present invention is not limited to this. In other words, it will be readily understood that the aspects of the present invention, as generally described herein can be modified or altered by combining, arranging, substituting, separating and designing in a wide variety of different configurations. For example, the comparison of the Subtype values can be performed before or parallel to the comparison of the checksum values.

Fig. 6 is a block diagram showing a malicious content detection system in accordance with an embodiment of the present invention. The malicious content detection system includes an information extraction unit 510 and a determination unit 520. The determination unit 520 may include an existence determinator 522 and a comparator 524.

The information extraction unit 510 may extract information about at least one of predetermined items, such as a DL item, a Params item, an EF item, an EmbeddedFile item and a Subtype item, from a portion within a non-PE file in which malicious files can be inserted. The extracted information by the information extraction unit 510 may be transmitted to the determination unit 520.

The determination unit 520 may inspect the extracted information, in order to not only determine the existence of the predetermined items but also identify the values of the predetermined items. By way of example, the existence determinator 522 can determine the existence of each of the DL item, the Params item, the EF item and the Subtype item which are included in an object within the non-PE file by inspecting the extracted information. Further, the existence determinator 522 can determine whether a stream or a file which can be malicious content, exists in the object within the non-PE file on the basis of the determined existence of each of the DL item, the Params item, the EF item and the Subtype item. Further, the comparator 524 can determine whether the malicious content is included in the non-PE file by comparing the values of the checksum and Subtype item to those relating to malicious contents. As such, the determination unit 520 can determine whether the malicious content is included in the non-PE file on the basis of the resultants from the existence determinator 522 and the comparator 524.

The determination unit 520 can use a communication unit 530 in order to obtain information about kinds of malicious contents and checksum values for the malicious contents. Further, the communication unit 530 can be used for transmitting the determination results for the existence of malicious content to a user.

As described above, the malicious content detection method and system in accordance with embodiments of the present invention can determine whether the non-PE file includes malicious content on the basis of information from a portion within the non-PE file in which the malicious content can be inserted. As such, the malicious content can be accurately and efficiently detected. Particularly, since configuration of the non-PE file is considered for detecting the malicious content, attacks with non-PE files can be efficiently prevented. Moreover, the substance of malicious content can be easily identified because a variety of information included in the non-PE file is used for detecting the malicious content.

While the invention has been shown and described with respect to the preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the scope of the invention as defined in the following claims.

Claims

A method for detecting whether malicious content is included in a non-PE (Portable Executable) file, the method comprising:

extracting information from a portion within the non-PE file in which the malicious content can be inserted; and

determining whether the malicious content is included in the non-PE file on the basis of the extracted information.
The method of claim 1, wherein the determining includes receiving information about malicious content from a database and comparing the extracted information to the received information.
The method of claim 1, wherein the non-PE file is a PDF (Portable Document Format) file, and

wherein the portion corresponds to a stream object within the PDF file.
The method of claim 1, wherein the extracted information includes information about at least one of an object ID, a DL item, a Params item, an EF item, a Type item, a SubType item and an EmbeddedFiles item, which are included in an object within the non-PE file.
The method of claim 1, wherein the extracted information includes a Checksum value within a Params item which is included in an object within the non-PE file.
The method of claim 1, wherein the extracting extracts information about at least two items, and

wherein the determining determines on the basis of the extracted information about at least two items.
The method of claim 1, wherein the extracting includes extracting information about a predetermined item from a portion within the non-PE file and indicating, if information about the predetermined item is not included in the non-PE file, the non-existence of the information about the predetermined item.
The method of claim 1, further comprises using the extracted information to obtain information about the malicious content when the determining determines the malicious content to be included in the non-PE file.
A system for detecting whether malicious content is included in a non-PE (Portable Executable) file, the system comprising:

an information extraction unit for extracting information from a portion within the non-PE file in which the malicious content can be inserted; and

a determination unit for determining whether the malicious content is included in the non-PE file on the basis of the extracted information.
The system of claim 9, wherein the determination unit includes a comparator for comparing the extracted information to information about malicious content which is received from a database.
The system of claim 9, wherein the non-PE file is a PDF (Portable Document Format) file, and

wherein the portion corresponds to a stream object within the PDF file.
The system of claim 9, wherein the extracted information includes information about at least one of an object ID, a DL item, a Params item, an EF item, a Type item, a SubType item and an EmbeddedFiles item, which are included in an object within the non-PE file.
The system of claim 9, wherein the extracted information includes a Checksum value within a Params item which is included in an object within the non-PE file.
The system of claim 9, wherein the information extraction unit extracts information about at least two items, and

wherein the determination unit determines on the basis of the extracted information about at least two items.
The system of claim 9, wherein the information extraction unit extracts information about a predetermined item from a portion within the non-PE file and indicates, if information about the predetermined item is not included in the non-PE file, the non-existence of the information about the predetermined item.
The system of claim 9, further comprises a unit for using the extracted information to obtain information about the malicious content when the determination unit determines the malicious content to be included in the non-PE file.
A computer-readable storage medium storing therein a program which includes computer-executable instructions causing a processor to execute the method of claim 1.