JP5982597B1 - Information processing apparatus, information processing method, program, and computer-readable recording medium recording the program - Google Patents

Information processing apparatus, information processing method, program, and computer-readable recording medium recording the program Download PDF

Info

Publication number
JP5982597B1
JP5982597B1 JP2016046762A JP2016046762A JP5982597B1 JP 5982597 B1 JP5982597 B1 JP 5982597B1 JP 2016046762 A JP2016046762 A JP 2016046762A JP 2016046762 A JP2016046762 A JP 2016046762A JP 5982597 B1 JP5982597 B1 JP 5982597B1
Authority
JP
Japan
Prior art keywords
predetermined
file
malware
point
icon image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
JP2016046762A
Other languages
Japanese (ja)
Other versions
JP2017162244A (en
Inventor
友輔 岡野
友輔 岡野
Original Assignee
株式会社Ffri
株式会社Ffri
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社Ffri, 株式会社Ffri filed Critical 株式会社Ffri
Priority to JP2016046762A priority Critical patent/JP5982597B1/en
Application granted granted Critical
Publication of JP5982597B1 publication Critical patent/JP5982597B1/en
Publication of JP2017162244A publication Critical patent/JP2017162244A/en
Application status is Active legal-status Critical

Links

Abstract

To make it easy to determine a fake icon as malware even if the icon image changes even a little. An information processing apparatus according to an embodiment of the present invention generates a feature vector from a feature information extraction unit that extracts an icon image binary from a resource of a predetermined file, and the extracted icon image binary. You may provide the feature vector production | generation part and the determination part which determines whether the said predetermined file is malware by machine learning using the said feature vector. A program according to an embodiment of the present invention extracts a binary icon image from a resource of a predetermined file to a computer, generates a feature vector from the extracted icon image binary, and uses the feature vector. Then, it may be determined whether or not the predetermined file is malware by machine learning. [Selection] Figure 1

Description

  The present invention relates to an information processing apparatus, an information processing method, a program, and a computer-readable recording medium on which the program is recorded. The present invention particularly relates to an information processing apparatus that detects malware, an information processing method, a program, and a computer-readable recording medium that records the program.

  In recent years, malicious software or malicious code created with the intention of performing illegal and harmful operations after impersonating an icon of another software or file that looks harmless at first appears to the user. There is a problem of running malicious software and malicious code without knowing it. Here, malicious software or malicious code created with the intention of performing an illegal and harmful operation is referred to as malware (Malicius Software).

  Malware-infected computers are characterized by performing illegal or harmful actions on other computers connected to the network. For example, denial-of-service attacks by mass sending of junk mail or illegal mass access to servers It is used as a tool for performing malicious actions such as Malware threats include not only attacks against the outside, but also operations that extract personal information such as credit card numbers and address books from infected computers and send them to external computers. In order to prevent such damage caused by malware, a technique for detecting the malware main body or communication that transmits and receives the malware main body is required.

  Among such malwares, icons associated with document files such as Word documents, icons that seem to look like that are devised to misidentify themselves as document files, image files, folders, etc. There are icons that are devised to guide the execution using icons that seem to appear. Here, these icons are called “fake icons”.

  Conventionally, as a method of detecting malware having a camouflaged icon, as shown in FIG. 12, a hash value of an icon image is extracted from the resource of an executable file, and this hash value and a hash value of a camouflaged icon held in advance There is a method of performing a matching with a list of the images and determining that the camouflaged icon is held if they match. However, with this method, although it is possible to deal with a pretending icon that is held in advance, there is a problem that when the icon image changes even a little, the hash value is different, so that it cannot be determined as a fake icon. The icon image of FIG. 13 differs only in the upper left pixel, but the hash value is different.

  On the other hand, the technique disclosed in Japanese Patent Application Laid-Open No. 2004-228561 is well-known as icon information of icon images of target files that are targets of risk determination and icon images of files such as document files that are not executable files. The similarity with the feature information of the basic icon image that is an image is calculated, and the risk of the target file is determined based on the calculated similarity.

JP2015-191458A

  However, the technique disclosed in Patent Document 1 only calculates the degree of similarity with the feature information of the basic icon image, and it may be erroneously determined to be a fake icon even if it is originally a normal file. There is.

  The present invention is intended to solve the problems associated with the prior art as described above, and its purpose is to make it easy to determine a fake icon as malware even if the icon image changes even a little. It is in.

  Another object of the present invention is to suppress false detection such that a normally normal file is erroneously determined as a fake icon.

  According to an embodiment of the present invention, a feature information extraction unit that extracts an icon image binary from a resource of a predetermined file, a feature vector generation unit that generates a feature vector from the extracted icon image binary, An information processing apparatus is provided that includes a determination unit that determines whether the predetermined file is malware by machine learning using a feature vector.

  The feature information extraction unit extracts a numerical value from the icon image, stores a numerical value obtained by digitizing the camouflaged icon image, a numerical value extracted by the feature information extraction unit, and a numerical value stored by the numerical value storage unit And an overall determination unit that determines whether the predetermined file is malware based on the information processing apparatus.

  When at least one of the determination unit and the general determination unit determines that the predetermined file is malware, an initial point setting unit that sets a predetermined malware initial point, and determines whether the predetermined file is malware A point threshold value storage unit that stores a point threshold value that is an index to perform, a point adjustment unit that adds or subtracts a predetermined point to the malware initial point when the predetermined file satisfies a predetermined condition, and the point adjustment unit A threshold determination unit that determines whether a point calculated by addition or subtraction exceeds a point threshold stored in the point threshold storage unit and determines the predetermined file as malware when the point threshold is exceeded. And an information processing apparatus characterized by further comprising:

  An information processing apparatus, further comprising: a numerical unit that digitizes an icon image of a predetermined file that is determined as the malware, wherein the numerical storage unit stores the numerical value that is digitized by the numerical unit It may be.

  According to an embodiment of the present invention, a computer extracts an icon image binary from a resource of a predetermined file, generates a feature vector from the extracted icon image binary, and uses the feature vector to perform machine learning. Provides an information processing method for determining whether the predetermined file is malware.

  According to an embodiment of the present invention, a computer extracts an icon image binary from a resource of a predetermined file, generates a feature vector from the extracted icon image binary, and uses the feature vector to perform machine learning. Provides a program for determining whether the predetermined file is malware.

  In order to extract the binary, the computer may execute extraction at a predetermined interval.

  The computer extracts a numerical value from the icon image, stores a numerical value obtained by digitizing a fake icon image, and determines whether the predetermined file is malware based on the extracted numerical value and the stored numerical value. May be further executed.

  The computer determines whether the predetermined file is malware by machine learning using at least the feature vector, and determines whether the predetermined file is malware based on the extracted numerical value and the stored numerical value When one of the determinations determines that the predetermined file is malware, a predetermined malware initial point is set, a point threshold value serving as an index for determining whether the predetermined file is malware is stored, and the predetermined file is stored. When a file satisfies a predetermined condition, a predetermined point is added to or subtracted from the malware initial point, and it is determined whether the point calculated by the addition or subtraction exceeds the stored point threshold, If the threshold is exceeded, the predetermined file is regarded as malware. It may be executed to the constant.

  You may make the said computer digitize the icon image of the predetermined file determined with the said malware, and memorize | store the said digitized numerical value.

  When the computer determines whether the icon image of the predetermined file is a regular icon image, and determines that the icon image of the predetermined file is not a regular icon image, the predetermined point is set as the malware initial point. You may perform adding.

  Extracting the number of icons held in the resource of the predetermined file, and adding the predetermined point to the malware initial point when the number of extracted icons is less than the predetermined number Good.

  Version information may be extracted from the predetermined file, and when the icon image of the predetermined file does not correspond to the version information, adding the predetermined point to the malware initial point may be executed.

  Programming language information is extracted from the predetermined file, and when the extracted programming language information is not programming language information corresponding to an icon image held in the predetermined file, the predetermined point is added to the malware initial point It may be performed.

  Extracting compiler information from the predetermined file, and adding the predetermined point to the malware initial point when the extracted compiler information is not compiler information corresponding to an icon image held in the predetermined file. It may be executed.

  When the information about the packer is extracted from the predetermined file and the information about the packer is extracted from the predetermined file, adding the predetermined point to the initial malware point may be executed.

  When the self-extracting archive information is extracted from the predetermined file and the self-extracting archive information is extracted from the predetermined file, adding the predetermined point to the malware initial point may be executed.

  A file name may be extracted from the predetermined file, and when the number of characters of the extracted file name exceeds a predetermined number of characters, adding a predetermined point to the malware initial point may be executed.

  A file name may be extracted from the predetermined file, and when a Unicode control character is included in the extracted file name, adding a predetermined point to the malware initial point may be executed.

  A file name may be extracted from the predetermined file, and when the extracted file name includes a plurality of extensions, adding a predetermined point to the malware initial point may be executed.

  A file name may be extracted from the predetermined file, and when a 2-byte character is included in the extracted file name, adding a predetermined point to the malware initial point may be executed.

  According to an embodiment of the present invention, a computer-readable recording medium that records the program may be provided.

  According to the present invention, even when an icon image changes even a little, it is possible to easily determine a camouflaged icon as malware.

It is a conceptual diagram of the information processing apparatus which concerns on one Embodiment of this invention. It is a figure for demonstrating that the information processing apparatus which concerns on one Embodiment of this invention produces | generates a feature vector. It is a figure for demonstrating the flow which produces | generates the determination part (model) of the information processing apparatus which concerns on one Embodiment of this invention. It is a conceptual diagram for demonstrating transplanting the model produced | generated in FIG. It is a figure for demonstrating the flow in which the information processing apparatus which concerns on one Embodiment of this invention determines a malware. It is a conceptual diagram of the information processing apparatus which concerns on other embodiment of this invention. It is a figure for demonstrating the flow in which the information processing apparatus which concerns on other embodiment of this invention determines malware. It is a figure for demonstrating the flow in which the information processing apparatus which concerns on other embodiment of this invention determines malware. It is a conceptual diagram of the information processing apparatus which concerns on other embodiment of this invention. It is a figure for demonstrating the flow in which the information processing apparatus which concerns on other embodiment of this invention determines malware. It is a conceptual diagram of the information processing apparatus which concerns on other embodiment of this invention. It is a figure for demonstrating the flow which detects the malware which has the camouflaged icon which concerns on a prior art. It is a figure which shows the hash value corresponding to the icon image and icon image for understanding the subject of a prior art.

  Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The following embodiments are examples of the embodiments of the present invention, and the present invention is not limited to these embodiments. Note that in the drawings referred to in this embodiment, the same portion or a portion having a similar function is denoted by the same reference symbol or a similar reference symbol (a reference symbol simply including A, B, etc. after a number) and repeated. The description of may be omitted. In addition, the dimensional ratio in the drawing may be different from the actual ratio for convenience of explanation, or a part of the configuration may be omitted from the drawing.

<First Embodiment>
[Configuration of information processing device]
The information processing apparatus 1 will be described with reference to FIGS. FIG. 1 is a conceptual diagram of an information processing apparatus according to an embodiment of the present invention.

  In the information processing apparatus 1, the user terminals 30 a and 30 b to be connected and the server 33 are connected via the network 27. In addition, when it is not necessary to distinguish between user terminals, “user terminal 30” is used.

  Here, the network 27 is a network such as a LAN (local area network) or the Internet, for example, and a network environment in which the user terminal 30 can connect to the workflow information processing apparatus 1 regardless of a wireless / wired line, a dedicated line, or the like. Applies.

  The user terminal 30 includes a multi-function mobile phone, a mobile communication terminal device such as a mobile phone and a PDA (Personal Digital Assistant), an information processing terminal device having a communication function and a calculation function such as a personal computer. Further, a browser is provided as a display control function for displaying a screen, and includes a CPU, a memory, and a communication control unit that performs communication control with the information processing apparatus 1. Furthermore, an operation input device such as a mouse, a keyboard, and a touch panel, and a display device can be provided.

  The information processing apparatus 1 includes a feature information extraction unit 10, a feature vector generation unit 11, and a determination unit 12.

  The feature information extraction unit 10 extracts a binary icon image from the resource of a predetermined file (execution file or the like) that is a specimen. Here, processing such as reduction or normalization may be performed on the extracted icon image.

  In this example, binary extraction of the icon image is performed from all pixels. Of course, the method for extracting the icon image binary is not limited to this method, and the binary may be extracted at a predetermined interval. For example, assuming that the upper left pixel is (1, 1), binary of odd pixels such as (1, 3), (1, 5)... (2, 1), (2, 3). You may extract, and you may extract the binary of an even-numbered pixel. Instead of the odd-numbered pixels and the even-numbered pixels, the binary of the pixels may be extracted with a larger interval than these.

  The feature vector generation unit 11 generates a feature vector from the binary icon image extracted by the feature information extraction unit 10. The generation of feature vectors will be described with reference to FIG. FIG. 2 is a diagram for explaining that the information processing apparatus according to an embodiment of the present invention generates a feature vector.

  The feature vector generation unit 11 uses the RGB value of each pixel of the icon image as a vector. In this example, the upper left pixel of the icon image in FIG. 2 is “R: 0xD4”, “G: 0x00”, “B: 0xC8”, and “0x00D4D0C8”. Then, “D4D0C8” which is a hexadecimal number is represented as “139447080” in decimal. Each pixel of the icon image of FIG. 2 is similarly converted to a decimal number and then a vector. In this example, the feature vector = {13947080, 1394080,..., 893013,.

  The determination unit 12 determines whether the predetermined file as the sample is malware by machine learning using the feature vector. By the way, the determination unit (model) 12 is generated in advance. The generation of the determination unit (model) 12 will be described with reference to FIG. FIG. 3 is a diagram for explaining a flow of generating a determination unit (model) of the information processing apparatus according to the embodiment of the present invention. FIG. 4 is a conceptual diagram for explaining that the model generated in FIG. 3 is transplanted.

  Prepare a plurality of icon images extracted from malware with fake icons and icon images extracted from normal software. And let this be a learning data set (teacher data). This learning data is input to the model (S101). In this example, the weight wi is arbitrarily (randomly) determined for the model into which learning data is input. However, the value of the weight wi may be determined by a known method such as pre-training.

  When the learning data is input to the model, learning is performed according to a deep learning algorithm (S102). As a result of this learning, the internal structure of the model changes. Specifically, the value of the model weight wi changes. The activation function is fixed. Deep learning is CNN (Convolutional Neural Network) or the like. In this example, learning by deep learning is performed. However, the present invention is not limited to this, and other algorithms such as backpropagation (Backpropagation), Boltzmann machine, TWDRLS, and cognitoron may be used. .

  As a result of learning, it is determined whether the internal structure changes and the classification accuracy of the model is a constant classification accuracy (S103). If the classification accuracy is constant (Yes in S103), the model generation is completed (S104). On the other hand, if it is less than a certain classification accuracy (No in S103), S101 to S103 are repeated. When the model generation is completed, the generated model is ported to the malware detection engine. This transplanted model is the determination unit 12.

[Malware judgment flow]
A flow in which the information processing apparatus determines a file as a sample as malware will be described with reference to FIG. FIG. 5 is a diagram for explaining a flow in which the information processing apparatus according to an embodiment of the present invention determines malware.

  First, the feature information extraction unit 10 extracts the binary of the icon image from the resource of the sample file (execution file or the like) (S201). Next, the feature vector generation unit 11 generates a feature vector from the extracted binary icon image (S202). Then, the determination unit 12 determines whether the sample file is malware by machine learning using the feature vector (S203). If the determination unit 12 determines that the malware is malware (S204a) or is normal (S204b), the flow ends.

  In the present embodiment, the determination unit 12 is obtained by transplanting a model determined to have a certain classification accuracy by changing the internal structure from the initial model by learning by deep learning. And the determination part 12 determines whether the file which is a sample is malware by machine learning. Therefore, there is an effect that a file that is a sample can be determined as malware with higher accuracy than the conventional technology.

Second Embodiment
As in the first embodiment, the method of determining whether a sample file is malware by machine learning is effective in the sense that the sample file can be determined as malware with higher accuracy than the conventional technology. . However, in this method, the sample file may be erroneously determined as malware even though it is originally normal software. Such a determination is called erroneous detection (incorrect determination). As a result of recognizing the necessity of a method for suppressing this false detection and intensively studying the present inventor, the present inventor considers a method using a statistical approach focusing on the number of appearances of the color of an icon image of a sample file and a density histogram. It came to.

[Configuration of information processing device]
The information processing apparatus 2 will be described with reference to FIG. FIG. 6 is a conceptual diagram of an information processing apparatus according to another embodiment of the present invention. The information processing apparatus 2 according to the present embodiment includes a general determination unit 13 and a numerical value storage unit 21 in addition to the configuration of the first embodiment. Here, differences from the first embodiment will be described in detail.

  The feature information extraction unit 10 extracts an icon image binary from the resource of a file that is a sample. In addition, the feature information extraction unit 10 extracts a numerical value from the icon image.

  In this example, the numerical value storage unit 21 stores in advance and holds numerical values obtained by digitizing the camouflaged icon image. Here, the camouflaged icon image is an icon image that has already been determined to be a camouflaged icon. A large number of such images are digitized to store and hold numerical values.

  The general determination unit 13 determines whether or not the file that is the sample is malware based on the numerical value extracted from the icon image in the resource of the file that is the sample by the feature information extraction unit 10 and the numerical value storage unit 21. An example of the general determination unit 13 is Average Hash. In the case of Average Hash, the feature information extraction unit 10 reduces the size of the icon image (for example, 8 × 8 pixels) and changes the color to gray scale. Then, the average value of the color is calculated using each pixel of the image, the color density of each pixel is checked, and when the color is darker than the average value, “1” is set. , “0” is set. Then, a 64-bit (8 × 8) bit string is created. The numerical value extracted by the feature information extraction unit 10 is this bit string.

  Similarly, a bit string is generated for a large number of icon images that have already been identified as fake icons, and the bit string (numerical value) is stored in the numerical value storage unit 21 in advance. Then, the general determination unit 13 obtains a bit string (numerical value) extracted from the file as the sample and a bit string (numerical value) of an icon image stored in the numerical value storage unit 21 and already known as a fake icon. The similarity is calculated by comparing one bit at a time. When the predetermined similarity is exceeded, the general determination unit 13 determines that the sample file is malware.

  The general determination unit 13 is not limited to the Average Hash, and a Jaccard coefficient, TF / IDF, Fuzzy Hash, SAD (Sum of Absolute Difference), or the like may be used. Using either of these, a numerical value extracted from a file as a sample is compared with numerical values of a large number of icon images already known as camouflaged icons, and the similarity is calculated. When the predetermined similarity is exceeded, the general determination unit 13 determines that the sample file is malware. Further, as the general determination unit 13, a plurality of determinations using Average Hash determination, Jaccard coefficient, TF / IDF, Fuzzy Hash, SAD (Sum of Absolute Difference), or the like may be used. This is because they are independent of each other.

[Malware judgment flow]
A flow in which the information processing apparatus determines a file as a sample as malware will be described with reference to FIG. FIG. 7 is a diagram for explaining a flow in which an information processing apparatus according to another embodiment of the present invention determines malware. S301, S302, and S303 of the present embodiment correspond to S201, S202, and S203 of the first embodiment. Here, S303 and subsequent steps will be described in detail.

  As a result of determination by machine learning, when the sample file is determined to be normal software and not malware (No in S303), the sample file is determined to be normal software (S306). On the other hand, if it is determined as a result of the machine learning that the sample file is malware (Yes in S303), the general determination unit 13 further determines whether the sample file is malware (S304). ). A specific determination method is as described above.

  When the general determination unit 13 determines that the sample file is not malware but normal software (No in S304), the sample file is determined to be normal software (S306), and the flow ends. Become. On the other hand, when the general determination unit 13 determines that the sample file is malware (Yes in S304), the sample file is determined to be malware (S305), and the flow ends.

  Also in the present embodiment, as in the first embodiment, the determination unit 12 determines whether a file as a sample is malware by machine learning. Therefore, there is an effect that a file that is a sample can be determined as malware with higher accuracy than the conventional technology.

  In the present embodiment, the general determination unit 13 further determines whether the file is determined to be malware by the determination unit 12. The determination unit 12 rarely mistakenly determines that the sample file is malware even though it is originally normal software. Such a determination is called erroneous detection (incorrect determination). The general determination unit 13 has an effect of suppressing erroneous detection of the determination unit 12.

<Third Embodiment>
[Malware judgment flow]
A flow in which the information processing apparatus determines a file as a sample as malware will be described with reference to FIG. FIG. 8 is a diagram for explaining a flow in which an information processing apparatus according to another embodiment of the present invention determines malware.

  The third embodiment differs from the second embodiment in the order of some flows. In the second embodiment, the determination by the general determination unit 13 is performed after the determination by the determination unit 12. In the third embodiment, the determination by the determination unit 12 is performed after the determination by the general determination unit 13.

  This embodiment also has the same effect as the second embodiment.

<Fourth embodiment>
As in the second and third embodiments, in addition to the method of determining whether a sample file is malware by machine learning, by performing a general determination, the sample file is more accurately identified as malware. Can be determined. These methods do not prevent all false detections. Therefore, as a result of intensive studies on a method for suppressing false detection, the present inventor has come up with a method for suppressing false detection in consideration of the characteristics of malware in addition to the second and third embodiments. It was. In the following, eleven characteristic examples of malware will be described, but the present embodiment is not limited to these.

[Configuration of information processing device]
The information processing apparatus 3 will be described with reference to FIG. FIG. 9 is a conceptual diagram of an information processing apparatus according to another embodiment of the present invention. The information processing apparatus 3 according to the present embodiment includes an initial point setting unit 14, a point adjustment unit 15, a threshold determination unit 16, a falsified icon determination unit 17, and a point threshold storage unit 22 in addition to the configuration of the second embodiment. . Here, differences from the second embodiment will be described in detail.

  The initial point setting unit 14 sets a predetermined initial malware point when at least one of the determination unit 12 and the general determination unit 13 determines a predetermined file as a sample as malware. The determination unit 12 may determine that the sample file is malware, the general determination unit 13 may determine that the sample file is malware, or the determination unit 12 and the general determination unit The determination unit 13 may determine that the file as the sample is malware. In these cases, the initial point setting unit 14 sets a malware initial point. In this example, the initial malware point is described as 90 points. The initial malware point may be stored in the storage unit 20 in advance.

  The point threshold value storage unit 22 stores a point threshold value that serves as an index for determining whether a predetermined file is malware. In this example, the point threshold value is described as 95 points.

  The point adding / subtracting unit 15 adds or subtracts a predetermined point to the malware initial point set by the initial point setting unit 14 when the sample file satisfies a predetermined condition. The predetermined point can be set as appropriate. Regarding “predetermined conditions”, eleven items will be described later. The predetermined point may be a different value for each item. Further, points for each item may be stored in the storage unit 20 in advance.

  The threshold determination unit 16 determines whether the point calculated by addition or subtraction by the point addition / subtraction unit 15 exceeds the point threshold stored in the point threshold storage unit 22, and if the point threshold exceeds the point threshold, It is determined that the file is malware. For example, 90 points are set as a malware initial point in the file determined by the determination unit 12 and the general determination unit 13 as malware. When this file satisfies a predetermined condition, the point adding / subtracting unit 15 adds or subtracts a predetermined point to 90 points. When the points calculated by adding or subtracting predetermined points to 90 points exceed 95 points, the file as the sample is determined to be malware.

  Eleven specific features of malware will be described. When the file that is the sample matches the specific characteristics of the malware, the point adding / subtracting unit 15 adds a predetermined point to the malware initial point set by the initial point setting unit 14. Specific features of the malware are extracted by the feature information extraction unit 10.

(Specific features 1)
The falsification icon determination unit 17 determines whether the icon image of the sample file is a regular icon image. If it is a regular icon image, the same icon may be held inside the operating system. Further, a generally known regular icon image can be determined by storing it in the storage unit 20. On the other hand, if the legitimate icon image has been tampered with, it is highly possible that it is malware. Therefore, when the falsification icon determination unit 17 determines that the icon image of the sample file is not a regular icon image, the point adjustment unit 15 adds a predetermined point to the malware initial point. On the other hand, when the falsification icon determination unit 17 determines that the icon image of the sample file is a regular icon image, the point adjustment unit 15 subtracts a predetermined point from the malware initial point.

(Specific features 2)
Legitimate applications tend to have a large number of icon resources. On the other hand, the number of malware icon resources tends to be extremely small. Therefore, the feature information extraction unit 10 extracts the number of icon segments from the resource file of the sample file. When the number of extracted icon segments falls below a predetermined threshold, the point adjuster 15 adds a predetermined point to the malware initial point. On the other hand, when the number of extracted icon segments is equal to or greater than a predetermined threshold, the point adding / subtracting unit 15 subtracts the predetermined point from the malware initial point. Here, the points for addition and the points for subtraction may be different. Further, the predetermined threshold value may be stored in the storage unit 20 in advance.

(Specific features 3)
For example, an application having a Microsoft (registered trademark) icon such as Microsoft Word is basically an application made by Microsoft (registered trademark). Therefore, the feature information extraction unit 10 extracts version information from a file that is a sample. Then, when the icon image of the sample file does not correspond to the version information, the point adding / subtracting unit 15 adds a predetermined point to the malware initial point. For example, if the extracted version information is not Microsoft (registered trademark) even though the sample file has a Microsoft (registered trademark) icon, the point adjuster 15 sets the malware initial point. Add predetermined points.

(Specific features 4)
For example, if a file that is a sample has a Microsoft (registered trademark) icon such as Microsoft Word and the icon is genuine, the application that holds the icon is basically a native application that uses C ++ or the like. Is an application. On the other hand, if the binary is created with “.NET” or “Visual Basic (registered trademark)”, there is a high possibility that the binary is malware. Specifically, the feature information extraction unit 10 extracts programming language information from the PE header of the file. If the sample file has a Microsoft (registered trademark) icon such as Microsoft Word and the extracted programming language information is not C ++ or the like, the point adjuster 15 adds a predetermined initial malware point to the malware. Add points.

(Specific feature 5)
For example, when a file that is a sample has a Microsoft (registered trademark) icon such as Microsoft Word, if the icon is genuine, the compiler is basically “Visual Studio (registered trademark)”. When there is evidence that other compilers are used, for example, when there is a difference in the API (Application Programming Interface) used, the sample file is likely to be malware. Add points. Specifically, the feature information extraction unit 10 extracts compiler information from a file that is a sample. The point adding / subtracting unit 15 adds a predetermined point to the malware initial point when the extracted compiler information is not the compiler information corresponding to the icon image held by the file.

(Specific features 6)
When a disguised icon is held and a packer such as UPX (Ultimate Packer for Executables) is used, the sample file is likely to be malware, so a predetermined point is added. Specifically, the feature information extraction unit 10 extracts information about the packer from the section information of the file. The point adjustment unit 15 adds a predetermined point to the initial malware point when the feature information extraction unit 10 extracts information about the packer from the predetermined file.

(Specific feature 7)
Originally, the icons that self-extracting archives have are specific. If the impersonation icon holds this icon, there is a high possibility that the execution of the file is guided. Therefore, in this case, points are added. Specifically, when the feature information extraction unit 10 extracts the self-decompressing archive information from the sample file, a predetermined point is added to the malware initial point.

(Specific feature 8)
In the case of legitimate applications, file names tend to be only alphanumeric and have a short number of characters. Therefore, if the sample file has a very long file name and holds a camouflaged icon, it is considered that the document file is misidentified, so a predetermined point is added. Here, a threshold value for the number of characters in the file name is set. This threshold value can be set as appropriate. The threshold value for the number of characters may be stored in the storage unit 20 in advance. Specifically, the feature information extraction unit 10 extracts a file name from header information of a file that is a sample. Then, when the number of characters in the extracted file name exceeds a predetermined number of characters (threshold), the point adjustment unit 15 adds a predetermined point to the malware initial point.

(Specific feature 9)
The file name may be reversed from the middle and the extension may be mistaken. There are cases where this is combined with a camouflaged icon. Therefore, when the file name includes a Unicode control character such as “¥ x202e (RLO (Right to Left Override))” and holds a camouflaged icon, points are added. Specifically, the feature information extraction unit 10 extracts a file name from header information of a file that is a sample. Then, the point adding / subtracting unit 15 adds a predetermined point to the malware initial point when the extracted file name includes a Unicode control character.

(Specific feature 10)
By giving a double extension such as “.doc.exe” to the file name, the actual file format may be misidentified. There are cases where this is combined with a camouflaged icon. Therefore, if the file name includes a double extension and holds a fake icon, points are added. Specifically, the feature information extraction unit 10 extracts a file name from header information of a file that is a sample. Then, when the extracted file name includes a plurality of extensions, the point adding / subtracting unit 15 adds a predetermined point to the malware initial point.

(Specific features 11)
When disguising malware as a document file, there is a tendency to use an interesting file name. In particular, Japanese file names tend to be used in attacks conducted in Japan. Therefore, if the file name includes double-byte characters and holds a camouflaged icon, points are added. Specifically, the feature information extraction unit 10 extracts a file name from header information of a file that is a sample. Then, the point adding / subtracting unit 15 adds a predetermined point to the malware initial point when the extracted file name includes a 2-byte character.

  The specific features of the file and 11 items have been described above. The features extracted by the feature information extraction unit 10 may be all 11 items or a combination of some of them. Depending on the number of items, the point threshold may be set as appropriate.

[Malware judgment flow]
A flow in which the information processing apparatus determines a file as a sample as malware will be described with reference to FIG. FIG. 10 is a diagram for explaining a flow in which an information processing apparatus according to another embodiment of the present invention determines malware. S501, S502, S503, and S504 in the present embodiment correspond to S301, S302, S303, and S304 in the second embodiment. Here, S504 and subsequent steps will be described in detail.

  As a result of the determination by the general determination unit 13, when it is determined that the sample file is malware (Yes in S504), the initial point setting unit 14 sets a malware initial point (S505). Here, description will be made assuming that the initial malware point is 90 points.

  Next, it is determined whether the sample file satisfies a predetermined condition (S506). Specifically, it is determined whether or not a file as a sample has the specific characteristics of the malware. When the predetermined condition is satisfied (Yes in S506), the point addition unit 15 adds a predetermined point to 90 points that are malware initial points (S507). On the other hand, if the predetermined condition is not satisfied (No in S506), no points are added, so the points remain 90 points of the initial malware points. In this example, it is assumed that the point threshold is larger than the initial malware point. For this reason, when the points are not added, the file that is the sample is determined to be normal software because it falls below the point threshold (S510).

  As a result of adding the points to the malware initial points, when the point threshold (95 points in this case) is exceeded (Yes in S508), the sample file is determined to be malware (S509), and the flow is finish. On the other hand, if the result of adding points to the malware initial points is equal to or less than the point threshold (No in S508), the sample file is determined to be normal software (S510), and the flow ends.

  This embodiment also has the same effects as those of the first to third embodiments.

  A file that is determined to be malware by the determination unit 12 and the general determination unit 13 is rarely erroneously determined to be malware although it is originally normal software. Therefore, an initial malware point is set for the file determined to be malware by the determination unit 12 and the general determination unit 13. And each feature information is extracted from a file, and when it has the specific feature of malware, a point is added to the malware initial point. Only when the point calculated by addition exceeds the point threshold value, the sample file is determined to be malware. In this embodiment, there exists an effect that the misdetection of the determination part 12 and the general determination part 13 can be suppressed.

<Fifth Embodiment>
[Configuration of information processing device]
The information processing apparatus 4 will be described with reference to FIG. FIG. 11 is a conceptual diagram of an information processing apparatus according to another embodiment of the present invention. The information processing apparatus 4 according to the present embodiment includes a digitizing unit 18 in addition to the configuration of the third embodiment. Here, differences from the third embodiment will be described in detail.

  The digitizing unit 18 digitizes an icon image of a predetermined file determined as malware. In this example, the digitizing unit 18 digitizes the icon image of the file determined as malware by the threshold determining unit 16. The numerical value storage unit 21 stores the numerical value converted into a numerical value by the numerical conversion unit 18. However, the present invention is not limited to this, and the digitizing unit 18 may digitize the icon image of the file determined as malware by the determining unit 12 or the general determining unit 13.

  The threshold value determination unit 16 also stores a numerical value obtained by digitizing an icon image determined to be malware by an unknown file. That is, an icon image determined to be malware in an unknown file is also data to be compared with a numerical value extracted from the sample file. However, there is a slight possibility that an unknown file determined to be malware is originally normal software. Therefore, in order to provide eligibility as data to be compared, a file to be digitized by the digitizing unit 18 may have an extremely high point threshold.

  This embodiment also has the same effect as the first to fourth embodiments.

  In the present embodiment, an icon image determined to be malware in an unknown file is also data to be compared with a numerical value extracted from the sample file. Therefore, there is an effect that the numerical value stored in the numerical value storage unit 21 is increased, and it is possible to determine the data to be compared with a larger number.

  The methods according to the above embodiments may be realized in the form of program instructions that can be executed by various computer means and recorded on a computer-readable medium. A computer readable medium may include program instructions, data files, data structures, etc., alone or in combination. Examples of the computer-readable recording medium include a hard disk, a floppy disk (registered trademark), a magnetic medium such as a magnetic tape, an optical recording medium such as a CD-ROM and a DVD, and a magnetic medium such as a floppy disk. -Optical media and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, etc. are included. Examples of program instructions may include not only machine language code such as that generated by a compiler, but also high-level language code executed by a computer using an interpreter or the like.

  Note that the present invention is not limited to the above-described embodiment, and can be modified as appropriate without departing from the spirit of the present invention.

1, 2, 3: Information processing apparatus 10: Feature information extraction unit 11: Feature vector generation unit 12: Determination unit 13: General determination unit 14: Initial point setting unit 15: Point adjustment unit 16: Threshold determination unit 17: Tampering Icon determination unit 18: Digitization unit 20: Storage unit 21: Numerical storage unit 22: Point threshold storage unit 27: Network 30: User terminal 33: Server

Claims (16)

  1. A feature information extraction unit that extracts a binary of an icon image from within a resource of a predetermined file;
    A feature vector generation unit that generates a feature vector from the binary of the extracted icon image;
    A determination unit that determines whether the predetermined file is malware by machine learning using the feature vector;
    An initial point setting unit for setting a predetermined malware initial point when the determination unit determines that the predetermined file is malware;
    A point threshold value storage unit for storing a point threshold value as an index for determining whether the predetermined file is malware;
    A point addition / subtraction unit that adds or subtracts a predetermined point to the malware initial point when the predetermined file satisfies a predetermined condition;
    It is determined whether the points calculated by addition or subtraction by the point addition / subtraction unit exceed the point threshold value stored in the point threshold value storage unit. If the point threshold value is exceeded, the predetermined file is regarded as malware. A threshold determination unit for determining;
    An information processing apparatus comprising:
  2. Computer
    Extract the icon image binary from the resource of the given file,
    Generating a feature vector from the extracted icon image binary;
    Determine whether the predetermined file is malware by machine learning using the feature vector,
    When the predetermined file is determined to be malware, a predetermined malware initial point is set,
    Storing a point threshold value as an index for determining whether the predetermined file is malware;
    Adding or subtracting a predetermined point to the malware initial point when the predetermined file satisfies a predetermined condition,
    An information processing method wherein addition or subtraction is the point that is calculated may determine whether more than a point threshold for the storage, if it exceeds the point threshold value, to determine that malware said predetermined file.
  3. On the computer,
    Extract the icon image binary from the resource of the given file,
    Generating a feature vector from the extracted icon image binary;
    Determine whether the predetermined file is malware by machine learning using the feature vector,
    When the predetermined file is determined to be malware, a predetermined malware initial point is set,
    Storing a point threshold value as an index for determining whether the predetermined file is malware;
    Adding or subtracting a predetermined point to the malware initial point when the predetermined file satisfies a predetermined condition,
    A program for determining whether or not the points calculated by addition or subtraction exceed the stored point threshold value, and determining that the predetermined file is malware when the point threshold value is exceeded .
  4. In the computer,
    The program according to claim 3 , wherein the binary is extracted at a predetermined interval.
  5. In the computer,
    It is determined whether the icon image of the predetermined file is a regular icon image compared with the icon image registered in advance as a regular icon image,
    It said predetermined when a file icon images is determined not to be the normal icon images, the program according to claim 3 or claim 4 for executing the adding a predetermined point on the malware initial point.
  6. Extracting the number of icons held in the resource of the given file;
    The program according to claim 3 or 4 for causing a predetermined point to be added to the malware initial point when the number of extracted icons is less than a predetermined number.
  7. Extracting version information from the predetermined file;
    The program according to claim 3 or 4 for causing a predetermined point to be added to the malware initial point when an icon image of the predetermined file does not correspond to the version information.
  8. Extracting programming language information from the predetermined file;
    Programming language information that the extracted, when the predetermined file is not a programming language information corresponding to the icon image for holding, claim 3 or claim for executing the adding a predetermined point on the malware initial point Item 5. The program according to item 4 .
  9. Extract compiler information from the given file,
    5 or 4 for executing addition of a predetermined point to the malware initial point when the extracted compiler information is not compiler information corresponding to an icon image held in the predetermined file. The program described in.
  10. Extracting information about the packer from the given file,
    The program according to claim 3 or 4 for causing a predetermined point to be added to the initial malware point when information on a packer is extracted from the predetermined file.
  11. Extract self-extracting archive information from the given file,
    Claim 3 or program according to claim 4 for executing the adding a predetermined point before Symbol malware initial point.
  12. Extract the file name from the given file,
    The program according to claim 3 or 4 for causing a predetermined point to be added to the malware initial point when the number of characters of the extracted file name exceeds a predetermined number of characters.
  13. Extract the file name from the given file,
    The program according to claim 3 or 4 for causing a predetermined point to be added to the malware initial point when a Unicode control character is included in the extracted file name.
  14. Extract the file name from the given file,
    The program according to claim 3 or 4 for causing a predetermined point to be added to the malware initial point when a plurality of extensions are included in the extracted file name.
  15. Extract the file name from the given file,
    The program according to claim 3 or 4 for causing a predetermined point to be added to the malware initial point when a two-byte character is included in the extracted file name.
  16. A computer-readable recording medium on which the program according to any one of claims 3 to 15 is recorded.
JP2016046762A 2016-03-10 2016-03-10 Information processing apparatus, information processing method, program, and computer-readable recording medium recording the program Active JP5982597B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2016046762A JP5982597B1 (en) 2016-03-10 2016-03-10 Information processing apparatus, information processing method, program, and computer-readable recording medium recording the program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2016046762A JP5982597B1 (en) 2016-03-10 2016-03-10 Information processing apparatus, information processing method, program, and computer-readable recording medium recording the program

Publications (2)

Publication Number Publication Date
JP5982597B1 true JP5982597B1 (en) 2016-08-31
JP2017162244A JP2017162244A (en) 2017-09-14

Family

ID=56819981

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2016046762A Active JP5982597B1 (en) 2016-03-10 2016-03-10 Information processing apparatus, information processing method, program, and computer-readable recording medium recording the program

Country Status (1)

Country Link
JP (1) JP5982597B1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008545177A (en) * 2005-05-05 2008-12-11 シスコ アイアンポート システムズ エルエルシー Identification of threats in an electronic message
JP2012501028A (en) * 2008-08-28 2012-01-12 エーブイジー テクノロジーズ シーゼット、エス.アール.オー. Heuristics for code analysis
JP2014504399A (en) * 2010-12-01 2014-02-20 ソースファイア インコーポレイテッドSourcefire,Inc. How to detect malicious software using contextual probabilities, generic signatures, and machine learning methods
US20150213365A1 (en) * 2014-01-30 2015-07-30 Shine Security Ltd. Methods and systems for classification of software applications
JP2015191458A (en) * 2014-03-28 2015-11-02 エヌ・ティ・ティ・ソフトウェア株式会社 File risk determination device, file risk determination method, and program
WO2015190446A1 (en) * 2014-06-11 2015-12-17 日本電信電話株式会社 Malware determination device, malware determination system, malware determination method, and program
JP2016507115A (en) * 2013-02-10 2016-03-07 サイバー アクティブ セキュリティー エルティーディー. Methods and products that provide predictive security products and evaluate existing security products

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008545177A (en) * 2005-05-05 2008-12-11 シスコ アイアンポート システムズ エルエルシー Identification of threats in an electronic message
JP2012501028A (en) * 2008-08-28 2012-01-12 エーブイジー テクノロジーズ シーゼット、エス.アール.オー. Heuristics for code analysis
JP2014504399A (en) * 2010-12-01 2014-02-20 ソースファイア インコーポレイテッドSourcefire,Inc. How to detect malicious software using contextual probabilities, generic signatures, and machine learning methods
JP2016507115A (en) * 2013-02-10 2016-03-07 サイバー アクティブ セキュリティー エルティーディー. Methods and products that provide predictive security products and evaluate existing security products
US20150213365A1 (en) * 2014-01-30 2015-07-30 Shine Security Ltd. Methods and systems for classification of software applications
JP2015191458A (en) * 2014-03-28 2015-11-02 エヌ・ティ・ティ・ソフトウェア株式会社 File risk determination device, file risk determination method, and program
WO2015190446A1 (en) * 2014-06-11 2015-12-17 日本電信電話株式会社 Malware determination device, malware determination system, malware determination method, and program

Also Published As

Publication number Publication date
JP2017162244A (en) 2017-09-14

Similar Documents

Publication Publication Date Title
AU2011336466B2 (en) Detecting malicious software through contextual convictions, generic signatures and machine learning techniques
US9973517B2 (en) Computing device to detect malware
EP2310974B1 (en) Intelligent hashes for centralized malware detection
Xiang et al. Cantina+: A feature-rich machine learning framework for detecting phishing web sites
Santos et al. Idea: Opcode-sequence-based malware detection
JP5586216B2 (en) Context-aware real-time computer protection system and method
Pascanu et al. Malware classification with recurrent networks
US8037535B2 (en) System and method for detecting malicious executable code
CN102822839B (en) Malware detection via credit system
US20070094734A1 (en) Malware mutation detector
US9130988B2 (en) Scareware detection
US8789178B2 (en) Method for detecting malicious javascript
Curtsinger et al. ZOZZLE: Fast and Precise In-Browser JavaScript Malware Detection.
US8561193B1 (en) Systems and methods for analyzing malware
JP4711949B2 (en) Method for detecting malware in the macro can be executed script and systems
JP5387124B2 (en) Method and system for performing content type search
Panchenko et al. Website fingerprinting in onion routing based anonymization networks
US20090064328A1 (en) System, apparatus and method of malware diagnosis mechanism based on immunization database
US9665713B2 (en) System and method for automated machine-learning, zero-day malware detection
US9690935B2 (en) Identification of obfuscated computer items using visual algorithms
US7721333B2 (en) Method and system for detecting a keylogger on a computer
Wang et al. Virus detection using data mining techinques
Zantedeschi et al. Efficient defenses against adversarial attacks
EP1560112A1 (en) Detection of files that do not contain executable code
CN102592079B (en) System and method for detecting unknown malware

Legal Events

Date Code Title Description
A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20160610

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20160705

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20160801

R150 Certificate of patent or registration of utility model

Ref document number: 5982597

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250