WO2020108760A1

WO2020108760A1 - Apparatus and method for malware detection

Info

Publication number: WO2020108760A1
Application number: PCT/EP2018/083014
Authority: WO
Inventors: Olga KOGAN; Elad TZOREFF; Dmitry MEYTIN
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2020-06-04
Also published as: CN113015972A

Abstract

The disclosure relates to an apparatus and a method for malware detection. The method for malware detection comprises: generating an image from a file; generating a signature of the image, wherein the signature of the image indicates local feature descriptors of the image; comparing the signature of the image to at least one pre-determined signature in a malware signature repository; and determining, based on the comparison result, if the file is malicious. Local feature descriptors used in the embodiments of the invention as signature for detecting malware in a file are more robust in detecting the same feature in the image independent of scaling, shifting and noise with high accuracy and repeatability. The robustness of the method for malware detection is thus improved.

Description

APPARATUS AND METHOD FOR MALWARE DETECTION

TECHNICAL FIELD

The disclosure relates to an apparatus and a method for malware detection. Furthermore, the disclosure also relates to an apparatus and a method for generating a malware signature repository, and corresponding computer programs and computer-readable storage mediums thereof.

BACKGROUND

Malware is a portmanteau for malicious software and is used to refer to any software designed to infiltrate or damage a computer system or a computer network without the owner's informed consent. Malware may include computer viruses, worms, Trojan horses, rootkits, and spywares. In order to prevent problems associated with malware infections, many end users make use of endpoint protection software to detect and possibly remove malware.

Endpoint protection is a solution deployed on endpoint devices such as servers and personal devices, including personal computers, laptops, tablets, and other devices in order to prevent malware attacks or other malicious activities. Endpoint security attempts to ensure that such devices follow a definite level of compliance to standards. In this field, signature based techniques are commonly used by the endpoint protection tools to detect attacks of known malwares. According to statistics analysis, scanning a file prior to execution prevents infection, assuming a signature exists for that threat. It is quick and has low false-positive rates (FPRs).

Most common methods for signature generation apply different functions (for example, checksums, cryptographic hash functions, fuzzy hash functions, etc.) on the sequence of bytes of the malicious file (e.g., malicious software, or malware). The problem is that these methods are sensitive to even slight changes in the malware binaries. This opens an opportunity for the attackers to evade signature-based detection by creating malware mutations and causing a response lag until the new signatures are generated.

There is a need to build more robust malware signatures that would allow identifying quickly and accurately, malware mutations that have not been seen yet even if a portion of its content is different from the known malware mutations. In addition, in order to minimize the detection time and storage space, it is advantageous to minimize the number of malware signatures that are used when scanning a new file by the endpoint protection tools.

SUMMARY

An objective of the embodiments of the disclosure is to provide a solution which mitigates or solves the drawbacks and problems of conventional solutions.

The above and further objectives are solved by the subject matter of the independent claims. Further advantageous embodiments can be found in the dependent claims.

The disclosure aims at providing a solution to make the malware detection robust to identify quickly and accurately unknown malware mutations even if a portion of its content is different from the known malware mutations.

According to a first aspect of the disclosure, the above mentioned and other objectives are achieved with a method for malware detection. The method comprises the following steps: generating an image from a file; generating a signature of the image, wherein the signature of the image indicates local feature descriptors of the image; comparing the signature of the image to at least one pre-determ ined signature in a malware signature repository; and determining, based on the comparison result, if the file is malicious.

It shall be noted, that the above disclosure concentrates on a scenario for detecting a malware in the binary content of an input file. After converting the binary content of the input file into an image, a signature of the image can be generated. This signature of the image indicates local feature descriptors of the image. A comparison is performed between the signature of the image and pre-determined signatures in a malware signature repository. The input file is determined as malicious based on the comparison result.

In this disclosure, the term“file” may be interpreted to the binary content of the file, which is suspicious of containing one or more malware mutations.

In this disclosure, the term“an image is generated from a file” may be interpreted as an image which is converted from the binary content of an input file. There are a plurality of ways of generating an image from the binary content of the file, for example, reading the byte stream of the binary content of the file and convert each byte value (0-255) to a pixel of a corresponding grey level. The width of the image is set based on the size of the binary content of the file, and then the height of the image is filled-in depending on the content of the binary. In this disclosure, the term“a signature of the image” may be interpreted as one or more characteristics of the image. These characteristics of the image may comprise global features of the image and/or local features of the image. Alternatively, the characteristics of the image may comprise global feature descriptors of the image and/or local feature descriptors of the image. In the embodiments of the invention, the signature of the image indicates local feature descriptors of the image, for example, local feature descriptors of the image comprise a plurality of key point descriptors.

In this disclosure, the term“global feature(s) descriptor” describes an image as a whole to generalize the entire object. Global feature descriptors may include for example, contour representations, shape descriptors, and texture features. Shape Matrices, Moment invariants, Histogram Oriented Gradients (HOG) and Co-occurrence Histogram Oriented Gradients (Co- HOG) are some examples of global feature descriptors.

In this disclosure, the term “local feature(s) descriptor” describes an image patches (for example, descriptors of key points in the image) of an object. Scale Invariant Feature Transform (SIFT), Speeded-Up Robust Features (SURF), Local Binary Patterns (LBP), Binary Robust Invariant Scalable Key-points (BRISK), Maximally Stable Extremal Regions (MSER) and Fast Retina Key-point (FREAK) are some examples of local feature descriptors. Local feature descriptors are more robust (e.g. independent of scaling, shifting, etc.) than global feature descriptors to describe the characteristics of an image. Taking“key point descriptor” as an example of a local feature descriptor, each key point descriptor comprises descriptions of location, scale and orientation of the key point.

In this disclosure, the term“malware signature repository” may be interpreted to relate to a repository containing signatures of a plurality of malware mutations. In the embodiments of the present disclosure, the signatures in the malware signature repository may be comprised of local feature descriptors. The signature of a suspicious file is compared with the signatures in the malware signature repository to find whether this file is malicious or not.

An advantage of the method according to the first aspect is that local feature descriptors are used as the signature of the image to detect malware in a file corresponding to the image. The local feature descriptors are more robust in detecting the same feature(s) in the image independent of scaling, shifting and noise and with high accuracy and repeatability. Therefore, the robustness of the method for malware detection is improved.

In an implementation form according to the first aspect, the method further comprises: applying at least one filter on the generated image to reduce noise before generating the signature of the image. In this implementation, a plurality of filters (e.g. Gabor filters) may be used to filter the image and then calculate superposition of filter results to reduce the noise in the image.

An advantage of this implementation form is that the noise in the image is reduced, and the accuracy for generating the signature of the image is thus increased.

In an implementation form of the method according to the first aspect, the step of comparing the signature of the image to at least one pre-determined signature in the malware signature repository comprises: calculating a correspondence between the signature of the image and each of the at least one pre-determined signature in the malware signature repository.

In this implementation, the signature of the image indicates local feature descriptors (e.g., key point descriptors) of the image, and each of the at least one pre-determined signature in the malware signature repository also comprises local feature descriptors of the image corresponding to a malware sample file. A correspondence calculation algorithm is performed to determine the correspondence between the signature of the image and each of the at least one pre-determined signature in the malware signature repository.

An advantage with this implementation form is that an easy way is provided to compare the signature of the image with at least one pre-determined signature in the malware signature repository.

In an implementation form of the method according to the first aspect the method further comprises: comparing the calculated correspondence to a pre-defined threshold. This comparing can be done before or after determining if the file is malicious.

In this implementation, since the correspondence between the signature of the image and each of the at least one pre-determined signature in the malware signature repository is calculated, a possible way is to compare the calculated correspondence with a pre-determined threshold. Just as an example, the more correspondences between the signature of the image and a pre determined signature in the malware signature repository, the more possibility of the file corresponding to the image being a malware.

An advantage of this implementation form is that a more practical way is provided to determine whether the file is malicious or not. Thus the applicability for malware detection is improved.

In an implementation form of the method according to the first aspect, the step of generating a signature of the image comprises: detecting a first set of key points in the image; generating a first set of descriptors based on the first set of key points, wherein each descriptor corresponds to a key point in the first set of the key points; and setting the first set of descriptors as the signature of the image.

In this implementation, key point descriptors of the image are used as an example of the local feature descriptors. The key points in the image are detected by using a key- point detection algorithm, for example, Harris detection algorithm, Scale Invariant Feature Transform (SIFT), Speeded-Up Robust Features (SURF), or Oriented“Features from accelerated segment test (FAST)” and rotated“Binary Robust Independent Elementary Features (BRIEF)” (ORB), etc.

An advantage with this implementation form is that an easier way is provided to determine the local feature descriptors.

In an implementation form of the method according to the first aspect, each of the pre determined signatures in the malware signature repository comprises a second set of descriptors, and each descriptor in the second set of descriptors corresponds to a key point.

In this implementation, the malware signature repository comprises at least one pre determined signature, and a pre-determined signature comprises a set of key point descriptors. Each key point descriptor corresponds to a key point in a malware sample file.

In an implementation form of the method according to the first aspect, wherein the step of comparing the signature of the image to at least one pre-determined signature in the malware signature repository comprises: for each second set of descriptors, detecting correspondences between the first set of descriptors and the second set of descriptors; and calculating a distance between the first set of descriptors and the second set of descriptors based on the detected correspondences.

In this implementation, the signature of the image is compared with the pre-determined signatures in the malware signature repository, for example in a bulk operation, and correspondences are calculated between the signature of the image and the at least one pre determined signature in the malware signature repository. With other words, when performing the bulk operation the algorithm takes all the available signatures to compare and builds an index tree that allows to do the comparison with all available signatures at once. For example, a calculation of a distance between the signature of the image (i.e. the first set of descriptors) and a pre-determined signature (i.e. a second set of descriptors) in the malware signature repository is performed.

In an implementation form of the method according to the first aspect, wherein the step of determining based on the comparison result, if the file is malicious comprises: determining, based on the calculated distance, if the file is malicious. In this implementation, the calculated distance may be compared with a pre-determined threshold to determine whether the file is malicious. This provides an easy way to determine if the file is malicious.

In an implementation form of the method according to the first aspect, wherein the distance between the first set of descriptors and the second set of descriptors is a result of an inverse correlation function of the detected correspondences between the first set of descriptors and the second set of descriptors.

In this implementation, the distance between the first set of descriptors and the second set of descriptors is used to specify the correspondence between the signature of the image and a signature in the malware signature repository. The higher the correspondence between the signature of the image and a signature in the malware signature repository, the smaller is the distance between the descriptors of the signature of the image and the descriptors of the signature in the malware signature repository.

According to a second aspect of the disclosure, the above mentioned and other objectives are achieved with a method for generating a malware signature repository. The method for generating a malware signature repository comprises: loading at least two malware sample files; generating an image from each of the at least two malware sample files; generating a signature for each image, wherein the signature of each image indicates local feature descriptors of the image; generating at least one cluster of signatures based on the signatures of images corresponding to the at least two malware sample files; and selecting at least one signature from each cluster of the at least one cluster to generate the malware signature repository.

An advantage of the method according to the second aspect is that local feature descriptors are used as signature of the image to describe characteristics of malware sample file corresponding to the image, and the signatures are clustered, and at least one signature from each cluster is selected to form a malware signature repository. The local feature descriptors are more robust in describing the same feature in the image independent of scaling, shifting and noise and with high accuracy and repeatability. At least one representative signature from each cluster is chosen to generate the malware signature repository. Therefore, the quantity of signatures for the malware sample files is reduced, and the storage space for the malware signature repository is thus saved.

In an implementation form of the method according to the second aspect, generating at least one cluster of signatures based on the signatures of images corresponding to the at least two malware sample files comprises: determining a distance matrix, wherein the distance matrix comprises at least one distance element, and each distance element is a distance between a pair of signatures of two images corresponding to two malware sample files; and generating at least one cluster of signatures based on the distance matrix according to a clustering algorithm.

In this implementation, the distance matrix is obtained by a feature matching algorithm, for example, Fast Library for Approximate Nearest Neighbours (FLANN) algorithm. The distance matrix comprises at least one distance element and the distance element specifies a distance between a pair of signatures of two images corresponding to two malware sample files.

In an implementation form of the method according to the second aspect, wherein the distance between the pair of signatures is a result of an inverse correlation function of detected correspondences between the pair of descriptors.

In an implementation form of the method according to the second aspect, local feature descriptors of the image comprise a plurality of key point descriptors.

According to a third aspect of the disclosure, the above mentioned and other objectives are achieved with an apparatus comprising processing circuitry for carrying out the method according to any of the first aspect or the second aspect.

The disclosure also relates to a computer program, characterized in program code, which, when run by at least one processor causes said at least one processor to execute any method according to the first aspect or the second aspect of the disclosure.

The disclosure also relates to a computer readable storage medium comprising computer program code instructions, being executable by a computer, for performing a method according to any of the first aspect or the second aspect when the computer program code instructions runs on a computer.

Further, the disclosure also relates to a computer program product comprising a computer readable medium and said mentioned computer program, wherein said computer program is included in the computer readable medium, and comprises of one or more from the group: ROM (Read-Only Memory), PROM (Programmable ROM), EAROM (Electrically alterable ROM), EPROM (Erasable PROM), Flash memory, EEPROM (Electrically EPROM), hard disk drive and 3D XPoint.

Further applications and advantages of the embodiments of the disclosure will be apparent from the following detailed description. BRIEF DESCRIPTION OF THE DRAWINGS

The appended drawings are intended to clarify and explain different embodiments of the disclosure, in which:

Fig. 1 illustrates schematically a computer system according to an embodiment of the disclosure.

Fig. 2 shows a flowchart of a method of generating a malware signature repository according to an embodiment of the disclosure.

Fig. 3 shows a flowchart of a method of malware detection according to an embodiment of the disclosure. Fig. 4 shows an implementation of the method for malware detection according to an embodiment of the disclosure.

Fig. 5 shows an implementation of the method for malware detection according to an embodiment of the disclosure.

DETAILED DESCRIPTION

Illustrative embodiments of methods, apparatuses, and program products for efficient packet transmission in a communication system are described with reference to the accompanying figures. Although this description provides a detailed example of possible implementations, it should be noted that the details are intended to be exemplary and in no way limit the scope of the application.

Moreover, an embodiment/example may refer to other embodiments/examples. For example, any description including but not limited to terminology, element, process, explanation and/or technical advantage mentioned in one embodiment/example is applicative to the other embodiments/examples. Fig.1 shows schematically a computer system 1000 according to an embodiment of the disclosure. In this embodiment, the computer system 1000 comprises an endpoint protection client terminal 1100, an endpoint protection server 1200, and a network 1300 implemented by, e.g. internet, Local Area network (LAN) or wireless LAN (WLAN), which connects the endpoint protection client terminal 1 100 and the endpoint protection server 1200. The network 1300 between the endpoint protection client terminal 1 100 and the endpoint protection server 1200 may be a wired network or a wireless network or a combination of a wired network and a wireless network.

In Fig.1 , only one endpoint protection client terminal 1 100 is shown as an illustrative example, it is known to the person skilled in the art that there may be a plurality of client terminals during the implementation.

The endpoint protection client terminal 1 100 comprises a processor 1 1 10, a memory 1 120, and/or a transmitter/receiver (transceiver) 1 130. The memory 1 120 comprises a storage unit 1 123 and in addition, an endpoint protection program that is stored in the memory 1 120 and executed by the processor 1 1 10. The storage unit 1 123 is configured to store binary content of the suspicious file. The endpoint protection program comprises two software components: an image generating unit 1 121 and a signature generating unit 1 122. The image generating unit 1 121 is configured to receive the binary content of an input file, convert the binary content of the input file into an image, and transmit the image to the signature generating unit 1 122. The signature generating unit 1 122 is configured to receive the image from the image generating unit 1 121 and generate a signature of the image.

During the implementation of the embodiments of the invention, the image generating unit 1 121 may be implemented with a malware detection agent in the client terminal 1 100. The signature generating unit 1 122 may be, e.g. implemented with a signature agent plug-in deployed in the endpoint protection client terminal 1 100.

The signature of the image are local feature descriptors of the image. Just as an example, the local feature descriptors are descriptors of the key points of the image corresponding to the binary content of the file.

The endpoint protection server 1200 comprises a processor 1210, a memory 1220, a database 1230, and/or a transmitter/receiver (transceiver) 1240. The memory 1220 comprises a storage unit 1222, and additionally, an endpoint protection program that is stored in the memory 1220. The endpoint protection program is stored in the memory 1220 and executed by the processor 1210. The endpoint protection program comprises two software components: a learning unit 1221 and a comparing unit 1222. The learning unit 1221 is configured to generate a malware signature repository and load the generated or pre-defined malware signature repository in the database 1230. The malware signature repository comprises a plurality of malware signatures which show different type of malware. The comparing unit 1222 is configured to compare the signature of the image to the at least one pre-determ ined signature in the malware signature repository, and determine whether the signature of the image matches a signature in the malware signature repository or of a malware mutation. The database 1230 comprises the malware signature repository including at least one pre-determined malware signature. Each malware signature corresponds to a malware or a mutation of a malware.

In Fig. 1 , the endpoint protection client terminal 1 100 and the endpoint protection server 1200 are shown as two separate apparatuses. However, it is known to the person skilled in the art that the endpoint protection client terminal 1 100 and the endpoint protection server 1200 can be integrated in a single apparatus.

In another possible implementation, the database 1230 and the comparing unit 1222 are deployed in the endpoint protection client terminal 1 100, and the learning unit 1221 is deployed in the endpoint protection server 1200 to provide an offline comparison, and speed up the performance of matching the signature of the image to a signature in the malware signature repository and/or of a malware mutation.

The endpoint protection client terminal 1 100 may be denoted as a user device, a user equipment (UE), a mobile station, internet of things (loT) device, a sensor device, a wireless terminal and/or a mobile terminal, a virtual machine (VM) or a container in a physical machine (PM). The UEs may further be referred to as mobile telephones, cellular telephones, computer tablets or laptops with wireless capability. The UEs in this context may be, for example, portable, pocket-storable, hand-held, computer comprised, or vehicle-mounted mobile devices, enabled to communicate voice and/or data, via the radio access network, with another entity, such as another receiver or a server. The UE can be a Station (STA), which is any device that contains an IEEE 802.1 1 -conformant Media Access Control (MAC) and Physical Layer (PHY) interface to the Wireless Medium (WM). The UE may also be configured for communication in 3GPP related LTE (4G) and LTE-Advanced, in WiMAX and its evolution, and in fifth generation (5G) wireless technologies, such as New Radio.

The endpoint protection server 1200 herein may also be denoted as a server, a radio device, an access device, an access point, or a base station, e.g. a Radio Base Station (RBS), which in some networks may be referred to as transmitter, “gNB”, “gNodeB”, “eNB”, “eNodeB”, “NodeB” or“B node”, depending on the technology and terminology used. The radio devices may be of different classes such as e.g. macro eNodeB, home eNodeB or pico base station, based on transmission power and/or also cell size. The radio device can be a station (STA), which is any device that contains an IEEE 802.1 1 -conformant Media Access Control (MAC) and Physical Layer (PHY) interface to the Wireless Medium (WM). The radio device may also be a base station corresponding to the fifth generation (5G) wireless systems. Fig. 2 shows a flowchart of a method for generating a malware signature repository according to an embodiment of the disclosure. The method may be performed by the endpoint protection server 1200.

In step 210, at least two malware sample files are loaded. Just as an example, the at least two malware sample files are downloaded from some malware resources, for example, VirusTotal.com.

In step 220, for each of the at least two malware sample files, an image is generated from the binary content of the malware sample files. There are a plurality of ways of generating an image from binary content of a file, just as an example, byte stream of a binary content of the file is read and each byte value (e.g. 0-255) is converted to a pixel of corresponding colour level, e.g. grey level. The width of the image is set based on the size of the binary content of the file, and then the height of the image is filled in depending on the content of the binary.

In step 230, a signature is generated for each image. To improve the robustness of a same feature independent of scaling, shifting and noise, local feature descriptors are used to describe the signature of the image. Just as an example, key point descriptors are used in the embodiments of the disclosure. During the implementation, an Oriented “Features from accelerated segment test (FAST)” and rotated “Binary Robust Independent Elementary Features (BRIEF)” (ORB) feature detection algorithm is used to detect the key points as signature of the image. Scale Invariant Feature Transform (SIFT), Speeded-Up Robust Features (SURF) feature detection algorithm can also be used as options for determining local feature descriptors.

In step 240, at least one cluster of signatures is generated based on the signatures of the images. During the implementation of step 240, for each image corresponding to a malware sample file, a signature is generated. A distance matrix is determined by using a Fast Library for Approximate Nearest Neighbours (FLANN) feature matching algorithm to compare the generated signatures of the images corresponding to the at least two malware sample files. The distance matrix comprises at least one distance element, and each distance element is a distance between a pair of signatures of two images corresponding to two malware sample files. After the distance matrix is determined, at least one cluster of signatures is generated based on the distance matrix according to a clustering algorithm. For example, based on the distance matrix, a spectral clustering algorithm may be used to segment the signatures of images into at least one cluster.

In step 250, at least one signature from each cluster is selected to generate the malware signature repository. During the implementation of step 250, at least one signature is chosen from each cluster of signatures to form the malware signature repository. During the implementation, one or more signatures are selected from each cluster to generate the malware signature repository.

According to the embodiment shown in Fig. 2, a malware signature repository can be generated. By clustering the signatures of images corresponding to the loaded malware sample files, a plurality of signatures which have some same or similar characteristics are assigned to one cluster, and the plurality of signatures in one cluster may be mutations of one kind of malware.

Be selecting at least one signature from each cluster, the malware signature repository can be formed, the quantity of signatures in the malware signature repository can be reduced and the storage space for the malware signature repository is thus saved.

Fig. 3 shows a flowchart of a method of malware detection according to an embodiment of the disclosure. The method may be performed by the endpoint protection client terminal 1 100 and the endpoint protection server 1200 individually or collaboratively.

In step 310, when receiving a suspicious file, the endpoint protection client terminal 1 100 generates an image from the binary content of the received suspicious file. During the implementation, the endpoint protection client terminal 1 100 converts the binary content of the file into an image. For example, the endpoint protection client terminal 1 100 reads byte stream of the binary content of the file and converts each byte value to a pixel of corresponding grey level. The width of the image is defined based on the size of the binary. Then the height is filled in depending on the content of the binary.

In step 320, the endpoint protection client terminal 1 100 generates a signature of the image. The signature of the image indicates local feature descriptors of the image.

During the implementation of the step 320, some algorithms can be used to detect the local feature descriptors. For example, a Scale Invariant Feature Transform (SIFT) feature detection algorithm, a Speeded-Up Robust Features (SURF) feature detection algorithm, or an Oriented “Features from accelerated segment test (FAST)” and rotated“Binary Robust Independent Elementary Features (BRIEF)” (ORB) feature detection algorithm. Just as an example, the signature of the image comprises a set of key point descriptors of the image. Each descriptor is a 32 byte vector that describes a key point in the image.

In step 330, the signature of the image is compared to the at least one pre-determined signature in a malware signature repository. During the implementation of step 330, the endpoint protection client terminal 1 100 transmits the signature of the image to the endpoint protection server 1200. The endpoint protection sever 1200 compares the signature of the image to at least one p re-determined signature in the malware signature repository (e.g. the database 1230). The comparison can be performed based on a calculated distance between the key point descriptors of the image and key point descriptors of a malware sample in the malware signature repository. For example, by using a Fast Library for Approximate Nearest Neighbours (FLANN) algorithm, two set of descriptors are compared and the matches for each descriptor are determined. The distance between the key point descriptors of the image and key point descriptors of a malware sample in the malware signature repository is a result of an inverse correlation function of detected correspondences between the first set of descriptors and the second set of descriptors. Just as an example, if there are N numbers of matched descriptors (N>1 , and N is an integer) for two signatures, the distance between the two signatures may be calculated as 1/N. That is, the more matched descriptors there are, the smaller the distance between the two signatures.

In step 340 it is determined, based on the comparison result, if the file is malicious.

Step 340 may be performed by the endpoint protection client terminal 1 100 or the endpoint protection server 1200. For example, if the comparison is based on a calculated distance, the calculated distance may be, for example, compared with a pre-determined threshold value to determine if the file is malicious.

An advantage of the embodiment of the method shown in Fig.3 is that local feature descriptors are used as signatures to detect malware in a file. The local feature descriptors are more robust in detecting the same feature in the image independent of scaling, shifting and noise and with high accuracy and repeatability. Therefore, the robustness of the method for malware detection is improved.

Fig. 4 shows an implementation of the method for malware detection according to an embodiment of the disclosure. The method may be performed by the endpoint protection client terminal 1 100 and the endpoint protection server 1200 individually or collaboratively. The method may also performed by a single device comprising the endpoint protection client terminal 1 100 and the endpoint protection server 1200.

In step 410 an image is generated from a file.

During the implementation of step 410, the binary content of a suspicious file is converted into an image.

In step 420 at least one filter is applied on the generated image. It is also possible to apply no filter. During the implementation of step 420, at least one filter is applied on the generated image to reduce noise in the image. For example, Gabor filters may be applied to the image and superposition of the filter results are calculated to reduce noise for the image.

In step 430, a signature of the image is generated, and the signature of the image indicates local feature descriptors of the image.

During the implementation of step 430, the signature of the image is generated and for example, the signature of the image comprises local feature descriptors of the image, for example, key-point descriptors of the image. The key point descriptor may be a 32 byte vector that describe a location, scale, or orientation of a key point in the image. It is also possible that the key point descriptor may be a 16, 64 or 128 byte vector.

In step 440 a correspondence is calculated between the signature of the image and each of the at least one pre-determined signature in the malware signature repository.

During the implementation of step 440, a correspondence between the signature of the image and each of the at least one pre-determined signature in the malware signature repository is calculated. Just as an example, this correspondence may be determined by calculating a distance between two signatures. This correspondence can be realized by comparing two lists of descriptors using a Fast Library for Approximate Nearest Neighbours (FLANN) algorithm to find the best match in the malware signature repository.

In step 450, the calculated correspondence is compared to a pre-defined threshold.

During the implementation of step 450, the calculated correspondence may be a calculated distance between two signatures. For example, this calculated distance may be a result of an inverse correlation function of detected correspondences between the key point descriptors in the image and the key point descriptors in the malware signature repository. The calculated distance may be an inverse proportion of numbers of matches between key point descriptors of the image and the key point descriptors of a malware sample in the malware signature repository. For example, there are 400 detected matches between key point descriptors in the image and a set of key point descriptors in the malware signature repository, and the calculated distance can be obtained by 1/400=0.0025. During the implementation, a pre-defined threshold may be set. The threshold may be in the range of 0 to 0.1 , preferably in the range of 0 to 0.05, more preferably in the range of 0 to 0.01.

In step 460 it is determined, based on the comparison result, if the file is malicious. During the implementation of step 460, the calculated distance (for example, 0.0025) is smaller than the pre-defined threshold (for example, 0.01 ), and this file corresponding to the image may be determined as malicious file.

Fig. 5 shows an implementation of the method for malware detection and training according to an embodiment of the disclosure.

Fig.5 shows schematically a malware sample file repository 510, an endpoint protection server 520, a client terminal (e.g. a virtual machine, or a container, or a mobile terminal) 530 and a computer emergency response team (CERT) 540.

The malware sample file repository 510 includes a plurality of malware sample files, and the malware sample files are used as training samples. The endpoint protection server 520 comprises at least a signature learning unit 521 , a signature inference unit 522 and a malware signature repository 523. The client terminal 530 comprises at least a signature agent plug-in 531 and an endpoint protection agent 532. The signature agent pug-in 531 further comprises a signature generator 531 1 .

A plurality of sample files are downloaded from the malware sample file repository 510 into the signature learning unit 521 to perform the training process. Thereby malware signatures corresponding to the malware sample files are generated to form a malware signature repository 523. The endpoint protection agent 532 detects a suspicious file, and sends the suspicious file to the signature generator 531 1 . An image is generated from the suspicious file and then a signature of the image is generated in the signature generator 531 1 . The signature agent plug-in 531 sends the signature of the image to a signature inference unit 522 in the endpoint protection server 520. The signature inference unit 522 compares the signature of the image with the p re-determined signatures from the malware signature repository 523. The signature inference unit 522 feeds back the comparison result to the signature agent plug-in 531 . In case of a malicious file the endpoint protection agent 532 communicates with the computer emergency response team (CERT) to determine whether the suspicious file is a malware. If the CERT determines the suspicious file as a malware, the suspicious file will be added to the signature learning unit 521 for the training of the malware signature repository 523.

Furthermore, any method according to embodiments of the disclosure may be implemented in a computer program, having code means, which when run by processing means causes the processing means to execute the steps of the method. The computer program is included in a computer readable medium of a computer program product. The computer readable medium may comprise essentially any memory, such as a ROM (Read-Only Memory), a PROM (Programmable Read-Only Memory), an EAROM (Electrically alterable ROM), an EPROM (Erasable PROM), a Flash memory, an EEPROM (Electrically Erasable PROM), or a hard disk drive, or 3D XPoint, or it could even have been streamed from any connection and temporarily stored in RAM. Moreover, it is realized by the skilled person that embodiments of the endpoint protection client terminal 1 100 or endpoint protection server 1200 comprises the necessary communication capabilities in the form of e.g., functions, means, units, elements, etc., for performing the solution. Examples of other such means, units, elements and functions are: processors, memory, buffers, control logic, encoders, decoders, rate matchers, de-rate matchers, mapping units, multipliers, decision units, selecting units, switches, interleavers, de-interleavers, modulators, demodulators, inputs, outputs, antennas, amplifiers, receiver units, transmitter units, Digital signal processors (DSPs), Trellis-coded modulation (TCM) encoder, TCM decoder, power supply units, power feeders, communication interfaces, communication protocols, etc. which are suitably arranged together for performing the solution. Especially, the processor(s) of the endpoint protection client terminal 1 100 or endpoint protection server 1200 may comprise, e.g., one or more instances of a Central Processing Unit (CPU), a processing unit, a processing circuit, a processor, an Application Specific Integrated Circuit (ASIC), a microprocessor, or other processing logic that may interpret and execute instructions. The expression“processor” may thus represent a processing circuitry comprising a plurality of processing circuits, such as, e.g., any, some or all of the ones mentioned above. The processing circuitry may further perform data processing functions for inputting, outputting, and processing of data comprising data buffering and device control functions, such as call processing control, user interface control, or the like.

Finally, it should be understood that the disclosure is not limited to the embodiments described above, but also relates to and incorporates all embodiments within the scope of the appended independent claims.

Claims

1. A method for malware detection, comprising: generating an image from a file; generating a signature of the image, wherein the signature of the image indicates local feature descriptors of the image; comparing the signature of the image to at least one pre-determined signature in a malware signature repository; and determining, based on the comparison result, if the file is malicious.

2. The method according to claim 1 , wherein the method further comprises: applying at least one filter on the generated image to reduce noise before generating the signature of the image.

3. The method according to any of preceding claims, wherein the step of comparing the signature of the image to at least one pre-determined signature in the malware signature repository comprises: calculating a correspondence between the signature of the image and each of the at least one pre-determined signature in the malware signature repository.

4. The method according to claim 3, wherein before determining if the file is malicious, the method further comprises: comparing the calculated correspondence to a pre-defined threshold.

5. The method according to any of preceding claims, wherein the step of generating a signature of the image comprises: detecting a first set of key points in the image; generating a first set of descriptors based on the first set of key points, wherein each descriptor corresponds to a key point in the first set of the key points; and setting the first set of descriptors as the signature of the image.

6. The method according to claim 5, wherein each of the pre-determined signatures in the malware signature repository comprises a second set of descriptors, and each descriptor in the second set of descriptors corresponds to a key point.

7. The method according to claim 6, wherein the step of comparing the signature of the image to at least one pre-determined signature in the malware signature repository comprises: for each second set of descriptors, detecting correspondences between the first set of descriptors and the second set of descriptors; and calculating a distance between the first set of descriptors and the second set of descriptors based on the detected correspondences.

8. The method according to claim 7, wherein the step of determining based on the comparison result, if the file is malicious comprises: determining, based on the calculated distance, if the file is malicious.

9. The method according to claim 7 or 8, wherein the distance between the first set of descriptors and the second set of descriptors is a result of an inverse correlation function of detected correspondences between the first set of descriptors and the second set of descriptors.

10. A method for generating a malware signature repository, comprising: loading at least two malware sample files; generating an image from each of the at least two malware sample files; generating a signature for each image, wherein the signature of each image indicates local feature descriptors of the image; generating at least one cluster of signatures based on the signatures of images corresponding to the at least two malware sample files; and selecting at least one signature from each cluster of the at least one cluster to generate the malware signature repository.

1 1. The method according to claim 10, wherein the generating at least one cluster of signatures based on the signatures of images corresponding to the at least two malware sample files comprises: determining a distance matrix, wherein the distance matrix comprises at least one distance element, and each distance element is a distance between a pair of signatures of two images corresponding to two malware sample files; and generating at least one cluster of signatures based on the distance matrix according to a clustering algorithm.

12. The method according to claim 1 1 , wherein the distance between the pair of signatures is a result of an inverse correlation function of detected correspondences between the pair of signatures.

13. The method according to any of preceding claims, wherein local feature descriptors of the image comprises a plurality of key point descriptors.

14. An apparatus comprising processing circuitry for carrying out the method according to any of claims 1 to 9 or 10 to 13.

15. A computer program with a program code for performing a method according to any of claims 1 to 9 or 10 to 13 when the computer program runs on a computer.

16. A computer readable storage medium comprising computer program code instructions, being executable by a computer, for performing a method according to any of claims 1 to 9 or 10 to 13 when the computer program code instructions runs on a computer.