EP4004827A1 - A computer-implemented method, a system and a computer program for identifying a malicious file - Google Patents

A computer-implemented method, a system and a computer program for identifying a malicious file

Info

Publication number
EP4004827A1
EP4004827A1 EP20744065.2A EP20744065A EP4004827A1 EP 4004827 A1 EP4004827 A1 EP 4004827A1 EP 20744065 A EP20744065 A EP 20744065A EP 4004827 A1 EP4004827 A1 EP 4004827A1
Authority
EP
European Patent Office
Prior art keywords
features
file
malicious file
computer
fuzzy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP20744065.2A
Other languages
German (de)
French (fr)
Inventor
Daniel SOLÍS AGEA
Gerard CERVELLÓ GARCÍA
Àngel PUIGVENTÓS GRÀCIA
Daniel GIBERT LLAURADÓ
Jordi PLANES CID
Maria Teresa ALSINET BERNADÓ
Carlos MATEU PIÑOL
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Leap In Value SL
Universitat de Lleida
Original Assignee
Leap In Value SL
Universitat de Lleida
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Leap In Value SL, Universitat de Lleida filed Critical Leap In Value SL
Publication of EP4004827A1 publication Critical patent/EP4004827A1/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/043Architecture, e.g. interconnection topology based on fuzzy logic, fuzzy membership or fuzzy inference, e.g. adaptive neuro-fuzzy inference systems [ANFIS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/048Fuzzy inferencing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present invention relates to a computer-implemented method, system and computer program for identifying a malicious file that combines different types of analysis, processes and procedures that allow detecting and classifying malicious files.
  • a method for identifying malware file using multiple classifiers is known by US patent application, US2010192222A1.
  • such method uses multiple classifiers including static and dynamic classifiers, and thus is unable to identify malware based only on static analysis.
  • the patent EP2882159 discloses a computer implemented method of profiling cyber threats detected in a target environment, that comprises receiving, from a Security Information and Event Manager (SIEM) monitoring the target environment, alerts triggered by a detected potential cyber threat, and, for each alert: retrieving captured packet data related to the alert; extracting data pertaining to a set of attributes from captured packet data triggering the alert; applying fuzzy logic to data pertaining to one or more of the attributes to determine values for one or more output variables indicative of a level of an aspect of risk attributable to the cyber threat.
  • SIEM Security Information and Event Manager
  • the present invention relates, in accordance with a first aspect, to a computer-implemented method for identifying a malicious file.
  • the method comprises:
  • a preliminary classification output i.e. a score
  • the method comprises:
  • the method comprises:
  • the method comprises performing a further static machine learning classification process, using as inputs several or all of the above mentioned sets of features, to obtain a corresponding further preliminary classification output; and - performing said fuzzy inference procedure based on possibilistic logic using as input variable also said further preliminary classification output.
  • the above mentioned fuzzy inference procedure comprises a fuzzification process that converts the input variables into fuzzy variables.
  • the fuzzification process comprises deriving membership functions relating the input variables with output variables through membership degrees of values of the input variables in predefined fuzzy sets, and representing said membership functions with linguistic variables, said linguistic variables being said fuzzy variables.
  • the fuzzy inference procedure further comprises an inference decision-making process comprising firing fuzzy possibilistic rules with values of said linguistic variables for said input variables, to generate a fuzzy output that identifies the degree of belief that the potentially malicious file has to be a malicious file or a benign file.
  • the method of the first aspect of the present invention further comprises selecting which fuzzy possibilistic rules to fire in said inference decision-making process, based on at least said values of the linguistic variables for the input variables.
  • the fuzzy inference procedure based on possibilistic logic further comprises a defuzzification process that converts the above mentioned fuzzy possibilistic output into a crisp output, wherein said crisp output constitutes the above mentioned enhanced classification output.
  • the above mentioned set or sets of features may comprise:
  • API Application Programming Interfaces
  • function calls the representation of an executable file as a stream of entropy values, where each value describes the amount of entropy over a small chunk of code in a specific location of the potentially malicious file; the sequence of assembly language instructions executed by a software program constituting the potentially malicious file, in particular, the operational codes of the machine language instructions;
  • the fuzzy inference procedure based on possibilistic logic is based on a PGL+ algorithm.
  • the proof method for PGL+ is complete and involves a semantical unification model of disjunctive fuzzy constants and three other inference patterns together with a deductive mechanism based on a modus ponens style.
  • the PGL+ algorithm can comprise applying three algorithms sequentially: a first algorithm that extends the fuzzy possibilistic rules by means of implementing a first set of rules; a second algorithm that translates the fuzzy possibilistic rules into a semantically equivalent set of 1 -weighted clauses by means of implemented a second set of rules; and a third algorithm that computes a maximum degree of possibilistic entailment of a goal from the equivalent set of 1 -weighted clauses.
  • the fuzzy inference procedure based on possibilistic logic comprises executing the following formulas in the form (A, c), where A is a Horn clause (fact or rule) with disjunctive fuzzy constants and c is a degree in the unit interval [0,1 ] which denotes a lower bound on the belief on A in terms of necessity measures. Every fact and rule is attached with a degree of belief or weight in the real interval [0, 1 ] that denotes a lower bound on the belief on the fact and rule in terms of necessity measures. So, those facts and rules that are demonstrated to be key for the decision system have a higher weight, and facts and rules not so useful in the decision system have a lower weight.
  • the rules created by the system can have a higher degree of belief than the rules created by a human, or vice versa.
  • the system may create rules of the following form:
  • the facts can have different degrees of belief depending on the source of the information.
  • file management API functions e.g. CopyFile, CreateFile, EncryptFile, etc.
  • networking APIs e.g.HttpCreateServerSession.DnsAcquireContextHandle, RpcStringBindingCompose, etc.
  • the machine learning models can be enhanced by further using Reinforcement Learning methods.
  • Reinforcement Learning is a set of techniques that allow to solve problems in highly uncertain or almost unknown domains.
  • the method can use machine learning to select the most relevant features, using RL guided methods to derive the future reward (i.e. accuracy) of using such feature.
  • the machine learning technique will be able to use a Q-Table (rewards table) of the RL method to accurately predict which feature and split set use for prediction, thus creating a quasi-optimal DT from which to derive the rules for the system. This last module makes the system keep learning from new threats, a key aspect when it comes to cybersecurity.
  • the present invention also relates to a system for identifying a malicious file, the system comprising one or more computing entities adapted to perform the steps of the method of the first aspect of the invention for all its embodiments, said one or more computing entities including at least the following modules operatively connected to each other:
  • preprocessing computing module configured and arranged to perform a static analysis of a potentially malicious file to obtain a set of features that provide an abstract view of the malicious file
  • a machine learning module configured and arranged to perform a static machine learning classification process using as inputs said set of features, to obtain a preliminary classification output;
  • a fuzzy inference module configured and arranged to perform a fuzzy inference procedure based on possibilistic logic using as input variables said set of features and said preliminary classification output, to generate an enhanced possibilistic classification output that identifies the potentially malicious file as a malicious file or a benign file.
  • a computer program product is one embodiment that has a computer-readable medium including computer program instructions encoded thereon that when executed on at least one processor in a computer system causes the processor to perform the operations indicated herein as embodiments of the invention.
  • the limitations mentioned above associated to the prior art methods are addressed by aggregating and combining multiple static features and the output of preferably multiple static classifiers to infer the maliciousness of a file based on a set of fuzzy rules. These rules might be inferred using the knowledge of cyber security experts or using any machine learning technique.
  • the user has access to all the decisions taken in order to decide if a file is malicious. Additionally, an expert user can create additional rules, or modify the ones created by the method, system or computer program of the present invention. Brief Description of the Fiaures
  • Fig. 1 schematically shows the system of the second aspect of the invention, for an embodiment, depicting its main modules.
  • Fig. 2 is an Entropy versus Chunk diagram showing an example of a static analysis of the method of the first aspect of the invention to provide a set of features of an abstract view of an executable file in the form of a stream of entropy values of a structural entropy, computed using the Shannon’s formula, of the executable file, according to an embodiment, by means of the pre-processing computing module of the system of the second aspect of the invention.
  • Fig. 3 shows gray scale images constituting sets of features obtained by respective static analyses of the method of the first aspect of the invention, representing abstracts views of different malware files (Rammnit, Lollipop, Kelihos_ver3), according to corresponding embodiments, by means of the pre-processing computing module of the system of the second aspect of the invention.
  • Fig. 4 schematically shows an overview of a preprocessing module of the system of the second aspect of the present invention, decomposed into five components for performing five corresponding static analyses, including those associate to the embodiments of Figures 2 and 3, to an executable file.
  • Fig. 5 schematically shows the system of the second aspect of the invention, for an embodiment for which the machine learning module includes one submodule, or static classifier, per each set of features provided by a respective static analyser of the pre-processing module.
  • Fig. 6 schematically shows the system of the second aspect of the invention, for an embodiment for which the machine learning module includes only one static classifier that includes as inputs all the set of features provided by all the static analysers of the pre-processing module.
  • Fig. 7 schematically shows the system of the second aspect of the invention, for an embodiment that differs to that of Figure 5 in that the machine learning module comprises, in addition, a further submodule that includes as inputs all the set of features provided by all the static analyzers of the pre-processing module.
  • Fig. 8 schematically shows the system of the second aspect of the present invention, for an embodiment, including the preprocessing module, the machine learning module, and a fuzzy inference module decomposed in several functional blocks.
  • Fig. 9 is a diagram that shows the membership function of some fuzzy subsets of sets of features obtained with a static analyzer, particularly of entropy values, for an embodiment of the fuzzification process performed according to the method of the first aspect of the invention, by means of the fuzzy inference module of the system of the second aspect of the invention.
  • Fig. 10 graphically shows the membership function of some fuzzy subsets associated to scores obtained from a machine learning process applied on the sets of features of Figure 9, for an embodiment of the fuzzification process performed according to the method of the first aspect of the invention, by means of the fuzzy inference module of the system of the second aspect of the invention.
  • Fig. 1 1 is a diagram that shows membership functions of scores obtained at the fuzzification process, as part of a defuzzification process to obtain crisp values, according to an embodiment of the method and system of the present invention. Detailed Description of Preferred Embodiments
  • Fig. 1 shows an embodiment of the system of the second aspect of the present invention.
  • the proposed system includes three components: a preprocessing module 1 10; a machine learning module 120 and a fuzzy interference module 130.
  • the preprocessing module 1 10 is responsible of the extraction of features/characteristics 1 1 1 of a given software program 100 (also termed file or executable).
  • the machine learning module 120 which can be composed of one or more machine learning modules 121 , given one or more of said extracted features/characteristics 1 1 1 , can output a score 123 (i.e. a preliminary classification output) indicating the maliciousness of the software program 100 with respect to the input features 1 1 1 .
  • the fuzzy inference module 130 is responsible of performing inference upon fuzzy rules and given facts, i.e. characteristics of the software program 100 and the output scores 123 of the machine learning methods implemented by the machine learning modules 121 , to derive a reasonable output or conclusion 140 (i.e. an enhanced classification output), that is whether a file 100 is malicious or not. Notice that the invention might be applied to classifying malware into families without needing to make any significant modification.
  • the terms“given facts” refer herein to the facts, data and input information of the fuzzy inference module 130. These data are the features extracted by the pre processing 1 10 and machine learning 120 modules.
  • the preprocessing 1 10 and machine learning 120 modules are independent module or are comprised by a common feature extraction module.
  • a file 100 is received at a client or server computer, and then a static type of analysis of the file 100, i.e. without executing the file, is initiated.
  • This static analysis is performed by the preprocessing module 1 10, which processes the file 100 and generates an abstract view thereof.
  • This abstract view might be represented by sets of features 1 1 1 .
  • each machine learning classifier 122 is used as input to one or more static classifiers 122, each implemented in one of the cited machine learning submodules 121 .
  • the output 123 of each machine learning classifier 122 is a value in the range [0, 1].
  • a value close to 0 means that the executable 100 does not contain suspicious/malicious indicators with regard of a specific group of features 1 1 1 , otherwise, values close to 1 indicates maliciousness.
  • Any machine learning method can be used as classifier. For instance, neural networks, support vector machines or decision trees.
  • the fuzzy inference module 130 receives as input at least one or more features 1 1 1 extracted by the preprocessing module 1 10 and the output 123 of one or more static classifiers 122, and performs the inference procedure upon the rules and given facts to derive a reasonable output or conclusion 140, that is whether a file is malicious or not.
  • Preprocessing module description The preprocessing module 1 10 is responsible of the feature extraction process. It analyses the software program 100 with static techniques (i.e. the program 100 is not executed). It extracts various characteristics from the programs’ 100 syntax and semantic.
  • the software program 100 can take varying formats including, but not limited to, Portable Executable (PE), Disk Operating System (DOS) executable files, New Executable (NE) files, Linear Executable (LE) files, Executable and Linkable Format (ELF) files, JAVA Archive (JAR) files, and SHOCKWAVE/FLASH (SWF) files.
  • PE Portable Executable
  • DOS Disk Operating System
  • NE New Executable
  • LE Linear Executable
  • ELF Executable and Linkable Format
  • JAR JAVA Archive
  • SWF SHOCKWAVE/FLASH
  • the preprocessing module 1 10 extracts at least one, but not limited to, of the following sets or subsets (groups) of features:
  • API Application Programming Interfaces
  • API functions and system calls are related with services provided by operating systems. It supports various key operations such as networks, security, system services, file managements, and so on. In addition, they include various functions for utilizing system resources, such as memory, file system, network or graphics.
  • API function calls can provide key information to represent the behavior of the software 100.
  • every API function and system call had been associated a feature.
  • the feature range is [0, 1]; 0 (or False) if the API function or system call hasn’t been called by the program; 1 (or True) otherwise.
  • An executable file 100 is represented as a stream of entropy values, where each value describes the amount of entropy over a small chunk of code in a specific location of the file 100. For each chunk of code, the entropy is computed using the Shannon’s formula. There exists empirical evidence that the entropy time series from a given family are similar and distinct from those belonging to a different family. This is the result of reusing the code to create new malware variants. In consequence, the structural entropy of an executable 100 can be used to detect whether it is benign or malware and to classify it into their corresponding family.
  • FIG. 2 shows an example of the above mentioned computed entropy versus chunk, for an embodiment.
  • a software program 100 is disassembled (IDA Pro, Radare2, Capstone, etc.) and its sequence of assembly language instructions is extracted for further analysis.
  • the operational codes of the machine language instruction were extracted.
  • every byte has to be interpreted as one pixel in an image. Then, the resulting array has to be organized as a 2-D array and visualized as a gray scale image, as shown in Fig. 3.
  • the main benefit of visualizing a malicious executable 100 as an image is that the different sections of a binary can be easily differentiated.
  • malware authors only used to change a small part of the code to produce new variants. Thus, if old malware is re-used to create new binaries the resulting ones would be very similar. Additionally, by representing malware as an image it is possible to detect the small changes while retaining the global structure of samples.
  • This group of features comprises hand-crafted features defined by cyber security experts. For instance, the size in bytes and the entropy of the sections of the Portable Executable file, the frequency of use of the registers, the frequency of a set of keywords from an executable, the attributes of the headers of the Portable Executable, among others.
  • Fig. 4 presents an overview of the preprocessing module 1 10 decomposed into the five aforementioned components.
  • Machine learning module description
  • machine learning algorithms to address the problem of malicious software detection and classification has increased during the last decade. Instead of directly dealing with raw malware, machine learning solutions first have to extract features that provide an abstract view of the software. Then the features extracted can be used to feed one machine learning method at least.
  • the system of the second aspect of the invention comprises and uses multiple machine learning submodules 121 , each receiving as inputs the output provided by a respective of the static classifiers of the preprocessing module 1 10.
  • the system receives a file 100 (such as an executable file) at a client or server computer.
  • the preprocessing module 1 10 is responsible of extracting a set of features 1 1 1 from the file 100, by means of the static classifiers. These features 1 1 1 are used as input to the machine learning submodules 121.
  • the system has at least as many machine learning submodules 121 as groups of features.
  • each machine learning submodule 121 is a value in the range [0,1].
  • a value close to 0 means that the executable 100 do not contain suspicious/malicious indicators with regard of a specific group of features 1 1 1 , otherwise, the value will be close to 1.
  • Any machine learning method can be used as static classifier. For instance, neural networks, support vector machines or decision trees.
  • a feed-forward neural network with at least three layers: (1 ) an input layer, (2) one fully-connected layer and (3) an output layer can be used.
  • the input layer has size equal to the length of the feature vector.
  • the output layer has only one neuron and outputs the probability of an executable of being malicious or not. Additionally, a dropout after every fully-connected layer can be added.
  • convolutional neural networks have achieved great success in image and time series related classification tasks.
  • Convolutional neural networks consist of a sequence of convolutional layers, the output of which is connected only to local regions in the input. This structure allows learning filters able to recognize specific patterns in the input data.
  • the convolutional network can be composed by 5 or more layers: (1 ) the input layer, (2) one convolutional layer, (3) one pooling layer, (4) one fully-connected layer and (5) the output layer.
  • Static classifier embodiment 1 API function calls.
  • the behavior of an executable file can be modelled by their use of the API functions.
  • the executable file is disassembled to analyze and extract the API function calls it performs.
  • every API function and system call has associated a feature.
  • the feature range is [0,1]; 0 (or False) if the API function or system call hasn’t been called by the program; 1 (or True) otherwise.
  • only a subset of the available API function calls a program can execute is considered. That is because the number of API function calls a program can execute is huge and some of them are irrelevant to model the program’s behavior. Thus, in some implementations only a subset of the available API function calls is considered. To select which are the most informative API function calls to record, any feature selection technique might be considered.
  • a feed-forward network can be utilized to analyze the API functions invoked by a computer program.
  • the feed-forward network may have one or more hidden layers followed by an output layer, which generates a classification for the file (e.g. malicious or benign).
  • the classification of the file can be provided at an output of the convolutional neural network.
  • Static classifier embodiment 2 Structural entropy.
  • an executable file can be represented as a stream of entropy values, where each value describes the amount of entropy over a small chunk of code in a specific location of the file. For each chunk of code, the entropy is computed using the Shannon’s formula.
  • a convolutional neural network can be utilized to analyze the stream of entropy values by applying a plurality of kernels to detect certain patterns in the variation between entropy values of adjacent chunks.
  • the convolutional network can detect malicious executables by providing a classification of the disassembled binary file (maliciousness score: [0,1]).
  • the convolutional neural network may include a convolutional layer, a pooling layer, a fully connected layer and an output layer.
  • the convolutional neural network can be configured to process streams variable in length. As such, one or more techniques can be applied to generate fixed length representations of the entropy values.
  • the first convolutional layer can be configured to process the stream of entropy values by applying a plurality of kernels K1 ,1 , K1 ,2,..., K1 ,x to the entropy values.
  • Each kernel applied to the first convolutional layer can be configured to detect changes between entropy values of adjacent chunks in a file. According to some implementations, each kernel applied to the first convolutional layer can be adapted to detect a specific sequence of entropy values, having w values.
  • the convolutional neural network has been indicated as comprising 3 convolutional layers, it should be appreciated that the convolutional neural network can include less or more convolutional layers.
  • the pooling layer can be configured to further process the output from a preceding convolutional layer by compressing (e.g. subsampling or down sampling) the output from the preceding convolution layer.
  • the pooling layer can compress the output by applying one or more pooling functions, including for example a maximum pooling functions.
  • the output of the pooling layer can be further processed by the one or more fully connected layers and the output layer in order to generate a classification for the file (e.g. malicious or benign).
  • the classification of the file can be provided at an output of the convolutional neural network.
  • Static classifier embodiment 3 Assembly language instructions.
  • a binary file can be disassembled thereby forming a discernible sequence of instructions having one or more identifying features (e.g. instruction mnemonics).
  • a convolutional neural network (CNN) can be utilized to analyze the disassembled binary file by applying a plurality of kernels (filters) adapted to detect certain sequences of instructions in the disassembled file.
  • the convolutional network can detect malicious executables by providing a classification of the disassembled binary file (maliciousness score: [0,1 ]).
  • the convolutional neural network may include a convolutional layer, a pooling layer, a fully connected layer and an output layer.
  • the convolutional neural network can be configured to process a sequence of instructions that are variable in length.
  • one or more techniques can be applied to generate fixed length representations of the instructions.
  • the fixed length of instructions can be encoded in a way the network understands their meaning.
  • mnemonics are encoded using one-hot vector representations.
  • each one-hot vector is represented as a word embedding, that is a vector of real numbers.
  • This vector representation of the opcodes can be generated during the training phase of the convolutional network or using any other approach such as neural probabilistic language models, i.e. SkipGram model, Word2Vec model, Recurrent Neural Network models, etc.
  • the first convolutional layer can be configured to process the encoded fixed mnemonics representations by applying a plurality of kernels K1 ,1 , K1 ,2,... K1 ,x to the encoded fixed mnemonics representations.
  • Each kernel applied at the first convolutional layer can be configured to detect a specific sequence of instructions.
  • each kernel applied to the first convolutional layer can be adapted to detect a sequence having a number of instructions. That is, kernels K can be adapted to detect instances where a number of instructions appear in a certain order.
  • kernel K1 ,1 can be adapted to detect the instruction sequence [cmp, jne, dec] while kernel K1 ,2 can be adapted to detect the instruction set [dec, mov, jmp].
  • the size of each kernel corresponds to the window size of the first convolutional layer.
  • the convolutional layer may have kernels of different size. For instance, one kernel may be adapted to detect the instruction sequence [dec, mov, jmp] while another kernel may be adapted to detect the instruction set [dec, mov, jmp, pull, sub].
  • the convolutional neural network is shown to include one convolutional layer, it should be appreciated that the convolutional neural network can include a different number of convolutional layers. For instance, the convolutional neural network can include more convolutional layers such as 2.
  • the kernels K2,1 , K2,2, .... K,2,x applied to the second convolutional layer can be adapted to detect specific sequences of two or more of the sequences of instructions detected at the first convolutional layer. Consequently, the second convolutional layer would generate increasingly abstract representations of the sequence of instructions from the disassembled binary file.
  • the pooling layer can be configured to further process the output from a preceding convolutional layer by compressing (e.g. subsampling or down sampling) the output from the preceding convolution layer.
  • the pooling layer can compress the output by applying one or more pooling functions, including for example a maximum pooling functions.
  • the output of the pooling layer can be further processed by the one or more fully connected layers and the output layer in order to generate a classification for the disassembled binary file (e.g. malicious or benign).
  • the classification of the disassembled binary file can be provided at an output of the convolutional neural network.
  • Static classifier embodiment 4 Image-based representation of malware’s hexadecimal content.
  • a software program can be visualized as an image, where every byte interpreted as one pixel in the image. Then, the resulting array is organized as a 2-D array and visualized as a gray scale image.
  • Approaches such as convolutional neural networks can yield classifiers that can learn to extract features that are at least as effective as human-engineered features.
  • a convolutional neural network implementation to extract features can advantageously make use of the connectivity structure between feature maps to extract local and invariant features from an image.
  • a convolutional neural network (CNN) can be utilized to analyze the file by applying a plurality of kernels (filters) adapted to detect certain local and invariant patterns in the pixels of the representation of the software program as a gray-scale image.
  • the convolutional network can detect malicious executables by providing a classification of the disassembled binary file (maliciousness score: [0,1]).
  • the convolutional neural network at least may include a convolutional layer, a pooling layer, a fully connected layer and an output layer. In some implementations, it may include more than one convolutional, pooling and fully connected layers. According to some implementations, each kernel applied to the first convolutional layer can be adapted to detect a pattern in the pixels of the image having w x h size, where w is the width and h is the height of the kernel. Subsequent convolutional layers detect increasingly abstract features.
  • the pooling layer can be configured to further process the output from a preceding convolutional layer by compressing (e.g. subsampling or down sampling) the output from the preceding convolution layer.
  • the pooling layer can compress the output by applying one or more pooling functions, including for example the maximum pooling function.
  • the output of the pooling layer can be further processed by the one or more fully connected layers and the output layer in order to generate a classification for the file (e.g. malicious or benign).
  • the classification of the file can be provided at an output of the convolutional neural network.
  • Static classifier embodiment 5 Miscellaneous features.
  • the so-called “miscellaneous” features include those applicable software characteristics. These characteristics at least include the keywords occurring in the software of the program and the fields of the header of a file in any format. Other type of features may also be used.
  • Next table illustrates the fields of the header of a file in portable executable format.
  • these fields are: MajorLinkedVersion, MinorLinkerVersion, SizeOfCode, SizeOflnitializedData, etc. Shown is relevant information that contains suitable characteristics to use as features. These characteristics are specific to information of a Portable Executable file header, but other file types will have other relevant header information and characteristics.
  • the preprocessing module 1 10 is responsible of extracting a set of informative features 1 1 1 from the file 100. These features 1 1 1 are then aggregated and fed as input to a common static classifier 122, which will determine whether the file 100 is malicious or not.
  • the input of the static classifier 123 is the features 1 1 1 from the distinct groups extracted by the preprocessing module 1 10.
  • the output 123 of the static classifier 122 is a value in the range [0,1].
  • a value close to 0 means that the executable 100 does not contain suspicious/malicious indicators with regard to the features 1 1 1 , otherwise, the value will be close to 1.
  • Any machine learning method can be used as classifier. For instance, neural networks, support vector machines or decision trees.
  • the preprocessing module 1 10 is responsible of extracting a set of informative features 1 1 1 from the file 100. These features 1 1 1 are used as input to static classifiers.
  • the system has as many static classifiers as set of features and, in contrast to the embodiment of Fig. 5, a further static classifier that would aggregate and use the features of all groups as input.
  • the output 123 of each machine learning classifier 122 is a value in the range [0, 1 ].
  • a value close to 0 means that the executable 100 do not contain suspicious/malicious indicators with regard of a specific group of features 1 1 1 , otherwise, the value will be close to 1.
  • Any machine learning method can be used as classifier. For instance, neural networks, support vector machines or decision trees.
  • the last component of the malware detection system is the fuzzy inference engine 130. Its aim is to define a set of fuzzy rules of whether an executable is malicious based on the output of the machine learning methods and the features extracted by the preprocessing module.
  • This component 130 performs the following steps:
  • the fuzzy inference module 130 can be decomposed into functional blocks, as depicted in Fig. 8, and described below in detail.
  • Fuzzification 131 involves two processes: derive the membership functions for input and output variables, and represent them with linguistic variables. (Given two inputs, x1 and y1 , determine the degree to which input variables belong to each of the appropriate fuzzy sets.)
  • the input values are two-fold: a feature vector of program characteristics named F, of size
  • F feature vector of program characteristics
  • feature vector of program characteristics
  • feature vector of program characteristics
  • feature vector of program characteristics
  • F i Î F corresponds to the value of the i-th feature of the program 100.
  • This feature vector is extracted by the preprocessing module 1 10; and a score vector containing the output scores 123 of the machine learning algorithms named S of size
  • is equal to the number of distinct algorithms that have been applied to predict the maliciousness of the program based on distinct groups of features.
  • is equal to the number of distinct algorithms that have been applied to predict the maliciousness of the program based on distinct groups of features.
  • the entropy of a bytes sequence refers to the amount of disorder (uncertainty) or its statistical variation.
  • the entropy value ranges from 0 to 8. If occurrences of all values are the same, the entropy will be largest. On the contrary, if certain byte values occur with high probabilities, the entropy value will be smaller. According to studies, the entropy of plain text, native executables, packed executables and encrypted executables tend to differ greatly. In consequence, the [0,8] range can be further divided into at least six sub-ranges or subsets, which are:
  • a trapezoidal waveform is utilized for this type of membership function. For instance, 4.0 entropy will belong to“very low” to 0.6 degree and to“low” to 0.4 degree.
  • the score 123 of a given machine learning classifier 122 is a value in the range [0, 1].
  • a value close to 0 means that the executable 100 do not contain suspicious/malicious indicators with regard of a specific group of features 1 1 1 , and it is a low threat, otherwise, the value will be close to 1 .
  • This score 123 can be further divided into at least three sub-ranges or subsets which are:
  • 0.4 score belongs to“LOW” to 0.38 degree and to“MEDIUM” to 1.0 degree.
  • the fuzzy sets corresponding to all machine learning classifiers 122 are defined using the same membership functions for simplicity purposes. However, this is not a constraint and they might be defined with different membership functions and fuzzy sets.
  • the rule base and the database of the invention are jointly referred to as the knowledge base 132.
  • the knowledge base 132 comprises:
  • IF-THEN rules lead to what action or actions should be taken in terms of the currently observed information.
  • a fuzzy rule associates a condition described using linguistic variables and fuzzy sets to an output or a conclusion.
  • the IF part is mainly used to capture knowledge and the THEN part can be utilized to give the conclusion or output in linguistic variable form.
  • IF-THEN rules are widely used by the inference engine to compute the degree to which the input data matches the condition of a rule.
  • Fuzzy sets are sets whose elements have degrees of membership. Fuzzy set theory permits the gradual assessment of the membership of elements in a set; this is described with the aid of a membership function valued in the real unit interval [0,1].
  • the membership function represents the degree of truth.
  • the system has associated one fuzzy set to every input feature. See the membership functions of features“entropy” and “machine learning score” previously presented.
  • the IF-THEN rules and the membership functions of the fuzzy sets might be defined by experts in the field or by exploiting approximation techniques from neural networks.
  • experts extract comprehensible rules from their vast knowledge of the field. These rules are fine-tuned using the available input-output data.
  • neural network techniques are used to automatically derive rules from the data.
  • Every rule is attached with a degree of belief or weight in the real interval (0, 1 ] that denotes a lower bound on the belief on the rule in terms of necessity measures. So, that rules that are demonstrated to be key for the decision system have a higher weight, and rules not so useful in the decision system have a lower weight.
  • the rules created by the system may have a higher degree of belief than the rules created by a human, or vice versa.
  • the system may create rules of the following form:
  • the decision-making unit (Inference Engine) 135 is the inference procedure upon the fuzzy rules and given facts to derive a reasonable output or conclusion 140.
  • the inference engine is based on the PGL+ reasoning system, for reasoning under possibilistic uncertainty and disjunctive vague knowledge.
  • PGL+ is a possibilistic logic programming framework with fuzzy constants based on the Horn-rule fragment of Godel infinitely- valued logic with an efficient proof algorithm based on a complete calculus and oriented to goals (conclusions). Fuzzy constants are interpreted as disjunctive imprecise knowledge and the partial matching between them is computed by means of a fuzzy unification mechanism based on a necessity-like measure.
  • the output of the Inference Engine 135 is a conclusion involving fuzzy constants together with the degree on the belief on the conclusion.
  • the belief degree to classify the file 100 as malware is used, and fuzzy constants are transformed into crisp values using membership functions analogous to the ones used by the fuzzifier 131 .
  • the invention may use, but not limited to, one of the following defuzzification 136 methods:
  • the output fuzzy set might be decomposed into at least three sub-ranges or subsets, which are represented as membership functions in Fig. 1 1 :
  • the fuzzy output is converted to a crisp output using, but not limited to, any of the aforementioned defuzzification methods 136.
  • An unseen executable (XXXXXXXXXXX.exe) 100 is passed as input to the system.
  • the preprocessing module 1 10 extracts a subset of features 1 1 1 that provides an abstract view of the program. In particular, the preprocessing module 1 10 extracts at least the following features 1 1 1 :
  • the aforementioned data is passed as input to some machine learning methods to calculate a maliciousness score 123 based on a particular feature or subset of features 1 1 1.
  • Machine learning model 1 outputs a maliciousness score 123 equal to 0.65 with respect to the structural entropy of the executable 100. (A machine learning model is defined as the output generated when a machine learning algorithm is trained with your training data).
  • Machine learning model 2 outputs a maliciousness score 123 equal to 0.15 with respect to the sequence of instructions of the executable 100.
  • Machine learning model 3 outputs a maliciousness score 123 equal to
  • Rule 2 IF entropy(file) is“very_high” AND ML_score(ENTROPY) is“high” THEN file 100 is encrypted with a degree of belief of at least 0.9 c.
  • Rule 3 IF has_section(UPX0) OR has_section(UPX1 ) or has_section(“X”) THEN file 100 is compressed with a degree of belief of at least 0.9 d.
  • Rule 4 IF file 100 is encrypted AND ML_score(API) is “low” and ML_score(Opcodes) is“low” THEN file 100 is benign with a degree of belief of at least 0.7 e.
  • Rule 5 IF file 100 is encrypted AND ML_score(API) is“medium” and ML_score(Opcodes) is “medium” THEN file 100 is suspicious with a degree of belief of at least 0.8
  • file 100 is encrypted AND ML_score(API) is “low” and ML_score(Opcodes) is“low” THEN file 100 is benign with a degree of belief of at least 0.7
  • the PGL+ involves a semantical unification model of disjunctive fuzzy constants and three other inference patterns together with a deductive mechanism based on a modus ponens style.
  • the PGL+ system allows expressing both ill-defined properties and weights with which properties and patterns can be attached with. For instance, suppose that the problem observation corresponds to the following statement“it is almost sure that the entropy file is around_20”. This statement can be represented in the proposed system with the formula:
  • entropy(.) is a classical predicate expressing the entropy property of the problem domain
  • around_20 is a fuzzy constant
  • degree 0.9 expresses how much is believed the formula entropy(around_20) in terms of a necessity measure.
  • the PGL+ system computes the degree of belief of the crisp property encrypted by conveniently combining the degrees of belief 0.9 and 0.7 together with the degree of partial matching between both fuzzy constants high and around_20.
  • the inference procedure based on the PGL+ reasoning system is divided in three algorithms which are applied sequentially.
  • a completion algorithm which extends the set of rules and facts with all valid clauses by means of the following Generalized Resolution and Fusion inference rules:
  • the completion algorithm first computes the set of valid clauses that can be derived by applying the Generalized resolution rule (i.e. by chaining clauses). Then, from this new set of valid clauses, the algorithm computes all valid clauses that can be derived by applying the Fusion rule (i.e. by fusing clauses). As the Fusion rule stretches the body of rules and the Generalized resolution rule modifies the body or the head of rules, the chaining and fusion steps have to be performed while new valid clauses are derived. As the chaining and fusion steps cannot produce infinite loops and each valid clause is either an original clause or can be derived at least from two clauses, in the worst-case each combination of clauses derives a different valid clause. Hence, as a finite set of facts and rules N is had, in the worst-case the number of valid clauses is
  • c1 , c2 and c3 can derive a new valid clause if c1 and c2, c1 and c3, or c2 and c3 derive a valid clause different to c1 , c2 and c3.
  • each clause can be replaced by
  • D q can be computed from this finite set of facts by applying the UN and IN rules.
  • the above mechanism can be recursively applied for determining
  • Program aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of executable code and/or associated data that is carried on or embodied in a type of machine readable medium.
  • Tangible non-transitory “storage” type media include any or all of the memory or other storage for the computers, processors, or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide storage at any time for the software programming.
  • All or portions of the software may at times be communicated through a network such as the Internet or various other telecommunication networks.
  • Such communications may enable loading of the software from one computer or processor into another, for example, from a management server or host computer of a scheduling system into the hardware platform(s) of a computing environment or other system implementing a computing environment or similar functionalities in connection with image processing.
  • another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
  • the physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software.
  • terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
  • a machine-readable medium may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium.
  • Non volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s), or the like, which may be used to implement the system or any of its components shown in the drawings.
  • Volatile storage media may include dynamic memory, such as a main memory of such a computer platform.
  • Tangible transmission media may include coaxial cables; copper wire and fiber optics, including the wires that form a bus within a computer system.
  • Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications.
  • RF radio frequency
  • IR infrared
  • Computer-readable media may include, for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a physical processor for execution.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Virology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Biomedical Technology (AREA)
  • Fuzzy Systems (AREA)
  • Automation & Control Theory (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

A computer-implemented method, a system and computer programs for identifying a malicious file are disclosed. The method comprises performing a static analysis of a potentially malicious file to obtain a set of features that provide an abstract view of the file; performing a static machine learning classification process using as inputs said set of features, to obtain a preliminary classification output; and performing a fuzzy inference procedure based on possibilistic logic using as input variables said set of features and said preliminary classification output, to generate an enhanced classification output that identifies the potentially malicious file as a malicious file or a benign file.

Description

A computer-implemented method, a system and a computer program for identifying a malicious file
Field of the Invention
The present invention relates to a computer-implemented method, system and computer program for identifying a malicious file that combines different types of analysis, processes and procedures that allow detecting and classifying malicious files.
Backaround of the Invention
In the last few years, machine learning has been applied successfully to the task of malware detection and classification. One of the most used techniques is neural networks. Algorithms based on neural networks have recently achieved state-of-the-art results in a wide range of tasks. Examples of proposals based on neural networks can be found in the following patents: US9690938B1 , US9705904B1 and US9495633B2.
However, the main limitation of neural networks is that it is difficult to obtain a rational explanation about the decisions they make.
A method for identifying malware file using multiple classifiers is known by US patent application, US2010192222A1. However, such method uses multiple classifiers including static and dynamic classifiers, and thus is unable to identify malware based only on static analysis.
Besides the above solutions, the patent EP2882159 discloses a computer implemented method of profiling cyber threats detected in a target environment, that comprises receiving, from a Security Information and Event Manager (SIEM) monitoring the target environment, alerts triggered by a detected potential cyber threat, and, for each alert: retrieving captured packet data related to the alert; extracting data pertaining to a set of attributes from captured packet data triggering the alert; applying fuzzy logic to data pertaining to one or more of the attributes to determine values for one or more output variables indicative of a level of an aspect of risk attributable to the cyber threat.
Apart from that, following the General Data Protection Regulation (GDPR) (Regulation (EU) 2016/679), automated individual decision-making, including profiling (Article 22) is contestable, similarly to the Data Protection Directive (Article 15). Citizens have rights to question and fight significant decisions that affect them that have been made on a solely algorithmic basis. In traditional deep learning (neural networks) systems, this right cannot be given.
New improved methods, systems and computer programs for identifying a malicious file are therefore needed. Brief Description of the Invention
To that end, the present invention relates, in accordance with a first aspect, to a computer-implemented method for identifying a malicious file. The method comprises:
- performing a static analysis of a potentially malicious file to obtain a set of features that provide an abstract view of the malicious file (i.e. a view that reflects the obtained features from different points of view);
- performing a static machine learning classification process using as inputs said set of features, to obtain a preliminary classification output (i.e. a score); and
- performing a fuzzy inference procedure based on possibilistic logic, for reasoning under possibilistic uncertainty and disjunctive vague knowledge, for example originated by the rules created either by the system or by experts, using as input variables said set of features and said preliminary classification output, to generate an enhanced classification output that identifies the potentially malicious file as a malicious file or a benign file.
For an embodiment, the method comprises:
- performing several static analyses of different types of said potentially malicious file to obtain corresponding sets of features that provide abstract views of the malicious file;
- performing said static machine learning classification process using as inputs said sets of features, to obtain said preliminary classification output; and
- performing said fuzzy inference procedure based on possibilistic logic using as input variables said sets of features and the preliminary classification output.
For an alternative embodiment, the method comprises:
- performing several static analyses of different types of said potentially malicious file to obtain corresponding sets of features that provide abstract views of the malicious file;
- performing several static machine learning classification processes, each using as inputs at least one respective of said sets of features, to obtain corresponding several preliminary classification outputs (i.e. scores); and
- performing said fuzzy inference procedure based on possibilistic logic using as input variables said sets of features and said preliminary classification outputs.
According to an implementation of said alternative embodiment, the method comprises performing a further static machine learning classification process, using as inputs several or all of the above mentioned sets of features, to obtain a corresponding further preliminary classification output; and - performing said fuzzy inference procedure based on possibilistic logic using as input variable also said further preliminary classification output.
For an embodiment, the above mentioned fuzzy inference procedure comprises a fuzzification process that converts the input variables into fuzzy variables.
For an implementation of said embodiment, the fuzzification process comprises deriving membership functions relating the input variables with output variables through membership degrees of values of the input variables in predefined fuzzy sets, and representing said membership functions with linguistic variables, said linguistic variables being said fuzzy variables.
In addition to the fuzzification process, according to an embodiment, the fuzzy inference procedure further comprises an inference decision-making process comprising firing fuzzy possibilistic rules with values of said linguistic variables for said input variables, to generate a fuzzy output that identifies the degree of belief that the potentially malicious file has to be a malicious file or a benign file.
Additionally, for an embodiment, the method of the first aspect of the present invention further comprises selecting which fuzzy possibilistic rules to fire in said inference decision-making process, based on at least said values of the linguistic variables for the input variables.
According to an embodiment, in addition to the fuzzification process and the inference decision-making process, the fuzzy inference procedure based on possibilistic logic further comprises a defuzzification process that converts the above mentioned fuzzy possibilistic output into a crisp output, wherein said crisp output constitutes the above mentioned enhanced classification output.
Depending on the embodiment, the above mentioned set or sets of features may comprise:
the frequency of use of Application Programming Interfaces (API) and their function calls; the representation of an executable file as a stream of entropy values, where each value describes the amount of entropy over a small chunk of code in a specific location of the potentially malicious file; the sequence of assembly language instructions executed by a software program constituting the potentially malicious file, in particular, the operational codes of the machine language instructions;
the representation of an executable file, constituting the potentially malicious file, as an image, where every byte is interpreted as one pixel in the image, wherein the resulting array is organized as a 2-D array and visualized as a gray scale image; and/or
applicable program characteristics, at least including alphanumeric strings occurring in the body of the software program constituting the potentially malicious file and the fields from the header of the potentially malicious file.
Two or more of the above indicated sets of features are comprised by the above identified as corresponding sets of features for implementations of the above described embodiments of the method of the first aspect of the invention for which the method comprises performing several static analyses.
In an embodiment, the fuzzy inference procedure based on possibilistic logic is based on a PGL+ algorithm. The proof method for PGL+ is complete and involves a semantical unification model of disjunctive fuzzy constants and three other inference patterns together with a deductive mechanism based on a modus ponens style.
In a particular embodiment, the PGL+ algorithm can comprise applying three algorithms sequentially: a first algorithm that extends the fuzzy possibilistic rules by means of implementing a first set of rules; a second algorithm that translates the fuzzy possibilistic rules into a semantically equivalent set of 1 -weighted clauses by means of implemented a second set of rules; and a third algorithm that computes a maximum degree of possibilistic entailment of a goal from the equivalent set of 1 -weighted clauses.
In an embodiment, the fuzzy inference procedure based on possibilistic logic comprises executing the following formulas in the form (A, c), where A is a Horn clause (fact or rule) with disjunctive fuzzy constants and c is a degree in the unit interval [0,1 ] which denotes a lower bound on the belief on A in terms of necessity measures. Every fact and rule is attached with a degree of belief or weight in the real interval [0, 1 ] that denotes a lower bound on the belief on the fact and rule in terms of necessity measures. So, those facts and rules that are demonstrated to be key for the decision system have a higher weight, and facts and rules not so useful in the decision system have a lower weight. The rules created by the system can have a higher degree of belief than the rules created by a human, or vice versa. For example, the system may create rules of the following form:
• Rule 1 : IF (entropy(.text) is “high” OR entropy(.text) is “very_high” OR entropy(.text) is “extreme”) AND (call("CryptAcquireContext") OR call(“CryptEncrypt”) OR call("CryptReleaseContext”)) THEN file 100 is encrypted with a degree of belief of
1.0 • Rule 2: IF entropy(file) is “very_high” AND ML_score(ENTROPY) is “high” THEN file 100 is encrypted with a degree of belief of at least 0.9.
Additionally, the facts can have different degrees of belief depending on the source of the information. For instance, file management API functions (e.g. CopyFile, CreateFile, EncryptFile, etc.) can have a higher belief degree than networking APIs (e.g.HttpCreateServerSession.DnsAcquireContextHandle, RpcStringBindingCompose, etc.).
In some embodiments, the machine learning models can be enhanced by further using Reinforcement Learning methods. Reinforcement Learning (RL) is a set of techniques that allow to solve problems in highly uncertain or almost unknown domains. The method can use machine learning to select the most relevant features, using RL guided methods to derive the future reward (i.e. accuracy) of using such feature. After several iterations (training process), the machine learning technique will be able to use a Q-Table (rewards table) of the RL method to accurately predict which feature and split set use for prediction, thus creating a quasi-optimal DT from which to derive the rules for the system. This last module makes the system keep learning from new threats, a key aspect when it comes to cybersecurity.
In a second aspect, the present invention also relates to a system for identifying a malicious file, the system comprising one or more computing entities adapted to perform the steps of the method of the first aspect of the invention for all its embodiments, said one or more computing entities including at least the following modules operatively connected to each other:
- a preprocessing computing module configured and arranged to perform a static analysis of a potentially malicious file to obtain a set of features that provide an abstract view of the malicious file;
- a machine learning module configured and arranged to perform a static machine learning classification process using as inputs said set of features, to obtain a preliminary classification output; and
- a fuzzy inference module configured and arranged to perform a fuzzy inference procedure based on possibilistic logic using as input variables said set of features and said preliminary classification output, to generate an enhanced possibilistic classification output that identifies the potentially malicious file as a malicious file or a benign file.
Other embodiments of the invention that are disclosed herein also include software programs to perform the method embodiment steps and operations summarized above and disclosed in detail below. More particularly, a computer program product is one embodiment that has a computer-readable medium including computer program instructions encoded thereon that when executed on at least one processor in a computer system causes the processor to perform the operations indicated herein as embodiments of the invention.
With the present invention, according to its three aspects, the limitations mentioned above associated to the prior art methods are addressed by aggregating and combining multiple static features and the output of preferably multiple static classifiers to infer the maliciousness of a file based on a set of fuzzy rules. These rules might be inferred using the knowledge of cyber security experts or using any machine learning technique.
With the present invention, the user has access to all the decisions taken in order to decide if a file is malicious. Additionally, an expert user can create additional rules, or modify the ones created by the method, system or computer program of the present invention. Brief Description of the Fiaures
The previous and other advantages and features will be more fully understood from the following detailed description of embodiments, with reference to the attached figures, which must be considered in an illustrative and non-limiting manner, in which:
Fig. 1 schematically shows the system of the second aspect of the invention, for an embodiment, depicting its main modules.
Fig. 2 is an Entropy versus Chunk diagram showing an example of a static analysis of the method of the first aspect of the invention to provide a set of features of an abstract view of an executable file in the form of a stream of entropy values of a structural entropy, computed using the Shannon’s formula, of the executable file, according to an embodiment, by means of the pre-processing computing module of the system of the second aspect of the invention.
Fig. 3 shows gray scale images constituting sets of features obtained by respective static analyses of the method of the first aspect of the invention, representing abstracts views of different malware files (Rammnit, Lollipop, Kelihos_ver3), according to corresponding embodiments, by means of the pre-processing computing module of the system of the second aspect of the invention.
Fig. 4 schematically shows an overview of a preprocessing module of the system of the second aspect of the present invention, decomposed into five components for performing five corresponding static analyses, including those associate to the embodiments of Figures 2 and 3, to an executable file. Fig. 5 schematically shows the system of the second aspect of the invention, for an embodiment for which the machine learning module includes one submodule, or static classifier, per each set of features provided by a respective static analyser of the pre-processing module.
Fig. 6 schematically shows the system of the second aspect of the invention, for an embodiment for which the machine learning module includes only one static classifier that includes as inputs all the set of features provided by all the static analysers of the pre-processing module.
Fig. 7 schematically shows the system of the second aspect of the invention, for an embodiment that differs to that of Figure 5 in that the machine learning module comprises, in addition, a further submodule that includes as inputs all the set of features provided by all the static analyzers of the pre-processing module.
Fig. 8 schematically shows the system of the second aspect of the present invention, for an embodiment, including the preprocessing module, the machine learning module, and a fuzzy inference module decomposed in several functional blocks.
Fig. 9 is a diagram that shows the membership function of some fuzzy subsets of sets of features obtained with a static analyzer, particularly of entropy values, for an embodiment of the fuzzification process performed according to the method of the first aspect of the invention, by means of the fuzzy inference module of the system of the second aspect of the invention.
Fig. 10 graphically shows the membership function of some fuzzy subsets associated to scores obtained from a machine learning process applied on the sets of features of Figure 9, for an embodiment of the fuzzification process performed according to the method of the first aspect of the invention, by means of the fuzzy inference module of the system of the second aspect of the invention.
Fig. 1 1 is a diagram that shows membership functions of scores obtained at the fuzzification process, as part of a defuzzification process to obtain crisp values, according to an embodiment of the method and system of the present invention. Detailed Description of Preferred Embodiments
Fig. 1 shows an embodiment of the system of the second aspect of the present invention. As seen in the figure, the proposed system includes three components: a preprocessing module 1 10; a machine learning module 120 and a fuzzy interference module 130.
The preprocessing module 1 10 is responsible of the extraction of features/characteristics 1 1 1 of a given software program 100 (also termed file or executable). The machine learning module 120, which can be composed of one or more machine learning modules 121 , given one or more of said extracted features/characteristics 1 1 1 , can output a score 123 (i.e. a preliminary classification output) indicating the maliciousness of the software program 100 with respect to the input features 1 1 1 . The fuzzy inference module 130 is responsible of performing inference upon fuzzy rules and given facts, i.e. characteristics of the software program 100 and the output scores 123 of the machine learning methods implemented by the machine learning modules 121 , to derive a reasonable output or conclusion 140 (i.e. an enhanced classification output), that is whether a file 100 is malicious or not. Notice that the invention might be applied to classifying malware into families without needing to make any significant modification.
The terms“given facts” refer herein to the facts, data and input information of the fuzzy inference module 130. These data are the features extracted by the pre processing 1 10 and machine learning 120 modules.
Depending on the embodiment, the preprocessing 1 10 and machine learning 120 modules are independent module or are comprised by a common feature extraction module.
In an embodiment of the proposed method, a file 100 is received at a client or server computer, and then a static type of analysis of the file 100, i.e. without executing the file, is initiated. This static analysis is performed by the preprocessing module 1 10, which processes the file 100 and generates an abstract view thereof. This abstract view might be represented by sets of features 1 1 1 .
These features 1 1 1 are used as input to one or more static classifiers 122, each implemented in one of the cited machine learning submodules 121 . The output 123 of each machine learning classifier 122 is a value in the range [0, 1]. A value close to 0 means that the executable 100 does not contain suspicious/malicious indicators with regard of a specific group of features 1 1 1 , otherwise, values close to 1 indicates maliciousness. Any machine learning method can be used as classifier. For instance, neural networks, support vector machines or decision trees.
The fuzzy inference module 130 receives as input at least one or more features 1 1 1 extracted by the preprocessing module 1 10 and the output 123 of one or more static classifiers 122, and performs the inference procedure upon the rules and given facts to derive a reasonable output or conclusion 140, that is whether a file is malicious or not. Preprocessing module description The preprocessing module 1 10 is responsible of the feature extraction process. It analyses the software program 100 with static techniques (i.e. the program 100 is not executed). It extracts various characteristics from the programs’ 100 syntax and semantic.
The software program 100 can take varying formats including, but not limited to, Portable Executable (PE), Disk Operating System (DOS) executable files, New Executable (NE) files, Linear Executable (LE) files, Executable and Linkable Format (ELF) files, JAVA Archive (JAR) files, and SHOCKWAVE/FLASH (SWF) files.
While the present embodiments describe the application of the present invention to Portable Executable (PE) format files, it will be appreciated that the methodologies described herein can be applied to other types of structured files as the ones previously mentioned.
Given a software program 100, the preprocessing module 1 10 extracts at least one, but not limited to, of the following sets or subsets (groups) of features:
1. API function calls,
2. Assembly language instructions,
3. Structural entropy,
4. Image representation of the binary program,
5. Miscellaneous features.
The use of every set of features is explained in the following sections.
1. API function calls
The frequency of use of Application Programming Interfaces (API) and their function calls are regarded as very important features. Literature has shown that API call can be explored to model the program behavior.
API functions and system calls are related with services provided by operating systems. It supports various key operations such as networks, security, system services, file managements, and so on. In addition, they include various functions for utilizing system resources, such as memory, file system, network or graphics.
There is no other way for software to access system resources that are managed by operating systems without using API functions or system calls, and thus, API function calls can provide key information to represent the behavior of the software 100. In consequence, every API function and system call had been associated a feature. The feature range is [0, 1]; 0 (or False) if the API function or system call hasn’t been called by the program; 1 (or True) otherwise. Alternatively, one can count how many times every API function has been called during the execution of the program. Because many malware programs are packed, leaving only the stub of the import table or perhaps even no import table at all, the malware classifier will search for the name of the dynamic link library or function in the body of the suspected malware (by disassembling the executable 100).
2. Structural entropy
An executable file 100 is represented as a stream of entropy values, where each value describes the amount of entropy over a small chunk of code in a specific location of the file 100. For each chunk of code, the entropy is computed using the Shannon’s formula. There exists empirical evidence that the entropy time series from a given family are similar and distinct from those belonging to a different family. This is the result of reusing the code to create new malware variants. In consequence, the structural entropy of an executable 100 can be used to detect whether it is benign or malware and to classify it into their corresponding family.
The diagram of Fig. 2 shows an example of the above mentioned computed entropy versus chunk, for an embodiment.
3. Assembly language instructions
A software program 100 is disassembled (IDA Pro, Radare2, Capstone, etc.) and its sequence of assembly language instructions is extracted for further analysis. In particular, the operational codes of the machine language instruction were extracted. An operational code (opcode) is the portion of a machine language instruction that specifies the operation to be performed: arithmetic or data manipulation, logical operation or program control. Opcodes reveal significant statistical differences between malware and legitimate software. Thus, a sequence of opcodes can be extracted and then used to detect whether a file 100 either is benign or malware (opcodes = [mov, pop, push, add, sub, mul, etc.]).
4. Gray scale imaae-based visualization of the program hexadecimal content
To visualize a software program 100 as an image, every byte has to be interpreted as one pixel in an image. Then, the resulting array has to be organized as a 2-D array and visualized as a gray scale image, as shown in Fig. 3.
The main benefit of visualizing a malicious executable 100 as an image is that the different sections of a binary can be easily differentiated. In addition, malware authors only used to change a small part of the code to produce new variants. Thus, if old malware is re-used to create new binaries the resulting ones would be very similar. Additionally, by representing malware as an image it is possible to detect the small changes while retaining the global structure of samples.
This technique for malware visualization was first presented in the work of Nataraj et al. named“Malware Images. Visualization and Automatic Classification”.
5. Miscellaneous features
This group of features comprises hand-crafted features defined by cyber security experts. For instance, the size in bytes and the entropy of the sections of the Portable Executable file, the frequency of use of the registers, the frequency of a set of keywords from an executable, the attributes of the headers of the Portable Executable, among others.
Fig. 4 presents an overview of the preprocessing module 1 10 decomposed into the five aforementioned components. Machine learning module description
The use of machine learning algorithms to address the problem of malicious software detection and classification has increased during the last decade. Instead of directly dealing with raw malware, machine learning solutions first have to extract features that provide an abstract view of the software. Then the features extracted can be used to feed one machine learning method at least.
In one embodiment, shown in Fig. 5, the system of the second aspect of the invention comprises and uses multiple machine learning submodules 121 , each receiving as inputs the output provided by a respective of the static classifiers of the preprocessing module 1 10.
The system receives a file 100 (such as an executable file) at a client or server computer. The preprocessing module 1 10 is responsible of extracting a set of features 1 1 1 from the file 100, by means of the static classifiers. These features 1 1 1 are used as input to the machine learning submodules 121. The system has at least as many machine learning submodules 121 as groups of features.
The output of each machine learning submodule 121 is a value in the range [0,1]. A value close to 0 means that the executable 100 do not contain suspicious/malicious indicators with regard of a specific group of features 1 1 1 , otherwise, the value will be close to 1. Any machine learning method can be used as static classifier. For instance, neural networks, support vector machines or decision trees.
In an embodiment, a feed-forward neural network with at least three layers: (1 ) an input layer, (2) one fully-connected layer and (3) an output layer can be used. The input layer has size equal to the length of the feature vector. The output layer has only one neuron and outputs the probability of an executable of being malicious or not. Additionally, a dropout after every fully-connected layer can be added.
Alternatively, depending on the nature of the data, i.e. images and time series, it might be useful to use a convolutional or a recurrent neural network as a classifier. In particular, convolutional neural networks have achieved great success in image and time series related classification tasks. Convolutional neural networks consist of a sequence of convolutional layers, the output of which is connected only to local regions in the input. This structure allows learning filters able to recognize specific patterns in the input data. The convolutional network can be composed by 5 or more layers: (1 ) the input layer, (2) one convolutional layer, (3) one pooling layer, (4) one fully-connected layer and (5) the output layer.
In particular, the following embodiments present concrete implementations of static classifiers for each group of features.
1. Static classifier embodiment 1 : API function calls.
In some implementations, the behavior of an executable file can be modelled by their use of the API functions. In those implementations, the executable file is disassembled to analyze and extract the API function calls it performs. In some implementations, every API function and system call has associated a feature. The feature range is [0,1]; 0 (or False) if the API function or system call hasn’t been called by the program; 1 (or True) otherwise. Alternatively, one can count how many times every API function has been called during the execution of the program. In other implementations, only a subset of the available API function calls a program can execute is considered. That is because the number of API function calls a program can execute is huge and some of them are irrelevant to model the program’s behavior. Thus, in some implementations only a subset of the available API function calls is considered. To select which are the most informative API function calls to record, any feature selection technique might be considered.
A feed-forward network can be utilized to analyze the API functions invoked by a computer program. The feed-forward network may have one or more hidden layers followed by an output layer, which generates a classification for the file (e.g. malicious or benign). The classification of the file can be provided at an output of the convolutional neural network.
2. Static classifier embodiment 2: Structural entropy.
In some implementations, an executable file can be represented as a stream of entropy values, where each value describes the amount of entropy over a small chunk of code in a specific location of the file. For each chunk of code, the entropy is computed using the Shannon’s formula. A convolutional neural network can be utilized to analyze the stream of entropy values by applying a plurality of kernels to detect certain patterns in the variation between entropy values of adjacent chunks.
The convolutional network can detect malicious executables by providing a classification of the disassembled binary file (maliciousness score: [0,1]). The convolutional neural network may include a convolutional layer, a pooling layer, a fully connected layer and an output layer. The convolutional neural network can be configured to process streams variable in length. As such, one or more techniques can be applied to generate fixed length representations of the entropy values. In some implementations, the first convolutional layer can be configured to process the stream of entropy values by applying a plurality of kernels K1 ,1 , K1 ,2,..., K1 ,x to the entropy values. Each kernel applied to the first convolutional layer can be configured to detect changes between entropy values of adjacent chunks in a file. According to some implementations, each kernel applied to the first convolutional layer can be adapted to detect a specific sequence of entropy values, having w values.
Although the convolutional neural network has been indicated as comprising 3 convolutional layers, it should be appreciated that the convolutional neural network can include less or more convolutional layers.
In some implementations, the pooling layer can be configured to further process the output from a preceding convolutional layer by compressing (e.g. subsampling or down sampling) the output from the preceding convolution layer. The pooling layer can compress the output by applying one or more pooling functions, including for example a maximum pooling functions.
In some implementations, the output of the pooling layer can be further processed by the one or more fully connected layers and the output layer in order to generate a classification for the file (e.g. malicious or benign). The classification of the file can be provided at an output of the convolutional neural network.
3. Static classifier embodiment 3: Assembly language instructions.
In some implementations, a binary file can be disassembled thereby forming a discernible sequence of instructions having one or more identifying features (e.g. instruction mnemonics). A convolutional neural network (CNN) can be utilized to analyze the disassembled binary file by applying a plurality of kernels (filters) adapted to detect certain sequences of instructions in the disassembled file. The convolutional network can detect malicious executables by providing a classification of the disassembled binary file (maliciousness score: [0,1 ]). The convolutional neural network may include a convolutional layer, a pooling layer, a fully connected layer and an output layer. The convolutional neural network can be configured to process a sequence of instructions that are variable in length. As such, one or more techniques can be applied to generate fixed length representations of the instructions. Moreover, the fixed length of instructions can be encoded in a way the network understands their meaning. Remember that neural networks cannot deal with not numerical features. Thus, mnemonics are encoded using one-hot vector representations. Afterwards, each one-hot vector is represented as a word embedding, that is a vector of real numbers. This vector representation of the opcodes can be generated during the training phase of the convolutional network or using any other approach such as neural probabilistic language models, i.e. SkipGram model, Word2Vec model, Recurrent Neural Network models, etc.
In some implementations, the first convolutional layer can be configured to process the encoded fixed mnemonics representations by applying a plurality of kernels K1 ,1 , K1 ,2,... K1 ,x to the encoded fixed mnemonics representations. Each kernel applied at the first convolutional layer can be configured to detect a specific sequence of instructions. According to some implementations, each kernel applied to the first convolutional layer can be adapted to detect a sequence having a number of instructions. That is, kernels K can be adapted to detect instances where a number of instructions appear in a certain order. For example, kernel K1 ,1 can be adapted to detect the instruction sequence [cmp, jne, dec] while kernel K1 ,2 can be adapted to detect the instruction set [dec, mov, jmp]. The size of each kernel (w, the number of instructions) corresponds to the window size of the first convolutional layer.
In some implementations, the convolutional layer may have kernels of different size. For instance, one kernel may be adapted to detect the instruction sequence [dec, mov, jmp] while another kernel may be adapted to detect the instruction set [dec, mov, jmp, pull, sub].
Although the convolutional neural network is shown to include one convolutional layer, it should be appreciated that the convolutional neural network can include a different number of convolutional layers. For instance, the convolutional neural network can include more convolutional layers such as 2.
Thus, in some implementations, the kernels K2,1 , K2,2, .... K,2,x applied to the second convolutional layer can be adapted to detect specific sequences of two or more of the sequences of instructions detected at the first convolutional layer. Consequently, the second convolutional layer would generate increasingly abstract representations of the sequence of instructions from the disassembled binary file. In some implementations, the pooling layer can be configured to further process the output from a preceding convolutional layer by compressing (e.g. subsampling or down sampling) the output from the preceding convolution layer. The pooling layer can compress the output by applying one or more pooling functions, including for example a maximum pooling functions.
In some implementations, the output of the pooling layer can be further processed by the one or more fully connected layers and the output layer in order to generate a classification for the disassembled binary file (e.g. malicious or benign). The classification of the disassembled binary file can be provided at an output of the convolutional neural network.
4. Static classifier embodiment 4: Image-based representation of malware’s hexadecimal content.
In some implementations, a software program can be visualized as an image, where every byte interpreted as one pixel in the image. Then, the resulting array is organized as a 2-D array and visualized as a gray scale image. Approaches such as convolutional neural networks can yield classifiers that can learn to extract features that are at least as effective as human-engineered features. A convolutional neural network implementation to extract features can advantageously make use of the connectivity structure between feature maps to extract local and invariant features from an image. A convolutional neural network (CNN) can be utilized to analyze the file by applying a plurality of kernels (filters) adapted to detect certain local and invariant patterns in the pixels of the representation of the software program as a gray-scale image. The convolutional network can detect malicious executables by providing a classification of the disassembled binary file (maliciousness score: [0,1]).
The convolutional neural network at least may include a convolutional layer, a pooling layer, a fully connected layer and an output layer. In some implementations, it may include more than one convolutional, pooling and fully connected layers. According to some implementations, each kernel applied to the first convolutional layer can be adapted to detect a pattern in the pixels of the image having w x h size, where w is the width and h is the height of the kernel. Subsequent convolutional layers detect increasingly abstract features.
In some implementations, the pooling layer can be configured to further process the output from a preceding convolutional layer by compressing (e.g. subsampling or down sampling) the output from the preceding convolution layer. The pooling layer can compress the output by applying one or more pooling functions, including for example the maximum pooling function. In some implementations, the output of the pooling layer can be further processed by the one or more fully connected layers and the output layer in order to generate a classification for the file (e.g. malicious or benign). The classification of the file can be provided at an output of the convolutional neural network.
5. Static classifier embodiment 5: Miscellaneous features.
In any embodiment of the invention, the so-called “miscellaneous” features include those applicable software characteristics. These characteristics at least include the keywords occurring in the software of the program and the fields of the header of a file in any format. Other type of features may also be used.
Next table illustrates the fields of the header of a file in portable executable format. For example, these fields are: MajorLinkedVersion, MinorLinkerVersion, SizeOfCode, SizeOflnitializedData, etc. Shown is relevant information that contains suitable characteristics to use as features. These characteristics are specific to information of a Portable Executable file header, but other file types will have other relevant header information and characteristics.
In another embodiment, shown in Fig. 6, the preprocessing module 1 10 is responsible of extracting a set of informative features 1 1 1 from the file 100. These features 1 1 1 are then aggregated and fed as input to a common static classifier 122, which will determine whether the file 100 is malicious or not. The input of the static classifier 123 is the features 1 1 1 from the distinct groups extracted by the preprocessing module 1 10. The output 123 of the static classifier 122 is a value in the range [0,1]. A value close to 0 means that the executable 100 does not contain suspicious/malicious indicators with regard to the features 1 1 1 , otherwise, the value will be close to 1. Any machine learning method can be used as classifier. For instance, neural networks, support vector machines or decision trees.
In another embodiment, showed in Fig. 7, the preprocessing module 1 10 is responsible of extracting a set of informative features 1 1 1 from the file 100. These features 1 1 1 are used as input to static classifiers. The system has as many static classifiers as set of features and, in contrast to the embodiment of Fig. 5, a further static classifier that would aggregate and use the features of all groups as input. The output 123 of each machine learning classifier 122 is a value in the range [0, 1 ]. A value close to 0 means that the executable 100 do not contain suspicious/malicious indicators with regard of a specific group of features 1 1 1 , otherwise, the value will be close to 1. Any machine learning method can be used as classifier. For instance, neural networks, support vector machines or decision trees.
Fuzzy inference module description
The last component of the malware detection system is the fuzzy inference engine 130. Its aim is to define a set of fuzzy rules of whether an executable is malicious based on the output of the machine learning methods and the features extracted by the preprocessing module.
This component 130 performs the following steps:
• receives one or more input values and generates an array of values each representing a membership degree of a respective input value in a predefined fuzzy set (fuzzification);
• combines the membership values on the premise part to get firing strength (degree of fulfilment) of each rule;
• generates the qualified consequent part (either fuzzy or crisp) of each rule depending on the firing strength;
• aggregates the qualified consequent part to produce a crisp output (defuzzification).
The fuzzy inference module 130 can be decomposed into functional blocks, as depicted in Fig. 8, and described below in detail.
A/ Fuzzification:
First, the input values of the system have to be converted to fuzzy variables. This process is named fuzzification 131 . Fuzzification 131 involves two processes: derive the membership functions for input and output variables, and represent them with linguistic variables. (Given two inputs, x1 and y1 , determine the degree to which input variables belong to each of the appropriate fuzzy sets.)
The input values are two-fold: a feature vector of program characteristics named F, of size |F|, where Fi Î F corresponds to the value of the i-th feature of the program 100. This feature vector is extracted by the preprocessing module 1 10; and a score vector containing the output scores 123 of the machine learning algorithms named S of size |S|, where |S| is equal to the number of distinct algorithms that have been applied to predict the maliciousness of the program based on distinct groups of features. This score vector is generated by the machine learning module 120.
To illustrate the process of fuzzification 131 , a vector of only two features containing the entropy of the .text section of a Portable Executable 100 and the score 123 generated by a machine learning algorithm will be considered as input.
The entropy of a bytes sequence refers to the amount of disorder (uncertainty) or its statistical variation. The entropy value ranges from 0 to 8. If occurrences of all values are the same, the entropy will be largest. On the contrary, if certain byte values occur with high probabilities, the entropy value will be smaller. According to studies, the entropy of plain text, native executables, packed executables and encrypted executables tend to differ greatly. In consequence, the [0,8] range can be further divided into at least six sub-ranges or subsets, which are:
• VERY LOW: From 0 to 4.328 entropy
• LOW: From 4.066 to 5.030 entropy
• MEDIUM: From 4.629 to 6.369 entropy
• HIGH: From 6.219 to 7.267 entropy
• VERY HIGH: From 6.838 to 7.312
• EXTREME: From 7.215 to 8.0
The membership function of these subsets is shown in Fig. 9. To make thing simple, a trapezoidal waveform is utilized for this type of membership function. For instance, 4.0 entropy will belong to“very low” to 0.6 degree and to“low” to 0.4 degree.
The score 123 of a given machine learning classifier 122 is a value in the range [0, 1]. A value close to 0 means that the executable 100 do not contain suspicious/malicious indicators with regard of a specific group of features 1 1 1 , and it is a low threat, otherwise, the value will be close to 1 . This score 123 can be further divided into at least three sub-ranges or subsets which are:
• LOW: From 0.0 to 0.5
• MEDIUM: From 0.2 to 0.75 • HIGH: From 5.0 to 1 .0
The membership function of these subsets is shown in Fig. 10. For example, 0.4 score belongs to“LOW” to 0.38 degree and to“MEDIUM” to 1.0 degree.
In the implementation of the current subject matter, the fuzzy sets corresponding to all machine learning classifiers 122 are defined using the same membership functions for simplicity purposes. However, this is not a constraint and they might be defined with different membership functions and fuzzy sets.
B/ Knowledae Base:
The rule base and the database of the invention are jointly referred to as the knowledge base 132. The knowledge base 132 comprises:
• a rules base 133 containing a number of fuzzy IF-THEN rules. These IF-THEN rules lead to what action or actions should be taken in terms of the currently observed information. A fuzzy rule associates a condition described using linguistic variables and fuzzy sets to an output or a conclusion. The IF part is mainly used to capture knowledge and the THEN part can be utilized to give the conclusion or output in linguistic variable form. IF-THEN rules are widely used by the inference engine to compute the degree to which the input data matches the condition of a rule.
• a database 134 which defines the membership functions of the fuzzy sets. Fuzzy sets are sets whose elements have degrees of membership. Fuzzy set theory permits the gradual assessment of the membership of elements in a set; this is described with the aid of a membership function valued in the real unit interval [0,1]. The membership function represents the degree of truth. The system has associated one fuzzy set to every input feature. See the membership functions of features“entropy” and “machine learning score” previously presented.
The IF-THEN rules and the membership functions of the fuzzy sets might be defined by experts in the field or by exploiting approximation techniques from neural networks. On the one hand, experts extract comprehensible rules from their vast knowledge of the field. These rules are fine-tuned using the available input-output data. On the other hand, neural network techniques are used to automatically derive rules from the data.
Every rule is attached with a degree of belief or weight in the real interval (0, 1 ] that denotes a lower bound on the belief on the rule in terms of necessity measures. So, that rules that are demonstrated to be key for the decision system have a higher weight, and rules not so useful in the decision system have a lower weight. The rules created by the system may have a higher degree of belief than the rules created by a human, or vice versa.
For example, the system may create rules of the following form:
• Rule 1 : IF (entropy(.text) is “high” OR entropy(.text) is “very_high” OR entropy(.text) is “extreme”) AND (call("CryptAcquireContext") OR call("CryptEncrypt”) OR call("CryptReleaseContext”)) THEN file 100 is encrypted with a degree of belief of 1 .0
• Rule 2: IF entropy(file) is“very_high” AND ML_score(ENTROPY) is“high” THEN file 100 is encrypted with a degree of belief of at least 0.9
• Rule 3: IF has_section(UPX0) OR has_section(UPX1 ) or has_section(“X”) THEN file 100 is compressed with a degree of belief of at least 0.9
• Rule 4: IF file 100 is encrypted AND ML_score(API) is “low” and ML_score(Opcodes) is“low” THEN file 100 is benign with a degree of belief of at least 0.7
• Rule 5: IF file 100 is encrypted AND ML_score(API) is “medium” and ML_score(Opcodes) is“medium” THEN file 100 is suspicious with a degree of belief of at least 0.8
• Rule 6: IF file 100 is encrypted AND (ML_score(API) is “high” OR ML_score(Opcodes) is“high” THEN file 100 is malicious with a degree of belief of at least 0.9
• Rule 7: IF file 100 is compressed AND ML_score(ENT) is “high” AND (ML_score(API) is“medium” OR ML_score(Opcodes) is“medium” THEN file 100 is malicious with a degree of belief of at least 0.8
• Rule 8: IF file 100 is compressed AND ML_score(ENT) is“medium” or“low” AND ML_score(API) is“low” AND ML_score(Opcodes) is“low” THEN file 100 is benign with a degree of belief of at least 0.8
Due to the complexity and number of rules, in this implementation of the system only few rules related to very few fuzzy sets (entropy and ML scores) were presented. Notice that some of the conditions of rules are crisp values. For instance call("CryptAcquireContext”) is TRUE if the executable calls“CryptAcquireContext” and FALSE otherwise.
C/ Inference Engine
The decision-making unit (Inference Engine) 135 is the inference procedure upon the fuzzy rules and given facts to derive a reasonable output or conclusion 140. Even that any fuzzy inference system could be used, e.g. Mandani Fuzzy Models, Sugeno Fuzzy Models, Tsukamoto Fuzzy Models, etc., in the current embodiment, the inference engine is based on the PGL+ reasoning system, for reasoning under possibilistic uncertainty and disjunctive vague knowledge. PGL+ is a possibilistic logic programming framework with fuzzy constants based on the Horn-rule fragment of Godel infinitely- valued logic with an efficient proof algorithm based on a complete calculus and oriented to goals (conclusions). Fuzzy constants are interpreted as disjunctive imprecise knowledge and the partial matching between them is computed by means of a fuzzy unification mechanism based on a necessity-like measure.
For instance, if the entropy of the“.text” section is 7.2, the score returned by a given machine learning model trained on the structural entropy of the executable is 0.65 and the executable calls the functions“CryptAcquireContext” and“CryptEncrypt”, then rules 1 and 2 are fired.
Rules fired:
• Rule 1 : IF (entropy(.text) is “high” OR entropy(.text) is “very_high” OR entropy(.text) is “extreme”) AND (call("CryptAcquireContext") OR call(“CryptEncrypt”) OR call("CryptReleaseContext”)) THEN file 100 is encrypted with a degree of belief of
1.0
• Rule 2: IF entropy(file) is “very_high” AND ML_score(ENTROPY) is “high” THEN file 100 is encrypted with a degree of belief of at least 0.9
To aggregate the output, it is used the minimum as defined by the PGL+ reasoning system.
• File 100 is encrypted with a degree of belief >= min(degree of belief of rule 1 , degree of belief of rule 2) == file 100 is encrypted with a degree of belief >= min(1 .0, 0.9) == file 100 is encrypted with a degree of belief >= 0.9
Considering that the ML_score(API)=0.21 and the ML_score(Opcodes)=0.15 then, if rule 1 or rule 2 are activated, consequently rule 4, 5 and 6 are fired but only rule 4 is satisfied
• Rule 4: IF file 100 is encrypted AND ML_score(API) is “low” and ML_score(Opcodes) is“low” THEN file 100 is benign with a degree of belief of at least 0.7
As a result, the output of the inference engine 135 is: file 100 is benign with a degree of belief >= min(0.7, min(degree of belief of rule 1 , degree of belief of rule 2)) -> file 100 is benign with a degree of belief >= min(0.7, min(1 .0, 0.9)) ->“file 100 is benign with a degree of belief >= 0.7”
D/ Defuzzification. The output of the Inference Engine 135 is a conclusion involving fuzzy constants together with the degree on the belief on the conclusion. The belief degree to classify the file 100 as malware is used, and fuzzy constants are transformed into crisp values using membership functions analogous to the ones used by the fuzzifier 131 . The invention may use, but not limited to, one of the following defuzzification 136 methods:
1. Centroid of Area (COA)
2. Bisector of Area (BOA)
3. Mean of Maximum (MOM)
4. Smallest of Maximum (SOM)
5. Largest of Maximum (LOM)
In the current embodiment, the output fuzzy set might be decomposed into at least three sub-ranges or subsets, which are represented as membership functions in Fig. 1 1 :
• BENIGN: from 0 to 0.4
• SUSPICIOUS: from 0.2 to 0.8
• MALICIOUS: from 0.5 to 1 .0
Given the degree of fulfilment and the degree of belief of the consequent part of each fired rule, the fuzzy output is converted to a crisp output using, but not limited to, any of the aforementioned defuzzification methods 136. For instance, following the example presented in C/, if the output of the fuzzy inference engine is“file 100 is benign with a degree of belief >= 0.7” and the defuzzification method 136 is the“Mean of Maximum(MOM)”, then the crisp value of the fuzzy set using MOM is y* = (a + b) / 2, where a is the minimum highest value of the membership function“benign”, aka 0, and b is the maximum highest value of the membership function, aka 0.2. In consequence y* = (0 +0.2) / 2 = 0.1.
Use Case
The steps for predicting the maliciousness of a previously unseen executable 100 in a concrete implementation of the system of the second aspect of the present invention are described below. Due to its complexity, in this implementation of the system only a reduced subset of features 1 1 1 are extracted. Therefore, the number of machine learning methods and fuzzy rules and fuzzy sets has been reduced accordingly to fit the needs of this concrete implementation. Steps: 1. An unseen executable (XXXXXXXXXXXX.exe) 100 is passed as input to the system. The preprocessing module 1 10 extracts a subset of features 1 1 1 that provides an abstract view of the program. In particular, the preprocessing module 1 10 extracts at least the following features 1 1 1 :
a. File entropy: 7.2
b. Windows API function calls: {“CryptAcquireContext”: True,“CryptEncrypt”: True,“CreateFile": True,“CopyFile”: False, ...}
c. Sequence of assembly language instructions: [“Inc eax”, “call Clrsc”, “jump L1”,“add ebx, eax”, ...]
d. Structural entropy
2. The aforementioned data is passed as input to some machine learning methods to calculate a maliciousness score 123 based on a particular feature or subset of features 1 1 1.
a. Machine learning model 1 outputs a maliciousness score 123 equal to 0.65 with respect to the structural entropy of the executable 100. (A machine learning model is defined as the output generated when a machine learning algorithm is trained with your training data).
b. Machine learning model 2 outputs a maliciousness score 123 equal to 0.15 with respect to the sequence of instructions of the executable 100. c. Machine learning model 3 outputs a maliciousness score 123 equal to
0.21 with respect to the imported Windows API functions.
3. Next, the features 1 1 1 and M.L. scores 123 are passed as input to the fuzzy inference module 130 to calculate the final maliciousness score 140 of the executable 100. Considering a rule base consisting of the following rules:
a. Rule 1 : IF (entropy(.text) is“high” OR entropy (.text) is“very_high” OR entropy(.text) is “extreme”) AND (call("CryptAcquireContext") OR call("CryptEncrypt") OR call("CryptReleaseContext”)) THEN file 100 is encrypted with a degree of belief of 1.0
b. Rule 2: IF entropy(file) is“very_high” AND ML_score(ENTROPY) is“high” THEN file 100 is encrypted with a degree of belief of at least 0.9 c. Rule 3: IF has_section(UPX0) OR has_section(UPX1 ) or has_section(“X”) THEN file 100 is compressed with a degree of belief of at least 0.9 d. Rule 4: IF file 100 is encrypted AND ML_score(API) is “low” and ML_score(Opcodes) is“low” THEN file 100 is benign with a degree of belief of at least 0.7 e. Rule 5: IF file 100 is encrypted AND ML_score(API) is“medium” and ML_score(Opcodes) is “medium” THEN file 100 is suspicious with a degree of belief of at least 0.8
f. Rule 6: IF file 100 is encrypted AND (ML_score(API) is “high” OR ML_score(Opcodes) is“high” THEN file 100 is malicious with a degree of belief of at least 0.9
g. Rule 7: IF file 100 is compressed AND ML_score(ENT) is“high” AND (ML_score(API) is“medium” OR ML_score(Opcodes) is“medium” THEN file 100 is malicious with a degree of belief of at least 0.8
h. Rule 8: IF file 100 is compressed AND ML_score(ENT) is“medium” or “low” AND ML_score(API) is “low” AND M L_sco re(Opcodes) is “low” THEN file 100 is benign with a degree of belief of at least 0.8
If the entropy of the“.text” section is 7.2, the score 123 returned by a given machine learning module 121 trained on the structural entropy of executables is 0.65 and the executable invokes the functions “CryptAcquireContext” and “CryptEncrypt”, then rules 1 and 2 are fired.
Rules fired:
• Rule 1 : IF (entropy(.text) is “high” OR entropy(.text) is “very_high” OR entropy(.text) is “extreme”) AND (call("CryptAcquireContext") OR call("CryptEncrypt”) OR call("CryptReleaseContext”)) THEN file 100 is encrypted with a degree of belief of 1.0
• Rule 2: IF entropy(file) is“very_high” AND ML_score(ENTROPY) is“high” THEN file 100 is encrypted with a degree of belief of at least 0.9
To aggregate the output 140, it is used the minimum as defined by the PGL+ reasoning system.
• file 100 is encrypted with a degree of belief >= min(degree of belief of rule 1 , degree of belief of rule 2) == file 100 is encrypted with a degree of belief >= min(1.0, 0.9) == file 100 is encrypted with a degree of belief >= 0.9
Considering that the ML_score(API)=0.21 and the ML_score(Opcodes)=0.15 then, if rule 1 or rule 2 are activated, consequently rule 4, 5 and 6 are fired but only rule 4 is satisfied.
• Rule 4: IF file 100 is encrypted AND ML_score(API) is “low” and ML_score(Opcodes) is“low” THEN file 100 is benign with a degree of belief of at least 0.7 As a result, the output of the inference engine 133 is: file 100 is benign with a degree of belief >= min(0.7, min(degree of belief of rule 1 , degree of belief of rule 2)) -> file 100 is benign with a degree of belief >= min(0.7, min(1 .0, 0.9)) ->“file 100 is benign with a degree of belief >= 0.7”.
Afterwards the output of the fuzzy inference engine 133 (“file 100 is benign with a degree of belief >= 0.7”) is defuzzified 136 using the “Mean of Maximum(MOM”) defuzzification method. In consequence, the resulting crisp value returned by the system is calculated using the formula: y* = (a + b) / 2, where a is the minimum highest value of the membership function“benign”, i.e. 0, and b is the maximum highest value of the membership function, i.e. 0.2. In consequence y* = (0 +0.2) / 2 = 0.1.
In an embodiment, the PGL+ involves a semantical unification model of disjunctive fuzzy constants and three other inference patterns together with a deductive mechanism based on a modus ponens style. The PGL+ system allows expressing both ill-defined properties and weights with which properties and patterns can be attached with. For instance, suppose that the problem observation corresponds to the following statement“it is almost sure that the entropy file is around_20”. This statement can be represented in the proposed system with the formula:
(entropy (around_20), 0.9),
where entropy(.) is a classical predicate expressing the entropy property of the problem domain; around_20 is a fuzzy constant; and the degree 0.9 expresses how much is believed the formula entropy(around_20) in terms of a necessity measure.
In case around_20 denotes a crisp interval of entropy values, the formula (entropy (around_20), 0.9) is interpreted as the sentence“exists x in around_20 such that entropy(x)” being certain with a necessity of at least 0.9. So, fuzzy constants can be seen as (flexible) restrictions on an existential quantifier. Moreover, suppose the fuzzy pattern“we are more or less sure that the file is encrypted when its entropy is high” is considered. This pattern can be represented in the proposed system with the formula:
(entropy (high) -> encrypted, 0.7),
where high is a fuzzy constant and the degree 0.7 expresses how much is believed the file is encrypted since entropy is high.
From knowledge {(entropy(arou nd_20) , 0.9), (entropy(high) -> encrypted, 0.7)}, the PGL+ system computes the degree of belief of the crisp property encrypted by conveniently combining the degrees of belief 0.9 and 0.7 together with the degree of partial matching between both fuzzy constants high and around_20.
In another embodiment, the inference procedure based on the PGL+ reasoning system is divided in three algorithms which are applied sequentially. First, a completion algorithm, which extends the set of rules and facts with all valid clauses by means of the following Generalized Resolution and Fusion inference rules:
Generalized resolution:
Fusion:
Second, a translation algorithm, which translates the completed set of rules and facts into a semantically equivalent set of 1 -weighted clauses by means of the following inference rules:
Intersection:
Resolving uncertainty:
Semantical unification:
where => is the reciprocal of Godel’s many-valued implication, defined as x => y = 1 if x£ y and x=> y = 1 -x, otherwise.
Modus ponens:
And, finally, a deduction algorithm, based on the Semantical unification rule, which computes the maximum degree of possibilistic entailment of a goal from the equivalent set of 1 -weighted facts.
The completion algorithm first computes the set of valid clauses that can be derived by applying the Generalized resolution rule (i.e. by chaining clauses). Then, from this new set of valid clauses, the algorithm computes all valid clauses that can be derived by applying the Fusion rule (i.e. by fusing clauses). As the Fusion rule stretches the body of rules and the Generalized resolution rule modifies the body or the head of rules, the chaining and fusion steps have to be performed while new valid clauses are derived. As the chaining and fusion steps cannot produce infinite loops and each valid clause is either an original clause or can be derived at least from two clauses, in the worst-case each combination of clauses derives a different valid clause. Hence, as a finite set of facts and rules N is had, in the worst-case the number of valid clauses is
However, in general, only a reduce set of clauses can be combined to derive new valid clauses. Indeed, c1 , c2 and c3 can derive a new valid clause if c1 and c2, c1 and c3, or c2 and c3 derive a valid clause different to c1 , c2 and c3.
The algorithm for translating a set of facts and rules into a set of 1 -weighted clauses is based on the following result: where, and P denotes the set of facts
and rules; i.e. the maximum degree of satisfiability of a goal q(C) can be computed from a single 1 -weighted clause (q(Dq); 1 ) instead of considering the entire original knowledge base P. Then, as Dq can be determined just from (i.e. the clauses of P+ whose heads
are q or q depends on their heads), and each rule in can be replaced by a fact
applying the Semantical unification and Modus ponens rules: each clause can be replaced by
At this point, Dq can be computed from this finite set of facts by applying the UN and IN rules. As non-recursive programs are only considered, the above mechanism can be recursively applied for determining ||p|| P for each predicate P such that q depends on p in P, and thus, the time complexity of the translation algorithm is linear in the total number of occurrences of predicates symbols in (P)+.
Finally, if (q(Dq),; 1 ) is the 1 -weighted clause computed by the translation algorithm for a propositional variable q, we have that
(C\Dq) and thus, after applying the completion and translation algorithms, the maximum degree of satisfiability of a goal q(C) corresponds with the maximum degree of deduction of q(C) from P and can be computed in a constant time complexity in the sense that it is equivalent to compute the partial matching between two fuzzy constants:
where => is the reciprocal of Godel's many-valued implication.
One important feature of inference procedure based on the PGL+ reasoning system is that when extending the knowledge with new facts only the set of 1 -weighted clauses must be computed again, and thus, the set of hidden clauses, which from a computational point of view is the hard counterpart of dealing with fuzzy constants, must be computed again only if new rules are added to the model.
Various aspects of the proposed method may be embodied in programming. Program aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Tangible non-transitory “storage” type media include any or all of the memory or other storage for the computers, processors, or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide storage at any time for the software programming.
All or portions of the software may at times be communicated through a network such as the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer of a scheduling system into the hardware platform(s) of a computing environment or other system implementing a computing environment or similar functionalities in connection with image processing. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to tangible“storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
A machine-readable medium may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s), or the like, which may be used to implement the system or any of its components shown in the drawings. Volatile storage media may include dynamic memory, such as a main memory of such a computer platform. Tangible transmission media may include coaxial cables; copper wire and fiber optics, including the wires that form a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media may include, for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a physical processor for execution.
Those skilled in the art will recognize that the present teachings are amenable to a variety of modifications and/or enhancements. For example, although the implementation of various components described herein may be embodied in a hardware device, it may also be implemented as a software only solution— e.g., an installation on an existing server.
The present disclosure and/or some other examples have been described in the above. According to descriptions above, various alterations may be achieved. The topic of the present disclosure may be achieved in various forms and embodiments, and the present disclosure may be further used in a variety of application programs. All applications, modifications and alterations required to be protected in the claims may be within the protection scope of the present disclosure.
The scope of the present invention is defined in the following set of claims.

Claims

Claims
1. A computer-implemented method for identifying a malicious file, the method comprising:
- performing a static analysis of a potentially malicious file, obtaining a set of features that provide an abstract view of the malicious file;
- performing a static machine learning classification process using as inputs said set of features, obtaining a preliminary classification output; and
- performing a fuzzy inference procedure based on possibilistic logic, for reasoning under possibilistic uncertainty and disjunctive vague knowledge, using as input variables said set of features and said preliminary classification output, generating an enhanced classification output that identifies the potentially malicious file as a malicious file or as a benign file.
2. A computer-implemented method according to claim 1 , comprising:
- performing several static analyses of different types of the potentially malicious file, obtaining corresponding sets of features that provide abstract views of the malicious file;
- performing the static machine learning classification process using as inputs said sets of features, obtaining the preliminary classification output; and
- performing the fuzzy inference procedure based on possibilistic logic using as input variables the sets of features and the preliminary classification output.
3. A computer-implemented method according to claim 1 , comprising:
- performing several static analyses of different types of the potentially malicious file, obtaining corresponding sets of features that provide abstract views of the malicious file;
- performing several static machine learning classification processes, each using as inputs at least one respective of the sets of features, obtaining corresponding several preliminary classification outputs; and
- performing the fuzzy inference procedure based on possibilistic logic using as input variables the sets of features and the preliminary classification outputs.
4. A computer-implemented method according to claim 3, comprising performing a further static machine learning classification process, using as inputs several or all of the sets of features, obtaining a corresponding further preliminary classification output; and
- performing the fuzzy inference procedure based on possibilistic logic using as input variable also the further preliminary classification output.
5. A computer-implemented method according to any one of the previous claims, wherein the fuzzy inference procedure comprises a fuzzification process that converts the input variables into fuzzy variables.
6. A computer-implemented method according to claim 5, wherein the fuzzification process comprises deriving membership functions relating the input variables with output variables through membership degrees of values of the input variables in predefined fuzzy sets, and representing the membership functions with linguistic variables, the linguistic variables being the fuzzy variables.
7. A computer-implemented method according to claim 5 or 6, wherein the fuzzy inference procedure further comprises an inference decision-making process comprising firing fuzzy possibilistic rules with values of the linguistic variables for the input variables, generating a fuzzy output that identifies a degree of belief that the potentially malicious file has to be a malicious file or a benign file.
8. A computer-implemented method according to claim 7, further comprising selecting which fuzzy possibilistic rules to fire in the inference decision-making process, based on at least the values of the linguistic variables for the input variables.
9. A computer-implemented method according to claim 7 or 8, wherein the fuzzy inference procedure further comprises a defuzzification process that converts said fuzzy possibilistic output into a crisp output, wherein said crisp output constitutes the enhanced classification output.
10. A computer-implemented method according to any of the previous claims, wherein the set or sets of features comprise at least one of the following:
- the frequency of use of Application Programming Interfaces (API) and their function calls; the representation of an executable file as a stream of entropy values, where each value describes the amount of entropy over a small chunk of code in a specific location of the potentially malicious file; the sequence of assembly language instructions executed by a software program constituting the potentially malicious file, in particular, the operational codes of the machine language instructions;
the representation of an executable file, constituting the potentially malicious file, as an image, where every byte is interpreted as one pixel in the image, wherein the resulting array is organized as a 2-D array and visualized as a gray scale image; - applicable program characteristics, at least including alphanumeric strings occurring in the body of the software program constituting the potentially malicious file and the fields from the header of the potentially malicious file.
1 1 . A computer-implemented method according to claim 10, wherein the sets of features comprise at least two of the features sets.
12. A computer-implemented method according to any of the previous claims wherein the fuzzy inference procedure based on possibilistic logic is based on a PGL+ algorithm.
13. A computer-implemented method according to claim 12, wherein the PGL+ algorithm comprises applying three algorithms sequentially: a first algorithm that extends the fuzzy possibilistic rules by means of implementing a first set of rules; a second algorithm that translates the fuzzy possibilistic rules into a semantically equivalent set of 1 -weighted clauses by means of implemented a second set of rules; and a third algorithm that computes a maximum degree of possibilistic entailment of a goal from the equivalent set of 1 -weighted clauses.
14. A computing system for identifying a malicious file, comprising:
- a preprocessing computing module (1 10), configured and arranged to perform a static analysis of a potentially malicious file (100) to obtain a set of features that provide an abstract view of the malicious file;
- a machine learning module (120), configured and arranged to perform a static machine learning classification process using as inputs said set of features, to obtain a preliminary classification output; and
- a fuzzy inference module (130), configured and arranged to perform a fuzzy inference procedure based on possibilistic logic using as input variables said set of features and said preliminary classification output, to generate an enhanced possibilistic classification output (140) that identifies the potentially malicious file (100) as a malicious file or as a benign file.
15. A system according to claim 14, wherein:
- the preprocessing computing module (1 10) is further configured and arranged to perform several static analyses of different types of the potentially malicious file (100) to obtain corresponding sets of features that provide abstract views of the malicious file;
- the machine learning module (120) is further configured and arranged to perform the static machine learning classification process using as inputs the sets of features, to obtain the preliminary classification output; and - the fuzzy inference module (130) is further configured and arranged to perform the fuzzy inference procedure based on possibilistic logic using as input variables the sets of features and the preliminary classification output.
16. A system according to claim 14, wherein:
- the preprocessing computing module (1 10) is further configured and arranged to perform several static analyses of different types of the potentially malicious file (100) to obtain corresponding sets of features that provide abstract views of the malicious file;
- the machine learning module (120) is further configured and arranged to perform several static machine learning classification processes, each using as inputs at least one respective of the sets of features, to obtain corresponding several preliminary classification outputs; and
- the fuzzy inference module (130) is further configured and arranged to perform the fuzzy inference procedure based on possibilistic logic using as input variables the sets of features and the preliminary classification outputs.
17. A non-transitory computer program product comprising computer executable software stored on a computer readable medium, the software being adapted to run at a computer or other processing means characterized in that when said computer executable software is loaded and read by said computer or other processing means, said computer or other processing means is able to perform the steps of the method according to any of claims 1 -13.
EP20744065.2A 2019-07-30 2020-07-29 A computer-implemented method, a system and a computer program for identifying a malicious file Pending EP4004827A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP19382656 2019-07-30
PCT/EP2020/071334 WO2021018929A1 (en) 2019-07-30 2020-07-29 A computer-implemented method, a system and a computer program for identifying a malicious file

Publications (1)

Publication Number Publication Date
EP4004827A1 true EP4004827A1 (en) 2022-06-01

Family

ID=67514512

Family Applications (1)

Application Number Title Priority Date Filing Date
EP20744065.2A Pending EP4004827A1 (en) 2019-07-30 2020-07-29 A computer-implemented method, a system and a computer program for identifying a malicious file

Country Status (2)

Country Link
EP (1) EP4004827A1 (en)
WO (1) WO2021018929A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11755728B2 (en) * 2020-12-08 2023-09-12 Mcafee, Llc Systems, methods, and media for analyzing structured files for malicious content
CN114036521B (en) * 2021-11-29 2024-05-03 北京航空航天大学 Method for generating countermeasure sample of Windows malicious software

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100192222A1 (en) 2009-01-23 2010-07-29 Microsoft Corporation Malware detection using multiple classifiers
GB2520987B (en) 2013-12-06 2016-06-01 Cyberlytic Ltd Using fuzzy logic to assign a risk level profile to a potential cyber threat
US10789367B2 (en) * 2014-04-18 2020-09-29 Micro Focus Llc Pre-cognitive security information and event management
US9495633B2 (en) 2015-04-16 2016-11-15 Cylance, Inc. Recurrent neural networks for malware analysis
US9690938B1 (en) 2015-08-05 2017-06-27 Invincea, Inc. Methods and apparatus for machine learning based malware detection
US9721097B1 (en) 2016-07-21 2017-08-01 Cylance Inc. Neural attention mechanisms for malware analysis

Also Published As

Publication number Publication date
WO2021018929A1 (en) 2021-02-04

Similar Documents

Publication Publication Date Title
Gibert et al. The rise of machine learning for detection and classification of malware: Research developments, trends and challenges
Aslan et al. A new malware classification framework based on deep learning algorithms
Tran et al. NLP-based approaches for malware classification from API sequences
Kumar et al. Malicious code detection based on image processing using deep learning
Wang et al. Review of android malware detection based on deep learning
Mahdavifar et al. DeNNeS: deep embedded neural network expert system for detecting cyber attacks
Lu Malware detection with lstm using opcode language
Amer et al. A multi-perspective malware detection approach through behavioral fusion of api call sequence
Li et al. I-mad: Interpretable malware detector using galaxy transformer
Yan et al. A survey of adversarial attack and defense methods for malware classification in cyber security
Eke et al. The use of machine learning algorithms for detecting advanced persistent threats
Sovilj et al. A comparative evaluation of unsupervised deep architectures for intrusion detection in sequential data streams
Ring et al. Malware detection on windows audit logs using LSTMs
WO2021018929A1 (en) A computer-implemented method, a system and a computer program for identifying a malicious file
Bhaskara et al. Emulating malware authors for proactive protection using GANs over a distributed image visualization of dynamic file behavior
Chan et al. Robustness analysis of classical and fuzzy decision trees under adversarial evasion attack
Silivery et al. A model for multi-attack classification to improve intrusion detection performance using deep learning approaches
Van Ouytsel et al. Analysis of machine learning approaches to packing detection
Gayathri et al. Adversarial training for robust insider threat detection
Nofal et al. SQL injection attacks detection and prevention based on neuro-fuzzy technique
Nofal et al. SQL injection attacks detection and prevention based on neuro—fuzzy technique
CN114969734B (en) Lesovirus variant detection method based on API call sequence
Otsubo et al. Compiler provenance recovery for multi-cpu architectures using a centrifuge mechanism
Hamad et al. BERTDeep-Ware: A Cross-architecture Malware Detection Solution for IoT Systems
Zhang Clement: Machine learning methods for malware recognition based on semantic behaviours

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20220204

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)