GB2590916A

GB2590916A - Steganographic malware detection

Info

Publication number: GB2590916A
Application number: GB2000083.2A
Authority: GB
Inventors: Kallos George; El-Moussa Fadi
Original assignee: British Telecommunications PLC
Current assignee: British Telecommunications PLC
Priority date: 2020-01-05
Filing date: 2020-01-05
Publication date: 2021-07-14
Also published as: GB202000083D0

Abstract

A computer implemented method of detecting data stored within an image for or by malware, the method comprising: for each of a plurality of input images, storing a portion of malware data within the input image by a steganographic process to create a respective second image; training a classifier to classify each second image to its respective input image; receiving a third image; and detecting malware data within the third image by executing the classifier based on the third image.

Description

Steganographic Ma!ware Detection The present invention relates to the detection of malicious software concealed using steganography.

Steganography is the concealment of information within other information, such as concealing a file, message, image, or video within another file, message, image, or video. Steganographic techniques are increasingly employed for embedding data within digital images where the data is hidden in the content of the image. Malicious software (malware) uses steganography to store executable code, command and control instructions and/or parameters, and/or stolen information within the content of images such as images that are communicated via websites. The use of such techniques for the malicious communication of information introduces additional challenges for the detection, mitigation and remediation of malware in computer systems and networks.

Accordingly, it is beneficial to provide improvements in the detection of malware.

According to a first aspect of the present invention, there is provided A computer implemented method of detecting data stored within an image for or by malware, the method comprising: for each of a plurality of input images, storing a portion of malware data within the input image by a steganographic process to create a respective second image; training a classifier to classify each second image to its respective input image; receiving a third image; and detecting malware data within the third image by executing the classifier based on the third image.

Preferably, the malware is detected within the third image based on a degree of confidence of classification of the third image by the classifier.

Preferably, the malware data includes one or more of: executable malware code; and malware command and/or control instructions.

Preferably, the steganographic process is one of: an image domain steganographic process in which the portion of malware data is stored in the input image by adjusting an intensity of pixels in the input image; and a transform domain steganographic process in which the portion of malware data is stored in the input image by transforming the input image and then storing the malware data in the input image.

Preferably, the classifier is trained using backpropagation.

According to a second aspect of the present invention, there is a provided a computer system including a processor and memory storing computer program code for performing the steps of the method set out above.

According to a third aspect of the present invention, there is a provided a computer system including a processor and memory storing computer program code for performing the steps of the method set out above.

Embodiments of the present invention will now be described, by way of example only, with reference to the accompanying drawings, in which: Figure 1 is a block diagram a computer system suitable for the operation of 10 embodiments of the present invention; Figure 2 is a component diagram of an arrangement for detecting data stored within an image for or by malware in accordance with an embodiment of the present invention; Figure 3 is a flowchart of a method for detecting data stored within an image for or by malware in accordance with an embodiment of the present invention; Figure 4 is a component diagram of an arrangement for detecting data stored within an image for or by malware in accordance with an embodiment of the present invention; and Figure 5 is a flowchart of a method for detecting data stored within an image for or by malware in accordance with an embodiment of the present invention.

Figure 1 is a block diagram of a computer system suitable for the operation of embodiments of the present invention. A central processor unit (CPU) 102 is communicatively connected to a storage 104 and an input/output (I/O) interface 106 via a data bus 108. The storage 104 can be any read/write storage device such as a random-access memory (RAM) or a non-volatile storage device. An example of a non-volatile storage device includes a disk or tape storage device. The I/O interface 106 is an interface to devices for the input or output of data, or for both input and output of data. Examples of I/O devices connectable to I/O interface 106 include a keyboard, a mouse, a display (such as a monitor) and a network connection.

Figure 2 is a component diagram of an arrangement for detecting data stored within an image for or by malware in accordance with an embodiment of the present invention.

A classifier 200 is provided as a machine learning component suitable for generating a classification as an output based on an input set of parameters. For example, the classifier 200 is implemented as a neural network, autoencoder or support vector machine, though other suitable classifiers are and may become available. The classifier is configured to accept, as an input data set, a data structure corresponding to image data such as a vector or matrix representation of an image including, for example, image pixel data encoded using a plurality of colour values. Such data is preferably normalised. For example, normalisation of a colour value occurring in a range of 0 to 255 can be provided within a normalised range of 0 to 1 using well known techniques for numeric normalisation. The classifier 200 is trained to generate an output classification indicative of a correspondence between an input data set and an output data set of the classifier. For example, an output of the classifier can be a vector or matrix representation. In accordance with embodiments of the present invention, the classifier 200 is arranged to take image data as an input and classify such image data to other image data as an output. In particular, the classifier is trained using a training method such as a supervised back-propagation method of a feedforward neural network. Such training can be provided by way of a trainer (not illustrated) such as a hardware, software, firmware or combination component arranged to provide classifier training functionality based on training data provided as a plurality of training examples. For example, the classifier can be provided as a feedforward neural network trained using a supervised back-propagation algorithm.

In accordance with embodiments of the present invention, the classifier 200 is trained based on a plurality of input images 202 such as bitmapped, raster or other suitable images.

Each input image used for training the classifier is subjected to a steganographic process 206 by which at least a portion of malware data 204 is stored in the input image to create a second image. Any suitable steganographic process can be employed such as any of the steganographic processes described in "An Overview of Image Steganography" (Morkel, T., Eloff, J.H. and Olivier, M.S. (2005), Information Security South Africa, Johannesburg, 29 June-1 July 2005, 1-11). Most preferably, the steganographic process 206 employed for generating the second image for training the classifier 200 is a steganographic process known to be utilised by a malware for communicating malware data within images.

The malware data 204 stored in the input image to generate the second image can be any malware data including: executable malware code; malware script; malware command 30 instructions; and malware control instructions.

The classifier 200 is trained using a combination of an input image (unamended by steganography) and the first image (corresponding to the input image amended by steganographic process 206) so as to train the classifier 200 to classify the first image to the input image. Thus, the training of the classifier 200 by multiple (and potentially many) such input image and first image pairs results in the classifier 200 being adapted to classify any image containing data stored using the steganographic process 206 to an original of such image with a greater degree of confidence than an image devoid of steganographically stored content. That is to say that a new image, such as third image 208, containing data stored therein using the steganographic process 206 will be classified by the classifier 200 to an output vector with a high degree of confidence, the output vector corresponding to an original version of such third image 208 prior to the application of a steganographic process 206. Furthermore, should third image 206 not contain data stored therein using the steganographic process 206 then the classifier 200 will classify the third image 206 with a lower degree of confidence (or not at all) indicative of an absence of data stored therein using the steganographic process 206. Accordingly, the trained classifier 200 can be considered to encode the effect of the steganographic process 206 on an input image by confidently classifying such image and thus constitutes an effective measure for detecting the application of the steganographic process 206 in any image.

Responsive to the classifier 200 classifying a new image 208 as including data stored therein using the steganographic process 206, a responder component 210 as a hardware, software, firmware or combination component can be configured to provide responsive action to such classification. For example, the responder component 210 can implement, trigger or provide responsive action(s) such as: isolating, quarantining or deleting the image 208; trigger further scanning of the image 208; alerting a user as to the existence of the image 208; dispatch, send or otherwise communicate the image 208 to a malware reporting, scanning or protection component; utilise the image 208 as input to train a further, additional or downstream malware detection component; add the image 208 to a register of detected malware; and other responsive measures as will be apparent to those skilled in the art.

Figure 3 is a flowchart of a method for detecting data stored within an image for or by malware in accordance with an embodiment of the present invention. Initially, at step 302, the method loops through each of a plurality of input images 202. At step 304, for a current input image 202, the method stores a portion of malware data 204 in the image using a steganographic process 206 to create a second image. The method loops through all input images at step 306. At step 307 the method trains a classifier 200 to classify each second image to each respective input image. At step 308 a third image 208 is received and the classifier is executed at step 310 to determine if the third image 208 can be confidently classified to indicate that the third image 208 includes data stored therein using the steganographic process 206. Where such storage of data is detected at step 312, the method triggers responsive actions at step 314.

Figure 4 is a component diagram of an arrangement for detecting data stored within an image for or by malware in accordance with an embodiment of the present invention. Many of the elements of Figure 4 are identical to those described above with respect to Figure 2 and these will not be repeated here. The arrangement of Figure 4 reflects the potential for particular steganographic processes to indicate a particular type of malware such that a particular malware may consistently utilise one or more particular steganographic techniques to store malware data within images. Figure 4 is arranged to provide the malware detection of Figure 2 with additional malware identification based on a plurality of steganographic processes 406a and 406b. Thus, multiple classifiers are provided 400a, 400b each being trained based on a plurality of input images 202 in which malware data 204 is stored using different steganographic processes 406a, 406b. Thus, classifier 400a is trained based on input images 202 in which malware data 204 is stored using a first steganographic process 406a, so as to generate first images as training data for first classifier 400a. Similarly, classifier 400b is trained based on input images 202 in which malware data 204 is stored using a second steganographic process 406b, so as to generate second images as training data for second classifier 400b.

Subsequently, in use, a third image such as image 408 is classified by each of the first and second classifies 400a, 400b so as to detect the presence of malware data stored in the third image 408 and, additionally, to determine which of the steganographic processes 406a, 406b is used in storing such malware data in the third image 408. The determination of the steganographic process used can be made based on a confidence of classification by each of the first and second classifiers 400a, 400b such that a more confident classification by a classifier determines a steganographic process used to train that classifier. Malware being stored in images using a particular steganographic process is indicative of the type of malware and the determination of the steganographic process used based on the first and second classifiers 400a, 400b thus serves to identify the malware, type of malware or category of malware used in the third image 408.

While a common set of input images 202 and malware data 204 are depicted in Figure 4, it will be apparent to those skilled in the art that a separate set or intersecting sets of input 30 images and malware data may be used for training each of the first and second classifiers 400a, 400b Notably, while two classifiers are depicted in Figure 4 and described herein, any number of classifiers may be used each corresponding to any number of steganographic processes so as to distinguish potentially multiple malwares or malware types. In one embodiment, particular malware or malware types are not known when training the classifiers and, in such embodiments, the application of multiple classifiers serves to categorise malware stored in images 408 into groups of malware that can be considered similar or related, even where the particular malware is unknown (such as during a zero-day attack). Furthermore, the output of multiple classifiers for multiple images 408 can be clustered using, for example, k-means clustering techniques, to group unknown or partially known malwares for similar handling such as for similar responsive actions by the responder 410.

Figure 5 is a flowchart of a method for detecting data stored within an image for or by malware in accordance with an embodiment of the present invention. Initially, at step 502, the method loops through each of a plurality of input images 202 (noting that, in some embodiments, different input images may be used for different classifiers). At step 504, for a current input image 202, the method stores a portion of malware data 204 in the image using a first steganographic process 406a to create a first image. At step 506, for the current input image 202, the method stores a portion of malware data 204 in the image using a second steganographic process 406b to create a second image. The method loops through all input images at step 508. At step 510 the method trains a first classifier 400a to classify each first image to each respective input image. At step 512 the method trains a second classifier 400b to classify each second image to each respective input image. At step 514 a third image 408 is received and the first and second classifiers 400a, 400b are executed at step 516 to determine if the third image 408 can be confidently classified to indicate that the third image 408 includes data stored therein using one of the steganographic processes 406a, 406b. The malware or a type of the malware is identified or categorised based on a degree of confidence of classification by each of the classifiers 400a, 400b. Where such storage of data is detected at step 518, the method triggers responsive actions at step 520 where the responsive actions are based on the identified malware type.

Insofar as embodiments of the invention described are implementable, at least in part, using a software-controlled programmable processing device, such as a microprocessor, digital signal processor or other processing device, data processing apparatus or system, it will be appreciated that a computer program for configuring a programmable device, apparatus or system to implement the foregoing described methods is envisaged as an aspect of the present invention. The computer program may be embodied as source code or undergo compilation for implementation on a processing device, apparatus or system or may be embodied as object code, for example.

Suitably, the computer program is stored on a carrier medium in machine or device readable form, for example in solid-state memory, magnetic memory such as disk or tape, optically or magneto-optically readable memory such as compact disk or digital versatile disk etc., and the processing device utilises the program or a part thereof to configure it for operation. The computer program may be supplied from a remote source embodied in a communications medium such as an electronic signal, radio frequency carrier wave or optical carrier wave. Such carrier media are also envisaged as aspects of the present invention.

It will be understood by those skilled in the art that, although the present invention has been described in relation to the above described example embodiments, the invention is not limited thereto and that there are many possible variations and modifications which fall within the scope of the invention.

The scope of the present invention includes any novel features or combination of features disclosed herein. The applicant hereby gives notice that new claims may be formulated to such features or combination of features during prosecution of this application or of any such further applications derived therefrom. In particular, with reference to the appended claims, features from dependent claims may be combined with those of the independent claims and features from respective independent claims may be combined in any appropriate manner and not merely in the specific combinations enumerated in the claims.

Claims

CLAIMS1. A computer implemented method of detecting data stored within an image for or by malware, the method comprising: for each of a plurality of input images, storing a portion of malware data within the input image by a steganographic process to create a respective second image; training a classifier to classify each second image to its respective input image; receiving a third image; and detecting malware data within the third image by executing the classifier based on the third image.
2. The method of claim 1 wherein the malware is detected within the third image based on a degree of confidence of classification of the third image by the classifier.
3. The method of any preceding claim wherein the malware data includes one or more 15 of: executable malware code; and malware command and/or control instructions.
4. The method of any preceding claim wherein the steganographic process is one of: an image domain steganographic process in which the portion of malware data is stored in the input image by adjusting an intensity of pixels in the input image; and a transform domain steganographic process in which the portion of malware data is stored in the input image by transforming the input image and then storing the malware data in the input image.
5. The method of any preceding claim wherein the classifier is trained using backpropagation.
6. A computer system including a processor and memory storing computer program code for performing the steps of the method of any preceding claim.
7. A computer program element comprising computer program code to, when loaded 30 into a computer system and executed thereon, cause the computer to perform the steps of a method as claimed in any of claims 1 to 5.