US20210218755A1 - Facet Blacklisting in Anomaly Detection - Google Patents

Facet Blacklisting in Anomaly Detection Download PDF

Info

Publication number
US20210218755A1
US20210218755A1 US17/219,492 US202117219492A US2021218755A1 US 20210218755 A1 US20210218755 A1 US 20210218755A1 US 202117219492 A US202117219492 A US 202117219492A US 2021218755 A1 US2021218755 A1 US 2021218755A1
Authority
US
United States
Prior art keywords
subhash
file
subhashes
hash
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/219,492
Inventor
Douglas Stuart Swanson
Mina Yousseif
Jon-Paul Lussier, JR.
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Malwarebytes Corporate Holdco Inc
Original Assignee
Malwarebytes Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Malwarebytes Inc filed Critical Malwarebytes Inc
Priority to US17/219,492 priority Critical patent/US20210218755A1/en
Assigned to MALWAREBYTES INC. reassignment MALWAREBYTES INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SWANSON, DOUGLAS STUART, LUSSIER, JON-PAUL, JR., YOUSSEIF, MINA
Publication of US20210218755A1 publication Critical patent/US20210218755A1/en
Assigned to COMPUTERSHARE TRUST COMPANY, N.A., AS ADMINISTRATIVE AGENT reassignment COMPUTERSHARE TRUST COMPANY, N.A., AS ADMINISTRATIVE AGENT INTELLECTUAL PROPERTY SECURITY AGREEMENT Assignors: MALWAREBYTES INC.
Assigned to COMPUTERSHARE TRUST COMPANY, N.A., AS ADMINISTRATIVE AGENT reassignment COMPUTERSHARE TRUST COMPANY, N.A., AS ADMINISTRATIVE AGENT INTELLECTUAL PROPERTY SECURITY AGREEMENT Assignors: MALWAREBYTES CORPORATE HOLDCO INC.
Assigned to MALWAREBYTES CORPORATE HOLDCO INC. reassignment MALWAREBYTES CORPORATE HOLDCO INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MALWAREBYTES INC.
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/554Detecting local intrusion or implementing counter-measures involving event detection and direct action
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/101Access control lists [ACL]

Definitions

  • the present disclosure generally relates to preventing malware and more specifically to reducing false positives in malware detection.
  • a method determines whether files are blacklisted.
  • a first full hash and a first set of subhashes are received from a client.
  • the first full hash is a hash of a first file and each subhash in the first set of subhashes is a hash of a facet of the first file.
  • a facet is a non-code portion of a file. It is determined whether the first full hash is balcklisted. Responsive to determining the first full hash is blacklisted, for each subhash in the first set of subhashes, an associated malicious count is updated. Each malicious count tracks a historic number of blacklisted files with which the subhash is associated. Responsive to a first malicious count of the malicious counts exceeding a threshold malicious count, the subhash associated with the first malicious count is added to a subhash blacklist.
  • a second full hash and a second set of subhashes are received from the client.
  • the second full hash is a hash of a second file and each subhash in the second set of subhashes is a hash of a facet of the second file. It is determined whether the second full hash is blacklisted. Responsive to determining the second full hash is not blacklisted, it is determined whether at least one subhash in the second set of subhashes is included in the subhash blacklist. Responsive to determining that at least one subhash in the second set of subhashes is included in the subhash blacklist, the second file is determined to be blacklisted. It is reported to the client that the second file is blacklisted.
  • a non-transitory computer-readable storage medium stores instructions that when executed by a processor causes the processor to execute the above-described method.
  • a computer system includes a processor and a non-transitory computer-readable storage medium that stores instructions for executing the above-described method.
  • FIG. 1 is a system diagram illustrating an example embodiment of an environment in which a protection application and a security server execute.
  • FIG. 2 is a diagram characterizing files that illustrates a process for generating anomaly scores, according to one embodiment.
  • FIG. 3 is a block diagram illustrating an example embodiment of a security server.
  • FIG. 4 is a flowchart illustrating an embodiment of a process for classifying files based on facets.
  • FIG. 5 is a flowchart illustrating an embodiment of a process for classifying files based on subhashes.
  • a protection application classifies files on a client and remediates files classified as malware.
  • the protection application may send a full hash of one or more files to a security server to check whether the files are included in a file whitelist. If the full hash of a file is included in the whitelist, the security server informs the protection application that the file is clean. If the full hash is not included in the whitelist, the security server checks whether one or more hashes of facets of the file (“subhashes”) are included in a subhash whitelist. If so, the security server reports the file as clean to the protection application. Otherwise, if the file and its facets are unknown to the security server, the protection application may determine how to classify the file based on a file classification model.
  • the subhash whitelist may beneficially be automatically updated over time using a learning technique that learns which facets are representative of clean files and may be added to the subhash whitelist, and which facets may be associated with malware and should be removed from the subhash whitelist.
  • FIG. 1 is a high-level block diagram illustrating a system environment 100 for a protection application and a security server.
  • the system environment 100 comprises a security server 105 , a network 110 , and a client 120 (also referred to as a client device 120 ).
  • a security server 105 also referred to as a client device 120 .
  • client 120 also referred to as a client device 120 .
  • the system environment 100 may include different or additional entities.
  • the security server 105 is a computer system configured to store, receive, and transmit data to clients 120 or to other servers via the network 110 .
  • the security server 105 may include a singular computing system, such as a single computer, or a network of computing systems, such as a data center or a distributed computing system.
  • the security server 105 may receive requests for data from clients 120 and respond by transmitting the requested data to the clients 120 .
  • the security server 105 includes a database of information about known malware (e.g., a blacklist), clean files (e.g., a whitelist), or both. Further, the security server 105 may lookup files in whitelists or blacklists of the database and provide results of the lookup to clients 120 .
  • the security server 105 is described in further detail below with reference to FIG. 3 .
  • the network 110 represents the communication pathways between the security server 105 and clients 120 .
  • the network 110 is the Internet.
  • the network 110 can also utilize dedicated or private communications links that are not necessarily part of the Internet.
  • the network 110 uses standard communications technologies and/or protocols.
  • the network 110 can include links using technologies such as Ethernet, Wi-Fi (802.11), integrated services digital network (ISDN), digital subscriber line (DSL), asynchronous transfer mode (ATM), etc.
  • the networking protocols used on the network 110 can include multiprotocol label switching (MPLS), the transmission control protocol/Internet protocol (TCP/IP), the hypertext transport protocol (HTTP), the simple mail transfer protocol (SMTP), the file transfer protocol (FTP), etc.
  • MPLS multiprotocol label switching
  • TCP/IP transmission control protocol/Internet protocol
  • HTTP hypertext transport protocol
  • SMTP simple mail transfer protocol
  • FTP file transfer protocol
  • the links use mobile networking technologies, including general packet radio service (GPRS), enhanced data GSM environment (EDGE), long term evolution (LTE), code division multiple access 2000 (CDMA2000), and/or wide-band CDMA (WCDMA).
  • GPRS general packet radio service
  • EDGE enhanced data GSM environment
  • LTE long term evolution
  • CDMA2000 code division multiple access 2000
  • WCDMA wide-band CDMA
  • the data exchanged over the network 110 can be represented using technologies and/or formats including the hypertext markup language (HTML), the extensible markup language (XML), the wireless access protocol (WAP), the short message service (SMS) etc.
  • all or some of the links can be encrypted using conventional encryption technologies such as the secure sockets layer (SSL), Secure HTTP and/or virtual private networks (VPNs).
  • the entities can use custom and/or dedicated data communications technologies instead of, or in addition to, the ones described above.
  • Each client 120 comprises one or more computing devices capable of processing data as well as transmitting and receiving data via a network 110 .
  • a client 120 may be a desktop computer, a laptop computer, a mobile phone, a tablet computing device, an Internet of Things (IoT) device, or any other device having computing and data communication capabilities.
  • Each client 120 includes a processor 125 for manipulating and processing data, and a storage medium 130 for storing data and program instructions associated with various applications.
  • the storage medium 130 may include both volatile memory (e.g., random access memory) and non-volatile storage memory such as hard disks, flash memory, and external memory storage devices.
  • the storage medium 130 stores files 140 , as well as various data associated with operation of the operating system 134 , protection application 136 , and other user applications 132 .
  • the storage medium 130 comprises a non-transitory computer-readable storage medium.
  • Various executable programs e.g., operating system 134 , protection application 136 , and user applications 132 ) are each embodied as computer-executable instructions stored to the non-transitory computer-readable storage medium. The instructions, when executed by the processor 125 , cause the client 120 to perform the functions attributed to the programs described herein.
  • the operating system 134 is a specialized program that manages computer hardware resources of the client 120 and provides common services to the user applications 132 .
  • a computer's operating system 134 may manage the processor 125 , storage medium 130 , or other components not illustrated such as, for example, a graphics adapter, an audio adapter, network connections, disc drives, and USB slots.
  • a mobile phone's operating system 134 may manage the processor 125 , storage medium 130 , display screen, keypad, dialer, wireless network connections and the like. Because many programs and executing processes compete for the limited resources provided by the processor 125 , the operating system 134 may manage the processor bandwidth and timing to each requesting process. Examples of operating systems 134 include WINDOWS, MAC OS, IOS, LINUX, UBUNTU, UNIX, and ANDROID.
  • the user applications 132 may include applications for performing a particular set of functions, tasks, or activities for the user. Examples of user applications 132 may include a word processor, a spreadsheet application, and a web browser. In some cases, a user application 132 can be a source of malware and may be associated with one or more of the files 140 stored on the client 120 . The malware may be executed or installed on the client 120 when the user application 132 is executed or installed, or when an associated malicious file is accessed.
  • the protection application 136 detects and remediates potentially malicious files installed or otherwise stored on the client 120 . To determine whether a given file is potentially malicious, the protection application 136 generates an anomaly score for the given file that represents a measure of dissimilarity between the given file and known clean files. Files that are highly anomalous relative to the clean files (e.g., have an anomaly score exceeding a predefined threshold) are identified as being potentially malicious.
  • the protection application 136 may also access the security server 105 via the network 110 to perform a check of a file against one or more whitelists of known clean files and/or blacklists of known malware prior to classifying the file as being malicious or clean and taking appropriate remedial action, if necessary.
  • the protection application 136 includes a file selection module 142 , a file classifier 144 , a model store 148 , a remediation module 150 , and a facet manager 152 .
  • Alternative embodiments may include different or additional modules or omit one or more of the illustrated modules.
  • the file selection module 142 selects files for classification by the protection application 136 .
  • the file selection module 142 may execute, for example, during a scheduled scan of the storage medium 130 or upon downloading files to the storage medium 130 to determine whether or not to further analyze particular files for potential malware.
  • the file selection module 142 obtains metadata associated with a given file from the files 140 on the client 120 .
  • the metadata includes a set of information that describes the given file.
  • the metadata may include file header information indicating a file format (e.g., a file having a portable executable (PE) format, portable document format (PDF), image format, another type of executable format, etc.), file size, file location, file source, or other parameters.
  • PE portable executable
  • PDF portable document format
  • file size file location
  • file source or other parameters.
  • the file selection module 142 stores content of the given file into a buffer and obtains the metadata by parsing the contents from the buffer.
  • the file selection module 142 may use the obtained metadata to determine a subclass of the given file.
  • the subclass is a label with which the file selection module 142 may assign or tag the given file based on its metadata.
  • Example subclasses of file types include portable executables (e.g., files with the .exe extension, dynamic-link libraries (DLL), and drivers), documents (e.g., files with extensions such as .doc, .docx, .txt, etc.), PDFs, images, scripts (e.g., JavaScript (.js), Visual Basic Scripts (.vbs), WINDOWS® script files (.wsf), etc.), among other types of files.
  • the protection application 136 may use the assigned subclass for other classification steps further described below.
  • the file selection module 142 may apply one or more filters to filter out files that can be deemed harmless without further processing and to select files for further processing.
  • different filters may be applied to different files depending on the determined file subclass.
  • the file selection module 142 may include different filters each associated with different file subclasses and may use the obtained subclass (or unprocessed metadata) to select a filter to apply each given file.
  • a plurality of different filters may be applied to all input files, with each filter designed to filter the files in different ways. For instance, a first filter for executable-type files may filter files according to different filtering criteria than a second filter for non-executable-type files.
  • the filtering criteria for each filter may be based on a local whitelist of known clean files stored by the protection application 136 .
  • each filter passes files that do not match any of the known clean files on the respective whitelist associated with the filter.
  • a filter may filter out (or pass) files based on criteria such as whether or not a file is digitally-signed, has file size in a target range, includes structured exception handling information, or was previously classified as clean by the protection application 136 . Only the files that pass through the filter are further processed for potential classification as being malicious, as further described below. Filtering the files 140 may be advantageous because, by reducing the number of files passed down the pipeline for further processing, the protection application 136 may reduce the amount of computational resources required by the client 120 for classification. In an embodiment, only a relatively small percentage of files pass through the filter.
  • the facet manager 152 identifies and extracts facets from files, e.g. files selected by the file selection module 142 , and hashes the extracted facets to generate respective subhashes.
  • a facet comprises a portion of a file that represents particular characteristics of the file that may be indicative of whether or not the file is malware.
  • a facet comprises a non-code portion of a file.
  • a facet may include an author string, a product name string, a list of application programming interfaces (API's) used by the file, a description of the file, copyright information, details from a file header of the file, or a combination thereof.
  • API's application programming interfaces
  • a subhash of a facet is a hash computed on the facet.
  • the facet manager 152 analyzes files to identify one or more facets. For example, based on the file type, the facet manager 152 may identify a header schema corresponding to the file type, and use the header schema to identify different sections of the header of the file for extraction as facets. The facet manager 152 may generate an index that stores the subhashes computed from the facets of a file in association with a full hash of the file, such that the subhashes may be retrieved in response to a query with the full hash. Depending upon the embodiment, either the file selection module 142 , the facet manager 152 , or the file classifier 144 hashes the file into the full hash.
  • the model store 148 stores a plurality of anomaly score models used by the file classifier 144 to classify files as malicious or clean.
  • Each anomaly score model comprises a function that generates an anomaly score based on a set of input features (e.g., an input feature vector) derived from an input file and a set of model parameters.
  • the features are measurable properties of files that characterize the files in a way that enables similarities or dissimilarities between files to be measured.
  • Features may be properties represented by a numerical scale such as a checksum value of a file, or binary properties such as whether the checksum value is valid.
  • features for a PE file can include a number of writeable or executable non-header sections of the PE file, a number of unknown or uncommon sections, section characteristics, or an amount of resources allocated to certain sections.
  • the features may also be based on heuristics such as whether the PE checksum is valid or whether a rich string is valid.
  • the rich string is a particular portion of a PE file header that may be ignored by an operating system 134 and, as a result, may be used by malware to store custom data such as decryption keys.
  • all of the features may be derived without executing the files, but instead by performing a static analysis of the files.
  • the model parameters for each model may be derived from reference features (e.g., reference feature vectors) associated with a set of reference files comprising known clean files.
  • the model parameters may include, for example, a mean feature vector ⁇ representing average values for each feature in the set of reference files, and a covariance matrix ⁇ representing the variance of each feature and the covariance between all feature pairs.
  • the covariance matrix ⁇ represents or describes the spread of the data in the feature space.
  • an anomaly score model may specify the following function to determine an anomaly score p(x) for a target file having a feature vector x, where n is the number of features employed by the model:
  • the selected model In this function, distances are determined between each feature of the input feature vector x and corresponding mean features of the mean feature vector and the distances are combined to generate the anomaly score p(x).
  • the selected model generates the anomaly score based on the differences and the variances for the features of the file, so that the anomaly score may be normalized based on the variances, which may vary from file-to-file.
  • Each of the different models in the model store 148 may specify a different function, different parameters, or different features sets to which the function is applied.
  • Each of the different models may be associated with a different subclass of files and may be configured specifically to detect anomalies within that file subclass.
  • the different models may be trained with reference files within a particular subclass so that the model produces an anomaly score relative to a particular subclass of files.
  • the model store 148 may receive new anomaly score models or periodic updates to existing anomaly score models from the security server 105 .
  • An example of a training module for training different models in the model store 148 is described in further detail below with respect to FIG. 3 .
  • the file classifier 144 uses the obtained subclass of an input file to select one of the multiple anomaly score models suitable to score the input file. For example, the file classifier 144 selects a model associated with the subclass that corresponds to the assigned subclass of the input file to be scored. In one embodiment, the file classifier 144 generates features for the input file, and applies the selected model to the input file to generate the anomaly score, for example, by applying the function of the selected model to the features using the parameters (e.g., expected value and variances) of the selected model.
  • parameters e.g., expected value and variances
  • the file classifier 144 compares the anomaly scores against one or more threshold scores to classify the files. In one embodiment, the file classifier 144 classifies a file as malicious based on determining that an anomaly score for the file is greater than a threshold score, and clean otherwise. In order to reduce false positives associated with a classification based on the anomaly score alone, the file classifier 144 retrieves the full hash of the file and its associated subhashes from the index in the facet manager 152 and sends the full hash and subhashes to the security server 105 , where they are checked against one or more whitelists and/or blacklists. The file classifier 144 then classifies the file based at least in part on results of the checks received from the security server 105 .
  • the checks involve checking the full hash and/or the subhashes against one or more whitelists, one or more blacklists, or both. For example, if the anomaly score is above a threshold score, the full hash may be compared against a whitelist of full hashes. If no match is found, subhashes associated with facets of the file may be compared against a whitelist of subhashes. If a match is found on either whitelist, the file is determined to be clean despite the anomaly score. Otherwise, if neither the file hashes nor the facet subhashes match the respective whitelists, the file is classified as malicious.
  • the file classifier 144 may reduce the number of false positives (i.e., clean files erroneously classified as malicious) because, in some exceptions, clean files may not closely resemble other typical clean files and may have high anomaly scores despite being clean.
  • the file classifier may classify the file as clean without checking the whitelists at the security server 105 .
  • the file classifier may query a blacklist of the security server 105 in response to the anomaly score being below the threshold score.
  • the full hash is compared against a blacklist of full hashes and the file is classified as malicious if the full hash matches an entry in the blacklist and may otherwise be classified as clean.
  • a blacklist of subhashes associated with facets of the file may also be queried if no matches are found on the blacklist of full hashes, and the file may be determined to be malicious responsive to a match with a subhash in the blacklist of subhashes, and is otherwise classified as clean.
  • multiple thresholds may be used (e.g., three thresholds including a lower threshold, a center threshold, and an upper threshold).
  • the file classifier 144 classifies a file as malicious responsive to determining that an anomaly score for the file is greater than (or equal to) an upper threshold score without querying whitelists or blacklists on the security server 105 .
  • the file classifier 144 classifies the file as clean responsive to determining that the anomaly score is less than a lower threshold score without querying whitelists or blacklists on the security server 105 .
  • the file classifier 144 Responsive to determining that the anomaly score is less than the upper threshold score and greater than (or equal to) a center threshold score, the file classifier 144 provides the file to the security server 105 for comparison against one or more whitelists known clean files and/or facets associated with malware as described above.
  • the file classifier 144 may also provide the file to the security server 105 for comparison against one or more blacklists of known malware and/or facets associated with malware as described above.
  • the lower and upper threshold scores may be between one to three standard deviations below and above the center threshold score, respectively.
  • the file classifier 144 may use the lower, center, or upper threshold scores to reduce load on the security server 105 by decreasing the amount of files that are provided to be checked by the security server 105 .
  • the remediation module 150 remediates files that are classified as malicious by the file classifier 144 .
  • the remediation module 150 may perform remediation by removing a malicious file from the client 120 , quarantining the malicious file on the client 120 , or providing a notification to a user of the client 120 indicating that the malicious file is suspected to be associated with malware.
  • the notification may also include information about the malicious file such as a file source or risk severity level proportional to the anomaly score of the malicious file.
  • the remediation module 150 provides a user of the client 120 with an option to remove or quarantine a suspected malicious file. Responsive to the user selecting to retain rather than remove the suspected malicious file, the remediation module 150 may determine that the classification is a false positive and provide this feedback to the security server 105 to re-train an anomaly score model.
  • FIG. 2 is a diagram 200 characterizing files that illustrates a process for generating anomaly scores, according to one embodiment.
  • the diagram 200 includes a graph of points representing a sample of known clean files and a target file to be scored by an anomaly score model.
  • features of the sample are represented by the two axes of the graph.
  • the x-axis and y-axis represent feature 1 and feature 2 , respectively, though in other embodiments, anomaly score models use many more features, e.g., hundreds of features.
  • a multi-dimensional feature score is represented by a point on the graph.
  • the point 220 corresponds to a file of the sample having a feature score of “x” for feature 1 and a feature score of “y” for feature 2 .
  • the points of the clean files of the sample are within dotted lines of the contour 210 , illustrating that the clean files are generally similar (non-anomalous) to each other based on the characterized features.
  • the contour 210 may represent the multivariate Gaussian distribution of the points of the sample, which is determined by the anomaly score model based on the feature scores.
  • the anomaly score may be represented in the graph as a distance 250 between the point 230 representing the target file and the mean 240 of the multivariate Gaussian distribution, or the “peak of normal.”
  • the mean 240 may be an average of one or more feature scores of the sample.
  • the distance 250 also referred to as a Mahalanobis distance or a Euclidean distance
  • the point 230 is a greater number of standard deviations away from the mean 240 , thus indicating that the target file is more dissimilar to the sample.
  • a threshold of distance 250 may be the threshold at which a file is classified as anomalous.
  • the diagram 200 is used to illustrate conceptually how suspected malicious file may be distinguished from a sample of clean files. That is, the anomaly score models in the model store 148 do not necessarily use a graph having two axes, each representing a different feature, to determine anomaly scores. Rather, the anomaly score models may be implemented using known types of machine learning techniques or models such as decision trees, support vector machines (SVMs), neural networks (e.g., autoencoders), boosted/bagged ensemble models, isolation forests, and the like. Additionally, the anomaly score models may characterize any number of features of the sample, e.g., hundreds of different features.
  • SVMs support vector machines
  • neural networks e.g., autoencoders
  • isolation forests isolation forests
  • FIG. 3 is a block diagram illustrating an example embodiment of a security server 105 .
  • the security server 105 includes a processor 300 for manipulating and processing data, and a storage medium 310 for storing data and program instructions associated with various modules.
  • the storage medium 310 includes a lookup module 212 , a file database 214 , and a file learning module 316 which collectively form a file analysis system 220 .
  • the storage medium 310 additionally includes a facet lookup module 232 , a facet database 234 , and a facet learning module 236 , which collectively form a facet analysis system 230 .
  • Alternative embodiments may include different or additional modules or omit one or more of the illustrated modules.
  • the lookup module 312 checks full hashes against a whitelist and/or blacklist stored to the file database 314 based on information received from the file classifier 144 of the protection application 136 running on the client 120 .
  • the file database 314 stores full hashes of files and the lookup module 312 compares a received full hash against the full hashes in the file database 314 to determine if they match.
  • the file database 314 may store full files and the lookup module 312 may compare the full files against the files in the file database 314 to identify a match.
  • the lookup module 312 For each file, the lookup module 312 performs a lookup in the file database 314 to determine if the full hash of the file is associated with information about known malicious files (e.g., related to malware) or clean files.
  • the lookup module 312 provides a result of the lookup to the file classifier 144 , e.g., via the network 110 .
  • the result may indicate that the full hash is associated with a full hash of a known malicious file on a blacklist, the full hash is associated with a full hash of a known clean file on a whitelist, or the full hash is not associated with full hashes of files on either the cloud blacklist or whitelist.
  • the whitelist and/or blacklist in the file database 314 may include a more extensive database of full hashes of files than the previously described local whitelist and/or blacklist of the file selection module 142 on the client 120 .
  • the file learning module 316 may establish training sets of clean files and learn model parameters for a plurality of different models each corresponding to a different file subclass.
  • the training set may obtain metadata for clean files and group the clean files into training sets of separate classes based on the metadata as described above.
  • the file learning module 316 generates features for the files in each training set. The type or number of features may be different in each training set corresponding to a different class.
  • the file learning module 316 trains a separate anomaly score model to learn model parameters using the features derived from the clean files in the training set for the class.
  • each model may be configured to generate an anomaly score for an input file of a different file class, based on clean files of the same subclass as the input file.
  • a first model may be trained using a sample of files downloaded from an online server file source (and thus assigned to a first subclass), while a second model is trained using another sample of files obtained from a local disk file source on clients 120 (and thus assigned to a second subclass different than the first subclass).
  • the first model may generate more accurate anomaly scores for files downloaded from the online server than for files obtained from the local disk, and vice-versa for the second model, because the features of each model are customized for different file sources.
  • the file learning module 316 uses the following equations to determine the model parameters including an expected value (i.e., mean) ⁇ and covariance matrix ⁇ :
  • x (i) is a vector representing the set of features for a sample clean file i in the training set of m files and has a dimension equal to the number of features.
  • the mean feature vector ⁇ represents average feature scores for each of the features across the training set.
  • the covariance matrix ⁇ represents the variance of each feature and the covariance between all feature pairs (i.e., extending across multiple dimensions), and may also capture how pairs of features vary together.
  • the file learning module 316 may optimize the model parameters in each model by applying the model to one or more test files known to be malicious or clean files.
  • the performance of the model can be measured based on correct classifications of malicious test files and a number of false positives (e.g., classification of clean test files as malicious).
  • Model parameters, the selected function, or the feature sets used in each model may then be adjusted to improve performance.
  • the file learning module 316 may distribute the trained models to protection applications 136 of clients 120 , as well as periodically update the distributed models. Beneficially, since clean files generally experience slower rates of change than do malicious files, the anomaly score models do not need to be frequently re-trained with updated samples of clean files.
  • the facet analysis system 330 checks whether a file is clean and/or malicious based on its facets by checking the subhashes of the facets against a subhash whitelist and/or subhash blacklist. In one embodiment, the facet analysis system 330 solely checks whether the file is clean based on its facets by checking subhashes of the facets against a whitelist of subhashes stored at the facet database 334 and does not employ a blacklist associated with facets. Alternatively, the facet analysis system 330 may additionally or alternatively check whether the file is malicious based on its facets by checking subhashes of the facets against a blacklist of subhashes stored at the database 234 .
  • the facet analysis system 330 additionally tracks the presence of subhashes in whitelisted and/or blacklisted files to learn which subhashes are associated with clean files and which subhashes are associated with malware.
  • the subhash whitelist and/or subhash blacklist may be updated dynamically as new associations are learned.
  • the facet analysis system 330 thus provides an additional way of whitelisting clean files in order to reduce false positives that may be undetected by the file analysis system 220 .
  • the security server 105 can enable file classification that beneficially reduces the rate of false positives.
  • a facet blacklist if employed, may reduce rates of false negatives.
  • the facet lookup module 332 checks subhashes against the facet database 334 based on information received from the file classifier 144 of the protection application 136 running on the client 120 .
  • the received information includes one or more subhashes, e.g., subhashes of facets identified by the facet manager 152 of the protection application 136 and included with the full hash received by the security server 105 .
  • the facet lookup module 332 performs a lookup in the facet database 334 to determine if the subhash is associated with information about known clean files and therefore matches a subhash on a subhash whitelist.
  • the facet lookup module 332 additionally or alternatively performs a lookup in the facet database 334 to determine if the subhash is associated with information about known malicious files and therefore matches a subhash on a facet blacklist.
  • the facet database 334 includes a subhash whitelist including subhashes of facets associated with files known to be clean.
  • the facet database 334 may additionally or alternatively include a subhash blacklist including subhashes associated with files known to be malicious. If a subhash is included in the subhash whitelist, then the file associated with the subhash is determined to be clean. In an embodiment, if a subhash is included in the subhash blacklist, then the file is considered to be malicious. If no subhash of a file is included in the subhash whitelist, the security server 105 indicates that the subhashes were not in the whitelist to the protection application 136 , which may then classify the file using other information, such as the anomaly score alone.
  • the facet learning module 336 determines whether a subhash is included in the subhash whitelist, and in some embodiments, the subhash blacklist.
  • the facet analysis system 330 includes solely a subhash whitelist or a subhash blacklist, in which case the analysis corresponding to the other is not performed.
  • the facet lookup module 332 Based on the results of the lookups for the one or more subhashes, the facet lookup module 332 provides an indication to the file classifier 144 , e.g., via the network 110 , of whether at least one subhash was included in a subhash whitelist, and in some embodiments, whether at least one subhash was included in a subhash blacklist. In an embodiment, if one subhash is in the subhash whitelist, the facet lookup module 332 indicates to the file classifier 144 that the file is whitelisted. Alternatively, there may be a threshold number of subhashes that need to be clean (e.g., found in the subhash whitelist) in order for the file to be classified as clean by the facet lookup module 232 .
  • the facet learning module 336 dynamically updates the subhash whitelist and/or subhash blacklist based on file classifications of files having facets associated with the subhashes.
  • the facet database 334 maintains for each subhash of a facet, a clean count and a malicious count.
  • the clean count of a subhash is a count indicating a number of known clean files that have been observed to include the facet associated with the subhash.
  • the malicious count is a count indicating a number of known malicious files that have been observed to include the facet associated with the subhash.
  • the facet learning module 336 updates the clean count and malicious count of each subhash as the security server 105 receives subhashes for facets and associated file hashes for a file, and determines whether the file is malware or clean. For example, each time the security server 105 determines that a file is clean, the clean count of each subhash of a facet included in the file is incremented by one by the facet learning module 236 . Similarly, if the security server 105 determines that a file is malicious, the security server 105 increments by one the malicious count of each subhash of each facet in the file. In an embodiment, the security server 105 updates clean counts and/or malicious counts based on information received from a third party server (e.g., a third party security analysis system) that provides a determination of whether a file is clean or malicious.
  • a third party server e.g., a third party security analysis system
  • the facet learning module 336 updates the subhash whitelist and, in some embodiments, the subhash blacklist, based on the clean counts and the malicious counts. If the malicious count of a subhash is nonzero, the facet learning module 336 may add the subhash to the subhash blacklist, and remove it from the subhash whitelist if it was included therein. If the clean count of a subhash is at least a threshold clean count value and the malicious count is zero, the facet learning module 336 adds the subhash to the subhash whitelist.
  • the threshold clean count value may vary depending upon the embodiment. For example, the threshold clean count value may be three, or five, or another value.
  • the facet learning module 336 enables the security server 105 to maintain up to date subhash whitelists and, in some embodiments, subhash blacklists, that accurately reflect which subhash are likely clean and/or malicious. As the validity of classifications made by the security server 105 are correlated with the validity of the whitelists and blacklists maintained by the security server 105 , the facet learning module 336 enables the security server 105 to better guarantee that a classification of a file is correct.
  • FIG. 4 is a flowchart illustrating an embodiment of a process 400 by a security server 105 for automatically updating a subhash whitelist associated with facets of files.
  • the security server 105 receives 402 a first full hash of a file and a first set of subhashes of facets of the file from a protection application 136 at a client 120 .
  • the security server 105 determines 404 whether the first full hash is whitelisted, and if so, the security server 105 updates 406 , for each subhash in the first set of subhashes received in association with the full hash for the file, a clean count associated with the subhash.
  • the security server 105 adds 408 a subhash to a subhash whitelist responsive to the clean count associated with the subhash exceeding a threshold clean count. If the security server 105 determines the file is malware, the security server 105 removes the subhash from the subhash whitelist if present, regardless of its clean count. In an embodiment, the security server 105 maintains a malicious count for each subhash. The security server 105 increments the malicious count of each subhash associated with the file responsive to the determining the file is malware, and removing a subhash from the whitelist is responsive to the malicious count being nonzero. The security server 105 may then report the results of the determination to the client 120 .
  • FIG. 5 is a flowchart illustrating an embodiment of a process 500 for classifying files based on subhashes.
  • the security server 105 receives 510 a second full hash of a second file and a second set of subhashes of facets of the second file from a client 120 .
  • the security server 105 determines 512 whether the second full hash is whitelisted. If the second full hash is whitelisted, the security server 105 sends an indication to the protection application 136 that the file is whitelisted. Responsive to determining that the second full hash is not whitelisted, the security server 105 determines 514 whether at least one subhash in the second set of subhashes is included in the subhash whitelist.
  • the security server 105 determines 516 that the second file is whitelisted and reports 518 that the second file is whitelisted to the client 120 . Otherwise, the security server 105 reports that the second file does not match any whitelists (i.e., the file is unknown to the security server 105 ).
  • the security server 105 checks whether at least one of the subhashes in the second set of subhashes is included in a subhash blacklist. If the security server 105 determines that at least one of the subhashes in the second set of subhashes is included in the subhash blacklist, the security server 105 classifies the second file as malicious, regardless of how many of the subhashes in the second set of subhashes are in the subhash whitelist.
  • the security server 105 determines that the first full hash is malicious, and increments, for each subhash in the first set of subhashes, an associated malicious count.
  • the security server 105 may add subhashes with nonzero malicious counts to the subhash blacklist, if there is a subhash blacklist, and remove such subhashes from the subhash whitelist.
  • the security server 105 can classify files with a reduced rate of false positives.
  • the security server 105 maintains whitelists and/or blacklists of full files that are not hashed, and matches full files received from the client 120 against the whitelists and/or blacklists.
  • the present disclosure refers to embodiments using subhashes and subhash whitelists and/or blacklists, though the techniques described herein additionally apply to facets that are not hashed and whitelists and/or blacklists of facets that are not hashed.
  • the security server 105 maintains whitelists and/or blacklists of facets that are not hashed, and matches facets received from the client 120 against the whitelists and/or blacklists.
  • a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
  • Embodiments of the invention may also relate to an apparatus for performing the operations herein.
  • This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer.
  • a computer program may be stored in a nontransitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus.
  • any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
  • Embodiments of the invention may also relate to a product that is produced by a computing process described herein.
  • a product may comprise information resulting from a computing process, where the information is stored on a nontransitory, tangible computer readable storage medium and may include any embodiment of a computer program product or other data combination described herein.
  • the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion.
  • a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
  • “or” refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).

Abstract

A security server receives a full hash and a set of subhashes from a client. The security server determines that the full hash is blacklisted. The security server updates, for each subhash in the set of subhashes, an associated malicious count. The security server adds a subhash to a subhash blacklist responsive to an associated malicious count exceeding a threshold. The security server receives a second set of subhashes. The security server determines whether at least one of the subhashes in the second set of subhashes is included in the subhash blacklist. The security server reports to the client based on the determination.

Description

    CROSS-REFERENCE TO RELATED APPLICATION
  • This application is a continuation of co-pending U.S. patent application Ser. No. 16/292,307, filed Mar. 4, 2019, the entirety of which is incorporated by reference herein.
  • FIELD OF ART
  • The present disclosure generally relates to preventing malware and more specifically to reducing false positives in malware detection.
  • BACKGROUND
  • It is traditionally difficult to automatically detect malware due to the constantly evolving nature of different types of malware. Conventional detection techniques often suffer from a high rate of false positives in which clean files are erroneously detected as malware. False positives are undesirable because they may prompt an anti-malware application to unnecessarily quarantine or delete important system or user files.
  • SUMMARY
  • A method determines whether files are blacklisted. A first full hash and a first set of subhashes are received from a client. The first full hash is a hash of a first file and each subhash in the first set of subhashes is a hash of a facet of the first file. A facet is a non-code portion of a file. It is determined whether the first full hash is balcklisted. Responsive to determining the first full hash is blacklisted, for each subhash in the first set of subhashes, an associated malicious count is updated. Each malicious count tracks a historic number of blacklisted files with which the subhash is associated. Responsive to a first malicious count of the malicious counts exceeding a threshold malicious count, the subhash associated with the first malicious count is added to a subhash blacklist.
  • A second full hash and a second set of subhashes are received from the client. The second full hash is a hash of a second file and each subhash in the second set of subhashes is a hash of a facet of the second file. It is determined whether the second full hash is blacklisted. Responsive to determining the second full hash is not blacklisted, it is determined whether at least one subhash in the second set of subhashes is included in the subhash blacklist. Responsive to determining that at least one subhash in the second set of subhashes is included in the subhash blacklist, the second file is determined to be blacklisted. It is reported to the client that the second file is blacklisted.
  • In another embodiment, a non-transitory computer-readable storage medium stores instructions that when executed by a processor causes the processor to execute the above-described method.
  • In yet another embodiment, a computer system includes a processor and a non-transitory computer-readable storage medium that stores instructions for executing the above-described method.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The figures (FIGS.) and the following description relate to preferred embodiments by way of illustration only. It should be noted that from the following discussion, alternative embodiments of the structures and methods disclosed herein will be readily recognized as viable alternatives that may be employed without departing from the principles of what is claimed.
  • FIG. 1 is a system diagram illustrating an example embodiment of an environment in which a protection application and a security server execute.
  • FIG. 2 is a diagram characterizing files that illustrates a process for generating anomaly scores, according to one embodiment.
  • FIG. 3 is a block diagram illustrating an example embodiment of a security server.
  • FIG. 4 is a flowchart illustrating an embodiment of a process for classifying files based on facets.
  • FIG. 5 is a flowchart illustrating an embodiment of a process for classifying files based on subhashes.
  • DETAILED DESCRIPTION
  • Reference will now be made in detail to several embodiments, examples of which are illustrated in the accompanying figures. It is noted that wherever practicable similar or like reference numbers may be used in the figures and may indicate similar or like functionality. The figures depict embodiments of the disclosed system (or method) for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described herein.
  • A protection application classifies files on a client and remediates files classified as malware. As part of the classification, the protection application may send a full hash of one or more files to a security server to check whether the files are included in a file whitelist. If the full hash of a file is included in the whitelist, the security server informs the protection application that the file is clean. If the full hash is not included in the whitelist, the security server checks whether one or more hashes of facets of the file (“subhashes”) are included in a subhash whitelist. If so, the security server reports the file as clean to the protection application. Otherwise, if the file and its facets are unknown to the security server, the protection application may determine how to classify the file based on a file classification model. The subhash whitelist may beneficially be automatically updated over time using a learning technique that learns which facets are representative of clean files and may be added to the subhash whitelist, and which facets may be associated with malware and should be removed from the subhash whitelist.
  • FIG. 1 is a high-level block diagram illustrating a system environment 100 for a protection application and a security server. The system environment 100 comprises a security server 105, a network 110, and a client 120 (also referred to as a client device 120). For simplicity and clarity, only one security server 105 and one client 120 are shown; however, other embodiments may include different numbers of security servers 105 and clients 120. Furthermore, the system environment 100 may include different or additional entities.
  • The security server 105 is a computer system configured to store, receive, and transmit data to clients 120 or to other servers via the network 110. The security server 105 may include a singular computing system, such as a single computer, or a network of computing systems, such as a data center or a distributed computing system. The security server 105 may receive requests for data from clients 120 and respond by transmitting the requested data to the clients 120. The security server 105 includes a database of information about known malware (e.g., a blacklist), clean files (e.g., a whitelist), or both. Further, the security server 105 may lookup files in whitelists or blacklists of the database and provide results of the lookup to clients 120. The security server 105 is described in further detail below with reference to FIG. 3.
  • The network 110 represents the communication pathways between the security server 105 and clients 120. In one embodiment, the network 110 is the Internet. The network 110 can also utilize dedicated or private communications links that are not necessarily part of the Internet. In one embodiment, the network 110 uses standard communications technologies and/or protocols. Thus, the network 110 can include links using technologies such as Ethernet, Wi-Fi (802.11), integrated services digital network (ISDN), digital subscriber line (DSL), asynchronous transfer mode (ATM), etc. Similarly, the networking protocols used on the network 110 can include multiprotocol label switching (MPLS), the transmission control protocol/Internet protocol (TCP/IP), the hypertext transport protocol (HTTP), the simple mail transfer protocol (SMTP), the file transfer protocol (FTP), etc. In one embodiment, at least some of the links use mobile networking technologies, including general packet radio service (GPRS), enhanced data GSM environment (EDGE), long term evolution (LTE), code division multiple access 2000 (CDMA2000), and/or wide-band CDMA (WCDMA). The data exchanged over the network 110 can be represented using technologies and/or formats including the hypertext markup language (HTML), the extensible markup language (XML), the wireless access protocol (WAP), the short message service (SMS) etc. In addition, all or some of the links can be encrypted using conventional encryption technologies such as the secure sockets layer (SSL), Secure HTTP and/or virtual private networks (VPNs). In another embodiment, the entities can use custom and/or dedicated data communications technologies instead of, or in addition to, the ones described above.
  • Each client 120 comprises one or more computing devices capable of processing data as well as transmitting and receiving data via a network 110. For example, a client 120 may be a desktop computer, a laptop computer, a mobile phone, a tablet computing device, an Internet of Things (IoT) device, or any other device having computing and data communication capabilities. Each client 120 includes a processor 125 for manipulating and processing data, and a storage medium 130 for storing data and program instructions associated with various applications. The storage medium 130 may include both volatile memory (e.g., random access memory) and non-volatile storage memory such as hard disks, flash memory, and external memory storage devices. In addition to storing program instructions, the storage medium 130 stores files 140, as well as various data associated with operation of the operating system 134, protection application 136, and other user applications 132.
  • In one embodiment, the storage medium 130 comprises a non-transitory computer-readable storage medium. Various executable programs (e.g., operating system 134, protection application 136, and user applications 132) are each embodied as computer-executable instructions stored to the non-transitory computer-readable storage medium. The instructions, when executed by the processor 125, cause the client 120 to perform the functions attributed to the programs described herein.
  • The operating system 134 is a specialized program that manages computer hardware resources of the client 120 and provides common services to the user applications 132. For example, a computer's operating system 134 may manage the processor 125, storage medium 130, or other components not illustrated such as, for example, a graphics adapter, an audio adapter, network connections, disc drives, and USB slots. A mobile phone's operating system 134 may manage the processor 125, storage medium 130, display screen, keypad, dialer, wireless network connections and the like. Because many programs and executing processes compete for the limited resources provided by the processor 125, the operating system 134 may manage the processor bandwidth and timing to each requesting process. Examples of operating systems 134 include WINDOWS, MAC OS, IOS, LINUX, UBUNTU, UNIX, and ANDROID.
  • The user applications 132 may include applications for performing a particular set of functions, tasks, or activities for the user. Examples of user applications 132 may include a word processor, a spreadsheet application, and a web browser. In some cases, a user application 132 can be a source of malware and may be associated with one or more of the files 140 stored on the client 120. The malware may be executed or installed on the client 120 when the user application 132 is executed or installed, or when an associated malicious file is accessed.
  • The protection application 136 detects and remediates potentially malicious files installed or otherwise stored on the client 120. To determine whether a given file is potentially malicious, the protection application 136 generates an anomaly score for the given file that represents a measure of dissimilarity between the given file and known clean files. Files that are highly anomalous relative to the clean files (e.g., have an anomaly score exceeding a predefined threshold) are identified as being potentially malicious. The protection application 136 may also access the security server 105 via the network 110 to perform a check of a file against one or more whitelists of known clean files and/or blacklists of known malware prior to classifying the file as being malicious or clean and taking appropriate remedial action, if necessary.
  • The protection application 136 includes a file selection module 142, a file classifier 144, a model store 148, a remediation module 150, and a facet manager 152. Alternative embodiments may include different or additional modules or omit one or more of the illustrated modules.
  • The file selection module 142 selects files for classification by the protection application 136. The file selection module 142 may execute, for example, during a scheduled scan of the storage medium 130 or upon downloading files to the storage medium 130 to determine whether or not to further analyze particular files for potential malware. The file selection module 142 obtains metadata associated with a given file from the files 140 on the client 120. The metadata includes a set of information that describes the given file. For example, the metadata may include file header information indicating a file format (e.g., a file having a portable executable (PE) format, portable document format (PDF), image format, another type of executable format, etc.), file size, file location, file source, or other parameters. In some embodiments, the file selection module 142 stores content of the given file into a buffer and obtains the metadata by parsing the contents from the buffer. The file selection module 142 may use the obtained metadata to determine a subclass of the given file. The subclass is a label with which the file selection module 142 may assign or tag the given file based on its metadata. Example subclasses of file types include portable executables (e.g., files with the .exe extension, dynamic-link libraries (DLL), and drivers), documents (e.g., files with extensions such as .doc, .docx, .txt, etc.), PDFs, images, scripts (e.g., JavaScript (.js), Visual Basic Scripts (.vbs), WINDOWS® script files (.wsf), etc.), among other types of files. The protection application 136 may use the assigned subclass for other classification steps further described below.
  • Additionally, the file selection module 142 may apply one or more filters to filter out files that can be deemed harmless without further processing and to select files for further processing. In an embodiment, different filters may be applied to different files depending on the determined file subclass. For example, the file selection module 142 may include different filters each associated with different file subclasses and may use the obtained subclass (or unprocessed metadata) to select a filter to apply each given file. Alternatively, a plurality of different filters may be applied to all input files, with each filter designed to filter the files in different ways. For instance, a first filter for executable-type files may filter files according to different filtering criteria than a second filter for non-executable-type files. The filtering criteria for each filter may be based on a local whitelist of known clean files stored by the protection application 136. Here, each filter passes files that do not match any of the known clean files on the respective whitelist associated with the filter. In other embodiments, a filter may filter out (or pass) files based on criteria such as whether or not a file is digitally-signed, has file size in a target range, includes structured exception handling information, or was previously classified as clean by the protection application 136. Only the files that pass through the filter are further processed for potential classification as being malicious, as further described below. Filtering the files 140 may be advantageous because, by reducing the number of files passed down the pipeline for further processing, the protection application 136 may reduce the amount of computational resources required by the client 120 for classification. In an embodiment, only a relatively small percentage of files pass through the filter.
  • The facet manager 152 identifies and extracts facets from files, e.g. files selected by the file selection module 142, and hashes the extracted facets to generate respective subhashes. Here, a facet comprises a portion of a file that represents particular characteristics of the file that may be indicative of whether or not the file is malware. In an embodiment, a facet comprises a non-code portion of a file. For example, a facet may include an author string, a product name string, a list of application programming interfaces (API's) used by the file, a description of the file, copyright information, details from a file header of the file, or a combination thereof. A subhash of a facet is a hash computed on the facet.
  • The facet manager 152 analyzes files to identify one or more facets. For example, based on the file type, the facet manager 152 may identify a header schema corresponding to the file type, and use the header schema to identify different sections of the header of the file for extraction as facets. The facet manager 152 may generate an index that stores the subhashes computed from the facets of a file in association with a full hash of the file, such that the subhashes may be retrieved in response to a query with the full hash. Depending upon the embodiment, either the file selection module 142, the facet manager 152, or the file classifier 144 hashes the file into the full hash.
  • The model store 148 stores a plurality of anomaly score models used by the file classifier 144 to classify files as malicious or clean. Each anomaly score model comprises a function that generates an anomaly score based on a set of input features (e.g., an input feature vector) derived from an input file and a set of model parameters. The features are measurable properties of files that characterize the files in a way that enables similarities or dissimilarities between files to be measured. Features may be properties represented by a numerical scale such as a checksum value of a file, or binary properties such as whether the checksum value is valid. In one embodiment, features for a PE file, can include a number of writeable or executable non-header sections of the PE file, a number of unknown or uncommon sections, section characteristics, or an amount of resources allocated to certain sections. The features may also be based on heuristics such as whether the PE checksum is valid or whether a rich string is valid. In an embodiment, the rich string is a particular portion of a PE file header that may be ignored by an operating system 134 and, as a result, may be used by malware to store custom data such as decryption keys. In some embodiments, all of the features may be derived without executing the files, but instead by performing a static analysis of the files.
  • The model parameters for each model may be derived from reference features (e.g., reference feature vectors) associated with a set of reference files comprising known clean files. The model parameters may include, for example, a mean feature vector μ representing average values for each feature in the set of reference files, and a covariance matrix Σ representing the variance of each feature and the covariance between all feature pairs. In other words, the covariance matrix Σ represents or describes the spread of the data in the feature space.
  • The function computes an anomaly score that provides a measure of how anomalous (e.g., how dissimilar) the input file is from the set of known clean files based on their respective features. For instance, an anomaly score model may specify the following function to determine an anomaly score p(x) for a target file having a feature vector x, where n is the number of features employed by the model:
  • p ( x ) = 1 ( 2 π ) n 2 1 2 exp ( - 1 2 ( x - μ ) T - 1 ( x - μ ) )
  • In this function, distances are determined between each feature of the input feature vector x and corresponding mean features of the mean feature vector and the distances are combined to generate the anomaly score p(x). The selected model generates the anomaly score based on the differences and the variances for the features of the file, so that the anomaly score may be normalized based on the variances, which may vary from file-to-file.
  • Each of the different models in the model store 148 may specify a different function, different parameters, or different features sets to which the function is applied. Each of the different models may be associated with a different subclass of files and may be configured specifically to detect anomalies within that file subclass. For example, the different models may be trained with reference files within a particular subclass so that the model produces an anomaly score relative to a particular subclass of files. In an embodiment, the model store 148 may receive new anomaly score models or periodic updates to existing anomaly score models from the security server 105. An example of a training module for training different models in the model store 148 is described in further detail below with respect to FIG. 3.
  • The file classifier 144 uses the obtained subclass of an input file to select one of the multiple anomaly score models suitable to score the input file. For example, the file classifier 144 selects a model associated with the subclass that corresponds to the assigned subclass of the input file to be scored. In one embodiment, the file classifier 144 generates features for the input file, and applies the selected model to the input file to generate the anomaly score, for example, by applying the function of the selected model to the features using the parameters (e.g., expected value and variances) of the selected model.
  • The file classifier 144 compares the anomaly scores against one or more threshold scores to classify the files. In one embodiment, the file classifier 144 classifies a file as malicious based on determining that an anomaly score for the file is greater than a threshold score, and clean otherwise. In order to reduce false positives associated with a classification based on the anomaly score alone, the file classifier 144 retrieves the full hash of the file and its associated subhashes from the index in the facet manager 152 and sends the full hash and subhashes to the security server 105, where they are checked against one or more whitelists and/or blacklists. The file classifier 144 then classifies the file based at least in part on results of the checks received from the security server 105. Depending upon the embodiment, the checks involve checking the full hash and/or the subhashes against one or more whitelists, one or more blacklists, or both. For example, if the anomaly score is above a threshold score, the full hash may be compared against a whitelist of full hashes. If no match is found, subhashes associated with facets of the file may be compared against a whitelist of subhashes. If a match is found on either whitelist, the file is determined to be clean despite the anomaly score. Otherwise, if neither the file hashes nor the facet subhashes match the respective whitelists, the file is classified as malicious. By checking against the cloud whitelist, the file classifier 144 may reduce the number of false positives (i.e., clean files erroneously classified as malicious) because, in some exceptions, clean files may not closely resemble other typical clean files and may have high anomaly scores despite being clean.
  • In an embodiment, when the anomaly score is below the threshold score, the file classifier may classify the file as clean without checking the whitelists at the security server 105. In an alternative embodiment, the file classifier may query a blacklist of the security server 105 in response to the anomaly score being below the threshold score. Here, the full hash is compared against a blacklist of full hashes and the file is classified as malicious if the full hash matches an entry in the blacklist and may otherwise be classified as clean. Optionally, a blacklist of subhashes associated with facets of the file may also be queried if no matches are found on the blacklist of full hashes, and the file may be determined to be malicious responsive to a match with a subhash in the blacklist of subhashes, and is otherwise classified as clean.
  • In another embodiment, multiple thresholds may be used (e.g., three thresholds including a lower threshold, a center threshold, and an upper threshold). Here, the file classifier 144 classifies a file as malicious responsive to determining that an anomaly score for the file is greater than (or equal to) an upper threshold score without querying whitelists or blacklists on the security server 105. On the other hand, the file classifier 144 classifies the file as clean responsive to determining that the anomaly score is less than a lower threshold score without querying whitelists or blacklists on the security server 105. Responsive to determining that the anomaly score is less than the upper threshold score and greater than (or equal to) a center threshold score, the file classifier 144 provides the file to the security server 105 for comparison against one or more whitelists known clean files and/or facets associated with malware as described above. Optionally, the file classifier 144 may also provide the file to the security server 105 for comparison against one or more blacklists of known malware and/or facets associated with malware as described above.
  • In some use cases, the lower and upper threshold scores may be between one to three standard deviations below and above the center threshold score, respectively. The file classifier 144 may use the lower, center, or upper threshold scores to reduce load on the security server 105 by decreasing the amount of files that are provided to be checked by the security server 105.
  • The remediation module 150 remediates files that are classified as malicious by the file classifier 144. In particular, the remediation module 150 may perform remediation by removing a malicious file from the client 120, quarantining the malicious file on the client 120, or providing a notification to a user of the client 120 indicating that the malicious file is suspected to be associated with malware. The notification may also include information about the malicious file such as a file source or risk severity level proportional to the anomaly score of the malicious file. In one embodiment, the remediation module 150 provides a user of the client 120 with an option to remove or quarantine a suspected malicious file. Responsive to the user selecting to retain rather than remove the suspected malicious file, the remediation module 150 may determine that the classification is a false positive and provide this feedback to the security server 105 to re-train an anomaly score model.
  • FIG. 2 is a diagram 200 characterizing files that illustrates a process for generating anomaly scores, according to one embodiment. In the embodiment shown in FIG. 2, the diagram 200 includes a graph of points representing a sample of known clean files and a target file to be scored by an anomaly score model. To characterize the sample, features of the sample are represented by the two axes of the graph. In particular, the x-axis and y-axis represent feature 1 and feature 2, respectively, though in other embodiments, anomaly score models use many more features, e.g., hundreds of features. For each file of the sample, a multi-dimensional feature score is represented by a point on the graph. As an example, the point 220 corresponds to a file of the sample having a feature score of “x” for feature 1 and a feature score of “y” for feature 2. The points of the clean files of the sample are within dotted lines of the contour 210, illustrating that the clean files are generally similar (non-anomalous) to each other based on the characterized features. The contour 210 may represent the multivariate Gaussian distribution of the points of the sample, which is determined by the anomaly score model based on the feature scores.
  • The anomaly score may be represented in the graph as a distance 250 between the point 230 representing the target file and the mean 240 of the multivariate Gaussian distribution, or the “peak of normal.” The mean 240 may be an average of one or more feature scores of the sample. As the distance 250 (also referred to as a Mahalanobis distance or a Euclidean distance) increases, the point 230 is a greater number of standard deviations away from the mean 240, thus indicating that the target file is more dissimilar to the sample. In this visualization of the anomaly score, a threshold of distance 250 may be the threshold at which a file is classified as anomalous.
  • The diagram 200 is used to illustrate conceptually how suspected malicious file may be distinguished from a sample of clean files. That is, the anomaly score models in the model store 148 do not necessarily use a graph having two axes, each representing a different feature, to determine anomaly scores. Rather, the anomaly score models may be implemented using known types of machine learning techniques or models such as decision trees, support vector machines (SVMs), neural networks (e.g., autoencoders), boosted/bagged ensemble models, isolation forests, and the like. Additionally, the anomaly score models may characterize any number of features of the sample, e.g., hundreds of different features.
  • FIG. 3 is a block diagram illustrating an example embodiment of a security server 105. The security server 105 includes a processor 300 for manipulating and processing data, and a storage medium 310 for storing data and program instructions associated with various modules. The storage medium 310 includes a lookup module 212, a file database 214, and a file learning module 316 which collectively form a file analysis system 220. The storage medium 310 additionally includes a facet lookup module 232, a facet database 234, and a facet learning module 236, which collectively form a facet analysis system 230. Alternative embodiments may include different or additional modules or omit one or more of the illustrated modules.
  • The lookup module 312 checks full hashes against a whitelist and/or blacklist stored to the file database 314 based on information received from the file classifier 144 of the protection application 136 running on the client 120. In one embodiment, the file database 314 stores full hashes of files and the lookup module 312 compares a received full hash against the full hashes in the file database 314 to determine if they match. Alternatively, the file database 314 may store full files and the lookup module 312 may compare the full files against the files in the file database 314 to identify a match. For each file, the lookup module 312 performs a lookup in the file database 314 to determine if the full hash of the file is associated with information about known malicious files (e.g., related to malware) or clean files. The lookup module 312 provides a result of the lookup to the file classifier 144, e.g., via the network 110. The result may indicate that the full hash is associated with a full hash of a known malicious file on a blacklist, the full hash is associated with a full hash of a known clean file on a whitelist, or the full hash is not associated with full hashes of files on either the cloud blacklist or whitelist. The whitelist and/or blacklist in the file database 314 may include a more extensive database of full hashes of files than the previously described local whitelist and/or blacklist of the file selection module 142 on the client 120.
  • The file learning module 316 may establish training sets of clean files and learn model parameters for a plurality of different models each corresponding to a different file subclass. For example, the training set may obtain metadata for clean files and group the clean files into training sets of separate classes based on the metadata as described above. The file learning module 316 generates features for the files in each training set. The type or number of features may be different in each training set corresponding to a different class. For each class, the file learning module 316 trains a separate anomaly score model to learn model parameters using the features derived from the clean files in the training set for the class. Thus, each model may be configured to generate an anomaly score for an input file of a different file class, based on clean files of the same subclass as the input file.
  • In an example using subclasses that correspond to the file sources of files, a first model may be trained using a sample of files downloaded from an online server file source (and thus assigned to a first subclass), while a second model is trained using another sample of files obtained from a local disk file source on clients 120 (and thus assigned to a second subclass different than the first subclass). Thus, the first model may generate more accurate anomaly scores for files downloaded from the online server than for files obtained from the local disk, and vice-versa for the second model, because the features of each model are customized for different file sources.
  • In an example, the file learning module 316 uses the following equations to determine the model parameters including an expected value (i.e., mean) μ and covariance matrix Σ:
  • μ = 1 m i = 1 m x ( i ) = 1 m i = 1 m ( x ( i ) - μ ) ( x ( i ) - μ ) T
  • where x(i) is a vector representing the set of features for a sample clean file i in the training set of m files and has a dimension equal to the number of features. The mean feature vector μ represents average feature scores for each of the features across the training set. Further, the covariance matrix Σ represents the variance of each feature and the covariance between all feature pairs (i.e., extending across multiple dimensions), and may also capture how pairs of features vary together.
  • The file learning module 316 may optimize the model parameters in each model by applying the model to one or more test files known to be malicious or clean files. The performance of the model can be measured based on correct classifications of malicious test files and a number of false positives (e.g., classification of clean test files as malicious). Model parameters, the selected function, or the feature sets used in each model may then be adjusted to improve performance.
  • The file learning module 316 may distribute the trained models to protection applications 136 of clients 120, as well as periodically update the distributed models. Beneficially, since clean files generally experience slower rates of change than do malicious files, the anomaly score models do not need to be frequently re-trained with updated samples of clean files.
  • The facet analysis system 330 checks whether a file is clean and/or malicious based on its facets by checking the subhashes of the facets against a subhash whitelist and/or subhash blacklist. In one embodiment, the facet analysis system 330 solely checks whether the file is clean based on its facets by checking subhashes of the facets against a whitelist of subhashes stored at the facet database 334 and does not employ a blacklist associated with facets. Alternatively, the facet analysis system 330 may additionally or alternatively check whether the file is malicious based on its facets by checking subhashes of the facets against a blacklist of subhashes stored at the database 234.
  • The facet analysis system 330 additionally tracks the presence of subhashes in whitelisted and/or blacklisted files to learn which subhashes are associated with clean files and which subhashes are associated with malware. The subhash whitelist and/or subhash blacklist may be updated dynamically as new associations are learned.
  • The facet analysis system 330 thus provides an additional way of whitelisting clean files in order to reduce false positives that may be undetected by the file analysis system 220. By using the file analysis system 320 and the facet analysis system 330 in combination, the security server 105 can enable file classification that beneficially reduces the rate of false positives. Similarly, a facet blacklist, if employed, may reduce rates of false negatives.
  • The facet lookup module 332 checks subhashes against the facet database 334 based on information received from the file classifier 144 of the protection application 136 running on the client 120. The received information includes one or more subhashes, e.g., subhashes of facets identified by the facet manager 152 of the protection application 136 and included with the full hash received by the security server 105. For each subhash, the facet lookup module 332 performs a lookup in the facet database 334 to determine if the subhash is associated with information about known clean files and therefore matches a subhash on a subhash whitelist. In an embodiment, the facet lookup module 332 additionally or alternatively performs a lookup in the facet database 334 to determine if the subhash is associated with information about known malicious files and therefore matches a subhash on a facet blacklist.
  • The facet database 334 includes a subhash whitelist including subhashes of facets associated with files known to be clean. The facet database 334 may additionally or alternatively include a subhash blacklist including subhashes associated with files known to be malicious. If a subhash is included in the subhash whitelist, then the file associated with the subhash is determined to be clean. In an embodiment, if a subhash is included in the subhash blacklist, then the file is considered to be malicious. If no subhash of a file is included in the subhash whitelist, the security server 105 indicates that the subhashes were not in the whitelist to the protection application 136, which may then classify the file using other information, such as the anomaly score alone. As described below, the facet learning module 336 determines whether a subhash is included in the subhash whitelist, and in some embodiments, the subhash blacklist. In some embodiments, the facet analysis system 330 includes solely a subhash whitelist or a subhash blacklist, in which case the analysis corresponding to the other is not performed.
  • Based on the results of the lookups for the one or more subhashes, the facet lookup module 332 provides an indication to the file classifier 144, e.g., via the network 110, of whether at least one subhash was included in a subhash whitelist, and in some embodiments, whether at least one subhash was included in a subhash blacklist. In an embodiment, if one subhash is in the subhash whitelist, the facet lookup module 332 indicates to the file classifier 144 that the file is whitelisted. Alternatively, there may be a threshold number of subhashes that need to be clean (e.g., found in the subhash whitelist) in order for the file to be classified as clean by the facet lookup module 232.
  • The facet learning module 336 dynamically updates the subhash whitelist and/or subhash blacklist based on file classifications of files having facets associated with the subhashes. The facet database 334 maintains for each subhash of a facet, a clean count and a malicious count. The clean count of a subhash is a count indicating a number of known clean files that have been observed to include the facet associated with the subhash. The malicious count is a count indicating a number of known malicious files that have been observed to include the facet associated with the subhash. The facet learning module 336 updates the clean count and malicious count of each subhash as the security server 105 receives subhashes for facets and associated file hashes for a file, and determines whether the file is malware or clean. For example, each time the security server 105 determines that a file is clean, the clean count of each subhash of a facet included in the file is incremented by one by the facet learning module 236. Similarly, if the security server 105 determines that a file is malicious, the security server 105 increments by one the malicious count of each subhash of each facet in the file. In an embodiment, the security server 105 updates clean counts and/or malicious counts based on information received from a third party server (e.g., a third party security analysis system) that provides a determination of whether a file is clean or malicious.
  • The facet learning module 336 updates the subhash whitelist and, in some embodiments, the subhash blacklist, based on the clean counts and the malicious counts. If the malicious count of a subhash is nonzero, the facet learning module 336 may add the subhash to the subhash blacklist, and remove it from the subhash whitelist if it was included therein. If the clean count of a subhash is at least a threshold clean count value and the malicious count is zero, the facet learning module 336 adds the subhash to the subhash whitelist. The threshold clean count value may vary depending upon the embodiment. For example, the threshold clean count value may be three, or five, or another value.
  • In this manner, the facet learning module 336 enables the security server 105 to maintain up to date subhash whitelists and, in some embodiments, subhash blacklists, that accurately reflect which subhash are likely clean and/or malicious. As the validity of classifications made by the security server 105 are correlated with the validity of the whitelists and blacklists maintained by the security server 105, the facet learning module 336 enables the security server 105 to better guarantee that a classification of a file is correct.
  • FIG. 4 is a flowchart illustrating an embodiment of a process 400 by a security server 105 for automatically updating a subhash whitelist associated with facets of files. The security server 105 receives 402 a first full hash of a file and a first set of subhashes of facets of the file from a protection application 136 at a client 120. The security server 105 determines 404 whether the first full hash is whitelisted, and if so, the security server 105 updates 406, for each subhash in the first set of subhashes received in association with the full hash for the file, a clean count associated with the subhash. The security server 105 adds 408 a subhash to a subhash whitelist responsive to the clean count associated with the subhash exceeding a threshold clean count. If the security server 105 determines the file is malware, the security server 105 removes the subhash from the subhash whitelist if present, regardless of its clean count. In an embodiment, the security server 105 maintains a malicious count for each subhash. The security server 105 increments the malicious count of each subhash associated with the file responsive to the determining the file is malware, and removing a subhash from the whitelist is responsive to the malicious count being nonzero. The security server 105 may then report the results of the determination to the client 120.
  • FIG. 5 is a flowchart illustrating an embodiment of a process 500 for classifying files based on subhashes. The security server 105 receives 510 a second full hash of a second file and a second set of subhashes of facets of the second file from a client 120. The security server 105 determines 512 whether the second full hash is whitelisted. If the second full hash is whitelisted, the security server 105 sends an indication to the protection application 136 that the file is whitelisted. Responsive to determining that the second full hash is not whitelisted, the security server 105 determines 514 whether at least one subhash in the second set of subhashes is included in the subhash whitelist. Responsive to determining 514 that at least one subhash in the second set of subhashes is included in the subhash whitelist, the security server 105 determines 516 that the second file is whitelisted and reports 518 that the second file is whitelisted to the client 120. Otherwise, the security server 105 reports that the second file does not match any whitelists (i.e., the file is unknown to the security server 105).
  • In an embodiment, the security server 105 checks whether at least one of the subhashes in the second set of subhashes is included in a subhash blacklist. If the security server 105 determines that at least one of the subhashes in the second set of subhashes is included in the subhash blacklist, the security server 105 classifies the second file as malicious, regardless of how many of the subhashes in the second set of subhashes are in the subhash whitelist.
  • In an embodiment, the security server 105 determines that the first full hash is malicious, and increments, for each subhash in the first set of subhashes, an associated malicious count. The security server 105 may add subhashes with nonzero malicious counts to the subhash blacklist, if there is a subhash blacklist, and remove such subhashes from the subhash whitelist.
  • The above-described system and processes beneficially enable reliable detection and remediation of malware. By relying on facets based on non-code portions of files to whitelist the files, the security server 105 can classify files with a reduced rate of false positives.
  • The present disclosure refers to embodiments using full hashes, though the techniques described herein additionally apply to full files that are not hashed and whitelists and/or blacklists of full files that are not hashed. In such alternative embodiments, the security server 105 maintains whitelists and/or blacklists of full files that are not hashed, and matches full files received from the client 120 against the whitelists and/or blacklists.
  • The present disclosure refers to embodiments using subhashes and subhash whitelists and/or blacklists, though the techniques described herein additionally apply to facets that are not hashed and whitelists and/or blacklists of facets that are not hashed. In such alternative embodiments, the security server 105 maintains whitelists and/or blacklists of facets that are not hashed, and matches facets received from the client 120 against the whitelists and/or blacklists.
  • ADDITIONAL CONSIDERATIONS
  • The foregoing description of the embodiments of the invention has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure.
  • Some portions of this description describe the embodiments of the invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
  • Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In one embodiment, a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
  • Embodiments of the invention may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a nontransitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus. Furthermore, any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
  • Embodiments of the invention may also relate to a product that is produced by a computing process described herein. Such a product may comprise information resulting from a computing process, where the information is stored on a nontransitory, tangible computer readable storage medium and may include any embodiment of a computer program product or other data combination described herein.
  • As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, unless expressly stated to the contrary, “or” refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).
  • Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments of the invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.

Claims (20)

1. A method comprising:
receiving a first full hash and a first plurality of subhashes from a client, wherein the first full hash is a hash of a first file and each subhash in the first plurality of subhashes is a hash of a facet of the first file;
determining whether the first full hash is blacklisted;
responsive to determining the first full hash is blacklisted, updating, for each subhash in the first plurality of subhashes, an associated malicious count, wherein the malicious count tracks a historic number of blacklisted files with which the subhash is associated;
responsive to a first malicious count of the malicious counts exceeding a threshold malicious count, adding the subhash associated with the first malicious count to a subhash blacklist;
receiving a second full hash and a second plurality of subhashes from the client, wherein the second full hash is a hash of a second file and each subhash in the second plurality of subhashes is a hash of a facet of the second file;
determining whether the second full hash is blacklisted;
responsive to determining the second full hash is not blacklisted, determining whether a subhash in the second plurality of subhashes is included in the subhash blacklist;
responsive to determining a subhash in the second plurality of subhashes is included in the subhash blacklist, determining the second file is blacklisted; and
reporting that the second file is blacklisted to the client.
2. The method of claim 1, further comprising:
receiving a third plurality of subhashes from the client, wherein each subhash in the third plurality of subhashes is a hash of a facet of a third file;
determining that the third file is malicious; and
removing a subhash in the third plurality of subhashes from a subhash whitelist.
3. The method of claim 2, wherein removing the subhash in the third plurality of subhashes from the subhash whitelist is responsive to a malicious count associated with the subhash in the third plurality of subhashes comprising a nonzero value.
4. The method of claim 1, further comprising:
receiving a third plurality of subhashes of facets of a third file from a second client;
determining whether at least one subhash in the third plurality of subhashes is included in the subhash blacklist, comprising:
determining whether a subhash in the third plurality of subhashes is the subhash associated with the first malicious count; and
reporting a result of determining whether at least one subhash in the third plurality of subhashes is included in the subhash blacklist to the second client.
5. The method of claim 1, further comprising:
receiving a third plurality of subhashes from the client, wherein each subhash in the third plurality of subhashes is a hash of a facet of a third file;
determining whether at least one subhash in the third plurality of subhashes is included in a subhash whitelist; and
responsive to determining at least one subhash included in the third plurality of subhashes is included in the subhash whitelist, reporting that the third file is clean to the client.
6. The method of claim 1, wherein reporting that the second file is blacklisted to the client comprises reporting that the second file is blacklisted to a protection application at the client.
7. The method of claim 1, wherein the threshold malicious count is one.
8. The method of claim 1, wherein reporting that the second file is blacklisted to the client comprises sending instructions to the client to perform at least one of removing the second file, quarantining the second file, and providing a notification to a user of the client.
9. A non-transitory computer-readable storage medium storing computer program instructions executable by a processor to perform operations comprising:
receiving a first full hash and a first plurality of subhashes from a client, wherein the first full hash is a hash of a first file and each subhash in the first plurality of subhashes is a hash of a facet of the first file;
determining whether the first full hash is blacklisted;
responsive to determining the first full hash is blacklisted, updating, for each subhash in the first plurality of subhashes, an associated malicious count, wherein the malicious count tracks a historic number of blacklisted files with which the subhash is associated;
responsive to a first malicious count of the malicious counts exceeding a threshold malicious count, adding the subhash associated with the first malicious count to a subhash blacklist;
receiving a second full hash and a second plurality of subhashes from the client, wherein the second full hash is a hash of a second file and each subhash in the second plurality of subhashes is a hash of a facet of the second file;
determining whether the second full hash is blacklisted;
responsive to determining the second full hash is not blacklisted, determining whether a subhash in the second plurality of subhashes is included in the subhash blacklist;
responsive to determining a subhash in the second plurality of subhashes is included in the subhash blacklist, determining the second file is blacklisted; and
reporting that the second file is blacklisted to the client.
10. The non-transitory computer-readable storage medium of claim 9, the operations further comprising:
receiving a third plurality of subhashes from the client, wherein each subhash in the third plurality of subhashes is a hash of a facet of a third file;
determining that the third file is malicious; and
removing a subhash in the third plurality of subhashes from a subhash whitelist.
11. The non-transitory computer-readable storage medium of claim 10, wherein removing the subhash in the third plurality of subhashes from the subhash whitelist is responsive to a malicious count associated with the subhash in the third plurality of subhashes comprising a nonzero value.
12. The non-transitory computer-readable storage medium of claim 9, the operations further comprising:
receiving a third plurality of subhashes of facets of a third file from a second client;
determining whether at least one subhash in the third plurality of subhashes is included in the subhash blacklist, comprising:
determining whether a subhash in the third plurality of subhashes is the subhash associated with the first malicious count; and
reporting a result of determining whether at least one subhash in the third plurality of subhashes is included in the subhash blacklist to the second client.
13. The non-transitory computer-readable storage medium of claim 9, the operations further comprising:
receiving a third plurality of subhashes from the client, wherein each subhash in the third plurality of subhashes is a hash of a facet of a third file;
determining whether at least one subhash in the third plurality of subhashes is included in a subhash whitelist; and
responsive to determining at least one subhash included in the third plurality of subhashes is included in the subhash whitelist, reporting that the third file is clean to the client.
14. The non-transitory computer-readable storage medium of claim 9, wherein reporting that the second file is blacklisted to the client comprises reporting that the second file is blacklisted to a protection application at the client.
15. The non-transitory computer-readable storage medium of claim 9, wherein the threshold malicious count is one.
16. A method, comprising:
receiving a full hash and a plurality of subhashes, wherein the full hash is a hash of a file and each subhash in the plurality of subhashes is a hash of a facet of the file;
determining whether the full hash is whitelisted;
responsive to determining the full hash is not whitelisted, determining whether a subhash in the plurality of subhashes is included in a subhash whitelist, wherein a subhash is included in the subhash whitelist responsive to a clean count associated with the subhash exceeding a threshold value, the clean count indicating a number of previous instances of a respective full hash being whitelisted;
responsive to determining a subhash in the plurality of subhashes is included in the subhash whitelist, determining the file is whitelisted; and
reporting that the file is whitelisted.
17. The method of claim 16, wherein at least one facet is an author string, a product name string, a list of application programming interfaces, a file description, copyright information, or a portion of a file header.
18. The method of claim 16, further comprising:
receiving a plurality of files; and
applying the plurality of files to a file filter to produce a subset of files comprising a subset of the plurality of files to be checked against a subhash whitelist;
wherein the first file is a file in the subset of files.
19. The method of claim 18, wherein the file filter produces the subset of files by performing steps comprising:
generating, for each file in the plurality of files, an anomaly score;
evaluating each anomaly score against an anomaly score threshold; and
producing the subset of files by adding files of the plurality of files to the subset of files with anomaly scores exceeding the anomaly score threshold.
20. The method of claim 16, further comprising:
receiving the first file;
analyzing the first file to generate a plurality of facets;
hashing the first file to produce the first full hash; and
hashing each facet in the plurality of facets to generate the first plurality of subhashes.
US17/219,492 2019-03-04 2021-03-31 Facet Blacklisting in Anomaly Detection Pending US20210218755A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/219,492 US20210218755A1 (en) 2019-03-04 2021-03-31 Facet Blacklisting in Anomaly Detection

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US16/292,307 US10992703B2 (en) 2019-03-04 2019-03-04 Facet whitelisting in anomaly detection
US17/219,492 US20210218755A1 (en) 2019-03-04 2021-03-31 Facet Blacklisting in Anomaly Detection

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US16/292,307 Continuation US10992703B2 (en) 2019-03-04 2019-03-04 Facet whitelisting in anomaly detection

Publications (1)

Publication Number Publication Date
US20210218755A1 true US20210218755A1 (en) 2021-07-15

Family

ID=72335551

Family Applications (2)

Application Number Title Priority Date Filing Date
US16/292,307 Active 2039-03-09 US10992703B2 (en) 2019-03-04 2019-03-04 Facet whitelisting in anomaly detection
US17/219,492 Pending US20210218755A1 (en) 2019-03-04 2021-03-31 Facet Blacklisting in Anomaly Detection

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US16/292,307 Active 2039-03-09 US10992703B2 (en) 2019-03-04 2019-03-04 Facet whitelisting in anomaly detection

Country Status (1)

Country Link
US (2) US10992703B2 (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019125914A (en) * 2018-01-16 2019-07-25 アラクサラネットワークス株式会社 Communication device and program
US11277425B2 (en) * 2019-04-16 2022-03-15 International Business Machines Corporation Anomaly and mode inference from time series data
US11182400B2 (en) * 2019-05-23 2021-11-23 International Business Machines Corporation Anomaly comparison across multiple assets and time-scales
US11271957B2 (en) 2019-07-30 2022-03-08 International Business Machines Corporation Contextual anomaly detection across assets
US20220292201A1 (en) * 2019-08-27 2022-09-15 Nec Corporation Backdoor inspection apparatus, backdoor inspection method, and non-transitory computer readable medium
US11288401B2 (en) * 2019-09-11 2022-03-29 AO Kaspersky Lab System and method of reducing a number of false positives in classification of files
US11474983B2 (en) * 2020-07-13 2022-10-18 International Business Machines Corporation Entity resolution of master data using qualified relationship score
US20220166778A1 (en) * 2020-11-24 2022-05-26 Saudi Arabian Oil Company Application whitelisting based on file handling history
CN112464295B (en) * 2020-12-14 2023-06-30 国网辽宁省电力有限公司抚顺供电公司 Maintenance communication safety device based on electric power edge gateway equipment
CN113704764A (en) * 2021-09-09 2021-11-26 安全邦(北京)信息技术有限公司 Intelligent detection equipment and method for industrial control system safety

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100216434A1 (en) * 2009-02-25 2010-08-26 Chris Marcellino Managing Notification Messages
US20120017275A1 (en) * 2010-07-13 2012-01-19 F-Secure Oyj Identifying polymorphic malware
US20120030293A1 (en) * 2010-07-27 2012-02-02 At&T Intellectual Property I, L.P. Employing report ratios for intelligent mobile messaging classification and anti-spam defense
US20120090025A1 (en) * 2010-10-06 2012-04-12 Steve Bradford Milner Systems and methods for detection of malicious software packages
US20120210423A1 (en) * 2010-12-01 2012-08-16 Oliver Friedrichs Method and apparatus for detecting malicious software through contextual convictions, generic signatures and machine learning techniques
US9223961B1 (en) * 2012-04-04 2015-12-29 Symantec Corporation Systems and methods for performing security analyses of applications configured for cloud-based platforms
US9244818B1 (en) * 2011-03-29 2016-01-26 Amazon Technologies, Inc. Automated selection of quality control tests to run on a software application
US9830453B1 (en) * 2015-10-30 2017-11-28 tCell.io, Inc. Detection of code modification
US20180253545A1 (en) * 2016-05-24 2018-09-06 Tencent Technology (Shenzhen) Company Limited File authentication method and apparatus
US20200241769A1 (en) * 2019-01-25 2020-07-30 International Business Machines Corporation Methods and systems for encryption based on cognitive data classification
US11184378B2 (en) * 2019-01-30 2021-11-23 Palo Alto Networks (Israel Analytics) Ltd. Scanner probe detection

Family Cites Families (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5440723A (en) 1993-01-19 1995-08-08 International Business Machines Corporation Automatic immune system for computers and computer networks
JP2501771B2 (en) * 1993-01-19 1996-05-29 インターナショナル・ビジネス・マシーンズ・コーポレイション Method and apparatus for obtaining multiple valid signatures of an unwanted software entity
US6263348B1 (en) * 1998-07-01 2001-07-17 Serena Software International, Inc. Method and apparatus for identifying the existence of differences between two files
US20050108562A1 (en) 2003-06-18 2005-05-19 Khazan Roger I. Technique for detecting executable malicious code using a combination of static and dynamic analyses
US7814545B2 (en) * 2003-07-22 2010-10-12 Sonicwall, Inc. Message classification using classifiers
US7873996B1 (en) * 2003-11-22 2011-01-18 Radix Holdings, Llc Messaging enhancements and anti-spam
US7966658B2 (en) * 2004-04-08 2011-06-21 The Regents Of The University Of California Detecting public network attacks using signatures and fast content analysis
US7660865B2 (en) * 2004-08-12 2010-02-09 Microsoft Corporation Spam filtering with probabilistic secure hashes
US8108929B2 (en) 2004-10-19 2012-01-31 Reflex Systems, LLC Method and system for detecting intrusive anomalous use of a software system using multiple detection algorithms
US7334005B2 (en) 2005-04-13 2008-02-19 Symantec Corporation Controllable deployment of software updates
US9055093B2 (en) 2005-10-21 2015-06-09 Kevin R. Borders Method, system and computer program product for detecting at least one of security threats and undesirable computer files
WO2007050667A2 (en) 2005-10-25 2007-05-03 The Trustees Of Columbia University In The City Of New York Methods, media and systems for detecting anomalous program executions
US7873833B2 (en) * 2006-06-29 2011-01-18 Cisco Technology, Inc. Detection of frequent and dispersed invariants
GB2444514A (en) * 2006-12-04 2008-06-11 Glasswall Electronic file re-generation
US9729513B2 (en) * 2007-11-08 2017-08-08 Glasswall (Ip) Limited Using multiple layers of policy management to manage risk
US8607066B1 (en) * 2008-08-04 2013-12-10 Zscaler, Inc. Content inspection using partial content signatures
US9292689B1 (en) * 2008-10-14 2016-03-22 Trend Micro Incorporated Interactive malicious code detection over a computer network
US8353037B2 (en) * 2009-12-03 2013-01-08 International Business Machines Corporation Mitigating malicious file propagation with progressive identifiers
US9104872B2 (en) * 2010-01-28 2015-08-11 Bank Of America Corporation Memory whitelisting
DE102010008538A1 (en) * 2010-02-18 2011-08-18 zynamics GmbH, 44787 Method and system for detecting malicious software
US8966623B2 (en) * 2010-03-08 2015-02-24 Vmware, Inc. Managing execution of a running-page in a virtual machine
RU2454714C1 (en) 2010-12-30 2012-06-27 Закрытое акционерное общество "Лаборатория Касперского" System and method of increasing efficiency of detecting unknown harmful objects
US9129110B1 (en) 2011-01-14 2015-09-08 The United States Of America As Represented By The Secretary Of The Air Force Classifying computer files as malware or whiteware
US8584235B2 (en) * 2011-11-02 2013-11-12 Bitdefender IPR Management Ltd. Fuzzy whitelisting anti-malware systems and methods
US8635700B2 (en) 2011-12-06 2014-01-21 Raytheon Company Detecting malware using stored patterns
CN103369555B (en) 2012-04-01 2017-03-01 西门子公司 A kind of method and apparatus for detecting mobile phone viruses
US20140157405A1 (en) * 2012-12-04 2014-06-05 Bill Joll Cyber Behavior Analysis and Detection Method, System and Architecture
US9430646B1 (en) 2013-03-14 2016-08-30 Fireeye, Inc. Distributed systems and methods for automatically detecting unknown bots and botnets
US9311480B2 (en) 2013-03-15 2016-04-12 Mcafee, Inc. Server-assisted anti-malware client
US9319423B2 (en) 2013-11-04 2016-04-19 At&T Intellectual Property I, L.P. Malware and anomaly detection via activity recognition based on sensor data
US9262635B2 (en) 2014-02-05 2016-02-16 Fireeye, Inc. Detection efficacy of virtual machine-based analysis with application specific events
US9805115B1 (en) 2014-03-13 2017-10-31 Symantec Corporation Systems and methods for updating generic file-classification definitions
US9762603B2 (en) 2014-05-10 2017-09-12 Informatica Llc Assessment type-variable enterprise security impact analysis
US10043009B2 (en) * 2014-09-24 2018-08-07 Intel Corporation Technologies for software basic block similarity analysis
CN105095755A (en) 2015-06-15 2015-11-25 安一恒通(北京)科技有限公司 File recognition method and apparatus
US10129291B2 (en) 2015-06-27 2018-11-13 Mcafee, Llc Anomaly detection to identify malware
US9935972B2 (en) 2015-06-29 2018-04-03 Fortinet, Inc. Emulator-based malware learning and detection
RU2614557C2 (en) 2015-06-30 2017-03-28 Закрытое акционерное общество "Лаборатория Касперского" System and method for detecting malicious files on mobile devices
US10200391B2 (en) 2015-09-23 2019-02-05 AVAST Software s.r.o. Detection of malware in derived pattern space
US10162967B1 (en) * 2016-08-17 2018-12-25 Trend Micro Incorporated Methods and systems for identifying legitimate computer files
US10657182B2 (en) * 2016-09-20 2020-05-19 International Business Machines Corporation Similar email spam detection

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100216434A1 (en) * 2009-02-25 2010-08-26 Chris Marcellino Managing Notification Messages
US20120017275A1 (en) * 2010-07-13 2012-01-19 F-Secure Oyj Identifying polymorphic malware
US20120030293A1 (en) * 2010-07-27 2012-02-02 At&T Intellectual Property I, L.P. Employing report ratios for intelligent mobile messaging classification and anti-spam defense
US20120090025A1 (en) * 2010-10-06 2012-04-12 Steve Bradford Milner Systems and methods for detection of malicious software packages
US20120210423A1 (en) * 2010-12-01 2012-08-16 Oliver Friedrichs Method and apparatus for detecting malicious software through contextual convictions, generic signatures and machine learning techniques
US9244818B1 (en) * 2011-03-29 2016-01-26 Amazon Technologies, Inc. Automated selection of quality control tests to run on a software application
US9223961B1 (en) * 2012-04-04 2015-12-29 Symantec Corporation Systems and methods for performing security analyses of applications configured for cloud-based platforms
US9830453B1 (en) * 2015-10-30 2017-11-28 tCell.io, Inc. Detection of code modification
US20180253545A1 (en) * 2016-05-24 2018-09-06 Tencent Technology (Shenzhen) Company Limited File authentication method and apparatus
US20200241769A1 (en) * 2019-01-25 2020-07-30 International Business Machines Corporation Methods and systems for encryption based on cognitive data classification
US11184378B2 (en) * 2019-01-30 2021-11-23 Palo Alto Networks (Israel Analytics) Ltd. Scanner probe detection

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Afroz, S., & Drexel, R.G. (2009). PhishZoo : An Automated Web Phishing Detection Approach Based on Profiling and Fuzzy Matching. *
ElMouatez Billah Karbab, Mourad Debbabi, Djedjiga Mouheb, Fingerprinting Android packaging: Generating DNAs for malware detection, Digital Investigation, Volume 18, Supplement, 2016, Pages S33-S45, ISSN 1742-2876, https://doi.org/10.1016/j.diin.2016.04.013. *

Also Published As

Publication number Publication date
US10992703B2 (en) 2021-04-27
US20200287914A1 (en) 2020-09-10

Similar Documents

Publication Publication Date Title
US20210218755A1 (en) Facet Blacklisting in Anomaly Detection
US10860720B2 (en) Static anomaly-based detection of malware files
US10902117B1 (en) Framework for classifying an object as malicious with machine learning for deploying updated predictive models
AU2018217323B2 (en) Methods and systems for identifying potential enterprise software threats based on visual and non-visual data
US10229269B1 (en) Detecting ransomware based on file comparisons
US10218740B1 (en) Fuzzy hash of behavioral results
US10430586B1 (en) Methods of identifying heap spray attacks using memory anomaly detection
US11570211B1 (en) Detection of phishing attacks using similarity analysis
US11122061B2 (en) Method and server for determining malicious files in network traffic
US8955121B2 (en) System, method, and computer program product for dynamically adjusting a level of security applied to a system
US8775333B1 (en) Systems and methods for generating a threat classifier to determine a malicious process
US20170083703A1 (en) Leveraging behavior-based rules for malware family classification
US20150033341A1 (en) System and method to detect threats to computer based devices and systems
US9614866B2 (en) System, method and computer program product for sending information extracted from a potentially unwanted data sample to generate a signature
US8370942B1 (en) Proactively analyzing binary files from suspicious sources
US10601847B2 (en) Detecting user behavior activities of interest in a network
US11108790B1 (en) Attack signature generation
US20200334353A1 (en) Method and system for detecting and classifying malware based on families
US9825994B1 (en) Detection and removal of unwanted applications
US20210342651A1 (en) Data classification device, data classification method, and data classification program
US11792212B2 (en) IOC management infrastructure

Legal Events

Date Code Title Description
AS Assignment

Owner name: MALWAREBYTES INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SWANSON, DOUGLAS STUART;YOUSSEIF, MINA;LUSSIER, JON-PAUL, JR.;SIGNING DATES FROM 20190227 TO 20190302;REEL/FRAME:055790/0575

STPP Information on status: patent application and granting procedure in general

Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

AS Assignment

Owner name: COMPUTERSHARE TRUST COMPANY, N.A., AS ADMINISTRATIVE AGENT, MARYLAND

Free format text: INTELLECTUAL PROPERTY SECURITY AGREEMENT;ASSIGNOR:MALWAREBYTES INC.;REEL/FRAME:062599/0069

Effective date: 20230131

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

AS Assignment

Owner name: COMPUTERSHARE TRUST COMPANY, N.A., AS ADMINISTRATIVE AGENT, MARYLAND

Free format text: INTELLECTUAL PROPERTY SECURITY AGREEMENT;ASSIGNOR:MALWAREBYTES CORPORATE HOLDCO INC.;REEL/FRAME:066373/0912

Effective date: 20240124

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

AS Assignment

Owner name: MALWAREBYTES CORPORATE HOLDCO INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MALWAREBYTES INC.;REEL/FRAME:066900/0386

Effective date: 20240201