WO2022182919A1 - Signatureless detection of malicious ms office documents - Google Patents

Signatureless detection of malicious ms office documents Download PDF

Info

Publication number
WO2022182919A1
WO2022182919A1 PCT/US2022/017778 US2022017778W WO2022182919A1 WO 2022182919 A1 WO2022182919 A1 WO 2022182919A1 US 2022017778 W US2022017778 W US 2022017778W WO 2022182919 A1 WO2022182919 A1 WO 2022182919A1
Authority
WO
WIPO (PCT)
Prior art keywords
malicious
document
documents
ole
files
Prior art date
Application number
PCT/US2022/017778
Other languages
English (en)
French (fr)
Inventor
Benjamin Chang
Ghanashyam SATPATHY
Original Assignee
Netskope, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/184,478 external-priority patent/US11222112B1/en
Priority claimed from US17/184,502 external-priority patent/US11349865B1/en
Application filed by Netskope, Inc. filed Critical Netskope, Inc.
Priority to JP2023551122A priority Critical patent/JP7493108B2/ja
Publication of WO2022182919A1 publication Critical patent/WO2022182919A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/52Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow
    • G06F21/53Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow by executing in a restricted environment, e.g. sandbox or secure virtual machine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the technology disclosed relates to cybersecurity attacks and cloud-based security, and more specifically a system and method for preventing malware attacks where Microsoft Office Documents act as the primary vector (a way) for delivering malicious code in the form of macros and OLE objects.
  • the technology disclosed relates to the detection of documents that include malicious macros and/or malicious OLE objects that do not contain known signatures.
  • signatureless refers to detecting malicious macros and malicious OLE objects that do not have previously established signatures.
  • the technology disclosed uses machine learning and feature engineering to predict the presence of malicious macros and (without requiring) malicious OLE objects in MS Office documents and other document types, without need for (the malicious code to have been previously known.
  • network devices may predict the presence of malware in document files without using a known signature for the unknown malware. Additionally, the detection of malicious office files may happen in near real-time, greatly improving network security while having reduced negative impact on system throughput, by reducing latencies in network security processing time.
  • FIG. 1 illustrates an architectural level schematic of a system for detecting malicious content in MS Office macros and MS Office embedded OLE object files.
  • the disclosed system uses machine learning and feature engineering to develop a supervised training model to detect malicious content in signatureless malicious data.
  • FIG. 2 illustrates malware detection aspects of an office classifier for detecting malware included in macros in MS Office documents and OLE objects operating within a Netskope network security system particularly showing the placement of ML based office classifier inside the network security system.
  • FIG. 3 illustrates an example workflow for training a supervised machine learning model according to an aspect of the present technology.
  • FIG. 4 is a flow diagram of the office classifier, illustrating how potentially malicious files are classified and post-processed.
  • FIG. 5 is a flowchart illustrating the steps (actions) in detecting embedded macros.
  • FIG. 6 is a flowchart illustrating the steps (actions) in detecting embedded malicious
  • FIG. 7 is a simplified block diagram of a computer system that can be used to detect malicious macros embedded in MS Office documents and MS Office documents having embedded OLE content.
  • the technology disclosed relates to a feature engineering approach for machine learning based classification for Microsoft Office documents, which will significantly improve malicious file detection efficiency or documents that include macros and OLE objects.
  • the technology disclosed relates to cybersecurity attacks and cloud-based security.
  • the technology disclosed is a method and apparatus for detecting documents with embedded threats in the form of malicious macros and malicious OLE objects.
  • the technology disclosed detects obfuscated malicious code using a trained machine learning model to predict documents having malicious code without a known signature, called signatureless.
  • the technology disclosed can thus predict which documents include signatureless malicious code.
  • Feature engineering is used to define a set of features for detecting malicious macros and malicious OLE objects, based on features selected from a list of known characteristics and attributes possessed by files that have historically indicated malicious content.
  • the characteristics and attributes of macro malware and OLE malware is determined by analysis of obfuscated malware code and are stored in a heuristic database. Features from the database are selected and used to train a supervised machine learning model.
  • an office classifier receives incoming documents over a network, and is configured for parsing and parses those documents, and applies the machine learning algorithm to classify the documents as to threat level — as safe, suspicious, or malicious. Safe documents are allowed into the network. Suspicious documents are subjected to additional processing, including quarantining or sandboxing methods. Malicious documents are rejected or blacklisted from the network.
  • the disclosed technology combines machine learning with other network security methods, serially or in tandem, to further increase the capability of a network security system to detect malicious macros and malicious OLE files.
  • signature-based detection methods a malware or virus has (includes) a unique code pattern that can be used to detect and identify a presence of a specific malware or virus.
  • the antivirus software scans file signatures and compares them to a database of known malicious codes. If they match, the file is flagged and treated as a threat.
  • signature- based detection is that it is only capable of flagging already known malware, being malware including a code pattern that is known to be associated with the particular malware, making it, being signature based detection, less effective, and in some circumstances, completely useless against new malware or zero-day attacks.
  • the method and apparatus of the present invention is advantageously used at least with the following file formats for MS Office documents: Word 97-2003 (.doc, .dot); Word 2007+ (.docx, .docm, .dotm); Word 2003 XML (.xml); Excel 97-2003 (.xls); Excel 2007+ (.xlsx, .xlsm, .xlsb); and PowerPoint 2007+ (.pptx, .pptm, .ppsm).
  • MS Office documents can contain embedded code such as VBA (Visual Basic for Applications), DDE (Dynamic Data Exchange) and other files (jpg, mpeg, exe/pe files, etc.).
  • VBA Visual Basic for Applications
  • DDE Dynamic Data Exchange
  • other files jpg, mpeg, exe/pe files, etc.
  • the overall objective of the disclosed technology is to scan the embedded content of the MS Office files to detect any malicious code within the files, based on the functionality of the code.
  • Macros are a powerful way to automate common tasks in Microsoft Office and can make users more productive.
  • macro malware uses this functionality to infect users’ endpoint devices.
  • Macro malware are often disguised inside in Microsoft Office files and delivered via email attachments, ZIP files or downloaded from cloud-based sources.
  • macro malware was common because macros ran automatically whenever a document was opened. In more recent versions of Microsoft Office, macros are disabled by default. Now, malware authors need to lure users into turning on macros so that their malware can run. These files use names that are intended to entice or “scare” users into opening them. Some files are disguised to look like official documents such as invoices, receipts, and legal documents. Other files often show fake warnings when a malicious document is opened to lure a user into accessing the malicious content.
  • Macros are programs that are embedded in MS Office documents. All types of MS Office formats (documents, spreadsheets, presentations, etc.) have an ability to include these macros. Macros written in VBA (Visual Basic for Applications) enable the user to build user- defined functions (UDFs), automating processes and accessing Windows API and other low- level functionality through dynamic-link libraries (DLLs). Malware authors utilize this functionality to carry out malicious activity on a user's computer. Macros are stored in a file folder. VBA components, in turn, are stored in a sub-folder. The VBA components can be considered streams, including a VBA project, directory, and reference to the document.
  • VBA Visual Basic for Applications
  • Visual basic can be used to launch embedded JavaScript (JScript) code.
  • JScript JavaScript
  • cscript.exe at the command line and wscript.exe running in the GUI are the main means of implementation of installed active script languages.
  • a Windows script file (.wsf) is an xml file that can contain more than one script in more than one language in addition to other elements and are executed by the Windows script host.
  • VBA is a very powerful language, beyond its native syntax, because it can invoke a component object model library, a .net library or any Windows interface. By invoking external program modules, VBA can realize the full capabilities of any windows programming language.
  • the following sample xlsm file and macro demonstrates creating a JScript macro in VBA for inside an Office document. This feature can be used to construct executable VBA during runtime, which is difficult to scan for maliciousness. A core snippit would appear as follows:
  • the JScript is specified in a variable.
  • the Script Control is instantiated and set to JScript.
  • VBScript is a language which allows embedding of JScript in VBScript code. This is how JScript code can be constructed and executed by VBScript code. The resulting JScript is outside the capability of any current Microsoft malicious code detection and mitigation mechanisms.
  • VBA API Document Open
  • VBA does not provide the ability to run code stored in a string, in contrast to JavaScript with eval() and VBScript with Execute.
  • obfuscation methods may be classified into four types, which are described below:
  • Each obfuscation type has a different syntactic structure and different uses of functions and operators.
  • Feature extraction from the VBA macro is directed to the four types of obfuscation.
  • Features that characterize obfuscation in these four categories build on the following analysis.
  • the basic purpose of using these obfuscation techniques is to decelerate the time of analysis, which in turn delays the countermeasures after detection.
  • each obfuscation method is quite simple, when used in combination, they render the code visually indecipherable.
  • malware authors use obfuscation tools to create many variants of malware with different hash values, which can serve as serve as a digital footprint for files in which a file is processed through a cryptographic algorithm, yielding a unique numerical value for that file.
  • Random Obfuscation makes VBA code unreadable by using nonsense or misleading token names. This random obfuscation can be characterized by features that use Shannon Entropy measures of the VBA code.
  • Split obfuscation is used to piece together strings, such as filename strings or URLs, that are different than they initially appear.
  • Obfuscation Usins Built-in Function ReylaceQ/sylitO Uses VBA Built-in Functionality Such as Reylace/sylit in Obfuscatins the Data.
  • Encoding obfuscation operates on parameters to produce malicious code that is much different than it initially appears. Examples of functions, by category, that produce encoding obfuscation include:
  • Logic obfuscation refers to using long code sequences or comments to obscure discovery of one or two lines of code with malicious operations.
  • Features that can be generated to characterize logic obfuscation include length of VBA code except comments and length or size of comments in VBA code.
  • MS Office files can be encrypted, requiring a password to be decrypted and opened, VBA macros are not encrypted with the rest of the document. Therefore, malicious macros cannot be obscured by encryption.
  • a VBA project can be protected with a password. However, MS Office has enforced this as a logical protection.
  • the VBA code is not encrypted in the file, and so can be extracted in clear text using tools such as OLETOOLS.
  • FIG.l An architectural diagram of the system 100 is shown in FIG.l, which is intentionally simplified to improve clarity in the description.
  • FIG.l shows the interconnection of the various major elements. The use of these elements will be described in greater detail further on in connection with the discussion of the structure and use of those elements.
  • FIG. 1 includes the system 100 including the endpoints 142.
  • User endpoints 142 may include devices such as computers 144, smart phones 146, and computer tablets 148, which provide access and interact with data stored on a cloud-based store 136 and cloud-based services 138.
  • An inline proxy 132 is interposed between the user endpoints 142 and the cloud-based services 138 through the network 140 and particularly through a network security system including a network administrator 122, network policies 124, an evaluation engine 126, an office classifier 127, a threat scan subsystem 128, a sandbox 130, and a metadata store 134, which will be described in more detail.
  • the In-line proxy 132 may be accessible through the network 140, or it may be resident as part of the network security system 120.
  • the in-line proxy 132 provides traffic monitoring and control between the user endpoints 142, the cloud-based store 136 and other cloud-based services 138.
  • the in-line proxy 132 monitors the network traffic between user endpoints 142 and cloud-based services 138, particularly to enforce network security policies including data loss prevention (DLP) policies and protocols.
  • DLP data loss prevention
  • the network 140 couples the computers 144, smart phones 146, and the computer tablets 148 and metadata store 134 and in the in-line proxy 132 with each other.
  • the communication path can be point-to-point over public and/or private networks.
  • the communication can occur over a variety of networks, including private networks, VPN, MPLS circuit, or Internet, and can use appropriate application programming interfaces (APIs) and data interchange formats such as [0051] Representational State Transfer (REST), JavaScript Object Notation (JSON), Extensible Markup Language (XML), Simple Object Access Protocol (SOAP), Java Message Service (JMS), and/or Java Platform Module System.
  • REST Representational State Transfer
  • JSON JavaScript Object Notation
  • XML Extensible Markup Language
  • SOAP Simple Object Access Protocol
  • JMS Java Message Service
  • Java Platform Module System Java Platform Module System
  • Communications may be encrypted.
  • the communication is generally over a network such as the LAN (local area network), WAN (wide area network), telephone network (Public Switched Telephone Network (PSTN), Session Initiation Protocol (SIP), wireless network, point- to-point network, star network, token ring network, hub network, Internet, inclusive of the mobile Internet, via protocols such as EDGE, 3G, 4G LTE, Wi-Fi and WiMAX.
  • PSTN Public Switched Telephone Network
  • SIP Session Initiation Protocol
  • wireless network point- to-point network
  • star network star network
  • token ring network token ring network
  • hub network inclusive of the mobile Internet, via protocols such as EDGE, 3G, 4G LTE, Wi-Fi and WiMAX.
  • the engines or system components of FIG. 1 are implemented by software running on varying types of computing devices. For example, a workstation, server, a computer cluster, a blade server, or a server farm. Additionally, a variety of authorization and authentication techniques, such as usemame/password, Open Authorization (OAuth), Kerberos, SecurelD, digital certificates and more, can be used to secure the communications.
  • OAuth Open Authorization
  • Kerberos Kerberos
  • SecurelD SecurelD
  • digital certificates digital certificates
  • the cloud-based services 138 provide functionality to users that is implemented in the cloud or on the Internet.
  • the cloud-based services 138 can include Internet hosted services such as news web sites, blogs, video streaming web sites, social media web sites, hosted services, cloud applications, cloud stores, cloud collaboration and messaging platforms, and/or cloud customer relationship management (CRM) platforms.
  • Cloud-based services 138 can be accessed using a browser (via a URL) or a native application (a sync client).
  • Categories of cloud-based services 138 include software-as-a-service (SaaS) offerings, platform-as-a-service (PaaS) offerings, and infrastructure-as-a-service (IaaS) offerings. Examples of common web services today include YouTubeTM, FacebookTM, TwitterTM,
  • GoogleTM LinkedlnTM, WikipediaTM, YahooTM, BaiduTM, AmazonTM, MSNTM, PinterestTM, TaobaoTM, InstagramTM, TumblrTM, eBayTM, HotmailTM, RedditTM, IMDbTM, NetflixTM,
  • the cloud-based services 138 provide functionality to the users of the organization that is implementing security policies.
  • the inline proxy 132 intercepts the request message.
  • the inline proxy 132 by accessing a database, seeks to identify the cloud- based service 138 being accessed.
  • the inline proxy accumulates the metadata in request messages from the user to the metadata store 134 to identify cloud based services 138 being accessed.
  • the inline proxy accumulates the metadata in request messages from the user to the metadata store 134 to identify cloud based services 138 being accessed.
  • the office classifier 127 is shown in more detail in FIG. 2.
  • the office classifier framework 202 receives office documents in the wild 204, which may or may not include malicious code.
  • the term “in the wild” 204 generally referred to malicious programs already circulating in the public, doing various kinds of damage.
  • the office classifier 127 is an integral part of the network security system 120 shown in FIG. 1 and operates in cooperation with other elements of the network security system 120.
  • the office classifier framework 202 uses supervised training of a machine learning model 208 to predict malicious content in document files.
  • the machine learning model is trained using labeled training data and a machine learning algorithm, which will be described in connection with FIG. 3.
  • the features used for training the machine learning model are discussed further on. The determination of these features is based on an analysis of methods used by malware authors to embed malicious macros and malicious OLE object using obfuscation and other known techniques to create difficulty in locating malicious content.
  • the office classifier 127 can use a heuristic feature generator 206 to generate a list of features derived from properties and attributes of malicious macros and malicious OLE Objects, which can be used to train an advanced boosted-tree based machine learning algorithm 208. The office classifier 127 can then classify the new Office documents as to threat level, based on those features.
  • the heuristic feature generator 206 can correlate the features of malicious macros and malicious OLE objects with a set of keywords, defining the unique a set of features extracted out of Office documents, providing a variance to a machine algorithm, resulting in a very accurate detection result.
  • the office classifier 127 uses the features extracted out of following embedded artifacts from MS Office Documents. These features are used to train Machine Learning (ML) model 208.
  • ML Machine Learning
  • the office classifier 127 collects those features that help to detect amount of obfuscation the macro code uses along with other artifacts such as metadata (document size, pages, paragraphs, etc.), code entropy etc.
  • metadata document size, pages, paragraphs, etc.
  • code entropy code entropy etc.
  • malware authors often use these obfuscation types to inflate the time of analysis, which in turn delays the countermeasures after detection.
  • each obfuscation method is quite simple, when used in combination, they render the code visually indecipherable.
  • malware authors use obfuscation tools to create many variants of malware with different hash values.
  • Malware authors seem to use such obfuscation technique in many of their malware campaigns, including the recent one from EMOTET. As explained above, the obfuscation is primarily achieved using VBA language features such as string operators, functions, etc.
  • the office classifier 127 collects such unique indicators from the VBA code in successful classification through Machine Learning and a Heuristic engine.
  • the disclosed technology targets detecting both OLE2 and OXML type documents of MS Office covering Word, Excel and PPT.
  • Feature extraction activity is basically extraction and parsing of required items like VBA code, DDE and Embedded items from the Office Documents, also referred to herein as documents, files or document files. No single features an absolute marker. Rather, the group of extracted features contributes to the classification.
  • potential false positives 212 and new threats 214 may require an analyst review 216 for to improve the level of malware detection, which may require some adjustments within the office classifier framework 202.
  • other threat detection engines 218 may operate in tandem with the machine learning and heuristic engine 208 to improve overall performance of the malware detection, leading to the significantly improved final detection result 210.
  • Document files may include multiple macros and embedded files. For the purpose of feature extraction, all macros are considered and combined as one single entity. Extracting macro code and embedded file info from a document (all formats), including both CFBF (Compound file Binary Format) and OpenXML. Features from macros and embedded OLE objects are extracted for processing by a machine learning algorithm that detects malicious code in Microsoft Office documents. The list of features associated with the construction of feature vectors is described further on.
  • FIG. 3 illustrates the training of a supervised machine learning model 312.
  • the training uses a suitable machine learning algorithm 310 such as Random Forest, Decision Tree, Linear Regression or the like.
  • the machine learning algorithm could use a convolutional neural network, a CNN, including a deep learning structure such as Inception.
  • training data includes classification labels 316.
  • Training text, documents, images 314 are used to extract features. Ideally this sampling should be large, on the order of a million samples, to be extracted and kept in a .csv file for further machine learning (ML) processing by a data scientist.
  • ML machine learning
  • the sample collection should avoid duplicates and must be leveled.
  • the sample collection ideally contains a combination of labeled malicious files and clean document files, including documents which have shown to be false positives (FP) inside a network security environment, such as Netskope.
  • FP false positives
  • the feature vectors 318 are identified and labeled, they are combined by the machine learning algorithm 310 to create the predictive model 312.
  • New unlabeled data in the form of new text, documents, images, etc. 320 are classified through the selected feature vector 322 and input into the predictive model 312.
  • the predictive model processes 312 the new data 320 and provides the expected label 324 as an end result.
  • FIG. 4 a flow diagram of the office classifier 127 is shown, illustrating how potentially malicious files are processed and classified according to the disclosed technology.
  • the office classifier 127 receives an incoming document 400 over a network, and is configured for parsing and parses that document, extracting features, and applies the machine learning algorithm to classify the document as to threat level, as safe 480, suspicious 500, or malicious 490.
  • Safe documents 480 are allowed into the network.
  • Suspicious documents 500 are subjected to additional processing, including quarantining or sandboxing methods 510.
  • Malicious documents 490 raise an alert and are ultimately rejected or blacklisted from the network.
  • Detected malicious files 490 may be quarantined for further study analysis.
  • the document may undergo in-depth threat scanning by the security administrator, which may also include isolation in the sandbox 130, where any executable embedded code is run in an isolated environment to determine if any embedded links cause malicious activity.
  • the disclosed technology combines machine learning with other network security methods, 410, 420 to further increase the capability of a network security system to detect malicious macros and malicious OLE documents.
  • the technology disclosed uses a machine learning algorithm.
  • the technology is configured for deriving, and a list of features can be derived from prior malware attacks, with or without heuristics to assist in derivation.
  • the features are used for training the machine learning algorithm.
  • the technology disclosed also detects malicious embedded OLE objects inside Office Documents. These extracted features from leveled samples are used to train a supervised model using a boosted-tree algorithm. Features combined from different categories provide a very good variance to a machine learning algorithm. Heuristics can be used to help derive a set of feature vectors for training the machine learning algorithm.
  • a combination of these two approaches - machine learning for zero-day and repeated malicious patterns and heuristics for detection of recognized malicious patterns - provide superior results in detecting malicious document files in the form of embedded macros and OLE objects.
  • the technology disclosed uses feature engineering from the heuristic engine to train a machine learning algorithm.
  • FIG. 5 is a flowchart illustrating the steps (actions) in detecting malicious embedded macros.
  • step (action) (action) 500 a document file is received into a network security system.
  • the network security system is configured for parsing, and the document file is parsed to separate metadata from malicious payload data.
  • step (action) 510 a heuristic engine within the office classifier 127 uses data indicative of past instances of malware embedded in macros using known obfuscation methods.
  • a feature set can be derived from the data provided by heuristics. The feature set is used, in part, for training a machine learning algorithm model using machine learning methods to predict the likelihood that a document file includes a malicious macro code.
  • step (action) 530 the trained machine learning model is used to predict the likelihood that an input document may contain a malicious macro.
  • step (action) 540 heuristic rules derived from instances of malicious macros are applied to increase the success rate of detecting malicious macros in the document file.
  • step (action) 550 the office classifier 127 classifies a resulting document file as safe, suspicious, or malicious.
  • step (action) 560 safe documents are accepted into the network system, malicious documents are blocked, and suspicious documents are isolated for further threat analysis including sandboxing.
  • FIG. 6 is a flowchart illustrating the steps (actions) in detecting malicious embedded OLE Objects.
  • step (action) 600 a document file is received into a network security system.
  • the network security system is configured for parsing, and the document file is parsed to separate metadata from malicious payload data.
  • a heuristic engine within the office classifier 127 uses data indicative of past instances of malicious embedded OLE Objects using known obfuscation methods.
  • the heuristic engine is configured for deriving a feature set and a feature set can be derived from the data provided by heuristics. The data is used, in part, for training a machine learning algorithm model using machine learning methods to predict the likelihood that a document file includes a malicious embedded OLE Object code.
  • the trained machine learning model is used to predict the likelihood that an input document may contain a malicious embedded OLE Object.
  • step (action) 640 heuristic rules derived from instances of malicious embedded OLE Objects are applied to increase the success rate of detecting malicious embedded OLE Objects in the document file.
  • step (action) 650 the office classifier 127 classifies a resulting document file as safe, suspicious, or malicious.
  • step (action) 560 safe documents are accepted into the network system, malicious documents are blocked, and suspicious documents are isolated for further threat analysis including sandboxing.
  • VBA code profile like count of code line, comment, variables, functions, loop, event, hex string, entropy, etc.
  • DDE Comprising features from Dynamic Data Execution Code/Strinss if available.
  • DDE has usage of trusted windows utilities like cmd.exe, powershell, wmi, wscript, and cscript utility.
  • Embed objects having suspicious files like exe, dll, 7z, dmg, deb, rar, etc.
  • Embed objects having external hyperlink to a URL are described in detail below.
  • MACRO_IS_PRESENT - This feature indicates whether a macro is present in the document being scanned. This can either be a count or a Boolean.
  • MACRO_AUTOEXEC This feature indicates whether a macro is automatically executed upon opening of the document being scanned. This can be a Boolean. Functions similar to AutoExec, that trigger macros based on events or states of operation, include Document Open and Document Close. This group of functions can be collectively counted. A single instance of event triggered macro execution can be suggestive of malicious coding. Multiple instances of event triggered macro execution can be suggestive of legitimate code. ML is good at making such distinctions, for this feature and others.
  • MACRO_EXECUTE This feature indicates whether the macro launches code, including external code. This can either be a count or a Boolean.
  • MACRO EXECUTE POWERSHELL This feature indicates whether the macro causes launching of windows PowerShell, which executes scripts. This can either be a count or a Boolean.
  • MACRO_WRITE This feature indicates whether the macro sends data to the disk of the same computer or to a network location. Many programs legitimately write log files using this function. One consideration that can go into this feature is whether the file written is a binary file or a textual log file.
  • MACRO_HAS_REGISTRY_ACCESS Writing or editing the registry can be indicative of malicious intent.
  • MACRO_HAS_HEX_STR Use of Hex encoding is unusual in legitimate VBA macro code.
  • Several synthetic features are constructed from other extracted features. Three examples are given here, which are based on engineering judgment.
  • MACRO_OLESTREAM_COUNT Malicious macros have very few OLE streams. In contrast, numerous OLE streams often appear in legitimate documents that are repeatedly updated.
  • MACRO_OLE_PASSCODE - # is OLE/VBA pass-coded project stream Project Protection State.
  • MACRO_DETECT_SANDBOX This feature indicates whether the macro attempts to detect that a sandbox is running, such as detecting Anubis, Sandboxie, Norman, CW, Winjail or any other type of sandbox. Non-malicious applications have no reason to detect whether they're running in a sandbox. Marcos are thoroughly parsed and therefore do not need to run in a sandbox.
  • MACRO DETECT VIRTUALIZATION - Detecting virtualization or debug mode is also more common. Virtualization is a more general feature than looking for a specific sandbox.
  • MACRO RUN SHELLCODEINMEMORY - VBA macros can run a shellcode in memory. At present, this is not widely exploited, but it could be exploited.
  • MACRO_SELF_MODIFICATION - macro may attempt to modify the VBA code (self-modification), including constructing parameters of code executable in the language other than VBA. Split and encoding obfuscation are prominent means of self modification.
  • MACRO_NUM_TEXTFUNC is a count of text functions including: Asc(), Chr(), Mid(), JoinQ, InStrQ, ReplaceQ, RightQ, StrConvQ, etc.
  • MACRO_NUM_ARITHFUNC is a count of arithmetic functions including: Abs(), Atn(), Cos(), Exp(), Log(), Randomize(), Round(), Tan(), Sqr(), etc.
  • MACRO_NUM_TYPECONVFUNC is a count of type conversion functions, including: CBool(), CByte(), CChar(), CStr(), CDec(), CUInt(), CShort(), etc.
  • MACRO_NUM_FINCFUNC is a count of financial functions, including: DDB(), FV(), IPmt(), PV(),Pmt(), Rate(), SLN(), SYD(), etc.
  • MACRO_SHANNON_ENTROPY is a Shannon Entropy score for the VBA macro code.
  • MS Office doc files will be scanned for any embedded files in it.
  • the following OLE streams (x01Olel0Native, ⁇ x01CompObj and ObjectPool) will be decoded to collect the embedded file information.
  • EMBED_HAS_SUSPICIOUS_BIN - The file is from any of the files / extn listed in suspicious bins.
  • EMBED_HAS_NORMAL_FILE - The file is from any of the files / extn listed in normal files
  • EMBED_HAS_OTHER_FILE - The file is not from any of the above category and not compressed files.
  • EMBED HAS COMPRESSED FILE - The file is from any of the files / extn listed in compressed files.
  • EMBED RULE 1 This is EMBED RULE1.
  • Unknown extensions (E) file name not in set A, B, C, D, or F. This means it has an extension, but it is not one defined in the other sets.
  • Hash-Based detection The disclosed technology is more generic and proactive, Hash- Based detection covers only one sample.
  • Antivirus Signatures The disclosed technology is more generic and proactive. This is not a simple/string pattern match.
  • Fuzzy Hash The disclosed technology is much more generic. Fuzzy Hash is also just a fuzzy byte/ string pattern match.
  • Sandbox-based detection The disclosed technology is uses static detection, and it is much less expansive. Also, we would be able to achieve a much lower false-positive (FP) rate.
  • FP false-positive
  • Computer system 700 includes at least one central processing unit (CPU) 704 that communicates with a number of peripheral devices via bus subsystem 726, and network security system 120 for providing network security services described herein.
  • peripheral devices can include a storage subsystem 708 including, for example, memory devices 722.724 and a file storage subsystem 712, user interface input devices 714, user interface output devices 716, and a network interface subsystem 718.
  • the input and output devices allow user interaction with computer system 700.
  • Network interface subsystem 718 provides an interface to outside networks, including an interface to corresponding interface devices in other computer systems.
  • network security system 120 of FIG. 1 is communicably linked to the storage subsystem 708 and the user interface input devices 714.
  • User interface input devices 714 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices.
  • input device is intended to include all possible types of devices and ways to input information into computer system 700.
  • User interface output devices 716 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
  • the display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image.
  • the display subsystem can also provide a non-visual display such as audio output devices.
  • output device is intended to include all possible types of devices and ways to output information from computer system 700 to the user or to another machine or computer system.
  • Storage subsystem 708 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein.
  • Additional subsystems 720 can be graphics processing units (GPUs) or field-programmable gate arrays (FPGAs).
  • Memory subsystem 710 used in the storage subsystem 708 can include a number of memories including a main random access memory (RAM) 722 for storage of instructions and data during program execution and a read only memory (ROM) 724 in which fixed instructions are stored.
  • the file storage subsystem 712 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges.
  • the modules implementing the functionality of certain implementations can be stored by file storage subsystem 712 in the storage subsystem 708, or in other machines accessible by the processor 704
  • Bus subsystem 726 provides a mechanism for letting the various components and subsystems of computer system 700 communicate with each other as intended. Although bus subsystem 726 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
  • Computer system 700 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 700 depicted in FIG. 7 is intended only as a specific example for purposes of illustrating the preferred embodiments of the present invention. Many other configurations of computer system 700 are possible having more components or less components than the computer system 700 depicted in FIG. 7. Particular Implementations
  • the technology disclosed can be practiced as a system, method, device, product, computer readable media, or article of manufacture.
  • One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable.
  • One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections. These recitations are hereby incorporated forward by reference into each of the following implementations.
  • the technology disclosed relates to cybersecurity attacks and cloud-based security.
  • the technology disclosed is a method and apparatus for detecting documents with embedded threats in the form of malicious content, also referred to herein as malicious code and/or malware, such malicious content (malware) including malicious macros and/or malicious OLE objects.
  • Macros are stored within macro data, and are also referred to herein as macro data.
  • OLE objects are stored within OLE object data, and are also referred to herein as ILE object data, or as embedded OLE object data.
  • the technology disclosed is a method for classifying input documents in a networked system to determine if at least one of said documents may include a macro and/or may include an embedded OLE object that includes malicious code (malware).
  • the method includes repeatedly receiving a document file into a network security system. Each document file is parsed to separate the macro data and/or the embedded OLE data from the document payload data.
  • the method is configured for generating, and next generates for the document file at least obfuscation scoring features indicative of past instances of malware embedded using known obfuscation methods.
  • the process next inputs the obfuscation scoring features into a trained machine learning model and applies the trained machine learning model to process the document file to determine the likelihood that the document file contains malicious content (malware), in the form of malicious macro data and/or malicious OLE data.
  • the resultant document is classified as safe, suspicious, or malicious.
  • the safe document file is accepted into the network system.
  • a malicious document is blocked as malicious.
  • a suspicious document is isolated to undergo threat analysis.
  • the obfuscation scoring features include one or more, and in at least some embodiments, at least five of the following features describing embedded VBA macro characteristics: macro i s _present; macro autoexec; macro execute; m acro ex ecute po wassh el 1 ; macro write; macro has internet download; macro_has_registry_access; macro comment lines; macro code lines; macro has hex str; macro olestream count; macro ole passcode; macro detect sandbox; macro detect virtualization; macro run shell code in memory; macro disable security; macro self modification; macro num stringops; macro num textfunc; macro num arithfunc; macro num typeconvfunc; macro num fmcfunc; and macro shannon entropy .
  • the document file is an MS Office document.
  • an MS Office document may be a Word document, being a word-processing document, an Excel document, being a spreadsheet document, or a PowerPoint document, being a graphical drawing document and which is also referred to herein as a presentation document.
  • the disclosed technology in the obfuscation scoring step (action), scores macro- related and/or OLE object related features including one or more, and in some embodiments, at least two of the VBA macro features: createobject; shell; filesystem; urldownloadtofile; callbyname; and detect sandbox.
  • OLE Object-related features including at least two of the following VBA OLE Object features: createobject; shell; filesystem; urldownloadtofile; callbyname; and detect sandbox.
  • the step (action) of inputting to the trained machine learning model one or more, and in some embodiments, at least two features derived from the following document features: document size; author information; type of document (Word/Excel/PPT); creation or modification time and revision numbers; number of pages; number of paragraphs; number of lines; and number of characters.
  • the disclosed technology in another aspect of the method, includes a secondary a secondary malware detection engine operating in tandem to increase the accuracy of the malware detection and eliminate false positives.
  • the step (action) of inputting to the trained machine learning model at one or more features, and in some embodiments, at least two features derived from the following static document features: doc num pages doc num words doc num lines doc num chars doc num paragraph doc lastmod time doc author info doc revi si on numb er doc lastprint time doc_link-is dirty doc language doc size
  • doc num pages is also referred to herein as a number of pages
  • doc num words is also referred to herein as a number of words
  • doc num lines is also referred to herein as a number of lines
  • doc num chars is also referred to herein as a number of characters
  • doc num paragraph is also referred to herein as a number of paragraphs
  • doc lastmod time is also referred to herein as a last modification time or as a modification time
  • doc author info is also referred to herein as author information
  • doc revision number is also referred to herein as document revision number or as a revision number
  • doc lastprint time is also referred to herein as a document last printing time or as a last printing time
  • doc_link-is dirty is also referred to herein as a document link that is dirty
  • doc language is also referred to herein as a document language or as a language of a document
  • doc size is also referred to
  • the document file may be isolated in a sandbox testing environment for testing one or more macros and/or one or more embedded OLE objects, in the suspicious document file.
  • the present invention has the capability of detecting malicious macros and/or embedded OLE objects, that do not include known malicious data signatures.
  • the machine learning model is a supervised machine learning model trained by machine learning algorithms through feature engineering. A selected set of features are derived, from deriving this selected set of features from a large sampling of document files.
  • Some sampled document files include one or more malicious macros and/or embedded OLE objects, and some sampled document files include at least one non-malicious macros and/or at least one non- malicious embedded OLE object.
  • a sampling of the document files include MS Office files.
  • a network analyst reviews the step (action) of classifying to increase the accuracy of threat analysis.
  • the disclosed technology is a system for detecting document files containing malicious macros and/or malicious OLE objects.
  • the system includes a heuristic engine, which stores unique code patterns and data attributes , also referred to herein as data that is indicative of past (known) malicious code, malicious content, malicious software and/or malware, found in past-analyzed malicious macros and/or OLE objects.
  • a feature set is derived of malware attributes which is used to train a machine learning model for detecting document files including macros and/or OLE objects that include those malware attributes (features).
  • the heuristic engine is configured for deriving, and derives features indicative of macro and/or embedded OLE object malware to train a machine learning engine to create a data model, using the malware attributes of past-analyzed malicious macros and/or OLE objects.
  • the heuristic engine stores indicators derived from malicious macros and/or OLE objects, based on the code and behavior of the past malicious macros and/or OLE objects.
  • the present technology uses one or more, and in some embodiments, at least five features from a list of known attributes for training the machine learning model. The list of these one or more features is extracted from the listing of attributes which indicate macro malware, especially obfuscated malware.
  • the system detects document files containing malicious macros and/or embedded OLE objects, a heuristic engine that stores data and attributes from past-analyzed malicious macros and/or OLE objects, a machine learning engine including a trained malicious macro and/or OLE object detection model.
  • the model is trained, using the supervised machine learning method with labeled data.
  • the training data includes documents, files and other data which are labeled as malicious or not malicious.
  • the machine learning engine includes a supervised machine learning model trained by features derived from characteristics of malicious macros and/or malicious OLE objects and non- malicious macros and/or non-malicious OLE objects.
  • the aforementioned malicious OLE objects and/or malicious macros are also collectively referred to herein as malicious code, malicious content, malicious software and/or malware.
  • the system disclosed including the heuristic engine operates in tandem with the trained machine learning model.
  • the likelihood of detecting documents that contain malicious code is increased. It becomes more likely that the office classifier will more accurately classify the input documents as to threat level, increasing the likelihood of detecting macros and/or OLE objects in documents that contain malicious code.
  • a system for classifying input documents in the network system to determine if at least one of said documents may include a macro and/or OLE object having malicious code.
  • the system includes a network, and a network interface in operable communication to the network.
  • the disclosed technology includes an office classifier in operable communication with the network security system.
  • the office classifier has an input means for receiving and processing document files, particularly document files which are MS Office document files.
  • the disclosed system further includes a heuristic feature generation engine.
  • the heuristic feature generation engine uses a list of malicious macro code attributes and/or malicious OLE object code attributes, selected for predicting the presence of malicious macros and/or malicious OLE objects.
  • the heuristic engine is configured for deriving, and derives a feature list which is used to train a supervised machine learning model to predict the presence of malicious macros and/or malicious OLE objects.
  • the office classifier applies the machine learning model to each input document to determine the level probability that the input document may include malicious macro and/or malicious OLE object code. Based on this analysis, each input document is classified as safe, malicious, or suspicious. Document files classified as safe are admitted into the network; documents files classified as malicious are permanently blocked; and document files classified as suspicious are threat analyzed. The threatening analysis of suspicious files may include quarantining and transferring into a virtual environment such as a sandbox, where the malicious code may be safely analyzed. [0144] In one aspect, the technology disclosed detects obfuscated malicious code using a trained machine learning model to predict documents having embedded malicious code without a known signature. In another aspect, the technology disclosed can be combined with signatureless-based analysis of malicious macros and/or malicious OLE objects.
  • a method for classifying input documents into a network system to determine if at least one of said documents includes a macro and/or an OLE object having malicious code.
  • a method for classifying input documents into a network system to determine if at least one of said documents includes a malicious macro and/or a malicious OLE object. Macros and/or OLE objects are not usually observable by the user, and they make an attractive vehicle for infecting documents, including MS Office documents.
  • the disclosed technology may be used for classifying input documents in a network system includes the steps (actions) of receiving an office document into the network security system of an attached enterprise network. The document file is parsed in order to separate the metadata from the malicious payload data so that it may be analyzed.
  • Feature engineering is used to define a set of features for detecting malicious macros and/or malicious OLE objects, based on features selected from a list of known characteristics and attributes possessed by files that have historically indicated malicious content.
  • the selected features are used to train a supervised machine learning model, a model based on labeled data.
  • an office classifier receives incoming documents over a network, deconstructs those documents, and applies the machine learning algorithm to classify the documents as to threat level, as safe, suspicious, or malicious. Safe documents are allowed into the network. Suspicious documents are subjected to additional processing, including quarantining or sandboxing methods. Malicious documents are blocked from the network.
  • the disclosed technology combines machine learning with other network security methods to further increase the capability of a network security system to detect malicious macros and/or malicious OLE files.
  • a method for classifying input documents in a networked system to determine if at least one of said documents may include a macro having malicious code, comprising the actions of: repeatedly receiving a document file into a network security system; parsing the document file to separate macro data from document payload data; generating for the document file at least obfuscation features indicative of past instances of malware embedded in macros using known obfuscation methods; inputting the obfuscation features to a trained machine learning model and applying the trained machine learning model to process the document file to predict the presence of a malicious macro; using a secondary malware detection engine operating in tandem to increase the accuracy of the malware detection and eliminate false positives; classifying a resulting document as safe, suspicious, or malicious; and based on the action of classifying, accepting a safe document into the networked system, blocking a malicious document as malicious, and isolating a suspicious document for threat analysis.
  • the obfuscation features include at least five of the following features describing embedded VBA macro characteristics: macro i s _present; macro autoexec; macro execute; m acro ex ecute po wassh el 1 ; macro write; macro has internet download; macro_has_registry_access; macro comment lines; macro code lines; macro has hex str; macro olestream count; macro ole passcode; macro detect sandbox; macro detect virtualization; macro run shellcodeinmemory; macro disable security; macro self modification; macro num stringops; macro num textfunc; macro num arithfunc; macro num typeconvfunc; macro num fmcfunc; and macro shannon entropy .
  • obfuscation features are macro-related features including at least two of the following VBA macro features: createobject; shell; filesystem; urldownloadtofile; callbyname; and detect sandbox.
  • the machine learning model is a supervised machine learning model trained by machine learning through feature engineering, and wherein selected features are derived from a large sampling of document files, wherein some sampled document files include one or more malicious macros and some sampled document files include at least one non-malicious macros.
  • a system for detecting document files including malicious macros comprising: a heuristic engine, wherein data is stored that is indicative of known malicious macros, and a machine learning engine including a supervised machine learning model trained by features derived from characteristics of malicious macros and non-malicious macros.
  • a system for classifying input documents in a networked system to determine if at least one of said documents may include a macro having malicious code, comprising: a network; a network interface coupled to the network; a network security system in operable communication with the network; an office classifier in operable communication with the network security system; the office classifier comprising, aninputmeansforreceiving and processing anMS Office document; a heuristic feature generation engine; a supervi sed machine learning model trained by machine learning methods with features selected for predicting the presence of malicious macros.
  • a method for classifying input documents in a networked system to determine if at least one of said documents may include an Object Linking & Embedding (OLE) object having malicious code, comprising the actions of: receiving a document file into a network security system; parsing the document file to separate metadata from malicious payload data in input documents; using a heuristic engine to provide data indicative of past instances of malware that have been embedded in OLE objects using known obfuscation methods; deriving a features set from the data provided by the heuristic engine for training a machine learning algorithm model by using deep learning (DL) methods to predict whether the document file includes malicious OLE objects; using the trained machine learning model, processing the document file to determine likelihood that the document file contains a malicious OLE object; classifying a resulting document as safe, suspicious, or malicious; and based on the action of classifying, accepting a safe document into the networked system, blocking a malicious document as malicious, or isolating a suspicious document for threat analysis.
  • OLE Object
  • OLE Object-related features including at least one of the following Visual Basic For Applications (VBA) OLE Object features: createobject; shell; filesystem; urldownloadtofile; callbyname; and detect sandbox.
  • VBA Visual Basic For Applications
  • the machine learning model is a supervised machine learning model trained by machine learning through feature engineering, and wherein the features set is derived from a large sampling of document files, including some document files having one or more malicious OLE Objects and some document files having one or more non- malicious OLE Objects.
  • a network security system including: a network interface coupled to a network; and a plurality of system components that are configured for the actions of: receiving a document file including an Object Linking and Embedded (OLE) object, via said network interface; parsing said document file to separate embedded data included within said OLE object from other payload data included within said document file; processing said document file to determine a likelihood of whether said document file includes malicious code being obfuscated within said OLE object: and wherein said processing includes extracting from said document file, obfuscation scoring features that are indicative of instances of malicious code being obfuscated within said OLE object; and wherein said processing further includes inputting said obfuscation scoring features into a trained machine learning model to determine said likelihood of whether said document file includes malicious code being obfuscated within said OLE object.
  • OLE Object Linking and Embedded
  • a system for classifying input documents in a networked system to determine if at least one of said documents may include an Object Linking & Embedding (OLE)_Object having malicious code, comprising a network; a network interface coupled to the network; a network security system in operable communication with the network; and an office classifier in operable communication with the network security system; the office classifier comprising, an input means for receiving and processing word-processing documents, spreadsheet documents, and presentation documents; a heuristic feature generation engine; and a supervised machine learning model trained by deep learning methods using features selected for detecting input documents having obfuscated malicious OLE Objects.
  • OLE Object Linking & Embedding
  • a system for providing detection of a presence of malicious code embedded within a document file including: a plurality of system components that are configured for performing actions of: receiving a document file including one or more embedded items; parsing said document file to extract at least one of said embedded items; processing said document file to determine a likelihood of whether said at least one of said embedded items includes malicious code; and wherein said processing includes extracting from said at least one of said embedded items, obfuscation scoring features that are indicative of known instances of malicious code being obfuscated within said at least one of said embedded items; and wherein said processing further includes inputting said obfuscation scoring features into a trained machine learning model to determine said likelihood of whether said at least one of said embedded items, within said document file, includes malicious code.
  • said embedded items include at least one of one or more macros and/or one or more OLE objects.
  • said macros include VBA macros.
  • obfuscation scoring features include macro related features, said features including at least one use of CreateObject, Shell, FileSystem, URLDownloadToFile, CallByName, or Detect Sandbox.
  • said obfuscation scoring features include object related features, said object related features including at least one use of the following Visual Basic For Applications (VBA) Object Linking and Embedding (OLE) features, said features including at least one use of Shell, FileSystem, URLDownloadToFile, CallByName, or Detect Sandbox.
  • VBA Visual Basic For Applications
  • OLE Object Linking and Embedding
  • a method for providing detection of a presence of malicious code embedded within a document file including the actions of: receiving a document file including one or more embedded items; parsing said document file to extract at least one of said embedded items; processing said document file to determine a likelihood of whether said at least one of said embedded items includes malicious code; and wherein said processing includes extracting from said at least one of said embedded items, obfuscation scoring features that are indicative of known instances of malicious code being obfuscated within said at least one of said embedded items; and wherein said processing further includes inputting said obfuscation scoring features into a trained machine learning model to determine said likelihood of whether said at least one of said embedded items, within said document file, includes malicious code.
  • said document file is one of a word processing document, a spreadsheet document or a presentation document.
  • said embedded items include at least one of one or more macros and/or one or more OLE objects.
  • obfuscation scoring features include macro related features, said features including at least one use of CreateObject, Shell, FileSystem, URLDownloadToFile, CallByName, or Detect Sandbox.
  • said obfuscation scoring features include object related features, said object related features including at least one use of the following Visual Basic For Applications (VBA) Object Linking and Embedding (OLE) features, said features including at least one use of Shell, FileSystem, URLDownloadToFile, CallByName, or Detect Sandbox.
  • VBA Visual Basic For Applications
  • OLE Object Linking and Embedding

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Virology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer And Data Communications (AREA)
PCT/US2022/017778 2021-02-24 2022-02-24 Signatureless detection of malicious ms office documents WO2022182919A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2023551122A JP7493108B2 (ja) 2021-02-24 2022-02-24 悪意のあるms office文書の署名なし検出

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US17/184,502 2021-02-24
US17/184,478 2021-02-24
US17/184,478 US11222112B1 (en) 2021-02-24 2021-02-24 Signatureless detection of malicious MS office documents containing advanced threats in macros
US17/184,502 US11349865B1 (en) 2021-02-24 2021-02-24 Signatureless detection of malicious MS Office documents containing embedded OLE objects
US17/572,548 2022-01-10
US17/572,548 US20220269782A1 (en) 2021-02-24 2022-01-10 Detection of malicious code that is obfuscated within a document file

Publications (1)

Publication Number Publication Date
WO2022182919A1 true WO2022182919A1 (en) 2022-09-01

Family

ID=83049470

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/017778 WO2022182919A1 (en) 2021-02-24 2022-02-24 Signatureless detection of malicious ms office documents

Country Status (2)

Country Link
JP (1) JP7493108B2 (ja)
WO (1) WO2022182919A1 (ja)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150261955A1 (en) * 2014-03-17 2015-09-17 Proofpoint, Inc. Behavior profiling for malware detection
US20190222591A1 (en) * 2018-01-17 2019-07-18 Group IB TDS, Ltd Method and server for determining malicious files in network traffic
US20190236273A1 (en) * 2018-01-26 2019-08-01 Sophos Limited Methods and apparatus for detection of malicious documents using machine learning
US20200233962A1 (en) * 2019-01-22 2020-07-23 Sophos Limited Detecting obfuscated malware variants
US20200250309A1 (en) * 2019-01-31 2020-08-06 Sophos Limited Methods and apparatus for using machine learning to detect potentially malicious obfuscated scripts
US11222112B1 (en) * 2021-02-24 2022-01-11 Netskope, Inc. Signatureless detection of malicious MS office documents containing advanced threats in macros

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8789172B2 (en) 2006-09-18 2014-07-22 The Trustees Of Columbia University In The City Of New York Methods, media, and systems for detecting attack on a digital processing device
KR101296716B1 (ko) 2011-12-14 2013-08-20 한국인터넷진흥원 피디에프 문서형 악성코드 탐지 시스템 및 방법
BR112019012654B1 (pt) 2016-12-19 2023-12-19 Telefonica Cybersecurity & Cloud Tech S.L.U Método e sistema para detectar um programa malicioso em um documento eletrônico e programa de informática
JP6855866B2 (ja) 2017-03-23 2021-04-07 三菱ケミカル株式会社 マクロモノマー共重合体および成形材料

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150261955A1 (en) * 2014-03-17 2015-09-17 Proofpoint, Inc. Behavior profiling for malware detection
US20190222591A1 (en) * 2018-01-17 2019-07-18 Group IB TDS, Ltd Method and server for determining malicious files in network traffic
US20190236273A1 (en) * 2018-01-26 2019-08-01 Sophos Limited Methods and apparatus for detection of malicious documents using machine learning
US20200233962A1 (en) * 2019-01-22 2020-07-23 Sophos Limited Detecting obfuscated malware variants
US20200250309A1 (en) * 2019-01-31 2020-08-06 Sophos Limited Methods and apparatus for using machine learning to detect potentially malicious obfuscated scripts
US11222112B1 (en) * 2021-02-24 2022-01-11 Netskope, Inc. Signatureless detection of malicious MS office documents containing advanced threats in macros

Also Published As

Publication number Publication date
JP7493108B2 (ja) 2024-05-30
JP2024507893A (ja) 2024-02-21

Similar Documents

Publication Publication Date Title
Gopinath et al. A comprehensive survey on deep learning based malware detection techniques
Sarmah et al. A survey of detection methods for XSS attacks
US11677764B2 (en) Automated malware family signature generation
US10560472B2 (en) Server-supported malware detection and protection
US11222112B1 (en) Signatureless detection of malicious MS office documents containing advanced threats in macros
Nunan et al. Automatic classification of cross-site scripting in web pages using document-based and URL-based features
US11349865B1 (en) Signatureless detection of malicious MS Office documents containing embedded OLE objects
Wang et al. Jsdc: A hybrid approach for javascript malware detection and classification
Ullah et al. Modified decision tree technique for ransomware detection at runtime through API calls
Aslan et al. Using a subtractive center behavioral model to detect malware
Acharya et al. [Retracted] A Comprehensive Review of Android Security: Threats, Vulnerabilities, Malware Detection, and Analysis
Huang et al. Android malware development on public malware scanning platforms: A large-scale data-driven study
Wang et al. A combined static and dynamic analysis approach to detect malicious browser extensions
Hannousse et al. Handling webshell attacks: A systematic mapping and survey
Feng et al. Android malware detection via graph representation learning
Bakour et al. A deep camouflage: evaluating android’s anti-malware systems robustness against hybridization of obfuscation techniques with injection attacks
Gu et al. From image to code: executable adversarial examples of android applications
Guo et al. An empirical study of malicious code in pypi ecosystem
Dubin Content disarm and reconstruction of PDF files
Huang et al. A large-scale study of android malware development phenomenon on public malware submission and scanning platform
Ladisa et al. On the feasibility of cross-language detection of malicious packages in npm and pypi
Liu et al. Malware detection method based on image analysis and generative adversarial networks
Hashem El Fiky et al. Android malware category and family identification using parallel machine learning
Fang et al. Pbdt: Python backdoor detection model based on combined features
Yan et al. DitDetector: Bimodal learning based on deceptive image and text for macro malware detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22760439

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023551122

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22760439

Country of ref document: EP

Kind code of ref document: A1