WO2023102117A1

WO2023102117A1 - Computer security systems and methods using machine learning models

Info

Publication number: WO2023102117A1
Application number: PCT/US2022/051538
Authority: WO
Inventors: Ehud SHAMIR; Pedro Hugo MARQUES VILACA; Joseph Landry; Sameet MEHTA
Original assignee: Threatoptix Inc.
Priority date: 2021-12-01
Filing date: 2022-12-01
Publication date: 2023-06-08

Abstract

Methods, systems, and computer readable media for detecting and identifying computer security threats. In some examples, a system includes a feature extractor configured for extracting a feature vector of features from a sample file based on a file type of the sample file. The system includes a security system configured for: selecting a maliciousness model from a plurality of maliciousness models based on the file type of the sample file, wherein the maliciousness model was trained on a plurality of feature vectors from training sample files of the same file type, each of the training sample files being labelled as malicious or benign; and determining that the sample file is malicious or benign based on evaluating the feature vector of the sample file with the maliciousness model.

Description

COMPUTER SECURITY SYSTEMS AND METHODS USING MACHINE LEARNING MODELS

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Patent Application Serial No. 63/285,064, filed December 1 , 2021 , the disclosure of which is incorporated herein by reference in its entirety. This application claims the benefit of U.S. Provisional Patent Application Serial No. 63/313,010, filed February 23, 2022, the disclosure of which is incorporated herein by reference in its entirety. This application claims the benefit of U.S. Provisional Patent Application Serial No. 63/316,193, filed March 3, 2022, the disclosure of which is incorporated herein by reference in its entirety. This application claims the benefit of U.S. Provisional Patent Application Serial No. 63/323,105, filed March 24, 2022, the disclosure of which is incorporated herein by reference in its entirety.

TECHNICAL FIELD

The subject matter described herein relates to computer security. More particularly, the subject matter described herein relates to methods, systems, and computer readable media for detecting and identifying computer security threats using machine learning models.

BACKGROUND

Computer security is the protection of computer systems and networks from information theft and damage to hardware, software, or electronic data, as well as from the disruption or misdirection of computer services. Computer security is continuously changing due to the expanding reliance on computer systems, the Internet, cloud-infrastructure, high-speed mobile networks (5G, etc), wireless network standards such as Bluetooth, Wi-Fi and due to the growth of "smart" devices, including smartphones, televisions, and the various devices that constitute the Internet of Things (loT). SUMMARY

In some examples, a system includes a number of security systems, each executing on a computing system comprising one or more processors and operating on the Linux operating system, wherein each security system is configured for using one or more machine-learning models to detect and identify security threats and for reporting security events. The system includes a backend server configured for receiving the security events from the plurality of security systems and using the security events to produce updated security intelligence.

In some examples, a system includes a number of embedded system devices, each comprising at least one processor and software configured to perform a dedicated function. The system includes a number of security systems, each executing on a respective embedded system device, wherein each security system is configured for using one or more machine-learning models to detect and identify security threats and for reporting security events. The system includes a backend server configured for receiving the security events from the security systems and using the security events to produce updated security intelligence.

The subject matter described herein can be implemented in software in combination with hardware and/or firmware. For example, the subject matter described herein can be implemented in software executed by a processor. In one example implementation, the subject matter described herein may be implemented using a computer readable medium having stored thereon computer executable instructions that when executed by the processor of a computer control the computer to perform steps.

Example computer readable media suitable for implementing the subject matter described herein include non-transitory devices, such as disk memory devices, chip memory devices, programmable logic devices, and application specific integrated circuits. In addition, a computer readable medium that implements the subject matter described herein may be located on a single device or computing platform or may be distributed across multiple devices or computing platforms.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 A is a block diagram illustrating an example computer security ecosystem;

Figure 1 B shows an example network where the security system can be deployed;

Figure 1 C is a block diagram illustrating an example computer security ecosystem;

Figure 1 D is a block diagram illustrating an example computer security ecosystem;

Figure 2 is a block diagram of an example backend server;

Figure 3 is a block diagram of an example target server;

Figure 4 is a block diagram of the security system;

Figure 5 is a block diagram of the machine learning detection engine;

Figure 6 is a block diagram illustrating a vehicle and one or more computing systems for the vehicle;

Figure 7 is a block diagram of an example networking environment;

Figure 8 is a block diagram of an example networking device;

Figure 9 is a block diagram of an example industrial system;

Figure 10 is block diagram of an example backend server;

Figure 11 is a block diagram of an example target system executing the security system; Figure 12 is a flow diagram illustrating the security system processing an example sample file in operation; and

Figure 13 is a block diagram of an example feature extractor.

DETAILED DESCRIPTION

The subject matter described herein relates to methods, systems, and computer readable media for detecting and identifying computer security threats. The security systems described in this document offer one or more of the following advantages over many conventional systems:

• Existing vendors depend on a cloud infrastructure/network connectivity to detect advanced threats. The security systems described in this document can interact with a backend server but need not to detect and identify advanced computer security threats by virtue of localized threat detection. The system provides “stand-alone” protection for the computer systems it’s protecting.

• The security systems described in this document can operate in a self-contained/stand-alone mode, without accessing a cloud infrastructure, because the security system has all the information that it needs to detect threats, and the models detection capability is long lasting, requiring infrequent updates.

• Existing vendors can only detect Linux threats with limited capabilities using limited static signatures or very limited machine learning features reliant on a cloud infrastructure. The security systems described in this document can use dynamic signatures and operate natively on Linux systems without relying on a cloud infrastructure.

• Existing vendors offer slow detection, in part due to the “roundtrips” required to communicate with outside resources for further analysis (Example Cloud Infrastructure), whereas the security systems described in this document can operate at real-time or near real-time due to native execution on Linux systems and operation without reliance on a cloud infrastructure. • The security systems described in this document can operate at the machine code level to enable the deepest level of protection for its systems

The sensor can be operating in a deep level, for example, using assembly code to communicate directly with the kernel when performing selfmitigation and threat inspection. This method protects the sensor from malicious code hooking and code manipulation in real time.

Figure 1 A is a block diagram illustrating an example computer security ecosystem 100. Figure 1 A shows a backend computer security server 102, a target server 104, and a security system 106 executing on the target server 104. A security operations center (SoC) 108 supports the target server 104. The systems communicate by a data communications network 110, e.g., the Internet.

The security system 106 comprises software configured for detecting and identifying computer security threats. The security system 106 can be supported by backend server 102, for example, by receiving software updates and updated security signatures from the backend server 102. Nonetheless, the security system 106 is configured to operate in a standalone mode to detect and identify threats without accessing the backend server 102.

The security system 106 provides one or more of the following features:

• Advanced machine learning protection

• Detecting and preventing Linux or Linux directed attacks, Windows, Mac, and mobile attacks in real-time or near real-time

• Real time code similarity, capable in identifying malicious code genomes to detect who is behind an attack, i.e. , attribute an attack to a particular attacker or group of attackers (an industry term known as “Attribution”)

• Detect fileless attacks using advanced machine learning, threat memory acquisition in real time or near real-time

• Reduction in dependency on SoC infrastructure, lowering cost • A threat detection sensor that can operate without any cloud connectivity

• Support for various processor architectures: x86, x86_64, ARM, ARM64, MIPS, RISC, and the like

• Portability to various devices and systems such as a vehicle control systems and loT devices

In some examples, the security system 106 is inline between execution and storage of files so we control access to those files. The security system 106 receives a notification from the kernel and dispatches the target file to the machine learning models who return a result in, e.g., milliseconds and then the security system 106 acts according to the result, either authorizing the execution/storage of the file or denying it.

The security system 106 can be configured to identify new, unknown threats by virtue of machine learning. The security system 106 can classify threats, identify bad actors, and does not require a cloud infrastructure. The security system 106 can be optimized for Linux. The security system 106 can detect threats within, for example, executable code, PowerShell and PHP scripts, Microsoft Office documents, EML mail documents, and other kinds of scripts and documents.

The security system 106 can run on any appropriate type of server, for example, a physical server or a virtual server. The security system 106 can execute on bare metal/virtual instance/Blade/embedded systems. The security system 106 can be run on an internal cloud or an external cloud. The security system 106 can scan files written to and read from cloud storage. The security system 106 can detect adversaries spreading using computing resources, i.e. , lateral movement.

Figure 1 A shows first and second attackers 112 and 114 that attempt to attack the target server 104. Figure 1A also shows legitimate users attempting to access the target server 104, for example, a first user on a laptop 116 communicating with a WiFi access point 118 and a second user on a mobile phone 120 communicating with a radio access station 122. The security system 106 is configured to detect attacks from the attackers 112 and 114 and then attribute attacks to either the first attacker 112, the second attacker 114, or some other source.

The security system 106 is configured to execute on Linux systems, and the security system 106 is typically configured to be light-weight to allow for portability to various systems operating Linux. For example, the security system 106 can be ported to an loT device 124, to the WiFi access point 118, to a computer system on the radio access station 122 or other cellular network nodes, and generally to routers, servers, gateways, and other devices in the ecosystem 100. Multiple pathes, devices, resources can be protected and contribute to the learning of potention threats/attack vectors. A multiple “N” number of systems can be protected.

For example, the security system 106 can be installed on enterprise gateway/router/firewall 126 to provide protection to an enterprise network 128. In another example, the security system 106 can be installed on a supervisory control and data acquisition (SCADA) system 130 or parts of the SCADA system 103. The SCADA system can be configured to control industrial processes locally or at remote locations, e.g., to monitor, gather, and process real-time data; interact with devices such as sensors, valves, pumps, motors, and the like; and record events into a log file. SCADA systems can be important to protect for industrial organizations since they help to maintain efficiency, process data for smarter decisions, and communicate system issues to help mitigate downtime.

Since the security system 106 is configured to execute on Linux systems, it can be ported, in general, to N other devices 132 in endpoints or infrastructure or both, e.g., routers, switches, gateways, and other embedded systems. Some of these devices may have significant available/spare processing power, e.g., mobile routers, which can be useful in executing the security system 106 on those devices.

Each of the installations can communicate with the backend server 102. The backend server 102 can use machine learning to detect and update security signatures, and the machine learning can be based on a very robust training dataset by virtue of receiving security data from the various other installations of the security system 106. Moreover, the training dataset used at the backend server 102 will be continuously updated as security data continuously flows in from the various devices in the ecosystem 100 executing the security system 106.

Figure 1 B shows an example network 150 where the security system 106 can be deployed. The network 150 includes an air-gapped local area network (LAN) 152. Air-gapping is a network security measure employed on one or more computers to ensure that a secure computer network is physically isolated from unsecured networks, such as the Internet or an unsecured local area network. In some examples, air-gapped LAN 152 has no network interface controllers connected to other networks.

The network 150 can include various target servers 154 and 156, each executing the security system 106. Since the security system 106 can be configured to operate in a stand-alone/self-contained mode, without accessing the backend server 102, the target servers 154 and 156 can be protected even though air-gapped LAN 152 is not connected to the backend server 102. The network 150 can include, e.g., a SCADA system 158 or an industrial control system (ICS), and the security system 106 can protect these systems even though they may be isolated from other networks.

Figure 1 C is a block diagram illustrating an example computer security ecosystem 160. Figure 1 C shows a backend computer security server 102, an embedded system device 134, and a security system 106 executing on the device 134. The systems communicate by a data communications network 110, e.g., the Internet. The device 134 executes an operating system 136, for example, Linux or QNX.

Cybersecurity for embedded systems, in general, can involve adopting end-to-end security protocols and attack surface reduction. End-to-end security protocols include applying security across some or all of at least the following channels:

1. Authentication by a device to a server. For example, the board inside a device can be reviewed for known attack vectors, e.g., passwords hardcoded in the device. In another example, the data connection between the device and a remote server can be secured to maintain the integrity of the data and avoid leaks and hacks.

2. Server authentication to the device, a common attack vector in poorly secured systems. In general, the connection between the device and the server should be secured in both directions.

3. Establishing secure session keys on the device. The public keys and the exchange of these keys between servers and devices keeps the data safe and secure and helps to avoid the reputation and financial costs of hacks.

4. Maintaining both the integrity and confidentiality of data stored on the device. This security can provide value to end users in various ways.

Attack surface reduction can be a multi-faceted security element of an embedded system. In some examples, an attack surface analysis can be performed by probing your device for vulnerabilities and assessing the extent of your exposure to risk. Based on this analysis, a full risk assessment can be developed.

In some examples, a security system for an embedded system can be customized for the embedded system using one or more of the following features:

1 . Reducing the amount of code

2. Diminishing the number of device and server entry points

3. Eliminating any service not used or under-used by end-users.

The security system 106, as shown in Figure 1 C, comprises software configured for detecting and identifying computer security threats. The security system 106 can be supported by backend server 102, for example, by receiving software updates and updated security signatures from the backend server 102. The security system 106 can be configured to operate in a standalone mode to detect and identify threats without accessing the backend server 102. In some examples, the security system 106 is inline between execution and storage of files to control access to those files. The security system 106 receives a notification from the kernel and dispatches the target file to the machine learning models who return a result in, e.g., milliseconds and then the security system 106 acts according to the result, either authorizing the execution/storage of the file or denying it.

The security system 106 can run on any appropriate type of embedded system, for example, a physical device or a virtual device. The security system 106 can execute on bare metal/virtual instance/Blade/embedded systems. The security system 106 can be run on an internal cloud or an external cloud. The security system 106 can scan files written to and read from cloud storage. The security system 106 can detect adversaries spreading using computing resources, i.e. , lateral movement.

Figure 1 C shows first and second attackers 112 and 114 that attempt to attack the device 134. Figure 1 C also shows legitimate users attempting to access the device 134, for example, a first user on a laptop 116 communicating with a WiFi access point 118 and a second user on a mobile phone 120 communicating with a radio access station 122. The security system 106 is configured to detect attacks from the attackers 112 and 114 and then attribute attacks to either the first attacker 112, the second attacker 114, or some other source.

The security system 106 is configured to execute on Linux systems, and the security system 106 is typically configured to be light-weight to allow for portability to various systems operating Linux. For example, the security system 106 can be ported to an loT device 124, to the WiFi access point 118, to a computer system on the radio access station 122 or other cellular network nodes, and generally to routers, servers, gateways, and other devices in the ecosystem 160. Multiple pathes, devices, resources can be protected and contribute to the learning of potention threats/attack vectors. A multiple “N” number of systems can be protected.

For example, the security system 106 can be installed on enterprise gateway/router/firewall 126 to provide protection to an enterprise network 128. In another example, the security system 106 can be installed on a supervisory control and data acquisition (SCADA) system 130 or parts of the SCADA system 103.

Each of the installations can communicate with the backend server 102. The backend server 102 can use machine learning to detect and update security signatures, and the machine learning can be based on a training dataset by virtue of receiving security data from the various other installations of the security system 106. Moreover, the training dataset used at the backend server 102 will be continuously updated as security data continuously flows in from the various devices in the ecosystem 160 executing the security system 106.

In general, the device 134 can be any appropriate type of computing device for an embedded system. The following are examples of embedded systems:

• Automotive systems (e.g., ABS, engine management systems, center lock... .)

• Industrial embedded systems (e.g., robotic arm control unit, machine monitoring systems....)

• Networking equipment (e.g., Firewalls, routers (aggregation & edge), smart switches, LTE infrastructure, base band units....) In some embedded systems, the device 134 comprises two layers:

• Application layer (e.g., bare metal)

• Hardware layer (e.g., a microcontroller)

In some applications, the device drivers are the bare metal code and also the only application that can be executed on the microcontroller.

In some other systems, for example, safety-critical applications, the device 134 can include an operating system layer, such that the device 134 comprises three layers:

• Application layer

• OS layer (e.g., Linux, OSEK, FeeRTOS)

• Hardware layer

In this type of system, the operating system (e.g., Linux) is used as an intermediate layer to produce more complex functionalities like multitasking (e.g., date exchanging between tasks, tasks scheduling, tasks synchronizing).

Figure 1 D is a block diagram illustrating an example computer security ecosystem 162. Figure 1 D shows a backend computer security server 102, an a target system 104, and a security system 106 executing on the target system 104. The systems communicate by a data communications network 110, e.g., the Internet. The target system 104 can be any appropriate type of computing system that is exposed to security threats. For example, the target system 104 can be a web server running Linux. In other examples, the target system 104 can be an loT device, a WiFi access point, a computer system on a radio access station or other cellular network node, or a networking device.

The security system 106 comprises software configured for detecting and identifying computer security threats. The security system 106 can be supported by the backend server 102, for example, by receiving software updates and updated machine learning models from the backend server 102. The backend server 102 can receive labelled training data from a training data source server 108. The labelled training data can include, for example, various files labelled as malicious or benign. Crowdsourced malware corpus/databases, for example, can provide a service that supplies various files labelled as malicious or benign based on scanning performed by a number of different security scanners.

In some examples, the security system 106 is inline between execution and storage of files to control access to those files. The security system 106 receives a notification from the kernel and dispatches the target file to the machine learning models who return a result in, e.g., milliseconds and then the security system 106 acts according to the result, either authorizing the execution/storage of the file or denying it.

The security system 106 can run on any appropriate type of embedded system, for example, a physical device or a virtual device. The security system 106 can execute on bare metal/virtual instance/Blade/embedded systems. The security system 106 can be run on an internal cloud or an external cloud. The security system 106 can scan files written to and read from cloud storage. The security system 106 can detect adversaries spreading using computing resources, i.e., lateral movement. A system administrator 116 can access the target system 104 or the backend server 102 or both.

Figure 1 D shows first and second attackers 112 and 114 that attempt to attack the target system 104. The security system 106 is configured to detect attacks from the attackers 112 and 114 and then attribute attacks to either the first attacker 112, the second attacker 114, or some other source. The security system 106 attributes an attack to one of the attackers 112 and 114 by virtue of machine learning models trained on training data labelled with the attackers.

The security system 106 can be configured to execute on Linux systems, and the security system 106 is typically configured to be light-weight to allow for portability to various systems operating Linux. Multiple paths, devices, and resources can be protected and contribute to the learning of potential threats/attack vectors.

Figure 2 is a block diagram of an example backend server 102. The backend server 102 is configured for receiving security data from each of various installations of the security system 106 of Figure 1A. The backend server 102 can produce additional intelligence, e.g., security signatures, based on the received security data.

The backend server 102 is implemented on one or more processors 202 and memory 204 storing executable instructions for the processors 202. The backend server 102 includes a threat monitor 206 for receiving security data from executing instances of the security system 106 and updating security information based on the received security data. The threat monitor 206 can be implemented as a logging engine configured for receiving events from security sensors.

The backend server 102 includes a signature generator 208 for producing security signatures for dissemination back to executing instances of the security system 106. The backend server 102 includes a security systems updater 210 for updating executing instances of the security system 106, e.g., with improved security intelligence based on the received security data.

Figure 3 is a block diagram of an example target server 104. The target server 104 is configured for providing computing services to legitimate users. The target server 104 is implemented on one or more processors 302 and memory 304 storing executable instructions for the processors 302. The target server 104 executes an operating system 306, which can be Linux. The target server 104 executes the security system 106, which can be configured to run natively on Linux. Figure 4 is a block diagram of the security system 106. The security system includes a machine learning detection engine 402, an events engine 404, a forensic engine 406, and an advanced incident response engine 408. The events engine 404 is configured for event-driven detection, e.g., by continuously colleting events from the operating system and monitoring these events to trigger detection when a certain threshold is crossed, automatically. The detection can be based on various types of events, e.g., file system events, operating system built-in security events, networking connections, process monitoring, and the like. The events engine 404 can collect events and send them to the backend sever 102, and the events can be stored for later analysis during an incident response or SoC monitoring.

The forensic engine 406 can be used for incident reconstruction and threat navigation, e.g., by a SoC operator. The forensic engine 406 can be configured to allow the SoC operator to understand the chain of events and focus on a core aspect of a threat. In some examples, the forensic engine 406 generates a story line graph, or the story line graph can be generated on the backend server 102.

The advanced incident response engine 408 can include, e.g. a memory investigator and a network investigator. The memory investigator can perform memory introspection in real-time or near real-time. When a threat is being triggered, the data can be sent for further analysis by security researchers. In some examples, diskless threats can be detected based on machine learning models that will be applied during the memory introspection process. The network investigator can perform network monitoring (sniffing) for, e.g., a predefined amount of time. During this time, the monitoring will collect network information and can send the information to the backend server 102 for further analysis.

Figure 5 is a block diagram of the machine learning detection engine 402. The machine learning detection engine 402 includes a maliciousness model 502, threat models 504, a process tree model 506, a file model 508, and a network model 510.

The maliciousness model 502 is used to determine if an object is malicious or benign using machine learning confidence level. If an object is malicious, then the threat models 504 will be used to classify the threat. In some examples, the threat models each represent different vectors of a threat actor, and detection is based on feature similarities of previous known threats affiliated with a specific threat actor. The machine learning detection engine 402 can be configured for performing code similarities in real time.

The process tree model 506 is a machine learning based model that analyzes the process tree and identifies malicious processes just from that information. The network model 508 is a machine learning based model that analyzes the network information collected and identifies malicious processes and/or network traffic from that information. The file model 510 is a machine learning based model that analyzes the file accesses and identifies malicious processes based on that information.

Figure 6 is a block diagram illustrating a vehicle 600 and one or more computing systems 602 for the vehicle 600. The vehicle 600 can be any appropriate type of vehicle, e.g., car, truck, boat, or plane.

A modem vehicle may have more than a hundred microprocessors to produce very complicated functionalities for safety, infotainment, and power trains. Linux can be used, for example, for infotainment and entertainment systems in cars and other vehicles. For example, consider the following systems in cars:

• Adaptive AUTOSAR. AUTOSAR stands for Automotive Open System Architecture, and it is a global architecture that is used for developing automotive systems between the automotive OEMs (e.g., BMW, Daimler, GM....) and suppliers (e.g., Valeo, Vector, Mentor Graphics...). Adaptive AUTOSAR is a new AUTOSAR platform based on Linux (POSIX) that has been developed especially for applications like autonomous driving, connected vehicles, and update OTA.

• Harman Automotive. Harman produces infotainment systems in some premium models like BMW, Audi, Mercedes and many more. These infotainment systems can be Linux-based embedded systems. • Automotive grade Linux (AGL). AGL is an open-source project that is used to standardize connected vehicle technology between automotive OEMs and suppliers, it is also based on Linux OS.

• Android Auto. System use to connect smartphones to info systems in cars. Google is the developer of Android auto which is based on Linux OS.

• Qt Automotive. Provides Linux-based software components and tools that are used for developing HMI applications in the automotive industry, e.g., the automotive instrument clusters such as those in electric vehicles.

As shown in Figure 6, the computing systems 602 include at least one networked system 604 (e.g., an infotainment system) and at least one off- network system 606 (e.g., a control or safety system). Each of the systems 604 and 606 has a security system installed. The security system 106a for the network systems 604 can receive updates from the backend server 102 and provide feedback to the backend server 102 regarding any detected threats. The security system 106b for the off-network systems 606 can be configured to operate in a standalone mode to detect and identify threats without accessing the backend server 102.

In some examples, the security system 106 can be installed as a service API. The computing systems 602 can then request the security system 106 to scan selected data on demand. Instead of being in the execution/storage path, the security system 106 can be configure as an on- demand malware scanning service. This type of installation can also be used for, e.g., Linux-based systems that lack certain features used in the native security system.

Figure 7 is a block diagram of an example networking environment 700. The networking environment 700 of Figure 7 includes a firewall 702 and a router 704. A security system 106c is installed on the firewall 702, and a security system 106d is installed on the router 704. In general, the security system 106 of Figure 1A can be installed on any appropriate type of networking device. The following are examples of Linux-based systems for networking equipment:

• Nvidia cumulus Linux. Linux-based network operating system for embedded networking applications and devices; the system has features such as network protection, flexible open architecture.

• Axxcelera. Networking and embedded devices, especially wireless technologies, that use a Linux-based system in the product itself or to develop the product.

• HMD Global. Linux-based smart solutions for Android.

• Cisco devices. Cisco offers Linux-based solutions like open NX-OS Linux (based on wind river Iinux5)

Figure 8 is a block diagram of an example networking device 800. The networking device 800 includes hardware 802, for example, one or more processors, memory, connection ports, power circuits, and a cooling system. The networking device includes switching circuits 804 configured for processing packets.

The networking device 800 executes a Linux kernel 806, which can include, for example, system calls 808, state information 810, and packing handling 812 software. The network device 800 includes a user space 814 for executing applications that run on the Linux kernel 806.

The security system 106 can be installed on the networking device at the level of the Linux kernel 806. The security system 106 can be configured within the execution/storage path to maintain the security of the networking device 800.

Figure 9 is a block diagram of an example industrial system 900. The system 900 includes a first robotic system 902 controlled by a computer system running robot operating system (ROS) 904. The system 900 includes a second robotic system 906 controlled by a computer system running Linux 908. A first security system 106e is installed on the first robotic system 902, and a second security system 106f is installed on the second robotic system 906.

The following are examples of Linux-based embedded systems in industrial systems:

• ROS Industrial systems. Use for robotic and industrial application development.

• KUKA Robots. Industrial robots, production machines, and production systems using Linux.

• ABB. Robotics and PLCs automation solutions using Linux.

• Siemens. Provides solutions based on Linux like SIMATIC industrial operating system.

• Schneider electric. Industrial products like powerChute (smart UPS).

Figure 10 is block diagram of an example backend server 102. The backend server 102 includes one or more processors 1002 and memory storing instructions for the processors 1002.

The backend server 102 stores labelled training samples 1004. In general, the labelled training samples 1004 can be acquired from any appropriate source. For example, the backend server 102 can receive some or all of the labelled training samples 1004 from the training data source server 108 of Figure 1 D. In some examples, sample files from the training data source server 108 are supplemented with other sample files from other sources, e.g., as identified by the system administrator 116 of Figure 1 D. Each sample file in the labelled training samples 1004 is labelled as benign or malicious. Some of the sample files are also labelled with a threat actor identifier.

The backend server 102 includes a feature extractor 1006. For each sample file within the labelled training samples 1004, the feature extractor 1006 identifies the file type of the sample file and then, based on the file type, extracts a feature vector comprising a number of different feature values. The feature extractor 1006 uses features that are derived from the sample file. For example, the features can be derived from the disassembly of machine instructions inside the sample file. The features can be derived from string literals, or human readable text, that appear in the sample file. String literals can appear in the source code of the sample file. The features can be derived from imported function names. These functions can be implemented in shared objects, or dynamically linked libraries and are stitched into the sample’s process at run time. These imported functions are reference by the source code used to generate the sample.

In some examples, the feature extractor 1006 uses features from the “negative space” of sample files. Negative space can be bytes that are used for file structural alignment, or just uninitialized regions of a sample. This can help machine learning systems to recognize techniques used by specific compilers. This can also help identify samples that have been compiled to optimize for smaller file sizes, because there is less negative space.

The feature extractor 1006 produces a feature vector for each sample file in the labelled training samples 1004, which results in a number of training data matrices, one for each file type and one for each threat actor. In a training data matrix, each row corresponds to a sample file from the labelled training samples 1004, and each row contains the extracted feature vector for the corresponding sample file.

The backend server 102 includes a machine learning model builder 1008. The machine learning model builder 1008 can use any appropriate type of machine learning software to develop models for threat detection and identification. For example, the machine learning model builder 1008 can use gradient boosting tools such as XGBoost or LightGBM, and gradient-boosted trees can be used for training models.

The machine learning model builder 1008 builds two types of models: maliciousness models 1010 and threat actor models 1012. The machine learning model builder 1008 builds a maliciousness model for each file type, and the machine learning model builder 1008 builds a threat actor model for each of a number of threat actors included in the labelled training samples 1004. To build one of the maliciousness models 1010 for a particular file type, the machine learning model builder 1008 receives a training data matrix for the particular file type as the predictor variable for the machine learning tool. The training data matrix includes, for each sample file, the extracted feature vector and the label for whether or not the sample file is malicious or benign. The machine learning model builder 1008 then builds a maliciousness model for classifying sample files of the particular file type as malicious or benign.

T o build one of the threat actor models 1012 for a particular threat actor, the machine learning model builder 1008 receives a training data matrix for the threat actor as the predictor variable for the machine learning tool. The training data matrix includes, for each sample file, the extracted feature vector and the label for the threat actor identifier. The machine learning model builder 1008 then builds a threat actor model for classifying sample files as belonging to the threat actor or not belonging to the threat actor.

The training matrices, in some cases where a large number of training samples are used, can be very large. In some examples, the feature vectors are compressed before the training matrices are supplied to the machine learning model builder 1008. In some examples, the training matrices are loaded into memory using memory maps that allow the kernel to treat the file as though it has been fully loaded into memory even though some portions of the file are stored on disk.

The maliciousness models 1010 and threat actor models 1012 can be originally produced into a format specific to the machine learning tool being used by the machine learning model builder 1008. The models can then be parsed and stored in a binary product, and optionally compressed, that can be deployed to live security systems.

In some examples, the backend server 102 includes a model evaluator 1014. The model evaluator 1014 can evaluate the maliciousness models 1010 and the threat actor models 1012 and automatically identify potential improvements to the models. In some cases, the labelled training samples 1004 are split into training samples and testing samples, e.g., 80% for training and 20% for testing. The model evaluator 1014 evaluates the samples reserved for testing against the produced models and then compares the determinations made by the models against the labels.

In some examples, the model evaluator 1014 can flag certain samples from the labelled training samples 1004 for further inspection. For example, the model evaluator 1014 can evaluate all of the samples of a particular file type with the corresponding maliciousness model for the file type, producing, for each sample, a confidence value ranging between 0 (benign) and 1 (malicious). Then, the model evaluator 1014 identifies samples where the confidence value is close to 0.5, e.g., by sorting by confidence value. The model evaluator 1014 can choose samples within the range of 0.45 - 0.55, for example, or any other suitable range.

The selected samples are then further evaluated to determine whether or not the labels for those samples should be changed. For example, the model evaluator 1014 can send a message to a system administrator 116 to evaluate the samples. If the labels for some or all of those samples are changed, the machine learning model builder 1008 can rebuild the corresponding models, and the resulting models may be reduced in size or may produce more accurate results.

Figure 11 is a block diagram of an example target system 104 executing the security system 106. The target system 104 includes one or more processors 302 and memory storing instructions for the processor 302. The target system 104 includes an operating system 306, e.g., the Linux operating system.

The target system 104 executes the security system 106. The security system 106 includes the feature extractor 1006, which is configured to extract feature vectors for sample files on the target system 104, e.g., files that are received, stored, or written into memory or on a disk drive on the target system 104. The feature extractor 1006 extracts a feature vector based on the type of file for each file that is scanned by the security system 106.

The security system 106 uses the maliciousness models 1010 and the threat actor models 1012 that are determined by the backend server 102. For example, the backend server 102 can transmit the maliciousness models 1010 and threat actor models 1012 to the target system 104. The backend server 102 can transmit the models (or updates to the models), e.g., a regular time intervals, or every time the models are updated, e.g., in response to receiving new training data or re-labelling the training data or software updates to the machine learning model builder 1008.

In operation, the security system 106 loads a sample file into memory, uses the feature extractor 1006 to extract a feature vector based on the file type of the sample file, and then supplies the feature vector to the corresponding maliciousness model for the file type from the maliciousness models 1010, which results in a confidence value ranging from 0 (benign) to 1 (malicious). If the confidence value is below a threshold (e.g., above 0.5), then the sample file is deemed benign and the security system 106 can respond appropriately, e.g., by allowing access to the sample file.

If the confidence value is above or equal to the threshold, then the sample file is deemed malicious and the security system 106 can respond appropriately, e.g., by denying access to the sample file. The security system 106 can then supply the feature vector to each of the threat actor models 1012, resulting in a confidence value for each threat actor. If the confidence value for a given threat actor meets or exceeds a threshold (e.g., 0.5), then the security system 106 can attribute the malicious sample file to that threat actor. The security system 106 can report the threat actor, e.g., by sending a notification to the system administrator 116.

Figure 12 is a flow diagram illustrating the security system 106 processing an example sample file in operation. Figure 4 illustrates a kernel space 1202, e.g., for the Linux operating system, and a user space 1204.

A file system event occurs (1206). For example, reading a file, writing to a file, opening a file, or closing a file can trigger an event notification. In some examples, the security system 106 can use an interface at the Linux kernel that provides file system events.

In some examples, the security system 106 can use eBPF to interface with the kernel 402. eBPF is technology with origins in the Linux kernel that can run sandboxed programs in an operating system kernel. It is used to safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules. By allowing execution of sandboxed programs within the operating system, application developers can run eBPF programs to add additional capabilities to the operating system at runtime. The operating system then guarantees safety and execution efficiency as if natively compiled with the aid of a Just-In-Time (JIT) compiler and verification engine.

The security system 106 is notified of the file system event (1208). In some examples, the security system 106 caches event notifications. The security system 106 can check if the event corresponds to a file that was previously scanned and has not been altered, e.g., by checking if results for the file are already stored in the cache. If the results are already stored, then the security system 106 can allow or deny access to the file based on the results. If the file is opened and then altered, then the cache entry for the file can be deleted, i.e., to indicate that a new scan of the file should be performed.

In some examples, the security system 106 determines whether the file can be excluded from a security scan based on other criteria. For example, the security system 106 can determine whether the file is in a protected file directory or originates from a protected source and therefore determine not to perform a security scan of the file.

A feature vector is extracted for the file (1210). The feature vector is extracted based on the file type of the file. The feature vector can include an ordered list of values, e.g., floating point values, extracted from various features internal to the file such as, e.g., header structures, decompressed data, and disassembled code.

The feature vector for the file is evaluated by a maliciousness model corresponding to the file type for the file (1212). The result of the evaluation is a confidence value, e.g., a value ranging from 0 (benign) to 1 (malicious). Based on the confidence level, the security system 106 determines to allow or deny access to the file.

The kernel 402 receives a notification to allow or deny access to the file (1214). In some examples, if the file is deemed to be malicious, the security system 106 also evaluates the files with a number of threat actor models to determine whether the file can be attributed to a specific threat actor. If so, the security system 106 can report the threat actor, e.g., to the kernel 402 or to a remote system or system administrator.

Figure 13 is a block diagram of an example feature extractor 1006. The feature extractor 1006 receives a sample file 1302 and produces a feature vector 1312.

In general, the feature extractor 1006 uses the internal structure of the sample file 1302, e.g., headers, records used by a loader to load it in memory, records to dynamic linking to load with other objects. The feature extractor 1006 can use information from headers, loading information, linking information, or other file information. The feature extractor 1006 can be used, e.g., header contents, the number of entries in a header, memory permission used by a loader, functions used by a dynamic linker, and the like.

The feature extractor 1006 includes a file type identifier 1304 for identifying the file type of the sample file 1302. The feature extractor 1006 can operate on a wide variety of file types, e.g., a portable executable (PE) file, an L file used by Lex, or a Mocko file. The file type identifier 1304 examines the sample file 1312 and determines its file type, e.g., by looking at the file extension or other information internal to the sample file 1312. The feature extractor 1006 extracts different features from the sample file 1302 depending on its file type.

The feature extractor 1006 includes a disassembler 1306 for disassembling code within the sample file 1302. The disassembler 1306 may be used, for example, when the sample file 1302 includes object files or other compiled code. The feature extractor 1006 can use the disassembler 1306 to analyze machine instructions, for example, by counting the number of instructions, creating a histogram of instructions, or creating a fingerprint of the code that’s in the sample file 1302, e.g., by hashing some of the code within the sample file 1302.

The feature extractor 1006 includes a tokenizer 1308 for scanning/tokenizing the sample file 1312. The tokenizer 1308 may be used, for example, when the sample file 1302 comprises a script file. The tokenizer 1308 can identify tokens encountered in the script and include the tokens in the feature vector 1312, e.g., by hashing one or some of the tokens. The feature extractor 1006 includes a decompressor 1310 for decompressing the sample file 1302 or compressed portions of the sample file 1302. For example, if the sample file 1302 contains a zipped file, the decompressor 1310 can unzip the file or otherwise analyze the zip file. The feature extractor 1006 can include, e.g., the compression ratio in the feature vector 1312. If the sample file 1302 includes portions that have been packed, the decompressor 1310 can unpack those portions or characterize those portions for inclusion in the feature vector 1312.

The resulting feature vector 1312 can be, e.g., an array of floating point values, with each position in the array corresponding to an extracted feature and the value at that position specifying the extracted feature or, e.g., a hash value of the feature. The feature vector 1312 can be compressed, e.g., using any appropriate compression technology.

The scope of the present disclosure includes any feature or combination of features disclosed in this specification (either explicitly or implicitly), or any generalization of features disclosed, whether or not such features or generalizations mitigate any or all of the problems described in this specification. Accordingly, new claims may be formulated during prosecution of this application (or an application claiming priority to this application) to any such combination of features.

In particular, with reference to the appended claims, features from dependent claims may be combined with those of the independent claims and features from respective independent claims may be combined in any appropriate manner and not merely in the specific combinations enumerated in the appended claims.

Claims

What is claimed is:

1. A system for detecting computer security threats, the system comprising: a feature extractor configured for extracting a feature vector of a plurality of features from a sample file based on a file type of the sample file; and a security system configured for: selecting a maliciousness model from a plurality of maliciousness models based on the file type of the sample file, wherein the maliciousness model was trained on a plurality of feature vectors from training sample files of the same file type, each of the training sample files being labelled as malicious or benign; and determining that the sample file is malicious or benign based on evaluating the feature vector of the sample file with the maliciousness model.

2. The system of claim 1 , wherein the security system is configured for determining that the sample file is malicious and then further evaluating the feature vector of the sample file with a plurality of threat actor models, wherein each of the threat actor models was trained on a plurality of feature vectors from training sample files labelled with a respective threat actor identifier.

3. The system of claim 2, wherein evaluating the feature vector of the sample file with a plurality of threat actor models comprises determining that a confidence value for one of the threat actor models exceeds a threshold and, in response, attributing the sample file to a threat actor associated with the one of the threat actor models.

4. The system of claim 1 , wherein the security system is configured for receiving updates to the maliciousness models from a backend server configured for training the maliciousness models.

5. The system of claim 1 , wherein the second security system is configured for operating in a standalone mode without connecting to a backend server.

-27-

6. The system of claim 1 , wherein the security system is configured for receiving file notification events from a Linux operating system and allowing or denying access to the sample file based on determining that the sample file is malicious or benign.

7. The system of claim 1 , wherein the security system is configured for determining that the sample file is malicious or benign by determining whether a confidence value produced by evaluating the feature vector with the maliciousness model exceeds a threshold value.

8. The system of claim 1 , wherein extracting the feature vector comprises one or more of: disassembling at least a portion of executable code within the sample file; tokenizing at least a portion of a script within the sample file; and decompressing at least a portion of compressed data within the sample file.

9. The system of claim 1 , wherein extracting the feature vector comprises counting consecutive identical data values stored within the sample file.

10. The system of claim 1 , wherein extracting the feature vector comprises hashing one or more portions of the sample file and storing the resulting value in the feature vector.

11. A method for detecting computer security threats, the method comprising: extracting a feature vector of a plurality of features from a sample file based on a file type of the sample file; selecting a maliciousness model from a plurality of maliciousness models based on the file type of the sample file, wherein the maliciousness model was trained on a plurality of feature vectors from training sample files of the same file type, each of the training sample files being labelled as malicious or benign; and determining that the sample file is malicious or benign based on evaluating the feature vector of the sample file with the maliciousness model.

12. The method of claim 11 , comprising determining that the sample file is malicious and then further evaluating the feature vector of the sample file with a plurality of threat actor models, wherein each of the threat actor models was trained on a plurality of feature vectors from training sample files labelled with a respective threat actor identifier.

13. The method of claim 12, wherein evaluating the feature vector of the sample file with a plurality of threat actor models comprises determining that a confidence value for one of the threat actor models exceeds a threshold and, in response, attributing the sample file to a threat actor associated with the one of the threat actor models.

14. The method of claim 11 , comprising receiving updates to the maliciousness models from a backend server configured for training the maliciousness models.

15. The method of claim 11 , comprising operating in a standalone mode without connecting to a backend server.

16. The method of claim 11 , comprising receiving file notification events from a Linux operating system and allowing or denying access to the sample file based on determining that the sample file is malicious or benign.

17. The method of claim 11 , comprising determining that the sample file is malicious or benign by determining whether a confidence value produced by evaluating the feature vector with the maliciousness model exceeds a threshold value.

18. The method of claim 11 , wherein extracting the feature vector comprises one or more of: disassembling at least a portion of executable code within the sample file; tokenizing at least a portion of a script within the sample file; and decompressing at least a portion of compressed data within the sample file.

19. The method of claim 11 , wherein extracting the feature vector comprises counting consecutive identical data values stored within the sample file. The method of claim 11 , wherein extracting the feature vector comprises hashing one or more portions of the sample file and storing the resulting value in the feature vector. A backend server for building and distributing machine learning models for detecting computer security threats, the backend server comprising: one or more processors and memory storing executable instructions for the processors; a feature extractor, implemented on the one or more processors, and configured for extracting, from each sample file of a plurality of training sample files being labelled as malicious or benign, a feature vector of a plurality of features from the sample file based on a file type of the sample file; and a machine learning model builder, implemented on the one or more processor, and configured for building a plurality of maliciousness models, one for each file type of a plurality of file types, including training the feature vectors for each file type against the malicious or benign labels. The backend server of claim 21 , wherein the backend server is configured for distributing the maliciousness models to a plurality of deployed security systems configured for analyzing live sample files using the maliciousness models. The backend server of claim 21 , wherein at least some of the training sample files are labelled as belonging to a threat actor from a plurality of threat actors, and wherein the machine learning model builder is configured for building a plurality of threat actor models, one for each threat actor of a plurality of threat actors, including training the threat actor models with feature vectors for sample files labelled with threat actors. The backend server of claim 21 , comprising a model evaluator configured for evaluating the maliciousness models by testing sample files against the maliciousness models and identifying samples for evaluation as those samples having a confidence value within a specified range of confidence values.

25. The backend server of claim 21 , wherein the machine learning model builder is implemented using gradient boosted trees.

26. A method for building a distributing machine learning models for detecting computer security threats, the method comprising: extracting, from each sample file of a plurality of training sample files being labelled as malicious or benign, a feature vector of a plurality of features from the sample file based on a file type of the sample file; and building a plurality of maliciousness models, one for each file type of a plurality of file types, including training the feature vectors for each file type against the malicious or benign labels.

27. The method of claim 26, comprising distributing the maliciousness models to a plurality of deployed security systems configured for analyzing live sample files using the maliciousness models.

28. The method of claim 26, wherein at least some of the training sample files are labelled as belonging to a threat actor from a plurality of threat actors, and wherein the method comprises building a plurality of threat actor models, one for each threat actor of a plurality of threat actors, including training the threat actor models with feature vectors for sample files labelled with threat actors.

29. The method of claim 26, comprising evaluating the maliciousness models by testing sample files against the maliciousness models and identifying samples for evaluation as those samples having a confidence value within a specified range of confidence values.

30. The method of claim 26, wherein building the maliciousness models comprises using gradient boosted trees.

31. A system for detecting and identifying computer security threats on Linux and its derivatives, the system comprising: a plurality of security systems, each executing on a computing system comprising one or more processors and operating on the Linux operating system, wherein each security system is configured for using one or more machine-learning models to detect and identify security threats and for reporting security events; and

-31 - a backend server configured for receiving the security events from the plurality of security systems and using the security events to produce updated security intelligence.

32. The system of claim 31 , wherein each security system is configured to execute on bare metal or a virtual instance or a blade.

33. The system of claim 31 , wherein each security system is configured to operate in a stand-alone/self-contained mode to detect security threats without access to the backend server.

34. The system of claim 31 wherein the backend server is configured to disseminate the updated security intelligence to the plurality of security systems.

35. The system of claim 31 , wherein at least one of the security systems is executing on a server.

36. The system of claim 31 , wherein at least one of the security systems is executing on a router or a switch.

37. The system of claim 31 , wherein at least one of the security systems comprises a maliciousness model and is configured to use the maliciousness model to inspect an object and classify the object as benign or malicious.

38. The system of claim 31 , wherein the backend server is configured for generating security signatures based on the security events and disseminating the security signatures to the plurality of security systems.

39. A system for detecting and identifying computer security threats on Linux and its derivatives, the system comprising: a plurality of embedded system devices, each comprising at least one processor and software configured to perform a dedicated function; a plurality of security systems, each executing on a respective embedded system device, wherein each security system is configured for using one or more machine-learning models to detect and identify security threats and for reporting security events; and

-32- a backend server configured for receiving the security events from the plurality of security systems and using the security events to produce updated security intelligence. The system of claim 39, wherein at least one of the embedded system devices comprises a vehicle computing system. The system of claim 40, wherein the vehicle computing system comprises a first security system for one or more networked systems and a second security system for one or more off-network systems. The system of claim 41 , wherein the first security system is configured for reporting security events to the backend server and receiving updated intelligence from the backend server. The system of claim 41 , wherein the second security system is configured for operating in a standalone mode without connecting to the backend server. The system of claim 39, wherein at least one of the embedded system devices comprises a networking device. The system of claim 44, wherein the networking device executes the Linux operating system and the security system for the networking device is configured to operate in the execution/storage path of the Linux operating system. The system of claim 39, wherein at least one of the embedded system devices comprises an industrial system. The system of claim 46, wherein the industrial system comprises at least a first robotic system controlled by a computer system running Linux. The system of claim 46, wherein the industrial system comprises a first security system for one or more networked systems and a second security system for one or more off-network systems.

-33-