JP2022541250A

JP2022541250A - Inline malware detection

Info

Publication number: JP2022541250A
Application number: JP2022502913A
Authority: JP
Inventors: ヒューレット，ウィリアム，レディントン; デン，スイシャン; ヤン，シェン; ラム，ホ，ユ
Original assignee: Palo Alto Networks Inc
Current assignee: Palo Alto Networks Inc
Priority date: 2019-07-19
Filing date: 2020-07-06
Publication date: 2022-09-22
Anticipated expiration: 2040-07-06
Also published as: CN114072798A; WO2021015941A1; JP7411775B2; JP2024023875A; KR20220053549A; EP3999985A4; EP3999985A1

Abstract

悪意のあるファイルの検出が開示される。１つ以上のサンプル分類モデルを含むセットが、ネットワーク装置に保管される。受信したファイルに関連付けられた受信パケットのシーケンスについて、ｎグラム解析が実行されるnグラム解析の実行は、少なくとも１つの保管されたサンプル分類モデルを使用することを含む。受信パケットのシーケンスのｎグラム解析に少なくとも部分的に基づいて、受信したファイルが悪意のものであるとの決定がなされる。ファイルが悪意のものであるとの決定に応じて、受信したファイルの伝搬が防止される。Malicious file detection is disclosed. A set containing one or more sample classification models is stored on the network device. N-gram analysis is performed on a sequence of received packets associated with a received file Performing n-gram analysis includes using at least one stored sample classification model. A determination is made that the received file is malicious based at least in part on an n-gram analysis of the sequence of received packets. Propagation of the received file is prevented in response to determining that the file is malicious.

Description

マルウェアは、悪意のあるソフトウェア(例えば、様々な敵対的、侵入的、及び／又は、望ましくないソフトウェアを含む)を参照する一般的な用語である。マルウェアは、コード、スクリプト、アクティブコンテンツ、及び／又は、他のソフトウェアの形態であり得る。マルウェアの使用例は、コンピュータ及び／又はネットワークの動作の中断、機密情報（proprietary information）(例えば、身元、財務、及び／又は、知的財産関連情報といった、秘密情報)の盗用、及び／又は、私的／専有コンピュータシステム及び／又はコンピュータネットワークへのアクセスの獲得、を含む。不幸にも、マルウェアの検出および軽減に役立つ技法が開発されるにつれて、悪意のある作家は、そうした努力を回避する方法を見つけるようになる。従って、マルウェアを識別し、かつ、軽減するための技法を改善する必要性が継続的に存在している。 Malware is a general term referring to malicious software (eg, including various hostile, intrusive, and/or unwanted software). Malware can be in the form of code, scripts, active content, and/or other software. Examples of malware uses include disrupting computer and/or network operations, stealing proprietary information (e.g., confidential information, such as identity, financial, and/or intellectual property related information), and/or including gaining access to private/proprietary computer systems and/or computer networks. Unfortunately, as techniques are developed to help detect and mitigate malware, malicious writers find ways to circumvent such efforts. Accordingly, there is a continuing need for improved techniques for identifying and mitigating malware.

本発明の様々な実施形態が、以下の詳細な説明および添付の図面において開示されている。
図1は、悪意のあるアプリケーションが検出され、危害を引き起こすことを防止する環境の一つの実施例を示している。図2Aは、データ機器の一つの実施形態を示している。図2Bは、データ機器の一つの実施形態の論理コンポーネントの機能図である。図3は、サンプルを解析するためのシステムに含めることができる論理コンポーネントの一つの実施例を示している。図4は、脅威エンジン（threat engine）の一つの例示的な実施形態の部分を示している。図5は、ツリーの一部について一つの実施例を示している。図6は、データ機器においてインラインマルウェア検出を実行するためのプロセスついて一つの実施例を示している。図7Aは、ファイルについて一つの例示的なハッシュテーブルを示している。図7Bは、サンプルについて一つの例示的な脅威署名を示している。図8Aは、特徴抽出を実行するためのプロセスについて一つの実施例を示している。図8Bは、モデルを生成するためのプロセスについて一つの実施例を示している。 Various embodiments of the present invention are disclosed in the following detailed description and accompanying drawings.
FIG. 1 illustrates one embodiment of an environment in which malicious applications are detected and prevented from causing harm. FIG. 2A shows one embodiment of a data appliance. Figure 2B is a functional diagram of the logical components of one embodiment of a data appliance. FIG. 3 shows one example of logical components that can be included in a system for analyzing samples. FIG. 4 shows portions of one exemplary embodiment of a threat engine. FIG. 5 shows one example of a portion of the tree. FIG. 6 shows one embodiment of a process for performing inline malware detection on data appliances. FIG. 7A shows one exemplary hash table for files. FIG. 7B shows one exemplary threat signature for the sample. FIG. 8A shows one embodiment of a process for performing feature extraction. FIG. 8B shows one embodiment of the process for generating the model.

本発明は、プロセス、装置、システム、合成物、コンピュータ読取り可能な記憶媒体上に具現化されたコンピュータプログラム製品、及び／又は、プロセッサを含む、多数の方法で実施することができる。プロセッサに結合されたメモリに保管され、かつ／あるいは、それによって提供される命令を実行するように構成されたプロセッサ、といったものである。この明細書では、これらの実施形態、または、本発明が採用し得るその他の形態は、技法（technique）と称される。一般的に、開示されるプロセスのステップの順序は、本発明の範囲内で変更され得る。特に指示のない限り、タスクを実行するように構成されているものと説明されたプロセッサまたはメモリといったコンポーネントは、所与の時間にタスクを実行するように一時的に構成される一般的なコンポーネント、または、タスクを実行するように製造されている特定のコンポーネントとして実装することができる。ここにおいて使用されるように、用語「プロセッサ（“processor”）」は、コンピュータプログラム命令などのデータを処理するように構成された１つ以上のデバイス、回路、及び／又は、処理コアを参照する。 The present invention can be implemented in numerous ways, including processes, devices, systems, compositions of matter, computer program products embodied on computer readable storage media, and/or processors. A processor configured to execute instructions stored in and/or provided by a memory coupled to the processor. In this specification, these embodiments, or other forms that the invention may take, are referred to as techniques. In general, the order of steps of disclosed processes may be altered within the scope of the invention. Unless otherwise indicated, a component such as a processor or memory described as being configured to perform a task is a generic component that is temporarily configured to perform a task at any given time; Alternatively, it can be implemented as a specific component manufactured to perform the task. As used herein, the term "processor" refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions. .

本発明の１つ以上の実施形態の詳細な説明は、本発明の原理を説明する添付の図面と共に、以下で提供されている。本発明は、そうした実施形態に関連して説明されるが、本発明は、任意の実施形態に限定されるものではない。本発明の範囲は、請求項によってのみ限定されるものであり、そして、本発明は、多数の代替物、修正物、および均等物を包含している。本発明の完全な理解を提供するために、以下の説明において多数の具体的な詳細が記載されている。これらの詳細は、例示のために提供されているものであり、そして、本発明は、これらの特定の詳細の一部または全部を伴わずに、請求項に従って実施することができる。明確化のために、発明に関連する技術分野において周知の技術的資料は、発明が不必要に不明瞭にならないように詳細には説明されない。 A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. Although the invention is described in connection with such embodiments, the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention includes numerous alternatives, modifications and equivalents. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the invention. These details are provided for the purpose of example, and the invention may be practiced according to the claims without some or all of these specific details. For the sake of clarity, technical material that is well known in the technical fields related to the invention has not been described in detail so as not to unnecessarily obscure the invention.

I.概要 I.Overview

ファイアウォールは、一般的に、承認された通信がファイアウォールを通過するのを許可し、一方で、不正アクセスからネットワークを保護している。ファイアウォールは、典型的には、ネットワークアクセスのためにファイアウォール機能を提供する、デバイス、一式のデバイス、または、デバイスにおいて実行されるソフトウェアである。例えば、ファイアウォールは、デバイス(例えば、コンピュータ、スマートフォン、または、他のタイプのネットワーク通信可能なデバイス)のオペレーティングシステムの中に統合することができる。ファイアウォールは、また、コンピュータサーバ、ゲートウェイ、ネットワーク／ルーティング（routing）デバイス(例えば、ネットワークルータ)、または、データ機器(例えば、セキュリティ機器、または他のタイプの特殊目的デバイス)といった、様々なタイプのデバイスまたはセキュリティデバイス上のソフトウェアアプリケーションとして統合され、または実行することができ、そして、いくつかの実装では、特定の動作は、ASICまたはFPGAといった、特定目的ハードウェアで実装することができる。
る。 Firewalls generally allow authorized communications to pass through the firewall while protecting the network from unauthorized access. A firewall is typically a device, set of devices, or software running on a device that provides firewall functionality for network access. For example, a firewall can be integrated into the operating system of a device (eg, computer, smart phone, or other type of network communication enabled device). Firewalls can also be devices of various types, such as computer servers, gateways, network/routing devices (e.g., network routers), or data appliances (e.g., security appliances, or other types of special purpose devices). or integrated or executed as a software application on the security device, and in some implementations specific operations may be implemented in special purpose hardware, such as ASICs or FPGAs.
be.

ファイアウォールは、典型的に、一式のルールに基づいてネットワーク送信を拒否または許可する。これらのルールのセットは、しばしば、ポリシ(例えば、ネットワークポリシ、またはネットワークセキュリティポリシ)として参照される。例えば、ファイアウォールは、不要な外部トラフィックが保護デバイスに到達するのを防ぐために、一式のルールまたはポリシを適用することによって、インバウンドトラフィック（inbound traffic）をフィルタリングすることができる。ファイアウォールは、また、一式のルールまたはポリシを適用することによってアウトバウンドトラフィックをフィルタリングすることができる(例えば、許可（allow）、ブロック（block）、モニタリング（monitor）、通知（notify）、またはログ（log）、及び／又は、ファイアウォールルールまたはファイアウォールポリシにおいて指定され得る他のアクションであり、これらは、ここにおいて説明されるような、様々な基準に基づいてトリガすることができる)。ファイアウォールは、また、同様に一式のルールまたはポリシを適用することによって、ローカルネットワーク(例えば、イントラネット)トラフィックをフィルタリングすることもできる。 A firewall typically denies or permits network transmissions based on a set of rules. These rule sets are often referred to as policies (eg, network policies or network security policies). For example, a firewall can filter inbound traffic by applying a set of rules or policies to prevent unwanted external traffic from reaching a protected device. Firewalls can also filter outbound traffic by applying a set of rules or policies (e.g., allow, block, monitor, notify, or log ), and/or other actions that may be specified in firewall rules or policies, which may be triggered based on various criteria, as described herein). A firewall can also filter local network (eg, intranet) traffic by applying a set of rules or policies as well.

セキュリティデバイス(例えば、セキュリティ機器、セキュリティゲートウェイ、セキュリティサービス、及び／又は、他のセキュリティデバイス)は、様々なセキュリティ動作(例えば、ファイアウォール、アンチ－マルウェア、侵入防止／検出、プロキシ、及び／又は、他のセキュリティ機能)、ネットワーク機能(例えば、ルーティング、クオリティ・オブ・サービス（QoS)、ネットワーク関連リソースのワークロードバランシング、及び／又は、他のネットワーク機能)、及び／又は、他のセキュリティ及び／又はネットワーク関連の機能を実行することができる。例えば、ルーティングは、送信元（source）情報(例えば、IPアドレスおよびポート)、宛先（destination）情報(例えば、IPアドレスおよびポート)、および、プロトコル情報に基づいて実行することができる。 Security devices (eg, security appliances, security gateways, security services, and/or other security devices) may perform various security operations (eg, firewall, anti-malware, intrusion prevention/detection, proxy, and/or other security functions), network functions (e.g., routing, quality of service (QoS), workload balancing of network-related resources, and/or other network functions), and/or other security and/or network functions It can perform related functions. For example, routing can be performed based on source information (eg, IP address and port), destination information (eg, IP address and port), and protocol information.

基本的なパケットフィルタリング・ファイアウォールは、ネットワークを介して送信される個々のパケットを検査することによって、ネットワーク通信トラフィックをフィルタリングする(例えば、ステートレス（stateless）パケットフィルタリング・ファイアウォールである、パケットフィルタリング・ファイアウォールまたは第１世代ファイアウォール)。ステートレスパケットフィルタリング・ファイアウォールは、典型的に、個々のパケット自体を検査し、そして、検査されたパケットに基づいて(例えば、パケットの送信元および宛先のアドレス情報、プロトコル情報、および、ポート番号の組み合わせを使用して)ルールを適用する。 A basic packet-filtering firewall filters network communication traffic by inspecting each individual packet sent over the network (e.g., a stateless packet-filtering firewall, a packet-filtering firewall or 1st generation firewall). Stateless packet-filtering firewalls typically inspect individual packets themselves, and based on the inspected packet (e.g., a combination of the packet's source and destination address information, protocol information, and port number). ) to apply the rule.

アプリケーション・ファイアウォールは、また、(例えば、アプリケーション層フィルタリング・ファイアウォール、または、TCP／IPスタックのアプリケーションレベルにおいて機能する第２世代ファイアウォールを使用して)アプリケーション層フィルタリングを実行することもできる。アプリケーション層フィルタリング・ファイアウォールまたはアプリケーション・ファイアウォールは、一般的に、所定のアプリケーションおよびプロトコル(例えば、ハイパーテキスト転送プロトコル(HTTP)を使用したウェブブラウジング、ドメインネームシステム(DNS)要求、ファイル転送プロトコル(FTP)を使用したファイル転送、および、Telnet、DHCP、TCP、UDP、およびTFTP(GSS)といった、様々な他のタイプのアプリケーションおよび他のプロトコル)を識別することができる。例えば、アプリケーション・ファイアウォールは、標準ポートにおいて通信を試みる未認可（unauthorized）プロトコルをブロックすることができる(例えば、そのプロトコルについて非標準（non-standard）ポートを使用することにより黙って通り抜けること（sneak through）を試みる未認可／外れたポリシプロトコルは、一般的に、アプリケーション・ファイアウォールを使用して識別することができる)。 An application firewall can also perform application layer filtering (eg, using an application layer filtering firewall or a second generation firewall that works at the application level of the TCP/IP stack). Application-layer filtering firewalls, or application firewalls, are generally used to protect certain applications and protocols (e.g., web browsing using Hypertext Transfer Protocol (HTTP), Domain Name System (DNS) requests, File Transfer Protocol (FTP) and various other types of applications and other protocols such as Telnet, DHCP, TCP, UDP, and TFTP (GSS)). For example, application firewalls can block unauthorized protocols that attempt to communicate on standard ports (e.g., sneak through by using a non-standard port for that protocol). Unauthorized/outside policy protocols attempting to through) can generally be identified using an application firewall).

ステートフル・ファイアウォールは、また、ステートフル・ベースのパケット検査を実行することもでき、そこでは、各パケットが、そのネットワーク送信のパケットフロー（packets／packet flow）と関連する一式のパケットのコンテキストの中で検査される。このファイアウォール技術は、一般的に、ステートフル・パケット検査として参照される。ファイアウォールを通過する全ての接続の記録を保持し、そして、パケットが、新しい接続の開始であるか、既存の接続の一部であるか、または、無効なパケットであるかを判断することができるからである。例えば、接続の状態は、それ自体が、ポリシの中のルールをトリガするクライテリアの１つになり得る。 Stateful firewalls can also perform stateful-based packet inspection, where each packet is viewed within the context of a set of packets associated with that network transmission's packets/packet flow. be inspected. This firewall technology is commonly referred to as stateful packet inspection. Keeps a record of all connections that pass through the firewall and can determine whether a packet is the start of a new connection, part of an existing connection, or an invalid packet It is from. For example, the state of a connection can itself be one of the criteria that trigger a rule in a policy.

先進的または次世代ファイアウォールは、上述のように、ステートレスおよびステートフルなパケットフィルタリングおよびアプリケーション層フィルタリングを実行することができる。次世代ファイアウォールは、また、追加的なファイアウォール技術を実行することもできる。例えば、先進的または次世代ファイアウォールとして、しばしば参照される所定の新しいファイアウォールは、また、ユーザおよびコンテンツを識別することができる。特に、所定の次世代ファイアウォールは、これらのファイアウォールが自動的に識別できるアプリケーションのリストを、何千ものアプリケーションまで拡大している。そうした次世代ファイアウォールの例は、Palo Alto Networksから市販されている(例えば、Palo Alto NetworksのPAシリーズのファイアウォール)。例えば、Palo Alto Networksの次世代ファイアウォールは、様々な識別技術を使用して、企業およびサービスプロバイダが、アプリケーション、ユーザ、およびコンテンツ－単にポート、IPアドレス、およびパケットだけでなく－を識別し、かつ、制御することを可能にする。様々な識別技術は、正確なアプリケーション識別のためのアプリケーションID（App-ID)（例えば、App ID)、ユーザ識別のためのユーザID（User-ID)（例えば、User ID)、および、リアルタイムなコンテンツスキャニングのためのコンテンツID（Content-ID)（例えば、Content ID)といったものである(例えば、Webサーフィンを制御し、かつ、データおよびファイルの転送を制限する)。これらの識別技術により、企業は、従来のポートブロッキングファイアウォールによって提供される従来のアプローチに従う代わりに、ビジネス関連の概念を使用して、アプリケーションの使用を安全に可能にすることができる。また、（例えば、専用装置として実装される）次世代ファイアウォールのための特定目的ハードウェアは、汎用ハードウェアにおいて実行されるソフトウェアよりも、アプリケーション検査についてより高いパフォーマンスレベルを一般的に提供する(例えば、Palo Alto Networks社が提供するセキュリティ機器といったものであり、シングルパス・ソフトウェアエンジンと堅く統合されている、専用の、機能固有の処理を利用し、Palo Alto NetworksのPAシリーズ次世代ファイアウォールについて、レイテンシ（latency）を最小化する一方で、ネットワークのスループットを最大化する)。 Advanced or next-generation firewalls can perform stateless and stateful packet filtering and application layer filtering, as described above. Next-generation firewalls can also implement additional firewall technologies. For example, certain newer firewalls, often referred to as advanced or next-generation firewalls, can also identify users and content. In particular, certain next-generation firewalls extend the list of applications that these firewalls can automatically identify to thousands of applications. Examples of such next-generation firewalls are commercially available from Palo Alto Networks (eg, Palo Alto Networks' PA series firewalls). For example, Palo Alto Networks next-generation firewalls use a variety of identification technologies to help enterprises and service providers identify applications, users, and content—not just ports, IP addresses, and packets—and , allows you to control Various identification technologies include App-ID (e.g., App-ID) for precise application identification, User-ID (e.g., User-ID) for user identification, and real-time Content-ID for content scanning (eg, Content-ID) (eg, controlling web surfing and restricting data and file transfers). These identification techniques allow enterprises to use business-related concepts to securely enable application usage instead of following the traditional approach provided by traditional port-blocking firewalls. Also, special-purpose hardware for next-generation firewalls (e.g., implemented as dedicated equipment) generally provides higher performance levels for application inspection than software running on general-purpose hardware (e.g., , security appliances from Palo Alto Networks, Inc., which utilize dedicated, function-specific processing that is tightly integrated with a single-pass software engine to reduce latency for Palo Alto Networks' PA Series Next-Generation Firewalls. maximizing network throughput while minimizing (latency)).

先進的または次世代ファイアウォールは、また、仮想化ファイアウォールを使用して実装することもできる。そうした次世代ファイアウォールの例は、Palo Alto Networks社から市販されている(Palo Alto Networksのファイアウォールは、VMware(R) ESXi^TMおよびNSX^TM、Citrix(R)Netscaler SDX^TM、KVM／OpenStack(Centos／RHEL、Ubuntu(R))、および、Amazon Web Services(AWS)を含む、様々な商用仮想化環境をサポートしている)。例えば、仮想化ファイアウォールは、物理的フォームファクタ機器で利用可能な、同様の、または、完全に同一の次世代ファイアウォールおよび先進的な脅威防止機能をサポートすることができ、企業は、プライベート、パブリック、およびハイブリッドなクラウドコンピューティング環境へのアプリケーションの流入を安全に可能にすることができる。VMモニタリング、ダイナミックアドレスグループ、およびRESTベースのAPIといった自動化機能により、企業は、VMの変化を動的にモニタすることができ、そのコンテキストをセキュリティポリシに反映させて、それにより、VMの変化時に生じ得るポリシの遅れ（lag）を排除している。 Advanced or next generation firewalls can also be implemented using virtualized firewalls. Examples of such next-generation firewalls are commercially available from Palo Alto Networks (Palo Alto Networks' firewalls include VMware(R) ESXi ^TM and NSX ^TM , Citrix(R) Netscaler SDX ^TM , KVM/OpenStack (Centos/RHEL , Ubuntu(R)), and Amazon Web Services (AWS)). For example, virtualized firewalls can support similar or even identical next-generation firewalls and advanced threat protection capabilities available in physical form factor devices, enabling enterprises to deploy private, public, and securely enable the influx of applications to hybrid cloud computing environments. Automation features such as VM monitoring, dynamic address groups, and REST-based APIs allow enterprises to dynamically monitor VM changes and reflect that context in security policies, thereby enabling Eliminates possible policy lag.

II.環境の実施例 II. Environmental Examples

図1は、悪意のあるアプリケーション(「マルウェア（“malware”）」)が検出され、被害を引き起こさない環境の例を示している。以下でさらに詳細に説明するように、マルウェア分類(例えば、セキュリティプラットフォーム122によって作成される)は、図1に示される環境に含まれる様々なエンティティ間で様々に共有及び／又は改良することができ、ここにおいて説明される技術を用いて、エンドポイントクライアント装置104－110といった装置を、そうしたマルウェアから保護することができる。 Figure 1 shows an example of an environment in which a malicious application (“malware”) is detected and does not cause damage. As described in further detail below, malware taxonomies (eg, created by security platform 122) can be shared and/or refined in various ways among various entities included in the environment shown in FIG. , the techniques described herein can be used to protect devices, such as endpoint client devices 104-110, from such malware.

「アプリケーション（“application”）」という用語は、形式／プラットフォームにかかわらず、プログラム、プログラムのバンドル、マニフェスト、パッケージ、等を総称して指すために、本仕様書の全体を通して使用されている。「アプリケーション」(ここにおいては「サンプル」とも呼ばれる)は、スタンドアロン（standalone）ファイル(例えば、ファイル名「calculator.apk」または「calculator.exe」を有する計算アプリケーション)であってもよく、または、別のアプリケーションの独立したコンポーネント(例えば、モバイル広告SDKまたは計算アプリケーション内に埋め込まれたライブラリ)であってよい。 The term "application" is used throughout this specification to refer generically to programs, bundles of programs, manifests, packages, etc., regardless of format/platform. An "application" (also referred to herein as a "sample") may be a standalone file (eg, a calculator application with the file name "calculator.apk" or "calculator.exe"), or another application (e.g., a mobile advertising SDK or a library embedded within a computational application).

ここにおいて使用される「マルウェア」とは、秘密裡であろうとなかろうと(かつ、違法であろうとなかろうと)、完全な情報を得た場合にはユーザが承認しない／承認しないであろう挙動に関与する。マルウェアの例は、トロイの木馬、ウイルス、ルートキット、スパイウェア、ハッキングツール、キーロガー、等を含む。マルウェアの一つの例は、デスクトップ・アプリケーションであり、それは、エンドユーザの場所を収集し、かつ、リモート・サーバに報告する(しかし、ユーザには、マッピング・サービスといった、場所ベースのサービスを提供しない)。マルウェアのもう別の例は、悪意のあるアンドロイド（登録商標）（Android）アプリケーションパッケージ.apk(APK)であり、それは、エンドユーザにとっては無料ゲームのように見えるが、密かにSMSプレミアムメッセージ(例えば、各10ドルの費用)を送信し、エンドユーザの電話料金請求書を膨らませる。マルウェアの別の例は、アップルのiOSフラッシュライトアプリケーションであり、それは、ユーザの連絡先を密かに収集し、かつ、それらの連絡先をスパマー（spammer）に送信する。他の形態のマルウェアも、ここにおいて説明される技術(例えば、ランサムウェア)を用いて検出／阻止することができる。さらにnグラム（n-gram）／特徴ベクトル／出力蓄積変数は、悪意のあるアプリケーションについて生成されるものとしてここにおいて説明されているが、ここにおいて説明される技術は、また、他の種類のアプリケーション(例えば、アドウェア・プロファイル、グッドウェア・プロファイル、等)のためのプロファイルを生成するために、様々な実施形態でも使用することができる。 "Malware," as used herein, means any behavior, whether covert or not (and illegal or not), that the user would not/would not approve of if fully informed. Involved. Examples of malware include Trojan horses, viruses, rootkits, spyware, hacking tools, keyloggers, and the like. One example of malware is a desktop application that collects and reports the end-user's location to a remote server (but does not provide users with location-based services, such as mapping services). ). Another example of malware is a malicious Android application package .apk (APK) that appears to be a free game to the end user but secretly sends SMS premium messages (e.g. , cost $10 each) and inflate the end user's phone bill. Another example of malware is Apple's iOS Flashlight application, which secretly collects users' contacts and sends them to spammers. Other forms of malware can also be detected/blocked using the techniques described herein (eg, ransomware). Furthermore, although n-grams/feature vectors/output accumulation variables are described herein as being generated for malicious applications, the techniques described herein are also applicable to other types of applications. Various embodiments can also be used to generate profiles for (eg, adware profiles, goodware profiles, etc.).

ここにおいて説明される技術は、種々のプラットフォーム(例えば、デスクトップ、モバイルデバイス、ゲームプラットフォーム、エンベッドシステム、等）及び／又は種々のタイプのアプリケーション(例えば、Android apkファイル、iOSアプリケーション、Windows PEファイル、Adobe Acrobat PDFファイル、等）と組み合わせて使用することができる。図1に示す例示的な環境において、クライアント装置104－108は、ラップトップコンピュータ、デスクトップコンピュータ、およびエンタープライズネットワーク140に存在するタブレットである。クライアント装置110は、エンタープライズネットワーク140の外部に存在するラップトップコンピュータである。 The techniques described herein can be used on different platforms (eg, desktops, mobile devices, gaming platforms, embedded systems, etc.) and/or different types of applications (eg, Android apk files, iOS applications, Windows PE files, Adobe Acrobat PDF files, etc.). In the exemplary environment shown in FIG. 1, client devices 104 - 108 are laptop computers, desktop computers, and tablets residing on enterprise network 140 . Client device 110 is a laptop computer that resides outside enterprise network 140 .

データ機器102は、クライアント装置104および106といった、クライアント装置と、エンタープライズネットワーク140外のノード(例えば、外部ネットワーク118を介して到達可能)との間の通信に関するポリシを実施するように構成されている。そうしたポリシの例は、トラフィックシェーピング、サービスの品質、およびトラフィックのルーティングを管理するポリを含む。ポリシの他の例は、受信（および送信）メールの添付ファイル、ウェブサイトのコンテンツ、インスタントメッセージングプログラムを介して交換されるファイル、及び／又は、他のファイル転送、における脅威についてスキャニング（scanning）を要求するといった、セキュリティポリシを含む。いくつかの実施形態において、データ機器102は、また、エンタープライズネットワーク140内に留まるトラフィックに関するポリシを実施するように構成される。 Data appliance 102 is configured to enforce policies regarding communications between client devices, such as client devices 104 and 106, and nodes outside enterprise network 140 (eg, reachable via external network 118). . Examples of such policies include policies governing traffic shaping, quality of service, and routing of traffic. Other examples of policies include scanning for threats in incoming (and outgoing) email attachments, website content, files exchanged via instant messaging programs, and/or other file transfers. including security policies, such as requiring In some embodiments, data appliance 102 is also configured to enforce policies regarding traffic remaining within enterprise network 140 .

データ機器の一つの実施形態が図2Aに示されている。示される例は、種々の実施形態において、データ機器102に含まれる物理的コンポーネントの表現である。具体的に、データ機器102は、高性能マルチコア中央処理ユニット（CPU）202およびランダムアクセスメモリ（RAM）204を含んでいる。データ機器102は、また、ストレージ210(１つ以上のハードディスクまたはソリッドステート・ストレージユニット、といったもの）を含む。様々な実施形態において、データ機器102は、エンタープライズネットワーク140をモニタリングすること、および、開示された技術を実装することに使用される情報を(RAM204、ストレージ210、及び／又は、他の適切なロケーション、のいずれかに)保管する。そうした情報の例は、アプリケーション識別子、コンテンツ識別子、ユーザ識別子、要求されたURL、IPアドレスマッピング、ポリシおよび他のコンフィグレーション情報、署名、ホスト名／URL分類情報、マルウェアプロファイル、および機械学習モデル、を含む。データ機器102は、また、１つ以上の任意的なハードウェアアクセラレータを含み得る。例えば、データ機器102は、暗号化および復号動作を実行するように構成された暗号エンジン206、および、照合器（matching）を実行し、ネットワークプロセッサとして動作し、かつ／あるいは、他のタスクを実行するように構成された、１つ以上のフィールドプログラマブルゲートアレイ208を含み得る。 One embodiment of a data appliance is shown in FIG. 2A. The examples shown are representations of physical components included in data appliance 102, in various embodiments. Specifically, data appliance 102 includes a high performance multi-core central processing unit (CPU) 202 and random access memory (RAM) 204 . Data appliance 102 also includes storage 210 (such as one or more hard disks or solid-state storage units). In various embodiments, data appliance 102 stores information (RAM 204, storage 210, and/or other suitable locations) used to monitor enterprise network 140 and implement the disclosed techniques. , or ). Examples of such information include application identifiers, content identifiers, user identifiers, requested URLs, IP address mappings, policy and other configuration information, signatures, hostname/URL categorization information, malware profiles, and machine learning models. include. Data appliance 102 may also include one or more optional hardware accelerators. For example, the data appliance 102 may execute a cryptographic engine 206 configured to perform encryption and decryption operations, a matcher, act as a network processor, and/or perform other tasks. It may include one or more field programmable gate arrays 208 configured to.

データ機器102によって実行されるものとしてここにおいて説明される機能性は、種々の方法で提供／実装することができる。例えば、データ機器102は、専用のデバイスまたはデバイスセットであってよい。データ機器102によって提供される機能は、汎用コンピュータ、コンピュータサーバ、ゲートウェイ、及び／又は、ネットワーク／ルーティング・デバイス上のソフトウェアとして統合され、または、実行され得る。いくつかの実施形態において、データ機器102によって提供されるものとして説明される少なくともいくつかのサービスが、代わりに(または、これに加えて)、クライアント装置において実行するソフトウェアによって、クライアント装置(例えば、クライアント装置104またはクライアント装置110)に提供される。 The functionality described herein as being performed by data appliance 102 may be provided/implemented in a variety of ways. For example, data appliance 102 may be a dedicated device or set of devices. The functionality provided by data appliance 102 may be integrated or executed as software on general purpose computers, computer servers, gateways, and/or network/routing devices. In some embodiments, at least some of the services described as being provided by the data appliance 102 are alternatively (or in addition) provided by software executing on the client device (e.g., provided to client device 104 or client device 110).

データ機器102がタスクを実行するものとして記述されるときはいつでも、単一のコンポーネント、コンポーネントのサブセット、またはデータ機器102の全てのコンポーネントは、タスクを実行するために協働することができる。同様に、データ機器102のコンポーネントがタスクを実行するものとして説明されるときはいつでも、サブコンポーネントは、タスクを実行することができ、かつ／あるいは、コンポーネントは、他のコンポーネントと共にタスクを実行することができる。様々な実施形態において、データ機器102の一部は、１つ以上の第三者によって提供される。データ機器102に利用可能な計算リソースの量といった要因に応じて、データ機器102の種々の論理コンポーネント及び／又は特徴は省略されてよく、そして、ここにおいて説明される技術はそれに応じて適合される。同様に、追加の論理コンポーネント／特徴を、データ機器102の実施形態に、適用可能なように含めることができる。種々の実施形態におけるデータ機器102に含まれるコンポーネントの一つの例は、(例えば、パケットフロー解析に基づいてアプリケーションを識別するために種々のアプリケーション署名を使用して)アプリケーションを識別するように構成されているアプリケーション識別エンジンである。例えば、アプリケーション識別エンジンは、セッションが関与するトラフィックのタイプを決定することができる。Webブラウジング－ソーシャルネットワーキング、Webブラウジング－ニュース、SSH、等といったものである。 Whenever data appliance 102 is described as performing a task, a single component, a subset of components, or all components of data appliance 102 can cooperate to perform the task. Similarly, whenever a component of data appliance 102 is described as performing a task, a subcomponent can perform the task and/or the component can perform the task in conjunction with other components. can be done. In various embodiments, portions of data appliance 102 are provided by one or more third parties. Depending on factors such as the amount of computing resources available to the data appliance 102, various logical components and/or features of the data appliance 102 may be omitted and the techniques described herein adapted accordingly. . Similarly, additional logical components/features may be included in embodiments of data appliance 102 as applicable. One example of a component included in data appliance 102 in various embodiments is configured to identify applications (eg, using various application signatures to identify applications based on packet flow analysis). is an application identification engine that For example, the application identification engine can determine the type of traffic the session will involve. Web Browsing - Social Networking, Web Browsing - News, SSH, etc.

図2Bは、データ機器の一つの実施形態の論理コンポーネントの機能図である。示される例は、種々の実施形態においてデータ機器102に含まれ得る論理コンポーネントの表現である。別段の規定がない限り、データ機器102の種々の論理コンポーネントは、一般的に、１つ以上のスクリプト(例えば、該当する場合、Java、python、等で書かれたもの)のセット（set）を含む種々の方法で実装可能である。 Figure 2B is a functional diagram of the logical components of one embodiment of a data appliance. The example shown is a representation of logical components that may be included in data appliance 102 in various embodiments. Unless otherwise specified, the various logical components of the data appliance 102 generally run a set of one or more scripts (eg, written in Java, python, etc., as applicable). can be implemented in a variety of ways, including

図示のように、データ機器102はファイアウォールを備え、かつ、管理プレーン232およびデータプレーン234を含んでいる。管理プレーンは、ポリシの設定およびログデータの表示のめのユーザインターフェイスを提供するといったことにより、ユーザインタラクション（user interaction）の管理について責任を負う。データプレーンは、パケット処理およびセッション処理を実行するといったことにより、データ管理について責任を負う。 As shown, data appliance 102 includes a firewall and includes management plane 232 and data plane 234 . The management plane is responsible for managing user interaction, such as by setting policies and providing a user interface for viewing log data. The data plane is responsible for data management, such as by performing packet processing and session processing.

ネットワークプロセッサ236は、クライアント装置108といった、クライアント装置からパケットを受信し、そして、それらを処理のためにデータプレーン234に提供するように構成されている。フローモジュール238は、新しいセッションの一部としてパケットを識別するときはいつでも、新しいセッションフローを生成する。その後のパケットは、フロールックアップに基づいて、セッションに属しているものとして識別される。該当する場合、SSL復号エンジン240によってSSL復号化が適用される。そうでなければ、SSL復号エンジン240による処理は省略される。復号エンジン240は、データ機器102がSSL／TLSおよびSSHの暗号化トラフィックを検査および制御することを助け、そして、従って、そうでなければ暗号化トラフィック内に隠されたままであり得る脅威を停止することを助ける。復号エンジン240は、また、機密性の高いコンテンツがエンタープライズネットワーク140から去るのを防止することを助けることができる。復号は、URLカテゴリ、トラフィック元、トラフィック宛先、ユーザ、ユーザグループ、およびポート、といったパラメータに基づいて選択的に制御することができる(例えば、イネーブルされ、または、ディセーブルされる)。復号ポリシ(例えば、復号するセッションを指定するもの)に加えて、復号プロファイルは、ポリシによって制御されるセッションの様々なオプションを制御するために割り当てることができる。例えば、特定の暗号スイートおよび暗号化プロトコルバージョンの使用が要求され得る。 Network processor 236 is configured to receive packets from client devices, such as client device 108, and provide them to data plane 234 for processing. Flow module 238 creates a new session flow whenever it identifies a packet as part of a new session. Subsequent packets are identified as belonging to the session based on the flow lookup. If applicable, SSL decryption is applied by SSL decryption engine 240 . Otherwise, processing by the SSL decryption engine 240 is skipped. Decryption engine 240 helps data appliance 102 inspect and control SSL/TLS and SSH encrypted traffic, thus stopping threats that may otherwise remain hidden within encrypted traffic. help Decryption engine 240 can also help prevent sensitive content from leaving enterprise network 140 . Decryption can be selectively controlled (eg, enabled or disabled) based on parameters such as URL category, traffic source, traffic destination, user, user group, and port. In addition to decryption policies (eg, those that specify sessions to decrypt), decryption profiles can be assigned to control various options of sessions controlled by the policy. For example, the use of specific cipher suites and encryption protocol versions may be required.

アプリケーション識別(APP-ID)エンジン242は、セッションが関与するトラフィックのタイプを決定するように構成されている。一つの例として、アプリケーション識別エンジン242は、受信データ内のGETリクエストを認識し、そして、セッションがHTTPデコーダを必要とすると結論付けることができる。場合によって、例えば、ウェブブラウジングセッションにおいて、識別されたアプリケーションは変更することができ、そして、そうした変更はデータ機器102によって書き留め（noted）られる。例えば、ユーザは、まず、企業のWiki(訪問したURLに基づいて「Webブラウジング－生産性（“Web Browsing-Productivity”）」として分類される)を閲覧し、次に、ソーシャルネットワーキングサイト(訪問したURLに基づいて「Webブラウジング－ソーシャルネットワーキング（“Web Browsing-Social Networking”）」として分類される)を閲覧することができる。異なるタイプのプロトコルは、対応するデコーダを有している。 Application identification (APP-ID) engine 242 is configured to determine the type of traffic the session involves. As one example, application identification engine 242 can recognize a GET request in the received data and conclude that the session requires an HTTP decoder. In some cases, for example, during a web browsing session, the identified application can be changed, and such changes are noted by the data appliance 102 . For example, a user may first browse a company's wiki (classified as "Web Browsing-Productivity" based on the URL visited), then a social networking site (visited Based on the URL, classified as "Web Browsing-Social Networking") can be browsed. Different types of protocols have corresponding decoders.

アプリケーション識別エンジン242によって行われた決定に基づいて、パケットを正しい順序に組み立て、トークン化を実行し、情報を抽出するように構成された、適切なデコーダに対して、脅威エンジン244によって、パケットが送信される。脅威エンジン244は、また、パケットに何が起こるべきかを決定するために、署名照合（signature matching）を実行する。必要に応じて、SSL暗号化エンジン246は、復号されたデータを再び暗号化することができる。パケットは、転送のために(例えば、宛先へ)転送モジュール248を使用して転送される。 Based on decisions made by application identification engine 242, threat engine 244 sends packets to appropriate decoders configured to assemble packets into the correct order, perform tokenization, and extract information. sent. Threat engine 244 also performs signature matching to determine what should happen to the packet. If necessary, SSL encryption engine 246 can re-encrypt the decrypted data. Packets are forwarded using a forwarding module 248 for forwarding (eg, to a destination).

図2Bにも、また、示されるように、ポリシ252は、受信され、そして、管理プレーン232に保管される。ポリシは、ドメイン名及び／又はホスト／サーバ名を使用して指定することができる、１つ以上のルールを含むことができ、そして、ルールは、モニタリングされるセッショントラフィックフローからの様々な抽出されたパラメータ／情報に基づいて、加入者／IPフローに対するセキュリティポリシ実施のためといった、１つ以上の署名または他の照合基準または発見的方法を適用することができる。インターフェイス（I/F）通信器250が、管理通信(例えば、(REST)API、メッセージ、またはネットワークプロトコル通信、もしくは他の通信メカニズムを介して)について提供されている。 Also shown in FIG. 2B, policy 252 is received and stored in management plane 232 . A policy can contain one or more rules, which can be specified using domain names and/or host/server names, and the rules can be extracted from various monitored session traffic flows. Based on the parameters/information obtained, one or more signatures or other verification criteria or heuristics can be applied, such as for security policy enforcement on subscribers/IP flows. An interface (I/F) communicator 250 is provided for management communication (eg, via (REST) API, messages, or network protocol communication, or other communication mechanism).

III.セキュリティプラットフォーム III. Security Platform

図1に戻り、悪意のある(システム120を使用する)個人がマルウェア130を作成したと仮定する。悪意のある個人は、クライアント装置104といった、クライアント装置がマルウェア130のコピーを実行することを望んでおり、クライアント装置を危険にさらし（compromising）、そして、例えば、クライアント装置をボットネットにおけるボット（bot）にさせる。危険にさらされたクライアント装置は、次いで、タスク(例えば、暗号通貨のマイニング、または、サービス妨害攻撃への参加)を実行し、そして、コマンドおよび制御（C&C）サーバ150といった、外部エンティティに情報を報告するように、並びに、必要に応じて、C&Cサーバ150からの命令を受信するように、指示され得る。 Returning to FIG. 1, assume that a malicious individual (using system 120) created malware 130. A malicious individual wants a client device, such as client device 104, to run a copy of malware 130, compromising the client device and, for example, turning the client device into a bot in a botnet. ). The compromised client device then performs a task (e.g., mining cryptocurrency or participating in a denial of service attack) and provides information to an external entity, such as a command and control (C&C) server 150. It can be instructed to report and, if necessary, to receive commands from C&C server 150 .

データ機器102が、クライアント装置104を操作するユーザ「アリス（“Alice”）」に対して送信された電子メールをインターセプトしたと想定する。マルウェア130のコピーは、システム120によってメッセージに添付されている。代替的であるが、類似のシナリオとして、データ機器102は、クライアント装置104による（例えば、ウェブサイトからの）マルウェア130のダウンロードの試みをインターセプトすることができる。いずれのシナリオにおいても、データ機器102は、ファイルの署名（例えば、eメールの添付またはマルウェア130のウェブサイトダウンロード）がデータ機器102上に存在するか否かを決定する。署名は、存在する場合に、ファイルが安全であると知られている(例えば、ホワイトリストに在る)ことを示すことができ、そして、また、そのファイルが悪意のものであると知られている(例えば、ブラックリストに在る)ことを示すこともできる。 Assume that data appliance 102 intercepts an email sent to user “Alice” operating client device 104 . A copy of malware 130 is attached to the message by system 120 . Alternatively, as a similar scenario, data appliance 102 may intercept an attempt by client device 104 to download malware 130 (eg, from a website). In either scenario, data appliance 102 determines whether a file signature (eg, email attachment or website download of malware 130 ) exists on data appliance 102 . A signature, if present, can indicate that the file is known to be safe (e.g., is on a whitelist) and also that the file is known to be malicious. You can also indicate that you are on a blacklist (for example).

様々な実施形態において、データ機器102は、セキュリティプラットフォーム122と協働して動作するように構成されている。一つの例として、セキュリティプラットフォーム122は、データ機器102に、既知の悪意のあるファイルの署名のセットを(例えば、サブスクリプションの一部として)提供することができる。マルウェア130に対する署名がセットに含まれる場合(例えば、マルウェア130のMD5ハッシュ)、データ機器102は、それに応じて(例えば、クライアント装置104に送られる電子メール添付のMD5ハッシュがマルウェア130のMD5ハッシュに一致することを検出することによって)、クライアント装置104へのマルウェア130の送信を防止することができる。セキュリティプラットフォーム122は、また、データ機器102に既知の悪意のあるドメイン及び／又はIPアドレスのリストを提供することができ、データ機器102がエンタープライズネットワーク140とC&Cサーバ150(例えば、C&Cサーバ150が悪意であることが知られている場合)との間のトラフィックをブロックすることを可能にする。悪意のあるドメイン(及び／又はIPアドレス)のリストは、また、データ機器102が、そのノードの１つがいつ侵害されたかを判断するのに役立つ。例えば、クライアント装置104がC&Cサーバ150へのコンタクトを試みる場合、そうした試みは、クライアント104がマルウェアによって危険にさらされたこと(従って、クライアント装置104がエンタープライズネットワーク140内の他のノードと通信するのを隔離するなどの是正措置を講じる必要があること)を示す強力な指標（indicator）である。以下でより詳細に説明されるように、セキュリティプラットフォーム122は、また、ファイルのインライン解析を行うためにデータ機器102によって使用可能な機械学習モデルのセットといった、他のタイプの情報を、データ機器102に(例えば、予約の一部として)提供することができる。 In various embodiments, data appliance 102 is configured to operate in cooperation with security platform 122 . As one example, security platform 122 may provide data appliance 102 with a set of known malicious file signatures (eg, as part of a subscription). If a signature for malware 130 is included in the set (eg, an MD5 hash of malware 130), data appliance 102 responds accordingly (eg, an MD5 hash of an email attachment sent to client device 104 to an MD5 hash of malware 130). By detecting a match), transmission of malware 130 to client device 104 can be prevented. The security platform 122 can also provide the data appliance 102 with a list of known malicious domains and/or IP addresses so that the data appliance 102 can identify the enterprise network 140 and the C&C server 150 (e.g., if the C&C server 150 is malicious). allows you to block traffic to and from A list of malicious domains (and/or IP addresses) also helps the data appliance 102 determine when one of its nodes has been compromised. For example, if the client device 104 attempts to contact the C&C server 150, such attempts indicate that the client 104 has been compromised by malware (and thus prevent the client device 104 from communicating with other nodes within the enterprise network 140). is a strong indicator that corrective action, such as isolating the As described in more detail below, security platform 122 also provides other types of information to data appliance 102, such as a set of machine learning models that can be used by data appliance 102 to perform inline analysis of files. (e.g., as part of a reservation).

様々な実施形態において、添付（attachment）に対する署名が見つからない場合、データ機器102は、様々な措置を講じることができる。第１例として、データ機器102は、良性（benign）としてホワイトリストに掲載されていない(例えば、既知の良好なファイルの署名と一致しない)添付の送信をブロックすることによって、フェールセーフ（fail-safe）にすることができる。このアプローチの欠点は、実際に良性である場合にも、潜在的にマルウェアとして不必要にブロックされる正規の添付が多く存在し得ることである。第２例として、データ機器102は、悪意のあるものとしてブラックリストに掲載されていない添付ファイル(例えば、既知の悪意のあるファイルの署名と一致しないもの)の送信を可能にすることによって、故障の危険（fail-danger）をもたらし得る。このアプローチの欠点は、新たに作成されたマルウェア(プラットフォーム122によって以前は見えなかったもの)が、危害を引き起こすのを妨げられないことである。 In various embodiments, data appliance 102 may take various actions if a signature for an attachment is not found. As a first example, the data appliance 102 fails by blocking transmission of attachments that are not whitelisted as benign (e.g., do not match signatures of known good files). safe). A drawback of this approach is that there can be many legitimate attachments that are unnecessarily blocked as potentially malware, even when they are in fact benign. As a second example, the data appliance 102 can prevent failure by allowing the transmission of attachments that are not blacklisted as malicious (e.g., those that do not match signatures of known malicious files). can pose a fail-danger. A drawback of this approach is that newly created malware (previously invisible to the platform 122) is not prevented from causing harm.

第３例として、データ機器102は、静的／動的解析のためにセキュリティプラットフォーム122にファイル(例えば、マルウェア130)を提供し、それが悪意であるか否かを判断し、かつ／あるいは、それを分類するように構成することができる。添付のセキュリティプラットフォーム122(署名がまだ存在しない)による解析が実行される間に、データ機器102は様々なアクションをとることができる。第１例として、データ機器102は、セキュリティプラットフォーム122から応答が受信されるまで、電子メール(および添付ファイル)がアリスに配信されるのを妨げることができる。プラットフォーム122がサンプルを完全に解析するのに約15分かかると仮定すると、これは、アリスへの受信メッセージが15分遅れることを意味する。この例では、添付は悪意があるため、そうした遅延はアリスにマイナスの影響を与えない。別の例においては、誰かが、署名も存在しない良性の添付を伴う時間に敏感な（time sensitive）メッセージをアリスに送ったものと想定する。アリスへのメッセージの配送を15分遅らせることは(例えば、アリスによって）受け入れられないと見なされる可能性が高い。以下でより詳細に説明されるように、代替的アプローチは、データ機器102において添付について(例えば、プラットフォーム122からの裁決を待つ間に)少なくともある程度のリアルタイム解析を行うことである。データ機器102が、添付が悪意のあるものか良性のものかを独立して決定することができれば、初期アクション（例えば、アリスへの配送をブロックする、または、許可する）をとることができ、そして、セキュリティプラットフォーム122から裁決（verdict）を受信した後で、必要に応じて、追加アクションを調整／実行することができる。 As a third example, data appliance 102 provides a file (eg, malware 130) to security platform 122 for static/dynamic analysis to determine whether it is malicious, and/or It can be configured to sort. Various actions can be taken by the data appliance 102 while analysis is performed by the attached security platform 122 (where the signature does not yet exist). As a first example, data appliance 102 can prevent the email (and attachments) from being delivered to Alice until a response is received from security platform 122 . Assuming that platform 122 takes approximately 15 minutes to fully analyze the sample, this means that the received message to Alice will be delayed by 15 minutes. In this example, the attachment is malicious, so such a delay does not negatively impact Alice. In another example, suppose someone sent Alice a time sensitive message with a benign attachment for which there is no signature. Delaying the delivery of the message to Alice by 15 minutes is likely to be considered unacceptable (eg by Alice). As will be described in more detail below, an alternative approach is to perform at least some real-time analysis of attachments at data appliance 102 (eg, while awaiting adjudication from platform 122). If the data appliance 102 can independently determine whether an attachment is malicious or benign, it can take initial action (e.g., block or allow delivery to Alice); Then, after receiving the verdict from the security platform 122, additional actions can be adjusted/performed as needed.

セキュリティプラットフォーム122は、受信したサンプルのコピーをストレージ142に保管し、そして、解析が開始される(または、適宜、予定される)。ストレージ142の一つの例は、アパッチハデュープ（Apache Hadoop）クラスタである。解析の結果(および、アプリケーションに関連する追加情報)は、データベース146に保管される。アプリケーションが不正であると判断された場合、データ機器は、解析結果に基づいて、ファイルダウンロードを自動的にブロックするように設定することができる。さらに、悪意があると判断されたファイルをダウンロードする将来のファイル転送要求を自動的にブロックするために、マルウェアについて署名を生成し、そして、(例えば、データ機器102、136、148といったデータ機器に対して)配布することができる。 Security platform 122 stores a copy of the received sample in storage 142, and analysis is initiated (or scheduled, as appropriate). One example of storage 142 is an Apache Hadoop cluster. The results of the analysis (and additional information related to the application) are stored in database 146 . If the application is determined to be fraudulent, the data appliance can be set to automatically block file downloads based on the analysis results. Additionally, to automatically block future file transfer requests that download files determined to be malicious, signatures are generated for the malware, and (e.g., data devices such as data devices 102, 136, 148 against) can be distributed.

様々な実施形態において、セキュリティプラットフォーム122は、典型的なサーバ－クラス・オペレーティングシステム(例えば、Linux（登録商標）)を実行する１つ以上の専用の市販のハードウェアサーバを含む(例えば、マルチコアプロセッサ、RAMの32G+、ギガビット・ネットワークインターフェイス・アダプタ、および、ハードドライブを有しているもの)。セキュリティプラットフォーム122は、複数のそうしたサーバ、ソリッドステートドライブ、及び／又は、他の適用可能な高性能ハードウェアを含むスケーラブル・インフラストラクチャにわたり、実装され得る。セキュリティプラットフォーム122は、１つ以上の第三者によって提供されるコンポーネントを含む、複数の分散コンポーネントを有することができる。例えば、セキュリティプラットフォーム122の一部または全部を、Amazon Elastic Compute Cloud（EC2）及び／又はAmazon Simple Storage Service（S3）を使用して実装することができる。さらに、データ機器102の場合と同様に、セキュリティプラットフォーム122が、データの保管またはデータの処理といった、タスクを実行するように言及されるときはいつでも、セキュリティプラットフォーム122のサブコンポーネントまたは複数のサブコンポーネントは、(個々に、または、第三者のコンポーネントと協力して)そのタスクを実行するために協働し得ることができることが理解されるべきである。一つの例として、セキュリティプラットフォーム122は、任意的に、VMサーバ124といった、１つ以上の仮想マシン（VM）サーバと協力して、静的／動的分解析を実行することができる。 In various embodiments, security platform 122 includes one or more dedicated off-the-shelf hardware servers (eg, multi-core processors) running typical server-class operating systems (eg, Linux®). , 32G+ of RAM, a Gigabit network interface adapter, and a hard drive). Security platform 122 may be implemented across a scalable infrastructure that includes multiple such servers, solid state drives, and/or other applicable high performance hardware. Security platform 122 may have multiple distributed components, including components provided by one or more third parties. For example, part or all of security platform 122 may be implemented using Amazon Elastic Compute Cloud (EC2) and/or Amazon Simple Storage Service (S3). Further, as with the data appliance 102, whenever the security platform 122 is referred to as performing a task, such as storing data or processing data, the subcomponent or subcomponents of the security platform 122 are , can cooperate (individually or in cooperation with third party components) to perform its tasks. As one example, security platform 122 can optionally cooperate with one or more virtual machine (VM) servers, such as VM server 124, to perform static/dynamic analysis.

仮想マシンサーバの一つの例は、VMware ESXi、Citrix XenServer、またはMicrosoft Hyper-Vといった、市販の仮想化ソフトウェアを実行する、市販のサーバ－クラスのハードウェア(例えば、マルチコアプロセッサ、RAMの32G+、および１つ以上のギガビット・ネットワークインターフェイス・アダプタ)を含む物理マシンである。いくつかの実施形態において、仮想マシンサーバは省略されている。さらに、仮想マシンサーバは、セキュリティプラットフォーム122を管理するのと同じエンティティの制御下にあってよいが、また、第三者によって提供されてもよい。一つの例として、仮想マシンサーバは、EC2に依存することができ、セキュリティプラットフォーム122のオペレータによって所有され、かつ、その制御下にある専用ハードウェアによって提供されるセキュリティプラットフォーム122の残りの部分を伴う。VMサーバ124は、クライアント装置をエミュレートするために１つ以上の仮想マシン126－128を提供するように構成さていれる。仮想マシンは、様々なオペレーティングシステム及び／又はそのバージョンを実行することができる。仮想マシンでアプリケーションを実行した結果として生じる、観察された動作がログに記録され、そして、解析される(例えば、アプリケーションが悪意を持っていることを示す場合)。いくつかの実施形態において、ログ解析は、VMサーバ(例えば、VMサーバ124)によって実行される。他の実施形態において、解析は、少なくとも部分的に、コーディネータ144といった、セキュリティプラットフォーム122の他のコンポーネントによって実行される。 One example of a virtual machine server is off-the-shelf server-class hardware (e.g., multi-core processor, 32G+ of RAM, and A physical machine that contains one or more gigabit network interface adapters). In some embodiments, virtual machine servers are omitted. Further, virtual machine servers may be under the control of the same entity that manages security platform 122, but may also be provided by a third party. As one example, a virtual machine server can rely on EC2, with the rest of the security platform 122 provided by dedicated hardware owned and under the control of the operator of the security platform 122. . VM server 124 is configured to provide one or more virtual machines 126-128 to emulate client devices. Virtual machines can run various operating systems and/or versions thereof. Observed behavior resulting from running the application in the virtual machine is logged and analyzed (eg, if it indicates that the application is malicious). In some embodiments, log analysis is performed by a VM server (eg, VM server 124). In other embodiments, analysis is performed, at least in part, by other components of security platform 122 , such as coordinator 144 .

様々な実施形態において、セキュリティプラットフォーム122は、サブスクリプション（subscription）の一部として、署名(及び／又は、他の識別子)のリストを介して、データ機器102に対してサンプルの解析の結果を利用可能にする。例えば、セキュリティプラットフォーム122は、マルウェアアプリケーションを識別するコンテンツパッケージを周期的に送信することができる(例えば、毎日、毎時、または他の間隔、及び／又は、１つ以上のポリシによって構成されたイベントに基づいて)。コンテンツパッケージの例は、識別されたマルウェアアプリケーションのリストを含み、パッケージ名、アプリケーションを一意に識別するためのハッシュ値、および、識別された各マルウェアアプリケーションのマルウェア名(及び／又は、マルウェアファミリ名)といった情報を伴う。サブスクリプションは、データ機器102によってインターセプトされ、データ機器102によってセキュリティプラットフォーム122に送信されるファイルの解析のみをカバーすることができ、そして、また、セキュリティプラットフォーム122(または、そのサブセット、単なるモバイルマルウェアであるが、マルウェアの他の形態ではないもの（例えば、PDFマルウェア）)に対して知られている全てのマルウェアの署名をカバーすることもできる。以下でより詳細に説明されるように、プラットフォーム122は、また、データ機器102がマルウェアを検出するのを助けることができる機械学習モデルといった、他のタイプの情報を利用可能にすることができる。 In various embodiments, security platform 122 makes the results of sample analysis available to data appliance 102 via a list of signatures (and/or other identifiers) as part of a subscription. to enable. For example, security platform 122 may periodically transmit content packages that identify malware applications (eg, daily, hourly, or other intervals and/or on events configured by one or more policies). based on). An example content package includes a list of identified malware applications, with a package name, a hash value to uniquely identify the application, and a malware name (and/or malware family name) for each identified malware application. accompanied by information such as A subscription may only cover analysis of files intercepted by data equipment 102 and transmitted by data equipment 102 to security platform 122, and may also cover security platform 122 (or a subset thereof, mere mobile malware). However, it also covers all known malware signatures for non-other forms of malware (e.g. PDF malware). As described in more detail below, platform 122 may also make available other types of information, such as machine learning models that can help data appliance 102 detect malware.

様々な実施形態において、セキュリティプラットフォーム122は、データ機器102のオペレータに加えて(または、該当する場合は、その代わりに)、様々なエンティティに対してセキュリティサービスを提供するように構成されている。例えば、自身のそれぞれのエンタープライズネットワーク114および116、並びに、それら自身のそれぞれのデータ機器136および148を有している、他のエンタープライズは、セキュリティプラットフォーム122のオペレータと契約（contract）することができる。他のタイプのエンティティは、また、セキュリティプラットフォーム122のサービスを利用することもできる。例えば、クライアント装置110にインターネットサービスを提供するインターネットサービスプロバイダは、クライアント装置110がダウンロードを試みるアプリケーションを解析するために、セキュリティプラットフォーム122と契約することができる。別の例として、クライアント装置110のオーナーは、セキュリティプラットフォーム122と通信するクライアント装置110上にソフトウェアをインストールすることができる(例えば、セキュリティプラットフォーム122からコンテンツパッケージを受信し、受信したコンテンツパッケージを使用して、ここにおいて説明される技術に従って添付をチェックし、そして、解析のためにアプリケーションをセキュリティプラットフォーム122に送信する)。 In various embodiments, security platform 122 is configured to provide security services to various entities in addition to (or instead of, where applicable) operators of data appliance 102 . For example, other enterprises that have their own respective enterprise networks 114 and 116 and their own respective data appliances 136 and 148 may contract with the security platform 122 operator. Other types of entities may also utilize the services of security platform 122 . For example, an Internet service provider that provides Internet service to client device 110 may contract with security platform 122 to analyze applications that client device 110 attempts to download. As another example, an owner of client device 110 can install software on client device 110 that communicates with security platform 122 (e.g., receive a content package from security platform 122 and use the received content package). , check the attachment according to the techniques described herein, and send the application to security platform 122 for analysis).

IV. 静的／動的解析を使用するサンプル解析 IV. SAMPLE ANALYSIS USING STATIC/DYNAMIC ANALYSIS

図3は、サンプルを解析するためのシステムに含めることができる論理コンポーネントの例を示している。解析システム300は、単一の装置を使用して実施することができる。例えば、解析システム300の機能は、データ機器102の中に組み込まれたマルウェア解析モジュール112に実装することができる。解析システム300は、また、複数の別個の装置にわたり、集合的に、実施することができる。例えば、解析システム300の機能は、セキュリティプラットフォーム122によって提供され得る。 Figure 3 shows an example of the logical components that can be included in a system for analyzing samples. Analysis system 300 can be implemented using a single device. For example, the functionality of analysis system 300 may be implemented in malware analysis module 112 embedded within data appliance 102 . The analysis system 300 can also be implemented collectively across multiple separate devices. For example, the functionality of analysis system 300 may be provided by security platform 122 .

様々な実施形態において、解析システム300は、既知の安全コンテンツ及び／又は既知の不良コンテンツのリスト、データベース、または、他のコレクション(コレクション314として図3において集合的に示されている)を使用する。コレクション314は、サブスクリプションサービス(例えば、第三者によって提供されるもの)を介すること、及び／又は、他の処理 (例えば、データ機器102及び／又はセキュリティプラットフォーム122によって実行されるもの)の結果として、様々な方法で獲得され得る。コレクション314に含まれる情報の例は、既知の悪意のあるサーバのURL、ドメイン名、及び／又は、IPアドレス、既知の安全なサーバのURL、ドメイン名、及び／又は、IPアドレス、既知のコマンドおよび制御（C&C）ドメインのURL、ドメイン名、及び／又は、IPアドレス、既知の悪意のあるアプリケーションの署名、ハッシュ、及び／又は、他の識別子、既知の安全なアプリケーションの署名、ハッシュ、及び／又は、他の識別子、既知の悪意のあるファイルの署名、ハッシュ、及び／又は、他の識別子(例えば、Android exploitファイル)、既知の安全なライブラリの署名、ハッシュ、及び／又は、他の識別子、および、既知の悪意のあるライブラリの署名、ハッシュ、及び／又は、他の識別子、である。 In various embodiments, analysis system 300 uses a list, database, or other collection of known safe content and/or known bad content (collectively shown in FIG. 3 as collection 314). . Collections 314 may be obtained through subscription services (e.g., provided by third parties) and/or as a result of other processing (e.g., performed by data appliance 102 and/or security platform 122). can be obtained in various ways. Examples of information included in collection 314 are known malicious server URLs, domain names and/or IP addresses, known good server URLs, domain names and/or IP addresses, known commands. and control (C&C) domain URLs, domain names, and/or IP addresses; signatures, hashes, and/or other identifiers of known malicious applications; signatures, hashes, and/or other identifiers of known good applications; or other identifiers, signatures, hashes, and/or other identifiers of known malicious files (e.g., Android exploit files), signatures, hashes, and/or other identifiers of known safe libraries; and signatures, hashes, and/or other identifiers of known malicious libraries.

A. 摂取（ingestion） A. ingestion

様々な実施形態においては、解析のための新しいサンプルが受信されると(例えば、サンプルに関連する既存の特徴が解析システム300に存在しない)、それはキュー302に追加される。図3に示すように、アプリケーション130は、システム300によって受信され、そして、キュー302に追加される。 In various embodiments, when a new sample is received for analysis (eg, no existing feature associated with the sample exists in analysis system 300), it is added to queue 302. FIG. As shown in FIG. 3, application 130 is received by system 300 and added to queue 302 .

B. 静的解析 B. Static analysis

コーディネータ304は、キュー302をモニタリングし、そして、リソース(例えば、静的解析ワーカー（worker）)が利用可能になると、コーディネータ304は、処理のためにキュー302からサンプルを取り出す(例えば、マルウェア130のコピーをフェッチ（fetch）する)。特に、コーディネータ304は、最初に、静的解析のためにサンプルを静的解析エンジン306に供給する。いくつかの実施形態においては、１つ以上の静的解析エンジンが解析システム300内に含まれており、ここで、解析システム300は、単一の装置である。他の実施形態において、静的解析は、複数のワーカー(すなわち、静的解析エンジン306の複数のインスタンス)を含む別個の静的解析サーバによって実行される。 Coordinator 304 monitors queue 302, and when resources (e.g., static analysis workers) become available, coordinator 304 retrieves samples from queue 302 for processing (e.g., fetch a copy). In particular, coordinator 304 first provides samples to static analysis engine 306 for static analysis. In some embodiments, one or more static analysis engines are contained within analysis system 300, where analysis system 300 is a single device. In other embodiments, static analysis is performed by a separate static analysis server that includes multiple workers (ie, multiple instances of static analysis engine 306).

静的解析エンジンは、サンプルに関する一般的な情報を獲得し、そして、それを静的解析レポート308内に(適宜、発見的情報および他の情報と共に)含める。レポートは、静的解析エンジンによって、または、静的解析エンジン306から情報を受信するように構成され得るコーディネータ304によって(または、別の適切なコンポーネントによって)作成され得る。いくつかの実施形態において、収集された情報は、作成される別個の静的解析レポート308(すなわち、レポート308からのデータベースレコードの部分)の代わりに、または、それに加えて、サンプルのデータベースレコード(例えば、データベース316)に保管される。いくつかの実施形態において、静的解析エンジンは、また、アプリケーション(例えば、「安全な（“safe”）」、「疑わしい（“suspicious”）」、または「悪意のある（“malicious”）」もの)に関する裁決を形成する。一つの例として、たとえ１つの「悪意のある」静的機能がアプリケーションに存在する場合(例えば、アプリケーションが既知の悪意のあるドメインへのハードリンクを含んでいる)、裁決は「悪意のある」ものであり得る。別の例として、各特徴にポイントを割り当てることができ(例えば、発見された場合の重大度に基づいて、悪意を予測するための特徴の信頼性に基づいて、等）、裁決は、静的解析結果に関連するポイントの数に基づいて、静的解析エンジン306(または、該当する場合は、コーディネータ304)によって割り当てることができる。 The static analysis engine obtains general information about the sample and includes it in the static analysis report 308 (along with heuristics and other information as appropriate). The report may be generated by the static analysis engine or by coordinator 304 (or another suitable component), which may be configured to receive information from static analysis engine 306 . In some embodiments, the collected information may be used as sample database records ( For example, stored in database 316). In some embodiments, the static analysis engine also identifies applications (e.g., "safe", "suspicious", or "malicious" ). As an example, even if one "malicious" static feature is present in the application (e.g., the application contains hardlinks to known malicious domains), the adjudication is "malicious". can be As another example, points can be assigned to each feature (e.g., based on the severity when discovered, based on the feature's reliability for predicting malice, etc.), and the adjudication is static. It can be assigned by static analysis engine 306 (or coordinator 304, if applicable) based on the number of points associated with the analysis result.

C. 動的解析 C. Dynamic analysis

一旦、静的解析が完了すると、コーディネータ304は、アプリケーションにおいて動的解析を実行するために、利用可能な動的解析エンジン310を配置する。静的解析エンジン306と同様に、解析システム300は、１つ以上の動的解析エンジンを直接的に含むことができる。他の実施形態において、動的解析は、複数のワーカー(すなわち、動的解析エンジン310の複数のインスタンス)を含む別個の動的解析サーバによって実行される。 Once static analysis is complete, coordinator 304 deploys available dynamic analysis engines 310 to perform dynamic analysis on the application. As with static analysis engine 306, analysis system 300 may directly include one or more dynamic analysis engines. In other embodiments, dynamic analysis is performed by a separate dynamic analysis server that includes multiple workers (ie, multiple instances of dynamic analysis engine 310).

各ダイナミック解析ワーカーは、仮想マシンインスタンスを管理する。いくつかの実施形態において、静的解析の結果(例えば、静的解析エンジン306によって実行されるもの)は、レポート形式(308)であるか、かつ／あるいは、データベース316に保管されているか、または、別の方法で保管されているかのいずれかで、動的解析エンジン310に対する入力として提供される。例えば、動的解析エンジン310によって使用される仮想マシンインスタンス(例えば、Microsoft Windows7 SP2 vs. Microsoft Windows10 Enterprise、または、iOS 11.0 vs. iOS 12.0)の選択／カスタマイズを助けるために、静的レポート情報を使用することができる。複数の仮想マシンインスタンスが同時に実行される場合、単一の動的解析エンジンが全てのインスタンスを管理することができ、または、必要に応じて、複数の動的解析エンジンを(例えば、それ自身の仮想マシンインスタンスの各管理と共に)使用することができる。以下でより詳細に説明するように、解析の動的部分の最中に、アプリケーション(ネットワークアクティビティを含む)によって取られたアクションが解析される。 Each dynamic analysis worker manages a virtual machine instance. In some embodiments, static analysis results (eg, those performed by static analysis engine 306) are in report form (308) and/or stored in database 316, or , or otherwise stored, and provided as input to the dynamic analysis engine 310 . For example, use static report information to help select/customize the virtual machine instance (e.g., Microsoft Windows7 SP2 vs. Microsoft Windows10 Enterprise, or iOS 11.0 vs. iOS 12.0) used by the dynamic analysis engine 310 can do. If multiple virtual machine instances run concurrently, a single dynamic analysis engine can manage all instances, or multiple dynamic analysis engines can be used (e.g., their own (with each management of virtual machine instances). During the dynamic portion of the analysis, actions taken by the application (including network activity) are analyzed, as described in more detail below.

様々な実施形態において、サンプルの静的解析は、省略されるか、または、該当する場合、別個のエンティティによって実施される。一つの例として、従来の静的及び／又は動的解析は、第１エンティティによってファイルにおいて実行され得る。一旦(例えば、第１エンティティによって)所与のファイルが悪意のものであると決定されると、そのファイルは、特に、マルウェアのネットワーク活動の使用に関連する追加的な解析のために(例えば、動的解析エンジン310によって)、第２エンティティ(例えば、セキュリティプラットフォーム122のオペレータ)に提供され得る。 In various embodiments, static analysis of samples is omitted or, where applicable, performed by a separate entity. As one example, conventional static and/or dynamic analysis may be performed on the file by the first entity. Once it is determined (e.g., by a first entity) that a given file is malicious, that file may be used for additional analysis (e.g., by the dynamic analysis engine 310) to a second entity (eg, an operator of the security platform 122).

解析システム300によって使用される環境は、アプリケーションが実行されている間に観察された挙動が、それらが発生したときにログに記録されるように(例えば、フッキング（hooking）およびログキャット（logcat）をサポートするカスタマイズされたカーネルを使用して)、計装され／フックされる。エミュレータに関連するネットワークトラフィックも、また、(例えば、pcapを使用して)キャプチャされる。ログ／ネットワークデータは、解析システム300上に一時ファイルとして保管することができ、そして、また、より永続的に(例えば、HDFS、または他の適切なストレージ技術、もしくは、MongoDBといった、技術の組み合わせを使用して)保管することもできる。動的解析エンジン(または、別の適切なコンポーネント)は、サンプルによって行われた接続をドメイン、IPアドレス、等のリスト(314)と比較し、そして、サンプルが悪意のあるエンティティと通信したか(または、通信を試みたか)否かを決定することができる。 The environment used by the analysis system 300 is configured so that behaviors observed while the application is running are logged when they occur (e.g., hooking and logcat). ), instrumented/hooked. Network traffic associated with the emulator is also captured (eg, using pcap). Log/network data can be stored as temporary files on the analysis system 300 and also more permanently (e.g. HDFS or other suitable storage technology or combination of technologies such as MongoDB). can be used) and stored. A dynamic analysis engine (or another appropriate component) compares connections made by the sample to a list of domains, IP addresses, etc. (314) and determines whether the sample communicated with a malicious entity ( Alternatively, it can determine whether or not communication has been attempted.

静的解析エンジンと同様に、動的解析エンジンは、その解析の結果を、テストされるアプリケーションに関連するレコードにおけるデータベース316に保管する(かつ／あるいは、該当する場合、結果をレポート312に含める)。いくつかの実施形態において、動的解析エンジンは、また、アプリケーションに関する裁決(例えば、「安全な」、「疑わしい」、または「悪意のある」)も形成する。一つの例として、たとえ１つの「悪意のある」行為がアプリケーションによって取られたとしても(例えば、既知の悪意のあるドメインにコンタクトする試み、または、機密情報を除去しようとする試みが観察される)、裁決は「悪意のある」であり得る。別の例として、実施されたアクションに対してポイントを割り当てることができ(例えば、発見された場合の重大性に基づいて、悪意を予測するための行為の信頼性に基づいて、等）、そして、動的解析エンジン310(または、該当する場合は、コーディネータ304)によって、動的解析結果に関連するポイントの数に基づいて、裁決を指定することができる。いくつかの実施態様において、サンプルに関連する最終的な裁決は、レポート308とレポート312の組み合わせに基づいて、(例えば、コーディネータ304によって)行われる。 Like static analysis engines, dynamic analysis engines store the results of their analysis in database 316 in records associated with the tested application (and/or include the results in report 312, if applicable). . In some embodiments, the dynamic analysis engine also forms a verdict on the application (eg, "safe," "suspicious," or "malicious"). As an example, even if one "malicious" action is taken by the application (e.g., an attempt to contact a known malicious domain, or an attempt to remove sensitive information is observed ), the ruling may be “malicious”. As another example, points can be assigned for actions taken (e.g., based on the severity if discovered, based on the reliability of the action for predicting malice, etc.), and , by the dynamic analysis engine 310 (or coordinator 304, if applicable), based on the number of points associated with the dynamic analysis results. In some embodiments, a final adjudication relating to the samples is made (eg, by coordinator 304) based on a combination of reports 308 and 312.

V. インラインマルウェア検出 V. Inline Malware Detection

図1の環境に戻ると、何百万もの新しいマルウェアサンプルが毎月生成され得る(例えば、システム120のオペレータといった不正な個人によるものであり、既存のマルウェアに微妙な変更を加えるか、または、新しいマルウェアを作成するかいずれかによる)。従って、セキュリティプラットフォーム122が(少なくとも初期に)署名を有していない多くのマルウェアサンプルが存在している。さらに、セキュリティプラットフォーム122が新たに作成されたマルウェアの署名を生成した場合でも、リソースの制約により、データ機器102といった、データ機器は、任意の時点で、全ての既知の署名のリスト(例えば、プラットフォーム122上に保管されたもの)を有すること／使用することができない。 Returning to the environment of Figure 1, millions of new malware samples can be generated each month (e.g., by rogue individuals, such as operators of system 120, either making subtle changes to existing malware or creating new malware samples). either by creating malware). Therefore, there are many malware samples that security platform 122 does not (at least initially) have signatures for. Furthermore, even if the security platform 122 generates signatures for newly created malware, due to resource constraints, a data appliance, such as the data appliance 102, at any given time, may have a list of all known signatures (e.g., platform 122).

ときどき、マルウェア130といった、マルウェアは成功裡にネットワーク140に侵入する。この理由の１つは、データ機器102が「初回許可（“first-time allow”）」原則に基づいて動作する場合である。データ機器102が、サンプル(例えば、サンプル130)についての署名を有しておらず、そして、解析のためにそれをセキュリティプラットフォーム122に提出する場合、裁決(例えば、「良性」、「悪意のある」、「不明」、等）を返すのに、セキュリティプラットフォーム122が概ね5分を要するものと仮定する。その5分間の最中にシステム120とクライアント装置104との間の通信をブロックする代わりに、初回許可の原則の下で、通信が許可されている。裁決が返された場合(例えば、5分後)、データ機器102は、裁決を使用して、ネットワーク140へのマルウェア130のその後の送信を阻ブロックすることができ、システム120とネットワーク140との間の通信を阻止することができる、等。様々な実施形態において、データ機器102がセキュリティプラットフォーム122からの裁決を待っている間に、サンプル130の第２コピーがデータ機器102に到着した場合、サンプル130の第２コピー(および、それに続く任意のコピー)は、セキュリティプラットフォーム122からの応答を待つ間、システム120によって保持される。 Occasionally, malware, such as malware 130 , successfully infiltrates network 140 . One reason for this is that the data appliance 102 operates on a "first-time allow" principle. If data appliance 102 does not have a signature on a sample (e.g., sample 130) and submits it to security platform 122 for analysis, the verdict (e.g., "benign", "malicious , "unknown", etc.), it takes approximately 5 minutes for the security platform 122 to return. Instead of blocking communication between system 120 and client device 104 during that five minute period, communication is allowed under the first-allow principle. If the adjudication is returned (eg, after 5 minutes), data appliance 102 can use the adjudication to block subsequent transmission of malware 130 to network 140, thereby preventing system 120 from communicating with network 140. can block communication between In various embodiments, if a second copy of sample 130 arrives at data appliance 102 while data appliance 102 is awaiting adjudication from security platform 122, the second copy of sample 130 (and any subsequent ) is held by system 120 while waiting for a response from security platform 122 .

残念ながら、データ機器102がセキュリティプラットフォーム122からの裁決を待つ5分間に、クライアント装置104のユーザはマルウェア130を実行し、クライアント装置104またはネットワーク140内の他のノードを危険にさらす可能性があった。上述のように、様々な実施形態において、データ機器102はマルウェア解析モジュール112を含んでいる。マルウェア解析モジュール112が実行できるタスクの１つは、インラインマルウェア検出である。特に、以下でさらに詳細に説明するように、ファイル(サンプル130といったもの）がデータ機器102を通過する際に、データ機器102上のファイルの効率的な解析を実行するために機械学習技術を適用することができ(例えば、データ機器102によってファイルにおいて実行される他の処理と並行して)、そして、初期の悪意裁定は、(例えば、セキュリティプラットフォーム122からの最低を待つ間に)データ機器102によって決定することができる。 Unfortunately, during the five minutes that data appliance 102 waits for a verdict from security platform 122, a user of client device 104 could execute malware 130 and compromise client device 104 or other nodes in network 140. rice field. As noted above, in various embodiments, data appliance 102 includes malware analysis module 112 . One of the tasks that the malware analysis module 112 can perform is in-line malware detection. In particular, machine learning techniques are applied to perform efficient parsing of files (such as sample 130) on data appliance 102 as they pass through data appliance 102, as described in further detail below. (eg, in parallel with other processing performed on the file by the data appliance 102), and an initial malicious adjudication may be performed by the data appliance 102 (eg, while waiting for a minimum from the security platform 122). can be determined by

データ機器102といったリソース制約付きの（resource constrained）機器においてでそうした解析を実施する際には、様々な困難が生じ得る。機器102における１つの主要なリソースは、セッションメモリである。セッションは、情報のネットワーク転送であり、ここにおいて説明される技術に従って機器102が解析するファイルを含んでいる。単一の機器は、何百万もの同時セッションを有することがあり、そして、所与のセッションの最中に持続することができるメモリは極めて限られている。データ機器102といった、データ機器においてインライン解析を実行することにおける第１の困難は、そうしたメモリ上の制約のせいで、データ機器102が、典型的には、ファイル全体を一度に処理することはできず、代わりに、パケット毎に処理する必要がある一連のパケットを受信することである。従って、データ機器102によって使用される機械学習アプローチは、様々な実施形態においてパケットストリームを収容（accommodate）する必要がある。第２の問題は、場合によっては、データ機器102が、処理される所与のファイルエンドがどこで生じるか(例えば、ストリームにおけるサンプル130の終端)を決定できないことである。データ機器102によって使用される機械学習アプローチは、従って、種々の実施形態において潜在的に途中（midstream）(例えば、サンプル130の受領／処理の途中、または、そうでなければ実際のファイル終了の前)の所与のファイルに関して裁決を下すことができる必要がある。 Various difficulties can arise in performing such analysis on a resource constrained device, such as data device 102 . One major resource in device 102 is session memory. A session is a network transfer of information, including files that device 102 parses according to the techniques described herein. A single device may have millions of concurrent sessions, and the memory that can last during a given session is extremely limited. A first difficulty in performing inline analysis on a data appliance, such as data appliance 102, is that due to such memory constraints, data appliance 102 typically cannot process an entire file at once. Instead, it receives a series of packets that need to be processed packet by packet. Therefore, the machine learning approach used by data appliance 102 should accommodate packet streams in various embodiments. A second problem is that in some cases the data appliance 102 cannot determine where the end of a given file being processed occurs (eg, the end of the samples 130 in the stream). The machine learning approach used by the data appliance 102 is therefore potentially midstream (e.g., during receipt/processing of samples 130, or otherwise prior to the actual end of file) in various embodiments. ) for a given file.

A. 機械学習モデル A. Machine learning model

以下でさらに詳細に説明するように、様々な実施形態において、セキュリティプラットフォーム122は、インラインマルウェア検出と共に使用するデータ機器102のために、データ機器102に対して一式の機械学習モデルを提供する。モデルは、悪意のあるファイルに対応している、セキュリティプラットフォーム122によって決定される特徴(例えばnグラム（n-grams）または他の特徴)を組み込んでいる。そうしたモデルの２つのタイプの例は、線形分類モデルおよび非線形分類モデルを含む。データ機器102によって使用され得る線形分類モデルの例は、ロジスティック回帰および線形サポートベクトルマシンを含む。データ機器102によって使用され得る非線形分類モデルの一つの例は、勾配ブースティングツリー(例えば、eXtreme Gradient Boosting（XGBoost）)を含む。非線形モデルは、より正確である(そして、難読化された／偽装されたマルウェアをより良好に検出することができる)が、線形モデルは、機器102においてかなり少ないリソースを使用する(そして、JavaScriptまたは類似のファイルを効率的に解析するのにより適している)。 As described in further detail below, in various embodiments, security platform 122 provides a set of machine learning models to data appliance 102 for use with in-line malware detection. The model incorporates features (eg, n-grams or other features) determined by security platform 122 that correspond to malicious files. Examples of two types of such models include linear classification models and non-linear classification models. Examples of linear classification models that can be used by data appliance 102 include logistic regression and linear support vector machines. One example of a non-linear classification model that may be used by data appliance 102 includes gradient boosting trees (eg, eXtreme Gradient Boosting (XGBoost)). The non-linear model is more accurate (and can better detect obfuscated/disguised malware), but the linear model uses significantly less resources on the device 102 (and does not require JavaScript or more suitable for efficient parsing of similar files).

以下でさらに詳細に説明するように、解析される所与のファイルに使用される分類モデルのタイプは、そのファイルに関連付けられたファイルタイプに基づくことができる(そして、例えば、マジックナンバーによって、決定することができる)。 As described in further detail below, the type of classification model used for a given file being parsed can be based on the file type associated with that file (and determined, for example, by a magic number). can do).

1. 脅威エンジンについて追加的な詳細 1. Additional details about the threat engine

様々な実施形態において、データ機器102は脅威エンジン244を含む。脅威エンジンは、それぞれのデコーダステージおよびパターンマッチステージの最中に、プロトコルデコーディングおよび脅威署名マッチングの両方を組み込んでいる。２つのステージの結果は、検出器ステージによって併合される。 In various embodiments, data appliance 102 includes threat engine 244 . The threat engine incorporates both protocol decoding and threat signature matching during the respective decoder and pattern match stages. The results of the two stages are merged by the detector stage.

データ機器102がパケットを受信すると、データ機器102はセッションマッチを実行して、そのパケットがどのセッションに属するかを決定する(データ機器102が同時セッションをサポートすることを可能にしている)。各セッションは、特定のプロトコルデコーダ(例えば、Webブラウジングデコーダ、FTPデコーダ、またはSMTPデコーダ)を意味するセッション状態を有している。ファイルがセッションの一部として送信されるとき、適用可能なプロトコルデコーダは、適切なファイル特有のデコーダ(例えば、PEファイルデコーダ、JavaScriptデコーダ、またはPDFデコーダ)を使用することができる。 When data appliance 102 receives a packet, data appliance 102 performs a session match to determine which session the packet belongs to (allowing data appliance 102 to support concurrent sessions). Each session has a session state which refers to a specific protocol decoder (eg web browsing decoder, FTP decoder or SMTP decoder). When a file is sent as part of a session, the applicable protocol decoder can use an appropriate file-specific decoder (eg PE file decoder, JavaScript decoder, or PDF decoder).

脅威エンジン244の一つの例示的な実施形態の部分が図4に示されている。所与のセッションに対して、デコーダ402は、対応するプロトコルおよびマーキングのコンテキスト（marking context）に従って、トラフィックバイトストリームを進む（walk）。コンテキストの一つの例は、エンドオブファイル（end-of-file）コンテキストである(例えば、JavaScriptファイルの処理中に<／script>に出会うこと)。デコーダ402は、パケット内のエンドオブファイルコンテキストをマーク付けすることができ、次いで、ファイルの観察された特徴を使用して、適切なモデルの実行をトリガするために使用することができる。ある場合(例えば、FTPトラフィック)では、コンテキストを識別／マーク付けする、デコーダ402のための明示的なプロトコルレベルのタグが存在しないことがある。以下でさらに詳細に説明するように、様々な実施形態において、デコーダ402は、他の情報(例えば、ヘッダで報告されたファイルサイズ)を使用して、ファイルの特徴抽出がいつ終了すべきか(例えば、オーバーレイセクションを開始する)、そして、適切なモデルを使用する実行が開始すべきかを判断する。 Portions of one exemplary embodiment of threat engine 244 are shown in FIG. For a given session, the decoder 402 walks the traffic byte stream according to the corresponding protocol and marking context. One example of context is end-of-file context (eg, encountering </script> while processing a JavaScript file). The decoder 402 can mark the end-of-file context within the packet and then use the observed characteristics of the file to trigger execution of the appropriate model. In some cases (eg, FTP traffic), there may be no explicit protocol-level tags for the decoder 402 that identify/mark the context. As will be described in more detail below, in various embodiments, the decoder 402 uses other information (e.g., the file size reported in the header) to determine when feature extraction for the file should end (e.g., , start the overlay section) and determine if execution using the appropriate model should begin.

デコーダ402は、２つの部分から構成される。デコーダ402の第１部分は、状態マシン言語を使用して状態マシンとして実装することができる仮想マシン部分(404)である。デコーダ402の第２部分は、トラフィックが一致したときに状態マシン遷移およびアクションをトリガするためのトークン406のセットである。脅威エンジン244は、また、(例えば、脅威パターンに対して)パターンマッチングを実行する脅威パターン照合器408(例えば、正規表現を使用している)を含む。一つの例として、脅威パターン照合器（matcher）408は、(例えば、セキュリティプラットフォーム122によって)照合する文字列（的確な（exact）文字列またはワイルドカード文字列のいずれか）のテーブル、および、照合する文字列が見つかった場合に行う対応するアクションを備えることができる。検出器410は、デコーダ402および脅威パターン照合器408によって提供される出力を処理して、様々なアクションを行う。 Decoder 402 consists of two parts. A first part of the decoder 402 is a virtual machine part (404) that can be implemented as a state machine using the state machine language. The second part of decoder 402 is a set of tokens 406 for triggering state machine transitions and actions when traffic matches. The threat engine 244 also includes a threat pattern matcher 408 (eg, using regular expressions) that performs pattern matching (eg, against threat patterns). As one example, threat pattern matcher 408 includes a table of strings (either exact strings or wildcard strings) to match (e.g., by security platform 122) and a matching It can have a corresponding action to take if a string that matches is found. Detector 410 processes the output provided by decoder 402 and threat pattern matcher 408 to perform various actions.

2. Nグラム（n-grams） 2. n-grams

セッション内のデータは、一連のnグラム（n-grams）へと分割することができる－一連のバイト文字列。一つの例として、セッションにおける16進数データの一部が「1023ae42f6f28762aab」であると仮定する。とすると、シーケンスにおける2グラム（2-gram）は、「1023」、「23ae」、「ae42」、「42f6」、等といった、隣接する文字の全てのペアである。様々な実施形態において、脅威エンジン244は、8グラム（8-gram）を使用してファイルを解析するように構成されている。他のnグラムも、また、使用することができる、7グラムまたは4グラムといったもの。上記の文字列の例において、「1023ae42f6f28762」は8グラムであり、「23ae42f6f28762aa」は8グラムである、等。バイトシーケンスで可能な異なる8グラムの総数は、2の64乗(18,446,744,073,709,551,616)である。バイトシーケンス内の可能な8グラムの全てを検索することは、データ機器102のリソースを容易に超えるだろう。代わりに、以下でより詳細に説明されるように、セキュリティプラットフォーム122によって、脅威エンジン244による使用のためのデータ機器102に対して、大幅に低減された8グラムのセットが提供される。 Data within a session can be divided into a series of n-grams - a series of byte strings. As an example, assume that a portion of the hexadecimal data in the session is "1023ae42f6f28762aab". Then the 2-grams in the sequence are all pairs of adjacent characters, such as "1023", "23ae", "ae42", "42f6", and so on. In various embodiments, threat engine 244 is configured to parse files using 8-grams. Other n-grams can also be used, such as 7-grams or 4-grams. In the string example above, "1023ae42f6f28762" is 8 grams, "23ae42f6f28762aa" is 8 grams, and so on. The total number of different 8-grams possible in a byte sequence is 2 to the 64th power (18,446,744,073,709,551,616). Searching for all possible 8-grams within a byte sequence would easily exceed the resources of the data appliance 102 . Instead, security platform 122 provides a significantly reduced set of 8 grams to data appliance 102 for use by threat engine 244, as described in more detail below.

ファイルに対応するセッションパケットが脅威エンジン244によって受信されると、脅威パターン照合器408は、テーブル内の文字列に対する一致についてパケットを解析する(例えば、正規表現及び／又は的確な文字列一致を実行することによる)。一致(例えば、対応するパターンIDによって識別される一致の各インスタンス)、および、各一致がどのオフセットで発生したかのリストが生成される。これらの一致に対するアクションは、オフセットの順序(例えば、下から上へ)で行われる。所与の一致に対して(すなわち、特定のパターンIDに対応して)、行われるべき１つ以上のアクションのセットが(例えば、アクションをパターンIDにマッピングするアクションテーブルを介して)指定される。 When a session packet corresponding to a file is received by threat engine 244, threat pattern matcher 408 parses the packet for matches against strings in the table (e.g., performs regular expression and/or exact string matching). by doing). A list is generated of matches (eg, each instance of a match identified by the corresponding pattern ID) and at which offset each match occurred. Actions on these matches are performed in offset order (eg, bottom to top). For a given match (i.e., corresponding to a particular pattern ID), a set of one or more actions to be taken is specified (e.g., via an action table mapping actions to pattern IDs) .

セキュリティプラットフォーム122によって提供される8グラムのセットは、脅威パターン照合器408がすでに実行している一致(例えば、JavaScriptファイルがパスワードストレージにアクセスする場所、または、PEファイルがLocal Security Authority Subsystem Service（LSASS）APIを呼び出す場所といった、マルウェアの特定の指標を探す発見的一致（heuristic matches）)のテーブルへの追加として、(例えば、的確な文字列一致として)追加され得る。このアプローチの１つの利点は、パケットを通過する複数のパスを実行する代わりに(例えば、最初に発見的一致を評価し、そして、次いで、8グラム一致を評価する)、脅威パターン照合器408によって実行される他の検索と並行して8グラムを検索できることである。 The 8-gram set provided by the security platform 122 matches what the threat pattern matcher 408 has already performed (e.g., where a JavaScript file accesses password storage, or where a PE file accesses the Local Security Authority Subsystem Service (LSASS ) can be added (eg, as exact string matches) to a table of heuristic matches looking for specific indicators of malware, such as where to call an API. One advantage of this approach is that instead of performing multiple passes through the packet (e.g., first evaluating heuristic matches and then evaluating 8-gram matches), threat pattern matcher 408 8grams can be searched in parallel with other searches being performed.

以下でより詳細に説明されるように、8グラム一致は、種々の実施形態において、線形および非線形の両方の分類モデルによって使用されるnグラム一致に対して指定可能なアクションの例は、(例えば、線形分類器について)重み付きカウンタを増加させること（incrementing）、および、(例えば、非線形分類器について)特徴ベクトル内の一致の保存を含む。どのアクションが行われるかは、(どのタイプのモデルを使用するかを決定する)パケットに関連付けられたファイルタイプに基づいて指定され得る。 As described in more detail below, 8-gram matching is used by both linear and non-linear classification models in various embodiments. Examples of actions that can be specified for n-gram matching are (e.g. , for linear classifiers) and storing matches in feature vectors (eg, for non-linear classifiers). Which action to take may be specified based on the file type associated with the packet (which determines which type of model to use).

3. モデルの選択 3. Model selection

場合によっては、ファイルのヘッダの中で特定のファイルタイプが指定される(例えば、ファイル自体の最初の7バイト内に現れるマジックナンバーとして)。そうしたシナリオにおいて、脅威エンジン244は、(例えば、ファイルタイプおよび対応するモデルを列挙するセキュリティプラットフォーム122によって提供されるテーブルに基づいて)指定されたファイルタイプに対応する適切なモデルを選択することができる。JavaScriptといった、他の場合において、マジックナンバーまたは他のファイルタイプ識別子(ヘッダに存在する場合)は、どの分類モデルを使用すべきかを証明するものではない。一つの例として、JavaScriptは「textfile」のファイルタイプを有するだろう。JavaScriptといったファイルタイプを識別するために、デコーダ402が使用され、確定的有限状態オートマトン（deterministic finite state automaton、DFA）パターンマッチングを実行し、そして、発見的手法(例えば、ファイルがJavaScriptであることを識別する<script>および他のインジケータ)を適用することができる。決定されたファイルタイプ及び／又は選択された分類モデルは、セッション状態に保存される。セッションに関連付けられたファイルタイプは、セッションの進行につれて、更新することができる。例えば、テキストストリームにおいて、<script>タグに出会うとき、JavaScriptファイルタイプをセッションに割り当てることができる。対応する<／script>出会うときは、ファイルタイプを変更することができる(例えば、平文に戻る)。 In some cases, a particular file type is specified in the file's header (eg, as a magic number that appears within the first 7 bytes of the file itself). In such scenarios, threat engine 244 may select the appropriate model corresponding to the specified file type (eg, based on a table provided by security platform 122 listing file types and corresponding models). . In other cases, such as JavaScript, the magic number or other file type identifier (if present in the header) does not prove which classification model should be used. As an example, JavaScript would have a file type of "textfile". To identify file types such as JavaScript, decoder 402 is used to perform deterministic finite state automaton (DFA) pattern matching and heuristics (e.g., to detect that a file is JavaScript). identifying <script> and other indicators) can be applied. The determined file type and/or the selected classification model are saved in the session state. File types associated with a session can be updated as the session progresses. For example, in a text stream, a JavaScript file type can be assigned to a session when a <script> tag is encountered. The file type can be changed (e.g. back to plaintext) when a corresponding </script> is encountered.

4. 線形分類モデル 4. Linear classification model

線形モデルを表現する１つの方法は、以下の線形方程式を使用することである。 One way to express the linear model is to use the following linear equations.

Σ（β_ｉｘ_ｉ）＜Ｃ，ｉ＝1,2,3…,P
ここで、Pは特徴の総数であり、ｘ_ｉはi番目の特徴であり、β_ｉは特徴ｘ_ｉの係数(重み付け)であり、そして、Cは閾値定数である。この例において、Cは悪意の裁決に対する閾値であり、所与のファイルについて合計がCより小さい場合に、そのファイルには良性の裁定が割り当てられ、かつ、合計がC以上の場合には、そのファイルに悪意の裁定が割り当てられることを意味している。 Σ(β _i x _i )<C, i=1,2,3...,P
where P is the total number of features, x _i is the ith feature, β _i is the coefficient (weighting) of feature x _i and C is the threshold constant. In this example, C is the threshold for bad faith verdicts, and if the sum is less than C for a given file, then the file is assigned a benign verdict, and if the sum is greater than or equal to C, then the Means that the file is assigned a verdict of bad faith.

データ機器102による線形分類モデルを使用するための１つのアプローチは、以下の通りである。入力ファイルのスコアを追跡するために単一のフロート(d)を使用され、そして、観察されたnグラムおよび対応する係数(すなわち、ｘ_ｉおよびβ_ｉ)を保管するためにハッシュテーブルが使用される。それぞれ入ってくるパケットに対して、n-gram特徴(例えば、セキュリティプラットフォーム122によって提供されるようなもの)それぞれがチェックされる。ハッシュテーブルの特徴(ｘ_ｉ)について一致が見つかると、いつでも、ハッシュテーブル内でその特徴に一致する単一のフロート(β_ｉ)が追加される(例えば、dに対して)。ファイルエンドに到達すると、単一フロート(d)が閾値(C)に対して比較され、ファイルについて裁決を決定する。 One approach for using a linear classification model by data appliance 102 is as follows. A single float (d) is used to track the score of the input file, and a hash table is used to store the observed n-grams and the corresponding coefficients (i.e., x _i and β _i ). be. For each incoming packet, each n-gram feature (eg, such as provided by security platform 122) is checked. Whenever a match is found for a hashtable feature (x _i ), a single float (β _i ) matching that feature in the hashtable is added (eg, to d). When the file end is reached, the single float (d) is compared against a threshold (C) to determine the verdict for the file.

nグラムカウントについて、特徴ｘ_ｉは、i番目のnグラムが観察される回数に等しい。特定のファイルについてi番目のn-gramが4回観測されたと仮定する。４＊β_ｉは、β_ｉ＋β_ｉ＋β_ｉ＋β_ｉに書き換えることができる。i番目のnグラムが何回を観察されるかをカウントし(すなわち4回)、そして、β_ｉを乗算することの代わりに、別のアプローチは、i番目のnグラム観察されるたびにβ_ｉを加算することである。さらに、ファイルについてj番目のnグラムが3回観測されたと仮定する。３＊β_ｉは、同様に、β_ｉ＋β_ｉ＋β_ｉとして書くことができ、β_ｉが何回観察されたかをカウントする代わりに、毎回β_ｉを加算し、そして、次いで、最後に加算する。 For n-gram counts, feature x _i is equal to the number of times the i-th n-gram is observed. Suppose the i-th n-gram was observed 4 times for a particular file. 4*β _i can be rewritten as β _i +β _i +β _i +β _i . Instead of counting how many times the ith n-gram is observed (i.e. 4 times) and multiplying by β _i , another approach is to use β adding _i . Further assume that the jth n-gram was observed 3 times for the file. 3*β _i can similarly be written as β _i +β _i +β _i , and instead of counting how many times β _i is observed, add β _i each time and then add at the end .

Σ（β_ｉｘ_ｉ）を見つけるために、β_ｉｘ_ｉ、β_ｊｘ_ｊ、...それぞれが加算される(ここで、...は他の特徴／重み付けの全てに対応する)。これは、β_ｉ＋β_ｉ＋β_ｉ＋β_ｊ＋β_ｊ＋β_ｊ＋β_ｊとして書き換えることができる。加算は累積的であるため、値の加算は任意の順序(例えば、β_ｉ＋β_ｊ＋β_ｉ＋β_ｊ＋β_ｉ＋β_ｉ＋β_ｊ、等）で加えられ、そして、単一のフロートへと累積される。ここで、フロート(d)が0.0で始まるものと仮定する。特徴ｘ_ｉが観察される度に、β_ｉがフロートdに対して追加され、そして、ｘ_ｊが観察される度に、β_ｊがフロートdに対して追加され得る。このアプローチは、4バイトのフロートをセッション毎のメモリ全体として使用することを可能にし、そして、セッション毎のメモリが特徴の数に比例するアプローチとは対照的である。ここでは、特徴ベクトル全体が重み付けベクトルによって乗算されるように、メモリに保管される。4バイト＊1,000の4Kバイトの特徴の例を使用すると、ストレージについて4Kが必要とされるだろう(単一の4バイトフロートと比較して)。これは、1,000倍高価である。 To find Σ(β _i x _i ), each of β _i x _i , β _j x _j , ... is added (where ... corresponds to all other features/weightings). This can be rewritten as β _i +β _i +β _i +β _j +β _j +β _j +β _j . Additions are cumulative, so the addition of values can be added in any order (e.g., _{βi + βj + βi + βj + βi + βi + βj} _, _etc. ₎ _and _accumulated _into a single float. . Now assume that float (d) starts at 0.0. Each time feature x _i is observed, β _i can be added to float d, and each time x _j is observed, β _j can be added to float d. This approach allows a 4-byte float to be used as the total per-session memory, as opposed to approaches where the per-session memory is proportional to the number of features. Here, the entire feature vector is stored in memory to be multiplied by the weighting vector. Using the 4K byte feature example of 4 bytes * 1,000, 4K of storage would be required (compared to a single 4 byte float). This is 1,000 times more expensive.

5. 非線形分類モデル 5. Nonlinear Classification Model

種々の非線形分類アプローチを、ここにおいて説明される技術と共に使用することができる。非線形分類モデルの一つの例は、勾配ブースティングツリーである。この例において、特徴ベクトルは、オールゼロ（all-zero）ベクトルに初期化される。不運にも、(線形モデルとは異なり)非線形モデルでは、存在が検出されている特徴のセット全体(例えば、1000個の特徴)がセッションの全持続期間について持続される。このことは、線形アプローチにおけるほど効率的ではないが、完全な4バイトのフロートではなく、1バイト(0－255)のフロートになるように特徴をダウンサンプリングすることによって、ある程度の効率が未だに得られる(メモリが制約されていないデバイスで使用され得る)。 Various non-linear classification approaches can be used with the techniques described herein. One example of a non-linear classification model is a gradient boosted tree. In this example, the feature vector is initialized to an all-zero vector. Unfortunately, in nonlinear models (unlike linear models) the entire set of features whose presence is being detected (eg, 1000 features) persists for the entire duration of the session. This is not as efficient as in a linear approach, but some efficiency can still be gained by downsampling the features to be a 1-byte (0-255) float instead of a full 4-byte float. (can be used on devices that are not memory constrained).

データ機器102がファイルの全体をスキャンする際、特徴が観察される度に、その特徴の値が特徴ベクトル内で1だけ増加される。一旦ファイルエンドに到達すると(または、そうでなければ特徴観察の終了が発生する)、構築された特徴ベクトルは、勾配ブースティングツリーモデルへと供給される(例えば、セキュリティプラットフォーム122から受信される)。以下でより詳細に説明されるように、非線形分類モデルはnグラム(例えば、8グラム)および非nグラム特徴の両方を使用して構築され得る。非nグラム特徴の一つの例は、ファイルの意図された（purported）サイズである(ファイルのヘッダを含むパケットから値として読み取ることができる)。(例えば、ヘッダで指定されたファイルサイズに基づいて)意図されたエンドオブファイルの後に現れるファイルデータは、オーバーレイと呼ばれる。特徴として機能することに加えて、意図されたファイル長は、そのファイルがどれだけ長いと予想されるかについてプロキシとして使用され得る。非線形分類子（classifier）は、意図されたファイル長に到達するまで、ファイルのパケットストリームに対して実行され得る。そして、次いで、ファイルエンドに実際に到達したか否かにかかわりなく、ファイルに対して裁決を形成することができる。所与のファイルがオーバーレイを含むことは、また、非線形分類モデルの一部として使用され得る特徴の例でもある。種々の実施形態において、ファイルのオーバーレイ部分は解析されず、再度、－実際のファイルエンドの以前に解析を行うことができる。他の実施形態においては、特徴抽出が行われ、そして、実際のファイルエンドに到達するまで、悪意について裁決+が形成されない。 As the data appliance 102 scans the entire file, each time a feature is observed, the value of that feature is incremented by one in the feature vector. Once the end of the file is reached (or the end of feature observation occurs otherwise), the constructed feature vector is fed into the gradient boosting tree model (eg, received from security platform 122). . As described in more detail below, non-linear classification models can be built using both n-gram (eg, 8-gram) and non-n-gram features. One example of a non-n-gram feature is the file's intended size (which can be read as a value from the packet containing the file's header). File data that appears after the intended end-of-file (eg, based on the file size specified in the header) is called an overlay. In addition to serving as a feature, the intended file length can be used as a proxy for how long the file is expected to be. A non-linear classifier may be run on the file's packet stream until the intended file length is reached. A verdict can then be formed for the file regardless of whether the end of the file has actually been reached. The inclusion of overlays in a given file is also an example of a feature that can be used as part of a non-linear classification model. In various embodiments, the overlay portion of the file is not parsed, and again - the parsing can occur prior to the actual end of the file. In other embodiments, feature extraction is performed and verdicts of malice are not formed until the actual file end is reached.

一つの例示的な実施形態において、ツリーモデルは、5000個のバイナリツリーを含む。各ツリー上の全てのノードは、特徴および対応する閾値を含んでいる。ツリーの一部の例を図5に示されている。図5に示される例において、特徴(例えば、特徴F4)の値がその閾値(例えば、30)より小さい場合、左分岐がとられる(502)。特徴の値が閾値以上である場合、右分岐がとられる(504)。ツリーは、関連する値(例えば、0.7)を有する、リーフノード(例えば、ノード506)に到達するまで進む。到達した各リーフの値は(ツリーそれぞれについて)合計され(乗算されるのではなく)、裁決を計算するための最終スコアを得る。スコアが閾値を下回る場合、ファイルは良性とみなされ、そして、閾値以上である場合、ファイルは悪意があるとみなされる。最終スコアを得る際の乗算の欠如は、データ機器102のリソース制約環境においてモデルをより効率的に使用する助けとなる。 In one exemplary embodiment, the tree model contains 5000 binary trees. Every node on each tree contains features and corresponding thresholds. An example of part of the tree is shown in Figure 5. In the example shown in FIG. 5, if the value of a feature (eg, feature F4) is less than its threshold (eg, 30), the left branch is taken (502). If the value of the feature is greater than or equal to the threshold, the right branch is taken (504). The tree progresses until it reaches a leaf node (eg, node 506) that has an associated value (eg, 0.7). The values of each leaf reached (for each tree) are summed (rather than multiplied) to get the final score for computing the adjudication. If the score is below the threshold, the file is considered benign, and above the threshold the file is considered malicious. The lack of multiplication in obtaining the final score helps the model to be used more efficiently in resource constrained environments of data appliance 102 .

様々な実施形態において、ツリー自身は、(更新されたモデルが受信されるまで)データ機器102において固定され、そして、同時に複数のセッションによってアクセスされ得る共有メモリ内に保管され得る。セッション当たりのコストは、セッションの特徴ベクトルを保管するコストであり、一旦セッションの解析が完了するとゼロにすることができる。 In various embodiments, the tree itself may be stored in a shared memory that is fixed in the data appliance 102 (until an updated model is received) and can be accessed by multiple sessions at the same time. The cost per session is the cost of storing the session's feature vector and can be zero once the session's analysis is complete.

6. プロセスの実施例 6. Example Process

図6は、データ機器においてインラインマルウェア検出を実行するためのプロセスについて一つの例を示している。様々な実施形態において、プロセス600は、データ機器102によって、そして、特には、脅威エンジン244によって実行される。脅威エンジン244は、適切なスクリプト言語(例えば、Python)で作成されたスクリプト(または、スクリプトのセット)を使用して実装することができる。プロセス600は、また、クライアント装置110といった、エンドポイントにおいても(例えば、クライアント装置110において実行するエンドポイント保護アプリケーションによって)実行され得る。 FIG. 6 shows one example of a process for performing inline malware detection on a data appliance. In various embodiments, process 600 is performed by data appliance 102 and, in particular, threat engine 244 . Threat engine 244 can be implemented using a script (or set of scripts) written in a suitable scripting language (eg, Python). Process 600 may also be performed at an endpoint, such as client device 110 (eg, by an endpoint protection application running on client device 110).

プロセス600は、ファイルがセッションの一部として送信されている旨の指示（indication）が機器102によって受信されると、602で開始する。602で実行される処理の一つの例として、所与のセッションについて、関連するプロトコルデコーダは、プロトコルデコーダによってファイルの開始が検出されるとき、適切なファイル特有のデコーダを呼び出すか、または、そうでなければ使用することができる。上述のように、ファイルタイプは(例えば、デコーダ402によって)決定され、そして、セッションに関連付けられる(例えば、ファイルタイプが変化するか、または、ファイルパケットが送信されなくなるまで、後続のファイルタイプ解析を行う必要がないようにする)。 Process 600 begins at 602 when an indication is received by device 102 that a file is being sent as part of a session. As one example of the processing performed at 602, for a given session, the associated protocol decoder may call the appropriate file-specific decoder when the start of a file is detected by the protocol decoder, or You can use it if you don't have it. As described above, the file type is determined (eg, by decoder 402) and associated with the session (eg, subsequent file type analysis is performed until the file type changes or no file packets are sent). so you don't have to).

604において、nグラム解析が、受信パケットのシーケンスに対して実行される。上述のように、nグラム解析は、機器102によってセッションにおいて実行されている他の解析とインラインで行うことができる。例えば、機器102が特定のパケットについて(例えば、特定の発見的方法の存在をチェックするために)解析を実行している間に、それは、また、パケット内の8グラムがセキュリティプラットフォーム122によって提供される8グラムと一致するか否かを決定することもできる。604で実行される処理の最中に、nグラム一致が見つかったときは、条件をファイルタイプ（filetype）に基づいてアクションにマッピングするために対応するパターンIDが使用される。このアクションは、重み付けされたカウンタをインクリメントするか(例えば、ファイルタイプが線形分類子に関連付けられている場合)、または、一致を説明するために特徴ベクトルを更新するか(例えば、ファイルタイプが非線形分類子に関連付けられている場合)のいずれかである。 At 604, n-gram analysis is performed on the sequence of received packets. As noted above, n-gram parsing can be done in-line with other parsing being performed by device 102 in a session. For example, while device 102 is performing analysis on a particular packet (eg, to check for the existence of a particular heuristic), it also It can also determine whether it matches the 8 grams During the processing performed at 604, when an n-gram match is found, the corresponding pattern ID is used to map conditions to actions based on filetype. This action either increments a weighted counter (e.g. if the file type is associated with a linear classifier) or updates the feature vector to account for the match (e.g. if the file type is non-linear associated with a classifier).

nグラム解析は、エンドオブファイル条件またはチェックポイントのいずれかが到達されるまで、パケットごとに、継続する。その時点(606)で、適切なモデルが、ファイルの裁決を決定するために使用される(すなわち、モデルを使用して得られた最終値を悪意の閾値と比較する)。上述のように、モデルは、nグラム特徴を組み込み、そして、また、他の特徴を(例えば、非線形分類器の場合に)組み込むこともできる。 N-gram parsing continues, packet by packet, until either an end-of-file condition or a checkpoint is reached. At that point (606), the appropriate model is used to determine the file's verdict (ie, the final value obtained using the model is compared to the bad faith threshold). As noted above, the model incorporates n-gram features and may also incorporate other features (eg, in the case of non-linear classifiers).

最終的に、608では、606でなされた決定に応答してアクションがとられる。応答アクションの一つの例は、セッションの終了である。応答アクションの別の例は、セッションを継続させるが、ファイルが送信されないようにする(代わりに、隔離エリアに置く)ことである。様々な実施形態において、機器102は、その裁決(良性の裁決、悪性の裁決、または、その両方のいずれか)をセキュリティプラットフォーム122と共有するように構成されている。セキュリティプラットフォーム122は、ファイルの独立した解析を完了すると、裁決を形成したモデルの性能の評価を含む、様々な目的のために、機器102によって報告された裁決を使用することができる。 Finally, at 608, action is taken in response to the decision made at 606. One example of a response action is terminating a session. Another example of a response action is to allow the session to continue, but prevent the file from being sent (instead put it in quarantine). In various embodiments, device 102 is configured to share its verdicts (either benign verdicts, bad verdicts, or both) with security platform 122 . Once the security platform 122 has completed its independent analysis of the file, it can use the adjudications reported by the device 102 for various purposes, including evaluating the performance of the model that formed the adjudication.

サンプルについて脅威署名（threat signature）の例を図7Bに示す。特に、「4d73f42438fb5a8579219cdfa9cbbb4ce3f771ffed93af81b052831e4813f8」のSHA-256ハッシュを有するサンプルについて、各ペアにおける第１値は特徴に対応し、そして、第２値はカウントに対応している。図7Bに示される例において、数字を含む特徴(例えば、特徴「3905」)は、nグラム特徴に対応し、そして、「J」と数字を含む特徴(例えば、特徴「J18」)は、非nグラム特徴に対応している。 An example threat signature for the sample is shown in FIG. 7B. Specifically, for a sample with a SHA-256 hash of "4d73f42438fb5a8579219cdfa9cbbb4ce3f771ffed93af81b052831e4813f8", the first value in each pair corresponds to the feature and the second value corresponds to the count. In the example shown in FIG. 7B, features containing numbers (e.g., feature "3905") correspond to n-gram features, and features containing "J" and numbers (e.g., feature "J18") correspond to non-n-gram features. It corresponds to n-gram features.

一つの例示的な実施形態において、セキュリティプラットフォーム122は、データ機器102といった機器による使用のためのモデルを生成するときに、特定の偽陽性率（false positive ratio）(例えば、0.001)を目標とするように構成されている。従って、ある場合には(例えば、1000個のファイルのうち1個)、ここにおいて説明される技術に従ったモデルを使用してインライン解析を実行している際に、データ機器102は、良性のファイルが悪意あるものと誤って判断し得る。そうしたシナリオでは、セキュリティプラットフォーム122が、ファイルが実際には良性であると後に続いて決定した場合に、後で(例えば、別の機器によって)悪意あるものとしてフラグ付けされないように、それをホワイトリストに追加することができる。 In one exemplary embodiment, security platform 122 targets a particular false positive ratio (e.g., 0.001) when generating models for use by devices such as data device 102. is configured as Therefore, in some cases (eg, 1 out of 1000 files), while performing inline analysis using a model according to the techniques described herein, the data appliance 102 may detect a benign A file can be falsely determined to be malicious. In such scenarios, if the security platform 122 subsequently determines that the file is in fact benign, it may be whitelisted so that it is not later flagged as malicious (e.g., by another device). can be added to

ホワイトリスト（whitelisting）に対する１つのアプローチは、そのファイルを機器102に保管されたホワイトリストに追加するように、セキュリティプラットフォーム122に対して指示することである。別のアプローチは、セキュリティプラットフォーム122について、偽陽性のホワイトリストシステム154を指示し、そして、ホワイトリストシステム154について、順に、機器102といった機器を偽陽性情報で最新の状態に保つことである。上述のように、機器102といった機器の１つの問題は、リソース制約されていることである。機器でホワイトリストを維持することに使用されるリソースを最小化する１つのアプローチは、最近最も使われなかった（Least Recently Used、LRU）キャッシュを使用してホワイトリストを維持することである。ホワイトリストは、ファイルハッシュを含むことができ、そして、また、特徴ベクトルまたは特徴ベクトルのハッシュといった、他の要素に基づくこともできる。 One approach to whitelisting is to instruct security platform 122 to add the file to a whitelist stored on device 102 . Another approach is to instruct security platform 122 to whitelist system 154 of false positives, and whitelist system 154, in turn, to keep devices, such as device 102, up to date with false positive information. As mentioned above, one problem with devices such as device 102 is that they are resource constrained. One approach to minimizing the resources used to maintain the whitelist on the device is to use a Least Recently Used (LRU) cache to maintain the whitelist. The whitelist can include file hashes and can also be based on other factors such as feature vectors or hashes of feature vectors.

VI. モデルの構築 VI. Model building

図1に示された環境に戻ると、先に説明したように、セキュリティプラットフォーム122は、受信したサンプルについて静的および動的解析を実行するように構成さていれる。セキュリティプラットフォーム122は、種々のソースから解析のためのサンプルを受信することができる。上述のように、サンプルソースの一つの例示的なタイプは、データ機器(例えば、データ機器102、136、および148)である。他のソース(例えば、他のセキュリティ機器ベンダー、セキュリティ研究者、等といった、サンプルの１つ以上の第三者プロバイダ)も、また、必要に応じて使用することができる。以下でより詳細に説明されるように、セキュリティプラットフォーム122は、モデルを構築するために、受信するサンプルのコーパス（corpus）を使用することができる(例えば、モデルは、ここにおいて説明される技術の実施形態に従って、次いで、セキュリティ機器102によって使用され得る)。 Returning to the environment shown in FIG. 1, security platform 122 is configured to perform static and dynamic analysis on received samples, as described above. Security platform 122 can receive samples for analysis from a variety of sources. As noted above, one exemplary type of sample source is a data appliance (eg, data appliances 102, 136, and 148). Other sources (eg, one or more third party providers of samples, such as other security device vendors, security researchers, etc.) can also be used as appropriate. As described in more detail below, the security platform 122 can use the corpus of samples it receives to build a model (e.g., the model can be a model of the techniques described herein). may then be used by the security device 102, according to an embodiment).

様々な実施形態において、静的解析エンジン306は、受信したサンプルに対して特徴抽出を実行するように構成されている(例えば、上述のように他の静的解析機能を実行している間にも)。特徴抽出(例えば、セキュリティプラットフォーム122による)を実行するための一つの例示的なプロセスが、図8Aに示されている。プロセス800は、サンプルの静的解析が開始されると、802で開始する。特徴抽出(804)の最中に、処理されるサンプル(例えば、図3のサンプル130)から、全ての8グラム(または、8グラムが使用されていない実施形態における他の適用可能なnグラム)が抽出される。特に、解析されているサンプル内の8グラムのヒストグラムが(例えば、ハッシュテーブルに)抽出され、これは、処理されているサンプル内で所与の8グラムが観察された回数を示す。静的解析エンジン306による特徴解析の最中に8グラムを抽出することの１つの利点は、(例えば、モデルを構築する際に)第三者から得られたサンプルの使用における潜在的なプライバシーおよび契約上の問題を軽減できることである。結果として得られるヒストグラムからオリジナルのファイルを再構成することができないからである。抽出されたヒストグラムは806で保管される。 In various embodiments, static analysis engine 306 is configured to perform feature extraction on the received samples (e.g., while performing other static analysis functions as described above). and). One exemplary process for performing feature extraction (eg, by security platform 122) is shown in FIG. 8A. Process 800 begins at 802 when static analysis of a sample is initiated. During feature extraction (804), all 8-grams (or other applicable n-grams in embodiments where 8-grams are not used) from the sample being processed (eg, sample 130 in FIG. 3) is extracted. In particular, a histogram of the 8-grams within the sample being analyzed is extracted (eg, into a hash table), indicating the number of times a given 8-gram was observed within the sample being processed. One advantage of extracting 8-grams during feature analysis by the static analysis engine 306 is the potential privacy and It can reduce contractual issues. This is because the original file cannot be reconstructed from the resulting histogram. The extracted histogram is saved at 806.

様々な実施形態において、静的解析エンジン306は、所与のサンプルについて抽出されたヒストグラム(例えば、ハッシュテーブルを使用して表される)を、他のサンプルから抽出されたヒストグラムと共にストレージ142(例えば、ハドゥープ（Hadoop）クラスタ)に保管する。ハドゥープ内のデータは圧縮され、そして、ハドゥープデータについて操作が実行されると、必要なデータはオンザフライ（on the fly）圧縮解除される。ファイルについて一つの例示的なハッシュテーブル(JSONで表される)の例が図7Aに示されている。行（line）702はファイルのSHA-256ハッシュを示している。行704は、サンプル130がセキュリティプラットフォーム122に到着するUNIX（登録商標）時間を示している。行706は、オーバーレイ部分におけるnグラムのカウントを示している(例えば、'd00fbf4e08bc366':1は、'd00fbf4e08bc366'の１つのインスタンスがオーバーレイセクション内で見つかったことを示す)。行708は、ファイル内に存在する8グラムそれぞれのカウントを示している。行710は、ファイルがオーバーレイを有することを示している。行712は、ファイルのファイルタイプが「.exe」であることを示している。行714は、セキュリティプラットフォーム122がサンプル130の処理を終了したUNIX時間を示している。行716は、ファイルがヒットした非8グラム特徴それぞれのカウントを示している。最後に、行718は、ファイルが(例えば、セキュリティプラットフォーム122によって)悪意があるものと決定されたことを示している。 In various embodiments, static analysis engine 306 stores the extracted histogram (eg, represented using a hash table) for a given sample along with histograms extracted from other samples in storage 142 (eg, , Hadoop cluster). The data in the hadoop is compressed, and the required data is decompressed on the fly when operations are performed on the hadoop data. An example of one exemplary hash table (represented in JSON) for a file is shown in Figure 7A. Line 702 shows the SHA-256 hash of the file. Row 704 indicates the UNIX time that sample 130 arrived at security platform 122 . Row 706 shows the count of n-grams in the overlay section (eg 'd00fbf4e08bc366':1 indicates that one instance of 'd00fbf4e08bc366' was found in the overlay section). Line 708 shows the count of each 8-gram present in the file. Line 710 indicates that the file has overlays. Line 712 indicates that the file type of the file is ".exe". Line 714 indicates the UNIX time when security platform 122 finished processing sample 130 . Line 716 shows the count for each non-8-gram feature hit by the file. Finally, line 718 indicates that the file has been determined (eg, by security platform 122) to be malicious.

一つの例示的な実施形態において、ハドゥープクラスタに保管された8グラムのヒストグラムのセットは、１日あたり、概ね3テラバイトの8グラムのヒストグラムデータによって成長する。ヒストグラムは、悪意のあるサンプルおよび良性サンプルの両方に対応している(例えば、上述のようにセキュリティプラットフォーム122によって実行される他の静的および動的解析の結果に基づいて、そのようにラベル付けされる。) In one exemplary embodiment, the set of 8-gram histograms stored in the hadoop cluster grows by approximately 3 terabytes of 8-gram histogram data per day. Histograms accommodate both malicious and benign samples (e.g., labeled as such based on the results of other static and dynamic analyzes performed by security platform 122 as described above). is done.)

解析されるサンプルから抽出される8グラムのヒストグラムは、ファイル自身よりも概ね10%大きく、そして、典型的なサンプルは、概ね100万個の異なる8グラムを含むヒストグラムを有する。異なる可能な8グラムの総数は、2の64乗（2⁶⁴）である。上述のように、対照的に、セキュリティプラットフォーム122によって(例えば、サブスクリプションの一部として)データ機器102といったデバイスに送信される分類モデルは、様々な実施形態において、数千個の特徴(例えば、1000個の特徴)だけを含む。潜在的に最大2⁶⁴個の機能のセットを、モデルで使用するために最も重要な1000個の特徴まで削減する一つの例示的な方法は、相互情報技術を使用することである。他のアプローチ(例えば、カイ二乗スコア）も、また、適用可能である。4つの必要とされるパラメータは、所与の機能を有する悪意のあるサンプルの数、所与の機能を有する良性サンプルの数、悪意のあるサンプルの総数、および良性サンプルの総数を含む。相互情報の利点の１つは、非常に大きなデータセットにおいて効率的に使用できることである。ハドゥープにおいて、相互情報アプローチは、複数のマッパー（mapper）にわたりタスクを分散することによって、単一のパスで(すなわち、所与のファイルタイプについてハドゥープクラスタデータセット内に保管された8グラムのヒストグラム全てを通じて)実行することができ、それぞれが特定の機能を処理する責任を負う。最も高い相互情報を有するこれらの特徴は、悪意を最も示す、かつ／あるいは、良性を最も示す特徴のセットとして、該当する場合、選択することができる。結果として生じた1000個の特徴は、次いで、該当する場合、モデル(例えば、線形分類モデルおよび非線形分類モデル)を構築するために使用することができる。例えば、線形分類モデルを構築するために、モデルビルダ（builder）152(pythonといった適切な言語で作成されたオープンソースツール及び／又はスクリプトのセットを使用して実装されるもの)は、上位1000個の特徴、および、適用可能な重み付けを、機器102がチェックするためのnグラム特徴のセットとして保存する(例えば、上記のセクションV.A.4に記載されているように)。 An 8-gram histogram extracted from the analyzed sample is approximately 10% larger than the file itself, and a typical sample has a histogram containing approximately 1 million different 8-grams. The total number of different possible 8-grams is 2 to the 64th power (2 ⁶⁴ ). As noted above, by contrast, a classification model transmitted by security platform 122 to a device such as data appliance 102 (e.g., as part of a subscription) may, in various embodiments, consist of thousands of features (e.g., 1000 features) only. One exemplary way to reduce the set of potentially maximum ²⁶⁴ features to the 1000 most important features for use in the model is to use mutual information technology. Other approaches (eg chi-squared scores) are also applicable. The four required parameters include the number of malicious samples with a given function, the number of benign samples with a given function, the total number of malicious samples, and the total number of benign samples. One of the advantages of mutual information is that it can be used efficiently on very large datasets. In Hadoop, the mutual information approach distributes the task over multiple mappers in a single pass (i.e., an 8-gram histogram stored in the Hadoop cluster dataset for a given file type). through all), each of which is responsible for handling a particular function. Those features with the highest mutual information can be selected as the set of features that are most malicious and/or most benign, as applicable. The resulting 1000 features can then be used to build models (eg, linear and non-linear classification models) where applicable. For example, to build a linear classification model, the model builder 152 (implemented using a set of open source tools and/or scripts written in a suitable language such as python) uses the top 1000 and applicable weightings as a set of n-gram features for device 102 to check (eg, as described in section VA4 above).

いくつかの実施形態において、非線形分類モデルは、また、特徴の上位1000個(または、他の所望の数)を使用して、モデルビルダ152によっても構築される。他の実施形態において、非線形分類モデルは、上位の（top）特徴(例えば、950)を主に使用して構築されるが、パケット毎の特徴抽出および解析の最中に検出され得る、他の非グラム特徴(例えば、50個のそうした特徴)も、また、組み込む。非線形分類モデルに組み込むことができる非nグラム特徴のいくつかの例は、(1)ヘッダのサイズ、(2)ファイル内のチェックサムの存否、(3)ファイル内のセクションの数、(4)ファイルの意図された長さ(PEファイルのヘッダに示されるように)、(5)ファイルがオーバーレイ部分を含むか否か、および(6)PEを実行するためにファイルがWindows EFIサブシステムを必要とするか否か、を含む。 In some embodiments, a non-linear classification model is also built by model builder 152 using the top 1000 (or other desired number) of features. In other embodiments, the non-linear classification model is built primarily using top features (e.g., 950), but other features that may be detected during per-packet feature extraction and analysis. Non-gram features (eg, 50 such features) are also incorporated. Some examples of non-n-gram features that can be incorporated into non-linear classification models are (1) the size of the header, (2) the presence or absence of checksums in the file, (3) the number of sections in the file, (4) The intended length of the file (as indicated in the header of the PE file), (5) whether the file contains overlay parts, and (6) whether the file requires the Windows EFI subsystem to run the PE. and whether or not.

いくつかの実施態様においては、上位1000個の特徴を選択するために相互情報を使用するのではなく、特徴のより大きなセット(過剰に生成された特徴のセット)が決定される。一つの例として、上位5000個の機能は、相互情報を使用して最初に選択することができる。5000個のセットは、次いで、従来の特徴選択技法(例えば、バギング（bagging）)への入力として使用することができる。それは、非常に大きなデータセット(例えば、ハドゥープデータセット全体)には上手くスケールできないが、縮小されたセット(例えば、5000個の特徴)ではより効果的である。相互情報を使用して識別された5000個の特徴のセットから最終的な1000個の特徴を選択するために、従来の特徴選択技術が使用され得る。 In some embodiments, rather than using mutual information to select the top 1000 features, a larger set of features (the set of over-generated features) is determined. As an example, the top 5000 features can be selected first using mutual information. The 5000 sets can then be used as input to conventional feature selection techniques (eg, bagging). It does not scale well for very large datasets (eg the entire Hadoop dataset), but is more effective for reduced sets (eg 5000 features). Conventional feature selection techniques can be used to select the final 1000 features from the set of 5000 features identified using mutual information.

一旦最終的な1000個の特徴が選択されると、非線形モデルを構築するための一つの例示的な方法は、scikit-learnまたはXGBoostといったオープンソースツールを使用することである。該当する場合、パラメータチューニングは、交差検証（cross-validation）を使用することなどにより、実行することができる。 Once the final 1000 features are selected, one exemplary method for building the nonlinear model is to use open source tools such as scikit-learn or XGBoost. If applicable, parameter tuning can be performed, such as by using cross-validation.

モデルを生成するための一つの例示的なプロセスが図8Bに示されている。様々な実施形態において、プロセス850は、セキュリティプラットフォーム122によって実行される。プロセス850は、抽出された特徴(例えば、nグラム特徴を含む)のセットが受信されると、852で開始する。特徴のセットを受信することができる一つの例字的な方法は、プロセス800の結果として保管された特徴を読み取ることによるものである。854では、852で受信された特徴から、特徴の削減されたセットが決定される。上述のように、特徴の削減されたセットを決定する一つの例示的な方法は、相互情報を使用することによるものである。他のアプローチ(例えば、カイ二乗スコア)も、また、使用することができる。さらに、また、上述のように、相互情報を用いて特徴の初期セットを選択し、バギングまたは他の適切な技術を使用して特徴の初期セットを精緻化するといった、技術の組み合わせも、また、852／854で使用することができる。最終的に、上述のように、一旦(例えば、854で)特徴が選択されると、856で適切なモデルが構築される(例えば、オープンソースまたは他のツールを使用し、そして、該当する場合は、パラメータチューニングを実行する)。モデル(例えば、プロセス850を使用してモデルビルダ152によって生成されるもの)は、データ機器102および他の適用可能な受信者(例えば、データ機器136および148)に対して(例えば、加入サービスの一部として)送信され得る。 One exemplary process for generating models is shown in FIG. 8B. In various embodiments, process 850 is performed by security platform 122 . Process 850 begins at 852 when a set of extracted features (eg, including n-gram features) is received. One exemplary way in which the feature set can be received is by reading the features stored as a result of process 800 . At 854 a reduced set of features is determined from the features received at 852 . As mentioned above, one exemplary method of determining the reduced set of features is by using mutual information. Other approaches (eg, chi-squared scores) can also be used. Furthermore, also as described above, combinations of techniques such as using mutual information to select an initial set of features and refining the initial set of features using bagging or other suitable techniques may also Can be used with 852/854. Ultimately, as described above, once the features are selected (e.g., at 854), an appropriate model is built at 856 (e.g., using open source or other tools and, if applicable, performs parameter tuning). A model (eg, generated by model builder 152 using process 850) is provided to data appliance 102 and other applicable recipients (eg, data appliances 136 and 148) (eg, subscription services). as a part).

様々な実施形態において、モデルビルダ152は、毎日(または他の適用可能な)ベースでモデル(例えば、線形および非線形分類モデル)を生成する。プロセス850を実行することにより、または、そうでなければ定期的にモデルを生成することによって、セキュリティプラットフォーム122は、機器102といった機器によって使用されるモデルが、最新のタイプのマルウェア脅威(例えば、悪意のある個人によって最新に展開された脅威)を検出することを確保するように助けることができる。 In various embodiments, model builder 152 generates models (eg, linear and non-linear classification models) on a daily (or other applicable) basis. By executing process 850 or otherwise periodically generating models, security platform 122 ensures that the models used by devices, such as device 102, are current types of malware threats (e.g., malicious can help ensure that the latest deployed threats by some individuals are detected.

新しく生成されたモデルが、(例えば、閾値を超える一連の品質評価メトリックスに基づいて決定されるように)既存のモデルよりも良好であると決定されるときはいつも、更新されたモデルは、データ機器102といったデータ機器に送信され得る。場合によって、そうした更新は、特徴に割り当てられた重み付けを調整する。そうした更新は、機器に容易に展開され、(例えば、リアルタイムアップデートとして)機器に採用される。他の事例において、そうした更新は、特徴自身を調整する。そうした更新は、デコーダといった、機器のコンポーネントに対するパッチを必要とし得るので、展開がより複雑になり得る。モデル生成の最中にオーバートレーニングを使用する１つの利点は、デコーダが特定の特徴を検出することができるか否かを、モデルが考慮できることである。 Whenever a newly generated model is determined to be better than an existing model (e.g., as determined based on a set of quality assessment metrics exceeding a threshold), the updated model is It may be transmitted to a data device such as device 102 . In some cases, such updates adjust the weightings assigned to features. Such updates are easily deployed to and adopted by the device (eg, as real-time updates). In other cases, such updates adjust the features themselves. Such updates may require patches to equipment components, such as decoders, which may make deployment more complex. One advantage of using overtraining during model generation is that the model can take into account whether the decoder can detect certain features.

様々な実施形態において、機器は、受信された際に、更新をモデルに対して展開するために(例えば、セキュリティプラットフォーム122によって)必要とされる。他の実施形態において、機器は、選択的に(少なくとも一定期間)更新を展開することが可能である。一つの例として、新しいモデルが機器102によって受信された場合、既存のモデルおよび新たなモデルは、両方が、機器102においてある期間について並列に実行され得る(例えば、既存のモデルが生産において使用され、かつ、新たなモデルは、実際には実行することなく行われるであろうアクションについてレポートする)。機器の管理者は、機器におけるトラフィックを処理するために既存のモデルまたは新たなモデルのいずれが使用されるべきかを示すことができる(例えば、どのモデルがより良好なパフォーマンスを示すかに基づいて)。様々な実施形態において、機器102は、どのモデルが機器102において動作しているか、および、そのモデルがどの程度有効であるか(例えば、偽陽性の統計情報)といった、情報を示すテレメトリ（telemetry）をセキュリティプラットフォーム122に戻す。 In various embodiments, a device is required (eg, by security platform 122) to deploy updates to the model as they are received. In other embodiments, the device can selectively deploy updates (at least for a period of time). As one example, when a new model is received by the device 102, both the existing model and the new model can be run in parallel on the device 102 for a period of time (e.g., if the existing model is used in production). , and the new model reports on actions that would be performed without actually executing them). A device administrator can indicate whether an existing model or a new model should be used to handle traffic on the device (e.g., based on which model shows better performance). ). In various embodiments, the device 102 provides telemetry that indicates information such as which model is running on the device 102 and how valid that model is (eg, false positive statistics). back to the security platform 122.

上述の実施形態は、理解を明確にするためにある程度詳細に説明されているが、本発明は、提供される詳細について限定されるものではない。本発明を実施するための多くの代替的な方法が存在している。開示された実施形態は、例示的なものであり、かつ、限定的なものではない。 Although the above embodiments have been described in some detail for clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are exemplary and non-limiting.

Claims

a system,
is a processor;
storing a set containing one or more sample classification models on a network device;
performing n-gram analysis on a sequence of received packets associated with a received file, said performing n-gram analysis comprising using at least one stored sample classification model;
determining, based at least in part on the n-gram analysis of the sequence of received packets, that the received file is malicious; and in response to determining that the file is malicious, prevent propagation of received files,
a processor configured to;
a memory coupled to the processor and configured to provide instructions to the processor;
system, including

The processor
performing the n-gram analysis, at least in part, by comparing n-grams in the received packet against a predefined list of n-grams;
2. The system of claim 1, configured to:

the default list of n-grams was generated using a plurality of pre-collected malware samples;
3. The system of claim 2.

The processor further
determining a file type associated with said file;
2. The system of claim 1, configured to:

The processor
selecting a linear classification model from the set of one or more sample classification models based on the determined file type associated with the file;
5. The system of claim 4, configured to:

performing the n-gram analysis includes accumulating a set of weights corresponding to observed n-grams;
6. The system of claim 5.

the weighting is accumulated over a single float value;
7. A system according to claim 6.

The processor
selecting a nonlinear classification model from the set of one or more sample classification models based on the determined file type associated with the file;
5. The system of claim 4, configured to:

the non-linear classification model includes n-gram features and non-n-gram features;
9. System according to claim 8.

at least one non-n-gram feature is associated with file size;
10. System according to claim 9.

at least one non-n-gram feature is associated with the presence of the overlay;
10. System according to claim 9.

performing the n-gram analysis includes updating values for features in a feature vector whenever the features match;
9. System according to claim 8.

using the at least one stored sample classification model includes running a non-linear classifier on the packet stream until an intended file length is reached;
The system of claim 1.

the intended file length is not the actual file length, and the ruling is determined before reaching the actual end of the file;
14. The system of claim 13.

The processor further
receiving at least one updated classification model;
2. The system of claim 1, configured to:

the n-gram analysis is performed inline with other packet analysis as a single pass analysis of the traffic stream;
The system of claim 1.

The processor further
using a whitelisted set of n-grams when performing the n-gram analysis;
2. The system of claim 1, configured to:

The processor further
sending a copy of the received file to a security platform and performing the n-gram analysis while awaiting adjudication from the security platform;
2. The system of claim 1, configured to:

storing a set containing one or more sample classifications on a network device;
performing n-gram analysis on a sequence of received packets associated with a received file, said performing n-gram analysis comprising using at least one stored sample classification model;
determining, based at least in part on the n-gram analysis of the sequence of received packets, that the received file is malicious; and in response to determining that the file is malicious, preventing propagation of the received file;
A method, including

A computer program stored on a tangible computer-readable storage medium and comprising a plurality of computer instructions,
When the computer instructions are executed, the computer
storing a set containing one or more sample classifications on a network device;
performing n-gram analysis on a sequence of received packets associated with a received file, said performing n-gram analysis comprising using at least one stored sample classification model;
determining, based at least in part on the n-gram analysis of the sequence of received packets, that the received file is malicious; and in response to determining that the file is malicious, preventing propagation of the received file;
A computer program that causes the

a system,
is a processor;
receiving a set of features, including a plurality of n-grams, extracted from a set of files;
determining a reduced set of features that includes at least some of the plurality of n-grams; and
using said reduced set of features to generate a model usable by a data appliance to perform inline malware analysis;
a processor;
a memory coupled to the processor and configured to provide instructions to the processor;
system, including

the set of features includes features extracted from a set of known malicious files;
The system of claim 1.

the set of features includes features extracted from a set of known benign files;
The system of claim 1.

the reduced set of features is determined using mutual information;
The system of claim 1.

the reduced set of features is determined using a chi-square score;
The system of claim 1.

the generated model includes n-gram features;
22. The system of claim 21.

the generated model further includes non-n-gram features;
27. The system of claim 26.

at least one non-n-gram feature is associated with file size;
8. The system of claim 7.

at least one non-n-gram feature is associated with a header size;
8. System according to claim 7.

the at least one non-n-gram feature is associated with at least one of the presence or absence of a checksum in the file;
8. The system of claim 7.

at least one non-n-gram feature is associated with the number of sections in the file;
8. The system of claim 7.

at least one non-n-gram feature is associated with the length of the claimed file;
8. The system of claim 7.

at least one non-n-gram feature is associated with whether the file contains an overlay;
8. System according to claim 7.

the model is a linear model,
22. The system of claim 21.

wherein the model is a non-linear model;
22. The system of claim 21.

the plurality of n-grams are extracted during static analysis of the set of files;
22. The system of claim 21.

a system in which the model is transmitted to a first data appliance;
22. The system of claim 21.

In response to false positive results reported by a second data appliance, the processor is configured to generate an updated model and send the updated model to the first data appliance. ,
38. The system of claim 37.

receiving a feature set comprising a plurality of n-grams extracted from a set of files;
determining a reduced set of features that includes at least some of the plurality of n-grams;
using said reduced set of features to generate a model usable by a data appliance to perform inline malware analysis;
A method, including

A computer program stored on a tangible computer-readable storage medium and comprising a plurality of computer instructions,
When the computer instructions are executed, the computer
receiving a feature set comprising a plurality of n-grams extracted from a set of files;
determining a reduced set of features that includes at least some of the plurality of n-grams;
using said reduced set of features to generate a model usable by a data appliance to perform inline malware analysis;
A computer program that causes the