RU2825975C1

RU2825975C1 - Method of combining large language model and security agent

Info

Publication number: RU2825975C1
Application number: RU2023135731A
Authority: RU
Inventors: Виктор Геннадьевич Сергеев; Владислав Николаевич Тушканов
Original assignee: Акционерное общество "Лаборатория Касперского"
Filing date: 2023-12-28
Publication date: 2024-09-02

Abstract

FIELD: information technology.

SUBSTANCE: invention relates to a method for automatic compilation of requests and responses to a large language model on issues related to computer security. Method includes steps performed on a host, which is a user end device, where: a) receiving input data from a security agent located on said host; b) selecting one or more phrases from the input data; c) leaving at least one text request from selected phrases; d) sending each text request to a large language model located on said host; e) receiving a response from a large language model; f) extracting key phrases from the received response from the large language model; if key phrases at step f) have not been determined, the context is stored in the form of requests and responses from steps c) and f) and this context is transmitted to another large language model located outside the host.

EFFECT: high efficiency of detecting security incidents by using local and external large language models.

6 cl, 5 dwg, 2 tbl

Description

Область техникиField of technology

Изобретение относится к области информационных технологий, а более конкретно к системам и способам использования больших языковых моделей совместно с агентами безопасности на устройствах пользователей.The invention relates to the field of information technology, and more specifically to systems and methods for using large language models together with security agents on user devices.

Уровень техникиState of the art

В настоящий момент все больше компаний используют центры мониторинга информационной безопасности (англ. Security Operations Center, SOC), что связано с растущим количеством информационных угроз, усложнением инфраструктуры компании, а также требованиями со стороны властей и регуляторов по защите данных пользователей от кражи или незаконного использования. В последние годы также вырос и средний размер ущерба от инцидента информационной безопасности (далее - инцидент), связанного с проникновением в инфраструктуру компании, который превысил 5 млн. долларов на 2022 год.Currently, more and more companies are using information security monitoring centers (Security Operations Center, SOC), which is associated with the growing number of information threats, the complexity of the company's infrastructure, as well as the requirements of authorities and regulators to protect user data from theft or illegal use. In recent years, the average amount of damage from an information security incident (hereinafter referred to as an incident) associated with penetration into the company's infrastructure has also increased, exceeding $ 5 million in 2022.

Специалисты SOC обрабатывают данные (так называемые артефакты, т.е. связанные с инцидентом документы и иные файлы), собранную с рабочих мест сотрудников, сетевых устройств и других объектов компьютерной инфраструктуры, чтобы как можно раньше обнаружить и остановить развитие возможного инцидента в атаку. Для мониторинга и сбора данных специалисты могут использовать SIEM (от англ. Security information and event management) решения и EDR (от англ. Endpoint detection and response) продукты, например продукт Kaspersky Unified Monitoring and Analysis Platform, а также сетевые экраны и систем обнаружения вторжений (англ. Intrusion Prevention System, IPS), системы обнаружения аномального поведения пользователей (англ. User and Entity Behavior Analytics, UEBA).SOC specialists process data (the so-called artifacts, i.e. incident-related documents and other files) collected from employee workstations, network devices and other computer infrastructure objects in order to detect and stop a potential incident from developing into an attack as early as possible. For monitoring and collecting data, specialists can use SIEM (Security information and event management) solutions and EDR (Endpoint detection and response) products, such as Kaspersky Unified Monitoring and Analysis Platform, as well as firewalls and intrusion detection systems (IPS), and user and entity behavior analytics (UEBA) systems.

Получив данные, которые могут указывать на возможный инцидент, специалисты SOC определяют, есть ли угроза, и если есть - каковы ее параметры, такие как тип (например, нежелательный контент, несанкционированный доступ, утечка информации) или источник (например, на каком компьютере была обнаружена шпионская программа). При обнаружении возможного инцидента принимаются меры по его устранению и минимизации ущерба.Having received data that may indicate a possible incident, SOC specialists determine whether there is a threat and, if so, what its parameters are, such as the type (e.g., unwanted content, unauthorized access, information leakage) or source (e.g., on which computer the spyware was detected). If a possible incident is detected, measures are taken to eliminate it and minimize damage.

Несмотря на то, что множество функций различных SIEM решений автоматизировано, для корректного определения инцидента все равно используется экспертиза специалиста в области компьютерной безопасности, что может приводить к пропуску возможных инцидентов, так как человек не всегда может иметь доступ к актуальной информации относительно самых последних инцидентов и связанных с ними артефактов. Увеличение количества источников информации, связанных с инцидентами, также не всегда помогает в решении проблемы, потому что требуется время на их добавление в SOC. Кроме того, самим специалистам может не всегда хватать времени и экспертизы для выделения и обработки важных артефактов, так как самих артефактов могут быть тысячи. Данная проблема особенно актуальна в случае APT-атаки (англ. Advanced Persistent Threat), когда паттерны инцидента могут быть неизвестны как специалистам SOC, так и SIEM решениям.Despite the fact that many functions of various SIEM solutions are automated, the expertise of a computer security specialist is still used to correctly identify an incident, which can lead to missing possible incidents, since a person may not always have access to up-to-date information regarding the most recent incidents and related artifacts. Increasing the number of sources of information related to incidents also does not always help in solving the problem, because it takes time to add them to the SOC. In addition, the specialists themselves may not always have enough time and expertise to select and process important artifacts, since there may be thousands of artifacts themselves. This problem is especially relevant in the case of an APT attack (Advanced Persistent Threat), when incident patterns may be unknown to both SOC specialists and SIEM solutions.

Существуют различные автоматические системы для обработки информации по компьютерным инцидентам. Например, предлагаемая в патенте US 8776241 система служит для расследования инцидентов информационной безопасности. Ядро данной системы позволяет определять причины возникновения инцидента и применять соответствующие меры для исправления последствий и предотвращения появления данного инцидента в корпоративной сети. Эффективность данной системы также достигается благодаря возможности интерпретировать системные события и записи системных журналов в доступную и удобную форму. Система позволяет строить связи системных событий и хранить только необходимую информацию для определения лиц, виновных в нарушении политики безопасности, и выявления уязвимостей в инфраструктуре корпоративной сети, из-за которых инцидент безопасности мог произойти. Ключевой особенностью технологии является построение цепочек событий. Описание патента содержит несколько примеров построения данных цепочек. Среди них отслеживание событий в хронологическом порядке, когда события, зафиксированные в корпоративной сети, сортируются по времени и анализируются в заданном порядке. Другой пример составления цепочек основан на связях компьютеров и системных событий. Два компьютера считаются связанными, в том случае если на них зафиксированы события с одним объектом компьютерной системы, например с файлом или ссылкой.There are various automatic systems for processing information on computer incidents. For example, the system proposed in patent US 8776241 is used to investigate information security incidents. The core of this system allows you to determine the causes of an incident and apply appropriate measures to correct the consequences and prevent this incident from occurring in the corporate network. The effectiveness of this system is also achieved due to the ability to interpret system events and system log entries in an accessible and convenient form. The system allows you to build connections between system events and store only the information necessary to determine the persons guilty of violating the security policy and identify vulnerabilities in the corporate network infrastructure due to which a security incident could have occurred. The key feature of the technology is the construction of event chains. The patent description contains several examples of building these chains. Among them is tracking events in chronological order, when events recorded in the corporate network are sorted by time and analyzed in a given order. Another example of creating chains is based on the connections between computers and system events. Two computers are considered linked if they have events recorded with the same computer system object, such as a file or link.

Однако указанная система не может решить описанную выше проблему, связанную с недостатком времени и экспертизы со стороны как специалистов, так и информационных систем. Исследователи угроз часто публикуют результаты своих расследований (тактики, методы и процедуры злоумышленников) в сети Интернет в виде отчетов, презентаций, статей в блогах, твитов и другого контента. Проходит достаточно много времени, прежде чем эти знания систематизируются и применяются в рамках SOC-команд и SIEM решений.However, the system cannot solve the problem described above, which is related to the lack of time and expertise on the part of both specialists and information systems. Threat researchers often publish the results of their investigations (attackers' tactics, methods, and procedures) on the Internet in the form of reports, presentations, blog articles, tweets, and other content. It takes a long time before this knowledge is systematized and applied within SOC teams and SIEM solutions.

Еще одной проблемой использования SIEM решений является ограниченность передаваемых данных с конечных устройств (конечных точек или хостов) пользователей на серверы SIEM решения. На конечном устройстве пользователя происходит большое количество событий: создание новых потоков, создание/открытие файлов (в том числе временных), использование объектов синхронизации, работа с ключами реестра. Установленные на конечных точках агенты безопасности (как правило, это EDR продукты) фильтруют большую часть подобных событий при передаче на сторону SIEM решения. Но в случае сложных случаев (например, APT-атаки или при утечке информации со стороны инсайдера) отфильтрованная информация может быть критически важной. Возникает необходимость использования SOC экспертизы прямо на конечной точке (конечном устройстве пользователя).Another problem with using SIEM solutions is the limited amount of data transferred from endpoints (or hosts) of users to SIEM solution servers. A large number of events occur on the user's endpoint: creating new streams, creating/opening files (including temporary ones), using synchronization objects, working with registry keys. Security agents installed on endpoints (usually EDR products) filter most of these events when transmitting them to the SIEM solution. But in complex cases (for example, APT attacks or information leakage by an insider), the filtered information can be critical. There is a need to use SOC expertise directly on the endpoint (the user's endpoint).

Таким образом, известные технологии имеют недостатки, препятствующие полноценному решению заявленной технической проблемы, поэтому возникает необходимость в заявленном изобретении.Thus, known technologies have shortcomings that prevent a full solution to the stated technical problem, which is why the need for the stated invention arises.

Раскрытие сущности изобретенияDisclosure of the essence of the invention

Технический результат заключается в уменьшении времени сбора и повышении уровня обработки информации (артефактов) во время проведения расследования инцидента информационной безопасности.The technical result consists in reducing the time of collection and increasing the level of processing of information (artifacts) during the investigation of an information security incident.

В варианте реализации используется способ автоматического составления запросов и ответов к большой языковой модели по вопросам, связанным с компьютерной безопасностью, включающий этапы, на которых: получают входные данные от агента безопасности на хосте; выделяют одну или несколько фраз из входных данных; составляют по меньшей мере один текстовый запрос из выделенных фраз; отправляют каждый текстовый запрос к большой языковой модели; получают ответ от большой языковой модели; выделяют ключевые фразы из полученного ответа от большой языковой модели; сохраняют контекст в виде запросов и ответов для передачи к другой большой языковой модели, находящейся вне хоста.In an embodiment, a method is used for automatically composing queries and responses to a large language model on issues related to computer security, which includes the following steps: receiving input data from a security agent on a host; selecting one or more phrases from the input data; composing at least one text query from the selected phrases; sending each text query to a large language model; receiving a response from a large language model; selecting key phrases from the received response from the large language model; storing the context in the form of queries and responses for transmission to another large language model located outside the host.

В еще одном варианте реализации составляют текстовый запрос к большой языковой модели на естественном языке.In another implementation, a text query is generated to a large language model in natural language.

В другом варианте реализации выделяют ключевые фразы из ответа от большой языковой модели с использованием библиотек обработки естественного языка.Another implementation involves extracting key phrases from a response from a large language model using natural language processing libraries.

В еще одном варианте реализации выделяют ключевые фразы из ответа от большой языковой модели с использованием набора регулярных выражений.Another implementation option extracts key phrases from a response from a large language model using a set of regular expressions.

В другом варианте реализации входные данные от элементов центра мониторинга информационной безопасности имеют формат JSON.In another implementation, the input data from the elements of the information security monitoring center is in JSON format.

В еще одном варианте реализации входные данные от элементов центра мониторинга информационной безопасности имеют формат XML.In another embodiment, the input data from the elements of the information security monitoring center is in XML format.

Краткое описание чертежейBrief description of the drawings

Дополнительные цели, признаки и преимущества настоящего изобретения будут очевидными из прочтения последующего описания осуществления изобретения со ссылкой на прилагаемые чертежи, на которых:Additional objects, features and advantages of the present invention will be apparent from reading the following description of the embodiment of the invention with reference to the accompanying drawings, in which:

На фиг. 1A представлена схема работы настоящего изобретения. Fig. 1A shows a diagram of the operation of the present invention.

Фиг. 1Б показывает схему работы настоящего изобретения при использовании большой языковой модели на конечном устройстве пользователя. Fig. 1B shows a diagram of the operation of the present invention when using a large language model on a user's end device.

Фиг. 2А отображает способ работы настоящего изобретения. Fig. 2A shows the mode of operation of the present invention.

Фиг. 2Б отображает способ работы настоящего изобретения при работе большой языковой модели на хосте. Fig. 2B illustrates a method of operation of the present invention when running a large language model on a host.

Фиг. 3 представляет пример компьютерной системы, с помощью которой может быть реализовано настоящее изобретение. Fig. 3 shows an example of a computer system with which the present invention can be implemented.

Осуществление изобретенияImplementation of the invention

Объекты и признаки настоящего изобретения, способы для достижения этих объектов и признаков станут очевидными посредством отсылки к примерным вариантам осуществления. Однако настоящее изобретение не ограничивается примерными вариантами осуществления, раскрытыми ниже, оно может воплощаться в различных видах. Приведенное описание предназначено для помощи специалисту в области техники для исчерпывающего понимания изобретения, которое определяется только в объеме приложенной формулы.The objects and features of the present invention, the methods for achieving these objects and features will become apparent by reference to exemplary embodiments. However, the present invention is not limited to the exemplary embodiments disclosed below, it can be embodied in various forms. The description provided is intended to help a person skilled in the art to fully understand the invention, which is defined only within the scope of the appended claims.

ГлоссарийGlossary

Центр мониторинга информационной безопасности (англ. Security Operations Center, SOC) - структурное подразделение организации, отвечающее за оперативный мониторинг IT-среды и предотвращение инцидентов информационной безопасности (далее - инцидентов). Специалисты SOC собирают и анализируют данные с различных объектов инфраструктуры организации и при обнаружении подозрительной активности принимают меры для предотвращения атаки. Для мониторинга и сбора данных используются такие элементы, как: SIEM-решения, EDR-продукты, NDR и аналогичные им (далее - элементы SOC). Кроме того, в рамках SOC используются также сервисы типа Threat Intelligence Platform.Security Operations Center (SOC) is a structural unit of an organization responsible for operational monitoring of the IT environment and prevention of information security incidents (hereinafter referred to as incidents). SOC specialists collect and analyze data from various infrastructure objects of the organization and, upon detection of suspicious activity, take measures to prevent an attack. The following elements are used for monitoring and data collection: SIEM solutions, EDR products, NDR and similar ones (hereinafter referred to as SOC elements). In addition, SOC also uses services such as the Threat Intelligence Platform.

SIEM (Security Information and Event Management) - класс программных продуктов, предназначенных для сбора и анализа информации о событиях безопасности. В задачи SIEM решения входит:SIEM (Security Information and Event Management) is a class of software products designed to collect and analyze information about security events. The tasks of a SIEM solution include:

отслеживать в режиме реального времени сигналы тревоги, поступающие от сетевых устройств и приложений;monitor real-time alarms from network devices and applications;

обрабатывать полученные данные и находить взаимосвязи между ними;process the received data and find relationships between them;

выявлять отклонения от нормального поведения контролируемых систем;identify deviations from the normal behavior of controlled systems;

оповещать специалистов SOC об обнаруженных инцидентах.Notify SOC specialists about detected incidents.

EDR (Endpoint Detection & Response) - класс решений для обнаружения и изучения вредоносной активности на конечных точках: подключенных к сети рабочих станциях, серверах, устройствах Интернета вещей (англ. Internet of Things, IoT) и так далее.EDR (Endpoint Detection & Response) is a class of solutions for detecting and studying malicious activity on endpoints: network-connected workstations, servers, Internet of Things (IoT) devices, and so on.

Специалист SOC (ИБ-специалист, специалист информационной безопасности) - сотрудник, ответственный за сбор и анализ данных, связанных с инцидентами. Как правило, ИБ-специалисты работают в рамках SOC с продуктами уровня SIEM, EDR и иных средств сбора данных и отслеживания инцидентов.SOC specialist (information security specialist) is an employee responsible for collecting and analyzing data related to incidents. As a rule, information security specialists work within the SOC with SIEM, EDR and other data collection and incident tracking products.

Индикатор компрометации (англ. Indicator of Compromise, IoC) - в сфере компьютерной безопасности наблюдаемый в сети или на конкретном устройстве объект или активность, который с большой долей вероятности указывает на несанкционированный доступ к системе, что указывает и на компрометацию объекта или активности.Indicator of Compromise (IoC) - in the field of computer security, an object or activity observed on a network or a specific device that, with a high degree of probability, indicates unauthorized access to the system, which also indicates a compromise of the object or activity.

В качестве индикатора компрометации, например, выступают:For example, the following serve as indicators of compromise:

необычные DNS-запросы;unusual DNS queries;

подозрительные файлы, приложения и процессы;suspicious files, applications and processes;

IP-адреса и домены, принадлежащие ботнетам или командным серверам вредоносного ПО;IP addresses and domains belonging to botnets or malware command servers;

значительное количество обращений к учетной записи (попытка перебора паролей);a significant number of attempts to access an account (attempts to brute force passwords);

подозрительная активность в учетных записях администраторов или привилегированных пользователей;suspicious activity in administrator or privileged user accounts;

неожиданное обновление ПО;unexpected software update;

установка неразрешенного (нелегитимного) ПО;installation of unauthorized (illegitimate) software;

передача данных через редко используемые порты;data transfer through rarely used ports;

нетипичное для человека поведение на веб-сайте;atypical human behavior on a website;

сигнатура или хеш-сумма вредоносной программы;signature or hash of the malware;

необычный размер HTML-ответов;unusual size of HTML responses;

несанкционированное изменение конфигурационных файлов, реестров или настроек устройства;unauthorized modification of configuration files, registries or device settings;

большое количество неудачных попыток входа в систему.a large number of unsuccessful login attempts.

Большая языковая модель (англ. LLM, large language model) - это языковая модель, состоящая из нейронной сети со множеством параметров (обычно миллиарды весовых коэффициентов и более), обученной на большом количестве текстов.A large language model (LLM) is a language model consisting of a neural network with many parameters (usually billions of weights or more) trained on a large amount of text.

На фиг. 1А представлена схема работы настоящего изобретения. ИБ-специалисты 101, элементы SOC 102 взаимодействуют с обученной большой языковой моделью (далее - LLM) 105 через обработчик запросов 104. Элементами SOC 102 являются SIEM решения, EDR-продукты и аналогичные им. В одном из вариантов реализации ИБ-специалист 101 взаимодействует с LLM 105 через веб-интерфейс обработчика 104. Элементы SOC 102 собирают данные с конечных точек 107 инфраструктуры предприятия 106 для сбора и анализа информации о событиях безопасности. Примером конечных точек 107 являются компьютеры (хосты) и такие устройства как смартфоны, роутеры, IoT-устройства. В предпочтительном варианте реализации на конечных точках 107 установлены EDR-решения. Fig. 1A shows the operation diagram of the present invention. Information security specialists 101 , SOC elements 102 interact with the trained large language model (hereinafter referred to as LLM) 105 via the request handler 104. The SOC elements 102 are SIEM solutions, EDR products and similar ones. In one embodiment, the information security specialist 101 interacts with the LLM 105 via the web interface of the handler 104. The SOC elements 102 collect data from the endpoints 107 of the enterprise infrastructure 106 to collect and analyze information about security events. An example of endpoints 107 are computers (hosts) and devices such as smartphones, routers, IoT devices. In a preferred embodiment, EDR solutions are installed on the endpoints 107 .

Инфраструктура предприятия 106 может также содержать и другие устройства, не отмеченные на фиг. 1А, а именно системы СКУД (системы контроля и управления доступом, англ. access control), SCADA (Supervisory Control And Data Acquisition, диспетчерское управление и сбор данных), блоки PLC (Programmable Logic Controller, программируемый контроллер) и другие программно-аппаратные системы современного предприятия.The infrastructure of the enterprise 106 may also contain other devices not shown in Fig. 1A , namely, ACS (access control and management systems), SCADA (Supervisory Control And Data Acquisition), PLC (Programmable Logic Controller) units and other software and hardware systems of a modern enterprise.

В одном из вариантов реализации LLM 105 осуществляет работу на основании запросов на естественном языке. В этом случае обработчик 104 выполняет функции перевода запросов ввода-вывода от элементов SOC в необходимый для LLM формат ввода и наоборот (например, JSON/XML).In one embodiment, LLM 105 operates based on natural language queries. In this case, handler 104 performs the functions of translating input/output queries from SOC elements into the input format required by LLM and vice versa (e.g., JSON/XML).

В еще одном из вариантов реализации LLM 105 дополнительно обучена на текстах, взятых из различных блогов, статей и публикаций, связанных с инцидентами информационной безопасности (например, DFIR Report) и исследованиями угроз (например, securelist.com).In another implementation, LLM 105 is additionally trained on texts taken from various blogs, articles, and publications related to information security incidents (e.g., DFIR Report) and threat research (e.g., securelist.com).

Обработчик 104 также используется для хранения контекста, который используется при взаимодействии с LLM 105 в виде запросов-ответов для того, чтобы эффективность взаимодействия с LLM 105 была максимальной. Контекст включает все ранее сделанные запросы к LLM 105, а также полученные ответы.The handler 104 is also used to store the context that is used when interacting with the LLM 105 in the form of requests and responses in order to maximize the efficiency of interaction with the LLM 105. The context includes all previously made requests to the LLM 105 as well as the responses received.

В одном из вариантов реализации обработчик 104 реализован с использованием библиотек обработки естественного языка, такими как SpaCy, NTLK и другими. В еще одном варианте реализации обработчик 104 использует набор регулярных выражений для выделений элементов текста из полученных ответов от LLM 105. В качестве элемента текста используется токен (англ. token), N-грамма или лексема.In one embodiment, the handler 104 is implemented using natural language processing libraries such as SpaCy, NTLK, and others. In another embodiment, the handler 104 uses a set of regular expressions to extract text elements from the received responses from the LLM 105. A token, N-gram, or lexeme is used as a text element.

ИБ-специалисты 101 и элементы SOC 102 передают в обработчик 104 собранные с одного или нескольких устройств 107 данные о:Information security specialists 101 and SOC elements 102 transmit to the processor 104 data collected from one or more devices 107 about:

модулях, сконфигурированных для автоматического выполнения (автозапуск, англ. autorun);modules configured for automatic execution (autorun);

запущенных процессах и их командных строках;running processes and their command lines;

событиях из журналов операционной системы (далее - ОС): установки служб, запуски процессов, выполнения скриптов PowerShell и т.д.;events from operating system (hereinafter referred to as OS) logs: installation of services, process launches, execution of PowerShell scripts, etc.;

событиях из журналов приложений (антивируса, браузера или средства удаленного управления системой);events from application logs (antivirus, browser or remote system management tool);

метаинформации по файлам (как процессам, там и файлам системы): имя, путь, хеш, метаинформацию из манифеста (оригинальное имя, описание, имя разработчика), информацию по цифровой подписи, строки из файла, отладочную информацию, имена секций, дизассемблированный/декомпилированный код точек входа в исполняемый файл (или экспортируемом DLL библиотекой функций), часть текстового-файла скрипта;meta-information on files (both processes and system files): name, path, hash, meta-information from the manifest (original name, description, developer name), information on the digital signature, lines from the file, debug information, section names, disassembled/decompiled code of entry points into the executable file (or exported DLL function library), part of the text file of the script;

информации по исполнению файла (например, в эмуляторе или в виртуальной машине), которая включает журнал вызовов системных API-функций и их аргументы, а также дополнительную информацию, например, дескрипторы открытых файлов.information about the execution of a file (for example, in an emulator or in a virtual machine), which includes a log of calls to system API functions and their arguments, as well as additional information such as open file descriptors.

LLM 105 не может корректно принимать на вход подобные данные, поэтому обработчик 104 преобразует эти данные в запросы такого вида:LLM 105 cannot correctly accept such data as input, so handler 104 converts this data into requests of this type:

Является ли следующее имя службы Windows ‘$ServiceName’ со следующей строкой запуска ‘$Servicecmd’ индикатором компрометации?Is the following Windows service name '$ServiceName' with the following startup string '$Servicecmd' an indicator of compromise?

Является какая-нибудь из следующих служб Windows c соответствующими командными строками индикатором компрометации?Are any of the following Windows services and their corresponding command lines indicators of compromise?

[[

{{

“ServiceName”: “ServiceName1”,"ServiceName": "ServiceName1",

“Servicecmd”: “Servicecmd1”“Servicecmd”: “Servicecmd1”

},},

{{

“ServiceName”: “ServiceName2”,"ServiceName": "ServiceName2",

“Servicecmd”: “Servicecmd2”“Servicecmd”: “Servicecmd2”

},},

]]

Ответ от LLM 105 включает текстовое описание, которое может быть легко интерпретировано со стороны ИБ-специалиста 101. Пример ответа:The answer from LLM 105 includes a text description that can be easily interpreted by the IS Specialist 101. Sample answer:

“Да, указанная комбинация запущенного процесса и командной строки может быть индикатором компрометации”.“ Yes, the specified combination of running process and command line may be an indicator of compromise .”

Также LLM 105 предоставляет информацию к ответу, такую как, например:LLM 105 also provides information to support the answer, such as:

“Метаданные были индикаторами компрометации, например, так: «командная строка пытается загрузить файл с внешнего сервера» или «она использует флаг -ep bypass, который приказывает PowerShell обойти стандартные проверки безопасности»”.“ The metadata was indicators of compromise, such as “the command line is trying to download a file from an external server” or “it is using the -ep bypass flag, which tells PowerShell to bypass standard security checks.”

Обработчик 104 интерпретирует ответ от LLM 105, выделяя ключевые фразы. Примером такой фразы является ответ, включающий слова “да”, “yes”, который дают утвердительный ответ на изначальные запросы. Кроме того, для упрощения интерпретации ответа от LLM 105, обработчик 104 вставляет дополнительно следующий текст к запросу:The handler 104 interprets the response from the LLM 105 by extracting key phrases. An example of such a phrase is a response including the words “yes”, which gives an affirmative answer to the initial requests. In addition, to simplify the interpretation of the response from the LLM 105 , the handler 104 additionally inserts the following text to the request:

Твой ответ должен быть json объектом со следующей структурой:Your response should be a json object with the following structure:

{short_answer:”yes/no”,{short_answer:”yes/no”,

long_answer_with description:full_text,long_answer_with description:full_text,

other_data_required_to_investigate_incident:full_textother_data_required_to_investigate_incident:full_text

}}

При обнаружении подобных ключевых фраз обработчик 104:When such key phrases are detected, the 104 handler:

передает сигнал тревоги (alert) в элементы SOC 102, а также ИБ-специалисту 101;transmits an alarm signal (alert) to SOC elements 102 , as well as to the information security specialist 101 ;

отправляет запрос на сбор дополнительных данных с одного из устройств 107, на которых были собраны первоначальные данные для запроса;sends a request to collect additional data from one of the devices 107 on which the initial data for the request was collected;

запускает (послужит триггером) для мер по реагированию (например, блокирует один или несколько хостов 107).triggers (serves as a trigger) for response measures (e.g. blocking one or more hosts 107 ).

Таким образом, обработчик 104 работает с LLM 105 по шаблону “вопрос-ответ”, где в рамках запросов к LLM 105 использует ключевые фразы для составления вопросов, а при анализе ответов от LLM 105 опирается на поиск ключевых фраз для поиска утвердительного ответа на изначальный вопрос или определение необходимости дополнительных вопросов.Thus, the handler 104 works with the LLM 105 using a question-answer pattern, where, within the framework of requests to the LLM 105 , it uses key phrases to formulate questions, and when analyzing the responses from the LLM 105, it relies on the search for key phrases to find an affirmative answer to the original question or to determine the need for additional questions.

На одном или нескольких хостов 107 помимо агента безопасности 110 установлена агентская LLM 112 и обработчик запросов 111. В одном из вариантов реализации в качестве агента безопасности 110 выступает EDR решение, например, Kaspersky Endpoint Security for Windows / Linux.On one or more hosts 107, in addition to the security agent 110, an agent LLM 112 and a request handler 111 are installed. In one embodiment, an EDR solution, such as Kaspersky Endpoint Security for Windows/Linux, acts as the security agent 110 .

От агента 110 можно получить большое количество информации по событиям, которые происходят в ОС на хосте 107. Список событий доступен в виде журнала событий или через API и включает по меньшей мере одной из:From agent 110, a large amount of information can be obtained on events that occur in the OS on host 107 . The list of events is available as an event log or via an API and includes at least one of:

работа с файловой системой - прежде всего, открытие файлов для записи;working with the file system - primarily opening files for writing;

работа с реестром;working with the registry;

запуск потоков в контролируемых процессах;launching threads in controlled processes;

удаленный вызов процедур (remote procedure call);remote procedure call;

вызов инструментария управления Windows (WMI);Windows Management Instrumentation (WMI) call;

открытие сетевых соединений.opening network connections.

Огромный поток подобных событий обрабатывается с помощью агента 110 только на стороне хоста 107, лишь часть этих событий по определенным фильтрам передается на сторону элементов SOC 102.A huge flow of such events is processed using agent 110 only on the host side 107 , only a part of these events is transmitted to the side of SOC elements 102 according to certain filters.

Обработчик запросов 111 аналогичен обработчику запросов 104. В одном из вариантов реализации агентская LLM 112 отличается от LLM 105 по ряду параметров в виду ограниченности ресурсов хоста 107.Request handler 111 is similar to request handler 104 . In one embodiment, agent LLM 112 differs from LLM 105 in a number of parameters due to the limited resources of host 107 .

Создание таких моделей как LLM включает два этапа - сначала большая языковая модель обучается раз в несколько дней/недель/месяцев на этапе pretraining. Данный этап очень ресурсоемок - может использоваться несколько тысяч специализированных ускорителей расчетов (или GPU), процесс занимает от нескольких дней до недель. На выходе получается базовая модель (англ. base model), которая имеет настроенный набор из нескольких миллиардов весов.The creation of such models as LLM involves two stages - first, a large language model is trained every few days/weeks/months at the pretraining stage. This stage is very resource-intensive - several thousand specialized calculation accelerators (or GPUs) can be used, the process takes from several days to weeks. The output is a base model, which has a configured set of several billion weights.

На втором этапе, который может происходить раз в несколько дней, делают тонкую настройку (англ. finetuning), когда с помощью экспертов модель обрабатывает обратную связь по запросам к базовой модели. На выходе получают модель-ассистент (англ. assistant model). В качестве агентской LLM 112 используется как вторая модель-ассистент, так и базовая модель (предпочтительно).At the second stage, which can happen every few days, fine tuning is done, when the model processes feedback on requests to the base model with the help of experts. The output is an assistant model. Both the second assistant model and the base model (preferably) are used as an agent LLM 112 .

Для работы в области информационной безопасности может применяться доменная адаптация с помощью технологий P-Tuning, LoRA, QLoRA и других. Над такой адаптацией работают эксперты по информационной безопасности, в частности, SOC-эксперты 101.Domain adaptation can be used for work in the field of information security using P-Tuning, LoRA, QLoRA and other technologies. Information security experts, in particular SOC experts 101 , work on such adaptation.

Агентская LLM 112 может быть создана из открытых моделей: Llama 2, MPT-7B, Falcon и другие, в том числе и производные модели (например, Meta LLaMA 2-chat). Данные модели представляют готовую базовую модель, которую требуется дообучить на этапе тонкой настройки. Кроме того, они имеют различные параметры, ключевым из которых является размер модели - он имеет размер в несколько миллиардов параметров (весов) и может достигать десятков миллиардов.Agent LLM 112 can be created from open models: Llama 2, MPT-7B, Falcon and others, including derivative models (for example, Meta LLaMA 2-chat). These models represent a ready-made basic model that needs to be trained at the fine-tuning stage. In addition, they have various parameters, the key one being the size of the model - it has a size of several billion parameters (weights) and can reach tens of billions.

Ввиду ограниченности ресурсов на хосте 107 подбирается та модель агентской LLM 112, которая не будет сильно загружать текущие ресурсы (прежде всего - память / видеопамять). Тонкая настройка агентской LLM 112 включает работы лишь с той информацией, которую можно получить от агента 110 - прежде всего, это данные по файловым операциям, работой с реестром и т.д.Due to limited resources on host 107 , the agent LLM 112 model is selected that will not heavily load current resources (primarily memory/video memory). Fine tuning of the agent LLM 112 includes work only with the information that can be obtained from agent 110 - first of all, this is data on file operations, work with the registry, etc.

В одном из вариантов реализации перед установкой агентской LLM 112 на хосте 107 проверяется наличие специализированного ускорителя и объема памяти, который может быть использован. Например, как представлено в таблице 1.In one embodiment, before installing the agent LLM 112 on the host 107 , the presence of a specialized accelerator and the amount of memory that can be used are checked. For example, as shown in Table 1.

Таблица 1Table 1 Объем видеопамятиVideo memory capacity Размер параметров моделиSize of model parameters < 48 Гб< 48 GB 7 млрд7 billion 49 - 72 Гб49 - 72 GB 13 млрд13 billion 73 - 128 Гб73 - 128 GB 70 млрд 70 billion

После квантизации сети объем потребления памяти существенно снижается. Например, для метода квантизации Q5_K_S представлено в таблице 2.After network quantization, the memory consumption is significantly reduced. For example, for the Q5_K_S quantization method, it is presented in Table 2.

Таблица 2Table 2 Тип моделиModel type Объем памятиMemory capacity Llama-2-7BLlama-2-7B 7.15 GB7.15 GB Llama-2-13BLlama-2-13B 11.47 GB11.47 GB Llama-2-70BLlama-2-70B 33.07 GB33.07 GB

В еще одном варианте реализации используется специализированный ускоритель расчетов для LLM (например, Instinct MI300X, Intel Gaudi2).Another implementation option uses a dedicated LLM accelerator (e.g. Instinct MI300X, Intel Gaudi2).

В других вариантах реализации агентская LLM 112 может быть развернута вне хоста 107.In other embodiments, the agent LLM 112 may be deployed outside of the host 107 .

Другие варианты тонкой настройки агентской LLM 112 зависят от роли хоста 107 и используемого агента 110. Для почтового сервера тонкая настройка будет использовать обработку данных, связанных с почтовыми сообщениями (для определения фишинга и спама).Other options for fine-tuning the agent LLM 112 depend on the host role 107 and the agent 110 used. For the mail server, fine-tuning will involve processing data associated with mail messages (to detect phishing and spam).

В одном из вариантов реализации агент 110 имеет уже настроенную агентскую LLM 112 для установки на целевой хост 107. В других вариантах реализации агентская LLM 112 устанавливается на хост 107 отдельной и запускается в виде отдельного процесса. Работа с агентской LLM 112 происходит через обработчик запросов 111, который может быть связан через API с агентом 110.In one embodiment, the agent 110 has an already configured agent LLM 112 for installation on the target host 107 . In other embodiments, the agent LLM 112 is installed on the host 107 separately and runs as a separate process. Work with the agent LLM 112 occurs through the request handler 111 , which can be connected via an API to the agent 110 .

Агентская LLM 112 имеет возможность использовать ранее сохраненный контекст от агента 110 в виде таких инструментов как RAG (англ. retrieval-augmented generation) - генерация с расширенным поиском. В таком случае агентская LLM 112 получает доступ к сохраненным журналам (логам) от агента 110 для получения наиболее релевантной информации по событиям на хосте 107. Например, при запросе через обработчик запросов 111 в виде:The agent LLM 112 has the ability to use previously saved context from the agent 110 in the form of such tools as RAG (retrieval-augmented generation) - generation with advanced search. In this case, the agent LLM 112 gets access to saved logs from the agent 110 to obtain the most relevant information on events on the host 107. For example, when requesting through the request handler 111 in the form:

Is the combination of process “C:\Users\User01\AppData\Local\Temp\dOlnolh.exe” and command line “ - EgUxRCdQ” is an indicator of compromise?Is the combination of process “C:\Users\User01\AppData\Local\Temp\dOlnolh.exe” and command line “ - EgUxRCdQ” is an indicator of compromise?

Дополнительно загружается информация обо всех ранее запущенных процессах (службах), ранее отмеченные флаги компрометации на системе, которые происходили в течение длительного времени. Это позволяет обрабатывать сложные варианты атак, которую могут происходит в течение недель или даже месяцев.Additionally, information about all previously launched processes (services), previously marked flags of compromise on the system that occurred over a long period of time is loaded. This allows processing complex variants of attacks that can occur over weeks or even months.

Инструменты, подобные llama_index, позволяют подгружать данные от агента 110 даже в виде машиночитаемых данных, не используя обработчик запросов 111 напрямую. В качестве данных используются дампы баз данных, CSV, JSON, XML, Parquet форматы данных и другие.Tools like llama_index allow loading data from agent 110 even in the form of machine-readable data, without using the 111 query handler directly. Database dumps, CSV, JSON, XML, Parquet data formats and others are used as data.

В еще одном варианте реализации агентская LLM 112 делает обращение к LLM 105 через обработчик запросов 104. В еще одном варианте реализации LLM 105 имеет доступ к контексту агентской LLM 112 при работе с агентом 110.In another embodiment, the agent LLM 112 makes a call to the LLM 105 through the request handler 104. In another embodiment, the LLM 105 has access to the context of the agent LLM 112 when working with the agent 110 .

Дополнительно агентская LLM 112 может расширять функционал взаимодействия агента 110 с элементами SOC 102. Например, агент 110 имеет реализацию в виде EDR решения, которое отправляет ограниченный набор логов в SIEM систему, что не позволяет собрать всей полноценной картины возможного инцидента. В таком случае агентская LLM 112 используется для сбора и обработки всех журналов событий, которые использует EDR решение. Это позволит автоматизировать ряд действий, которые обычно выполняет специалист 101. Примерами могут быть:Additionally, the agent LLM 112 can expand the functionality of the interaction of agent 110 with SOC 102 elements. For example, agent 110 has an implementation in the form of an EDR solution that sends a limited set of logs to the SIEM system, which does not allow collecting the entire full picture of a possible incident. In this case, the agent LLM 112 is used to collect and process all event logs that the EDR solution uses. This will automate a number of actions that are usually performed by specialist 101. Examples may be:

Если при анализе логов агентская LLM 112 обнаружила подозрительное поведение исполняемого файла, который был обнаружен с помощью агента 110, агентский LLM 112 может запросить инструмент, получить файл с хоста 107 и запросить его анализ в песочнице, чтобы потом предоставить больше контекста для анализа инцидента.If, while analyzing logs, agent LLM 112 detects suspicious behavior of an executable that was discovered by agent 110 , agent LLM 112 can query the tool, retrieve the file from host 107 , and request its analysis in the sandbox to then provide more context for incident analysis.

Агентская LLM 112 может делать запросы в сервис Kaspersky Threat Intelligence Portal (или другие аналогичные решения, такие как MISP, AlienVault OTX) для получения расширенной статистики по файлу или URL адресу.Agent LLM 112 can make requests to the Kaspersky Threat Intelligence Portal service (or other similar solutions such as MISP, AlienVault OTX) to obtain advanced statistics on a file or URL.

Агентская LLM 112 делает дополнительный запрос в Kaspersky Threat Intelligence Portal по имеющейся информации о криминальных группах, которые использовали вредоносные инструменты (файлы и URL ссылки, найденные уязвимости и т.д.), которые были обнаружены с помощью агента 110. Это также позволит сэкономить время для специалистов 101.Agent LLM 112 makes an additional request to the Kaspersky Threat Intelligence Portal based on available information about criminal groups that used malicious tools (files and URL links, vulnerabilities found, etc.) that were detected using agent 110 . This will also save time for specialists 101 .

Агентская LLM 112 может выполнить функции агента 110 для закрытия уязвимостей с помощью системы Patch Management, если таковая не используется с помощью агента 110 напрямую.The agent LLM 112 may perform the functions of the agent 110 to close vulnerabilities using the Patch Management system if it is not used directly by the agent 110 .

При обнаружении подозрительного файла на хосте 107 с помощью агента 110 агентская LLM 112 возьмет на себя функции аналитика 101 для сбора информации о способе получения файла. Например, если файл появился на диске после сохранения из вложения в электронном письме, то агентская LLM 112 может сделать запрос на почтовый сервер для сбора информации по данной рассылке и передать как часть инцидента на сторону SOC 102.When a suspicious file is detected on host 107 using agent 110, agent LLM 112 will take on the functions of analyst 101 to collect information about the method of obtaining the file. For example, if the file appeared on the disk after being saved from an email attachment, then agent LLM 112 can make a request to the mail server to collect information about this mailing and transfer it as part of the incident to the SOC side 102 .

На этапе 201 происходит получение входных данных в центр мониторинга информационной безопасности (SOC). Как правило, элементы SOC 102 собирают входные данные (примеры собираемых данных описаны выше) с конечных точек 107 инфраструктуры предприятия 106.At step 201, input data is received by the security information monitoring center (SOC). Typically, SOC elements 102 collect input data (examples of collected data are described above) from endpoints 107 of the enterprise infrastructure 106 .

На этапе 202 происходит формирование запроса в LLM 105. Так как запросы к LLM 105 формируются на естественном языке, то обработчик 104 добавляет следующие ключевые фразы к входным данным:At step 202, a query is generated in LLM 105. Since queries to LLM 105 are generated in natural language, the handler 104 adds the following key phrases to the input data:

“Is the combination of: ” - задается перечисление необходимых данных,“ Is the combination of: ” - specifies the list of required data,

“is an indicator of compromise”, “is a threat”, “is an attack”, “is a cyber attack”, “is malicious action/file” - вопрос по угрозе,“ is an indicator of compromise ”, “ is a threat ”, “ is an attack ”, “ is a cyber attack ”, “ is malicious action/file ” - a question about the threat,

“think it step by step” - требование дать ответ с разбивкой по шагам,“ think it step by step ” - a requirement to give an answer broken down into steps,

“is presence file on disk” - запрос по определенному файлу.“ is presence file on disk ” - a request for a specific file.

Указанные фразы хранятся как в самом обработчике 104 (например, в виде списка), так и в отдельной базе данных (не отображена на Фиг. 1А).The specified phrases are stored both in the handler 104 itself (for example, in the form of a list) and in a separate database (not shown in Fig. 1A ).

Запросы, например, могут иметь вид:Queries, for example, may look like:

Is the combination of process “C:\Users\User01\AppData\Local\Temp\dOlnolh.exe” and command line “ -EgUxRCdQ” is an indicator of compromise?Is the combination of process “C:\Users\User01\AppData\Local\Temp\dOlnolh.exe” and command line “ -EgUxRCdQ” is an indicator of compromise?

Согласно представленному запросу входными данными являются:According to the submitted request, the input data are:

Тип объектаObject type ОбъектObject Процесс (process)Process C:\Users\User01\AppData\Local\Temp\dOlnolh.exeC:\Users\User01\AppData\Local\Temp\dOlnolh.exe Командная строка (command line)Command line -EgUxRCdQ-EgUxRCdQ

Данные о процессе и командной строке его запуска были получены от EDR, установленного на одном из хостов 107. Обработчик 104 перечисляет указанные объекты в запросе и добавляет указанные выше ключевые фразы для формирования запроса на естественном языке.The process and its launch command line data were obtained from EDR installed on one of the hosts 107 . Handler 104 lists the specified objects in the request and adds the above key phrases to form a natural language request.

На этапе 203 обработчик 104 получает ответ от LLM 105 и анализирует его. Так как LLM 105 выдает ответ в виде текста на естественном языке, то обработчик 104 использует подключенные библиотеки обработки естественного языка, такие как SpaCy или NTLK, для выделения нужных терминов и фраз. Анализ включает выделение заранее известных фраз или элементов текста. В качестве элемента текста используется токен (англ. token), N-грамма или лексема. В одном из вариантов реализации выделение известных фраз реализуется с помощью регулярных выражений.In step 203, the processor 104 receives the response from the LLM 105 and analyzes it. Since the LLM 105 produces a response in the form of natural language text, the processor 104 uses connected natural language processing libraries, such as SpaCy or NTLK, to extract the desired terms and phrases. The analysis includes the extraction of previously known phrases or text elements. A token, N-gram, or lexeme is used as a text element. In one embodiment, the extraction of known phrases is implemented using regular expressions.

На этапе 204 определяют, была ли выделена угроза по результатам ответа от LLM 105. Критерием выделения угрозы является наличие в ответе определенных фраз. Один из вариантов реализации основан на поиске в ответе слов «yes», «да» и других утвердительных вариантов ответа. Другие ключевые фразы: «This could be an indicator of compromise» (это может быть индикатором компрометации), «This is suspicious behaviour» (подозрительное поведение). Если угроза выделена, то на этапе 205 происходит сбор необходимых артефактов с помощью элементов SOC 102 и дальнейший разбор инцидента / угрозы в рамках SOC с привлечением специалистов 101.At step 204 it is determined whether a threat has been identified based on the response from LLM 105 . The criterion for identifying a threat is the presence of certain phrases in the response. One implementation option is based on searching for the words " yes " and other affirmative response options in the response. Other key phrases are: " This could be an indicator of compromise ", " This is suspicious behaviour". If a threat has been identified, then at step 205 the necessary artifacts are collected using SOC elements 102 and the incident/threat is further analyzed within the SOC with the involvement of specialists 101 .

Если определенные фразы на этапе 204 не были определены, то переходят к этапу 206.If certain phrases were not determined at step 204 , then proceed to step 206 .

На этапе 206 определяют, нужны ли еще данные для LLM 105. Ключевыми фразами в таком случае являются: «should be further investigated» (продолжить исследование), «additional data required» (необходимы дополнительные данные), «necessary to analyze further» (необходимо провести дополнительный анализ) и другие аналогичные фразы. Указанные фразы хранятся как в самом обработчике 104 (например, в виде списка), так и в отдельной базе данных (не отображена на фиг. 1А). Дополнительно анализируется, какого типа информация требуется в зависимости от наличия следующих слов:At step 206 it is determined whether further data is needed for LLM 105 . The key phrases in this case are: "should be further investigated ", "additional data required ", " necessary to analyze further " and other similar phrases. These phrases are stored both in the processor 104 itself (e.g., as a list) and in a separate database (not shown in Fig. 1A ). In addition, the type of information required is analyzed depending on the presence of the following words:

Endpoint, host, local - означает необходимость сбора информации с хостов 107. Endpoint, host, local - means the need to collect information from hosts 107 .

Registry - требуется информация из реестра. Registry - information from the registry is required.

File - требуется информация о файловой активности. File - file activity information is required.

Network - необходима информация о сетевой активности. Network - information about network activity is required.

Process - требуется информация о процессах. Process - information about processes is required.

Autorun - требуется дополнительная информация о ключах автозапуска. Autorun - additional information about autorun keys is required.

PowerShell - дополнительно требуется анализ скриптов PowerShell. PowerShell - additional analysis of PowerShell scripts is required.

Кроме того, анализируется контекст, связанный с данными словами. Например, для слова «process» может быть определено уточнение «running process», что означает только запущенные процессы.In addition, the context associated with these words is analyzed. For example, the word " process " may be qualified with the qualification " running process ", which means only running processes.

В том случае, если обработчик 104 не смог выделить, какие дополнительные данные могут быть необходимы, то работа способа завершается на этапе 207.In the event that the handler 104 was unable to identify what additional data may be needed, then the method terminates at step 207 .

После определения необходимой дополнительной информации переходят к этапу 208.After determining the necessary additional information, proceed to step 208 .

На этапе 208 собирают на одном из элементов SOC 102 дополнительную информацию. Например, для сбора информации с одного из хостов 107 используется EDR-решение. Затем работа способа повторяется с этапа 202 уже с новыми собранными данными, пока способ не закончит свою работу либо на этапе 205, либо на этапе 207.At step 208, additional information is collected from one of the SOC elements 102. For example, an EDR solution is used to collect information from one of the hosts 107. The method then repeats from step 202 with the new collected data until the method finishes its work either at step 205 or at step 207 .

LLM 105 на фиг. 1А можно дополнительно дообучать. На одном или нескольких целевых хостах 107 (это могут быть в том числе и виртуальные машины и объекта типа sandbox) происходит целенаправленное заражение известными примерами (сэмплами) вредоносных программ. Например, можно установить фреймворк PowerShell Empire и запустить несколько типовых вредоносных процедур. После этого выполняется способ по Фиг. 2А для выделения нужных запросов и ответов со стороны LLM 105.LLM 105 in Fig. 1A can be further trained. On one or several target hosts 107 (these can be virtual machines and sandbox-type objects) a targeted infection with known examples (samples) of malware occurs. For example, the PowerShell Empire framework can be installed and several typical malicious procedures can be launched. After this, the method in Fig. 2A is performed to select the necessary requests and responses from LLM 105 .

Например, в ответ на запрос:For example, in response to the request:

Is the following is indicator of compromise:Is the following is indicator of compromise:

Launch String: “C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe” -NonI -W hidden -c “IEX ([Text.Encoding]::UNICODE.GetString([Convert]::FromBase64String((gp HKCU:\Software\Microsoft\Windows\CurrentVersion debug).debug)))”Launch String: “C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe” -NonI -W hidden -c “IEX ([Text.Encoding]::UNICODE.GetString([Convert]::FromBase64String(( gp HKCU:\Software\Microsoft\Windows\CurrentVersion debug).debug)))”

LLM 105 выдает следующий ответ:LLM 105 gives the following answer:

Yes, this could be an indicator of compromise. It is possible that the attacker has created a malicious task in the registry to launch PowerShell with a hidden flag and encoded base64 string. This code could potentially execute malicious commands on the system, such as installing malware or stealing data. In order to confirm if this indeed an indicator of compromise, it would be necessary to analyze script further and look for other clues that indicate malicious activity.Yes, this could be an indicator of compromise. It is possible that the attacker has created a malicious task in the registry to launch PowerShell with a hidden flag and encoded base64 string. This code could potentially execute malicious commands on the system, such as installing malware or stealing data. In order to confirm if this indeed is an indicator of compromise, it would be necessary to analyze the script further and look for other clues that indicate malicious activity.

Данный ответ означает:This answer means:

“Да, это может быть индикатором компрометации. Возможно, злоумышленник создал вредоносную задачу в реестре для запуска PowerShell со скрытым флагом и закодированной строкой base64. Этот код потенциально может выполнять вредоносные команды в системе, такие как установка вредоносного ПО или кража данных. Для того чтобы подтвердить, действительно ли это является индикатором компрометации, необходимо проанализировать сценарий дополнительно и осуществить поиск других признаков, указывающих на вредоносную активность.”“Yes, this could be an indicator of compromise. It is possible that the attacker has created a malicious task in the registry to launch PowerShell with a hidden flag and a base64 encoded string. This code could potentially execute malicious commands on the system, such as installing malware or stealing data. In order to confirm whether this is indeed an indicator of compromise, it is necessary to analyze the script further and look for other signs that indicate malicious activity.”

Выполнив несколько последовательных запросов к LLM 105 с различными параметрами, которые касаются вредоносной активности на хостах 107 и получив необходимые ответы, можно дополнить контекст, который будет использоваться при переобучении / дообучения LLM 105.By running several consecutive queries to LLM 105 with different parameters that relate to malicious activity on hosts 107 and receiving the necessary responses, it is possible to supplement the context that will be used when retraining / additionally training LLM 105 .

На этапе 211 происходит получение входных данных от агента безопасности 110. Входными данными может быть журнал событий или вызов API функции, который предоставляется со стороны агента 110.At step 211, input data is received from the security agent 110 . The input data may be an event log or a call to an API function that is provided by the agent 110 .

На этапе 212 происходит формирование запроса в агентскую LLM 112. Так как запросы к агентской LLM 112 формируются на естественном языке, то обработчик 111 добавляет следующие ключевые фразы к входным данным:At step 212 , a request is generated for the agent LLM 112. Since requests for the agent LLM 112 are generated in natural language, the handler 111 adds the following key phrases to the input data:

Указанные фразы хранятся как в самом обработчике 111 (например, в виде списка), так и в отдельной базе данных (не отображена на фиг. 1Б).The specified phrases are stored both in the handler 111 itself (for example, in the form of a list) and in a separate database (not shown in Fig. 1B ).

В еще одном варианте реализации инструменты, подобные llama_index, позволяют подгружать данные от агента 110 даже в виде машиночитаемых данных, не используя обработчик 111 напрямую. В качестве данных используются дампы баз данных, CSV, JSON, XML, Parquet форматы данных и другие.In another implementation option, tools like llama_index allow loading data from agent 110 even in the form of machine-readable data, without using handler 111 directly. Database dumps, CSV, JSON, XML, Parquet data formats and others are used as data.

На этапе 213 обработчик 111 получает ответ от агентской LLM 112 и анализирует его. Так как агентская LLM 112 выдает ответ в виде текста на естественном языке, то обработчик 111 использует подключенные библиотеки обработки естественного языка, такие как SpaCy или NTLK, для выделения нужных терминов и фраз. Анализ включает выделение заранее известных фраз или элементов текста. В качестве элемента текста используется токен (англ. token), N-грамма или лексема. В одном из вариантов реализации выделение известных фраз реализуется с помощью регулярных выражений.At step 213, the handler 111 receives the response from the agent LLM 112 and analyzes it. Since the agent LLM 112 produces a response in the form of natural language text, the handler 111 uses connected natural language processing libraries, such as SpaCy or NTLK, to extract the desired terms and phrases. The analysis includes the extraction of previously known phrases or text elements. A token, N-gram, or lexeme is used as a text element. In one embodiment, the extraction of known phrases is implemented using regular expressions.

На этапе 214 определяют, была ли выделена угроза по результатам ответа от агентской LLM 112. Критерием выделения угрозы является наличие в ответе определенных фраз. Один из вариантов реализации основан на поиске в ответе слов «yes», «да» и других утвердительных вариантов ответа. Другие ключевые фразы: «This could be an indicator of compromise» (это может быть индикатором компрометации), «This is suspicious behaviour» (подозрительное поведение). Если угроза выделена, то на этапе 215 происходит сбор необходимых артефактов с помощью агент 110 и с привлечением специалистов 101.At step 214 it is determined whether a threat has been identified based on the response from the agent LLM 112 . The criterion for identifying a threat is the presence of certain phrases in the response. One implementation option is based on searching for the words " yes " and other affirmative response options in the response. Other key phrases are: " This could be an indicator of compromise ", " This is suspicious behaviour ". If a threat has been identified, then at step 215 the necessary artifacts are collected using agent 110 and with the involvement of specialists 101 .

Если определенные фразы на этапе 214 не были определены, то переходят к этапу 216. Ключевыми фразами в таком случае являются: «should be further investigated» (продолжить исследование), «additional data required» (необходимы дополнительные данные), «necessary to analyze further» (необходимо провести дополнительный анализ) и другие аналогичные фразы. Указанные фразы хранятся как в самом обработчике 111 (например, в виде списка), так и в отдельной базе данных (не отображена на фиг. 1Б). Дополнительно анализируется, какого типа информация требуется в зависимости от наличия следующих слов:If certain phrases were not determined at step 214 , then step 216 is reached. The key phrases in this case are: " should be further investigated", "additional data required ", " necessary to analyze further " and other similar phrases. These phrases are stored both in the processor 111 itself (e.g., as a list) and in a separate database (not shown in Fig. 1B ). Additionally, the type of information required is analyzed depending on the presence of the following words:

На этапе 216 происходит обращение к LLM 105, что подробно описано на фиг. 2А. В данном случае использование агентской LLM 112 на хосте 107 позволяет снизить нагрузку на LLM 105, выявить больше событий и предоставить больше данных для последующего выявления инцидентов. Для обращения к LLM 105 будет использован весь контекст взаимодействия агента и агентской LLM 112 на хосте. Контекст включает набор запросов и ответов от агентской LLM 112 при взаимодействии с агентом.At step 216, the LLM 105 is contacted, as described in detail in Fig. 2A . In this case, the use of the agent LLM 112 on the host 107 allows for a reduction in the load on the LLM 105 , the detection of more events, and the provision of more data for subsequent incident detection. The entire context of the interaction between the agent and the agent LLM 112 on the host will be used to contact the LLM 105. The context includes the set of requests and responses from the agent LLM 112 when interacting with the agent.

На фиг. 3 представлена компьютерная система, на которой могут быть реализованы различные варианты систем и способов, раскрытых в настоящем документе. Компьютерная система 20 может представлять собой систему, сконфигурированную для реализации настоящего изобретения и может быть представлена в виде одного вычислительного устройства или в виде нескольких вычислительных устройств, например, настольного компьютера, портативного компьютера, ноутбука, сервера, мейнфрейма, встраиваемого устройства и других форм вычислительных устройств. Fig. 3 shows a computer system on which various embodiments of the systems and methods disclosed in this document can be implemented. The computer system 20 can be a system configured to implement the present invention and can be represented as a single computing device or as multiple computing devices, such as a desktop computer, a portable computer, a notebook, a server, a mainframe, an embedded device, and other forms of computing devices.

Как показано на фиг. 3, компьютерная система 20 включает в себя: центральный процессор 21, системную память 22 и системную шину 23, которая связывает разные системные компоненты, в том числе память, связанную с центральным процессором 21. Системная шина 23 реализована, как любая известная из уровня техники шинная структура, содержащая в свою очередь память шины или контроллер памяти шины, периферийную шину и локальную шину, способную взаимодействовать с любой другой шинной архитектурой. Примерами шин являются: PCI, ISA, PCI-Express, HyperTransport™, InfiniBand™, Serial ATA, I2C и другие подходящие соединения между компонентами компьютерной системы 20. Центральный процессор 21 содержит один или несколько процессоров, имеющих одно или несколько ядер. Центральный процессор 21 исполняет один или несколько наборов машиночитаемых инструкций, реализующих способы, представленные в настоящем документе. Системная память 22 может быть любой памятью для хранения данных и/или компьютерных программ, исполняемых центральным процессором 21. Системная память может содержать как постоянное запоминающее устройство (ПЗУ) 24, так и память с произвольным доступом (ОЗУ) 25. Основная система ввода/вывода (BIOS) 26, содержит основные процедуры, которые обеспечивают передачу информации между элементами компьютерной системы 20, например, в момент загрузки операционной системы с использованием ПЗУ 24.As shown in Fig. 3 , the computer system 20 includes: a central processor 21 , a system memory 22 and a system bus 23 that connects various system components, including memory associated with the central processor 21 . The system bus 23 is implemented as any bus structure known in the art, which in turn contains a bus memory or a bus memory controller, a peripheral bus and a local bus capable of interacting with any other bus architecture. Examples of buses are: PCI, ISA, PCI-Express, HyperTransport™, InfiniBand™, Serial ATA, I2C and other suitable connections between components of the computer system 20 . The central processor 21 contains one or more processors having one or more cores. The central processor 21 executes one or more sets of machine-readable instructions that implement the methods presented in this document. The system memory 22 may be any memory for storing data and/or computer programs executed by the central processor 21 . The system memory may contain both a read-only memory (ROM) 24 and a random access memory (RAM) 25 . The main input/output system (BIOS) 26 contains the main procedures that ensure the transfer of information between elements of the computer system 20 , for example, at the time of loading the operating system using the ROM 24 .

Компьютерная система 20 включает в себя одно или несколько устройств хранения данных, таких как одно или несколько извлекаемых запоминающих устройств 27, одно или несколько неизвлекаемых запоминающих устройств 28, или комбинации извлекаемых и неизвлекаемых устройств. Одно или несколько извлекаемых запоминающих устройств 27 и/или неизвлекаемых запоминающих устройств 28 подключены к системной шине 23 через интерфейс 32. В одном из вариантов реализации извлекаемые запоминающие устройства 27 и соответствующие машиночитаемые носители информации представляют собой энергонезависимые модули для хранения компьютерных инструкций, структур данных, программных модулей и других данных компьютерной системы 20. Системная память 22, извлекаемые запоминающие устройства 27 и неизвлекаемые запоминающие устройства 28 могут использовать различные машиночитаемые носители информации. Примеры машиночитаемых носителей информации включают в себя машинную память, такую как кэш-память, SRAM, DRAM, ОЗУ не требующую конденсатора (Z-RAM), тиристорную память (T-RAM), eDRAM, EDO RAM, DDR RAM, EEPROM, NRAM, RRAM, SONOS, PRAM; флеш-память или другие технологии памяти, такие как твердотельные накопители (SSD) или флеш-накопители; магнитные кассеты, магнитные ленты и магнитные диски, такие как жесткие диски или дискеты; оптические носители, такие как компакт-диски (CD-ROM) или цифровые универсальные диски (DVD); и любые другие носители, которые могут быть использованы для хранения нужных данных и к которым может получить доступ компьютерная система 20.The computer system 20 includes one or more data storage devices, such as one or more removable storage devices 27 , one or more non-removable storage devices 28 , or combinations of removable and non-removable devices. One or more removable storage devices 27 and/or non-removable storage devices 28 are connected to the system bus 23 via the interface 32. In one embodiment, the removable storage devices 27 and the corresponding computer-readable storage media are non-volatile modules for storing computer instructions, data structures, program modules and other data of the computer system 20. The system memory 22 , the removable storage devices 27 and the non-removable storage devices 28 can use various computer-readable storage media. Examples of computer-readable storage media include computer memory such as cache memory, SRAM, DRAM, Z-RAM, T-RAM, eDRAM, EDO RAM, DDR RAM, EEPROM, NRAM, RRAM, SONOS, PRAM; flash memory or other memory technologies such as solid-state drives (SSD) or flash drives; magnetic cassettes, magnetic tapes, and magnetic disks such as hard disks or floppy disks; optical media such as compact discs (CD-ROM) or digital versatile discs (DVD); and any other media that can be used to store desired data and that can be accessed by a computer system 20 .

Системная память 22, извлекаемые запоминающие устройства 27 и неизвлекаемые запоминающие устройства 28, содержащиеся в компьютерной системе 20 используются для хранения операционной системы 35, приложений 37, других программных модулей 38 и программных данных 39. Компьютерная система 20 включает в себя периферийный интерфейс 46 для передачи данных от устройств ввода 40, таких как клавиатура, мышь, стилус, игровой контроллер, устройство голосового ввода, устройство сенсорного ввода, или других периферийных устройств, таких как принтер или сканер через один или несколько портов ввода/вывода, таких как последовательный порт, параллельный порт, универсальная последовательная шина (USB) или другой периферийный интерфейс. Устройство отображения 47, такое как один или несколько мониторов, проекторов или встроенных дисплеев, также подключено к системной шине 23 через выходной интерфейс 48, такой как видеоадаптер. Помимо устройств отображения 47, компьютерная система 20 оснащена другими периферийными устройствами вывода (на фиг. 3 не показаны), такими как динамики и другие аудиовизуальные устройства.The system memory 22 , the removable storage devices 27 and the non-removable storage devices 28 contained in the computer system 20 are used to store the operating system 35 , applications 37 , other program modules 38 and program data 39. The computer system 20 includes a peripheral interface 46 for transmitting data from input devices 40 , such as a keyboard, mouse, stylus, game controller, voice input device, touch input device, or other peripheral devices, such as a printer or scanner through one or more input/output ports, such as a serial port, parallel port, universal serial bus (USB) or other peripheral interface. A display device 47 , such as one or more monitors, projectors or built-in displays, is also connected to the system bus 23 through an output interface 48 , such as a video adapter. In addition to the display devices 47 , the computer system 20 is equipped with other peripheral output devices (not shown in Fig. 3 ), such as speakers and other audiovisual devices.

Компьютерная система 20 может работать в сетевом окружении, используя сетевое соединение с одним или несколькими удаленными компьютерами 49. Удаленный компьютер (или компьютеры) 49 является рабочим персональным компьютером или сервером, который содержит большинство или все упомянутые компоненты, отмеченные ранее при описании сущности компьютерной системы 20, представленной на Фиг. 3. В сетевом окружении также могут присутствовать и другие устройства, например, маршрутизаторы, сетевые станции или другие сетевые узлы. Компьютерная система 20 может включать один или несколько сетевых интерфейсов 51 или сетевых адаптеров для связи с удаленными компьютерами 49 через одну или несколько сетей, таких как локальная компьютерная сеть (LAN) 50, глобальная компьютерная сеть (WAN), интранет и Интернет. Примерами сетевого интерфейса 51 являются интерфейс Ethernet, интерфейс Frame Relay, интерфейс SONET и беспроводные интерфейсы.The computer system 20 may operate in a network environment using a network connection with one or more remote computers 49. The remote computer (or computers) 49 is a working personal computer or a server that contains most or all of the components mentioned earlier in the description of the essence of the computer system 20 shown in Fig. 3. Other devices may also be present in the network environment, such as routers, network stations or other network nodes. The computer system 20 may include one or more network interfaces 51 or network adapters for communicating with the remote computers 49 via one or more networks, such as a local area network (LAN) 50 , a wide area network (WAN), an intranet and the Internet. Examples of a network interface 51 are an Ethernet interface, a Frame Relay interface, a SONET interface and wireless interfaces.

Варианты раскрытия настоящего изобретения могут представлять собой систему, способ, или машиночитаемый носитель (или носитель) информации.Embodiments of the present invention may be a system, a method, or a computer-readable storage medium (or carrier).

Машиночитаемый носитель информации является осязаемым устройством, которое сохраняет и хранит программный код в форме машиночитаемых инструкций или структур данных, к которым имеет доступ центральный процессор 21 компьютерной системы 20. Машиночитаемый носитель может быть электронным, магнитным, оптическим, электромагнитным, полупроводниковым запоминающим устройством или любой подходящей их комбинацией. В качестве примера, такой машиночитаемый носитель информации может включать в себя память с произвольным доступом (RAM), память только для чтения (ROM), EEPROM, портативный компакт-диск с памятью только для чтения (CD-ROM), цифровой универсальный диск (DVD), флеш-память, жесткий диск, портативную компьютерную дискету, карту памяти, дискету или даже механически закодированное устройство, такое как перфокарты или рельефные структуры с записанными на них инструкциями.A computer-readable storage medium is a tangible device that stores and stores program code in the form of computer-readable instructions or data structures that are accessible to the central processor 21 of the computer system 20. The computer-readable storage medium may be an electronic, magnetic, optical, electromagnetic, semiconductor memory device, or any suitable combination thereof. As an example, such a computer-readable storage medium may include random access memory (RAM), read-only memory (ROM), EEPROM, a portable compact disc with read-only memory (CD-ROM), a digital versatile disc (DVD), flash memory, a hard disk, a portable computer diskette, a memory card, a floppy disk, or even a mechanically encoded device such as punched cards or relief structures with instructions recorded thereon.

Система и способ настоящего изобретения, могут быть рассмотрены в терминах средств. Термин «средство», используемый в настоящем документе, относится к реальному устройству, компоненту или группе компонентов, реализованных с помощью аппаратного обеспечения, например, с помощью интегральной схемы, специфичной для конкретного приложения (ASIC) или FPGA, или в виде комбинации аппаратного и программного обеспечения, например, с помощью микропроцессорной системы и набора машиночитаемых инструкций для реализации функциональности средства, которые (в процессе выполнения) превращают микропроцессорную систему в устройство специального назначения. Средство также может быть реализовано в виде комбинации этих двух компонентов, при этом некоторые функции могут быть реализованы только аппаратным обеспечением, а другие функции - комбинацией аппаратного и программного обеспечения. В некоторых вариантах реализации, по крайней мере, часть, а в некоторых случаях и все средство может быть выполнено на центральном процессоре 21 компьютерной системы 20. Соответственно, каждое средство может быть реализовано в различных подходящих конфигурациях и не должно ограничиваться каким-либо конкретным вариантом реализации, приведенным в настоящем документе.The system and method of the present invention can be considered in terms of means. The term "means" as used herein refers to an actual device, component or group of components implemented by hardware, such as an application-specific integrated circuit (ASIC) or FPGA, or as a combination of hardware and software, such as a microprocessor system and a set of machine-readable instructions for implementing the functionality of the means, which (when executed) transform the microprocessor system into a special-purpose device. The means can also be implemented as a combination of these two components, whereby some functions can be implemented only by hardware, and other functions - by a combination of hardware and software. In some embodiments, at least part, and in some cases all, of the means can be executed on the central processor 21 of the computer system 20. Accordingly, each means can be implemented in various suitable configurations and should not be limited to any particular embodiment given herein.

В заключение следует отметить, что приведенные в описании сведения являются примерами, которые не ограничивают объем настоящего изобретения, определенного формулой. Специалисту в данной области становится понятным, что при разработке любого реального варианта осуществления настоящего изобретения необходимо принять множество решений, специфических для конкретного варианта осуществления, для достижения конкретных целей, и эти конкретные цели будут разными для разных вариантов осуществления. Понятно, что такие усилия по разработке могут быть сложными и трудоемкими, но, тем не менее, они будут обычной инженерной задачей для тех, кто обладает обычными навыками в данной области, пользуясь настоящим раскрытием изобретения.Finally, it should be noted that the information provided in the description is examples that do not limit the scope of the present invention defined by the claims. It will be clear to one skilled in the art that in developing any actual embodiment of the present invention, many decisions specific to a particular embodiment must be made in order to achieve specific goals, and these specific goals will be different for different embodiments. It is understood that such development efforts can be complex and time-consuming, but, nevertheless, they will be a routine engineering task for those of ordinary skill in the art, using the present disclosure.

Claims

1. A computer-implemented method for automatically generating queries and responses to a large language model on issues related to computer security, comprising steps performed on a host representing a user's end device, in which:

a) receive input data from the security agent located on the above-mentioned host;

b) select one or more phrases from the input data;

c) make up at least one text query from the selected phrases;

d) send each text query to a large language model located on the above-mentioned host;

d) receive a response from a large language model;

e) extract key phrases from the response received from a large language model;

g) if the key phrases in step e) were not determined, save the context in the form of requests and responses from steps c) and e) and pass this context to another large language model located outside the host.

2. The method according to item 1, in which a text query is composed for a large language model in natural language.

3. The method according to claim 1, wherein key phrases are extracted from the response from a large language model using natural language processing libraries.

4. The method according to claim 1, wherein key phrases are extracted from the response from a large language model using a set of regular expressions.

5. The method according to item 1, wherein the input data from the elements of the information security monitoring center are in JSON format.

6. The method according to claim 1, wherein the input data from the elements of the information security monitoring center are in XML format.