RU2741742C1

RU2741742C1 - Method for obtaining low-dimensional numeric representations of sequences of events

Info

Publication number: RU2741742C1
Application number: RU2020107035A
Authority: RU
Inventors: Дмитрий Леонидович Бабаев; Никита Павлович Овсов; Иван Александрович Киреев
Original assignee: Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк)
Priority date: 2020-02-14
Filing date: 2020-02-14
Publication date: 2021-01-28
Also published as: EA202092230A1

Abstract

FIELD: physics.

SUBSTANCE: invention relates to the field of information technologies, in particular to a method of obtaining low-dimensional numerical representation of sequences of events. Disclosed is a computer-implemented method of obtaining low-dimensional numerical representation of sequences of events, comprising steps of: obtaining a set of input data, characterizing events, aggregated in a sequence and associated with at least one information entity, wherein pre-processing of said set of input data, in which: generating positive and negative pairs of sequences of transaction events, using a transactional event encoder generating a vector representation of each transaction event from said set of attributes, normalizing numerical variables; processing time marks, concatenating the obtained vector representations of categorical variables and normalized numerical variables; single event vector is generated based on the resultant concatenation.

EFFECT: higher efficiency of generating features for machine learning models by generating low-dimensional numerical representations of sequences of events.

7 cl, 8 dwg, 8 tbl

Description

ОБЛАСТЬ ТЕХНИКИFIELD OF TECHNOLOGY

[0001] Заявленное изобретение относится к области информационных технологий, в частности к способу получения низкоразмерных числовых представлений последовательностей событий.[0001] The claimed invention relates to the field of information technology, in particular to a method for obtaining low-dimensional numerical representations of sequences of events.

УРОВЕНЬ ТЕХНИКИLEVEL OF TECHNOLOGY

[0002] Создание семантически значимых числовых представлений из огромного количества неразмеченых данных событий жизненного потока является сложной задачей для машинного обучения. Эти предварительно обученные числовые представления извлекают сложную информацию из исходных данных в виде низкоразмерных числовых векторов фиксированной длины и могут быть легко применены в различных последующих задачах машинного обучения в качестве признаков или дообучены под конкретную целевую переменную.[0002] Generating semantically meaningful numeric representations from a huge amount of unlabeled life stream event data is a complex machine learning task. These pretrained numerical representations extract complex information from the original data in the form of low-dimensional, fixed-length numerical vectors and can be easily applied in various downstream machine learning tasks as features or retrained for a specific target variable.

[0003] Традиционно для подхода метрического обучения или метрик лернинг (англ. Metric learning) требуются пары объектов, помеченные как похожие, но эти пары часто недоступны для данных жизненного потока событий. Данные о последовательности событий генерируются во многих бизнес-приложениях, некоторые примеры - транзакции по кредитным картам и данные о посещениях интернет-сайтов, а анализ последовательности событий - очень распространенная проблема машинного обучения [1] - [4]. Lifestream - это последовательность событий, которая присваивается человеку и фиксирует его / ее регулярные и рутинные действия определенного типа, например транзакции, поисковые запросы, телефонные звонки и сообщения.[0003] Traditionally, the metric learning approach or metric learning requires pairs of objects marked as similar, but these pairs are often not available for life event data. Sequence of events data is generated in many business applications, some examples are credit card transactions and web visit data, and sequence analysis is a very common machine learning problem [1] - [4]. Lifestream is a sequence of events that is assigned to a person and records his / her regular and routine activities of a certain type, such as transactions, searches, phone calls and messages.

[0004] Метрик лернинг подход к обучению, лежащий в основе заявленного способа MeLES, широко используется в различных областях, включая такие домены как компьютерное зрение, НЛП и аудио. В частности, метрик лернинг подход к обучению для распознавания лиц был первоначально предложен в [5], где контрастивная функция потерь (англ. contrastive loss) использовалась для обучения функции сопоставления входных данных с их низкоразмерными представления, используя некоторые предварительные знания об отношении похожести между обучающими выборками или ручную разметку. Кроме того, в [6] авторы представили FaceNet, метод, который обучает отображение изображений лиц на 128-мерные представления с использованием функции потери триплет (англ. triplet loss), основанной на классификации ближайших соседей с большим маржином (LMNN) [7]. В FaceNet авторы также представили онлайн-метод выбора троек объектов - триплетов и технику hard-positive и hard-negative майнинга для процедуры обучения.[0004] The Metric Learning learning approach underlying the claimed MeLES method is widely used in various fields, including domains such as computer vision, NLP, and audio. In particular, the metric lerning approach to training for face recognition was originally proposed in [5], where contrastive loss was used to train a function for matching input data to their low-dimensional representations using some prior knowledge of the similarity relationship between trainers. samples or manual markup. In addition, in [6], the authors presented FaceNet, a method that trains the mapping of facial images to 128-dimensional representations using a triplet loss function based on a large margin nearest neighbor (LMNN) classification [7]. On FaceNet, the authors also presented an online method for selecting triplets of objects - triplets and hard-positive and hard-negative mining techniques for the learning procedure.

[0005] Кроме того, метрик лернинг использовался для задачи распознавания голоса [8], где конрастивная функция потери (contrastive loss) определяется как близость численного представления каждого высказывания к центроиду численных представлений всех высказываний этого говорящего (positive pair - положительная пара) и дальность от центроидов численных представлений высказываний других говорящих, выбранных по наибольшей близости среди всех других говорящих (hard negative pair - жесткая отрицательная пара).[0005] In addition, the lerning metric was used for the voice recognition problem [8], where the contrastive loss is defined as the proximity of the numerical representation of each statement to the centroid of the numerical representations of all statements of this speaker (positive pair) and the distance from centroids of the numerical representations of the statements of other speakers, selected for the closest possible among all other speakers (hard negative pair - hard negative pair).

[0006] Наконец, в [9] авторы предложили дообучение модели BERT [10], которая использует метрик лернинг в форме сиамских и триплет нейронных сетей для обучения численных представлений предложений для задач семантического текстового сходства с использованием семантической близости аннотаций пар предложений.[0006] Finally, in [9], the authors proposed additional training of the BERT model [10], which uses lerning metrics in the form of Siamese and triplet neural networks to train numerical representations of sentences for semantic text similarity problems using semantic proximity of annotations of pairs of sentences.

[0007] Хотя метрик лернинг использовался во всех этих областях, он не был применен к анализу событий жизненного потока, связанных с транзакционными данными, кликстримом и другими типами данных событий жизненного потока, что является предметом данной статьи.[0007] Although metric lerning has been used in all of these areas, it has not been applied to the analysis of lifestream events associated with transactional data, clickstream, and other types of lifestream event data, which is the subject of this article.

[0008] Важно отметить, что в предыдущей литературе подход метрик лернинг применялся в своих областях как обучение с учителем, в то время как заявленный способ MeLES внедряет идеи метрик лернинга совершенно новым способом, способом обучения без учителя в области последовательностей событий.[0008] It is important to note that in previous literature, the lerning metrics approach has been applied in its fields as supervised learning, while the claimed MeLES method introduces lerning metrics ideas in a completely new way, unsupervised learning in the field of event sequences.

[0009] Другая идея применения обучения без учителя к последовательным данным была ранее предложена в методе контрастного прогнозирующего кодирования (англ. contrastive predicting coding - СРС) [11], где значимые представления извлекаются путем прогнозирования будущего в скрытом пространстве с использованием авторегрессивных методов. Представления СРС продемонстрировали высокую эффективность в четырех различных областях: аудио, компьютерное зрение, естественный язык и обучение с подкреплением.[0009] Another idea of applying unsupervised learning to sequential data was previously proposed in contrastive predicting coding (CPC) [11], where meaningful representations are extracted by predicting the future in hidden space using autoregressive methods. CPL performances have been shown to be highly effective in four different areas: audio, computer vision, natural language, and reinforcement learning.

[0010] В области компьютерного зрения существует множество других подходов к обучению с учителем, которые хорошо обобщены в источнике [12]. Есть несколько способов определить задачу обучения с учителем (аналогично заданию предсказания следующего слова в тексте) для изображения. Один из вариантов - изменить изображение, а затем попытаться восстановить исходное изображение. Примерами этого подхода являются супер-разрешение, изменение цвета изображения и восстановление поврежденного изображения. Другой вариант - предсказать контекстную информацию из локальных признаков, например, предсказать место патча изображения на изображении с несколькими отсутствующими патчами.[0010] In the field of computer vision, there are many other approaches to teaching with a teacher, which are well summarized in the source [12]. There are several ways to define a supervised learning task (similar to predicting the next word in a text) for an image. One option is to change the image and then try to restore the original image. Examples of this approach are super-resolution, image color changing, and damaged image recovery. Another option is to predict contextual information from local features, for example, to predict the location of an image patch in an image with several missing patches.

[0011] При этом почти каждый подход к обучению без учителя может быть использован для получения численных представлений исходных данных в форме эмбеддингов. Существует несколько примеров применения полученного набора численных представлений исходных данных для нескольких последующих задач [13], [14].[0011] Moreover, almost every unsupervised learning approach can be used to obtain numerical representations of the raw data in the form of embeddings. There are several examples of application of the obtained set of numerical representations of the initial data for several subsequent problems [13], [14].

[0012] Одним из распространенных подходов к изучению представлений без учителя является либо традиционный автокодировщик (автоэнкодер) [15], либо вариационный автоэнкодер [16]. Он широко используется для изображений, текста и аудио или агрегированных данных событий жизненного потока ([17]). Хотя автоэнкодеры успешно использовались в нескольких перечисленных выше областях, они не применялись к необработанным данным событий жизненного потока в виде последовательностей событий, в основном из-за проблем определения расстояний между входом и восстановленными через автоэнкодер входными последовательностями.[0012] One of the common approaches to studying unsupervised representations is either a traditional autoencoder (autoencoder) [15] or a variational autoencoder [16]. It is widely used for images, text and audio or life stream event aggregated data ([17]). Although autoencoders have been successfully used in several of the areas listed above, they have not been applied to the raw life stream event data as sequences of events, mainly due to problems in determining the distances between the input and the input sequences reconstructed through the autoencoder.

СУЩНОСТЬ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

[0013] В настоящем решении предлагается новый метод: метрик лернинг (метрическое обучение от англ. metric learning) для последовательностей событий (MeLES), используемый для получения представления данных жизненного потока в скрытом пространстве.[0013] The present solution proposes a new method: metric learning for sequences of events (MeLES), used to obtain a representation of life stream data in hidden space.

[0014] В настоящем решении воплощен новый метод - метрик лернинг на последовательностях событий (MeLES) для получения низкоразмерных числовых представлений последовательностей событий, который может хорошо работать со специфическими свойствами жизненных потоков событий, такими как их дискретная природа.[0014] The present solution implements a new method, metric lerning on sequence of events (MeLES) for obtaining low-dimensional numerical representations of sequences of events, which can work well with the specific properties of life streams of events, such as their discrete nature.

[0015] В широком смысле метод MeLES адаптирует подход метрик лернинг [18]-[19]. Метрик лернинг часто ставится как задача обучения с учителем для отображения многомерных объектов в пространство низкоразмерных числовых представлений. Целью метрик лернинга является представление семантически похожих объектов (изображений, видео, аудио и т.д.) ближе друг к другу, а разнородных - дальше. Большинство подходов метрик лернинга используются в таких приложениях, как распознавание речи [8], компьютерное зрение [20]-[21] и анализ текста [9].[0015] In a broad sense, the MeLES method adapts the lerning metrics approach [18] - [19]. Metric lerning is often posed as a supervised learning task to map multidimensional objects to a space of low-dimensional numerical representations. The purpose of lerning metrics is to present semantically similar objects (images, video, audio, etc.) closer to each other, and heterogeneous ones further away. Most of the lerning metric approaches are used in applications such as speech recognition [8], computer vision [20] - [21], and text analysis [9].

[0016] В этих областях метрик лернинг успешно применяется как задача обучения с учителем к датасетам (наборам данных), где пары многомерных экземпляров помечены как один и тот же объект или разные объекты. В отличие от всех предыдущих методов метрик лернинга, MeLES полностью обучается без учителя и не требует никаких меток. Он основан на наблюдении, что данные жизненного потока событий подчиняются периодичности и повторяемости событий в последовательности. Поэтому некоторые подпоследовательности одного и того же жизненного потока можно рассматривать как многомерные представления одного и того же человека. Идея MeLES заключается в том, что в скрытом низкоразмерном пространстве численные представления таких подпоследовательностей должны быть ближе друг к другу.[0016] In these areas of metrics, lerning is successfully applied as a supervised learning task to datasets (datasets) where pairs of multidimensional instances are marked as the same object or different objects. Unlike all previous lerning metrics methods, MeLES is completely unsupervised and does not require any labels. It is based on the observation that the data of the life stream of events obeys the periodicity and recurrence of events in sequence. Therefore, some subsequences of the same life stream can be viewed as multidimensional representations of the same person. The idea behind MeLES is that in a hidden low-dimensional space, the numerical representations of such subsequences should be closer to each other.

[0017] Обучение без учителя позволяет обучать модели, используя внутреннюю структуру больших неразмеченных или частично размеченных обучающих датасетов. Обучение без учителя продемонстрировало эффективность в различных областях машинного обучения, таких как обработка естественного языка (например, ELMO, BERT, и компьютерное зрение).[0017] Unsupervised Learning allows models to be trained using the internal structure of large unlabeled or partially labeled training datasets. Unsupervised learning has been shown to be effective in various areas of machine learning such as natural language processing (eg ELMO, BERT, and computer vision).

[0018] Модель MeLES, обученная без учителя, может использоваться двумя способами. Представления, создаваемые моделью, могут непосредственно использоваться в качестве фиксированного вектора признаков в некоторой последующей задаче машинного обучения с учителем (например, задаче классификации), аналогично решению из источника [22]. В качестве альтернативы, обученная модель может быть дообучена [10] для конкретной последующей задачи машинного обучения с учителем.[0018] The unsupervised MeLES model can be used in two ways. The views generated by the model can be directly used as a fixed feature vector in some subsequent supervised machine learning problem (eg, classification problem), similar to the solution from source [22]. Alternatively, the trained model can be retrained [10] for a specific subsequent supervised machine learning problem.

[0019] Проведенные эксперименты с двумя открытыми датасетами с банковскими транзакциями позволили оценить эффективность заявленного метода для последующих задач машинного обучения. Когда численные представления MeLES непосредственно используются в качестве признаков, метод обеспечивает высокую производительность, сопоставимую с базовыми методами (бейзлайном).[0019] The experiments carried out with two open datasets with banking transactions made it possible to evaluate the effectiveness of the claimed method for subsequent machine learning tasks. When numerical representations of MeLES are directly used as features, the method provides high performance comparable to the basic methods (baseline).

[0020] Дообученные под конкретную задачу обучения с учителем представления позволяют достигать самых высоких показателей качества, значительно превосходя несколько других методов обучения с учителем и методов с предварительным обучением без учителя. Далее в настоящих материалах будет также представлено превосходство представлений MeLES над методами обучения с учителем в применении к частично размеченным данным по причине недостаточного количества разметки для обучения достаточно сложной модели с нуля.[0020] Supervised representations for the specific task of supervised learning achieve the highest quality scores, vastly outperforming several other supervised learning methods and unsupervised pre-teaching methods. Further in these materials, the superiority of MeLES representations over supervised learning methods will also be presented when applied to partially labeled data due to the insufficient amount of markup to train a rather complex model from scratch.

[0021] Существующая техническая проблема состоит в том, что генерация численных представлений событийных данных является необратимым преобразованием, поэтому невозможно восстановить точную последовательность событий из ее представления. Следовательно, использование представлений приводит к большей конфиденциальности и безопасности данных для конечных пользователей, чем при работе непосредственно с необработанными данными событий, и все это достигается без потери качества моделирования.[0021] An existing technical problem is that the generation of numerical representations of event data is an irreversible transformation, so it is impossible to recover the exact sequence of events from its representation. Consequently, the use of views results in greater privacy and data security for end users than working directly with raw event data, all without sacrificing modeling quality.

[0022] Техническим результатом является повышение эффективности формирования признаков для моделей машинного обучения с помощью формирования низкоразмерных числовых представлений последовательностей событий.[0022] The technical result is to increase the efficiency of generating features for machine learning models by generating low-dimensional numerical representations of sequences of events.

[0023] Заявленный технический результат достигается за счет компьютерно-реализуемого способа получения низкоразмерных числовых представлений последовательностей событий, содержащего этапы, на которых:[0023] The claimed technical result is achieved due to a computer-implemented method for obtaining low-dimensional numerical representations of sequences of events, containing the stages at which:

- получают набор входных данных, характеризующий события, агрегированные в последовательность и связанные с по меньшей мере одной информационной сущностью, причем упомянутые данные содержат набор атрибутов, включающий: категориальные переменные, числовые переменные и временную метку;- receive a set of input data, characterizing the events, aggregated in a sequence and associated with at least one information entity, and said data contains a set of attributes, including: categorical variables, numeric variables and a timestamp;

при этом выполняется предобработка упомянутого набора входных данных, при которой:in this case, preprocessing of the mentioned set of input data is performed, in which:

формируют позитивные пары последовательностей транзакционных событий, которые представляют собой подпоследовательности, принадлежащие последовательности транзакционных событий одной информационной сущности;

form positive pairs of sequences of transactional events, which are subsequences belonging to the sequence of transactional events of one information entity;

формируют негативные пары подпоследовательностей транзакционных событий, которые являются подпоследовательностями, принадлежащими последовательностям транзакционных событий разных информационных сущностей;

form negative pairs of subsequences of transactional events, which are subsequences belonging to sequences of transactional events of different information entities;

- с помощью кодировщика транзакционных событий формируют векторное представление каждого транзакционного события из упомянутого набора атрибутов, при этом кодировщик содержит первичный набор параметров и выполняет этапы, на которых:- using the encoder of transactional events, a vector representation of each transactional event is formed from the mentioned set of attributes, while the encoder contains the primary set of parameters and performs the steps at which:

осуществляют кодирование категориальных переменных в виде векторных представлений;

coding categorical variables in the form of vector representations;

осуществляют нормирование числовых переменных;

normalization of numeric variables;

осуществляют обработку временных меток для выстраивания упорядоченной по времени последовательности транзакционных событий;

processing time stamps to build a time-ordered sequence of transactional events;

осуществляют конкатенацию полученных векторных представлений категориальных переменных и нормированных числовых переменных;

concatenate the obtained vector representations of categorical variables and normalized numerical variables;

формируют единый числовой вектор одного транзакционного события по итогам выполненной конкатенации;

form a single numerical vector of one transaction event based on the results of the performed concatenation;

- с помощью кодировщика подпоследовательности формируют векторное представление подпоследовательности транзакционных событий из набора числовых векторов транзакционных событий, полученных с помощью кодировщика транзакционных событий, при этом кодировщик содержит первичный набор параметров;- using a subsequence encoder, a vector representation of a subsequence of transactional events is formed from a set of numeric vectors of transactional events obtained using a transactional event encoder, wherein the encoder contains a primary set of parameters;

- осуществляют фильтрацию негативных пар векторов подпоследовательностей транзакционных событий, значение векторного расстояния между которыми не выше заданного порогового значения;- filtering negative pairs of vectors of subsequences of transactional events, the value of the vector distance between which is not higher than a predetermined threshold value;

- корректируют первичные параметры упомянутых кодировщика транзакционных событий и кодировщика подпоследовательности с помощью применения функции потерь вида маржинальных или контрастивных потерь; и- adjusting the primary parameters of the above-mentioned transaction event encoder and subsequence encoder using a loss function of the form of margin or contrast losses; and

- формируют низкоразмерные числовых представления последовательностей событий, связанных с одной информационной сущностью, на основании выполненной корректировки.- form low-dimensional numerical representations of sequences of events associated with one information entity, based on the performed adjustment.

[0024] В одном из частных вариантов реализации способа информационная сущность представляет собой транзакционные данные физического или юридического лица.[0024] In one of the particular embodiments of the method, the information entity is transactional data of an individual or legal entity.

[0025] В другом частном варианте реализации способа создание позитивных пар осуществляется с помощью алгоритма формирования несвязных подпоследовательностей.[0025] In another particular embodiment of the method, the creation of positive pairs is carried out using an algorithm for generating disconnected subsequences.

[0026] В другом частном варианте реализации способа создание позитивных пар осуществляется с помощью алгоритма генерации случайных срезов последовательности.[0026] In another particular embodiment of the method, positive pairs are generated using an algorithm for generating random sequence slices.

[0027] В другом частном варианте реализации способа формируемые подпоследовательности не пересекаются между собой.[0027] In another particular embodiment of the method, the generated subsequences do not intersect with each other.

[0028] В другом частном варианте реализации способа формируемые подпоследовательности не пересекаются и/или пересекаются между собой.[0028] In another particular embodiment of the method, the generated subsequences do not intersect and / or intersect with each other.

[0029] В другом частном варианте реализации способа кодировщик подпоследовательности представляет собой рекуррентную нейронную сеть (РНС).[0029] In another particular embodiment of the method, the subsequence encoder is a recurrent neural network (RNN).

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF DRAWINGS

[0030] Фиг. 1 иллюстрирует концептуальную схему заявленного решения.[0030] FIG. 1 illustrates a conceptual diagram of the claimed solution.

[0031] Фиг. 2 - Фиг. 3 иллюстрируют графики зависимостей размерности векторов в задачах прогнозирования.[0031] FIG. 2 to FIG. 3 illustrate graphs of dependencies of vector dimensions in forecasting problems.

[0032] Фиг. 4 иллюстрирует распределение векторов в задаче прогнозирования возрастной группы.[0032] FIG. 4 illustrates the distribution of vectors in an age group prediction problem.

[0033] Фиг. 5 - Фиг. 8 иллюстрируют примеры прогнозирования на различных датасетах.[0033] FIG. 5 to FIG. 8 illustrate examples of forecasting on various datasets.

ОСУЩЕСТВЛЕНИЕ ИЗОБРЕТЕНИЯCARRYING OUT THE INVENTION

[0034] Заявленный способ создан специально для данных событий жизненного потока. Такие данные состоят из отдельных событий информационной сущности, например, человека или юридического лица в непрерывном времени, например, поведения на веб-сайтах, выполнением транзакций и т.д.[0034] The claimed method is specifically designed for life stream event data. Such data consists of individual events of an information entity, for example, a person or a legal entity in continuous time, for example, behavior on websites, the execution of transactions, etc.

[0035] Принимая во внимание транзакции, например, транзакции по кредитным картам, каждая транзакция имеет набор атрибутов, категориальных или числовых, включая временную метку транзакции. Пример последовательности трех операций с их атрибутами представлен в Таблице 1. Поле типа продавца представляет категорию продавца, такую как «авиакомпания», «гостиница», «ресторан» и т.д.[0035] Considering transactions such as credit card transactions, each transaction has a set of attributes, categorical or numeric, including the timestamp of the transaction. An example of a flow of three operations with their attributes is shown in Table 1. The seller type field represents the seller category such as airline, hotel, restaurant, and so on.

[0036] Другим примером данных жизненного потока является кликстрим (от англ. click-stream) - журнал посещений интернет-страниц. Пример журнала посещений интернет-страниц для одного пользователя представлен в Таблице 2.[0036] Another example of life stream data is the click-stream, a log of web page visits. An example of a web browsing history for one user is shown in Table 2.

[0037] На Фиг. 1 представлен общий принцип заявленного способа. Задано количество дискретных событий

в заданном интервале наблюдения [1, Т] конечной целью является получение численного представления последовательности c_t для временной метки Т в скрытое пространство R^d. Чтобы обучить кодировщик последовательности

генерировать осмысленное численное представление c_t из

необходимо применить подход метрик лернинг так, чтобы расстояние между представлениями одной и той же информационной сущности было небольшим, тогда как представления разных сущностей (отрицательные пары) велики.[0037] FIG. 1 shows the general principle of the claimed method. The number of discrete events is set

in a given observation interval [1, T], the ultimate goal is to obtain a numerical representation of the sequence c _t for the timestamp T into the hidden space R ^d . To train the sequence encoder

generate a meaningful numerical representation of c _t from

it is necessary to apply the lerning metrics approach so that the distance between the representations of the same information entity is small, while the representations of different entities (negative pairs) are large.

[0038] Одними из трудностей применения подхода метрик лернинг для данных жизненного потока заключается в том, что понятие семантического сходства как и различий требует знания базовых областей, а также процесса разметки положительных и отрицательных примеров является трудоемким. Ключевым свойством предметной области событий жизненного потока является периодичность и повторяемость событий в последовательности событий, что позволяет нам переформулировать задачу метрик лернинг как задачу обучения без учителя. MeLES изучает низкоразмерные представления из последовательных данных о выбранной информационной сущности, например, о человеке, отбирая положительные пары как подпоследовательности одной и той же последовательности одного человека и отрицательные пары как подпоследовательности из последовательностей разных людей. Соответствующие пары формируются с помощью обработки входных данных кодировщиками, формирующими векторные представления транзакционных событий, о чем будет более детально раскрыто далее.[0038] One of the difficulties of applying the lerning metrics approach to life stream data is that the concept of semantic similarity, like differences, requires knowledge of the basic areas, and the process of marking up positive and negative examples is laborious. A key property of the domain of life stream events is the frequency and repetition of events in a sequence of events, which allows us to reformulate the metrics lerning task as an unsupervised learning task. MeLES studies low-dimensional representations from sequential data about a selected information entity, for example, about a person, by selecting positive pairs as subsequences of the same sequence of one person and negative pairs as subsequences from sequences of different people. The corresponding pairs are formed by processing the input data by encoders that form vector representations of transactional events, which will be discussed in more detail below.

[0039] Представление последовательности c_t, полученное на основе метрик лернинг, затем используется в различных задачах машинного обучения в качестве вектора признаков. Кроме того, одним из возможных способов повышения качества задачи в которой применяются численные представления событийных данных является встраивание предварительно обученного c_t (например, выходного вектора последнего слоя рекуррентной нейронной сети RNN) в задачу классификации с конкретной целевой переменной, а затем совместно обучать, то есть настраивать веса сети кодировщиков и классификатора.[0039] The representation of the sequence c _t obtained from the lerning metrics is then used in various machine learning problems as a feature vector. In addition, one of the possible ways to improve the quality of a problem in which numerical representations of event data are used is to embed a pretrained c _t (for example, the output vector of the last layer of a recurrent neural network RNN) into a classification problem with a specific target variable, and then train together, that is adjust the weights of the network of encoders and classifier.

[0040] Чтобы построить представление последовательности событий в виде вектора фиксированного размера c_t ∈ R^d, используется подход, аналогичный энкодеру транзакций карты E.T.-RNN, описанному в работе авторов [15]. Вся сеть кодировщиков состоит из двух концептуальных частей: кодировщик событий и подсети кодировщика последовательности событий.[0040] To construct a representation of the sequence of events in the form of a vector of fixed size c _t ∈ R ^d , an approach is used similar to the ET-RNN card transaction encoder described in the authors' work [15]. The entire network of encoders consists of two conceptual parts: an event encoder and an event sequence encoder subnet.

[0041] Кодировщик событий берет на вход набор атрибутов одного события x_t и выводит его представление в скрытое пространство Z ∈ R^m: z_t=e(x_t). Кодировщик последовательности s принимает скрытые представления последовательности событий: z_1:Т=z₁, z₂, z_T и выводит представление всей последовательности c_t на временном шаге t: c_t = s (z_1:t).[0041] The event encoder takes as input a set of attributes of one event x _t and outputs its representation into the hidden space Z ∈ R ^m : z _t = e (x _t ). The sequence encoder s takes hidden representations of the sequence of events: z _{1: T} = z ₁ , z ₂ , z _T and outputs the representation of the entire sequence c _t at time step t: c _t = s (z _{1: t} ).

[0042] Сеть кодировщика событий состоит из нескольких эмбеддинг слоев и слоя батч нормализации [16]. Каждый эмбеддинг слой используется для кодирования каждого категориального атрибута события. Батч нормализация применяется к числовым атрибутам события. Наконец, выходные данные каждого эмбеддинг слоя и слоя батч нормализации конкатенируются для создания представления z_t одного события в скрытом пространстве.[0042] The event encoder network consists of several embedding layers and a batch normalization layer [16]. Each embedding layer is used to encode each categorical attribute of an event. Batch normalization is applied to the numeric attributes of the event. Finally, the output of each embedding and batch normalization layer is concatenated to create a z _t representation of a single event in hidden space.

[0043] Последовательность скрытых представлений событий z_1:t передается в кодировщик последовательности s для получения вектора c_t фиксированного размера. Несколько подходов могут быть использованы для кодирования последовательности. Одним из возможных подходов является использование рекуррентной сети (RNN), как в [17]. Другой подход заключается в использовании кодирующей части архитектуры Transformer, представленной в [18]. В обоих случаях вектор последнего события может использоваться для представления всей последовательности событий. В случае RNN последний выход h_t является представлением последовательности событий.[0043] The sequence of latent representations of events z _{1: t} is transmitted to the encoder of the sequence s to obtain a vector c _{t of a} fixed size. Several approaches can be used to encode a sequence. One possible approach is to use a recurrent network (RNN), as in [17]. Another approach is to use the coding part of the Transformer architecture presented in [18]. In both cases, the last event vector can be used to represent the entire sequence of events. In the case of RNN, the last output h _t is a sequence of events representation.

[0044] Кодировщик, основанный на архитектуре RNN-типа, такой как GRU [18], позволяет вычислять представление C_t+k путем обновления представления c_t вместо расчета представления c_t+k из всей последовательности прошлых событий z_1:t: c_k=rnn (c_t, z_t+1:k). Эта опция позволяет сократить время инференса, когда необходимо обновить уже существующие расчитанные клиентские представления новыми событиями, произошедшими после расчета. Это возможно из-за периодического характера сетей, подобных RNN.[0044] An encoder based on an RNN-type architecture such as GRU [18] allows the computation of the representation C _{t + k} by updating the representation c _t instead of computing the representation c _{t + k} from the entire sequence of past events z _{1: t} : c _k = rnn (c _t , z _{t + 1: k} ). This option allows you to reduce the inference time when you need to update the already existing calculated client views with new events that occurred after the calculation. This is possible due to the intermittent nature of RNN-like networks.

[0045] Функция потери в метрик лернинге изменяют численные представления таким образом, что расстояние между представлениями из одного класса уменьшается, а между представлениями из другого класса увеличивается. Было рассмотрено несколько функций потерь метрик лернинга - contrastive loss (конрастивных потерь) [19], binomial deviance (потерь биномиального отклонения) [20], triplet loss (триплетных потерь) [21], histogram loss (гистограммных потерь) [22] и margin loss (маржинальных потерь) [23].[0045] Loss functions in lerning metrics change numerical representations so that the distance between representations from one class decreases and between representations from another class increases. Several loss functions of the lerning metrics have been considered - contrastive loss [19], binomial deviance [20], triplet loss [21], histogram loss [22] and margin loss (margin losses) [23].

[0046] Все вышеуказанные функции потерь решают следующую проблему подхода метрик лернинга: использование всех пар выборок неэффективно, например, расстояние между представлениями некоторых из отрицательных пар уже достаточно большое, поэтому эти пары не пригодны для обучения ([24]-[25]).[0046] All of the above loss functions solve the following problem of the lerning metrics approach: using all pairs of samples is inefficient, for example, the distance between representations of some of the negative pairs is already large enough, so these pairs are not suitable for training ([24] - [25]).

[0047] Далее рассмотрим два вида функций потерь, которые концептуально просты, но в то же время продемонстрировали высокую эффективность при валидации в экспериментах с заявленным способом, а именно, функции контрастивных потерь и маржинальных потерь.[0047] Next, consider two types of loss functions, which are conceptually simple, but at the same time have demonstrated high efficiency in validation in experiments with the claimed method, namely, contrast loss function and margin loss.

[0048] Функция контрастивных потерь имеет контрастивное слагаемое для отрицательной пары представлений, которое штрафует модель только в том случае, если отрицательная пара недостаточно удалена и расстояние между представлениями меньше, чем маржин m:[0048] The contrast loss function has a contrast term for a negative pair of representations that penalizes the model only if the negative pair is not far enough away and the distance between representations is less than the margin m:

где Р - количество всех пар в батче,

- функция расстояния между i-й помеченной выборкой пары представлений X₁ и Х₂, Y - бинарная метка, назначенная паре: Y=0 означает позитивная пара, Y=1 означает негативную пару, m>0 - маржин. Как предложено в [26], используется евклидово расстояние как функция расстояния:where P is the number of all pairs in the batch,

is the function of the distance between the i-th labeled selection of a pair of representations X ₁ and X ₂ , Y is a binary label assigned to a pair: Y = 0 means a positive pair, Y = 1 means a negative pair, m> 0 is a margin. As suggested in [26], the Euclidean distance is used as a function of distance:

[0049] Функция маржинальных потерь похожа на контрастивных потерь, основное отличие заключается в том, что не существует штрафа для положительных пар, которые находятся ближе, чем порог в функции маржинальных потерь.[0049] The margin loss function is similar to the contrast loss, the main difference being that there is no penalty for positive pairs that are closer than the threshold in the margin loss function.

Где Р - количество всех пар в батче,

- функция расстояния между i-й помеченной выборочной парой представлений X₁ и Х₂, Y - бинарная метка, назначенная паре: Y=0 означает позитивную пару, Y=1 означает негативную пару, ш>0 и b>0 определитель порогового значения маржинаWhere P is the number of all pairs in the batch,

is the function of the distance between the i-th labeled sample pair of representations X ₁ and X ₂ , Y is the binary label assigned to the pair: Y = 0 means a positive pair, Y = 1 means a negative pair, w> 0 and b> 0 determinant of the margin threshold

[0050] Выборка негативных пар - это еще один способ решения проблемы, заключающейся в том, что некоторые из негативных пар уже достаточно отдалены, поэтому эти пары не пригодны для обучения ([24]-[26]). Следовательно, при расчете функции потерь учитывается только часть возможных негативных пар. При этом рассматриваются только текущие пары в батче. Существует несколько возможных стратегий выбора наиболее подходящих для обучения негативных пар:[0050] Sampling negative pairs is another way to solve the problem that some of the negative pairs are already distant enough that these pairs are not suitable for training ([24] - [26]). Consequently, when calculating the loss function, only a part of possible negative pairs is taken into account. In this case, only the current pairs in the batch are considered. There are several possible strategies for choosing the most suitable negative pairs for training:

• Случайная выборка негативных пар;• A random sample of negative pairs;

• Жесткий негативный майнинг пар: генерировать k самых сложных негативых пар для каждой положительной пары;• Hard negative pair mining: generate the k most difficult negative pairs for each positive pair;

• взвешенная по расстоянию выборка пар, где негативные к рассматриваемому примеры семплируются равномерно в соответствии с их относительным расстоянием от этого рассматриваемого примера [27];• distance-weighted sample of pairs, where negative examples are sampled evenly in accordance with their relative distance from this example under consideration [27];

• полу-жесткий отбор, при котором осуществляется выбор ближайшего к рассматриваемому примеру негативный пример из набора всех негативных примеров, которые находятся дальше от рассматриваемого примера, чем его позитивный пример ([28]).• semi-rigid selection, in which the selection of the closest negative example to the considered example from the set of all negative examples that are further from the considered example than its positive example is carried out ([28]).

[0051] Чтобы выбрать негативные пары, необходимо вычислить попарно расстояние между всеми возможными парами векторов представлений в батче. Чтобы сделать эту процедуру более вычислительно эффективной, мы выполняем нормализацию векторов представлений, то есть проецируем их на гиперсферу единичного радиуса. Поскольку

и

чтобы вычислить евклидово расстояние, то необходимо вычислить:

[0051] To select negative pairs, it is necessary to calculate pairwise the distance between all possible pairs of representation vectors in the batch. To make this procedure more computationally efficient, we normalize the representation vectors, that is, we project them onto a hypersphere of unit radius. Insofar as

and

to calculate the Euclidean distance, you need to calculate:

[0052] Чтобы вычислить скалярное произведение между всеми парами в батче, необходимо умножить матрицу всех векторов представлений батча на саму себя транспонированную, что является высоко оптимизированной вычислительной процедурой в большинстве современных сред разработки для глубокого обучения. Следовательно, вычислительная сложность выбора негативной пары составляет О (n²h), где h - размер представления, a n - размер батча.[0052] To compute the dot product between all pairs in a batch, it is necessary to multiply the matrix of all vectors of the batch's representations by the transpose itself, which is a highly optimized computational procedure in most modern deep learning development environments. Therefore, the computational complexity of choosing a negative pair is O (n ² h), where h is the size of the representation, an is the size of the batch.

[0053] Процедура генерации позитивных пар используется для создания батча для обучения MeLES. N начальных последовательностей взяты для генерации батча. Затем производится K подпоследовательностей для каждой начальной последовательности.[0053] The procedure for generating positive pairs is used to create a batch for MeLES training. N initial sequences are taken to generate the batch. Then K subsequences are produced for each initial sequence.

[0054] Пары подпоследовательностей, полученных из одной и той же последовательности, рассматриваются как положительные образцы, а пары из разных последовательностей рассматриваются как отрицательные образцы. Следовательно, после генерации положительной пары каждый батч содержит N×K подпоследовательностей, используемых в качестве обучающих выборок. В партии имеется K-1 положительных пар и (N-1) × K отрицательных пар на образец.[0054] Pairs of subsequences obtained from the same sequence are considered positive patterns, and pairs from different sequences are considered negative patterns. Therefore, after generating a positive pair, each batch contains N × K subsequences used as training samples. There are K-1 positive pairs per lot and (N-1) × K negative pairs per sample.

[0055] Существует несколько возможных стратегий генерации подпоследовательности. Простейшей стратегией является случайная выборка без замены. Другой стратегией является создание подпоследовательности от случайной последовательности расщепления до нескольких подпоследовательностей без пересечения между ними (см. Алгоритм 1). Третий вариант - использовать случайно выбранные срезы событий с возможным пересечением между срезами (см. Алгоритм 2). Порядок событий в сгенерированных подпоследовательностях всегда сохраняется.[0055] There are several possible strategies for generating a subsequence. The simplest strategy is random sampling without replacement. Another strategy is to create a subsequence from a random splitting sequence to several subsequences without intersection between them (see Algorithm 1). The third option is to use randomly selected event slices with possible intersection between slices (see Algorithm 2). The order of events in the generated subsequences is always preserved.

[0056] Алгоритм 1: Стратегия генерации несвязных подпоследовательностей, гиперпараметры: k - число генерируемых подпоследовательностей.[0056] Algorithm 1: Strategy for generating disconnected subsequences, hyperparameters: k is the number of generated subsequences.

вход: последовательность S длины linput: sequence S of length l

выход: S_l,…,S_k - подпоследовательности сгенерированные из S.output: S _l , ..., S _k - subsequences generated from S.

Сформировать вектор inds длины l со случайными числами из [l,k].Generate a vector inds of length l with random numbers from [l, k].

для i←l to k выполнять:for i ← l to k do:

Si=S[inds=i]Si = S [inds = i]

Конец.The end.

[0057] Алгоритм 2: Стратегия генерации случайных срезов последовательности.[0057] Algorithm 2: Strategy for generating random sequence slices.

гиперпараметры: m, М - минимальная и максимально возможная длина подпоследовательности, k - количество подпоследовательностей, которые будут произведены.hyperparameters: m, M - the minimum and maximum possible length of the subsequence, k - the number of subsequences to be produced.

для i←l to k выполнятьfor i ← l to k execute

Сгенерировать случайное число l_i,

Generate a random number l _i ,

Сгенерировать случайное число s,

Generate a random number s,

S_i=S[s:s+l_i]S _i = S [s: s + l _i ]

Конец.The end.

[0058] Датасеты:[0058] Datasets:

(1) Соревнование по предсказанию возрастной группы клиента - задача предсказать возрастную группу клиента в пределах 4 классов как целевые переменные, и точность используется в качестве показателя качества. Датасет состоит из 44 млн анонимных транзакций, представляющих 50 тыс. клиентов с целевой переменной, размеченной только для 30 тыс. из них (27 млн из 44 млн транзакций), для остальных 20 тыс. клиентов (17 млн из 44 млн транзакций) метка неизвестна. Каждая транзакция включает дату, тип (например, продуктовый магазин, одежду, заправку, товары для детей и т.Д.) и сумму. Мы используем все доступные 44М транзакций для метрик лернинга, за исключением 10% - для тестовой части датасета и 5% для валидации метрик лернинга.(1) Client Age Prediction Competition - The task is to predict the age group of a client within 4 grades as target variables and accuracy is used as an indicator of quality. The dataset consists of 44 million anonymous transactions representing 50 thousand clients with a target variable tagged only for 30 thousand of them (27 million out of 44 million transactions), for the remaining 20 thousand clients (17 million out of 44 million transactions) the label is unknown ... Each transaction includes date, type (e.g. grocery store, clothing, gas station, baby products, etc.), and amount. We use all available 44M transactions for the leering metrics, with the exception of 10% for the test part of the dataset and 5% for the validation of the lerning metrics.

(2) Соревнование по предсказания пола клиента - задача представляет собой бинарную классификационную задачу прогнозирования пола клиента, и используется метрика ROC-AUC. Датасет состоит из 6,8 млн анонимных транзакций, представляющих 15 тыс.клиентов, из которых только 8,4 тыс.из них размечены. Каждая транзакция характеризуется датой, типом (например, «депозит наличными через банкомат»), суммой и кодом категории продавца (также известный как МСС).(2) Client Gender Prediction Competition - The problem is a binary classification problem to predict the gender of a client, and the ROC-AUC metric is used. The dataset consists of 6.8 million anonymous transactions representing 15 thousand clients, of which only 8.4 thousand of them are tagged. Each transaction is characterized by date, type (for example, ATM cash deposit), amount, and merchant category code (also known as MCC).

[0059] Для каждого набора данных мы выделяем 10% клиентов из размеченной части данных как тестовую выборку, на которой мы сравнивали качество различных моделей. В представленных экспериментах используется функция контрастивных потерь и стратегия генерации случайных срезов последовательности. Для всех методов гиперпараметры были выбраны с использованием случайного поиска с 5-фолдовой кросс-валидацией на тренировочной выборке с точки зрения качества на отложенной выборке.[0059] For each dataset, we isolate 10% of the clients from the labeled portion of the data as a test set against which we compared the quality of different models. The presented experiments use a contrast loss function and a strategy for generating random sequence slices. For all methods, the hyperparameters were selected using a random search with 5-fold cross-validation on a training sample in terms of quality on a deferred sample.

[0060] Результаты настройки гиперпараметров, полученные для MeLES, показан в Таблице 3.[0060] The hyperparameter tuning results obtained for MeLES are shown in Table 3.

[0061] Для оценки методов обучения без учителя (включая MeLES) были использованы все транзакции, включая неразмеченные данные, кроме тестовой выборки, поскольку эти методы подходят для датасетов с частичной разметкой или вообще не требуют разметки.[0061] To evaluate unsupervised learning methods (including MeLES), all transactions, including unlabeled data, except for the test sample were used, since these methods are suitable for datasets with partial markup or do not require markup at all.

[0062] Обучение архитектуры нейронной сети, пригодной для реализации заявленного способа, проводилось на одной видеокарте Tesla Р-100. При обучении нейронной сети MeLES один батч тренировочной выборки обрабатывается за 142 миллисекунды. Для датасета прогнозирования возраста один один батч тренировочной выборки содержит 64 уникальных клиента с 5 подвыборками на каждого клиента, то есть в общей сложности 320 обучающих выборок, среднее число транзакций на выборку составляет 90, следовательно, каждый батч содержит около 28800 транзакций.[0062] Training of the neural network architecture suitable for implementing the claimed method was carried out on one Tesla P-100 video card. When training the MeLES neural network, one batch of the training sample is processed in 142 milliseconds. For the age prediction dataset, one single batch of the training sample contains 64 unique clients with 5 subsamples per client, that is, a total of 320 training samples, the average number of transactions per sample is 90, therefore, each batch contains about 28800 transactions.

[0063] Заявленный способ сравнивался со следующими двумя базовыми моделями. Во-первых, будет проанализирован метод Gradient Boosting Machine (GBM) на вручную построенных признаках. GBM можно рассматривать как надежную базовую модель в случае табличных данных с разнородными признаками. В частности, подходы, основанные на GBM, позволяют достигать самых современных результатов в различных практических задачах, включая поиск в Интернете, прогнозирование погоды, обнаружение мошенничества и многие другие.[0063] The claimed method was compared with the following two baseline models. First, the Gradient Boosting Machine (GBM) method will be analyzed using manually constructed features. GBM can be seen as a reliable baseline model for tabular data with heterogeneous characteristics. Specifically, GBM-based approaches can achieve cutting-edge results in a variety of practical applications, including Internet search, weather forecasting, fraud detection, and many others.

[0064] Во-вторых, применяется недавно предложенный метод контрастного прогнозирования (СРС), метод обучения без учителя, который показал высокое качество для последовательных данных таких традиционных областей, как аудио, компьютерное зрение, естественный язык и обучение с подкреплением. Модель, основанная на GBM, требует большого количества вручную подготовленных из необработанных данных транзакций агрегатных признаков. Примером агрегатных признаков может служить средняя сумма расходов в некоторых категориях продавцов, таких как отели, рассчитанная за всю историю транзакций. Применялась LightGBM реализация алгоритма GBM с почти 1 тыс. признаков, подготовленных вручную для данной задачи.[0064] Second, the recently proposed contrast prediction (CPM) technique, an unsupervised learning technique, has been applied that has shown high quality for sequential data in traditional areas such as audio, computer vision, natural language, and reinforcement learning. The GBM-based model requires a large number of manually prepared aggregate characteristics from raw transaction data. An example of aggregate indicators is the average amount of expenses in some categories of sellers, such as hotels, calculated over the entire history of transactions. The LightGBM implementation of the GBM algorithm was used with almost 1,000 features prepared manually for this task.

[0065] В дополнение к упомянутым базовым моделям заявленный способ сравнивался с методом обучения с учителем, когда подсеть кодировщика и подсеть классификатора совместно обучаются под целевую переменную данной задачи. При этом в данном случае предварительная подготовка агрегатных признаков не производится.[0065] In addition to these basic models, the claimed method was compared to a supervised learning method where the encoder subnet and the classifier subnet are co-trained for the target variable of a given task. Moreover, in this case, the preliminary preparation of aggregate characteristics is not performed.

[0066] Далее в таблицах 4, 5, 6 и 7 будут представлены результаты экспериментов по различным вариантам заявленного способа.[0066] Further in tables 4, 5, 6 and 7 will present the results of experiments on various variants of the claimed method.

[0067] Как показано в Таблице 4, различные варианты архитектур кодировщиков показывают сопоставимое качество в данных задачах. При этом функция контрастивных потерь, которая может рассматриваться как основной вариант функции потери метрик лернинга, позволяет получить высокие результаты при использовании представлений в задачах машинного обучения (см. Таблицу 5). Это позволяет отразить тот факт, что увеличение качества модели для задачи метрик лернинга не всегда приводит к увеличению качества при использовании представлений в задачах машинного обучения.[0067] As shown in Table 4, various encoder architectures show comparable performance in these tasks. At the same time, the contrast loss function, which can be considered as the main variant of the lerning metrics loss function, allows one to obtain high results when using representations in machine learning problems (see Table 5). This allows us to reflect the fact that an increase in the quality of a model for a lerning metrics problem does not always lead to an increase in quality when using representations in machine learning problems.

[0068] Жесткий негативный майнинг приводит к значительному повышению качества при использовании представлений в задачах машинного обучения по сравнению со случайной негативной выборкой (см. Таблицу 7). Другое наблюдение состоит в том, что более сложная стратегия генерации подпоследовательности (например, случайные срезы) демонстрирует немного более низкое качество при использовании представлений в задачах машинного обучения по сравнению со случайной выборкой событий (см. Таблицу 6)[0068] Hard negative mining leads to a significant improvement in quality when using representations in machine learning problems compared to random negative sampling (see Table 7). Another observation is that more sophisticated subsequence generation strategy (e.g. random slices) exhibits slightly lower quality when using representations in machine learning problems compared to random sampling of events (see Table 6)

[0069] На Фиг. 2 показано, что при использовании представлений в задачах машинного обучения качество задачи увеличивается с размерностью представления. Наилучшее качество достигается при размерности представления 800. Дальнейшее увеличение размерности представления снижает качество. Результаты могут быть интерпретированы как проблема компромисса смещения-отклонения. Когда размерность представления слишком мала, можно отбросить слишком много информации (высокое смещение уклон). С другой стороны, когда размерность представления слишком велика, добавляется слишком много шума (высокая дисперсия).[0069] FIG. 2 shows that when using representations in machine learning problems, the quality of the problem increases with the dimension of the representation. The best quality is achieved when the dimension of the presentation is 800. The further increase in the dimension of the presentation decreases the quality. The results can be interpreted as a bias-bias trade-off problem. When the dimension of the representation is too small, too much information can be discarded (high bias bias). On the other hand, when the dimension of the representation is too large, too much noise is added (high variance).

[0070] На Фиг. 3 представлена схожая зависимость, отображающую плато между размерностью 256 и 2048, когда качество в задачах не увеличивается. Во всех экспериментах, кроме тех, что представлены на графике использовался размер векторов (эмбеддингов) равный 256.[0070] FIG. 3 shows a similar relationship showing a plateau between 256 and 2048 when the quality in tasks does not increase. In all experiments, except those shown in the graph, the size of vectors (embeddings) equal to 256 was used.

[0071] Увеличение размерности представления также будет линейно увеличивать время обучения и объем используемой памяти на GPU.[0071] Increasing the dimension of the representation will also linearly increase the training time and the amount of memory used on the GPU.

[0072] Чтобы визуализировать представления MeLES в двумерном пространстве, был применен метод преобразования tSNE. tSNE преобразует многомерное пространство в низкоразмерное на основе локальных отношений между точками, поэтому соседние векторы представлений в многомерном пространстве представлений оказываются близкими в 2-мерном пространстве.[0072] To render the MeLES representations in two-dimensional space, the tSNE transform method was applied. tSNE transforms multidimensional space into low-dimensional space based on local relationships between points, so adjacent vectors of representations in multidimensional space of representations are close in 2-dimensional space.

[0073] Представления были получены полностью обучением без учителя из необработанных пользовательских транзакций без какой-либо информации о целевой переменной. Последовательность транзакций отражает поведение пользователя, поэтому модель MeLES фиксирует поведенческие паттерны и выводит представления пользователей с похожими паттернами поблизости. Векторы tSNE из набора данных прогнозирования возраста представлены на Фиг. 4. На Фиг. 4 можно наблюдать 4 кластера: кластеры для группы '1' и '2' находятся на противоположной стороне облака, кластеры для групп '2' и '3' в середине.[0073] The views were obtained entirely by unsupervised learning from raw user transactions without any target variable information. The sequence of transactions reflects user behavior, so the MeLES model captures behavioral patterns and displays the views of users with similar patterns nearby. The tSNE vectors from the age prediction dataset are shown in FIG. 4. In FIG. 4, 4 clusters can be observed: clusters for group '1' and '2' are on the opposite side of the cloud, clusters for groups '2' and '3' are in the middle.

[0074] Сравнение с базовыми методами. Как показано в Таблице 8, заявленный способ генерирует представления последовательностей данных жизненного потока, которые обеспечивают высокое качество, сравнимое с вручную подготовленными признаками при использовании в последующих задачах. Более того представления, полученные с помощью нашего метода, дообученные под целевую переменную позволяют достигать самое высокое качество в обоих датасетах банковских транзакций, значительно опережая все часто используемые методы обучения.[0074] Comparison with baseline methods. As shown in Table 8, the inventive method generates representations of life stream data sequences that provide high quality comparable to manually prepared features when used in subsequent tasks. Moreover, the representations obtained using our method, retrained for the target variable, allow achieving the highest quality in both datasets of banking transactions, significantly ahead of all commonly used training methods.

[0075] Кроме того, использование представлений последовательностей вместе с подготовленными вручную агрегатными признаками приводит к лучшему качеству, чем использование только агрегатных признаков или только представлений последовательностей, то есть возможно комбинировать различные подходы, чтобы получить еще более лучшую модель.[0075] In addition, using sequence representations together with manually prepared aggregate features results in better quality than using only aggregate features or only sequence representations, that is, it is possible to combine different approaches to get an even better model.

[0076] Чтобы оценить заявленный способ в условиях ограниченного количества размеченных данных, используется только часть доступной разметки для эксперимента с обучением без учителя. Так же как и в подходе обучения с учителем, выполняется сравнение предложенного метода с ligthGBM по вручную подготовленными агрегатным признакам и методом контрастного прогнозирующего кодирования СРС. Для обоих методов генерации представлений (MeLES и СРС) оценивается качество lightGBM как на представлениях, так и дообученных под целевую переменную представлений. В дополнение к этим экспериментам заявленный способ сравнивается с обучением с учителем на размеченной части датасета.[0076] To evaluate the inventive method in the context of a limited amount of tagged data, only a portion of the available markup for the unsupervised experiment is used. As in the supervised learning approach, the proposed method is compared with ligthGBM using manually prepared aggregate features and the CPC contrast predictive coding method. For both methods of generating views (MeLES and CPC), the quality of lightGBM is assessed both on views and views that have been retrained for the target variable. In addition to these experiments, the claimed method is compared to teaching with a teacher on a marked-up portion of a dataset.

[0077] На Фиг. 5 - Фиг. 6 сравнивается качество подготовленных вручную агрегатных признаков и представлений, накладывая метод lightGBM поверх них. Кроме того, на Фиг. 7 - Фиг. 8 можно найти сравнение отдельных моделей на задачах, рассмотренных в статье. Как видно на рисунках, если количество разметки ограничено, MeLES значительно превосходит подходы обучения с учителем и другие. Также MeLES неизменно превосходит СРС для данных с разным объемом разметки.[0077] FIG. 5 to FIG. 6 compares the quality of manually prepared aggregate features and views, overlaying the lightGBM method on top of them. In addition, in FIG. 7 to FIG. 8 you can find a comparison of individual models on the tasks discussed in the article. As you can see in the figures, if the amount of markup is limited, MeLES is significantly superior to supervised learning approaches and others. MeLES also consistently outperforms CPC for data with varying markup volumes.

[0078] В настоящем способе был применен подход на основе метрик лернинга для анализа данных жизненного потока новым образом, обучением без учителя. В рамках этого был разработан метод Metric Learning for Sequences (MeLES), основанный на обучении без учителя. В частности, метод MeLES может использоваться для создания представлений последовательностей событий со сложной структурой, которые могут эффективно использоваться в различных последующих задачах машинного обучения. Кроме того, заявленный метод может быть использован для предобработки признаков в условиях обучения без учителя. С помощью эмпирических экспериментов демонстрируется эффективность заявленного способа за счет достижения высоких результатов в качестве для нескольких задач, существенно опережая как классические базовые модели машинного обучения на основе созданных вручную признаков, так и подходы, основанные на нейронных сетях.[0078] In the present method, a leering metrics approach was applied to analyze life flow data in a new, unsupervised way. As part of this, the Metric Learning for Sequences (MeLES) method based on unsupervised learning was developed. In particular, the MeLES method can be used to create representations of complex event sequences that can be effectively used in various downstream machine learning tasks. In addition, the claimed method can be used for preprocessing features in unsupervised learning conditions. Empirical experiments demonstrate the effectiveness of the claimed method by achieving high quality results for several tasks, significantly outperforming both the classical basic machine learning models based on manually created features, and approaches based on neural networks.

[0079] В среде с ограниченной разметкой, заявленный способ демонстрирует еще более сильные результаты при сравнении с методами на основании обучения с учителем. Предложенный метод генерации представлений удобен для использования в продуктиве, поскольку для получения сложных компактных представлений почти не требуется предварительной обработки признаков на основе сложных потоков событий. Предварительно рассчитанные представления могут быть легко использованы для различных последующих задач без выполнения сложных и трудоемких вычислений агрегатных признаков на основе необработанных данных о событиях. Для некоторых архитектур кодировщиков становится возможно постепенно обновлять уже рассчитанные представления, когда поступают дополнительные новые данные событий жизненного потока. Другое преимущество использования представлений на основе последовательности событий вместо явных данных о событиях заключается в том, что невозможно восстановить точную входную последовательность из ее представлений. Следовательно, использование представлений приводит к конфиденциальности и безопасности данных для конечных пользователей, чем непосредственно при работе с необработанными данными событий, и все это достигается без потери информации при использования последующих задачах машинного обучения.[0079] In a restricted markup environment, the claimed method shows even stronger results when compared to supervised learning methods. The proposed method for generating views is convenient for use in production, because to obtain complex compact views, there is almost no need for preprocessing features based on complex streams of events. The pre-computed views can be easily used for a variety of downstream tasks without performing complex and time-consuming aggregate feature calculations based on raw event data. For some encoder architectures, it becomes possible to gradually update already computed representations when additional new life stream event data arrives. Another advantage of using event sequence-based representations instead of explicit event data is that it is impossible to reconstruct the exact input sequence from its representations. Hence, using views results in privacy and data security for end users than directly when working with raw event data, all without losing information when using subsequent machine learning tasks.

Источники информации:Sources of information:

1. Srivatsan Laxman, Vikram Tankasali, and RyenWWhite. 2008. Stream prediction using a generative model based on frequent episodes in event sequences. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 453-461.1. Srivatsan Laxman, Vikram Tankasali, and RyenWWhite. 2008. Stream prediction using a generative model based on frequent episodes in event sequences. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 453-461.

2.

Wiese and Christian Omlin. 2009. Credit card transactions, fraud detection, and machine learning: Modelling time with LSTM recurrent neural networks. In Innovations in neural information paradigms and applications. Springer, 231-268.2.

Wiese and Christian Omlin. 2009. Credit card transactions, fraud detection, and machine learning: Modeling time with LSTM recurrent neural networks. In Innovations in neural information paradigms and applications. Springer, 231-268.

3. Yishen Zhang, Dong Wang, Yuehui Chen, Huijie Shang, and Qi Tian. 2017. Credit risk assessment based on long short-term memory model. In International conference on intelligent computing. Springer, 700-712.3. Yishen Zhang, Dong Wang, Yuehui Chen, Huijie Shang, and Qi Tian. 2017. Credit risk assessment based on long short-term memory model. In International conference on intelligent computing. Springer, 700-712.

4. Luca Bigon, Giovanni Cassani, Ciro Greco, Lucas Lacasa, Mattia Pavoni, Andrea Polonioli, and Jacopo Tagliabue. 2019. Prediction is very hard, especially about conversion. Predicting user purchases from clickstream data in fashion e-commerce. arXiv preprint arXiv: 1907.00400 (2019).4. Luca Bigon, Giovanni Cassani, Ciro Greco, Lucas Lacasa, Mattia Pavoni, Andrea Polonioli, and Jacopo Tagliabue. 2019. Prediction is very hard, especially about conversion. Predicting user purchases from clickstream data in fashion e-commerce. arXiv preprint arXiv: 1907.00400 (2019).

5. Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), Vol. 1. IEEE, 539-546.5. Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), Vol. 1. IEEE 539-546.

6. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 815-823.6. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 815-823.

7. Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. 2006. Distance metric learning for large margin nearest neighbor classification. In Advances in neural information processing systems. 1473-1480.7. Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. 2006. Distance metric learning for large margin nearest neighbor classification. In Advances in neural information processing systems. 1473-1480.

8. Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2017. Generalized End-to-End Loss for Speaker Verification. (2017). arXiv:eess.AS/1710.104678. Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2017. Generalized End-to-End Loss for Speaker Verification. (2017). arXiv: eess.AS/1710.10467

9. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. http://arxiv.org/abs/l908.100849. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. http://arxiv.org/abs/l908.10084

10. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT10. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT

11.

van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. CoRR abs/1807.03748 (2018). arXiv: 1807.03748 http://arxiv.org/abs/1807.03748eleven.

van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. CoRR abs / 1807.03748 (2018). arXiv: 1807.03748 http://arxiv.org/abs/1807.03748

12. Longlong Jing and Yingli Tian. 2019. Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey. (2019). arXiv:cs.CV/l 902.06162.12. Longlong Jing and Yingli Tian. 2019. Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey. (2019). arXiv: cs.CV/l 902.06162.

13. Yang Song, Yuan Li, BoWu, Chao-Yeh Chen, Xiao Zhang, and HartwigAdam. 2017. Learning Unified Embedding for Apparel Recognition. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) (2017), 2243-2246.13. Yang Song, Yuan Li, BoWu, Chao-Yeh Chen, Xiao Zhang, and HartwigAdam. 2017. Learning Unified Embedding for Apparel Recognition. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) (2017), 2243-2246.

14. Andrew Zhai, Hao-Yu Wu, Eric Tzeng, Dong Huk Park, and Charles Rosenberg. 2019. Learning a Unified Embedding for Visual Search at Pinterest. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '19). ACM, New York, NY, USA, 2412-2420. https://doi.org/10.1145/3292500.3330739.14. Andrew Zhai, Hao-Yu Wu, Eric Tzeng, Dong Huk Park, and Charles Rosenberg. 2019. Learning a Unified Embedding for Visual Search at Pinterest. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '19). ACM, New York, NY, USA, 2412-2420. https://doi.org/10.1145/3292500.3330739.

15. David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1985. Learning internal representations by error propagation. Technical Report. California Univ San Diego La Jolla Inst for Cognitive Science.15. David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1985. Learning internal representations by error propagation. Technical Report. California Univ San Diego La Jolla Inst for Cognitive Science.

16. Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).16. Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv: 1312.6114 (2013).

17. Rogelio A Mancisidor, Michael Kampffmeyer, Kjersti Aas, and Robert Jenssen. 2019. Learning Latent Representations of Bank Customers With The Variational Autoencoder. (2019). arXiv:stat.ML/l903.06580.17. Rogelio A Mancisidor, Michael Kampffmeyer, Kjersti Aas, and Robert Jenssen. 2019. Learning Latent Representations of Bank Customers With The Variational Autoencoder. (2019). arXiv: stat.ML/l903.06580.

18. Eric P Xing, Michael I Jordan, Stuart J Russell, and Andrew Y Ng. 2003. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems. 521-528.18. Eric P Xing, Michael I Jordan, Stuart J Russell, and Andrew Y Ng. 2003. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems. 521-528.

19. Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality Reduction by Learning an Invariant Mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR '06). IEEE Computer Society, Washington, DC, USA, 1735-1742. https://doi.org/10.1109/CVPR.2006.100.19. Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality Reduction by Learning an Invariant Mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR '06). IEEE Computer Society, Washington, DC, USA, 1735-1742. https://doi.org/10.1109/CVPR.2006.100.

20. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 815-823.20. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 815-823.

21. Chengzhi Mao, Ziyuan Zhong, Junfeng Yang, Carl Vondrick, and Baishakhi Ray. 2019. Metric learning for adversarial robustness. In Advances in Neural Information Processing Systems (2019), 478-489.21. Chengzhi Mao, Ziyuan Zhong, Junfeng Yang, Carl Vondrick, and Baishakhi Ray. 2019. Metric learning for adversarial robustness. In Advances in Neural Information Processing Systems (2019), 478-489.

22. Tomas Mikolov, G.s Corrado, Kai Chen, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. 1-12.22. Tomas Mikolov, G. s Corrado, Kai Chen, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. 1-12.

Claims

1. A computer-implemented method for obtaining low-dimensional numerical representations of sequences of events, containing the stages at which:

- receive a set of input data, characterizing the events, aggregated in a sequence and associated with at least one information entity, and said data contains a set of attributes, including: categorical variables, numeric variables and a timestamp;

in this case, preprocessing of the mentioned set of input data is performed, in which:

- using the encoder of transactional events, a vector representation of each transactional event is formed from the mentioned set of attributes, while the encoder contains the primary set of parameters and performs the steps at which:

coding categorical variables in the form of vector representations;

normalization of numeric variables;

- using a subsequence encoder, a vector representation of a subsequence of transactional events is formed from a set of numeric vectors of transactional events obtained using a transactional event encoder, wherein the encoder contains a primary set of parameters;

- filtering negative pairs of vectors of subsequences of transactional events, the value of the vector distance between which is not higher than a predetermined threshold value;

- adjusting the primary parameters of the above-mentioned transaction event encoder and subsequence encoder using a loss function of the form of margin or contrast losses; and

- form low-dimensional numerical representations of sequences of events associated with one information entity, based on the performed adjustment.

2. The method according to claim 1, characterized in that the information entity is the transactional data of an individual or legal entity.

3. The method according to claim 1, characterized in that the creation of positive pairs is carried out using an algorithm for the formation of disconnected subsequences.

4. The method according to claim 1, characterized in that the creation of positive pairs is carried out using an algorithm for generating random slices of the sequence.

5. The method according to claim 3, characterized in that the generated subsequences do not intersect with each other.

6. The method according to claim 4, characterized in that the generated subsequences do not intersect and / or intersect with each other.

7. The method according to claim 1, characterized in that the subsequence encoder is a recurrent neural network (RNN).