RU2469389C1

RU2469389C1 - Method of integrating user profiles of online social networks

Info

Publication number: RU2469389C1
Application number: RU2011145077/08A
Authority: RU
Inventors: Сергей Олегович Бартунов; Антон Викторович Коршунов; Денис Юрьевич Турдаков; Николай Николаевич Кузюрин; Сеунг-Таек ПАРК; Вонхо РЫУ; Хыунгдонг ЛИ
Original assignee: Учреждение Российской академии наук Институт системного программирования РАН; Корпорация "САМСУНГ ЭЛЕКТРОНИКС Ко., Лтд."
Priority date: 2011-11-08
Filing date: 2011-11-08
Publication date: 2012-12-10

Abstract

FIELD: information technology.

SUBSTANCE: method comprises steps of entering all possible pairs of profiles; constructing a Conditional Random Field model for all profiles and connections between them; for each pair of profiles, calculating the similarity value of their attributes using a string or graph similarity metric; constructing a feature vector from the obtained similarity metric values, which is sent to a machine learning algorithm which calculates unary energy or binary energy for each pair of profiles, wherein the profiles belong to different social graphs; calculating profile similarity; checking whether the obtained profile similarity value exceeds a given threshold value; if so, the pair of profiles is entered into a list of candidates; a priori true projections are selected from the obtained list of candidates; the Conditional Random Field model is broken into independent components; for each component of the model, the optimum configuration of projections is sought; the lists of the found projections are merged for all components of the model.

EFFECT: high efficiency of integrating user profiles of online social networks.

6 dwg

Description

Изобретение относится к области обработки пользовательских данных, полученных из графов онлайновых социальных сетей, с целью интеграции данных различных профилей, принадлежащих одному пользователю. Может быть использовано для построения баз данных пользовательской информации, полученной из различных источников, в частности, для построения расширенного социального графа, содержащего данные о пользователе, полученные из нескольких различных социальных графов. Подобный расширенный социальный граф может быть использован для улучшения качества результатов в ряде задач, таких как поиск информации в Интернете, онлайн-реклама товаров и услуг, построение рекомендаций товаров и услуг пользователям и др.The invention relates to the field of processing user data obtained from graphs of online social networks in order to integrate data of various profiles belonging to one user. It can be used to build databases of user information obtained from various sources, in particular, to build an extended social graph containing user data obtained from several different social graphs. Such an expanded social graph can be used to improve the quality of results in a number of tasks, such as searching for information on the Internet, online advertising of goods and services, building recommendations for goods and services to users, etc.

Рассмотрим основные понятия, необходимые для понимания представленного изобретения:Consider the basic concepts necessary for understanding the presented invention:

1. Интеграция данных включает объединение данных, находящихся в различных источниках и предоставление данных пользователям в унифицированном виде. Изобретение обеспечивает интеграцию данных на логическом уровне с целью обеспечения возможности доступа к данным, содержащимся в различных источниках, в терминах единой глобальной схемы, которая описывает их совместное представление с учетом структурных свойств.1. Data integration includes combining data from various sources and providing data to users in a unified form. The invention provides the integration of data at a logical level in order to provide access to data contained in various sources in terms of a single global scheme that describes their joint presentation taking into account structural properties.

2. Социальный граф является цифровым представлением взаимоотношений пользователей онлайновых социальных сетей, которое явно задается различными типами отношений связи между пользователями (например, отношение дружбы, отношение следования и т.д.). Данные пользователя в социальном графе представлены в виде профиля, который представляет собой находящуюся на материальном носителе либо в памяти вычислительной машины совокупность атрибутов (в основном, строковых) в виде пар "имя - значение", которые содержат различную информацию о пользователе (например, имя, пол, адрес, номер телефона и т.д.). Изобретение предназначено для объединения различных социальных графов путем сравнения атрибутов профилей пользователей и интенсивного использования информации, скрытой в связях между профилями.2. A social graph is a digital representation of the relationships of users of online social networks, which is clearly defined by various types of relationships between users (for example, friendship, following, etc.). The user data in the social graph is presented in the form of a profile, which is a combination of attributes (mainly string) in the form of name-value pairs that contain various information about the user (for example, name, gender, address, phone number, etc.). The invention is intended for combining various social graphs by comparing the attributes of user profiles and the intensive use of information hidden in the links between profiles.

3. Условные Случайные Поля - это графическая вероятностная модель, в которой в виде ненаправленного графа представлены зависимости между случайными величинами. Узлы графа делятся на два непересекающихся множества - наблюдаемые переменные, которые задаются в качестве входных данных, и скрытые переменные. Ребра графа соответствуют вероятностным взаимосвязям между узлами. Вершинам и ребрам могут быть назначены численные значения, называемые энергиями. Вычисление значений скрытых переменных называется выводом из модели.3. Conditional Random Fields is a graphical probabilistic model in which dependencies between random variables are represented as an undirected graph. The nodes of the graph are divided into two disjoint sets - observable variables, which are set as input, and hidden variables. The edges of the graph correspond to probabilistic relationships between nodes. Vertices and edges can be assigned numerical values called energies. The calculation of the values of hidden variables is called inference from the model.

4. Машинное обучение - раздел искусственного интеллекта, изучающий методы построения алгоритмов, способных обучаться. В представленном изобретении используется разновидность машинного обучения, именуемая обучением с учителем: алгоритм генерирует функцию, которая связывает входные данные с выходными определенным образом (задача классификации). В качестве обучающих данных используются примеры связи входных данных с выходными. В алгоритмах машинного обучения широко используется понятие признака. Признаки - индивидуальные измеримые свойства наблюдаемого феномена, которые используются для создания его численного представления (например, значения метрик строковой похожести для пар атрибутов профилей).4. Machine learning - a section of artificial intelligence that studies the methods of constructing algorithms that can be learned. In the present invention, a kind of machine learning is used, called learning with a teacher: the algorithm generates a function that associates input with output in a certain way (classification task). As training data, examples of the relationship between input and output are used. Machine learning algorithms make extensive use of the concept of trait. Signs are the individual measurable properties of the observed phenomenon, which are used to create its numerical representation (for example, the value of string similarity metrics for pairs of profile attributes).

5. Метрики строковой похожести возвращают численное значение похожести пары строк, основываясь на порядке расположения составляющих их символов (например, расстояние Джаро-Винклера [17]).5. String similarity metrics return the numerical value of the similarity of a pair of strings, based on the arrangement of their constituent characters (for example, the Jaro-Winkler distance [17]).

6. Метрики графовой похожести возвращают численное значение похожести пары узлов графа, основываясь на структуре связей между ними (например, коэффициент Дайса [18]).6. Graph similarity metrics return the numerical value of the similarity of a pair of graph nodes based on the structure of the relationships between them (for example, the Dyce coefficient [18]).

Крупнейшим исследованием, посвященным интеграции профилей пользователей социальных сетей, является диссертация Veldman [1]. В ней предлагается множество эвристик, использующих как данные профилей пользователей, так и существующие связи между ними. Результаты подобных исследований также представлены в работах Motoyama et аl [2], Gae-won et al [3], Raad et al [4] и Vozecky et al [5]. В работе [2] авторы сравнивают профили пользователей MySpace и Facebook. В работе [3] авторы делают то же самое с профилями из Twitter и EntityCube. Авторы [4] генерируют синтетические профили пользователей и применяют к ним различные сложные эвристики, стараясь использовать любой потенциально полезный источник данных в социальной сети. В работе [5] профили пользователей Facebook и StudiVZ представлены в виде n-мерных векторов, которые впоследствии сравниваются с помощью различных техник, включая нечеткое сравнение. Авторы также исследуют влияние различных атрибутов профиля на точность результатов сравнения.The largest study on the integration of user profiles of social networks is the thesis Veldman [1]. It offers many heuristics that use both user profile data and existing relationships between them. The results of such studies are also presented in Motoyama et al [2], Gae-won et al [3], Raad et al [4] and Vozecky et al [5]. In [2], authors compare MySpace and Facebook user profiles. In [3], the authors do the same with profiles from Twitter and EntityCube. The authors [4] generate synthetic user profiles and apply various complex heuristics to them, trying to use any potentially useful source of data on a social network. In [5], the user profiles of Facebook and StudiVZ are presented in the form of n-dimensional vectors, which are subsequently compared using various techniques, including fuzzy comparison. The authors also examine the effect of various profile attributes on the accuracy of comparison results.

Интерес также представляют проекты Foaf-o-matic [6] и Okkam [7], целью которых является интеграция социальных профилей с помощью формальной семантики FOAF (Friend-of-a-friend). Проект Stanford Entity Resolution Framework [8] также предназначен для решения задач, подобных данной. Помимо исходных кодов фреймворка, доступно множество работ, посвященных теоретическим аспектам интеграции данных, таким как масштабируемость, оценка качества и др.Of interest are also the projects Foaf-o-matic [6] and Okkam [7], the purpose of which is the integration of social profiles using the formal semantics of FOAF (Friend-of-a-friend). The Stanford Entity Resolution Framework [8] is also designed to solve problems like this one. In addition to the source code of the framework, there are many works on theoretical aspects of data integration, such as scalability, quality assessment, etc.

Несмотря на успехи, достигнутые авторами вышеперечисленных работ, в них используется слишком простая модель сравнения профилей, основанная, в основном на попарном сравнении с помощью строковой похожести отдельных атрибутов. Кроме того, существующие связи между профилями учитываются недостаточно либо вообще не берутся во внимание, то же касается особенностей сравниваемых социальных графов.Despite the successes achieved by the authors of the above works, they use a too simple model for comparing profiles, based mainly on pairwise comparison using string similarity of individual attributes. In addition, existing connections between profiles are not taken into account sufficiently or are not taken into account at all, the same applies to the features of the compared social graphs.

Наиболее близким к представленному изобретению является способ, предложенный Singla et al [9] для выявления дубликатов в сети цитирования научных работ с помощью модели Условных Случайных Полей. Авторы формулируют задачу в терминах марковской случайной логики и строят модель из фактов и утверждений о сравниваемых объектах, после чего рассчитывают их вероятности. Для оптимизации скорости работы алгоритма авторы производят разбиение модели на пересекающиеся компоненты. Вместе с тем, представленный подход обладает следующими недостатками, препятствующими его использованию для решения рассматриваемой задачи:Closest to the presented invention is a method proposed by Singla et al [9] for identifying duplicates in a citation network of scientific papers using the model of Conditional Random Fields. The authors formulate the problem in terms of Markov random logic and construct a model from facts and statements about the objects being compared, and then calculate their probabilities. To optimize the speed of the algorithm, the authors split the model into intersecting components. However, the presented approach has the following disadvantages that impede its use for solving the problem under consideration:

- в модели предусмотрено наличие только одного источника данных (граф сети цитирования научных работ);- the model provides for the presence of only one data source (graph of the citation network of scientific works);

- узлами модели являются факты и утверждения о сравниваемых объектах, а не сами объекты. В частности, определено два типа узлов: узлы-записи и узлы-атрибуты. Первый тип узлов предназначен для хранения вопроса "Идентичен ли данный объект другому объекту?", тогда как второй тип хранит информацию о похожести атрибутов объектов;- the nodes of the model are facts and statements about the compared objects, and not the objects themselves. In particular, two types of nodes are defined: record nodes and attribute nodes. The first type of nodes is designed to store the question "Is this object identical to another object?", While the second type stores information about the similarity of the attributes of objects;

- для сравнения объектов используются только метрики строковой похожести их атрибутов, тогда как метрики графовой похожести не используются.- to compare objects, only metrics of string similarity of their attributes are used, while metrics of graph similarity are not used.

Настоящее изобретение обладает следующими преимуществами по сравнению с ранее предложенными подходами:The present invention has the following advantages compared to previously proposed approaches:

- позволяет упростить представление социального графа в памяти вычислительной машины, поскольку модель Условных Случайных Полей строится на основе одного из сравниваемых графов, при этом узлами модели являются непосредственно профили, а ребрами - связи между ними. Затем рассчитываются энергии связей модели исходя из данных о строковой и графовой похожести профилей, после чего производится поиск оптимального решения в виде конфигурации проекций профилей одной социальной сети на другую;- allows you to simplify the representation of a social graph in the memory of a computer, since the model of Conditional Random Fields is built on the basis of one of the compared graphs, while the nodes of the model are directly profiles, and the edges are the connections between them. Then, the link energies of the model are calculated based on the data on the string and graph similarity of the profiles, after which the search for the optimal solution in the form of a configuration projection of the profiles of one social network on another is performed;

- с помощью модели Условных Случайных Полей учитывается вся доступная информация о связях между профилями, что позволяет использовать латентную информацию, скрытую в этих связях, для уточнения результатов. Таким образом, способ позволяет производить интеграцию профилей, атрибуты которых содержат лишь незначительное количество полезной информации, что существенно усложняет применение общепринятого подхода, основанного на строковой близости атрибутов;- using the Conditional Random Fields model, all available information about the relationships between the profiles is taken into account, which allows you to use the latent information hidden in these relationships to refine the results. Thus, the method allows the integration of profiles whose attributes contain only a small amount of useful information, which greatly complicates the application of the generally accepted approach based on string proximity of attributes;

- адекватность и эффективность различных метрик строковой и графовой близости, а также относительная значимость атрибутов оцениваются с помощью методик машинного обучения на предварительно составленном наборе реальных экспериментальных данных, что позволяет учесть особенности выбранных социальных графов и оптимизировать параметры алгоритма для достижения лучших результатов.- the adequacy and effectiveness of various metrics of string and graph proximity, as well as the relative importance of attributes are estimated using machine learning techniques on a pre-compiled set of real experimental data, which allows you to take into account the features of the selected social graphs and optimize the algorithm parameters to achieve better results.

Технический результат использования предлагаемого изобретения состоит в том, что изобретение позволяет ранее неизвестным способом получать список пар профилей пользователей онлайновых социальных сетей, в котором профили в каждой паре содержат информацию об одном и том же пользователе, основываясь только на информации, содержащейся в атрибутах профилей и связях между ними. Также предложен универсальный подход, позволяющий учитывать особенности любой пары социальных графов и оптимизировать параметры метода для получения лучших результатов.The technical result of the use of the invention is that the invention allows a previously unknown method to obtain a list of pairs of user profiles of online social networks, in which the profiles in each pair contain information about the same user, based only on the information contained in the profile attributes and connections between them. A universal approach is also proposed that allows you to take into account the features of any pair of social graphs and optimize the parameters of the method to obtain better results.

Для лучшего понимания заявленного изобретения далее приводится его подробное описание с соответствующими чертежами.For a better understanding of the claimed invention the following is a detailed description with the corresponding drawings.

Фиг.1 представляет собой блок-схему алгоритма работы изобретенияFigure 1 is a block diagram of the algorithm of the invention

Фиг.2 представляет собой схему расчета значений похожести атрибутов профилей и построение вектора признаковFigure 2 is a diagram for calculating the similarity values of profile attributes and the construction of a feature vector

Фиг.3 представляет собой Модель Условных Случайных ПолейFigure 3 is a Model of Conditional Random Fields

Рассмотрим два социальных графа A и B. Для вершины ν∈A соответствующий профиль из графа В будем называть проекцией pr(ν) вершины ν∈A на граф B. Если для вершины ν∈A не определена проекция из B, то ей назначается нейтральная проекция: pr(ν)=N.Consider two social graphs A and B. For a vertex ν∈A, the corresponding profile from graph B will be called the projection pr (ν) of the vertex ν∈A onto graph B. If a projection from B is not defined for the vertex ν∈A, then the neutral projection is assigned to it : pr (ν) = N.

Задача интеграции профилей пользователей заключается в определении максимально возможного количества верных проекций (ν, u), ν∈A, u∈A. в терминах модели Условных Случайных Полей проекции pr(ν) для каждого ν∈A являются скрытыми переменными, значение которых нужно установить.The task of integrating user profiles is to determine the maximum possible number of correct projections (ν, u), ν∈A, u∈A. in terms of the model of Conditional Random Projection Fields pr (ν) for each ν∈A are hidden variables whose value must be set.

Представленное изобретение основано на следующих основных утверждениях:The presented invention is based on the following main statements:

- задача определения верной проекции для узла из графа A связана с задачами определения верных проекций для всех смежных узлов из графа A;- the problem of determining the correct projection for a node from graph A is related to the problems of determining the correct projections for all adjacent nodes from graph A;

- если два узла в графе A связаны, то их проекции в графе B должны иметь как можно более высокое значение графовой похожести.- if two nodes in column A are connected, then their projections in column B should have the highest value of graph similarity.

Предлагаемый способ интеграции профилей пользователей онлайновых социальных сетей содержит алгоритм, включающий в себя ввод всех возможных пар профилей, 19 основных шагов и вывод результата в виде списка пар профилей, в котором каждая пара содержит информацию об одном и том же пользователе, при этом составляющие проекцию профили относятся к разным социальным графам.The proposed method for integrating user profiles of online social networks contains an algorithm that includes entering all possible pairs of profiles, 19 basic steps and displaying the result in the form of a list of pairs of profiles, in which each pair contains information about the same user, while the profiles that make up the projection belong to different social graphs.

Рассмотрим более подробно шаги алгоритма (фиг.1).Consider in more detail the steps of the algorithm (figure 1).

Шаг 100. Ввод всех возможных пар профилей.Step 100. Entering all possible pairs of profiles.

В сравнении участвуют все возможные пары профилей, поскольку используемый в изобретении алгоритм вывода из модели Условных Случайных Полей дает лучшие результаты при наличии информации об энергиях всех возможных парных комбинаций узлов. Набор профилей из графов A и B со всеми известными связями между ними образуют модель Условных Случайных Полей.All possible pairs of profiles are involved in the comparison, since the algorithm for deriving Conditional Random Fields from the model used in the invention gives the best results when there is information about the energies of all possible pair combinations of nodes. A set of profiles from columns A and B with all known relationships between them form a model of Conditional Random Fields.

Шаг 101. Выбор следующей пары профилей.Step 101. Select the next pair of profiles.

Шаг 102. Расчет значений похожести атрибутов профилей и построение вектора признаков.Step 102. Calculation of the similarity values of profile attributes and construction of a feature vector.

Сравнение атрибутов профилей из графов A и B производится с помощью схемы соответствия, которая задает порядок сравнения атрибутов и применяемые метрики похожести (примеры схем соответствия для расчета похожести атрибутов между узлами из различных графов и между двумя узлами графа ν даны в Табл. 1 и Табл. 2 соответственно).Comparison of the profile attributes from columns A and B is performed using the correspondence scheme, which sets the order of attribute comparison and the similarity metrics used (examples of correspondence schemes for calculating the similarity of attributes between nodes from different graphs and between two nodes of ν are given in Table 1 and Table. 2, respectively).

На фиг.2 изображен пример сравнения профилей из Twitter и Facebook. T_i и F_i соответствуют двум сравниваемым атрибутам из профилей Т и F, где i - порядковый номер содержащей данные атрибуты записи в схеме соответствия. К каждой паре атрибутов применяется метрика похожести sim_i. Значения метрик похожести для всех пар атрибутов составляют вектор признаков, который передается на следующий шаг.Figure 2 shows an example of a comparison of profiles from Twitter and Facebook. T _i and F _i correspond to two compared attributes from the profiles T and F, where i is the serial number of the records containing the given attributes in the correspondence scheme. A sim _i similarity metric is applied to each pair of attributes. The similarity metric values for all attribute pairs make up the feature vector, which is passed to the next step.

Шаг 103. Расчет энергии пары профилей с помощью алгоритма машинного обучения.Step 103. Calculation of the energy of a pair of profiles using the machine learning algorithm.

Определим понятия энергий, предварительно определив необходимые переменные.We define the concepts of energies by first defining the necessary variables.

Граф A используется для построения модели Условных Случайных Полей. Пусть ν и u являются пользователями из графа А, а pr(ν) и pr(u) - их проекциями из графа В.Graph A is used to construct a model of Conditional Random Fields. Let ν and u be users from graph A, and pr (ν) and pr (u) their projections from graph B.

Тогда энергия узла (унарная энергия) определяется какThen the node energy (unary energy) is defined as

Ф(pr(ν)|ν)=α(ν)·(1-profile_{similarity(ν,pr(ν)}),Ф (pr (ν) | ν) = α (ν) · (1-profile _{similarity (ν, pr (ν)} ),

энергия связи (бинарная энергия) определяется какbinding energy (binary energy) is defined as

ψ(pr(ν),pr(u)|ν,u)=β(pr(ν),pr(u))·(1-network_{similarity(pr(ν),pr(u)}),ψ (pr (ν), pr (u) | ν, u) = β (pr (ν), pr (u)) · (1-network _{similarity (pr (ν), pr (u)} ),

а полная энергия определяется какand the total energy is defined as

α(ν) и β(pr(ν),pr(u)) являются коэффициентами, регулирующими влияние каждого типа энергий на итоговую модель и результаты работы. Функции profile_similarity и network_similarity являются функциями похожести профилей, которые нормированы и увеличиваются с увеличением похожести сравниваемых профилей.α (ν) and β (pr (ν), pr (u)) are the coefficients that govern the influence of each type of energy on the final model and the results of the work. The profile _similarity and network _similarity functions are profile _similarity functions that are normalized and increase with increasing similarity of the compared profiles.

Для расчета унарных и бинарных энергий с целью максимизации точности результатов путем анализа реальных данных используется методика машинного обучения, которая получает на вход набор значений метрик похожести для данной пары профилей в виде вектора признаков и возвращает значение энергии связи данной пары профилей. В качестве методики машинного обучения предусмотрено использование одного из способов:To calculate unary and binary energies in order to maximize the accuracy of the results by analyzing real data, a machine learning technique is used that receives a set of similarity metric values for a given pair of profiles as a feature vector and returns the value of the binding energy of this pair of profiles. As a machine learning technique, one of the following methods is provided:

- Взвешенная линейная комбинация признаков, где веса подбираются при помощи линейной регрессии [15], исходя из того, что унарная энергия для правильных проекций должна быть равна 0, а для неправильных - 1;- A weighted linear combination of features, where weights are selected using linear regression [15], on the basis that the unary energy for the correct projections should be 0, and for the wrong ones - 1;

- Алгоритм машинного обучения MultiBoostAB [13] над решающими деревьями С 4.5 [14];- Machine learning algorithm MultiBoostAB [13] over decision trees C 4.5 [14];

- Алгоритм машинного обучения LogitBoost над решающими деревьями М5Р [16].- LogitBoost machine learning algorithm over M5P decision trees [16].

Перечисленные алгоритмы перед использованием проходят процедуру обучения, т.е. получают на вход множество векторов признаков, полученных при расчете метрик похожести для атрибутов профилей, входящих в состав набора данных, в котором верные проекции заранее заданы вручную. Вектор признаков, используемый для обучения, включает в себя все значения метрик похожести для сравниваемой пары профилей и дополнительное измерение, содержащее значение булевского типа и указывающее, содержат ли данные профили информацию об одном и том же пользователе. Такой набор данных может быть составлен для любой пары социальных графов.The listed algorithms pass the training procedure before use, i.e. receive a set of feature vectors, obtained by calculating the similarity metrics for the profile attributes included in the data set, in which the correct projections are predefined manually. The feature vector used for training includes all values of similarity metrics for the pair of profiles being compared and an additional dimension containing a Boolean type value and indicating whether these profiles contain information about the same user. Such a data set can be compiled for any pair of social graphs.

Шаг 104. Является ли данная пара профилей последней? Если НЕТ - переход к шагу 101, если ДА - переход к шагу 105.Step 104. Is this pair of profiles the last? If NO, go to step 101; if YES, go to step 105.

Шаг 105. Выбор следующей пары профилей.Step 105. Selecting the next pair of profiles.

Производится последовательный перебор всех возможных пар профилей, в которых один из профилей принадлежит графу A а второй - графу B.A sequential search of all possible pairs of profiles is carried out, in which one of the profiles belongs to column A and the second to column B.

Шаг 106. Расчет похожести пары профилей.Step 106. Calculation of the similarity of the pair of profiles.

Производится сравнение атрибута выбранной пары профилей, который с наибольшей степенью вероятности однозначно идентифицирует профиль, с помощью метрики строковой похожести.The attribute of the selected pair of profiles is compared, which most likely uniquely identifies the profile using the string similarity metric.

Шаг 107. Превышает ли значение похожести пары профилей заданное пороговое значение? Если ДА - переход к шагу 108, если НЕТ - переход к шагу 109.Step 107. Does the similarity of the pair of profiles exceed the specified threshold value? If YES, go to step 108; if NO, go to step 109.

Шаг 108. Добавить пары профилей в список кандидатов.Step 108. Add pairs of profiles to the list of candidates.

Шаг 109. Является ли данная пара профилей последней? Если НЕТ - переход к шагу 105, если ДА - переход к шагу 110.Step 109. Is this pair of profiles the last? If NO, go to step 105; if YES, go to step 110.

Шаг 110. Выбор наилучших пар профилей из списка кандидатов и составление списка априорно верных проекций.Step 110. Selecting the best pairs of profiles from the list of candidates and compiling a list of a priori correct projections.

Для выбора наилучших пар профилей к списку кандидатов применяется алгоритм Куна-Манкреса [19], который производит последовательный перебор всех пар из списка кандидатов и выдает в качестве результата часть из них, выбранных таким образом, чтобы каждый профиль встречался только в одной паре, а профили, составляющие пару, взаимно максимизировали похожесть друг на друга. Результатом является набор априорно верных проекций (пар профилей), которые заносятся в модель с целью улучшения качества результатов.To select the best pairs of profiles, the Kuna-Mancresa algorithm [19] is applied to the list of candidates, which sequentially searches all pairs from the list of candidates and gives as a result some of them selected so that each profile appears in only one pair, and the profiles constituting a pair mutually maximized similarity to each other. The result is a set of a priori correct projections (pairs of profiles) that are entered into the model in order to improve the quality of the results.

Шаг 111. Разбиение модели на независимые компоненты путем удаления априорно верных проекций.Step 111. Partitioning the model into independent components by removing a priori true projections.

С целью уменьшения вычислительной сложности алгоритма исходная задача разбивается на подзадачи путем разбиения исходной модели. Это достигается путем удаления найденных априорно верных проекций.In order to reduce the computational complexity of the algorithm, the original problem is divided into sub-tasks by splitting the original model. This is achieved by removing the found a priori correct projections.

Шаг 112. Выбор следующей компоненты модели.Step 112. Selecting the next model component.

Шаг 113. Поиск оптимальной конфигурации проекций для выбранной компоненты.Step 113. Finding the optimal projection configuration for the selected component.

Для решения данной задачи полная энергия модели должна быть минимизирована. Для этого производится вывод из построенной модели Условных Случайных Полей. К задаче вывода сначала применяется квадратичная релаксация [10], затем задача аппроксимируется с применением методов Power Iteration [11] и Singular Value Decomposition [12], после чего решается как задача квадратичного программирования. Результатом данного шага является конфигурация модели (набор проекций), которая минимизирует полную энергию модели и содержит максимальное количество верных проекций профилей.To solve this problem, the total energy of the model should be minimized. For this, a conclusion is made from the constructed model of Conditional Random Fields. First, quadratic relaxation [10] is applied to the inference problem, then the problem is approximated using the Power Iteration [11] and Singular Value Decomposition [12] methods, and then it is solved as a quadratic programming problem. The result of this step is the configuration of the model (set of projections), which minimizes the total energy of the model and contains the maximum number of correct projections of the profiles.

На фиг.3 схематично изображена модель с набором различных проекций. Узлы графа A изображены непрерывной линией, тогда как узлы графа B - штрихпунктирной линией. Узлы в форме квадратов соответствуют априорно верным проекциям, узлы в форме треугольников являются компонентами проекций, найденных в результате вывода, в то время как для всех остальных узлов подходящих проекций не найдено. Пара узлов, составляющих проекцию, соединена с помощью пунктирной линии.Figure 3 schematically shows a model with a set of different projections. The nodes of graph A are shown by a continuous line, while the nodes of graph B by a dash-dotted line. Nodes in the form of squares correspond to a priori correct projections, nodes in the form of triangles are components of the projections found as a result of the derivation, while for all other nodes suitable projections were not found. The pair of nodes making up the projection is connected using a dashed line.

Шаг 114. Является ли данная компонента модели последней? Если НЕТ - переход к шагу 112, если ДА - переход к шагу 115.Step 114. Is this component of the model the last? If NO, go to step 112; if YES, go to step 115.

Шаг 115. Объединение результатов для всех компонент модели.Step 115. Combining the results for all components of the model.

Объединяются списки найденных проекций для всех компонент модели.The lists of found projections for all components of the model are combined.

Шаг 116. Выбор следующей проекции.Step 116. Select the next projection.

Шаг 117. Построение вектора признаков для выбранной проекции и перенаправление его классификатору в качестве входных данных.Step 117. Building a feature vector for the selected projection and redirecting it to the classifier as input.

Для выбранной проекции строится вектор, состоящий из следующих признаков:For the selected projection, a vector is constructed consisting of the following features:

- унарная энергия вершины;- unary peak energy;

- средняя бинарная энергия связи с априорно верными проекциями;- average binary binding energy with a priori true projections;

- доля априорно верных проекций в списке вершин, связанных с данной;- the proportion of a priori correct projections in the list of vertices associated with this one;

- качество набора априорно верных проекций.- the quality of a set of a priori true projections.

Качество набора априорно верных проекций вычисляется как сумма наибольших N весов, назначаемых априорно верным проекциям, связанным с рассматриваемой вершиной, где вес вершины вычисляется как средняя бинарная энергия связи между данной вершиной и другими взвешиваемыми вершинами.The quality of the set of a priori faithful projections is calculated as the sum of the largest N weights assigned to the a priori faithful projections associated with the vertex in question, where the vertex weight is calculated as the average binary binding energy between this vertex and other weighted vertices.

Шаг 118. Является ли выбранная проекция верной?Step 118. Is the selected projection correct?

Для уточнения результатов применяется классификаторный алгоритм машинного обучения (бустинг MultiBoostAB [13] над решающими деревьями С 4.5 [14]). Алгоритм классификации принимает построенный вектор признаков в качестве входных данных и возвращает решение о том, является ли данная вершина корректно спроецированной. В случае, если проекция неверная (решение классификатора не совпадает с решением алгоритма), то переход к шагу 119, в противном случае - переход к шагу 120.To clarify the results, a classifier algorithm for machine learning is used (MultiBoostAB boost [13] over decision trees C 4.5 [14]). The classification algorithm takes the constructed feature vector as input and returns a decision on whether the given vertex is correctly projected. If the projection is incorrect (the solution of the classifier does not coincide with the solution of the algorithm), then go to step 119, otherwise, go to step 120.

Шаг 119. Удаление выбранной проекции из результатов.Step 119. Removing the selected projection from the results.

Шаг 120. Является ли данная проекция последней? Если НЕТ - переход к шагу 116, если ДА - переход к шагу 121.Step 120. Is this projection the last? If NO, go to step 116; if YES, go to step 121.

Шаг 121. Вывод списка проекций, в котором каждая проекция содержит информацию об одном и том же пользователе, при этом составляющие проекцию профили относятся к различным социальным графам.Step 121. Displaying a list of projections in which each projection contains information about the same user, while the profiles that make up the projection belong to different social graphs.

Пример работыWork example

На фиг.3 изображены модели двух социальных графов, где профили представлены узлами, а связи между ними - ребрами. Будем считать, что узлы 1-8 принадлежат графу A (Twitter), а узлы 9-14 принадлежат графу В (Facebook). Все связи между узлами внутри каждого из графов заданы изначально, тогда как связи между узлами разных графов (объединение узлов в пары) являются результатом работы. Граф Twitter с полным набором узлов и ребер используется для построения модели Условных Случайных Полей, а узлы из графа Facebook считаются скрытыми переменными модели. При этом связью в графе Twitter считается отношение взаимного следования (каждый профиль пары следует за другим профилем, т.е. получает уведомления о появлении новой информации в профиле; такое отношение устанавливается путем односторонней активации уведомлений владельцем того профиля, который следует за другим профилем), а в графе Facebook - отношение дружбы (каждый профиль пары дружит с другим профилем; такое отношение устанавливается путем отправки владельцем одного из профилей запроса на установление отношения и получения явного подтверждения запроса).Figure 3 shows models of two social graphs, where profiles are represented by nodes, and the connections between them are edges. We assume that nodes 1-8 belong to column A (Twitter), and nodes 9-14 belong to column B (Facebook). All connections between nodes within each of the graphs are given initially, while connections between nodes of different graphs (pairing nodes) are the result of work. A Twitter graph with a full set of nodes and edges is used to construct a model of Conditional Random Fields, and nodes from the Facebook graph are considered hidden model variables. At the same time, the relationship in the Twitter column is the reciprocal relationship (each pair profile follows a different profile, i.e. receives notifications of new information in the profile; this relationship is established by unilateral activation of notifications by the owner of the profile that follows the other profile), and in the Facebook column - the friendship relationship (each profile of the couple is friends with another profile; this relationship is established by sending the owner of one of the profiles a request for establishing a relationship and receiving an explicit confirmation Nia request).

На шаге 102 рассчитываются значения похожести атрибутов профилей, а также строится вектор признаков. Сравнение атрибутов профилей из графов - A и B производится с помощью схемы соответствия, которая задает порядок сравнения атрибутов и применяемые метрики похожести (примеры схем соответствия для расчета похожести атрибутов между профилями Twitter и Facebook и между профилями Facebook даны в Табл. 1 и Табл. 2 соответственно).At step 102, the similarity values of the profile attributes are calculated and a feature vector is constructed. Comparison of the profile attributes from columns A and B is performed using the correspondence scheme, which sets the order of attribute comparison and the similarity metrics used (examples of correspondence schemes for calculating the similarity of attributes between Twitter and Facebook profiles and between Facebook profiles are given in Table 1 and Table 2 respectively).

Таблица 1Table 1 Схема соответствия для расчета похожести атрибутов профилей между узлами 2 и 11Compliance scheme for calculating the similarity of profile attributes between nodes 2 and 11 Атрибут профиля Twitter (узел 2)Twitter Profile Attribute (Node 2) Значение атрибута профиля TwitterTwitter profile attribute value Атрибут профиля Facebook (узел 11)Facebook Profile Attribute (Node 11) Значение атрибута профиля FacebookFacebook profile attribute value Метрика похожестиSimilarity metric Значение метрикиMetric value NameName John SmithJohn smith NameName J SmithJ smith VMNVMN 0,660.66 User placeUser place New York, USNew York, US Current cityCurrent city New YorkNew york JaroJaro 0,450.45 URLURL www.my.sitewww.my.site WebsiteWebsite www.no.sitewww.no.site URL measureURL measure 00

Таблица 2.Table 2. Схема соответствия для расчета похожести атрибутов профилей узлами 11 и 14Compliance scheme for calculating the similarity of profile attributes by nodes 11 and 14 Атрибут профиля FacebookFacebook profile attribute Значение атрибута первого профиля Facebook (узел 11)First Facebook Profile Attribute Value (Node 11) Значение атрибута второго профиля Facebook (узел 14)Facebook Second Profile Attribute Value (Node 14) Метрика похожестиSimilarity metric Значение метрикиMetric value Список контактовContact list 9, 10, 12, 13, 149, 10, 12, 13, 14 11, 12,1311, 12,13 Bidirectional Contact ScoreBidirectional contact score 1one Список контактовContact list 9, 10, 12, 13, 149, 10, 12, 13, 14 11,12,1311,12,13 Weighted DiceWeighted dice 0,30.3

На шаге 103 рассчитываются значения унарных и бинарных энергий.At step 103, the values of unary and binary energies are calculated.

Для расчета унарных и бинарных энергий используется методика машинного обучения, которая получает на вход вектор признаков и возвращает значение энергии связи данной пары профилей. К примеру, для профилей из Табл.1 вектор признаков выглядит следующим образом:To calculate unary and binary energies, a machine learning technique is used, which receives a vector of attributes and returns the value of the binding energy of a given pair of profiles. For example, for profiles from Table 1, the feature vector is as follows:

[0,66; 0,45; 0].[0.66; 0.45; 0].

Алгоритм машинного обучения возвращает значение унарной энергии для данной пары профилей, равное 0,52.The machine learning algorithm returns the value of unary energy for a given pair of profiles equal to 0.52.

На шаге 106 производится попарное сравнение атрибута "Name" всех пар профилей Twitter - Facebook с помощью метрики строкой похожести VMN. Результатом является набор троек вида "номер профиля - номер профиля - значение похожести". Значение порога похожести примем равным 0,51.At step 106, a pairwise comparison of the "Name" attribute of all pairs of Twitter-Facebook profiles is performed using the metric similarity string VMN. The result is a set of triples of the form “profile number - profile number - similarity value”. The similarity threshold value is assumed to be 0.51.

На шаге 107 производится сравнение значения метрики строковой похожести с заданным порогом.At step 107, the string similarity metric is compared with a predetermined threshold.

На шаге 108 все пары, для которых значение метрики строковой похожести превысило заданный порог, заносятся в список кандидатов.At step 108, all pairs for which the value of the string similarity metric has exceeded a predetermined threshold are entered in the candidate list.

К примеру, начальный список пар имеет вид:For example, the initial list of pairs has the form:

1one 99 00 1one 1010 0,650.65 1one 11eleven 0,550.55 22 99 0,50.5 22 1010 0,60.6 22 11eleven 0,660.66 33 99 0,350.35 33 1010 0,70.7 33 11eleven 00

Тогда список пар-кандидатов будет следующим:Then the list of candidate pairs will be as follows:

1one 1010 0,650.65 1one 11eleven 0,550.55 22 1010 0,60.6 22 11eleven 0,660.66 33 1010 0,70.7

На шаге 110 к списку кандидатов применяется алгоритм Куна-Манкреса для выбора наилучших проекций профилей, взаимно максимизирующих похожесть друг на друга. Результатом является набор априорно верных проекций, которые заносятся в модель:At step 110, the Kuhn-Mancres algorithm is applied to the candidate list to select the best projection profiles that mutually maximize similarity to each other. The result is a set of a priori true projections that are entered into the model:

22 11eleven 33 1010

На фиг.3 априорно верные проекции соответствуют парам узлов в форме квадратов, соединенных пунктирной линией (пары 2-11,3-10 и 5-13). Каждая такая проекция вносит дополнительную полезную информацию в модель и, таким образом, не только помогает уменьшить количество необходимых для получения оптимальной конфигурации вычислений, но и уменьшает вероятность неверных ответов в результатах работы.In Fig. 3, a priori correct projections correspond to pairs of nodes in the form of squares connected by a dashed line (pairs 2-11.3-10 and 5-13). Each such projection introduces additional useful information into the model and, thus, not only helps to reduce the number of calculations necessary to obtain the optimal configuration, but also reduces the likelihood of incorrect answers in the work results.

На шаге 111 исходная модель разбивается на независимые компоненты путем удаления априорно верных проекций. На рассматриваемом примере после удаления узлов 2, 3 и 5 остаются 3 независимых подграфа (1, 7), (8) и (4). Соответствующие узлы 10, 11 и 13 из графа Facebook также в дальнейшем не рассматриваются и не участвуют в подборе проекций для оставшихся в модели узлов.At step 111, the original model is partitioned into independent components by removing a priori true projections. In the example under consideration, after deleting nodes 2, 3, and 5, 3 independent subgraphs (1, 7), (8), and (4) remain. The corresponding nodes 10, 11, and 13 from the Facebook graph are also not further considered and do not participate in the selection of projections for the nodes remaining in the model.

На шаге 113 для каждой из образовавшихся независимых компонент модели производится вывод путем применения итеративного алгоритма, перебирающего возможные конфигурации проекций модели и минимизирующего ее полную энергию.At step 113, for each of the resulting independent components of the model, a conclusion is made by applying an iterative algorithm that sortes out the possible configurations of the projections of the model and minimizes its total energy.

На шаге 115 результаты для всех компонент модели объединяются.At step 115, the results for all components of the model are combined.

Пример результатов работы данного шага - набор проекций 1-9, 1-12 и 6-14.An example of the results of this step is a set of projections 1-9, 1-12 and 6-14.

На следующих шагах производится уточнение результатов.The following steps refine the results.

На шаге 117 для каждой найденной на шаге 115 проекции составляется вектор признаков, который подается на вход классификатора.In step 117, for each projection found in step 115, a feature vector is compiled, which is fed to the input of the classifier.

На шаге 118 классификатор принимает решение о верности каждой из проекций. Пусть вектор для проекции 1-9 будет [0,85; 0,5; 0,4; 0,5], а для проекции 1-12 [0,7; 0,6; 0,6; 0,35]. Тогда для проекции 1-9 классификатор дает ответ "истина", а для проекции 1-12 - ответ "ложь".At step 118, the classifier decides on the fidelity of each of the projections. Let the vector for the projection 1-9 be [0.85; 0.5; 0.4; 0.5], and for the projection 1-12 [0.7; 0.6; 0.6; 0.35]. Then for the projection 1-9 the classifier gives the answer "true", and for the projection 1-12 - the answer is "false".

На шаге 119 проекция 1-12 исключается из результатов работы, поскольку ответ классификатора для данной проекции не совпадает с решением алгоритма.At step 119, the projection 1-12 is excluded from the results of the work, since the classifier answer for this projection does not coincide with the solution of the algorithm.

На шаге 121 выводится список проекций, в котором каждая проекция содержит информацию об одном и том же пользователе, при этом составляющие проекцию профили относятся к разным социальным графам. Окончательный результат работы изобретения на рассматриваемом примере изображен на фиг.3 и содержит 2 проекции: 1-9 и 6-14. Это означает, что вершины 1 и 9 принадлежат различным социальным графам, но при этом содержат информацию об одном и том же пользователе. То же касается вершин 6 и 14.At step 121, a list of projections is displayed in which each projection contains information about the same user, while the profiles that make up the projection belong to different social graphs. The final result of the invention on the example in question is shown in figure 3 and contains 2 projections: 1-9 and 6-14. This means that peaks 1 and 9 belong to different social graphs, but at the same time contain information about the same user. The same goes for peaks 6 and 14.

Claims

A way to integrate user profiles of online social networks, characterized in that they enter all possible pairs of profiles, build a model of Conditional Random Fields from all profiles and the relationships between them, after which for each pair of profiles they calculate the similarity values of their attributes using string and graph similarity metrics, based on correspondence schemes for profiles from various social graphs, after which a feature vector is constructed from the obtained similarity metrics, which is transmitted to the machine learning algorithm which calculates unary energy by the formula
Ф (pr (ν) | ν) = α (ν) · (1-profile _{similarity (ν, pr (ν))} )
for a pair of profiles ν and pr (ν) belonging to different social graphs, or binary energy according to the formula
Ψ (pr (ν), pr (u) | ν, u) = β (pr (ν), pr (u)) · (1-network _{similarity (pr (ν), pr (u))} )
for a pair of profiles pr (ν) and pr (u) belonging to the same social graph, after which for each pair of profiles in which profiles belong to different social graphs, the similarity of profiles is calculated by applying the string similarity metric to the attribute with the highest the degree of probability uniquely identifies the profile, then it is checked whether the obtained profile similarity value exceeds a predetermined threshold value, in case of a positive answer, a pair of profiles is entered in the candidate list, then from the received the candidate list, a priori correct projections are selected by applying an algorithm to the candidate list that sequentially searches all the pairs from the candidate list and gives as a result some of them selected so that each profile appears in only one pair, and the profiles that make up the pair mutually maximized similarity to each other, then the Model of Conditional Random Fields is divided into independent components by removing a priori true projections, after which for each component of the model the search for the optimal projection configuration by applying an iterative algorithm minimizing the total energy of the model

in order to obtain the maximum number of correct projections of profiles, after which the lists of found projections are combined for all components of the model, then a feature vector is constructed for each projection found, which is transmitted to the classifier, after which the classifier determines whether this projection is correct, in case of a negative answer, the projection is excluded from the results, after which a list of projections (pairs of profiles) is displayed, each of which contains information about the same user, while Projection profiles refer to various social graphs.