WO2016106944A1 - Method for creating virtual human on mapreduce platform - Google Patents

Method for creating virtual human on mapreduce platform Download PDF

Info

Publication number
WO2016106944A1
WO2016106944A1 PCT/CN2015/072486 CN2015072486W WO2016106944A1 WO 2016106944 A1 WO2016106944 A1 WO 2016106944A1 CN 2015072486 W CN2015072486 W CN 2015072486W WO 2016106944 A1 WO2016106944 A1 WO 2016106944A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
key
rho
value
neighbor
Prior art date
Application number
PCT/CN2015/072486
Other languages
French (fr)
Chinese (zh)
Inventor
蔡立宇
张观成
喻勇
杨航
范亚博
贾西贝
Original Assignee
深圳市华傲数据技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市华傲数据技术有限公司 filed Critical 深圳市华傲数据技术有限公司
Publication of WO2016106944A1 publication Critical patent/WO2016106944A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for creating a virtual human on a MapReduce platform. The method for creating a virtual human comprises: step 1, extracting from behavior logs accounts as well as login time and login terminal information corresponding to the accounts (1); step 2, calculating the similarity between the accounts according to synergistic conditions of the accounts, constructing a connected graph in which the accounts are represented by nodes, and representing the similarity between the accounts by using the length of an edge between the nodes, wherein the similarity between the accounts represented by the nodes increases as the edge between the nodes decreases (2); and step 3, clustering the nodes in the connected graph on the basis of the MapReduce platform, and creating a virtual human according to the clustering result (3). In the method, a virtual human is created according to behavior logs, the method is low in complexity and high in accuracy, and is suitable for processing big data. By means of a popular MapReduce distributed computation concept, clustering based on local density is achieved on a cluster, the restriction during the processing due to limited resources of a single machine is weakened, mass data can be processed, and clustering operations can be completed more rapidly.

Description

MapReduce平台上的虚拟人建立方法Virtual person establishment method on MapReduce platform 技术领域Technical field
本发明涉及数据处理技术领域,尤其涉及一种MapReduce平台上的虚拟人建立方法。The present invention relates to the field of data processing technologies, and in particular, to a method for establishing a virtual human on a MapReduce platform.
背景技术Background technique
当前,即时通讯,电子邮件,网络游戏,P2P软件下载,网络论坛,网络招聘,电子商务交易,网络预定机票酒店等各种网络服务给网络用户的生活带来极大的便利。各种网络服务一般会给每个用户分配一个帐号,该帐号跟用户的注册信息相关联并用以对各用户进行记录和识别,比如网络用户的即时通信号码(如QQ账号)或电子邮件地址,网络游戏帐号,论坛登陆帐号,以及P2P软件帐号等等。Currently, instant messaging, e-mail, online games, P2P software downloads, online forums, online recruitment, e-commerce transactions, online booking of airline tickets, and other network services bring great convenience to the lives of online users. Each network service generally assigns an account to each user, which is associated with the user's registration information and is used to record and identify each user, such as an instant communication number (such as a QQ account) or an email address of the network user. Online game account, forum login account, P2P software account and so on.
每个网络用户都拥有类型多样的账号,而大量的网络用户则带来的巨量的账号数据,对相关部门来说,有效管理网络用户信息已经成为艰巨的任务。为有效管理网络用户信息,实现对网络帐号归属关系的分析,即哪些帐号属于同一个人(虚拟人),现已成为亟需解决的问题。Each network user has a variety of accounts, and a large number of network users bring huge amounts of account data. For relevant departments, effectively managing network user information has become an arduous task. In order to effectively manage network user information and realize the analysis of the affiliation relationship of network accounts, that is, which accounts belong to the same person (virtual person), it has become an urgent problem to be solved.
现有技术在面对构建虚拟人的问题时,大多归于属性匹配方式。属性匹配的方案大致如下:In the prior art, in the face of the problem of constructing a virtual person, most of them belong to the attribute matching method. The scheme for attribute matching is roughly as follows:
A)指定网络帐号属性匹配的规则,在哪种情况下用哪些属性进行匹配,以及相应的匹配成功判定方法。比如,当匹配一个QQ帐号和一个淘宝帐号时,如果两帐号的“姓名”和“联系方式”两个字段的编辑距离(edit distance)均小于3,则认为这两个帐号匹配成功。A) Specify rules for matching network account attributes, in which case which attributes are used for matching, and corresponding matching success determination methods. For example, when a QQ account and a Taobao account are matched, if the edit distances of the two fields "name" and "contact" are less than 3, the two accounts are considered to be successfully matched.
B)根据属性匹配的情况,构建帐号之间属于同一个人的程度(相似度)。并最终根据相似度分辨出哪些帐号属于同一个人。比如,上例中,只要匹配成功则认为属于同一个人。B) Build the degree (similarity) of accounts belonging to the same person according to the matching of attributes. And finally according to the similarity to distinguish which accounts belong to the same person. For example, in the above example, as long as the match is successful, it is considered to belong to the same person.
但是,实际生活中存在如下情况: However, the following situations exist in real life:
1.账号数据中经常出现属性缺失的情况,例如账号注册时只填写了部分属性值。1. In the account data, there are often cases where the attribute is missing. For example, only some attribute values are filled in when the account is registered.
2.不同类型的账号数据,共有的属性少。而且共有的属性中,不一定都能用于属性匹配。2. Different types of account data, the total number of attributes is small. And not all of the shared attributes can be used for attribute matching.
3.不同类型的账号数据,对同一语义的属性不同,需要对齐,这进一步增加了难度。比如在A类帐号中,姓名对应的字段就是“姓名”这一个字段,但在B类帐号中,姓名实际上是用“姓”和“名”两个字段来表示。3. Different types of account data, different attributes for the same semantics, need to be aligned, which further increases the difficulty. For example, in a class A account, the field corresponding to the name is the field of "name", but in the class B account, the name is actually represented by two fields of "last name" and "name".
4.实际账号数据中,属性值的可信度并不是很高。例如,因为缺乏实名认证,可能存在身份证号不真实的情况。4. In the actual account data, the credibility of the attribute value is not very high. For example, due to the lack of real-name authentication, there may be cases where the ID number is not true.
5.需要进行属性级别的比较,复杂度较高。5. Need to compare the attribute level, the complexity is higher.
这些情况使得属性匹配的过程复杂、计算量大且实际结果不理想,尤其是针对大量数据处理时,准确度较低。These conditions make the process of attribute matching complicated, computationally intensive, and the actual results are not ideal, especially for large data processing, the accuracy is low.
另一方面,MapReduce是谷歌提出的分布式并行计算框架,用于大规模数据集的并行运算,主要通过“Map(映射)”和“Reduce(化简)”这两个步骤来并行处理大规模的数据集。在MapReduce平台上的计算过程中,输入数据首先被切分到集群的不同计算机上,集群中其他计算机分配为执行Map作业或Reduce作业;Map作业从输入数据中抽取出键值对<Key,Value>,每一个键值对都作为参数传递给map函数,map函数产生的中间键值对被缓存在内存中,缓存的中间键值对会被定期写入本地磁盘,而且这些中间键值对被分为R个区,R的大小是由用户定义的,将来每个区会对应一个Reduce作业;带有相同Key的键值对由同一个Reduce作业来处理,Reduce作业读取这些中间键值对,对于每个唯一的键,都将键与关联的值传递给reduce函数,reduce函数产生的输出会添加到这个分区的输出文件中。Map/Reduce作业和map/reduce函数的区别:Map作业处理一个输入数据的分片,可能需要调用多次map函数来处理每个输入键值对;Reduce作业处理一个分区的中间键值对,期间要对每个不同的键调用一次reduce函数,Reduce作业最终也对应一个输出文件。整个过程中,输入数据是来自底层分布式文件系统的,中间数据是放在本地文件系统的,最终输出数据是写入底层分布式文件系统。为了能实现对海量数据的处理,克服单机本身资源有限所带来的限制,亟需在MapReduce平台上实现虚拟人的建立。 MapReduce, on the other hand, is a distributed parallel computing framework proposed by Google for parallel computing of large-scale data sets. It mainly processes large-scale data in parallel through the two steps of “Map” and “Reduce”. Data set. In the calculation process on the MapReduce platform, the input data is first segmented into different computers in the cluster, and other computers in the cluster are assigned to execute Map jobs or Reduce jobs; the Map job extracts key-value pairs <Key, Value from the input data. >, each key-value pair is passed as a parameter to the map function. The intermediate key-value pairs generated by the map function are cached in memory. The cached intermediate key-value pairs are periodically written to the local disk, and these intermediate key-value pairs are Divided into R zones, the size of R is defined by the user. In the future, each zone will correspond to a Reduce job; the key-value pairs with the same Key will be processed by the same Reduce job, and the Reduce job will read these intermediate key-value pairs. For each unique key, the key and associated value are passed to the reduce function, and the output generated by the reduce function is added to the output file of the partition. The difference between a Map/Reduce job and a map/reduce function: A Map job processes a slice of input data, and may need to call multiple map functions to process each input key-value pair; the Reduce job processes a partition's intermediate key-value pair, during the period. To call the reduce function once for each different key, the Reduce job eventually corresponds to an output file. Throughout the process, the input data is from the underlying distributed file system, the intermediate data is placed on the local file system, and the final output data is written to the underlying distributed file system. In order to realize the processing of massive data and overcome the limitations imposed by the limited resources of the single machine, it is urgent to realize the establishment of virtual humans on the MapReduce platform.
发明内容Summary of the invention
因此,本发明的目的在于提供一种MapReduce平台上的虚拟人建立方法,解决因帐号类型多样等带来的虚拟人构建复杂、准确度低的问题,实现对海量数据的处理,克服单机本身资源有限所带来的限制。Therefore, the object of the present invention is to provide a virtual person establishment method on the MapReduce platform, which solves the problem that the virtual person is complicated to construct and has low accuracy due to various types of account types, and realizes processing of massive data and overcomes the resources of the single machine. Limited restrictions.
为实现上述目的,本发明提供了一种MapReduce平台上的虚拟人建立方法,包括:To achieve the above objective, the present invention provides a virtual human establishment method on a MapReduce platform, including:
步骤1、从行为日志中提取账号及与账号对应的登陆时间、登陆终端信息; Step 1. Extract the account number and the login time and login terminal information corresponding to the account from the behavior log;
步骤2、根据账号之间协同出现的情况计算账号之间的相似度,构造以节点表征账号的连通图,并以节点之间的边的长度表征账号之间的相似度,节点之间的边越短,节点所表征的账号之间相似度越高;Step 2: Calculate the similarity between the accounts according to the co-occurrence between the accounts, construct a connected graph that represents the account by the node, and represent the similarity between the accounts by the length of the edges between the nodes, and the edges between the nodes The shorter, the higher the similarity between the accounts represented by the nodes;
步骤3、基于MapReduce平台对所述连通图中的节点进行聚类,根据聚类结果建立虚拟人。Step 3: Clustering the nodes in the connected graph based on the MapReduce platform, and establishing a virtual human according to the clustering result.
其中,步骤3包括: Wherein step 3 includes:
步骤20、以连通图中的节点和边的信息作为输入数据,通过Map作业生成包括节点以及邻边信息的键值对,通过Reduce作业生成包括节点、节点的本地密度Rho以及节点所有邻边信息的输出,Rho定义为连接本节点的长度低于预定义值Dc的邻边的数目;Step 20: Using the information of the nodes and edges in the connected graph as the input data, generating a key value pair including the node and the neighboring information through the Map job, and generating the local density Rho including the node and the node and all the neighbor information of the node through the Reduce job. Output, Rho is defined as the number of neighboring edges whose length is lower than the predefined value Dc;
步骤30、对于步骤20中Reduce作业的输出,通过Map作业生成包括节点、节点Rho、邻居节点Rho以及邻边信息的键值对,对每个节点,通过Reduce作业遍历节点Rho、所有邻居节点Rho以及所有邻边信息,得出每个节点的离散度Delta,Delta定义为本节点所有连接更高Rho值邻居节点的邻边中最短边的边长,若不存在这样的邻居节点,则取本节点最长邻边的边长;再结合预定规则来进行类标识。Step 30: For the output of the Reduce job in step 20, generate a key-value pair including a node, a node Rho, a neighbor node Rho, and neighbor information through a Map job. For each node, traverse the node Rho and all neighbor nodes Rho through a Reduce job. And all the neighbor information, and the delta of each node is obtained. Delta is defined as the side length of the shortest side of the neighbors of all the neighbor nodes connected to the higher Rho value of the node. If there is no such neighbor node, then the source is taken. The length of the longest neighboring edge of the node; combined with the predetermined rules for class identification.
步骤40、相同类的各个节点一同构成一个虚拟人。Step 40: Each node of the same class together constitutes a virtual person.
其中,所述预定规则包括:节点的Rho和Delta分别高于作为输入参数的阈值R_T和阈值D_T,则该节点为一个类的中心,该节点的类标识取其自身类标识;否则,节点的类标识取距离其最近且Rho更高的邻居节点的类标识;The predetermined rule includes: the Rho and the delta of the node are respectively higher than the threshold R_T and the threshold D_T as input parameters, then the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the node The class identifier takes the class identifier of the neighbor node that is closest to it and has a higher Rho;
孤立节点的类标识为自身类标识。 The class ID of an isolated node is its own class identifier.
其中,所述预定规则包括:预先划分Rho值可能取值区间以及对应的Delta值可能取值区间,如果节点的Rho值属于Rho值可能取值区间且节点的Delta值属于对应的Delta值可能取值区间,则该节点为一个类的中心,该节点的类标识取其自身类标识;否则,节点的类标识取距离其最近且Rho更高的邻居节点的类标识;The predetermined rule includes: pre-dividing the Rho value possible value interval and the corresponding Delta value possible value interval, if the Rho value of the node belongs to the Rho value possible value interval and the node's Delta value belongs to the corresponding Delta value, it may take In the value interval, the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and has a higher Rho;
孤立节点的类标识为自身类标识。The class ID of an isolated node is its own class identifier.
其中,还引入账号之间协同出现的情况以外的因素计算所述账号之间的相似度。Among them, the factors other than the case where the accounts are co-occurring are also introduced to calculate the similarity between the accounts.
其中,还包括合并所有虚拟人及与虚拟人对应的账号成为虚拟人数据库。The method further includes merging all the virtual persons and the account corresponding to the virtual person to become a virtual person database.
其中,步骤20中Reduce作业的输出存储于关系数据库或键值数据库中。The output of the Reduce job in step 20 is stored in a relational database or a key value database.
其中,步骤30中的Map作业中,通过对步骤20中Reduce作业的输出进行笛卡尔积,实现对邻居节点Rho的遍历。In the Map job in step 30, the traversal of the neighbor node Rho is implemented by performing Cartesian product on the output of the Reduce job in step 20.
其中,步骤20包括:Wherein step 20 includes:
步骤21、连通图中的节点和边的信息作为输入数据经由Map作业生成键值对,其中,键包括标识节点的字段,值包括标识邻居节点的字段和标识该节点和邻居节点之间邻边的边长的字段;Step 21: The information of the nodes and edges in the connected graph is used as input data to generate a key-value pair via a Map job, wherein the key includes a field identifying the node, and the value includes a field identifying the neighbor node and identifying a neighboring edge between the node and the neighbor node Side length field;
步骤22、对键值对按照键所包括的节点进行分区,键包括相同节点的键值对分配至同一分区;Step 22: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;
步骤23、对于同一分区内的键值对按照键所包括的节点进行分组,键包括相同节点的键值对分配至同一组;Step 23: group the key values in the same partition according to the nodes included in the key, and the key includes the key value pairs of the same node to be allocated to the same group;
步骤25、经由Reduce作业,通过对属于同一组的键值对的值的迭代来遍历同一节点的所有的邻边,生成包括节点、节点的本地密度Rho以及节点所有邻边信息的输出。Step 25. Via the Reduce job, traverse all the neighbors of the same node by iterating over the values of the key pairs belonging to the same group, and generate an output including the node, the local density Rho of the node, and all neighbor information of the node.
其中,步骤20还包括:Wherein, step 20 further includes:
步骤21中,键还包括标识该节点和邻居节点之间邻边的边长的字段;In step 21, the key further includes a field that identifies a side length of the neighboring edge between the node and the neighboring node;
步骤24、对于属于同一组的键值对按照键所包括的邻边的边长进行排序。Step 24: Sort the key pairs belonging to the same group according to the side lengths of the adjacent edges included in the key.
其中,步骤30包括: Wherein step 30 includes:
步骤31、对于步骤20中Reduce作业的输出经由Map作业生成键值对,其中,键包括标识节点的字段,值包括标识邻居节点的字段、标识该节点和邻居节点之间邻边的边长的字段、标识该邻居节点Rho的字段和标识该节点Rho的字段;Step 31: Generate a key value pair for the output of the Reduce job in step 20 via a Map job, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node, and identifying a side length of the neighboring edge between the node and the neighbor node. a field, a field identifying the neighbor node Rho, and a field identifying the node Rho;
步骤32、对键值对按照键所包括的节点进行分区,键包括相同节点的键值对分配至同一分区;Step 32: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;
步骤33、对于同一分区内的键值对按照键所包括的节点进行分组,键包括相同节点的键值对分配至同一组;Step 33: The key value pairs in the same partition are grouped according to the nodes included in the key, and the key includes the key value pairs of the same node allocated to the same group;
步骤35、经由Reduce作业,对每个节点,通过对属于同一组的键值对的值的迭代来遍历节点Rho、所有邻居节点Rho以及所有邻边信息,得出每个节点的离散度Delta,再结合预定规则来进行类标识。Step 35: Via the Reduce operation, for each node, traverse the node Rho, all neighbor nodes Rho, and all neighbor information by iterating over the values of the key pairs belonging to the same group, and obtain the dispersion Delta of each node, Class identification is performed in conjunction with predetermined rules.
其中,步骤30还包括:Wherein, step 30 further includes:
步骤31中,键还包括标识该邻居节点Rho的字段;In step 31, the key further includes a field identifying the neighbor node Rho;
步骤34、对于属于同一组的键值对按照键所包括的邻居节点Rho进行排序。Step 34: Sort the key values belonging to the same group according to the neighbor nodes Rho included in the key.
综上所述,本发明的MapReduce平台上的虚拟人建立方法基于行为日志建立虚拟人,复杂度低,准确率高,适合于处理大数据;借助流行的MapReduce分布式计算思想在集群上实现了基于本地密度的聚类,弱化了处理时单机本身资源有限等所带来的限制,能实现对海量数据的处理,更快的完成聚类操作。In summary, the virtual human establishment method on the MapReduce platform of the present invention establishes a virtual person based on the behavior log, has low complexity and high accuracy, and is suitable for processing big data; and implements the popular MapReduce distributed computing idea on the cluster. Based on local density clustering, the limitations imposed by the limited resources of the single machine during processing are weakened, and the processing of massive data can be realized, and the clustering operation can be completed faster.
附图说明DRAWINGS
附图中,In the drawings,
图1为本发明MapReduce平台上的虚拟人建立方法一较佳实施例的流程图;1 is a flowchart of a preferred embodiment of a virtual human establishment method on a MapReduce platform according to the present invention;
图2为本发明MapReduce平台上的虚拟人建立方法一较佳实施例的逻辑示意图;2 is a schematic diagram of a preferred embodiment of a virtual human establishment method on a MapReduce platform according to the present invention;
图3为本发明MapReduce平台上的虚拟人建立方法一较佳实施例中的Rho值-Delta值分布示意图。 FIG. 3 is a schematic diagram of Rho value-Delta value distribution in a preferred embodiment of a virtual human establishment method on the MapReduce platform of the present invention.
具体实施方式detailed description
下面结合附图,通过对本发明的具体实施方式详细描述,将使本发明的技术方案及其有益效果显而易见。The technical solutions of the present invention and the beneficial effects thereof will be apparent from the following detailed description of the embodiments of the invention.
参见图1,其为本发明MapReduce平台上的虚拟人建立方法一较佳实施例的流程图。本发明的主要步骤包括:FIG. 1 is a flowchart of a preferred embodiment of a virtual human establishment method on a MapReduce platform of the present invention. The main steps of the invention include:
步骤1、从行为日志中提取账号及与账号对应的登陆时间、登陆终端信息; Step 1. Extract the account number and the login time and login terminal information corresponding to the account from the behavior log;
步骤2、根据账号之间协同出现的情况计算账号之间的相似度,构造以节点表征账号的连通图,并以节点之间的边的长度表征账号之间的相似度,节点之间的边越短,节点所表征的账号之间相似度越高;Step 2: Calculate the similarity between the accounts according to the co-occurrence between the accounts, construct a connected graph that represents the account by the node, and represent the similarity between the accounts by the length of the edges between the nodes, and the edges between the nodes The shorter, the higher the similarity between the accounts represented by the nodes;
步骤3、基于MapReduce平台对所述连通图中的节点进行聚类,根据聚类结果建立虚拟人。Step 3: Clustering the nodes in the connected graph based on the MapReduce platform, and establishing a virtual human according to the clustering result.
本发明还可以包括合并所有虚拟人及与虚拟人对应的账号成为虚拟人数据库的步骤。The invention may also include the step of merging all virtual persons and accounts corresponding to the virtual person into a virtual person database.
为应对因帐号类型多样等带来的虚拟人构建复杂、准确度低等实际问题,本发明基于行为日志的分析来建立虚拟人。行为日志记录了网络用户应用网络服务的情况,可采集自服务器端,用户终端等。该方法基于如下对现实情况的观察:In order to cope with practical problems such as complicated virtual person construction and low accuracy due to various types of account types, the present invention establishes a virtual person based on analysis of behavior logs. The behavior log records the network user application network service, and can be collected from the server side, user terminal, and the like. The method is based on the following observations of the reality:
1.一段时间内,在同一台终端上有活动的帐号可能属于同一个人。我们称在某一段时间内多个帐号在同一终端上都有过活动,为这些帐号的协同出现。1. For a period of time, an account with an activity on the same terminal may belong to the same person. We claim that multiple accounts have been active on the same terminal for a certain period of time, for the synergy of these accounts.
2.多个帐号协同出现的情况越近似—比如次数越多,那这些帐号属于同一个人可能性(称,相似度)就越大。2. The more similar the situation of multiple accounts co-occurring - for example, the more times, the greater the likelihood that these accounts belong to the same person (called similarity).
3.单个用户拥有的多个帐号中,总是有部分帐号使用更为频繁。3. Of the multiple accounts owned by a single user, some accounts are always used more frequently.
4.不同用户的部分帐号之间,即便偶尔有协同出现过,其协同出现的情况不会比用户自己的各个帐号之间协同出现的情况更近似。4. Between some accounts of different users, even if there are occasional synergies, the situation of co-occurrence will not be more similar than the situation where the users themselves collaborate.
参见图2,其为本发明MapReduce平台上的虚拟人建立方法一较佳实施例的逻辑示意图。Referring to FIG. 2, it is a logic diagram of a preferred embodiment of a virtual human establishment method on the MapReduce platform of the present invention.
该较佳实施例的关键性步骤包括: The key steps of the preferred embodiment include:
将行为日志中的记录抽象为【时间,终端,帐号】,从而得到包含时间戳,账号ID及终端ID的数据,从而得知什么时候在哪个终端上哪个帐号有活动过,通过对每一个账号统计该账号一段时间内与其他帐号在同一终端上都有过活动的协同出现次数,可以得出账号之间协同出现的次数。Abstract the records in the behavior log as [time, terminal, account], and get the data including the timestamp, account ID and terminal ID, so as to know when and which account has active on which terminal, through each account It is counted that the account has the number of coordinated occurrences of activities on the same terminal with other accounts for a period of time, and the number of times of cooperation between the accounts can be obtained.
“次数”是衡量“情况”的一种方式,此实施例中采用“次数”的说法仅是为了简化说明。实际上,还可以加入时段等信息作为权值来一起衡量“情况”—比如,下班时间的协同出现的权重可稍重于上班时间—上班时间更可能会共用电脑终端。"Number of times" is a way of measuring the "situation", and the term "number of times" is used in this embodiment only to simplify the explanation. In fact, you can also add time and other information as weights to measure the "situation" together - for example, the synergy of the off-hours can be slightly heavier than the working hours - the working hours are more likely to share the computer terminal.
基于上述账号协同出现情况的观察,计算得出帐号之间的相似度。若抽象成连通图,则连通图中的节点代表帐号,边的长度表征帐号之间的相似度。通常情况下,相似度越高,边越短。Based on the observation of the above-mentioned account synergy, the similarity between the accounts is calculated. If abstracted into a connected graph, the nodes in the connected graph represent accounts, and the length of the edges represents the similarity between the accounts. Usually, the higher the similarity, the shorter the side.
如有其他模型,比如属性匹配,可将对应模型的匹配结果同样作为影响边长度的一个因素。If there are other models, such as attribute matching, the matching result of the corresponding model can also be used as a factor affecting the length of the edge.
得到上述图后,可以进行如下计算,得出哪些帐号属于同一个人:After obtaining the above figure, the following calculations can be made to determine which accounts belong to the same person:
对各个节点,求出其本地密度Rho。Rho的定义为本节点长度低于某个预定义值Dc的边的数目。For each node, find its local density Rho. Rho is defined as the number of edges whose node length is lower than a certain predefined value Dc.
对每个节点,求出其离散度Delta。Delta定义为本节点所有连接更高Rho值邻居节点的邻边中最短边的边长;若不存在这样的邻居节点,则取本节点最长邻边的边长。For each node, find its dispersion Delta. Delta is defined as the side length of the shortest side of the neighbors of all the neighbor nodes connected to the higher Rho value of the node; if there is no such neighbor node, the side length of the longest neighbor of the node is taken.
将Rho值和Delta值分别高于特定阈值R_T和D_T的节点,标识为类的中心节点。每一个这样的节点代表一个类,也就是一个虚拟人。The nodes whose Rho value and Delta value are higher than the specific thresholds R_T and D_T, respectively, are identified as the central node of the class. Each such node represents a class, which is a virtual person.
将其他非中心节点归类为到其距离最短且Rho值高于自己的中心节点的那一类。Classify other non-central nodes as the one with the shortest distance and a higher Rho value than their own central node.
相同类的各个节点即表示属于同一个虚拟人。对应各个类分别建立相应的虚拟人。Each node of the same class means that it belongs to the same virtual person. Corresponding virtual people are established for each class.
参见上述计算得出哪些帐号属于同一个人的过程,本发明采用基于本地密度的聚类算法来对行为日志进行分析,与其他K-Means、层次聚类等聚类方式相比较而言,降低了整个系统的分析复杂度。同时,籍由Delta和Rho值这两个源自数据本身的分布特征量,提供了对聚类数目选定的一种客观参考方式。 Referring to the above process of calculating which accounts belong to the same person, the present invention uses a local density-based clustering algorithm to analyze the behavior log, which is reduced compared with other K-Means, hierarchical clustering and other clustering methods. The analytical complexity of the entire system. At the same time, the two distribution values derived from the data itself, Delta and Rho, provide an objective reference for the selection of the number of clusters.
所示类中心点标识方法为节点的Rho值和Delta值同时高于某个相应阈值。实际中可采取其他基于Rho值或Delta值的方法。如Rho值高于3,则delta值在4-5之间,Rho值高于5,则Delta值在5-6之间。The class center point identification method shown is that the Rho value and the Delta value of the node are both higher than a corresponding threshold. Other methods based on Rho or Delta values can be taken in practice. If the Rho value is higher than 3, the delta value is between 4-5, and the Rho value is higher than 5, then the Delta value is between 5-6.
下面对本发明MapReduce平台上的虚拟人建立方法中各种值的含义结合简单示例具体说明如下。The following briefly describes the meanings of various values in the virtual human establishment method on the MapReduce platform of the present invention, as follows.
边长表征:节点之间属于同一个人的可能性(相似度)的衡量。Side length representation: A measure of the likelihood (similarity) of nodes belonging to the same person.
Rho表征:当前节点对其邻接点的重要性。Rho representation: The importance of the current node to its neighbors.
Delta表征:若以当前节点为类中心,其相对其他类中心的可区别性。Delta representation: If the current node is centered, it is distinguishable from other class centers.
举例来说:for example:
边长可定义为:两个帐号在行为日志里,协同出现的次数(ca,b)的倒数1/(ca,b)。即两个帐号在同一个终端上一定时间内先后活动过的次数的倒数。The side length can be defined as: the reciprocal number of times (c a,b ) of the two accounts in the behavior log (c a,b ). That is, the countdown of the number of times that two accounts have been active on the same terminal for a certain period of time.
Rho可定义为:当前节点的邻边中,长度小于参数值Dc的边的数量。Rho can be defined as: the number of edges in the neighboring edge of the current node that are less than the parameter value Dc.
Delta定义为本节点所有连接更高Rho值邻居节点的邻边中最短边的边长;若不存在这样的邻居节点,则取本节点最长邻边的边长。Delta is defined as the side length of the shortest side of the neighbors of all the neighbor nodes connected to the higher Rho value of the node; if there is no such neighbor node, the side length of the longest neighbor of the node is taken.
在上述定义示例下对应的公式表达为:The corresponding formula in the above definition example is expressed as:
令c(a,b)为从行为日志中统计到的帐号a和b的协同出现次数,则有:Let c(a,b) be the number of co-occurrences of accounts a and b counted from the behavior log, then:
1.a,b之间的边长:1. The length between the sides of a, b:
d(a,b)=1/c(a,b)  [等式1]。d(a,b)=1/c(a,b) [Equation 1].
2.则对a的所有N个邻居节点bn,n=1…N(N为自然数),a的Rho值:2. For all N neighbor nodes bn of a, n=1...N (N is a natural number), the Rho value of a:
Figure PCTCN2015072486-appb-000001
Figure PCTCN2015072486-appb-000001
其中,X(x)的定义为:1.如果x<0,则X(x)=1,否则X(x)=0。Where X(x) is defined as: 1. If x < 0, then X(x) = 1, otherwise X(x) = 0.
3.a的Delta值:3.a Delta value:
令节点a的邻居节点依次为b1…bN,则Delta(a)可定义为:Let the neighbor nodes of node a be b1...bN in turn, then Delta(a) can be defined as:
1)如果存在满足Rho(bx)>Rho(a)的邻边,则有:1) If there is an adjacent edge that satisfies Rho(bx)>Rho(a), then:
Delta(a)=min{d(a,bn))|n=1..N且Rho(bn)>Rho(a)}。Delta(a)=min{d(a,bn))|n=1..N and Rho(bn)>Rho(a)}.
2)否则:2) Otherwise:
Delta(a)=max{d(a,bn),n=1..N}。 Delta(a)=max{d(a,bn), n=1..N}.
特别的,对于没有任何邻边的节点,在标记其类标识时,可直接标识为其自己,即独立形成一个虚拟人。In particular, for a node without any neighboring edges, when marking its class identifier, it can be directly identified as itself, that is, a virtual person is formed independently.
Delta值的求得与Rho值有关,而Rho值的定义也可用常见的中心度等其他定义方式。The delta value is related to the Rho value, and the Rho value can be defined by other definitions such as common centrality.
Dc的取值在实践中和具体的数据有关,通常我们会在得到连通图后,再确定Dc的取值。也就是说,和在其他常见聚类方式中一样,它是一个输入参数。但与K-Means中的K值的选取不同的是,K值的选取直接确定类的数目,但这里的Dc会通过Rho值和Delta值以及R_T和D_T的取值而弱化了主观因素的影响,因为这些参数的选取会引入对数据本身特性的客观考虑。The value of Dc is related to the specific data in practice. Usually we will determine the value of Dc after obtaining the connected graph. That is, as in other common clustering methods, it is an input parameter. However, unlike the selection of the K value in K-Means, the selection of the K value directly determines the number of classes, but the Dc here weakens the influence of subjective factors by the Rho value and the Delta value and the values of R_T and D_T. Because the selection of these parameters will introduce an objective consideration of the characteristics of the data itself.
R_T和D_T的选取的一种方法如下。如图3所示,其为本发明MapReduce平台上的虚拟人建立方法一较佳实施例中的Rho值-Delta值分布示意图,图中每一个点代表一个节点。首先画出各个点的Rho值-Delta值分布图,之后观察Delta值(Rho值)的分布情况,看在哪个值时,分布情况发生了突变,则取该值为D_T(R_T)。如图3中,在d’(r’)处,Delta值的分布情况发生了间断/突变,则D_T(R_T)的取值为d’(r’)。若数据点较多,则可进行抽样,再以样本点的分布图做取值的参考。One method of selecting R_T and D_T is as follows. As shown in FIG. 3, it is a schematic diagram of Rho value-Delta value distribution in a preferred embodiment of the virtual human establishment method on the MapReduce platform of the present invention, where each point represents a node. First, draw the Rho-Delta value distribution map of each point, and then observe the distribution of the Delta value (Rho value). When the value is changed, the distribution value is abrupt, and the value is D_T(R_T). As shown in Fig. 3, at d'(r'), the distribution of the Delta value is discontinuous/mutated, and the value of D_T(R_T) is d'(r'). If there are more data points, you can sample them and use the distribution map of the sample points as a reference for the values.
通过引入其他模型,比如属性匹配,可将对应模型的匹配结果同样作为影响边长度的一个因素。也就是说,引入账号之间协同出现的次数以外的因素计算所述账号之间的相似度。By introducing other models, such as attribute matching, the matching result of the corresponding model can also be used as a factor affecting the length of the edge. That is to say, factors other than the number of times of cooperation between the accounts are introduced to calculate the similarity between the accounts.
以属性匹配举例来说,用数学符号表示的话,即是将属性匹配的结果作为计算边长的一个参数。即,令Match(a,b)为属性匹配到的a和b的帐号相似度,则可如下定义边长:In the case of attribute matching, for example, if the mathematical symbol is used, the result of the attribute matching is used as a parameter for calculating the side length. That is, if Match(a,b) is the account similarity of a and b to which the attribute matches, the side length can be defined as follows:
d(a,b)=f(c(a,b),match(a,b))。d(a,b)=f(c(a,b), match(a,b)).
以[等式1]为例,可选择将其具体定义为:Taking [Equation 1] as an example, you can choose to define it as:
引入属性匹配模型后的边长
Figure PCTCN2015072486-appb-000002
Edge length after introducing attribute matching model
Figure PCTCN2015072486-appb-000002
通过上述说明可以了解,本发明基于行为日志采用基于本地密度的聚类过程来建立虚拟人,复杂度低,准确率高,适合于处理大数据;更进一步,为了能实现对海量数据的处理,克服单机本身资源有限所带来的限制,本发 明将基于本地密度的聚类过程在MapReduce平台上进行实施,以针对海量数据的处理,更快的完成聚类操作。It can be understood from the above description that the present invention uses a local density-based clustering process to establish a virtual person based on the behavior log, which has low complexity and high accuracy, and is suitable for processing big data; further, in order to realize processing of massive data, Overcoming the limitations imposed by the limited resources of the stand-alone unit, this issue Ming will implement the clustering process based on local density on the MapReduce platform to complete the clustering operation faster for the processing of massive data.
下面来举例说明该较佳实施例的步骤3在MapReduce平台上的具体实施方式。步骤3具体可以包括:The specific implementation manner of step 3 of the preferred embodiment on the MapReduce platform is exemplified below. Step 3 may specifically include:
步骤20、以连通图中的节点和边的信息作为输入数据,通过Map作业生成包括节点以及邻边信息的键值对,通过Reduce作业生成包括节点、节点的本地密度Rho以及节点所有邻边信息的输出,Rho定义为连接本节点的长度低于预定义值Dc的邻边的数目。Step 20: Using the information of the nodes and edges in the connected graph as the input data, generating a key value pair including the node and the neighboring information through the Map job, and generating the local density Rho including the node and the node and all the neighbor information of the node through the Reduce job. The output, Rho, is defined as the number of neighboring edges whose length is lower than the predefined value Dc.
步骤20具体可以包括:Step 20 may specifically include:
步骤21、连通图中的节点和边的信息作为输入数据经由Map作业生成键值对,其中,键包括标识节点的字段,值包括标识邻居节点的字段和标识该节点和邻居节点之间邻边的边长的字段。邻边信息包括对应的邻居节点和邻边边长。作为优化,步骤21中,键还可以包括标识该节点和邻居节点之间邻边的边长的字段。Step 21: The information of the nodes and edges in the connected graph is used as input data to generate a key-value pair via a Map job, wherein the key includes a field identifying the node, and the value includes a field identifying the neighbor node and identifying a neighboring edge between the node and the neighbor node The length of the field. The neighbor information includes the corresponding neighbor node and the neighbor side length. As an optimization, in step 21, the key may further include a field identifying the side length of the neighboring edge between the node and the neighboring node.
应用时,可以将输入数据的每一行对应一组节点之间的边信息。故为方便起见,可以将输入数据设定为依次由小标识值节点a、大标识值节点b和边长len(a,b)组成的三元组:[a,b,len(a,b)]。When applied, each row of input data can be associated with side information between a group of nodes. Therefore, for the sake of convenience, the input data can be set to a triple consisting of a small identity value node a, a large identity value node b, and a side length len(a, b): [a, b, len(a, b) )].
因为对于每个节点都需要计算它们的Rho值,对连通图中的一条边信息,Map作业将会有两次<Key,Value>输出。每个Key值或Value值均依次由left和right两个字段组成。具体来说,第一次的Key值可以是K1=<a,len(a,b)>(这里,left=a,right=len(a,b)),Value值可以是V1=<b,len(a,b)>,第二次的Key值可以是K2=<b,len(a,b)>,Value值可以是V2=<a,len(a,b)>。Because each node needs to calculate their Rho value, the Map job will have two <Key, Value> outputs for one side information in the connected graph. Each Key value or Value value is composed of two fields, left and right. Specifically, the first Key value may be K1=<a, len(a,b)> (here, left=a, right=len(a,b)), and the Value value may be V1=<b, Len(a,b)>, the second Key value can be K2=<b, len(a,b)>, and the Value value can be V2=<a, len(a,b)>.
步骤22、对键值对按照键所包括的节点进行分区,键包括相同节点的键值对分配至同一分区。在此实施例中具体来说,各记录所将归属的分区(Partition)的序列将只与Map输出Key值的第一个字段有关。比如说,分区序列可以为Key的left字段的哈希值与已知总分区数的余数,以伪代码表示即:Step 22: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition. In this embodiment, in particular, the sequence of partitions to which each record belongs will only be related to the first field of the Map output Key value. For example, the partition sequence can be the remainder of the key's left field and the remainder of the known total number of partitions, represented by pseudocode:
K.left.hashCode()%总分区数。K.left.hashCode()% of the total number of partitions.
这实际上保证了相同节点left字段的节点的边信息,都会分配到同一个分区中进行存储。 This actually guarantees that the side information of the nodes of the same node's left field will be allocated to the same partition for storage.
步骤23、对于同一分区内的键值对按照键所包括的节点进行分组,键包括相同节点的键值对分配至同一组;Step 23: group the key values in the same partition according to the nodes included in the key, and the key includes the key value pairs of the same node to be allocated to the same group;
分组(GroupCompare)的结果将只与相比较的Key值它们的第一个字段的比较结果有关。举例来说,对于两个Key,k1和k2,相应的比较(compare)结果为:The result of the group (GroupCompare) will only be related to the comparison of the first field of the compared Key value. For example, for two Keys, k1 and k2, the corresponding compare result is:
k1.left.compare(k2.left)。K1.left.compare(k2.left).
这实际上保证了,每一个节点的所有边的信息(Value值,邻居点和边长),都会在同一次Reduce过程中调用。This actually guarantees that the information (Value value, neighbor point and side length) of all edges of each node will be called in the same Reduce process.
步骤24、对于属于同一组的键值对按照键所包括的邻边的边长进行排序。步骤24中的排序可以为升序排序。步骤24作为一个可选的优化措施,可以称为组内排序(SortComparator,SC),可设定为按left和right次序两个字段先后进行比较的结果。以伪代码表示即:Step 24: Sort the key pairs belonging to the same group according to the side lengths of the adjacent edges included in the key. The ordering in step 24 can be sorted in ascending order. As an optional optimization measure, step 24 can be called SortComparator (SC), which can be set as the result of comparing the two fields in the left and right order. Expressed in pseudo code:
Figure PCTCN2015072486-appb-000003
Figure PCTCN2015072486-appb-000003
由于Key的right值均表示边长,故这里实际上保证在Reduce过程中的迭代式时,边信息是按照边长的长短的升序顺序返回的。注:实际上,步骤21中Key值设定为由节点标识和边长两个字段构成,就是为了进行该优化;若无该优化的考虑,则步骤21中Key值仅有节点标识组成即可。Since the right value of the Key indicates the length of the side, it is actually guaranteed that the edge information is returned in ascending order of the length of the side length in the iterative process in the Reduce process. Note: In fact, the Key value in step 21 is set to be composed of two fields: node identifier and side length, in order to perform the optimization; if there is no such optimization consideration, the Key value in step 21 can only be composed of the node identifier. .
步骤25、经由Reduce作业,通过对属于同一组的键值对的值的迭代来遍历同一节点的所有的邻边,生成包括节点、节点的本地密度Rho以及节点所有邻边信息的输出。Step 25. Via the Reduce job, traverse all the neighbors of the same node by iterating over the values of the key pairs belonging to the same group, and generate an output including the node, the local density Rho of the node, and all neighbor information of the node.
步骤25中Reduce作业的输出为键值对,其中,键包括标识节点的字段,值包括标识节点的字段,标识节点Rho的字段以及标识节点所有邻边信息的字段。The output of the Reduce job in step 25 is a key value pair, wherein the key includes a field identifying the node, the value includes a field identifying the node, a field identifying the node Rho, and a field identifying all neighbor information of the node.
经过上述步骤,在每一次Reduce调用时,均可通过对Values的迭代来遍历同一节点的所有的边。每次Reduce过程调用时,都会输出如下三部分信 息:当前节点n的标识,n的Rho值,按边长进行排序后的n的所有邻边信息。After the above steps, each Edge call can traverse all edges of the same node by iterating over Values. Each time the Reduce procedure is called, the following three parts of the letter are output. Information: the identifier of the current node n, the Rho value of n, and all neighbor information of n after sorting by the side length.
当使用上述SC进行了优化时,Rho值的计数可在迭代到的边长大于预定义值Dc时便结束。同时,由于邻边已经借助SC进行了升序排序,邻边信息亦可按迭代时的先后拼接即可。若未进行该优化,则Rho值的计数需迭代到了最后一条边时才能结束,而邻边信息需要排序后再作为Value值的一部分。When optimized using the SC described above, the count of Rho values may end when the iterative side length is greater than the predefined value Dc. At the same time, since the neighboring edges have been sorted in ascending order by means of the SC, the neighboring side information can also be spliced in the order of iteration. If the optimization is not performed, the count of the Rho value needs to be iterated to the end of the last edge, and the neighbor information needs to be sorted and then used as part of the Value value.
作为举例,输出的格式可以为键值对:As an example, the output format can be a key-value pair:
[K=n,V=<n,Rho(n),n1:len(n,n1),n2:len<n,n2>…nN:len<n,nN>>]。[K=n, V=<n, Rho(n), n1: len(n, n1), n2: len<n, n2>...nN: len<n, nN>>].
该较佳实施例通过以上所述的第一个MapReduce任务,主要实现计算Rho值,并对邻居节点按距离升序排序。接下来的第二个MapReduce任务,主要实现计算Delta值,并标识类中心点。The preferred embodiment implements the calculation of the Rho value by using the first MapReduce task described above, and sorts the neighbor nodes in ascending order by distance. The next second MapReduce task, the main implementation of the calculation of the Delta value, and identify the class center point.
步骤30、对于步骤20中Reduce作业的输出,通过Map作业生成包括节点、节点Rho、邻居节点Rho以及邻边信息的键值对,对每个节点,通过Reduce作业遍历节点Rho、所有邻居节点Rho以及所有邻边信息,得出每个节点的离散度Delta,再结合预定规则来进行类标识。Step 30: For the output of the Reduce job in step 20, generate a key-value pair including a node, a node Rho, a neighbor node Rho, and neighbor information through a Map job. For each node, traverse the node Rho and all neighbor nodes Rho through a Reduce job. And all the neighbor information, the dispersion Delta of each node is obtained, and the class identification is performed in combination with the predetermined rule.
在此较佳实施例中预定规则为:节点的Rho和Delta分别高于作为输入参数的阈值R_T和阈值D_T,则该节点为一个类的中心,该节点的类标识取其自身类标识;否则,节点的类标识取距离其最近且Rho更高的邻居节点的类标识;In the preferred embodiment, the predetermined rule is: Rho and Delta of the node are respectively higher than the threshold R_T and the threshold D_T as input parameters, then the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise The class identifier of the node takes the class identifier of the neighbor node that is closest to it and has a higher Rho;
孤立节点(没有邻居的节点)的类标识为自身类标识。该预定规则与中国专利申请CN 201410814330.4“虚拟人建立方法及装置”中所采用的规则类似—刚性的要求Rho值和Delta值一定高于某个分别对应的阈值。The class ID of an isolated node (a node with no neighbors) is its own class identifier. The predetermined rule is similar to the rule adopted in the Chinese patent application CN 201410814330.4 "Virtual Person Establishment Method and Apparatus" - the rigid requirement Rho value and the Delta value must be higher than a corresponding corresponding threshold.
这只是节点是否可标识为类中心的方法之一。从根本来说,节点是否可作为类中心节点是根据节点的Rho值和Delta值来进行的。其实,还存在其他利用包括Rho值和Delta值在内的因素来进行判断的各种方法。本发明MapReduce平台上的虚拟人建立方法在类中心点的确认方式上,也可以松懈,能更快的完成聚类操作。例如,预定规则可以包括:预先划分Rho值可能取值区间以及对应的Delta值可能取值区间,如果节点的Rho值属于Rho值可能取值区间且节点的Delta值属于对应的Delta值可能取值区间,则该节 点为一个类的中心,该节点的类标识取其自身类标识;否则,节点的类标识取距离其最近且Rho更高的邻居节点的类标识;孤立节点的类标识为自身类标识。比如:如果节点的Rho值在[10,20]范围,且Delta值也在[0.9*10,0.8*20](即Delta值也在随Rho值变动的某个范围内,Delta值取值范围与Rho值取值范围相对应,该节点也可标识为类中心)。This is just one of the ways in which a node can be identified as a class center. Basically, whether a node can be used as a class center node is based on the node's Rho value and Delta value. In fact, there are other methods for making judgments using factors including Rho values and Delta values. The virtual human establishment method on the MapReduce platform of the present invention can also be relaxed in the way of confirming the class center point, and the clustering operation can be completed more quickly. For example, the predetermined rule may include: pre-dividing the Rho value possible value interval and the corresponding Delta value possible value interval, if the Rho value of the node belongs to the Rho value possible value interval and the node's Delta value belongs to the corresponding Delta value, the value may be Interval, then the section The point is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and Rho is higher; the class identifier of the isolated node is its own class identifier. For example, if the Rho value of the node is in the range [10, 20], and the Delta value is also in [0.9*10, 0.8*20] (that is, the Delta value is also within a certain range of the Rho value change, the Delta value range Corresponding to the Rho value range, the node can also be identified as a class center).
这样,最终可以得到所有节点对应的类标识。同时,相同类标识的即为同一个类—即同一个虚拟人,从而完成步骤40、相同类的各个节点一同构成一个虚拟人。In this way, you can finally get the class identifier corresponding to all nodes. At the same time, the same class identifies the same class—that is, the same virtual person, so that step 40 is completed, and each node of the same class constitutes a virtual person.
求解某个节点的Delta值,需要取得其邻边对应的Rho值。在步骤20中Reduce作业的输出下,可以借助通用的MapReduce上进行笛卡尔积(Cartesian Product)的方式,来实现对邻居节点的Rho值的遍历—通过自定义InputFormat来实现全连接。这里的遍历,实际上是为了后续求出Delta值。相关的案例可参见[<<MapReduce Design Patterns>>,O’Reilly,Dec.2012,p:128-138]所述。To solve the Delta value of a node, you need to get the Rho value corresponding to its neighbor. In the output of the Reduce job in step 20, the Cartesian Product on the general MapReduce can be used to implement the traversal of the Rho value of the neighbor node - the full connection is realized by the custom InputFormat. The traversal here is actually to find the Delta value later. Related cases can be found in [<<MapReduce Design Patterns>>, O'Reilly, Dec. 2012, p: 128-138].
步骤30具体可以包括:Step 30 specifically includes:
步骤31、对于步骤20中Reduce作业的输出经由Map作业生成键值对,其中,键包括标识节点的字段,值包括标识邻居节点的字段,标识该节点和邻居节点之间邻边的边长的字段,标识该邻居节点Rho的字段,标识该节点Rho的字段。Step 31: Generate a key value pair for the output of the Reduce job in step 20 via a Map job, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node, and identifying a side length of the neighboring edge between the node and the neighbor node. A field identifying a field of the neighbor node Rho and identifying a field of the node Rho.
对于步骤20中Reduce作业的输出经由Map作业输出当前节点和经连接得到的邻居节点的信息。一种优化的示例输出格式为:For the output of the Reduce job in step 20, the information of the current node and the connected neighbor node is output via the Map job. An optimized sample output format is:
[K=<a,Rho(b)>,V=<Rho(b),Rho(a),b,len(a,b)>]。[K=<a, Rho(b)>, V=<Rho(b), Rho(a), b, len(a,b)>].
步骤31中,作为选择,键还可以包括标识该邻居节点Rho的字段,优化处在于将Rho(b)的信息也并入到Key部分,便于后续步骤34的排序。In step 31, as a selection, the key may further include a field identifying the neighbor node Rho, and the optimization is to incorporate the information of Rho(b) into the Key part to facilitate the sorting of the subsequent step 34.
步骤32、对键值对按照键所包括的节点进行分区,键包括相同节点的键值对分配至同一分区。具体方式可参见步骤22。Step 32: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition. For details, see step 22.
步骤33、对于同一分区内的键值对按照键所包括的节点进行分组,键包括相同节点的键值对分配至同一组。具体方式可参见步骤23。Step 33: Group the key values in the same partition according to the nodes included in the key, and the key includes the key value pairs of the same node to be allocated to the same group. For details, see step 23.
步骤34、对于属于同一组的键值对按照键所包括的邻居节点Rho进行排序。作为可选的优化措施,首先根据Key值的第一个字段区别出是否为同一 个节点的Key值,若相同则以第二个字段降序排序。这样排序保证了在同一个Reduce过程中,高Rho值的邻居节点会被首先迭代访问到。Step 34: Sort the key values belonging to the same group according to the neighbor nodes Rho included in the key. As an optional optimization measure, first distinguish whether the same is based on the first field of the Key value. The key values of the nodes are sorted in descending order of the second field if they are the same. This sorting ensures that neighbor nodes with high Rho values are first iteratively accessed in the same Reduce process.
步骤35、经由Reduce作业,对每个节点,通过对属于同一组的键值对的值的迭代来遍历节点Rho、所有邻居节点Rho以及所有邻边信息,得出每个节点的离散度Delta,再结合预定规则来进行类标识。Step 35: Via the Reduce operation, for each node, traverse the node Rho, all neighbor nodes Rho, and all neighbor information by iterating over the values of the key pairs belonging to the same group, and obtain the dispersion Delta of each node, Class identification is performed in conjunction with predetermined rules.
经过上述步骤,在每一次Reduce过程中,都可以通过对Value值的迭代来遍历某个节点的自身及其所有邻边的信息。这时可以再选择结合作为输入参数的阈值R_T和阈值D_T值,便生成进行类标识所需的信息。After the above steps, in each Reduce process, the information of a node and all its neighbors can be traversed by iterating over the Value value. At this time, the threshold R_T and the threshold D_T value, which are combined as input parameters, can be selected to generate the information required for class identification.
在此较佳实施例中,步骤30的Map过程是在原生的MapReduce方案上实现,但实际中可通过常见的数据库技术而加速处理过程。例如,在步骤20中Reduce作业输出时,将各节点的Rho值存在关系型数据库或K-V数据库中。从而在步骤30的Map时,只需对邻居点的Rho值进行查询即可,而不需要通过自定义InputFormat来处理;也就是说,不再需要进行笛卡尔操作,可在Map阶段直接访问数据来获取邻居节点的Rho值便可。In the preferred embodiment, the Map process of step 30 is implemented on a native MapReduce scheme, but in practice the process can be accelerated by common database techniques. For example, when the Reduce job is output in step 20, the Rho value of each node is stored in a relational database or a K-V database. Therefore, in the Map of Step 30, it is only necessary to query the Rho value of the neighbor point, and does not need to be processed by the custom InputFormat; that is, the Cartesian operation is no longer needed, and the data can be directly accessed in the Map stage. To get the Rho value of the neighbor node.
在本发明的MapReduce平台上的虚拟人建立方法中,通过分析行为日志的方式,实际分析所得出的结果是“哪些帐号是属于同一个人操作的”。在现实系统需求中,使用者比帐号所有人往往更有意义,同时这也能降低因“身份证号码”等关键值不真实,而引起帐号归属关系结果上的偏差。用行为日志来进行分析,增加了整个系统的可适用性–只需要帐号标识,而并不一定需要具体的帐号属性。源自行为日志的特征与上述复杂度的降低,本发明能更好适用更大范围下、更长时间范围内、更多数据量的环境。实际上,数据采集自的范围越广、时间越长、数据量更大会使得系统的实际准确率越高。本发明根据上述对行为日志分析后,聚类得出的帐号归属关系,可结合帐号属性等额外数据,进一步描绘出该虚拟人的姓名、住址等属性信息。In the virtual person establishment method on the MapReduce platform of the present invention, by analyzing the behavior log, the actual analysis results are “which accounts belong to the same person operation”. In the real system requirements, the user is often more meaningful than the account owner, and this can also reduce the deviation of the account attribution relationship result caused by the unreal value of the "identity number" and other key values. The use of behavior logs for analysis increases the applicability of the entire system – only account identification is required, and specific account attributes are not necessarily required. From the characteristics of the behavior log and the reduction of the above complexity, the present invention can be better applied to an environment of a larger range, a longer time range, and more data volume. In fact, the wider the scope of data collection, the longer the time, and the greater the amount of data, the higher the actual accuracy of the system. According to the foregoing analysis of the behavior log, the present invention can further describe the attribute information such as the name and address of the virtual person by combining additional data such as the account attribute.
综上所述,本发明的MapReduce平台上的虚拟人建立方法基于行为日志建立虚拟人,复杂度低,准确率高,适合于处理大数据;借助流行的MapReduce分布式计算思想在集群上实现了基于本地密度的聚类,弱化了处理时单机本身资源有限等所带来的限制,能实现对海量数据的处理,更快的完成聚类操作。 In summary, the virtual human establishment method on the MapReduce platform of the present invention establishes a virtual person based on the behavior log, has low complexity and high accuracy, and is suitable for processing big data; and implements the popular MapReduce distributed computing idea on the cluster. Based on local density clustering, the limitations imposed by the limited resources of the single machine during processing are weakened, and the processing of massive data can be realized, and the clustering operation can be completed faster.
以上所述,对于本领域的普通技术人员来说,可以根据本发明的技术方案和技术构思作出其他各种相应的改变和变形,而所有这些改变和变形都应属于本发明后附的权利要求的保护范围。 In the above, various other changes and modifications can be made in accordance with the technical solutions and technical concept of the present invention, and all such changes and modifications should be included in the appended claims. The scope of protection.

Claims (10)

  1. 一种MapReduce平台上的虚拟人建立方法,其特征在于,包括:A virtual human establishment method on a MapReduce platform, which is characterized in that:
    步骤1、从行为日志中提取账号及与账号对应的登陆时间、登陆终端信息;Step 1. Extract the account number and the login time and login terminal information corresponding to the account from the behavior log;
    步骤2、根据账号之间协同出现的情况计算账号之间的相似度,构造以节点表征账号的连通图,并以节点之间的边的长度表征账号之间的相似度,节点之间的边越短,节点所表征的账号之间相似度越高;Step 2: Calculate the similarity between the accounts according to the co-occurrence between the accounts, construct a connected graph that represents the account by the node, and represent the similarity between the accounts by the length of the edges between the nodes, and the edges between the nodes The shorter, the higher the similarity between the accounts represented by the nodes;
    步骤3、基于MapReduce平台对所述连通图中的节点进行聚类,根据聚类结果建立虚拟人。Step 3: Clustering the nodes in the connected graph based on the MapReduce platform, and establishing a virtual human according to the clustering result.
  2. 如权利要求1所述的MapReduce平台上的虚拟人建立方法,其特征在于,步骤3包括:The method for establishing a virtual person on the MapReduce platform according to claim 1, wherein the step 3 comprises:
    步骤20、以连通图中的节点和边的信息作为输入数据,通过Map作业生成包括节点以及邻边信息的键值对,通过Reduce作业生成包括节点、节点的本地密度Rho以及节点所有邻边信息的输出,Rho定义为连接本节点的长度低于预定义值Dc的邻边的数目;Step 20: Using the information of the nodes and edges in the connected graph as the input data, generating a key value pair including the node and the neighboring information through the Map job, and generating the local density Rho including the node and the node and all the neighbor information of the node through the Reduce job. Output, Rho is defined as the number of neighboring edges whose length is lower than the predefined value Dc;
    步骤30、对于步骤20中Reduce作业的输出,通过Map作业生成包括节点、节点Rho、邻居节点Rho以及邻边信息的键值对,对每个节点,通过Reduce作业遍历节点Rho、所有邻居节点Rho以及所有邻边信息,得出每个节点的离散度Delta,Delta定义为本节点所有连接更高Rho值邻居节点的邻边中最短边的边长,若不存在这样的邻居节点,则取本节点最长邻边的边长;再结合预定规则来进行类标识;Step 30: For the output of the Reduce job in step 20, generate a key-value pair including a node, a node Rho, a neighbor node Rho, and neighbor information through a Map job. For each node, traverse the node Rho and all neighbor nodes Rho through a Reduce job. And all the neighbor information, and the delta of each node is obtained. Delta is defined as the side length of the shortest side of the neighbors of all the neighbor nodes connected to the higher Rho value of the node. If there is no such neighbor node, then the source is taken. The length of the longest neighboring edge of the node; combined with the predetermined rules for class identification;
    步骤40、相同类的各个节点一同构成一个虚拟人。Step 40: Each node of the same class together constitutes a virtual person.
  3. 如权利要求2所述的MapReduce平台上的虚拟人建立方法,其特征在于,所述预定规则包括:The method for establishing a virtual person on a MapReduce platform according to claim 2, wherein the predetermined rule comprises:
    节点的Rho和Delta分别高于作为输入参数的阈值R_T和阈值D_T,则该节点为一个类的中心,该节点的类标识取其自身类标识;否则,节点的类标识取距离其最近且Rho更高的邻居节点的类标识;The Rho and Delta of the node are higher than the threshold R_T and the threshold D_T as input parameters respectively, then the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the class identifier of the node is taken closest to it and Rho The class identifier of the higher neighbor node;
    孤立节点的类标识为自身类标识。 The class ID of an isolated node is its own class identifier.
  4. 如权利要求2所述的MapReduce平台上的虚拟人建立方法,其特征在于,所述预定规则包括:The method for establishing a virtual person on a MapReduce platform according to claim 2, wherein the predetermined rule comprises:
    预先划分Rho值可能取值区间以及对应的Delta值可能取值区间,如果节点的Rho值属于Rho值可能取值区间且节点的Delta值属于对应的Delta值可能取值区间,则该节点为一个类的中心,该节点的类标识取其自身类标识;否则,节点的类标识取距离其最近且Rho更高的邻居节点的类标识;Pre-dividing the Rho value possible value interval and the corresponding Delta value possible value interval. If the node's Rho value belongs to the Rho value possible value interval and the node's Delta value belongs to the corresponding Delta value possible value interval, the node is a The class identifier of the node takes its own class identifier; otherwise, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and has a higher Rho;
    孤立节点的类标识为自身类标识。The class ID of an isolated node is its own class identifier.
  5. 如权利要求2所述的MapReduce平台上的虚拟人建立方法,其特征在于,步骤20中Reduce作业的输出存储于关系数据库或键值数据库中。The method for establishing a virtual person on the MapReduce platform according to claim 2, wherein the output of the Reduce job in step 20 is stored in a relational database or a key value database.
  6. 如权利要求2所述的MapReduce平台上的虚拟人建立方法,其特征在于,步骤30中的Map作业中,通过对步骤20中Reduce作业的输出进行笛卡尔积,实现对邻居节点Rho的遍历。The method for establishing a virtual person on the MapReduce platform according to claim 2, wherein in the Map job in step 30, the traversal of the neighbor node Rho is implemented by performing a Cartesian product on the output of the Reduce job in step 20.
  7. 如权利要求2所述的MapReduce平台上的虚拟人建立方法,其特征在于,步骤20包括:The method for establishing a virtual person on the MapReduce platform according to claim 2, wherein the step 20 includes:
    步骤21、连通图中的节点和边的信息作为输入数据经由Map作业生成键值对,其中,键包括标识节点的字段,值包括标识邻居节点的字段和标识该节点和邻居节点之间邻边的边长的字段;Step 21: The information of the nodes and edges in the connected graph is used as input data to generate a key-value pair via a Map job, wherein the key includes a field identifying the node, and the value includes a field identifying the neighbor node and identifying a neighboring edge between the node and the neighbor node Side length field;
    步骤22、对键值对按照键所包括的节点进行分区,键包括相同节点的键值对分配至同一分区;Step 22: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;
    步骤23、对于同一分区内的键值对按照键所包括的节点进行分组,键包括相同节点的键值对分配至同一组;Step 23: group the key values in the same partition according to the nodes included in the key, and the key includes the key value pairs of the same node to be allocated to the same group;
    步骤25、经由Reduce作业,通过对属于同一组的键值对的值的迭代来遍历同一节点的所有的邻边,生成包括节点、节点的本地密度Rho以及节点所有邻边信息的输出。Step 25. Via the Reduce job, traverse all the neighbors of the same node by iterating over the values of the key pairs belonging to the same group, and generate an output including the node, the local density Rho of the node, and all neighbor information of the node.
  8. 如权利要求7所述的MapReduce平台上的虚拟人建立方法,其特征在于,步骤20还包括:The method for establishing a virtual person on the MapReduce platform according to claim 7, wherein the step 20 further comprises:
    步骤21中,键还包括标识该节点和邻居节点之间邻边的边长的字段;In step 21, the key further includes a field that identifies a side length of the neighboring edge between the node and the neighboring node;
    步骤24、对于属于同一组的键值对按照键所包括的邻边的边长进行排序。 Step 24: Sort the key pairs belonging to the same group according to the side lengths of the adjacent edges included in the key.
  9. 如权利要求2所述的MapReduce平台上的虚拟人建立方法,其特征在于,步骤30包括:The method for establishing a virtual person on the MapReduce platform of claim 2, wherein the step 30 comprises:
    步骤31、对于步骤20中Reduce作业的输出经由Map作业生成键值对,其中,键包括标识节点的字段,值包括标识邻居节点的字段、标识该节点和邻居节点之间邻边的边长的字段、标识该邻居节点Rho的字段和标识该节点Rho的字段;Step 31: Generate a key value pair for the output of the Reduce job in step 20 via a Map job, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node, and identifying a side length of the neighboring edge between the node and the neighbor node. a field, a field identifying the neighbor node Rho, and a field identifying the node Rho;
    步骤32、对键值对按照键所包括的节点进行分区,键包括相同节点的键值对分配至同一分区;Step 32: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;
    步骤33、对于同一分区内的键值对按照键所包括的节点进行分组,键包括相同节点的键值对分配至同一组;Step 33: The key value pairs in the same partition are grouped according to the nodes included in the key, and the key includes the key value pairs of the same node allocated to the same group;
    步骤35、经由Reduce作业,对每个节点,通过对属于同一组的键值对的值的迭代来遍历节点Rho、所有邻居节点Rho以及所有邻边信息,得出每个节点的离散度Delta,再结合预定规则来进行类标识。Step 35: Via the Reduce operation, for each node, traverse the node Rho, all neighbor nodes Rho, and all neighbor information by iterating over the values of the key pairs belonging to the same group, and obtain the dispersion Delta of each node, Class identification is performed in conjunction with predetermined rules.
  10. 如权利要求9所述的MapReduce平台上的虚拟人建立方法,其特征在于,步骤30还包括:The method for establishing a virtual person on the MapReduce platform according to claim 9, wherein the step 30 further comprises:
    步骤31中,键还包括标识该邻居节点Rho的字段;In step 31, the key further includes a field identifying the neighbor node Rho;
    步骤34、对于属于同一组的键值对按照键所包括的邻居节点Rho进行排序。 Step 34: Sort the key values belonging to the same group according to the neighbor nodes Rho included in the key.
PCT/CN2015/072486 2014-12-31 2015-02-09 Method for creating virtual human on mapreduce platform WO2016106944A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410855499.4A CN104965846B (en) 2014-12-31 2014-12-31 Visual human's method for building up in MapReduce platform
CN201410855499.4 2014-12-31

Publications (1)

Publication Number Publication Date
WO2016106944A1 true WO2016106944A1 (en) 2016-07-07

Family

ID=54219882

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/072486 WO2016106944A1 (en) 2014-12-31 2015-02-09 Method for creating virtual human on mapreduce platform

Country Status (2)

Country Link
CN (1) CN104965846B (en)
WO (1) WO2016106944A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978382A (en) * 2014-12-31 2015-10-14 深圳市华傲数据技术有限公司 Clustering method based on local density on MapReduce platform
CN109816029B (en) * 2019-01-30 2023-12-19 重庆邮电大学 High-order clustering division algorithm based on military operation chain
CN112487250B (en) * 2019-09-11 2022-06-21 武汉斗鱼网络科技有限公司 Method and device for identifying service account group
CN110728317A (en) * 2019-09-30 2020-01-24 腾讯科技(深圳)有限公司 Training method and system of decision tree model, storage medium and prediction method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1804882A (en) * 2006-01-25 2006-07-19 深圳市中科新业信息科技发展有限公司 Virtual population system and its building method
US20080320040A1 (en) * 2007-06-19 2008-12-25 Marina Zhurakhinskaya Methods and systems for use of a virtual persona emulating activities of a person in a social network
CN103902709A (en) * 2014-03-31 2014-07-02 安徽新华博信息技术股份有限公司 Association analyzing method
CN104504264A (en) * 2014-12-08 2015-04-08 深圳市华傲数据技术有限公司 Virtual person building method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103339624A (en) * 2010-12-14 2013-10-02 加利福尼亚大学董事会 High efficiency prefix search algorithm supporting interactive, fuzzy search on geographical structured data
CN103544289A (en) * 2013-10-28 2014-01-29 公安部第三研究所 Feature extraction achieving method based on deploy and control data mining

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1804882A (en) * 2006-01-25 2006-07-19 深圳市中科新业信息科技发展有限公司 Virtual population system and its building method
US20080320040A1 (en) * 2007-06-19 2008-12-25 Marina Zhurakhinskaya Methods and systems for use of a virtual persona emulating activities of a person in a social network
CN103902709A (en) * 2014-03-31 2014-07-02 安徽新华博信息技术股份有限公司 Association analyzing method
CN104504264A (en) * 2014-12-08 2015-04-08 深圳市华傲数据技术有限公司 Virtual person building method and device

Also Published As

Publication number Publication date
CN104965846B (en) 2018-10-02
CN104965846A (en) 2015-10-07

Similar Documents

Publication Publication Date Title
WO2016090748A1 (en) Virtual human creating method and apparatus
US10484413B2 (en) System and a method for detecting anomalous activities in a blockchain network
WO2017211051A1 (en) Mining method and server for social network account of target subject, and storage medium
CN104077723B (en) A kind of social networks commending system and method
CN109284626A (en) Random forests algorithm towards difference secret protection
CN108897842A (en) Computer readable storage medium and computer system
Gong et al. Novel heuristic density-based method for community detection in networks
Qu et al. Efficient online summarization of large-scale dynamic networks
WO2016106944A1 (en) Method for creating virtual human on mapreduce platform
Liu et al. A new clustering algorithm based on data field in complex networks
Strotmann et al. Author name disambiguation for collaboration network analysis and visualization
CN110968802B (en) Analysis method and analysis device for user characteristics and readable storage medium
CN107679209A (en) Expression formula generation method of classifying and device
Bhat et al. OCMiner: a density-based overlapping community detection method for social networks
Jeyasudha et al. An intelligent centrality measures for influential node detection in COVID-19 environment
Yuan et al. Efficient processing of streaming graphs for evolution-aware clustering
Olech et al. Hierarchical gaussian mixture model with objects attached to terminal and non-terminal dendrogram nodes
CN115329078B (en) Text data processing method, device, equipment and storage medium
JP5734118B2 (en) Method and program for extracting, naming and visualizing small groups from social networks
Sina et al. Solving the missing node problem using structure and attribute information
WO2016107297A1 (en) Clustering method based on local density on mapreduce platform
Negara et al. Analysis of Indonesian Motorcycle Gang with Social Network Approach
Panagopoulos et al. Scientometrics for success and influence in the microsoft academic graph
CN113988878A (en) Graph database technology-based anti-fraud method and system
Bhat et al. A density-based approach for mining overlapping communities from social network interactions

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15874637

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15874637

Country of ref document: EP

Kind code of ref document: A1