WO2016090748A1 - 虚拟人建立方法及装置 - Google Patents

虚拟人建立方法及装置 Download PDF

Info

Publication number
WO2016090748A1
WO2016090748A1 PCT/CN2015/072487 CN2015072487W WO2016090748A1 WO 2016090748 A1 WO2016090748 A1 WO 2016090748A1 CN 2015072487 W CN2015072487 W CN 2015072487W WO 2016090748 A1 WO2016090748 A1 WO 2016090748A1
Authority
WO
WIPO (PCT)
Prior art keywords
accounts
node
nodes
virtual person
virtual
Prior art date
Application number
PCT/CN2015/072487
Other languages
English (en)
French (fr)
Inventor
蔡立宇
贾西贝
Original Assignee
深圳市华傲数据技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市华傲数据技术有限公司 filed Critical 深圳市华傲数据技术有限公司
Publication of WO2016090748A1 publication Critical patent/WO2016090748A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16ZINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS, NOT OTHERWISE PROVIDED FOR
    • G16Z99/00Subject matter not provided for in other main groups of this subclass

Definitions

  • the present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for establishing a virtual person based on a behavior log.
  • Each network service generally assigns an account to each user, which is associated with the user's registration information and is used to record and identify each user, such as an instant communication number (such as a QQ account) or an email address of the network user.
  • an instant communication number such as a QQ account
  • A) Specify rules for matching network account attributes, in which case which attributes are used for matching, and corresponding matching success determination methods. For example, when a QQ account and a Taobao account are matched, if the edit distances of the two fields "name" and "contact" are less than 3, the two accounts are considered to be successfully matched.
  • an object of the present invention is to provide a method for establishing a virtual person based on a behavior log, which solves the problem that the virtual person is complicated to construct and has low accuracy due to various types of account types.
  • Another object of the present invention is to provide a virtual person establishment apparatus based on a behavior log, which solves the problem that the virtual person is complicated to construct and has low accuracy due to various types of account types.
  • the present invention provides a virtual human establishment method, including the following steps:
  • the nodes in the connected graph are clustered, and a virtual person is established according to the clustering result.
  • the process of clustering nodes in the connected graph includes the following steps:
  • Rho is defined as the number of neighboring edges whose length is lower than a certain predefined value Dc;
  • Delta is defined as the side length of the shortest side of the neighboring edges of all the neighbor nodes connected to the higher Rho value of the node; if there is no such neighbor node, the longest neighboring edge of the node is taken The length of the side.
  • Each node of the same class together constitutes a virtual person, that is, belongs to the same virtual person.
  • the nodes in the connected graph are clustered by using a K-Means method or a hierarchical clustering method.
  • the method further includes merging all the virtual persons and the account corresponding to the virtual person to become a virtual person database.
  • the invention also provides a virtual person establishing device, comprising:
  • An information extracting unit configured to extract an account and a login time and a login terminal information corresponding to the account from the behavior log;
  • the connectivity graph construction unit is configured to calculate the similarity between the accounts according to the co-occurrence between the accounts, construct a connected graph that represents the account by the node, and represent the similarity between the accounts by the length of the edge between the nodes, the node The shorter the side between, the higher the similarity between the accounts represented by the nodes;
  • a virtual person establishing unit is configured to cluster nodes in the connected graph, and establish a virtual person according to the clustering result.
  • the method further includes an external model importing unit, configured to calculate a similarity between the account accounts by using factors other than the case where the cooperation between the accounts occurs.
  • the process of clustering nodes in the connected graph includes the following steps:
  • Rho is defined as the number of neighboring edges whose length is lower than a certain predefined value Dc;
  • Delta is defined as the side length of the shortest side of the neighboring edges of all the neighbor nodes connected to the higher Rho value of the node; if there is no such neighbor node, the longest neighboring edge of the node is taken The length of the side.
  • Each node of the same class together constitutes a virtual person.
  • the nodes in the connected graph are clustered by using a K-Means method or a hierarchical clustering method.
  • the virtual person merging unit is further configured to merge all virtual persons and accounts corresponding to the virtual person into a virtual person database.
  • the virtual human establishment method and apparatus of the present invention establish a virtual person based on the behavior log, has low complexity, high accuracy, and is suitable for processing big data.
  • FIG. 1 is a flow chart of a preferred embodiment of a virtual human establishment method according to the present invention.
  • FIG. 2 is a logic diagram of a preferred embodiment of a virtual human establishment method according to the present invention.
  • FIG. 3 is a schematic diagram of a Rho value-Delta value distribution in a preferred embodiment of a virtual human establishment method according to the present invention
  • FIG. 4 is a schematic structural diagram of a virtual human entity establishing apparatus according to a preferred embodiment of the present invention.
  • FIG. 1 is a flowchart of a preferred embodiment of a virtual human establishment method according to the present invention.
  • the main steps of the invention include:
  • the nodes in the connected graph are clustered, and a virtual person is established according to the clustering result.
  • the invention may also include the step of merging all virtual persons and accounts corresponding to the virtual person into a virtual person database.
  • the present invention proposes an analysis method based on behavior log.
  • the behavior log records the network user application network service, and can be collected from the server side, user terminal, and the like.
  • the method is based on the following observations of the reality:
  • an account with an activity on the same terminal may belong to the same person. We claim that multiple accounts have been active on the same terminal for a certain period of time, for the synergy of these accounts.
  • FIG. 2 it is a logic diagram of a preferred embodiment of a virtual human establishment method according to the present invention.
  • Step 1 Abstract the records in the behavior log as [time, terminal, account], and get the data including the timestamp, account ID and terminal ID, so as to know when and which account has active on which terminal.
  • Each account counts the number of times that the account has been active on the same terminal with other accounts for a period of time, and the number of collaborative occurrences between the accounts can be obtained.
  • “Number of times” is a way of measuring the "situation", and the term “number of times” is used in this embodiment only to simplify the explanation. In fact, you can also add time and other information as weights to measure the "situation" together - for example, the synergy of the off-hours can be slightly heavier than the working hours - the working hours are more likely to share the computer terminal.
  • Step 2 Based on the observation of the above-mentioned account co-occurrence situation, the similarity between the account numbers is calculated. If abstracted into a connected graph, the nodes in the connected graph represent accounts, and the length of the edges represents the similarity between the accounts. Usually, the higher the similarity, the shorter the side.
  • Step 3 If there are other models, such as attribute matching, the matching result of the corresponding model can also be used as a factor affecting the length of the edge.
  • Step 4 After obtaining the above figure, you can perform the following calculations to determine which accounts belong to the same person:
  • Step 4.1 Find the local density Rho for each node.
  • Rho is defined as the number of edges whose node length is lower than a certain predefined value Dc.
  • Step 4.2 For each node, find its dispersion Delta. Delta is defined as the side length of the shortest side of the neighbors of all the neighbor nodes connected to the higher Rho value of the node; if there is no such neighbor node, the side length of the longest neighbor of the node is taken.
  • Step 4.3 identifies the node whose Rho value and Delta value are higher than the specific thresholds R_T and D_T, respectively, as the central node of the class.
  • Each such node represents a class, which is a virtual person.
  • Step 4.4 classifies other non-central nodes as the one with the shortest distance and the Rho value higher than their own central node.
  • Step 4.5 Each node of the same class means that it belongs to the same virtual person.
  • Corresponding virtual classes for each class are identical to each class.
  • clustering method shown in the key step 4 other common clustering methods such as K-Means and Hierarchical Clustering can also be used, and they can achieve similar results, only in complexity or effect. Different on.
  • the behavioral log is analyzed in combination with the clustering algorithm in the preferred embodiment, and the analysis complexity of the whole system is reduced compared with other clustering methods such as K-Means and hierarchical clustering.
  • the two distribution values derived from the data itself, Delta and Rho provide an objective reference for the selection of the number of clusters.
  • the class center point identification method shown is that the Rho value and the Delta value of the node are simultaneously higher than a corresponding threshold.
  • Other methods based on Rho or Delta values can be taken in practice. If the Rho value is higher than 3, the delta value is between 4-5, and the Rho value is higher than 5, then the Delta value is between 5-6.
  • Rho representation The importance of the current node to its neighbors.
  • the side length can be defined as: the reciprocal number of times (c a,b ) of the two accounts in the behavior log (c a,b ). That is, the countdown of the number of times that two accounts have been active on the same terminal for a certain period of time.
  • Rho can be defined as: the number of edges in the neighboring edge of the current node that are less than the parameter value Dc.
  • Delta is defined as the side length of the shortest side of the neighbors of all the neighbor nodes connected to the higher Rho value of the node; if there is no such neighbor node, the side length of the longest neighbor of the node is taken.
  • Delta(a) can be defined as:
  • the delta value is related to the Rho value, and the Rho value can be defined by other definitions such as common centrality.
  • the value of Dc is related to the specific data in practice. Usually we will determine the value of Dc after obtaining the connected graph. That is, as in other common clustering methods, it is an input parameter. However, unlike the selection of the K value in K-Means, the selection of the K value directly determines the number of classes, but the Dc here weakens the influence of subjective factors by the Rho value and the Delta value and the values of R_T and D_T. Because the selection of these parameters will introduce an objective consideration of the characteristics of the data itself.
  • R_T and D_T are as follows.
  • FIG. 3 it is a schematic diagram of Rho value-Delta value distribution in a preferred embodiment of the virtual human establishment method of the present invention, where each point represents a node.
  • D_T(R_T) the distribution of the Delta value is discontinuous/mutated, and the value of D_T(R_T) is d'(r'). If there are more data points, you can sample them and use the distribution map of the sample points as a reference for the values.
  • the matching result of the corresponding model can also be used as a factor affecting the length of the edge. That is to say, factors other than the number of times of cooperation between the accounts are introduced to calculate the similarity between the accounts.
  • the result of the attribute matching is used as a parameter for calculating the side length. That is, if Match(a,b) is the account similarity of a and b to which the attribute matches, the side length can be defined as follows:
  • FIG. 4 it is a schematic structural diagram of a virtual human-created device according to a preferred embodiment of the present invention.
  • the virtual person establishing apparatus of the preferred embodiment includes an information extracting unit 1, a connected graph construction unit 2, an external model introducing unit 3, a virtual person establishing unit 4, and a virtual person merging unit 5.
  • the information extracting unit 1 is configured to extract an account and a login time and login terminal information corresponding to the account from the behavior log;
  • the connectivity graph construction unit 2 is configured to calculate the similarity between the account accounts according to the co-occurrence between the account accounts, construct a connected graph that represents the account by the node, and represent the similarity between the accounts by the length of the edge between the nodes. The shorter the edge between nodes, the higher the similarity between the accounts represented by the nodes;
  • the external model introduction unit 3 is configured to calculate a similarity between the account accounts by introducing factors other than the case where the accounts are co-occurring;
  • a virtual person establishing unit 4 configured to cluster nodes in the connected graph, and establish a virtual person according to the clustering result
  • the virtual person merging unit 5 is configured to merge all virtual people and accounts corresponding to the virtual person into a virtual person database.
  • the present invention by analyzing the behavior log, the actual analysis results are "which accounts belong to the same person." In the real system requirements, the user is often more meaningful than the account owner, and this can also reduce the deviation of the account attribution relationship result caused by the unreal value of the "identity number" and other key values.
  • the use of behavior logs for analysis increases the applicability of the entire system – only account identification is required, and specific account attributes are not necessarily required. From the characteristics of the behavior log and the reduction of the above complexity, the present invention can be better applied to an environment of a larger range, a longer time range, and more data volume. In fact, the wider the scope of data collection, the longer the time, and the greater the amount of data, the higher the actual accuracy of the system.
  • the present invention can further describe the attribute information such as the name and address of the virtual person by combining additional data such as the account attribute.
  • the virtual human establishment method and apparatus of the present invention establish a virtual person based on the behavior log, has low complexity, high accuracy, and is suitable for processing big data.

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种基于行为日志的虚拟人建立方法及装置。该虚拟人建立方法包括:从行为日志中提取账号及与账号对应的登陆时间、登陆终端信息;根据账号之间协同出现的情况计算账号之间的相似度,构造以节点表征账号的连通图,并以节点之间的边的长度表征账号之间的相似度,节点之间的边越短,节点所表征的账号之间相似度越高;对所述连通图中的节点进行聚类,根据聚类结果建立虚拟人。该虚拟人建立方法及装置基于行为日志建立虚拟人,复杂度低,准确率高,适合于处理大数据。

Description

虚拟人建立方法及装置 技术领域
本发明涉及数据处理技术领域,尤其涉及一种基于行为日志的虚拟人建立方法及装置。
背景技术
当前,即时通讯,电子邮件,网络游戏,P2P软件下载,网络论坛,网络招聘,电子商务交易,网络预定机票酒店等各种网络服务给网络用户的生活带来极大的便利。各种网络服务一般会给每个用户分配一个帐号,该帐号跟用户的注册信息相关联并用以对各用户进行记录和识别,比如网络用户的即时通信号码(如QQ账号)或电子邮件地址,网络游戏帐号,论坛登陆帐号,以及P2P软件帐号等等。
每个网络用户都拥有类型多样的账号,而大量的网络用户则带来的巨量的账号数据,对相关部门来说,有效管理网络用户信息已经成为艰巨的任务。为有效管理网络用户信息,实现对网络帐号归属关系的分析,即哪些帐号属于同一个人(虚拟人),现已成为亟需解决的问题。
现有技术在面对构建虚拟人的问题时,大多归于属性匹配方式。属性匹配的方案大致如下:
A)指定网络帐号属性匹配的规则,在哪种情况下用哪些属性进行匹配,以及相应的匹配成功判定方法。比如,当匹配一个QQ帐号和一个淘宝帐号时,如果两帐号的“姓名”和“联系方式”两个字段的编辑距离(edit distance)均小于3,则认为这两个帐号匹配成功。
B)根据属性匹配的情况,构建帐号之间属于同一个人的程度(相似度)。并最终根据相似度分辨出哪些帐号属于同一个人。比如,上例中,只要匹配成功则认为属于同一个人。
但是,实际生活中存在如下情况:
1.账号数据中经常出现属性缺失的情况,例如账号注册时只填写了部分属性值。
2.不同类型的账号数据,共有的属性少。而且共有的属性中,不一定都能用于属性匹配。
3.不同类型的账号数据,对同一语义的属性不同,需要对齐,这进一步增加了难度。比如在A类帐号中,姓名对应的字段就是“姓名”这一个字段,但在B类帐号中,姓名实际上是用“姓”和“名”两个字段来表示。
4.实际账号数据中,属性值的可信度并不是很高。例如,因为缺乏实名认证,可能存在身份证号不真实的情况。
5.需要进行属性级别的比较,复杂度较高。
这些情况使得属性匹配的过程复杂、计算量大且实际结果不理想,尤其是针对大量数据处理时,准确度较低。
发明内容
因此,本发明的目的在于提供一种基于行为日志的虚拟人建立方法,解决因帐号类型多样等带来的虚拟人构建复杂、准确度低的问题。
本发明的另一目的在于提供一种基于行为日志的虚拟人建立装置,解决因帐号类型多样等带来的虚拟人构建复杂、准确度低的问题。
为实现上述目的,本发明提供了一种虚拟人建立方法,包括如下步骤:
从行为日志中提取账号及与账号对应的登陆时间、登陆终端信息;
根据账号之间协同出现的情况计算账号之间的相似度,构造以节点表征账号的连通图,并以节点之间的边的长度表征账号之间的相似度,节点之间的边越短,节点所表征的账号之间相似度越高;
对所述连通图中的节点进行聚类,根据聚类结果建立虚拟人。
其中,还引入账号之间协同出现的情况以外的因素计算所述账号之间的相似度。
其中,对所述连通图中的节点进行聚类的过程包括如下步骤:
分别求出每个节点的本地密度Rho,Rho定义为连接本节点的长度低于某个预定义值Dc的邻边的数目;
分别求出每个节点的离散度Delta,Delta定义为本节点所有连接更高Rho值邻居节点的邻边中最短边的边长;若不存在这样的邻居节点,则取本节点最长邻边的边长。
将Rho值和Delta值分别高于预设阈值R_T和D_T的节点标识为类的中心节点;
将非中心节点归类为到该非中心节点距离最短且Rho值高于该非中心节点的中心节点所属的类;
相同类的各个节点一同构成一个虚拟人,也就是属于同一个虚拟人。
其中,采用K-Means方法或层次聚类方法对所述连通图中的节点进行聚类。
其中,还包括合并所有虚拟人及与虚拟人对应的账号成为虚拟人数据库。
本发明还提供了一种虚拟人建立装置,包括:
信息提取单元,用于从行为日志中提取账号及与账号对应的登陆时间、登陆终端信息;
连通图构造单元,用于根据账号之间协同出现的情况计算账号之间的相似度,构造以节点表征账号的连通图,并以节点之间的边的长度表征账号之间的相似度,节点之间的边越短,节点所表征的账号之间相似度越高;
虚拟人建立单元,用于对所述连通图中的节点进行聚类,根据聚类结果建立虚拟人。
其中,还包括外部模型引入单元,用于引入账号之间协同出现的情况以外的因素计算所述账号之间的相似度。
其中,对所述连通图中的节点进行聚类的过程包括如下步骤:
分别求出每个节点的本地密度Rho,Rho定义为连接本节点的长度低于某个预定义值Dc的邻边的数目;
分别求出每个节点的离散度Delta,Delta定义为本节点所有连接更高Rho值邻居节点的邻边中最短边的边长;若不存在这样的邻居节点,则取本节点最长邻边的边长。
将Rho值和Delta值分别高于预设阈值R_T和D_T的节点标识为类的中心节点;
将非中心节点归类为到该非中心节点距离最短且Rho值高于该非中心节点的中心节点所属的类;
相同类的各个节点一同构成一个虚拟人。
其中,采用K-Means方法或层次聚类方法对所述连通图中的节点进行聚类。
其中,还包括虚拟人合并单元,用于合并所有虚拟人及与虚拟人对应的账号成为虚拟人数据库。
综上所述,本发明的虚拟人建立方法及装置基于行为日志建立虚拟人,复杂度低,准确率高,适合于处理大数据。
附图说明
附图中,
图1为本发明虚拟人建立方法一较佳实施例的流程图;
图2为本发明虚拟人建立方法一较佳实施例的逻辑示意图;
图3为本发明虚拟人建立方法一较佳实施例中的Rho值-Delta值分布示意图;
图4为本发明虚拟人建立装置一较佳实施例的结构示意图。
具体实施方式
下面结合附图,通过对本发明的具体实施方式详细描述,将使本发明的技术方案及其有益效果显而易见。
参见图1,其为本发明虚拟人建立方法一较佳实施例的流程图。本发明的主要步骤包括:
从行为日志中提取账号及与账号对应的登陆时间、登陆终端信息;
根据账号之间协同出现的情况计算账号之间的相似度,构造以节点表征账号的连通图,并以节点之间的边的长度表征账号之间的相似度,节点之间的边越短,节点所表征的账号之间相似度越高;
对所述连通图中的节点进行聚类,根据聚类结果建立虚拟人。
本发明还可以包括合并所有虚拟人及与虚拟人对应的账号成为虚拟人数据库的步骤。
为应对因帐号类型多样等带来的虚拟人构建复杂、准确度低等实际问题,本发明提出了一种基于行为日志的分析方法。行为日志记录了网络用户应用网络服务的情况,可采集自服务器端,用户终端等。该方法基于如下对现实情况的观察:
1.一段时间内,在同一台终端上有活动的帐号可能属于同一个人。我们称在某一段时间内多个帐号在同一终端上都有过活动,为这些帐号的协同出现。
2.多个帐号协同出现的情况越近似—比如次数越多,那这些帐号属于同一个人可能性(称,相似度)就越大。
3.单个用户拥有的多个帐号中,总是有部分帐号使用更为频繁。
4.不同用户的部分帐号之间,即便偶尔有协同出现过,其协同出现的情况不会比用户自己的各个帐号之间协同出现的情况更近似。
参见图2,其为本发明虚拟人建立方法一较佳实施例的逻辑示意图。
该较佳实施例中的关键性步骤包括:
步骤1.将行为日志中的记录抽象为【时间,终端,帐号】,从而得到包含时间戳,账号ID及终端ID的数据,从而得知什么时候在哪个终端上哪个帐号有活动过,通过对每一个账号统计该账号一段时间内与其他帐号在同一终端上都有过活动的协同出现次数,可以得出账号之间协同出现的次数。
“次数”是衡量“情况”的一种方式,此实施例中采用“次数”的说法仅是为了简化说明。实际上,还可以加入时段等信息作为权值来一起衡量“情况”—比如,下班时间的协同出现的权重可稍重于上班时间—上班时间更可能会共用电脑终端。
步骤2.基于上述账号协同出现情况的观察,计算得出帐号之间的相似度。若抽象成连通图,则连通图中的节点代表帐号,边的长度表征帐号之间的相似度。通常情况下,相似度越高,边越短。
步骤3.如有其他模型,比如属性匹配,可将对应模型的匹配结果同样作为影响边长度的一个因素。
步骤4.得到上述图后,可以进行如下计算,得出哪些帐号属于同一个人:
步骤4.1对各个节点,求出其本地密度Rho。Rho的定义为本节点长度低于某个预定义值Dc的边的数目。
步骤4.2对每个节点,求出其离散度Delta。Delta定义为本节点所有连接更高Rho值邻居节点的邻边中最短边的边长;若不存在这样的邻居节点,则取本节点最长邻边的边长。
步骤4.3将Rho值和Delta值分别高于特定阈值R_T和D_T的节点,标识为类的中心节点。每一个这样的节点代表一个类,也就是一个虚拟人。
步骤4.4将其他非中心节点归类为到其距离最短且Rho值高于自己的中心节点的那一类。
步骤4.5相同类的各个节点即表示属于同一个虚拟人。对应各个类分别建立相应的虚拟人
对关键性步骤4中所示聚类方法,也可采用如K-Means、层次聚类(Hierarchical clustering)之类的其他常用聚类方法,它们也能达到类似的结果,只是在复杂度或效果上不同。结合该较佳实施例中的聚类算法来对行为日志进行分析,与其他K-Means、层次聚类等聚类方式相比较而言,降低了整个系统的分析复杂度。同时,籍由Delta和Rho值这两个源自数据本身的分布特征量,提供了对聚类数目选定的一种客观参考方式。
关键步骤4.3中,所示类中心点标识方法为节点的Rho值和Delta值同时高于某个相应阈值。实际中可采取其他基于Rho值或Delta值的方法。如Rho值高于3,则delta值在4-5之间,Rho值高于5,则Delta值在5-6之间。
下面对本发明虚拟人建立方法中各种值的含义结合简单示例具体说明如下。
边长表征:节点之间属于同一个人的可能性(相似度)的衡量。
Rho表征:当前节点对其邻接点的重要性。
Delta表征:若以当前节点为类中心,其相对其他类中心的可区别性。
举例来说:
边长可定义为:两个帐号在行为日志里,协同出现的次数(ca,b)的倒数1/(ca,b)。即两个帐号在同一个终端上一定时间内先后活动过的次数的倒数。
Rho可定义为:当前节点的邻边中,长度小于参数值Dc的边的数量。
Delta定义为本节点所有连接更高Rho值邻居节点的邻边中最短边的边长;若不存在这样的邻居节点,则取本节点最长邻边的边长。
在上述定义示例下对应的公式表达为:
令c(a,b)为从行为日志中统计到的帐号a和b的协同出现次数,则有:
1.a,b之间的边长:
d(a,b)=1/c(a,b)  [等式1]。
2.则对a的所有N个邻居节点bn,n=1…N(N为自然数),a的Rho值:
Figure PCTCN2015072487-appb-000001
其中,X(x)的定义为:1.如果x<0,则X(x)=1,否则X(x)=0。
3.a的Delta值:
令节点a的邻居节点依次为b1…bN,则Delta(a)可定义为:
1)如果存在满足Rho(bx)>Rho(a)的邻边,则有:
Delta(a)=min{d(a,bn))|n=1..N且Rho(bn)>Rho(a)}。
2)否则:
Delta(a)=max{d(a,bn),n=1..N}
特别的,对于没有任何邻边的节点,在标记其类标识时,可直接标识为其自己,即独立形成一个虚拟人。
Delta值的求得与Rho值有关,而Rho值的定义也可用常见的中心度等其他定义方式。
Dc的取值在实践中和具体的数据有关,通常我们会在得到连通图后,再确定Dc的取值。也就是说,和在其他常见聚类方式中一样,它是一个输入参数。但与K-Means中的K值的选取不同的是,K值的选取直接确定类的数目,但这里的Dc会通过Rho值和Delta值以及R_T和D_T的取值而弱化了主观因素的影响,因为这些参数的选取会引入对数据本身特性的客观考虑。
R_T和D_T的选取的一种方法如下。如图3所示,其为本发明虚拟人建立方法一较佳实施例中的Rho值-Delta值分布示意图,图中每一个点代表一个节点。首先画出各个点的Rho值-Delta值分布图,之后观察Delta值(Rho 值)的分布情况,看在哪个值时,分布情况发生了突变,则取该值为D_T(R_T)。如图3中,在d’(r’)处,Delta值的分布情况发生了间断/突变,则D_T(R_T)的取值为d’(r’)。若数据点较多,则可进行抽样,再以样本点的分布图做取值的参考。
通过引入其他模型,比如属性匹配,可将对应模型的匹配结果同样作为影响边长度的一个因素。也就是说,引入账号之间协同出现的次数以外的因素计算所述账号之间的相似度。
以属性匹配举例来说,用数学符号表示的话,即是将属性匹配的结果作为计算边长的一个参数。即,令Match(a,b)为属性匹配到的a和b的帐号相似度,则可如下定义边长:
d(a,b)=f(c(a,b),match(a,b))。
以[等式1]为例,可选择将其具体定义为:
引入属性匹配模型后的边长
Figure PCTCN2015072487-appb-000002
参见图4,其为本发明虚拟人建立装置一较佳实施例的结构示意图。该较佳实施例的虚拟人建立装置包括信息提取单元1,连通图构造单元2,外部模型引入单元3,虚拟人建立单元4及虚拟人合并单元5。
信息提取单元1,用于从行为日志中提取账号及与账号对应的登陆时间、登陆终端信息;
连通图构造单元2,用于根据账号之间协同出现的情况计算账号之间的相似度,构造以节点表征账号的连通图,并以节点之间的边的长度表征账号之间的相似度,节点之间的边越短,节点所表征的账号之间相似度越高;
外部模型引入单元3,用于引入账号之间协同出现的情况以外的因素计算所述账号之间的相似度;
虚拟人建立单元4,用于对所述连通图中的节点进行聚类,根据聚类结果建立虚拟人;
虚拟人合并单元5,用于合并所有虚拟人及与虚拟人对应的账号成为虚拟人数据库。
虚拟人建立单元4对连通图中的节点进行聚类的方式可参考前述说明中对本发明虚拟人建立方法的描述。
在本发明的虚拟人建立方法及装置中,通过分析行为日志的方式,实际分析所得出的结果是“哪些帐号是属于同一个人操作的”。在现实系统需求中,使用者比帐号所有人往往更有意义,同时这也能降低因“身份证号码”等关键值不真实,而引起帐号归属关系结果上的偏差。用行为日志来进行分析,增加了整个系统的可适用性–只需要帐号标识,而并不一定需要具体的帐号属性。源自行为日志的特征与上述复杂度的降低,本发明能更好适用更大范围下、更长时间范围内、更多数据量的环境。实际上,数据采集自的范围越广、时间越长、数据量更大会使得系统的实际准确率越高。本发明根据上述对行为日志分析后,聚类得出的帐号归属关系,可结合帐号属性等额外数据,进一步描绘出该虚拟人的姓名、住址等属性信息。
综上所述,本发明的虚拟人建立方法及装置基于行为日志建立虚拟人,复杂度低,准确率高,适合于处理大数据。
以上所述,对于本领域的普通技术人员来说,可以根据本发明的技术方案和技术构思作出其他各种相应的改变和变形,而所有这些改变和变形都应属于本发明后附的权利要求的保护范围。

Claims (10)

  1. 一种虚拟人建立方法,其特征在于,包括如下步骤:
    从行为日志中提取账号及与账号对应的登陆时间、登陆终端信息;
    根据账号之间协同出现的情况计算账号之间的相似度,构造以节点表征账号的连通图,并以节点之间的边的长度表征账号之间的相似度,节点之间的边越短,节点所表征的账号之间相似度越高;
    对所述连通图中的节点进行聚类,根据聚类结果建立虚拟人。
  2. 如权利要求1所述的虚拟人建立方法,其特征在于,还可引入账号之间协同出现的情况以外的因素计算所述账号之间的相似度。
  3. 如权利要求1所述的虚拟人建立方法,其特征在于,对所述连通图中的节点进行聚类的过程包括如下步骤:
    分别求出每个节点的本地密度Rho,Rho定义为连接本节点的长度低于预定义值Dc的邻边的数目;
    分别求出每个节点的离散度Delta,Delta定义为本节点所有连接更高Rho值邻居节点的邻边中最短边的边长;若不存在这样的邻居节点,则取本节点最长邻边的边长;
    将Rho值和Delta值分别高于预设阈值R_T和D_T的节点标识为类的中心节点;
    将非中心节点归类为到该非中心节点距离最短且Rho值高于该非中心节点的中心节点所属的类;
    相同类的各个节点一同构成一个虚拟人。
  4. 如权利要求1所述的虚拟人建立方法,其特征在于,采用K-Means方法或层次聚类方法对所述连通图中的节点进行聚类。
  5. 如权利要求1所述的虚拟人建立方法,其特征在于,还包括合并所有虚拟人及与虚拟人对应的账号成为虚拟人数据库。
  6. 一种虚拟人建立装置,其特征在于,包括:
    信息提取单元,用于从行为日志中提取账号及与账号对应的登陆时间、登陆终端信息;
    连通图构造单元,用于根据账号之间协同出现的情况计算账号之间的相似度,构造以节点表征账号的连通图,并以节点之间的边的长度表征账号之间的相似度,节点之间的边越短,节点所表征的账号之间相似度越高;
    虚拟人建立单元,用于对所述连通图中的节点进行聚类,根据聚类结果建立虚拟人。
  7. 如权利要求6所述的虚拟人建立装置,其特征在于,还包括外部模型引入单元,用于引入账号之间协同出现的情况以外的因素计算所述账号之间的相似度。
  8. 如权利要求6所述的虚拟人建立装置,其特征在于,对所述连通图中的节点进行聚类的过程包括如下步骤:
    分别求出每个节点的本地密度Rho,Rho定义为连接本节点的长度低于预定义值Dc的邻边的数目;
    分别求出每个节点的离散度Delta,Delta定义为本节点所有连接更高Rho值邻居节点的邻边中最短边的边长;若不存在这样的邻居节点,则取本节点最长邻边的边长;
    将Rho值和Delta值分别高于预设阈值R_T和D_T的节点标识为类的中心节点;
    将非中心节点归类为到该非中心节点距离最短且Rho值高于该非中心节点的中心节点所属的类;
    相同类的各个节点一同构成一个虚拟人。
  9. 如权利要求6所述的虚拟人建立装置,其特征在于,采用K-Means方法或层次聚类方法对所述连通图中的节点进行聚类。
  10. 如权利要求6所述的虚拟人建立装置,其特征在于,还包括虚拟人合并单元,用于合并所有虚拟人及与虚拟人对应的账号成为虚拟人数据库。
PCT/CN2015/072487 2014-12-08 2015-02-09 虚拟人建立方法及装置 WO2016090748A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201410741334.4 2014-12-08
CN201410741334 2014-12-08
CN201410814330.4 2014-12-23
CN201410814330.4A CN104504264B (zh) 2014-12-08 2014-12-23 虚拟人建立方法及装置

Publications (1)

Publication Number Publication Date
WO2016090748A1 true WO2016090748A1 (zh) 2016-06-16

Family

ID=52945661

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/072487 WO2016090748A1 (zh) 2014-12-08 2015-02-09 虚拟人建立方法及装置

Country Status (2)

Country Link
CN (1) CN104504264B (zh)
WO (1) WO2016090748A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965846B (zh) * 2014-12-31 2018-10-02 深圳市华傲数据技术有限公司 MapReduce平台上的虚拟人建立方法
RU2617918C2 (ru) * 2015-06-19 2017-04-28 Иосиф Исаакович Лившиц Способ формирования образа человека с учетом характеристик его психологического портрета, полученных под контролем полиграфа
CN106372977B (zh) * 2015-07-23 2019-06-07 阿里巴巴集团控股有限公司 一种虚拟账户的处理方法和设备
CN106469413B (zh) 2015-08-20 2021-08-03 深圳市腾讯计算机系统有限公司 一种虚拟资源的数据处理方法及装置
CN105224606B (zh) * 2015-09-02 2019-04-02 新浪网技术(中国)有限公司 一种用户标识的处理方法及装置
CN105897667A (zh) * 2015-10-22 2016-08-24 乐视致新电子科技(天津)有限公司 设备访问历史跟踪方法、设备、服务器及系统
CN107291760A (zh) * 2016-04-05 2017-10-24 阿里巴巴集团控股有限公司 无监督的特征选择方法、装置
CN106604264A (zh) * 2017-01-04 2017-04-26 北京奇虎科技有限公司 应用程序的安装方法、服务器、移动终端及系统
CN107248929B (zh) * 2017-05-27 2020-08-11 北京知道未来信息技术有限公司 一种多维关联数据的强关联数据生成方法
CN110032603A (zh) * 2019-01-22 2019-07-19 阿里巴巴集团控股有限公司 一种对关系网络图中的节点进行聚类的方法及装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103368917A (zh) * 2012-04-01 2013-10-23 阿里巴巴集团控股有限公司 一种网络虚拟用户的风险控制方法及系统
CN103544289A (zh) * 2013-10-28 2014-01-29 公安部第三研究所 基于布控数据挖掘实现特征提取的方法
CN103927307A (zh) * 2013-01-11 2014-07-16 阿里巴巴集团控股有限公司 一种识别网站用户的方法和装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090293121A1 (en) * 2008-05-21 2009-11-26 Bigus Joseph P Deviation detection of usage patterns of computer resources
CN103970752B (zh) * 2013-01-25 2017-12-05 秒针信息技术有限公司 独立访问者数量估算方法和系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103368917A (zh) * 2012-04-01 2013-10-23 阿里巴巴集团控股有限公司 一种网络虚拟用户的风险控制方法及系统
CN103927307A (zh) * 2013-01-11 2014-07-16 阿里巴巴集团控股有限公司 一种识别网站用户的方法和装置
CN103544289A (zh) * 2013-10-28 2014-01-29 公安部第三研究所 基于布控数据挖掘实现特征提取的方法

Also Published As

Publication number Publication date
CN104504264B (zh) 2017-09-01
CN104504264A (zh) 2015-04-08

Similar Documents

Publication Publication Date Title
WO2016090748A1 (zh) 虚拟人建立方法及装置
US11949747B2 (en) Apparatus, method and article to facilitate automatic detection and removal of fraudulent user information in a network environment
Zannettou et al. On the origins of memes by means of fringe web communities
Qiu et al. The lifecycle and cascade of wechat social messaging groups
CN110162717A (zh) 一种推荐好友的方法和设备
WO2016106944A1 (zh) MapReduce平台上的虚拟人建立方法
CN103324636A (zh) 在社交网络中推荐好友的系统和方法
CN107085616B (zh) Lbsn中一种基于多维属性挖掘的虚假评论可疑地点检测方法
WO2022247955A1 (zh) 非正常账号识别方法、装置、设备和存储介质
Kardara et al. Large-scale evaluation framework for local influence theories in Twitter
CN113221104B (zh) 用户异常行为的检测方法及用户行为重构模型的训练方法
CN107809370B (zh) 用户推荐方法及装置
Strotmann et al. Author name disambiguation for collaboration network analysis and visualization
Han et al. Linking social network accounts by modeling user spatiotemporal habits
Ba et al. Social and rewarding microscopical dynamics in blockchain-based online social networks
Zhang et al. Learning fair representations via rebalancing graph structure
Shao et al. Misinformation detection and adversarial attack cost analysis in directional social networks
Elyusufi et al. Social networks fake profiles detection based on account setting and activity
WO2015165230A1 (zh) 一种社交消息的监测方法及装置
Jiang et al. Camera fingerprint: A new perspective for identifying user's identity
CN107683477A (zh) 数据质量管理系统和方法
US20150371162A1 (en) System and method for identifying enterprise risks emanating from social networks
Negara et al. Analysis of Indonesian Motorcycle Gang with Social Network Approach
Bui et al. Twitter Bot Detection using Social Network Analysis
Rajput et al. Fuzzy soft set decision-making model for social networking sites

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15867287

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15867287

Country of ref document: EP

Kind code of ref document: A1