WO2016119275A1 - 网络账号识别匹配方法 - Google Patents

网络账号识别匹配方法 Download PDF

Info

Publication number
WO2016119275A1
WO2016119275A1 PCT/CN2015/072489 CN2015072489W WO2016119275A1 WO 2016119275 A1 WO2016119275 A1 WO 2016119275A1 CN 2015072489 W CN2015072489 W CN 2015072489W WO 2016119275 A1 WO2016119275 A1 WO 2016119275A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
network account
record
same
matching method
Prior art date
Application number
PCT/CN2015/072489
Other languages
English (en)
French (fr)
Inventor
王明兴
吴颖徽
马帅
汤南
贾西贝
Original Assignee
深圳市华傲数据技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市华傲数据技术有限公司 filed Critical 深圳市华傲数据技术有限公司
Publication of WO2016119275A1 publication Critical patent/WO2016119275A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Definitions

  • the present invention relates to the field of data processing technologies, and in particular, to a network account identification matching method.
  • the difficulty of network account identification is that the amount of data in the account is very large, the structure difference between various accounts is large, and the account number is constantly updated and growing, which is also in line with the 3V characteristics of big data, namely Volume. , Variance (data type), Velocity (processing speed). How to identify the network account belonging to the same person from a large number of heterogeneous and dynamic accounts is a difficult point of technology.
  • the object of the present invention is to provide a network account identification matching method, which can be used for large-scale network account identification matching.
  • the present invention provides a network account identification matching method, including:
  • Step 10 Organize the network account according to the attributes required by the predefined matching rule, and use the unique record id as the identifier of the corresponding network account;
  • Step 20 For each matching rule, if the network account has all the attributes required by the matching rule, the contents of all the attributes of the network account are concatenated into an attribute string to form a record id of the attribute string and the network account. Correspondence relationship
  • Step 30 The record ids corresponding to the same attribute string are grouped together, and the record ids that are merged together represent the same entity person and are the identifiers of the corresponding entity persons;
  • step 40 the record id of the identifier of each entity is broadcasted to the entity to which the entity belongs, and the correspondence between the record id and the identity of the entity to which the entity belongs is formed, and the identifiers of the entity corresponding to the same record id are merged together. Transmitting the identification of the entities that are merged together to obtain the identity of the new entity;
  • Step 50, step 40 is repeated until the entity does not change.
  • step 10 includes:
  • Step 101 Sort out required attributes according to matching rules
  • Step 102 Generate a unique record id for each network account data.
  • Step 103 Extract a value corresponding to the network account according to the required attribute, and add a record id to generate a new row of data; if the network account does not exist or exist but the content is empty or illegal, the content of the corresponding attribute is finally The result is empty.
  • step 20 the content is concatenated in a specific symbol to form an attribute string.
  • step 40 includes:
  • Step 401 Broadcast the entity entity to which the record id in the identity of each entity is to be generated, and generate a key-value pair including the record id and the identity of the entity to which the entity belongs; by recording the correspondence in the form of a key-value pair, the subsequent Merge operations, and further facilitate porting to the Hadoop platform;
  • Step 402 Collect the entity to which each record id belongs. If there is only one entity to which the record id belongs, the state of the entity corresponding to the mark is reserved; otherwise, the record id in the identity of all the entities is merged, and the weight is deduplicated. , generating a new entity's identity and marking the status of the new entity as new, and marking the status of each old entity as deleted;
  • Step 403 Combine the status information of each entity, if the status includes new, the entity needs to be retained; if the status includes deletion, the entity needs to delete; otherwise, the entity needs to be retained;
  • Step 404 Output all entities that need to be retained.
  • the condition that the entity is not changed in step 50 is that the number of the entity remains unchanged.
  • the condition that the entity is not changed in the step 50 is that the entity who is not in the deleted state appears.
  • the required attribute is an ID number, a mobile phone number, an email address, or a QQ number.
  • the matching rule includes the same ID number, the same mobile phone number, the same email address, or QQ.
  • the number is the same.
  • the key value pair including the attribute id and the record id of the network account is generated in step 20.
  • the network account identification matching method of the present invention can identify which accounts are most likely to belong to the same entity in a large number of heterogeneous accounts, and can be used for large-scale network account identification matching.
  • FIG. 1 is a flow chart of a preferred embodiment of a network account identification matching method according to the present invention.
  • FIG. 1 is a flowchart of a preferred embodiment of a network account identification matching method according to the present invention.
  • the preferred embodiment mainly includes:
  • Step 10 Organize the network account according to the attributes required by the predefined matching rule, and use the unique record id as the identifier of the corresponding network account;
  • Step 20 For each matching rule, if the network account has all the attributes required by the matching rule, the contents of all the attributes of the network account are concatenated into an attribute string to form a record id of the attribute string and the network account. Corresponding relationship; for example, a key value pair containing the attribute id and the record id of the network account can be generated;
  • Step 30 The record ids corresponding to the same attribute string are grouped together, and the record ids that are merged together represent the same entity person and are the identifiers of the corresponding entity persons;
  • step 40 the record id of the identifier of each entity is broadcasted to the entity to which the entity belongs, and the correspondence between the record id and the identity of the entity to which the entity belongs is formed, and the identifiers of the entity corresponding to the same record id are merged together. Passing the closure of the identity of the merged entity to obtain the identity of the new entity; for example, forming a key-value pair of the record id and the identity of the entity to which it belongs, and grouping the key-value pairs with the same record id ;
  • Step 50, step 40 is repeated until the entity does not change.
  • each network account system Since there are public information of some entities in each network account system, the information is sensitive and Very important is the key information for network account identification.
  • the first step in identifying an account is to reflect this public information. After analysis, each network account system usually requires the registrant to provide a valid e-mail address and mobile phone number for verification. Therefore, when the e-mail address and mobile phone number of the account are the same, the registrant is usually the same person.
  • some accounts need to provide the registrant's ID number, name and other information when performing real-name authentication.
  • the ID card number is an important identification information. In the Internet age, network communication is very common, and the representative is QQ, so the QQ number is also an important means of communication between people. To synthesize this information, the following matching rules can be pre-defined to identify the same entity:
  • the mobile phone number is the same;
  • an entity registration network account A provides the mailbox x1 and the phone number p1, and when the network account B is registered, the mailbox x2 is provided, and the phone number is not provided, but the real name verification is performed on both accounts, and the real and effective are provided. identification number. It provides mailbox x2 and phone p2 when registering network account C. Therefore, through the same identity card, we know that account A and account B are the same entity. Through the same mailbox, we know that account B and account C are the same entity, and are comprehensively available. Accounts A, B, and C are the same entity.
  • the present invention specifies a rule for matching network account attributes through a predefined matching rule, in which case which attributes are used for matching, and a corresponding matching success determination method.
  • Step 10 may specifically include:
  • Step 101 According to the matching rule, sort out required attributes, such as an ID number, a mobile phone number, an email address, a QQ number, and the like;
  • Step 102 Generate a unique record id for each network account data, such as sequentially numbering and adding type for different account types, such as x1, x2, ..., a1, a2, etc.
  • Step 103 Extract a value corresponding to the network account according to the required final attribute, and add a record id to generate a new line of data; if the network account does not have an attribute or exists but the content is empty or illegal, the corresponding attribute The final result of the content is empty. For example, if a mailbox system does not perform real-name verification on the registrant, and therefore does not have information such as an ID number, the content of the "ID number" field is Empty.
  • step 20 the attributes corresponding to the matching rule are extracted. For each rule, according to all the attributes defined by the rule, if the corresponding content is not empty, all the contents are concatenated with specific symbols to form an attribute string, and together with the record id, a set of key-value pairs are generated, such as:
  • the attribute string is used as a key to record the id as a value.
  • large-scale network account identification matching can be realized on distributed parallel computing platforms such as MapReduce.
  • the present invention merges the rule attributes by step 30 to initially identify the same entity. Specifically, it may include:
  • the result obtained by the above steps is obtained by independent calculation of each rule, so there will be cases where the entity is repeated and an account belongs to multiple entities, and the solution is called a transitive closure.
  • the invention performs the transitive closure processing on the data through the step 40, and solves the problem of virtual human repetition and transmission.
  • Step 40 may specifically include the following:
  • Step 401 Broadcast the entity entity to which the record id in the identity of each entity is to be generated, and generate a key value pair including the record id and the identity of the entity to which it belongs.
  • a key value pair containing the record id and the identity of the entity is generated according to all the record ids of the identity of the entity, such as the record id - the record group to which x1 belongs includes:
  • Step 402 Collect the entity to which each record id belongs. If there is only one entity to which the record id belongs, the state of the entity corresponding to the mark is reserved; otherwise, the record id in the identity of all the entities is merged, and the weight is deduplicated. Generate a new entity's identity and mark the status of the new entity as new, and mark the status of each old entity as deleted.
  • the record id - x1 corresponds to the entity has 4, respectively "x1, a1”, “x1, y1", “x1", “x1”, merged to get the new entity "x1, a1, Y1", the status is "new”;
  • the status of the four entities "x1, a1", “x1, y1", “x1", and "x1" is "delete”.
  • Another example is the record id - the entity corresponding to y1 has only one "x1, y1", so the output status is "reserved”.
  • Step 403 Combine the status information of each entity. If the status includes new, the entity needs to keep; if the status includes deletion, the entity needs to delete; otherwise, the entity needs to be retained.
  • the state of "x1, y1" consists of two types, “delete” (calculated by x1) and “reserved” (derived by y1), so the final result is that the entity “x1, y1" needs to be deleted.
  • Step 404 Output all entities that need to be retained.
  • step 50 is also required, which is due to the possibility of multiple transmissions between entities, so multiple transitive closure processing is required, for example, the entity “x1, a1”, “a1, b1", “b1, c1" are initially identified; After a closure process, the entity: “x1, a1, b1", “a1, b1, c1", after closing again, the correct final result: "x1, a1, b1, c1".
  • the closure process is stopped.
  • the present invention can identify an account belonging to the same entity from a large amount of data, and can be used for large-scale network account identification matching, and the beneficial effects thereof mainly include the following three points:
  • the data benefits.
  • the value of the data is 1+1>>2, which links the originally isolated but highly relevant data, and its value is much greater than the sum of its own values.
  • the original loose data can be aggregated, and the entity's attributes and activity information can be fully obtained. This is the technological work for the later analysis of the entity and the application based on the analysis results.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

涉及一种网络账号识别匹配方法。该方法包括:步骤10、根据预定义的匹配规则所需的属性整理网络账号(10);步骤20、对于每个匹配规则,网络账号如果具有该匹配规则所需的所有属性,则将该网络账号的该所有属性的内容串联组成属性串,形成该属性串与该网络账号的记录id的对应关系(20);步骤30、将对应于相同属性串的记录id归并在一起(30);步骤40、对每个实体人的标识所具有的记录id广播其所属的实体人,形成记录id与其所属实体人的标识的对应关系,将对应于相同记录id的实体人的标识归并在一起,对归并在一起的实体人的标识进行传递闭包处理得到新的实体人的标识(40);步骤50、反复进行步骤40,直至实体人没有改变(50)。能够用于大规模网络账号识别匹配。

Description

网络账号识别匹配方法 技术领域
本发明涉及数据处理技术领域,尤其涉及一种网络账号识别匹配方法。
背景技术
随着互联网技术的发展,网民在各类网站、应用上注册的账号快速增长。主流应用如QQ,淘宝、163邮箱、智联招聘,去哪儿网几乎是人手一号。这些账号的基本资料和活动信息蕴藏着大量与实体人有关的信息,可以说是一个数据油田。然而,同一个实体人,各类账号之间的数据是分离的,同一类型账号(比如有多个QQ号)数据也是分离的,这对数据的提取和分析造成了障碍,如果能识别哪些账号属于同一个实体人,将使数据大幅度增值。
网络账号识别的难点在于账号的数据量非常之大,各类账号之间结构差异大,账号也处在不断的更新、增长之中,这也符合大数据的3V特性,即Volume(数据量),Variance(数据种类),Velocity(处理速度)。如何从海量的、异构的、动态的账号中识别出属于同一个人的网络账号,是技术的重难点。
发明内容
本发明的目的在于提供一种网络账号识别匹配方法,可以用于大规模网络账号识别匹配。
为实现上述目的,本发明提供一种网络账号识别匹配方法,包括:
步骤10、根据预定义的匹配规则所需的属性整理网络账号,以唯一的记录id作为相应网络账号的标识;
步骤20、对于每个匹配规则,网络账号如果具有该匹配规则所需的所有属性,则将该网络账号的该所有属性的内容串联组成属性串,形成该属性串与该网络账号的记录id的对应关系;
步骤30、将对应于相同属性串的记录id归并在一起,以归并在一起的记录id代表同一实体人并作为相应实体人的标识;
步骤40、对每个实体人的标识所具有的记录id广播其所属的实体人,形成记录id与其所属实体人的标识的对应关系,将对应于相同记录id的实体人的标识归并在一起,对归并在一起的实体人的标识进行传递闭包处理得到新的实体人的标识;
步骤50、反复进行步骤40,直至实体人没有改变。
其中,步骤10包括:
步骤101、根据匹配规则整理出所需要的属性;
步骤102、对于每个网络账号数据,生成一个唯一的记录id;
步骤103、根据所需要的属性提取网络账号对应的值,并加上记录id,生成一行新的数据;如果网络账号不存在某属性或存在但内容为空或者不合法,则对应属性的内容最终结果为空。
其中,步骤20中,所述内容以特定的符号串联起来组成属性串。
其中,步骤40包括:
步骤401、对每个实体人的标识中的记录id广播其所属的实体人,生成包含记录id与其所属实体人的标识的键值对;通过以键值对形式记录对应关系,可以方便后续的归并操作,并且进一步可以方便于移植到Hadoop平台;
步骤402、收集每个记录id所属的实体人,如果记录id所属的实体人只有一个,则标记对应的实体人的状态为保留;否则合并所有的实体人的标识中的记录id,并去重,生成新的实体人的标识并标记该新的实体人的状态为新增,并标记每个旧的实体人的状态为删除;
步骤403、合并每个实体人的状态信息,如果状态内包含新增,此实体人需保留;如果状态内包含删除,此实体人需删除;否则,此实体人需保留;
步骤404、输出所有需要保留的实体人。
其中,步骤50中判断实体人没有改变的条件为实体人的数量保持不变。
其中,步骤50中判断实体人没有改变的条件为没有处于删除状态的实体人出现。
其中,所述所需的属性为身份证号、手机号、电子邮箱或QQ号。
其中,所述匹配规则包括身份证号相同、手机号相同、电子邮箱相同或QQ 号相同。
其中,步骤20中生成包含该属性串与该网络账号的记录id的键值对。通过以键值对形式记录对应关系,可以方便后续的归并操作,并且进一步可以方便于移植到Hadoop平台。
综上所述,本发明的网络账号识别匹配方法可以在海量异构的账号中识别出哪些账号最有可能属于同一个实体人,能够用于大规模网络账号识别匹配。
附图说明
图1是本发明网络账号识别匹配方法一较佳实施例的流程图。
具体实施方式
下面结合附图,通过对本发明的具体实施方式详细描述,将使本发明的技术方案及其有益效果显而易见。
参见图1,其为本发明网络账号识别匹配方法一较佳实施例的流程图。该较佳实施例主要包括:
步骤10、根据预定义的匹配规则所需的属性整理网络账号,以唯一的记录id作为相应网络账号的标识;
步骤20、对于每个匹配规则,网络账号如果具有该匹配规则所需的所有属性,则将该网络账号的该所有属性的内容串联组成属性串,形成该属性串与该网络账号的记录id的对应关系;例如,可以生成包含该属性串与该网络账号的记录id的键值对;
步骤30、将对应于相同属性串的记录id归并在一起,以归并在一起的记录id代表同一实体人并作为相应实体人的标识;
步骤40、对每个实体人的标识所具有的记录id广播其所属的实体人,形成记录id与其所属实体人的标识的对应关系,将对应于相同记录id的实体人的标识归并在一起,对归并在一起的实体人的标识进行传递闭包处理得到新的实体人的标识;例如,可以形成记录id与其所属实体人的标识的键值对,将记录id相同的键值对归并在一起;
步骤50、反复进行步骤40,直至实体人没有改变。
由于各网络账号系统中都有存在一些实体的公共信息,这些信息是敏感且 非常重要的,是网络账号识别的关键信息所在,识别账号的第一步就是体现出这些公共信息。经过分析,各网络账号系统通常会需要注册者提供有效电子邮箱以及手机号码进行验证,因此账号的电子邮箱、手机号码相同时通常代表注册者是同一人。另外一些账号进行实名认证时需要提供注册者的身份证号码、姓名等信息,身份证号码是个重要的识别信息。互联网时代,网络通信非常普遍,其中的代表是QQ,因此QQ号码也是人与人之间一个重要的联系手段。综合这些信息可预先制定如下匹配规则用于识别同一实体人:
1、身份证号码相同;
2、电子邮箱相同;
3、手机号码相同;
4、QQ号码相同。
针对其他特定的业务数据我们还可以提取其他有效的规则来识别同一实体人。例如某个实体人注册网络账号A是提供了邮箱x1和电话号码p1,注册网络账号B时提供了邮箱x2,没有提供电话号码,但对两个账号都进行了实名验证,提供了真实有效的身份证号码。其在注册网络账号C时提供了邮箱x2和电话p2。因此通过身份证相同我们知道账号A和账号B为同一实体人,通过邮箱相同我们知道账号B和账号C为同一实体人,综合可得,账号A、B、C为同一实体人。
本发明通过预定义的匹配规则,指定网络帐号属性匹配的规则,在哪种情况下用哪些属性进行匹配,以及相应的匹配成功判定方法。
由于各类账号结构差异大,不能直接进行比较和匹配,因此第一步需要整理数据。步骤10具体可以包括:
步骤101、根据匹配规则整理出所需要的属性,如身份证号、手机号、电子邮箱、QQ号等;
步骤102、对于每个网络账号数据,生成一个唯一的记录id,如可针对不同的账号类型按顺序编号并加上类型组成,如x1,x2,…,a1,a2…等形式;
步骤103、根据所需要的最终属性对应提取网络账号对应的值,并加上记录id,生成一行新的数据;如果网络账号不存在某属性或存在但内容为空或者不合法,则对应属性的内容最终结果为空。比如某邮箱系统由于没有对注册者进行实名验证,因此没有身份证号码等信息,则提取时“身份证号”字段内容为 空即可。
如此我们得到统一格式的、可用于匹配的数据,具体可如:
id 身份证号 手机号 电子邮箱 QQ号
x1 360622199001011111 13812345678 vip@audaque.com 12345678
a1 360622199001011111     23456789
a2       34567890
y1   13812345678    
y2 360622199001012222   guest@audaque.com 34567890
通过步骤20,提取匹配规则对应的属性。对于每个规则,根据规则定义的所有属性,如果对应的内容都不为空,则将所有内容以特定的符号串联起来,组成属性串,并与记录id一起生成一组键值对,如:
360622199001011111/x1
13812345678/x1
vip@audaque.com/x1
12345678/x1
360622199001011111/a1
23456789/a1
34567890/a2
13812345678/y1
360622199001012222/y2
guest@audaque.com/y2
34567890/y2。
此较佳实施例中以属性串为键,以记录id为值。通过生成键值对的方式,可以在MapReduce等分布式并行计算平台上实现对海量数据的处理,完成大规模网络账号识别匹配。
本发明通过步骤30合并规则属性,初步识别同一实体人。具体可以包括:
将所有相同的属性串归并在一起,对应的在一起记录id就代表同一实体人(注册者),如:
360622199001011111/x1,a1
13812345678/x1,y1
vip@audaque.com/x1
12345678/x1
23456789/a1
34567890/a2,y2
360622199001012222/y2
guest@audaque.com/y2。
忽略属性串,可以得到如下的实体人初步结果列表:
x1,a1
x1,y1
x1
x1
a1
a2,y2
y2
y2。
通过上述步骤识别后得到的结果是由每个规则独立计算后所得,因此会存在实体人重复出现以及某个账号属于多个实体人等情况,解决的方法称为传递闭包。本发明通过步骤40对数据进行传递闭包处理,解决虚拟人重复、传递问题。
步骤40具体可以包括如下:
步骤401、对每个实体人的标识中的记录id广播其所属的实体人,生成包含记录id与其所属实体人的标识的键值对;
对于每个实体人,根据该实体人的标识所具有的全部记录id分别生成包含记录id与该实体人的标识的键值对,如记录id——x1所属的记录组包括:
x1/x1,a1
x1/x1,y1
x1/x1
x1/x1。
步骤402、收集每个记录id所属的实体人,如果记录id所属的实体人只有一个,则标记对应的实体人的状态为保留;否则合并所有的实体人的标识中的记录id,并去重,生成新的实体人的标识并标记该新的实体人的状态为新增,并标记每个旧的实体人的状态为删除。
例如,记录id——x1对应的实体人有4个,分别为“x1,a1”,“x1,y1”,“x1”,“x1”,合并去重后得到新实体人“x1,a1,y1”,状态为“新增”;而 “x1,a1”,“x1,y1”,“x1”,“x1”4个实体人的状态为“删除”。又如记录id——y1对应的实体人只有一个“x1,y1”,所以输出其状态为“保留”。
步骤403、合并每个实体人的状态信息,如果状态内包含新增,此实体人需保留;如果状态内包含删除,此实体人需删除;否则,此实体人需保留。
例如,“x1,y1”的状态包含2种,分别为“删除”(通过x1计算得出)和“保留”(通过y1得出),因此最终结果为实体人“x1,y1”需删除。
步骤404、输出所有需要保留的实体人。
经过上述几步处理后能解决所有重复问题以及一部分传递问题。但是还需要进行步骤50,这是由于实体人间可能多重传递,因此需采用多次传递闭包处理,例如初步识别出实体人“x1,a1”,“a1,b1”,“b1,c1”;经过一次闭包处理后得实体人:“x1,a1,b1”,“a1,b1,c1”,再次闭包后,才得正确的最终结果:“x1,a1,b1,c1”。当实体人没有改变时(如结果中实体人的数量保持不变,或没有“删除”状态出现)停止闭包处理过程。
综上所述,本发明可以从大量数据中识别出归属于同一个实体人的账号,能够用于大规模网络账号识别匹配,其有益效果主要有以下三点:
一、数据效益。众所周知,数据的价值是1+1>>2的,将原本孤立但却高度相关的数据联系起来,其价值要远大于本身价值之和。通过关联实体人的账号,可以聚合原本松散的数据,全面获得实体人的属性以及活动信息。这对于后期进行实体人的分析以及基于分析结果的应用是奠基的工作。
二、经济效益。当掌握了实体人各类账号属性以及活动信息后,便是一个巨大的数据油田。数据本身具有经济价值,基于数据的应用例如精准营销也具有经济价值。
三、社会效益。当政府部门掌握的民众的网络数据、行为时,可以加深其对于群众的了解,制定更加贴合实际的政策,增加社会效益。与此同时,公安部门通过对网络数据的监控,可以获取破案线索,维护社会的稳定。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。

Claims (9)

  1. 一种网络账号识别匹配方法,其特征在于,包括:
    步骤10、根据预定义的匹配规则所需的属性整理网络账号,以唯一的记录id作为相应网络账号的标识;
    步骤20、对于每个匹配规则,网络账号如果具有该匹配规则所需的所有属性,则将该网络账号的该所有属性的内容串联组成属性串,形成该属性串与该网络账号的记录id的对应关系;
    步骤30、将对应于相同属性串的记录id归并在一起,以归并在一起的记录id代表同一实体人并作为相应实体人的标识;
    步骤40、对每个实体人的标识所具有的记录id广播其所属的实体人,形成记录id与其所属实体人的标识的对应关系,将对应于相同记录id的实体人的标识归并在一起,对归并在一起的实体人的标识进行传递闭包处理得到新的实体人的标识;
    步骤50、反复进行步骤40,直至实体人没有改变。
  2. 根据权利要求1所述的网络账号识别匹配方法,其特征在于,步骤10包括:
    步骤101、根据匹配规则整理出所需要的属性;
    步骤102、对于每个网络账号数据,生成一个唯一的记录id;
    步骤103、根据所需要的属性提取网络账号对应的值,并加上记录id,生成一行新的数据;如果网络账号不存在某属性或存在但内容为空或者不合法,则对应属性的内容最终结果为空。
  3. 根据权利要求1所述的网络账号识别匹配方法,其特征在于,步骤20中,所述内容以特定的符号串联起来组成属性串。
  4. 根据权利要求1所述的网络账号识别匹配方法,其特征在于,步骤40包括:
    步骤401、对每个实体人的标识中的记录id广播其所属的实体人,生成包含记录id与其所属实体人的标识的键值对;
    步骤402、收集每个记录id所属的实体人,如果记录id所属的实体人只有一个,则标记对应的实体人的状态为保留;否则合并所有的实体人的标识中的 记录id,并去重,生成新的实体人的标识并标记该新的实体人的状态为新增,并标记每个旧的实体人的状态为删除;
    步骤403、合并每个实体人的状态信息,如果状态内包含新增,此实体人需保留;如果状态内包含删除,此实体人需删除;否则,此实体人需保留;
    步骤404、输出所有需要保留的实体人。
  5. 根据权利要求1所述的网络账号识别匹配方法,其特征在于,步骤50中判断实体人没有改变的条件为实体人的数量保持不变。
  6. 根据权利要求4所述的网络账号识别匹配方法,其特征在于,步骤50中判断实体人没有改变的条件为没有处于删除状态的实体人出现。
  7. 根据权利要求1所述的网络账号识别匹配方法,其特征在于,所述所需的属性为身份证号、手机号、电子邮箱或QQ号。
  8. 根据权利要求1所述的网络账号识别匹配方法,其特征在于,所述匹配规则包括身份证号相同、手机号相同、电子邮箱相同或QQ号相同。
  9. 根据权利要求1所述的网络账号识别匹配方法,其特征在于,步骤20中生成包含该属性串与该网络账号的记录id的键值对。
PCT/CN2015/072489 2015-01-30 2015-02-09 网络账号识别匹配方法 WO2016119275A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510047747.7A CN104573094B (zh) 2015-01-30 2015-01-30 网络账号识别匹配方法
CN201510047747.7 2015-01-30

Publications (1)

Publication Number Publication Date
WO2016119275A1 true WO2016119275A1 (zh) 2016-08-04

Family

ID=53089156

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/072489 WO2016119275A1 (zh) 2015-01-30 2015-02-09 网络账号识别匹配方法

Country Status (2)

Country Link
CN (1) CN104573094B (zh)
WO (1) WO2016119275A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110392041A (zh) * 2019-06-17 2019-10-29 平安银行股份有限公司 电子授权方法、装置、存储设备及存储介质
CN112737825A (zh) * 2020-12-23 2021-04-30 携程旅游信息技术(上海)有限公司 基于日志的网络设备关联方法、系统、设备及存储介质

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105207996B (zh) * 2015-08-18 2018-11-23 小米科技有限责任公司 账户合并方法及装置
CN105262725B (zh) * 2015-09-08 2018-06-22 浪潮(北京)电子信息产业有限公司 一种账号共享控制方法和系统
CN106909811B (zh) 2015-12-23 2020-07-03 腾讯科技(深圳)有限公司 用户标识处理的方法和装置
CN106933829B (zh) * 2015-12-29 2020-08-04 阿里巴巴集团控股有限公司 一种信息关联方法和设备
CN105912663A (zh) * 2016-04-12 2016-08-31 宁波极动精准广告传媒有限公司 一种基于大数据的用户标签合并方法
CN106230829B (zh) * 2016-08-03 2019-06-11 浪潮通用软件有限公司 面向网络威胁发现的虚拟身份知识图谱的构建方法
CN106604051A (zh) * 2016-12-20 2017-04-26 广州华多网络科技有限公司 直播频道推荐方法及装置
CN107688603B (zh) * 2017-07-25 2019-03-26 平安科技(深圳)有限公司 电子装置、名单除重方法和计算机可读存储介质
CN110019519A (zh) * 2017-11-28 2019-07-16 腾讯科技(深圳)有限公司 数据处理方法、装置、存储介质和电子装置
CN110413623A (zh) * 2019-08-05 2019-11-05 北京深醒科技有限公司 一种人员信息多账号查询方法
CN111147511A (zh) * 2019-12-31 2020-05-12 杭州涂鸦信息技术有限公司 一种用户的身份串并方法及系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060074980A1 (en) * 2004-09-29 2006-04-06 Sarkar Pte. Ltd. System for semantically disambiguating text information
CN102375853A (zh) * 2010-08-24 2012-03-14 中国移动通信集团公司 分布式数据库系统、在其中建立索引的方法和查询方法
CN102426609A (zh) * 2011-12-28 2012-04-25 厦门市美亚柏科信息股份有限公司 一种基于MapReduce编程架构的索引生成方法和装置
CN102915365A (zh) * 2012-10-24 2013-02-06 苏州两江科技有限公司 基于Hadoop的分布式搜索引擎构建方法
CN104239490A (zh) * 2014-09-05 2014-12-24 电子科技大学 一种用于ugc网站平台的多账户检测方法及装置
CN104573095A (zh) * 2015-01-30 2015-04-29 深圳市华傲数据技术有限公司 基于Hadoop框架的大规模对象识别方法
CN104573098A (zh) * 2015-01-30 2015-04-29 深圳市华傲数据技术有限公司 基于Spark系统的大规模对象识别方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103118043B (zh) * 2011-11-16 2015-12-02 阿里巴巴集团控股有限公司 一种用户账号的识别方法及设备
US9639676B2 (en) * 2012-05-31 2017-05-02 Microsoft Technology Licensing, Llc Login interface selection for computing environment user login

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060074980A1 (en) * 2004-09-29 2006-04-06 Sarkar Pte. Ltd. System for semantically disambiguating text information
CN102375853A (zh) * 2010-08-24 2012-03-14 中国移动通信集团公司 分布式数据库系统、在其中建立索引的方法和查询方法
CN102426609A (zh) * 2011-12-28 2012-04-25 厦门市美亚柏科信息股份有限公司 一种基于MapReduce编程架构的索引生成方法和装置
CN102915365A (zh) * 2012-10-24 2013-02-06 苏州两江科技有限公司 基于Hadoop的分布式搜索引擎构建方法
CN104239490A (zh) * 2014-09-05 2014-12-24 电子科技大学 一种用于ugc网站平台的多账户检测方法及装置
CN104573095A (zh) * 2015-01-30 2015-04-29 深圳市华傲数据技术有限公司 基于Hadoop框架的大规模对象识别方法
CN104573098A (zh) * 2015-01-30 2015-04-29 深圳市华傲数据技术有限公司 基于Spark系统的大规模对象识别方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110392041A (zh) * 2019-06-17 2019-10-29 平安银行股份有限公司 电子授权方法、装置、存储设备及存储介质
CN110392041B (zh) * 2019-06-17 2022-05-06 平安银行股份有限公司 电子授权方法、装置、存储设备及存储介质
CN112737825A (zh) * 2020-12-23 2021-04-30 携程旅游信息技术(上海)有限公司 基于日志的网络设备关联方法、系统、设备及存储介质
CN112737825B (zh) * 2020-12-23 2022-12-02 携程旅游信息技术(上海)有限公司 基于日志的网络设备关联方法、系统、设备及存储介质

Also Published As

Publication number Publication date
CN104573094B (zh) 2018-05-29
CN104573094A (zh) 2015-04-29

Similar Documents

Publication Publication Date Title
WO2016119275A1 (zh) 网络账号识别匹配方法
US11438383B2 (en) Controlling permissible actions a computing device can perform on a data resource based on a use policy evaluating an authorized context of the device
Ray et al. Twitter sentiment analysis for product review using lexicon method
US10356094B2 (en) Uniqueness and auditing of a data resource through an immutable record of transactions in a hash history
US20170154124A1 (en) Composite Term Index for Graph Data
CN104516910B (zh) 在客户端服务器环境中推荐内容
US20150032729A1 (en) Matching snippets of search results to clusters of objects
CN103279515B (zh) 基于微群的推荐方法及微群推荐装置
Daraghmi et al. We are so close, less than 4 degrees separating you and me!
US20150134663A1 (en) Method, apparatus, and computer-readable storage medium for grouping social network nodes
US20190073410A1 (en) Text-based network data analysis and graph clustering
CN103745014A (zh) 一种社交网络用户虚实映射方法和系统
CN106534164A (zh) 计算机中基于网络空间用户标识的有效虚拟身份刻画方法
JP2020046738A (ja) ブロックチェーン履歴蓄積システム及びブロックチェーン履歴蓄積方法
CN102811177A (zh) 网络信息的分享方法及系统
Rowe Interlinking Distributed Social Graphs.
WO2016106944A1 (zh) MapReduce平台上的虚拟人建立方法
CN103810248A (zh) 基于照片查找人际关系的方法和装置
Wu Sufficient and necessary conditions of complete convergence for weighted sums of PNQD random variables
Yao et al. Topic-based rank search with verifiable social data outsourcing
Schild et al. Linking survey data with administrative social security data-the project'interactions between capabilities in work and private life'
CN102622544A (zh) 个性化服务中用户兴趣模型匿名化方法
Xianlei et al. Finding domain experts in microblogs
CN117121440A (zh) 统一资源标识符
Hu et al. [Retracted] Internet False News Information Feature Extraction and Screening Based on 5G Internet of Things Combined with Passive RFID

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15879483

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15879483

Country of ref document: EP

Kind code of ref document: A1