WO2016106944A1

WO2016106944A1 - Method for creating virtual human on mapreduce platform

Info

Publication number: WO2016106944A1
Application number: PCT/CN2015/072486
Authority: WO
Inventors: 蔡立宇; 张观成; 喻勇; 杨航; 范亚博; 贾西贝
Original assignee: 深圳市华傲数据技术有限公司
Priority date: 2014-12-31
Filing date: 2015-02-09
Publication date: 2016-07-07
Also published as: CN104965846B; CN104965846A

Abstract

A method for creating a virtual human on a MapReduce platform. The method for creating a virtual human comprises: step 1, extracting from behavior logs accounts as well as login time and login terminal information corresponding to the accounts (1); step 2, calculating the similarity between the accounts according to synergistic conditions of the accounts, constructing a connected graph in which the accounts are represented by nodes, and representing the similarity between the accounts by using the length of an edge between the nodes, wherein the similarity between the accounts represented by the nodes increases as the edge between the nodes decreases (2); and step 3, clustering the nodes in the connected graph on the basis of the MapReduce platform, and creating a virtual human according to the clustering result (3). In the method, a virtual human is created according to behavior logs, the method is low in complexity and high in accuracy, and is suitable for processing big data. By means of a popular MapReduce distributed computation concept, clustering based on local density is achieved on a cluster, the restriction during the processing due to limited resources of a single machine is weakened, mass data can be processed, and clustering operations can be completed more rapidly.

Description

Virtual person establishment method on MapReduce platform

Technical field

The present invention relates to the field of data processing technologies, and in particular, to a method for establishing a virtual human on a MapReduce platform.

Background technique

Currently, instant messaging, e-mail, online games, P2P software downloads, online forums, online recruitment, e-commerce transactions, online booking of airline tickets, and other network services bring great convenience to the lives of online users. Each network service generally assigns an account to each user, which is associated with the user's registration information and is used to record and identify each user, such as an instant communication number (such as a QQ account) or an email address of the network user. Online game account, forum login account, P2P software account and so on.

Each network user has a variety of accounts, and a large number of network users bring huge amounts of account data. For relevant departments, effectively managing network user information has become an arduous task. In order to effectively manage network user information and realize the analysis of the affiliation relationship of network accounts, that is, which accounts belong to the same person (virtual person), it has become an urgent problem to be solved.

In the prior art, in the face of the problem of constructing a virtual person, most of them belong to the attribute matching method. The scheme for attribute matching is roughly as follows:

A) Specify rules for matching network account attributes, in which case which attributes are used for matching, and corresponding matching success determination methods. For example, when a QQ account and a Taobao account are matched, if the edit distances of the two fields "name" and "contact" are less than 3, the two accounts are considered to be successfully matched.

B) Build the degree (similarity) of accounts belonging to the same person according to the matching of attributes. And finally according to the similarity to distinguish which accounts belong to the same person. For example, in the above example, as long as the match is successful, it is considered to belong to the same person.

However, the following situations exist in real life:

1. In the account data, there are often cases where the attribute is missing. For example, only some attribute values are filled in when the account is registered.

2. Different types of account data, the total number of attributes is small. And not all of the shared attributes can be used for attribute matching.

3. Different types of account data, different attributes for the same semantics, need to be aligned, which further increases the difficulty. For example, in a class A account, the field corresponding to the name is the field of "name", but in the class B account, the name is actually represented by two fields of "last name" and "name".

4. In the actual account data, the credibility of the attribute value is not very high. For example, due to the lack of real-name authentication, there may be cases where the ID number is not true.

5. Need to compare the attribute level, the complexity is higher.

These conditions make the process of attribute matching complicated, computationally intensive, and the actual results are not ideal, especially for large data processing, the accuracy is low.

MapReduce, on the other hand, is a distributed parallel computing framework proposed by Google for parallel computing of large-scale data sets. It mainly processes large-scale data in parallel through the two steps of “Map” and “Reduce”. Data set. In the calculation process on the MapReduce platform, the input data is first segmented into different computers in the cluster, and other computers in the cluster are assigned to execute Map jobs or Reduce jobs; the Map job extracts key-value pairs <Key, Value from the input data. >, each key-value pair is passed as a parameter to the map function. The intermediate key-value pairs generated by the map function are cached in memory. The cached intermediate key-value pairs are periodically written to the local disk, and these intermediate key-value pairs are Divided into R zones, the size of R is defined by the user. In the future, each zone will correspond to a Reduce job; the key-value pairs with the same Key will be processed by the same Reduce job, and the Reduce job will read these intermediate key-value pairs. For each unique key, the key and associated value are passed to the reduce function, and the output generated by the reduce function is added to the output file of the partition. The difference between a Map/Reduce job and a map/reduce function: A Map job processes a slice of input data, and may need to call multiple map functions to process each input key-value pair; the Reduce job processes a partition's intermediate key-value pair, during the period. To call the reduce function once for each different key, the Reduce job eventually corresponds to an output file. Throughout the process, the input data is from the underlying distributed file system, the intermediate data is placed on the local file system, and the final output data is written to the underlying distributed file system. In order to realize the processing of massive data and overcome the limitations imposed by the limited resources of the single machine, it is urgent to realize the establishment of virtual humans on the MapReduce platform.

Summary of the invention

Therefore, the object of the present invention is to provide a virtual person establishment method on the MapReduce platform, which solves the problem that the virtual person is complicated to construct and has low accuracy due to various types of account types, and realizes processing of massive data and overcomes the resources of the single machine. Limited restrictions.

To achieve the above objective, the present invention provides a virtual human establishment method on a MapReduce platform, including:

Step 1. Extract the account number and the login time and login terminal information corresponding to the account from the behavior log;

Step 2: Calculate the similarity between the accounts according to the co-occurrence between the accounts, construct a connected graph that represents the account by the node, and represent the similarity between the accounts by the length of the edges between the nodes, and the edges between the nodes The shorter, the higher the similarity between the accounts represented by the nodes;

Step 3: Clustering the nodes in the connected graph based on the MapReduce platform, and establishing a virtual human according to the clustering result.

Wherein step 3 includes:

Step 20: Using the information of the nodes and edges in the connected graph as the input data, generating a key value pair including the node and the neighboring information through the Map job, and generating the local density Rho including the node and the node and all the neighbor information of the node through the Reduce job. Output, Rho is defined as the number of neighboring edges whose length is lower than the predefined value Dc;

Step 30: For the output of the Reduce job in step 20, generate a key-value pair including a node, a node Rho, a neighbor node Rho, and neighbor information through a Map job. For each node, traverse the node Rho and all neighbor nodes Rho through a Reduce job. And all the neighbor information, and the delta of each node is obtained. Delta is defined as the side length of the shortest side of the neighbors of all the neighbor nodes connected to the higher Rho value of the node. If there is no such neighbor node, then the source is taken. The length of the longest neighboring edge of the node; combined with the predetermined rules for class identification.

Step 40: Each node of the same class together constitutes a virtual person.

The predetermined rule includes: the Rho and the delta of the node are respectively higher than the threshold R_T and the threshold D_T as input parameters, then the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the node The class identifier takes the class identifier of the neighbor node that is closest to it and has a higher Rho;

The class ID of an isolated node is its own class identifier.

The predetermined rule includes: pre-dividing the Rho value possible value interval and the corresponding Delta value possible value interval, if the Rho value of the node belongs to the Rho value possible value interval and the node's Delta value belongs to the corresponding Delta value, it may take In the value interval, the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and has a higher Rho;

The class ID of an isolated node is its own class identifier.

Among them, the factors other than the case where the accounts are co-occurring are also introduced to calculate the similarity between the accounts.

The method further includes merging all the virtual persons and the account corresponding to the virtual person to become a virtual person database.

The output of the Reduce job in step 20 is stored in a relational database or a key value database.

In the Map job in step 30, the traversal of the neighbor node Rho is implemented by performing Cartesian product on the output of the Reduce job in step 20.

Wherein step 20 includes:

Step 21: The information of the nodes and edges in the connected graph is used as input data to generate a key-value pair via a Map job, wherein the key includes a field identifying the node, and the value includes a field identifying the neighbor node and identifying a neighboring edge between the node and the neighbor node Side length field;

Step 22: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;

Step 23: group the key values in the same partition according to the nodes included in the key, and the key includes the key value pairs of the same node to be allocated to the same group;

Step 25. Via the Reduce job, traverse all the neighbors of the same node by iterating over the values of the key pairs belonging to the same group, and generate an output including the node, the local density Rho of the node, and all neighbor information of the node.

Wherein, step 20 further includes:

In step 21, the key further includes a field that identifies a side length of the neighboring edge between the node and the neighboring node;

Step 24: Sort the key pairs belonging to the same group according to the side lengths of the adjacent edges included in the key.

Wherein step 30 includes:

Step 31: Generate a key value pair for the output of the Reduce job in step 20 via a Map job, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node, and identifying a side length of the neighboring edge between the node and the neighbor node. a field, a field identifying the neighbor node Rho, and a field identifying the node Rho;

Step 32: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;

Step 33: The key value pairs in the same partition are grouped according to the nodes included in the key, and the key includes the key value pairs of the same node allocated to the same group;

Step 35: Via the Reduce operation, for each node, traverse the node Rho, all neighbor nodes Rho, and all neighbor information by iterating over the values of the key pairs belonging to the same group, and obtain the dispersion Delta of each node, Class identification is performed in conjunction with predetermined rules.

Wherein, step 30 further includes:

In step 31, the key further includes a field identifying the neighbor node Rho;

Step 34: Sort the key values belonging to the same group according to the neighbor nodes Rho included in the key.

In summary, the virtual human establishment method on the MapReduce platform of the present invention establishes a virtual person based on the behavior log, has low complexity and high accuracy, and is suitable for processing big data; and implements the popular MapReduce distributed computing idea on the cluster. Based on local density clustering, the limitations imposed by the limited resources of the single machine during processing are weakened, and the processing of massive data can be realized, and the clustering operation can be completed faster.

DRAWINGS

In the drawings,

1 is a flowchart of a preferred embodiment of a virtual human establishment method on a MapReduce platform according to the present invention;

2 is a schematic diagram of a preferred embodiment of a virtual human establishment method on a MapReduce platform according to the present invention;

FIG. 3 is a schematic diagram of Rho value-Delta value distribution in a preferred embodiment of a virtual human establishment method on the MapReduce platform of the present invention.

detailed description

The technical solutions of the present invention and the beneficial effects thereof will be apparent from the following detailed description of the embodiments of the invention.

FIG. 1 is a flowchart of a preferred embodiment of a virtual human establishment method on a MapReduce platform of the present invention. The main steps of the invention include:

The invention may also include the step of merging all virtual persons and accounts corresponding to the virtual person into a virtual person database.

In order to cope with practical problems such as complicated virtual person construction and low accuracy due to various types of account types, the present invention establishes a virtual person based on analysis of behavior logs. The behavior log records the network user application network service, and can be collected from the server side, user terminal, and the like. The method is based on the following observations of the reality:

1. For a period of time, an account with an activity on the same terminal may belong to the same person. We claim that multiple accounts have been active on the same terminal for a certain period of time, for the synergy of these accounts.

2. The more similar the situation of multiple accounts co-occurring - for example, the more times, the greater the likelihood that these accounts belong to the same person (called similarity).

3. Of the multiple accounts owned by a single user, some accounts are always used more frequently.

4. Between some accounts of different users, even if there are occasional synergies, the situation of co-occurrence will not be more similar than the situation where the users themselves collaborate.

Referring to FIG. 2, it is a logic diagram of a preferred embodiment of a virtual human establishment method on the MapReduce platform of the present invention.

The key steps of the preferred embodiment include:

Abstract the records in the behavior log as [time, terminal, account], and get the data including the timestamp, account ID and terminal ID, so as to know when and which account has active on which terminal, through each account It is counted that the account has the number of coordinated occurrences of activities on the same terminal with other accounts for a period of time, and the number of times of cooperation between the accounts can be obtained.

"Number of times" is a way of measuring the "situation", and the term "number of times" is used in this embodiment only to simplify the explanation. In fact, you can also add time and other information as weights to measure the "situation" together - for example, the synergy of the off-hours can be slightly heavier than the working hours - the working hours are more likely to share the computer terminal.

Based on the observation of the above-mentioned account synergy, the similarity between the accounts is calculated. If abstracted into a connected graph, the nodes in the connected graph represent accounts, and the length of the edges represents the similarity between the accounts. Usually, the higher the similarity, the shorter the side.

If there are other models, such as attribute matching, the matching result of the corresponding model can also be used as a factor affecting the length of the edge.

After obtaining the above figure, the following calculations can be made to determine which accounts belong to the same person:

For each node, find its local density Rho. Rho is defined as the number of edges whose node length is lower than a certain predefined value Dc.

For each node, find its dispersion Delta. Delta is defined as the side length of the shortest side of the neighbors of all the neighbor nodes connected to the higher Rho value of the node; if there is no such neighbor node, the side length of the longest neighbor of the node is taken.

The nodes whose Rho value and Delta value are higher than the specific thresholds R_T and D_T, respectively, are identified as the central node of the class. Each such node represents a class, which is a virtual person.

Classify other non-central nodes as the one with the shortest distance and a higher Rho value than their own central node.

Each node of the same class means that it belongs to the same virtual person. Corresponding virtual people are established for each class.

Referring to the above process of calculating which accounts belong to the same person, the present invention uses a local density-based clustering algorithm to analyze the behavior log, which is reduced compared with other K-Means, hierarchical clustering and other clustering methods. The analytical complexity of the entire system. At the same time, the two distribution values derived from the data itself, Delta and Rho, provide an objective reference for the selection of the number of clusters.

The class center point identification method shown is that the Rho value and the Delta value of the node are both higher than a corresponding threshold. Other methods based on Rho or Delta values can be taken in practice. If the Rho value is higher than 3, the delta value is between 4-5, and the Rho value is higher than 5, then the Delta value is between 5-6.

The following briefly describes the meanings of various values in the virtual human establishment method on the MapReduce platform of the present invention, as follows.

Side length representation: A measure of the likelihood (similarity) of nodes belonging to the same person.

Rho representation: The importance of the current node to its neighbors.

Delta representation: If the current node is centered, it is distinguishable from other class centers.

for example:

The side length can be defined as: the reciprocal number of times (c _a,b ) of the two accounts in the behavior log (c _a,b ). That is, the countdown of the number of times that two accounts have been active on the same terminal for a certain period of time.

Rho can be defined as: the number of edges in the neighboring edge of the current node that are less than the parameter value Dc.

Delta is defined as the side length of the shortest side of the neighbors of all the neighbor nodes connected to the higher Rho value of the node; if there is no such neighbor node, the side length of the longest neighbor of the node is taken.

The corresponding formula in the above definition example is expressed as:

Let c(a,b) be the number of co-occurrences of accounts a and b counted from the behavior log, then:

1. The length between the sides of a, b:

d(a,b)=1/c(a,b) [Equation 1].

2. For all N neighbor nodes bn of a, n=1...N (N is a natural number), the Rho value of a:

Where X(x) is defined as: 1. If x < 0, then X(x) = 1, otherwise X(x) = 0.

3.a Delta value:

Let the neighbor nodes of node a be b1...bN in turn, then Delta(a) can be defined as:

1) If there is an adjacent edge that satisfies Rho(bx)>Rho(a), then:

Delta(a)=min{d(a,bn))|n=1..N and Rho(bn)>Rho(a)}.

2) Otherwise:

Delta(a)=max{d(a,bn), n=1..N}.

In particular, for a node without any neighboring edges, when marking its class identifier, it can be directly identified as itself, that is, a virtual person is formed independently.

The delta value is related to the Rho value, and the Rho value can be defined by other definitions such as common centrality.

The value of Dc is related to the specific data in practice. Usually we will determine the value of Dc after obtaining the connected graph. That is, as in other common clustering methods, it is an input parameter. However, unlike the selection of the K value in K-Means, the selection of the K value directly determines the number of classes, but the Dc here weakens the influence of subjective factors by the Rho value and the Delta value and the values of R_T and D_T. Because the selection of these parameters will introduce an objective consideration of the characteristics of the data itself.

One method of selecting R_T and D_T is as follows. As shown in FIG. 3, it is a schematic diagram of Rho value-Delta value distribution in a preferred embodiment of the virtual human establishment method on the MapReduce platform of the present invention, where each point represents a node. First, draw the Rho-Delta value distribution map of each point, and then observe the distribution of the Delta value (Rho value). When the value is changed, the distribution value is abrupt, and the value is D_T(R_T). As shown in Fig. 3, at d'(r'), the distribution of the Delta value is discontinuous/mutated, and the value of D_T(R_T) is d'(r'). If there are more data points, you can sample them and use the distribution map of the sample points as a reference for the values.

By introducing other models, such as attribute matching, the matching result of the corresponding model can also be used as a factor affecting the length of the edge. That is to say, factors other than the number of times of cooperation between the accounts are introduced to calculate the similarity between the accounts.

In the case of attribute matching, for example, if the mathematical symbol is used, the result of the attribute matching is used as a parameter for calculating the side length. That is, if Match(a,b) is the account similarity of a and b to which the attribute matches, the side length can be defined as follows:

d(a,b)=f(c(a,b), match(a,b)).

Taking [Equation 1] as an example, you can choose to define it as:

Edge length after introducing attribute matching model

It can be understood from the above description that the present invention uses a local density-based clustering process to establish a virtual person based on the behavior log, which has low complexity and high accuracy, and is suitable for processing big data; further, in order to realize processing of massive data, Overcoming the limitations imposed by the limited resources of the stand-alone unit, this issue Ming will implement the clustering process based on local density on the MapReduce platform to complete the clustering operation faster for the processing of massive data.

The specific implementation manner of step 3 of the preferred embodiment on the MapReduce platform is exemplified below. Step 3 may specifically include:

Step 20: Using the information of the nodes and edges in the connected graph as the input data, generating a key value pair including the node and the neighboring information through the Map job, and generating the local density Rho including the node and the node and all the neighbor information of the node through the Reduce job. The output, Rho, is defined as the number of neighboring edges whose length is lower than the predefined value Dc.

Step 20 may specifically include:

Step 21: The information of the nodes and edges in the connected graph is used as input data to generate a key-value pair via a Map job, wherein the key includes a field identifying the node, and the value includes a field identifying the neighbor node and identifying a neighboring edge between the node and the neighbor node The length of the field. The neighbor information includes the corresponding neighbor node and the neighbor side length. As an optimization, in step 21, the key may further include a field identifying the side length of the neighboring edge between the node and the neighboring node.

When applied, each row of input data can be associated with side information between a group of nodes. Therefore, for the sake of convenience, the input data can be set to a triple consisting of a small identity value node a, a large identity value node b, and a side length len(a, b): [a, b, len(a, b) )].

Because each node needs to calculate their Rho value, the Map job will have two <Key, Value> outputs for one side information in the connected graph. Each Key value or Value value is composed of two fields, left and right. Specifically, the first Key value may be K1=<a, len(a,b)> (here, left=a, right=len(a,b)), and the Value value may be V1=<b, Len(a,b)>, the second Key value can be K2=<b, len(a,b)>, and the Value value can be V2=<a, len(a,b)>.

Step 22: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition. In this embodiment, in particular, the sequence of partitions to which each record belongs will only be related to the first field of the Map output Key value. For example, the partition sequence can be the remainder of the key's left field and the remainder of the known total number of partitions, represented by pseudocode:

K.left.hashCode()% of the total number of partitions.

This actually guarantees that the side information of the nodes of the same node's left field will be allocated to the same partition for storage.

The result of the group (GroupCompare) will only be related to the comparison of the first field of the compared Key value. For example, for two Keys, k1 and k2, the corresponding compare result is:

K1.left.compare(k2.left).

This actually guarantees that the information (Value value, neighbor point and side length) of all edges of each node will be called in the same Reduce process.

Step 24: Sort the key pairs belonging to the same group according to the side lengths of the adjacent edges included in the key. The ordering in step 24 can be sorted in ascending order. As an optional optimization measure, step 24 can be called SortComparator (SC), which can be set as the result of comparing the two fields in the left and right order. Expressed in pseudo code:

Since the right value of the Key indicates the length of the side, it is actually guaranteed that the edge information is returned in ascending order of the length of the side length in the iterative process in the Reduce process. Note: In fact, the Key value in step 21 is set to be composed of two fields: node identifier and side length, in order to perform the optimization; if there is no such optimization consideration, the Key value in step 21 can only be composed of the node identifier. .

The output of the Reduce job in step 25 is a key value pair, wherein the key includes a field identifying the node, the value includes a field identifying the node, a field identifying the node Rho, and a field identifying all neighbor information of the node.

After the above steps, each Edge call can traverse all edges of the same node by iterating over Values. Each time the Reduce procedure is called, the following three parts of the letter are output. Information: the identifier of the current node n, the Rho value of n, and all neighbor information of n after sorting by the side length.

When optimized using the SC described above, the count of Rho values may end when the iterative side length is greater than the predefined value Dc. At the same time, since the neighboring edges have been sorted in ascending order by means of the SC, the neighboring side information can also be spliced in the order of iteration. If the optimization is not performed, the count of the Rho value needs to be iterated to the end of the last edge, and the neighbor information needs to be sorted and then used as part of the Value value.

As an example, the output format can be a key-value pair:

[K=n, V=<n, Rho(n), n1: len(n, n1), n2: len<n, n2>...nN: len<n, nN>>].

The preferred embodiment implements the calculation of the Rho value by using the first MapReduce task described above, and sorts the neighbor nodes in ascending order by distance. The next second MapReduce task, the main implementation of the calculation of the Delta value, and identify the class center point.

Step 30: For the output of the Reduce job in step 20, generate a key-value pair including a node, a node Rho, a neighbor node Rho, and neighbor information through a Map job. For each node, traverse the node Rho and all neighbor nodes Rho through a Reduce job. And all the neighbor information, the dispersion Delta of each node is obtained, and the class identification is performed in combination with the predetermined rule.

In the preferred embodiment, the predetermined rule is: Rho and Delta of the node are respectively higher than the threshold R_T and the threshold D_T as input parameters, then the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise The class identifier of the node takes the class identifier of the neighbor node that is closest to it and has a higher Rho;

The class ID of an isolated node (a node with no neighbors) is its own class identifier. The predetermined rule is similar to the rule adopted in the Chinese patent application CN 201410814330.4 "Virtual Person Establishment Method and Apparatus" - the rigid requirement Rho value and the Delta value must be higher than a corresponding corresponding threshold.

This is just one of the ways in which a node can be identified as a class center. Basically, whether a node can be used as a class center node is based on the node's Rho value and Delta value. In fact, there are other methods for making judgments using factors including Rho values and Delta values. The virtual human establishment method on the MapReduce platform of the present invention can also be relaxed in the way of confirming the class center point, and the clustering operation can be completed more quickly. For example, the predetermined rule may include: pre-dividing the Rho value possible value interval and the corresponding Delta value possible value interval, if the Rho value of the node belongs to the Rho value possible value interval and the node's Delta value belongs to the corresponding Delta value, the value may be Interval, then the section The point is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and Rho is higher; the class identifier of the isolated node is its own class identifier. For example, if the Rho value of the node is in the range [10, 20], and the Delta value is also in [0.9*10, 0.8*20] (that is, the Delta value is also within a certain range of the Rho value change, the Delta value range Corresponding to the Rho value range, the node can also be identified as a class center).

In this way, you can finally get the class identifier corresponding to all nodes. At the same time, the same class identifies the same class—that is, the same virtual person, so that step 40 is completed, and each node of the same class constitutes a virtual person.

To solve the Delta value of a node, you need to get the Rho value corresponding to its neighbor. In the output of the Reduce job in step 20, the Cartesian Product on the general MapReduce can be used to implement the traversal of the Rho value of the neighbor node - the full connection is realized by the custom InputFormat. The traversal here is actually to find the Delta value later. Related cases can be found in [<<MapReduce Design Patterns>>, O'Reilly, Dec. 2012, p: 128-138].

Step 30 specifically includes:

Step 31: Generate a key value pair for the output of the Reduce job in step 20 via a Map job, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node, and identifying a side length of the neighboring edge between the node and the neighbor node. A field identifying a field of the neighbor node Rho and identifying a field of the node Rho.

For the output of the Reduce job in step 20, the information of the current node and the connected neighbor node is output via the Map job. An optimized sample output format is:

[K=<a, Rho(b)>, V=<Rho(b), Rho(a), b, len(a,b)>].

In step 31, as a selection, the key may further include a field identifying the neighbor node Rho, and the optimization is to incorporate the information of Rho(b) into the Key part to facilitate the sorting of the subsequent step 34.

Step 32: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition. For details, see step 22.

Step 33: Group the key values in the same partition according to the nodes included in the key, and the key includes the key value pairs of the same node to be allocated to the same group. For details, see step 23.

Step 34: Sort the key values belonging to the same group according to the neighbor nodes Rho included in the key. As an optional optimization measure, first distinguish whether the same is based on the first field of the Key value. The key values of the nodes are sorted in descending order of the second field if they are the same. This sorting ensures that neighbor nodes with high Rho values are first iteratively accessed in the same Reduce process.

After the above steps, in each Reduce process, the information of a node and all its neighbors can be traversed by iterating over the Value value. At this time, the threshold R_T and the threshold D_T value, which are combined as input parameters, can be selected to generate the information required for class identification.

In the preferred embodiment, the Map process of step 30 is implemented on a native MapReduce scheme, but in practice the process can be accelerated by common database techniques. For example, when the Reduce job is output in step 20, the Rho value of each node is stored in a relational database or a K-V database. Therefore, in the Map of Step 30, it is only necessary to query the Rho value of the neighbor point, and does not need to be processed by the custom InputFormat; that is, the Cartesian operation is no longer needed, and the data can be directly accessed in the Map stage. To get the Rho value of the neighbor node.

In the virtual person establishment method on the MapReduce platform of the present invention, by analyzing the behavior log, the actual analysis results are “which accounts belong to the same person operation”. In the real system requirements, the user is often more meaningful than the account owner, and this can also reduce the deviation of the account attribution relationship result caused by the unreal value of the "identity number" and other key values. The use of behavior logs for analysis increases the applicability of the entire system – only account identification is required, and specific account attributes are not necessarily required. From the characteristics of the behavior log and the reduction of the above complexity, the present invention can be better applied to an environment of a larger range, a longer time range, and more data volume. In fact, the wider the scope of data collection, the longer the time, and the greater the amount of data, the higher the actual accuracy of the system. According to the foregoing analysis of the behavior log, the present invention can further describe the attribute information such as the name and address of the virtual person by combining additional data such as the account attribute.

In the above, various other changes and modifications can be made in accordance with the technical solutions and technical concept of the present invention, and all such changes and modifications should be included in the appended claims. The scope of protection.

Claims

A virtual human establishment method on a MapReduce platform, which is characterized in that:

Step 1. Extract the account number and the login time and login terminal information corresponding to the account from the behavior log;

Step 2: Calculate the similarity between the accounts according to the co-occurrence between the accounts, construct a connected graph that represents the account by the node, and represent the similarity between the accounts by the length of the edges between the nodes, and the edges between the nodes The shorter, the higher the similarity between the accounts represented by the nodes;

Step 3: Clustering the nodes in the connected graph based on the MapReduce platform, and establishing a virtual human according to the clustering result.
The method for establishing a virtual person on the MapReduce platform according to claim 1, wherein the step 3 comprises:

Step 20: Using the information of the nodes and edges in the connected graph as the input data, generating a key value pair including the node and the neighboring information through the Map job, and generating the local density Rho including the node and the node and all the neighbor information of the node through the Reduce job. Output, Rho is defined as the number of neighboring edges whose length is lower than the predefined value Dc;

Step 30: For the output of the Reduce job in step 20, generate a key-value pair including a node, a node Rho, a neighbor node Rho, and neighbor information through a Map job. For each node, traverse the node Rho and all neighbor nodes Rho through a Reduce job. And all the neighbor information, and the delta of each node is obtained. Delta is defined as the side length of the shortest side of the neighbors of all the neighbor nodes connected to the higher Rho value of the node. If there is no such neighbor node, then the source is taken. The length of the longest neighboring edge of the node; combined with the predetermined rules for class identification;

Step 40: Each node of the same class together constitutes a virtual person.
The method for establishing a virtual person on a MapReduce platform according to claim 2, wherein the predetermined rule comprises:

The Rho and Delta of the node are higher than the threshold R_T and the threshold D_T as input parameters respectively, then the node is the center of a class, and the class identifier of the node takes its own class identifier; otherwise, the class identifier of the node is taken closest to it and Rho The class identifier of the higher neighbor node;

The class ID of an isolated node is its own class identifier.
The method for establishing a virtual person on a MapReduce platform according to claim 2, wherein the predetermined rule comprises:

Pre-dividing the Rho value possible value interval and the corresponding Delta value possible value interval. If the node's Rho value belongs to the Rho value possible value interval and the node's Delta value belongs to the corresponding Delta value possible value interval, the node is a The class identifier of the node takes its own class identifier; otherwise, the class identifier of the node takes the class identifier of the neighbor node that is closest to it and has a higher Rho;

The class ID of an isolated node is its own class identifier.
The method for establishing a virtual person on the MapReduce platform according to claim 2, wherein the output of the Reduce job in step 20 is stored in a relational database or a key value database.
The method for establishing a virtual person on the MapReduce platform according to claim 2, wherein in the Map job in step 30, the traversal of the neighbor node Rho is implemented by performing a Cartesian product on the output of the Reduce job in step 20.
The method for establishing a virtual person on the MapReduce platform according to claim 2, wherein the step 20 includes:

Step 21: The information of the nodes and edges in the connected graph is used as input data to generate a key-value pair via a Map job, wherein the key includes a field identifying the node, and the value includes a field identifying the neighbor node and identifying a neighboring edge between the node and the neighbor node Side length field;

Step 22: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;

Step 23: group the key values in the same partition according to the nodes included in the key, and the key includes the key value pairs of the same node to be allocated to the same group;

Step 25. Via the Reduce job, traverse all the neighbors of the same node by iterating over the values of the key pairs belonging to the same group, and generate an output including the node, the local density Rho of the node, and all neighbor information of the node.
The method for establishing a virtual person on the MapReduce platform according to claim 7, wherein the step 20 further comprises:

In step 21, the key further includes a field that identifies a side length of the neighboring edge between the node and the neighboring node;

Step 24: Sort the key pairs belonging to the same group according to the side lengths of the adjacent edges included in the key.
The method for establishing a virtual person on the MapReduce platform of claim 2, wherein the step 30 comprises:

Step 31: Generate a key value pair for the output of the Reduce job in step 20 via a Map job, where the key includes a field identifying the node, and the value includes a field identifying the neighbor node, and identifying a side length of the neighboring edge between the node and the neighbor node. a field, a field identifying the neighbor node Rho, and a field identifying the node Rho;

Step 32: Partition the key value pair according to the node included in the key, and the key includes the key value pair of the same node to be allocated to the same partition;

Step 33: The key value pairs in the same partition are grouped according to the nodes included in the key, and the key includes the key value pairs of the same node allocated to the same group;

Step 35: Via the Reduce operation, for each node, traverse the node Rho, all neighbor nodes Rho, and all neighbor information by iterating over the values of the key pairs belonging to the same group, and obtain the dispersion Delta of each node, Class identification is performed in conjunction with predetermined rules.
The method for establishing a virtual person on the MapReduce platform according to claim 9, wherein the step 30 further comprises:

In step 31, the key further includes a field identifying the neighbor node Rho;

Step 34: Sort the key values belonging to the same group according to the neighbor nodes Rho included in the key.