CN107330020A

CN107330020A - A kind of user subject analytic method based on structure and attributes similarity

Info

Publication number: CN107330020A
Application number: CN201710470266.6A
Authority: CN
Inventors: 徐杰; 刘震; 卢思变; 陈文龙
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-06-20
Filing date: 2017-06-20
Publication date: 2017-11-07
Anticipated expiration: 2037-06-20
Also published as: CN107330020B

Abstract

The invention discloses a kind of user subject analytic method based on structure and attributes similarity, pass through the analysis and modeling to social networks, combine the friend relation and individual subscriber data in social networks, i.e. structural information and attribute information, realize the purpose across the user subject parsing of social platform.During entity resolution, the concept of dynamic threshold is introduced, the data characteristicses under present case, regulation and control attribute and structure proportion is adapted to using different threshold values in the different times of iteration, to obtain more precisely result.

Description

A kind of user subject analytic method based on structure and attributes similarity

Technical field

The invention belongs to entity resolution technical field, more specifically, it is related to a kind of based on structure and attributes similarity User subject analytic method.

Background technology

In data set, the object in real world pointed by data, commonly referred to as entity (Entity).For same One entity, in different or even same data set, it is understood that there may be a variety of different performances or description form, comes when by multiple differences When the data set in source is merged to analyze and process, these then can be mixed in together for the description of same entity, causes certain The polyisomenism of degree.Entity resolution (Entity Resolution), is exactly that a variety of different descriptions concentrated to data are carried out Identification, connection, determine which description is mapped in the process of the same entity in real world.Entity resolution is data prediction mistake An important step in journey, is mainly used in solving the quality problems such as the repeated and redundant of data.

With the fast development of social networks, application of the entity resolution in terms of social networks is gradually of concern.Greatly Part social network user not only uses a social networks, but according to oneself interest and needs, while using multiple Social networks, and the information between different social platform is isolated, not intercommunication, therefore Direct Recognition of having no idea is same Virtual identity of the individual user in different platform.The cross-platform entity resolution problem of social networks is exactly matching and is recognized in difference The account for belonging to same user subject in social platform, i.e. user's identification or account are matched.Pass through the matching of account identity, energy The personalized service to user is enough realized, and also contributes to solve some safety problems of social networks.

This concept of entity resolution is set forth in nineteen fifty-nine earliest.Newcombe et al. exists《Science》On the article delivered it is first It is secondary to propose this concept, and think that entity resolution is a statistical problem, elaborate entity resolution problem from the angle of probability. 1969 after 10 years, Fellegi and Sunter were made that standardization and formulated to entity resolution problem first, they by its The classification problem that is considered as in a machine learning and the specification series of sign of entity resolution and fixed in their article Justice, establishes the Fellegi-Sunter models of classics.In research after this, there are Many researchers to Fellegi- Sunter models are improved and supplemented, and mainly have Jaro, Winkler, Belin and a Rubin, Ravikumar, Larsen, Sadinle et al., wherein, Winkler has done substantial amounts of work, using Bayesian statistical model, to Fellegi-Sunter moulds A series of improvement has been done in the parameter calculating of type and matched rule etc..

For the entity resolution research of social networks, mainly deploy in recent years.Most of researcher is conceived to social network These aspect expansion researchs of attribute, structure and the social content of network.Attribute refers to the personal information of user, such as head portrait, use Name in an account book, sex, birthday, education background, location etc., structure refer to the friend relation between account and account in social networks, And social content refers to the information such as text, picture that user produces in doings, such as blog, comment, geographical position.

What the algorithm based on attribute was mainly utilized is the personal information information of user in social networks, and each single item is described to believe Breath is respectively seen as the attribute of user, problem is converted into the matching of attribute field.As Zafarani and Liu utilizes user name With the URL of individual subscriber homepage, it is proposed that user subject analytical algorithm.Goga et al. is proposed suitable for extensive identification Algorithm.Structure-based algorithm is exactly the main friend relation information using social networks, and social networks is abstracted into figure knot Structure, realizes that user subject is parsed using some graph structure information.Narayanan and Shmatikov^[13]And Bartunov etc. People have studied the algorithm of correlation from this angle.Algorithm based on social content using analysis to text style, and the time, The information such as geographical position, realize that user subject is parsed.Write as Almishari and Tsudik proposes one kind by analyzing author Style, the method for recognizing user in different social platforms.Goga et al. propose using user issue content when geo-location, Timestamp, and the writing style of content realize the work of user's identification to combine.

Algorithm for being conceived to attribute information, because the account personal information in social platform has a certain degree of lack Become estranged inaccurate, this kind of abnormal data can be impacted to algorithm performance, and this influence from data in itself is very What hardly possible was removed.The inaccurate of information is avoided from the algorithm of structure, but ought be existed in small groupuscule, Ke Nengduo Situation about connecting entirely is almost formd between individual account, then how to distinguish then very difficult into one between these accounts Problem.Therefore structure-based algorithm is in the case where friend relation is very intensive, it is difficult to play a role well.And based on interior The method of appearance, related data is difficult to obtain and is difficult to handle, and is not convenient to use.Method proposed by the present invention, is organically combined Attribute and the category information of structure two, avoid the defect of various methods as far as possible.

The content of the invention

It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of user based on structure and attributes similarity Entity resolution method, with reference to attribute and the aspect information of structure two, problem is parsed to solve the cross-platform user subject of social networks.

For achieving the above object, a kind of user subject analytic method based on structure and attributes similarity of the present invention, It is characterised in that it includes following steps：

(1) attributes similarity matrix and adjacency matrix, are set up

According to the attributes similarity of all accounts between any two on social platform A and social platform B, attribute is built similar Spend matrix S^m×n, wherein, m and n are respectively the account base in platform A and B, S^m×nIn element representation correspondence two accounts between Attributes similarity；

Whether it is between any two respectively friend relation according to all accounts on social platform A and social platform B, sets up adjacent Connect matrixWithWherein, every a line of adjacency matrix, each row all represent an account in the platform, adjacent square In the element representation platform between two accounts of correspondence whether it is friend relation in battle array, if friend relation, then the element value For 1, the element value is 0 if being not friend relation；

(2) incidence matrix, is set up

According to adjacency matrixWith priori matching pair, set up unidentified in social platform A and social platform B Account and the incidence matrix between account is recognizedWherein, τ represent priori matching to number, association Every a line of matrix represents unidentified account, and each row, which are all represented, has recognized account, the unidentified account of element representation in incidence matrix Family and recognize between account whether be friend relation, if friend relation, then the element value is 1, if do not closed for good friend Then the element value is 0 for system；

(3) common friend matrix, is set up

According to incidence matrixWith priori matching pair, set up in social platform A and social platform B and do not know The common friend matrix of other account；

Wherein, ()^TTransposition is represented, every a line of common friend matrix represents a unidentified account in social platform AEach row represent a unidentified account in social platform BElement f in common friend matrix_ijRepresentWith Priori matches the common friend number of centering；

(4) account pair of the corresponding two unidentified account compositions of maximum nonzero element, is selected from common friend matrix, And account is stored in in set Q, Q=(i, j) | f_ij=max (F^{(m-τ)×(n-τ)})}；

(5), in attributes similarity matrix S^m×nIn, take out account similar to the attribute between all accounts pair in set Q Degree, and it is stored in similarity set S^*In, S^*={ s_ij|s_ij∈S^m×n,(i,j)∈Q}；

(6), according to default initial threshold, by similarity set S^*In less than initial threshold element delete, simultaneously will Account is deleted the corresponding element in set Q；

(7) whether to set Q be empty, if sky, then by the maximum nonzero element in common friend matrix if, judging account Set to 0, return again to step (4)；If being not sky, into step (8)；

(8) similarity set S, is taken out^*In greatest member max (S^*), and selected and max (S in account is to set Q^*) Corresponding account is to (i, j), then (i, j) corresponding one group of accountLabeled as the match is successful, and it is added to epicycle iteration Result set M in；

(9), account has common account to deleting the account that is added in result set M in set Q to (i, j), and with (i, j) The account pair at family, while deleting similarity set S^*Middle corresponding element；

(10) account, is judged to whether also there is element in set Q, if it is present return to step (8)；If do not deposited In then output result collection M；

(11), corresponding account in result set M is returned again to step (2), carry out epicycle to being added to priori matching centering Next iteration, when do not have in result set M new matching to output when epicycle iteration terminate；

(12) size of initial threshold, is changed, step (2) is returned again to, the iteration of next round is carried out, when initial by modification After threshold value, in result set M still without new matching to output when iteration terminate, complete user subject parsing.

What the goal of the invention of the present invention was realized in：

A kind of user subject analytic method based on structure and attributes similarity of the present invention, passes through the analysis to social networks And modeling, combine the friend relation and individual subscriber data, i.e. structural information and attribute information in social networks, realize across The purpose of the user subject parsing of social platform.During entity resolution, the concept of dynamic threshold is introduced, in iteration Different times adapt to the data characteristicses under present case, regulation and control attribute and structure proportion using different threshold values, to obtain Obtain more precisely result.

Meanwhile, a kind of user subject analytic method based on structure and attributes similarity of the present invention also has following beneficial effect Really：

(1) information of both attribute and structure, is combined, it is to avoid the defect of single piece of information and result accuracy is made Into adverse effect, the influence that such as the attribute influence that causes of missing and friend relation dense band come.

(2) concept of dynamic threshold, is introduced, during iteration, attributes similarity threshold value is not constant all the time, and It is to increase with producing result, is gradually changed in certain scope, the characteristics of adapts to different iteration periods.Threshold value is originally Performed since with the upper bound, obtain result the most accurate, be then gradually reduced, it is not high enough to store more attributes similarities True match pair.

(3), using priori matching to as iteration starting point, result is all fed back among known conditions, is not required to by each iteration Want comparatively large number of known conditions or training data to set up model, it is only necessary to less true match to can implementation, keep away The problem of known conditions is not enough is exempted from.

Brief description of the drawings

Fig. 1 is the user subject analytic method flow chart of the invention based on structure and attributes similarity；

Fig. 2 is the friend relation structure chart of two social platforms in example.

Embodiment

The embodiment to the present invention is described below in conjunction with the accompanying drawings, so as to those skilled in the art preferably Understand the present invention.Requiring particular attention is that, in the following description, when known function and design detailed description perhaps When can desalinate the main contents of the present invention, these descriptions will be ignored herein.

Embodiment

Fig. 1 is the user subject analytic method flow chart of the invention based on structure and attributes similarity.

In the present embodiment, our definition first to some titles are described：

One social platform is modeled as to the friend relation pair between the form of a non-directed graph, account corresponding node, account The side between node, i.e. G={ V, E } are answered, wherein G represents social platform, and V is the set of account in the platform, and E is good in the platform The set of friendly relation.Friend relation in social platform is divided into unidirectional and two-way two types.It is theoretical for unidirectional good friend's type On abstract should turn to digraph, but in this algorithm, friend relation is the highly important foundation of user subject parsing, it is considered to To between the account only unidirectionally paid close attention to, its intimate degree is not enough, it is impossible to which the user belonging to reflection account really hands over well Friendship condition, therefore, in the social platform unidirectionally connected, this algorithm only considers the account paid close attention to mutually, and by such relation The friend relation in being bi-directionally connected is equivalent to, a non-directed graph is still modeled as.

A series of personal informations that each account in social platform possesses are collectively referred to as to the attribute of node, each single item Attribute shows a certain item data of account, such as user name, sex, age.C represents the set of attribute, C=(C₁,C₂, C₃...), wherein, C_iRepresent the title of an attribute.

The owner herein by the account in social network sites in real world, i.e., using the people of the account, define simultaneously For user subject.The collection of user subject shares U and represented, U=(u₁,u₂,u₃,…)。

Assuming that two social platforms are respectively A and B, then two accounts of two social platforms are belonging respectively to,WithSuch as Really they point to same entity, i.e., they are possessed by the same person in real world, then are claimedWithMatching, is expressed asOr M_A,B(i, j), if opposite mismatch, then it represents that beOr UM_A,B(i,j).If matching account exists Corresponding user subject is u in real world_k, then can be expressed as：

Before account matching is carried out, the account pair of a part of correct matching known in advance is generally required, these Matching is to being commonly referred to as the matching pair of priori account, or seed matching pair.In actual user subject resolving, priori account Family matching to acquisition be relatively difficult to solve, except artificial manual some priori of mark are matched in addition to, main method is A kind of unique mark that can determine user subject is found, account, or IP address etc., but this category information are bound in such as E-mail address It is general to be relatively difficult to obtain, it is therefore desirable to consider that new method is substituted.

Can not determine directly priori matching to information in the case of, it may be considered that the attribute accessed to your account is similar Spend to obtain priori matching pair.This paper algorithms priori is matched to quantitative requirement it is not high, therefore attribute can be passed through first Similarity, selects a part of account matching pair of similarity highest, and choose the most a part of account pair of good friend's number wherein It is considered as priori matching pair, to ensure critical role in a network, then carries out the execution of algorithm.The matching pair so selected Though it is impossible to ensure that accurately pointing to an entity, it can solve the problem that priori matching to more unobtainable problem to a certain extent.

With reference to shown in Fig. 1, the user subject analytic method based on structure attribute similarity a kind of to the present invention is carried out Describe in detail, specifically include following steps：

S1, set up attributes similarity matrix and adjacency matrix

First according to the attributes similarity of all accounts between any two on social platform A and social platform B, attribute is built Similarity matrix S^m×n, m and n is 7 in this example, directly give here removed in attributes similarity matrix priori matching to portion Point, shown with the form of form, as shown in table 1.

Table 1 is the major part of attributes similarity matrix.

Table 1

Next according to the structural relation of two platforms in Fig. 2, such as Fig. 2 (a) and Fig. 2 (b) are shown, set up adjacency matrixWithRespectively：

S2, set up incidence matrix

According to adjacency matrixWith priori matching pair, set up unidentified in social platform A and social platform B Account and the incidence matrix between account is recognizedIn the present embodiment, priori be paired into (1,1) and Two groups of (2,2), represent that then incidence matrix is respectively with solid node in fig. 2：

S3, set up common friend matrix

According to incidence matrixWith priori matching pair, unidentified account in social platform A and social platform B is set up The common friend matrix at family；

S4, the account pair for selecting from common friend matrix the corresponding two unidentified account compositions of maximum nonzero element, And account is stored in in set Q, Q=(i, j) | f_ij=max (F^{(m-τ)×(n-τ)})}；

S5, in attributes similarity matrix S^m×nIn, account is taken out to the attributes similarity between all accounts pair in set Q, And it is stored in similarity set S^*In, S^*={ s_ij|s_ij∈S^m×n,(i,j)∈Q}；

The upper bound of threshold value and lower bound are respectively set to 0.8 and 0.2 in S6, setting initial threshold, the present embodiment, then initially Threshold value is 0.8.By similarity set S^*In less than initial threshold element delete, while by account to the corresponding element in set Q Element is deleted, now Q={ (3,3), (4,4) }, S^*={ 0.85,1 }；

S7, judge whether account is empty to set Q, if sky, then by the maximum nonzero element in common friend matrix Set to 0, return again to step S4；If being not sky, into step S8；

S8, taking-up similarity set S^*In greatest member max (S^*), i.e., 1, and selected and 1 pair in account is to set Q The account answered is to (4,4), then (4,4) corresponding one group of accountLabeled as the match is successful, and it is added to epicycle iteration In result set M；

S9, account have common account to deleting the account that is added in result set M in set Q to (4,4), and with (4,4) The account pair at family, while deleting similarity set S^*Middle corresponding element, now Q={ (3,3) }, S^*={ 0.85 }；

S10, account is judged to whether also there is element in set Q, if still with the presence of element, return to step S8 repeats to hold OK；If existed without element, after current iteration terminates, (3,3) and (4,4) are added into result set M, and output result collection M；

S11, by corresponding account in result set M to be added to priori matching centering, return again to step S2, rebuild pass Join matrix and common friend matrix, carry out the next iteration of epicycle, repeat above-mentioned steps, finally when without new result During generation, epicycle iteration terminates, and now has (3,3), (4,4), and (5,5) three groups of accounts are to being added into result set M；

S12, the size for changing initial threshold, return again to step S2, carry out the iteration of next round；

Modification threshold value formula be：

Wherein, th represents amended threshold value, th_uAnd th_lThe respectively upper bound of initial threshold and lower bound, | M_c| represent to work as Matched in preceding result set M to number, min (N_A,N_B) less value in social platform A and B account quantity is represented, τ represents elder generation Test matching to number.

According to formula, the threshold value that next round iteration is used is：

Three groups of accounts in M are simultaneously performed into follow-up step to being added to priori matching centering, return to step S2 with new threshold value Suddenly.

Second wheel iteration can be by (6,6) this group of account to being added in result set, and newly iteration once is come to nothing production It is raw, and change threshold value and remained unchanged generation of coming to nothing after 0.32, to perform new round iteration, now iteration terminates, and finally generates (3,3), (4,4), (5,5), (6,6) this four groups of account matching results.

Although illustrative embodiment of the invention is described above, in order to the technology of the art Personnel understand the present invention, it should be apparent that the invention is not restricted to the scope of embodiment, to the common skill of the art For art personnel, as long as various change is in the spirit and scope of the present invention that appended claim is limited and is determined, these Change is it will be apparent that all utilize the innovation and creation of present inventive concept in the row of protection.

Claims

1. a kind of user subject analytic method based on structure attribute similarity, it is characterised in that comprise the following steps：

(1) attributes similarity matrix and adjacency matrix, are set up

According to the attributes similarity of all accounts between any two on social platform A and social platform B, attributes similarity square is built Battle array S^m×n, wherein, m and n are respectively the account base in platform A and B, S^m×nIn element representation correspondence two accounts between category Property similarity；

Whether it is between any two respectively friend relation according to all accounts on social platform A and social platform B, sets up adjacent square Battle arrayWithWherein, every a line of adjacency matrix, each row are all represented in an account in the platform, adjacency matrix Between two accounts of correspondence whether it is friend relation in the element representation platform, if friend relation, then the element value is 1, The element value is 0 if being not friend relation；

(2) incidence matrix, is set up

According to adjacency matrixWith priori matching pair, set up in social platform A and social platform B unidentified account with The incidence matrix between account is recognizedWherein, τ represent priori matching to number, incidence matrix Unidentified account is represented per a line, each row, which are all represented, has recognized account, in incidence matrix the unidentified account of element representation with Whether it is friend relation between identification account, if friend relation, then the element value is 1, should if being not friend relation Element value is 0；

(3) common friend matrix, is set up

<mrow> <msup> <mi>F</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>-</mo> <mi>&tau;</mi> <mo>)</mo> <mo>&times;</mo> <mo>(</mo> <mi>n</mi> <mo>-</mo> <mi>&tau;</mi> <mo>)</mo> </mrow> </msup> <mo>=</mo> <msubsup> <mi>R</mi> <mi>A</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>-</mo> <mi>&tau;</mi> <mo>)</mo> <mo>&times;</mo> <mi>&tau;</mi> </mrow> </msubsup> <mo>&times;</mo> <msup> <mrow> <mo>(</mo> <msubsup> <mi>R</mi> <mi>B</mi> <mrow> <mo>(</mo> <mi>n</mi> <mo>-</mo> <mi>&tau;</mi> <mo>)</mo> <mo>&times;</mo> <mi>&tau;</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mi>T</mi> </msup> </mrow>

Wherein, ()^TTransposition is represented, every a line of common friend matrix represents a unidentified account in social platform A Each row represent a unidentified account in social platform BElement f in common friend matrix_ijRepresentAnd v^BIn priori Match the common friend number of centering；

(4) account pair of the corresponding two unidentified account compositions of maximum nonzero element, is selected from common friend matrix, and is deposited Account is placed on in set Q, Q=(i, j) | f_ij=max (F^{(m-τ)×(n-τ)})}；

(5), in attributes similarity matrix S^m×nIn, account is taken out to the attributes similarity between all accounts pair in set Q, and It is stored in similarity set S^*In, S^*={ s_ij|s_ij∈S^m×n,(i,j)∈Q}；

(6), according to default initial threshold, by similarity set S^*In deleted less than the element of initial threshold, while by account pair Corresponding element in set Q is deleted；

(7) whether to set Q be empty, if sky if, judging account, then put the maximum nonzero element in common friend matrix 0, return again to step (4)；If being not sky, into step (8)；

(8) similarity set S, is taken out^*In greatest member max (S^*), and selected and max (S in account is to set Q^*) correspondence Account to (i, j), then (i, j) corresponding one group of accountLabeled as the match is successful, and it is added to the knot of epicycle iteration In fruit collection M；

(9), account has joint account to deleting the account that is added in result set M in set Q to (i, j), and with (i, j) Account pair, while deleting similarity set S^*Middle corresponding element；

(10) account, is judged to whether also there is element in set Q, if it is present return to step (8)；If it does not exist, then Output result collection M；

(11), corresponding account in result set M is returned again to step (2), carried out under epicycle to being added to priori matching centering An iteration, when not having new matching to output in result set M, epicycle iteration terminates；

(12) size of initial threshold, is changed, step (2) is returned again to, the iteration of next round is carried out, when by changing initial threshold Afterwards, iteration terminates when appointing in result set M so without new matching to output, completes user subject parsing.

2. the user subject analytic method according to claim 1 based on structure attribute similarity, it is characterised in that described The method of modification initial threshold be：

<mrow> <mi>t</mi> <mi>h</mi> <mo>=</mo> <msub> <mi>th</mi> <mi>u</mi> </msub> <mo>-</mo> <mfrac> <mrow> <mo>|</mo> <msub> <mi>M</mi> <mi>c</mi> </msub> <mo>|</mo> </mrow> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> <mrow> <mo>(</mo> <msub> <mi>N</mi> <mi>A</mi> </msub> <mo>,</mo> <msub> <mi>N</mi> <mi>B</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mi>&tau;</mi> </mrow> </mfrac> <mo>&times;</mo> <mrow> <mo>(</mo> <msub> <mi>th</mi> <mi>u</mi> </msub> <mo>-</mo> <msub> <mi>th</mi> <mi>l</mi> </msub> <mo>)</mo> </mrow> </mrow>

Wherein, th represents amended threshold value, th_uAnd th_lThe respectively upper bound of initial threshold and lower bound, | M_c| represent current knot Fruit collection M in match to number, min (N_A,N_B) less value in social platform A and B account quantity is represented, τ represents priori The number of pairing.