CN114118250B

CN114118250B - Cross-platform user identity recognition method based on activity similarity

Info

Publication number: CN114118250B
Application number: CN202111389814.5A
Authority: CN
Inventors: 李勇军; 黄丽蓉; 颜兆洁; 张银银
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2024-04-12
Anticipated expiration: 2041-11-22
Also published as: CN114118250A

Abstract

The invention discloses a cross-platform user identity recognition method based on activity similarity, which comprises the steps of firstly, combining time and semantic information in an activity track to extract an activity mode of a user, secondly, calculating similarity scores among the activity modes of the user, distributing different weights of different interest point types by utilizing the concept of inverse document frequency in order to distinguish the importance of different interest point types, and thirdly, introducing an interest point embedding layer similar to an embedded word in natural language to generate an embedded representation for each interest point, then generating vector representation of the activity mode of the user according to the activity mode of the user and the embedding of the interest point, and finally, calculating the activity similarity of the user according to the generated representation of the activity mode of the user, wherein the most similar user has the same natural person identity; the invention calculates the similarity of the users in the semantic space, embeds the activity habit of the users into the low-dimensional space, and can efficiently find the users which are most matched with any user.

Description

Cross-platform user identity recognition method based on activity similarity

Technical Field

The invention relates to a method for linking identity of a cross-platform user based on a track, in particular to a method based on activity similarity.

Background

Cross-platform user links play a vital role in numerous applications, such as user interest recommendation and location prediction. Many studies use user attributes and social networks to study this topic. But the attributes of the user and social networking characteristics are inconsistent for different service platforms. At the same time, some sensitive information cannot be obtained and used for identity analysis due to privacy concerns. Unlike platform-specific information, a user's spatiotemporal track may provide stable and consistent identity information. Because of the popularity of the global positioning system tracking technology, a mobile device records its time-space trajectory regardless of which service the user accesses. Linking user identities can be achieved by analyzing the spatiotemporal localization of user activity.

The method for realizing user identity linking by utilizing the user space-time track mainly focuses on matching the special time matching of the user according to the statistical characteristics so as to identify the user. By introducing different weight allocation policies, the frequency of user accesses, different locations and popularity of "encountered" events are used to calculate the similarity between different users. However, these methods cannot capture dynamic patterns and semantic information of the track.

In the conventional user track identity linking problem, longitude and latitude of GPS positioning are generally used as the representation of track points, but the longitude and latitude points do not contain semantic information of the track. There has been a great deal of effort to model human movements, but most of them are modeled for spatio-temporal laws. Human movement is typically modeled as a random process around a fixed point. The biggest drawback of this modeling approach is that the active information is ignored. What the person is at a particular location at a particular time is for is the purpose of movement, which is implicit behind the track. In recent years, researches are also conducted on the addition of semantic information of track points on the basis of space-time tracks, so that the accuracy of user identity linking is improved. However, these methods only use semantic information as auxiliary information, and cannot fuse space-time information with semantic information. Unlike available technology, the present invention provides one new method of combining the time and semantic information of the user's track and learning the user's hidden activity habit.

Disclosure of Invention

In order to fully mine semantic information in a user track, the invention provides a system based on representation learning, which converts the interest point track into a unique activity habit and well stores time and semantic information. Firstly, combining time and semantic information in an activity track, and extracting an activity mode of a user. Second, a similarity score between the user activity patterns is calculated. In order to distinguish the importance of different interest point types, the invention utilizes the concept of inverse document frequency to assign different weights to different interest point types. Again, similar to the embedded words in natural language, a point of interest embedding layer is introduced, generating an embedded representation for each point of interest. Then, a vector representation of the user's activity pattern is generated from the user's activity pattern and the point of interest embedding. Finally, user activity similarity is calculated according to the generated representation of the user activity pattern, and the most similar users have the same natural person identity.

(1) Extracting activity patterns of a user

The invention expresses the interest point track of the user as T (u) = { p ₁ ，p ₂ ，...，p _t P, where _t Is the type of point of interest of the address accessed by the user at a certain point in time t, u representing the user. In view of the strong periodicity and predictability of the user's activity patterns, it is necessary to analyze the user's daily activities, so the present invention divides the user's point of interest track into sub-tracks T of length in days _sub (u). In order to better analyze the daily activity habit of the user, the method divides the day into m time partitions, and respectively counts the frequently accessed interest points of the user in each time partitionWherein->Indicating that the user ui has accessed the point of interest p in the jth time period _t The number of accesses is n _t And twice. The present invention defines that the daily activity pattern of the user is denoted +.>

(2) Analyzing and calculating similarity scores of user activity patterns

The invention introduces a new index to measureThe similarity of activity patterns between users in the original space is measured. An intuition of similarity scores is that similar users tend to appear in similar types of places at similar times. Thus, the present invention calculates the co-occurrence time of points of interest during a particular period. For two users of the user on the A platform and the B platform, there areAnd->The invention defines the time activity similarity of the user as follows:

wherein the method comprises the steps ofRepresenting user u _A Frequent point of interest statistics at the jth time period. Thus, user link results may be achieved based on semantic similarity between computing users. For user u _A The most similar user ui' can be calculated in the B-platform with the maximum temporal activity similarity score maximum +.>And u is to _A And ui' are linked together to share the most similar activity pattern.

To improve S (u) _A ，u _B ) The invention improves the similarity function, introduces the idea of TF-IDF (inverse document frequency) and distinguishes the importance of different interest points. TF-IDF is a common weighting technique used for information retrieval and data mining, aimed at reflecting the importance of different words in corpora and documents. Inspired by TF-IDF, the method calculates word frequency and inverse document frequency of different interest points:

wherein the method comprises the steps ofRepresenting the raw statistics of points of interest in a trajectory, e.g. point of interest p _t Number of occurrences in the track.

Where n= |t| is the number of all traces in the dataset, |{ T e T: p is p _t E t } | represents containing a point of interest p _t Is provided for the number of tracks of the track. Then, the inverse document frequency of the point of interest is calculated as follows:

tfidf(p _t ，t，T)＝tf(p _t ，t)·idf(t，T) (7)

the TF-IDF value is calculated and used as the weight of each interest point, and the invention designs an improved time activity similarity score S# (u) _A ，u _B ) Its co-occurrence function is defined as follows:

(3) Representation learning of track points of interest

Although the user's daily time activity record L is obtained to represent statistics of the user's activity patterns, this statistical feature is still insufficient for analysis. First, it cannot distinguish between different points of interest, for example, the difference between Beijing banks and China construction banks is significantly greater than the difference between Beijing banks and Beijing restaurants. Second, the similarity of the user's activity patterns is calculated, which is feature-based and cannot be used to further link the user's identity. The present invention therefore proposes a method for learning an embedded representation of a user's activity pattern based on representation learning. The user's activity similarity can be easily calculated by classical distance functions.

The distribution of the interest points of the user in the track is very similar to the word frequency distribution in the natural language, so that the word embedding method in the natural language processing can be used for solving the embedding problem of the interest points. Inspired by the word2vec model, the invention designs a POI2vec model for learning the low-dimensional embedding of interest points.

Specifically, similar to the bag of words model, the target point of interest p _t Can be predicted from its contextual points of interest, i.e. by maximizing the probability functionAnd (5) calculating. Conditional probability->Defined by a normalized exponential function:

where V is the set of all points of interest in the dataset,(where d is the dimension of the low-dimensional space) is the point of interest p _t V is represented by (v) _Context Is the Context point of interest Context (p _t ) Is a sum vector of (a) and (b). Finally, the training goal of POI2vec is to maximize the average of the indices for all probabilities:

(4) Representation learning of trace activity patterns

Based on the interest point embedding obtained in the above steps, the time activity embedding of the user can be further obtained. In the user's time activity statistics L, the present invention counts the k (top-k) points of interest that the user accesses most frequently in each time partition of the day. Embedding v (p) in time activity statistics L and interest points obtained in the last step _t )On the basis of (a), the activity habit of the user is expressed as +.>Where m is the number of time partitions and dim is the dimension of the point of interest embedding. If the user has a POI record in the time period, the embedding between the time periods is expressed as frequent POI embedding, and the embedding of the user in the time period is expressed as follows according to the occurrence times of each POI and tf-dif weight:

where concat represents the concatenation operation of vectors, p _jl Is the first frequent interest point of the user in the jth time partition, and the access frequency is n _jl . Similar to the definition of temporal activity similarity scores, the present invention counts TF-IDF weights into a representation of the user temporal activity pattern.

If the user does not have a record of points of interest within a certain time partition, the present invention proposes three strategies to replace the missing values: 1) Replacing the missing value with a zero vector: 2) Replace with the most frequent points of interest in other time partitions: 3) Replaced with a weighted average of points of interest at all other times.

(5) User identity linking

Through the above steps, an embedded representation of each user's temporal activity may be obtained. Cosine similarity is often used to calculate the similarity between two vectors, and the invention defines the similarity between the activity habits of two users to be calculated as follows:

wherein v is ₁ and v ₂ Is a representation of the activity habits of two users. Thus, designating a user of a certain platform, the method of the present invention can find the user with the most similar activity habit on another platform in the dataset, and link the two users, i.e. have the same user identity.

The method of the invention can calculate the similarity of the user in the semantic space because the unique activity mode of the user is extracted by combining the time information and the semantic information. Meanwhile, the method adopts a model based on representation learning, the activity habit of the user is embedded into a low-dimensional space, and for any user, the user which is most matched with the user can be efficiently found.

Drawings

FIG. 1 is a flow chart of a cross-platform user identity recognition method based on activity similarity.

Detailed Description

To illustrate an embodiment of the method, we will take the GeoLife dataset as an example to illustrate the processing steps of the method. The GeoLife dataset is a GPS dataset collected by microsoft asian research corporation, recording activity trajectories of 182 users over a period of three years (4 months 2007 to 8 months 2012). To obtain fine-grained activity information, we use the geocoder API provided by the Goldmap to obtain detailed point of interest information from latitude and longitude.

As shown in fig. 1, the specific description of the same identity recognition algorithm for the cross-platform user by using the activity similarity is as follows:

input: one user u on the A-platform _j Trajectory data of (c) and user trajectory data on B-planeCollection set

And (3) outputting: b on platform and u _i User u 'who is most similar in activity habit' _j

Step1: preprocessing the activity track of the user, and extracting the activity mode of the user;

step2: calculating the activity similarity of the user in the original space according to formulas (1) - (8);

step3: obtaining a submerging representation v (p) of the interest point according to the POI2vec model;

step4: an embedded representation v of the user's activity pattern is generated from v (p) and the missing value substitution.

Step5: calculating the activity similarity D (v) _i ，v _j ) And the user with the highest similarity performs identity linking.

According to the invention, acc@K and mean rank are used as indexes for measuring the performance of the model, and the acc@K is defined as follows:

where # correctly identified users@k denotes the number of identical users correctly predicted among the first K candidates, # users denotes the total number of all unidentified users. mean rank represents ranking the similarity of all candidate users, the higher the similarity ranking is, and the ranking of the users with the same identity in the candidate users in the candidate is calculated for all the users. The average of these predictive ranks is reported herein. A lower average ranking indicates the superiority of the method.

After analyzing the trajectories in the dataset, the present invention obtains the trajectories of points of interest that are frequent to the user every day. According to the similarity scores defined in formulas (1) - (8), an initial user link result is obtained as shown in table 2 below:

table 1 dataset analysis is similar to the original activity. ( Ratio: the user ratio (1) most similar to itself was calculated using equation (1) without tf-idf weight. Ratio#: using tf-idf weight to calculate user ratio (1) most similar to itself )

dataset	Ratio	Ratio#
			Geolife	94.5％	98.9％

It can be seen that the cross-platform user identity can be identified to a great extent according to the activity similarity calculation formula defined by the invention.

Based on the POI2vec model provided by the invention, we can obtain low-dimensional embedding of different interest points in the moving track. For parameters of the model, the invention sets the embedding dimension of the interest point equal to 80, and the window size of the context equal to 3 (1.5 hours). Obviously, larger embedding dimensions can better preserve the original semantic information, but as dimensions increase, gains in performance are gradually reduced. For the window size, if the value is too small, the association relation between the interest points may not be captured well, but if the window is too large, errors are introduced, and the performance of the model is reduced. Therefore, we recommend that the embedding dimension of the interest point is equal to 80 and the window size of the context is equal to 3.

Based on the POI2vec model, the present invention uses a POI embedded dictionary to convert the user activity pattern L into an embedded representation v of the user activity pattern. More specifically, we resolved one day into 48 equal-length slices (30 minutes per slice). The parameter is set to 48 because it provides the best granularity without suffering from data sparseness. The present invention marks time slices lacking point of interest information as "missing" types in the dataset. We represent frequent points of interest in the time partition with vector weighted sums and finally aggregate the embedded vector concatenation of 48 time slices into an embedded representation of the user's activity habits.

In this experiment, the present invention investigated the impact of different "missing" type replacement strategies on the user time activity of embedding similarity ranking. Policy 1 replaces the missing value with a zero vector, policy 2 replaces the "missing" type with the most common POI in the other time period, and policy 3 replaces the "missing" type with a vector weighted by the sum of all POIs in the other time period. In the purchased POI type transition tracking, 36.39% of the time slices in the GeoLife dataset were marked missing. Table 2 shows the impact of these three different strategies on the user-embedded similarity ranking. It can be seen that strategy 3 works best. Policy 3 considers the behavior habits of the user during all recording periods of the day.

Table 1 average ranking of user temporal activity embedded similarities

Data set	MR@strategy 1	MR@strategy 2	MR@strategy 3
				GeoLife	2.2197	13.3021	1.9175

Claims

1. A method for identifying the identity of a cross-platform user based on activity similarity is characterized in that,

firstly, extracting an activity mode of a user by combining time and semantic information in an activity track, secondly, calculating similarity scores among the activity modes of the user, distributing different weights of different interest point types by utilizing the concept of inverse document frequency in order to distinguish the importance of different interest point types, thirdly, introducing an interest point embedding layer similar to an embedded word in natural language to generate an embedded representation for each interest point, then generating a vector representation of the activity mode of the user according to the activity mode of the user and the embedding of the interest point, and finally, calculating the activity similarity of the user according to the generated representation of the activity mode of the user, wherein the most similar user has the same natural person identity;

the method specifically comprises the following steps:

(1) Extracting activity patterns of a user

Representing the point of interest track of the user as T (u) = { p ₁ ，p ₂ ，...，p _t P, where _t Is the type of interest point of the address accessed by the user at a certain time point T, u represents the user, and considering that the activity mode of the user has strong periodicity and predictability, it is necessary to analyze the daily activity of the user, and the interest point track of the user is divided into sub-tracks T with the length of days _sub (u) dividing a day into m time partitions for better analyzing daily activity habits of users, and respectively counting frequently accessed interest points of users in each time partitionWherein->Indicating that the user ui has accessed the point of interest p in the jth time period _t The number of accesses is n _t Next time, define the user's daily activity pattern to be denoted +.>

(2) Analyzing and calculating similarity scores of user activity patterns

A new index is introduced to measure the similarity of activity modes between users in an original space, and the intuition of the similarity score is that similar users tend to appear in similar types of places at similar times, the co-occurrence time of interest points in a specific period is calculated, and the method comprises the following steps ofAndthe temporal activity similarity of the user is defined as follows:

wherein the method comprises the steps ofRepresenting user u _A Frequent point of interest statistics at the jth time period, therefore, user link results are achieved based on computing semantic similarity between users, for user u _A Calculating the most similar user ui' in the B-platform with the maximum temporal activity similarity score maximum +.>And u is to _A And ui' are connected to oneFrom the beginning, sharing the most similar activity patterns;

the TF-IDF value is calculated as the weight of each interest point, and the improved time activity similarity score S# (u) _A ，u _B ) Its co-occurrence function is defined as follows:

(3) Representation learning of track points of interest

Although a statistics of the user's daily time activity record L is obtained to represent the user's activity pattern, this statistical feature is still insufficient for analysis, firstly, it cannot distinguish between different points of interest, secondly, the user's activity pattern similarity is calculated, which is feature-based and cannot be used to further link the user's identity, so a learning-based approach is proposed to learn an embedded representation of the user's activity pattern, the user's activity similarity can be easily calculated by classical distance functions;

the distribution of the interest points of the user in the track is very similar to the word frequency distribution in the natural language, so that the word embedding method in the natural language processing can be used for solving the problem of embedding the interest points, inspired by a word2vec model, and a POI2vec model is designed for learning the low-dimensional embedding of the interest points;

specifically, similar to the bag of words model, the target point of interest p _t Can be predicted from its contextual points of interest, i.e. by maximizing the probability functionCalculation, conditional probability->Defined by a normalized exponential function:

where V is the set of all points of interest in the dataset,(where d is the dimension of the low-dimensional space) is the point of interest p _t V is represented by (v) _Context Is the Context point of interest Context (p _t ) Finally, the training goal of POI2vec is to maximize the average of the exponents of all probabilities:

(4) Representation learning of user activity patterns

Based on the interest point embedding obtained in the above steps, further obtaining the time activity embedding of the user, in the activity mode L of the user, counting k (top-k) interest points which are most frequently accessed by the user in each time partition in the day, and embedding the interest points obtained in the activity mode L of the user and the last stepOn the basis of (a), the embedded vector of the activity mode of the user is expressed as +.>Where m is the number of time partitions and dim is the dimension of point of interest embedding, if the user has a POI record in a time period, the embedding between this period is denoted as frequent POI embedding, and the embedding vector of the user in this time period is expressed as follows according to the occurrence number of each POI and tf-dif weight:

where concat represents the concatenation operation of vectors, p _jl Is the 1 st frequent interest point of the user in the jth time partition, and the access frequency is n _jl Similar to the definition of the temporal activity similarity score, the TF-IDF weightsCounting into a representation of the user activity pattern;

if the user does not have a record of points of interest within a certain time partition, three strategies are proposed to replace the missing values: 1) Replacing the missing value with a zero vector: 2) Replace with the most frequent points of interest in other time partitions: 3) Replacing with a weighted average of points of interest at all other times;

(5) User identity linking

Through the above steps, an embedded representation of each user's temporal activity is obtained, cosine similarity is often used to calculate the similarity between two vectors, and the similarity between the two users' activity habits is defined as follows:

wherein v is ₁ and v ₂ Is a representation of the activity habits of two users, thus, one user of a certain platform is designated, the user with the most similar activity habit is found on the other platform in the dataset, and the two users are linked, i.e. have the same user identity.

2. The activity similarity-based cross-platform user identity recognition method of claim 1, wherein the second step (2) is performed to increase s (u _A ，u _B ) The similarity function is improved, the thought of TF-IDF inverse document frequency is introduced, the importance of different interest points is distinguished, TF-IDF is a common weighting technology used for information retrieval and data mining, the importance of different words in a corpus and a document is reflected, and the word frequency and the inverse document frequency of different interest points are calculated under the inspired by TF-IDF:

wherein the method comprises the steps ofRepresenting the raw statistics of points of interest in a trajectory, e.g. point of interest p _t The number of occurrences in the track;

where n= |t| is the number of all traces in the dataset, |{ T e T: p is p _t E t } | represents containing a point of interest p _t Is a number of tracks of (a); then, the inverse document frequency of the point of interest is calculated as follows:

tfidf(p _t ，t，T)＝tf(p _t ，t)·idf(t，T) (7)。

3. the cross-platform user identity recognition method based on activity similarity according to claim 1, wherein the limitation of physical distance is broken through by using external semantic information of the user activity track, and even for tracks with a larger physical distance, hidden user fixed activity patterns in the tracks can be captured.

4. The activity similarity-based cross-platform user identity recognition method of claim 1, wherein the user activity trajectory context information is utilized by maximizing a probability functionCalculating vector representation of the active track points, and finally, the training target of the POI2vec is the average value of indexes of all probabilities:

by maximizing the average exponential probability for the entire data set, unsupervised representation learning for the active trajectory points is achieved.

5. The cross-platform user identity recognition method based on activity similarity according to claim 1, wherein top-k frequent activity places in a user activity mode are fully analyzed, and three strategies are provided to replace missing values in order to solve the problem of data sparseness: 1) Replacing the missing value with a zero vector: 2) Replace with the most frequent points of interest in other time partitions: 3) Replaced with a weighted average of points of interest at all other times.