CN114118250A

CN114118250A - Cross-platform user identity identification method based on activity similarity

Info

Publication number: CN114118250A
Application number: CN202111389814.5A
Authority: CN
Inventors: 李勇军; 黄丽蓉; 颜兆洁; 张银银
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-03-01
Anticipated expiration: 2041-11-22
Also published as: CN114118250B

Abstract

The invention discloses a cross-platform user identity identification method based on activity similarity, which comprises the steps of firstly, extracting activity patterns of users by combining time and semantic information in an activity track, secondly, calculating similarity scores among the activity patterns of the users, in order to distinguish the importance of different interest point types, allocating different weights of different interest point types by utilizing a concept of inverse document frequency, thirdly, introducing an interest point embedding layer similar to an embedding word in a natural language to generate an embedded representation for each interest point, secondly, generating a vector representation of the activity pattern of the users according to the activity pattern of the users and the interest point embedding, and finally, calculating the activity similarity of the users according to the generated representation of the activity pattern of the users, wherein the most similar users have the same natural identity; the invention calculates the similarity of the user in the semantic space, embeds the activity habit of the user into the low-dimensional space, and can efficiently find the user which is most matched with the user for any user.

Description

Cross-platform user identity identification method based on activity similarity

Technical Field

The invention relates to a track-based cross-platform user identity linking method, in particular to an activity similarity-based method.

Background

Cross-platform user links play a crucial role in numerous applications, such as user interest recommendation and location prediction, among others. Many studies have used user attributes and social networks to study this topic. But the attributes and social networking features of users are inconsistent for different service platforms. Meanwhile, due to privacy problems, some sensitive information cannot be acquired and used for identity analysis. Unlike platform-specific information, the user's spatiotemporal trajectory may provide stable and consistent identity information. Due to the popularity of global positioning system tracking technology, mobile devices record their spatiotemporal trajectories no matter which service the user accesses. Linking user identities can be achieved by analyzing spatiotemporal localization of user activities.

The method for realizing user identity linkage by utilizing the user space-time trajectory mainly focuses on matching the special time matching of the user according to the statistical characteristics so as to identify the user. By introducing different weight assignment strategies, user access frequency, different locations and popularity of "meet" events are used to calculate the similarity between different users. However, these methods cannot capture the dynamic patterns and semantic information of the trajectory.

In the traditional user track identity linking problem, longitude and latitude of GPS positioning are generally used as the representation of track points, but the longitude and latitude points do not contain the semantic information of tracks. There have been a number of efforts to model human movement, but most of them model spatiotemporal laws. Human movement is typically modeled as a random process around a fixed point. But the biggest defect of the modeling mode is that the information of the activity is ignored. It is for what purpose people appear at a particular location at a particular time, which is a movement purpose implicit behind the trajectory. In recent years, research is also devoted to adding semantic information of track points on the basis of space-time tracks and improving the accuracy of user identity linking. However, these methods only use semantic information as auxiliary information, and cannot integrate spatiotemporal information with semantic information. Different from the prior method, the invention provides a new method which can combine the time and semantic information of the user track and learn the activity habit hidden in the track by the user.

Disclosure of Invention

In order to fully mine semantic information in user tracks, the invention provides a system based on representation learning, which converts the interest point tracks into unique activity habits and well stores time and semantic information. Firstly, the activity mode of the user is extracted by combining the time and semantic information in the activity track. Second, a similarity score between the user activity patterns is calculated. In order to distinguish the importance of different interest point types, the invention utilizes the concept of inverse document frequency to assign different weights to different interest point types. Again, similar to the embedded words in natural language, a point of interest embedding layer is introduced, generating an embedded representation for each point of interest. Then, a vector representation of the user's activity pattern is generated from the user's activity pattern and the point of interest embedding. And finally, calculating the similarity of the user activities according to the generated representation of the user activity pattern, wherein the most similar users have the same natural person identity.

(1) Extracting activity patterns of a user

The invention represents the interest point track of the user as T (u) ═ p₁，p₂，...，p_tIn which p is_tIs the type of point of interest of the address that the user accesses at a certain point in time t, u represents the user. Considering that the activity pattern of the user has strong periodicity and predictability, it is necessary to analyze the activities of the user every day, therefore, the invention divides the interest point track of the user into sub-tracks T with the length of day_sub(u). In order to better analyze the daily activity habits of the user, the invention divides a day into m time partitions, and respectively counts the frequently accessed interest points of the user in each time partition

Wherein

Indicating that the user ui visited the point of interest p in the jth time period_tThe number of accesses is n_tNext, the process is carried out. The invention defines the daily activity pattern of the user as

(2) Analyzing and calculating similarity scores of user activity patterns

The present invention introduces a new index to measure the similarity of activity patterns between users in the raw space. The intuition of the similarity score is that similar users tend to appear in similar types of places at similar times. Therefore, the present invention calculates the total occurrence time of the interest points for a specific period. For two users with users on the A platform and the B platform, there are

And

the invention defines the time activity similarity of the user as follows:

wherein

Representing user u_AFrequent point of interest statistics at jth time period. Therefore, the user linking result can be realized according to the semantic similarity between the calculation users. For user u_AThe most similar user ui' can be calculated in the B-platform with the largest time activity similarity score maximum

And will u_AAnd ui' are linked together and share the most similar activity pattern.

To increase S (u)_A，u_B) The invention improves the similarity function, introduces the idea of TF-IDF (inverse document frequency) and distinguishes the importance of different interest points. TF-IDF is a commonly used weighting technique for information retrieval and data mining intended to reflect the importance of different words in corpora and documents. Inspired by TF-IDF, the method calculates the word frequency and the inverse document frequency of different interest points:

wherein

Representing the original statistics of the point of interest in the trajectory, e.g. point of interest p_tNumber of occurrences in the trace.

Where N ═ T | is the number of all tracks in the dataset, | { T ∈ T: p is a radical of_tE t } | represents that the point of interest p is contained_tThe number of tracks of (a). Then, the inverse document frequency of the point of interest is calculated as follows:

tfidf(p_t，t，T)＝tf(p_t，t)·idf(t，T) (7)

calculating to obtain TF-IDF value as the weight of each interest point, the invention designs the improvementThe incoming time activity similarity score S # (u)_A，u_B) Its co-occurrence function is defined as follows:

(3) representation learning of track points of interest

Although the time of day activity record L of the user is derived to represent statistics of the user's activity pattern, this statistical signature is still insufficient for analysis. First, it cannot distinguish between different points of interest, for example, the difference between Beijing's bank and the Chinese construction bank is significantly greater than the difference between Beijing's bank and Beijing's restaurant. Second, the activity pattern similarity of the user is calculated, which is feature-based and cannot be used to further link the user identity. Accordingly, the present invention proposes a representation learning-based method for learning an embedded representation of a user's activity pattern. The activity similarity of the user can be easily calculated by a classical distance function.

The distribution of the user's interest points in the trajectory is very similar to the word frequency distribution in natural language, so that the word embedding method in natural language processing can be used to solve the problem of embedding the interest points. Inspired by the word2vec model, the invention designs the POI2vec model to learn the low-dimensional embedding of the interest points.

Specifically, similar to the bag-of-words model, the target point of interest p_tCan be predicted by its contextual point of interest, i.e. by maximizing the probability function

And (4) calculating. Conditional probability

Defined by a normalized exponential function:

where V is the set of all points of interest in the data set,

(where d is a dimension of the low dimensional space) is a point of interest p_tIs represented by v_ContextIs the Context point of interest Context (p)_t) The sum vector of (2). Finally, the training goal for POI2vec is to maximize the average of the indices of all probabilities:

(4) representation learning of trace activity patterns

Based on the interest point embedding obtained in the above steps, the time activity embedding of the user can be further obtained. In the user's time activity statistics L, the present invention counts the most frequently visited k (top-k) points of interest of the user in each time partition of the day. Embedding v (p) in time activity statistics L and interest points obtained in the previous step_t)

On the basis of the user's activity habits are expressed as

Where m is the number of temporal partitions and dim is the dimension of interest point embedding. If the user has POI records in a time period, embedding in the time period is expressed as frequent POI embedding, and the embedding of the user in the time period is expressed as follows according to the occurrence frequency and tf-dif weight of each POI:

where concat denotes the concatenation of vectors, p_jlIs the ith frequent interest point of the user in the jth time partition, and the access frequency is n_jl. Similar to the definition of the time activity similarity score,the present invention includes TF-IDF weights into the representation of the user's temporal activity pattern.

If the user has no record of the interest point in a certain time partition, the invention proposes three strategies to replace the missing value: 1) replacement of missing values with zero vectors: 2) replace with the most frequent interest points in other time partitions: 3) replaced with a weighted average of the points of interest at all other times.

(5) User identity linking

Through the above steps, an embedded representation of each user's temporal activity may be obtained. Cosine similarity is often used to calculate the similarity between two vectors, and the similarity between the activity habits of two users is defined by the present invention as follows:

wherein v is₁ and v₂Is an indication of the activity habits of both users. Thus, by specifying one user of a certain platform, the method of the invention can find the user with the most similar activity habits on another platform in the data set and link the two users, i.e. with the same user identity.

The invention adopts the mode of combining the time information and the semantic information to extract the unique activity mode of the user, and the method can calculate the similarity of the user in the semantic space. Meanwhile, the method adopts a model based on representation learning, the activity habits of the users are embedded into a low-dimensional space, and the users most matched with the user can be efficiently found for any user.

Drawings

FIG. 1 is a flowchart of a cross-platform user identity recognition method based on activity similarity according to the present invention.

Detailed Description

To illustrate the implementation of the method, the processing steps of the method are shown by taking a GeoLife dataset as an example. The GeoLife dataset is a GPS dataset collected by microsoft asian research corporation, recording the activity trace of 182 users over a period of more than three years (2007-4-2012-8). To obtain fine-grained activity information, we use the geocoder API provided by the grand map to obtain detailed point-of-interest information from the longitude and latitude.

As shown in fig. 1, the algorithm for realizing the same identity recognition of the cross-platform user by using the activity similarity is specifically described as follows:

inputting: a user u on the A platform_jAnd a user trajectory data set on the B platform

And (3) outputting: on B platform and u_iUser u 'with activity habit most similar'_j

Step 1: preprocessing the activity track of the user and extracting the activity mode of the user;

step 2: calculating the activity similarity of the user in the original space according to the formulas (1) to (8);

step 3: obtaining a submergence representation v (p) of the interest point according to the POI2vec model;

step 4: an embedded representation v of the activity pattern of the user is generated from v (p) and the missing value replacements.

Step 5: calculating activity similarity D (v) between users_i，v_j) And the user with the highest similarity carries out identity linking.

The invention uses Acc @ K and mean rank as indexes for measuring model performance, and Acc @ K is defined as follows:

where # corrected identified users @ K represents the number of users that correctly predicted the same user among the first K candidates, and # users represents the total number of all unidentified users. mean rank represents ranking the similarity of all candidate users, and the ranking with higher similarity is closer to the top, and the ranking of users with the same identity in the candidates is calculated for all the users. The average of these predicted rankings is reported herein. A lower average rank indicates superiority of the method.

After the tracks in the data set are analyzed, the invention obtains the frequent interest point tracks of the user every day. Based on the similarity scores defined in equations (1) - (8), the initial user link results are obtained as shown in table 2 below:

table 1 data set analysis similarity to original activity. (Ratio: calculating user Ratio (1) most similar to oneself using equation (1) without tf-idf weight Ratio #: calculating user Ratio (1) most similar to oneself using tf-idf weight)

dataset	Ratio	Ratio#
			Geolife	94.5％	98.9％

It can be seen that the activity similarity calculation formula defined by the invention can identify the cross-platform user identity to a great extent.

Based on the POI2vec model provided by the invention, the low-dimensional embedding of different interest points in the activity track can be obtained. For the parameters of the model, the invention sets the embedding dimension of the interest point equal to 80 and the window size of the context equal to 3(1.5 hours). Obviously, the larger embedding dimension can better retain the original semantic information, but as the dimension increases, the performance improvement gains less and less. For window size, if the value is too small, the correlation between the points of interest may not be well captured, but if the window is too large, errors may be introduced, reducing the performance of the model. Therefore, we recommend that the embedding dimension of the interest points be equal to 80 and the window size of the context be equal to 3.

Based on the POI2vec model, the present invention uses a POI embedding dictionary to convert a user activity pattern L into an embedded representation v of the user activity pattern. More specifically, we resolved a day into 48 equally long slices (30 minutes per slice). The parameter is set to 48 because it provides the best granularity without suffering from data sparseness. The present invention marks time slices that lack point of interest information as "missing" types in the dataset. We represent frequent interest points within a time partition by vector weighted summation and finally concatenate embedded vectors of 48 time slices into an embedded representation of the user's activity habits.

In this experiment, the present invention investigated the effect of different "missing" type replacement strategies on embedding similarity ranking user temporal activity. Policy 1 is to replace missing values with zero vectors, policy 2 is to replace "missing" types with POIs that are most common in other time periods, and policy 3 is to replace "missing" types with vectors weighted by the sum of all POIs in other time periods. In the purchased POI type transition tracking, 36.39% of the time slices in the GeoLife dataset were marked missing. Table 2 shows the effect of these three different strategies on the user-embedded similarity ranking. It can be seen that using strategy 3 works best. Policy 3 takes into account the user's behavior habits throughout the recording period of the day.

TABLE 1 average ranking of user temporal activity embedding similarity

Data set	MR@strategy 1	MR@strategy 2	MR@strategy 3
				GeoLife	2.2197	13.3021	1.9175

Claims

1. A method for identifying the same identity of a cross-platform user based on activity similarity is characterized in that,

firstly, extracting the activity patterns of users by combining time and semantic information in an activity track, secondly, calculating similarity scores between the activity patterns of the users, in order to distinguish the importance of different interest point types, allocating different weights of different interest point types by utilizing the concept of inverse document frequency, thirdly, introducing an interest point embedding layer similar to an embedded word in a natural language, generating an embedded representation for each interest point, secondly, generating a vector representation of the activity pattern of the users according to the activity pattern and the interest point embedding of the users, and finally, calculating the activity similarity of the users according to the generated representation of the activity pattern of the users, wherein the most similar users have the same natural person identity;

the method specifically comprises the following steps:

(1) extracting activity patterns of a user

Representing the user's interest point trajectory as T (u) ═ p₁，p₂，...，p_tIn which p is_tThe user is the interest point type of an address accessed by the user at a certain time point T, u represents the user, and considering that the activity mode of the user has strong periodicity and predictability, it is necessary to analyze the daily activity of the user, and the interest point track of the user is divided into sub-tracks T with the length of day_sub(u) in order to better analyze the daily activity habits of the user, dividing the day into m time partitions, and respectively counting the frequently visited interest points of the user in each time partition

Wherein

Indicating that the user ui visited the point of interest p in the jth time period_tThe number of accesses is n_tNext, defining the activity pattern of the user every day is expressed as

(2) Analyzing and calculating similarity scores of user activity patterns

A new index is introduced to measure the similarity of activity patterns among users in an original space, the similarity score has the intuition that similar users often appear in places of similar types at similar time, the common occurrence time of interest points in a specific period is calculated, and for two users of the users on an A platform and a B platform, the similarity score has the effect that

And

the time activity similarity of the user is defined as follows:

wherein

Representing user u_AFrequent interest point statistics in the jth time period, therefore, the user linking result is realized according to semantic similarity between the calculation users, and for the user u_ACalculating the most similar user ui' in B-plane with the largest time activity similarity score maximum

And will u_AAnd ui' are linked together, sharing the most similar activity pattern;

calculating TF-IDF value as weight of each interest point, and improving time activity similarity score S # (u)_A，u_B) Its co-occurrence function is defined as follows:

(3) representation learning of track points of interest

Although the statistics of the activity pattern of the user is represented by the time activity record L of each day of the user is obtained, the statistical feature is still insufficient for analysis, firstly, the difference between different interest points cannot be distinguished, and secondly, the activity pattern similarity of the user is calculated, the calculation mode is based on the feature and cannot be used for further linking the user identity, so that a representation learning-based method is provided for learning the embedded representation of the activity pattern of the user, and the activity similarity of the user can be easily calculated through a classical distance function;

the distribution of the interest points of the user in the track is very similar to the word frequency distribution in the natural language, so that a word embedding method in natural language processing can be used for solving the embedding problem of the interest points, and a POI2vec model is designed to learn the low-dimensional embedding of the interest points under the enlightenment of a word2vec model.

Calculation, conditional probability

Defined by a normalized exponential function:

where V is the set of all points of interest in the data set,

(where d is a dimension of the low dimensional space) is a point of interest p_tIs represented by v_ContextIs the Context point of interest Context (p)_t) And finally, the training goal of POI2vec is to maximize the average of the indices of all probabilities:

(4) representation learning of user activity patterns

Based on the interest point embedding obtained in the above steps, further obtaining the time activity embedding of the user, in the activity pattern L of the user, counting k (top-k) interest points which are most frequently visited by the user in each time partition in the day, and in the activity pattern L of the user and the interest point embedding obtained in the last step

On the basis of (2), an embedded vector of the activity pattern of the user is expressed as

Wherein m is the number of time partitions, dim is the dimension of embedding of the interest points, if the user has POI records in a time period, the embedding in the time period is expressed as frequent POI embedding, and the embedded vector of the user in the time period is expressed as follows according to the occurrence frequency and tf-dif weight of each POI:

where concat denotes the concatenation of vectors, p_jlIs the 1 st frequent interest point of the user in the jth time partition, and the access frequency is n_jlSimilar to the definition of the temporal activity similarity score, a TF-IDF weight is incorporated into the representation of the user activity pattern;

if the user has no record of points of interest in a certain time partition, three strategies are proposed to replace the missing values: 1) replacement of missing values with zero vectors: 2) replace with the most frequent interest points in other time partitions: 3) replace with a weighted average of the points of interest at all other times;

(5) user identity linking

Through the above steps, an embedded representation of the time activity of each user is obtained, cosine similarity is often used to calculate the similarity between two vectors, and the similarity between the activity habits of two users is defined as follows:

wherein v is₁ and v₂Is an indication of the activity habits of two users, and thus, one user of a certain platform is designated, and on the other platform in the data set, the user with the most similar activity habit is found, and the two users are linked, i.e. have the same user identity.

2. According to claim 1The cross-platform user identity identification method based on activity similarity is characterized in that the second step (2) is to improve s (u)_A，u_B) The method is characterized in that the performance of the method is improved for a similarity function, the idea of TF-IDF inverse document frequency is introduced, the importance of different interest points is distinguished, TF-IDF is a common weighting technology for information retrieval and data mining, aims to reflect the importance of different words in a corpus and documents, and is inspired by TF-IDF to calculate the word frequency and the inverse document frequency of different interest points:

wherein

Representing the original statistics of the point of interest in the trajectory, e.g. point of interest p_tNumber of occurrences in the trace;

where N ═ T | is the number of all tracks in the dataset, | { T ∈ T: p is a radical of_tE t } | represents that the point of interest p is contained_tThe number of tracks of (a); then, the inverse document frequency of the point of interest is calculated as follows:

tfidf(p_t，t，T)＝tf(p_t，t)·idf(t，T) (7)。

3. the method for identifying the same identity of the users across the platforms based on the activity similarity as claimed in claim 1, wherein the external semantic information of the user activity track is utilized to break through the limitation of physical distance, and even for the track far away from the physical distance, the hidden fixed activity pattern of the users in the track can be captured.

4. The activity similarity-based cross-platform of claim 1The method for identifying the same identity of the user is characterized in that the context information of the user activity track is utilized to pass through the maximized probability function

Calculating the vector representation of the active trajectory points, and finally, the training goal of POI2vec is to maximize the average of the indices of all probabilities:

unsupervised representation learning for active trace points is achieved by maximizing the average exponential probability of the entire data set.

5. The method for identifying the same identity of the cross-platform users based on the activity similarity as claimed in claim 1, wherein the frequent activity places of top-k in the activity patterns of the users are fully analyzed, and in order to solve the problem of data sparsity, three strategies are proposed to replace the missing values: 1) replacement of missing values with zero vectors: 2) replace with the most frequent interest points in other time partitions: 3) replaced with a weighted average of the points of interest at all other times.