CN114118250A - Cross-platform user identity identification method based on activity similarity - Google Patents

Cross-platform user identity identification method based on activity similarity Download PDF

Info

Publication number
CN114118250A
CN114118250A CN202111389814.5A CN202111389814A CN114118250A CN 114118250 A CN114118250 A CN 114118250A CN 202111389814 A CN202111389814 A CN 202111389814A CN 114118250 A CN114118250 A CN 114118250A
Authority
CN
China
Prior art keywords
user
activity
interest
users
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111389814.5A
Other languages
Chinese (zh)
Other versions
CN114118250B (en
Inventor
李勇军
黄丽蓉
颜兆洁
张银银
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202111389814.5A priority Critical patent/CN114118250B/en
Publication of CN114118250A publication Critical patent/CN114118250A/en
Application granted granted Critical
Publication of CN114118250B publication Critical patent/CN114118250B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Business, Economics & Management (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Human Resources & Organizations (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-platform user identity identification method based on activity similarity, which comprises the steps of firstly, extracting activity patterns of users by combining time and semantic information in an activity track, secondly, calculating similarity scores among the activity patterns of the users, in order to distinguish the importance of different interest point types, allocating different weights of different interest point types by utilizing a concept of inverse document frequency, thirdly, introducing an interest point embedding layer similar to an embedding word in a natural language to generate an embedded representation for each interest point, secondly, generating a vector representation of the activity pattern of the users according to the activity pattern of the users and the interest point embedding, and finally, calculating the activity similarity of the users according to the generated representation of the activity pattern of the users, wherein the most similar users have the same natural identity; the invention calculates the similarity of the user in the semantic space, embeds the activity habit of the user into the low-dimensional space, and can efficiently find the user which is most matched with the user for any user.

Description

Cross-platform user identity identification method based on activity similarity
Technical Field
The invention relates to a track-based cross-platform user identity linking method, in particular to an activity similarity-based method.
Background
Cross-platform user links play a crucial role in numerous applications, such as user interest recommendation and location prediction, among others. Many studies have used user attributes and social networks to study this topic. But the attributes and social networking features of users are inconsistent for different service platforms. Meanwhile, due to privacy problems, some sensitive information cannot be acquired and used for identity analysis. Unlike platform-specific information, the user's spatiotemporal trajectory may provide stable and consistent identity information. Due to the popularity of global positioning system tracking technology, mobile devices record their spatiotemporal trajectories no matter which service the user accesses. Linking user identities can be achieved by analyzing spatiotemporal localization of user activities.
The method for realizing user identity linkage by utilizing the user space-time trajectory mainly focuses on matching the special time matching of the user according to the statistical characteristics so as to identify the user. By introducing different weight assignment strategies, user access frequency, different locations and popularity of "meet" events are used to calculate the similarity between different users. However, these methods cannot capture the dynamic patterns and semantic information of the trajectory.
In the traditional user track identity linking problem, longitude and latitude of GPS positioning are generally used as the representation of track points, but the longitude and latitude points do not contain the semantic information of tracks. There have been a number of efforts to model human movement, but most of them model spatiotemporal laws. Human movement is typically modeled as a random process around a fixed point. But the biggest defect of the modeling mode is that the information of the activity is ignored. It is for what purpose people appear at a particular location at a particular time, which is a movement purpose implicit behind the trajectory. In recent years, research is also devoted to adding semantic information of track points on the basis of space-time tracks and improving the accuracy of user identity linking. However, these methods only use semantic information as auxiliary information, and cannot integrate spatiotemporal information with semantic information. Different from the prior method, the invention provides a new method which can combine the time and semantic information of the user track and learn the activity habit hidden in the track by the user.
Disclosure of Invention
In order to fully mine semantic information in user tracks, the invention provides a system based on representation learning, which converts the interest point tracks into unique activity habits and well stores time and semantic information. Firstly, the activity mode of the user is extracted by combining the time and semantic information in the activity track. Second, a similarity score between the user activity patterns is calculated. In order to distinguish the importance of different interest point types, the invention utilizes the concept of inverse document frequency to assign different weights to different interest point types. Again, similar to the embedded words in natural language, a point of interest embedding layer is introduced, generating an embedded representation for each point of interest. Then, a vector representation of the user's activity pattern is generated from the user's activity pattern and the point of interest embedding. And finally, calculating the similarity of the user activities according to the generated representation of the user activity pattern, wherein the most similar users have the same natural person identity.
(1) Extracting activity patterns of a user
The invention represents the interest point track of the user as T (u) ═ p1,p2,...,ptIn which p istIs the type of point of interest of the address that the user accesses at a certain point in time t, u represents the user. Considering that the activity pattern of the user has strong periodicity and predictability, it is necessary to analyze the activities of the user every day, therefore, the invention divides the interest point track of the user into sub-tracks T with the length of daysub(u). In order to better analyze the daily activity habits of the user, the invention divides a day into m time partitions, and respectively counts the frequently accessed interest points of the user in each time partition
Figure BDA0003368194670000021
Wherein
Figure BDA0003368194670000022
Indicating that the user ui visited the point of interest p in the jth time periodtThe number of accesses is ntNext, the process is carried out. The invention defines the daily activity pattern of the user as
Figure BDA0003368194670000023
(2) Analyzing and calculating similarity scores of user activity patterns
The present invention introduces a new index to measure the similarity of activity patterns between users in the raw space. The intuition of the similarity score is that similar users tend to appear in similar types of places at similar times. Therefore, the present invention calculates the total occurrence time of the interest points for a specific period. For two users with users on the A platform and the B platform, there are
Figure BDA0003368194670000024
And
Figure BDA0003368194670000025
the invention defines the time activity similarity of the user as follows:
Figure BDA0003368194670000026
Figure BDA0003368194670000027
Figure BDA0003368194670000028
Figure BDA0003368194670000029
wherein
Figure BDA00033681946700000210
Representing user uAFrequent point of interest statistics at jth time period. Therefore, the user linking result can be realized according to the semantic similarity between the calculation users. For user uAThe most similar user ui' can be calculated in the B-platform with the largest time activity similarity score maximum
Figure BDA0003368194670000031
And will uAAnd ui' are linked together and share the most similar activity pattern.
To increase S (u)A,uB) The invention improves the similarity function, introduces the idea of TF-IDF (inverse document frequency) and distinguishes the importance of different interest points. TF-IDF is a commonly used weighting technique for information retrieval and data mining intended to reflect the importance of different words in corpora and documents. Inspired by TF-IDF, the method calculates the word frequency and the inverse document frequency of different interest points:
Figure BDA0003368194670000032
wherein
Figure BDA0003368194670000033
Representing the original statistics of the point of interest in the trajectory, e.g. point of interest ptNumber of occurrences in the trace.
Figure BDA0003368194670000034
Where N ═ T | is the number of all tracks in the dataset, | { T ∈ T: p is a radical oftE t } | represents that the point of interest p is containedtThe number of tracks of (a). Then, the inverse document frequency of the point of interest is calculated as follows:
tfidf(pt,t,T)=tf(pt,t)·idf(t,T) (7)
calculating to obtain TF-IDF value as the weight of each interest point, the invention designs the improvementThe incoming time activity similarity score S # (u)A,uB) Its co-occurrence function is defined as follows:
Figure BDA0003368194670000035
(3) representation learning of track points of interest
Although the time of day activity record L of the user is derived to represent statistics of the user's activity pattern, this statistical signature is still insufficient for analysis. First, it cannot distinguish between different points of interest, for example, the difference between Beijing's bank and the Chinese construction bank is significantly greater than the difference between Beijing's bank and Beijing's restaurant. Second, the activity pattern similarity of the user is calculated, which is feature-based and cannot be used to further link the user identity. Accordingly, the present invention proposes a representation learning-based method for learning an embedded representation of a user's activity pattern. The activity similarity of the user can be easily calculated by a classical distance function.
The distribution of the user's interest points in the trajectory is very similar to the word frequency distribution in natural language, so that the word embedding method in natural language processing can be used to solve the problem of embedding the interest points. Inspired by the word2vec model, the invention designs the POI2vec model to learn the low-dimensional embedding of the interest points.
Specifically, similar to the bag-of-words model, the target point of interest ptCan be predicted by its contextual point of interest, i.e. by maximizing the probability function
Figure BDA0003368194670000041
And (4) calculating. Conditional probability
Figure BDA0003368194670000042
Defined by a normalized exponential function:
Figure BDA0003368194670000043
where V is the set of all points of interest in the data set,
Figure BDA0003368194670000044
(where d is a dimension of the low dimensional space) is a point of interest ptIs represented by vContextIs the Context point of interest Context (p)t) The sum vector of (2). Finally, the training goal for POI2vec is to maximize the average of the indices of all probabilities:
Figure BDA0003368194670000045
(4) representation learning of trace activity patterns
Based on the interest point embedding obtained in the above steps, the time activity embedding of the user can be further obtained. In the user's time activity statistics L, the present invention counts the most frequently visited k (top-k) points of interest of the user in each time partition of the day. Embedding v (p) in time activity statistics L and interest points obtained in the previous stept)
Figure BDA0003368194670000046
On the basis of the user's activity habits are expressed as
Figure BDA0003368194670000047
Where m is the number of temporal partitions and dim is the dimension of interest point embedding. If the user has POI records in a time period, embedding in the time period is expressed as frequent POI embedding, and the embedding of the user in the time period is expressed as follows according to the occurrence frequency and tf-dif weight of each POI:
Figure BDA0003368194670000048
where concat denotes the concatenation of vectors, pjlIs the ith frequent interest point of the user in the jth time partition, and the access frequency is njl. Similar to the definition of the time activity similarity score,the present invention includes TF-IDF weights into the representation of the user's temporal activity pattern.
If the user has no record of the interest point in a certain time partition, the invention proposes three strategies to replace the missing value: 1) replacement of missing values with zero vectors: 2) replace with the most frequent interest points in other time partitions: 3) replaced with a weighted average of the points of interest at all other times.
(5) User identity linking
Through the above steps, an embedded representation of each user's temporal activity may be obtained. Cosine similarity is often used to calculate the similarity between two vectors, and the similarity between the activity habits of two users is defined by the present invention as follows:
Figure BDA0003368194670000049
wherein v is1 and v2Is an indication of the activity habits of both users. Thus, by specifying one user of a certain platform, the method of the invention can find the user with the most similar activity habits on another platform in the data set and link the two users, i.e. with the same user identity.
The invention adopts the mode of combining the time information and the semantic information to extract the unique activity mode of the user, and the method can calculate the similarity of the user in the semantic space. Meanwhile, the method adopts a model based on representation learning, the activity habits of the users are embedded into a low-dimensional space, and the users most matched with the user can be efficiently found for any user.
Drawings
FIG. 1 is a flowchart of a cross-platform user identity recognition method based on activity similarity according to the present invention.
Detailed Description
To illustrate the implementation of the method, the processing steps of the method are shown by taking a GeoLife dataset as an example. The GeoLife dataset is a GPS dataset collected by microsoft asian research corporation, recording the activity trace of 182 users over a period of more than three years (2007-4-2012-8). To obtain fine-grained activity information, we use the geocoder API provided by the grand map to obtain detailed point-of-interest information from the longitude and latitude.
As shown in fig. 1, the algorithm for realizing the same identity recognition of the cross-platform user by using the activity similarity is specifically described as follows:
inputting: a user u on the A platformjAnd a user trajectory data set on the B platform
And (3) outputting: on B platform and uiUser u 'with activity habit most similar'j
Step 1: preprocessing the activity track of the user and extracting the activity mode of the user;
step 2: calculating the activity similarity of the user in the original space according to the formulas (1) to (8);
step 3: obtaining a submergence representation v (p) of the interest point according to the POI2vec model;
step 4: an embedded representation v of the activity pattern of the user is generated from v (p) and the missing value replacements.
Step 5: calculating activity similarity D (v) between usersi,vj) And the user with the highest similarity carries out identity linking.
The invention uses Acc @ K and mean rank as indexes for measuring model performance, and Acc @ K is defined as follows:
Figure BDA0003368194670000051
where # corrected identified users @ K represents the number of users that correctly predicted the same user among the first K candidates, and # users represents the total number of all unidentified users. mean rank represents ranking the similarity of all candidate users, and the ranking with higher similarity is closer to the top, and the ranking of users with the same identity in the candidates is calculated for all the users. The average of these predicted rankings is reported herein. A lower average rank indicates superiority of the method.
After the tracks in the data set are analyzed, the invention obtains the frequent interest point tracks of the user every day. Based on the similarity scores defined in equations (1) - (8), the initial user link results are obtained as shown in table 2 below:
table 1 data set analysis similarity to original activity. (Ratio: calculating user Ratio (1) most similar to oneself using equation (1) without tf-idf weight Ratio #: calculating user Ratio (1) most similar to oneself using tf-idf weight)
dataset Ratio Ratio#
Geolife 94.5% 98.9%
It can be seen that the activity similarity calculation formula defined by the invention can identify the cross-platform user identity to a great extent.
Based on the POI2vec model provided by the invention, the low-dimensional embedding of different interest points in the activity track can be obtained. For the parameters of the model, the invention sets the embedding dimension of the interest point equal to 80 and the window size of the context equal to 3(1.5 hours). Obviously, the larger embedding dimension can better retain the original semantic information, but as the dimension increases, the performance improvement gains less and less. For window size, if the value is too small, the correlation between the points of interest may not be well captured, but if the window is too large, errors may be introduced, reducing the performance of the model. Therefore, we recommend that the embedding dimension of the interest points be equal to 80 and the window size of the context be equal to 3.
Based on the POI2vec model, the present invention uses a POI embedding dictionary to convert a user activity pattern L into an embedded representation v of the user activity pattern. More specifically, we resolved a day into 48 equally long slices (30 minutes per slice). The parameter is set to 48 because it provides the best granularity without suffering from data sparseness. The present invention marks time slices that lack point of interest information as "missing" types in the dataset. We represent frequent interest points within a time partition by vector weighted summation and finally concatenate embedded vectors of 48 time slices into an embedded representation of the user's activity habits.
In this experiment, the present invention investigated the effect of different "missing" type replacement strategies on embedding similarity ranking user temporal activity. Policy 1 is to replace missing values with zero vectors, policy 2 is to replace "missing" types with POIs that are most common in other time periods, and policy 3 is to replace "missing" types with vectors weighted by the sum of all POIs in other time periods. In the purchased POI type transition tracking, 36.39% of the time slices in the GeoLife dataset were marked missing. Table 2 shows the effect of these three different strategies on the user-embedded similarity ranking. It can be seen that using strategy 3 works best. Policy 3 takes into account the user's behavior habits throughout the recording period of the day.
TABLE 1 average ranking of user temporal activity embedding similarity
Data set MR@strategy 1 MR@strategy 2 MR@strategy 3
GeoLife 2.2197 13.3021 1.9175

Claims (5)

1. A method for identifying the same identity of a cross-platform user based on activity similarity is characterized in that,
firstly, extracting the activity patterns of users by combining time and semantic information in an activity track, secondly, calculating similarity scores between the activity patterns of the users, in order to distinguish the importance of different interest point types, allocating different weights of different interest point types by utilizing the concept of inverse document frequency, thirdly, introducing an interest point embedding layer similar to an embedded word in a natural language, generating an embedded representation for each interest point, secondly, generating a vector representation of the activity pattern of the users according to the activity pattern and the interest point embedding of the users, and finally, calculating the activity similarity of the users according to the generated representation of the activity pattern of the users, wherein the most similar users have the same natural person identity;
the method specifically comprises the following steps:
(1) extracting activity patterns of a user
Representing the user's interest point trajectory as T (u) ═ p1,p2,...,ptIn which p istThe user is the interest point type of an address accessed by the user at a certain time point T, u represents the user, and considering that the activity mode of the user has strong periodicity and predictability, it is necessary to analyze the daily activity of the user, and the interest point track of the user is divided into sub-tracks T with the length of daysub(u) in order to better analyze the daily activity habits of the user, dividing the day into m time partitions, and respectively counting the frequently visited interest points of the user in each time partition
Figure FDA0003368194660000011
Wherein
Figure FDA0003368194660000012
Indicating that the user ui visited the point of interest p in the jth time periodtThe number of accesses is ntNext, defining the activity pattern of the user every day is expressed as
Figure FDA0003368194660000013
(2) Analyzing and calculating similarity scores of user activity patterns
A new index is introduced to measure the similarity of activity patterns among users in an original space, the similarity score has the intuition that similar users often appear in places of similar types at similar time, the common occurrence time of interest points in a specific period is calculated, and for two users of the users on an A platform and a B platform, the similarity score has the effect that
Figure FDA0003368194660000014
And
Figure FDA0003368194660000015
the time activity similarity of the user is defined as follows:
Figure FDA0003368194660000016
Figure FDA0003368194660000017
Figure FDA0003368194660000018
Figure FDA0003368194660000019
wherein
Figure FDA0003368194660000021
Representing user uAFrequent interest point statistics in the jth time period, therefore, the user linking result is realized according to semantic similarity between the calculation users, and for the user uACalculating the most similar user ui' in B-plane with the largest time activity similarity score maximum
Figure FDA0003368194660000022
And will uAAnd ui' are linked together, sharing the most similar activity pattern;
calculating TF-IDF value as weight of each interest point, and improving time activity similarity score S # (u)A,uB) Its co-occurrence function is defined as follows:
Figure FDA0003368194660000023
(3) representation learning of track points of interest
Although the statistics of the activity pattern of the user is represented by the time activity record L of each day of the user is obtained, the statistical feature is still insufficient for analysis, firstly, the difference between different interest points cannot be distinguished, and secondly, the activity pattern similarity of the user is calculated, the calculation mode is based on the feature and cannot be used for further linking the user identity, so that a representation learning-based method is provided for learning the embedded representation of the activity pattern of the user, and the activity similarity of the user can be easily calculated through a classical distance function;
the distribution of the interest points of the user in the track is very similar to the word frequency distribution in the natural language, so that a word embedding method in natural language processing can be used for solving the embedding problem of the interest points, and a POI2vec model is designed to learn the low-dimensional embedding of the interest points under the enlightenment of a word2vec model.
Specifically, similar to the bag-of-words model, the target point of interest ptCan be predicted by its contextual point of interest, i.e. by maximizing the probability function
Figure FDA0003368194660000024
Calculation, conditional probability
Figure FDA0003368194660000025
Defined by a normalized exponential function:
Figure FDA0003368194660000026
where V is the set of all points of interest in the data set,
Figure FDA0003368194660000027
(where d is a dimension of the low dimensional space) is a point of interest ptIs represented by vContextIs the Context point of interest Context (p)t) And finally, the training goal of POI2vec is to maximize the average of the indices of all probabilities:
Figure FDA0003368194660000028
(4) representation learning of user activity patterns
Based on the interest point embedding obtained in the above steps, further obtaining the time activity embedding of the user, in the activity pattern L of the user, counting k (top-k) interest points which are most frequently visited by the user in each time partition in the day, and in the activity pattern L of the user and the interest point embedding obtained in the last step
Figure FDA0003368194660000036
On the basis of (2), an embedded vector of the activity pattern of the user is expressed as
Figure FDA0003368194660000032
Wherein m is the number of time partitions, dim is the dimension of embedding of the interest points, if the user has POI records in a time period, the embedding in the time period is expressed as frequent POI embedding, and the embedded vector of the user in the time period is expressed as follows according to the occurrence frequency and tf-dif weight of each POI:
Figure FDA0003368194660000033
where concat denotes the concatenation of vectors, pjlIs the 1 st frequent interest point of the user in the jth time partition, and the access frequency is njlSimilar to the definition of the temporal activity similarity score, a TF-IDF weight is incorporated into the representation of the user activity pattern;
if the user has no record of points of interest in a certain time partition, three strategies are proposed to replace the missing values: 1) replacement of missing values with zero vectors: 2) replace with the most frequent interest points in other time partitions: 3) replace with a weighted average of the points of interest at all other times;
(5) user identity linking
Through the above steps, an embedded representation of the time activity of each user is obtained, cosine similarity is often used to calculate the similarity between two vectors, and the similarity between the activity habits of two users is defined as follows:
Figure FDA0003368194660000034
wherein v is1 and v2Is an indication of the activity habits of two users, and thus, one user of a certain platform is designated, and on the other platform in the data set, the user with the most similar activity habit is found, and the two users are linked, i.e. have the same user identity.
2. According to claim 1The cross-platform user identity identification method based on activity similarity is characterized in that the second step (2) is to improve s (u)A,uB) The method is characterized in that the performance of the method is improved for a similarity function, the idea of TF-IDF inverse document frequency is introduced, the importance of different interest points is distinguished, TF-IDF is a common weighting technology for information retrieval and data mining, aims to reflect the importance of different words in a corpus and documents, and is inspired by TF-IDF to calculate the word frequency and the inverse document frequency of different interest points:
Figure FDA0003368194660000035
wherein
Figure FDA0003368194660000044
Representing the original statistics of the point of interest in the trajectory, e.g. point of interest ptNumber of occurrences in the trace;
Figure FDA0003368194660000041
where N ═ T | is the number of all tracks in the dataset, | { T ∈ T: p is a radical oftE t } | represents that the point of interest p is containedtThe number of tracks of (a); then, the inverse document frequency of the point of interest is calculated as follows:
tfidf(pt,t,T)=tf(pt,t)·idf(t,T) (7)。
3. the method for identifying the same identity of the users across the platforms based on the activity similarity as claimed in claim 1, wherein the external semantic information of the user activity track is utilized to break through the limitation of physical distance, and even for the track far away from the physical distance, the hidden fixed activity pattern of the users in the track can be captured.
4. The activity similarity-based cross-platform of claim 1The method for identifying the same identity of the user is characterized in that the context information of the user activity track is utilized to pass through the maximized probability function
Figure FDA0003368194660000042
Calculating the vector representation of the active trajectory points, and finally, the training goal of POI2vec is to maximize the average of the indices of all probabilities:
Figure FDA0003368194660000043
unsupervised representation learning for active trace points is achieved by maximizing the average exponential probability of the entire data set.
5. The method for identifying the same identity of the cross-platform users based on the activity similarity as claimed in claim 1, wherein the frequent activity places of top-k in the activity patterns of the users are fully analyzed, and in order to solve the problem of data sparsity, three strategies are proposed to replace the missing values: 1) replacement of missing values with zero vectors: 2) replace with the most frequent interest points in other time partitions: 3) replaced with a weighted average of the points of interest at all other times.
CN202111389814.5A 2021-11-22 2021-11-22 Cross-platform user identity recognition method based on activity similarity Active CN114118250B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111389814.5A CN114118250B (en) 2021-11-22 2021-11-22 Cross-platform user identity recognition method based on activity similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111389814.5A CN114118250B (en) 2021-11-22 2021-11-22 Cross-platform user identity recognition method based on activity similarity

Publications (2)

Publication Number Publication Date
CN114118250A true CN114118250A (en) 2022-03-01
CN114118250B CN114118250B (en) 2024-04-12

Family

ID=80439634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111389814.5A Active CN114118250B (en) 2021-11-22 2021-11-22 Cross-platform user identity recognition method based on activity similarity

Country Status (1)

Country Link
CN (1) CN114118250B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013107669A1 (en) * 2012-01-20 2013-07-25 Telefónica, S.A. A method for the automatic detection and labelling of user point of interest
CN104268171A (en) * 2014-09-11 2015-01-07 东北大学 Activity similarity and social trust based social networking website friend recommendation system and method
CN107194434A (en) * 2017-06-16 2017-09-22 中国矿业大学 A kind of mobile object similarity calculating method and system based on space-time data
CN109726336A (en) * 2018-12-21 2019-05-07 长安大学 A kind of POI recommended method of combination trip interest and social preference
US20190149626A1 (en) * 2017-11-15 2019-05-16 Target Brands, Inc. Similarity learning-based device attribution
US20200402019A1 (en) * 2019-06-18 2020-12-24 Capital One Services, Llc Techniques to apply machine learning to schedule events of interest

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013107669A1 (en) * 2012-01-20 2013-07-25 Telefónica, S.A. A method for the automatic detection and labelling of user point of interest
CN104268171A (en) * 2014-09-11 2015-01-07 东北大学 Activity similarity and social trust based social networking website friend recommendation system and method
CN107194434A (en) * 2017-06-16 2017-09-22 中国矿业大学 A kind of mobile object similarity calculating method and system based on space-time data
US20190149626A1 (en) * 2017-11-15 2019-05-16 Target Brands, Inc. Similarity learning-based device attribution
CN109726336A (en) * 2018-12-21 2019-05-07 长安大学 A kind of POI recommended method of combination trip interest and social preference
US20200402019A1 (en) * 2019-06-18 2020-12-24 Capital One Services, Llc Techniques to apply machine learning to schedule events of interest

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张莹;李智;张省;: "基于位置的社交网络用户轨迹相似性算法", 四川大学学报(工程科学版), no. 2, 1 July 2013 (2013-07-01) *
胡德敏;杨晨;: "一种基于多类型情景信息的兴趣点推荐模型", 计算机应用研究, no. 06, 14 June 2017 (2017-06-14) *

Also Published As

Publication number Publication date
CN114118250B (en) 2024-04-12

Similar Documents

Publication Publication Date Title
CN108399163B (en) Text similarity measurement method combining word aggregation and word combination semantic features
CN109460473B (en) Electronic medical record multi-label classification method based on symptom extraction and feature representation
Lin et al. User-level psychological stress detection from social media using deep neural network
Hu et al. Twitter100k: A real-world dataset for weakly supervised cross-media retrieval
US8185536B2 (en) Rank-order service providers based on desired service properties
CN107194560B (en) Social search evaluation method based on friend clustering in LBSN (location based service)
CN109960763A (en) A kind of photography community personalization friend recommendation method based on user's fine granularity photography preference
CN110110225B (en) Online education recommendation model based on user behavior data analysis and construction method
Hossny et al. Event detection in twitter: A keyword volume approach
CN103123653A (en) Search engine retrieving ordering method based on Bayesian classification learning
CN111324816B (en) Interest point recommendation method based on region division and context influence
CN109657011A (en) A kind of data digging method and system screening attack of terrorism criminal gang
CN109460520A (en) Point of interest recommended method based on geography-social relationships and deep implicit interest digging
CN106778070A (en) A kind of human protein's subcellular location Forecasting Methodology
CN103778206A (en) Method for providing network service resources
CN110705247A (en) Based on x2-C text similarity calculation method
CN113284627B (en) Medication recommendation method based on patient characterization learning
He et al. A binary-search-based locality-sensitive hashing method for cross-site user identification
CN109582743A (en) A kind of data digging method for the attack of terrorism
Xu et al. Next location recommendation based on semantic-behavior prediction
CN114118250B (en) Cross-platform user identity recognition method based on activity similarity
CN110941638B (en) Application classification rule base construction method, application classification method and device
Lutsai et al. Geolocation predicting of tweets using bert-based models
CN114707517B (en) Target tracking method based on open source data event extraction
CN107423294A (en) A kind of community image search method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant