CN111159543B

CN111159543B - Personalized tourist place recommendation method based on multi-level visual similarity

Info

Publication number: CN111159543B
Application number: CN201911311868.2A
Authority: CN
Inventors: 陈岭; 吕丹丹
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2022-04-05
Anticipated expiration: 2039-12-18
Also published as: CN111159543A

Abstract

The invention discloses a personalized tourist place recommendation method based on multilevel visual similarity of a geotagged photo, which comprises the following steps: 1) preprocessing the geotagged photo set, clustering to obtain a travel place, and extracting the times of visiting the travel place by a user; 2) obtaining visual characteristics of the photo by using a VGG16 model; 3) calculating weight values for different photos using a self-attention mechanism to obtain visual representations of the user and the travel location; 4) sampling to obtain implicit vectors of the user and the tourist site based on the visual representation of the user and the tourist site, and predicting the times of visiting the tourist site by the user according to the implicit vectors; 5) training the model based on the integral loss formed by quintuple loss, accuracy loss and regular terms to obtain a parameter-optimized model; 6) given a query, the querying user is recommended travel locations that may be of interest to the querying city. The method excavates user tour preferences from the set of geotagged photos and recommends tour locations to which the user may be interested.

Description

Personalized tourist place recommendation method based on multi-level visual similarity

Technical Field

The invention relates to the technical field of information recommendation, in particular to a personalized tourist site recommendation method based on multilevel visual similarity of a geographic marking photo.

Background

In recent years, with the rapid development of mobile internet, smart phones and photo sharing websites (such as Flickr, Panoramio and Instagram), a large number of geotagged photos appear on the internet, and the number of the geotagged photos contributed by the group is on a rapid growth trend. Based on the geographic labeled photos (hereinafter referred to as photos), tourist locations (hereinafter referred to as locations) in a city can be mined and tourist preferences (hereinafter referred to as preferences) of tourists can be analyzed, so that personalized location recommendation service is further provided for users.

In the early photo mining-based place recommendation method, the similarity among users is usually calculated directly based on the number of times that the users visit places, and then the place is recommended to the users by combining a user-based collaborative filtering method. To improve recommendation performance, a place recommendation method introducing various additional information has appeared. With the development of deep neural networks, the visual content of photographs is receiving more and more attention. Existing visual content-based methods typically first extract features from the visual content of the photograph, and then train a recommendation model using these features as priors in combination with the user history. These methods fail to extract visual features suitable for site recommendations because the extraction of visual features is guided primarily by computer vision tasks unrelated to recommendations.

To solve this problem, the predecessor proposed a visual content enhanced point of interest (POI) recommendation method that extracts features from the visual content of the photos, classifies them according to the photographer and place of the photos, and decomposes the user-POI check-in matrix for personalized recommendation. However, given a photograph, this approach may use the user and location information independently to divide other photographs into visually similar or dissimilar groups, and may not fully utilize the user and location information of photographs to provide multiple levels of similarity. Furthermore, this method does not take into account the degree of importance of the different photos to the user or location.

Disclosure of Invention

The technical problem to be solved by the invention is how to fully utilize the visual difference of pictures taken by different users in different places to obtain the user preference and the place characteristics, thereby further providing personalized place recommendation service for the users.

In order to solve the technical problem, the personalized tourist site recommendation method based on the multilevel visual similarity of the geotagged photos provided by the invention comprises the following steps:

(1) preprocessing a photo set labeled by geography, clustering to obtain a travel location set, and extracting a user set and the times of visiting travel locations by the user;

(2) obtaining visual characteristics of the photo by using a VGG16 model;

(3) calculating weight values for different photos by adopting a self-attention mechanism to obtain visual representations of the user and the place, and obtaining hidden vectors of the user and the place according to the visual representations of the user and the place;

(4) predicting the number of times of the user accessing the location according to the user hidden vector and the location hidden vector;

(5) constructing quintuple loss of the photo according to visual features of the photo, constructing a user regular term according to a user hidden vector, constructing a place regular term according to a place hidden vector, constructing accuracy loss according to access times, calculating total loss according to the quintuple loss, the user regular term, the place regular term and the accuracy loss, and iteratively optimizing model parameters of a VGG16 model and a weight coefficient of an attention mechanism by using the total loss;

(6) and (4) searching and obtaining all candidate places in the query city aiming at a query task comprising the query user and the query city, calculating preference values of the query user on the candidate places according to the query user hidden vector and the candidate place hidden vector obtained in the step (3), and accordingly realizing personalized tourist place recommendation.

Compared with the prior art, the method has the advantages that at least:

1) through the information of the users and the places of the crossed photos, multi-level visual similarity is defined, corresponding quintuple loss is introduced to obtain visual representation of the photos, and the visual difference of the photos shot by different users in different places is fully utilized.

2) The self-attention network is utilized to infer the weight of each photo to characterize the user and location, capturing the importance of different photos.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a personalized travel location recommendation method based on multi-level visual similarity of geotagged photos according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a flowchart of a personalized travel location recommendation method based on multi-level visual similarity of geotagged photos according to an embodiment of the present invention. Referring to fig. 1, the personalized tourist site recommendation method includes the steps of:

step 1, inputting a photo set P, clustering photos by using a density-based clustering method, and extracting a location set L; and simultaneously extracting a user set U.

Users typically take pictures at locations where they are of more interest, and if a large number of users take pictures at a location, the location may be considered a location. Clustering the photos according to the longitude and latitude position information corresponding to the photos by using a density-based clustering method (such as P-DBSCAN), wherein each obtained cluster represents a place, and the clustering center is the position of the place. Through the process, a site set L ═ L is excavated₁,l₂,…,l_|L|And f, wherein l is (c, g), c is a city where the location l is located, and g is latitude and longitude information of l. Further, a user set U ═ { U } is extracted from the photographer information of the photograph₁,u₂,…,u_|U|}。

And 2, inputting a photo set P, a user set U and a place set L, and extracting a user access history V.

For user location pair (u)_i∈U，l_jE.l), first for the ith user u according to the time of taking the picture_iAt jth location l_jThe pictures taken are sorted. Considering that a user may take several photos in the same visit, if user u_iAt a location l_jIf the time interval between several consecutive pictures taken is less than a given time threshold Δ t, the pictures are considered to belong to the same visit, and the average of the times of taking the pictures is used as the time t of the visit, the visit can be expressed as (u)_i,l_jT). All useful for this treatmentThe user access history V { (u) can be obtained by corresponding photos of the user and the place_i,l_jT) }, in which (u)_i,l_jT) represents user u_iVisit location l at time t_j。

And 3, inputting a user access history V, and extracting the times M of the user access to the place.

Counting the number of times each user visits each place according to the user visit history V, thereby obtaining the number of times M { c } that the user visits the place_ijI is more than or equal to 1 and less than or equal to | U |, j is more than or equal to 1 and less than or equal to | L |, wherein c_ijRepresenting user u_iVisit location l_jThe number of times.

And 4, dividing the user set U and the place set L into N batches, and simultaneously batching the photo set P and the times M of visiting places by the users according to the corresponding users and places in each batch.

The user set U and the place set L are batched according to the total batch number N set manually by experience to form { U₁,U₂,…,U_NAnd { L }₁,L₂,…,L_N}. For each batch of users U_mAnd location L_m(m is more than or equal to 1 and less than or equal to N), all U are found from the photo set P_mPictures taken by the inner user and all at L_mPictures taken at the interior locations to form a batch of pictures P_m(ii) a Finding all U's from the number M of times a user visits a place at the same time_mInter-user access L_mNumber of places in the list, and number of times M of visiting places by a group of users_m。

Step 5, taking out a batch of training samples U with index m (m is more than or equal to 1 and less than or equal to N) from a plurality of batches_m，L_m，P_mAnd M_m。

Step 6, for each photo p in the batch_k∈P_mInputting the image into a VGG16 model to obtain the visual characteristic v of the photo_k。

The VGG16 model is a classic deep learning model in the task of picture classification, and comprises 16 hidden layers (13 convolutional layers and 3 full-link layers). The method extracts the picture p by utilizing the first 14 hidden layers (removing the last 2 full connecting layers) of the VGG16 model_kVisual feature v of_k。

Step 7, for each user u in the batch_i∈U_mAnd 8-9.

Step 8, fusing the users u by using a self-attention mechanism_iThe visual characteristics of the picture are taken to obtain a user u_iIs a visual representation u of_i。

First, stack user u by the time of taking the photo_iThe visual characteristics of the picture are taken, forming a matrix UP_iEach row in the matrix corresponds to a visual characteristic of the corresponding photograph. Fusing users u with a self-attention mechanism_iThe specific calculation method of the visual characteristics of the shot picture is as follows:

ua_i＝softmax(w_U tanh(V_UUP_i ^T))

u_i＝ua_iUP_i

wherein, w_UAnd V_UAre learnable network parameters, are weights and bias terms for the self-attention mechanism. ua_iIs the weight vector of the photograph. The softmax function ensures that the sum of all calculated weights is 1. According to ua_iWeight provided, will UP_iThe vector summation in (1) to obtain the user u_iIs a visual representation u of_i。

Step 9, taking the mean value as u_iSum variance

The user implicit vector U is obtained by sampling in Gaussian distribution_i。

Given that user preferences primarily depend on visual information, but may also be influenced by other factors, assume a user hidden vector U_iIs derived from having a mean value of u_i(visual information) and variance

(other factors) in a Gaussian distribution, where I_UIs and u_iAll 1 vectors of the same length.

Step 10, for those in the batchEach location l_j∈L_mAnd performing the steps 11-12.

Step 11, fusing at the location l by using a self-attention mechanism_jThe visual characteristics of the picture are taken to obtain a location l_jVisual representation of (l)_j。

First, the pictures are stacked at a location l according to the shooting time of the pictures_jThe visual characteristics of the picture are taken to form a matrix LP_jEach row in the matrix corresponds to a visual characteristic of the corresponding photograph. Fusing at location l with a self-attention mechanism_jThe specific calculation method of the visual characteristics of the shot picture is as follows:

la_j＝softmax(w_Ltanh(V_LLP_j ^T))

l_j＝la_jLP_j

wherein, w_LAnd V_LAre learnable network parameters, are weights and bias terms for the self-attention mechanism. la_jIs the weight vector of the photograph. According to la_jWeight provided, will LP_jSumming the vectors to obtain the location l_jVisual representation of (l)_j。

Step 12, from the mean value of l_jSum variance

The Gaussian distribution is sampled to obtain a location hidden vector L_j。

Assuming a location hidden vector L, considering that location features mainly depend on visual information, but may also be affected by other factors_jIs derived from having a mean value of l_j(visual information) and variance

(other factors) in a Gaussian distribution, where I_LIs a sum of_jAll 1 vectors of the same length.

Step 13, for each number of visits c in the batch_ij∈M_mFrom the mean value of U_iL_jSum variance σ²Is sampled in a gaussian distributionGet user u_iVisit location l_jThe number of times.

Assuming that user u depends primarily on user preferences and location characteristics, but may also be affected by noise, consider that user u_iVisit location l_jIs from the mean value U_iL_jSum variance σ²(noise) in a gaussian distribution.

Step 14, from the photos P in the batch_mInternally mining quintuple sets for training

Two pictures p taken by the same user at the same location_o,

A picture taken by another user at the same location

A picture taken by the same user at another location

And a picture taken by another user at another location

May constitute a quintuple

After training is finished, quintuple

Multiple levels of visual similarity should be satisfied, with the corresponding formalization expressed as follows:

wherein v is_o，

And

are each p_o，

And

the visual characteristics of (1). m is₁，m₂，m₃，m₄，m₅And m₆Are respectively a photo pair

And

and

and

and

and

and

and

must satisfy the minimum visual distance between, and satisfy m₁<m₂<m₃，m₄<m₅。

Quintuple, which has satisfied the above multi-level visual similarity, does not contribute to training, resulting in a slow convergence rate. To ensure fast convergence, for any p_o∈P_mSelecting all other photos taken by the same user at the same place as

And select P_mAll photos satisfying the following inequality are taken as

And

in this way, P_mAll the pictures can obtain quintuple set for training

Wherein p is_o,

Step 15, for each five-tuple excavated

Calculating the corresponding quintuple loss L_Q。

The triplet loss calculation method corresponding to each inequality representing multi-level visual similarity in the previous step is as follows:

wherein when [ · [ ]]₊Internal value being positive, [ ·]₊Take this value, otherwise 0.

And adding the triad losses to obtain a final quintuple loss, wherein the specific calculation mode is as follows:

L_Q＝L₁+L₂+L₃+L₄+L₅+L₆

step 16, for each number of visits c in the batch_ij∈M_mCalculating the loss of accuracy L_H。

Calculating and sampling user u_iVisit location l_jNumber of times of (2) and number of true accesses c_ijThe square of the error between, the accuracy loss L is obtained_HThe specific calculation method is as follows:

L_H＝(c_ij-U_iL_j)²

step 17, for each user u in the batch_i∈U_mCalculating the user regularization term L_U。

Calculating the distance between the user hidden vector and the user visual representation to obtain a user regular term L_UThe specific calculation method is as follows:

wherein

The Frobenius norm of the matrix is represented.

For each location l in the batch, step 18_j∈L_mComputing a locality regularization term L_L。

Calculating the distance between the hidden location vector and the visual location representation to obtain a location regular term L_LThe specific calculation method is as follows:

and 19, calculating the total loss L of all samples in the batch, and adjusting the network parameters in the whole model.

The total loss L for all samples in the batch was calculated in the following manner:

wherein

Quintuple loss, accuracy loss, user regularization term, and place regularization term, respectively, for a single sample. Θ represents the parameters of the VGG16 model as well as the weight and bias terms of the self-attention mechanism.

λ_nAnd respectively representing the weight of the user regular term, the location regular term and the parameter regular term. Then, according to the loss L, the network parameters in the whole model are adjusted.

Step 20, repeat steps 6-19 until all batches of the training data set have been engaged in model training.

And step 21, repeating the steps 5-20 until the specified iteration number is reached.

Step 22, given the query q ═ u, c, find all candidate locations in the query city c

Step 23, calculating candidate location of query user u

And returns the K-top ranked places as recommendation results.

Finding out a hidden vector u and a candidate location corresponding to the query user u

Each of which is

Corresponding hidden vector

Calculating the query user u for each place

The specific calculation of the preference value is as follows:

and sorting the calculated preference values in a descending order, and returning the place K before the ranking as a recommendation result.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A personalized tourist place recommendation method based on multilevel visual similarity of a geographic marking photo comprises the following steps:

(2) obtaining visual characteristics of the photo by using a VGG16 model;

2. The method for recommending personalized tourist sites based on multilevel visual similarity of geotagged photos as claimed in claim 1, wherein in step (1), the photos are clustered by using a density-based clustering method according to the longitude and latitude position information corresponding to the photos, each obtained cluster represents a site, and the clustering center is the position of the site; through the process, a site set L ═ L is excavated₁,l₂,…,l_|L|Where l ═ c, g)C is the city where the location l is located, and g is the longitude and latitude information of l;

extracting a user set U-U according to the photographer information of the photo₁,u₂,…,u_|U|}。

3. The method for recommending personalized tourist sites based on multi-level visual similarity of geotagged photos according to claim 1, wherein in step (1), the user site pair (u) is pointed out_i∈U，l_jE.l), first for the ith user u according to the time of taking the picture_iAt jth location l_jSequencing the shot photos;

considering that a user may take several photos in the same visit, if user u_iAt a location l_jIf the time interval between several consecutive pictures taken is less than a given time threshold Δ t, the pictures are considered to belong to the same visit, and the average of the times of taking the pictures is used as the time t of the visit, the visit can be expressed as (u)_i,l_jT); the user access history V { (u) can be obtained_i,l_jT) }, in which (u)_i,l_jT) represents user u_iVisit location l at time t_j；

4. The method for recommending personalized tourist sites based on multi-level visual similarity of geotagged photos according to claim 1, wherein in step (3),

first, stack user u by the time of taking the photo_iThe visual characteristics of the picture are taken, forming a matrix UP_iEach row in the matrix corresponds to the visual characteristics of the corresponding photo, and the user u is fused by utilizing a self-attention mechanism_iThe specific calculation method of the visual characteristics of the shot picture is as follows:

ua_i＝softmax(w_Utanh(V_UUP_i ^T))

u_i＝ua_iUP_i

wherein, w_UAnd V_UFor learnable network parameters, for weights and bias terms of the self-attention mechanism, ua_iIs the weight vector of the photo, the softmax function ensures that the sum of all calculated weights is 1;

then, from the mean value u_iSum variance

The user implicit vector U is obtained by sampling in Gaussian distribution_iIn which I_UIs and u_iAll 1 vectors of the same length.

5. The method as claimed in claim 1, wherein the personalized tourist spot recommendation method based on the multi-level visual similarity of the geotagged photos is implemented in step (3), and the photos are firstly stacked at the spot l according to the shooting time of the photos_jThe visual characteristics of the picture are taken to form a matrix LP_jEach row in the matrix corresponds to the visual characteristics of the corresponding photo and is fused at the location l by using a self-attention mechanism_jThe specific calculation method of the visual characteristics of the shot picture is as follows:

la_j＝softmax(w_Ltanh(V_LLP_j ^T))

l_j＝la_jLP_j

wherein, w_LAnd V_LFor learnable network parameters, for the weights and bias terms of the self-attention mechanism, la_jIs the weight vector of the photograph, according to la_jWeight provided, will LP_jSumming the vectors to obtain the location l_jVisual representation of (l)_j；

Then, from the mean value of l_jSum variance

Is sampled in a gaussian distributionObtaining a location implicit vector L_jIn which I_LIs a sum of_jAll 1 vectors of the same length.

6. The method of claim 1, wherein in the step (4), the number of visits c to each of the plurality of groups is determined_ij∈M_mFrom the mean value of U_iL_jSum variance σ²The user u is obtained by sampling in the Gaussian distribution_iVisit location l_jThe number of accesses of (c).

7. The method for recommending personalized tourist sites based on multi-level visual similarity of geotagged photos according to claim 1, wherein in step (5),

two pictures p taken by the same user at the same location_o，

Finding all batches of users U from photo collection P_mPictures taken internally and all at location L_mPictures taken at the interior locations to form a batch of pictures P_mA picture taken by another user at the same location

A picture taken by the same user at another location

And a picture taken by another user at another location

May constitute a quintuple

After training is finished, quintuple

wherein v is_o，

And

are each p_o，

And

the visual characteristics of (1), m1, m2, m3, m4, m5 and m6 are photo pairs respectively

And

and

and

and

and

and

and

must satisfy the minimum visual distance between, and satisfy m₁＜m₂＜m₃，m₄＜m₅；

To ensure fast convergence, for any p_o∈P_mSelecting all other photos taken by the same user at the same place as

And select P_mAll photos satisfying the following inequality are taken as

And

in this way, P_mAll the pictures can obtain quintuple set for training

Wherein p is_o，

The triplet loss calculation for each inequality representing multi-level visual similarity is as follows:

wherein when [ · [ ]]₊Internal value being positive, [ ·]₊Taking the value, otherwise, taking the value as 0;

L_Q＝L₁+L₂+L₃+L₄+L₅+L₆。

8. the method as claimed in claim 1, wherein the step (5) of calculating the hidden vector U of the user is performed by using a personalized tourist spot recommendation method based on multi-level visual similarity of the geotagged photos_iAnd a user visual representation u_iDistance between them, obtaining user regular term L_UThe specific calculation method is as follows:

computing a locality-hidden vector L_jAnd a location visual representation l_jDistance between them, get the location regularization term L_LThe specific calculation method is as follows:

L_H＝(c_ij-U_iL_j)²

wherein

Frobenius norm representing matrix, i being user index, j being location index, c_ijRepresenting user u_iVisit location l_jThe number of times.

9. The method for recommending personalized tourist sites based on multi-level visual similarity of geotagged photos as claimed in claim 1, wherein in step (5), the specific calculation manner of the total loss L is as follows:

wherein

Quintuple loss, accuracy loss, user regularization term and location regularization term, respectively, of a single sample, Θ represents the parameters of the VGG16 model and the weight and bias terms of the self-attention mechanism,

λ_nand respectively representing the weight of the user regular term, the location regular term and the parameter regular term.

10. The method as claimed in claim 1, wherein in step (6), the hidden vector u and the candidate location corresponding to the query user u are found

Each of which is

Corresponding hidden vector

Calculating the query user u for each place

The specific calculation of the preference value is as follows: