CN114254837A - Travel route customizing method and system based on deep reinforcement learning - Google Patents

Travel route customizing method and system based on deep reinforcement learning Download PDF

Info

Publication number
CN114254837A
CN114254837A CN202111635694.2A CN202111635694A CN114254837A CN 114254837 A CN114254837 A CN 114254837A CN 202111635694 A CN202111635694 A CN 202111635694A CN 114254837 A CN114254837 A CN 114254837A
Authority
CN
China
Prior art keywords
tourist
route
poi
information
environment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111635694.2A
Other languages
Chinese (zh)
Inventor
赵玺
刘佳璠
王乐
李雨航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202111635694.2A priority Critical patent/CN114254837A/en
Publication of CN114254837A publication Critical patent/CN114254837A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • G06Q10/047Optimisation of routes or paths, e.g. travelling salesman problem
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/14Travel agencies

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Human Resources & Organizations (AREA)
  • Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Marketing (AREA)
  • Biophysics (AREA)
  • General Business, Economics & Management (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Development Economics (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a travel route customizing method and system based on deep reinforcement learning, which comprises the steps of mining historical preference scores of tourists according to hotel, scenic spot and traffic data; a route optimization framework based on a deep reinforcement learning algorithm; the method comprises the steps of obtaining requirements of tourists, and generating an intelligent and customized route; dynamically updating the route based on the real-time scene change of the tourist; the method can quickly obtain the intelligent and customized route including the hotel and the scenic spot, provides more diversified and convenient services for the tourist, and saves the time for the tourist to select the hotel, the scenic spot and the route planning; the environment is seen as the tourism environment where the tourist is really located, the tourism environment comprises POI information and tourist input information, a route is generated according to the historical preference and the requirement of the tourist, and the personalized and customized design requirement of the tourist can be met; according to the real tourist route of the tourist, the route is dynamically and intelligently planned, and the optimization model is further learned, so that the satisfaction degree and experience of the tourist can be improved.

Description

Travel route customizing method and system based on deep reinforcement learning
Technical Field
The invention belongs to the field of travel route customization, and particularly relates to a travel route customization method and system based on deep reinforcement learning.
Background
The tourism industry, an important component of the modern service industry, has gradually become one of the most important economic driving forces in the world. With the popularity of travel, the behavior of tourists has changed greatly, and tourists increasingly like "customized tour" and "self-driving tour", rather than pre-organized routes or standard travel packages.
The customized travel route problem is called a "travel route design problem," and its object is to design a travel route for a guest by maximizing the total preference score of the guest according to the guest's constraint conditions. Existing customized tour routing mechanisms are of relatively little interest for hotel selection, however, this is an important component of a multi-day tour route. In practice, a guest will typically select a hotel at the end of each day and continue traveling the next day at that hotel. Thus, the solution of travel design problems with hotel options is a key and complex problem.
The current mainstream approaches are various heuristic approaches, which focus primarily on selecting and ranking POIs, but neglect the optimization of the sightseeing time spent on each POI, as well as the real traffic and real guest preferences for POIs.
Disclosure of Invention
The invention aims to provide a tour route customizing method and system based on deep reinforcement learning; intelligent and customized routes including hotels and scenic spots can be quickly obtained; the personalized and customized design requirements of the tourists can be met; the satisfaction and experience of the tourists can be improved.
In order to achieve the purpose, the invention adopts the technical scheme that: a travel route customizing method based on deep reinforcement learning comprises the following steps:
the method comprises the steps of obtaining tourist demands, and generating a customized route based on the tourist demands and a route optimization model;
the route optimization model defines a travel route planning problem as a Markov decision process based on a deep reinforcement learning framework, namely POI information is sequentially generated according to a time sequence, the starting point, the terminal point, the number of playing days and the playing duration of each day are given by a tourist, scenic spots are sequentially selected according to the starting point given by the tourist, and a hotel is selected after the playing of each day is finished; the next day, starting from a hotel, repeating the process until the playing is finished, selecting a destination given by the tourist, wherein the deep reinforcement learning frame comprises an environment and an intelligent agent, the real tourist environment of the tourist is used as the environment, the tourist environment comprises POI information and tourist input information variables, and the deep learning algorithm is adopted to learn the environment representation; through inputting environment information, the intelligent agent outputs POI required to be selected in the next step; training through an actor-reviewer algorithm to obtain a route optimization model;
and dynamically updating the route based on the route optimization model according to the real-time scene change of the tourist.
The route optimization model training comprises the following steps:
collecting attribute information and tourist comment information of hotels and scenic spots of tourist destinations, wherein the hotels and the scenic spots are collectively called POIs (point of interest), and meanwhile collecting data related to the POIs and traffic information; the travel destination comprises a series of destinations of a traditional classical route or a single city or a certain scenic spot;
constructing a tourist figure by analyzing the tourist comment information, and mining the preference score of the tourist on the scenic spot;
based on the tourist preference score, the scenic spot information and the traffic information, a deep reinforcement learning-based framework line optimization model is constructed, and the optimization model is obtained through actor-reviewer algorithm training.
The deep reinforcement learning framework comprises states, actions, rewards and strategies;
the state is as follows: the state defines a certain POI sequence selected before, the state is the output of the environment and the input of the intelligent agent, and in the design of the travel route, the state elements are divided into static elements and dynamic elements according to the travel context information;
and (4) action: according to the current state, the next POI needing to be selected is an action, after the POI is selected, the state is updated to be a new state, and different states are updated due to the fact that different actions (POI) are selected;
rewarding: the Reward defines whether the behavior changes the environment in the current state, the total preference value of the user to a POI sequence is used as Reward, and the Reward is used for guiding the intelligent agent to select the POI sequence which maximizes the objective function, and the specific calculation formula is as follows:
Figure BDA0003439675980000031
wherein K belongs to {1, 2.,. K } and represents the type of the tourist; u represents the total preference score value; u shapekRepresenting the total preference score value of the tourist with the type k; a istRepresenting the POI selected in the t step;
Figure BDA0003439675980000032
representing the preference score brought by the POI selected by the tourist with the type k at the t step;
strategy: the intelligent agent selects an action based on the strategy, outputs probability distribution of the action by inputting the current state, and maps the current state into an optimal control action to be selected in the next step; a sequence with a higher total reward, updating parameters to support the sequence; the calculation method of the action probability is as follows:
P(at|st,G)=πθ(st,at)
wherein G represents a POI network distribution map; piθA policy network with a representation parameter θ; stIndicating the state of the t-th stage, atRepresenting the action generated in the t stage;
then, the generation probability of the POI sequence is calculated in the following manner:
Figure BDA0003439675980000033
wherein O represents a sequence of POIs; g represents a POI network distribution map; p (o)t|Ot-1G) represents the probability of action, i.e. at step t, based on the POI sequence selectedProbability of selecting the next POI;
the purpose of training the DRL model is to update the strategy parameters to maximize the total reward value, i.e., training the model so that it can generate a route that maximizes the total score of the user's preferences; the desired reward is calculated by summing all the travel routes for a given parameter and map by:
Figure BDA0003439675980000034
wherein, piθA policy network with a representation parameter θ; r (O | G) represents the reward obtained by the POI sequence O under the condition that the POI network distribution map is G;
and finally, performing model training by adopting an actor-reviewer algorithm to obtain an optimized DRL model.
Training a policy network using a policy gradient-based actor-reviewer algorithm, in an actor-reviewer framework, the actor responsible for policy gradient learning policies, i.e., the policy network generates actions through interaction with the environment, the reviewer to estimate an expected jackpot that is responsive to evaluating the actor's performance and instructing the actor's actions at the next stage;
by giving a POI network distribution graph G and setting parameters, the training goal of the strategy network is to maximize the expected return, and in order to maximize the expected return, a strategy gradient algorithm is adopted to update the strategy; policy gradient
Figure BDA0003439675980000041
Expressed as:
Figure BDA0003439675980000042
wherein G represents a POI network distribution map; o isnRepresenting a randomly generated nth sequence of POIs; r (O)nG) represents the POI sequence OnThe reward of (1); b (g) a baseline representing a desired jackpot for reducing training variance; piθ(O) watchA policy network with an exemplary parameter θ;
the reviewer network is a feed-forward neural network, the input is the weighted sum of the embedded feature vectors of the various sights, followed by two hidden layers, the ReLU and sense layers, and another linear layer with a single output to return an estimated reward, representing the mean-square error as the reviewer, which is used to train reviewer network parameters, specifically as:
Figure BDA0003439675980000043
wherein N represents the number of batch sizes; o isnRepresenting a randomly generated nth sequence of POIs; g represents a POI network distribution map; r (O)nG) represents the POI sequence OnThe reward of (1); bβ(G) Representing a desired jackpot for parameter β;
replace the expected value with the average of the Monte Carlo samples in batch N:
Figure BDA0003439675980000044
where N is the number of batch sizes, then the gradient is used to adjust the parameters of the strategy:
Figure BDA0003439675980000045
wherein alpha is a learning rate and controls the updating rate of strategy parameters;
in training, the actor network and the reviewer network are trained simultaneously, and the actor network updates the parameters of the policy function according to the direction suggested by the reviewer.
The POI attribute information comprises a POI name, a geographic position, a tour duration and business hours; the tourist comment information comprises a tourist ID, a play type, a comment time, a score and a comment text; the traffic information mainly comprises the traffic time between two points, including the time spent in self-driving and public transportation; the guest demand information includes guest number, guest composition, starting point, ending point, and playing time.
The specific process of mining the historical preference of the tourists is as follows: mining the preference scores of the tourists from three aspects of situation factors, comment texts and scores, and taking the types of the tourists as the situation factors into consideration; meanwhile, based on comprehensive and detailed description of the tourists in the comment text, a natural language processing method is adopted to further mine the preference scores of the tourists.
And when the route is dynamically updated, based on the trained route optimization model, the mobile phone GPS is used for positioning, real-time geographic position information is added into the input information, and an updated POI sequence is output.
On the other hand, the invention also provides a travel route customizing system based on deep reinforcement learning, which comprises a demand obtaining module, a route generating module and a route updating module;
the demand acquisition module is used for acquiring demands of tourists;
the route generation module is used for generating a customized route according to the tourist demand and a route optimization model; the route generation module calls the route optimization model to generate a customized route, wherein the route optimization model is based on a deep reinforcement learning framework, the travel route planning problem is defined as a Markov decision process, namely POI information is sequentially generated according to a time sequence, the starting point, the terminal point, the number of days of play and the playing time of each day are given by the tourists, scenic spots are sequentially selected according to the starting point given by the tourists, and a hotel is selected after the playing of each day is finished; the next day, starting from a hotel, repeating the process until the playing is finished, selecting a destination given by the tourist, wherein the deep reinforcement learning frame comprises an environment and an intelligent agent, the real tourist environment of the tourist is used as the environment, the tourist environment comprises POI information and tourist input information variables, and the deep learning algorithm is adopted to learn the environment representation; through inputting environment information, the intelligent agent outputs POI required to be selected in the next step; training through an actor-reviewer algorithm to obtain a route optimization model;
and the route updating module dynamically updates the route based on the route optimization model according to the real-time scene change of the tourist.
The invention also provides a computer device, which comprises one or more processors and a memory, wherein the memory is used for storing computer executable programs, the processor reads part or all of the computer executable programs from the memory and executes the computer executable programs, and when the processor executes part or all of the computer executable programs, the travel route customization method based on deep reinforcement learning can be realized.
Meanwhile, a computer readable storage medium is provided, in which a computer program is stored, and when the computer program is executed by a processor, the method for customizing a tour route based on deep reinforcement learning according to the invention can be realized.
Compared with the prior art, the invention has the following beneficial technical effects:
the intelligent and customized route containing the hotel and the scenic spot can be quickly obtained, more diversified and convenient services are provided for the tourists, and the time for selecting the hotel, the scenic spot and the route planning by the tourists is saved; the route is generated according to the historical preference and the requirement of the tourist, so that the personalized and customized design requirement of the tourist can be met; according to the invention, the route is dynamically and intelligently planned according to the real tourist route of the tourist, and the optimization model is further learned, so that the satisfaction degree and experience of the tourist can be improved.
Drawings
FIG. 1 is a block diagram of the method of the present invention;
FIG. 2 is a diagram of a deep reinforcement learning framework according to the present invention;
FIG. 3 is a diagram of a part of a sample travel route generation in example 1 of the present invention.
Detailed Description
The invention is described in further detail below with reference to the following figures and detailed description:
referring to fig. 1, the present invention provides a travel route customization method based on deep reinforcement learning, which comprises the following steps:
step 1, collecting attribute information and tourist comment information of hotels and scenic spots of tourist destinations, wherein the hotels and the scenic spots are collectively called POIs (point of interest), and meanwhile collecting data related to the POIs and traffic information; the POI attribute information comprises a POI name, a geographic position, a tour duration and business hours; the tourist comment information comprises a tourist ID, a play type, a comment time, a score and a comment text; the traffic information includes traffic time between two points, such as time spent on self-driving and public transportation; the travel destination comprises a series of destinations of a traditional classical route or a single city or a certain scenic spot;
and 2, constructing the tourist portrait by analyzing the comment information of the tourist, and mining the preference score of the tourist. Specifically, the tourist preference score is mined mainly from three aspects of context factors, comment texts and scores. Considering that the contextual factor of guest type causes guests to comment differently on attractions, the present invention takes this contextual factor of guest type into account in order to more fully capture the guest preference score. Because the comment text contains the comprehensive and detailed description of the tourist, the invention further mines the preference score of the tourist by adopting a natural language processing method.
Step 3, designing a route optimization framework based on a deep reinforcement learning algorithm:
first, the travel route planning problem is defined as a Markov decision process, i.e., POI information is generated sequentially in a sequence. The tourists give a starting point start, an end, playing days d and playing duration t every day, the scenic spots are sequentially selected according to the starting point start given by the tourists, and as an example, the tourists select the scenic spot a2,a3,a4When a is selected4Then, when the time is night, selecting a hotel next step; the next day, from the hotel, the process repeats until the end of play, selecting the guest's given endpoint.
Then, referring to fig. 2, a Deep Reinforcement Learning (DRL) framework is designed, where the Deep Reinforcement Learning (DRL) includes an environment and an agent, the environment can be seen as a tourism environment where the tourist is really located, the tourism environment includes POI information and tourist input information, and the present invention learns the environment characterization by using a Deep learning algorithm. The intelligent agent is the core of the algorithm, and the POI to be selected next is output through inputting environment information. Finally, an optimization model is obtained through Actor-reviewer (Actor-Critic) method training.
The deep reinforcement learning framework includes states, actions, rewards, and strategies.
The state is as follows: the state defines the sequence of POIs selected before step t. The state is the output of the environment and is the input to the agent. In the design of the tour route, the state elements can be divided into static elements and dynamic elements according to the tour context information, for example, the position information and business hours of the POI are both static elements; the time budget, a factor that changes gradually, is a dynamic element.
And (4) action: and according to the current state, the next POI needing to be selected is an action, after the POI is selected, the state is updated to be a new state, and different state updates are carried out by selecting different actions (POI).
Rewarding: rewards define whether a behavior is good or bad for a change in the environment in the current state. It is expected that the designed travel route will provide more satisfaction to the tourists and better meet the user preference, i.e. the higher the POI score value, the better. And taking the total score of the preference value of the user for one POI sequence as Reward. Reward represents a Reward for directing an agent to select a sequence of POIs that maximizes an objective function, and the specific calculation formula is:
Figure BDA0003439675980000081
wherein K belongs to {1, 2.,. K } and represents the type of the tourist; u represents the total utility; u shapekRepresents the total preference score of a guest of type k; a istRepresenting the POI selected in the t step;
Figure BDA0003439675980000082
indicating the preference score of the POI selected at step t by the guest of guest type k. Strategy: the agent selects an action based on the policy, and outputs a probability distribution of the action by inputting the current state. Mapping the current state to the optimal control action to be selected next, if a sequence has a higher total reward, then the parameters are updated to support the sequence, and the action probability is calculated as:
P(at|st,G)=πθ(st,at)
wherein G represents a POI network distribution map; piθA policy network with a representation parameter θ; stIndicating the state of the t-th stage, atRepresenting the action generated in phase t.
Then, the generation probability of the POI sequence is calculated in the following manner:
Figure BDA0003439675980000083
wherein O represents a sequence of POIs; g represents a POI network distribution map; p (o)t|Ot-1G) represents the probability of action, i.e. the probability of selecting the next POI from the sequence of POIs selected at step t.
The purpose of training the DRL model is to update the strategic parameters to maximize the total reward, the travel routes are random, and therefore the desired reward is calculated by simply summing all the travel routes for a given parameter and map by:
Figure BDA0003439675980000084
wherein, piθA policy network with a representation parameter θ; r (O | G) represents the reward obtained for the POI sequence O under the condition that the POI net distribution map is G.
Finally, model training is performed. In particular, the present invention trains a policy network using an Actor-reviewer (Actor-Critic) algorithm based on policy gradients. In the actor-reviewer framework, the actor is responsible for policy gradient learning policies, i.e., the policy network generates actions through interactions with the environment. The reviewers are used to estimate the expected jackpot for responding to the performance of the assessment actor and to direct the actor through the next stage of action.
By giving the POI net profile G and setting the parameters, the training goal of the network is to maximize the desired reward. To maximize the desired reward, the policy is updated using a policy gradient algorithm (policy gradient method), represented as:
Figure BDA0003439675980000091
wherein G represents a POI network distribution map; o isnRepresenting a randomly generated nth sequence of POIs; r (O)nG) represents the POI sequence OnThe reward of (1); b (g) a baseline representing a desired jackpot for reducing training variance; piθ(O) represents a policy network with a parameter θ.
The reviewer network is a feed-forward neural network, with the input being a weighted sum of the embedded feature vectors for each of the sights. Then, two hidden layers (ReLU and sense layers), and another linear layer (with a single output to return an estimated reward), are followed, representing the mean square error as the reviewer, which is used to train the reviewer network parameters, specifically as:
Figure BDA0003439675980000092
wherein N represents the number of batch sizes; o isnRepresenting a randomly generated nth sequence of POIs; g represents a POI network distribution map; r (O)nG) represents the POI sequence OnThe reward of (1); bβ(G) Representing a desired jackpot with parameter beta.
In practice, the expected value is replaced by the average of the monte carlo samples in batch N:
Figure BDA0003439675980000093
where N is the number of batch sizes, then the gradient is used to adjust the parameters of the strategy:
Figure BDA0003439675980000094
wherein alpha is the learning rate and controls the updating rate of the strategy parameters.
In the training process, the actor network and the reviewer network are trained simultaneously, and the actor network updates parameters of the strategy function according to the direction suggested by the reviewer; a partial sample travel route generation map of the present invention can be found in FIG. 3.
Step 4, based on the trained intelligent customized model, if the tourists input the starting point, the terminal point and the playing time factors, a customized route can be generated within second-level time; certainly, the tourists can also input a starting point, a terminal point, playing time and food requirements; or, in consideration of the situation, the guest can also input a starting point, a destination, a playing time, national restaurant needs and a special hotel;
and step 5, considering that the tourists can actually select scenic spots which do not appear in the customized route according to self tendency when actually playing, so that the route is dynamically updated based on the scene where the tourists are located in real time. The deep reinforcement learning adopts a deep learning algorithm to represent the environment variable, and the representation can be automatically updated according to the input position, so that the real-time geographical position information is added in the input information based on the GPS positioning of the mobile phone, and the updated path can be output.
The present invention provides a system for travel route customization, comprising: the system comprises a demand acquisition module, a route generation module and a route updating module;
the demand acquisition module is used for acquiring demands of tourists;
the route generation module is used for generating a customized route according to the tourist demand and a route optimization model; the route generation module calls the route optimization model to generate a customized route, wherein the route optimization model is based on a deep reinforcement learning framework, the travel route planning problem is defined as a Markov decision process, namely POI information is sequentially generated according to a time sequence, the starting point, the terminal point, the number of days of play and the playing time of each day are given by the tourists, scenic spots are sequentially selected according to the starting point given by the tourists, and a hotel is selected after the playing of each day is finished; the next day, starting from a hotel, repeating the process until the playing is finished, selecting a destination given by the tourist, wherein the deep reinforcement learning frame comprises an environment and an intelligent agent, the real tourist environment of the tourist is used as the environment, the tourist environment comprises POI information and tourist input information variables, and the deep learning algorithm is adopted to learn the environment representation; through inputting environment information, the intelligent agent outputs POI required to be selected in the next step; training through an actor-reviewer algorithm to obtain a route optimization model;
and the route updating module dynamically updates the route based on the route optimization model according to the real-time scene change of the tourist.
Optionally, the present invention further provides a device for travel route customization based on deep reinforcement learning, which includes a processor and a memory, wherein the memory is used for storing a computer executable program, the processor reads part or all of the computer executable program from the memory and executes the computer executable program, and when the processor executes part or all of the computer executable program, part or all of the steps of the travel route customization method based on deep reinforcement learning according to the present invention can be implemented.
The device for customizing the tour route based on the deep reinforcement learning can be a notebook computer, a tablet computer, a desktop computer, a mobile phone or a workstation.
The processor may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or an off-the-shelf programmable gate array (FPGA).
The memory of the invention can be an internal storage unit of a notebook computer, a tablet computer, a desktop computer, a mobile phone or a workstation, such as a memory and a hard disk; external memory units such as removable hard disks, flash memory cards may also be used.
Computer-readable storage media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. The computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The random access memory may include a resistive random access memory (ReRAM).
The specific embodiment of the invention is as follows:
the method realizes the travel route customization based on the deep reinforcement learning, and the specific process is as follows:
step 1, crawling data related to POI attribute information, tourist comment information and traffic information of 64 hotels and 64 scenic spots in Beijing city in 12 months in 2018 from TripAdvisor.com, Ctip.com, Meituan.com, Baike.baidu.com and Amap.com by using a crawler technology. The POI attribute information comprises a POI name, a geographic position, a tour duration and business hours; the tourist comment information comprises a tourist ID, a play type, a comment time, a score and a comment text; the traffic information mainly comprises self-driving time between two points and bus time.
And 2, constructing the tourist portrait by analyzing the comment information of the tourist, and mining the preference score of the tourist. The tourist preference score is mined from three aspects of a contextual factor, comment text and score, and the contextual factor of the type of the tourist is taken into account. Because the comment text contains the comprehensive and detailed description of the tourist, the invention further excavates the tourist preference score by adopting a natural language processing technology, and as an example, the invention divides the types of the tourist into four types which are expressed as:
k is e {1,2,3,4} - { family trip, friend-to-partner trip, couple-to-couple trip, single trip }
Satisfaction of guests with POIs was numerically expressed by sentiment analysis of online reviews and score processing by Python's SnowNLP library. The numerical score was based primarily on a five point scale, with 5 representing the most satisfactory and 1 representing the least satisfactory. To simplify the calculation, the scores are weighted. Weighted average of travel type k scores for a certain POI (i)
Figure BDA0003439675980000121
Expressed as:
Figure BDA0003439675980000122
wherein q belongs to {1,2,3,4,5} represents the specific value of the five-score,
Figure BDA0003439675980000123
denotes the number of POIs that guest type k scores to POI (i) as q, and N denotes the number of POIs. In addition, the comment of the visitor type k to poi (i) is stored as a text file
Figure BDA0003439675980000124
After the online comment text is preprocessed, for example, the pause word is subjected to feature marking (tokenization) and elimination (elimnation), the sentiment value of the tourist type k to POI (i) is calculated by using a SnowNLP sentiment analysis tool of Python
Figure BDA0003439675980000125
Finally, satisfaction value of visitor type k to POI (i)
Figure BDA0003439675980000126
Expressed as:
Figure BDA0003439675980000127
wherein v is1,v2(v1+v21) is weight, and can be flexibly assigned according to personal wishes and preferences. In this example v1=v2=0.5。
The guest's preference scores for POIs are primarily expressed numerically by analyzing the notes using TF-IDF techniques. Storing notes of visitor type k as text files (log)k) After preprocessing, the Python sklern tool is used to calculate the preference score of the visitor type k to POI (i)
Figure BDA0003439675980000128
For the above calculated
Figure BDA0003439675980000129
And
Figure BDA00034396759800001210
after normalization, the score of the tourist type k to POI (i) can be calculated
Figure BDA00034396759800001211
The concrete expression is as follows:
Figure BDA0003439675980000131
wherein the content of the first and second substances,
Figure BDA0003439675980000132
the variables after the normalization process are shown. Theta1,θ2121) is weight, and can be flexibly assigned. In this example θ1=θ2=0.5。
And 3, firstly, defining the travel route planning problem as a Markov decision process, namely sequentially generating POI information according to a sequence. Suppose that the guest is known to give a start point start, an end point end, a number of playing days d, and a playing time duration per day t. According to the starting start given by the tourist, the scenic spots are selected in sequence, such as the scenic spot a is selected2,a3,a4When a is selected4Then, when the time is night, selecting a hotel next step; the next day, from the hotel, the process repeats until the end of play, selecting the guest's given endpoint.
Then, referring to fig. 2, a Deep Reinforcement Learning (DRL) framework is designed. The environment is used as a tourism environment where the tourists are really located, the environment comprises POI information, tourist input information and the like. The agent is the core of the algorithm, which mainly outputs the POI to be selected next by inputting the environment information. And finally, obtaining an optimization model through Actor-Critic training. In this example, the specific parameters of the DRL model are: embedding the travel route map into a 128-dimensional vector with an encoder; the number of hidden layer units of GRU is128; the Dropout layer parameter of the GRU is set to 0.1; the size of each epoch, mini-banks, training instance is set to 1000, 256, and 1000000, respectively; initializing all parameters by using an Xavier Initializer, and reducing the L2 norm of the gradient to 2.0; the model was optimized using Adam's algorithm and the learning rate was set to 10-6The map for generating the travel route of the partial sample can be referred to fig. 3.
And 4, based on the trained customized model, the tourists input factors such as a starting point, a terminal point, playing time and the like, and then the customized frame can be generated within a second level. In this example, under the same constraint (starting point is S)3The number of playing days is 5, and the playing time is 10 hours each day), the results of the routes selected by the four kinds of tourists are as follows:
family swimming:
66-69-67-38-65-76-72-32-97-74-48-75-100-80-2-88-85-90-18;
friend coming and going:
66-65-119-61-69-75-104-23-67-84-88-57-72-121-76-47-85-107-51;
the couple swims in two persons:
66-65-72-32-69-64-110-3-67-75-94-57-85-100-29-76-74-90-18;
single person swimming:
66-69-65-46-67-85-9-92-88-77-15-75-109-110-3-72-106-94-95-29。
and 5, when the tourist selects the scenic spots which do not appear in the customized route according to the tendency of the tourist, dynamically updating the route based on the scene where the tourist is located in real time, and automatically updating the representation according to the input position, so that the real-time geographic position is input based on the mobile phone GPS positioning system, and the updated path can be output.
The intelligent and customized route containing the hotel and the scenic spot can be quickly obtained, more diversified and convenient services are provided for the tourists, and the time for selecting the hotel, the scenic spot and the route planning by the tourists is saved; the route is generated according to the historical preference and the requirement of the tourist, the personalized and customized design requirements of the tourist can be met, the route is dynamically and intelligently planned according to the real tourist route of the tourist, the optimization model is further learned, and the satisfaction degree and the experience feeling of the tourist can be improved.

Claims (10)

1. A travel route customizing method based on deep reinforcement learning is characterized by comprising the following steps:
the method comprises the steps of obtaining tourist demands, and generating a customized route based on the tourist demands and a route optimization model;
the route optimization model defines a travel route planning problem as a Markov decision process based on a deep reinforcement learning framework, namely POI information is sequentially generated according to a time sequence, the starting point, the terminal point, the number of playing days and the playing duration of each day are given by a tourist, scenic spots are sequentially selected according to the starting point given by the tourist, and a hotel is selected after the playing of each day is finished; the next day, starting from a hotel, repeating the process until the playing is finished, selecting a destination given by the tourist, wherein the deep reinforcement learning frame comprises an environment and an intelligent agent, the real tourist environment of the tourist is used as the environment, the tourist environment comprises POI information and tourist input information variables, and the deep learning algorithm is adopted to learn the environment representation; through inputting environment information, the intelligent agent outputs POI required to be selected in the next step; training through an actor-reviewer algorithm to obtain a route optimization model;
and dynamically updating the route based on the route optimization model according to the real-time scene change of the tourist.
2. The travel route customization method based on deep reinforcement learning of claim 1, wherein the route optimization model training comprises the following steps:
collecting attribute information and tourist comment information of hotels and scenic spots of tourist destinations, wherein the hotels and the scenic spots are collectively called POIs (point of interest), and meanwhile collecting data related to the POIs and traffic information; the travel destination comprises a series of destinations of a traditional classical route or a single city or a certain scenic spot;
constructing a tourist figure by analyzing the tourist comment information, and mining the preference score of the tourist on the scenic spot;
based on the tourist preference score, the scenic spot information and the traffic information, a deep reinforcement learning-based framework line optimization model is constructed, and the optimization model is obtained through actor-reviewer algorithm training.
3. The deep reinforcement learning-based travel route customization method according to claim 1, wherein the deep reinforcement learning framework comprises status, action, reward and strategy;
the state is as follows: the state defines a certain POI sequence selected before, the state is the output of the environment and the input of the intelligent agent, and in the design of the travel route, the state elements are divided into static elements and dynamic elements according to the travel context information;
and (4) action: according to the current state, the next POI needing to be selected is an action, after the POI is selected, the state is updated to be a new state, and different states are updated due to the fact that different actions (POI) are selected;
rewarding: the Reward defines whether the behavior changes the environment in the current state, the total preference value of the user to a POI sequence is used as Reward, and the Reward is used for guiding the intelligent agent to select the POI sequence which maximizes the objective function, and the specific calculation formula is as follows:
Figure FDA0003439675970000021
wherein K belongs to {1, 2.,. K } and represents the type of the tourist; u represents the total preference score value; u shapekRepresenting the total preference score value of the tourist with the type k; a istRepresenting the POI selected in the t step;
Figure FDA0003439675970000022
representing the preference score brought by the POI selected by the tourist with the type k at the t step;
strategy: the intelligent agent selects an action based on the strategy, outputs probability distribution of the action by inputting the current state, and maps the current state into an optimal control action to be selected in the next step; a sequence with a higher total reward, updating parameters to support the sequence; the calculation method of the action probability is as follows:
P(at|st,G)=πθ(st,at)
wherein G represents a POI network distribution map; piθA policy network with a representation parameter θ; stIndicating the state of the t-th stage, atRepresenting the action generated in the t stage;
then, the generation probability of the POI sequence is calculated in the following manner:
Figure FDA0003439675970000023
wherein O represents a sequence of POIs; g represents a POI network distribution map; p (o)t|Ot-1G) represents the probability of action, i.e. the probability of selecting the next POI from the sequence of POIs selected at step t;
the purpose of training the DRL model is to update the strategy parameters to maximize the total reward value, i.e., training the model so that it can generate a route that maximizes the total score of the user's preferences; the desired reward is calculated by summing all the travel routes for a given parameter and map by:
Figure FDA0003439675970000031
wherein, piθA policy network with a representation parameter θ; r (O | G) represents the reward obtained by the POI sequence O under the condition that the POI network distribution map is G;
and finally, performing model training by adopting an actor-reviewer algorithm to obtain an optimized DRL model.
4. The method of claim 3, wherein the strategy network is trained by using a strategy gradient-based actor-reviewer algorithm, and in the framework of actor-reviewers, the actor is responsible for strategy gradient learning strategies, i.e. the strategy network generates actions through interaction with the environment, and the reviewers estimate the expected cumulative reward, which responds to evaluate the performance of the actor and guides the actor to the next stage;
by giving a POI network distribution graph G and setting parameters, the training goal of the strategy network is to maximize the expected return, and in order to maximize the expected return, a strategy gradient algorithm is adopted to update the strategy; policy gradient
Figure FDA0003439675970000032
Expressed as:
Figure FDA0003439675970000033
wherein G represents a POI network distribution map; o isnRepresenting a randomly generated nth sequence of POIs; r (O)nG) represents the POI sequence OnThe reward of (1); b (g) a baseline representing a desired jackpot for reducing training variance; piθ(O) a policy network with a parameter θ;
the reviewer network is a feed-forward neural network, the input is the weighted sum of the embedded feature vectors of the various sights, followed by two hidden layers, the ReLU and sense layers, and another linear layer with a single output to return an estimated reward, representing the mean-square error as the reviewer, which is used to train reviewer network parameters, specifically as:
Figure FDA0003439675970000034
wherein N represents the number of batch sizes; o isnRepresenting a randomly generated nth sequence of POIs; g represents a POI network distribution map; r (O)nG) represents the POI sequence OnThe reward of (1); bβ(G) Representing a desired jackpot for parameter β;
replace the expected value with the average of the Monte Carlo samples in batch N:
Figure FDA0003439675970000041
where N is the number of batch sizes, then the gradient is used to adjust the parameters of the strategy:
Figure FDA0003439675970000042
wherein alpha is a learning rate and controls the updating rate of strategy parameters;
in training, the actor network and the reviewer network are trained simultaneously, and the actor network updates the parameters of the policy function according to the direction suggested by the reviewer.
5. The method for customizing a travel route based on deep reinforcement learning of claim 1, wherein the POI attribute information comprises POI name, geographical location, duration of visit and business hours; the tourist comment information comprises a tourist ID, a play type, a comment time, a score and a comment text; the traffic information mainly comprises the traffic time between two points, including the time spent in self-driving and public transportation; the guest demand information includes guest number, guest composition, starting point, ending point, and playing time.
6. The method for customizing a travel route based on deep reinforcement learning as claimed in claim 5, wherein the specific process of mining historical preferences of tourists is as follows: mining the preference scores of the tourists from three aspects of situation factors, comment texts and scores, and taking the types of the tourists as the situation factors into consideration; meanwhile, based on comprehensive and detailed description of the tourists in the comment text, a natural language processing method is adopted to further mine the preference scores of the tourists.
7. The method for customizing a tour route based on deep reinforcement learning of claim 1, wherein when the tour route is dynamically updated based on real-time scene changes of tourists, real-time geographic location information is added to input information and an updated POI sequence is output based on trained route optimization models and mobile phone GPS positioning.
8. A travel route customizing system based on deep reinforcement learning is characterized by comprising a demand obtaining module, a route generating module and a route updating module;
the demand acquisition module is used for acquiring demands of tourists;
the route generation module is used for generating a customized route according to the tourist demand and a route optimization model; the route generation module calls the route optimization model to generate a customized route, wherein the route optimization model is based on a deep reinforcement learning framework, the travel route planning problem is defined as a Markov decision process, namely POI information is sequentially generated according to a time sequence, the starting point, the terminal point, the number of days of play and the playing time of each day are given by the tourists, scenic spots are sequentially selected according to the starting point given by the tourists, and a hotel is selected after the playing of each day is finished; the next day, starting from a hotel, repeating the process until the playing is finished, selecting a destination given by the tourist, wherein the deep reinforcement learning frame comprises an environment and an intelligent agent, the real tourist environment of the tourist is used as the environment, the tourist environment comprises POI information and tourist input information variables, and the deep learning algorithm is adopted to learn the environment representation; through inputting environment information, the intelligent agent outputs POI required to be selected in the next step; training through an actor-reviewer algorithm to obtain a route optimization model;
and the route updating module dynamically updates the route based on the route optimization model according to the real-time scene change of the tourist.
9. A computer device, comprising one or more processors and a memory, wherein the memory is used for storing a computer executable program, the processor reads part or all of the computer executable program from the memory and executes the computer executable program, and the processor can realize the deep reinforcement learning based tour route customization method according to any one of claims 1-7 when executing part or all of the computer executable program.
10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method for customizing a tour route based on deep reinforcement learning as claimed in claims 1 to 7 is implemented.
CN202111635694.2A 2021-12-28 2021-12-28 Travel route customizing method and system based on deep reinforcement learning Pending CN114254837A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111635694.2A CN114254837A (en) 2021-12-28 2021-12-28 Travel route customizing method and system based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111635694.2A CN114254837A (en) 2021-12-28 2021-12-28 Travel route customizing method and system based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN114254837A true CN114254837A (en) 2022-03-29

Family

ID=80795504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111635694.2A Pending CN114254837A (en) 2021-12-28 2021-12-28 Travel route customizing method and system based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN114254837A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115371684A (en) * 2022-10-24 2022-11-22 四川师范大学 Scenic spot playing path planning method and system
CN116579514B (en) * 2023-07-12 2023-09-26 湖南师范大学 Self-driving tour planning method and system based on role collaborative recommendation
CN117592239A (en) * 2024-01-17 2024-02-23 北京邮电大学 Multi-objective optimization optical cable network route intelligent planning method and system
KR102644622B1 (en) * 2022-12-13 2024-03-07 한지은 Tour guide service system and method for fan of artist

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130006521A1 (en) * 2011-06-29 2013-01-03 Needham Bradford H Customized Travel Route System
KR20140046792A (en) * 2012-10-11 2014-04-21 황규원 Travel scheduling system and travel scheduling method using the system
CN108829852A (en) * 2018-06-21 2018-11-16 桂林电子科技大学 A kind of individualized travel route recommended method
US20180364054A1 (en) * 2017-06-15 2018-12-20 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for building an itinerary-planning model and planning a traveling itinerary
CN109447312A (en) * 2018-09-14 2019-03-08 北京三快在线科技有限公司 Route planning method, device, electronic equipment and readable storage medium storing program for executing
CN113158086A (en) * 2021-04-06 2021-07-23 浙江贝迩熊科技有限公司 Personalized customer recommendation system and method based on deep reinforcement learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130006521A1 (en) * 2011-06-29 2013-01-03 Needham Bradford H Customized Travel Route System
KR20140046792A (en) * 2012-10-11 2014-04-21 황규원 Travel scheduling system and travel scheduling method using the system
US20180364054A1 (en) * 2017-06-15 2018-12-20 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for building an itinerary-planning model and planning a traveling itinerary
CN108829852A (en) * 2018-06-21 2018-11-16 桂林电子科技大学 A kind of individualized travel route recommended method
CN109447312A (en) * 2018-09-14 2019-03-08 北京三快在线科技有限公司 Route planning method, device, electronic equipment and readable storage medium storing program for executing
CN113158086A (en) * 2021-04-06 2021-07-23 浙江贝迩熊科技有限公司 Personalized customer recommendation system and method based on deep reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴清霞等: "基于用户兴趣和兴趣点流行度的个性化旅游路线推荐", 计算机应用, vol. 36, no. 06, 10 June 2016 (2016-06-10), pages 1762 - 1766 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115371684A (en) * 2022-10-24 2022-11-22 四川师范大学 Scenic spot playing path planning method and system
CN115371684B (en) * 2022-10-24 2023-02-03 四川师范大学 Scenic spot playing path planning method and system
KR102644622B1 (en) * 2022-12-13 2024-03-07 한지은 Tour guide service system and method for fan of artist
CN116579514B (en) * 2023-07-12 2023-09-26 湖南师范大学 Self-driving tour planning method and system based on role collaborative recommendation
CN117592239A (en) * 2024-01-17 2024-02-23 北京邮电大学 Multi-objective optimization optical cable network route intelligent planning method and system
CN117592239B (en) * 2024-01-17 2024-04-26 北京邮电大学 Multi-objective optimization optical cable network route intelligent planning method and system

Similar Documents

Publication Publication Date Title
CN114254837A (en) Travel route customizing method and system based on deep reinforcement learning
CN107633317B (en) Method and device for establishing journey planning model and planning journey
JP5932030B2 (en) Custom travel route system
Gavalas et al. The eCOMPASS multimodal tourist tour planner
CN111143680B (en) Route recommendation method, system, electronic equipment and computer storage medium
CN107423837A (en) The Intelligent planning method and system of tourism route
CN109870164A (en) Navigation terminal and its route preferences prediction technique
CN106599092A (en) Method and device for recommending tourist attractions
CN106095973A (en) The tourism route of a kind of combination short term traffic forecasting recommends method
CN113935547A (en) Optimal-utility customized scenic spot route planning system
Kabaya et al. Investigating future ecosystem services through participatory scenario building and spatial ecological–economic modelling
Barhorst-Cates et al. Effects of home environment structure on navigation preference and performance: A comparison in Veneto, Italy and Utah, USA
CN112632379A (en) Route recommendation method and device, electronic equipment and storage medium
Gagnon et al. Stepping into a map: Initial heading direction influences spatial memory flexibility
KR102042919B1 (en) A System of Providing Theme Travel AI Curation Based on AR through Customization Learning of Virtual Character
CN110309438A (en) Recommended method, device, computer storage medium and the electronic equipment of planning driving path
CN110532464B (en) Tourism recommendation method based on multi-tourism context modeling
CN115238169A (en) Mu course interpretable recommendation method, terminal device and storage medium
Cao An optimal round-trip route planning method for tourism based on improved genetic algorithm
Ding et al. Two-stage travel itinerary recommendation optimization model considering stochastic traffic time
CN110489837B (en) City landscape satisfaction calculation method, computer equipment and storage medium
Carlsson Environmental design, systems thinking, and human agency: McHarg’s ecological method and Steinitz and Rogers’s interdisciplinary education experiment
Klein-Hewett Design as an indicator of tourist destination change: The concept renewal cycle at Watkins Glen state park
EP4047447A1 (en) Route recommendation method and apparatus, electronic device, and storage medium
JP7199067B2 (en) Reasoning device, reasoning method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination