CN114254837A

CN114254837A - Travel route customizing method and system based on deep reinforcement learning

Info

Publication number: CN114254837A
Application number: CN202111635694.2A
Authority: CN
Inventors: 赵玺; 刘佳璠; 王乐; 李雨航
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-03-29

Abstract

The invention discloses a travel route customizing method and system based on deep reinforcement learning, which comprises the steps of mining historical preference scores of tourists according to hotel, scenic spot and traffic data; a route optimization framework based on a deep reinforcement learning algorithm; the method comprises the steps of obtaining requirements of tourists, and generating an intelligent and customized route; dynamically updating the route based on the real-time scene change of the tourist; the method can quickly obtain the intelligent and customized route including the hotel and the scenic spot, provides more diversified and convenient services for the tourist, and saves the time for the tourist to select the hotel, the scenic spot and the route planning; the environment is seen as the tourism environment where the tourist is really located, the tourism environment comprises POI information and tourist input information, a route is generated according to the historical preference and the requirement of the tourist, and the personalized and customized design requirement of the tourist can be met; according to the real tourist route of the tourist, the route is dynamically and intelligently planned, and the optimization model is further learned, so that the satisfaction degree and experience of the tourist can be improved.

Description

Travel route customizing method and system based on deep reinforcement learning

Technical Field

The invention belongs to the field of travel route customization, and particularly relates to a travel route customization method and system based on deep reinforcement learning.

Background

The tourism industry, an important component of the modern service industry, has gradually become one of the most important economic driving forces in the world. With the popularity of travel, the behavior of tourists has changed greatly, and tourists increasingly like "customized tour" and "self-driving tour", rather than pre-organized routes or standard travel packages.

The customized travel route problem is called a "travel route design problem," and its object is to design a travel route for a guest by maximizing the total preference score of the guest according to the guest's constraint conditions. Existing customized tour routing mechanisms are of relatively little interest for hotel selection, however, this is an important component of a multi-day tour route. In practice, a guest will typically select a hotel at the end of each day and continue traveling the next day at that hotel. Thus, the solution of travel design problems with hotel options is a key and complex problem.

The current mainstream approaches are various heuristic approaches, which focus primarily on selecting and ranking POIs, but neglect the optimization of the sightseeing time spent on each POI, as well as the real traffic and real guest preferences for POIs.

Disclosure of Invention

The invention aims to provide a tour route customizing method and system based on deep reinforcement learning; intelligent and customized routes including hotels and scenic spots can be quickly obtained; the personalized and customized design requirements of the tourists can be met; the satisfaction and experience of the tourists can be improved.

In order to achieve the purpose, the invention adopts the technical scheme that: a travel route customizing method based on deep reinforcement learning comprises the following steps:

the method comprises the steps of obtaining tourist demands, and generating a customized route based on the tourist demands and a route optimization model;

the route optimization model defines a travel route planning problem as a Markov decision process based on a deep reinforcement learning framework, namely POI information is sequentially generated according to a time sequence, the starting point, the terminal point, the number of playing days and the playing duration of each day are given by a tourist, scenic spots are sequentially selected according to the starting point given by the tourist, and a hotel is selected after the playing of each day is finished; the next day, starting from a hotel, repeating the process until the playing is finished, selecting a destination given by the tourist, wherein the deep reinforcement learning frame comprises an environment and an intelligent agent, the real tourist environment of the tourist is used as the environment, the tourist environment comprises POI information and tourist input information variables, and the deep learning algorithm is adopted to learn the environment representation; through inputting environment information, the intelligent agent outputs POI required to be selected in the next step; training through an actor-reviewer algorithm to obtain a route optimization model;

and dynamically updating the route based on the route optimization model according to the real-time scene change of the tourist.

The route optimization model training comprises the following steps:

collecting attribute information and tourist comment information of hotels and scenic spots of tourist destinations, wherein the hotels and the scenic spots are collectively called POIs (point of interest), and meanwhile collecting data related to the POIs and traffic information; the travel destination comprises a series of destinations of a traditional classical route or a single city or a certain scenic spot;

constructing a tourist figure by analyzing the tourist comment information, and mining the preference score of the tourist on the scenic spot;

based on the tourist preference score, the scenic spot information and the traffic information, a deep reinforcement learning-based framework line optimization model is constructed, and the optimization model is obtained through actor-reviewer algorithm training.

The deep reinforcement learning framework comprises states, actions, rewards and strategies;

the state is as follows: the state defines a certain POI sequence selected before, the state is the output of the environment and the input of the intelligent agent, and in the design of the travel route, the state elements are divided into static elements and dynamic elements according to the travel context information;

and (4) action: according to the current state, the next POI needing to be selected is an action, after the POI is selected, the state is updated to be a new state, and different states are updated due to the fact that different actions (POI) are selected;

rewarding: the Reward defines whether the behavior changes the environment in the current state, the total preference value of the user to a POI sequence is used as Reward, and the Reward is used for guiding the intelligent agent to select the POI sequence which maximizes the objective function, and the specific calculation formula is as follows:

wherein K belongs to {1, 2.,. K } and represents the type of the tourist; u represents the total preference score value; u shape^kRepresenting the total preference score value of the tourist with the type k; a is_tRepresenting the POI selected in the t step;

representing the preference score brought by the POI selected by the tourist with the type k at the t step;

strategy: the intelligent agent selects an action based on the strategy, outputs probability distribution of the action by inputting the current state, and maps the current state into an optimal control action to be selected in the next step; a sequence with a higher total reward, updating parameters to support the sequence; the calculation method of the action probability is as follows:

P(a_t|s_t,G)＝π_θ(s_t,a_t)

wherein G represents a POI network distribution map; pi_θA policy network with a representation parameter θ; s_tIndicating the state of the t-th stage, a_tRepresenting the action generated in the t stage;

then, the generation probability of the POI sequence is calculated in the following manner:

wherein O represents a sequence of POIs; g represents a POI network distribution map; p (o)_t|O_t-1G) represents the probability of action, i.e. at step t, based on the POI sequence selectedProbability of selecting the next POI;

the purpose of training the DRL model is to update the strategy parameters to maximize the total reward value, i.e., training the model so that it can generate a route that maximizes the total score of the user's preferences; the desired reward is calculated by summing all the travel routes for a given parameter and map by:

wherein, pi_θA policy network with a representation parameter θ; r (O | G) represents the reward obtained by the POI sequence O under the condition that the POI network distribution map is G;

and finally, performing model training by adopting an actor-reviewer algorithm to obtain an optimized DRL model.

Training a policy network using a policy gradient-based actor-reviewer algorithm, in an actor-reviewer framework, the actor responsible for policy gradient learning policies, i.e., the policy network generates actions through interaction with the environment, the reviewer to estimate an expected jackpot that is responsive to evaluating the actor's performance and instructing the actor's actions at the next stage;

by giving a POI network distribution graph G and setting parameters, the training goal of the strategy network is to maximize the expected return, and in order to maximize the expected return, a strategy gradient algorithm is adopted to update the strategy; policy gradient

Expressed as:

wherein G represents a POI network distribution map; o isⁿRepresenting a randomly generated nth sequence of POIs; r (O)ⁿG) represents the POI sequence OⁿThe reward of (1); b (g) a baseline representing a desired jackpot for reducing training variance; pi_θ(O) watchA policy network with an exemplary parameter θ;

the reviewer network is a feed-forward neural network, the input is the weighted sum of the embedded feature vectors of the various sights, followed by two hidden layers, the ReLU and sense layers, and another linear layer with a single output to return an estimated reward, representing the mean-square error as the reviewer, which is used to train reviewer network parameters, specifically as:

wherein N represents the number of batch sizes; o isⁿRepresenting a randomly generated nth sequence of POIs; g represents a POI network distribution map; r (O)ⁿG) represents the POI sequence OⁿThe reward of (1); b_β(G) Representing a desired jackpot for parameter β;

replace the expected value with the average of the Monte Carlo samples in batch N:

where N is the number of batch sizes, then the gradient is used to adjust the parameters of the strategy:

wherein alpha is a learning rate and controls the updating rate of strategy parameters;

in training, the actor network and the reviewer network are trained simultaneously, and the actor network updates the parameters of the policy function according to the direction suggested by the reviewer.

The POI attribute information comprises a POI name, a geographic position, a tour duration and business hours; the tourist comment information comprises a tourist ID, a play type, a comment time, a score and a comment text; the traffic information mainly comprises the traffic time between two points, including the time spent in self-driving and public transportation; the guest demand information includes guest number, guest composition, starting point, ending point, and playing time.

The specific process of mining the historical preference of the tourists is as follows: mining the preference scores of the tourists from three aspects of situation factors, comment texts and scores, and taking the types of the tourists as the situation factors into consideration; meanwhile, based on comprehensive and detailed description of the tourists in the comment text, a natural language processing method is adopted to further mine the preference scores of the tourists.

And when the route is dynamically updated, based on the trained route optimization model, the mobile phone GPS is used for positioning, real-time geographic position information is added into the input information, and an updated POI sequence is output.

On the other hand, the invention also provides a travel route customizing system based on deep reinforcement learning, which comprises a demand obtaining module, a route generating module and a route updating module;

the demand acquisition module is used for acquiring demands of tourists;

the route generation module is used for generating a customized route according to the tourist demand and a route optimization model; the route generation module calls the route optimization model to generate a customized route, wherein the route optimization model is based on a deep reinforcement learning framework, the travel route planning problem is defined as a Markov decision process, namely POI information is sequentially generated according to a time sequence, the starting point, the terminal point, the number of days of play and the playing time of each day are given by the tourists, scenic spots are sequentially selected according to the starting point given by the tourists, and a hotel is selected after the playing of each day is finished; the next day, starting from a hotel, repeating the process until the playing is finished, selecting a destination given by the tourist, wherein the deep reinforcement learning frame comprises an environment and an intelligent agent, the real tourist environment of the tourist is used as the environment, the tourist environment comprises POI information and tourist input information variables, and the deep learning algorithm is adopted to learn the environment representation; through inputting environment information, the intelligent agent outputs POI required to be selected in the next step; training through an actor-reviewer algorithm to obtain a route optimization model;

and the route updating module dynamically updates the route based on the route optimization model according to the real-time scene change of the tourist.

The invention also provides a computer device, which comprises one or more processors and a memory, wherein the memory is used for storing computer executable programs, the processor reads part or all of the computer executable programs from the memory and executes the computer executable programs, and when the processor executes part or all of the computer executable programs, the travel route customization method based on deep reinforcement learning can be realized.

Meanwhile, a computer readable storage medium is provided, in which a computer program is stored, and when the computer program is executed by a processor, the method for customizing a tour route based on deep reinforcement learning according to the invention can be realized.

Compared with the prior art, the invention has the following beneficial technical effects:

the intelligent and customized route containing the hotel and the scenic spot can be quickly obtained, more diversified and convenient services are provided for the tourists, and the time for selecting the hotel, the scenic spot and the route planning by the tourists is saved; the route is generated according to the historical preference and the requirement of the tourist, so that the personalized and customized design requirement of the tourist can be met; according to the invention, the route is dynamically and intelligently planned according to the real tourist route of the tourist, and the optimization model is further learned, so that the satisfaction degree and experience of the tourist can be improved.

Drawings

FIG. 1 is a block diagram of the method of the present invention;

FIG. 2 is a diagram of a deep reinforcement learning framework according to the present invention;

FIG. 3 is a diagram of a part of a sample travel route generation in example 1 of the present invention.

Detailed Description

The invention is described in further detail below with reference to the following figures and detailed description:

referring to fig. 1, the present invention provides a travel route customization method based on deep reinforcement learning, which comprises the following steps:

step 1, collecting attribute information and tourist comment information of hotels and scenic spots of tourist destinations, wherein the hotels and the scenic spots are collectively called POIs (point of interest), and meanwhile collecting data related to the POIs and traffic information; the POI attribute information comprises a POI name, a geographic position, a tour duration and business hours; the tourist comment information comprises a tourist ID, a play type, a comment time, a score and a comment text; the traffic information includes traffic time between two points, such as time spent on self-driving and public transportation; the travel destination comprises a series of destinations of a traditional classical route or a single city or a certain scenic spot;

and 2, constructing the tourist portrait by analyzing the comment information of the tourist, and mining the preference score of the tourist. Specifically, the tourist preference score is mined mainly from three aspects of context factors, comment texts and scores. Considering that the contextual factor of guest type causes guests to comment differently on attractions, the present invention takes this contextual factor of guest type into account in order to more fully capture the guest preference score. Because the comment text contains the comprehensive and detailed description of the tourist, the invention further mines the preference score of the tourist by adopting a natural language processing method.

Step 3, designing a route optimization framework based on a deep reinforcement learning algorithm:

first, the travel route planning problem is defined as a Markov decision process, i.e., POI information is generated sequentially in a sequence. The tourists give a starting point start, an end, playing days d and playing duration t every day, the scenic spots are sequentially selected according to the starting point start given by the tourists, and as an example, the tourists select the scenic spot a₂，a₃，a₄When a is selected₄Then, when the time is night, selecting a hotel next step; the next day, from the hotel, the process repeats until the end of play, selecting the guest's given endpoint.

Then, referring to fig. 2, a Deep Reinforcement Learning (DRL) framework is designed, where the Deep Reinforcement Learning (DRL) includes an environment and an agent, the environment can be seen as a tourism environment where the tourist is really located, the tourism environment includes POI information and tourist input information, and the present invention learns the environment characterization by using a Deep learning algorithm. The intelligent agent is the core of the algorithm, and the POI to be selected next is output through inputting environment information. Finally, an optimization model is obtained through Actor-reviewer (Actor-Critic) method training.

The deep reinforcement learning framework includes states, actions, rewards, and strategies.

The state is as follows: the state defines the sequence of POIs selected before step t. The state is the output of the environment and is the input to the agent. In the design of the tour route, the state elements can be divided into static elements and dynamic elements according to the tour context information, for example, the position information and business hours of the POI are both static elements; the time budget, a factor that changes gradually, is a dynamic element.

And (4) action: and according to the current state, the next POI needing to be selected is an action, after the POI is selected, the state is updated to be a new state, and different state updates are carried out by selecting different actions (POI).

Rewarding: rewards define whether a behavior is good or bad for a change in the environment in the current state. It is expected that the designed travel route will provide more satisfaction to the tourists and better meet the user preference, i.e. the higher the POI score value, the better. And taking the total score of the preference value of the user for one POI sequence as Reward. Reward represents a Reward for directing an agent to select a sequence of POIs that maximizes an objective function, and the specific calculation formula is:

wherein K belongs to {1, 2.,. K } and represents the type of the tourist; u represents the total utility; u shape^kRepresents the total preference score of a guest of type k; a is_tRepresenting the POI selected in the t step;

indicating the preference score of the POI selected at step t by the guest of guest type k. Strategy: the agent selects an action based on the policy, and outputs a probability distribution of the action by inputting the current state. Mapping the current state to the optimal control action to be selected next, if a sequence has a higher total reward, then the parameters are updated to support the sequence, and the action probability is calculated as:

P(a_t|s_t,G)＝π_θ(s_t,a_t)

wherein G represents a POI network distribution map; pi_θA policy network with a representation parameter θ; s_tIndicating the state of the t-th stage, a_tRepresenting the action generated in phase t.

wherein O represents a sequence of POIs; g represents a POI network distribution map; p (o)_t|O_t-1G) represents the probability of action, i.e. the probability of selecting the next POI from the sequence of POIs selected at step t.

The purpose of training the DRL model is to update the strategic parameters to maximize the total reward, the travel routes are random, and therefore the desired reward is calculated by simply summing all the travel routes for a given parameter and map by:

wherein, pi_θA policy network with a representation parameter θ; r (O | G) represents the reward obtained for the POI sequence O under the condition that the POI net distribution map is G.

Finally, model training is performed. In particular, the present invention trains a policy network using an Actor-reviewer (Actor-Critic) algorithm based on policy gradients. In the actor-reviewer framework, the actor is responsible for policy gradient learning policies, i.e., the policy network generates actions through interactions with the environment. The reviewers are used to estimate the expected jackpot for responding to the performance of the assessment actor and to direct the actor through the next stage of action.

By giving the POI net profile G and setting the parameters, the training goal of the network is to maximize the desired reward. To maximize the desired reward, the policy is updated using a policy gradient algorithm (policy gradient method), represented as:

wherein G represents a POI network distribution map; o isⁿRepresenting a randomly generated nth sequence of POIs; r (O)ⁿG) represents the POI sequence OⁿThe reward of (1); b (g) a baseline representing a desired jackpot for reducing training variance; pi_θ(O) represents a policy network with a parameter θ.

The reviewer network is a feed-forward neural network, with the input being a weighted sum of the embedded feature vectors for each of the sights. Then, two hidden layers (ReLU and sense layers), and another linear layer (with a single output to return an estimated reward), are followed, representing the mean square error as the reviewer, which is used to train the reviewer network parameters, specifically as:

wherein N represents the number of batch sizes; o isⁿRepresenting a randomly generated nth sequence of POIs; g represents a POI network distribution map; r (O)ⁿG) represents the POI sequence OⁿThe reward of (1); b_β(G) Representing a desired jackpot with parameter beta.

In practice, the expected value is replaced by the average of the monte carlo samples in batch N:

wherein alpha is the learning rate and controls the updating rate of the strategy parameters.

In the training process, the actor network and the reviewer network are trained simultaneously, and the actor network updates parameters of the strategy function according to the direction suggested by the reviewer; a partial sample travel route generation map of the present invention can be found in FIG. 3.

Step 4, based on the trained intelligent customized model, if the tourists input the starting point, the terminal point and the playing time factors, a customized route can be generated within second-level time; certainly, the tourists can also input a starting point, a terminal point, playing time and food requirements; or, in consideration of the situation, the guest can also input a starting point, a destination, a playing time, national restaurant needs and a special hotel;

and step 5, considering that the tourists can actually select scenic spots which do not appear in the customized route according to self tendency when actually playing, so that the route is dynamically updated based on the scene where the tourists are located in real time. The deep reinforcement learning adopts a deep learning algorithm to represent the environment variable, and the representation can be automatically updated according to the input position, so that the real-time geographical position information is added in the input information based on the GPS positioning of the mobile phone, and the updated path can be output.

The present invention provides a system for travel route customization, comprising: the system comprises a demand acquisition module, a route generation module and a route updating module;

the demand acquisition module is used for acquiring demands of tourists;

Optionally, the present invention further provides a device for travel route customization based on deep reinforcement learning, which includes a processor and a memory, wherein the memory is used for storing a computer executable program, the processor reads part or all of the computer executable program from the memory and executes the computer executable program, and when the processor executes part or all of the computer executable program, part or all of the steps of the travel route customization method based on deep reinforcement learning according to the present invention can be implemented.

The device for customizing the tour route based on the deep reinforcement learning can be a notebook computer, a tablet computer, a desktop computer, a mobile phone or a workstation.

The processor may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or an off-the-shelf programmable gate array (FPGA).

The memory of the invention can be an internal storage unit of a notebook computer, a tablet computer, a desktop computer, a mobile phone or a workstation, such as a memory and a hard disk; external memory units such as removable hard disks, flash memory cards may also be used.

Computer-readable storage media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. The computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The random access memory may include a resistive random access memory (ReRAM).

The specific embodiment of the invention is as follows:

the method realizes the travel route customization based on the deep reinforcement learning, and the specific process is as follows:

step 1, crawling data related to POI attribute information, tourist comment information and traffic information of 64 hotels and 64 scenic spots in Beijing city in 12 months in 2018 from TripAdvisor.com, Ctip.com, Meituan.com, Baike.baidu.com and Amap.com by using a crawler technology. The POI attribute information comprises a POI name, a geographic position, a tour duration and business hours; the tourist comment information comprises a tourist ID, a play type, a comment time, a score and a comment text; the traffic information mainly comprises self-driving time between two points and bus time.

And 2, constructing the tourist portrait by analyzing the comment information of the tourist, and mining the preference score of the tourist. The tourist preference score is mined from three aspects of a contextual factor, comment text and score, and the contextual factor of the type of the tourist is taken into account. Because the comment text contains the comprehensive and detailed description of the tourist, the invention further excavates the tourist preference score by adopting a natural language processing technology, and as an example, the invention divides the types of the tourist into four types which are expressed as:

k is e {1,2,3,4} - { family trip, friend-to-partner trip, couple-to-couple trip, single trip }

Satisfaction of guests with POIs was numerically expressed by sentiment analysis of online reviews and score processing by Python's SnowNLP library. The numerical score was based primarily on a five point scale, with 5 representing the most satisfactory and 1 representing the least satisfactory. To simplify the calculation, the scores are weighted. Weighted average of travel type k scores for a certain POI (i)

Expressed as:

wherein q belongs to {1,2,3,4,5} represents the specific value of the five-score,

denotes the number of POIs that guest type k scores to POI (i) as q, and N denotes the number of POIs. In addition, the comment of the visitor type k to poi (i) is stored as a text file

After the online comment text is preprocessed, for example, the pause word is subjected to feature marking (tokenization) and elimination (elimnation), the sentiment value of the tourist type k to POI (i) is calculated by using a SnowNLP sentiment analysis tool of Python

Finally, satisfaction value of visitor type k to POI (i)

Expressed as:

wherein v is₁，v₂(v₁+v₂1) is weight, and can be flexibly assigned according to personal wishes and preferences. In this example v₁＝v₂＝0.5。

The guest's preference scores for POIs are primarily expressed numerically by analyzing the notes using TF-IDF techniques. Storing notes of visitor type k as text files (log)^k) After preprocessing, the Python sklern tool is used to calculate the preference score of the visitor type k to POI (i)

For the above calculated

And

after normalization, the score of the tourist type k to POI (i) can be calculated

The concrete expression is as follows:

wherein the content of the first and second substances,

the variables after the normalization process are shown. Theta₁，θ₂(θ₁+θ₂1) is weight, and can be flexibly assigned. In this example θ₁＝θ₂＝0.5。

And 3, firstly, defining the travel route planning problem as a Markov decision process, namely sequentially generating POI information according to a sequence. Suppose that the guest is known to give a start point start, an end point end, a number of playing days d, and a playing time duration per day t. According to the starting start given by the tourist, the scenic spots are selected in sequence, such as the scenic spot a is selected₂，a₃，a₄When a is selected₄Then, when the time is night, selecting a hotel next step; the next day, from the hotel, the process repeats until the end of play, selecting the guest's given endpoint.

Then, referring to fig. 2, a Deep Reinforcement Learning (DRL) framework is designed. The environment is used as a tourism environment where the tourists are really located, the environment comprises POI information, tourist input information and the like. The agent is the core of the algorithm, which mainly outputs the POI to be selected next by inputting the environment information. And finally, obtaining an optimization model through Actor-Critic training. In this example, the specific parameters of the DRL model are: embedding the travel route map into a 128-dimensional vector with an encoder; the number of hidden layer units of GRU is128; the Dropout layer parameter of the GRU is set to 0.1; the size of each epoch, mini-banks, training instance is set to 1000, 256, and 1000000, respectively; initializing all parameters by using an Xavier Initializer, and reducing the L2 norm of the gradient to 2.0; the model was optimized using Adam's algorithm and the learning rate was set to 10^-6The map for generating the travel route of the partial sample can be referred to fig. 3.

And 4, based on the trained customized model, the tourists input factors such as a starting point, a terminal point, playing time and the like, and then the customized frame can be generated within a second level. In this example, under the same constraint (starting point is S)₃The number of playing days is 5, and the playing time is 10 hours each day), the results of the routes selected by the four kinds of tourists are as follows:

family swimming:

66-69-67-38-65-76-72-32-97-74-48-75-100-80-2-88-85-90-18；

friend coming and going:

66-65-119-61-69-75-104-23-67-84-88-57-72-121-76-47-85-107-51；

the couple swims in two persons:

66-65-72-32-69-64-110-3-67-75-94-57-85-100-29-76-74-90-18；

single person swimming:

66-69-65-46-67-85-9-92-88-77-15-75-109-110-3-72-106-94-95-29。

and 5, when the tourist selects the scenic spots which do not appear in the customized route according to the tendency of the tourist, dynamically updating the route based on the scene where the tourist is located in real time, and automatically updating the representation according to the input position, so that the real-time geographic position is input based on the mobile phone GPS positioning system, and the updated path can be output.

The intelligent and customized route containing the hotel and the scenic spot can be quickly obtained, more diversified and convenient services are provided for the tourists, and the time for selecting the hotel, the scenic spot and the route planning by the tourists is saved; the route is generated according to the historical preference and the requirement of the tourist, the personalized and customized design requirements of the tourist can be met, the route is dynamically and intelligently planned according to the real tourist route of the tourist, the optimization model is further learned, and the satisfaction degree and the experience feeling of the tourist can be improved.

Claims

1. A travel route customizing method based on deep reinforcement learning is characterized by comprising the following steps:

2. The travel route customization method based on deep reinforcement learning of claim 1, wherein the route optimization model training comprises the following steps:

3. The deep reinforcement learning-based travel route customization method according to claim 1, wherein the deep reinforcement learning framework comprises status, action, reward and strategy;

P(a_t|s_t,G)＝π_θ(s_t,a_t)

wherein O represents a sequence of POIs; g represents a POI network distribution map; p (o)_t|O_t-1G) represents the probability of action, i.e. the probability of selecting the next POI from the sequence of POIs selected at step t;

4. The method of claim 3, wherein the strategy network is trained by using a strategy gradient-based actor-reviewer algorithm, and in the framework of actor-reviewers, the actor is responsible for strategy gradient learning strategies, i.e. the strategy network generates actions through interaction with the environment, and the reviewers estimate the expected cumulative reward, which responds to evaluate the performance of the actor and guides the actor to the next stage;

Expressed as:

wherein G represents a POI network distribution map; o isⁿRepresenting a randomly generated nth sequence of POIs; r (O)ⁿG) represents the POI sequence OⁿThe reward of (1); b (g) a baseline representing a desired jackpot for reducing training variance; pi_θ(O) a policy network with a parameter θ;

5. The method for customizing a travel route based on deep reinforcement learning of claim 1, wherein the POI attribute information comprises POI name, geographical location, duration of visit and business hours; the tourist comment information comprises a tourist ID, a play type, a comment time, a score and a comment text; the traffic information mainly comprises the traffic time between two points, including the time spent in self-driving and public transportation; the guest demand information includes guest number, guest composition, starting point, ending point, and playing time.

6. The method for customizing a travel route based on deep reinforcement learning as claimed in claim 5, wherein the specific process of mining historical preferences of tourists is as follows: mining the preference scores of the tourists from three aspects of situation factors, comment texts and scores, and taking the types of the tourists as the situation factors into consideration; meanwhile, based on comprehensive and detailed description of the tourists in the comment text, a natural language processing method is adopted to further mine the preference scores of the tourists.

7. The method for customizing a tour route based on deep reinforcement learning of claim 1, wherein when the tour route is dynamically updated based on real-time scene changes of tourists, real-time geographic location information is added to input information and an updated POI sequence is output based on trained route optimization models and mobile phone GPS positioning.

8. A travel route customizing system based on deep reinforcement learning is characterized by comprising a demand obtaining module, a route generating module and a route updating module;

the demand acquisition module is used for acquiring demands of tourists;

9. A computer device, comprising one or more processors and a memory, wherein the memory is used for storing a computer executable program, the processor reads part or all of the computer executable program from the memory and executes the computer executable program, and the processor can realize the deep reinforcement learning based tour route customization method according to any one of claims 1-7 when executing part or all of the computer executable program.

10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method for customizing a tour route based on deep reinforcement learning as claimed in claims 1 to 7 is implemented.