CN111415198A

CN111415198A - Visitor behavior preference modeling method based on reverse reinforcement learning

Info

Publication number: CN111415198A
Application number: CN202010195068.5A
Authority: CN
Inventors: 常亮; 宣闻; 宾辰忠; 陈源鹏
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2020-07-14
Anticipated expiration: 2040-03-19
Also published as: CN111415198B

Abstract

The invention discloses a tourist behavior preference modeling method based on reverse reinforcement learning, which comprises the steps of positioning an exhibit based on iBeacon, combining the times of receiving photographing broadcasts and position marks of the iBeacon by a smart phone, uploading and storing tourist behavior data, obtaining five elements in a Markov decision process, constructing a Markov decision process model, constructing a return function by using a function approximation method, obtaining and adding normalized photographing times and dwell time into the return function, converting the tourist data into expert example data, calculating a strategy by using Boltzmann distribution, obtaining a log-likelihood estimation function, then carrying out derivation and updating weight vectors, finishing the preference learning when set conditions are met, and learning accurate tourist preference according to limited tourist data.

Description

Visitor behavior preference modeling method based on reverse reinforcement learning

Technical Field

The invention relates to the technical field of position perception and machine learning, in particular to a visitor behavior preference modeling method based on reverse reinforcement learning.

Background

The travel recommendation technology is one of the hotspots of the current intelligent travel field research, and is used for providing personalized services for users and improving recommendation performance and tourist satisfaction. In travel recommendations, it is important to understand the behavior patterns of guests and learn guest preferences. The current travel recommendation technology mainly takes data such as scores, sign-in data and visiting frequency of tourists and exhibits as the evaluation basis of the preference degree of the tourists and the exhibits. However, inside a specific scenic spot, such as a museum, a theme park, etc., specific scoring data of tourists for tourist spots or exhibits is generally unavailable, so that fine-grained preference learning cannot be performed on the tourists, and thus touring recommendations for the inside of a specific scenic spot cannot be obtained. Many recommendation algorithms need a large amount of historical data of tourists for training, so that the tourists can learn the preference and then recommend, but the tourists in the exhibition hall have scarce and incomplete data, so that accurate preference cannot be learned according to the limited data of the tourists.

Disclosure of Invention

The invention aims to provide a visitor behavior preference modeling method based on reverse reinforcement learning, which can learn accurate visitor preferences according to limited visitor visiting data.

In order to achieve the above object, the present invention provides a visitor behavior preference modeling method based on reverse reinforcement learning, which includes:

based on the combination of iBeacon and a smart phone, acquiring and storing touring behavior data of the tourists;

performing Markov decision process modeling according to the tour behavior data and constructing a return function;

acquiring and adding the photographing times and the residence time into the return function, and converting the tour data into expert example data;

and (4) learning the preference of the tourist tour track by utilizing a maximum likelihood reverse reinforcement learning algorithm.

Wherein, based on iBeacon combines together with the smart mobile phone, acquire and save visitor's tourism behavior data, include:

the method comprises the steps of obtaining and grouping iBeacon equipment in an indoor exhibition hall, positioning exhibits by combining a Minor and a Major in iBeacon protocol data, receiving iBeacon equipment broadcast signals by an application program in a smart phone, reading sensor data and monitoring shooting broadcasts, and uploading collected data to a system server through a wireless network.

Wherein, based on iBeacon combines together with the smart mobile phone, acquire and save visitor's tourism behavior data, still include:

according to the number of times of receiving photographing broadcasts and the position identification of the iBeacon, the system server counts the photographing times of the tourists on the target exhibit and stores the collected touring behavior data of the tourists through files.

Modeling a Markov decision process according to the tour behavior data and constructing a return function, wherein the modeling comprises the following steps:

the method comprises the steps of obtaining S, A, P, r and gamma five elements in a Markov decision process, constructing a Markov decision process model, and obtaining an interaction sequence of a tourist by combining a set strategy, wherein S represents a state space of a record of the tourist currently browsing an exhibit, A represents an action space of the exhibit to be browsed by the tourist next in a corresponding state, P represents a state transition probability, r represents a return function, and gamma represents a discount factor.

Wherein, modeling the Markov decision process according to the tour behavior data and constructing a return function, further comprises:

and acquiring the feature basis functions, the number and weight vectors of the feature bases and the feature vectors of each state, and constructing a return function by using a function approximation method.

Wherein, acquiring and adding the shooting times and the stay time into the return function, and converting the tour data into expert example data, comprises:

the method comprises the steps of obtaining the photographing times and the residence time when any exhibit is browsed, respectively carrying out normalization processing, adding the normalized photographing times and the residence time to instantaneous return data in a corresponding state to obtain a return function value in the corresponding state, and simultaneously converting the obtained tour behavior data into expert example data in a sequence format.

Wherein, the learning of the preference of the tourist tour track by using the maximum likelihood reverse reinforcement learning algorithm comprises the following steps:

and calculating a strategy by adopting a Boltzmann distribution based on the accumulated return expectation of the action made in any state obtained by the expert sample data, thereby obtaining a log-likelihood estimation function based on the existing expert sample data.

Wherein, the learning of the preference of the tourist tour track by using the maximum likelihood reverse reinforcement learning algorithm further comprises:

and after the logarithm likelihood estimation function is subjected to derivation to obtain a gradient, updating the weight vector according to the current weight vector plus 0.01 time of the gradient until the absolute value of the difference of the next weight vector minus the current weight vector is less than or equal to 0.01, ending learning, and outputting a weight vector value, and if the absolute value is greater than 0.01, acquiring the accumulated return expectation again until the absolute value is less than or equal to 0.01.

The invention relates to a tourist behavior preference modeling method based on reverse reinforcement learning, which is characterized in that an exhibit is positioned based on iBeacon, the number of times of receiving photographing broadcast by a smart phone and the position identification of the iBeacon are combined, the system is uploaded to a system server, tour behavior data is stored, S, A, P, r and gamma five elements in a Markov decision process are obtained, a Markov decision process model is built, a return function is defined, normalized photographing times and dwell time are added into the return function, the return function is approximated by a function approximation method, the tour data is converted into expert example data in a state-action-behavior characteristic sequence format, meanwhile, the accumulated return expectation of actions made in any state is obtained based on the expert example data, and a strategy is calculated by adopting Boltzmann distribution, and obtaining a log-likelihood estimation function based on the existing expert example data, then carrying out derivation and weight vector updating on the log-likelihood estimation function, finishing the learning of preference when set conditions are met, and learning accurate visitor preference according to limited visitor tour data.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic step diagram of a visitor behavior preference modeling method based on reverse reinforcement learning provided by the invention.

FIG. 2 is a flow chart of the overall structure of learning guest fine grain preferences provided by the present invention.

FIG. 3 is a flow chart of data acquisition and processing provided by the present invention.

Figure 4 is a flow chart of the method for constructing a markov decision process model provided by the present invention.

Figure 5 is a schematic diagram of a markov decision interaction process provided by the present invention.

FIG. 6 is a flowchart illustrating an overall reverse reinforcement learning method according to the present invention.

Fig. 7 is a flowchart of the maximum likelihood inverse reinforcement learning algorithm provided by the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Referring to fig. 1 and fig. 2, the present invention provides a method for modeling guest behavior preference based on reverse reinforcement learning, including:

s101, based on the combination of the iBeacon and the smart phone, tourism behavior data of the tourists are obtained and stored.

Specifically, a scene is first arranged in an indoor exhibition hall. As shown in the data acquisition and processing flow chart of fig. 3, a tourist smart phone is provided with a tour guide APP, and an iBeacon (the name of chinese: must be sure, a very precise micro-positioning technology by the bluetooth low energy technology) is arranged at each exhibit at the entrance of the exhibition hall and inside the exhibition hall for obtaining the position information of the tourist; in the iBeacon protocol data, two identifiers, namely a Minor identifier and a Major identifier, are included. In an application scene, the iBeacon devices are grouped, wherein a Major is used for identifying which group the iBeacon devices belong to, a Minor is used for identifying different iBeacon devices in the same group, namely the Minor is set as the ID of an exhibit inside an exhibition hall, and the Major is set as a partition to which the exhibit belongs, so that the position information of the current tourist exhibit of a tourist can be positioned by combining the two identifications of the Minor and the Major; and the signal that iBeacon sent is received through cell-phone camera, acceleration sensor to the guide APP on the visitor's smart mobile phone to collect the multiple tourism action data of visitor (for example, take a picture, dwell time etc.), the application in the smart mobile phone receives iBeacon equipment broadcast signal, then the smart mobile phone reads sensor data and monitors the broadcast of taking a picture, uploads the data of gathering to system server through wireless network at last. When a tourist takes a picture, an application program in the smart phone immediately detects the occurrence of the picture taking action and then sends a broadcast to a system server; and the system server counts the shooting times, browsing time and the like of the tourist on the target exhibit according to the times of receiving the shooting broadcast and the position identification of the iBeacon, and stores the collected behavior data of the tourist through a file. The data stored in the file comprises a time stamp sequence of interaction between the tourist and the iBeacon, behavior three-axis (X, Y and Z) acceleration data of the user and an identification of browsing exhibits. Data are collected in a mode of combining the iBeacon and the smart phone, and the data collection mode is convenient and fast. The adopted data set is real behavior data generated when the tourist visits the scenic spots in the scenic spot, and the data also contains the browsing behavior of the tourist, so that the data is richer and more real.

And S102, carrying out Markov decision process modeling according to the tour behavior data and constructing a return function.

Specifically, S, A, P, r and γ five elements in the markov decision process are obtained, a markov decision process model is constructed, and an interaction sequence of the guest is obtained in combination with a set strategy, as shown in a flowchart of constructing the markov decision process model shown in fig. 4, wherein five elements in the markov decision process are defined: the state S represents the record of the tourist browsing the exhibit currently, and the state space is S; the action a represents that in the state s, the tourist next browses the exhibit, and the action space is A; probability of state transition P(s)_t+1|s_t,a_t) Represents the slave state s_tBy action a_tTransition to state s_t+1Wherein s is_t∈S，a_t∈ A. for example, tourists browse showpiece records s₁Then, the user wants to browse the exhibit a₂Or exhibit a₃Then the state transition probability can be defined as P(s)₂|s₁,a₂)＝0.5，P(s₃|s₁,a₃)＝0.5；r(s_t,a_t) The function of return is expressed by browsing the records s of the exhibits at the present time_tNext, browse the exhibit a_tThe reward that can be achieved is then. Wherein s is_t∈S，a_t∈ A. this return value is proportional to the guest's preference value, i.e., guest vs. exhibit a_tThe higher the preference, the higher the reward value. For convenience of calculation, we define r(s)_t,a_t)≤1；γ∈[0,1]Representing a discount factor, for calculating a cumulative reward.

The interaction process of the tourist and the exhibit in the exhibition hall can be regarded as a markov decision process, and then the interaction process of the tourist and the exhibit in the exhibition hall is described, as shown in fig. 5, the markov decision interaction process is shown:

the tourists browse the default s of the record from entering the exhibition hall₀. When browsing exhibit a₁In time, corresponding photographing times and residence time exist; the number of times of photographing and the length of stay are taken asAdding the characteristic value into the return function to calculate the return value r₁And updates the visitor's browsing history s₁(ii) a Then the tourist browses the next exhibit a₂The return value r is calculated in the same manner₂The visitor browses the record to change to s₂The interaction continues, so that the interaction sequence when the tourists browse is shown as (1), wherein

s₀,s₁,s₂,......,s_t-1,s_t∈S；

s₀,a₁,r₁,s₁,a₂,r₂,......,s_t-1,a_t,r_t,s_t(1)

In this context, Markov refers to the record s of the exhibit viewed by the visitor at the next moment_t+1Dependent only on the record s of the exhibits visited by the visitor at the present moment_tAnd an exhibit a being browsed_tAll other historical browsed exhibit records can be discarded; as shown in formula (2), wherein P(s)_t+1|s_t,a_t) Transition probability for tourists to browse exhibits:

P(s_t+1|s_t,a_t，......s₁,a₁)＝P(s_t+1|s_t,a_t) (2)

how to select action a under each state_tIs determined by the policy pi. The policy (policy) is defined as pi: s → A, representing the behavior mapping from the state space of the tourist browsing exhibit record to the next exhibition browsing by the tourist. As can be seen from equation (3), a policy pi refers to a conditional probability distribution over a set of actions given a state s, i.e., the probability that a policy pi can specify an action on each state s; namely, the strategy pi can determine the exhibit a recommended to the tourist next step according to the record s of the exhibit browsed by the tourist;

π(a|s)＝P(A_t＝a|S_t＝s) (3)

for example, a tourist has a strategy of pi (a) to browse exhibits₂|s₁)＝0.3，π(a₃|s₁) 0.7, this means that the guest is browsing the record s₁In case of (2), the next viewed exhibit a₂Has a probability of 0.3, browses the exhibit a₃Has a probability of 0.7, obviously the tourist browses the exhibit a₃Is more likely;

based on the given strategy pi and the Markov decision process model, the interaction sequence tau of the tourist for the exhibit can be determined:

τ＝s₀,a₁,r₁,s₁,a₂,r₂,s₂,......,s_t-1,a_t,r_t,s_t(4)

since the guest preferences are unknown, i.e. the reward function r(s)_t,a_t) The unknown characteristic base function, the number and the weight vector of the characteristic base and the characteristic vector of each state can be obtained, the parameter approximation is carried out by utilizing a function approximation method, a return function is constructed, and the approximation form is shown as a formula (5):

in the above formula, phi is ═ phi₁，φ₂,，......,φ_d)^T，φ:S×A→R^dIs a finite and fixed bounded number of feature basis functions, where d is the number of feature bases, φ₁A feature vector for each state. Theta ═ theta₁,θ₂,......θ_d) Representing a weight vector between the respective feature bases. With such a linear representation, we can adjust the weights to change the reward function value.

S103, acquiring and adding the photographing times and the staying time into the return function, and converting the tour data into expert example data.

Specifically, the shooting times and the retention time when any exhibit is browsed are obtained, normalized processing is respectively carried out, the normalized processing and the instantaneous regression data in the corresponding state are added to obtain a return function value in the corresponding state, the obtained tour behavior data are converted into expert example data in a sequence format, and the sequence of the expert example data is the expert example dataThe format is "state-action-behavior feature". Because the preference of the tourist is unknown, the tourist can be considered that the reward obtained by next browsing the exhibit is unknown under the current browsing state s, that is, the reward R (s, a) obtained by the tourist selecting the action a under the state s is often unknown; it is therefore necessary to learn the reward function behind by looking at expert examples (existing trajectory data of relevant visitors looking through the exhibit). In the learning process, two tourist behavior characteristics of photographing times and staying time are added into the return function for training; finally, learning the return function R by a reverse reinforcement learning algorithm_θ(s, a), the reverse reinforcement learning overall flow chart shown in fig. 6 includes the following detailed steps:

in the scenario of our application, there are a total of 15 exhibits. We count the photo times img of a certain exhibit under the current state s_sAnd residence time stay_sTwo guest behavior characteristics (in seconds). Therefore, we define the reward function as the sum of the instantaneous reward generated when browsing the exhibit and the reward generated by the number of times of photographing and the stay time when the downstream customer browses the exhibit in the state. For ease of calculation, we normalized the data by equation (6) for the number of shots and the return on dwell time, where x^*Representing the value of the photographing times or the stay time in the current state, and min and max representing the minimum value and the maximum value of the photographing times or the stay time in all the states;

the reward function in the current state can be represented by equation (7):

the existing tourist browsing tracks are then processed into a sequence of "state-action-behavior features" as expert example data. Suppose that N guest trajectory data D ═ { ζ ═ is set₁,......,ζ_NEach timeThe length of the bar track data is H, then a set of track data sequences can be represented as:

ζ₁＝((s₁,a₁,img₁,stay₁),(s₂,a₂,img₂,stay₂),......(s_H,a_H,img_H,stay_H))

wherein s is_H∈S，a_H∈ A, in the present invention, we define each track data length H as 15. for example, the browsing track of a tourist u is:

ζ_u＝((s₁,a₂,img₁,stay₁),(s₂,a₄,img₂,stay₂),(s₃,a₃,img₃,stay₃),......(s₁₅,a₁₅,img₁₅,stay₁₅))

then it represents guest u in state s₁Under browse exhibit a₂Wherein in the exhibit a₂The number of times of photographing is img₁The residence time is day₁(ii) a Then browsed exhibit a₄Wherein in the exhibit a₄The number of times of photographing is img₂Stay time stay₂。

And S104, learning the preference of the tourist tour track by utilizing a maximum likelihood reverse reinforcement learning algorithm.

Specifically, the accumulated return expectation of the action made in any state obtained based on the expert example data is calculated by adopting boltzmann distribution, so that the log-likelihood estimation function based on the existing expert example data is obtained, because the maximum-likelihood reverse reinforcement learning integrates the characteristics of other reverse reinforcement learning models, the return function can be estimated under the condition of less expert tracks, the maximum-likelihood model is found out through the expert tracks, the initial return function is continuously adjusted, the strategy pi is continuously optimized through the gradient, and the overall algorithm flow is as the maximum-likelihood reverse reinforcement learning algorithm flow chart shown in fig. 7. The method comprises the following specific steps:

firstly, the expert sample data is used to obtain the state s of the tourist, making an action a results in a cumulative return expectation Q, which can be expressed by formula (8):

whereas in MDP, actions are defined as the next viewed exhibit, so the action space is not large, so we take the boltzmann distribution as the policy pi, which can be expressed by equation (9):

π_θ(a|s)＝e^βQ(s,a)/∑_a'e^βQ(s,a')(9)

under the strategy, the log-likelihood estimation function based on the existing demonstration data of the tourist's browsing track related to the exhibit can be represented by formula (10):

the logarithm likelihood estimation function is subjected to derivation to obtain a gradient

The weight vector is then updated according to the current weight vector plus 0.01 times the gradient, i.e. the current weight vector is updated with the gradient

Until the absolute value of the difference between the next weight vector and the current weight vector is less than or equal to 0.01, i.e., | θ_t-₁-θ_tIf | ≦ 0.01, learning is finished, and the weight vector value θ is output as argmax_θL (D | theta), if the absolute value is greater than 0.01, i.e. | theta_t-1-θ_tIf > 0.01, the cumulative reward expectation is retrieved until the absolute value is less than or equal to 0.01. Based on the collection of the real tourism behavior data of the tourists, the tourism behavior of the tourists is combined with reverse reinforcement learning, a reverse reinforcement learning algorithm is designed according to the collected behavior data, and fine-grained deviation is carried out on the basis of the obtained real dataThe learning is good.

The whole process is as the flow chart of the whole structure for learning the fine-grained preference of the tourists provided by the figure 2: the method comprises the steps of collecting tourist visiting behavior data, storing the tourist visiting behavior data in a text file, acquiring and defining five elements in a Markov decision process, constructing a Markov decision process model, constructing a return function, adding two characteristics of normalized photographing times and residence time into the return function, taking tourist browsing track data as expert example data, and finally learning out tourist preferences by utilizing a maximum likelihood reverse reinforcement learning algorithm, wherein the tourist visiting behavior data is combined with a smart phone based on iBeacon, so that accurate tourist preferences can be learned out according to limited tourist visiting data.

The invention relates to a tourist behavior preference modeling method based on reverse reinforcement learning, which is characterized in that an exhibit is positioned based on iBeacon, the number of times of receiving photographing broadcast and the position identification of the iBeacon are combined with a smart phone, the iBeacon is uploaded to a system server, tour behavior data is stored, S, A, P, r and gamma five elements in a Markov decision process are obtained, a Markov decision process model is constructed, a return function is constructed by using a function approximation method, normalized photographing times and residence time are added into the return function, the tour data is converted into expert example data in a state-action-behavior characteristic sequence format, meanwhile, accumulated expectation of actions made in any state is obtained based on the expert example data, a strategy is calculated by adopting Boltzmann distribution, and a log likelihood estimation function based on the existing expert example data is obtained, and then, carrying out derivation and weight vector updating on the log-likelihood estimation function, and finishing the learning of preference when a set condition is met, so that accurate visitor preference can be learned according to limited visitor tour data.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A visitor behavior preference modeling method based on reverse reinforcement learning is characterized by comprising the following steps:

2. The method for modeling visitor behavior preference based on reverse reinforcement learning as claimed in claim 1, wherein the obtaining and storing visitor behavior data of the visitor based on the combination of iBeacon and smart phone comprises:

3. The method as claimed in claim 2, wherein the method for modeling tourism behavior preference based on reverse reinforcement learning is based on combination of iBeacon and smart phone to obtain and store tourism behavior data of tourists, and further comprises:

4. The method as claimed in claim 3, wherein the modeling of Markov decision process and the construction of the reward function according to the tour behavior data comprises:

5. The method as claimed in claim 4, wherein the method for modeling behavior preference of tourists based on reverse reinforcement learning comprises modeling a Markov decision process according to the tourism behavior data and constructing a return function, and further comprises:

6. The method as claimed in claim 5, wherein the steps of obtaining and adding the number of photos taken and the stay time to the reward function, and converting the tour data into expert example data comprise:

7. The reverse reinforcement learning-based visitor behavior preference modeling method as claimed in claim 6, wherein the learning of the visitor travel track preference by using the maximum likelihood reverse reinforcement learning algorithm comprises:

8. The reverse reinforcement learning-based visitor behavior preference modeling method as claimed in claim 7, wherein the learning of the visitor's tour trajectory preference using the maximum likelihood reverse reinforcement learning algorithm further comprises: