CN111415198A - Visitor behavior preference modeling method based on reverse reinforcement learning - Google Patents

Visitor behavior preference modeling method based on reverse reinforcement learning Download PDF

Info

Publication number
CN111415198A
CN111415198A CN202010195068.5A CN202010195068A CN111415198A CN 111415198 A CN111415198 A CN 111415198A CN 202010195068 A CN202010195068 A CN 202010195068A CN 111415198 A CN111415198 A CN 111415198A
Authority
CN
China
Prior art keywords
data
tourist
preference
behavior
reinforcement learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010195068.5A
Other languages
Chinese (zh)
Other versions
CN111415198B (en
Inventor
常亮
宣闻
宾辰忠
陈源鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202010195068.5A priority Critical patent/CN111415198B/en
Publication of CN111415198A publication Critical patent/CN111415198A/en
Application granted granted Critical
Publication of CN111415198B publication Critical patent/CN111415198B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0203Market surveys; Market polls
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/14Travel agencies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/80Services using short range communication, e.g. near-field communication [NFC], radio-frequency identification [RFID] or low energy communication
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Theoretical Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Data Mining & Analysis (AREA)
  • Game Theory and Decision Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a tourist behavior preference modeling method based on reverse reinforcement learning, which comprises the steps of positioning an exhibit based on iBeacon, combining the times of receiving photographing broadcasts and position marks of the iBeacon by a smart phone, uploading and storing tourist behavior data, obtaining five elements in a Markov decision process, constructing a Markov decision process model, constructing a return function by using a function approximation method, obtaining and adding normalized photographing times and dwell time into the return function, converting the tourist data into expert example data, calculating a strategy by using Boltzmann distribution, obtaining a log-likelihood estimation function, then carrying out derivation and updating weight vectors, finishing the preference learning when set conditions are met, and learning accurate tourist preference according to limited tourist data.

Description

Visitor behavior preference modeling method based on reverse reinforcement learning
Technical Field
The invention relates to the technical field of position perception and machine learning, in particular to a visitor behavior preference modeling method based on reverse reinforcement learning.
Background
The travel recommendation technology is one of the hotspots of the current intelligent travel field research, and is used for providing personalized services for users and improving recommendation performance and tourist satisfaction. In travel recommendations, it is important to understand the behavior patterns of guests and learn guest preferences. The current travel recommendation technology mainly takes data such as scores, sign-in data and visiting frequency of tourists and exhibits as the evaluation basis of the preference degree of the tourists and the exhibits. However, inside a specific scenic spot, such as a museum, a theme park, etc., specific scoring data of tourists for tourist spots or exhibits is generally unavailable, so that fine-grained preference learning cannot be performed on the tourists, and thus touring recommendations for the inside of a specific scenic spot cannot be obtained. Many recommendation algorithms need a large amount of historical data of tourists for training, so that the tourists can learn the preference and then recommend, but the tourists in the exhibition hall have scarce and incomplete data, so that accurate preference cannot be learned according to the limited data of the tourists.
Disclosure of Invention
The invention aims to provide a visitor behavior preference modeling method based on reverse reinforcement learning, which can learn accurate visitor preferences according to limited visitor visiting data.
In order to achieve the above object, the present invention provides a visitor behavior preference modeling method based on reverse reinforcement learning, which includes:
based on the combination of iBeacon and a smart phone, acquiring and storing touring behavior data of the tourists;
performing Markov decision process modeling according to the tour behavior data and constructing a return function;
acquiring and adding the photographing times and the residence time into the return function, and converting the tour data into expert example data;
and (4) learning the preference of the tourist tour track by utilizing a maximum likelihood reverse reinforcement learning algorithm.
Wherein, based on iBeacon combines together with the smart mobile phone, acquire and save visitor's tourism behavior data, include:
the method comprises the steps of obtaining and grouping iBeacon equipment in an indoor exhibition hall, positioning exhibits by combining a Minor and a Major in iBeacon protocol data, receiving iBeacon equipment broadcast signals by an application program in a smart phone, reading sensor data and monitoring shooting broadcasts, and uploading collected data to a system server through a wireless network.
Wherein, based on iBeacon combines together with the smart mobile phone, acquire and save visitor's tourism behavior data, still include:
according to the number of times of receiving photographing broadcasts and the position identification of the iBeacon, the system server counts the photographing times of the tourists on the target exhibit and stores the collected touring behavior data of the tourists through files.
Modeling a Markov decision process according to the tour behavior data and constructing a return function, wherein the modeling comprises the following steps:
the method comprises the steps of obtaining S, A, P, r and gamma five elements in a Markov decision process, constructing a Markov decision process model, and obtaining an interaction sequence of a tourist by combining a set strategy, wherein S represents a state space of a record of the tourist currently browsing an exhibit, A represents an action space of the exhibit to be browsed by the tourist next in a corresponding state, P represents a state transition probability, r represents a return function, and gamma represents a discount factor.
Wherein, modeling the Markov decision process according to the tour behavior data and constructing a return function, further comprises:
and acquiring the feature basis functions, the number and weight vectors of the feature bases and the feature vectors of each state, and constructing a return function by using a function approximation method.
Wherein, acquiring and adding the shooting times and the stay time into the return function, and converting the tour data into expert example data, comprises:
the method comprises the steps of obtaining the photographing times and the residence time when any exhibit is browsed, respectively carrying out normalization processing, adding the normalized photographing times and the residence time to instantaneous return data in a corresponding state to obtain a return function value in the corresponding state, and simultaneously converting the obtained tour behavior data into expert example data in a sequence format.
Wherein, the learning of the preference of the tourist tour track by using the maximum likelihood reverse reinforcement learning algorithm comprises the following steps:
and calculating a strategy by adopting a Boltzmann distribution based on the accumulated return expectation of the action made in any state obtained by the expert sample data, thereby obtaining a log-likelihood estimation function based on the existing expert sample data.
Wherein, the learning of the preference of the tourist tour track by using the maximum likelihood reverse reinforcement learning algorithm further comprises:
and after the logarithm likelihood estimation function is subjected to derivation to obtain a gradient, updating the weight vector according to the current weight vector plus 0.01 time of the gradient until the absolute value of the difference of the next weight vector minus the current weight vector is less than or equal to 0.01, ending learning, and outputting a weight vector value, and if the absolute value is greater than 0.01, acquiring the accumulated return expectation again until the absolute value is less than or equal to 0.01.
The invention relates to a tourist behavior preference modeling method based on reverse reinforcement learning, which is characterized in that an exhibit is positioned based on iBeacon, the number of times of receiving photographing broadcast by a smart phone and the position identification of the iBeacon are combined, the system is uploaded to a system server, tour behavior data is stored, S, A, P, r and gamma five elements in a Markov decision process are obtained, a Markov decision process model is built, a return function is defined, normalized photographing times and dwell time are added into the return function, the return function is approximated by a function approximation method, the tour data is converted into expert example data in a state-action-behavior characteristic sequence format, meanwhile, the accumulated return expectation of actions made in any state is obtained based on the expert example data, and a strategy is calculated by adopting Boltzmann distribution, and obtaining a log-likelihood estimation function based on the existing expert example data, then carrying out derivation and weight vector updating on the log-likelihood estimation function, finishing the learning of preference when set conditions are met, and learning accurate visitor preference according to limited visitor tour data.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic step diagram of a visitor behavior preference modeling method based on reverse reinforcement learning provided by the invention.
FIG. 2 is a flow chart of the overall structure of learning guest fine grain preferences provided by the present invention.
FIG. 3 is a flow chart of data acquisition and processing provided by the present invention.
Figure 4 is a flow chart of the method for constructing a markov decision process model provided by the present invention.
Figure 5 is a schematic diagram of a markov decision interaction process provided by the present invention.
FIG. 6 is a flowchart illustrating an overall reverse reinforcement learning method according to the present invention.
Fig. 7 is a flowchart of the maximum likelihood inverse reinforcement learning algorithm provided by the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Referring to fig. 1 and fig. 2, the present invention provides a method for modeling guest behavior preference based on reverse reinforcement learning, including:
s101, based on the combination of the iBeacon and the smart phone, tourism behavior data of the tourists are obtained and stored.
Specifically, a scene is first arranged in an indoor exhibition hall. As shown in the data acquisition and processing flow chart of fig. 3, a tourist smart phone is provided with a tour guide APP, and an iBeacon (the name of chinese: must be sure, a very precise micro-positioning technology by the bluetooth low energy technology) is arranged at each exhibit at the entrance of the exhibition hall and inside the exhibition hall for obtaining the position information of the tourist; in the iBeacon protocol data, two identifiers, namely a Minor identifier and a Major identifier, are included. In an application scene, the iBeacon devices are grouped, wherein a Major is used for identifying which group the iBeacon devices belong to, a Minor is used for identifying different iBeacon devices in the same group, namely the Minor is set as the ID of an exhibit inside an exhibition hall, and the Major is set as a partition to which the exhibit belongs, so that the position information of the current tourist exhibit of a tourist can be positioned by combining the two identifications of the Minor and the Major; and the signal that iBeacon sent is received through cell-phone camera, acceleration sensor to the guide APP on the visitor's smart mobile phone to collect the multiple tourism action data of visitor (for example, take a picture, dwell time etc.), the application in the smart mobile phone receives iBeacon equipment broadcast signal, then the smart mobile phone reads sensor data and monitors the broadcast of taking a picture, uploads the data of gathering to system server through wireless network at last. When a tourist takes a picture, an application program in the smart phone immediately detects the occurrence of the picture taking action and then sends a broadcast to a system server; and the system server counts the shooting times, browsing time and the like of the tourist on the target exhibit according to the times of receiving the shooting broadcast and the position identification of the iBeacon, and stores the collected behavior data of the tourist through a file. The data stored in the file comprises a time stamp sequence of interaction between the tourist and the iBeacon, behavior three-axis (X, Y and Z) acceleration data of the user and an identification of browsing exhibits. Data are collected in a mode of combining the iBeacon and the smart phone, and the data collection mode is convenient and fast. The adopted data set is real behavior data generated when the tourist visits the scenic spots in the scenic spot, and the data also contains the browsing behavior of the tourist, so that the data is richer and more real.
And S102, carrying out Markov decision process modeling according to the tour behavior data and constructing a return function.
Specifically, S, A, P, r and γ five elements in the markov decision process are obtained, a markov decision process model is constructed, and an interaction sequence of the guest is obtained in combination with a set strategy, as shown in a flowchart of constructing the markov decision process model shown in fig. 4, wherein five elements in the markov decision process are defined: the state S represents the record of the tourist browsing the exhibit currently, and the state space is S; the action a represents that in the state s, the tourist next browses the exhibit, and the action space is A; probability of state transition P(s)t+1|st,at) Represents the slave state stBy action atTransition to state st+1Wherein s ist∈S,at∈ A. for example, tourists browse showpiece records s1Then, the user wants to browse the exhibit a2Or exhibit a3Then the state transition probability can be defined as P(s)2|s1,a2)=0.5,P(s3|s1,a3)=0.5;r(st,at) The function of return is expressed by browsing the records s of the exhibits at the present timetNext, browse the exhibit atThe reward that can be achieved is then. Wherein s ist∈S,at∈ A. this return value is proportional to the guest's preference value, i.e., guest vs. exhibit atThe higher the preference, the higher the reward value. For convenience of calculation, we define r(s)t,at)≤1;γ∈[0,1]Representing a discount factor, for calculating a cumulative reward.
The interaction process of the tourist and the exhibit in the exhibition hall can be regarded as a markov decision process, and then the interaction process of the tourist and the exhibit in the exhibition hall is described, as shown in fig. 5, the markov decision interaction process is shown:
the tourists browse the default s of the record from entering the exhibition hall0. When browsing exhibit a1In time, corresponding photographing times and residence time exist; the number of times of photographing and the length of stay are taken asAdding the characteristic value into the return function to calculate the return value r1And updates the visitor's browsing history s1(ii) a Then the tourist browses the next exhibit a2The return value r is calculated in the same manner2The visitor browses the record to change to s2The interaction continues, so that the interaction sequence when the tourists browse is shown as (1), wherein
s0,s1,s2,......,st-1,st∈S;
s0,a1,r1,s1,a2,r2,......,st-1,at,rt,st(1)
In this context, Markov refers to the record s of the exhibit viewed by the visitor at the next momentt+1Dependent only on the record s of the exhibits visited by the visitor at the present momenttAnd an exhibit a being browsedtAll other historical browsed exhibit records can be discarded; as shown in formula (2), wherein P(s)t+1|st,at) Transition probability for tourists to browse exhibits:
P(st+1|st,at,......s1,a1)=P(st+1|st,at) (2)
how to select action a under each statetIs determined by the policy pi. The policy (policy) is defined as pi: s → A, representing the behavior mapping from the state space of the tourist browsing exhibit record to the next exhibition browsing by the tourist. As can be seen from equation (3), a policy pi refers to a conditional probability distribution over a set of actions given a state s, i.e., the probability that a policy pi can specify an action on each state s; namely, the strategy pi can determine the exhibit a recommended to the tourist next step according to the record s of the exhibit browsed by the tourist;
π(a|s)=P(At=a|St=s) (3)
for example, a tourist has a strategy of pi (a) to browse exhibits2|s1)=0.3,π(a3|s1) 0.7, this means that the guest is browsing the record s1In case of (2), the next viewed exhibit a2Has a probability of 0.3, browses the exhibit a3Has a probability of 0.7, obviously the tourist browses the exhibit a3Is more likely;
based on the given strategy pi and the Markov decision process model, the interaction sequence tau of the tourist for the exhibit can be determined:
τ=s0,a1,r1,s1,a2,r2,s2,......,st-1,at,rt,st(4)
since the guest preferences are unknown, i.e. the reward function r(s)t,at) The unknown characteristic base function, the number and the weight vector of the characteristic base and the characteristic vector of each state can be obtained, the parameter approximation is carried out by utilizing a function approximation method, a return function is constructed, and the approximation form is shown as a formula (5):
Figure BDA0002417308300000061
in the above formula, phi is ═ phi1,φ2,,......,φd)T,φ:S×A→RdIs a finite and fixed bounded number of feature basis functions, where d is the number of feature bases, φ1A feature vector for each state. Theta ═ theta12,......θd) Representing a weight vector between the respective feature bases. With such a linear representation, we can adjust the weights to change the reward function value.
S103, acquiring and adding the photographing times and the staying time into the return function, and converting the tour data into expert example data.
Specifically, the shooting times and the retention time when any exhibit is browsed are obtained, normalized processing is respectively carried out, the normalized processing and the instantaneous regression data in the corresponding state are added to obtain a return function value in the corresponding state, the obtained tour behavior data are converted into expert example data in a sequence format, and the sequence of the expert example data is the expert example dataThe format is "state-action-behavior feature". Because the preference of the tourist is unknown, the tourist can be considered that the reward obtained by next browsing the exhibit is unknown under the current browsing state s, that is, the reward R (s, a) obtained by the tourist selecting the action a under the state s is often unknown; it is therefore necessary to learn the reward function behind by looking at expert examples (existing trajectory data of relevant visitors looking through the exhibit). In the learning process, two tourist behavior characteristics of photographing times and staying time are added into the return function for training; finally, learning the return function R by a reverse reinforcement learning algorithmθ(s, a), the reverse reinforcement learning overall flow chart shown in fig. 6 includes the following detailed steps:
in the scenario of our application, there are a total of 15 exhibits. We count the photo times img of a certain exhibit under the current state ssAnd residence time staysTwo guest behavior characteristics (in seconds). Therefore, we define the reward function as the sum of the instantaneous reward generated when browsing the exhibit and the reward generated by the number of times of photographing and the stay time when the downstream customer browses the exhibit in the state. For ease of calculation, we normalized the data by equation (6) for the number of shots and the return on dwell time, where x*Representing the value of the photographing times or the stay time in the current state, and min and max representing the minimum value and the maximum value of the photographing times or the stay time in all the states;
Figure BDA0002417308300000071
the reward function in the current state can be represented by equation (7):
Figure BDA0002417308300000072
the existing tourist browsing tracks are then processed into a sequence of "state-action-behavior features" as expert example data. Suppose that N guest trajectory data D ═ { ζ ═ is set1,......,ζNEach timeThe length of the bar track data is H, then a set of track data sequences can be represented as:
ζ1=((s1,a1,img1,stay1),(s2,a2,img2,stay2),......(sH,aH,imgH,stayH))
wherein s isH∈S,aH∈ A, in the present invention, we define each track data length H as 15. for example, the browsing track of a tourist u is:
ζu=((s1,a2,img1,stay1),(s2,a4,img2,stay2),(s3,a3,img3,stay3),......(s15,a15,img15,stay15))
then it represents guest u in state s1Under browse exhibit a2Wherein in the exhibit a2The number of times of photographing is img1The residence time is day1(ii) a Then browsed exhibit a4Wherein in the exhibit a4The number of times of photographing is img2Stay time stay2
And S104, learning the preference of the tourist tour track by utilizing a maximum likelihood reverse reinforcement learning algorithm.
Specifically, the accumulated return expectation of the action made in any state obtained based on the expert example data is calculated by adopting boltzmann distribution, so that the log-likelihood estimation function based on the existing expert example data is obtained, because the maximum-likelihood reverse reinforcement learning integrates the characteristics of other reverse reinforcement learning models, the return function can be estimated under the condition of less expert tracks, the maximum-likelihood model is found out through the expert tracks, the initial return function is continuously adjusted, the strategy pi is continuously optimized through the gradient, and the overall algorithm flow is as the maximum-likelihood reverse reinforcement learning algorithm flow chart shown in fig. 7. The method comprises the following specific steps:
firstly, the expert sample data is used to obtain the state s of the tourist, making an action a results in a cumulative return expectation Q, which can be expressed by formula (8):
Figure BDA0002417308300000081
whereas in MDP, actions are defined as the next viewed exhibit, so the action space is not large, so we take the boltzmann distribution as the policy pi, which can be expressed by equation (9):
πθ(a|s)=eβQ(s,a)/∑a'eβQ(s,a')(9)
under the strategy, the log-likelihood estimation function based on the existing demonstration data of the tourist's browsing track related to the exhibit can be represented by formula (10):
Figure BDA0002417308300000082
the logarithm likelihood estimation function is subjected to derivation to obtain a gradient
Figure BDA0002417308300000083
The weight vector is then updated according to the current weight vector plus 0.01 times the gradient, i.e. the current weight vector is updated with the gradient
Figure BDA0002417308300000084
Until the absolute value of the difference between the next weight vector and the current weight vector is less than or equal to 0.01, i.e., | θt-1tIf | ≦ 0.01, learning is finished, and the weight vector value θ is output as argmaxθL (D | theta), if the absolute value is greater than 0.01, i.e. | thetat-1tIf > 0.01, the cumulative reward expectation is retrieved until the absolute value is less than or equal to 0.01. Based on the collection of the real tourism behavior data of the tourists, the tourism behavior of the tourists is combined with reverse reinforcement learning, a reverse reinforcement learning algorithm is designed according to the collected behavior data, and fine-grained deviation is carried out on the basis of the obtained real dataThe learning is good.
The whole process is as the flow chart of the whole structure for learning the fine-grained preference of the tourists provided by the figure 2: the method comprises the steps of collecting tourist visiting behavior data, storing the tourist visiting behavior data in a text file, acquiring and defining five elements in a Markov decision process, constructing a Markov decision process model, constructing a return function, adding two characteristics of normalized photographing times and residence time into the return function, taking tourist browsing track data as expert example data, and finally learning out tourist preferences by utilizing a maximum likelihood reverse reinforcement learning algorithm, wherein the tourist visiting behavior data is combined with a smart phone based on iBeacon, so that accurate tourist preferences can be learned out according to limited tourist visiting data.
The invention relates to a tourist behavior preference modeling method based on reverse reinforcement learning, which is characterized in that an exhibit is positioned based on iBeacon, the number of times of receiving photographing broadcast and the position identification of the iBeacon are combined with a smart phone, the iBeacon is uploaded to a system server, tour behavior data is stored, S, A, P, r and gamma five elements in a Markov decision process are obtained, a Markov decision process model is constructed, a return function is constructed by using a function approximation method, normalized photographing times and residence time are added into the return function, the tour data is converted into expert example data in a state-action-behavior characteristic sequence format, meanwhile, accumulated expectation of actions made in any state is obtained based on the expert example data, a strategy is calculated by adopting Boltzmann distribution, and a log likelihood estimation function based on the existing expert example data is obtained, and then, carrying out derivation and weight vector updating on the log-likelihood estimation function, and finishing the learning of preference when a set condition is met, so that accurate visitor preference can be learned according to limited visitor tour data.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. A visitor behavior preference modeling method based on reverse reinforcement learning is characterized by comprising the following steps:
based on the combination of iBeacon and a smart phone, acquiring and storing touring behavior data of the tourists;
performing Markov decision process modeling according to the tour behavior data and constructing a return function;
acquiring and adding the photographing times and the residence time into the return function, and converting the tour data into expert example data;
and (4) learning the preference of the tourist tour track by utilizing a maximum likelihood reverse reinforcement learning algorithm.
2. The method for modeling visitor behavior preference based on reverse reinforcement learning as claimed in claim 1, wherein the obtaining and storing visitor behavior data of the visitor based on the combination of iBeacon and smart phone comprises:
the method comprises the steps of obtaining and grouping iBeacon equipment in an indoor exhibition hall, positioning exhibits by combining a Minor and a Major in iBeacon protocol data, receiving iBeacon equipment broadcast signals by an application program in a smart phone, reading sensor data and monitoring shooting broadcasts, and uploading collected data to a system server through a wireless network.
3. The method as claimed in claim 2, wherein the method for modeling tourism behavior preference based on reverse reinforcement learning is based on combination of iBeacon and smart phone to obtain and store tourism behavior data of tourists, and further comprises:
according to the number of times of receiving photographing broadcasts and the position identification of the iBeacon, the system server counts the photographing times of the tourists on the target exhibit and stores the collected touring behavior data of the tourists through files.
4. The method as claimed in claim 3, wherein the modeling of Markov decision process and the construction of the reward function according to the tour behavior data comprises:
the method comprises the steps of obtaining S, A, P, r and gamma five elements in a Markov decision process, constructing a Markov decision process model, and obtaining an interaction sequence of a tourist by combining a set strategy, wherein S represents a state space of a record of the tourist currently browsing an exhibit, A represents an action space of the exhibit to be browsed by the tourist next in a corresponding state, P represents a state transition probability, r represents a return function, and gamma represents a discount factor.
5. The method as claimed in claim 4, wherein the method for modeling behavior preference of tourists based on reverse reinforcement learning comprises modeling a Markov decision process according to the tourism behavior data and constructing a return function, and further comprises:
and acquiring the feature basis functions, the number and weight vectors of the feature bases and the feature vectors of each state, and constructing a return function by using a function approximation method.
6. The method as claimed in claim 5, wherein the steps of obtaining and adding the number of photos taken and the stay time to the reward function, and converting the tour data into expert example data comprise:
the method comprises the steps of obtaining the photographing times and the residence time when any exhibit is browsed, respectively carrying out normalization processing, adding the normalized photographing times and the residence time to instantaneous return data in a corresponding state to obtain a return function value in the corresponding state, and simultaneously converting the obtained tour behavior data into expert example data in a sequence format.
7. The reverse reinforcement learning-based visitor behavior preference modeling method as claimed in claim 6, wherein the learning of the visitor travel track preference by using the maximum likelihood reverse reinforcement learning algorithm comprises:
and calculating a strategy by adopting a Boltzmann distribution based on the accumulated return expectation of the action made in any state obtained by the expert sample data, thereby obtaining a log-likelihood estimation function based on the existing expert sample data.
8. The reverse reinforcement learning-based visitor behavior preference modeling method as claimed in claim 7, wherein the learning of the visitor's tour trajectory preference using the maximum likelihood reverse reinforcement learning algorithm further comprises:
and after the logarithm likelihood estimation function is subjected to derivation to obtain a gradient, updating the weight vector according to the current weight vector plus 0.01 time of the gradient until the absolute value of the difference of the next weight vector minus the current weight vector is less than or equal to 0.01, ending learning, and outputting a weight vector value, and if the absolute value is greater than 0.01, acquiring the accumulated return expectation again until the absolute value is less than or equal to 0.01.
CN202010195068.5A 2020-03-19 2020-03-19 Tourist behavior preference modeling method based on reverse reinforcement learning Active CN111415198B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010195068.5A CN111415198B (en) 2020-03-19 2020-03-19 Tourist behavior preference modeling method based on reverse reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010195068.5A CN111415198B (en) 2020-03-19 2020-03-19 Tourist behavior preference modeling method based on reverse reinforcement learning

Publications (2)

Publication Number Publication Date
CN111415198A true CN111415198A (en) 2020-07-14
CN111415198B CN111415198B (en) 2023-04-28

Family

ID=71494548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010195068.5A Active CN111415198B (en) 2020-03-19 2020-03-19 Tourist behavior preference modeling method based on reverse reinforcement learning

Country Status (1)

Country Link
CN (1) CN111415198B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158086A (en) * 2021-04-06 2021-07-23 浙江贝迩熊科技有限公司 Personalized customer recommendation system and method based on deep reinforcement learning
CN114355786A (en) * 2022-01-17 2022-04-15 北京三月雨文化传播有限责任公司 Big data-based regulation cloud system of multimedia digital exhibition hall
CN117033800A (en) * 2023-10-08 2023-11-10 法琛堂(昆明)医疗科技有限公司 Intelligent interaction method and system for visual cloud exhibition system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010048146A1 (en) * 2008-10-20 2010-04-29 Carnegie Mellon University System, method and device for predicting navigational decision-making behavior
US20160210602A1 (en) * 2008-03-21 2016-07-21 Dressbot, Inc. System and method for collaborative shopping, business and entertainment
CN107358471A (en) * 2017-07-17 2017-11-17 桂林电子科技大学 A kind of tourist resources based on visit behavior recommends method and system
CN108819948A (en) * 2018-06-25 2018-11-16 大连大学 Driving behavior modeling method based on reverse intensified learning
CN108875005A (en) * 2018-06-15 2018-11-23 桂林电子科技大学 A kind of tourist's preferential learning system and method based on visit behavior
WO2019145952A1 (en) * 2018-01-25 2019-08-01 Splitty Travel Ltd. Systems, methods and computer program products for optimization of travel technology target functions, including when communicating with travel technology suppliers under technological constraints
CN110288436A (en) * 2019-06-19 2019-09-27 桂林电子科技大学 A kind of personalized recommending scenery spot method based on the modeling of tourist's preference

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160210602A1 (en) * 2008-03-21 2016-07-21 Dressbot, Inc. System and method for collaborative shopping, business and entertainment
WO2010048146A1 (en) * 2008-10-20 2010-04-29 Carnegie Mellon University System, method and device for predicting navigational decision-making behavior
CN107358471A (en) * 2017-07-17 2017-11-17 桂林电子科技大学 A kind of tourist resources based on visit behavior recommends method and system
WO2019145952A1 (en) * 2018-01-25 2019-08-01 Splitty Travel Ltd. Systems, methods and computer program products for optimization of travel technology target functions, including when communicating with travel technology suppliers under technological constraints
CN108875005A (en) * 2018-06-15 2018-11-23 桂林电子科技大学 A kind of tourist's preferential learning system and method based on visit behavior
CN108819948A (en) * 2018-06-25 2018-11-16 大连大学 Driving behavior modeling method based on reverse intensified learning
CN110288436A (en) * 2019-06-19 2019-09-27 桂林电子科技大学 A kind of personalized recommending scenery spot method based on the modeling of tourist's preference

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
刘建伟;高峰;罗雄麟;: "基于值函数和策略梯度的深度强化学习综述" *
孙磊等: "基于游览行为的游客偏好学习方法" *
宣闻: "基于逆向强化学习的细粒度游客行为偏好研究" *
范长杰: "基于马尔可夫决策理论的规划问题的研究" *
陈希亮;曹雷;何明;李晨溪;徐志雄;: "深度逆向强化学习研究综述" *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158086A (en) * 2021-04-06 2021-07-23 浙江贝迩熊科技有限公司 Personalized customer recommendation system and method based on deep reinforcement learning
CN114355786A (en) * 2022-01-17 2022-04-15 北京三月雨文化传播有限责任公司 Big data-based regulation cloud system of multimedia digital exhibition hall
CN117033800A (en) * 2023-10-08 2023-11-10 法琛堂(昆明)医疗科技有限公司 Intelligent interaction method and system for visual cloud exhibition system

Also Published As

Publication number Publication date
CN111415198B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN111415198A (en) Visitor behavior preference modeling method based on reverse reinforcement learning
CN110297848B (en) Recommendation model training method, terminal and storage medium based on federal learning
JP6431231B1 (en) Imaging system, learning apparatus, and imaging apparatus
CN107680010B (en) Scenic spot route recommendation method and system based on touring behavior
JP5570079B2 (en) Data processing apparatus and data processing method
JP4902270B2 (en) How to assemble a collection of digital images
JP6229655B2 (en) Information processing apparatus, information processing method, and program
Baltrunas et al. Context relevance assessment and exploitation in mobile recommender systems
US8577962B2 (en) Server apparatus, client apparatus, content recommendation method, and program
CN101010560B (en) Information processing device and method, program, and information processing system
JP5847070B2 (en) Server apparatus and photographing apparatus
CN103914559A (en) Network user screening method and network user screening device
CN110309434B (en) Track data processing method and device and related equipment
CN107018333A (en) Shoot template and recommend method, device and capture apparatus
CN110798718B (en) Video recommendation method and device
CN111306700A (en) Air conditioner control method and device and computer storage medium
WO2008129374A2 (en) Motion and image quality monitor
US20090189992A1 (en) Apparatus and method for learning photographing profiles of digital imaging device for recording personal life history
CN107121661B (en) Positioning method, device and system and server
CN114930319A (en) Music recommendation method and device
CN116643494A (en) Scene recommendation method, device and system and electronic equipment
CN116503209A (en) Digital twin system based on artificial intelligence and data driving
KR100880001B1 (en) Mobile device for managing personal life and method for searching information using the mobile device
CN113158086B (en) Personalized customer recommendation system and method based on deep reinforcement learning
KR102045475B1 (en) Tour album providing system for providing a tour album by predicting a user's preference according to a tour location and operating method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant