WO2020024170A1 - 连续动作空间上的纳什均衡策略及社交网络舆论演变模型 - Google Patents

连续动作空间上的纳什均衡策略及社交网络舆论演变模型 Download PDF

Info

Publication number
WO2020024170A1
WO2020024170A1 PCT/CN2018/098101 CN2018098101W WO2020024170A1 WO 2020024170 A1 WO2020024170 A1 WO 2020024170A1 CN 2018098101 W CN2018098101 W CN 2018098101W WO 2020024170 A1 WO2020024170 A1 WO 2020024170A1
Authority
WO
WIPO (PCT)
Prior art keywords
media
agent
gossiper
strategy
action
Prior art date
Application number
PCT/CN2018/098101
Other languages
English (en)
French (fr)
Inventor
侯韩旭
郝建业
张程伟
Original Assignee
东莞理工学院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东莞理工学院 filed Critical 东莞理工学院
Priority to PCT/CN2018/098101 priority Critical patent/WO2020024170A1/zh
Priority to CN201880001570.9A priority patent/CN109496305B/zh
Publication of WO2020024170A1 publication Critical patent/WO2020024170A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Definitions

  • the invention relates to a Nash equilibrium strategy, in particular to a Nash equilibrium strategy in a continuous action space, and also relates to a social network public opinion evolution model based on the Nash equilibrium strategy in a continuous action space.
  • the agent's action space can be a discrete finite set or a continuous set. Because the essence of reinforcement learning is to find the best through continuous trial and error, continuous action space has an infinite number of action choices, and the multi-agent environment increases the dimension of action space, which makes general reinforcement learning algorithms difficult. Learned the global optimum (or equilibrium).
  • This type of algorithm maintains a discrete set of actions, then uses the traditional discrete class algorithm to select the optimal action in the action set, and finally updates the action set according to a resampling mechanism to Gradually learn the best.
  • This type of algorithm can be easily combined with traditional discrete algorithms.
  • the disadvantage is that the algorithm requires a long convergence time. All the above algorithms are designed with the goal of calculating the optimal strategy in a single-agent environment, and cannot be directly applied in the learning of a multi-agent environment.
  • the present invention provides a Nash equilibrium strategy in a continuous action space.
  • the present invention also provides a social network public opinion evolution model based on the Nash equilibrium strategy in a continuous action space.
  • the invention includes the following steps:
  • the invention is further improved. Given a positive number ⁇ L and a positive number K, the Nash equilibrium strategy on the continuous action space of two agents can eventually converge to the Nash equilibrium, where ⁇ L is the lower bound of the variance ⁇ .
  • the invention also provides a social network public opinion evolution model based on the Nash equilibrium strategy in the continuous action space.
  • the social network public opinion evolution model includes two types of agents, which are Gossiper-type intelligences that simulate the general public in the social network Media and Media-type agents that simulate the media or public figures aimed at attracting the general public in social networks, wherein the Media-type agents use the Nash equilibrium strategy on the continuous action space to calculate the concept of optimal returns , Update their ideas and broadcast on social networks.
  • the invention is further improved and includes the following steps:
  • each agent adjusts its concept according to the following strategy, until each agent no longer changes the concept;
  • step S21 the operation method of the Gossiper-type agent is:
  • A2 Idea update: When the difference between the idea of the agent and the selected agent is less than the set threshold, update the idea of the agent;
  • A3 The agent compares the difference between itself and other Media concepts, and selects a Media to follow according to probability.
  • step A2 if the currently selected neighbor is Gossiper j, and
  • step A3 Following Media k, among them,
  • step S23 the current return r j of Media j is defined as the ratio of the number of people who choose Gossiper following j in G ′ to the total number of people in G ′.
  • P ij represents the probability that Gossiper i follows Media j.
  • the present invention is further improved.
  • the existence of a Media will accelerate the convergence of public opinion of each Gossiper agent.
  • the dynamic change of the concept of each Gossiper agent is a weighted average affected by each Media.
  • the present invention has the beneficial effect that, in the environment of continuous action space, the agent can maximize its own interests in the process of interacting with other agents, and finally learn the Nash equilibrium.
  • Figure 3 is a schematic diagram of the evolution of public opinion of each network when the Gossiper-Media model has no Media;
  • Figure 4 is a schematic diagram of the evolution of public opinion of the Gossiper-Media model when there is no Media in the small world network;
  • 5 is a schematic diagram of the evolution of public opinion of each network when the Gossiper-Media model has a Media in a fully connected network;
  • Figure 6 is a schematic diagram of the evolution of public opinion of each network when the Gossiper-Media model has a Media in the small world network;
  • FIG. 7 is a schematic diagram of the evolution of public opinion of each network when the Gossiper-Media model has two competing media in a fully connected network;
  • FIG. 8 is a schematic diagram of the evolution of public opinion of each network when the Gossiper-Media model has two competing media in the small world network.
  • the Nash equilibrium strategy in the continuous action space of the present invention is extended from the single-agent reinforcement learning algorithm CALA [7] (Continuous Action Learning Automata), by introducing WoLS (Win or Learn Slow)
  • CALA Continuous Action Learning Automata
  • WoLS Wind or Learn Slow
  • the learning mechanism enables the algorithm to effectively deal with learning problems in a multi-agent environment. Therefore, the Nash equilibrium strategy of the present invention is abbreviated as: WoLS-CALA (Win Learn or Slow Slow Continuous Action Learning Learning Automaton) machine).
  • the present invention first describes the CALA in detail.
  • Continuous Action Learning Automaton [7] is a strategy gradient reinforcement learning algorithm that solves the learning problem of continuous action space.
  • the agent's strategy is defined as the probability density function following the normal distribution N (u t , ⁇ t ) in the action space.
  • the CALA agent's strategy is updated as follows: At time t, the agent chooses an action x t according to the normal distribution N (u t , ⁇ t ); executes the actions x t and u t , and then obtains the corresponding returns V ( x t ) and V (u t ), which means that the algorithm needs to perform two actions during each interaction with the environment; finally, the mean and variance of the normal distribution N (u t , ⁇ t ) are updated according to the following formula ,
  • ⁇ u and ⁇ ⁇ are learning rates
  • K is a normal number, which is used to control the convergence of the algorithm.
  • the size of K is related to the number of learning times of the algorithm, and is usually set to the order of 1 / N
  • N is the number of iterations of the algorithm
  • ⁇ L is the lower bound of the variance ⁇ .
  • the algorithm continues to update the mean and variance until u is constant and ⁇ t tends to ⁇ L. After the algorithm converges, the mean u will point to an optimal solution of the problem.
  • the size of ⁇ in equation (2) determines the exploration capability of the CALA algorithm: the larger ⁇ t , the more likely CALA is to find a potentially better action.
  • the CALA algorithm is a learning algorithm based on the policy gradient class. This algorithm has been theoretically proven that under the condition that the return function V (x) is sufficiently smooth, the CALA algorithm can find a local optimum [7].
  • De Jong et al. [34] extended CALA to a multi-agent environment by improving the reward function, and experimentally verified that the improved algorithm can converge to Nash equilibrium.
  • the WoLS-CALA proposed by the present invention introduces a "WoLS" mechanism to solve the multi-agent learning problem, and theoretically analyzes and proves that the algorithm can learn Nash equilibrium in a continuous action space.
  • CALA requires that the agent needs to obtain the return of the sampling action and the expected action at the same time in each learning, however, this is not feasible in most reinforcement learning environments. Generally, the agent can only execute each time in the interaction of the environment. An action. To this end, the present invention extends CALA from two aspects of Q-value function estimation and variable learning rate, and proposes a WoLS-CALA algorithm.
  • agents choose one action at a time, and then get rewards from the environment.
  • a natural way to explore the normal distribution is to use the Q value to estimate the average return of the expected action u.
  • the expected return of the action i of the agent i in equation (1) Can be estimated using
  • the present invention updates the expected action u in a variable learning rate.
  • the learning rate of the expected action u i is updated to be defined as the following formula
  • the WoLS rule can be intuitively interpreted as if the return V (x) of the agent's action x is greater than the expected return V (u) of u, then it should learn faster, otherwise it should be slower. It can be seen that the strategies of WoLS and WoLF (Win or Learn Fast) [35] are just the opposite. The difference is that the goal of the WoLF design is to ensure the convergence of the algorithm, while the WoLS strategy of the present invention is to ensure that the expected return of action u can be correctly estimated while enabling the algorithm to update u in the direction of increasing returns.
  • Theorem 1 On the continuous action space, the learning dynamics of the CALA algorithm using WoLS rules can be approximated as a gradient ascent (GA) strategy.
  • GA gradient ascent
  • N (u, ⁇ u ) is the probability density function of the normal distribution
  • dN (a, b) represents the differential of a normal distribution with a mean of a and a variance of b 2 with respect to a).
  • f ′ (u) is the gradient direction of the function f (u) at u. Equation (10) shows that u will change towards the gradient of f (u), that is, the direction where f (u) increases fastest. That is, the dynamic trajectory of u can be approximated as a gradient ascent strategy.
  • the expected action u of the CALA algorithm after the WoLS rule can itself converge when the standard deviation ⁇ is not 0, so a lower value ⁇ L can be taken to a larger value in order to ensure a sufficient exploration rate ⁇ .
  • the global optimum can be learned by selecting appropriate parameter algorithms.
  • the present invention combines a PHC (Policy Hill Climbing) strategy [35] to propose an Actor-Critic type
  • the agent reinforcement learning algorithm is called WoLS-CALA.
  • the main idea of the Actor-Critic architecture is that strategy estimation and strategy updating are learned separately in separate processes.
  • the part that deals with strategy estimation is called Critic, and the part that updates strategy is called Actor.
  • the specific learning process is as follows (Algorithm 1),
  • algorithm 1 uses two constants ⁇ ub and ⁇ us ( ⁇ ub > ⁇ us ) instead of the learning rate of u i If the return r i received by the agent i after performing the action x i is greater than the current cumulative average return Q i , then the learning rate of u j is ⁇ ub (winning), and vice versa (losing) is ⁇ us (step 3.3). Because equations (7) and (4) contain the denominator ⁇ ( ⁇ i t ), when the denominator is small, a small error will have a great impact on the update of u and ⁇ . Using two fixed steps is easier to control the update process of the algorithm during the specific experiment, and it is also easier to implement.
  • step 4 the algorithm starts with Convergence as the loop termination condition and algorithm output. The main purpose of this is to prevent that in a competitive environment, u i will have a periodic solution and the algorithm cannot be terminated.
  • u i represent different meanings: Is the cumulative statistical average of the sampling actions of agent i, and its final result will converge to the Nash equilibrium strategy in a multi-agent environment; and u j is the expected mean of the strategy distribution of agent i, which may be in a competitive environment. Periodic oscillations near the equilibrium point. A detailed explanation will be given later in Theorem 2.
  • the Nash equilibrium can be divided into two types: the equilibrium point located on the boundary of the continuous action space (bounded closed set) and the other type is the equilibrium point located inside the continuous action space.
  • the equilibrium point on the boundary can be equivalent to the equilibrium point in the lower one-dimensional space, this example focuses on the second type of equilibrium point.
  • the dynamic characteristics of an ordinary differential equation depend on the stability of its internal equilibrium points [40], so this example first calculates the equilibrium points in equation (10), and then analyzes the stability of these equilibrium points.
  • Matrix M has eigenvalues with positive real parts, that is, the equilibrium point is unstable.
  • the trajectories around the unstable equilibrium point can be divided into two types: trajectories on stable manifolds and other trajectories ⁇ cite ⁇ Shilnikov1998Methods ⁇ .
  • a stable manifold is a subspace generated by a eigenvector corresponding to a stable eigenvalue. The trajectories in a stable manifold will eventually converge to this equilibrium point in theory. Considering that due to randomness and calculation errors, the probability that the algorithm will not go out in this subspace is 0. And all the trajectories that do not belong to the stable manifold will gradually move away from the equilibrium point and eventually converge to the other types of equilibrium points analyzed above, that is, the equilibrium points on the boundary or the first and second types of equilibrium points.
  • the algorithm can converge to a Nash equilibrium point (the global optimum of each agent when the other agent's strategy is unchanged).
  • a suitable exploration-utilization rate such as ⁇ L is sufficiently large, ⁇ takes a large initial value and a small learning Rate, the algorithm can converge to a Nash equilibrium point (the global optimum of each agent when the other agent's strategy is unchanged).
  • the present invention completes the proof that the algorithm converges to the Nash equilibrium.
  • the invention also provides a social network public opinion evolution model based on the Nash equilibrium strategy in the continuous action space.
  • the social network public opinion evolution model includes two types of agents, which are Gossiper-type intelligences that simulate the general public in a social network. Media and Media-type intelligent agents that simulate the media or public figures in the social network for the purpose of attracting the general public. Therefore, the social network public opinion evolution model of the present invention is also called a Gossiper-Media model.
  • the Media agent uses the Nash equilibrium strategy in the continuous action space to calculate the concept of optimal return, updates its concept and broadcasts it in social networks.
  • the present invention applies the WoLS-CALA algorithm to the research on the evolution of public opinion in real social networks. By using WoLS-CALA to model the media in the network, it discusses what kind of influence the competitive media will have on social public opinion.
  • the present invention proposes a multi-agent reinforcement learning framework, the Gossiper-Media model, to study the evolution of group public opinion.
  • the Gossiper-Media model includes two types of agents, Gossiper-type agents and Media-type agents. Among them, the Gossiper-type agent is used to simulate the general public in the real network, and its ideas (public opinion) are simultaneously influenced by Media and other Gossiper; and the Media-type agent is used to simulate the media or public figures in the social network to attract the public This type of agent actively chooses its own ideas to maximize its followers.
  • N agents where the number of Gossiper is
  • (N G ⁇ M).
  • Gossiper and Media are fully connected, that is, each Gossiper can choose any Media interaction with equal probability.
  • Gossiper does not require full connectivity, that is, each Gossiper can only interact with its neighbors.
  • the network between Gossiper is determined by the social relationship between them.
  • this example defines two Gossiper networks for simulation experiments: a fully connected network and a small-world network. Let's remember the concepts of Gossiper i and Media j as x i and y j respectively .
  • the interaction process of the agents in the model follows Algorithm 2.
  • each Gossiper and Media concept is randomly initialized to a value on the action space [0,1] (step 1). Then in each interaction, each agent adjusts its own ideas according to different strategies until the algorithm converges (the agents no longer change their ideas). For each Gossiper agent, first choose to choose the object to interact with it: randomly choose a Gossiper from its neighbors with probability ⁇ , or randomly choose a Media with probability 1- ⁇ (step 2.1). Then the Gossiper updated its concept according to Algorithm 3, and chose to follow a media that was closest to its own concept based on the difference between the concept and each Media.
  • each Media agent can randomly obtain a part of Gossiper's concept through sampling, and broadcast it to all Media, which is denoted as G ′ (step 2.2).
  • G ′ the Media agent uses the WoLS-CALA algorithm to play against each other, calculate the ideas that can maximize their followers, and broadcast the updated ideas to the entire network (step 2.3).
  • each Media can also sample independently, so that they get different G ′. This has little impact on the subsequent learning of the WoLS-CALA algorithm, because the theoretical distribution of G ′ is the same as G.
  • the environmental assumptions of the present invention are mainly for the sake of simplicity, while also reducing possible uncertainties due to random sampling.
  • Each Gossiper's strategy includes two parts: 1) how to update the concept; 2) how to choose the media to follow.
  • the detailed description is as follows (Algorithm 3):
  • the magnitude of the threshold d g (or d m ) represents the degree to which Gossiper accepts new ideas. Intuitively, the larger d, the more susceptible Gossiper is to other agents [41-43].
  • the Gossiper then compares his differences with other Media concepts and chooses a Media follower with probability (step 3).
  • the probability P ij ⁇ represents the probability that Gossiper i chooses to follow Media j at time ⁇ , which satisfies the following characteristics:
  • Media j's current return r j is defined as the proportion of G's who choose to follow j's Gossiper to the total number of G's,
  • ⁇ ij represents the probability that Gossiper i follows Media j.
  • ⁇ y j ⁇ j ⁇ M , y j ⁇ (0,1) be the concept of Media j.
  • the idea distribution of Gossiper can be represented by a continuous distribution density function.
  • p (x, t) is used to represent the probability density function of the idea distribution of Gossiper group at time t.
  • the evolution of Gossiper's public opinion can be expressed as the partial derivative of the probability density function p (x, t) with respect to time.
  • I 1 ⁇ x
  • I 2 ⁇ x
  • W x + y ⁇ x represents the probability that a Gossiper with an idea equal to x + y will change the idea to x
  • W x + y ⁇ x p (x + y) dy represents intelligence within the time interval (t, t + dt)
  • the concept of body shifts from the interval (x + y, x + y + dy) to the proportion of x.
  • W x ⁇ x + y represents the probability that the agent of idea x will change the idea to x + y
  • W x ⁇ x + y p (x) dy means that the Gossiper with idea equal to x is transferred to the interval (x + y, x + y + dy) ratio.
  • the agent Gossiper is influenced by other Gossiper concepts according to probability ⁇ , or is affected by Media concepts according to probability 1- ⁇ , and then makes its own decision.
  • W x + y ⁇ x and W x ⁇ x + y into two parts influenced by other Gossiper concepts and Media concepts, and write them as w [g] and w [m] respectively , then W x ⁇ x + y and W x + y ⁇ x can be expressed as,
  • ⁇ g (x, t) represents the rate of change of the probability density function p (x, t) of the agent g concept affected by Gossiper.
  • p (x, t) the probability density function of the agent g concept affected by Gossiper.
  • ⁇ g is a real number between 0 and 0.5.
  • d g is the threshold of Gossiper.
  • ⁇ m (x, t) represents the rate of change of the distribution density function p (x, t) of the idea affected by media.
  • the Dirac delta equation ⁇ (x) [46] is often used to simulate a high and narrow spike function (impulse) and other similar abstract concepts, such as point charge, point mass or electron, which are defined as follows,
  • Transfer rate from x + y to x Can be expressed as
  • (x-[(x + y) + ⁇ m ((x + z)-(x + y))]) indicates that the following event occurs, and the idea x + y is affected by the idea x + z Go to x.
  • q (x + z) is the distribution density of the media at the idea x + z.
  • w x ⁇ x + y can be expressed as,
  • I 1 ⁇ x
  • I 2 ⁇ x
  • the rate of change of p (x, t) is a weighted average of the formulas ⁇ g (x, t) and ⁇ m (x, t).
  • the former represents the part influenced by the Gossiper network and the latter represents the part influenced by the Media network.
  • the formula ⁇ g (x, t) containing only Gossiper has been studied and analyzed by Weisbuch G's work [45]. An important property that it draws is that from any distribution, the locally optimal point in the distribution density will gradually strengthen, which indicates that the development of public opinion in the pure Gossiper network will gradually tend to be consistent.
  • equation (24) shows that Gossiper's views similar to Media's concept will converge to this Media, so we can draw the following conclusions,
  • f 1 (x, y) and f 2 (x, y) simulate r in Algorithm 4, which represent the return of Media 1 and 2 when the joint action is ⁇ x, y>.
  • This example uses two WoLS-CALA agents to control x and y separately to maximize their respective return functions f 1 (x, y) and f 2 (x, y).
  • Gossiper can be divided into two categories according to different forms of Nash equilibrium:
  • This section shows the simulation results of the Gossiper-Media model.
  • Gossiper and experimental environments with different numbers of Media respectively: (i) no Media; (ii) only one Media; (iii) two competing Media.
  • this example considers two representative Gossiper networks, Fully Connected Network and Small-World Network [47] (Small-World Network).
  • the same parameter settings are used in each experimental environment.
  • the same network was used in the three experimental environments, and the same initial concepts of Gossiper and Media were used.
  • the initial idea of each Gossiper is to randomly sample the interval [0,1] according to a uniform distribution. Media's initial idea was 0.5.
  • the Gossiper-Media threshold d m and the Gossiper-Gossiper threshold d g are set to a small positive number 0.1.
  • Gossiper's learning rates ⁇ g and ⁇ m are set to 0.5.
  • the set G ′ is randomly sampled from G and satisfies
  • 80 ⁇ %
  • Figure 3-4 shows the evolution of public opinion in the fully connected network and the small world network when there is no Media
  • Figure 5-6 shows the public network in the fully connected network and the small world network.
  • the evolution of public opinion on the network shows the evolution of public opinion on each network when there are two competing media under a fully connected network and a small world network. From these figures, it can be seen first that under all three Media environments, the number of points of convergence for different Gossiper networks is the same: it converges to five in zero Media environments; it converges to four in one Media environment. Converged to three in two Media environments. This phenomenon is consistent with the conclusions in Theorem 3 and Inference 2.
  • the public opinion dynamics of Gossiper have nothing to do with the topology of the Gossiper network, because the public opinion dynamics of Gossiper under different networks can be modeled with the same formula.
  • the present invention proposes an independent learning multi-agent continuous learning space reinforcement learning algorithm WoLS-CALA, which proves that the algorithm can learn Nash equilibrium from two aspects: theoretical proof and experimental verification. Then the algorithm is applied to the research on the evolution of public opinion in the network environment.
  • Individuals in the social network are divided into two categories: Gossiper and Media.
  • Gossiper class represents the general public.
  • Media uses the WoLS-CALA algorithm to model individuals representing social media and other objects that attract public attention.
  • the present invention discusses the impact of competition between different numbers of media on Gossiper public opinion.
  • the theory and experiments show that the competition of Media can accelerate the convergence of public opinion.

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Feedback Control In General (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供一种连续动作空间上的纳什均衡策略及社交网络舆论演变模型,属于强化学习方法领域。本发明策略包括以下步骤:初始化参数;按一定探索率依正态分布N(uij)随机选择一个动作xi;并执行执行,然后从环境中获取回报ri;如果智能体i执行动作xi后收到的回报ri大于当前的累计平均回报Qi,那么ui的学习率为αub,反之学习率为αus,根据选定的学习率更新ui、方差σi、Qi,最后更新累计平均策略(I)如果累计平均策略(I)收敛,则输出累计平均策略(I)作为智能体i的最终动作。本发明的有益效果为:在与其它智能体交互的过程中最大化自己的利益,并最终能学习到纳什均衡。

Description

连续动作空间上的纳什均衡策略及社交网络舆论演变模型 技术领域
本发明涉及一种纳什均衡策略,尤其涉及一种连续动作空间上的纳什均衡策略,还涉及一种基于所述连续动作空间上的纳什均衡策略的社交网络舆论演变模型。
背景技术
在连续动作空间的环境中,一方面,智能体对动作的选择是无限的,传统的基于Q的表格类算法也无法存储无限多的回报的估计;另一方面,多智能体环境中,连续的动作空间也会增加问题的难度。
在多智能体强化学习算法领域,智能体的动作空间可以是离散的有限集,也可以是连续的集合。因为强化学习的本质是通过不断的试错来寻找最优,而连续的动作空间具有无穷多的动作选择,而多智能体环境又增加了动作空间的维度,这使得一般的强化学习算法很难学习到全局最优(或均衡)。
目前大部分算法都是基于函数逼近技术解决连续问题,这类算法可分为两类:值近似算法[1-5]和策略近似算法[6-9]。值近似算法探索动作空间并根据回报估计对应的值函数,而策略近似算法将策略定义为连续动作空间上的概率分布函数并直接学习策略。这类算法的性能取决于对值函数或者策略的估计的准确性,在处理复杂问题如非线性控制问题时往往力不从心。此外,还有一种基于采样的算法[10、11],这类算法维持一个离散的动作集,然后使用传统离散类算法选择动作集中的最优动作,最后根据一种重采样机制更新动作集从而逐渐学习到最优。这类算法可以很方便的与传统离散类算法结合,缺点是算法需要较长的收敛时间。上述所有算法都是以计算单智能体环境中的最优策略为目标设计的,并不能直接应用在多智能体环境的学习中。
近年来很多工作使用智能体仿真技术研究社交网络中的舆论演变[12-14]。给定的不同组具有不同观念分布的群体,研究群体在相互交往过程中其观念最终是会达成共识,或者两级分化还是始终处于混乱状态[15]。解决这一问题的关键是如何了解舆论演变的动态,从而得到引发舆论走向一致的内在原因[15]。针对社交网络中的舆论演变问题,研究者提出了多种多智能体学习模型[16-20]研究了不同信息共享或交换程度等因素对舆论演变的影响。其中[21–23]研究了不同信息共享或交换程度等因素对舆论演变的影响。[14 24-28]等工作采用演化博弈论模型来研究智能体的行为(例如背叛和合作)如何从同伴互动中演变而来。这些工作对智能体的行为建模,并假设所有智能体都是相同的。然而,在实际情况中,个体在社会中会扮演不同的角色(例如,领导者或追随者),而这根据上述方法是不能准确建模的。为此,Quattrociochi等人[12]将社交群体分为媒体和大众两部分并分别建模,其中大众的观念受其追随的媒体以及其它大众影响,而媒体的观念受媒体中的佼佼者影响。随后,Zhao等人[29]提出了一个基于领 导追随者(leader-follower)类型的舆论模型来探索舆论的形成。在这两个工作中,智能体观念的调整策略都是模仿领导者或者成功的同行。基于模仿的相关工作还有Local Majority[30]、Conformity[31]和Imitating Neighbor[32]。然而,现实环境中,人们在做决策中采取的策略要比简单的模仿复杂的多。人们往往是通过不断的与未知环境交互,并结合自己以掌握的知识,来决策自己的行为。此外,基于模仿的策略也不能保证算法能够学习到的是全局最优,因为其智能体策略的好坏要取决于领导者或者被模仿者的策略,而领导者的策略也并不都是最好的。
发明内容
为解决现有技术中的问题,本发明提供一种连续动作空间上的纳什均衡策略,本发明还提供了一种基于所述连续动作空间上的纳什均衡策略的社交网络舆论演变模型。
本发明包括如下步骤:
(1)设置常数α ub和α us,其中,α ub>α usQσ∈(0,1)为学习率;
(2)初始化参数,其中,所述参数包括智能体i期望动作u的均值u i、累计平均策略
Figure PCTCN2018098101-appb-000001
常数C、方差σ i和累计平均回报Q i
(3)重复以下步骤直至智能体i的采样动作的累计平均策略
Figure PCTCN2018098101-appb-000002
收敛,
(3.1)按一定探索率依正态分布N(u ij)随机选择一个动作x i
(3.2)执行动作x i,然后从环境中获取回报r i
(3.3)如果智能体i执行动作x i后收到的回报r i大于当前的累计平均回报Q i,那么u i的学习率为α ub,反之学习率为α us,根据选定的学习率更新u i
(3.4)根据学习到u i的更新方差σ i
(3.5)如果智能体i执行动作x i后收到的回报r i大于当前的累计平均回报Q i,那么u i的学习率为α ub,反之学习率为α us,根据选定的学习率更新Q i
(3.6)根据常数C和动作x i更新
Figure PCTCN2018098101-appb-000003
(4)输出累计平均策略
Figure PCTCN2018098101-appb-000004
作为智能体i的最终动作。。
本发明作进一步改进,在步骤(3.3)和步骤(3.5)中,Q的更新步长和u的更新步长同步,在u i的邻域内,Q i关于u i的映射能够线性化为Q i=Ku i+C,其中斜率
Figure PCTCN2018098101-appb-000005
本发明作进一步改进,给定正数σ L和正数K,两个智能体的连续动作空间上的纳什均衡策略最终可以收敛到纳什均衡,其中,σ L是方差σ的下界。
本发明还提供了一种基于所述连续动作空间上的纳什均衡策略的社交网络舆论演变模型,所述社交网络舆论演变模型包括两类智能体,分别为模拟社交网络中普通大众的Gossiper类智能体和模拟社交网络中以吸引普通大众为目的的媒体或公众人物的Media类智能体,其中,所述Media类智能体采用所述连续动作空间上的纳什均衡策略计算对其回报最优的观念,更新其观念并在社交网络中广播。
本发明作进一步改进,包括如下步骤:
S1:每个Gossiper和Media的观念被随机的初始化为动作空间[0,1]上的一个值;
S2:在每一次交互中,各智能体按照以下策略调整自己的观念,直至各智能体都不再改变观念;
S21:对任意一个Gossiper类智能体,按照设定概率在Gossiper网络中随机选择一个邻居,根据BCM(the bounded confidence model,有界置信模型)策略更新其观念及追随的Media;
S22:随机采样Gossiper网络G的一个子集
Figure PCTCN2018098101-appb-000006
将子集G′中的Gossiper观念广播给所有Media;
S23:对任意一个Media,使用连续动作空间上的纳什均衡策略计算其回报最优的观念,并将更新后的观念广播到整个社交网络中。
本发明作进一步改进,在步骤S21中,所述Gossiper类智能体的操作方法为:
A1:观念初始化:x i τ=x i τ-1
A2:观念更新:当该智能体与选择的智能体的观念相差小于设定阈值,更新该智能体的观念;
A3:该智能体对比自己与其它Media观念的差别,依概率选择一个Media追随。
本发明作进一步改进,在步骤A2中,如果当前选择的邻居是Gossiper j,并且|x j τ-x i τ|<d g,则x i τ←x i τg(x j τ-x i τ);如果当前选择的邻居是Media k,并且
Figure PCTCN2018098101-appb-000007
Figure PCTCN2018098101-appb-000008
其中,d g和d m分别为针对不同类型的邻居的观念设定的阈值,ɑ g和ɑ m分别为针对不同类型的邻居的学习率。
本发明作进一步改进,在步骤A3中,依概率
Figure PCTCN2018098101-appb-000009
追随Media k,
Figure PCTCN2018098101-appb-000010
其中,
Figure PCTCN2018098101-appb-000011
本发明作进一步改进,在步骤S23中,Media j当前的回报r j被定义为G′中选择追随j的Gossiper的人数所占G′中总人数的比例,
Figure PCTCN2018098101-appb-000012
P ij表示Gossiper i追随Media j的概率。
本发明作进一步改进,一个Media的存在,会加速各Gossiper智能体的舆论趋向统一;当存在多个Media竞争的环境下,各Gossiper智能体观念的动态变化为受各Media影响的加权平均。
与现有技术相比,本发明的有益效果是:在连续动作空间的环境下,智能体与其它智能体交互的过程中既能够最大化自己的利益,又最终能学习到纳什均衡。
附图说明
图1为本发明r=0.7>2/3,a=0.4,b=0.6,两个智能体收敛到纳什均衡点示意图;
图2为本发明r=0.6<2/3,a=0.4,b=0.6,两个智能体收敛到纳什均衡点示意图;
图3为Gossiper-Media模型在全连通网络没有Media时各网络的舆论演变示意图;
图4为Gossiper-Media模型在小世界网络没有Media时各网络的舆论演变示意图;
图5为Gossiper-Media模型在全连通网络具有一个Media时各网络的舆论演变示意图;
图6为Gossiper-Media模型在小世界网络具有一个Media时各网络的舆论演变示意图;
图7为Gossiper-Media模型在全连通网络具有两个竞争的Media时各网络的舆论演变示意图;
图8为Gossiper-Media模型在小世界网络具有两个竞争的Media时各网络的舆论演变示意图。
具体实施方式
下面结合附图和实施例对本发明做进一步详细说明。
本发明的连续动作空间上的纳什均衡策略扩展自单智能体强化学习算法CALA[7](Continuous Action Learning Automata,连续动作学习自动机),通过引入WoLS(Win or Learn Slow,赢则快速学习)学习机制,使得算法可以有效的处理多智能体环境中的学习问题,因此,本发明的纳什均衡策略简称为:WoLS-CALA(Win or Learn Slow Continuous Action Learning Automaton,赢则快-连续动作学习自动机)。本发明首先对所述CALA进行详细说明。
连续动作学习自动机(CALA)[7]是一个解决连续动作空间的学习问题的策略梯度强化学习算法。其中,智能体的策略被定义为动作空间上的服从正态分布N(u tt)的概率密度函数。
CALA智能体的策略更新如下:在时刻t,智能体根据正态分布N(u tt)选择一个动作x t;执行动作x t和u t,然后从环境分别获得对应的回报V(x t)和V(u t),这意味着算法在每次与环境交互的过程中需要执行两次动作;最后,按照下面公式更新正态分布N(u tt)的均值和方差,
Figure PCTCN2018098101-appb-000013
Figure PCTCN2018098101-appb-000014
其中,
Figure PCTCN2018098101-appb-000015
这里α u和α σ为学习率,K是一个正常数,用来控制算法收敛性。具体的,K的大小与算法的学习次数有关,通常设为1/N的数量级,N为算法迭代次数,σ L是方差σ的下界。算法持续更新均值和方差直到u不变且σ t趋向于σ L。算法收敛后均值u将指向问题的一个最最优解。方程(2)中σ的大小决定了CALA算法的探索能力:σ t越大,CALA越有可能寻找到潜在的更好的动作。
按定义,CALA算法是一个基于策略梯度类的学习算法。该算法已被理论证实在回报函数V(x)足够光滑的情况下,CALA算法可以寻到找局部最优[7]。De Jong等人[34]通过改进回报函数,将CALA扩展并应用到多智能体环境下,并通过实验验证了其改进的算法可以收敛到纳什均衡。本发明提出的WoLS-CALA引入"WoLS"机制解决多智能体学习问题,并从理论上分析和证明算法能够在连续的动作空间中学习到纳什均衡。
由于CALA要求智能体在每次学习中需要一次性同时获得采样动作和期望动作的回报,然而这在大多数强化学习环境中是不可行的,通常智能体在于环境的交互中每次只能执行一个动作。为此,本发明从Q值函数估计和可变学习率两个方面扩展CALA,提出了WoLS-CALA算法。
1、Q函数估计
独立式多智能体强化学习环境中,智能体一次选择一个动作,然后从环境中获得回报。对正态分布的探索方式,一个很自然的方式就是使用Q值对期望动作u的平均回报进行估计。具体的,式(1)中智能体i的动作u i的期望回报
Figure PCTCN2018098101-appb-000016
可用下式估计,
Figure PCTCN2018098101-appb-000017
这里
Figure PCTCN2018098101-appb-000018
为t时刻的采样动作。
Figure PCTCN2018098101-appb-000019
是智能体i在选择动作
Figure PCTCN2018098101-appb-000020
时收到的回报,由$t$时刻各智能体的联合动作
Figure PCTCN2018098101-appb-000021
决定。
Figure PCTCN2018098101-appb-000022
是智能体i的关于对Q的学习率。式(3)中的更新方式是强化学习估计单状态的值函数的常见方式,其本质是用r i的统计平均值去估计
Figure PCTCN2018098101-appb-000023
此外还有一个优点是,
Figure PCTCN2018098101-appb-000024
可以一次一更新,并且新收到的回报对Q值估计的占比永远都是α。
根据式(3),u的更新过程(式(1))和σ的更新过程(式(2))可表示为,
Figure PCTCN2018098101-appb-000025
Figure PCTCN2018098101-appb-000026
这里
Figure PCTCN2018098101-appb-000027
为t时刻的采样动作。
Figure PCTCN2018098101-appb-000028
是智能体i在选择动作
Figure PCTCN2018098101-appb-000029
时收到的回报,由t时刻各智能体的联合动作
Figure PCTCN2018098101-appb-000030
决定。
Figure PCTCN2018098101-appb-000031
Figure PCTCN2018098101-appb-000032
是智能体i的关于u i和σ i的学习率。
然而在多智能体环境中直接使用Q函数估计会对算法带来新的问题。因为在多智能体环境中,智能体的回报受其它智能体的影响,而其它智能体的策略变化会导致环境不稳定。式(4)中的更新方式并不能保证u能够适应环境的动态变化。这里举一个简单的例子,假设$t$时刻智能体i已学到当前时刻的最优动作
Figure PCTCN2018098101-appb-000033
并且
Figure PCTCN2018098101-appb-000034
就是对
Figure PCTCN2018098101-appb-000035
的准确估计
Figure PCTCN2018098101-appb-000036
根据定义,在t时刻,对任意的x i都有
Figure PCTCN2018098101-appb-000037
将式(3)带入(4)得,
Figure PCTCN2018098101-appb-000038
若环境保持不变,那么会有
Figure PCTCN2018098101-appb-000039
继续成立;然而若环境改变,假设
Figure PCTCN2018098101-appb-000040
并且
Figure PCTCN2018098101-appb-000041
不再是最优动作,那么会存在
Figure PCTCN2018098101-appb-000042
使得其对应的回报
Figure PCTCN2018098101-appb-000043
这种情况下继续按照式(5)中的更新方式,u i会远离x i,然而理论上因为
Figure PCTCN2018098101-appb-000044
为保证准确的估计u i应该靠近x i。因为Q为r的统计估计,所以Q的更新要比r的变化慢,导致后面在更新过程中
Figure PCTCN2018098101-appb-000045
一直成立,多次采样下u i将持续保持在
Figure PCTCN2018098101-appb-000046
附近不变。理论上u i应该变化去准找新的最优动作才对。造成这些问题的原因主要是因为多智能体环境导致的不稳定性,而传统的估计方法(如Q学习)无法有效的应对环境的变化。
2、WoLS规则及分析
为了在多智能体环境中更准确的估计u的期望回报,本发明通过可变的学习率的方式更新期望动作u。形式化地,期望动作u i的学习率按照下式更新定义为下式,
Figure PCTCN2018098101-appb-000047
则u i的更新可表示为
Figure PCTCN2018098101-appb-000048
WoLS规则可直观地解释为,如果智能体动作x的回报V(x)大于期望u的回报V(u),那么它应该学习快一些,反之则慢一些。可以看出,WoLS和WoLF(Win or Learn Fast)[35]策略正好相反。区别是WoLF设计的目标是为了保证算法的收敛性,而本发明的WoLS策略是为了在确保能够正确估计动作u的期望回报的同时,使得算法能按照回报增加的方向更新u。通过分析WoLS策略内在的动力学特征,可得到下面结论,
定理1连续动作空间上,使用WoLS规则的CALA算法的学习动态可近似为梯度上升策略(GA,gradient ascent)。
证明:根据定义(4),已知x t是智能体在时刻t照正态分布N(u tt)选择的动作,V(x t)和V({u t})是分别对应于动作x t和u t的回报。定义f(x)=E[V(x t)|x t=x]为关于动作x的期望回报函数。假设α u无穷小,则WoLS-CALA算法中u t的动态变化可由下面常微分方程表示,
Figure PCTCN2018098101-appb-000049
这里N(u,σ u)是正太分布的概率密度函数(dN(a,b)表示均值为a,方差为b 2的正态分布关于a的微分)。令x=u+y,然后在y=0处将式(8)中f(x)泰勒展开,并化简整理可得,
Figure PCTCN2018098101-appb-000050
注意到式(9)中,项
Figure PCTCN2018098101-appb-000051
和σ 2均是衡正的。
标准差σ的更新过程(式(4))与原CALA算法一样,因此可直接使用CALA的结论:给定一个足够大的正数K,σ最终将会收敛到σ L。结合式(9),本发明可得出下面结论:
对一个小的正数σ L(如1/10000),足够多的时间后,关于u t的常微分方程可近似为,
Figure PCTCN2018098101-appb-000052
其中
Figure PCTCN2018098101-appb-000053
为一个小的正常数。f′(u)为函数f(u)在u处的梯度方向。式(10)表明u会朝着f(u)的梯度方向变化,即f(u)增长最快的方向。即u的动态轨迹可近似为梯度上升策略。
在只有一个智能体存在的情况下,u的动态将最终收敛到一个最优点,因为当u=u *为一个最优点时,
Figure PCTCN2018098101-appb-000054
Figure PCTCN2018098101-appb-000055
从定理1中可看出,WoLS规则的CALA智能体期望动作的学习动态类似于前面介绍的梯度上升策略,即他们关于时间的微分都可表示成形如
Figure PCTCN2018098101-appb-000056
的形式。如果f(u)存在多个局部最优,算法最终能否收敛到全局最优取决于算法对探索-利用(Exploration-Exploitation)的分配[36],而这是强化学习领域内一个无法两全的难题。为探索到全局最优常用的办法是将算法的初始探索率σ(即标准差)取较大的值,并且对σ的初始学习率
Figure PCTCN2018098101-appb-000057
取特别小的值,以保证算法能在整个动作空间范围内有足够多采样次数。此外加上WoLS规则之后的CALA算法的期望动作u在标准差σ不为0时本身也能够收敛,因此为保证足够的探索率σ的下界σ L可以取一个较大的值。综合上述策略,通过选取合适的参数算法可以学习到全局最优。
另一个问题是多智能体环境下采用纯梯度上升策略可能会导致算法不收敛,为此本发明结合PHC(Policy Hill Climbing,策略爬山)[35]算法,提出一个Actor-Critic类型的独立式多智能体强化学习算法,称之为WoLS-CALA。Actor-Critic架构的主要思路是策略的估计和策略的更新在独立的进程中分别学习,处理策略估计部分称为Critic,策略更新的部分称为Actor。具体学习过程如下(算法1),
算法1 WoLS-CALA智能体i的学习策略
Figure PCTCN2018098101-appb-000058
Figure PCTCN2018098101-appb-000059
为简便,算法1中用两个常数α ub和α us,(α ub>α us),代替u i的学习率
Figure PCTCN2018098101-appb-000060
如果智能体i执行动作x i后收到的回报r i大于当前的累计平均回报Q i,那么u j的学习率为α ub(winning),反之(losing)为α us(第3.3步)。因为式(7)和(4)中含有分母φ(σ i t),当分母很小的时候一点误差都会对u和σ的更新造成很大的影响。使用两个固定的步长在具体实验的过程中更容易控制算法的更新过程,也容易实现。此外,注意到算法第3.5步中Q的更新步长和u的步长同步,即在r i>Q i时都为α ub,反之都是α us。因为α ub和α us是两个很小的数,在u i的很小的邻域内,Q i关于u i的映射可线性化为Q i=Ku i+C,其中斜率
Figure PCTCN2018098101-appb-000061
若u i改变了
Figure PCTCN2018098101-appb-000062
Figure PCTCN2018098101-appb-000063
这样做的目的也是为了更精确的估计u的期望回报。最后(第4步),算法以
Figure PCTCN2018098101-appb-000064
收敛作为循环终止条件和算法输出。这样做的目的主要是为了防止在竞争的环境中,u i会出现周期解而导致算法不能终止。这里需注意变量
Figure PCTCN2018098101-appb-000065
和u i代表不同的意义:
Figure PCTCN2018098101-appb-000066
为智能体i的采样动作的累计统计平均值,在多智能体环境下其最终结果会收敛到纳什均衡策略;而u j是智能体i的策略分布的期望均值,在竞争环境下可能会在均衡点附近周期性的震荡。详细的解释将在之后的定理2中给出。
因为高维空间中的动态轨迹可能会有混沌现象,导致很难对算法在具有多个智能体时的动态行为做定性分析。领域内对多智能体相关算法的动态分析基本上都是基于两个智能体的[3537-39]。因此这里主要分析具有两个WoLS-CALA智能体的情况。
定理2给定正数σ L和一个足够大的正数K,两个WoLS-CALA智能体的策略最终可以收敛到纳什均衡。
证明:按均衡点的位置纳什均衡可以分为两类:位于连续动作空间(有界闭集)边界上的均衡点和另一类是位于连续动作空间内部的平衡点。考虑到边界上的平衡点可以等价为更低一维空间内部的平衡点,本例这里着重探讨第二类平衡点。一个常微分方程的动态特征取决于其内部平衡点的稳定性质[40],因此本例首先计算式(10)中的平衡点,然后分析这些平衡点的稳定性。
Figure PCTCN2018098101-appb-000067
为智能体i在t时刻根据正态分布
Figure PCTCN2018098101-appb-000068
随机采样的动作。
Figure PCTCN2018098101-appb-000069
Figure PCTCN2018098101-appb-000070
分别为动作
Figure PCTCN2018098101-appb-000071
Figure PCTCN2018098101-appb-000072
对应的期望回报。如果点
Figure PCTCN2018098101-appb-000073
是方程(10)的一个平衡点,那么
Figure PCTCN2018098101-appb-000074
都有
Figure PCTCN2018098101-appb-000075
根据非线性动力学理论[40],点eq的稳定性可由下面矩阵的特征值决定,
Figure PCTCN2018098101-appb-000076
其中
Figure PCTCN2018098101-appb-000077
当i≠j。
此外根据纳什均衡的定义,纳什均衡点
Figure PCTCN2018098101-appb-000078
满足下面性质,
Figure PCTCN2018098101-appb-000079
将式(12)带入到M中,可知纳什平衡点的特征值属于以下三种可能性之一:
(a).矩阵M的所有特征值都有负的实部。这类平衡点是渐进稳定的,即所有$eq$附近的轨迹最终都会收敛到这个平衡点。
(b).矩阵M的所有特征值都有非正的实部,并且含有一对纯虚的特征根。这类平衡点是稳定的,但是其附近的轨迹的极限集为周期解,其极限集不可数。此外,容易证明
Figure PCTCN2018098101-appb-000080
Figure PCTCN2018098101-appb-000081
Figure PCTCN2018098101-appb-000082
Figure PCTCN2018098101-appb-000083
将最终收敛到该纳什均衡。考虑到WoLS-CALA以累计平均值
Figure PCTCN2018098101-appb-000084
为输出,因此算法也能处理这类平衡点问题。
(c).矩阵M存在正实部的特征值,即平衡点不稳定。对这类平衡点,依据非线性动力学理论,该不稳定平衡点周围的轨迹可分为两种:稳定流形上的轨迹和其它轨迹\cite{Shilnikov1998Methods}。稳定流形是由稳定的特征值对应的特征向量生成的子空间。处于稳定流形中的轨迹理论上最终都会收敛到这个平衡点。考虑到由于随机性和计算误差,算法维持在该子空间内不出去的概率为0。而所有不属于该稳定流形的轨迹都将会逐渐远离该平衡点并最终收敛到上述分析过的其他类型的平衡点,即收敛到边界上的平衡点或第一和第二类平衡点。
此外,类似于单智能体环境,如果存在多个平衡点,根据对定理1的分析,在给定合适的探索-利用率时,如σ L足够大,σ取大的初值和小的学习率,算法能够收敛到一个纳什均衡点(每个智能体当其它智能体策略不变时的全局最优)。综上所述,本发明完成了对算法收敛到纳什均衡的证明。
本发明还提供了一种基于所述连续动作空间上的纳什均衡策略的社交网络舆论演变模型,所述社交网络舆论演变模型包括两类智能体,分别为模拟社交网络中普通大众的Gossiper类智能体和模拟社交网络中以吸引普通大众为目的的媒体或公众人物的Media类智能体,因此,本发明的社交网络舆论演变模型也叫Gossiper-Media模型。其中,所述Media类智能体采用所述连续动作空间上的纳什均衡策略计算对其回报最优的观念,更新其观念并在社交网络中广播。本发明将WoLS-CALA算法应用到真实社交网络中的舆论演变的研究中,通过对网络中的媒体使 用WoLS-CALA建模,探讨竞争的媒体会对社会舆论造成什么样的影响。
下面对其进行详细阐述:
1.Gossiper-Media模型
本发明提出一个多智能体强化学习框架,Gossiper-Media模型,来研究群体舆论的演变。Gossiper-Media模型包含两类智能体,Gossiper类智能体和Media类智能体。其中Gossiper类智能体用来模拟真实网络中的普通大众,其观念(舆论)同时受Media和其它Gossiper的影响;而Media类智能体用来模拟社交网络中以吸引大众为目的的媒体或公众人物,该类智能体主动的选择自己的观念去最大化自己的追随者。考虑一个具有N个智能体的网络,其中Gossiper的数目为|G|,Media的数目为|M|(N=G∪M)。假设Gossiper和Media之间是全联通的,即每个Gossiper可以等概率的选择任何一个Media交互。而Gossiper之间不规定全联通,即每个Gossiper只可能与自己的邻居交互。Gossiper之间的网络由其之间的社交关系决定。特别地,在后面的仿真实验中,本例分别定义了两种Gossiper网络来做仿真实验:全联通网络(fully connected network)和小世界网络(small-world network)。记Gossiper i和Media j的观念分别记为x i和y j。模型中各智能体的交互过程遵照算法2。
算法2 Gossiper-Media网络中观念的学习模型
Figure PCTCN2018098101-appb-000085
首先,每个Gossiper和Media的观念被随机的初始化为动作空间[0,1]上的一个值(第1步)。然后在每一次交互中,各智能体按照不同策略分别调整自己的观念直到算法收敛(各智能体都不再改变观念)。对每一个Gossiper智能体,首先选择选择与它交互的对象:依概率ξ随机的从它的邻居中选择一个Gossiper,或者依概率1-ξ随机的选择一个Media(第2.1步)。随后这个Gossiper按照算法3更新它的观念,并根据其与各Media的观念差异选择追随一个最接近自己观念的Media。假设Media智能体可以通过采样随机获得一部分Gossiper的观念,并广播给所有Media,这里记为G′(第2.2步)。然后各Media使用WoLS-CALA算法互相博弈,计算出可以最大化自己的追随者的观念,并将更新后的观念广播到整个网络中(第2.3步)。原则上各Media也可以独自采样,使得他们获得的G′各不相同,这对后面WoLS-CALA算法的学习 影响并不大,因为理论上G′的观念分布与G相同。本发明的环境假设主要是为了简便考虑,同时也减少由于随机采样造成的可能的不确定性。
1.1Gossiper策略
每个Gossiper的策略包括两部分:1)怎样更新观念;2)怎样选择追随的Media。具体描述如下(算法3):
算法3 Gossiper i在第τ轮的策略
Figure PCTCN2018098101-appb-000086
对Gossiper i,首先初始化其观念:x i τ=x i τ-1(第1步)。接着按照BCM(the bounded confidence model,有界置信模型)策略[12,33]更新其观念(第2步)。BCM是一种较常见的描述群体观念的模型,基于BCM的智能体的观念只受与之观念相近的智能体的影响。在算法3中,只有与它选择的智能体的观念相差小于阈值d g(或d m)时,Gossiper才会更新它的观念。这里d g和d m分别对应于选择的智能体是Gossiper和Media。阈值d g(或d m)的大小代表了Gossiper接受新观念的程度。直观地,d越大,Gossiper就更容易受其它智能体影响[41-43]。然后该Gossiper对比自己与其它Media观念的差别,依概率选择一个Media追随(第3步)。这里用概率P ij τ表示在τ时刻Gossiper i选择追随Media j的概率,其满足如下特性:
(i)当|x i-y j|>d m时,P ij=0;
(ii)(ii)P ij>0当且仅当Media j的观念y j满足|x i-y j|≤d m
(iii)(iii)P ij随着观念x i和y j的距离|x i-y j|的增大而减小。
注意到如果对
Figure PCTCN2018098101-appb-000087
都有|x i-y j|>d m,那么
Figure PCTCN2018098101-appb-000088
P ij=0,这意味着存在这种可能,一个 Gossiper不会追随任何一个Media。方程λ ij里的参数δ是一个小的正数,用来防止分数的分母为0。
1.2Media策略
对给定的一组Gossiper的观念采样信息,各Media可以通过学习适当的调整自己的观念,以迎合Gossiper的喜好,从而吸引更多的Gossiper追随它。在存在多个Media的多智能体系统中,纳什均衡是多个智能体相互竞争最后达成的稳定状态。在这个状态下,各智能体不能通过单方面的改变自己的策略来获取更高的回报。考虑到Media的动作空间是连续的(观念被定义为区间[0,1]上的任意一点),这里使用WoLS-CALA算法对Media的行为建模,算法4是基于WoLS-CALA构建的Media策略。
算法4 Media j在第τ轮的策略
Figure PCTCN2018098101-appb-000089
Figure PCTCN2018098101-appb-000090
Media j当前的回报r j被定义为G′中选择追随j的Gossiper的人数所占G′中总人数的比例,
Figure PCTCN2018098101-appb-000091
这里λ ij的定义同算法3。P ij表示Gossiper i追随Media j的概率。
2、群体舆论动态分析
记{y j} j∈M,y j∈(0,1)为Media j的观念。假设Gossiper网络无穷大,则Gossiper的观念分布可以由一个连续的分布密度函数表示,这里用p(x,t)表示Gossiper群体在t时刻观念分布的概率密度函数。则Gossiper的舆论演变可以表示成概率密度函数p(x,t)关于时间的偏导数。首先本例考虑只有一个Media存在的情况。
定理3在一个只含有一个Media的Gosiper-Media网络中,Gossiper观念分布的演变服从下面公式,
Figure PCTCN2018098101-appb-000092
其中,
Figure PCTCN2018098101-appb-000093
Figure PCTCN2018098101-appb-000094
这里I 1={x||x-y|<(1-α m)d m},I 2={x|d m≥|x-y|≥(1-α m)d m}。
证明:基于MF近似[40](Mean Field approximations)理论,基于BCM的Gossiper观念的概率分布关于t的偏导p(x,t)可以用下面表示[12],
Figure PCTCN2018098101-appb-000095
这里W x+y→x表示观念等于x+y的Gossiper会改变观念到x的概率,而W x+y→xp(x+y)dy表示在时间区间(t,t+dt)内智能体的观念从区间(x+y,x+y+dy)转移到x的比例。类似的W x→x+y表示观念x的智能体会改变观念到x+y的概率,W x→x+yp(x)dy表示观念等于x的Gossiper转移到区间(x+y,x+y+dy)比例。
根据算法3的定义,智能体Gossiper依概率ξ受其它Gossiper的观念影响,或者依概率1-ξ受Media的观念影响,然后做出自己的决策。将W x+y→x和W x→x+y细化为受其它Gossiper观念和受Media观念的影响的两部分,分别记为w [g]和w [m],则W x→x+y和W x+y→x可表示为,
Figure PCTCN2018098101-appb-000096
将式(18)带入到式(17)中可得,
Figure PCTCN2018098101-appb-000097
定义
Figure PCTCN2018098101-appb-000098
Figure PCTCN2018098101-appb-000099
其中Ψ g(x,t)表示智能体g观念的概率密度函数p(x,t)受Gossiper影响的变化率。Weisbuch G[45]等人已证明Ψ g(x,t)服从下面公式,
Figure PCTCN2018098101-appb-000100
这里
Figure PCTCN2018098101-appb-000101
是p关于x的二阶偏导。α g是一个介于0到0.5的实数。d g为Gossiper的阈值。
式Ψ m(x,t)代表观念的分布密度函数p(x,t)受media影响的变化率。假设Media j的观念为u j(u j=x+d j),则Media的观念分布可以利用Dirac delta方程q(x)=δ(x-u j)表示。Dirac delta方程δ(x)[46]常被用于模拟一个高窄的尖峰函数(脉冲)和其他类似的抽象概念,如点电荷,点质量或电子,其定义如下,
Figure PCTCN2018098101-appb-000102
则从x+y到x转移率
Figure PCTCN2018098101-appb-000103
可表示为
Figure PCTCN2018098101-appb-000104
式(21)中δ(x-[(x+y)+α m((x+z)-(x+y))])表示以下事件发生,观念x+y受观念x+z的影响而转移到x。q(x+z)是Media在观念x+z处的的分布密度。同理,w x→x+y可表示成,
Figure PCTCN2018098101-appb-000105
结合式(21)-(22),计算整理可得,
Figure PCTCN2018098101-appb-000106
其中I 1={x||x-y|<(1-α m)d m},I 2={x|d m≥|x-y|≥(1-α m)d m}。
综合式(20),完成证明。
从公式(14)中本例可看出,p(x,t)的变化率是式Ψ g(x,t)和Ψ m(x,t)的加权平均。前者代表了舆论变化受Gossiper网络的影响部分,后者代表了受Media网络的影响部分。仅含有Gossiper的公式Ψ g(x,t)已经被Weisbuch G的工作[45]研究分析过。其得出一个重要的性质是从任何一个分布起,分布密度中局部最优的点会逐渐强化,这表明纯Gossiper网络中舆论的发展会逐渐趋向一直。此外,从定理3中可看出,式Ψ g(x,t)和式Ψ m(x,t)都与Gossiper的具体网络无关,这表明网络无限大的时候,舆论的发展并不受网络结构的影响。
接下来分析方程(14)的第二部分,Ψ m(x,t)(式(23))。假设y为常数,分析(23)可得,
Figure PCTCN2018098101-appb-000107
直观地,式(24)表明与Media观念相似的Gossiper的观点都会收敛到这个Media,因此可得出下面结论,
推论1一个Media的存在,会加速Gossiper的舆论趋向统一。
下面本例考虑多个Media存在的情况。定义P j(x)为Gossiper的观念在x处受Media j影响的概率,则
Figure PCTCN2018098101-appb-000108
Figure PCTCN2018098101-appb-000109
那么Gossiper在具有多个Media竞争的环境下,其观念的动态变化可以表示为受各Media影响的加权平均。可得到下面结论,
推论2Gossiper观念的分布函数的动态变化服从下式:
Figure PCTCN2018098101-appb-000110
其中Ψ g(x,t)和Ψ m(x,t)由式(20)和(23)分别定义。
3、仿真实验与分析
首先验证WoLS-CALA算法可以学习到纳什均衡。随后给出Gossiper-Media模型的实验仿真,用来验证前面的理论分析结果。
3.1WoLS-CALA算法性能检验
本例考虑一个简化版的Gossiper-Media模型,用以检验WoLS-CALA算法是否可学习到纳什均衡策略。具体的,将两个Media竞争追随者的问题建模成下面的目标优化问题,
max(f 1(x,y),f 2(x,y))
s.t.,x,y∈[0,1](s.t.表示约束条件,是优化类问题的标准写法。)(26)
其中
Figure PCTCN2018098101-appb-000111
以及
Figure PCTCN2018098101-appb-000112
r∈[0,1]。a,b∈[0,1]∧|a-b|≥0.2为Gossiper的观念。
这里函数f 1(x,y)和f 2(x,y)模拟算法4中的r,分别代表Media 1和2在联合动作为<x,y>是的回报。本例使用两个WoLS-CALA智能体,通过独立学习分别控制x和y,来最大化各自的回报函数f 1(x,y)和f 2(x,y)。在该模型中,Gossiper的观念按照不同形式的纳什均衡可分为两类:
(i)当r>2/3时,均衡点为(a,a),当r<1/3时均衡点为(b,b);
(ii)(ii)当1/3≤r≤2/3时,均衡点为集合|x-a|<0.1∧|y-b|<0.1或|x-b|<0.1∧|y-a|<0.1上任意一点。
在具体仿真实验中,本例在这两个类型中各取了一个点,即r=0.7>2/3和r=0.6<2/3。然后观察在Gossiper的观念分布不同时,算法能否能按预期学习到纳什均衡。表1为WoLS-CALA的参数设置。
表1参数设置
Figure PCTCN2018098101-appb-000113
图1和2为两个实验的仿真结果,可以很明显的看出,两个实验中Media智能体在经过3000次左右的学习后,都收敛到了纳什均衡,也就是说,r=0.6时收敛到了<0.4,0.4>,r=0.7时收敛到了<0.4,0.57>。如图1所示,当r=0.7>2/3,a=0.4,b=0.6,两个智能体收敛到纳什均衡点(0.4,0.4);如图2所示,当r=0.6<2/3,a=0.4,b=0.6,智能体1(agent1)收敛到x=0.4,智能体2(agent2)收敛到y=0.57。
3.2Gossiper-Media模型的实验仿真
这一小节展示Gossiper-Media模型的仿真结果。考虑200个Gossiper和具有不同数目Media的实验环境,分别为:(i)没有Media;(ii)只有一个Media;(iii)有两个竞争的Media。对每一种环境,本例分别考虑两种具有代表性的Gossiper网络,全连通网络(Fully Connected Network)和小世界网络[47](Small-World Network)。通过这些对比实验,本例探讨Media对Gossiper舆论演变的影响。
为公平起见,各实验环境采用同样的参数设置。在三个实验环境中采用同样的网络,以及相同的Gossiper和Media的初始观念。这里,小世界网络使用Watts-Strogatz构造方法[47]按连通度p=0.2随机生成。各Gossiper的初始观念是按均匀分布在区间[0,1]上随机采样。Media的初始观念为0.5。考虑到阈值的过大会干扰实验的观察,这里将Gossiper-Media阈值d m和Gossiper-Gossiper阈值d g设为一个小的正数0.1。Gossiper的学习率α g和α m设为0.5。集合G′随机从G采样,并且满足|G′|=80\%|G|。
因每个环境下都有两种Gossiper网络模式:全连通网络和小世界网络。因此,图3-4分别展示了在全连通网络和小世界网络下,没有Media时各网络的舆论演变;图5-6分别展示了在全连通网络和小世界网络下,具有一个Media时各网络的舆论演变;图7-8分别展示了在全连通网络和小世界网络下,具有两个竞争的Media时各网络的舆论演变。从这几个图中,首先可以看出,在所有的三种Media环境下,不同的Gossiper网络最终收敛的点的数目相同:零个Media环境中收敛到5个;一个Media环境中收敛到4个;两个Media环境中收敛到3个。这个现象与定理3和推论2中的结论相符,Gossiper的舆论动态与Gossiper网络的拓扑结构无关,因为Gossiper的在不同网络下的舆论动态可用相同的公式建模。
第二,从图3-6中可观察到,当存在一个Media的情况下,两个网络中Gossiper的舆论最后收敛的点数都从5减少到4。这表明Media的存在会加速Gossiper舆论一致化的产生,符合本例在推论1中的结论。同时,从图5-8中,当Media的数目从1增加到2时,两个网络中Gossiper的舆论最后收敛的点数进一步从4减少到3。这表明竞争的Media会进一步加速Gossiper舆论的一致化。
此外,实验结果也能验证WoLS-CALA算法的性能。在图5和图6中,Media智能体的观念一直维持在具有最多Gossiper的观念的周围(全连通网络中N max=69,小世界网络中N max=68)。这个现象符合算法设计的预期,即WoLS-CALA智能体能够学习到全局最优。在图7和图8中,可看出当存在两个Media时,一个Media的观念维持在具有最多Gossiper的观念周围(两个 网络中N max都是89),另一个Media维持在具有第二多Gossiper的观念周围(全连通网络中N‘ max=70,小世界网络中N’ max=66)。这也符合定理2的预期,两个WoLS-CALA智能体最终可以收敛到纳什均衡。图3-8中Media的观念一直在Gossiper观念周围上下振动,是因为Gossiper-Media模型中,Media的最优的策略不唯一(Gossiper观念周围小于d m的范围内都是Media的最优点)。
4、总结
本发明提出了一个独立学习的多智能体的连续动作空间的强化学习算法WoLS-CALA,分别从理论证明和实验验证两个方面验证了该算法可以学习到纳什均衡。然后将该算法应用在对网络环境中舆论演变的研究中。这里将社交网络中的个体分为Gossiper和Media两类分别建模,其中Gossiper类代表普通大众,Media使用WoLS-CALA算法建模代表社交媒体等以吸引大众关注为目的的个体。通过对两种智能体分别建模,本发明研讨了不同数目Media的竞争对Gossiper舆论产生的影响。最后理论和实验表明,Media的竞争可以加速舆论趋于一致。
以上所述之具体实施方式为本发明的较佳实施方式,并非以此限定本发明的具体实施范围,本发明的范围包括并不限于本具体实施方式,凡依照本发明所作的等效变化均在本发明的保护范围内。
本发明中涉及到的标号对应的参考文献如下:
[1]PazisJ,LagoudakisMG.Binary Action Search for Learning Continuous-action Control Policies[C].In Proceedings of the 26th Annual International Conference on Machine Learning,New York,NY,USA,2009:793–800.
[2]Pazis J,Lagoudakis M G.Reinforcement learning in multidimensional continuous action spaces[C].In IEEE Symposiumon Adaptive Dynamic Programming&Reinforcement Learning,2011:97–104.
[3]Sutton R S,Maei H R,Precup D,et al.Fast Gradient-descent Methods for Temporal-difference Learning with Linear Function Approximation[C].In Proceedings of the 26th Annual International Conference on Machine Learning,2009:993–1000.
[4]Pazis J,Parr R.Generalized Value Functions for Large Action Sets[C].In International Conference on Machine Learning,ICML 2011,Bellevue,Washington,USA,2011:1185–1192.
[5]Lillicrap T P,Hunt J J,Pritzel A,et al.Continuous control with deep reinforcement learning[J].Computer Science,2015,8(6):A187.
[6]KONDA V R.Actor-critic algorithms[J].SIAM Journal on Control and Optimization,2003,42(4).
[7]Thathachar M A L,Sastry P S.Networks of Learning Automata:Techniques for Online Stochastic Optimization[J].Kluwer Academic Publishers,2004.
[8]Peters J,Schaal S.2008Special Issue:Reinforcement Learning of Motor Skills with Policy Gradients[J].Neural Netw.,2008,21(4).
[9]van Hasselt H.Reinforcement Learning in Continuous State and Action Spaces[M]. In Reinforcement Learning:State-of-the-Art.Berlin,Heidelberg:Springer Berlin Heidelberg,2012:207–251.
[10]Sallans B,Hinton G E.Reinforcement Learning with Factored States and Actions[J].J.Mach.Learn.Res.,2004,5:1063–1088.
[11]Lazaric A,Restelli M,Bonarini A.Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods[C].In Conference on Neural Information Processing Systems,Vancouver,British Columbia,Canada,2007:833–840.
[12]Quattrociocchi W,Caldarelli G,Scala A.Opinion dynamics on interacting networks:media competition and social influence[J].Scientific Reports,2014,4(21):4938–4938.
[13]Yang H X,Huang L.Opinion percolation in structured population[J].Computer Physics Communications,2015,192(2):124–129.
[14]Chao Y,Tan G,Lv H,et al.Modelling Adaptive Learning Behaviours for Consensus Formation in Human Societies[J].Scientific Reports,2016,6:27626.
[15]De Vylder B.The evolution of conventions in multi-agent systems[J].Unpublished doctoral dissertation,Vrije Universiteit Brussel,Brussels,2007.
[16]Holley R A,Liggett T M.Ergodic Theorems for Weakly Interacting Infinite Systems and the Voter Model[J].Annals of Probability,1975,3(4):643–663.
[17]Nowak A,Szamrej J,Latan茅B.From private attitude to public opinion:A dynamic theory of social impact.[J].Psychological Review,1990,97(3):362–376.
[18]Tsang A,Larson K.Opinion dynamics of skeptical agents[C].In Proceedings of the 2014international conference on Autonomous agents and multi-agent systems,2014:277–284.
[19]Ghaderi J,Srikant R.Opinion dynamics in social networks with stubborn agents:Equilibrium and convergence rate[J].Automatica,2014,50(12):3209–3215.
[20]Kimura M,Saito K,Ohara K,et al.Learning to Predict Opinion Share in Social Networks.[C].In Twenty-Fourth AAAI Conference on Artificial Intelligence,AAAI 2010,Atlanta,Georgia,Usa,July,2010.
[21]Liakos P,Papakonstantinopoulou K.On the Impact of Social Cost in Opinion Dynamics[C].In Tenth International AAAI Conference on Web and Social Media ICWSM,2016.
[22]Bond R M,Fariss C J,Jones J J,et al.A 61-million-person experiment in social influence and political mobilization[J].Nature,2012,489(7415):295–8.
[23]Szolnoki A,Perc M.Information sharing promotes prosocial behaviour[J].New Journal of Physics,2013,15(15):1–5.
[24]Hofbauer J,Sigmund K.Evolutionary games and population dynamics[M].Cambridge;New York,NY:Cambridge University Press,1998.
[25]Tuyls K,Nowe A,Lenaerts T,et al.An Evolutionary Game Theoretic Perspective on Learning in Multi-Agent Systems[J].Synthese,2004,139(2):297–330.
[26]Szabo B G.Fath G(2007)Evolutionary games on graphs[C].In Physics Reports,2010.
[27]Han T A,Santos F C.The role of intention recognition in the evolution of cooperative behavior[C].In International Joint Conference on Artificial Intelligence,2011:1684–1689.
[28]Santos F P,Santos F C,Pacheco J M.Social Norms of Cooperation in Small-Scale  Societies[J].PLoS computational biology,2016,12(1):e1004709.
[29]Zhao Y,Zhang L,Tang M,et al.Bounded confidence opinion dynamics with opinion leaders and environmental noises[J].Computers and Operations Research,2016,74(C):205–213.
[30]Pujol J M,Delgado J,Sang,et al.The role of clustering on the emergence of efficient social conventions[C].In International Joint Conference on Artificial Intelligence,2005:965–970.
[31]Nori N,Bollegala D,Ishizuka M.Interest Prediction on Multinomial,Time-Evolving Social Graph.[C].In IJCAI 2011,Proceedings of the International Joint Conference on Artificial Intelligence,Barcelona,Catalonia,Spain,July,2011:2507–2512.
[32]Fang H.Trust modeling for opinion evaluation by coping with subjectivity and dishonesty[C].In International Joint Conference on Artificial Intelligence,2013:3211–3212.
[33]Deffuant G,Neau D,Amblard F,et al.Mixing beliefs among interacting agents[J].Advances in Complex Systems,2011,3(1n04):87–98.
[34]De Jong S,Tuyls K,Verbeeck K.Artificial agents learning human fairness[C].In International Joint Conference on Autonomous Agents and Multiagent Systems,2008:863–870.
[35]BowlingM,Veloso.Multiagent learning using a variable learning rate[J].Artificial Intelligence,2002,136(2):215–250.
[36]Sutton R S,Barto A G.Reinforcement learning:an introduction [M].Cambridge,Mass:MIT Press,1998.
[37]Abdallah S,Lesser V.A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics[J].J.Artif.Int.Res.,2008,33(1):521–549.
[38]Singh S P,Kearns M J,Mansour Y.Nash Convergence of Gradient Dynamics in General-Sum Games[J],2000:541–548.
[39]Zhang C,Lesser V R.Multi-agent learning with policy prediction [J],2010:927–934.
[40]Shilnikov L P,Shilnikov A L,Turaev D,et al.Methods of qualitative theory in nonlinear dynamics/[M].World Scientific,1998.
[41]Dittmer J C.Consensus formation under bounded confidence[J].Nonlinear Analysis Theory Methods and Applications,2001,47(7):4615–4621.
[42]LORENZ J.CONTINUOUS OPINION DYNAMICS UNDER BOUNDED CONFIDENCE:A SURVEY[J].International Journal of Modern Physics C,2007,18(12):2007.
[43]Krawczyk M J,Malarz K,Korff R,et al.Communication and trust in the bounded confidence model[J].Computational Collective Intelligence.Technologies and Applications,2010,6421:90–99.
[44]Lasry J M,Lions P L.Mean field games[J].Japanese Journal of Mathematics,2007,2(1):229–260.
[45]WeisbuchG,DeffuantG,AmblardF,etal.Interacting Agents and Continuous Opinions Dynamics[M].Springer Berlin Heidelberg,2003.
[46]Hassani S.Dirac Delta Function[M].Springer New York,2000.
[47]DJ W,SH S.Collectivedynamics of’small-world’networks[C].In Nature,1998:440–442.

Claims (10)

  1. 连续动作空间上的纳什均衡策略,其特征在于包括如下步骤:
    (1)设置常数α ub和α us,其中,α ub>α usQσ∈(0,1)为学习率;
    (2)初始化参数,其中,所述参数包括智能体i期望动作u的均值u i、累计平均策略
    Figure PCTCN2018098101-appb-100001
    常数C、方差σ i和累计平均回报Q i
    (3)重复以下步骤直至智能体i的采样动作的累计平均策略
    Figure PCTCN2018098101-appb-100002
    收敛,
    (3.1)按一定探索率依正态分布N(u ij)随机选择一个动作x i
    (3.2)执行动作x i,然后从环境中获取回报r i
    (3.3)如果智能体i执行动作x i后收到的回报r i大于当前的累计平均回报Q i,那么u i的学习率为α ub,反之学习率为α us,根据选定的学习率更新u i
    (3.4)根据学习到u i的更新方差σ i
    (3.5)如果智能体i执行动作x i后收到的回报r i大于当前的累计平均回报Q i,那么u i的学习率为α ub,反之学习率为α us,根据选定的学习率更新Q i
    (3.6)根据常数C和动作x i更新
    Figure PCTCN2018098101-appb-100003
    (4)输出累计平均策略
    Figure PCTCN2018098101-appb-100004
    作为智能体i的最终动作。
  2. 根据权利要求1所述的连续动作空间上的纳什均衡策略,其特征在于:在步骤(3.3)和步骤(3.5)中,Q的更新步长和u的更新步长同步,在u i的邻域内,Q i关于u i的映射能够线性化为Q i=Ku i+C,其中斜率
    Figure PCTCN2018098101-appb-100005
  3. 根据权利要求2所述的连续动作空间上的纳什均衡策略,其特征在于:给定正数σ L和一个正数K,两个智能体的连续动作空间上的纳什均衡策略最终可以收敛到纳什均衡,其中,σ L是方差σ的下界。
  4. 基于权利要求1-3任一项所述的连续动作空间上的纳什均衡策略的社交网络舆论演变模型,其特征在于:所述社交网络舆论演变模型包括两类智能体,分别为模拟社交网络中普通大众的Gossiper类智能体和模拟社交网络中以吸引普通大众为目的的媒体或公众人物的Media类智能体,其中,所述Media类智能体采用所述连续动作空间上的纳什均衡策略计算对其回报最优的观念,更新其观念并在社交网络中广播。
  5. 根据权利要求4所述的社交网络舆论演变模型,其特征在于包括如下步骤:
    S1:每个Gossiper和Media的观念被随机的初始化为动作空间[0,1]上的一个值;
    S2:在每一次交互中,各智能体按照以下策略调整自己的观念,直至各智能体都不再改变观念;
    S21:对任意一个Gossiper类智能体,按照设定概率在Gossiper网络中随机选择一个邻居,根据BCM策略更新其观念及追随的Media;
    S22:随机采样Gossiper网络G的一个子集
    Figure PCTCN2018098101-appb-100006
    将子集G′中的Gossiper观念广播给所有Media;
    S23:对任意一个Media,使用连续动作空间上的纳什均衡策略计算其回报最优的观念,并将更新后的观念广播到整个社交网络中。
  6. 根据权利要求5所述的社交网络舆论演变模型,其特征在于:在步骤S21中,所述Gossiper类智能体的操作方法为:
    A1:观念初始化:x i τ=x i τ-1
    A2:观念更新:当该智能体与选择的智能体的观念相差小于设定阈值,更新该智能体的观念;
    A3:该智能体对比自己与其它Media观念的差别,依概率选择一个Media追随。
  7. 根据权利要求6所述的社交网络舆论演变模型,其特征在于:在步骤A2中,如果当前选择的邻居是Gossiper j,并且|x j τ-x i τ|<d g,则x i τ←x i τg(x j τ-x i τ);如果当前选择的邻居是Media k,并且
    Figure PCTCN2018098101-appb-100007
    Figure PCTCN2018098101-appb-100008
    其中,d g和d m分别为针对不同类型的邻居的观念设定的阈值,ɑ g和ɑ m分别为针对不同类型的邻居的学习率。
  8. 根据权利要求7所述的社交网络舆论演变模型,其特征在于:在步骤A3中,依概率
    Figure PCTCN2018098101-appb-100009
    追随Media k,
    Figure PCTCN2018098101-appb-100010
    其中,
    Figure PCTCN2018098101-appb-100011
  9. 根据权利要求8所述的社交网络舆论演变模型,其特征在于:在步骤S23中,Media j当前的回报r j被定义为G′中选择追随j的Gossiper的人数所占G′中总人数的比例,
    Figure PCTCN2018098101-appb-100012
    P ij表示Gossiper i追随Media j的概率。
  10. 根据权利要求4-9任一项所述的社交网络舆论演变模型,其特征在于:一个Media的存在,会加速各Gossiper智能体的舆论趋向统一;当存在多个Media竞争的环境下,各Gossiper智能体观念的动态变化为受各Media影响的加权平均。
PCT/CN2018/098101 2018-08-01 2018-08-01 连续动作空间上的纳什均衡策略及社交网络舆论演变模型 WO2020024170A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/098101 WO2020024170A1 (zh) 2018-08-01 2018-08-01 连续动作空间上的纳什均衡策略及社交网络舆论演变模型
CN201880001570.9A CN109496305B (zh) 2018-08-01 2018-08-01 一种社交网络舆论演变方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/098101 WO2020024170A1 (zh) 2018-08-01 2018-08-01 连续动作空间上的纳什均衡策略及社交网络舆论演变模型

Publications (1)

Publication Number Publication Date
WO2020024170A1 true WO2020024170A1 (zh) 2020-02-06

Family

ID=65713809

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/098101 WO2020024170A1 (zh) 2018-08-01 2018-08-01 连续动作空间上的纳什均衡策略及社交网络舆论演变模型

Country Status (2)

Country Link
CN (1) CN109496305B (zh)
WO (1) WO2020024170A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801299A (zh) * 2021-01-26 2021-05-14 西安电子科技大学 奖惩机制演化博弈模型构建方法、系统及应用
CN113568954A (zh) * 2021-08-02 2021-10-29 湖北工业大学 网络流量预测数据预处理阶段的参数最优化方法及系统
CN113572548A (zh) * 2021-06-18 2021-10-29 南京理工大学 一种基于多智能体强化学习的无人机网络协同快跳频方法
CN113645589A (zh) * 2021-07-09 2021-11-12 北京邮电大学 一种基于反事实策略梯度的无人机集群路由计算方法
CN113687657A (zh) * 2021-08-26 2021-11-23 鲁东大学 用于多智能体编队动态路径规划的方法和存储介质
CN113778619A (zh) * 2021-08-12 2021-12-10 鹏城实验室 多集群博弈的多智能体状态控制方法、装置及终端
CN114021456A (zh) * 2021-11-05 2022-02-08 沈阳飞机设计研究所扬州协同创新研究院有限公司 一种基于强化学习的智能体无效行为切换抑制方法
CN114845359A (zh) * 2022-03-14 2022-08-02 中国人民解放军军事科学院战争研究院 一种基于Nash Q-Learning的多智能异构网络选择方法
CN115515101A (zh) * 2022-09-23 2022-12-23 西北工业大学 一种用于scma-v2x系统的解耦q学习智能码本选择方法

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362754B (zh) * 2019-06-11 2022-04-29 浙江大学 基于强化学习的线上社交网络信息源头检测的方法
CN111445291B (zh) * 2020-04-01 2022-05-13 电子科技大学 一种为社交网络影响力最大化问题提供动态决策的方法
CN112862175B (zh) * 2021-02-01 2023-04-07 天津天大求实电力新技术股份有限公司 基于p2p电力交易的本地优化控制方法及装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106936855A (zh) * 2017-05-12 2017-07-07 中国人民解放军信息工程大学 基于攻防微分博弈的网络安全防御决策确定方法及其装置
CN107135224A (zh) * 2017-05-12 2017-09-05 中国人民解放军信息工程大学 基于Markov演化博弈的网络防御策略选取方法及其装置
CN108092307A (zh) * 2017-12-15 2018-05-29 三峡大学 基于虚拟狼群策略的分层分布式智能发电控制方法

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8930989B2 (en) * 2007-08-20 2015-01-06 AdsVantage System and method for providing supervised learning to associate profiles in video audiences
CN103490413B (zh) * 2013-09-27 2015-09-02 华南理工大学 一种基于智能体均衡算法的智能发电控制方法
CN106358308A (zh) * 2015-07-14 2017-01-25 北京化工大学 一种超密集网络中的强化学习的资源分配方法
US20180033081A1 (en) * 2016-07-27 2018-02-01 Aristotle P.C. Karas Auction management system and method
CN106899026A (zh) * 2017-03-24 2017-06-27 三峡大学 基于具有时间隧道思想的多智能体强化学习的智能发电控制方法
CN107979540B (zh) * 2017-10-13 2019-12-24 北京邮电大学 一种sdn网络多控制器的负载均衡方法及系统
CN107832882A (zh) * 2017-11-03 2018-03-23 上海交通大学 一种基于马尔科夫决策过程的出租车寻客策略推荐方法
WO2020024172A1 (zh) * 2018-08-01 2020-02-06 东莞理工学院 多状态连续动作空间的合作式方法及系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106936855A (zh) * 2017-05-12 2017-07-07 中国人民解放军信息工程大学 基于攻防微分博弈的网络安全防御决策确定方法及其装置
CN107135224A (zh) * 2017-05-12 2017-09-05 中国人民解放军信息工程大学 基于Markov演化博弈的网络防御策略选取方法及其装置
CN108092307A (zh) * 2017-12-15 2018-05-29 三峡大学 基于虚拟狼群策略的分层分布式智能发电控制方法

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801299B (zh) * 2021-01-26 2023-12-01 西安电子科技大学 奖惩机制演化博弈模型构建方法、系统及应用
CN112801299A (zh) * 2021-01-26 2021-05-14 西安电子科技大学 奖惩机制演化博弈模型构建方法、系统及应用
CN113572548B (zh) * 2021-06-18 2023-07-07 南京理工大学 一种基于多智能体强化学习的无人机网络协同快跳频方法
CN113572548A (zh) * 2021-06-18 2021-10-29 南京理工大学 一种基于多智能体强化学习的无人机网络协同快跳频方法
CN113645589A (zh) * 2021-07-09 2021-11-12 北京邮电大学 一种基于反事实策略梯度的无人机集群路由计算方法
CN113645589B (zh) * 2021-07-09 2024-05-17 北京邮电大学 一种基于反事实策略梯度的无人机集群路由计算方法
CN113568954A (zh) * 2021-08-02 2021-10-29 湖北工业大学 网络流量预测数据预处理阶段的参数最优化方法及系统
CN113568954B (zh) * 2021-08-02 2024-03-19 湖北工业大学 网络流量预测数据预处理阶段的参数最优化方法及系统
CN113778619A (zh) * 2021-08-12 2021-12-10 鹏城实验室 多集群博弈的多智能体状态控制方法、装置及终端
CN113778619B (zh) * 2021-08-12 2024-05-14 鹏城实验室 多集群博弈的多智能体状态控制方法、装置及终端
CN113687657A (zh) * 2021-08-26 2021-11-23 鲁东大学 用于多智能体编队动态路径规划的方法和存储介质
CN113687657B (zh) * 2021-08-26 2023-07-14 鲁东大学 用于多智能体编队动态路径规划的方法和存储介质
CN114021456A (zh) * 2021-11-05 2022-02-08 沈阳飞机设计研究所扬州协同创新研究院有限公司 一种基于强化学习的智能体无效行为切换抑制方法
CN114845359A (zh) * 2022-03-14 2022-08-02 中国人民解放军军事科学院战争研究院 一种基于Nash Q-Learning的多智能异构网络选择方法
CN115515101A (zh) * 2022-09-23 2022-12-23 西北工业大学 一种用于scma-v2x系统的解耦q学习智能码本选择方法

Also Published As

Publication number Publication date
CN109496305A (zh) 2019-03-19
CN109496305B (zh) 2022-05-13

Similar Documents

Publication Publication Date Title
WO2020024170A1 (zh) 连续动作空间上的纳什均衡策略及社交网络舆论演变模型
Vecerik et al. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards
Shankar et al. Learning robot skills with temporal variational inference
Han et al. Intelligent decision-making for 3-dimensional dynamic obstacle avoidance of UAV based on deep reinforcement learning
Hussein et al. Deep reward shaping from demonstrations
CN113919485A (zh) 基于动态层级通信网络的多智能体强化学习方法及系统
Hafez et al. Topological Q-learning with internally guided exploration for mobile robot navigation
Wang et al. Online service migration in mobile edge with incomplete system information: A deep recurrent actor-critic learning approach
Bai et al. Variational dynamic for self-supervised exploration in deep reinforcement learning
Verstaevel et al. Lifelong machine learning with adaptive multi-agent systems
Lale et al. Kcrl: Krasovskii-constrained reinforcement learning with guaranteed stability in nonlinear dynamical systems
Wen et al. Federated Offline Reinforcement Learning With Multimodal Data
Mustafa Towards continuous control for mobile robot navigation: A reinforcement learning and slam based approach
Notsu et al. Online state space generation by a growing self-organizing map and differential learning for reinforcement learning
Brys Reinforcement Learning with Heuristic Information
Shi et al. A sample aggregation approach to experiences replay of Dyna-Q learning
Paassen et al. Gaussian process prediction for time series of structured data.
Li et al. Hyper-parameter tuning of federated learning based on particle swarm optimization
Khalil et al. Machine learning algorithms for multi-agent systems
Duan Meta learning for control
Alpcan Dual control with active learning using Gaussian process regression
Dobre et al. POMCP with human preferences in Settlers of Catan
Thodoroff et al. Recurrent value functions
Marochko et al. Pseudorehearsal in actor-critic agents with neural network function approximation
Qian Evolutionary population curriculum for scaling multi-agent reinforcement learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18928332

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18928332

Country of ref document: EP

Kind code of ref document: A1