CN109496305B - Social network public opinion evolution method - Google Patents

Social network public opinion evolution method Download PDF

Info

Publication number
CN109496305B
CN109496305B CN201880001570.9A CN201880001570A CN109496305B CN 109496305 B CN109496305 B CN 109496305B CN 201880001570 A CN201880001570 A CN 201880001570A CN 109496305 B CN109496305 B CN 109496305B
Authority
CN
China
Prior art keywords
media
agent
concept
gossip
public opinion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201880001570.9A
Other languages
Chinese (zh)
Other versions
CN109496305A (en
Inventor
侯韩旭
郝建业
张程伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongguan University of Technology
Original Assignee
Dongguan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dongguan University of Technology filed Critical Dongguan University of Technology
Publication of CN109496305A publication Critical patent/CN109496305A/en
Application granted granted Critical
Publication of CN109496305B publication Critical patent/CN109496305B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Abstract

The invention provides a social network public opinion evolution method, and belongs to the field of reinforcement learning methods. The social network public opinion evolution method comprises two types of agents, namely a Gossiper type agent simulating common people in a social network and a Media type agent simulating Media or public characters aiming at attracting the common people in the social network, wherein the Media type agent adopts a Nash equilibrium strategy calculation on the continuous action space to calculate the concept of optimal return, updates the concept and broadcasts in the social network. The invention has the beneficial effects that: maximize their interest in interacting with other agents and ultimately enable learning of nash equilibrium.

Description

Social network public opinion evolution method
Technical Field
The invention relates to a Nash equilibrium strategy, in particular to a Nash equilibrium strategy on a continuous action space, and also relates to a social network public opinion evolution model based on the Nash equilibrium strategy on the continuous action space.
Background
In the environment of a continuous action space, on one hand, the selection of actions by an agent is infinite, and the traditional Q-based table-like algorithm cannot store an infinite number of return estimates; on the other hand, in a multi-agent environment, continuous motion space also increases the difficulty of the problem.
In the field of multi-agent reinforcement learning algorithms, the action space of an agent can be a discrete finite set or a continuous set. Because the nature of reinforcement learning is to find the optimum through continuous trial and error, the continuous action space has infinite action choices, and the multi-agent environment increases the dimensionality of the action space, which makes it difficult for a general reinforcement learning algorithm to learn the global optimum (or balance).
At present, most algorithms solve continuous problems based on function approximation technology, and the algorithms can be divided into two types: value approximation algorithms [1-5] and policy approximation algorithms [6-9 ]. The value approximation algorithm explores the action space and estimates a corresponding value function according to the return, and the strategy approximation algorithm defines the strategy as a probability distribution function on the continuous action space and directly learns the strategy. The performance of such algorithms depends on the accuracy of the estimation of the value function or strategy, which is often not a concern when dealing with complex problems such as non-linear control problems. In addition, there is a sampling-based algorithm [10, 11] that maintains a discrete set of actions, then uses traditional discrete-class algorithms to select the optimal actions in the set of actions, and finally updates the set of actions according to a resampling mechanism to gradually learn the optimal actions. Such algorithms can be easily combined with conventional discrete algorithms, which have the disadvantage of requiring a long convergence time. All the algorithms are designed with the aim of calculating the optimal strategy in a single-agent environment, and cannot be directly applied to the learning of a multi-agent environment.
Much work in recent years has been using agent simulation techniques to study consensus evolution in social networks [12-14 ]. Given different groups of populations with different distributions of concepts, the study populations eventually agree on their concepts during their interaction, or whether the two-stage differentiation is consistently chaotic [15 ]. The key to solve this problem is how to understand the dynamics of public opinion evolution, and thus obtain the inherent reason for the consensus [15 ]. Aiming at the public opinion evolution problem in the social network, researchers provide various multi-agent learning models [16-20] to research the influence of different information sharing or exchange degrees and other factors on the public opinion evolution. Wherein [ 21-23 ] researches the influence of different information sharing or exchange degrees and other factors on public opinion evolution. 1424-28 et al work employed an evolutionary game theory model to study how the behavior of agents (e.g., traitors and collaborations) evolved from partner interactions. These operations model the behavior of the agent and assume that all agents are the same. However, in practical situations, individuals may play different roles in society (e.g., leaders or followers), which cannot be accurately modeled according to the above-described methods. To this end, Quattrocciochi et al [12] divided and modeled social populations into two parts, media and popular, with the concept of popular being influenced by the media it follows and other popular people, and the concept of media being influenced by outstanding ones of the media. Subsequently, Zhao et al [29] proposed a consensus model based on leader-follower (leader-follower) type to explore the formation of consensus. In both of these works, the adjustment strategy of the agent concept is to mimic a leader or a successful peer. Related work based on modeling has also been Local major [30], compliance [31], and activating Neighbor [32 ]. However, in a real-world environment, the strategies that people take in making decisions are much more complex than simple impersonation. People often make decisions about their own behavior by constantly interacting with unknown environments and combining knowledge that they know. Furthermore, impersonation-based strategies also do not guarantee that the algorithm can learn global optima, because the quality of its agent's strategy depends on the leader's or impersonated strategy, which is not all the best.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a nash equilibrium strategy on a continuous action space and also provides a social network public opinion evolution model based on the nash equilibrium strategy on the continuous action space.
The invention comprises the following steps:
(1) setting constant alphaubAnd alphausWherein α isubusQσEpsilon (0,1) is a learning rate;
(2) initializing parameters, wherein the parameters comprise a mean u of desired actions u of agents iiCumulative average strategy
Figure GDA0003490221330000025
Constant C, variance σiAnd cumulative average reward Qi
(3) Repeating the following steps until the accumulative average strategy of the sampling action of the agent i
Figure GDA0003490221330000026
The convergence of the signals is carried out,
(3.1) normally distributing N (u) at a certain exploration rateij) Randomly selecting an action xi
(3.2) performing action xiThen obtains the report r from the environmenti
(3.3) if agent i performs action xiThe return received later riGreater than the current cumulative average reward QiThen uiHas a learning rate of alphaubOn the contrary, the learning rate is alphausUpdating u according to the selected learning ratei
(3.4) according to learning of uiUpdate variance σ ofi
(3.5) if agent i performs action xiThe return received later riGreater than the current cumulative average reward QiThen uiHas a learning rate of alphaubOn the contrary, the learning rate is alphausUpdating Q according to the selected learning ratei
(3.6) according to the constant C and the action xiUpdating
Figure GDA0003490221330000021
(4) Output cumulative averaging strategy
Figure GDA0003490221330000022
As the final action of agent i. .
The invention is further improved in that in step (3.3) and step (3.5), the update step of Q and the update step of u are synchronized, and in uiIn the neighborhood of (c), QiAbout uiCan be linearized to Qi=Kui+ C, wherein the slope
Figure GDA0003490221330000023
Figure GDA0003490221330000024
The invention is further improved by giving a positive number sigmaLAnd a positive number K, continuous action null of two agentsThe inter-Nash equalization strategy can eventually converge to Nash equalization, where σLIs the lower bound of the variance σ.
The invention also provides a social network public opinion evolution model based on the Nash equilibrium strategy on the continuous action space, which comprises two types of agents, namely a Gossiper type agent simulating common people in a social network and a Media type agent simulating Media or public characters aiming at attracting common people in the social network, wherein the Media type agent adopts the Nash equilibrium strategy on the continuous action space to calculate the concept of optimal return, updates the concept and broadcasts in the social network.
The invention is further improved, comprising the following steps:
s1: the notion of each gossip and Media is initialized randomly to a value in the motion space [0,1 ];
s2: in each interaction, each agent adjusts the concept of the agent according to the following strategy until each agent does not change the concept;
s21: randomly selecting a neighbor in the gossip network according to a set probability for any gossip class agent, and updating the concept and the following Media according to a BCM (the bounded confidence model) strategy;
s22: randomly sampling a subset of gossip networks G
Figure GDA0003490221330000031
Broadcasting gossip concepts in subset G' to all Media;
s23: for any one Media, the concept of the best return is calculated by using a Nash equilibrium strategy on a continuous action space, and the updated concept is broadcasted to the whole social network.
The invention is further improved, in step S21, the operating method of the gossip-like agent is as follows:
a1: concept initialization: x is the number ofi τ=xi τ-1
A2: concept updating: updating the concept of the agent when the difference between the concept of the agent and the concept of the selected agent is less than a set threshold;
a3: the agent compares the difference between the agent and other Media concepts, and selects one Media to follow according to the probability.
The invention improves upon further if, in step A2, the currently selected neighbor is Gossiper j, and | xj τ-xi τ|<dgThen xi τ←xi τg(xj τ-xi τ) (ii) a If the currently selected neighbor is Mediak, and yk τ-xi τ|<dmThen xi τ←xi τm(xk τ-xi τ) Wherein d isgAnd dmThresholds, alpha, respectively set for concepts for different types of neighborsgAnd alphamRespectively, learning rates for different types of neighbors.
The invention is further improved by the step A3 of probability-based
Figure GDA0003490221330000035
Following the Media k, the Media k is,
Figure GDA0003490221330000036
wherein the content of the first and second substances,
Figure GDA0003490221330000037
the invention is further improved by the step S23 of returning r of Media jjIs defined as the ratio of the number of people in G 'who choose to follow the gossip of j to the total number of people in G',
Figure GDA0003490221330000038
Pijrepresenting the probability that gossip i follows Media j.
The invention is further improved, and the existence of one Media can accelerate the public opinion trend of each gossip agent to be uniform; in an environment where there are multiple Media competitions, the dynamic variation of each gossip agent concept is a weighted average affected by each Media.
Compared with the prior art, the invention has the beneficial effects that: under the environment of continuous action space, the intelligent agent can maximize the benefit of the intelligent agent and learn Nash equilibrium finally in the process of interaction with other intelligent agents.
Drawings
Fig. 1 is a schematic diagram of the convergence of two agents to nash equilibrium point according to the present invention, where r is 0.7>2/3, a is 0.4, and b is 0.6;
fig. 2 is a schematic diagram of the convergence of two agents to nash equilibrium point, where r is 0.6<2/3, a is 0.4, and b is 0.6;
FIG. 3 is a schematic diagram of the public opinion evolution of the Gossiper-Media model for networks without Media in a fully connected network;
FIG. 4 is a schematic diagram of the public opinion evolution of the Gossiper-Media model for small world networks without Media;
FIG. 5 is a schematic diagram of the public opinion evolution of the Gossiper-Media model for networks with a fully connected network having a Media;
FIG. 6 is a schematic diagram of the public opinion evolution of the Gossiper-Media model for small world networks with one Media;
FIG. 7 is a schematic diagram of the public opinion evolution of the Gossiper-Media model for each network when the fully connected network has two competing Media;
fig. 8 is a schematic diagram of the public opinion evolution of the gossip-Media model for networks when the small world network has two competing Media.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
The nash equilibrium strategy in the Continuous motion space of the invention expands from the single intelligent body reinforcement Learning algorithm CALA [7] (Continuous motion Action Learning Automata), through introducing WoLS (Win or lean Slow, Win then fast Learning) Learning mechanism, make the algorithm can deal with the Learning problem in the multi-intelligent body environment effectively, therefore, the nash equilibrium strategy of the invention is abbreviated as: Wolls-CALA (Win or lean slope Learning Automaton, Win is a fast-Continuous Action Learning Automaton). The invention first describes the CALA in detail.
Continuous Action Learning Automaton (CALA) [7]]The method is a strategy gradient reinforcement learning algorithm for solving the learning problem of the continuous motion space. Wherein the strategy of the agent is defined as obeying normal distribution N (u) on the action spacett) Is determined.
The policy updates for the CALA agent are as follows: at time t, the agent distributes N (u) according to a normaltt) Selecting an action xt(ii) a Performing action xtAnd utThen respectively obtain corresponding returns V (x) from the environmentt) And V (u)t) This means that the algorithm needs to perform two actions during each interaction with the environment; finally, the normal distribution N (u) is updated according to the following formulatt) The mean and the variance of (a) is,
Figure GDA0003490221330000051
Figure GDA0003490221330000052
wherein the content of the first and second substances,
Figure GDA0003490221330000053
where α isuAnd alphaσFor the learning rate, K is a normal number, which is used to control the algorithm convergence. Specifically, the size of K is related to the learning frequency of the algorithm, and is usually set to be in the order of 1/N, where N is the iteration frequency of the algorithm, σLIs the lower bound of the variance σ. The algorithm continues to update the mean and variance until u is constant and σ istTends towards sigmaL. The mean u after convergence of the algorithm will point to a most optimal solution to the problem. The size of σ in equation (2) determines the exploratory power of the CALA algorithm: sigmatThe larger the CALA, the more likely it isFind potentially better actions.
By definition, the CALA algorithm is a learning algorithm based on policy gradient classes. The algorithm has been theoretically demonstrated that the CALA algorithm can find local optima in the case of a reward function V (x) that is sufficiently smooth [7 ]. De Jong et al [34] extended and applied CALA to a multi-agent environment by improving the reward function, and verified through experiments that its improved algorithm can converge to Nash equilibrium. The WOLS-CALA provided by the invention introduces a WOLS mechanism to solve the problem of multi-agent learning, and theoretically analyzes and proves that an algorithm can learn Nash equilibrium in a continuous action space.
Since CALA requires the agent to obtain a sample action and a reward of a desired action at the same time once per learning, however this is not feasible in most reinforcement learning environments, and typically the agent can only perform one action at a time in the interaction with the environment. Therefore, the invention extends CALA from two aspects of Q value function estimation and variable learning rate and provides a Wolls-CALA algorithm.
1. Q function estimation
In a standalone multi-agent reinforcement learning environment, an agent selects one action at a time and then obtains a reward from the environment. A natural way to explore the dynamic distribution is to use the Q value to estimate the average return of the expected action u. Specifically, the action u of the agent i in the formula (1)iExpected reward of
Figure GDA0003490221330000054
It can be estimated by the following formula,
Figure GDA0003490221330000055
here, the
Figure GDA0003490221330000056
Is the sampling action at time t.
Figure GDA0003490221330000057
Is an intelligenceEnergy i in the selection action
Figure GDA0003490221330000058
The return received in time is the joint action of all agents at the time of $ t $
Figure GDA0003490221330000059
And (6) determining.
Figure GDA00034902213300000510
Is the learning rate for agent i on Q. The updating method in equation (3) is a common method for estimating the value function of a single state by reinforcement learning, and the essence is to use riTo estimate the statistical mean of
Figure GDA00034902213300000511
In addition, there is an advantage in that,
Figure GDA00034902213300000512
it can be updated one at a time and the ratio of the newly received reward to the Q value estimate is always a.
According to equation (3), the update procedure of u (equation (1)) and the update procedure of σ (equation (2)) can be expressed as,
Figure GDA00034902213300000513
Figure GDA0003490221330000061
here, the
Figure GDA0003490221330000062
Is the sampling action at time t.
Figure GDA0003490221330000063
Is agent i is selecting action
Figure GDA0003490221330000064
Received at the same timeIn return, joint action of agents at time t
Figure GDA0003490221330000065
And (6) determining.
Figure GDA0003490221330000066
And
Figure GDA0003490221330000067
is of agent i with respect to uiAnd σiThe learning rate of (2).
However, the direct use of Q function estimation in a multi-agent environment can introduce new problems to the algorithm. Since in a multi-agent environment the reward of an agent is affected by other agents whose policy changes may cause the environment to be unstable. The updating method in equation (4) does not guarantee that u can adapt to the dynamic change of the environment. Here, as a simple example, assume that agent i has learned the optimal action for the current time at time $ t $
Figure GDA0003490221330000068
And is
Figure GDA0003490221330000069
Is that it is a pair
Figure GDA00034902213300000610
Is accurately estimated
Figure GDA00034902213300000611
By definition, at time t, for any xiAre all provided with
Figure GDA00034902213300000612
The compound represented by the formula (3) is introduced into the compound represented by the formula (4),
Figure GDA00034902213300000613
if the environment remains the same, then there will be
Figure GDA00034902213300000614
Continuously establishing; however, if the environment changes, it is assumed that
Figure GDA00034902213300000615
Figure GDA00034902213300000616
And is
Figure GDA00034902213300000617
Is no longer the optimal action, then there will be
Figure GDA00034902213300000618
So that it can return correspondingly
Figure GDA00034902213300000619
Figure GDA00034902213300000620
In this case, the update method u in equation (5) is continuediWill be far away from xiHowever, theoretically because
Figure GDA00034902213300000621
To ensure accurate estimation uiShould be close to xi. Because Q is a statistical estimate of r, Q is updated more slowly than r, resulting in later updates
Figure GDA00034902213300000622
Always true, multiple sampling down uiWill be continuously maintained at
Figure GDA00034902213300000623
The vicinity is unchanged. Theoretical uiIt should be changed to find a new optimal action pair. The reason for these problems is mainly the instability caused by the multi-agent environment, and the traditional estimation method (such as Q learning) can not effectively deal with the change of the environment.
2. Wolls rules and analysis
To more accurately estimate the expected return of u in a multi-agent environment, the present invention updates the expected action u by way of a variable learning rate. Formally, the desired action uiIs defined as the following formula according to the following formula update,
Figure GDA00034902213300000624
then u isiCan be represented as
Figure GDA00034902213300000625
The WoLS rule may intuitively interpret that if the reward v (x) of an agent action x is greater than the reward v (u) of an expected u, then it should learn faster, and vice versa slower. It can be seen that the strategies WOL and WoLF (win or Learn fast) [35] are exactly opposite. The difference is that WoLF was designed with the goal of ensuring convergence of the algorithm, while the WoLS strategy of the present invention is to enable the algorithm to update u in the direction of increasing returns while ensuring that the expected returns of actions u can be correctly estimated. By analyzing the kinetic features inherent to the WoLS strategy, the following conclusions can be drawn,
theorem 1 the learning dynamics of the CALA algorithm using the WoLS rule can be approximated as a gradient ascent strategy (GA) in the continuous motion space.
And (3) proving that: according to definition (4), x is knowntIs that the agent normally distributes N (u) at time ttt) Selected action, V (x)t) And V ({ u)tAre respectively corresponding to the action xtAnd utIn return for (1). Definition f (x) ═ E [ V (x)t)|xt=x]Is the expected reward function for action x. Suppose alphauInfinitesimal, u in Woll-CALA algorithmtCan be represented by the following ordinary differential equation,
Figure GDA0003490221330000071
where N (u, σ)u) Probability density function (dN (a, b) is a positive-Tai distribution (dN (a, b) represents mean a, variance b2The derivative of the normal distribution of (a) with respect to (a). Let x be u + y, then f (x) taylor in formula (8) is expanded at y being 0, and simplified to obtain,
Figure GDA0003490221330000072
note that in the formula (9), the term
Figure GDA0003490221330000073
And σ2Are all well balanced.
The update procedure of the standard deviation σ (equation (4)) is the same as the original CALA algorithm, so the CALA conclusion can be used directly: given a sufficiently large positive number K, σ will eventually converge to σL. In combination with formula (9), the present invention can conclude the following:
for a small positive number sigmaL(e.g., 1/10000) for a sufficient amount of time thereafter, with respect to utCan be approximated as a function of the ordinary differential equation of (c),
Figure GDA0003490221330000074
wherein
Figure GDA0003490221330000075
A small positive constant. f' (u) is the gradient direction of the function f (u) at u. Equation (10) indicates that u will change towards the gradient of f (u), i.e., f (u) increases the fastest. I.e. the dynamic trajectory of u can be approximated by a gradient ascent strategy.
In the case where only one agent is present, the dynamics of u will eventually converge to an optimal point, since when u ═ u*In the case of an optimum point, the point is,
Figure GDA0003490221330000076
and is
Figure GDA0003490221330000077
From theorem 1, it can be seen that the learning dynamics of the CALA agent's expected actions of Wolls rule are similar to the gradient ascent strategy described above, i.e., their differentials with respect to time can both represent shaping as described above
Figure GDA0003490221330000078
In the form of (1). If f (u) there are multiple local optima, whether the algorithm eventually converges to a global optimum depends on the algorithm's allocation to Exploration-Exploitation [36 ]]This is a problem in the field of reinforcement learning. A common approach to explore global optima is to take the initial exploration rate σ (i.e., standard deviation) of the algorithm to a large value and the initial learning rate for σ
Figure GDA0003490221330000079
The value is particularly small to ensure that the algorithm can sample enough times in the whole motion space range. Since the desired action u of the CALA algorithm after the WoLS rule is added can be converged even when the standard deviation σ is not 0, the lower bound σ of the search rate σ is sufficiently securedLA larger value may be taken. By combining the strategies, the global optimum can be learned by selecting a proper parameter algorithm.
Another problem is that the algorithm may not converge due to the adoption of a pure gradient ascent strategy in a multi-agent environment, so that the invention combines a PHC (Policy Hill learning, strategy Climbing) [35] algorithm to provide an Actor-Critic type independent multi-agent reinforcement learning algorithm, which is called WOL-CALA. The main idea of the Actor-criticic architecture is that the estimation of the strategy and the updating of the strategy are learned in independent processes respectively, the part for processing the strategy estimation is called criticic, and the part for updating the strategy is called Actor. The specific learning process is as follows (algorithm 1),
learning strategy of Algorithm 1 Woll-CALA agent i
Figure GDA0003490221330000081
For simplicity, two constants α are used in Algorithm 1ubAnd alphaus,(αubus) Instead of uiLearning rate of
Figure GDA0003490221330000082
If agent i performs action xiThe return received later riGreater than the current cumulative average reward QiThen ujHas a learning rate of alphaub(winning), whereas (missing) is alphaus(step 3.3). Because the formulas (7) and (4) contain the denominator phi (sigma)i t) When the denominator is small, a small error will have a great influence on the updating of u and σ. The use of two fixed step sizes makes it easier to control the updating process of the algorithm during the course of a particular experiment and also to implement it. Furthermore, note that the update step size of Q and the step size of u are synchronized in step 3.5 of the algorithm, i.e., at ri>QiWhen all are alphaubOtherwise both are alphaus. Because of alphaubAnd alphausIs a two very small number, in uiIn a very small neighborhood, QiAbout uiCan be linearized to Qi=Kui+ C, wherein the slope
Figure GDA0003490221330000083
If uiChange over
Figure GDA0003490221330000084
Then
Figure GDA0003490221330000085
Figure GDA0003490221330000086
This is also done to more accurately estimate the expected return of u. Finally (step 4), the algorithm proceeds
Figure GDA0003490221330000087
Convergence is output as a loop termination condition and algorithm. The purpose of doing so is mainlyTo prevent u from being in a competitive environmentiPeriodic solutions can occur resulting in the algorithm not terminating. The variables to be noted here
Figure GDA0003490221330000091
And uiRepresents different meanings:
Figure GDA0003490221330000092
the method comprises the steps that a cumulative statistical average value of sampling actions of an agent i is obtained, and a final result of the agent i is converged to a Nash equilibrium strategy in a multi-agent environment; and ujIs the expected mean of the strategic distribution of agent i, which may oscillate periodically around the equilibrium point in a competitive environment. A detailed explanation will be given in theorem 2 later.
Because the dynamic track in the high-dimensional space may have a chaos phenomenon, it is difficult to perform qualitative analysis on the dynamic behavior of the algorithm when a plurality of agents are provided. Dynamic analysis of multi-agent correlation algorithms in the field is essentially based on two agents 3537-39. The case with two WoLS-CALA agents is therefore mainly analyzed here.
Theorem 2 gives a positive number σLAnd a sufficiently large positive number K, the strategy of the two WoLS-CALA agents can eventually converge to nash equilibrium.
And (3) proving that: nash equalization can be divided into two categories by the location of the equalization point: equilibrium points located on the boundary of the continuous motion space (bounded set) and another class are equilibrium points located inside the continuous motion space. This example focuses on the second class of balance points, considering that balance points on the boundary can be equivalent to balance points inside the lower one-dimensional space. The dynamics of an ordinary differential equation depend on the stability properties of its internal equilibrium points [40], so this example first calculates the equilibrium points in equation (10) and then analyzes the stability of these equilibrium points.
Order to
Figure GDA0003490221330000093
For agent i according to normal distribution at time t
Figure GDA0003490221330000094
And (4) randomly sampling.
Figure GDA0003490221330000095
And
Figure GDA0003490221330000096
are respectively an action
Figure GDA0003490221330000097
And
Figure GDA0003490221330000098
the corresponding expected reward. If it is not good
Figure GDA0003490221330000099
Is a balance point of equation (10), then
Figure GDA00034902213300000910
Are all provided with
Figure GDA00034902213300000911
According to the nonlinear kinetic theory [40]]The stability of the point eq can be determined by the eigenvalues of the matrix below,
Figure GDA00034902213300000912
wherein
Figure GDA00034902213300000913
When i ≠ j.
In addition, the Nash equalization point is defined according to the Nash equalization
Figure GDA00034902213300000914
The following properties are satisfied,
Figure GDA00034902213300000915
by substituting the equation (12) into M, it can be seen that the characteristic value of the nash equilibrium point belongs to one of the following three possibilities:
(a) all eigenvalues of the matrix M have a negative real part. This type of equilibrium point is asymptotically stable, i.e., all trajectories around $ eq $ eventually converge to this equilibrium point.
(b) All eigenvalues of the matrix M have non-positive real parts and contain a pair of pure imaginary eigenroots. Such equilibrium points are stable, but the limit set of the trajectories in their vicinity is a periodic solution, which cannot be counted. In addition, it is easy to prove
Figure GDA00034902213300000916
Namely, it is
Figure GDA00034902213300000917
Will eventually converge to the nash equilibrium. Consider Woll-CALA as a running average
Figure GDA0003490221330000101
For output, the algorithm can also handle this type of balance point problem.
(c) The matrix M has eigenvalues of the real part, i.e. the equilibrium point is unstable. For such equilibrium points, the trajectory around the unstable equilibrium point can be divided into two categories according to the nonlinear dynamics theory: traces on the steady manifold and other traces \ cite { Shilnikov1998Methods }. A stable manifold is a subspace generated by the eigenvectors corresponding to the stable eigenvalues. The trajectory in the steady manifold will theoretically eventually converge to this equilibrium point. The probability that the algorithm will remain out of this subspace is 0, considering randomness and computational errors. While all trajectories not belonging to the stable manifold will gradually move away from the equilibrium point and eventually converge to the other type of equilibrium point analyzed above, i.e. to the equilibrium point on the boundary or to the first and second type of equilibrium point.
Furthermore, similar to a single agent environment, if there are multiple equilibrium points, given an appropriate exploration-utilization, e.g., σ, from an analysis of theorem 1LSufficiently large that σ takes a large initial value and a small valueLearning rate, the algorithm can converge to a nash equilibrium point (global optimum for each agent when other agent policies are unchanged). In conclusion, the present invention completes the proof that the algorithm converges to nash equilibrium.
The invention also provides a social network public opinion evolution model based on the Nash equilibrium strategy on the continuous action space, wherein the social network public opinion evolution model comprises two types of intelligent bodies, namely a Gossiper type intelligent body for simulating common people in a social network and a Media type intelligent body for simulating Media or public characters aiming at attracting common people in the social network, so the social network public opinion evolution model is also called a Gossiper-Media model. Wherein the Media-like agent employs a nash equilibrium strategy calculation over the continuous motion space to report the most optimal concept thereto, update its concept and broadcast in a social network. The invention applies the WOLS-CALA algorithm to the research of public opinion evolution in a real social network, and discusses the influence of competitive media experience on social public opinion by using WOLS-CALA modeling for media in the network.
This is explained in detail below:
gossip-Media model
The invention provides a multi-agent reinforcement learning framework, namely a gossip-Media model, for researching the evolution of group public opinion. The gossip-Media model contains two classes of agents, gossip class agents and Media class agents. Wherein the gossip-like agent is used to simulate the general public in a real network, and the concept (public opinion) of the gossip-like agent is influenced by Media and other gossips at the same time; while Media-like agents are used to simulate Media or public characters in social networks for the purpose of attracting the public, such agents actively choose their concepts to maximize their followers. Consider a network with N agents, where the number of gossypers is | G |, and the number of medias is | M | (N ═ G uem). It is assumed that there is full connectivity between gossypier and Media, i.e., each gossypier can select any one Media interaction with equal probability. While gossyper does not specify full connectivity, i.e., each gossyper may only interact with its own neighbors. The network between gossip is formed by the community between themAnd (5) determining the relationship. In particular, in the following simulation experiments, the present example defines two gossip networks respectively for the simulation experiments: full connected networks (full connected networks) and small-world networks (small-world networks). The concepts of Gossiper i and Media j are denoted as x respectivelyiAnd yj. The interaction process of the agents in the model follows algorithm 2.
Algorithm 2 concept learning model in gossip-Media network
Figure GDA0003490221330000111
First, the notion of each gossip and Media is initialized randomly to a value in the motion space [0,1] (step 1). Then, in each interaction, each agent respectively adjusts its concept according to different strategies until the algorithm converges (each agent does not change the concept any more). For each gossip agent, first choose to select the object with which to interact: a gossip is randomly selected from its neighbors according to probability ξ, or a Media is randomly selected according to probability 1- ξ (step 2.1). This gossip then updates its concept according to algorithm 3 and chooses to follow a Media that is closest to its concept based on its difference from the concepts of the medias. It is assumed that the Media agent can randomly obtain a part of the concept of gossip by sampling and broadcast to all Media, here denoted as G' (step 2.2). Media then games with each other using WoLS-CALA algorithm, works out the idea that can maximize their followers, and broadcasts the updated idea throughout the network (step 2.3). In principle, Media can also sample independently, so that G 'they obtain is different, which has little influence on the learning of the following WoLS-CALA algorithm, because the conceptual distribution of G' is theoretically the same as G. The environment assumption of the present invention is mainly for simplicity and ease of consideration, while also reducing possible uncertainties due to random sampling.
1.1 gossip strategy
The policy of each gossip includes two parts: 1) how to update the concept; 2) how to select Media for follow. The following is specifically described (algorithm 3):
strategy of algorithm 3 gossyper i at round tau
Figure GDA0003490221330000112
Figure GDA0003490221330000121
For Gossiper i, its concept is first initialized: x is the number ofi τ=xi τ-1(step 1). Then following the BCM (the bounded confidence model) strategy [12,33 ]]Update its concept (step 2). BCM is a more common model for describing the concept of a group, and the concept of BCM-based agents is only influenced by agents that are close to its concept. In Algorithm 3, only the notion of the agent that it selects differs by less than the threshold dg(or d)m) Gossip will update its idea. Where d isgAnd dmThe agents corresponding to the selection are gossip and Media, respectively. Threshold value dg(or d)m) The size of (a) represents how well gossip accepts new ideas. Intuitively, the larger d, the more susceptible gossip is to other agents [41-43]. Then the gossip compares the difference between itself and other Media concepts and selects one Media following according to probability (step 3). Here using the probability Pij τThe probability that gossip i chooses to follow Media j at time τ is expressed, and satisfies the following characteristics:
(i) when | xi-yj|>dmWhen is, Pij=0;
(ii)(ii)Pij>0 if and only if the notion of Media jjSatisfy | xi-yj|≤dm
(iii)(iii)PijFollowing the concept xiAnd yjDistance | x ofi-yjThe increase in | decreases.
Note that if pair
Figure GDA0003490221330000122
All have | xi-yj|>dmThen, then
Figure GDA0003490221330000123
This means that there is a possibility that one gossip will not follow any one Media. Equation λijThe parameter δ is a small positive number to prevent the denominator of the fraction from being 0.
1.2 Media policy
For a given set of gossip concepts, each Media can adapt its own concept by learning to cater to gossip preferences, thereby attracting more gossip to follow it. In a multi-agent system where there are multiple Media, nash equalization is the last stable state achieved when multiple agents compete against each other. In this state, each agent cannot obtain higher reward by unilaterally changing its own policy. Considering that the motion space of Media is continuous (notional is defined as any point on the interval [0,1 ]), where the behavior of Media is modeled using the WoLS-CALA algorithm, algorithm 4 is a Media strategy constructed based on WoLS-CALA.
Algorithm 4 Media j strategy in round τ
Figure GDA0003490221330000131
Figure GDA0003490221330000141
Mediaj Current rewards rjIs defined as the ratio of the number of people in G 'who choose to follow the gossip of j to the total number of people in G',
Figure GDA0003490221330000142
where lambda isijIs as defined in algorithm 3. PijRepresenting the probability that gossip i follows Media j.
2. Dynamic analysis of group public opinion
Let { y }j}j∈M,yjE (0,1) is the concept of Mediaj. Assuming that the gossip network is infinite, the concept distribution of gossip can be represented by a continuous distribution density function, where p (x, t) represents the probability density function of the concept distribution of gossip population at time t. Then the consensus evolution of gossyper can be expressed as the partial derivative of the probability density function p (x, t) with respect to time. First the example considers the case where only one Media is present.
Theorem 3 in a gossip-Media network containing only one Media, the evolution of the gossip concept distribution follows the following formula,
Figure GDA0003490221330000143
wherein the content of the first and second substances,
Figure GDA0003490221330000144
Figure GDA0003490221330000145
here I1={x||x-y|<(1-αm)dm},I2={x|dm≥|x-y|≥(1-αm)dm}。
And (3) proving that: based on the MF approximation [40] (Mean Field approximations) theory, the partial derivatives p (x, t) of the probability distribution of BCM-based gossip concept with respect to t can be represented by the following [12],
Figure GDA0003490221330000146
here Wx+y→xGossiper, representing a notion equal to x + y, changes the probability of reciting x, while Wx+y→xp (x + y) dy represents the notion of agent from interval (x + y) within time interval (t, t + dt)X + y + dy) to the ratio of x. Similar Wx→x+yAn agent representing concept x will change the probability of seeing x + y, Wx→x+yp (x) dy represents the ratio of gossip transition to interval (x + y, x + y + dy) with the notion equal to x.
According to the definition of algorithm 3, the agent gossyper is influenced by other gossyper concepts according to probability xi or influenced by Media concepts according to probabilities 1-xi, and then makes its own decision. W is to bex+y→xAnd Wx→x+yThe refinement is two parts affected by other gossip concepts and Media concepts, which are respectively marked as w[g]And w[m]Then W isx→x+yAnd Wx+y→xCan be expressed as a number of times,
Figure GDA0003490221330000151
by bringing the formula (18) into the formula (17),
Figure GDA0003490221330000152
definition of
Figure GDA0003490221330000153
Figure GDA0003490221330000154
Therein Ψg(x, t) represents the rate of change of the probability density function p (x, t) of the agent g concept as affected by gossip. Weisbuch G [45]Et al have demonstrated Ψg(x, t) obeys the following formula,
Figure GDA0003490221330000155
here, the
Figure GDA0003490221330000156
Is the second order partial derivative of p with respect to x. Alpha is alphagIs a real number between 0 and 0.5. dgIs the gossip threshold.
Formula Ψm(x, t) represents the rate of change of the notional distribution density function p (x, t) affected by media. Suppose the notion of Media j is uj(uj=x+dj) Then the notional distribution of Media can utilize Dirac delta equation q (x) δ (x-u)j) And (4) showing. Dirac delta equation delta (x) [46 ]]Often used to model a narrow-high spike function (pulse) and other similar abstractions, such as point charge, point mass or electrons, as defined below,
Figure GDA0003490221330000157
the rate of transfer from x + y to x
Figure GDA0003490221330000158
Can be expressed as
Figure GDA0003490221330000159
In the formula (21), δ (x- [ (x + y) + αm((x+z)-(x+y))]) Indicating that the following event occurred, notion x + y was shifted to x by the influence of notion x + z. q (x + z) is the distribution density of Media at notion x + z. In the same way, wx→x+yCan be expressed as a number of representations,
Figure GDA0003490221330000161
the combined formulas (21) to (22) can be obtained by calculation and arrangement,
Figure GDA0003490221330000162
wherein I1={x||x-y|<(1-αm)dm},I2={x|dm≥|x-y|≥(1-αm)dm}。
And (5) integrating (20) to complete the certification.
As can be seen from this example in equation (14), the rate of change of p (x, t) is the equation Ψg(x, t) and ΨmWeighted average of (x, t). The former represents a part affected by public opinion change to gossip network, and the latter represents a part affected by Media network. The formula Ψ containing only gossipg(x, t) has been worked on by Weisbuch G [45]The research is analyzed. It follows that an important property is that from any one distribution, the locally optimal point in the distribution density will gradually strengthen, which indicates that the development of consensus in the pure gossip network will gradually trend all the time. Furthermore, as can be seen from theorem 3, the formula Ψg(x, t) and formula Ψm(x, t) are all independent of gossip's specific network, which means that when the network is infinite, the development of public opinion is not affected by the network structure.
Next, the second part of equation (14), Ψ, is analyzedm(x, t) (formula (23)). Assuming that y is constant, analysis (23) can yield,
Figure GDA0003490221330000163
intuitively, equation (24) shows that gossip's view, which is similar to Media's concept, converges to this Media, so the following conclusions can be drawn,
inference 1 the existence of a Media will accelerate the consensus trend of gossyper.
The following example considers the case where multiple medias exist. Definition Pj(x) For the probability that gossyper's concept is affected by Media j at x, then
Figure GDA0003490221330000164
Figure GDA0003490221330000171
Then gossipier, in an environment with multiple Media competition, can express its notional dynamic change as a weighted average affected by each Media. It is possible to conclude that,
it was deduced that the dynamic variation of the distribution function of the 2 gossip concept obeys the following equation:
Figure GDA0003490221330000172
therein Ψg(x, t) and Ψm(x, t) are defined by formulas (20) and (23), respectively.
3. Simulation experiment and analysis
First, it is verified that the WoLS-CALA algorithm can learn nash equilibrium. An experimental simulation of the gossip-Media model is then presented to verify the results of the foregoing theoretical analysis.
3.1 Wolls-CALA Algorithm Performance test
This example considers a simplified version of the gossip-Media model to check whether the WoLS-CALA algorithm can learn the nash equalization strategy. Specifically, the problem of two Media competing followers is modeled as the following objective optimization problem,
max(f1(x,y),f2(x,y))
s.t., x, y ∈ [0,1] (s.t. represents constraint condition, is standard writing method of optimization problem) (26)
Wherein
Figure GDA0003490221330000173
And
Figure GDA0003490221330000174
r is equal to [0,1 ]. A, b belongs to [0,1] < lambda > a-b ≧ 0.2 is the concept of gossipier.
Where the function f1(x, y) and f2R in (x, y) simulation algorithm 4, representing Media 1 and 2, respectively, in a joint action of<x,y>Is a return. This example uses two WOLSCALA agents, maximizing the respective reward function f by independently learning the separate control of x and y1(x, y) and f2(x, y). In this model, the concept of gossip can be divided into two categories according to different forms of nash equilibrium:
(i) when r >2/3, the equilibrium point is (a, a), when r <1/3 the equilibrium point is (b, b);
(ii) (ii) when r is equal to or less than 1/3 and is equal to or less than 2/3, the equilibrium point is any point on the set | x-a | <0.1 ^ y-b | <0.1 or | x-b | <0.1 ^ y-a | < 0.1.
In a specific simulation experiment, in each of the two types, the example takes one point, namely r is 0.7>2/3 and r is 0.6< 2/3. It was then observed whether the algorithm could learn nash equilibrium as expected when the concept distribution of gossip was different. Table 1 shows the parameter settings for WOLS-CALA.
TABLE 1 parameter settings
Figure GDA0003490221330000181
Fig. 1 and 2 show simulation results of two experiments, and it is obvious that the Media agent in the two experiments converged to nash equilibrium after about 3000 times of learning, that is, r is 0.6 and then converged to <0.4,0.4> and r is 0.7 and then converged to <0.4,0.57 >. As shown in fig. 1, when r is 0.7>2/3, a is 0.4, and b is 0.6, the two agents converge to nash equilibrium point (0.4 ); as shown in fig. 2, when r is 0.6<2/3, a is 0.4, and b is 0.6, agent 1(agent1) converges to x is 0.4, and agent 2(agent2) converges to y is 0.57.
3.2 Experimental simulation of Gossiper-Media model
This subsection shows the simulation results of the gossip-Media model. Consider 200 gossipers and an experimental environment with a different number of medias, respectively: (i) there is no Media; (ii) only one Media; (iii) there are two competing Media. For each environment, this example considers two representative gossip networks, a Fully Connected Network (Fully Connected Network) and a Small World Network [47] (Small-World Network), respectively. Through these comparative experiments, this example discusses the effect of Media on the evolution of gossip public opinion.
For fairness, the same parameter settings were used for each experimental environment. The same network was used in the three experimental environments, and the same gossip and Media initiatives. Here, the small world network uses the Watts-Strogatz construction method [47]Randomly generated according to the connectivity p being 0.2. The initial concept of each gossip is to distribute uniformly in the interval [0,1]]And (4) performing up-random sampling. Media's initial notion is 0.5. Considering that too large a threshold may interfere with the observation of the experiment, the gossip-Media threshold d is used heremAnd gossip-gossip threshold dgA small positive number of 0.1 is set. Gossiper's learning rate αgAnd alphamSet to 0.5. Set G 'was sampled randomly from G and satisfied | G' | 80 \% | G |.
Because there are two gossip network modes in each environment: all-connected networks and small-world networks. Thus, fig. 3-4 illustrate the public opinion evolution of networks without Media under a fully connected network and a small world network, respectively; FIGS. 5-6 illustrate the public opinion evolution of networks with one Media under a fully connected network and a small world network, respectively; fig. 7-8 show the public opinion evolution of networks with two competing Media under a fully connected network and a small world network, respectively. From these several figures, it can be seen first that the number of points at which different gossip networks eventually converge is the same in all three Media environments: convergence to 5 in zero Media environment; convergence to 4 in a Media environment; convergence to 3 in both Media environments. This phenomenon is consistent with the conclusions in theorem 3 and inference 2, and the public opinion dynamics of gossyper is independent of the topology of gossyper's network, because the public opinion dynamics of gossyper under different networks can be modeled with the same formula.
Second, it can be observed from fig. 3-6 that the number of points that gossyper's consensus converges last in both networks is reduced from 5 to 4 in the presence of one Media. This indicates that the presence of Media accelerates the generation of gossip consensus, which is the conclusion of this example in inference 1. Meanwhile, from fig. 5 to 8, when the number of Media is increased from 1 to 2, the point at which the consensus of gossyper finally converges in the two networks is further decreased from 4 to 3. This suggests that competing Media will further accelerate the reconciliation of gossip consensus.
In addition, the experimental result can also verify the performance of the WOLS-CALA algorithm. In fig. 5 and 6, the concept of Media agent is maintained around the concept of having the most gossip at all times (N in fully connected networks)max69, N in small world networksmax68). This phenomenon is in line with the expectation of algorithmic design, i.e. WoLS-CALA agents are able to learn the global optimum. In fig. 7 and 8, it can be seen that when there are two medias, the concept of one Media is maintained around the concept of having the most gossip (N in two networks)maxAll 89), another Media is maintained around the concept of having the second most gossip (N 'in a fully connected network)'max70, N 'in the Small world network'max66). This is also in line with the expectation of theorem 2 that the two WoLS-CALA agents can eventually converge to nash equilibrium. The concept of Media in fig. 3-8 has been vibrating up and down around the concept of gossip because in the gossip-Media model, the optimal strategy of Media is not unique (less than d around the concept of gossip)mIs the most optimal point of Media).
4. Summary of the invention
The invention provides an independently-learned multi-agent continuous motion space reinforcement learning algorithm WOLS-CALA, which verifies that the algorithm can learn Nash equilibrium from two aspects of theoretical proof and experimental verification respectively. The algorithm is then applied in the study of consensus evolution in a network environment. Individuals in the social network are divided into two classes, namely Gossiper and Media, and are modeled respectively, wherein the Gossiper class represents the general public, and the Media uses Woll-CALA algorithm modeling to represent individuals of social Media and the like aiming at attracting public attention. By modeling the two agents respectively, the invention discusses the influence of competition of different numbers of Media on the gossip public opinion. Finally, theories and experiments show that Media competition can accelerate consensus among public opinions.
The above-described embodiments are intended to be illustrative, and not restrictive, of the invention, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
The references referred to in the present invention correspond to the following references:
[1]PazisJ,LagoudakisMG.Binary Action Search for Learning Continuous-action Control Policies[C].In Proceedings of the 26th Annual International Conference on Machine Learning,New York,NY,USA,2009:793–800.
[2]Pazis J,Lagoudakis M G.Reinforcement learning in multidimensional continuous action spaces[C].In IEEE Symposiumon Adaptive Dynamic Programming&Reinforcement Learning,2011:97–104.
[3]Sutton R S,Maei H R,Precup D,et al.Fast Gradient-descent Methods for Temporal-difference Learning with Linear Function Approximation[C].In Proceedings of the 26th Annual International Conference on Machine Learning,2009:993–1000.
[4]Pazis J,Parr R.Generalized Value Functions for Large Action Sets[C].In International Conference on Machine Learning,ICML 2011,Bellevue,Washington,USA,2011:1185–1192.
[5]Lillicrap T P,Hunt J J,Pritzel A,et al.Continuous control with deep reinforcement learning[J].Computer Science,2015,8(6):A187.
[6]KONDA V R.Actor-critic algorithms[J].SIAM Journal on Control and Optimization,2003,42(4).
[7]Thathachar M A L,Sastry P S.Networks of Learning Automata:Techniques for Online Stochastic Optimization[J].Kluwer Academic Publishers,2004.
[8]Peters J,Schaal S.2008Special Issue:Reinforcement Learning of Motor Skills with Policy Gradients[J].Neural Netw.,2008,21(4).
[9]van Hasselt H.Reinforcement Learning in Continuous State and Action Spaces[M].In Reinforcement Learning:State-of-the-Art.Berlin,Heidelberg:Springer Berlin Heidelberg,2012:207–251.
[10]Sallans B,Hinton G E.Reinforcement Learning with Factored States and Actions[J].J.Mach.Learn.Res.,2004,5:1063–1088.
[11]Lazaric A,Restelli M,Bonarini A.Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods[C].In Conference on Neural Information Processing Systems,Vancouver,British Columbia,Canada,2007:833–840.
[12]Quattrociocchi W,Caldarelli G,Scala A.Opinion dynamics on interacting networks:media competition and social influence[J].Scientific Reports,2014,4(21):4938–4938.
[13]Yang H X,Huang L.Opinion percolation in structured population[J].Computer Physics Communications,2015,192(2):124–129.
[14]Chao Y,Tan G,Lv H,et al.Modelling Adaptive Learning Behaviours for Consensus Formation in Human Societies[J].Scientific Reports,2016,6:27626.
[15]De Vylder B.The evolution of conventions in multi-agent systems[J].Unpublished doctoral dissertation,Vrije Universiteit Brussel,Brussels,2007.
[16]Holley R A,Liggett T M.Ergodic Theorems for Weakly Interacting Infinite Systems and the Voter Model[J].Annals of Probability,1975,3(4):643–663.
[17] nowak A, Szamrej J, Latan Mao B.from private attribute to public option A dynamic of social impact [ J ] Psychological Review,1990,97(3): 362-.
[18]Tsang A,Larson K.Opinion dynamics of skeptical agents[C].In Proceedings of the 2014international conference on Autonomous agents and multi-agent systems,2014:277–284.
[19]Ghaderi J,Srikant R.Opinion dynamics in social networks with stubborn agents:Equilibrium and convergence rate[J].Automatica,2014,50(12):3209–3215.
[20]Kimura M,Saito K,Ohara K,et al.Learning to Predict Opinion Share in Social Networks.[C].In Twenty-Fourth AAAI Conference on Artificial Intelligence,AAAI 2010,Atlanta,Georgia,Usa,July,2010.
[21]Liakos P,Papakonstantinopoulou K.On the Impact of Social Cost in Opinion Dynamics[C].In Tenth International AAAI Conference on Web and Social Media ICWSM,2016.
[22]Bond R M,Fariss C J,Jones J J,et al.A 61-million-person experiment in social influence and political mobilization[J].Nature,2012,489(7415):295–8.
[23]Szolnoki A,Perc M.Information sharing promotes prosocial behaviour[J].New Journal of Physics,2013,15(15):1–5.
[24]Hofbauer J,Sigmund K.Evolutionary games and population dynamics[M].Cambridge;New York,NY:Cambridge University Press,1998.
[25]Tuyls K,Nowe A,Lenaerts T,et al.An Evolutionary Game Theoretic Perspective on Learning in Multi-Agent Systems[J].Synthese,2004,139(2):297–330.
[26]Szabo B G.Fath G(2007)Evolutionary games on graphs[C].In Physics Reports,2010.
[27]Han T A,Santos F C.The role of intention recognition in the evolution of cooperative behavior[C].In International Joint Conference on Artificial Intelligence,2011:1684–1689.
[28]Santos F P,Santos F C,Pacheco J M.Social Norms of Cooperation in Small-Scale Societies[J].PLoS computational biology,2016,12(1):e1004709.
[29]Zhao Y,Zhang L,Tang M,et al.Bounded confidence opinion dynamics with opinion leaders and environmental noises[J].Computers and Operations Research,2016,74(C):205–213.
[30]Pujol J M,Delgado J,Sang,et al.The role of clustering on the emergence of efficient social conventions[C].In International Joint Conference on Artificial Intelligence,2005:965–970.
[31]Nori N,Bollegala D,Ishizuka M.Interest Prediction on Multinomial,Time-Evolving Social Graph.[C].In IJCAI 2011,Proceedings of the International Joint Conference on Artificial Intelligence,Barcelona,Catalonia,Spain,July,2011:2507–2512.
[32]Fang H.Trust modeling for opinion evaluation by coping with subjectivity and dishonesty[C].In International Joint Conference on Artificial Intelligence,2013:3211–3212.
[33]Deffuant G,Neau D,Amblard F,et al.Mixing beliefs among interacting agents[J].Advances in Complex Systems,2011,3(1n04):87–98.
[34]De Jong S,Tuyls K,Verbeeck K.Artificial agents learning human fairness[C].In International Joint Conference on Autonomous Agents and Multiagent Systems,2008:863–870.
[35]BowlingM,Veloso.Multiagent learning using a variable learning rate[J].Artificial Intelligence,2002,136(2):215–250.
[36]Sutton R S,Barto A G.Reinforcement learning:an introduction[M].Cambridge,Mass:MIT Press,1998.
[37]Abdallah S,Lesser V.A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics[J].J.Artif.Int.Res.,2008,33(1):521–549.
[38]Singh S P,Kearns M J,Mansour Y.Nash Convergence of Gradient Dynamics in General-Sum Games[J],2000:541–548.
[39]Zhang C,Lesser V R.Multi-agent learning with policy prediction[J],2010:927–934.
[40]Shilnikov L P,Shilnikov A L,Turaev D,et al.Methods of qualitative theory in nonlinear dynamics/[M].World Scientific,1998.
[41]Dittmer J C.Consensus formation under bounded confidence[J].Nonlinear Analysis Theory Methods and Applications,2001,47(7):4615–4621.
[42]LORENZ J.CONTINUOUS OPINION DYNAMICS UNDER BOUNDED CONFIDENCE:A SURVEY[J].International Journal of Modern Physics C,2007,18(12):2007.
[43]Krawczyk M J,Malarz K,Korff R,et al.Communication and trust in the bounded confidence model[J].Computational Collective Intelligence.Technologies and Applications,2010,6421:90–99.
[44]Lasry J M,Lions P L.Mean field games[J].Japanese Journal of Mathematics,2007,2(1):229–260.
[45]WeisbuchG,DeffuantG,AmblardF,etal.Interacting Agents and Continuous Opinions Dynamics[M].Springer Berlin Heidelberg,2003.
[46]Hassani S.Dirac Delta Function[M].Springer New York,2000.
[47]DJ W,SH S.Collectivedynamics of’small-world’networks[C].In Nature,1998:440–442.

Claims (9)

1. A social network public opinion evolution method is characterized in that: the social network public opinion evolution method comprises two types of agents, namely a Gossiper type agent simulating common masses in a social network and a Media type agent simulating Media or public characters aiming at attracting the common masses in the social network, wherein the Media type agent adopts a Nash equilibrium strategy on the continuous action space to calculate the concept of optimal return, updates the concept and broadcasts in the social network,
the Nash equalization strategy on the continuous motion space comprises the following steps:
(1) setting constant alphaubAnd alphausWherein α isubusubusEpsilon (0,1) is a learning rate;
(2) initializing parameters, wherein the parameters comprise a mean value u of a desired action u of an agent iiCumulative average strategy
Figure FDA0003490221320000015
Constant C, variance σiAnd cumulative average reward Qi
(3) Repeating the following steps until the accumulative average strategy of the sampling action of the agent i
Figure FDA0003490221320000016
The convergence of the signals is carried out,
(3.1) normally distributing N (u) at a certain exploration rateij) Randomly selecting an action xi
(3.2) performing action xiThen obtains the report r from the environmenti
(3.3) if agent i performs action xiThe return received later riGreater than the current cumulative average reward QiThen uiHas a learning rate of alphaubOn the contrary, the learning rate is alphausUpdating u according to the selected learning ratei
(3.4) according to learning of uiUpdate variance σ ofi
(3.5) if agent i performs action xiThe return received later riGreater than the current cumulative average reward QiThen uiHas a learning rate of alphaubOn the contrary, the learning rate is alphausUpdating Q according to the selected learning ratei
(3.6) according to the constant C and the action xiUpdating
Figure FDA0003490221320000011
(4) Output cumulative averaging strategy
Figure FDA0003490221320000012
As the final action of agent i.
2. The social networking public opinion evolution method of claim 1, wherein: in step (3.3) and step (3.5), the update step size of Q and the update step size of u are synchronized, at uiWithin a neighborhood of (2), QiAbout uiCan be linearized to Qi=Kui+ C, wherein the slope
Figure FDA0003490221320000013
3. The social networking public opinion evolution of claim 2The method is characterized in that: given a positive number σLAnd a positive number K, the Nash equilibrium strategy over the continuous motion space of the two agents can eventually converge to Nash equilibrium, where σLIs the lower bound of the variance σ.
4. The social networking public opinion evolution method of claim 1, characterized in that it comprises the following steps:
s1: the notion of each gossip and Media is initialized randomly to a value in the motion space [0,1 ];
s2: in each interaction, each agent adjusts the concept of the agent according to the following strategy until each agent does not change the concept;
s21: randomly selecting a neighbor in the gossip network according to a set probability for any gossip class intelligent agent, and updating the concept and the Media followed by the gossip class intelligent agent according to a BCM strategy;
s22: randomly sampling a subset of gossip networks G
Figure FDA0003490221320000014
Broadcasting gossip concepts in subset G' to all Media;
s23: for any one Media, the concept of the best return is calculated by using a Nash equilibrium strategy on a continuous action space, and the updated concept is broadcasted to the whole social network.
5. The social networking public opinion evolution method of claim 4, wherein: in step S21, the operating method of the gossip-like agent is:
a1: concept initialization: x is the number ofi τ=xi τ-1
A2: concept updating: updating the concept of the agent when the difference between the concept of the agent and the concept of the selected agent is less than a set threshold;
a3: the agent compares the difference between the agent and other Media concepts, and selects one Media to follow according to the probability.
6. The social networking public opinion evolution method of claim 5, wherein: in step A2, if the currently selected neighbor is Gossiper j, and | xj τ-xi τ|<dgThen xi τ←xi τg(xj τ-xi τ) (ii) a If the currently selected neighbor is Media k, and yk τ-xi τ|<dmThen xi τ←xi τm(yk τ-xi τ) Wherein d isgAnd dmThresholds, alpha, respectively set for concepts for different types of neighborsgAnd alphamRespectively, learning rates for different types of neighbors.
7. The social networking public opinion evolution method of claim 6, wherein: in step A3, probability-based
Figure FDA0003490221320000021
Following the Media k, the Media k is,
Figure FDA0003490221320000022
wherein the content of the first and second substances,
Figure FDA0003490221320000023
8. the social networking public opinion evolution method of claim 7, wherein: in step S23, Media j current reward rjIs defined as the ratio of the number of people in G 'who choose to follow the gossip of j to the total number of people in G',
Figure FDA0003490221320000024
Figure FDA0003490221320000025
Pijrepresenting the probability that gossip i follows Media j.
9. The social networking public opinion evolution method of any one of claims 1-8, wherein: the existence of one Media can accelerate the public opinion trend of each gossip agent to be uniform; in an environment where there are multiple Media competitions, the dynamic variation of each gossip agent concept is a weighted average affected by each Media.
CN201880001570.9A 2018-08-01 2018-08-01 Social network public opinion evolution method Active CN109496305B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/098101 WO2020024170A1 (en) 2018-08-01 2018-08-01 Nash equilibrium strategy and social network consensus evolution model in continuous action space

Publications (2)

Publication Number Publication Date
CN109496305A CN109496305A (en) 2019-03-19
CN109496305B true CN109496305B (en) 2022-05-13

Family

ID=65713809

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201880001570.9A Active CN109496305B (en) 2018-08-01 2018-08-01 Social network public opinion evolution method

Country Status (2)

Country Link
CN (1) CN109496305B (en)
WO (1) WO2020024170A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362754B (en) * 2019-06-11 2022-04-29 浙江大学 Online social network information source detection method based on reinforcement learning
CN111445291B (en) * 2020-04-01 2022-05-13 电子科技大学 Method for providing dynamic decision for social network influence maximization problem
CN112801299B (en) * 2021-01-26 2023-12-01 西安电子科技大学 Method, system and application for constructing game model of evolution of reward and punishment mechanism
CN112862175B (en) * 2021-02-01 2023-04-07 天津天大求实电力新技术股份有限公司 Local optimization control method and device based on P2P power transaction
CN113572548B (en) * 2021-06-18 2023-07-07 南京理工大学 Unmanned plane network cooperative fast frequency hopping method based on multi-agent reinforcement learning
CN113645589A (en) * 2021-07-09 2021-11-12 北京邮电大学 Counter-fact strategy gradient-based unmanned aerial vehicle cluster routing calculation method
CN113568954B (en) * 2021-08-02 2024-03-19 湖北工业大学 Parameter optimization method and system for preprocessing stage of network flow prediction data
CN113687657B (en) * 2021-08-26 2023-07-14 鲁东大学 Method and storage medium for multi-agent formation dynamic path planning
CN114845359A (en) * 2022-03-14 2022-08-02 中国人民解放军军事科学院战争研究院 Multi-intelligent heterogeneous network selection method based on Nash Q-Learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103490413A (en) * 2013-09-27 2014-01-01 华南理工大学 Intelligent electricity generation control method based on intelligent body equalization algorithm
CN106358308A (en) * 2015-07-14 2017-01-25 北京化工大学 Resource allocation method for reinforcement learning in ultra-dense network
CN106899026A (en) * 2017-03-24 2017-06-27 三峡大学 Intelligent power generation control method based on the multiple agent intensified learning with time warp thought
CN107135224A (en) * 2017-05-12 2017-09-05 中国人民解放军信息工程大学 Cyber-defence strategy choosing method and its device based on Markov evolutionary Games
CN107832882A (en) * 2017-11-03 2018-03-23 上海交通大学 A kind of taxi based on markov decision process seeks objective policy recommendation method
CN107979540A (en) * 2017-10-13 2018-05-01 北京邮电大学 A kind of load-balancing method and system of SDN network multi-controller
CN109511277A (en) * 2018-08-01 2019-03-22 东莞理工学院 The cooperative method and system of multimode Continuous action space

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8930989B2 (en) * 2007-08-20 2015-01-06 AdsVantage System and method for providing supervised learning to associate profiles in video audiences
US20180033081A1 (en) * 2016-07-27 2018-02-01 Aristotle P.C. Karas Auction management system and method
CN106936855B (en) * 2017-05-12 2020-01-10 中国人民解放军信息工程大学 Network security defense decision-making determination method and device based on attack and defense differential game
CN108092307A (en) * 2017-12-15 2018-05-29 三峡大学 Layered distribution type intelligent power generation control method based on virtual wolf pack strategy

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103490413A (en) * 2013-09-27 2014-01-01 华南理工大学 Intelligent electricity generation control method based on intelligent body equalization algorithm
CN106358308A (en) * 2015-07-14 2017-01-25 北京化工大学 Resource allocation method for reinforcement learning in ultra-dense network
CN106899026A (en) * 2017-03-24 2017-06-27 三峡大学 Intelligent power generation control method based on the multiple agent intensified learning with time warp thought
CN107135224A (en) * 2017-05-12 2017-09-05 中国人民解放军信息工程大学 Cyber-defence strategy choosing method and its device based on Markov evolutionary Games
CN107979540A (en) * 2017-10-13 2018-05-01 北京邮电大学 A kind of load-balancing method and system of SDN network multi-controller
CN107832882A (en) * 2017-11-03 2018-03-23 上海交通大学 A kind of taxi based on markov decision process seeks objective policy recommendation method
CN109511277A (en) * 2018-08-01 2019-03-22 东莞理工学院 The cooperative method and system of multimode Continuous action space

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多智能体布谷鸟算法的网络计划资源均衡优化;宋玉坚等;《计算机工程与应用》;20150801;56-61 *

Also Published As

Publication number Publication date
WO2020024170A1 (en) 2020-02-06
CN109496305A (en) 2019-03-19

Similar Documents

Publication Publication Date Title
CN109496305B (en) Social network public opinion evolution method
Gronauer et al. Multi-agent deep reinforcement learning: a survey
Vecerik et al. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards
Schulman et al. Trust region policy optimization
Sontakke et al. Causal curiosity: Rl agents discovering self-supervised experiments for causal representation learning
Chen et al. On computation and generalization of generative adversarial imitation learning
Yu et al. Multiagent learning of coordination in loosely coupled multiagent systems
Han et al. Intelligent decision-making for 3-dimensional dynamic obstacle avoidance of UAV based on deep reinforcement learning
Abed-Alguni et al. A comparison study of cooperative Q-learning algorithms for independent learners
CN113919485B (en) Multi-agent reinforcement learning method and system based on dynamic hierarchical communication network
Bai et al. Variational dynamic for self-supervised exploration in deep reinforcement learning
Juang et al. A self-generating fuzzy system with ant and particle swarm cooperative optimization
Mustafa Towards continuous control for mobile robot navigation: A reinforcement learning and slam based approach
Subramanian et al. Multi-agent advisor q-learning
Han et al. Multi-uav automatic dynamic obstacle avoidance with experience-shared a2c
Notsu et al. Online state space generation by a growing self-organizing map and differential learning for reinforcement learning
Wu et al. Policy reuse for learning and planning in partially observable Markov decision processes
Mishra et al. Model-free reinforcement learning for stochastic stackelberg security games
Dias et al. Quantum-inspired neuro coevolution model applied to coordination problems
Akbulut et al. Reward conditioned neural movement primitives for population-based variational policy optimization
Shi et al. A sample aggregation approach to experiences replay of Dyna-Q learning
Zhou et al. An evolutionary approach toward dynamic self-generated fuzzy inference systems
Wei et al. A bayesian approach to robust inverse reinforcement learning
Zhang et al. Opinion Dynamics in Gossiper-Media Networks Based on Multiagent Reinforcement Learning
Mishra et al. Model-free Reinforcement Learning for Mean Field Games

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant