CN109496305B

CN109496305B - Social network public opinion evolution method

Info

Publication number: CN109496305B
Application number: CN201880001570.9A
Authority: CN
Inventors: 侯韩旭; 郝建业; 张程伟
Original assignee: Dongguan University of Technology
Current assignee: Dongguan University of Technology
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2022-05-13
Anticipated expiration: 2038-08-01
Also published as: WO2020024170A1; CN109496305A

Abstract

The invention provides a social network public opinion evolution method, and belongs to the field of reinforcement learning methods. The social network public opinion evolution method comprises two types of agents, namely a Gossiper type agent simulating common people in a social network and a Media type agent simulating Media or public characters aiming at attracting the common people in the social network, wherein the Media type agent adopts a Nash equilibrium strategy calculation on the continuous action space to calculate the concept of optimal return, updates the concept and broadcasts in the social network. The invention has the beneficial effects that: maximize their interest in interacting with other agents and ultimately enable learning of nash equilibrium.

Description

Social network public opinion evolution method

Technical Field

The invention relates to a Nash equilibrium strategy, in particular to a Nash equilibrium strategy on a continuous action space, and also relates to a social network public opinion evolution model based on the Nash equilibrium strategy on the continuous action space.

Background

In the environment of a continuous action space, on one hand, the selection of actions by an agent is infinite, and the traditional Q-based table-like algorithm cannot store an infinite number of return estimates; on the other hand, in a multi-agent environment, continuous motion space also increases the difficulty of the problem.

In the field of multi-agent reinforcement learning algorithms, the action space of an agent can be a discrete finite set or a continuous set. Because the nature of reinforcement learning is to find the optimum through continuous trial and error, the continuous action space has infinite action choices, and the multi-agent environment increases the dimensionality of the action space, which makes it difficult for a general reinforcement learning algorithm to learn the global optimum (or balance).

At present, most algorithms solve continuous problems based on function approximation technology, and the algorithms can be divided into two types: value approximation algorithms [1-5] and policy approximation algorithms [6-9 ]. The value approximation algorithm explores the action space and estimates a corresponding value function according to the return, and the strategy approximation algorithm defines the strategy as a probability distribution function on the continuous action space and directly learns the strategy. The performance of such algorithms depends on the accuracy of the estimation of the value function or strategy, which is often not a concern when dealing with complex problems such as non-linear control problems. In addition, there is a sampling-based algorithm [10, 11] that maintains a discrete set of actions, then uses traditional discrete-class algorithms to select the optimal actions in the set of actions, and finally updates the set of actions according to a resampling mechanism to gradually learn the optimal actions. Such algorithms can be easily combined with conventional discrete algorithms, which have the disadvantage of requiring a long convergence time. All the algorithms are designed with the aim of calculating the optimal strategy in a single-agent environment, and cannot be directly applied to the learning of a multi-agent environment.

Much work in recent years has been using agent simulation techniques to study consensus evolution in social networks [12-14 ]. Given different groups of populations with different distributions of concepts, the study populations eventually agree on their concepts during their interaction, or whether the two-stage differentiation is consistently chaotic [15 ]. The key to solve this problem is how to understand the dynamics of public opinion evolution, and thus obtain the inherent reason for the consensus [15 ]. Aiming at the public opinion evolution problem in the social network, researchers provide various multi-agent learning models [16-20] to research the influence of different information sharing or exchange degrees and other factors on the public opinion evolution. Wherein [ 21-23 ] researches the influence of different information sharing or exchange degrees and other factors on public opinion evolution. 1424-28 et al work employed an evolutionary game theory model to study how the behavior of agents (e.g., traitors and collaborations) evolved from partner interactions. These operations model the behavior of the agent and assume that all agents are the same. However, in practical situations, individuals may play different roles in society (e.g., leaders or followers), which cannot be accurately modeled according to the above-described methods. To this end, Quattrocciochi et al [12] divided and modeled social populations into two parts, media and popular, with the concept of popular being influenced by the media it follows and other popular people, and the concept of media being influenced by outstanding ones of the media. Subsequently, Zhao et al [29] proposed a consensus model based on leader-follower (leader-follower) type to explore the formation of consensus. In both of these works, the adjustment strategy of the agent concept is to mimic a leader or a successful peer. Related work based on modeling has also been Local major [30], compliance [31], and activating Neighbor [32 ]. However, in a real-world environment, the strategies that people take in making decisions are much more complex than simple impersonation. People often make decisions about their own behavior by constantly interacting with unknown environments and combining knowledge that they know. Furthermore, impersonation-based strategies also do not guarantee that the algorithm can learn global optima, because the quality of its agent's strategy depends on the leader's or impersonated strategy, which is not all the best.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a nash equilibrium strategy on a continuous action space and also provides a social network public opinion evolution model based on the nash equilibrium strategy on the continuous action space.

The invention comprises the following steps:

(1) setting constant alpha_ubAnd alpha_usWherein α is_ub>α_us,α_Q,α_σEpsilon (0,1) is a learning rate;

(2) initializing parameters, wherein the parameters comprise a mean u of desired actions u of agents i_iCumulative average strategy

Constant C, variance σ_iAnd cumulative average reward Q_i；

(3) Repeating the following steps until the accumulative average strategy of the sampling action of the agent i

The convergence of the signals is carried out,

(3.1) normally distributing N (u) at a certain exploration rate_i,σ_j) Randomly selecting an action x_i；

(3.2) performing action x_iThen obtains the report r from the environment_i；

(3.3) if agent i performs action x_iThe return received later r_iGreater than the current cumulative average reward Q_iThen u_iHas a learning rate of alpha_ubOn the contrary, the learning rate is alpha_usUpdating u according to the selected learning rate_i；

(3.4) according to learning of u_iUpdate variance σ of_i；

(3.5) if agent i performs action x_iThe return received later r_iGreater than the current cumulative average reward Q_iThen u_iHas a learning rate of alpha_ubOn the contrary, the learning rate is alpha_usUpdating Q according to the selected learning rate_i；

(3.6) according to the constant C and the action x_iUpdating

(4) Output cumulative averaging strategy

As the final action of agent i. .

The invention is further improved in that in step (3.3) and step (3.5), the update step of Q and the update step of u are synchronized, and in u_iIn the neighborhood of (c), Q_iAbout u_iCan be linearized to Q_i＝Ku_i+ C, wherein the slope

The invention is further improved by giving a positive number sigma_LAnd a positive number K, continuous action null of two agentsThe inter-Nash equalization strategy can eventually converge to Nash equalization, where σ_LIs the lower bound of the variance σ.

The invention also provides a social network public opinion evolution model based on the Nash equilibrium strategy on the continuous action space, which comprises two types of agents, namely a Gossiper type agent simulating common people in a social network and a Media type agent simulating Media or public characters aiming at attracting common people in the social network, wherein the Media type agent adopts the Nash equilibrium strategy on the continuous action space to calculate the concept of optimal return, updates the concept and broadcasts in the social network.

The invention is further improved, comprising the following steps:

s1: the notion of each gossip and Media is initialized randomly to a value in the motion space [0,1 ];

s2: in each interaction, each agent adjusts the concept of the agent according to the following strategy until each agent does not change the concept;

s21: randomly selecting a neighbor in the gossip network according to a set probability for any gossip class agent, and updating the concept and the following Media according to a BCM (the bounded confidence model) strategy;

s22: randomly sampling a subset of gossip networks G

Broadcasting gossip concepts in subset G' to all Media;

s23: for any one Media, the concept of the best return is calculated by using a Nash equilibrium strategy on a continuous action space, and the updated concept is broadcasted to the whole social network.

The invention is further improved, in step S21, the operating method of the gossip-like agent is as follows:

a1: concept initialization: x is the number of_i ^τ＝x_i ^τ-1；

A2: concept updating: updating the concept of the agent when the difference between the concept of the agent and the concept of the selected agent is less than a set threshold;

a3: the agent compares the difference between the agent and other Media concepts, and selects one Media to follow according to the probability.

The invention improves upon further if, in step A2, the currently selected neighbor is Gossiper j, and | x_j ^τ-x_i ^τ|<d_gThen x_i ^τ←x_i ^τ+α_g(x_j ^τ-x_i ^τ) (ii) a If the currently selected neighbor is Mediak, and y_k ^τ-x_i ^τ|＜d_mThen x_i ^τ←x_i ^τ+α_m(x_k ^τ-x_i ^τ) Wherein d is_gAnd d_mThresholds, alpha, respectively set for concepts for different types of neighbors_gAnd alpha_mRespectively, learning rates for different types of neighbors.

The invention is further improved by the step A3 of probability-based

Following the Media k, the Media k is,

wherein the content of the first and second substances,

the invention is further improved by the step S23 of returning r of Media j_jIs defined as the ratio of the number of people in G 'who choose to follow the gossip of j to the total number of people in G',

P_ijrepresenting the probability that gossip i follows Media j.

The invention is further improved, and the existence of one Media can accelerate the public opinion trend of each gossip agent to be uniform; in an environment where there are multiple Media competitions, the dynamic variation of each gossip agent concept is a weighted average affected by each Media.

Compared with the prior art, the invention has the beneficial effects that: under the environment of continuous action space, the intelligent agent can maximize the benefit of the intelligent agent and learn Nash equilibrium finally in the process of interaction with other intelligent agents.

Drawings

Fig. 1 is a schematic diagram of the convergence of two agents to nash equilibrium point according to the present invention, where r is 0.7>2/3, a is 0.4, and b is 0.6;

fig. 2 is a schematic diagram of the convergence of two agents to nash equilibrium point, where r is 0.6<2/3, a is 0.4, and b is 0.6;

FIG. 3 is a schematic diagram of the public opinion evolution of the Gossiper-Media model for networks without Media in a fully connected network;

FIG. 4 is a schematic diagram of the public opinion evolution of the Gossiper-Media model for small world networks without Media;

FIG. 5 is a schematic diagram of the public opinion evolution of the Gossiper-Media model for networks with a fully connected network having a Media;

FIG. 6 is a schematic diagram of the public opinion evolution of the Gossiper-Media model for small world networks with one Media;

FIG. 7 is a schematic diagram of the public opinion evolution of the Gossiper-Media model for each network when the fully connected network has two competing Media;

fig. 8 is a schematic diagram of the public opinion evolution of the gossip-Media model for networks when the small world network has two competing Media.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

The nash equilibrium strategy in the Continuous motion space of the invention expands from the single intelligent body reinforcement Learning algorithm CALA [7] (Continuous motion Action Learning Automata), through introducing WoLS (Win or lean Slow, Win then fast Learning) Learning mechanism, make the algorithm can deal with the Learning problem in the multi-intelligent body environment effectively, therefore, the nash equilibrium strategy of the invention is abbreviated as: Wolls-CALA (Win or lean slope Learning Automaton, Win is a fast-Continuous Action Learning Automaton). The invention first describes the CALA in detail.

Continuous Action Learning Automaton (CALA) [7]]The method is a strategy gradient reinforcement learning algorithm for solving the learning problem of the continuous motion space. Wherein the strategy of the agent is defined as obeying normal distribution N (u) on the action space^t,σ^t) Is determined.

The policy updates for the CALA agent are as follows: at time t, the agent distributes N (u) according to a normal^t,σ^t) Selecting an action x^t(ii) a Performing action x^tAnd u^tThen respectively obtain corresponding returns V (x) from the environment^t) And V (u)^t) This means that the algorithm needs to perform two actions during each interaction with the environment; finally, the normal distribution N (u) is updated according to the following formula^t,σ^t) The mean and the variance of (a) is,

wherein the content of the first and second substances,

where α is_uAnd alpha_σFor the learning rate, K is a normal number, which is used to control the algorithm convergence. Specifically, the size of K is related to the learning frequency of the algorithm, and is usually set to be in the order of 1/N, where N is the iteration frequency of the algorithm, σ_LIs the lower bound of the variance σ. The algorithm continues to update the mean and variance until u is constant and σ is^tTends towards sigma_L. The mean u after convergence of the algorithm will point to a most optimal solution to the problem. The size of σ in equation (2) determines the exploratory power of the CALA algorithm: sigma^tThe larger the CALA, the more likely it isFind potentially better actions.

By definition, the CALA algorithm is a learning algorithm based on policy gradient classes. The algorithm has been theoretically demonstrated that the CALA algorithm can find local optima in the case of a reward function V (x) that is sufficiently smooth [7 ]. De Jong et al [34] extended and applied CALA to a multi-agent environment by improving the reward function, and verified through experiments that its improved algorithm can converge to Nash equilibrium. The WOLS-CALA provided by the invention introduces a WOLS mechanism to solve the problem of multi-agent learning, and theoretically analyzes and proves that an algorithm can learn Nash equilibrium in a continuous action space.

Since CALA requires the agent to obtain a sample action and a reward of a desired action at the same time once per learning, however this is not feasible in most reinforcement learning environments, and typically the agent can only perform one action at a time in the interaction with the environment. Therefore, the invention extends CALA from two aspects of Q value function estimation and variable learning rate and provides a Wolls-CALA algorithm.

1. Q function estimation

In a standalone multi-agent reinforcement learning environment, an agent selects one action at a time and then obtains a reward from the environment. A natural way to explore the dynamic distribution is to use the Q value to estimate the average return of the expected action u. Specifically, the action u of the agent i in the formula (1)_iExpected reward of

It can be estimated by the following formula,

here, the

Is the sampling action at time t.

Is an intelligenceEnergy i in the selection action

The return received in time is the joint action of all agents at the time of $ t $

And (6) determining.

Is the learning rate for agent i on Q. The updating method in equation (3) is a common method for estimating the value function of a single state by reinforcement learning, and the essence is to use r_iTo estimate the statistical mean of

In addition, there is an advantage in that,

it can be updated one at a time and the ratio of the newly received reward to the Q value estimate is always a.

According to equation (3), the update procedure of u (equation (1)) and the update procedure of σ (equation (2)) can be expressed as,

here, the

Is the sampling action at time t.

Is agent i is selecting action

Received at the same timeIn return, joint action of agents at time t

And (6) determining.

And

is of agent i with respect to u_iAnd σ_iThe learning rate of (2).

However, the direct use of Q function estimation in a multi-agent environment can introduce new problems to the algorithm. Since in a multi-agent environment the reward of an agent is affected by other agents whose policy changes may cause the environment to be unstable. The updating method in equation (4) does not guarantee that u can adapt to the dynamic change of the environment. Here, as a simple example, assume that agent i has learned the optimal action for the current time at time $ t $

And is

Is that it is a pair

Is accurately estimated

By definition, at time t, for any x_iAre all provided with

The compound represented by the formula (3) is introduced into the compound represented by the formula (4),

if the environment remains the same, then there will be

Continuously establishing; however, if the environment changes, it is assumed that

And is

Is no longer the optimal action, then there will be

So that it can return correspondingly

In this case, the update method u in equation (5) is continued_iWill be far away from x_iHowever, theoretically because

To ensure accurate estimation u_iShould be close to x_i. Because Q is a statistical estimate of r, Q is updated more slowly than r, resulting in later updates

Always true, multiple sampling down u_iWill be continuously maintained at

The vicinity is unchanged. Theoretical u_iIt should be changed to find a new optimal action pair. The reason for these problems is mainly the instability caused by the multi-agent environment, and the traditional estimation method (such as Q learning) can not effectively deal with the change of the environment.

2. Wolls rules and analysis

To more accurately estimate the expected return of u in a multi-agent environment, the present invention updates the expected action u by way of a variable learning rate. Formally, the desired action u_iIs defined as the following formula according to the following formula update,

then u is_iCan be represented as

The WoLS rule may intuitively interpret that if the reward v (x) of an agent action x is greater than the reward v (u) of an expected u, then it should learn faster, and vice versa slower. It can be seen that the strategies WOL and WoLF (win or Learn fast) [35] are exactly opposite. The difference is that WoLF was designed with the goal of ensuring convergence of the algorithm, while the WoLS strategy of the present invention is to enable the algorithm to update u in the direction of increasing returns while ensuring that the expected returns of actions u can be correctly estimated. By analyzing the kinetic features inherent to the WoLS strategy, the following conclusions can be drawn,

theorem 1 the learning dynamics of the CALA algorithm using the WoLS rule can be approximated as a gradient ascent strategy (GA) in the continuous motion space.

And (3) proving that: according to definition (4), x is known^tIs that the agent normally distributes N (u) at time t^t,σ^t) Selected action, V (x)^t) And V ({ u)^tAre respectively corresponding to the action x^tAnd u^tIn return for (1). Definition f (x) ═ E [ V (x)^t)|x^t＝x]Is the expected reward function for action x. Suppose alpha_uInfinitesimal, u in Woll-CALA algorithm^tCan be represented by the following ordinary differential equation,

where N (u, σ)_u) Probability density function (dN (a, b) is a positive-Tai distribution (dN (a, b) represents mean a, variance b²The derivative of the normal distribution of (a) with respect to (a). Let x be u + y, then f (x) taylor in formula (8) is expanded at y being 0, and simplified to obtain,

note that in the formula (9), the term

And σ²Are all well balanced.

The update procedure of the standard deviation σ (equation (4)) is the same as the original CALA algorithm, so the CALA conclusion can be used directly: given a sufficiently large positive number K, σ will eventually converge to σ_L. In combination with formula (9), the present invention can conclude the following:

for a small positive number sigma_L(e.g., 1/10000) for a sufficient amount of time thereafter, with respect to u^tCan be approximated as a function of the ordinary differential equation of (c),

wherein

A small positive constant. f' (u) is the gradient direction of the function f (u) at u. Equation (10) indicates that u will change towards the gradient of f (u), i.e., f (u) increases the fastest. I.e. the dynamic trajectory of u can be approximated by a gradient ascent strategy.

In the case where only one agent is present, the dynamics of u will eventually converge to an optimal point, since when u ═ u^*In the case of an optimum point, the point is,

and is

From theorem 1, it can be seen that the learning dynamics of the CALA agent's expected actions of Wolls rule are similar to the gradient ascent strategy described above, i.e., their differentials with respect to time can both represent shaping as described above

In the form of (1). If f (u) there are multiple local optima, whether the algorithm eventually converges to a global optimum depends on the algorithm's allocation to Exploration-Exploitation [36 ]]This is a problem in the field of reinforcement learning. A common approach to explore global optima is to take the initial exploration rate σ (i.e., standard deviation) of the algorithm to a large value and the initial learning rate for σ

The value is particularly small to ensure that the algorithm can sample enough times in the whole motion space range. Since the desired action u of the CALA algorithm after the WoLS rule is added can be converged even when the standard deviation σ is not 0, the lower bound σ of the search rate σ is sufficiently secured_LA larger value may be taken. By combining the strategies, the global optimum can be learned by selecting a proper parameter algorithm.

Another problem is that the algorithm may not converge due to the adoption of a pure gradient ascent strategy in a multi-agent environment, so that the invention combines a PHC (Policy Hill learning, strategy Climbing) [35] algorithm to provide an Actor-Critic type independent multi-agent reinforcement learning algorithm, which is called WOL-CALA. The main idea of the Actor-criticic architecture is that the estimation of the strategy and the updating of the strategy are learned in independent processes respectively, the part for processing the strategy estimation is called criticic, and the part for updating the strategy is called Actor. The specific learning process is as follows (algorithm 1),

learning strategy of Algorithm 1 Woll-CALA agent i

For simplicity, two constants α are used in Algorithm 1_ubAnd alpha_us，(α_ub>α_us) Instead of u_iLearning rate of

If agent i performs action x_iThe return received later r_iGreater than the current cumulative average reward Q_iThen u_jHas a learning rate of alpha_ub(winning), whereas (missing) is alpha_us(step 3.3). Because the formulas (7) and (4) contain the denominator phi (sigma)_i ^t) When the denominator is small, a small error will have a great influence on the updating of u and σ. The use of two fixed step sizes makes it easier to control the updating process of the algorithm during the course of a particular experiment and also to implement it. Furthermore, note that the update step size of Q and the step size of u are synchronized in step 3.5 of the algorithm, i.e., at r_i>Q_iWhen all are alpha_ubOtherwise both are alpha_us. Because of alpha_ubAnd alpha_usIs a two very small number, in u_iIn a very small neighborhood, Q_iAbout u_iCan be linearized to Q_i＝Ku_i+ C, wherein the slope

If u_iChange over

Then

This is also done to more accurately estimate the expected return of u. Finally (step 4), the algorithm proceeds

Convergence is output as a loop termination condition and algorithm. The purpose of doing so is mainlyTo prevent u from being in a competitive environment_iPeriodic solutions can occur resulting in the algorithm not terminating. The variables to be noted here

And u_iRepresents different meanings:

the method comprises the steps that a cumulative statistical average value of sampling actions of an agent i is obtained, and a final result of the agent i is converged to a Nash equilibrium strategy in a multi-agent environment; and u_jIs the expected mean of the strategic distribution of agent i, which may oscillate periodically around the equilibrium point in a competitive environment. A detailed explanation will be given in theorem 2 later.

Because the dynamic track in the high-dimensional space may have a chaos phenomenon, it is difficult to perform qualitative analysis on the dynamic behavior of the algorithm when a plurality of agents are provided. Dynamic analysis of multi-agent correlation algorithms in the field is essentially based on two agents 3537-39. The case with two WoLS-CALA agents is therefore mainly analyzed here.

Theorem 2 gives a positive number σ_LAnd a sufficiently large positive number K, the strategy of the two WoLS-CALA agents can eventually converge to nash equilibrium.

And (3) proving that: nash equalization can be divided into two categories by the location of the equalization point: equilibrium points located on the boundary of the continuous motion space (bounded set) and another class are equilibrium points located inside the continuous motion space. This example focuses on the second class of balance points, considering that balance points on the boundary can be equivalent to balance points inside the lower one-dimensional space. The dynamics of an ordinary differential equation depend on the stability properties of its internal equilibrium points [40], so this example first calculates the equilibrium points in equation (10) and then analyzes the stability of these equilibrium points.

Order to

For agent i according to normal distribution at time t

And (4) randomly sampling.

And

are respectively an action

And

the corresponding expected reward. If it is not good

Is a balance point of equation (10), then

Are all provided with

According to the nonlinear kinetic theory [40]]The stability of the point eq can be determined by the eigenvalues of the matrix below,

wherein

When i ≠ j.

In addition, the Nash equalization point is defined according to the Nash equalization

The following properties are satisfied,

by substituting the equation (12) into M, it can be seen that the characteristic value of the nash equilibrium point belongs to one of the following three possibilities:

(a) all eigenvalues of the matrix M have a negative real part. This type of equilibrium point is asymptotically stable, i.e., all trajectories around $ eq $ eventually converge to this equilibrium point.

(b) All eigenvalues of the matrix M have non-positive real parts and contain a pair of pure imaginary eigenroots. Such equilibrium points are stable, but the limit set of the trajectories in their vicinity is a periodic solution, which cannot be counted. In addition, it is easy to prove

Namely, it is

Will eventually converge to the nash equilibrium. Consider Woll-CALA as a running average

For output, the algorithm can also handle this type of balance point problem.

(c) The matrix M has eigenvalues of the real part, i.e. the equilibrium point is unstable. For such equilibrium points, the trajectory around the unstable equilibrium point can be divided into two categories according to the nonlinear dynamics theory: traces on the steady manifold and other traces \ cite { Shilnikov1998Methods }. A stable manifold is a subspace generated by the eigenvectors corresponding to the stable eigenvalues. The trajectory in the steady manifold will theoretically eventually converge to this equilibrium point. The probability that the algorithm will remain out of this subspace is 0, considering randomness and computational errors. While all trajectories not belonging to the stable manifold will gradually move away from the equilibrium point and eventually converge to the other type of equilibrium point analyzed above, i.e. to the equilibrium point on the boundary or to the first and second type of equilibrium point.

Furthermore, similar to a single agent environment, if there are multiple equilibrium points, given an appropriate exploration-utilization, e.g., σ, from an analysis of theorem 1_LSufficiently large that σ takes a large initial value and a small valueLearning rate, the algorithm can converge to a nash equilibrium point (global optimum for each agent when other agent policies are unchanged). In conclusion, the present invention completes the proof that the algorithm converges to nash equilibrium.

The invention also provides a social network public opinion evolution model based on the Nash equilibrium strategy on the continuous action space, wherein the social network public opinion evolution model comprises two types of intelligent bodies, namely a Gossiper type intelligent body for simulating common people in a social network and a Media type intelligent body for simulating Media or public characters aiming at attracting common people in the social network, so the social network public opinion evolution model is also called a Gossiper-Media model. Wherein the Media-like agent employs a nash equilibrium strategy calculation over the continuous motion space to report the most optimal concept thereto, update its concept and broadcast in a social network. The invention applies the WOLS-CALA algorithm to the research of public opinion evolution in a real social network, and discusses the influence of competitive media experience on social public opinion by using WOLS-CALA modeling for media in the network.

This is explained in detail below:

gossip-Media model

The invention provides a multi-agent reinforcement learning framework, namely a gossip-Media model, for researching the evolution of group public opinion. The gossip-Media model contains two classes of agents, gossip class agents and Media class agents. Wherein the gossip-like agent is used to simulate the general public in a real network, and the concept (public opinion) of the gossip-like agent is influenced by Media and other gossips at the same time; while Media-like agents are used to simulate Media or public characters in social networks for the purpose of attracting the public, such agents actively choose their concepts to maximize their followers. Consider a network with N agents, where the number of gossypers is | G |, and the number of medias is | M | (N ═ G uem). It is assumed that there is full connectivity between gossypier and Media, i.e., each gossypier can select any one Media interaction with equal probability. While gossyper does not specify full connectivity, i.e., each gossyper may only interact with its own neighbors. The network between gossip is formed by the community between themAnd (5) determining the relationship. In particular, in the following simulation experiments, the present example defines two gossip networks respectively for the simulation experiments: full connected networks (full connected networks) and small-world networks (small-world networks). The concepts of Gossiper i and Media j are denoted as x respectively_iAnd y_j. The interaction process of the agents in the model follows algorithm 2.

Algorithm 2 concept learning model in gossip-Media network

First, the notion of each gossip and Media is initialized randomly to a value in the motion space [0,1] (step 1). Then, in each interaction, each agent respectively adjusts its concept according to different strategies until the algorithm converges (each agent does not change the concept any more). For each gossip agent, first choose to select the object with which to interact: a gossip is randomly selected from its neighbors according to probability ξ, or a Media is randomly selected according to probability 1- ξ (step 2.1). This gossip then updates its concept according to algorithm 3 and chooses to follow a Media that is closest to its concept based on its difference from the concepts of the medias. It is assumed that the Media agent can randomly obtain a part of the concept of gossip by sampling and broadcast to all Media, here denoted as G' (step 2.2). Media then games with each other using WoLS-CALA algorithm, works out the idea that can maximize their followers, and broadcasts the updated idea throughout the network (step 2.3). In principle, Media can also sample independently, so that G 'they obtain is different, which has little influence on the learning of the following WoLS-CALA algorithm, because the conceptual distribution of G' is theoretically the same as G. The environment assumption of the present invention is mainly for simplicity and ease of consideration, while also reducing possible uncertainties due to random sampling.

1.1 gossip strategy

The policy of each gossip includes two parts: 1) how to update the concept; 2) how to select Media for follow. The following is specifically described (algorithm 3):

strategy of algorithm 3 gossyper i at round tau

For Gossiper i, its concept is first initialized: x is the number of_i ^τ＝x_i ^τ-1(step 1). Then following the BCM (the bounded confidence model) strategy [12,33 ]]Update its concept (step 2). BCM is a more common model for describing the concept of a group, and the concept of BCM-based agents is only influenced by agents that are close to its concept. In Algorithm 3, only the notion of the agent that it selects differs by less than the threshold d_g(or d)_m) Gossip will update its idea. Where d is_gAnd d_mThe agents corresponding to the selection are gossip and Media, respectively. Threshold value d_g(or d)_m) The size of (a) represents how well gossip accepts new ideas. Intuitively, the larger d, the more susceptible gossip is to other agents [41-43]. Then the gossip compares the difference between itself and other Media concepts and selects one Media following according to probability (step 3). Here using the probability P_ij ^τThe probability that gossip i chooses to follow Media j at time τ is expressed, and satisfies the following characteristics:

(i) when | x_i-y_j|>d_mWhen is, P_ij＝0；

(ii)(ii)P_ij>0 if and only if the notion of Media j_jSatisfy | x_i-y_j|≤d_m；

(iii)(iii)P_ijFollowing the concept x_iAnd y_jDistance | x of_i-y_jThe increase in | decreases.

Note that if pair

All have | x_i-y_j|>d_mThen, then

This means that there is a possibility that one gossip will not follow any one Media. Equation λ_ijThe parameter δ is a small positive number to prevent the denominator of the fraction from being 0.

1.2 Media policy

For a given set of gossip concepts, each Media can adapt its own concept by learning to cater to gossip preferences, thereby attracting more gossip to follow it. In a multi-agent system where there are multiple Media, nash equalization is the last stable state achieved when multiple agents compete against each other. In this state, each agent cannot obtain higher reward by unilaterally changing its own policy. Considering that the motion space of Media is continuous (notional is defined as any point on the interval [0,1 ]), where the behavior of Media is modeled using the WoLS-CALA algorithm, algorithm 4 is a Media strategy constructed based on WoLS-CALA.

Algorithm 4 Media j strategy in round τ

Mediaj Current rewards r_jIs defined as the ratio of the number of people in G 'who choose to follow the gossip of j to the total number of people in G',

where lambda is_ijIs as defined in algorithm 3. P_ijRepresenting the probability that gossip i follows Media j.

2. Dynamic analysis of group public opinion

Let { y }_j}_j∈M，y_jE (0,1) is the concept of Mediaj. Assuming that the gossip network is infinite, the concept distribution of gossip can be represented by a continuous distribution density function, where p (x, t) represents the probability density function of the concept distribution of gossip population at time t. Then the consensus evolution of gossyper can be expressed as the partial derivative of the probability density function p (x, t) with respect to time. First the example considers the case where only one Media is present.

Theorem 3 in a gossip-Media network containing only one Media, the evolution of the gossip concept distribution follows the following formula,

wherein the content of the first and second substances,

here I₁＝{x||x-y|<(1-α_m)d_m}，I₂＝{x|d_m≥|x-y|≥(1-α_m)d_m}。

And (3) proving that: based on the MF approximation [40] (Mean Field approximations) theory, the partial derivatives p (x, t) of the probability distribution of BCM-based gossip concept with respect to t can be represented by the following [12],

here W_x+y→xGossiper, representing a notion equal to x + y, changes the probability of reciting x, while W_x+y→xp (x + y) dy represents the notion of agent from interval (x + y) within time interval (t, t + dt)X + y + dy) to the ratio of x. Similar W_x→x+yAn agent representing concept x will change the probability of seeing x + y, W_x→x+yp (x) dy represents the ratio of gossip transition to interval (x + y, x + y + dy) with the notion equal to x.

According to the definition of algorithm 3, the agent gossyper is influenced by other gossyper concepts according to probability xi or influenced by Media concepts according to probabilities 1-xi, and then makes its own decision. W is to be_x+y→xAnd W_x→x+yThe refinement is two parts affected by other gossip concepts and Media concepts, which are respectively marked as w^[g]And w^[m]Then W is_x→x+yAnd W_x+y→xCan be expressed as a number of times,

by bringing the formula (18) into the formula (17),

definition of

Therein Ψ_g(x, t) represents the rate of change of the probability density function p (x, t) of the agent g concept as affected by gossip. Weisbuch G [45]Et al have demonstrated Ψ_g(x, t) obeys the following formula,

here, the

Is the second order partial derivative of p with respect to x. Alpha is alpha_gIs a real number between 0 and 0.5. d_gIs the gossip threshold.

Formula Ψ_m(x, t) represents the rate of change of the notional distribution density function p (x, t) affected by media. Suppose the notion of Media j is u_j(u_j＝x+d_j) Then the notional distribution of Media can utilize Dirac delta equation q (x) δ (x-u)_j) And (4) showing. Dirac delta equation delta (x) [46 ]]Often used to model a narrow-high spike function (pulse) and other similar abstractions, such as point charge, point mass or electrons, as defined below,

the rate of transfer from x + y to x

Can be expressed as

In the formula (21), δ (x- [ (x + y) + α_m((x+z)-(x+y))]) Indicating that the following event occurred, notion x + y was shifted to x by the influence of notion x + z. q (x + z) is the distribution density of Media at notion x + z. In the same way, w_x→x+yCan be expressed as a number of representations,

the combined formulas (21) to (22) can be obtained by calculation and arrangement,

wherein I₁＝{x||x-y|<(1-α_m)d_m}，I₂＝{x|d_m≥|x-y|≥(1-α_m)d_m}。

And (5) integrating (20) to complete the certification.

As can be seen from this example in equation (14), the rate of change of p (x, t) is the equation Ψ_g(x, t) and Ψ_mWeighted average of (x, t). The former represents a part affected by public opinion change to gossip network, and the latter represents a part affected by Media network. The formula Ψ containing only gossip_g(x, t) has been worked on by Weisbuch G [45]The research is analyzed. It follows that an important property is that from any one distribution, the locally optimal point in the distribution density will gradually strengthen, which indicates that the development of consensus in the pure gossip network will gradually trend all the time. Furthermore, as can be seen from theorem 3, the formula Ψ_g(x, t) and formula Ψ_m(x, t) are all independent of gossip's specific network, which means that when the network is infinite, the development of public opinion is not affected by the network structure.

Next, the second part of equation (14), Ψ, is analyzed_m(x, t) (formula (23)). Assuming that y is constant, analysis (23) can yield,

intuitively, equation (24) shows that gossip's view, which is similar to Media's concept, converges to this Media, so the following conclusions can be drawn,

inference 1 the existence of a Media will accelerate the consensus trend of gossyper.

The following example considers the case where multiple medias exist. Definition P_j(x) For the probability that gossyper's concept is affected by Media j at x, then

Then gossipier, in an environment with multiple Media competition, can express its notional dynamic change as a weighted average affected by each Media. It is possible to conclude that,

it was deduced that the dynamic variation of the distribution function of the 2 gossip concept obeys the following equation:

therein Ψ_g(x, t) and Ψ_m(x, t) are defined by formulas (20) and (23), respectively.

3. Simulation experiment and analysis

First, it is verified that the WoLS-CALA algorithm can learn nash equilibrium. An experimental simulation of the gossip-Media model is then presented to verify the results of the foregoing theoretical analysis.

3.1 Wolls-CALA Algorithm Performance test

This example considers a simplified version of the gossip-Media model to check whether the WoLS-CALA algorithm can learn the nash equalization strategy. Specifically, the problem of two Media competing followers is modeled as the following objective optimization problem,

max(f₁(x,y),f₂(x,y))

s.t., x, y ∈ [0,1] (s.t. represents constraint condition, is standard writing method of optimization problem) (26)

Wherein

And

r is equal to [0,1 ]. A, b belongs to [0,1] < lambda > a-b ≧ 0.2 is the concept of gossipier.

Where the function f₁(x, y) and f₂R in (x, y) simulation algorithm 4, representing Media 1 and 2, respectively, in a joint action of<x,y>Is a return. This example uses two WOLSCALA agents, maximizing the respective reward function f by independently learning the separate control of x and y₁(x, y) and f₂(x, y). In this model, the concept of gossip can be divided into two categories according to different forms of nash equilibrium:

(i) when r >2/3, the equilibrium point is (a, a), when r <1/3 the equilibrium point is (b, b);

(ii) (ii) when r is equal to or less than 1/3 and is equal to or less than 2/3, the equilibrium point is any point on the set | x-a | <0.1 ^ y-b | <0.1 or | x-b | <0.1 ^ y-a | < 0.1.

In a specific simulation experiment, in each of the two types, the example takes one point, namely r is 0.7>2/3 and r is 0.6< 2/3. It was then observed whether the algorithm could learn nash equilibrium as expected when the concept distribution of gossip was different. Table 1 shows the parameter settings for WOLS-CALA.

TABLE 1 parameter settings

Fig. 1 and 2 show simulation results of two experiments, and it is obvious that the Media agent in the two experiments converged to nash equilibrium after about 3000 times of learning, that is, r is 0.6 and then converged to <0.4,0.4> and r is 0.7 and then converged to <0.4,0.57 >. As shown in fig. 1, when r is 0.7>2/3, a is 0.4, and b is 0.6, the two agents converge to nash equilibrium point (0.4 ); as shown in fig. 2, when r is 0.6<2/3, a is 0.4, and b is 0.6, agent 1(agent1) converges to x is 0.4, and agent 2(agent2) converges to y is 0.57.

3.2 Experimental simulation of Gossiper-Media model

This subsection shows the simulation results of the gossip-Media model. Consider 200 gossipers and an experimental environment with a different number of medias, respectively: (i) there is no Media; (ii) only one Media; (iii) there are two competing Media. For each environment, this example considers two representative gossip networks, a Fully Connected Network (Fully Connected Network) and a Small World Network [47] (Small-World Network), respectively. Through these comparative experiments, this example discusses the effect of Media on the evolution of gossip public opinion.

For fairness, the same parameter settings were used for each experimental environment. The same network was used in the three experimental environments, and the same gossip and Media initiatives. Here, the small world network uses the Watts-Strogatz construction method [47]Randomly generated according to the connectivity p being 0.2. The initial concept of each gossip is to distribute uniformly in the interval [0,1]]And (4) performing up-random sampling. Media's initial notion is 0.5. Considering that too large a threshold may interfere with the observation of the experiment, the gossip-Media threshold d is used here_mAnd gossip-gossip threshold d_gA small positive number of 0.1 is set. Gossiper's learning rate α_gAnd alpha_mSet to 0.5. Set G 'was sampled randomly from G and satisfied | G' | 80 \% | G |.

Because there are two gossip network modes in each environment: all-connected networks and small-world networks. Thus, fig. 3-4 illustrate the public opinion evolution of networks without Media under a fully connected network and a small world network, respectively; FIGS. 5-6 illustrate the public opinion evolution of networks with one Media under a fully connected network and a small world network, respectively; fig. 7-8 show the public opinion evolution of networks with two competing Media under a fully connected network and a small world network, respectively. From these several figures, it can be seen first that the number of points at which different gossip networks eventually converge is the same in all three Media environments: convergence to 5 in zero Media environment; convergence to 4 in a Media environment; convergence to 3 in both Media environments. This phenomenon is consistent with the conclusions in theorem 3 and inference 2, and the public opinion dynamics of gossyper is independent of the topology of gossyper's network, because the public opinion dynamics of gossyper under different networks can be modeled with the same formula.

Second, it can be observed from fig. 3-6 that the number of points that gossyper's consensus converges last in both networks is reduced from 5 to 4 in the presence of one Media. This indicates that the presence of Media accelerates the generation of gossip consensus, which is the conclusion of this example in inference 1. Meanwhile, from fig. 5 to 8, when the number of Media is increased from 1 to 2, the point at which the consensus of gossyper finally converges in the two networks is further decreased from 4 to 3. This suggests that competing Media will further accelerate the reconciliation of gossip consensus.

In addition, the experimental result can also verify the performance of the WOLS-CALA algorithm. In fig. 5 and 6, the concept of Media agent is maintained around the concept of having the most gossip at all times (N in fully connected networks)_max69, N in small world networks_max68). This phenomenon is in line with the expectation of algorithmic design, i.e. WoLS-CALA agents are able to learn the global optimum. In fig. 7 and 8, it can be seen that when there are two medias, the concept of one Media is maintained around the concept of having the most gossip (N in two networks)_maxAll 89), another Media is maintained around the concept of having the second most gossip (N 'in a fully connected network)'_max70, N 'in the Small world network'_max66). This is also in line with the expectation of theorem 2 that the two WoLS-CALA agents can eventually converge to nash equilibrium. The concept of Media in fig. 3-8 has been vibrating up and down around the concept of gossip because in the gossip-Media model, the optimal strategy of Media is not unique (less than d around the concept of gossip)_mIs the most optimal point of Media).

4. Summary of the invention

The invention provides an independently-learned multi-agent continuous motion space reinforcement learning algorithm WOLS-CALA, which verifies that the algorithm can learn Nash equilibrium from two aspects of theoretical proof and experimental verification respectively. The algorithm is then applied in the study of consensus evolution in a network environment. Individuals in the social network are divided into two classes, namely Gossiper and Media, and are modeled respectively, wherein the Gossiper class represents the general public, and the Media uses Woll-CALA algorithm modeling to represent individuals of social Media and the like aiming at attracting public attention. By modeling the two agents respectively, the invention discusses the influence of competition of different numbers of Media on the gossip public opinion. Finally, theories and experiments show that Media competition can accelerate consensus among public opinions.

The above-described embodiments are intended to be illustrative, and not restrictive, of the invention, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

The references referred to in the present invention correspond to the following references:

[1]PazisJ,LagoudakisMG.Binary Action Search for Learning Continuous-action Control Policies[C].In Proceedings of the 26th Annual International Conference on Machine Learning,New York,NY,USA,2009:793–800.

[2]Pazis J,Lagoudakis M G.Reinforcement learning in multidimensional continuous action spaces[C].In IEEE Symposiumon Adaptive Dynamic Programming&Reinforcement Learning,2011:97–104.

[3]Sutton R S,Maei H R,Precup D,et al.Fast Gradient-descent Methods for Temporal-difference Learning with Linear Function Approximation[C].In Proceedings of the 26th Annual International Conference on Machine Learning,2009:993–1000.

[4]Pazis J,Parr R.Generalized Value Functions for Large Action Sets[C].In International Conference on Machine Learning,ICML 2011,Bellevue,Washington,USA,2011:1185–1192.

[5]Lillicrap T P,Hunt J J,Pritzel A,et al.Continuous control with deep reinforcement learning[J].Computer Science,2015,8(6):A187.

[6]KONDA V R.Actor-critic algorithms[J].SIAM Journal on Control and Optimization,2003,42(4).

[7]Thathachar M A L,Sastry P S.Networks of Learning Automata:Techniques for Online Stochastic Optimization[J].Kluwer Academic Publishers,2004.

[8]Peters J,Schaal S.2008Special Issue:Reinforcement Learning of Motor Skills with Policy Gradients[J].Neural Netw.,2008,21(4).

[9]van Hasselt H.Reinforcement Learning in Continuous State and Action Spaces[M].In Reinforcement Learning:State-of-the-Art.Berlin,Heidelberg:Springer Berlin Heidelberg,2012:207–251.

[10]Sallans B,Hinton G E.Reinforcement Learning with Factored States and Actions[J].J.Mach.Learn.Res.,2004,5:1063–1088.

[11]Lazaric A,Restelli M,Bonarini A.Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods[C].In Conference on Neural Information Processing Systems,Vancouver,British Columbia,Canada,2007:833–840.

[12]Quattrociocchi W,Caldarelli G,Scala A.Opinion dynamics on interacting networks:media competition and social influence[J].Scientific Reports,2014,4(21):4938–4938.

[13]Yang H X,Huang L.Opinion percolation in structured population[J].Computer Physics Communications,2015,192(2):124–129.

[14]Chao Y,Tan G,Lv H,et al.Modelling Adaptive Learning Behaviours for Consensus Formation in Human Societies[J].Scientific Reports,2016,6:27626.

[15]De Vylder B.The evolution of conventions in multi-agent systems[J].Unpublished doctoral dissertation,Vrije Universiteit Brussel,Brussels,2007.

[16]Holley R A,Liggett T M.Ergodic Theorems for Weakly Interacting Infinite Systems and the Voter Model[J].Annals of Probability,1975,3(4):643–663.

[17] nowak A, Szamrej J, Latan Mao B.from private attribute to public option A dynamic of social impact [ J ] Psychological Review,1990,97(3): 362-.

[18]Tsang A,Larson K.Opinion dynamics of skeptical agents[C].In Proceedings of the 2014international conference on Autonomous agents and multi-agent systems,2014:277–284.

[19]Ghaderi J,Srikant R.Opinion dynamics in social networks with stubborn agents:Equilibrium and convergence rate[J].Automatica,2014,50(12):3209–3215.

[20]Kimura M,Saito K,Ohara K,et al.Learning to Predict Opinion Share in Social Networks.[C].In Twenty-Fourth AAAI Conference on Artificial Intelligence,AAAI 2010,Atlanta,Georgia,Usa,July,2010.

[21]Liakos P,Papakonstantinopoulou K.On the Impact of Social Cost in Opinion Dynamics[C].In Tenth International AAAI Conference on Web and Social Media ICWSM,2016.

[22]Bond R M,Fariss C J,Jones J J,et al.A 61-million-person experiment in social influence and political mobilization[J].Nature,2012,489(7415):295–8.

[23]Szolnoki A,Perc M.Information sharing promotes prosocial behaviour[J].New Journal of Physics,2013,15(15):1–5.

[24]Hofbauer J,Sigmund K.Evolutionary games and population dynamics[M].Cambridge；New York,NY:Cambridge University Press,1998.

[25]Tuyls K,Nowe A,Lenaerts T,et al.An Evolutionary Game Theoretic Perspective on Learning in Multi-Agent Systems[J].Synthese,2004,139(2):297–330.

[26]Szabo B G.Fath G(2007)Evolutionary games on graphs[C].In Physics Reports,2010.

[27]Han T A,Santos F C.The role of intention recognition in the evolution of cooperative behavior[C].In International Joint Conference on Artificial Intelligence,2011:1684–1689.

[28]Santos F P,Santos F C,Pacheco J M.Social Norms of Cooperation in Small-Scale Societies[J].PLoS computational biology,2016,12(1):e1004709.

[29]Zhao Y,Zhang L,Tang M,et al.Bounded confidence opinion dynamics with opinion leaders and environmental noises[J].Computers and Operations Research,2016,74(C):205–213.

[30]Pujol J M,Delgado J,Sang,et al.The role of clustering on the emergence of efficient social conventions[C].In International Joint Conference on Artificial Intelligence,2005:965–970.

[31]Nori N,Bollegala D,Ishizuka M.Interest Prediction on Multinomial,Time-Evolving Social Graph.[C].In IJCAI 2011,Proceedings of the International Joint Conference on Artificial Intelligence,Barcelona,Catalonia,Spain,July,2011:2507–2512.

[32]Fang H.Trust modeling for opinion evaluation by coping with subjectivity and dishonesty[C].In International Joint Conference on Artificial Intelligence,2013:3211–3212.

[33]Deffuant G,Neau D,Amblard F,et al.Mixing beliefs among interacting agents[J].Advances in Complex Systems,2011,3(1n04):87–98.

[34]De Jong S,Tuyls K,Verbeeck K.Artificial agents learning human fairness[C].In International Joint Conference on Autonomous Agents and Multiagent Systems,2008:863–870.

[35]BowlingM,Veloso.Multiagent learning using a variable learning rate[J].Artificial Intelligence,2002,136(2):215–250.

[36]Sutton R S,Barto A G.Reinforcement learning:an introduction[M].Cambridge,Mass:MIT Press,1998.

[37]Abdallah S,Lesser V.A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics[J].J.Artif.Int.Res.,2008,33(1):521–549.

[38]Singh S P,Kearns M J,Mansour Y.Nash Convergence of Gradient Dynamics in General-Sum Games[J],2000:541–548.

[39]Zhang C,Lesser V R.Multi-agent learning with policy prediction[J],2010:927–934.

[40]Shilnikov L P,Shilnikov A L,Turaev D,et al.Methods of qualitative theory in nonlinear dynamics/[M].World Scientific,1998.

[41]Dittmer J C.Consensus formation under bounded confidence[J].Nonlinear Analysis Theory Methods and Applications,2001,47(7):4615–4621.

[42]LORENZ J.CONTINUOUS OPINION DYNAMICS UNDER BOUNDED CONFIDENCE:A SURVEY[J].International Journal of Modern Physics C,2007,18(12):2007.

[43]Krawczyk M J,Malarz K,Korff R,et al.Communication and trust in the bounded confidence model[J].Computational Collective Intelligence.Technologies and Applications,2010,6421:90–99.

[44]Lasry J M,Lions P L.Mean field games[J].Japanese Journal of Mathematics,2007,2(1):229–260.

[45]WeisbuchG,DeffuantG,AmblardF,etal.Interacting Agents and Continuous Opinions Dynamics[M].Springer Berlin Heidelberg,2003.

[46]Hassani S.Dirac Delta Function[M].Springer New York,2000.

[47]DJ W,SH S.Collectivedynamics of’small-world’networks[C].In Nature,1998:440–442.

Claims

1. A social network public opinion evolution method is characterized in that: the social network public opinion evolution method comprises two types of agents, namely a Gossiper type agent simulating common masses in a social network and a Media type agent simulating Media or public characters aiming at attracting the common masses in the social network, wherein the Media type agent adopts a Nash equilibrium strategy on the continuous action space to calculate the concept of optimal return, updates the concept and broadcasts in the social network,

the Nash equalization strategy on the continuous motion space comprises the following steps:

(1) setting constant alpha_ubAnd alpha_usWherein α is_ub>α_us,α_ub,α_usEpsilon (0,1) is a learning rate;

(2) initializing parameters, wherein the parameters comprise a mean value u of a desired action u of an agent i_iCumulative average strategy

Constant C, variance σ_iAnd cumulative average reward Q_i；

The convergence of the signals is carried out,

(3.2) performing action x_iThen obtains the report r from the environment_i；

(3.4) according to learning of u_iUpdate variance σ of_i；

(3.6) according to the constant C and the action x_iUpdating

(4) Output cumulative averaging strategy

As the final action of agent i.

2. The social networking public opinion evolution method of claim 1, wherein: in step (3.3) and step (3.5), the update step size of Q and the update step size of u are synchronized, at u_iWithin a neighborhood of (2), Q_iAbout u_iCan be linearized to Q_i＝Ku_i+ C, wherein the slope

3. The social networking public opinion evolution of claim 2The method is characterized in that: given a positive number σ_LAnd a positive number K, the Nash equilibrium strategy over the continuous motion space of the two agents can eventually converge to Nash equilibrium, where σ_LIs the lower bound of the variance σ.

4. The social networking public opinion evolution method of claim 1, characterized in that it comprises the following steps:

s21: randomly selecting a neighbor in the gossip network according to a set probability for any gossip class intelligent agent, and updating the concept and the Media followed by the gossip class intelligent agent according to a BCM strategy;

s22: randomly sampling a subset of gossip networks G

Broadcasting gossip concepts in subset G' to all Media;

5. The social networking public opinion evolution method of claim 4, wherein: in step S21, the operating method of the gossip-like agent is:

a1: concept initialization: x is the number of_i ^τ＝x_i ^τ-1；

6. The social networking public opinion evolution method of claim 5, wherein: in step A2, if the currently selected neighbor is Gossiper j, and | x_j ^τ-x_i ^τ|<d_gThen x_i ^τ←x_i ^τ+α_g(x_j ^τ-x_i ^τ) (ii) a If the currently selected neighbor is Media k, and y_k ^τ-x_i ^τ|<d_mThen x_i ^τ←x_i ^τ+α_m(y_k ^τ-x_i ^τ) Wherein d is_gAnd d_mThresholds, alpha, respectively set for concepts for different types of neighbors_gAnd alpha_mRespectively, learning rates for different types of neighbors.

7. The social networking public opinion evolution method of claim 6, wherein: in step A3, probability-based

Following the Media k, the Media k is,

wherein the content of the first and second substances,

8. the social networking public opinion evolution method of claim 7, wherein: in step S23, Media j current reward r_jIs defined as the ratio of the number of people in G 'who choose to follow the gossip of j to the total number of people in G',

P_ijrepresenting the probability that gossip i follows Media j.

9. The social networking public opinion evolution method of any one of claims 1-8, wherein: the existence of one Media can accelerate the public opinion trend of each gossip agent to be uniform; in an environment where there are multiple Media competitions, the dynamic variation of each gossip agent concept is a weighted average affected by each Media.