CN103248693A

CN103248693A - Large-scale self-adaptive composite service optimization method based on multi-agent reinforced learning

Info

Publication number: CN103248693A
Application number: CN2013101612388A
Authority: CN
Inventors: 王红兵; 王晓珺
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2013-05-03
Filing date: 2013-05-03
Publication date: 2013-08-14

Abstract

The invention discloses a self-adaptive composite service optimization method based on multi-agent reinforced learning. The method combines conceptions of the reinforced learning and agents, and defines the state set of reinforced learning to be the precondition and postcondition of the service, and the action set to be the Web service; parameters for Q learning including the learning rate, discount factors and Q value in reinforced learning are initialized; each agent is used for performing one composite optimizing task, and can perceive the current state, and select the optimal action under the current state as per the action selection strategy; the Q value is calculated and updated as per the Q learning algorithm; before the Q value is converged, the next round learning is performed after one learning round is finished, and finally the optimal strategy is obtained. According to the method, the corresponding self-adaptive action strategy is worked out on line as per the environment change at the time, so that higher flexibility, self-adaptability and practical value are realized.

Description

Extensive service combined optimization method based on the multiple agent intensified learning

Technical field

The invention belongs to artificial intelligence field, relate to and utilize computer to the method for Web service combination adaptive optimization.

Background technology

In the face of market environment and keen competition complicated and changeable, enterprise presses for the support of the integrated and e-commerce technology of application, in order to improve from competitiveness and adaptability in market.Because the characteristic that Web service has makes it be very suitable for striding now the integrated of enterprise commerce application, industrial quarters and academia all wish and can create the service function that makes new advances by the service of combination existing Web.In order to realize application interoperability and the application integration of inter-organizational information system, can be by enterprise application system being carried out the Web service encapsulation, set up service-oriented frame system, the interface of web access is provided, the mode of application system between the enterprise with Web service integrated, the service combination that enterprise is striden in realization with cooperate, and by striding the automation of enterprise work streaming system realization operation flow.The Web service combination technique is exactly an important channel of realizing above-mentioned target.It according to certain rule, finds and is assembled into value-added, a more service of great dynamics, to satisfy user's complicated demand with a plurality of Web services.But, because the peculiar complexity of Internet environment and polytropy, the serviced component of composition composite service is in the implementation of composite service, dynamic change may take place, and this makes that the serviced component of forming composite service is difficult in the design phase or the compilation phase is decided.Therefore, need carry out dynamic web service combination, adapt to the complicated business environment of dynamic change.The another one problem is exactly the quality problems of service, i.e. QoS attribute problem.Because network has many services that identical function is provided of going up, it also is very important selecting a service with optimum QoS attribute.And dynamic change also may take place in Web service its QoS attribute in running.After operation after a while, its QoS can't satisfy client's demand as Web service.So Web service is made up, also need to adapt to the complicated business environment of dynamic change, to maintain a good running status and to have certain fault tolerance.

At present, the services combination will determine to participate in the Web service of flow process in advance, require the developer manually to go to carry out Service Assembly and service execution.Therefore, this process is difficulty, and is consuming time and easily make mistakes, and can not adapt to dynamic environment.(Markov Decision Process MDP) is the quantification expression framework of sequence decision problem under the random environment based on Markovian decision process.Sequence decision problem under the random environment refers to that the policymaker will make a policy on each observation station, and the decision information on the next state time is not known in decision-making.In general, this decision problem not only needs to consider the current interests of determining, and needs also to consider that current decision-making to the influence in future, makes the operation of system reach optimum.Doshi has proposed the application of MDP in the Web service combination, for generation of the Dynamic Web Services Composition of workflow.But the method need be known the environmental model of state transition probability and return value function.And this normally can not realize in actual environment.

Summary of the invention

Technical problem: the invention provides a kind of in the face of uncertain and uncertain environment the time, can be according to the online extensive service combined optimization method based on many agent intensified learning of making corresponding self adaptation behavioral strategy of the variation of prevailing circumstances.

Technical scheme: the extensive service combined optimization method based on the multiple agent intensified learning of the present invention may further comprise the steps:

1) environmental modeling of Web service combination is become the Web service combination Markovian decision process state transition diagram of 6 tuples, i.e. WSC-MDP=＜S, s ₀, s _t, A (s), P:[p _Iaj], R:[r _Iaj], wherein S is that a series of atomic actions are from certain specific initial condition s ₀The accessible state set that begins to carry out, s ₀The expression initial condition, the state when the expression action does not also take place also is the initial value of workflow, s _tUser's dbjective state also is the final state of workflow, and A (s) expression Web service combination intelligent body is executable Web service set at a certain state s ∈ S, P:[p _Iaj]: for system when a certain state, call the available Web service under this state, system enters the probability of NextState, R:[r _Iaj] for calling the overall merit return value of service under certain state;

2) learning rate of Q learning algorithm, discount factor, Q value and public Q value Q in the initialization intensified learning _p

3) the software entity that carries out the Web service Combinatorial Optimization as can the perception environment and can autonomous operation satisfy the Web service combination intelligent body of design object, the state s in the Web service combination intelligent body perception environment;

4) Web service combination intelligent body obtains new state s ', simultaneously the value of being recompensed r from new state s ' according to Action Selection policy selection and execution action A (s);

5) the Q value in the Q study is calculated and upgraded, and the Q value that after will upgrading will be is as the public Q value of the intelligent body of Web service combination supervision, finish this intensified learning process, Web service is made up the intelligent body of supervision and is the software entity of guidance with synchronous each Web service combination intelligent body learning process;

6) judge whether the Q value restrains, in this way, then the result of this intensified learning is flowed as optimum Web service execution work, otherwise get back to step 3).

Step 2 of the present invention) in, the intelligence body utilizes intensified learning to train, learning process is regarded as one sound out the process of estimating, if certain Web service of intelligent body selects the return value of behavior bigger than other Web services, should select the trend of this service to strengthen by the intelligence body so; If certain behavioral strategy of intelligent body causes lower return value, the trend that so intelligent body produces this behavioral strategy can weaken.Intensified learning is exactly the study that intelligent body shines upon from the environment to the behavior in the multiple agent, so that the return value maximum.

Action Selection strategy in the step 4) of the present invention is, select action according to one of following manner: a. selects feasible action at random, and b. selects to make the action of current Q value maximum;

Wherein, may selecting according to mode a of ε probability arranged, may selecting according to mode b of 1-ε probability arranged; ε selection 0.85 is comparatively suitable.When selecting according to mode b, determine and inform the action of the current Q value maximum of Web service combination intelligent body by the intelligent body of Web service combination supervision.Formula is:

p ^m(a _i| be that m WSCA selects action a at state s s) _iProbability.ArgMax _aQ _p[s, a] informs the maximum action of the current Q value of Web service combination intelligent body for the intelligent body of Web service combination supervision.Web service is made up the intelligent body of supervision and is the software entity of guidance with synchronous each Web service combination intelligent body learning process; If this method has guaranteed to carry out enough trials, namely action is all carried out unlimitedly on each, and one finds optimum action surely.An advantage of this system of selection is the increase along with learning time, and therefore each action all will guarantee last Q value convergence by sampling to unlimited.

The computational methods of return value r are in the step 4) of the present invention: if the user thinks that the more big service quality that namely shows of service quality value that the service provider provides is more good, then carry out standardization according to formula (1), obtain standardized value v ',

v^{'} = \{\begin{matrix} \frac{v - \min}{\max - \min}, \max &NotEqual; \min \\ 1, \max = \min \end{matrix} - - - (1)

If the user thinks that the more little service quality that namely shows of service quality value that the service provider provides is more good, then carry out standardization according to formula (2), obtain standardized value v ',

v^{'} = \{\begin{matrix} \frac{\max - v}{\max - \min}, \max &NotEqual; \min \\ 1, \max = \min \end{matrix} - - - (2);

Wherein max and min are maximum and the minimum value in this attribute, and v will be carried out the Web service of selecting by the standardized service property value at every turn, can obtain the property value v of this service;

According to following formula standardized value is aggregated into a single return value:

Wherein, m is the number of service quality value attribute, w _iBe the weights of each service quality value attribute of choosing according to user preference,

In the step 5) of the present invention, the Q value in the Q study calculated and upgraded according to following Q value formula:

Q (s, a) &LeftArrow; Q (s, a) + α [r + γ \max_{a^{'} &Element; A (s^{'})} Q ({s^{'}, a}^{'}) - Q (s, a)]

Wherein α is learning rate, and γ is discount factor, and r is for carrying out the return value that action a receives when state s, and s ' is for carrying out the new state that obtains behind the action a when state s, Q (s, a) in the value representation Q study state s with move the value of the corresponding combination of a.

Judge in the step 6) of the present invention that the method whether the Q value restrains is: ask for the difference of calculating calculating Q value in Q value and the k-1 time iteration in the k time iteration, for k=1, then ask for the difference of calculating Q value and initialization Q value in the 1st iteration, as difference less than decision content, then judge the convergence of Q value, otherwise judge that the Q value does not restrain, decision content is

Wherein R is the reward function upper bound, and γ is discount factor.

Beneficial effect: the present invention compared with prior art has the following advantages:

The present invention is applied to intensified learning and intelligent agent technology in the Web service Combinatorial Optimization system, and whole Combinatorial Optimization process is carried out monitoring system, effective and management.From the overall situation and the local QoS attribute optimization of Web service, macroscopic view is selected overall combinatorial path and the local optimal service that meets customer requirements respectively.The present invention utilizes based on the QoS property value and calculates return value, and to different property values, but the user can give different weights, satisfies the demand of user individual.And along with the variation of Web service running environment, the QoS attribute of Web service and its functional attributes also can change thereupon, and intensified learning can be suitable for this environmental change, online real-time selection optimal service, the uncertain and unpredictable problem of solution environmental change.And the experience between the multiple agent is shared, and has increased the coordination system between the intelligent body, helps to solve the slow problem of single intelligent body learning algorithm convergence rate, has improved the intelligent of system greatly.Than services combination, because intelligent physical efficiency is made feedback to environmental change in real time, therefore can the dynamically adapting change of network environment, keep composite services a more excellent performance state in real time.Because intensified learning is the intelligence learning algorithm of a model-free, therefore with respect to serving combination based on Markovian decision process, the present invention's impunity again knows definite state transition function and repayment function earlier, improves the autgmentability of system greatly.In view of this, the present invention has important significance for theories and actual application value.

Description of drawings

The WSC-MDP figure of Fig. 1 itinerary.

Fig. 2 intensified learning combined service optimization system global structure figure.

Fig. 3 Web service workflow schematic diagram.

Fig. 4 is the logical flow chart of the inventive method.

Embodiment

The present invention is described in detail below in conjunction with accompanying drawing and example.

Extensive service combined optimization method based on the multiple agent intensified learning of the present invention, idiographic flow may further comprise the steps as shown in Figure 4:

1) as shown in Figure 1 the environmental modeling of Web service combination is become Web service combination Markovian decision process state transition diagram (WSC-MDP).It can also can pass through the modeling of artificial intelligence planing method by manual modeling.It is 6 tuple WSC-MDP=＜S, s ₀, s _t, A (s), P, R 〉, S: be that a series of atomic actions are from certain specific initial condition s ₀The accessible state set that begins to carry out.s ₀The expression initial condition, the state when the expression action does not also take place also is the initial value of workflow.s _tUser's dbjective state also is the final state of workflow.A (s) expression WSCA is executable Web service set at a certain state s ∈ S.P:[p _Iaj]: when system during at a certain state, call the available Web service under this state, system enters the probability of NextState.R:[r _Iaj]: the overall merit that we are defined in the service of calling under certain state is return value.A WSC-MDP can be regarded as a state transition diagram.Two types node is arranged among the figure, and open circles and filled circles are represented state node and service node respectively.s ₀Be initial condition node, s _tBe the state of termination node.A state node can be followed a plurality of service nodes, is illustrated in a plurality of services that may carry out under this state.Among the figure except state of termination each state have the next state node of arrow points at least.Each arrow is accompanied by a transition probability p _Iaj, the return value r under this state transitions _Iaj(for the sake of simplicity, we omit this label in the drawings).At the probability of the arrow of an action on the node with always be 1.We suppose that the result of each service execution has two states.If next state is then transferred in the service execution success, this state is just served performed postcondition.If the service execution failure, then environment rests on current state.Such service groups is combined into the user many itinerary services streams is provided.When carrying out composite services, a workflow that optimum is provided can be selected by system.A WSC-MDP is super composite services that comprise a plurality of selectable workflows.Each workflow is representing one by conventional method, and for example BPEL and OWL-S form the Web service combination.WSC-MDP namely can pass through engineer's manual creation, also can create automatically by the method for AI planning.

Be the overall construction drawing based on multiple agent intensified learning service Combinatorial Optimization algorithm of the present invention as Fig. 2.The software entity that carries out the Web service Combinatorial Optimization be abstracted into can the perception environment and can autonomous operation to satisfy design object intelligence body, Web service combination intelligent body.Web service combination intelligent body and external environment are carried out alternately, and perception state S carries out action A (S), and obtains return value r.

Intensified learning is a kind of by the trial and error method, constantly adjusts the learning method of self behavior in the feedback signal of constantly carrying out acquisition environment in the reciprocal process with environment.We can train Web service combination intelligent body with intensified learning.The intelligence body utilizes intensified learning to train, and learning process is regarded as one sound out the process of estimating, if certain Web service of intelligent body selects the return value of behavior bigger than other Web services, should select the trend of this service to strengthen by the intelligence body so; If certain behavioral strategy of intelligent body causes lower return value, the trend that so intelligent body produces this behavioral strategy can weaken.

Concrete training method will be narrated in following steps.Its objective is that the Q that uses in the intensified learning learns to find from initial condition s ₀Dbjective state s to the user _tAn optimum Web service execution work stream, as shown in Figure 3.If wf is the subgraph of WSC-MDP.Wf is that and if only if has only a service to be performed at each state of wf for services stream.A workflow is actually and is equivalent to a definite state machine.A traditional service combination is based upon a single workflow usually.Wherein two have been shown as Fig. 3.The learning strategy of intensified learning and result have determined the Web service work in combination stream that these can be performed.

Be used as instruct with the software entity of synchronous each Web service combination intelligent body learning process as the intelligent body of Web service combination supervision.It has kept blackboard and has been used for storing overall Q value.Local WSCA can obtain the overall Q value on the blackboard, also can upgrade overall Q value by WSCS.The public Q value of initialization simultaneously Q _pBe 0.

2) learning rate of Q learning algorithm, discount factor, Q value in the initialization intensified learning; Can set its value as the case may be, general learning rate can be set at 0.5, and it is 0 that discount factor can be set at 0.8, Q value initialization.

3) the state s in the Web service combination intelligent body perception environment;

The Action Selection strategy of intelligence body, the simplest Action Selection rule are to select the action of motion estimation value maximum.This method always utilizes current knowledge to make that repayment is maximum immediately; It can't select in fact better action of suboptimum to the eye.A straightforward procedure of head it off is the most of the time, and selection can obtain the action of high repayment, and once in a while, namely little probability ε selects the action irrelevant with the motion estimation value at random.If this method has guaranteed to carry out enough trials, i.e. each action is all carried out unlimited, and one finds optimum action surely.Claim this to be ε-greedy near greedy Action Selection rule.Then can select action according to one of following manner: a. selects feasible action at random, and b. selects to make the action of current Q value maximum; Wherein, may selecting according to mode a of ε probability arranged, may selecting according to mode b of 1-ε probability arranged; ε selection 0.15 is comparatively suitable.When selecting according to mode b, determine and inform the action of the current Q value maximum of Web service combination intelligent body by the intelligent body of Web service combination supervision.Formula is:

p ^m(a _i| be that m WSCA selects action a at state s s) _iProbability.ArgMax _aQ _p[s, a] informs the maximum action of the current Q value of Web service combination intelligent body for the intelligent body of Web service combination supervision.Web service is made up the intelligent body of supervision and is the software entity of guidance with synchronous each Web service combination intelligent body learning process; An advantage of this system of selection is the increase along with learning time, and therefore each action all will guarantee last Q value convergence by sampling to unlimited.

The computational methods of return value r of intelligence body are: if the user thinks that the more big service quality that namely shows of service quality value that the service provider provides is more good, then carry out standardization according to formula (1), obtain standardized value v ',

v^{'} = \{\begin{matrix} \frac{v - \min}{\max - \min}, \max &NotEqual; \min \\ 1, \max = \min \end{matrix} - - - (1)

v^{'} = \{\begin{matrix} \frac{\max - v}{\max - \min}, \max &NotEqual; \min \\ 1, \max = \min \end{matrix} - - - (2);

Wherein max and min are maximum in this attribute and minimum value v for will be by the standardized service property value, carry out the Web service of selecting at every turn, can obtain the property value v of this service;

State transition probability p (s ' | s, a) meaning is state Can be from state s by carrying out the probability that Web service a reaches.Namely under state s, the probability that adopts action a to transfer to state s ' is:

P (s^{'} = j | s = i, a) = \{\begin{matrix} q_{r} (s), i &NotEqual; j \\ 1 - q_{r} (s), i = j \end{matrix}

Q wherein _r(s) be service reliability,

N _sBe the number of times of service successful execution, N _tTotal degree for service execution.

Q value in the Q study calculated and upgraded according to following Q value formula:

Q (s, a) &LeftArrow; Q (s, a) + α [r + γ \max_{a^{'} &Element; A (s^{'})} Q ({s^{'}, a}^{'}) - Q (s, a)]

Judge in the step 6) that the method whether the Q value restrains is: ask for the difference of calculating calculating Q value in Q value and the k-1 time iteration in the k time iteration, for k=1, then ask for the difference of calculating Q value and initialization Q value in the 1st iteration, difference is less than decision content as described, then judge the convergence of Q value, otherwise judge that the Q value does not restrain, namely

Described decision content is

Wherein R is the reward function upper bound, and γ is discount factor, Q _k(s a) is the k time iteration Q (s, a) value, Q _K-1(s a) is Q (s, a) value of k-1 iteration.

When carrying out optimal service, also be considered to a kind of learning process.According to the return value that newly obtains, Q value table is updated subsequently.Carry out and learning process by combination, our method has realized self adaptation.Based on newly observed return value, Web service combination meeting changes along with the variation of environment.It may by with the performance of the interactive learning Web service of environment, thereby do not need the priori QoS property value of composite services.

If agent under to the situation of environment without any experience, can only lean on trial and error fully, obviously be very blindly.Many agent intensified learning algorithm of sharing based on experience takes into account the cooperation thought between the agent, utilizes the Q function to share, thereby improves the learning efficiency of whole agent system.

Claims

1. extensive service combined optimization method based on the multiple agent intensified learning is characterized in that this method may further comprise the steps:

3) the software entity that carries out the Web service Combinatorial Optimization as can the perception environment and can autonomous operation satisfy the Web service combination intelligent body of design object, the state s in the described Web service combination intelligent body perception environment;

5) the Q value in the Q study is calculated and upgraded, and the Q value that after will upgrading will be is as the public Q value of the intelligent body of Web service combination supervision, finish this intensified learning process, described Web service is made up the intelligent body of supervision and is the software entity of guidance with synchronous each Web service combination intelligent body learning process;

6) judge whether the Q value restrains, in this way, then the result of this intensified learning flowed as optimum Web service execution work, otherwise get back to step 3) after making k=k+1 that k is the iterations that returns step 3).

2. the extensive service combined optimization method based on the multiple agent intensified learning according to claim 1 is characterized in that the Action Selection strategy in the described step 4) is:

Select action according to one of following manner: a. selects feasible action at random, and b. selects to make the action of current Q value maximum;

Wherein, may selecting according to mode a of ε probability arranged, may selecting according to mode b of 1-ε probability arranged;

When selecting according to mode b, determine and inform the action of the current Q value maximum of Web service combination intelligent body by the intelligent body of Web service combination supervision.

3. the extensive service combined optimization method based on the multiple agent intensified learning according to claim 1, it is characterized in that, the computational methods of return value r are in the described step 4): if the user thinks that the more big service quality that namely shows of service quality value that the service provider provides is more good, then carry out standardization according to formula (1), obtain standardized value v'

v^{'} = \{\begin{matrix} \frac{v - \min}{\max - \min}, \max &NotEqual; \min \\ 1, \max = \min \end{matrix} - - - (1)

v^{'} = \{\begin{matrix} \frac{\max - v}{\max - \min}, \max &NotEqual; \min \\ 1, \max = \min \end{matrix} - - - (2);

4. the extensive service combined optimization method based on the multiple agent intensified learning according to claim 1 is characterized in that, in the described step 5), the Q value in the Q study is calculated and is upgraded according to following Q value formula:

Q (s, a) &LeftArrow; Q (s, a) + α [r + γ \max_{a^{'} &Element; A (s^{'})} Q (s^{'}, a^{'}) - Q (s, a)]

5. the extensive service combined optimization method based on the multiple agent intensified learning according to claim 1, it is characterized in that, judge in the described step 6) that the method whether the Q value restrains is: ask for the difference of calculating calculating Q value in Q value and the k-1 time iteration in the k time iteration, for k=1, then ask for the difference of calculating Q value and initialization Q value in the 1st iteration, difference judges then that less than decision content the Q value restrains as described, otherwise judge that the Q value does not restrain, described decision content is

Wherein R is the reward function upper bound, and γ is discount factor.