CN103248693A - Large-scale self-adaptive composite service optimization method based on multi-agent reinforced learning - Google Patents

Large-scale self-adaptive composite service optimization method based on multi-agent reinforced learning Download PDF

Info

Publication number
CN103248693A
CN103248693A CN2013101612388A CN201310161238A CN103248693A CN 103248693 A CN103248693 A CN 103248693A CN 2013101612388 A CN2013101612388 A CN 2013101612388A CN 201310161238 A CN201310161238 A CN 201310161238A CN 103248693 A CN103248693 A CN 103248693A
Authority
CN
China
Prior art keywords
value
service
web service
state
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013101612388A
Other languages
Chinese (zh)
Inventor
王红兵
王晓珺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN2013101612388A priority Critical patent/CN103248693A/en
Publication of CN103248693A publication Critical patent/CN103248693A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a self-adaptive composite service optimization method based on multi-agent reinforced learning. The method combines conceptions of the reinforced learning and agents, and defines the state set of reinforced learning to be the precondition and postcondition of the service, and the action set to be the Web service; parameters for Q learning including the learning rate, discount factors and Q value in reinforced learning are initialized; each agent is used for performing one composite optimizing task, and can perceive the current state, and select the optimal action under the current state as per the action selection strategy; the Q value is calculated and updated as per the Q learning algorithm; before the Q value is converged, the next round learning is performed after one learning round is finished, and finally the optimal strategy is obtained. According to the method, the corresponding self-adaptive action strategy is worked out on line as per the environment change at the time, so that higher flexibility, self-adaptability and practical value are realized.

Description

Extensive service combined optimization method based on the multiple agent intensified learning
Technical field
The invention belongs to artificial intelligence field, relate to and utilize computer to the method for Web service combination adaptive optimization.
Background technology
In the face of market environment and keen competition complicated and changeable, enterprise presses for the support of the integrated and e-commerce technology of application, in order to improve from competitiveness and adaptability in market.Because the characteristic that Web service has makes it be very suitable for striding now the integrated of enterprise commerce application, industrial quarters and academia all wish and can create the service function that makes new advances by the service of combination existing Web.In order to realize application interoperability and the application integration of inter-organizational information system, can be by enterprise application system being carried out the Web service encapsulation, set up service-oriented frame system, the interface of web access is provided, the mode of application system between the enterprise with Web service integrated, the service combination that enterprise is striden in realization with cooperate, and by striding the automation of enterprise work streaming system realization operation flow.The Web service combination technique is exactly an important channel of realizing above-mentioned target.It according to certain rule, finds and is assembled into value-added, a more service of great dynamics, to satisfy user's complicated demand with a plurality of Web services.But, because the peculiar complexity of Internet environment and polytropy, the serviced component of composition composite service is in the implementation of composite service, dynamic change may take place, and this makes that the serviced component of forming composite service is difficult in the design phase or the compilation phase is decided.Therefore, need carry out dynamic web service combination, adapt to the complicated business environment of dynamic change.The another one problem is exactly the quality problems of service, i.e. QoS attribute problem.Because network has many services that identical function is provided of going up, it also is very important selecting a service with optimum QoS attribute.And dynamic change also may take place in Web service its QoS attribute in running.After operation after a while, its QoS can't satisfy client's demand as Web service.So Web service is made up, also need to adapt to the complicated business environment of dynamic change, to maintain a good running status and to have certain fault tolerance.
At present, the services combination will determine to participate in the Web service of flow process in advance, require the developer manually to go to carry out Service Assembly and service execution.Therefore, this process is difficulty, and is consuming time and easily make mistakes, and can not adapt to dynamic environment.(Markov Decision Process MDP) is the quantification expression framework of sequence decision problem under the random environment based on Markovian decision process.Sequence decision problem under the random environment refers to that the policymaker will make a policy on each observation station, and the decision information on the next state time is not known in decision-making.In general, this decision problem not only needs to consider the current interests of determining, and needs also to consider that current decision-making to the influence in future, makes the operation of system reach optimum.Doshi has proposed the application of MDP in the Web service combination, for generation of the Dynamic Web Services Composition of workflow.But the method need be known the environmental model of state transition probability and return value function.And this normally can not realize in actual environment.
Summary of the invention
Technical problem: the invention provides a kind of in the face of uncertain and uncertain environment the time, can be according to the online extensive service combined optimization method based on many agent intensified learning of making corresponding self adaptation behavioral strategy of the variation of prevailing circumstances.
Technical scheme: the extensive service combined optimization method based on the multiple agent intensified learning of the present invention may further comprise the steps:
1) environmental modeling of Web service combination is become the Web service combination Markovian decision process state transition diagram of 6 tuples, i.e. WSC-MDP=<S, s 0, s t, A (s), P:[p Iaj], R:[r Iaj], wherein S is that a series of atomic actions are from certain specific initial condition s 0The accessible state set that begins to carry out, s 0The expression initial condition, the state when the expression action does not also take place also is the initial value of workflow, s tUser's dbjective state also is the final state of workflow, and A (s) expression Web service combination intelligent body is executable Web service set at a certain state s ∈ S, P:[p Iaj]: for system when a certain state, call the available Web service under this state, system enters the probability of NextState, R:[r Iaj] for calling the overall merit return value of service under certain state;
2) learning rate of Q learning algorithm, discount factor, Q value and public Q value Q in the initialization intensified learning p
3) the software entity that carries out the Web service Combinatorial Optimization as can the perception environment and can autonomous operation satisfy the Web service combination intelligent body of design object, the state s in the Web service combination intelligent body perception environment;
4) Web service combination intelligent body obtains new state s ', simultaneously the value of being recompensed r from new state s ' according to Action Selection policy selection and execution action A (s);
5) the Q value in the Q study is calculated and upgraded, and the Q value that after will upgrading will be is as the public Q value of the intelligent body of Web service combination supervision, finish this intensified learning process, Web service is made up the intelligent body of supervision and is the software entity of guidance with synchronous each Web service combination intelligent body learning process;
6) judge whether the Q value restrains, in this way, then the result of this intensified learning is flowed as optimum Web service execution work, otherwise get back to step 3).
Step 2 of the present invention) in, the intelligence body utilizes intensified learning to train, learning process is regarded as one sound out the process of estimating, if certain Web service of intelligent body selects the return value of behavior bigger than other Web services, should select the trend of this service to strengthen by the intelligence body so; If certain behavioral strategy of intelligent body causes lower return value, the trend that so intelligent body produces this behavioral strategy can weaken.Intensified learning is exactly the study that intelligent body shines upon from the environment to the behavior in the multiple agent, so that the return value maximum.
Action Selection strategy in the step 4) of the present invention is, select action according to one of following manner: a. selects feasible action at random, and b. selects to make the action of current Q value maximum;
Wherein, may selecting according to mode a of ε probability arranged, may selecting according to mode b of 1-ε probability arranged; ε selection 0.85 is comparatively suitable.When selecting according to mode b, determine and inform the action of the current Q value maximum of Web service combination intelligent body by the intelligent body of Web service combination supervision.Formula is:
Figure BDA00003137439600031
p m(a i| be that m WSCA selects action a at state s s) iProbability.ArgMax aQ p[s, a] informs the maximum action of the current Q value of Web service combination intelligent body for the intelligent body of Web service combination supervision.Web service is made up the intelligent body of supervision and is the software entity of guidance with synchronous each Web service combination intelligent body learning process; If this method has guaranteed to carry out enough trials, namely action is all carried out unlimitedly on each, and one finds optimum action surely.An advantage of this system of selection is the increase along with learning time, and therefore each action all will guarantee last Q value convergence by sampling to unlimited.
The computational methods of return value r are in the step 4) of the present invention: if the user thinks that the more big service quality that namely shows of service quality value that the service provider provides is more good, then carry out standardization according to formula (1), obtain standardized value v ',
v ′ = v - min max - min , max ≠ min 1 , max = min - - - ( 1 )
If the user thinks that the more little service quality that namely shows of service quality value that the service provider provides is more good, then carry out standardization according to formula (2), obtain standardized value v ',
v ′ = max - v max - min , max ≠ min 1 , max = min - - - ( 2 ) ;
Wherein max and min are maximum and the minimum value in this attribute, and v will be carried out the Web service of selecting by the standardized service property value at every turn, can obtain the property value v of this service;
According to following formula standardized value is aggregated into a single return value:
Figure BDA00003137439600041
Wherein, m is the number of service quality value attribute, w iBe the weights of each service quality value attribute of choosing according to user preference,
Figure BDA00003137439600042
In the step 5) of the present invention, the Q value in the Q study calculated and upgraded according to following Q value formula:
Q ( s , a ) ← Q ( s , a ) + α [ r + γ max a ′ ∈ A ( s ′ ) Q ( s ′ , a ′ ) - Q ( s , a ) ]
Wherein α is learning rate, and γ is discount factor, and r is for carrying out the return value that action a receives when state s, and s ' is for carrying out the new state that obtains behind the action a when state s, Q (s, a) in the value representation Q study state s with move the value of the corresponding combination of a.
Judge in the step 6) of the present invention that the method whether the Q value restrains is: ask for the difference of calculating calculating Q value in Q value and the k-1 time iteration in the k time iteration, for k=1, then ask for the difference of calculating Q value and initialization Q value in the 1st iteration, as difference less than decision content, then judge the convergence of Q value, otherwise judge that the Q value does not restrain, decision content is
Figure BDA00003137439600043
Wherein R is the reward function upper bound, and γ is discount factor.
Beneficial effect: the present invention compared with prior art has the following advantages:
The present invention is applied to intensified learning and intelligent agent technology in the Web service Combinatorial Optimization system, and whole Combinatorial Optimization process is carried out monitoring system, effective and management.From the overall situation and the local QoS attribute optimization of Web service, macroscopic view is selected overall combinatorial path and the local optimal service that meets customer requirements respectively.The present invention utilizes based on the QoS property value and calculates return value, and to different property values, but the user can give different weights, satisfies the demand of user individual.And along with the variation of Web service running environment, the QoS attribute of Web service and its functional attributes also can change thereupon, and intensified learning can be suitable for this environmental change, online real-time selection optimal service, the uncertain and unpredictable problem of solution environmental change.And the experience between the multiple agent is shared, and has increased the coordination system between the intelligent body, helps to solve the slow problem of single intelligent body learning algorithm convergence rate, has improved the intelligent of system greatly.Than services combination, because intelligent physical efficiency is made feedback to environmental change in real time, therefore can the dynamically adapting change of network environment, keep composite services a more excellent performance state in real time.Because intensified learning is the intelligence learning algorithm of a model-free, therefore with respect to serving combination based on Markovian decision process, the present invention's impunity again knows definite state transition function and repayment function earlier, improves the autgmentability of system greatly.In view of this, the present invention has important significance for theories and actual application value.
Description of drawings
The WSC-MDP figure of Fig. 1 itinerary.
Fig. 2 intensified learning combined service optimization system global structure figure.
Fig. 3 Web service workflow schematic diagram.
Fig. 4 is the logical flow chart of the inventive method.
Embodiment
The present invention is described in detail below in conjunction with accompanying drawing and example.
Extensive service combined optimization method based on the multiple agent intensified learning of the present invention, idiographic flow may further comprise the steps as shown in Figure 4:
1) as shown in Figure 1 the environmental modeling of Web service combination is become Web service combination Markovian decision process state transition diagram (WSC-MDP).It can also can pass through the modeling of artificial intelligence planing method by manual modeling.It is 6 tuple WSC-MDP=<S, s 0, s t, A (s), P, R 〉, S: be that a series of atomic actions are from certain specific initial condition s 0The accessible state set that begins to carry out.s 0The expression initial condition, the state when the expression action does not also take place also is the initial value of workflow.s tUser's dbjective state also is the final state of workflow.A (s) expression WSCA is executable Web service set at a certain state s ∈ S.P:[p Iaj]: when system during at a certain state, call the available Web service under this state, system enters the probability of NextState.R:[r Iaj]: the overall merit that we are defined in the service of calling under certain state is return value.A WSC-MDP can be regarded as a state transition diagram.Two types node is arranged among the figure, and open circles and filled circles are represented state node and service node respectively.s 0Be initial condition node, s tBe the state of termination node.A state node can be followed a plurality of service nodes, is illustrated in a plurality of services that may carry out under this state.Among the figure except state of termination each state have the next state node of arrow points at least.Each arrow is accompanied by a transition probability p Iaj, the return value r under this state transitions Iaj(for the sake of simplicity, we omit this label in the drawings).At the probability of the arrow of an action on the node with always be 1.We suppose that the result of each service execution has two states.If next state is then transferred in the service execution success, this state is just served performed postcondition.If the service execution failure, then environment rests on current state.Such service groups is combined into the user many itinerary services streams is provided.When carrying out composite services, a workflow that optimum is provided can be selected by system.A WSC-MDP is super composite services that comprise a plurality of selectable workflows.Each workflow is representing one by conventional method, and for example BPEL and OWL-S form the Web service combination.WSC-MDP namely can pass through engineer's manual creation, also can create automatically by the method for AI planning.
Be the overall construction drawing based on multiple agent intensified learning service Combinatorial Optimization algorithm of the present invention as Fig. 2.The software entity that carries out the Web service Combinatorial Optimization be abstracted into can the perception environment and can autonomous operation to satisfy design object intelligence body, Web service combination intelligent body.Web service combination intelligent body and external environment are carried out alternately, and perception state S carries out action A (S), and obtains return value r.
Intensified learning is a kind of by the trial and error method, constantly adjusts the learning method of self behavior in the feedback signal of constantly carrying out acquisition environment in the reciprocal process with environment.We can train Web service combination intelligent body with intensified learning.The intelligence body utilizes intensified learning to train, and learning process is regarded as one sound out the process of estimating, if certain Web service of intelligent body selects the return value of behavior bigger than other Web services, should select the trend of this service to strengthen by the intelligence body so; If certain behavioral strategy of intelligent body causes lower return value, the trend that so intelligent body produces this behavioral strategy can weaken.
Concrete training method will be narrated in following steps.Its objective is that the Q that uses in the intensified learning learns to find from initial condition s 0Dbjective state s to the user tAn optimum Web service execution work stream, as shown in Figure 3.If wf is the subgraph of WSC-MDP.Wf is that and if only if has only a service to be performed at each state of wf for services stream.A workflow is actually and is equivalent to a definite state machine.A traditional service combination is based upon a single workflow usually.Wherein two have been shown as Fig. 3.The learning strategy of intensified learning and result have determined the Web service work in combination stream that these can be performed.
Be used as instruct with the software entity of synchronous each Web service combination intelligent body learning process as the intelligent body of Web service combination supervision.It has kept blackboard and has been used for storing overall Q value.Local WSCA can obtain the overall Q value on the blackboard, also can upgrade overall Q value by WSCS.The public Q value of initialization simultaneously Q pBe 0.
2) learning rate of Q learning algorithm, discount factor, Q value in the initialization intensified learning; Can set its value as the case may be, general learning rate can be set at 0.5, and it is 0 that discount factor can be set at 0.8, Q value initialization.
3) the state s in the Web service combination intelligent body perception environment;
4) Web service combination intelligent body obtains new state s ', simultaneously the value of being recompensed r from new state s ' according to Action Selection policy selection and execution action A (s);
The Action Selection strategy of intelligence body, the simplest Action Selection rule are to select the action of motion estimation value maximum.This method always utilizes current knowledge to make that repayment is maximum immediately; It can't select in fact better action of suboptimum to the eye.A straightforward procedure of head it off is the most of the time, and selection can obtain the action of high repayment, and once in a while, namely little probability ε selects the action irrelevant with the motion estimation value at random.If this method has guaranteed to carry out enough trials, i.e. each action is all carried out unlimited, and one finds optimum action surely.Claim this to be ε-greedy near greedy Action Selection rule.Then can select action according to one of following manner: a. selects feasible action at random, and b. selects to make the action of current Q value maximum; Wherein, may selecting according to mode a of ε probability arranged, may selecting according to mode b of 1-ε probability arranged; ε selection 0.15 is comparatively suitable.When selecting according to mode b, determine and inform the action of the current Q value maximum of Web service combination intelligent body by the intelligent body of Web service combination supervision.Formula is:
Figure BDA00003137439600071
p m(a i| be that m WSCA selects action a at state s s) iProbability.ArgMax aQ p[s, a] informs the maximum action of the current Q value of Web service combination intelligent body for the intelligent body of Web service combination supervision.Web service is made up the intelligent body of supervision and is the software entity of guidance with synchronous each Web service combination intelligent body learning process; An advantage of this system of selection is the increase along with learning time, and therefore each action all will guarantee last Q value convergence by sampling to unlimited.
The computational methods of return value r of intelligence body are: if the user thinks that the more big service quality that namely shows of service quality value that the service provider provides is more good, then carry out standardization according to formula (1), obtain standardized value v ',
v ′ = v - min max - min , max ≠ min 1 , max = min - - - ( 1 )
If the user thinks that the more little service quality that namely shows of service quality value that the service provider provides is more good, then carry out standardization according to formula (2), obtain standardized value v ',
v ′ = max - v max - min , max ≠ min 1 , max = min - - - ( 2 ) ;
Wherein max and min are maximum in this attribute and minimum value v for will be by the standardized service property value, carry out the Web service of selecting at every turn, can obtain the property value v of this service;
According to following formula standardized value is aggregated into a single return value:
Figure BDA00003137439600081
Wherein, m is the number of service quality value attribute, w iBe the weights of each service quality value attribute of choosing according to user preference,
Figure BDA00003137439600082
State transition probability p (s ' | s, a) meaning is state Can be from state s by carrying out the probability that Web service a reaches.Namely under state s, the probability that adopts action a to transfer to state s ' is:
P ( s ′ = j | s = i , a ) = q r ( s ) , i ≠ j 1 - q r ( s ) , i = j
Q wherein r(s) be service reliability,
Figure BDA00003137439600084
N sBe the number of times of service successful execution, N tTotal degree for service execution.
5) the Q value in the Q study is calculated and upgraded, and the Q value that after will upgrading will be is as the public Q value of the intelligent body of Web service combination supervision, finish this intensified learning process, Web service is made up the intelligent body of supervision and is the software entity of guidance with synchronous each Web service combination intelligent body learning process;
Q value in the Q study calculated and upgraded according to following Q value formula:
Q ( s , a ) ← Q ( s , a ) + α [ r + γ max a ′ ∈ A ( s ′ ) Q ( s ′ , a ′ ) - Q ( s , a ) ]
Wherein α is learning rate, and γ is discount factor, and r is for carrying out the return value that action a receives when state s, and s ' is for carrying out the new state that obtains behind the action a when state s, Q (s, a) in the value representation Q study state s with move the value of the corresponding combination of a.
6) judge whether the Q value restrains, in this way, then the result of this intensified learning is flowed as optimum Web service execution work, otherwise get back to step 3).
Judge in the step 6) that the method whether the Q value restrains is: ask for the difference of calculating calculating Q value in Q value and the k-1 time iteration in the k time iteration, for k=1, then ask for the difference of calculating Q value and initialization Q value in the 1st iteration, difference is less than decision content as described, then judge the convergence of Q value, otherwise judge that the Q value does not restrain, namely
Figure BDA00003137439600085
Described decision content is
Figure BDA00003137439600086
Wherein R is the reward function upper bound, and γ is discount factor, Q k(s a) is the k time iteration Q (s, a) value, Q K-1(s a) is Q (s, a) value of k-1 iteration.
When carrying out optimal service, also be considered to a kind of learning process.According to the return value that newly obtains, Q value table is updated subsequently.Carry out and learning process by combination, our method has realized self adaptation.Based on newly observed return value, Web service combination meeting changes along with the variation of environment.It may by with the performance of the interactive learning Web service of environment, thereby do not need the priori QoS property value of composite services.
If agent under to the situation of environment without any experience, can only lean on trial and error fully, obviously be very blindly.Many agent intensified learning algorithm of sharing based on experience takes into account the cooperation thought between the agent, utilizes the Q function to share, thereby improves the learning efficiency of whole agent system.

Claims (5)

1. extensive service combined optimization method based on the multiple agent intensified learning is characterized in that this method may further comprise the steps:
1) environmental modeling of Web service combination is become the Web service combination Markovian decision process state transition diagram of 6 tuples, i.e. WSC-MDP=<S, s 0, s t, A (s), P:[p Iaj], R:[r Iaj], wherein S is that a series of atomic actions are from certain specific initial condition s 0The accessible state set that begins to carry out, s 0The expression initial condition, the state when the expression action does not also take place also is the initial value of workflow, s tUser's dbjective state also is the final state of workflow, and A (s) expression Web service combination intelligent body is executable Web service set at a certain state s ∈ S, P:[p Iaj]: for system when a certain state, call the available Web service under this state, system enters the probability of NextState, R:[r Iaj] for calling the overall merit return value of service under certain state;
2) learning rate of Q learning algorithm, discount factor, Q value and public Q value Q in the initialization intensified learning p
3) the software entity that carries out the Web service Combinatorial Optimization as can the perception environment and can autonomous operation satisfy the Web service combination intelligent body of design object, the state s in the described Web service combination intelligent body perception environment;
4) Web service combination intelligent body obtains new state s ', simultaneously the value of being recompensed r from new state s ' according to Action Selection policy selection and execution action A (s);
5) the Q value in the Q study is calculated and upgraded, and the Q value that after will upgrading will be is as the public Q value of the intelligent body of Web service combination supervision, finish this intensified learning process, described Web service is made up the intelligent body of supervision and is the software entity of guidance with synchronous each Web service combination intelligent body learning process;
6) judge whether the Q value restrains, in this way, then the result of this intensified learning flowed as optimum Web service execution work, otherwise get back to step 3) after making k=k+1 that k is the iterations that returns step 3).
2. the extensive service combined optimization method based on the multiple agent intensified learning according to claim 1 is characterized in that the Action Selection strategy in the described step 4) is:
Select action according to one of following manner: a. selects feasible action at random, and b. selects to make the action of current Q value maximum;
Wherein, may selecting according to mode a of ε probability arranged, may selecting according to mode b of 1-ε probability arranged;
When selecting according to mode b, determine and inform the action of the current Q value maximum of Web service combination intelligent body by the intelligent body of Web service combination supervision.
3. the extensive service combined optimization method based on the multiple agent intensified learning according to claim 1, it is characterized in that, the computational methods of return value r are in the described step 4): if the user thinks that the more big service quality that namely shows of service quality value that the service provider provides is more good, then carry out standardization according to formula (1), obtain standardized value v'
v ′ = v - min max - min , max ≠ min 1 , max = min - - - ( 1 )
If the user thinks that the more little service quality that namely shows of service quality value that the service provider provides is more good, then carry out standardization according to formula (2), obtain standardized value v ',
v ′ = max - v max - min , max ≠ min 1 , max = min - - - ( 2 ) ;
Wherein max and min are maximum and the minimum value in this attribute, and v will be carried out the Web service of selecting by the standardized service property value at every turn, can obtain the property value v of this service;
According to following formula standardized value is aggregated into a single return value:
Figure FDA00003137439500023
Wherein, m is the number of service quality value attribute, w iBe the weights of each service quality value attribute of choosing according to user preference,
Figure FDA00003137439500024
4. the extensive service combined optimization method based on the multiple agent intensified learning according to claim 1 is characterized in that, in the described step 5), the Q value in the Q study is calculated and is upgraded according to following Q value formula:
Q ( s , a ) ← Q ( s , a ) + α [ r + γ max a ′ ∈ A ( s ′ ) Q ( s ′ , a ′ ) - Q ( s , a ) ]
Wherein α is learning rate, and γ is discount factor, and r is for carrying out the return value that action a receives when state s, and s ' is for carrying out the new state that obtains behind the action a when state s, Q (s, a) in the value representation Q study state s with move the value of the corresponding combination of a.
5. the extensive service combined optimization method based on the multiple agent intensified learning according to claim 1, it is characterized in that, judge in the described step 6) that the method whether the Q value restrains is: ask for the difference of calculating calculating Q value in Q value and the k-1 time iteration in the k time iteration, for k=1, then ask for the difference of calculating Q value and initialization Q value in the 1st iteration, difference judges then that less than decision content the Q value restrains as described, otherwise judge that the Q value does not restrain, described decision content is
Figure FDA00003137439500031
Wherein R is the reward function upper bound, and γ is discount factor.
CN2013101612388A 2013-05-03 2013-05-03 Large-scale self-adaptive composite service optimization method based on multi-agent reinforced learning Pending CN103248693A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013101612388A CN103248693A (en) 2013-05-03 2013-05-03 Large-scale self-adaptive composite service optimization method based on multi-agent reinforced learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013101612388A CN103248693A (en) 2013-05-03 2013-05-03 Large-scale self-adaptive composite service optimization method based on multi-agent reinforced learning

Publications (1)

Publication Number Publication Date
CN103248693A true CN103248693A (en) 2013-08-14

Family

ID=48927914

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013101612388A Pending CN103248693A (en) 2013-05-03 2013-05-03 Large-scale self-adaptive composite service optimization method based on multi-agent reinforced learning

Country Status (1)

Country Link
CN (1) CN103248693A (en)

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103646008A (en) * 2013-12-13 2014-03-19 东南大学 Web service combination method
CN104601643A (en) * 2014-07-17 2015-05-06 浙江大学 Service portfolio selection method in moving environment
CN105046351A (en) * 2015-07-01 2015-11-11 内蒙古大学 Reinforcement learning-based service combination method and system in uncertain environment
CN105119751A (en) * 2015-09-08 2015-12-02 浙江工业大学 Service evaluation and selection method based on environment real-time perceiving
CN105119733A (en) * 2015-07-06 2015-12-02 百度在线网络技术(北京)有限公司 Artificial intelligence system and state shifting method thereof, server and communication system
CN105955921A (en) * 2016-04-18 2016-09-21 苏州大学 Robot hierarchical reinforcement learning initialization method based on automatic discovery of abstract action
CN106411749A (en) * 2016-10-12 2017-02-15 国网江苏省电力公司苏州供电公司 Path selection method for software defined network based on Q learning
CN106850289A (en) * 2017-01-25 2017-06-13 东南大学 With reference to Gaussian process and the service combining method of intensified learning
CN106878403A (en) * 2017-01-25 2017-06-20 东南大学 Based on the nearest heuristic service combining method explored
CN107156020A (en) * 2017-06-21 2017-09-15 重庆大学 A kind of Intelligent fish tank water quality adjustment method based on intensified learning
CN107241213A (en) * 2017-04-28 2017-10-10 东南大学 A kind of web service composition method learnt based on deeply
CN107246710A (en) * 2017-05-17 2017-10-13 深圳和而泰智能家居科技有限公司 The control method and device of indoor sleep temperature
CN107306207A (en) * 2017-05-31 2017-10-31 东南大学 Calculated and multiple target intensified learning service combining method with reference to Skyline
CN107315573A (en) * 2017-07-19 2017-11-03 北京上格云技术有限公司 Build control method, storage medium and the terminal device of Mechatronic Systems
WO2017201662A1 (en) * 2016-05-24 2017-11-30 华为技术有限公司 Q-learning based resource scheduling method and device
CN108629422A (en) * 2018-05-10 2018-10-09 浙江大学 A kind of intelligent body learning method of knowledge based guidance-tactics perception
CN108829797A (en) * 2018-04-25 2018-11-16 苏州思必驰信息科技有限公司 Multiple agent dialog strategy system constituting method and adaptive approach
CN109063870A (en) * 2018-07-24 2018-12-21 海南大学 Composite services policy optimization method and system based on Q study
CN109117983A (en) * 2018-07-09 2019-01-01 南京邮电大学 Build method for managing resource and system, computer readable storage medium and terminal
CN109587715A (en) * 2018-12-13 2019-04-05 广州大学 A kind of distributed buffer memory strategy based on multiple agent intensified learning
CN109803292A (en) * 2018-12-26 2019-05-24 佛山市顺德区中山大学研究院 A method of the mobile edge calculations of more secondary user's based on intensified learning
CN110084375A (en) * 2019-04-26 2019-08-02 东南大学 A kind of hierarchy division frame based on deeply study
CN110288878A (en) * 2019-07-01 2019-09-27 科大讯飞股份有限公司 Adaptive learning method and device
CN110882544A (en) * 2019-11-28 2020-03-17 网易(杭州)网络有限公司 Multi-agent training method and device and electronic equipment
CN110971683A (en) * 2019-11-28 2020-04-07 海南大学 Service combination method based on reinforcement learning
CN111140911A (en) * 2020-01-03 2020-05-12 南方电网科学研究院有限责任公司 Regulation and control method of intelligent building comprehensive heating equipment
CN111316295A (en) * 2017-10-27 2020-06-19 渊慧科技有限公司 Reinforcement learning using distributed prioritized playback
CN111505944A (en) * 2019-01-30 2020-08-07 珠海格力电器股份有限公司 Energy-saving control strategy learning method, and method and device for realizing air conditioning energy control
CN111985672A (en) * 2020-05-08 2020-11-24 东华大学 Single-piece job shop scheduling method for multi-Agent deep reinforcement learning
US10977551B2 (en) 2016-12-14 2021-04-13 Microsoft Technology Licensing, Llc Hybrid reward architecture for reinforcement learning
CN112714165A (en) * 2020-12-22 2021-04-27 声耕智能科技(西安)研究院有限公司 Distributed network cooperation strategy optimization method and device based on combination mechanism
CN113396428A (en) * 2019-03-05 2021-09-14 赫尔实验室有限公司 Robust, extensible, and generalizable machine learning paradigm for multi-agent applications

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102207928A (en) * 2011-06-02 2011-10-05 河海大学常州校区 Reinforcement learning-based multi-Agent sewage treatment decision support system
CN102238555A (en) * 2011-07-18 2011-11-09 南京邮电大学 Collaborative learning based method for multi-user dynamic spectrum access in cognitive radio
CN102255955A (en) * 2011-06-22 2011-11-23 浙江工商大学 Dynamic Web service combination method based on dependency relationship

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102207928A (en) * 2011-06-02 2011-10-05 河海大学常州校区 Reinforcement learning-based multi-Agent sewage treatment decision support system
CN102255955A (en) * 2011-06-22 2011-11-23 浙江工商大学 Dynamic Web service combination method based on dependency relationship
CN102238555A (en) * 2011-07-18 2011-11-09 南京邮电大学 Collaborative learning based method for multi-user dynamic spectrum access in cognitive radio

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HONGBING WANG, XUAN ZHOUY,XIANG ZHOU, WEIHONG LIU,WENYA LI: "Adaptive and Dynamic Service Composition Using Q-Learning", 《22ND INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE》, 29 October 2010 (2010-10-29) *

Cited By (49)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103646008B (en) * 2013-12-13 2016-06-08 东南大学 A kind of web service composition method
CN103646008A (en) * 2013-12-13 2014-03-19 东南大学 Web service combination method
CN104601643B (en) * 2014-07-17 2018-04-13 浙江大学 A kind of service composition selection method towards under mobile environment
CN104601643A (en) * 2014-07-17 2015-05-06 浙江大学 Service portfolio selection method in moving environment
CN105046351A (en) * 2015-07-01 2015-11-11 内蒙古大学 Reinforcement learning-based service combination method and system in uncertain environment
CN105119733B (en) * 2015-07-06 2019-01-15 百度在线网络技术(北京)有限公司 Artificial intelligence system and its state transition method, server, communication system
CN105119733A (en) * 2015-07-06 2015-12-02 百度在线网络技术(北京)有限公司 Artificial intelligence system and state shifting method thereof, server and communication system
CN105119751B (en) * 2015-09-08 2018-09-18 浙江工业大学 A kind of service valuation and choosing method based on environment real-time perception
CN105119751A (en) * 2015-09-08 2015-12-02 浙江工业大学 Service evaluation and selection method based on environment real-time perceiving
CN105955921A (en) * 2016-04-18 2016-09-21 苏州大学 Robot hierarchical reinforcement learning initialization method based on automatic discovery of abstract action
CN105955921B (en) * 2016-04-18 2019-03-26 苏州大学 Robot Hierarchical reinforcement learning initial method based on automatic discovery abstract action
WO2017201662A1 (en) * 2016-05-24 2017-11-30 华为技术有限公司 Q-learning based resource scheduling method and device
CN106411749A (en) * 2016-10-12 2017-02-15 国网江苏省电力公司苏州供电公司 Path selection method for software defined network based on Q learning
CN106411749B (en) * 2016-10-12 2019-07-30 国网江苏省电力公司苏州供电公司 A kind of routing resource for software defined network based on Q study
US10977551B2 (en) 2016-12-14 2021-04-13 Microsoft Technology Licensing, Llc Hybrid reward architecture for reinforcement learning
CN106850289B (en) * 2017-01-25 2020-04-24 东南大学 Service combination method combining Gaussian process and reinforcement learning
CN106850289A (en) * 2017-01-25 2017-06-13 东南大学 With reference to Gaussian process and the service combining method of intensified learning
CN106878403A (en) * 2017-01-25 2017-06-20 东南大学 Based on the nearest heuristic service combining method explored
CN107241213B (en) * 2017-04-28 2020-05-05 东南大学 Web service combination method based on deep reinforcement learning
CN107241213A (en) * 2017-04-28 2017-10-10 东南大学 A kind of web service composition method learnt based on deeply
CN107246710A (en) * 2017-05-17 2017-10-13 深圳和而泰智能家居科技有限公司 The control method and device of indoor sleep temperature
CN107306207A (en) * 2017-05-31 2017-10-31 东南大学 Calculated and multiple target intensified learning service combining method with reference to Skyline
CN107156020B (en) * 2017-06-21 2019-09-13 重庆大学 A kind of Intelligent fish tank water quality adjustment method based on intensified learning
CN107156020A (en) * 2017-06-21 2017-09-15 重庆大学 A kind of Intelligent fish tank water quality adjustment method based on intensified learning
CN107315573A (en) * 2017-07-19 2017-11-03 北京上格云技术有限公司 Build control method, storage medium and the terminal device of Mechatronic Systems
CN111316295B (en) * 2017-10-27 2023-09-22 渊慧科技有限公司 Reinforcement learning using distributed prioritized playback
US11625604B2 (en) 2017-10-27 2023-04-11 Deepmind Technologies Limited Reinforcement learning using distributed prioritized replay
CN111316295A (en) * 2017-10-27 2020-06-19 渊慧科技有限公司 Reinforcement learning using distributed prioritized playback
CN108829797A (en) * 2018-04-25 2018-11-16 苏州思必驰信息科技有限公司 Multiple agent dialog strategy system constituting method and adaptive approach
CN108629422B (en) * 2018-05-10 2022-02-08 浙江大学 Intelligent learning method based on knowledge guidance-tactical perception
CN108629422A (en) * 2018-05-10 2018-10-09 浙江大学 A kind of intelligent body learning method of knowledge based guidance-tactics perception
CN109117983A (en) * 2018-07-09 2019-01-01 南京邮电大学 Build method for managing resource and system, computer readable storage medium and terminal
CN109063870A (en) * 2018-07-24 2018-12-21 海南大学 Composite services policy optimization method and system based on Q study
CN109587715A (en) * 2018-12-13 2019-04-05 广州大学 A kind of distributed buffer memory strategy based on multiple agent intensified learning
CN109587715B (en) * 2018-12-13 2022-03-25 广州大学 Distributed caching method based on multi-agent reinforcement learning
CN109803292A (en) * 2018-12-26 2019-05-24 佛山市顺德区中山大学研究院 A method of the mobile edge calculations of more secondary user's based on intensified learning
CN109803292B (en) * 2018-12-26 2022-03-04 佛山市顺德区中山大学研究院 Multi-level user moving edge calculation method based on reinforcement learning
CN111505944A (en) * 2019-01-30 2020-08-07 珠海格力电器股份有限公司 Energy-saving control strategy learning method, and method and device for realizing air conditioning energy control
CN113396428B (en) * 2019-03-05 2024-05-07 赫尔实验室有限公司 Learning system, computer program product and method for multi-agent application
CN113396428A (en) * 2019-03-05 2021-09-14 赫尔实验室有限公司 Robust, extensible, and generalizable machine learning paradigm for multi-agent applications
CN110084375A (en) * 2019-04-26 2019-08-02 东南大学 A kind of hierarchy division frame based on deeply study
CN110288878A (en) * 2019-07-01 2019-09-27 科大讯飞股份有限公司 Adaptive learning method and device
CN110971683B (en) * 2019-11-28 2021-06-15 海南大学 Service combination method based on reinforcement learning
CN110882544A (en) * 2019-11-28 2020-03-17 网易(杭州)网络有限公司 Multi-agent training method and device and electronic equipment
CN110882544B (en) * 2019-11-28 2023-09-15 网易(杭州)网络有限公司 Multi-agent training method and device and electronic equipment
CN110971683A (en) * 2019-11-28 2020-04-07 海南大学 Service combination method based on reinforcement learning
CN111140911A (en) * 2020-01-03 2020-05-12 南方电网科学研究院有限责任公司 Regulation and control method of intelligent building comprehensive heating equipment
CN111985672A (en) * 2020-05-08 2020-11-24 东华大学 Single-piece job shop scheduling method for multi-Agent deep reinforcement learning
CN112714165A (en) * 2020-12-22 2021-04-27 声耕智能科技(西安)研究院有限公司 Distributed network cooperation strategy optimization method and device based on combination mechanism

Similar Documents

Publication Publication Date Title
CN103248693A (en) Large-scale self-adaptive composite service optimization method based on multi-agent reinforced learning
CN109491790B (en) Container-based industrial Internet of things edge computing resource allocation method and system
Meyer et al. Automated service composition using heuristic search
CN112100155B (en) Cloud-edge collaborative digital twin model assembling and fusing method
JP7118726B2 (en) workflow engine framework
CN106055395A (en) Method for constraining workflow scheduling in cloud environment based on ant colony optimization algorithm through deadline
CN108924198A (en) A kind of data dispatching method based on edge calculations, apparatus and system
Kim et al. Multi-agent reinforcement learning-based resource management for end-to-end network slicing
Wang et al. Adaptive and dynamic service composition using q-learning
CN110119399B (en) Business process optimization method based on machine learning
CN111126621A (en) Online model training method and device
Chen et al. A hybrid task scheduling scheme for heterogeneous vehicular edge systems
CN107306207A (en) Calculated and multiple target intensified learning service combining method with reference to Skyline
CN101477475A (en) Method for optimizing dynamic grid work flow by employing ant colony algorithm
Xu et al. Living with artificial intelligence: A paradigm shift toward future network traffic control
Hu et al. Intelligent decision making framework for 6G network
Liu et al. RFID: Towards low latency and reliable DAG task scheduling over dynamic vehicular clouds
CN117155845B (en) Internet of things data interaction method and system
Zhang Storage optimization algorithm design of cloud computing edge node based on artificial intelligence technology
Lei et al. Web service composition based on reinforcement learning
Bensalem et al. Towards optimal serverless function scaling in edge computing network
CN112446484A (en) Multitask training cluster intelligent network system and cluster network optimization method
CN106657238A (en) Interactive-workload-oriented data center coarse-grained dynamic server reservation algorithm
Li et al. Slicing-based AI service provisioning on network edge
CN110119268A (en) Workflow optimization method based on artificial intelligence

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20130814

RJ01 Rejection of invention patent application after publication