CN113485104B

CN113485104B - Fuzzy control 'cloud green making' intelligent method based on reinforcement Learning Q-Learning

Info

Publication number: CN113485104B
Application number: CN202110740746.6A
Authority: CN
Inventors: 邓勇; 李宏发; 连纪文; 郑蔚涛; 王栋; 陈汉城; 刘璐; 陈行滨; 黄锐; 李霄铭; 李棋; 林旭军; 熊军; 陈卓琳; 余翔; 翁晓锋; 江秀; 潘丹; 林栋�; 许高术
Original assignee: Fujian Yili Information Technology Co ltd; State Grid Fujian Electric Power Co Ltd; Information and Telecommunication Branch of State Grid Fujian Electric Power Co Ltd
Current assignee: Fujian Yili Information Technology Co ltd; State Grid Fujian Electric Power Co Ltd; Information and Telecommunication Branch of State Grid Fujian Electric Power Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2023-08-01
Anticipated expiration: 2041-06-30
Also published as: CN113485104A

Abstract

The invention relates to a fuzzy control 'cloud green making' intelligent algorithm based on reinforcement Learning Q-Learning. Firstly, blurring environmental information during tea making is used as a front part of a fuzzy rule in a rule base of a green making fuzzy strengthening system, a conclusion of the fuzzy rule, namely a back part of the fuzzy rule, is obtained through a strengthening learning system, different combinations of a green making barrel including a rotation direction, a rotation degree and a rotation time are used as an action set of the strengthening learning system, and each back part of the fuzzy rule is one action set of the strengthening learning system; secondly, establishing a rule base of a bluish fuzzy reinforcement system, namely, the front part of each fuzzy rule corresponds to all actions in the action set respectively and is used as a possible back part of the fuzzy rule, and simultaneously distributing one for each actionThe function is used as an evaluation value of the action; by updating the possible actions of each fuzzy ruleValue, selecting the largest value in the fuzzy rule after learningThe action of the value is used as a conclusion of the fuzzy rule, so that the final fuzzy control output is obtained.

Description

Fuzzy control 'cloud green making' intelligent method based on reinforcement Learning Q-Learning

Technical Field

The invention belongs to the field of machine Learning, and particularly relates to a fuzzy control 'cloud green making' intelligent method based on reinforcement Learning Q-Learning.

Background

When tea making is performed at present through electrification, mechanical flow operation is realized no matter green making or tea baking, the production efficiency is greatly improved, and meanwhile, the fine man-made type of the traditional tea making is lost. The Minbei oolong tea is mainly represented by Wuyi rock tea of Wuyi mountain, the tea making process is complex, the process is fine, the fresh leaves of different raw materials need flexible operation technology according to different conditions, and how to ensure stable quality in mass production is a difficult point. How to combine the experience of the master and the mass production data of machine production to form a standardized and normalized manufacturing process flow, the big data technical means are urgently needed to be applied to building a tea manufacturing process data platform, collecting manufacturing process flow data and tea quality data, building a data model, realizing the standardized tea manufacturing process flow, improving the quality of machine tea manufacturing, and promoting the upgrading and development of tea industry. The preparation process of the Minbei oolong tea can be divided into: five working procedures of withering, making green, deactivating enzyme, rolling and baking are carried out, wherein the making green is an important working procedure for making tea and is a key for determining the quality of tea. The green making is a repeated and alternating operation process of rocking green and airing green for a plurality of times, and the aim is to effectively control the change of the moisture of green leaves and the change of enzymatic oxidation. Therefore, the control and the control of the change of various factors in the process of making the green of the Fujian oolong tea are important to ensure the tea quality of the Fujian oolong tea.

With the rapid development of computer hardware and machine learning, big data mining models in various fields are paid attention to by scholars' experts. The prior art of green making has focused mainly on methods based on control models and methods based on supervised learning. He Jing and the like design a temperature control system of the Wuyi rock tea comprehensive green making machine based on a PID control principle to obtain temperature errors and error changes, control the rotating speed of a fan and achieve the purpose of improving the quality of tea; liu Jiang the influence of temperature and humidity environmental factors on the quality of rock tea in the green making process is equally analyzed, and a fuzzy controller taking the green making temperature and humidity deviation as an input parameter and the green shaking (airing) time as an output parameter is designed based on an expert database of an upper computer, so that the intelligent control of the green making process is realized; wu Wei and the like analyze the relation between the color change of withered leaves and the green making degree in the green making process of the Wuyi rock tea based on a machine vision technology, and construct a neural network model to predict the green making degree of the Wuyi rock tea. Cao Chengmao and the like apply the fuzzy control technology to the constant temperature control of green tea fixation, and apply the fuzzy control decision to adjust the fixation time and fixation temperature of the green tea fixation machine in real time by detecting the fresh leaf throwing amount and fresh leaf grade in the green tea fixation.

For the intelligent green making process of the Minbei oolong, a method based on a control model and a method based on a supervised learning method have advantages and limitations. Based on the control model: the prior knowledge is utilized, the output is controlled through fuzzy reasoning and fuzzy control of rules, an accurate control object model is not needed, the robustness is high, the learning is difficult, and the fuzzy control is difficult to realize under the condition of lacking an expert database. The method based on supervised learning has the advantages of high learning speed, but the method requires a large amount of input-output data for training of models such as neural networks and the like, and when the quantity and quality of training data are not guaranteed, the learning effect is greatly reduced.

Disclosure of Invention

The invention aims to overcome the defects, and provides a fuzzy control 'cloud green making' intelligent method based on reinforcement Learning Q-Learning, wherein the reinforcement Learning Q-Learning method is introduced, a fuzzy control model based on reinforcement Learning Q-Learning is established, on one hand, a fuzzy rule base for controlling green making of northbound oolong tea is designed through reinforcement Learning, on the other hand, the condition space and the action space of reinforcement Learning are continuously changed, so that the search space is overlarge, and through fuzzy control, parameters such as temperature and humidity in the green making process of northbound oolong tea are precisely controlled, such as forward rotation, reverse rotation, rotation degree, air inlet quantity, frequency and the like of a green making barrel are operated in real time. Proved by demonstration analysis, the model has good effect, can effectively control the green making process of the North Fujian oolong tea, and improves the quality of the oolong tea.

In order to achieve the above purpose, the technical scheme of the invention is as follows: the intelligent fuzzy control 'cloud green making' method based on reinforcement Learning Q-Learning comprises the steps of firstly, fuzzifying environmental information during tea making, then taking the fuzzified environmental information as a front part of a fuzzy rule in a green making fuzzy reinforcement system rule library, obtaining a conclusion of the fuzzy rule, namely a back part of the fuzzy rule through a reinforcement Learning system, and taking different groups of green making barrels including rotation directions, rotation degrees and rotation time as an action set of the reinforcement Learning system, wherein the back part of each fuzzy rule is one of the action sets of the reinforcement Learning system; secondly, establishing a rule base of a bluing fuzzy reinforcement system, namely, respectively corresponding all actions in an action set by a front piece of each fuzzy rule, serving as possible rear pieces of the fuzzy rule, and simultaneously distributing a Q function for each action as an evaluation value of the action; and (3) updating the Q value of possible actions of each fuzzy rule, and selecting the action with the maximum Q value in the fuzzy rules as a conclusion of the fuzzy rules after learning, so as to obtain a final fuzzy control output.

In one embodiment of the present invention, the reinforcement Learning Q-Learning in the reinforcement Learning system requires input: iteration times T, state sets S, action sets T, learning rate alpha, attenuation factor gamma and exploration rate epsilon; wherein the state set S is the environmental information during tea making; the action set is a green making barrel and comprises a rotation direction, a rotation degree and a rotation time; the learning rate alpha, defining an old Q value, wherein the new Q value learned from the new Q value accounts for the specific gravity of the old Q value, the value of 0 means that nothing can be learned, and the value of 1 means that the newly discovered information is the only important information; a decay factor gamma defining the importance of future rewards, a value of 0 meaning that only short term rewards are considered, a value of 1 meaning that long term rewards are valued; the reinforcement Learning Q-Learning step is as follows:

1) Randomly initializing Q values corresponding to all states and actions, and initializing the Q value of the termination state to 0;

2) The learning task is to learn a strategy, wherein the strategy represents the selection of the next action based on the current state, i.e. the effective selection method of the learning strategy is that under the strategy, the largest accumulated return is obtained; the cumulative return value obtained by any strategy is:

to learn the optimal strategy, define the value-assessment function as Q (s, a):

Q(s _t ,a _t )＝r _t +γ _t V(s _t+1 )

3) Performing local iterative learning on the evaluation function of the current state, and selecting a globally optimal action sequence;

3.1 Initializing s to be the first state of the current state sequence;

3.2 Selecting an action a in the current state s by using an epsilon greedy method;

3.3 Executing the current action a in the state s to obtain a new state s' and a reward r;

3.4 Updating the value evaluation function Q (s, a):

once the value-assessment function reaches convergence by learning, an optimal strategy can be determined, i.e., the action with the highest Q value is selected as the optimal strategy for each state.

In an embodiment of the present invention, after the reinforcement Learning Q-Learning is combined with the fuzzy control, the reinforcement Learning system is changed into the reinforcement Learning Q-Learning fuzzy control system to represent the state set s= { S _i |s _i E S to action set a= { a } _i |a _i The mapping of e A may also represent the state-action pairs(s) _t ,a _t ) To the corresponding value evaluation function Q (s _t ,a _t ) Is mapped to; the fuzzy rule is expressed in the following form:

R _i ：If s is F _i Then y with q(i,1)

or y is a(i,2)with q(i,2)

or y is a(i,j)with q(i,j)

wherein R is _i An ith rule indicating fuzzy control, s indicating a current state vector, F as an input of the fuzzy control ⁱ Representing fuzzy sets, a (i, j) and q (i, j) as F ⁱ Respectively representing possible actions of the state s and corresponding evaluation values; in the reinforcement learning process, an epsilon greedy search algorithm strategy is adopted to select the fuzzy controlled back-piece action a (i, j) ⁺ ) As an output of the ith rule;

the form of the rules for N inputs, using the zero-order Takagi-Sugeno fuzzy inference model, is:

R _l :

thenf _l ＝k _l

wherein the method comprises the steps ofIs the i-th input variable i ^th Fuzzy set of (d) rule r _l ，l＝1,2,...,k _l Is rule r _l The blurred output may be deblurred to a clear output using a deblurring technique;

the action output of the system is as follows:

the corresponding Q values are:

wherein a is _i (s)＝min[μ ¹ (s ¹ ),μ ² (s ² ),...]Representing the back-piece fitness of the ith rule,membership to a state component;

the update of the evaluation value Q (i, j) uses a Q learning algorithm, and at time t, the agent outputs the action a in the fuzzy control _t (s _t ) Under the action of (a), the state transitions to s _t+1 And obtain the strengthening information r _t The method comprises the steps of carrying out a first treatment on the surface of the The timing difference error is calculated by the following formula:

ζ＝r _t +γmaxQ _t+1 -Q _t

the sequential differential error is used to update the q value of the motion by first calculating the gradient of q (i, j):

delta is learning rate, accumulated qualification trace is used to accelerate reinforcement learning, qualification is defined to be updated by the following formula, and lambda is qualification trace learning rate;

finally, the q value of each possible corresponding action of the fuzzy rule is updated by the following formula:

q _t (i,j)＝q _t-1 (i,j)+Δq(i,j)。

in an embodiment of the invention, parameter optimization based on self-feature mapping and gradient descent method is introduced to realize the parameter optimization based on reinforcement Learning Q-Learning fuzzy control, and the method specifically comprises the following steps:

first, the desired output of the controller based on the reinforcement signal is taken:

wherein ρ is [0,1 ]]The scaling factor over the interval, if r (t+1) > ρ (t+1), describesBetter control output than y (t), should be rewarded; conversely, if r (t+1) < ρ (t+1), explanation +.>Control outputs inferior to y (t) should be penalized;

then the following steps are executed:

1) Parameter preliminary adjustment based on self-organizing feature mapping

In order to reduce the parameter optimization range based on reinforcement Learning Q-Learning fuzzy control and establish a more accurate fuzzy rule, a kohonen self-organizing feature mapping method is utilized to preliminarily adjust the mean value of a fuzzy membership function; the specific algorithm is as follows:

the desired control output reaches the fifth layer from top to bottom, and the state input reaches the second layer from bottom to top; the mean value of the input fuzzy membership function is adjusted according to the following formula:

m _i (t+1)＝m _i (t)+α[x(t)-m _i (t)]ifm _i ≠m _closest

m _i (t+1)＝m _i (t)-if-m _i ≠m _closest

wherein x (t) is input training data, alpha is learning rate, and k is the number of fuzzy segmentation of input variables; the average value of the output fuzzy membership function is adjusted by adopting the same method;

2) And (3) calculating the certainty:

searching for the calculation relation of the fuzzy rule and the definition of the calculation rule, and defining W _ij (initial value is 0) represents the mapping strength of the fuzzy rule, and the specific algorithm is as follows:

step 1, updating the mapping strength of the fuzzy rule, wherein the state input reaches the interval node of the second layer from bottom to topAt the same time, and further can obtain the activation strength of the third layer rule node>The desired output goes from top to bottom to the fourth level node while getting +.>W _ij Updating according to the following formula:

step 2, calculating the certainty factor, and aiming at W according to the following formula _ij Normalization processing is carried out to obtain the certainty degree of each rule:

3) Parameter optimization based on gradient descent method

The mean and variance of the fuzzy membership function are optimized by using a gradient descent method, and the performance index function is taken as follows:

deducing correction amounts of the mean and the variance by using a chained rule;

the mean and variance correction of the input fuzzy membership function is:

the mean and variance correction of the output fuzzy membership function is:

wherein beta is ₁ 、β ₂ 、γ ₁ 、γ ₂ Is the learning rate.

In an embodiment of the present invention, the reinforcement Learning Q-Learning system uses reinforcement Learning Q-Learning based on heuristic function, i.e. the reinforcement Learning Q-Learning is integrated into or even directly guides the action selection process of the green barrel by using heuristic function.

Compared with the prior art, the invention has the following beneficial effects: in the prior art, the green making process is influenced by natural factors and unnatural factors at the same time, the unnatural factors cannot be displayed in a data form, the fuzzy control is difficult to realize under the condition of lacking an expert database based on fuzzy control modeling, and the fuzzy control model is directly utilized to control only based on known data under the condition of lacking more accurate priori knowledge, so that the precision has errors; the neural network model based on supervised learning modeling is a learning method based on training data, a large amount of input-output data is needed for training the neural network model, and when the quantity and quality of the training data are not guaranteed, the overall green making control generates larger errors. The invention firstly carries out unsupervised learning through reinforcement learning, and under the condition of insufficient expert knowledge, the invention can carry out autonomous learning based on the input environmental state, is an effective fuzzy control design method, meanwhile, the system design introduces priori knowledge, adopts heuristic selection strategies based on experience and domain knowledge, accelerates the learning process, and improves flexibility and expansibility; the method for strengthening learning and fuzzy control is integrated, and the characteristic that fuzzy control is widely approximated is utilized, and the accumulated strengthening signal of strengthening learning is gradually increased and reaches convergence, so that a more optimized control strategy can be obtained; aiming at the problems of overlarge parameter range, low convergence speed and the like of the model, the expected control output of the state input is obtained through simple punishment or rewarding signals, an optimization method similar to training data learning is constructed, parameter initialization based on a self-feature mapping method and parameter optimization based on gradient are distributed, iterative updating of the model parameters is accelerated, and therefore the model is easier to obtain a better solution and faster convergence speed.

Drawings

FIG. 1 is a block diagram of a fuzzy control system based on reinforcement learning according to the present invention.

FIG. 2 is a flow chart of the blurring process for inputting environmental information according to the present invention.

FIG. 3 is a block diagram of a reinforcement learning system based on heuristic strategy selection in accordance with the present invention.

Detailed Description

The technical scheme of the invention is specifically described below with reference to the accompanying drawings.

The invention relates to a fuzzy control 'cloud green making' intelligent method based on reinforcement Learning Q-Learning, which comprises the steps of firstly, fuzzifying environmental information during tea making to be used as a front part of a fuzzy rule in a green making fuzzy strengthening system rule base, obtaining a conclusion of the fuzzy rule, namely a rear part of the fuzzy rule through a reinforcement Learning system, and combining different combinations of a green making barrel including a rotation direction, a rotation degree and a rotation time to be used as an action set of the reinforcement Learning system, wherein the rear part of each fuzzy rule is one of the action sets of the reinforcement Learning system; secondly, establishing a rule base of a bluing fuzzy reinforcement system, namely, respectively corresponding all actions in an action set by a front piece of each fuzzy rule, serving as possible rear pieces of the fuzzy rule, and simultaneously distributing a Q function for each action as an evaluation value of the action; and (3) updating the Q value of possible actions of each fuzzy rule, and selecting the action with the maximum Q value in the fuzzy rules as a conclusion of the fuzzy rules after learning, so as to obtain a final fuzzy control output.

The following is a specific implementation procedure of the present invention.

As shown in fig. 1, according to the intelligent fuzzy control 'cloud green making' method based on reinforcement Learning Q-Learning, firstly, after the fuzzy control system obtains environmental information (temperature, humidity and deviation thereof) when the Minbei oolong tea is made through a sensor, the environmental information is blurred to be used as a front part of a reinforcement Learning system rule base, and a conclusion part of fuzzy rules is obtained through the reinforcement Learning system. Different combinations of the rotation direction, the rotation degree and the rotation time of the Minbei oolong tea green making barrel are used as an action set of the reinforcement learning system. The back-piece of each fuzzy rule is one of the action sets.

Secondly, a rule base of a northly oolong tea green making fuzzy reinforcement system is established, namely, the front part of each rule corresponds to all actions in the action set as possible rear parts of the rule, and a Q function is allocated to each action as an evaluation value of the action. And updating the Q value of possible actions of each fuzzy rule through an algorithm, and selecting the action with the maximum Q value in the rules as a conclusion part of the rules after learning, so as to obtain the final fuzzy control output.

The purpose of reinforcement Learning Q-Learning is to determine the conclusion part of the fuzzy rule in fuzzy control, and limited prior knowledge (expert knowledge) can be introduced in reinforcement Learning, which is to accelerate the reinforcement Learning process, and to make the system have better flexibility and expansibility. The environment states such as environment humidity and temperature are used as input variables of fuzzy control, the output action of the fuzzy control is determined by two parts, when priori knowledge exists, a heuristic learning mode is adopted, expert knowledge and reinforcement learning are fused, the output action of the two comprehensive results is used, and the rotation direction, the rotation degree and the like of the green making barrel of the Fujian oolong tea are determined; in the absence of expert knowledge, the output actions are obtained through reinforcement learning.

Reinforcement Learning Q-Learning, algorithm requires input: iteration times T, state set S, action set T, learning rate alpha, attenuation factor gamma and exploration rate epsilon. Wherein the state set S is the states of the temperature, the humidity and the like of the green making barrel, and the action set is the rotation direction and the rotation degree of the green making barrel. Learning rate α, which defines how much specific the old Q will learn from where the new Q is. A value of 0 means that the agent does not learn anything (old information is important) and a value of 1 means that newly discovered information is the only important information. A discount factor gamma (decay factor) defining the importance of the future rewards, a value of 0 means that only short-term rewards are considered, wherein a value of 1 more pays attention to long-term rewards. The main step of reinforcement learning is as follows:

2) The learning task is to learn a strategy, wherein the strategy represents the selection of the next action based on the current state, i.e. the effective selection method of the learning strategy is that under this strategy the largest cumulative return is to be obtained. The cumulative return value obtained by any strategy is:

Q(s _t ,a _t )＝r _t +γ _t V(s _t+1 )

3.1 Initializing s to be the first state of the current state sequence;

3.4 Updating the value evaluation function Q (s, a):

The invention combines the reinforcement learning after the fuzzy controlThe characteristic of extensive approximation of fuzzy control is utilized to effectively control the reinforcement learning problem of continuous states and continuous actions. The reinforcement Learning Q-Learning fuzzy control system can represent a state set S= { S _i |s _i E S to action set a= { a } _i |a _i The mapping of e A may also represent the state-action pairs(s) _t ,a _t ) To the corresponding value evaluation function Q (s _t ,a _t ) Is mapped to; the fuzzy rule is expressed in the following form:

R _i ：If s is F _i Then y with q(i,1)

or y is a(i,2) with q(i,2)

or y is a(i,j) with q(i,j)

R _l :

thenf _l ＝k _l

the action output of the system is as follows:

the corresponding Q values are:

ζ＝r _t +γmaxQ _t+1 -Q _t

q _t (i,j)＝q _t-1 (i,j)+Δq(i,j)。

the fuzzy rule set is a conclusion part of reinforcement learning, is a core of fuzzy rules, and is mainly based on the green making process of the northlet oolong tea, and is mainly as follows:

in the green making mode, the Minbei oolong tea has the technical characteristics of re-sun, light shake, multiple shaking times and re-fermentation. The method is characterized in that the green-making of the North Min oolong tea has the characteristics of more green-shaking times, short green-shaking duration, light green-shaking degree and short green-cooling time. The technology for making the green of the Minbei oolong tea is complex, and the influence factors are more, including tea tree varieties, freshness She Nendu, withering degree, climate conditions and the like. The data set thus relates to variables: variety, season, climate, time period, tenderness, withering mode, withering degree, shaking time, airing time, shaking temperature, relative humidity and the like. Besides the structure, by means of the online tea quality evaluation function of the system, sensory evaluation, biochemical components, leaves and the like are obtained, fuzzy control input is enriched, reinforcement learning is assisted, and loop learning is closed.

The proposal adopts a fuzzy control method based on reinforcement learning, takes the temperature and humidity obtained by reinforcement learning as the basis, collects the factors such as temperature, humidity and the like of a green making site to analyze, judge and make fuzzy decisions, adjusts the green shaking time and the green airing time and degree, realizes intelligent control, and lists part of fuzzy rule sets in table 1:

where the quantity in the fundamental domain is an exact quantity, so both the input and output of the fuzzy controller are exact quantities, but the fuzzy control algorithm requires a fuzzy quantity. The exact amount of input (digital amount) needs to be converted into a blur amount, and the specific blurring process is shown in fig. 2.

The invention introduces parameter optimization based on self-feature mapping and gradient descent method, and the specific algorithm analysis is as follows:

the purpose of fuzzy control learning is to make the controller gradually optimized by adjusting the parameters of the fuzzy membership function and the mapping relation of the fuzzy rule, and simultaneously make the external reinforcement learning signal maximum. Reinforcement learning differs from supervised learning (learning based on training data) in that the root cause is: learning based on training data provides a desired output as a teacher signal, whereas in reinforcement learning only one penalty or reward signal may be obtained as an evaluation signal for the performance of the controlled system. However, in reinforcement learning, if a desired control output of a state input can be obtained using a simple penalty or bonus signal, the problem of reinforcement learning can be solved by referring to the idea and method of learning based on training data. Based on the thought, the problem of Learning based on fuzzy control of reinforcement Learning Q-Learning is solved. First, the desired output of the controller based on the reinforcement signal is taken:

then the following steps are executed:

1) Parameter preliminary adjustment based on self-organizing feature mapping

m _i (t+1)＝m _i (t)+α[x(t)-m _i (t)]ifm _i ≠m _closest

m _i (t+1)＝m _i (t)-if-m _i ≠m _closest

2) And (3) calculating the certainty:

step 2, calculating the certainty according to the following stepsFormula pair W _ij Normalization processing is carried out to obtain the certainty degree of each rule:

3) Parameter optimization based on gradient descent method

the mean and variance correction of the input fuzzy membership function is:

the mean and variance correction of the output fuzzy membership function is:

wherein beta is ₁ 、β ₂ 、γ ₁ 、γ ₂ Is the learning rate.

Based on the characteristics of the Fujian oolong tea making (manual operation and complicated working procedures), the experience of the master is particularly important in the operation in the green making procedure, and the experience of the master is fully utilized, so that a basis is provided for mass production of the machine. The heuristic strategy selection is realized by utilizing a heuristic function, merging into or even directly guiding the action selection process of the machine, so that the convergence of the Learning process of the intelligent agent is not influenced, meanwhile, the heuristic function can be adaptively changed in the Learning process, and the heuristic function is introduced into reinforcement Learning to improve the Q-Learning algorithm in reinforcement Learning.

The heuristic strategy selection mainly researches how to integrate the design and optimization of the heuristic reward function by using the priori knowledge or process knowledge of people in reinforcement Learning, and combines the heuristic reward function with the original action selection strategy so as to guide the action selection of an agent to be used in combination with the Q-Learning algorithm. A reinforcement learning system based on heuristic strategy selection is shown in fig. 3.

The system comprises a heuristic strategy learning module relative to a traditional reinforcement learning system. The strategy learning module is used for guiding the selection of actions by combining action function selection strategies, and does not directly act on a value function, so that the convergence of the system is not affected. Suggested actions derived in the heuristic strategy learning module can be generally categorized into two types, rule-based and heuristic function-based. The method directly gives out suggested actions based on rules, integrates a heuristic function H (s, a) based on the original action function based on heuristic function rules, and guides the selection of actions together according to the combined action selection function.

In order to ensure the convergence and stability of the algorithm, the suggested actions obtained by the heuristic function are not directly used for guiding the behavior selection of the agent, but are expressed by the following formula:

the fusion mechanism is fused into the agent by means of probability.

Pi in _Q Representing a greedy strategy with Q values, rand is a random number generated for each step, and β is represented as the probability of using the proposed strategy.

The invention has the following characteristics:

1. a reinforcement Learning Q-Learning method is adopted, priori knowledge is introduced at the same time, and a rule base of a green fuzzy reinforcement system of the North Fujian oolong tea is built together. The environmental states such as environmental humidity and temperature of the green making link of the Minbei oolong tea are used as input variables of fuzzy control, the output action of the fuzzy control is determined by two parts, and when priori knowledge exists, the output action is determined by expert knowledge, and the rotation direction, the rotation degree and the like of the green barrel are determined; in the absence of expert knowledge, the output actions are obtained through reinforcement learning.

2. And the reinforcement learning and the fuzzy control are fused, and the reinforcement learning problem of continuous states and continuous actions is effectively controlled by utilizing the characteristic that the fuzzy control is widely approximated. The q value of the action possibly corresponding to each fuzzy rule can be finally obtained by adopting a zero-order Takagi-Sugeno fuzzy inference model:

q _t (i,j)＝q _t-1 (i,j)+Δq(i,j)

3. based on the parameter optimization of reinforcement Learning Q-Learning fuzzy control, a simple punishment or rewarding signal is utilized to obtain the expected control output of state input, an optimization method similar to training data Learning is constructed, a self-feature mapping method is introduced to narrow the parameter optimization range based on reinforcement Learning, and a gradient descent method is utilized to optimize the mean and variance of input and output fuzzy membership functions so as to establish a more accurate Fujian oolong tea green making fuzzy rule.

4. Based on the parameter optimization of reinforcement Learning Q-Learning fuzzy control, the thick historical accumulation of the North-Min oolong tea and the experience of the old master are fully utilized, heuristic selection strategies based on experience and field knowledge are adopted to improve the reinforcement Learning Q-Learning, heuristic functions are utilized, even the heuristic functions are utilized directly, the action selection process of the intelligent agent is integrated and even guided directly, the convergence of the intelligent agent Learning process is not affected, and meanwhile, the heuristic functions can be adaptive to Learning changes in the Learning process.

The above is a preferred embodiment of the present invention, and all changes made according to the technical solution of the present invention belong to the protection scope of the present invention when the generated functional effects do not exceed the scope of the technical solution of the present invention.

Claims

1. The fuzzy control 'cloud green making' intelligent method based on reinforcement Learning Q-Learning is characterized in that firstly, environmental information in tea making is fuzzified and then used as a front part of a fuzzy rule in a green making fuzzy reinforcement system rule base, a conclusion of the fuzzy rule, namely a back part of the fuzzy rule, is obtained through a reinforcement Learning system, different combinations of a green making barrel including a rotation direction, a rotation degree and a rotation time are combined into an action set of the reinforcement Learning system, and the back part of each fuzzy rule is one of the action sets of the reinforcement Learning system; secondly, establishing a rule base of a bluing fuzzy reinforcement system, namely, respectively corresponding all actions in an action set by a front piece of each fuzzy rule, serving as possible rear pieces of the fuzzy rule, and simultaneously distributing a Q function for each action as an evaluation value of the action; the Q value of possible actions of each fuzzy rule is updated, and actions with the maximum Q value in the fuzzy rules are selected as conclusions of the fuzzy rules after learning, so that final fuzzy control output is obtained; the reinforcement Learning Q-Learning in the reinforcement Learning system requires input: iteration times T, state sets S, action sets T, learning rate alpha, attenuation factor gamma and exploration rate epsilon; wherein the state set S is the environmental information during tea making; the action set is a green making barrel and comprises a rotation direction, a rotation degree and a rotation time; the learning rate alpha, defining an old Q value, wherein the new Q value learned from the new Q value accounts for the specific gravity of the old Q value, the value of 0 means that nothing can be learned, and the value of 1 means that the newly discovered information is the only important information; a decay factor gamma defining the importance of future rewards, a value of 0 meaning that only short term rewards are considered, a value of 1 meaning that long term rewards are valued; the reinforcement Learning Q-Learning step is as follows:

Q(s _t ,a _t )＝r _t +γ _t V(s _t+1 )

3.1 Initializing s to be the first state of the current state sequence;

3.4 Updating the value evaluation function Q (s, a):

once the value evaluation function reaches convergence through learning, an optimal strategy can be determined, i.e., the action with the highest Q value is selected as the optimal strategy for each state;

after the reinforcement Learning Q-Learning is combined with the fuzzy control, the reinforcement Learning system is changed into the reinforcement Learning Q-Learning fuzzy control system which can represent a state set S= { S _i |s _i E S to action set a= { a } _i |a _i The mapping of e A may also represent the state-action pairs(s) _t ,a _t ) To the corresponding value evaluation function Q (s _t ,a _t ) Is mapped to; the fuzzy rule is expressed in the following form:

R _i ：If s is F _i Then y with q(i,1)

or y is a(i,2)with q(i,2)

or y is a(i,j)with q(i,j)

wherein R is _i Ith rule indicating fuzzy control, s indicates the currentA state vector as input of fuzzy control, F ⁱ Representing fuzzy sets, a (i, j) and q (i, j) as F ⁱ Respectively representing possible actions of the state s and corresponding evaluation values; in the reinforcement learning process, an epsilon greedy search algorithm strategy is adopted to select the fuzzy controlled back-piece action a (i, j) ⁺ ) As an output of the ith rule;

R _l :

thenf _l ＝k _l

the action output of the system is as follows:

the corresponding Q values are:

the update of the evaluation value Q (i, j) uses a Q learning algorithm,at time t, agent outputs action a in fuzzy control _t (s _t ) Under the action of (a), the state transitions to s _t+1 And obtain the strengthening information r _t The method comprises the steps of carrying out a first treatment on the surface of the The timing difference error is calculated by the following formula:

ζ＝r _t +γmaxQ _t+1 -Q _t

q _t (i,j)＝q _t-1 (i,j)+Δq(i,j)

parameter optimization based on self-feature mapping and gradient descent method is introduced to realize the parameter optimization based on reinforcement Learning Q-Learning fuzzy control, and the method specifically comprises the following steps:

then the following steps are executed:

1) Parameter preliminary adjustment based on self-organizing feature mapping

m _i (t+1)＝m _i (t)+α[x(t)-m _i (t)]-if-m _i ≠m _closest

m _i (t+1)＝m _i (t)-if-m _i ≠m _closest

2) And (3) calculating the certainty:

3) Parameter optimization based on gradient descent method

the mean and variance correction of the input fuzzy membership function is:

the mean and variance correction of the output fuzzy membership function is:

wherein beta is ₁ 、β ₂ 、γ ₁ 、γ ₂ Is the learning rate.

2. The intelligent fuzzy control 'cloud green making' method based on reinforcement Learning Q-Learning according to claim 1, wherein the reinforcement Learning Q-Learning is performed by reinforcement Learning Q-Learning based on heuristic function, namely the reinforcement Learning Q-Learning is integrated into or even directly guides the action selection process of a green making barrel by utilizing heuristic function.