Disclosure of Invention
The invention aims to provide a control method of a liquid-cooled battery thermal management system, which effectively solves the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions.
A control method of a liquid cooling battery thermal management system comprises the following steps:
1) The method comprises the steps of setting up an environment, setting up a multi-layer perceptron Model (MLP) consisting of an input layer, a hidden layer and an output layer, wherein the input layer is responsible for receiving data, the node number is consistent with the characteristic number of the input data, each neuron in the hidden layer is connected with all nodes of the previous layer and performs weighted summation through weights, the output layer adopts a linear activation function f (x) =x to output a regression result, the multi-layer perceptron Model (MLP) is used as an environment simulator and can predict the future state of a battery according to the current battery state (such as SOC, SOH, temperature, voltage, current and the like) and control actions (such as liquid cooling temperature setting, liquid flow pressure and the like), and the prediction model is the basis of the training deep reinforcement learning model.
Further, the environment construction includes:
A\building a data collection platform which is simultaneously communicated with a battery management system and a liquid cooling control system in real time, collecting various parameters and operation data of battery operation from an energy storage system experiment platform, uploading the data to a cloud server database, extracting the data from the cloud server by training equipment and carrying out data preprocessing, wherein the data preprocessing comprises data denoising processing, supplementing the missing data, correcting the data with errors or exceeding a permitted range, carrying out normalization processing on the data, obtaining battery operation information under more states than real data by building a battery and liquid cooling system physical simulation model, realizing data augmentation,
And B\establishing an MLP model according to the processed data, taking the current state and control action of the battery as input, and outputting a predicted new state of the battery, so as to model the lithium battery and the surrounding environment.
2) Model training, using the above-mentioned MLP model built deep reinforcement learning training environment, adopting self-adaptive model predictive control (Deep Diffusion Soft Actor-Critic (D2 SAC)) algorithm to make interaction in the training environment and training deep reinforcement learning model, using the strategy network (Actor) of D2SAC to gradually add Gaussian noise into motion distribution in forward process to increase its randomness, and gradually remove noise by learning in reverse process so as to recover an optimum motion distribution, and sampling a motion from said distributionS t is the current state, pi θ(st) is the action generated by the policy network, and the value function of each action is estimated by the double Q network (value network)To reduce overestimation, using a target value functionUpdating, r t is the reward, gamma is the discount factor, d t is the termination flag, and the objective function consisting of the expected reward and entropy terms by maximizing the policy Updating a strategy network, wherein alpha is the weight of entropy, and log pi θ(at|st) is the entropy item of action;
The operation steps of the adaptive model predictive control (Deep Diffusion Soft Actor-Critic (D2 SAC)) algorithm comprise algorithm initialization, sampling action, experience storage, sampling and target Q value calculation, critic network updating, actor network updating, entropy coefficient automatic adjustment and soft update of the target Q network, and the operation is repeated until the strategy converges or the stopping condition is met. The value network (Critic) is used for estimating a value function of each action, and a double criticism network (Double Critic Network), namely a double Q network, is adopted to update a target Q network by taking the smaller of two Q values so as to reduce the possibility of overestimation and improve the stability and performance of an algorithm. The target Q network (target network) is used for evaluating the action value, is independent from the value network, and adopts a soft update mode to avoid rapid fluctuation of the Q value in the training process and maintain the update stability.
Further, the algorithm initialization operation includes:
A\initializing environment, generating a strategy network (Actor network) based on a diffusion model and adopting a double Q network as a value network (Critic network), and then setting neural network parameters, including initializing noise parameters (such as the number T of steps and the noise level sigma) of the diffusion model and randomly initializing the weight of the double Q network by using Gaussian distribution;
Initializing a target Q network, wherein the parameters of the target Q network are generally the same as those of the double Q network;
θtarget←θmain
Where θ target is a parameter of the target Q network, and θ main is a parameter of the dual Q network.
Initializing an experience playback pool for storing experience samples generated by the interaction of an agent with the environment, including but not limited to a state s, an action a, a reward r, a next state s', and a termination flag d;
The sampling action operation includes:
A\given the current state s, a random initial vector is initialized from the diffusion model of the policy network During the forward process, noise is gradually added to the data:
Where a t is the intensity of the control noise, Is random noise.
In each step t, the mean and variance of the denoising is inferred using a deep neural network. According to the current state s and the time step t, the mean value and variance of denoising distribution are output:
and generating a Gaussian distribution according to the mean and variance, and adding a compliant normal distribution From which an action a is randomly sampled.
B\inputting the sampled action a into the environment to obtain the next state s', the reward r and a sign d of whether to terminate;
The store experience operation includes:
storing the current state s, the action a, the reward r, the next state s' and the termination mark d into an experience playback pool;
The sampling and calculating the target Q value comprises the following steps:
A, randomly sampling a batch of data samples from an experience playback pool, and calculating a target Q value and updating network parameters;
b\calculating the Q value of the next state from the target Q network, wherein action a' is generated by a back diffusion process:
Qtarget(s′,a′)=Target Q Network(s′,a′)
c/select the smallest Q value to avoid overestimation:
Qmin(s′,a′)=min(Q1(s′,a′),Q2(s′,a′))
D\calculating a target Q value according to an objective function of the D2SAC by combining the rewards and the discount factors gamma:
y=r+γQmin(s′,a′)
the updating Critic network operation comprises the following steps:
calculating the difference between the Q value output by the Critic double Q network and the target Q value by using a mean square error loss function:
b\minimizing a loss function through a back propagation algorithm to update parameters of the Critic double Q network;
where η is the learning rate;
the updating the Actor network operation comprises the following steps:
a\based on the minimum Q value output by the Critic double Q network, calculating the strategy loss of the Actor network:
L(θ)=-Es~ρ,a~diifusion process[Qmin(s,a)+αH(πθ(a|s))]
Where pi θ (a|s) represents the probability distribution of actions generated by denoising, H (pi θ (a|s)) is the entropy of the policy pi, Q min (s, a) is the minimum Q value of the current state and actions, α is the entropy coefficient, the objective of the loss function is to maximize the Q value of the actions in the state, and an entropy regularization term is included to encourage randomness of the policy, avoiding the situation of falling into local optima;
B\minimizing the strategy loss to update the parameters of the Actor network;
the automatic entropy coefficient adjusting operation comprises the following steps:
a\calculating the loss of the entropy coefficient according to the difference between the actual entropy value of the current strategy and the target entropy value:
L(α)=α·(-logπ(a|s)-Htarget)
Where-log pi (a|s) is the actual entropy of the current strategy and H target is the target entropy value;
B\calculating the gradient of the loss function L (alpha) to the entropy coefficient alpha by back propagation, and updating the entropy coefficient by using an Adam algorithm:
where η is the learning rate;
dynamically adjusting entropy coefficients, balancing the exploration-utilization (Exploration-Exploitation) relationships, namely enabling the strategy network to select behaviors which are not currently optimal to acquire more information and to make optimal decisions based on currently known information;
The soft update target Q network operation includes:
and A\carrying out soft update on the parameters of the target Q network, wherein the update rule is as follows:
θtarget←τθmain+(1-τ)θtarget
where τ is a small constant (e.g., 0.005) between 0 and 1 for controlling the update rate of the target Q network parameter θ target to the Critic double Q network parameter θ main;
And B\after the repeated training is finished, outputting a final strategy network (Actor network), wherein the network can generate optimal actions under a given state.
3) Offline training is carried out on the deep reinforcement learning model to obtain an optimized liquid cooling system control strategy;
4) And actually regulating and controlling, namely sending a corresponding instruction to an upper computer according to the optimized liquid cooling system control strategy so as to realize the regulation and control of the liquid cooling system. The method comprises the following steps:
A\obtaining data and preprocessing;
B\dividing the training set and the verification set, adopting 80% of original data as the training set, and the remaining 20% as the test set;
C\carrying out dimension transformation operation on the data so as to be input into an MLP model subsequently;
D\training by using a D2SAC algorithm, and sending a corresponding instruction to an upper computer according to a trained model result to control the liquid cooling system to regulate and control.
Further, the adaptive model predictive control (Deep Diffusion Soft Actor-Critic (D2 SAC)) algorithm uses a double Q network to reduce estimated bias, and adds an entropy term H (pi) =e s~ρ,a~π [ -log pi (a|s) ] to increase the randomness of the strategy, wherein H (pi) is the entropy of the strategy pi, s is state, α is action, and ρ is state distribution;
The adaptive model predictive control (Deep Diffusion Soft Actor-Critic (D2 SAC)) algorithm dynamically adjusts the weight coefficient alpha of the entropy term by minimizing the loss function L (alpha) =α· (-log pi (a|s) -H target), by monitoring the entropy level of the current strategy, wherein, -log pi (a|s) is the actual entropy of the current strategy, and H tar get is the target entropy value;
Through the operation, the strategy network can select the action which is not currently seen to be optimal to acquire more information, and can also make an optimal decision based on the currently known information, and the entropy level of the current strategy can dynamically adjust the liquid cooling strategy. For example, when the algorithm greedily selects higher rewards and may gradually fall into a locally optimal solution, the entropy term can effectively jump out of the dilemma and encourage the algorithm to explore more unknown possibilities, and the algorithm is not disturbed by too much exploration behaviors due to the smaller entropy coefficient, so that the information obtained by the existing training can be fully utilized, and the effect of balancing the exploration-utilization (Exploration-Exploitation) relation can be achieved.
Furthermore, the model training can also adopt SoftActor-Critic (SAC) algorithm to interact in a training environment to train a deep reinforcement learning model, wherein the SAC algorithm comprises a strategy network (Actor), a double Q network (value network), target value function updating and strategy updating, the strategy network (Actor) uses an algorithm based on a diffusion model, gaussian noise is gradually added to action distribution in the forward process to increase randomness of the strategy network, noise is gradually removed in the reverse process through learning, an optimal Gaussian distribution is recovered as the action distribution, an action is sampled from the distribution, the Q value of each action-state pair is estimated through the double Q network (value network), the target value function is used for updating, the SAC is used for updating the Actor network through the sum of expectations of the maximized strategy and entropy items, and the above flow is repeated until the termination condition is reached.
The training process of the Soft Actor-Critic (SAC) algorithm is as follows:
Initializing the parameters of an Actor and a double Q network and an experience playback pool D when training is started;
b, for each training iteration, acquiring an initial state from the environment;
In each time step, the Actor generates the optimal Gaussian distribution according to the current state through a denoising process to serve as action distribution, samples an action from the distribution, executes the action and observes new states and rewards;
updating parameters of the double Q network by using the new state and rewarding information;
e\updating the parameters of the Actor by using the output and entropy regularization items of the current double Q network;
repeating the above process until reaching the training ending condition.
The research on the liquid cooling control scheme of the battery thermal management system in China is mainly based on the traditional PID control, fuzzy control, model predictive control and other technologies, and lacks the application of machine learning technologies such as deep learning, reinforcement learning and the like. The invention provides basic theory and key technical support for the application of deep learning and reinforcement learning in the field of lithium battery thermal management for introducing the deep reinforcement learning into the liquid cooling control technology of the lithium battery thermal management system. Meanwhile, the self-adaptive model predictive control (Deep Diffusion Soft Actor-Critic (D2 SAC)) algorithm and the Soft Actor-Critic (SAC) algorithm provided by the invention can further improve the efficiency and effect of liquid cooling control on the basis of the prior art by exploring different cooling strategies, reducing estimated deviation and self-adaptively adjusting the control strategies in the aspect of liquid cooling control.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In electric vehicles and energy storage systems, the efficiency and stability of the liquid-cooled battery thermal management system plays a critical role in the overall performance and life of the battery pack. Because the battery is affected by various factors such as charge and discharge cycles, ambient temperature, load changes, etc., internal parameters (e.g., SOC, SOH) and external conditions (e.g., temperature, current, voltage, etc.) of the battery are constantly changed during use. Therefore, it is important to develop a liquid cooling control strategy capable of adapting to these changes, and it is the innovative solution designed for this requirement that the adaptive model predictive control Deep Diffusion Soft Actor-Critic (D2 SAC) algorithm of the present invention. The algorithm combines the prediction capability of Model Predictive Control (MPC) and the self-adaptive learning capability of Deep Reinforcement Learning (DRL), and can automatically follow the change of the parameters of the liquid cooling battery thermal management system through continuous learning and optimization, thereby realizing the dynamic adjustment of a liquid cooling control strategy. Workflow steps of D2SAC algorithm:
1. Data collection and preprocessing
Various parameters of the battery and the liquid cooling system are collected from the energy storage system experiment platform in real time, wherein the parameters comprise key data such as temperature, current and voltage.
Preprocessing the acquired data, including denoising, normalization, missing value processing and the like, so as to ensure the data quality.
And determining the state space and the control action space of the battery and the liquid cooling system, and providing a basis for subsequent modeling and training.
2. Data set partitioning
The preprocessed data is divided into training and validation sets (or test sets), typically using 80% of the data as training sets and the remaining 20% as test sets. This is to evaluate the generalization ability of the algorithm on unseen data.
3. Modeling of multilayer perceptrons (MLP)
The lithium battery and the surrounding environment were modeled using a multilayer perceptron (MLP) as a base model. The MLP is capable of handling nonlinear relationships and is suitable for predicting new states of the battery.
The current state (such as SOC, SOH, temperature, etc.) and control actions (such as liquid cooling temperature, liquid flow pressure, etc.) of the battery are taken as inputs of the MLP, and the predicted new state (such as future temperature, voltage, etc.) of the battery is output.
And carrying out dimension transformation operation on the data to adapt to the input requirement of the MLP model.
4. Setting up deep reinforcement learning training environment
A training environment for Deep Reinforcement Learning (DRL) is built by using an MLP model, wherein the MLP model is used for simulating the interaction between the current state of the battery and the environmental condition.
The D2SAC algorithm is designed to combine reinforcement learning and entropy maximization, which can find a balance between exploration and utilization, while encouraging higher entropy (i.e., more randomness) of the strategy to enhance the robustness of the algorithm.
During training, the strategy network (actor) generates noisy actions to increase diversity of exploration. The value function of each action is estimated through the double Q network and updated using the target value function to stabilize the training process.
The policy network is updated to maximize the sum of the expected rewards and entropy terms of the policy so that the policy maintains a certain randomness while pursuing a high rewards. And dynamically adjusting the liquid cooling strategy according to the entropy level of the current strategy so as to adapt to the change of system parameters.
5. Real-time control
And deploying the trained model into an actual system, and sending a corresponding instruction to an upper computer according to a result of the model.
After the upper computer receives the instruction, the liquid cooling system is controlled to perform corresponding regulation and control so as to realize effective management of parameters such as battery temperature and the like.
Through the steps, the D2SAC algorithm can realize the self-adaptive control of the liquid-cooled battery thermal management system, and the control strategy is dynamically adjusted according to the change of system parameters, so that the overall performance and the service life of the battery pack are improved.
Example 1
The embodiment aims at providing a control method capable of intelligently adjusting the temperature of an energy storage liquid cooling system in real time. The parameters of the liquid cooling battery thermal management system include battery state of charge (SOC), battery state of health (SOH), temperature, current, voltage, liquid cooling temperature, liquid flow pressure and the like. The intelligent control method of the liquid cooling battery thermal management system comprises the following steps:
1. And collecting various parameters of the battery and the liquid cooling system working, such as temperature, current, voltage and the like from an energy storage system experiment platform, preprocessing the data, and determining corresponding states and control action spaces.
2. The method comprises the steps of obtaining preprocessed data, dividing a training set and a verification set, taking 80% of original data as the training set, and taking the remaining 20% as a test set.
3. A multi-layer perceptron Model (MLP) model is built, the current state and control action of a battery are received as input, the predicted new state of the battery is output, the lithium battery and the surrounding environment are modeled, dimension transformation operation is carried out on data, and the data are input into the MLP model.
4. A deep reinforcement learning training environment is built by using a multi-layer perceptron Model (MLP) model, a D2SAC algorithm is designed, and the reinforcement learning model is trained offline by simulating the interaction between the current state of a battery and the environmental conditions so as to continuously optimize a control strategy.
The method comprises the steps of generating noisy actions through a strategy network (actor) of a D2SAC algorithm, estimating a value function of each action through a double Q network, updating through a target value function, updating the strategy network through the sum of expected rewards and entropy items of a maximized strategy, and dynamically adjusting the liquid cooling strategy according to the entropy level of the current strategy.
5. And according to the trained model result, sending a corresponding instruction to the upper computer, and controlling the liquid cooling system to regulate and control.
In step 1, various parameters of the battery and the liquid cooling system may include a water pump running state, a unit water outlet temperature, a liquid supplementing water pump state, a compressor running state, a unit backwater temperature, a fault alarm code, a compressor running state, a unit water outlet pressure, an electric heating running state, a unit backwater pressure, a condensing fan running state, an outside environment temperature of the unit, a total voltage, a minimum monomer voltage, a maximum monomer temperature, a fan relay, a discharge permission control, a fire-fighting device state in a box, a total current, a maximum monomer voltage module number, a minimum monomer temperature module number, an operation indication, a maximum permitted charging current, a fire-fighting device state in a box, a battery state of charge (SOC), a minimum monomer voltage module number, a total positive relay, an alarm indication, a maximum permitted discharging current, an SOH, a maximum monomer temperature, a total negative relay, a fault indication, a maximum permitted charging power, a maximum monomer voltage, a minimum monomer temperature, a pre-charging relay, a charging permission control, and a maximum permitted discharging power.
The data preprocessing in the step 1 comprises data denoising processing, supplementing processing of missing data, correction processing of data with errors or beyond a permissible range and normalization processing of data.
In step 3, the multi-layer perceptron Model (MLP) is composed of an input layer, a hidden layer and an output layer, wherein the input layer is responsible for receiving data, the number of nodes is consistent with the characteristic number of the input data, each neuron in the hidden layer is connected with all nodes of the previous layer and performs weighted summation through weights, and the output layer adopts a linear activation function to output a regression result.
In step 4, the deep reinforcement learning model is trained by adopting a D2SAC algorithm, the D2SAC algorithm uses a strategy network based on a diffusion model and uses a double Q network as a value network to reduce estimation deviation, and meanwhile, an entropy term H (pi) =E s~ρ,a~π [ -log pi (a|s) ] is added in strategy optimization to increase the randomness of the strategy, wherein H (pi) is the entropy of the strategy pi, s is a state, alpha is an action, and rho is a state distribution. Meanwhile, D2SAC dynamically adjusts the weight coefficient α of the entropy term by monitoring the entropy level of the current strategy by minimizing the loss function L (α) =α· (-logpi (a|s) -H target), where-logpi/a|s is the actual entropy of the current strategy and H target is the target entropy value. Through the above operations, the policy network is enabled to select behaviors that are not currently seen to be optimal to obtain more information, and to make optimal decisions, i.e., balance the explore-to-utilize (Exploration-Exploitation) relationships, based on the currently known information.
In the step 4, the operation steps of D2SAC algorithm training comprise algorithm initialization, sampling action, experience storage, sampling and calculation of target Q value, updating of Critic double Q network, updating of Actor network, automatic adjustment of entropy coefficient and soft updating of target Q network;
The algorithm initialization operation includes:
A\initializing environment, generating a strategy network (Actor network) based on a diffusion model and adopting a double Q network as a value network (Critic network), and then setting neural network parameters, including initializing noise parameters (such as the number T of steps and the noise level sigma) of the diffusion model and randomly initializing the weight of the double Q network by using Gaussian distribution;
initializing a target Q network, wherein the parameters of the target Q network are generally the same as those of a Critic double Q network;
θtarget←θmain
wherein, θ target is a parameter of the target Q network, and θ main is a parameter of the Critic double Q network.
Initializing an experience playback pool for storing experience samples generated by the interaction of an agent with the environment, including but not limited to a state s, an action a, a reward r, a next state s', and a termination flag d;
The sampling action operation includes:
A\given the current state s, a random initial vector is initialized from the diffusion model of the policy network During the forward process, noise is gradually added to the data:
Where a t is the intensity of the control noise, Is random noise.
In each step t, the mean and variance of the denoising is inferred using a deep neural network. According to the current state s and the time step t, the mean value and variance of denoising distribution are output:
and generating a Gaussian distribution according to the mean and variance, and adding a compliant normal distribution From which an action a is randomly sampled.
B\inputting the sampled action a into the environment to obtain the next state s', the reward r and a sign d of whether to terminate;
The store experience operation includes:
Storing the current state s, the action a, the reward r, the next state s' and the termination mark d into an experience playback pool;
The sampling and calculating the target Q value comprises the following steps:
A, randomly sampling a batch of data samples from an experience playback pool, and calculating a target Q value and updating network parameters;
b\calculating the Q value of the next state from the target Q network, wherein action a' is generated by a back diffusion process:
Qtarget(s′,a′)=Target Q Network(s′,a′)
c/select the smallest Q value to avoid overestimation:
Qmin(s′,a′)=min(Q1(s′,a′),Q2(s′,a′))
D\calculating a target Q value according to an objective function of the D2SAC by combining the rewards and the discount factors gamma:
y=r+γQmin(s′,a′)
The updating Critic double Q network operation comprises the following steps:
calculating the difference between the Q value output by the Critic double Q network and the target Q value by using a mean square error loss function:
b\minimizing a loss function through a back propagation algorithm to update parameters of the Critic double Q network;
where η is the learning rate;
the updating the Actor network operation comprises the following steps:
a\based on the minimum Q value output by the Critic double Q network, calculating the strategy loss of the Actor network:
L(θ)=-Es~p,a~diiffusion process[Qmin(s,a)+αH(πθ(a|s))]
Where pi θ (a|s) represents the probability distribution of actions generated by denoising, H (pi θ (a|s)) is the entropy of the policy pi, Q min (s, a) is the minimum Q value of the current state and actions, α is the entropy coefficient, the objective of the loss function is to maximize the Q value of the actions in the state, and an entropy regularization term is included to encourage randomness of the policy, avoiding the situation of falling into local optima;
B\minimizing the strategy loss to update the parameters of the Actor network;
the automatic entropy coefficient adjusting operation comprises the following steps:
a\calculating the loss of the entropy coefficient according to the difference between the actual entropy value of the current strategy and the target entropy value:
L(α)=α·(-logπ(a|s)-Htarget)
Where-log pi (a|s) is the actual entropy of the current strategy and H tar get is the target entropy value;
B\calculating the gradient of the loss function L (alpha) to the entropy coefficient alpha by back propagation, and updating the entropy coefficient by using an Adam algorithm:
where η is the learning rate;
dynamically adjusting entropy coefficients, balancing the exploration-utilization (Exploration-Exploitation) relationships, namely enabling the strategy network to select behaviors which are not currently optimal to acquire more information and to make optimal decisions based on currently known information;
The soft update target Q network operation includes:
and A\carrying out soft update on the parameters of the target Q network, wherein the update rule is as follows:
θtarget←τθmain+(1-τ)θtarget
Where τ is a small constant (e.g., 0.005) between 0 and 1 for controlling the update rate of the target Q network parameter θ target to the Critic double Q network parameter θ main.
And B\after the repeated training is finished, outputting a final strategy network (Actor network), wherein the strategy network (Actor network) can generate optimal actions under a given state.
Because the parameters of the liquid cooling battery thermal management system can change along with working conditions and environments, the self-adaptive model predictive control Deep Diffusion Soft Actor-Critic (D2 SAC) algorithm can automatically and continuously optimize the liquid cooling control strategy along with the change of the system parameters.
Alternatively, in the above embodiment, a Soft Actor-Critic (SAC) algorithm may be used to interact in a training environment to train a deep reinforcement learning model, where the SAC algorithm includes a policy network (Actor), a dual Q network (value network), a target value function update, and a policy update, where the policy network (Actor) generates noisy actions, estimates a value function of each action through the dual Q network (value network), and updates the value function with the target value function;
the training process of the Soft Actor-Critic (SAC) algorithm is as follows:
Initializing parameters of an Actor and a double Q network when training is started;
b, for each training iteration, acquiring an initial state from the environment;
in each time step, the Actor generates a noisy action according to the current state, executes the action and observes a new state and rewards;
updating parameters of the double Q network by using the new state and rewarding information;
e\updating the parameters of the Actor by using the output and entropy regularization items of the current double Q network;
repeating the above process until reaching the training ending condition.
Example 2
The embodiment aims at providing a control method capable of intelligently adjusting the temperature of an energy storage liquid cooling system in real time. The method adopts a Soft Actor-Critic (SAC) algorithm training model and comprises the following steps:
1. And collecting various parameters of the battery and the liquid cooling system working, such as temperature, current, voltage and the like from an energy storage system experiment platform, preprocessing the data, and determining corresponding states and control action spaces.
2. The method comprises the steps of obtaining preprocessed data, dividing a training set and a verification set, taking 80% of original data as the training set, and taking the remaining 20% as a test set.
3. A multi-layer perceptron Model (MLP) model is built, the current state and control action of a battery are received as input, the predicted new state of the battery is output, the lithium battery and the surrounding environment are modeled, dimension transformation operation is carried out on data, and the data are input into the MLP model.
4. A deep reinforcement learning training environment is built by using a multi-layer perceptron Model (MLP) model, a SAC algorithm is designed, and the reinforcement learning model is trained offline by simulating the interaction between the current state of a battery and the environmental conditions so as to continuously optimize a control strategy.
The method comprises the steps of generating a noisy action by a strategy network (actor) of a SAC algorithm, estimating a value function of each action through a double Q network, updating by using a target value function, updating the strategy network by maximizing the sum of expected rewards and entropy items of the strategy, and dynamically adjusting the liquid cooling strategy according to the entropy level of the current strategy.
5. And according to the trained model result, sending a corresponding instruction to the upper computer, and controlling the liquid cooling system to regulate and control.
In step 1, various parameters of the battery and the liquid cooling system may include a water pump running state, a unit water outlet temperature, a liquid supplementing water pump state, a compressor running state, a unit backwater temperature, a fault alarm code, a compressor running state, a unit water outlet pressure, an electric heating running state, a unit backwater pressure, a condensing fan running state, an outside environment temperature of the unit, a total voltage, a minimum monomer voltage, a maximum monomer temperature, a fan relay, a discharge permission control, a fire-fighting device state in a box, a total current, a maximum monomer voltage module number, a minimum monomer temperature module number, an operation indication, a maximum permitted charging current, a fire-fighting device state in a box, a battery state of charge (SOC), a minimum monomer voltage module number, a total positive relay, an alarm indication, a maximum permitted discharging current, an SOH, a maximum monomer temperature, a total negative relay, a fault indication, a maximum permitted charging power, a maximum monomer voltage, a minimum monomer temperature, a pre-charging relay, a charging permission control, and a maximum permitted discharging power.
The data preprocessing in the step 1 comprises data denoising processing, supplementing processing of missing data, correction processing of data with errors or beyond a permissible range and normalization processing of data.
In step 3, the multi-layer perceptron Model (MLP) is composed of an input layer, a hidden layer and an output layer, wherein the input layer is responsible for receiving data, the number of nodes is consistent with the characteristic number of the input data, each neuron in the hidden layer is connected with all nodes of the previous layer and performs weighted summation through weights, and the output layer adopts a linear activation function to output a regression result.
In step 4, the deep reinforcement learning model is trained by adopting a SAC algorithm, the SAC algorithm uses a double Q network to reduce estimation deviation, entropy regularization reinforcement exploration and automatically adjust entropy coefficients to balance exploration and utilization, and the SAC algorithm uses the double Q network to reduce the estimation deviation and adds an entropy term H (pi) =E s~ρ,a~π [ -log pi (a|s) in strategy optimization to increase the randomness of the strategy, wherein H (pi) is the entropy of the strategy pi, s is the state, alpha is the action and rho is the state distribution. Meanwhile, SAC dynamically adjusts the weight coefficient α of the entropy term by monitoring the entropy level of the current strategy by minimizing the loss function L (α) =α· (-log pi (a|s) -H target), where-log pi (a|s) is the actual entropy of the current strategy and H target is the target entropy value. Through the above operations, the policy network is enabled to select behaviors that are not currently seen to be optimal to obtain more information, and to make optimal decisions, i.e., balance the explore-to-utilize (Exploration-Exploitation) relationships, based on the currently known information.
In the step 4, the operation steps of SAC algorithm training comprise algorithm initialization, sampling action, storage experience, sampling and calculating target Q value, updating Critic double Q network, updating Actor network, automatically adjusting entropy coefficient and soft updating target Q network;
The algorithm initialization operation includes:
Initializing the environment and setting parameters of an Actor network (strategy network) and a double Q network (value network);
initializing a target Q network, wherein the parameters of the target Q network are generally the same as those of a Critic double Q network;
θtarget←θmain
wherein, θ target is a parameter of the target Q network, and θ main is a parameter of the Critic double Q network.
Initializing an experience playback pool for storing experience samples generated by the interaction of an agent with the environment, including but not limited to a state s, an action a, a reward r, a next state s', and a termination flag d;
The sampling action operation includes:
A\given the current state s, a random initial vector is initialized from the diffusion model of the policy network During the forward process, noise is gradually added to the data:
Where a t is the intensity of the control noise, Is random noise.
In each step t, the mean and variance of the denoising is inferred using a deep neural network. According to the current state s and the time step t, the mean value and variance of denoising distribution are output:
and generating a Gaussian distribution according to the mean and variance, and adding a compliant normal distribution From which an action a is randomly sampled.
B\inputting the sampled actions into the environment to obtain the next state s', the reward r and a sign d of whether to terminate;
The store experience operation includes:
Storing the current state s, the action a, the reward r, the next state s' and the termination mark d into an experience playback pool;
The sampling and calculating the target Q value comprises the following steps:
A, randomly sampling a batch of data samples from an experience playback pool, and calculating a target Q value and updating network parameters;
b\calculating the Q value of the next state from the target Q network:
Qtarget(s′,a′)=Target Q Network(s′,a′)
c/select the smallest Q value to avoid overestimation:
Qmin(s′,a′)=min(Q1(s′,a′),Q2(s′,a′))
d\calculating a target Q value according to the target function of SAC by combining the rewards and the discount factors gamma:
y=r+γQmin(s′,a′)
The updating Critic double Q network operation comprises the following steps:
calculating the difference between the Q value output by the Critic double Q network and the target Q value by using a mean square error loss function:
b\minimizing a loss function through a back propagation algorithm to update parameters of the Critic double Q network;
where η is the learning rate.
The updating the Actor network operation comprises the following steps:
a\based on the minimum Q value output by the Critic double Q network, calculating the strategy loss of the Actor network:
Where H (pi θ (a|s)) is the entropy of the policy pi, Q min (s, a) is the minimum Q value of the current state and action, and α is the entropy coefficient. The goal of the loss function is to maximize the Q value of the action in the state, and include entropy regularization terms to encourage randomness of the strategy, avoiding situations involving local optima;
B\minimizing the strategy loss to update the parameters of the Actor network;
the automatic entropy coefficient adjusting operation comprises the following steps:
a\calculating the loss of the entropy coefficient according to the difference between the actual entropy value of the current strategy and the target entropy value:
L(α)=α·(-logπ(a|s)-Htarget)
B\calculating the gradient of the loss function L (alpha) to the entropy coefficient alpha by back propagation, and updating the entropy coefficient by using an Adam algorithm:
where η is the learning rate.
The entropy coefficients can thus be dynamically adjusted so that the policy network can both select behaviors that are not currently seen to be optimal to obtain more information and make optimal decisions based on the currently known information, i.e., balancing the explore-to-utilize (Exploration-Exploitation) relationships.
The soft update target Q network operation includes:
and A\carrying out soft update on the parameters of the target Q network, wherein the update rule is as follows:
θtarget←τθmain+(1-τ)θtarget
Where τ is a small constant (e.g., 0.005) between 0 and 1 for controlling the update rate of the target Q network parameter θ target to the Critic double Q network parameter θ main.
And B\after the repeated training is finished, outputting a final strategy network (Actor network), wherein the strategy network (Actor network) can generate optimal actions under a given state.
The above-mentioned embodiments 1 and 2 are both aimed at solving the decision problem of the liquid-cooled battery thermal management system in a complex environment by combining the deep learning and reinforcement learning methods. They all employ a dual Q network to reduce the likelihood of overestimating Q values and incorporate entropy terms in policy optimization to enhance exploration ability. The liquid cooling battery thermal management strategy capable of coping with complex environmental changes can be trained efficiently, and the strategy can dynamically adjust control actions according to different system states and environmental conditions so as to optimize the overall performance and service life of the battery pack.
The D2SAC algorithm of example 1 employs a diffusion model based policy network that may be able to generate a more diverse and random policy. In liquid cooling control, this means that the algorithm can more widely explore different cooling strategies, including combinations of coolant flow, temperature set points, pump speed, etc., to find a better cooling solution. The combination of the diffusion model and entropy regularization of example 1 may make the D2SAC algorithm exhibit greater adaptivity in liquid cooling control. As system operating conditions change (e.g., load increases, ambient temperature fluctuations, changes in battery pack chemistry over time, etc.), the algorithm can automatically adapt to these changes and adjust the cooling strategy to ensure that the system is operating in an optimal state as always as possible.
The SAC algorithm of embodiment 2 enhances exploration ability by automatically adjusting entropy regularization of entropy coefficients. In liquid cooling control, this means that the method can dynamically adjust its exploration behavior according to the current system operating conditions and performance feedback. When the Q value tends to be stable, the entropy term can break the current state to obtain a strategy for acquiring and exploring an unknown space so as to find a potential better strategy, and meanwhile, the model also tends to select a behavior with higher rewards, which is the utilization of the original learning result, so that the model can fully utilize the known information to improve the performance.
The foregoing detailed description of the invention has been presented in conjunction with a specific embodiment, and it is not intended that the invention be limited to such detailed description. Several equivalent substitutions or obvious modifications will occur to those skilled in the art to which this invention pertains without departing from the spirit of the invention, and the same should be considered to be within the scope of this invention as defined in the appended claims.