WO2024053566A1

WO2024053566A1 - Information processing device, information processing method, and program

Info

Publication number: WO2024053566A1
Application number: PCT/JP2023/031965
Authority: WO
Inventors: 直輝山田; 広昂岡崎; 昌弘山田
Original assignee: 三菱重工業株式会社
Priority date: 2022-09-07
Filing date: 2023-08-31
Publication date: 2024-03-14
Also published as: JP2024037423A

Abstract

According to the present invention, on the basis of information indicating information on the state of a vehicle at a certain time point and a control input for the vehicle to stop at a target position in said state, a control model for inferring information on the state of the vehicle at the next time point by using a probability distribution is generated. A motor torque instruction in the state of the vehicle at the next time point is generated by determining an evaluation function for an evaluation value relating to the state of the vehicle, and inputting, to a policy function, a policy parameter with which the evaluation value of the evaluation function is improved the most. The control model is updated by using the relationship between the control input and vehicle state information which relates to results of trials on the vehicle using the motor torque instruction, and the relationship between the control input and information, on the state of the vehicle, used for generating the control mode.

Description

Information processing device, information processing method, program

The present disclosure relates to an information processing device, an information processing method, and a program.
This application claims priority to Japanese Patent Application No. 2022-142287 filed in Japan on September 7, 2022, the contents of which are incorporated herein.

There is a control device for a vehicle that drives wheels by rotating a motor and travels on a track. This control device controls the motor by outputting a motor torque command value according to the current state of the vehicle. Further, a vehicle equipped with the control device brakes the vehicle using regenerative braking of the motor. In addition to the regenerative brake, the control device may perform braking using an air brake (mechanical brake) provided on the vehicle in parallel.

Patent Document 1 and Patent Document 2 disclose techniques that use the above-mentioned motor control function and regenerative brake for braking a vehicle. More specifically, Patent Document 1 discloses a technique for reducing the occurrence of regeneration failure and improving stopping accuracy. Moreover, Patent Document 2 discloses a technique of predicting regeneration failure and calculating a command value of regenerative braking force to obtain a braking force corresponding to the difference between a target braking force and a mechanical braking force.

Japanese Patent Application Publication No. 2001-204102 JP2017-99172A

In the technology of braking a vehicle using regenerative braking as described above, there is a need for a technology that can automatically calculate a motor command value for braking a vehicle using regenerative braking of the motor.

Therefore, the purpose of this disclosure is to provide an information processing device, an information processing method, and a program that solve the above-mentioned problems.

According to one aspect of the present disclosure, an information processing device indicates information about a state of a vehicle at a certain time and a control input for the vehicle that indicates a motor torque command for the vehicle to stop at a target position in that state. A control model that estimates information on the state of the vehicle at the next time based on the information using a probability distribution is based on the relationship between the information on the state indicated by the braking control of the vehicle performed in the past and the control input, and the relationship between the information and the control input. The evaluation is an evaluation function of an evaluation value regarding the vehicle condition, which is generated using the obtained vehicle condition at the next time, and the evaluation value worsens as the distance to the target position increases. A function is determined, and a policy parameter for which the evaluation value is most improved is input into the policy function to generate a motor torque command for the vehicle in the state of the vehicle at the next time, and a trial is performed using the generated motor torque command. The control model is updated using the relationship between the resulting vehicle state information and the control input, and the relationship between the vehicle state information used to generate the control model and the control input.

According to one aspect of the present disclosure, an information processing method indicates information on a state of a vehicle at a certain time and a control input for the vehicle that indicates a motor torque command for the vehicle to stop at a target position in that state. A control model that estimates information on the state of the vehicle at the next time based on the information using a probability distribution is based on the relationship between the information on the state indicated by the braking control of the vehicle performed in the past and the control input, and the relationship between the information and the control input. The evaluation is an evaluation function of an evaluation value regarding the vehicle condition, which is generated using the obtained vehicle condition at the next time, and the evaluation value worsens as the distance to the target position increases. A function is determined, and a policy parameter for which the evaluation value is most improved is input into the policy function to generate a motor torque command for the vehicle in the state of the vehicle at the next time, and a trial is performed using the generated motor torque command. The control model is updated using the relationship between the resulting vehicle state information and the control input, and the relationship between the vehicle state information used to generate the control model and the control input.

According to one aspect of the present disclosure, a program causes a computer of an information processing device to display information about a state of a vehicle at a certain time and a motor torque command for the vehicle to stop the vehicle at a target position in that state. A control model that estimates information on the state of the vehicle at the next time using a probability distribution based on information indicating the control input is created by combining the information on the state indicated by the braking control of the vehicle performed in the past and the control input. Means for generating an evaluation function of an evaluation value regarding the state of the vehicle using a relationship and a state of the vehicle at the next time obtained from the relationship, wherein the evaluation function increases as the distance to the target position increases. Means for determining the evaluation function whose value deteriorates and inputting into the strategy function a policy parameter for which the evaluation value is most improved to generate a motor torque command for the vehicle in the state of the vehicle at the next time; using the relationship between the vehicle state information and the control input, which is the result of a trial based on the motor torque command, and the relationship between the vehicle state information and the control input used to generate the control model. The control model is made to function as a means for updating the control model.

According to the present disclosure, it is possible to automatically calculate a motor command value for braking a vehicle using regenerative braking of the motor.

1 is a schematic diagram of a vehicle control system including a vehicle control device and a server device according to the present embodiment. FIG. 2 is a block diagram showing a control mechanism including a vehicle control device according to the present embodiment. FIG. 2 is a functional block diagram of a server device according to the present embodiment. FIG. 3 is a diagram showing an overview of processing of the server device according to the present embodiment. FIG. 3 is a diagram showing a processing flow of the server device according to the present embodiment. FIG. 2 is a hardware configuration diagram of a server device according to the present embodiment.

Hereinafter, a vehicle control device and a server device according to an embodiment of the present disclosure will be described with reference to the drawings.
FIG. 1 is a schematic diagram of a vehicle control system including a vehicle control device and a server device according to this embodiment.
FIG. 2 is a block diagram showing the configuration of a control mechanism including a vehicle control device according to this embodiment.
As shown in this figure, a vehicle 50 partially includes a vehicle control device 1, an inverter 2, and a motor 3 as an example of a control mechanism. The vehicle control device 1 outputs a torque command value Tri _(t) according to the state to control the motor. Inverter 2 outputs a current to motor 3 according to torque command value Tri _(t) . The motor 3 is driven by a current based on the torque command value Tri _(t) . The vehicle control device 1 is communicatively connected to a server device 10, which is one aspect of an information processing device. The vehicle control device 1 controls the self-position px _(t) of the vehicle 50, the speed v _(t) of the vehicle 50, the pressure pa _(t) of the air spring that suppresses the shaking between the bogie of the vehicle 50 and the passenger car, and the motor voltage. Status information indicating the status of the vehicle 50 including V _(t) and the torque output Tro _(t) of the motor 3 is acquired. During vehicle operation, the vehicle control device 1 uses the state information and the control model to calculate a torque command value Tri _(t) , which is a control input, and outputs it to the inverter 2. Thereby, the vehicle control device 1 controls the vehicle based on the torque command value Tri _(t) according to the state of the vehicle. Further, the vehicle control device 1 outputs information on the acquired state information and the torque command value Tri _(t) calculated based on the state information to the server device 10. The server device 10 acquires and stores the information.

FIG. 3 is a functional block diagram of the server device.
The server device 10 performs each function of the learning section 12, the policy evaluation section 13, and the policy improvement section 14 by activating a program stored in advance. The server device 10 also includes a storage unit 11 such as a database.
The storage unit 11 stores the self-position px _(t) of the vehicle 50, the speed v _(t) of the vehicle 50, the pressure pa _(t) of the air spring that suppresses the shaking between the bogie of the vehicle 50 and the passenger car, and the motor voltage V _(t) , stores the relationship between state information indicating the state of the vehicle 50 including the torque output Tro _(t) of the motor 3 and the torque command value Tri _(t) in the state of the vehicle indicated by the state information. This stored information is information transmitted from the vehicle control device 1 of the vehicle 50 and recorded.
The learning unit 12 performs the following based on initial data indicating information on the state of the vehicle 50 at time t and a control input for the vehicle 50 indicating a motor torque command for the vehicle 50 to stop at the target position in that state. A control model is generated that estimates information on the state of the vehicle 50 at time t+1 using a probability distribution.
The policy evaluation unit 13 evaluates the policy using the evaluation function J ^π (θ).
The policy improvement unit 14 searches for a parameter θ that makes the evaluation function J ^π (θ) small. The policy is updated by the policy improvement unit 14 updating the value of the parameter θ.

FIG. 4 is a diagram showing an outline of processing of the server device.
The server device 10 is equipped with functions such as PILCO (Probabilistic Inference for Learning Control), which is one type of model reinforcement learning, and performs the following processing.
(1) Model learning The server device 10 provides an initial information indicating the state of the vehicle 50 at time t and a control input for the vehicle 50 indicating a motor torque command for the vehicle 50 to stop at the target position in that state. Based on the data, a control model is generated that estimates information on the state of the vehicle 50 at the next time t+1 using a probability distribution.

(2) Evaluation, improvement and trial of measures The server device 10 is an evaluation function of the evaluation value regarding the state of the vehicle 50, and the server device 10 is an evaluation function of the evaluation value regarding the state of the vehicle 50. Optimization calculations are performed using an evaluation function in which the evaluation value worsens as the distance increases. The server device 10 sets the policy parameter that improves the evaluation value most in the policy function, inputs the state information into the policy function, and generates a motor torque command for the state of the vehicle 50 at the next time. The server device 10 instructs the vehicle control device 1 to perform a trial operation of the vehicle 50 based on the generated motor torque command.

(3) Updating the model using trial results The server device 10 updates the relationship between the information on the state of the vehicle 50 and the control input, which is the result of the trial in the vehicle control device 1, and the information on the vehicle 50 used to generate the control model. The control model is updated using the state information and the relationship between the control inputs.

FIG. 5 is a diagram showing the processing flow of the server device.
(Generation of control model)
First, the learning unit 12 generates a control model using a machine learning method (step S101). The server device 10 determines in advance the relationship between the value of each state indicated by the state information during control of the vehicle 50 and the torque command value Tri _(t), which is a control amount output as a control input on the vehicle 50 side in that state. , based on the relationship, information on the state of the vehicle 50 at the next time when the vehicle 50 is driven is linked and stored in large quantities in the storage unit 11 or the like. This torque command value Tri _(t) is information when the driver of the vehicle 50 or the like performs control so that regeneration failure does not occur when stopping at the target position. The storage unit 11 stores the state information x _(t) , the torque command value Tri _(t) which is a control input, and the state information x ( _{t+1) when the vehicle 50 is controlled using the torque command Tri (t} _). The relationship and information on a flag (initial data) indicating whether or not regeneration has expired in that relationship are stored in association with each other. The learning unit 12 determines the relationship between such status information x _(t) and status information x _(t+1) and the torque command value Tri _(t) , which is a control input, and whether or not regeneration failure has occurred in the relationship. The information on the indicated flag and the information on the state of the vehicle 50 at the next time are learned using a method such as Gaussian process regression, and a control model is generated. The control model is based on the state information of the vehicle 50 at time t and the torque command value Tri _(t) of the vehicle 50 suitable for stopping at the target position without regeneration failure in each state indicated by the state information. , is a learning model that estimates information on the state of the vehicle 50 at the next time t+1 using a probability distribution. The control model is shown in equation (1). The control model is an example of a dynamics model.

In equation (1), x _(t) is the state information at time t, and u _(t) is the torque command value Tri _(t), which is the control input at time t, which corresponds to equation (2) and equation (3), respectively. It is shown as follows. ω indicates noise. In equation (1), x _(t+1) is state information at time t+1. Further, in equation (1), N(0, Σ _ω ) represents a Gaussian distribution with a mean of 0 and a covariance matrix Σ _ω . The noise ω is determined stochastically according to the Gaussian distribution. As shown in Equation (1), the control model allows the distribution of state information at the next time t+1 to be estimated based on the state information x _(t) at the current time t and the control input u _(t) .

The learning unit 12 calculates the value of each state indicating state information acquired in the past under conditions in which regeneration failure is unlikely to occur, and the torque command value Tri _(t) which is a control amount output as a control input on the vehicle side in that state. A control model may be learned using the relationship. Alternatively, the learning unit 12 calculates the value of each state indicating state information acquired in the past under conditions where regeneration failure is likely to occur, and the torque command value Tri _{(t )} may be used to learn the control model. The condition under which regeneration failure is unlikely to occur is when the voltage of the power system is lower than a predetermined threshold value at which regeneration failure is likely to occur. Further, the condition where regeneration failure is likely to occur is a case where the voltage of the power system is higher than a predetermined threshold value at which regeneration failure is likely to occur. In this way, by learning the control model using state information under conditions where regenerative failure is likely to occur or under conditions where regenerative failure is unlikely to occur, accurate control can be achieved even under conditions where regenerative failure is unlikely to occur or conditions where regenerative failure is likely to occur. A control model that can output the torque command value Tri _(t) can be generated.

(Evaluation of measures)
The policy evaluation unit 13 determines a policy parameter θ that reduces the value of the evaluation function J ^π (θ) (step S102). In this process, the policy evaluation unit 13 arbitrarily sets the initial value of the parameter θ. The evaluation function J ^π (θ) is shown in equation (4).

In equation (4), c(x _(t) ) is expressed by equation (5) and indicates the evaluation value of state information x _(t) at time t. When time t is used as a reference time, H indicates an arbitrarily set timing after that time. E indicates the expected value of the evaluation value c(x _(t) ). Note that in equation (5), σ _c ² indicates the variance of the evaluation value c.

The value of the evaluation value c(x _(t) ) approaches 1 as the target position x _target on the orbit and the position xt at the start of the vehicle 50 at time t are closer, and approaches 0 as the position xt is farther apart. . In formula (4),

is a Gaussian distribution with mean μ _(t) and covariance matrix Σ _(t) .

The policy evaluation unit 13 samples the initial state information x ₍₀₎ according to a normal distribution N (μ ₍₀₎ , Σ ₍₀₎ ). The policy evaluation unit 13 acquires the state information x _(t) at time t, calculates the evaluation value c(x _(t) ) expected value E, and similarly calculates each evaluation value c( The evaluation function J ^π (θ) shown in equation (4) is calculated by integrating the expected value E of x _(t) ).

(Improvement of measures)
The policy improvement unit 14 has the function of an RBF (Radial Basis Function) controller, for example, and performs the following processing. Note that the RBF controller is a nonlinear controller having a network structure of a neural network with a Gaussian function in the intermediate layer. The policy improvement unit 14 searches for and updates the policy parameter θ that minimizes the evaluation function J ^π (θ) calculated by the policy evaluation unit 13 (step S103). In this process, the policy improvement unit 14 calculates a policy gradient from the evaluation function J ^π (θ), and performs optimization calculation using the policy parameter θ that constitutes the policy as a solution search target based on the policy gradient. The policy gradient of the evaluation function J ^π (θ) can be calculated using equation (7). The policy improvement unit 14 searches for the policy parameter θ using a gradient method, such as backpropagation, in the direction in which the value of the policy gradient becomes the smallest.

The policy improvement unit 14 calculates the average of the state distribution p(x _(t) )=N(μ _(t) _,Σ(t) _{) for the expected value Ex(t)} of the evaluation value c(x( _t) ) at each time. From the partial derivatives of the and covariance matrices, find the state x _{(t) in} which the evaluation value c(x _(t) ) is small, and then calculate the policy function π ₍ x _{( t)} , θ) may be calculated using an optimization method.

Equation (8) shows the partial derivative of the state distribution p(x _(t) )=N(μ _(t) , Σ _(t) ) with respect to the average μ _(t) . Further, the partial derivative of the state distribution p(x _(t) ) = N (μ _(t) , Σ _(t) ) with respect to the covariance matrix Σ _(t) is shown in equation (9).

Here, formula (10) is satisfied in formula (8) and formula (9). Also, in equation (9), I represents a unit matrix.

Also, for T ^-1 , the diagonal component is

This is the matrix.

(trial)
The policy improvement unit 14 inputs the state information x (t) into the policy function π(x _(t) , θ) using the optimized policy parameter _θ , and then adjusts the control input u as shown in equation (12). _(t+1) is calculated (step S104).

The policy improvement unit 14 outputs the torque command value Tri _(t+1 _{) indicated by the calculated control input u (t+1)} to the vehicle control device 1 (step S105). The vehicle control device 1 outputs the torque command value Tri _(t+1) to the inverter 2, and as a result, attempts to control the motor 3. As a result, the vehicle 50 operates, and state information x _(t+1) at the next time can be observed. The vehicle control device 1 transmits the relationship between the state information x _(t) and x _(t+1) observed at that time, the control input u _(t) , and information on whether regeneration has expired to the server device 10, and the server device 10 The information is linked and recorded in the storage unit 11. The presence or absence of regeneration failure is determined by the vehicle control device 1 calculating the difference between torque command value Tri _(t) and torque output Tro _(t) , and if this difference is greater than or equal to a predetermined threshold value, regeneration failure has occurred, and if it is less than the threshold value, regeneration failure has occurred. The vehicle control device 1 may determine that the registered name has not expired. This recorded information is used to update the control model.

(Control model update)
The learning unit 12 uses state information x _(t) and control input u (t) used in the initial control model, and state information x (t) and control input u _(t ₎ newly recorded after optimization by the policy improvement unit 14. _(t) , iterative learning is performed using a method such as Gaussian process regression, and the control model is updated (step S106). Note that the state information x _(t) and control input u _(t) newly recorded after optimization by the policy improvement unit 14 are also values corresponding to conditions where regeneration failure is unlikely to occur or conditions where regeneration failure is likely to occur. The learning unit 12 may repeatedly learn the state information x _(t) and control input u _(t) under each environment and update the control model by using a method such as Gaussian process regression. The server device 10 uses the updated control model to calculate a torque command value Tri _(t ), which is a control input u _(t) according to the state information x _(t) , and outputs it to the vehicle control device 1. . The server device 10 determines whether to terminate the process (step S107), and repeats the processes from step S102 to step S106 until an instruction to terminate is received.

By repeating the processes of the policy evaluation unit 13 and policy improvement unit 14 and updating the control model of the learning unit 12 as described above, it is possible to optimize the control model. According to such processing, it is possible to automatically calculate appropriate control inputs (run curves) at each position of the vehicle 50 that can stop the vehicle 50 at the target stop position without causing regeneration failure. In addition, since regenerative braking can be used to control the braking of the vehicle 50 using such appropriate control inputs, the use of mechanical brakes is reduced, and the wear and tear of the mechanical brakes per unit period is reduced. The cost related to brake maintenance can be reduced. Further, according to the above-described process, a method for specifying an appropriate control input (run curve) at each position of the vehicle 50 that allows the vehicle 50 to stop at a target stop position without causing regeneration failure is applied to the vehicle 50 a small number of times. It can be earned through a trial run.

Note that in the initial data used to generate the control model, the relationship between control input and information on the state of the vehicle 50 when it is stopped with random acceleration added to a constant deceleration may be used. By using such a variety of initial data, it becomes possible to learn a control model with a small amount of initial data.

FIG. 6 is a hardware configuration diagram of the server device. As an example, as shown in this figure, the server device 10 includes a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, a storage unit 104 such as an HDD or SDD, and a communication module 105. It may be equipped with each hardware such as.

In the server device 10 described above, each process described above is stored in a computer-readable recording medium in the form of a program, and the above-mentioned processes are performed by reading and executing this program by the computer. Here, the computer-readable recording medium refers to a magnetic disk, a magneto-optical disk, a CD-ROM, a DVD-ROM, a semiconductor memory, and the like. Alternatively, this computer program may be distributed to a computer via a communication line, and the computer receiving the distribution may execute the program.

The above program may be for realizing some of the functions described above. Furthermore, it may be a so-called difference file (difference program) that can realize the above-mentioned functions in combination with a program already recorded in the computer system.

<Additional notes>
The above embodiment can be understood, for example, as follows.

(1) According to the first aspect, the information processing device (server device 10) indicates information on the state of the vehicle 50 at a certain time and a motor torque command for the vehicle 50 to stop at the target position in that state. A control model that estimates information on the state of the vehicle 50 at the next time using a probability distribution based on information indicating the control input of the vehicle 50 is used to estimate the state indicated by the braking control of the vehicle 50 performed in the past. Generated using the relationship between the information and the control input and the state of the vehicle 50 at the next time obtained from the relationship,
A measure that improves the value of the evaluation function most by determining an evaluation function of evaluation values related to the state of the vehicle, in which the evaluation value worsens as the distance to the target position increases. inputting the policy parameters into a policy function to generate a motor torque command for the vehicle 50 in the state of the vehicle at the next time;
A relationship between information on the state of the vehicle 50 that is the result of a trial based on the generated motor torque command and the control input, and a relationship between information on the state of the vehicle 50 used to generate the control model and the control input. The control model is updated using the following.

According to such processing, it is possible to automatically calculate appropriate control inputs (run curves) at each position of the vehicle 50 that can stop the vehicle 50 at the target stop position without causing regeneration failure. In addition, since regenerative braking can be used to control the braking of the vehicle 50 using such appropriate control inputs, the use of mechanical brakes is reduced, and the wear and tear of the mechanical brakes per unit period is reduced. The cost related to brake maintenance can be reduced. Further, according to such processing, a method for specifying an appropriate control input (run curve) at each position of the vehicle 50 that allows the vehicle 50 to stop at a target stop position without causing regeneration failure can be applied to the vehicle 50 a small number of times. It can be earned through a trial run.

(2) According to the second aspect, in the information processing device (server device 10) according to the first aspect, the evaluation function is calculated from a certain time as a reference to a set timing after the reference time. This is a function indicating an integral value of expected values of evaluation values regarding the state of the vehicle 50 at a plurality of times.

(3) According to the third aspect, in the information processing device (server device 10) according to the first or second aspect, the information on the state of the vehicle 50 includes the position, speed, and air spring pressure of the vehicle 50. , motor voltage, and motor torque output information.

(4) According to the fourth aspect, in the information processing device (server device 10) according to the second aspect,
sampling the initial arbitrary state of the vehicle 50 according to the policy function and normal distribution, and calculating the evaluation function by integrating the expected value according to the state;
The policy parameter for which the evaluation function is the smallest is searched.

(5) According to the fifth aspect, in the information processing device (server device 10) according to any one of the first to fourth aspects, the information on the state acquired under conditions where regeneration failure is unlikely to occur, and the regeneration The control model is generated using the state information acquired under conditions where invalidation is likely to occur.

According to such processing, it is possible to generate a control model that can output control inputs for accurate control both under conditions where regeneration failure is unlikely to occur and under conditions where regeneration failure is likely to occur.

(6) According to the sixth aspect, in the information processing device (server device 10) according to any one of the first to fourth aspects, the vehicle 50 is driven by adding random acceleration to a constant acceleration. The control model is generated using the relationship between state information and the control input.

According to such processing, by using a variety of initial data, it is possible to learn a control model with a small amount of initial data.

(7) According to the seventh aspect, the information processing method includes:
Based on the information on the state of the vehicle 50 at a certain time and the control input for the vehicle 50 indicating the motor torque command for the vehicle 50 to stop at the target position in that state, the vehicle 50 at the next time is determined. A control model that estimates the state information of the vehicle 50 using a probability distribution based on the relationship between the state information indicated by past braking control of the vehicle 50 and the control input, and the vehicle at the next time obtained from that relationship. generated using 50 states,
Determine an evaluation function of an evaluation value regarding the state of the vehicle, in which the evaluation value worsens as the distance to the target position increases, and select a policy parameter that will most improve the evaluation value. , inputting the policy parameters into a policy function to generate a motor torque command for the vehicle 50 in the state of the vehicle at the next time;
A relationship between information on the state of the vehicle 50 that is the result of a trial based on the generated motor torque command and the control input, and a relationship between information on the state of the vehicle 50 used to generate the control model and the control input. The control model is updated using the following.

(8) According to an eighth aspect, in the information processing method according to the seventh aspect, the evaluation function is based on the evaluation function at a plurality of times up to a set timing after the reference time, with a certain time as a reference. This is a function that indicates the integral value of the expected value of the evaluation value regarding the state of the vehicle 50.

(9) According to a ninth aspect, in the information processing method according to the seventh or eighth aspect, the information on the state of the vehicle 50 includes the position, speed, air spring pressure, motor voltage, motor Contains at least information on torque output.

(10) According to the tenth aspect, in the information processing method according to the eighth aspect, the initial state of the arbitrary vehicle 50 is sampled according to the policy function and the normal distribution, and the expectation according to the state is sampled. The evaluation function is calculated by integrating the values, and the policy parameter for which the evaluation function becomes the smallest is searched.

(11) According to the eleventh aspect, in the information processing method according to the seventh to tenth aspects, the information on the state acquired under conditions where regeneration lapse is unlikely to occur and the information under conditions where regeneration lapse is likely to occur. The control model is generated using the acquired state information.

(12) According to the twelfth aspect, in the information processing method according to the seventh to tenth aspects, information on the state of the vehicle 50 driven with random acceleration added to a constant acceleration and the control input The control model is generated using the relationship.

(13) According to the thirteenth aspect, the program causes the computer of the information processing device to:
Based on the information on the state of the vehicle 50 at a certain time and the control input for the vehicle 50 indicating the motor torque command for the vehicle 50 to stop at the target position in that state, the vehicle 50 at the next time is determined. A control model that estimates the state information of the vehicle 50 using a probability distribution based on the relationship between the state information indicated by past braking control of the vehicle 50 and the control input, and the vehicle at the next time obtained from that relationship. 50 states;
A measure that improves the value of the evaluation function most by determining an evaluation function of evaluation values related to the state of the vehicle, in which the evaluation value worsens as the distance to the target position increases. means for inputting the policy parameters into a policy function to generate a motor torque command for the vehicle 50 in the state of the vehicle at the next time;
A relationship between information on the state of the vehicle 50 that is the result of a trial based on the generated motor torque command and the control input, and a relationship between information on the state of the vehicle 50 used to generate the control model and the control input. means for updating the control model using
function as

1...Vehicle control device 2...Inverter 3...Motor 11...Storage section 12...Learning section 13...Policy evaluation section 14...Policy improvement section

Claims

Information on the state of the vehicle at the next time is obtained based on information on the state of the vehicle at a certain time and information indicating a control input for the vehicle that indicates a motor torque command for the vehicle to stop at the target position in that state. A control model estimated by a probability distribution is created using the relationship between the information on the state indicated by past braking control of the vehicle and the control input, and the state of the vehicle at the next time obtained from that relationship. generate,
Determine an evaluation function of an evaluation value regarding the state of the vehicle, in which the evaluation value worsens as the distance to the target position increases, and select a policy parameter that will most improve the evaluation value. input into a policy function to generate a motor torque command for the vehicle in the vehicle state at the next time;
a relationship between the vehicle state information that is the result of a trial using the generated motor torque command and the control input; and a relationship between the vehicle state information used to generate the control model and the control input. an information processing apparatus that updates the control model using the information processing apparatus;
The evaluation function is a function that indicates an integral value of an expected value of an evaluation value regarding the state of the vehicle at a plurality of times from a certain time as a reference to a set timing after the reference time. information processing equipment.
The information processing device according to claim 1 or 2, wherein the information on the state of the vehicle includes at least information on the position, speed, air spring pressure, motor voltage, and motor torque output of the vehicle.
sampling an initial arbitrary state of the vehicle according to the policy function and normal distribution, and calculating the evaluation function by integrating the expected value according to the state;
The information processing apparatus according to claim 2, wherein the policy parameter for which the evaluation function is the smallest is searched for.
The information processing device according to claim 4, wherein the control model is generated using information on the state acquired under conditions where regeneration failure is unlikely to occur and information on the state acquired under conditions where regeneration failure is likely to occur.
The information processing device according to claim 4, wherein the control model is generated using a relationship between the control input and information on the state of the vehicle driven with random acceleration added to a constant acceleration.
Information on the state of the vehicle at the next time is obtained based on information on the state of the vehicle at a certain time and information indicating a control input for the vehicle that indicates a motor torque command for the vehicle to stop at the target position in that state. A control model estimated by a probability distribution is created using the relationship between the information on the state indicated by past braking control of the vehicle and the control input, and the state of the vehicle at the next time obtained from that relationship. generate,
Determine an evaluation function of an evaluation value regarding the state of the vehicle, in which the evaluation value worsens as the distance to the target position increases, and select a policy parameter that will most improve the evaluation value. input into a policy function to generate a motor torque command for the vehicle in the vehicle state at the next time;
a relationship between the vehicle state information that is the result of a trial using the generated motor torque command and the control input; and a relationship between the vehicle state information used to generate the control model and the control input. an information processing method that updates the control model using the information processing method.
The evaluation function is a function that indicates an integral value of an expected value of the evaluation value regarding the state of the vehicle at a plurality of times from a certain time as a reference to a set timing after the reference time. Information processing method.
The information processing method according to claim 7 or 8, wherein the information on the state of the vehicle includes at least information on the position, speed, air spring pressure, motor voltage, and motor torque output of the vehicle.
sampling an initial arbitrary state of the vehicle according to the policy function and normal distribution, and calculating the evaluation function by integrating the expected value according to the state;
The information processing method according to claim 8, further comprising searching for the policy parameter for which the evaluation function is the smallest.
The information processing method according to claim 10, wherein the control model is generated using information on the state acquired under conditions where regeneration failure is unlikely to occur and information on the state acquired under conditions where regeneration failure is likely to occur.
The information processing method according to claim 10, wherein the control model is generated using a relationship between the control input and information on the state of the vehicle driven with random acceleration added to a constant acceleration.
The computer of the information processing equipment,
Information on the state of the vehicle at the next time is obtained based on information on the state of the vehicle at a certain time and information indicating a control input for the vehicle that indicates a motor torque command for the vehicle to stop at the target position in that state. A control model estimated by a probability distribution is created using the relationship between the information on the state indicated by past braking control of the vehicle and the control input, and the state of the vehicle at the next time obtained from that relationship. means of generating;
Determine an evaluation function of an evaluation value regarding the state of the vehicle, in which the evaluation value worsens as the distance to the target position increases, and select a policy parameter that will most improve the evaluation value. means for inputting into a policy function to generate a motor torque command for the vehicle in the state of the vehicle at a next time;
a relationship between the vehicle state information that is the result of a trial using the generated motor torque command and the control input; and a relationship between the vehicle state information used to generate the control model and the control input. means for updating the control model using
A program that functions as