WO2019120174A1 - 动作控制方法及装置 - Google Patents

动作控制方法及装置 Download PDF

Info

Publication number
WO2019120174A1
WO2019120174A1 PCT/CN2018/121519 CN2018121519W WO2019120174A1 WO 2019120174 A1 WO2019120174 A1 WO 2019120174A1 CN 2018121519 W CN2018121519 W CN 2018121519W WO 2019120174 A1 WO2019120174 A1 WO 2019120174A1
Authority
WO
WIPO (PCT)
Prior art keywords
state
dimension
fuzzy
dimensions
fuzzy subset
Prior art date
Application number
PCT/CN2018/121519
Other languages
English (en)
French (fr)
Inventor
钱俊
王新宇
陈晨
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP18890321.5A priority Critical patent/EP3719603B1/en
Publication of WO2019120174A1 publication Critical patent/WO2019120174A1/zh
Priority to US16/906,863 priority patent/US11449016B2/en

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/0088Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots characterized by the autonomous decision making process, e.g. artificial intelligence, predefined behaviours
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/0265Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion
    • G05B13/0275Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion using fuzzy logic only
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/2163Partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/043Architecture, e.g. interconnection topology based on fuzzy logic, fuzzy membership or fuzzy inference, e.g. adaptive neuro-fuzzy inference systems [ANFIS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/02Computing arrangements based on specific mathematical models using fuzzy logic
    • G06N7/023Learning or tuning the parameters of a fuzzy system

Definitions

  • the present disclosure relates to the field of artificial intelligence, and in particular, to an action control method and apparatus.
  • artificial intelligence With the development of artificial intelligence, various artificial intelligence devices such as unmanned vehicles and robots have been born, which has brought great convenience to people's lives.
  • artificial intelligence methods are often used to provide intelligent decision-making for artificial intelligence devices, which serve as the machine brain of artificial intelligence devices, thereby controlling artificial intelligence devices to perform corresponding actions, such as controlling the drones to travel on the road, and controlling the robots to move in the warehouse. Wait.
  • the reinforcement learning algorithm is usually used to control the action of the artificial intelligence device: taking the unmanned vehicle as an example, firstly, the necessary elements in the reinforcement learning algorithm are defined according to the actual application scenario: N-dimensional state space S, m-dimensional action space A and reward function R Then, based on S, A, and R, driving in a simulator environment or driving on a real road, a control model is trained to output m in A according to the N-dimensional state in the input S. Dimensional discrete decision making. After that, the unmanned vehicle will collect the current N-dimensional state during the actual driving process, input the N-dimensional state into the control model, and obtain the m-dimensional discrete decision of the control model output. Based on the m-dimensional discrete decision, the control execution corresponds. action.
  • the decision of the output is discrete, it is difficult to ensure the smooth control of the artificial intelligence device, resulting in poor smoothness of the action.
  • the embodiments of the present disclosure provide an action control method and apparatus, which can solve the technical problem that it is difficult to perform smooth control on the artificial intelligence device in the related art.
  • the technical solution is as follows:
  • an action control method comprising:
  • N is a positive integer greater than or equal to 1;
  • the activated fuzzy subset of one state refers to a fuzzy subset of which the membership of the state is not zero.
  • Each fuzzy subset includes a state interval corresponding to the same discrete decision within a dimension, the membership degree is used to indicate the degree to which the state belongs to the fuzzy subset, and the control model is configured to output a corresponding discrete according to the state of the input. decision making;
  • the artificial intelligence device is controlled to perform a corresponding action.
  • the method provided by the embodiment provides a weighted summation of a plurality of discrete decisions based on the degree of membership between each state of the dimension and the activated fuzzy subset, and obtains a continuous decision. Since the output decision is a continuous quantity, the manual can be guaranteed.
  • the smooth control of the smart device ensures the smoothness of the action.
  • the continuous decision-making by obtaining membership degree provides a way to reasonably continuously discretize the decision-making, ensuring that the trend of continuous decision-making changes with the trend of the state, thus ensuring high accuracy of continuous decision-making.
  • the artificial intelligence device is an unmanned vehicle, it can ensure the smoothness of the unmanned vehicle and improve the comfort of the passenger.
  • the weighting and summing the plurality of discrete decisions based on the membership degree between the state of each dimension and the activated fuzzy subset to obtain a continuous decision includes:
  • the activated fuzzy subset and the control model based on the state of each of the N dimensions obtain a plurality of discrete decisions, including:
  • each intermediate state including a central value of N dimensions
  • the plurality of intermediate states are respectively input into the control model to obtain a plurality of discrete decisions of the control model output.
  • the method before the obtaining a plurality of discrete decisions based on the activated fuzzy subset of the states of each of the N dimensions and the control model, the method further includes:
  • each of the fuzzy subsets is not used when the degree of membership between the state of each of the dimensions and any of the fuzzy subsets of each of the dimensions is not zero As an activated fuzzy subset of each of the dimensions; or,
  • two fuzzy subsets whose center values are located around the state of each dimension are selected from the plurality of fuzzy subsets of each dimension, as each of the dimensions Activate the fuzzy subset.
  • the method before the acquiring the states of the N dimensions of the artificial intelligence device, the method further includes:
  • a plurality of state intervals corresponding to the same typical discrete decision and adjacent are merged into one fuzzy subset, and at least one fuzzy subset of each dimension is obtained.
  • This design provides a way to automatically divide the fuzzy subset based on the control model obtained by reinforcement learning training, without relying on manual determination of decision-making, and the efficiency is extremely high. Further, the over-segmentation method can be used to divide each state space into a large number of state intervals.
  • the fuzzy sub-set can be guaranteed because the boundary of the fuzzy subset is very accurate. The accuracy of the set is high. Further, the scenario suitable for dividing the fuzzy subset into the high-dimensional state space can be conveniently and quickly extended to the high-dimensional space, so as to be applied to the complicated operation in practical applications, and the utility is strong.
  • the obtaining, based on the control model, a typical discrete decision of each of the plurality of state intervals comprising:
  • each representative state including a center value of each of the state sections in each dimension and other Any state in each dimension;
  • a discrete decision with the most occurrences is selected from the plurality of discrete decisions as a typical discrete decision for each of the state intervals.
  • the method further includes:
  • a membership function corresponding to each of the fuzzy subsets is used, and a state of each of the dimensions is calculated to obtain each of the fuzzy subsets. Membership.
  • the method before the acquiring the states of the N dimensions of the artificial intelligence device, the method further includes:
  • the preset rule is: a membership function in each of the fuzzy subsets
  • the center value of the boundary value of the fuzzy subset is taken as 0.5
  • the center value of the adjacent two fuzzy subsets of each of the fuzzy subsets is taken as 0.
  • the membership function is highly explanatory and effective.
  • the steps are simple and can improve the efficiency of constructing the membership function.
  • an action control apparatus comprising a plurality of functional modules to implement the above-described first aspect and the motion control method in any of the possible designs of the first aspect.
  • an artificial intelligence device comprising a processor and a memory, wherein the memory stores at least one instruction loaded by the processor and executed to implement the first aspect described above And an action control method in any of the possible designs of the first aspect.
  • a fourth aspect provides a computer readable storage medium having stored therein at least one instruction loaded by a processor and executed to implement the first aspect and any one of the possible aspects of the first aspect The action control method in the middle.
  • FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure.
  • FIG. 3 is a schematic structural diagram of an artificial intelligence device according to an embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of a membership function provided by an embodiment of the present disclosure.
  • FIG. 6 is a schematic diagram of a membership function provided by an embodiment of the present disclosure.
  • FIG. 7 is a schematic diagram of a membership function provided by an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of an action control apparatus according to an embodiment of the present disclosure.
  • State space refers to the set of all possible states of the artificial intelligence device.
  • the state space can include N dimensions (n is a positive integer).
  • the dimension of the state space can be a speed dimension, an angle dimension, a distance dimension, etc., and the artificial intelligence device is in any
  • the state of the moment can be represented by an N-dimensional vector in the state space.
  • Action space refers to the set of all actions that the artificial intelligence device can perform.
  • the action space can include m dimensions (m is a positive integer), including, for example, throttle dimension, steering angle dimension, brake dimension, etc., artificial intelligence device at any time.
  • An action can be represented by an m-dimensional vector in the action space.
  • the return function a function that takes the state as input and the bonus value as the output. The larger the reward value, the more ideal the corresponding state, and the negative value indicates that the corresponding state is not ideal.
  • Reinforcement learning also known as reinforcement learning and evaluation learning, which refers to the maximization of reward values, from environmental mapping to behavioral learning. In the intensive learning, the rewards are evaluated by the reward value.
  • the control model In the training process, the control model must learn by its own experience, acquire knowledge in the action-evaluation environment, and continuously improve the action to adapt to the environment.
  • Fuzzy set also called fuzzy set, fuzzy set, refers to the set of concepts expressing ambiguity, the membership relationship between any state and fuzzy subset is not absolute affirmation or negation, but through the size of membership Characterize the level of affiliation.
  • the fuzzy subset refers to a state interval corresponding to the same discrete decision in one dimension, that is, the discrete decisions corresponding to all states in the fuzzy subset are the same.
  • Membership degree and membership function If any element x in the universe U has a number A(x) ⁇ [0,1] corresponding to it, then A is called the fuzzy set on U, A(x ) is called the membership of x to A. When x changes in U, A(x) is a function called A's membership function. The closer the degree of membership A(x) is to 1, the higher the degree to which x belongs to A. The closer A(x) is to 0, the lower the degree to which x belongs to A, and the membership function that takes the value in interval (0,1). A(x) can characterize the extent to which x belongs to A. In the embodiment of the present disclosure, the degree of membership is used to indicate the degree to which the state belongs to the fuzzy subset.
  • the greater the membership degree the higher the degree that the state belongs to the fuzzy subset, and the state can be considered to belong to the fuzzy subset, and the membership degree is higher. Small, indicating that the state belongs to the fuzzy subset is lower, it can be considered that the state weak belongs to the fuzzy subset.
  • Activating the fuzzy subset When the membership degree between the state and a certain fuzzy subset is not 0, it can be understood that the fuzzy subset is activated, and the fuzzy subset is recorded as the activated fuzzy subset.
  • the implementation environment includes an artificial intelligence device 101 and a plurality of servers 102.
  • the artificial intelligence device 101 is connected to multiple servers 102 through a wireless or wired network.
  • the artificial intelligence device is connected.
  • 101 can be an unmanned vehicle or a robot, and each server 102 can be a server, or a server cluster composed of several servers, or a cloud computing service center.
  • the state used by the artificial intelligence device training may come from the server 102, and the artificial intelligence device may perceive the state of each dimension to perform learning.
  • the server can provide the current location information of the unmanned vehicle and the road information of the current driving of the unmanned vehicle to the unmanned vehicle, so that the unmanned vehicle can know the location and the road information of the road.
  • the state acquired by the artificial intelligence device may also come from the server, and the artificial intelligence device may determine a corresponding decision based on the acquired state and the control model to perform the corresponding action.
  • the server 102 may also have at least one database, such as a road traffic network database, a geographic information database, etc., for providing the state of the various dimensions to the artificial intelligence device 101.
  • the action control method provided by this embodiment can be applied to multiple practical application scenarios.
  • the unmanned vehicles can pass the current speed, the distance between the central axis of the road, the current road traffic and other dimensions of the state during driving.
  • control model, output continuous decision, the continuous decision can be the steering angle size, acceleration, throttle size, brake size, etc., through continuous decision to ensure the smoothness of the unmanned car's action, to avoid the situation of high and low speed, etc. Improve passenger comfort.
  • the continuous decision of the output may be the underlying data, that is, the continuous decision can be refined to the angle of the left turn, the value of the acceleration, and the like, thereby performing high-precision control, so that the unmanned vehicle can perform straight line and lane change along the road axis.
  • the effects of high-impact movements such as overtaking, following, and parking enable the task of automatic driving of unmanned vehicles.
  • the embodiment of the present invention can be applied to a warehouse robot to carry goods in various places such as a warehouse, a construction site, a weeding robot to weed the lawn, a sweeping robot to sweep the floor in the office, and a harvesting robot in the field.
  • the storage robot can obtain the current speed, the current position, the distance from the adjacent shelf, the distance from the shelf on which the cargo is placed, and the control model during operation.
  • Output continuous decision which can be used to pick up goods, raise hands, place goods in the storage basket, steering angle size, acceleration, etc.
  • the continuous decision of the output may be the underlying data, that is, the continuous decision may be refined to the speed of the picked-up product, the angle of the left turn, the value of the acceleration, and the like, thereby performing high-precision control, so that the robot can perform difficult operations. The effect is to improve the performance of the robot.
  • the artificial intelligence device includes: a receiver 301, a transmitter 302, a memory 303, and a processor 304.
  • the receiver 301, the transmitting The processor 302 and the memory 303 are respectively connected to the processor 304.
  • the memory 303 stores program code
  • the processor 304 is configured to call the program code to perform operations performed by the artificial intelligence device in the following embodiments.
  • a computer readable storage medium such as a memory comprising instructions executable by a processor in an electronic device to perform the motion control method in the embodiments described below.
  • the computer readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device.
  • the embodiment of the present disclosure provides an action control method, which mainly includes three parts.
  • the first part is a process of reinforcement learning. For details, see the following steps 401-402.
  • the second part is the process of spatial fuzzification. See step 403-step 406 below.
  • each dimension of the state space can be divided into multiple fuzzy subsets.
  • the third part is the process of defuzzification. For details, see step 407-410 below.
  • the online control process based on the fuzzy subset and membership function to which the actual state belongs, the continuous decision is calculated and then the corresponding action is controlled.
  • the smooth control of the artificial intelligence device can be ensured, and the smoothness of the action can be ensured.
  • the execution body of the method is an artificial intelligence device. As shown in FIG. 4, the method includes:
  • the artificial intelligence device acquires an N-dimensional state space S, an M-dimensional action space A and a reward function R, where N is a positive integer greater than or equal to 1, and M is also a positive integer greater than or equal to 1.
  • the training model is constructed based on the reinforcement learning algorithm.
  • the artificial intelligence device acquires the state space, the action space, and the reward function in advance.
  • the state space is represented by S
  • the action space is represented by A
  • the reward function is represented by R.
  • the step 401 may specifically include the following steps 1 to 3.
  • Step 1 The artificial intelligence device generates an N-dimensional state space S.
  • the state space of each dimension refers to the set of states of the corresponding dimensions of the artificial intelligence device, and the state space of each dimension can be designed according to the state that the artificial intelligence device can obtain during the running process. That is, it can be designed according to the state that the artificial intelligence device can collect in real time or the state obtained by the processing.
  • the state space of each dimension includes two boundaries, the left boundary can represent the minimum value of the state of the corresponding dimension, and the right boundary can represent the maximum value of the state of the corresponding dimension.
  • the minimum value of the state can be directly designed as the left boundary of the state space, and the maximum value of the state is directly designed as the right boundary of the state space.
  • the minimum and maximum values of the state can be normalized, and the normalized minimum and normalized maximum are designed as the left and right boundaries of the state space, respectively.
  • the state space S may include a state space of a ⁇ dimension, a state space of a V dimension, a state space of a P dimension, a state space of a V front dimension, and a state of a V rear dimension.
  • the state space may include a state space of a speed dimension, a state space of a left foot dimension, a state space of a right foot dimension, and the like, which are not limited in this embodiment.
  • the boundaries can be -1 and 1, -1 for -180° and 1 for 180°.
  • the boundary of the V-dimensional state space is 0 to 300, with 0 representing the minimum velocity, 300 representing the maximum velocity, and the boundary of the P-dimensional state space being -1 and 1, the boundary being normalized by the road width.
  • P is greater than 1 or less than -1, the unmanned vehicle has already left the road.
  • variable name Explanation ⁇ The angle between the center axis of the vehicle and the center axis of the road, ranging from -1 to 1.
  • V Self-vehicle speed ranging from 0 to 400 P
  • the distance between the car and the center axis of the road ranging from -1 to 1
  • the artificial intelligence device may acquire a state space generation instruction, and generate an N-dimensional state space based on the state space generation instruction.
  • the state space generation instruction indicates the number of dimensions of the state space, and may also indicate the name of the state space of each dimension, the boundary value of the state space of each dimension, and the artificial intelligence device may generate the number of dimensions in the instruction based on the state space, each The name of the state space of the dimension and the boundary value of the state space of each dimension generate an N-dimensional state space.
  • the artificial intelligence device can acquire the state space generation instruction when running the code defining the state space.
  • the code that defines the state space is written by the developer in advance according to actual needs, and is stored by the developer in the artificial intelligence device in advance.
  • Step 2 The artificial intelligence device generates an M-dimensional action space A.
  • the action space of each dimension refers to the set of actions of the corresponding dimension of the artificial intelligence device, and the action space of each dimension can be determined according to the action that the artificial intelligence device can perform during the actual running process.
  • the action space of each dimension includes two boundaries, the left boundary can represent the minimum value of the action of the corresponding dimension, and the right boundary can represent the maximum value of the action of the corresponding dimension.
  • the minimum value of the action can be directly designed as the left boundary of the action space, and the maximum value of the action can be directly designed as the right boundary of the action space.
  • the minimum and maximum values of the action can be normalized, and the normalized minimum and the normalized maximum are designed as the left and right boundaries of the action space, respectively.
  • the action space A may include an action space of a steering angle dimension, a state space of a throttle dimension, an arbitrary combination of state spaces of a brake dimension, and the like.
  • the action space may include any combination of the action space of the sweeping dimension, the action space of the weeding dimension, the action space of the steering angle dimension, and the action space of the dimension of the carried item.
  • Each dimension corresponds to one type of action
  • the action space of each dimension is a set of actions corresponding to the artificial intelligence device
  • the boundary of the action space of each dimension is determined by the minimum and maximum values of the actions of the corresponding type.
  • the action space of the speed dimension is a set of speeds of the artificial intelligence device traveling
  • the boundary of the action space of the speed dimension is determined by the minimum speed and the maximum speed of the artificial intelligence device, for example, 0 to 400.
  • the artificial intelligence device may acquire an action space generation instruction, and generate an M-dimensional action space A based on the action space generation instruction.
  • the action space generation instruction indicates the number of dimensions of the action space, and may also indicate the name of the action space of each dimension and the boundary value of the action space of each dimension, and the artificial intelligence device may generate the number of dimensions in the instruction based on the action space, each The name of the action space of the dimension and the boundary value of the action space of each dimension generate an M-dimensional action space.
  • the artificial intelligence device can obtain an action space generating instruction when running the code defining the action space, and the code for defining the action space is written by the developer according to actual requirements in advance, and is stored by the developer in the artificial intelligence device in advance.
  • the action space of each dimension may be discretized according to a uniform granularity, and the correspondence between the granularity and the dimension of the action space may be obtained, and the action space of each dimension is discretized according to the granularity corresponding to the dimension.
  • the specific value of the granularity may be determined according to actual requirements, which is not limited in this embodiment.
  • the motion space can be discretized according to the granularity of 0.1 to obtain [-1, -0.9, -0.8...0.8, 0.9, 1], and a total of 21 operations are formed to constitute a discretized motion space of the steering angle dimension.
  • the design of the action space will directly affect the convergence speed of the control model in the process of subsequent model training.
  • the more action actions in the action space, the control model determines the decision time in a certain state.
  • the design action space is continuous, resulting in excessive calculation of model training. It is difficult to ensure that the control model converges quickly and has poor practicability.
  • the motion space is discretized, the exploration space of the reinforcement learning process is reduced, the calculation amount of the training control model is reduced, the convergence speed of the control model is improved, and the control model is quickly trained. Further, the online usage control model is reduced to determine the calculation amount of the decision, the speed of determining the decision is improved, and the decision is quickly determined.
  • Step 3 The artificial intelligence device obtains a reward function.
  • the reward function can be designed according to the expected design of the state of the artificial intelligence device during the actual running process.
  • the reward function can be designed to be a positive value for the ideal state output, and is positively correlated with the ideal degree, that is, if the state is more ideal, then The reward value of the return function output is greater when the return function enters this state.
  • the reward function can be designed to have a negative value for the unsatisfactory state output, which is equivalent to outputting the penalty value.
  • the control model in the subsequent model training process, the control model's cognition of each state is rewarded by the state determined by the reward function. The value is determined, and the control model determines the decision by maximizing the reward value.
  • the reward value is negative.
  • the model When the model is controlled, the negative reward value is obtained when the state is triggered. It can be understood as being punished.
  • the control model will recognize that the state should be avoided when determining the decision. Then, when the actual use of the control model to determine the decision, the effect of avoiding an undesirable state can be achieved.
  • the reward value of designing the ideal state is positive and positively correlated with the ideal degree.
  • the positive reward value will be obtained when the state is triggered. It can be understood that if it is encouraged, the control model will recognize the subsequent decision. When this state should be approached, the effect of the ideal state can be achieved when the control model is actually used to determine the decision.
  • the unmanned vehicle since the ideal state of the unmanned vehicle driving process includes no collision of the unmanned vehicle, fast driving of the unmanned vehicle, and driving of the unmanned vehicle along the road, the reward value of the return function output can be designed to collide with the collision. Negative correlation, positive correlation with velocity, and negative correlation with ⁇ .
  • the control model will be punished due to collisions during the training process, thus urging the control model output decision to avoid collisions.
  • steps 1 to 3 are not limited, and each step may be performed in any order.
  • the artificial intelligence device uses the reinforcement learning algorithm to train the model and obtain the control model.
  • the control model is used to output an M-dimensional discrete decision within A according to the N-dimensional state within the input S.
  • N dimensions may be acquired state S t
  • S t is an N-dimensional state vector of N-dimensional state space
  • the reward function using the R S t Calculate get the reward value R t-1 , input S t to the current control model, get the output A t , perform A t , then get S t+1 and R t , add S t , R t-1
  • the experience pool is modeled by the data in the experience pool to update the control model, and then the next learning based on the updated control model.
  • the artificial intelligence device can configure various components such as a camera, various sensors, and a network communication module, and can sense the state of the external environment and its own state through various components, such as an artificial intelligence device.
  • the front camera can be used to capture the image in front, and the image can be used to sense the state of the front environment.
  • the artificial intelligence device can collect the distance from the surrounding object through the ultrasonic radar, thereby sensing the distance between itself and the obstacle, such as artificial intelligence.
  • the device can sense its own acceleration and speed through the acceleration sensor.
  • the artificial intelligence device can sense its own rotation angle through the gyroscope, and the artificial intelligence device can acquire the road through the Global Positioning SysteM (GPS) sensor.
  • GPS Global Positioning SysteM
  • the GPS sensor can collect position information during driving, and the unmanned vehicle can determine the current orientation and the direction of the central axis of the currently traveling road according to the position information, and the unmanned vehicle can be calculated according to the two directions.
  • the unmanned vehicle can calculate the distance P between itself and the central axis of the road based on the position information.
  • the acceleration sensor can collect acceleration during driving, and the unmanned vehicle can calculate the speed V of the unmanned vehicle through acceleration.
  • the exploration strategy can be explored in the process of reinforcement learning: taking the t-th study as an example, the action can be performed according to the A t determined by the reward value R t-1 with a certain probability.
  • the probability of randomly performing an action in order to enhance the generalization of the control model, the ability of the exercise control model to explore the unknown environment, and to ensure that the control model can cope with the complex and varied actual situation in the actual use process.
  • the search strategy may be an ⁇ -greedy strategy, an optimistic initial estimation strategy, an attenuated ⁇ -greedy strategy, an uncertain behavior first search strategy, a probability matching strategy, an information value strategy, etc., which is not limited in this embodiment.
  • the motion space A is discretized into a design of q actions.
  • the training model is trained based on the discrete motion space, only the q actions are required for each learning. Selecting an action can be used as a discrete decision of the output.
  • the exploration space is small and can converge quickly.
  • the simulation software can be run by an electronic device to build a simulator environment.
  • the control model can be understood as a process or thread running in the simulator environment, and the simulation model is continuously controlled by driving in the simulator environment. Will continue to mature until the training is completed.
  • the real environment is required to be a multi-lane real road with a lane line, and there are other moving vehicles on the road to create a lane change opportunity for the unmanned vehicle, while other moving vehicles
  • the arrangement satisfies a certain randomness, ensuring that the model can be trained on a variety of data sets to enhance the generalization ability of the model.
  • the third point that needs to be explained is that, in this embodiment, the number of the episodes of the model training, the stopping rule, and the specific calculation manner can all be determined according to actual needs, and the episode refers to the process of completing a preset action, and the stopping rule refers to triggering.
  • the rules for stopping the model training such as stopping when running one lap, stopping when reaching the destination, and so on.
  • the specific algorithm for model training may include Deep Reinforcement Learning (DQN) algorithm and Q-learning algorithm.
  • DQN Deep Reinforcement Learning
  • Q-learning algorithm can be used when the dimension is small. This embodiment is not limited in this regard.
  • control model has the ability to control at the bottom:
  • the steering angle determined in this scheme can only reach the level of the high-level decision of the steering angle, and the underlying control of determining the speed of the left turn, the speed of the right turn, the acceleration of the left turn, and the acceleration of the right turn cannot be achieved. Degree.
  • the action space of each dimension is designed, and the action space of each dimension can be further refined into a plurality of discrete actions, thereby underlying, clearing, and realizing the actions of the control model decision.
  • the ability of the control model to have the underlying control can be realized, that is, the discrete decision of the control model output can indicate the specific actions of each dimension, such as the value of the acceleration, the value of the lifted foot, etc., which improves the accuracy of the decision and is practical.
  • the mapping function is only suitable for controlling the driving of the vehicle that takes the first view picture of the learning process, and the actual application is changed.
  • the reliability of the steering angle determined by the mapping function will be poor.
  • the mapping function is only applicable to the road where the vehicle in the first process of view is taken during the learning process.
  • the reliability of the steering angle determined by the driving function on other roads is also poor, so the method is difficult to face. For complex and variable road conditions, it is not applicable to other vehicles, and the robustness is very poor.
  • the control model since the control model is trained as the input of the underlying state, the generality of the underlying state is strong, and the correlation between the self-subject of the acquisition state and the environment in which it is located is weak, and can ensure certain determination.
  • the decisions determined by the state can be applied to a variety of self-subjects and environments. Taking an unmanned vehicle as an example, it is not necessary to require that the vehicle applied in the actual use process is the same as the vehicle applied in the learning process, and it is not necessary to require that the road applied in the actual application process is the same as the road applied in the learning process, and is determined in actual use.
  • the decision can be applied to a variety of vehicles and roads, so as to ensure that it can face complex and varied road conditions and is robust.
  • a control model with the current state as an input and the discrete decision as an output can be obtained.
  • the state space S is divided into a plurality of disambiguating plurality of fuzzy subsets ⁇ S i ⁇ based on the above control model, and all states satisfying each fuzzy subset S i are corresponding to the control model
  • the same discrete decision a i is generated, and a corresponding membership function is generated for each fuzzy subset.
  • the artificial intelligence device divides the state space of each dimension into a plurality of state intervals.
  • the state space of each dimension may be divided by using an over-segmentation manner, that is, the number of state intervals required to be divided is as large as possible, and each state interval is as narrow as possible to ensure the accuracy of the subsequently obtained fuzzy subset.
  • the state space of the ⁇ dimension can be divided into M state intervals.
  • the ith state interval can be recorded.
  • the artificial intelligence device acquires a typical discrete decision of each state interval in the plurality of state intervals based on the control model, and obtains multiple typical discrete decisions.
  • the typical discrete decision of the state interval refers to the discrete decision that the control model is most likely to output when the center value of the state interval is the input of the control model, that is, the discrete decision with the highest probability of output.
  • the artificial intelligence device obtains a typical discrete decision for each state interval based on the control model, so that the fuzzy subset is obtained by combining the same state intervals of the typical discrete decision.
  • the process of obtaining a typical discrete decision may specifically include the following steps 1 through 3:
  • Step 1 Acquire, for each of the plurality of state intervals, a plurality of representative states of the state interval, each representative state including a center value of the state interval in the dimension and a state in other dimensions.
  • the representative state of the state interval refers to the state in which the dimension of the state interval is the central value of the state interval, and the dimension is equal to the state of the dimension of the state space, and the representative state is used to determine a typical discrete decision of the corresponding state interval, and the representative state includes the state.
  • the process of obtaining the representative state may include (1)-(2).
  • the state space of the other dimensions other than the i-th dimension in the N dimensions is sampled to obtain a plurality of sample values.
  • the sampling mode can be Monte Carlo sampling, and the sampling value is a random sampling value, or the sampling mode is equal interval sampling, sampling with a certain prior probability, and the like.
  • the number of sampled values can be determined according to actual requirements. To ensure accuracy, the number of sampled values can be as many as possible. For example, the number of sampled values is 10000. In this embodiment, the number of sampling modes and the number of sampled values are not limited.
  • the sample value and the center value of the state interval are combined into an N-dimensional vector as a representative state of the state interval.
  • the value of the N-dimensional vector in the i-th dimension is s ij , and the other dimensions other than the i-th dimension are taken as the sample values of the state space of other dimensions.
  • all state intervals of each dimension may be sampled once, and the representative state of each state interval of the dimension is respectively determined by the sampled value: when the representative state of each state interval of the i-th dimension is obtained, the N may be The state space of each dimension other than the i-th dimension in the dimensional state space is sampled, and the sampled value of the state space of each dimension other than the i-th dimension is obtained, and the central value of the m state intervals of the i-th dimension is obtained.
  • the m center values in the set are respectively combined with the sample values of the state space of each dimension other than the i-th dimension, and m representative states corresponding to m central values are obtained, that is, each state interval of the i-th dimension is obtained. Represents the state, so that it is guaranteed to perform one sampling in the i-th dimension without m sampling for m state intervals.
  • Step 2 input multiple representative states into the control model respectively, and obtain multiple discrete decisions of the control model output.
  • the representative state may be input into a control model, which outputs a corresponding discrete decision according to the input representative state, and w discrete decisions can be obtained by w representative states.
  • Step 3 Select the discrete decision with the most occurrences from multiple discrete decisions as a typical discrete decision of the state interval.
  • the number of occurrences of each discrete decision can be counted, and the discrete decision with the most occurrences can be selected as the typical discrete decision of the state interval.
  • a typical discrete decision is denoted as A i .
  • the artificial intelligence device combines the plurality of state intervals corresponding to the same typical discrete decision and adjacent into one fuzzy subset based on a plurality of typical discrete decisions, and obtains at least one fuzzy subset of each dimension.
  • the variation of the typical discrete decision can be analyzed, and the typical discrete decision is obtained in these state intervals.
  • the changed edge position is divided once at the edge position every time the edge position is determined, thereby merging the plurality of state intervals whose typical discrete decision has not changed into one fuzzy subset, and then dividing the state space of the dimension into at least one blur.
  • the discrete decisions corresponding to each fuzzy subset in these fuzzy subsets are the same, and the typical discrete decisions corresponding to adjacent fuzzy subsets are different.
  • the state space of the limb motion dimension includes 100 state intervals
  • the typical discrete decision corresponding to the state interval 1-state interval 10 is the lower jaw
  • the state interval 10-state interval 40 corresponds to the typical discrete
  • the decision-making is from the standpoint.
  • the typical discrete decision corresponding to the state interval 40-state interval 80 is the raising hand.
  • the typical discrete decision corresponding to the state interval 80-state interval 100 is the jump, then the analysis is based on the change of the typical discrete decision.
  • state interval 10 determines that the typical discrete decision changes in the state interval 10, the state interval 40, and the state interval 80, that is, the edge position is the state interval 10, the state interval 40, and the state interval 80, then the state interval 1 - the state interval 10 is merged into one blur Subset, state interval 10-state interval 40 will be merged into one fuzzy subset, state interval 40-state interval 80 will be merged into one fuzzy subset, and state interval 80-state interval 100 will be merged into one fuzzy subset.
  • the state space of the ⁇ dimension has a state interval.
  • the corresponding typical discrete decision set ⁇ A 1 ,...A M ⁇ which will merge the same typical discrete decision and adjacent state intervals, and get multiple new state intervals. It is recorded as a fuzzy subset, where k ⁇ is the number of fuzzy subsets of the ⁇ dimension, thus completing the division of the fuzzy subset of the ⁇ dimension.
  • the above describes the process of dividing the state space of a certain dimension into multiple fuzzy subsets.
  • the state space of each dimension can be divided by this method, and the state space of each dimension is obtained.
  • the state space of the V dimension can be divided to obtain the fuzzy subset of the V dimension.
  • Divide the state space of the P dimension to obtain a fuzzy subset of the P dimension
  • k v is the number of fuzzy subsets of the V dimension
  • k p is the number of fuzzy subsets of the P dimension.
  • the embodiment provides a method for automatically dividing the fuzzy subset based on the control model obtained by the reinforcement learning training, and does not need to rely on manual determination of the decision, and the efficiency is extremely high. Further, the over-segmentation method can be used to divide each state space into a large number of state intervals.
  • the fuzzy sub-set can be guaranteed because the boundary of the fuzzy subset is very accurate. The accuracy of the set is high. Further, the scenario suitable for dividing the fuzzy subset into the high-dimensional state space can be conveniently and quickly extended to the high-dimensional space, so as to be applied to the complicated operation in practical applications, and the utility is strong.
  • the artificial intelligence device acquires a membership function of each fuzzy subset according to a preset rule, and the membership function is used to calculate the membership degree of the corresponding fuzzy subset, and the preset rule is: the membership function is in each fuzzy subset.
  • the center value is taken as 1
  • the membership value of the boundary value of each fuzzy subset is taken as 0.5
  • the center value of the adjacent two fuzzy subsets of each fuzzy subset is taken as 0.
  • the membership function of the fuzzy subset For the way of obtaining the membership function of the fuzzy subset, for each fuzzy subset of each dimension, five points of the fuzzy subset are determined: the central value of the fuzzy subset, the left boundary value of the fuzzy subset, and the fuzzy The right boundary value of the subset, the center value of the previous fuzzy subset of the fuzzy subset, and the center value of the latter fuzzy subset of the fuzzy subset, taking the membership of the central value as 1, and the left boundary value and the right boundary value
  • the membership degree is 0.5
  • the membership value of the center value of the previous fuzzy subset and the center value of the latter fuzzy subset is taken as 0, and the adjacent points of the 5 points are connected by a straight line to obtain a piecewise linear function.
  • the membership function of the fuzzy subset can be as shown in FIG. 5.
  • the left boundary value of the state space of the dimension the center value of the fuzzy subset, the right boundary value of the fuzzy subset, and the latter blur of the fuzzy subset can be determined.
  • the center value of the subset takes the left boundary value of the state space and the membership value of the center value of the fuzzy subset as 1, the membership degree of the right boundary value of the fuzzy subset takes 0.5, and the center value of the latter fuzzy subset takes 0.
  • the adjacent points of the four points are connected by a straight line to obtain a piecewise linear function as a membership function of the first fuzzy subset.
  • the right boundary value of the state space of the dimension, the center value of the fuzzy subset, the left boundary value of the fuzzy subset, and the previous one of the fuzzy subset can be determined.
  • the center value of the fuzzy subset takes the right boundary value of the state space and the membership value of the center value of the fuzzy subset as 1, the membership degree of the left boundary value of the fuzzy subset takes 0.5, and the center value of the previous fuzzy subset takes 0.
  • the adjacent points of the four points are connected by a straight line to obtain a piecewise linear function as the membership function of the last fuzzy subset.
  • the membership function is highly explanatory and effective: when the state of a certain dimension is the central value of a fuzzy subset, people usually give the evaluation that the state strongly belongs to the fuzzy subset through subjective perception. When the state is calculated by the membership function of the fuzzy subset, the membership degree between the obtained state and the fuzzy subset will be 1, and the maximum membership degree is reached, thus accurately representing the strong fuzzy sub-function. The extent of the set.
  • the state of a dimension is the boundary value of a fuzzy subset
  • people usually give the state that the state belongs to the fuzzy subset by subjective perception, and may also belong to the fuzzy subset adjacent to the fuzzy subset.
  • the fuzzy evaluation when the state is calculated by the membership function of the fuzzy subset, the membership degree between the obtained state and the fuzzy subset will be 0.5, and the fuzzy subset adjacent to the fuzzy subset The degree of membership will also be 0.5, which is equal, thus accurately characterizing the extent to which this is roughly a fuzzy subset.
  • the state of a dimension is the central value of the adjacent fuzzy subset of a fuzzy subset
  • the subjective perception usually gives the evaluation that the state is weakly affiliated to the fuzzy subset, and the fuzzy subset is used.
  • the membership function calculates the state, the membership degree between the obtained state and the fuzzy subset will be 0, thus accurately characterizing the extent to which this weak belongs to the fuzzy subset. That is to say, it is guaranteed that the membership degree calculated by the membership function to the state will be very well matched with the artificial evaluation, thereby reasonably and accurately characterizing the degree to which the state belongs to the fuzzy subset.
  • the steps are simple and can improve the efficiency of constructing the membership function:
  • the related technology usually uses the curve fitting method to construct the membership function: pre-capture a large number of samples, draw the approximate curve of the membership function, from parabolic distribution, trapezoidal distribution
  • the membership function that matches the curve is determined in a variety of typical membership functions, and the coefficients of the membership function are adjusted to approximate the sample, thereby obtaining the membership function of the fuzzy subset.
  • This method is more cumbersome and more efficient. low.
  • only five points are selected for each fuzzy subset, and the membership function can be obtained by assigning and connecting, and the membership function is faster and the efficiency is improved.
  • the membership function may also be constructed in other ways, for example, by curve fitting.
  • the commonly used membership functions such as parabolic distribution function, trapezoidal distribution function and triangular distribution function are used to curve the membership function according to the fuzzy subset to obtain the final membership function.
  • the membership function can be determined according to the fuzzy subset. Any of the ways to construct the membership function is not limited in this embodiment.
  • the state space of each dimension is divided into multiple fuzzy subsets and the membership function of each fuzzy subset is obtained.
  • the artificial intelligence device combines the fuzzy subset of each dimension, the membership function, the control model, and the current state to make decisions to control its actions.
  • the artificial intelligence device acquires a state of N dimensions.
  • This step is similar to the process of acquiring the state when training the model in the above step 402, and details are not described herein.
  • the artificial intelligence device obtains multiple discrete decisions based on the activated fuzzy subset and the control model of each of the N dimensions, and the activated fuzzy subset of one state refers to the fuzzy subset whose state membership is not zero.
  • the control model is used to output a corresponding discrete decision based on the state of the input.
  • the artificial intelligence device acquires at least one activated fuzzy subset of the multiple fuzzy subsets of each dimension according to the state of each dimension, thereby obtaining an activated fuzzy sub-n of N dimensions. set.
  • multiple discrete decisions can be obtained.
  • Each of these discrete decisions is a decision that can be considered for the state of the determined N dimensions, and the weighted summation of these discrete decisions is subsequently performed. Get continuous decisions.
  • the specific process of obtaining an activated fuzzy subset for each dimension can include the following two possible designs.
  • the membership function corresponding to each fuzzy subset may be used to calculate the state of the dimension, and the membership degree of each fuzzy subset is obtained.
  • the fuzzy subset is used as the active fuzzy subset. That is, a fuzzy subset with a membership degree other than 0 in the divided fuzzy subsets in each dimension can be obtained as an activated fuzzy subset.
  • each fuzzy subset computes the membership degree between S and each fuzzy subset respectively, and the fuzzy subset whose membership degree is not 0 is the active fuzzy subset. For example, assuming that the state space of the ⁇ dimension has five fuzzy subsets and five membership functions, after ⁇ is obtained, ⁇ can be calculated by using five membership functions respectively, and each fuzzy of ⁇ and five fuzzy subsets is obtained. The membership degree between the subsets is selected from the five fuzzy subsets as the fuzzy subset of the ⁇ dimension. Among them, according to the above principle of membership function design, it can be seen that there are no more than two active fuzzy subsets in any one dimension.
  • Design 2 For each of the N dimensions, two fuzzy subsets whose center values are located around the state of the dimension are selected from the plurality of fuzzy subsets of each dimension as the activated fuzzy subset of each dimension.
  • each dimension acquires two active fuzzy subsets, and N dimensions can acquire n pairs of active fuzzy subsets.
  • the closest to ⁇ can be found from the set of central values of the ⁇ dimension. Find the closest to V from the set of central values in the V dimension Find the closest to P from the set of central values in the P dimension That is, each dimension finds three pairs of active fuzzy subset center values, respectively Then determining to activate the fuzzy subset includes with
  • Step 1 Obtain a central value of the activated fuzzy subset corresponding to the state of each dimension, and obtain a plurality of central values.
  • Step 2 Combine the central values of different dimensions to obtain a plurality of intermediate states, each intermediate state including a central value of N dimensions.
  • a center value may be selected from the center value of at least one active fuzzy subset of the i-th dimension, and N center values are selected after traversing the N dimensions, and the N center values are performed.
  • an intermediate state can be obtained, the intermediate state including N dimensions, and the value of any dimension is the central value of an activated fuzzy set of the dimension.
  • the obtained multiple n-dimensional intermediate states can be expressed as among them,
  • the center value of the fuzzy subset is activated in the i-th dimension for s i .
  • the center values of the fuzzy subsets of the three dimensions are combined to obtain eight 3-dimensional states, which can be expressed as
  • Step 3 Input multiple intermediate states into the control model respectively, and obtain multiple discrete decisions of the control model output.
  • any N-dimensional intermediate state is input to the control model, and the control model can output the discrete decision of the N-dimensional intermediate state.
  • the artificial intelligence device performs weighted summation on the plurality of discrete decisions based on the membership degree between each dimension and the activated fuzzy subset to obtain a continuous decision.
  • the membership function is a continuous function
  • the membership degree based on the membership function is weighted and summed for multiple discrete decisions, and the obtained decision is continuous decision.
  • the artificial intelligence device is controlled by continuous decision, the artificial intelligence can be guaranteed.
  • the smooth control of the device improves the smoothness of the action.
  • the process of weighting and summing multiple discrete decisions to obtain continuous decision may specifically include the following steps 1 to 3:
  • Step 1 For each discrete decision of the multiple discrete decisions, obtain the membership degrees of the N activated fuzzy subsets corresponding to each discrete decision, and obtain N membership degrees.
  • N active fuzzy subsets corresponding to discrete decision and the process of acquiring N active fuzzy subsets for each discrete decision, the discrete decision to input the intermediate state to the output of the control model can be obtained, and the output is output. Determining the intermediate state of the discrete decision, and then acquiring the N active fuzzy subsets of the intermediate state in the process of forming the intermediate values of the plurality of activated fuzzy subsets, and using the N active fuzzy subsets as the discrete decision Corresponding N active fuzzy subsets.
  • the N active fuzzys may be acquired according to the membership degree of each fuzzy subset of each dimension calculated in the above process.
  • the membership of the subset gives N membership degrees.
  • the membership function corresponding to the fuzzy subset may be used for each fuzzy subset of each of the N dimensions, and the state of the dimension may be performed. Calculating, obtaining the membership degree of the fuzzy subset, thereby obtaining N membership degrees.
  • the membership degree between the N-dimensional state S and each activated fuzzy subset can be calculated, wherein the membership degree between S and the j-th activated subset is equal to
  • the active fuzzy subset is obtained.
  • the membership function calculates ⁇ to get ⁇ and Membership degree use
  • the membership function calculates ⁇ to get ⁇ and Membership degree use
  • the membership function calculates V and obtains V and Membership degree use
  • the membership function calculates V and obtains V and Membership degree use
  • the membership function calculates P and gets P and Membership degree use
  • the membership function calculates P and gets P and Membership degree
  • Step 2 Calculate the weight of each discrete decision based on the N membership degrees.
  • the artificial intelligence device can calculate the product of the N membership degrees as the weight of the discrete decision. For example, if the discrete decision is A xyz , then the weight of the discrete decision is
  • Step 3 Based on the weight of each discrete decision, weighting and summing the multiple discrete decisions to obtain a continuous decision.
  • the artificial intelligence device controls the execution of the corresponding action based on the continuous decision.
  • the continuous decision may include M dimensions, that is, the dimensions of the M-dimensional action space A in the above step 401.
  • the artificial intelligence device can obtain the value of the continuous decision in each dimension, and control the action of performing each dimension of itself.
  • the unmanned vehicle for the continuous decision of the steering angle dimension, the unmanned vehicle will rotate according to the continuous decision by a certain angle, for example, the continuous decision of the steering angle dimension is -1, mapped to The maximum angle at which the unmanned vehicle can turn right, the unmanned vehicle will turn right at the maximum angle, and if the continuous decision is 0, the map is 0°, then the unmanned vehicle will go straight.
  • the continuous decision of the acceleration dimension the unmanned vehicle will accelerate according to the continuous decision.
  • the continuous decision of the acceleration dimension is 0.3
  • the acceleration of the map is 300M/s 2
  • the unmanned vehicle will accelerate according to 300M/s 2 .
  • the robot will control the left foot motion according to the continuous decision.
  • the continuous decision of the left foot dimension is 5, and the mapping is to raise the left foot 40 cm, then the robot will lift From the left foot 40cm.
  • the action control method provided in this embodiment may also be performed by a server, and the server may be deployed in the cloud.
  • the server can establish a network connection with the artificial intelligence device, and communicate with the artificial intelligence device in real time through the network connection, and the artificial intelligence device can send the acquired status of each dimension to the server, and the server can be based on the status and control of each dimension.
  • the model obtains a continuous decision and sends the continuous decision to the artificial intelligence device, so that the artificial intelligence device can control the execution of the corresponding action after receiving the continuous decision, thereby achieving the effect that the server remotely controls the artificial intelligence device to perform the action.
  • the communication delay may be required to be as small as possible, and the communication security as high as possible.
  • the continuous decision is determined by the membership degree between the state and the activated fuzzy subset, which provides a reasonable way of continuous decision making, ensuring high accuracy of continuous decision: since the membership degree can reflect The trend of the state in the fuzzy interval.
  • the fuzzy subset The membership degree will also change according to the change trend, and the continuous decision determined based on the membership degree will also change with the change trend, that is, the change trend of the continuous decision will match the change trend of the state, and the accuracy is high.
  • the method provided by the embodiment provides a weighted summation of a plurality of discrete decisions based on the degree of membership between each state of the dimension and the activated fuzzy subset, and obtains a continuous decision. Since the output decision is a continuous quantity, the manual can be guaranteed.
  • the smooth control of the smart device ensures the smoothness of the action.
  • the continuous decision-making by obtaining membership degree provides a way to reasonably continuously discretize the decision-making, ensuring that the trend of continuous decision-making changes with the trend of the state, thus ensuring high accuracy of continuous decision-making.
  • the artificial intelligence device is an unmanned vehicle, it can ensure the smoothness of the unmanned vehicle and improve the comfort of the passenger.
  • FIG. 8 is a schematic structural diagram of an action control apparatus according to an embodiment of the present disclosure. As shown in FIG. 8 , the apparatus includes: an acquisition module 801, a calculation module 802, and a control module 803.
  • the obtaining module 801 is configured to perform step 407 above;
  • the obtaining module 801 is further configured to perform the foregoing step 408;
  • the calculating module 802 is configured to perform step 409 above;
  • Control module 803 configured to perform the above step 410;
  • the obtaining module 801 includes:
  • the input sub-module is used in step three of the second design of step 408 above.
  • the device further includes:
  • a dividing module configured to perform step 401 above;
  • the obtaining module 801 is further configured to perform the foregoing step 404;
  • the obtaining module 801 is further configured to perform the foregoing step 405.
  • the obtaining module 801 includes:
  • An input submodule configured to perform step 2 in step 404 above;
  • the sub-module is selected to perform step three in step 404 above.
  • the calculating module 802 is configured to calculate a membership degree of each fuzzy subset
  • the obtaining module 801 is further configured to perform the above step 406.
  • the computer program product includes one or more computer program instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer program instructions can be passed from a website site, computer, server or data center Transfer to another website site, computer, server, or data center, either wired or wireless.
  • the computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that includes one or more available media.
  • the usable medium may be a magnetic medium (such as a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a digital video disc (DVD), or a semiconductor medium (such as a solid state hard disk) or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Automation & Control Theory (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Algebra (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Feedback Control In General (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

一种动作控制方法及装置,属于人工智能领域。方法包括:获取人工智能设备的N个维度的状态(407);基于N个维度中每个维度的状态的激活模糊子集以及控制模型,得到多个离散决策(408),一个状态的激活模糊子集是指状态的隶属度不为0的模糊子集,每个模糊子集包括一个维度内对应于同一个离散决策的状态区间,隶属度用于表示状态隶属于模糊子集的程度高低,控制模型用于根据输入的状态输出对应的离散决策;基于每个维度的状态与激活模糊子集之间的隶属度,对多个离散决策进行加权求和,得到连续决策(409);基于连续决策,控制人工智能设备执行对应的动作(410)。输出的决策为连续量,能够保证对人工智能设备的平顺控制,保证动作的平滑性。

Description

动作控制方法及装置
本申请要求于2017年12月22日提交的申请号为201711408965.4、发明名称为“动作控制方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及人工智能领域,特别涉及一种动作控制方法及装置。
背景技术
随着人工智能领域的发展,无人车、机器人等各种人工智能设备纷纷诞生,为人们的生活带来了极大便利。目前通常由人工智能方法来为人工智能设备提供智能决策,充当人工智能设备的机器大脑,从而控制人工智能设备执行对应的动作,例如可以控制无人车在马路上行驶,可以控制机器人在仓库中移动等。
目前,通常采用强化学习算法控制人工智能设备的动作:以无人车为例,首先根据实际应用场景定义强化学习算法中的必要元素:N维状态空间S,m维动作空间A与回报函数R,再基于S、A、R在模拟器环境中进行模拟驾驶或在真实道路上进行驾驶,训练出一个控制模型,该控制模型用于根据输入的S内的N维状态,输出A内的m维离散决策。之后,无人车在实际驾驶过程中,会采集当前的N维状态,将N维状态输入到控制模型中,得到控制模型输出的m维离散决策,基于该m维离散决策,控制执行对应的动作。
在实现本公开的过程中,发明人发现相关技术至少存在以下问题:
基于离散决策控制人工智能设备执行动作时,由于其所输出的决策为离散量,难以保证对人工智能设备的平顺控制,导致动作的平滑性较差。
发明内容
本公开实施例提供了一种动作控制方法及装置,能够解决相关技术中难以对人工智能设备进行平顺控制的技术问题。所述技术方案如下:
第一方面,提供了一种动作控制方法,所述方法包括:
获取人工智能设备的N个维度的状态,所述N为大于或等于1的正整数;
基于所述N个维度中每个维度的状态的激活模糊子集以及控制模型,得到多个离散决策,一个状态的激活模糊子集是指所述状态的隶属度不为0的模糊子集,每个模糊子集包括一个维度内对应于同一个离散决策的状态区间,所述隶属度用于表示状态隶属于模糊子集的程度高低,所述控制模型用于根据输入的状态输出对应的离散决策;
基于所述每个维度的状态与激活模糊子集之间的隶属度,对所述多个离散决策进行加权求和,得到连续决策;
基于所述连续决策,控制所述人工智能设备执行对应的动作。
本实施例提供的方法,基于每个维度的状态与激活模糊子集之间的隶属度,对多个离散决策进行加权求和,得到连续决策,由于输出的决策为连续量,能够保证对人工智能设备的 平顺控制,保证动作的平滑性。同时,通过隶属度来获取连续决策,提供了一种合理地连续化离散决策的方式,保证连续决策的变化趋势与状态的变化趋势匹配,从而保证连续决策具有高准确性。进一步地,当人工智能设备为无人车时,可以保证控制无人车的平顺性,提升乘客的舒适度。
在一种可能的设计中,所述基于所述每个维度的状态与激活模糊子集之间的隶属度,对所述多个离散决策进行加权求和,得到连续决策,包括:
对于所述多个离散决策中的每个离散决策,获取所述每个离散决策对应的N个激活模糊子集的隶属度,得到N个隶属度;
基于所述N个隶属度,计算所述每个离散决策的权重;
基于所述每个离散决策的权重,对所述多个离散决策进行加权求和,得到所述连续决策。
在一种可能的设计中,所述基于所述N个维度中每个维度的状态的激活模糊子集以及控制模型,得到多个离散决策,包括:
获取所述N个维度中每个维度的激活模糊子集的中心值,得到多个中心值;
对不同维度的中心值进行组合,得到多个中间状态,每个中间状态包括N个维度的中心值;
分别将所述多个中间状态输入到所述控制模型中,得到所述控制模型输出的多个离散决策。
在一种可能的设计中,所述基于所述N个维度中每个维度的状态的激活模糊子集以及控制模型,得到多个离散决策之前,所述方法还包括:
对于所述N个维度中的每个维度,当所述每个维度的状态与所述每个维度的任一模糊子集之间的隶属度不为0时,将所述每个模糊子集作为所述每个维度的激活模糊子集;或,
对于所述N个维度中的每个维度,从所述每个维度的多个模糊子集中选取中心值位于所述每个维度的状态左右的两个模糊子集,作为所述每个维度的激活模糊子集。
在一种可能的设计中,所述获取人工智能设备的N个维度的状态之前,所述方法还包括:
对于所述N个维度中的每个维度,将所述每个维度的状态空间划分为多个状态区间;
基于所述控制模型,获取所述多个状态区间中每个状态区间的典型离散决策,得到多个典型离散决策;
基于所述多个典型离散决策,将对应于同一典型离散决策且相邻的多个状态区间合并为一个模糊子集,得到所述每个维度的至少一个模糊子集。
本设计提供了一种基于强化学习训练得到的控制模型自动划分模糊子集的方式,无需依赖人工确定决策,效率极高。进一步地,可以采用过分割的方式,将每个状态空间分割为大量状态区间,基于大量状态区间的典型离散决策进行合并得到模糊子集时,由于模糊子集的边界非常精确,能够保证模糊子集的准确性较高。进一步地,适用于为高维状态空间划分模糊子集的场景,可以方便快速的扩展到高维空间,以便应用于实际应用中复杂的运行情况,实用性强。
在一种可能的设计中,所述基于所述控制模型,获取所述多个状态区间中每个状态区间的典型离散决策,包括:
对于所述多个状态区间中的每个状态区间,获取所述每个状态区间的多个代表状态,每个代表状态包括所述每个维度上的所述每个状态区间的中心值以及其他每个维度上的任一个 状态;
分别将所述多个代表状态输入到所述控制模型中,得到所述控制模型输出的多个离散决策;
从所述多个离散决策中选取出现次数最多的离散决策,作为所述每个状态区间的典型离散决策。
在一种可能的设计中,所述获取人工智能设备的N个维度的状态之后,所述方法还包括:
对于所述N个维度中每个维度的每个模糊子集,采用所述每个模糊子集对应的隶属度函数,对所述每个维度的状态进行计算,得到所述每个模糊子集的隶属度。
在一种可能的设计中,所述获取所述人工智能设备的N个维度的状态之前,所述方法还包括:
按照预设规则,获取每个模糊子集的隶属度函数,所述隶属度函数用于计算对应模糊子集的隶属度,所述预设规则为:隶属度函数在所述每个模糊子集的中心值取1,在所述每个模糊子集的边界值的隶属度取0.5,在所述每个模糊子集的相邻的两个模糊子集的中心值取0。
基于本设计,隶属度函数具有高解释性,较为有效。同时,步骤简单,能够提高构建隶属度函数的效率。
第二方面,提供了一种动作控制装置,所述装置包括多个功能模块,以实现上述第一方面以及第一方面的任一种可能设计中的动作控制方法。
第三方面,提供了一种人工智能设备,所述人工智能设备包括处理器和存储器,所述存储器中存储有至少一条指令,所述指令由所述处理器加载并执行以实现上述第一方面以及第一方面的任一种可能设计中的动作控制方法。
第四方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令,所述指令由处理器加载并执行以实现上述第一方面以及第一方面的任一种可能设计中的动作控制方法。
附图说明
图1是本公开实施例提供的一种实施环境的示意图;
图2是本公开实施例提供的一种应用场景的示意图;
图3是本公开实施例提供的一种人工智能设备的结构示意图;
图4是本公开实施例提供的一种动作控制方法的流程图;
图5是本公开实施例提供的一种隶属度函数的示意图;
图6是本公开实施例提供的一种隶属度函数的示意图;
图7是本公开实施例提供的一种隶属度函数的示意图;
图8是本公开实施例提供的一种动作控制装置的结构示意图。
具体实施方式
为使本公开的目的、技术方案和优点更加清楚,下面将结合附图对本公开实施方式作进一步地详细描述。
为了方便理解,下面先对本公开实施例中涉及的名词进行解释:
状态空间:是指人工智能设备所有可能的状态的集合,状态空间可以包括N个维度(n为正整数),状态空间的维度可以为速度维度、角度维度、距离维度等,人工智能设备在任 一时刻的状态可以采用状态空间中的一个N维向量表示。
动作空间:是指人工智能设备所有可以执行的动作的集合、动作空间可以包括m个维度(m为正整数),例如包括油门维度、转向角维度、刹车维度等,人工智能设备在任一时刻的动作可以采用动作空间中的一个m维向量表示。
回报函数:以状态为输入、以奖励值为输出的函数,奖励值越大,表示对应的状态越理想,奖励值为负时表示对应的状态不理想。
强化学习(reinforcement learning):又称再励学习、评价学习,是指以奖励值最大化为目标,从环境映射到行为的学习。强化学习中对产生动作的好坏均采用奖励值进行评价,在训练过程中控制模型必须靠自身的经历进行学习,在行动-评价的环境中获得知识,不断改进行动以适应环境。
模糊子集(fuzzy set):也称模糊集合、模糊集,是指表达模糊性概念的集合,任一状态与模糊子集之间的隶属关系不是绝对的肯定或否定,而通过隶属度的大小表征隶属关系的高低程度。本公开实施例中,模糊子集是指一个维度内对应于同一个离散决策的状态区间,即,模糊子集内的所有状态对应的离散决策相同。
隶属度以及隶属度函数:若对论域U中的任一元素x,都有一个数A(x)∈[0,1]与之对应,则称A为U上的模糊集,A(x)称为x对A的隶属度。当x在U中变动时,A(x)就是一个函数,称为A的隶属函数。隶属度A(x)越接近于1,表示x属于A的程度越高,A(x)越接近于0表示x属于A的程度越低,通过取值于区间(0,1)的隶属函数A(x)可以表征x属于A的程度高低。本公开实施例中,隶属度用于表示状态隶属于模糊子集的程度高低,隶属度越大,表示状态隶属于模糊子集的程度越高,可以认为状态强属于模糊子集,隶属度越小,表示状态隶属于模糊子集的程度越低,可以认为状态弱属于模糊子集。
激活模糊子集:当状态与某个模糊子集之间的隶属度不为0时,可以理解为该模糊子集被激活,则记该模糊子集为激活模糊子集。
图1是本公开实施例提供的一种实施环境的示意图,该实施环境包括人工智能设备101和多个服务器102,人工智能设备101通过无线或者有线网络与多个服务器102连接,该人工智能设备101可以为无人车或机器人,每个服务器102可以为是一台服务器,或者由若干台服务器组成的服务器集群,或者是一个云计算服务中心。
在通过强化学习训练控制模型的过程中,人工智能设备训练所用的状态可以来自服务器102,人工智能设备可以感知各个维度的状态,从而进行学习。以无人车为例,服务器可以将无人车当前的位置信息、无人车当前行驶的道路信息提供给无人车,以便无人车获知所处的位置、所行驶的道路信息。在通过控制模型实际使用的过程中,人工智能设备获取的状态也可以来自服务器,人工智能设备可以基于获取到的状态以及控制模型确定对应的决策,以便执行对应的动作。可选地,服务器102还可以具有至少一种数据库,例如道路交通路网数据库、地理信息数据库等,用于向人工智能设备101提供各个维度的状态。
本实施例提供的动作控制方法可以应用在多种实际应用场景,以下结合两种示例性应用场景进行阐述:
(1)可以应用在无人车进行无人驾驶的场景:参见图2,无人车在驾驶过程中可以通过当前的速度、与道路中轴线之间的距离、当前道路交通等各个维度的状态以及控制模型,输 出连续决策,该连续决策可以为转向角大小、加速度、油门大小、刹车大小等,通过连续决策能够保证无人车的动作的平滑性,避免出现速度忽高忽低等情况,提升乘客的舒适度。进一步地,输出的连续决策可以为底层数据,即连续决策可以细化到左拐的角度、加速度的数值等数据,从而进行高精度控制,实现无人车可以执行沿着道路轴线直行、换道、超车、跟车、泊车等高难度动作的效果,实现无人车自动驾驶的任务。
(2)可以应用在机器人执行任务的场景:本发明实施例可以应用在仓储机器人在仓库、工地等各种地点搬运货物、除草机器人在草坪进行除草、扫地机器人在办公室进行扫地、收割机器人在田地收割庄稼、围棋机器人下围棋等场景。以仓储机器人在仓库搬运某货物为例,仓储机器人在运行过程中可以获取当前的速度、当前的位置、与相邻货架的距离、与放置该货物的货架之间的距离等状态以及控制模型,输出连续决策,该连续决策可以为捡起商品、抬手、将货物放在储物筐中、转向角大小、加速等,通过连续决策能够保证机器人的动作的平滑性,避免出现机器人动作之间的波动性大导致机器人摔倒的情况。进一步地,输出的连续决策可以为底层数据,即连续决策可以细化到捡起商品的速度、左拐的角度、加速度的数值等数据,从而进行高精度控制,实现机器人可以执行高难度动作的效果,提高机器人的性能。
图3是本公开实施例提供的一种人工智能设备的结构示意图,参见图3,该人工智能设备包括:接收器301、发射器302、存储器303和处理器304,该接收器301、该发射器302和该存储器303分别与该处理器304连接,该存储器303存储有程序代码,该处理器304用于调用该程序代码,执行下述实施例中人工智能设备所执行的操作。
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括指令的存储器,上述指令可由电子设备中的处理器执行以完成下述实施例中的动作控制方法。例如,所述计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
本公开实施例提供了一种动作控制方法,该方法主要包括三部分,第一部分为强化学习的过程,详见以下步骤401-步骤402,通过强化学习的过程可以得到一个输入为状态,输出为离散决策的控制模型;第二部分为空间模糊化的过程,详见以下步骤403-步骤406,通过空间模糊化的过程可以将状态空间的每一维度划分为多个模糊子集,并得出各模糊子集的隶属度函数。第三部分是反模糊化的过程,详见以下步骤407-步骤410,通过在线控制过程中基于实际状态所属的模糊子集和隶属度函数,计算连续决策进而控制自身执行对应的动作。通过本公开实施例提供的方法,由于基于连续决策控制人工智能设备执行动作,可以保证对人工智能设备的平顺控制,保证动作的平滑性。
图4是本公开实施例提供的一种动作控制方法的示意图,该方法的执行主体为人工智能设备,如图4所示,该方法包括:
401、人工智能设备获取N维状态空间S,M维动作空间A与回报函数R,N为大于或等于1的正整数,M也为大于或等于1的正整数。
由于后续过程会以状态空间、动作空间、回报函数为输入,各种离散决策为输出,基于强化学习算法构建训练模型,本步骤401中,人工智能设备会预先获取状态空间、动作空间、 回报函数,以便定义强化学习算法的必要元素。以状态空间用S表示、动作空间用A表示,回报函数用R表示为例,本步骤401具体可以包括以下步骤一至步骤三。
步骤一、人工智能设备生成N维状态空间S。
每个维度的状态空间的概念以及设计:每个维度的状态空间是指人工智能设备对应维度的状态的集合,每个维度的状态空间可以根据人工智能设备在运行过程中可以获取到的状态设计,即可以根据人工智能设备可以实时采集的状态或者通过处理得到的状态设计。每个维度的状态空间包括两个边界,左边界可以表征对应维度的状态的最小值,右边界可以表征对应维度的状态的最大值。其中,可以直接将状态的最小值设计为状态空间的左边界,直接将状态的最大值设计为状态空间的右边界。或者,可以对状态的最小值和最大值进行归一化,将归一化的最小值和归一化的最大值分别设计为状态空间的左边界和右边界。
示例性地,以人工智能设备为无人车为例,状态空间S可以包括θ维度的状态空间、V维度的状态空间、P维度的状态空间、V前维度的状态空间、V后维度的状态空间、P前维度的状态空间、P后维度的状态空间的任意组合,其中θ是指无人车的中轴线与道路的中轴线之间的夹角、V是指无人车当前的速度、P是指无人车与道路的中轴线之间的距离、V前是指无人车的前一辆车的速度,V后是指无人车的后一辆车的速度、P前是指无人车与前一辆车之间的距离、P后是指无人车与后一辆车之间的距离。以人工智能设备为机器人为例,状态空间可以包括速度维度的状态空间、左脚维度的状态空间、右脚维度的状态空间等,本实施例对不做限定。
在一种可能的设计中,参见表1,无人车可以对应N=3维状态空间,包括θ维度的状态空间、V维度的状态空间和P维度的状态空间,其中θ维度的状态空间的边界可以为-1和1,-1表示-180°,1表示180°。V维度的状态空间的边界为0至300,0表示最小速度,300表示最大速度,P维度的状态空间的边界为-1和1,该边界采用道路宽度进行归一化的结果。当P=0时表示无人车正好在道路中轴线上,当P大于1或小于-1表示无人车已经驶出道路。
表1
变量名 解释
θ 自车的中轴与道路中轴的夹角,范围为-1至1。
V 自车的速度,范围为0至400
P 自车和道路中轴线之间的距离,范围为-1至1
针对生成N维状态空间S的具体过程,人工智能设备可以获取状态空间生成指令,基于该状态空间生成指令生成N维状态空间。该状态空间生成指令指示状态空间的维度数量,还可以指示每个维度的状态空间的名称、每个维度的状态空间的边界值,人工智能设备可以基于状态空间生成指令中的维度数量、每个维度的状态空间的名称、每个维度的状态空间的边界值生成N维状态空间。其中,人工智能设备可以在运行定义状态空间的代码时,获取状态空间生成指令。该定义状态空间的代码由开发人员预先根据实际需求进行编写,并由开发人员预先在人工智能设备中进行存储。
步骤二、人工智能设备生成M维动作空间A。
每个维度的动作空间的概念以及设计:每个维度的动作空间是指人工智能设备对应维度的动作的集合,每个维度的动作空间可以根据人工智能设备在实际运行过程中能够执行的动 作确定。每个维度的动作空间包括两个边界,左边界可以表征对应维度的动作的最小值,右边界可以表征对应维度的动作的最大值。其中,可以直接将动作的最小值设计为动作空间的左边界,直接将动作的最大值设计为动作空间的右边界。或者,可以对动作的最小值和最大值进行归一化,将归一化的最小值和归一化的最大值分别设计为动作空间的左边界和右边界。
示例性地,以人工智能设备为无人车为例,动作空间A可以包括转向角维度的动作空间、油门维度的状态空间、刹车维度的状态空间的任意组合等。以人工智能设备为机器人为例,动作空间可以包括扫地维度的动作空间、除草维度的动作空间、转向角维度的动作空间、搬运物品维度的动作空间的任意组合。
其中,每个维度对应一种类型的动作,每个维度的动作空间为人工智能设备对应类型的动作的集合,每个维度的动作空间的边界由对应类型的动作的最小值和最大值确定。例如速度维度的动作空间为人工智能设备行驶的速度的集合,速度维度的动作空间的边界由人工智能设备的最小速度和最大速度确定,例如为0至400。
针对生成M维动作空间A的具体过程,人工智能设备可以获取动作空间生成指令,基于该动作空间生成指令生成M维动作空间A。该动作空间生成指令指示动作空间的维度数量,还可以指示每个维度的动作空间的名称、每个维度的动作空间的边界值,人工智能设备可以基于动作空间生成指令中的维度数量、每个维度的动作空间的名称、每个维度的动作空间的边界值生成M维动作空间。其中,人工智能设备可以在运行定义动作空间的代码时,获取动作空间生成指令,该定义动作空间的代码由开发人员预先根据实际需求进行编写,并由开发人员预先在人工智能设备中进行存储。
在一种可能的设计中,人工智能设备可以对M维动作空间A进行离散化,即,对于每个维度的动作空间,可以按照一定的粒度从该维度的动作空间提取q个动作{a i,i=1,...p},作为该维度的离散化的动作空间。其中,可以按照统一的粒度对每个维度的动作空间进行离散化,也可以获取粒度和动作空间的维度之间的对应关系,按照维度对应的粒度对每个维度的动作空间进行离散化。另外,粒度的具体数值可以根据实际需求确定,本实施例对此不做限定。
示例性地,参见表2,无人车可以对应M=1维动作空间(转向角维度的动作空间),该动作空间的范围为[-1,1],其中-1表示最大右转角,1表示最大左转角,0表示直行。可以按照0.1的粒度对该动作空间进行离散化,得到[-1,-0.9,-0.8…0.8,0.9,1],共计21个动作,组成转向角维度的离散化的动作空间。
表2
Figure PCTCN2018121519-appb-000001
本实施例中,通过对动作空间进行离散化,可以达到以下技术效果:
提高控制模型的收敛速度,保证快速训练出控制模型:动作空间的设计会直接影响后续模型训练的过程中控制模型的收敛速度,动作空间的动作越多,控制模型确定某个状态下的决策时可供选择的动作越多,而面临的选择越多则计算量越大,则控制模型的收敛速度越慢,在线使用时确定决策的速度也越慢。相关技术中,对于深度确定性策略梯度(Deep DeterMinistic Policy Gradient,DDPG)以及其他类型的连续决策的强化学习方案来说,其设计的动作空间均是连续的,导致模型训练的计算量过大,难以保证控制模型快速收敛,实用性差。而本实施例中,将动作空间离散化,缩小了强化学习过程的探索空间,减少了训练 控制模型的计算量,提高了控制模型的收敛速度,保证快速训练出控制模型。进一步地,减少了在线使用控制模型确定决策的计算量,提高了确定决策的速度,保证快速确定决策。
步骤三、人工智能设备获取回报函数。
回报函数可以根据对人工智能设备实际运行过程中状态的期望设计,可以设计回报函数为理想状态输出的奖励值为正数,且与理想程度正相关,即,若某个状态越理想,则向回报函数输入该状态时回报函数输出的奖励值越大。同时可以设计回报函数为不理想的状态输出的奖励值为负数,即相当于输出了惩罚值。
通过上述设计回报函数的方式,可以达到以下技术效果:
帮助控制模型学习每个状态对应的决策,进而提升控制模型在使用过程中确定决策的准确性:后续模型训练过程中,控制模型对每个状态的认知均通过回报函数所确定的状态的奖励值确定,控制模型会以奖励值最大化为趋向确定决策。通过设计不理想的状态的奖励值为负,当控制模型时触发这种状态时会得到负奖励值,可以理解为受到了惩罚,则控制模型会认知到后续确定决策时应当避免出现该状态,那么在实际使用控制模型确定决策时,可以达到避免出现不理想的状态的效果。而通过设计理想状态的奖励值为正,且与理想程度正相关,当控制模型时触发这种状态时会得到正奖励值,可以理解为受到了鼓励,则控制模型会认知到后续确定决策时应当趋向该状态,那么在实际使用控制模型确定决策时,可以达到趋向理想状态的效果。
以无人车为例,由于无人车驾驶过程的理想状态包括无人车不发生碰撞、无人车快速行驶、无人车沿着道路行驶等,可以设计回报函数输出的奖励值与发生碰撞负相关、与速度正相关、与θ负相关。示例性地,可以设计无人车在未发生碰撞时,回报函数为R=V cosθ-V sinθ,当发生碰撞时,R=-200。
基于这种回报函数,由于训练过程中发生碰撞时控制模型会被惩罚,从而督促控制模型输出决策避免碰撞。由于速度越大则奖励越大,能够鼓励控制模型输出决策来提高速度,从而保证无人车尽量快速行驶。由于θ越小则奖励越大,能够鼓励控制模型输出决策来减小θ,从而保证无人车尽量沿着道路行驶,而不会驶离车道、产生偏移的行为。
需要说明的是,本实施例对上述步骤一至步骤三的执行顺序不做限定,可以按照任意顺序分别执行每个步骤。
402、人工智能设备基于状态空间、动作空间与回报函数,采用强化学习算法进行模型训练,得到控制模型。
本实施例中,控制模型用于根据输入的S内的N维状态,输出A内的M维离散决策。针对模型训练的具体过程,当人工智能设备进行第t次学习时,可以获取N个维度的状态S t,S t为N维状态空间中的一个N维状态向量,采用回报函数R对S t进行计算,得到奖励值R t-1,向当前的控制模型输入S t,得到输出的A t,执行A t后,再获取S t+1及R t,将S t、R t-1加入经验池,通过经验池中的数据进行模型训练,从而对控制模型进行更新,后续再基于更新的控制模型进行下一次学习。
其中,针对获取N个维度的状态的过程,人工智能设备可以配置摄像头、各种传感器、网络通信模块等各种部件,通过各种部件可以感知外部环境的状态以及自身的状态,例如人工智能设备可以通过前置摄像头,采集前方的图像,通过该图像感知前方环境的状态,又如人工智能设备可以通过超声波雷达,采集与周围物体的距离,从而感知自身与障碍物的远近, 又如人工智能设备可以通过加速度传感器,感知自身的加速度以及速度,又如人工智能设备可以通过陀螺仪,感知自身的旋转角度,又如人工智能设备可以通过全球定位系统(Global Positioning SysteM,GPS)传感器,获取道路交通路网数据以及车辆的位置信息等,从而感知自身与道路中轴线之间的距离、自身朝向与道路朝向之间的夹角等。
以无人车为例,在驾驶过程中GPS传感器可以采集位置信息,无人车根据位置信息可以确定当前的朝向与当前行驶的道路的中轴线的方向,根据两个方向可以计算出无人车的中轴线与道路的中轴线之间的夹角θ。另外无人车根据位置信息可以计算自身与道路的中轴线之间的距离P。另外驾驶过程中加速度传感器可以采集加速度,无人车可以通过加速度计算出无人车的速度V。
在一种可能的设计中,强化学习的过程中可以采用探索策略进行探索:以进行第t次学习为例,可以以一定的概率按照奖励值R t-1确定的A t执行动作,以一定的概率随机执行某个动作,以便增强控制模型的泛化性,锻炼控制模型探索未知环境的能力,保证控制模型能够在实际使用过程中应对复杂多变的实际情况。其中,探索策略可以为ε-greedy策略、乐观初始估计策略、衰减ε-greedy策略、不确定行为优先探索策略、概率匹配策略、信息价值策略等,本实施例对此不做限定。
需要说明的第一点是,针对上述步骤401将动作空间A离散为q个动作的设计,本步骤402中训练模型基于离散的动作空间进行训练时,每次学习时只需从q个动作中选择一个动作,即可作为输出的离散决策,探索空间较小,能够快速收敛。
需要说明的第二点是,进行模型训练时可以在虚拟的模拟器环境中训练模型,也可以在真实环境中训练模型。以无人车为例,可以由某个电子设备运行仿真软件,构建模拟器环境,控制模型可以理解为模拟器环境中运行的进程或线程,通过在模拟器环境中不断进行模拟驾驶,控制模型会不断成熟,直至训练完毕。或者可以由无人车在真实环境上驾驶,则要求真实环境为带有车道线的多车道真实道路,且道路上有其他行驶的移动车辆给无人车制造换道机会,同时其他移动车辆的布置满足一定的随机性,保证模型可以在多样的数据集上训练,以增强模型的泛化能力。
需要说明的第三点是,本实施例中,模型训练的episode的个数、停止规则、具体计算方式均可以根据实际需求确定,episode是指完成一次预设动作的过程,停止规则是指触发模型训练时停止的规则,例如为跑完一圈时停止、到达目的地时停止等。模型训练的具体算法可以包括深度增强学习(Deep ReinforceMent Learning,DQN)算法、强化学习(Q-learning)算法等,在实施中,当状态空间的维数较大时可以采用DQN算法,当状态空间的维数较小时可以采用Q-learning算法。本实施例对此均不做限定。
需要说明的第四点是,针对于相关技术中采用卷积神经网络训练出高层决策模型的方案来说,本实施例中采用强化学习算法训练控制模型的过程具有以下技术效果:
第一,控制模型具有底层控制的能力:
相关技术中会收集大量人类开车时的第一视角图像和对应的转向角标注数据,基于卷积神经网络学习出第一视角图像从第一视角图片到转向角的映射函数,以便无人车实际驾驶时根据实时采集的第一视角图片以及映射函数确定转向角。然而,该方案中确定的转向角目前只能达到转向角大小这种高层决策的程度,而无法达到确定左拐的速度、右拐的速度、左拐的加速度、右拐的加速度这种底层控制的程度。而本实施例中设计了每个维度的动作空间, 还可进一步将每个维度的动作空间细化为多个离散的动作,从而将控制模型决策的动作底层化、清晰化、具体化,则可以实现控制模型具有底层控制的能力,即控制模型输出的离散决策能够指示每个维度的具体动作,例如精确到加速度的数值、抬脚的数值等,提高了决策的精确性,实用性强。
第二,控制模型的稳健性强。
相关技术中由于第一视角图像与车辆、道路强相关,导致映射函数的泛化能力差:映射函数只适用于控制拍摄了学习过程的第一视角图片的车辆进行驾驶的情况,实际应用中换了别的车辆进行驾驶时,映射函数确定的转向角的可靠性就会很差。同时,映射函数只适用于学习过程拍摄第一视角图片的车辆所处的道路,实际应用中在别的道路进行驾驶时映射函数确定的转向角的可靠性也会很差,因而该方法难以面对复杂多变的道路情况,也无法适用于其他车辆,稳健性很差。
而本实施例提供的方法,由于控制模型以底层的状态为输入进行训练,底层状态的普适性强,与采集状态的自主体与所处的环境的相关度弱,能够保证针对某种确定的状态所确定的决策可以适用于各种自主体和环境。以无人车为例,不必要求实际使用过程中应用的车辆与学习过程中应用的车辆相同,更不必要求实际应用过程中应用的道路与学习过程中应用的道路相同,在实际使用过程中确定的决策可以适用于各种车辆与道路,从而保证可以面对复杂多变的道路情况,稳健性强。
综上所述,通过上述步骤401至步骤402,可以得到以当前状态为输入,以离散决策为输出的控制模型。在下述步骤403至步骤405中,会基于上述控制模型将状态空间S分成不相交的多个模糊子集{S i},满足每个模糊子集S i的所有状态在控制模型的作用下对应于同一个离散决策a i,并为每个模糊子集生成对应的隶属度函数。
403、对于N个维度中的每个维度,人工智能设备将该每个维度的状态空间划分为多个状态区间。
对于N维状态空间S的第i维(i=1,...n),可以按一定粒度将其分割为m i个状态区间。可选地,可以采用过分割的方式对每个维度的状态空间进行划分,即要求划分的状态区间的数量尽量多,每个状态区间尽量窄,以保证后续得到的模糊子集的精确性。另外,由于后续会基于每个状态区间的中心值获取每个状态区间的典型离散决策,在划分状态区间时,可以记录每个状态区间的中心值s ij,j=1,...m i,得到每个维度的状态空间的中心值集合。
示例性地,以划分θ维度的状态空间为例,可以将θ维度的状态空间划分为M个状态区间
Figure PCTCN2018121519-appb-000002
另外可以记录第i个状态区间
Figure PCTCN2018121519-appb-000003
的中心值
Figure PCTCN2018121519-appb-000004
得到包含M个中心值
Figure PCTCN2018121519-appb-000005
的中心值集合。假设θ维度的状态空间为[-1,1],M=200,则中心值集合为{-0.995,-0.985,-0.975,…,0.985,0.995}。
404、人工智能设备基于控制模型获取多个状态区间中每个状态区间的典型离散决策,得到多个典型离散决策。
状态区间的典型离散决策是指以状态区间的中心值为控制模型的输入时,控制模型最可能输出的离散决策,即输出的概率最大的离散决策。人工智能设备会基于控制模型获取每个状态区间的典型离散决策,以便后续通过合并典型离散决策相同的状态区间来得到模糊子集。获取典型离散决策的过程具体可以包括以下步骤一至步骤三:
步骤一、对于多个状态区间中的每个状态区间,获取状态区间的多个代表状态,每个代 表状态包括维度上的状态区间的中心值以及其他维度上的状态。
状态区间的代表状态是指在该状态区间的维度的取值为状态区间的中心值,且维度等于状态空间的维度的状态,代表状态用于确定对应状态区间的典型离散决策,代表状态包括状态区间的中心值和其他维度上的状态,该其他维度上的状态可以为其他维度的状态空间的采样值。
以获取第i维的第j个状态区间的代表状态为例,获取代表状态的过程可以包括(1)-(2)。
(1)在N个维度中第i维以外的其他维度的状态空间进行采样,得到多个采样值。
采样方式可以为蒙特卡洛采样,则采样值为随机采样值,或者采样方式为等间隔采样、以一定的先验概率采样等。采样值的数量可以根据实际需求确定,为了保证精确性,可以要求采样值的数量尽量多,例如采样值的数量为10000,本实施例对采样方式和采样值的数量不做限定。
(2)对于该多个采样值中的每个采样值,将该采样值和状态区间的中心值组成一个N维向量,作为该状态区间的代表状态。其中,该N维向量在第i维的取值为s ij,在第i个维度以外的其他维度取值为其他维度的状态空间的采样值。
可选地,可以为每个维度的所有状态区间进行一次采样,通过采样值分别确定该维度每个状态区间的代表状态:当获取第i维的每个状态区间的代表状态时,可以在N维状态空间中第i维以外的其他维度中每个维度的状态空间均进行采样,得到第i维以外的每个维度的状态空间的采样值,将第i维的m个状态区间的中心值集合中的m个中心值分别和第i维以外的每个维度的状态空间的采样值进行组合,得到m个中心值对应的m个代表状态,即得到了第i维的每个状态区间的代表状态,从而保证为在第i维进行一次采样即可,而无需为m个状态区间进行m次采样。
示例性地,以获取θ维度的状态空间的每个状态区间的代表状态为例,假设随机采样的点的个数为w,记状态空间S中θ以外的维度为S O,可以在S O的状态空间中随机采样w个点,则得到{S O1,S O2,...S Ow}这些采样值。对于θ维度的中心值集合
Figure PCTCN2018121519-appb-000006
中的每个中心值θ i(i=1,…,M),令θ i与{S O1,S O2,...S Ow}中的每个采样值组合,可以得到w个代表状态{S i1,S i2,...S iw},其中任意代表状态S ij=(θ i,S Oj),且S ij的维数与状态空间S的维数一致。
步骤二、分别将多个代表状态输入到控制模型中,得到该控制模型输出的多个离散决策。
对于w个代表状态中的每个代表状态,可以将该代表状态输入到控制模型中,该控制模型会根据输入的代表状态输出对应的离散决策,通过w个代表状态可以得到w个离散决策。
步骤三、从多个离散决策中选取出现次数最多的离散决策,作为状态区间的典型离散决策。
在获取离散决策的过程中,可以统计每个离散决策的出现次数,从而选取出现次数最多的离散决策,作为状态区间的典型离散决策。
示例性地,以获取θ维度的状态区间
Figure PCTCN2018121519-appb-000007
Figure PCTCN2018121519-appb-000008
的典型离散决策为例,当通过
Figure PCTCN2018121519-appb-000009
的w个代表状态得到w个离散决策时,可以统计这w个离散决策中出现次数最多的决策,作为
Figure PCTCN2018121519-appb-000010
的典型离散决策,记为A i
405、人工智能设备基于多个典型离散决策,将对应于同一典型离散决策且相邻的多个状 态区间合并为一个模糊子集,得到该每个维度的至少一个模糊子集。
对于每个维度的状态空间来说,当得到该维度的多个状态区间以及对应的多个典型离散决策后,可以对典型离散决策的变化情况进行分析,获取典型离散决策在这些状态区间中发生变化的边沿位置,每确定一次边沿位置则在该边沿位置分割一次,从而将典型离散决策未发生变化的多个状态区间合并为一个模糊子集,进而将该维度的状态空间分割为至少一个模糊子集,这些模糊子集中每个模糊子集对应的离散决策相同,且相邻模糊子集对应的典型离散决策不同。
以人工智能设备为机器人为例,假设肢体动作维度的状态空间包括100个状态区间,状态区间1-状态区间10对应的典型离散决策均为下蹲,状态区间10-状态区间40对应的典型离散决策均为站起,状态区间40-状态区间80对应的典型离散决策均为抬手,状态区间80-状态区间100对应的典型离散决策均为跳跃,则根据典型离散决策的变化情况进行分析时,确定典型离散决策在状态区间10、状态区间40、状态区间80发生了变化,即边沿位置为状态区间10、状态区间40、状态区间80,则状态区间1-状态区间10会合并为一个模糊子集,状态区间10-状态区间40会合并为一个模糊子集,状态区间40-状态区间80会合并为一个模糊子集,状态区间80-状态区间100会合并为一个模糊子集。
以人工智能设备为无人车为例,假设θ维度的状态空间得到了状态区间
Figure PCTCN2018121519-appb-000011
以及对应的典型离散决策集合{A 1,...A M},将对应于相同典型离散决策并且相邻的状态区间进行合并,会得到多个新的状态区间
Figure PCTCN2018121519-appb-000012
记为模糊子集,其中k θ是θ维度的模糊子集的个数,从而完成了对θ维度的模糊子集的划分。
进一步地,以上阐述了将某一个维度的状态空间划分为多个模糊子集的过程,在实施中可以对每个维度的状态空间均采用此方法进行划分,得到每个维度的状态空间的模糊子集,以人工智能设备为无人车为例,可以对V维度的状态空间进行划分,得到V维度的模糊子集
Figure PCTCN2018121519-appb-000013
对P维度的状态空间进行划分,得到P维度的模糊子集
Figure PCTCN2018121519-appb-000014
其中k v为V维度的模糊子集个数,k p为P维度的模糊子集个数。
综上所述,通过上述步骤403-步骤405,将每个维度的状态空间划分为多个模糊子集,这种划分模糊子集的方式可以达到以下技术效果:
相关技术通常依赖于专家经验,通过人工的方式完成模糊子集的划分:当要将某个维度的状态空间划分为多个模糊子集时,需要邀请多位专家,让每位专家依据个人经验人工填写各种状态下的决策,将相同决策对应的状态合并为模糊子集,这种方式极其繁琐、效率低下,准确性较差。进一步地,该方法难以应用在高维状态空间划分模糊子集的场景:由于高维状态空间的状态向量已经为不同维度的状态排列组合,代表的实际运行情况非常复杂,专家已经无法依据个人经验确定决策,故这种方法实用性较差,应用范围狭窄。
而本实施例提供了一种基于强化学习训练得到的控制模型自动划分模糊子集的方式,无需依赖人工确定决策,效率极高。进一步地,可以采用过分割的方式,将每个状态空间分割为大量状态区间,基于大量状态区间的典型离散决策进行合并得到模糊子集时,由于模糊子集的边界非常精确,能够保证模糊子集的准确性较高。进一步地,适用于为高维状态空间划分模糊子集的场景,可以方便快速的扩展到高维空间,以便应用于实际应用中复杂的运行情况,实用性强。
406、人工智能设备按照预设规则,获取每个模糊子集的隶属度函数,隶属度函数用于计算对应模糊子集的隶属度,预设规则为:隶属度函数在每个模糊子集的中心值取1,在每个 模糊子集的边界值的隶属度取0.5,在每个模糊子集的相邻的两个模糊子集的中心值取0。
针对获取模糊子集的隶属度函数的方式,对于每个维度的每个模糊子集,会确定该模糊子集的5个点:模糊子集的中心值、模糊子集的左边界值、模糊子集的右边界值、模糊子集的前一个模糊子集的中心值、模糊子集的后一个模糊子集的中心值,将中心值的隶属度取1,将左边界值、右边界值的隶属度取0.5,将前一个模糊子集的中心值、后一个模糊子集的中心值的隶属度取0,将这5个点中相邻的点以直线连接,得到一个分段线性函数,作为该模糊子集的隶属度函数。其中,模糊子集的隶属度函数可以如图5所示。
具体来说,对于任一维度的第一个模糊子集和最后一个模糊子集来说,由于这两个模糊子集只存在一个相邻的模糊子集,则只需确定4个点即可。参见图6,对于某个维度的第一个模糊子集,可以确定该维度的状态空间的左边界值、模糊子集的中心值、模糊子集的右边界值、模糊子集的后一个模糊子集的中心值,将状态空间的左边界值、模糊子集的中心值的隶属度取1,模糊子集的右边界值的隶属度取0.5,后一个模糊子集的中心值取0,将这4个点中相邻的点以直线连接,得到一个分段线性函数,作为第一个模糊子集的隶属度函数。参见图7,对于某个维度的最后一个模糊子集来说,可以确定该维度的状态空间的右边界值、模糊子集的中心值、模糊子集的左边界值、模糊子集的前一个模糊子集的中心值,将状态空间的右边界值、模糊子集的中心值的隶属度取1,模糊子集的左边界值的隶属度取0.5,前一个模糊子集的中心值取0,将这4个点中相邻的点以直线连接,得到一个分段线性函数,作为最后一个模糊子集的隶属度函数。
通过这种构建隶属度函数的方式,可以达到以下技术效果:
第一,隶属度函数具有高解释性,较为有效:当某个维度的状态为某个模糊子集的中心值时,人们通过主观感知通常会给出该状态强属于该模糊子集的评价,而采用该模糊子集的隶属度函数对该状态进行计算时,得到的状态与模糊子集之间的隶属度会为1,达到最大的隶属度,从而准确地表征出这种强属于模糊子集的程度。而当某个维度的状态为某个模糊子集的边界值时,人们通过主观感知通常会给出该状态大概隶属于该模糊子集,也可能隶属于该模糊子集相邻的模糊子集的模糊评价,而采用该模糊子集的隶属度函数对该状态进行计算时,得到的状态与模糊子集之间的隶属度会为0.5,状态与该模糊子集相邻的模糊子集之间的隶属度也会为0.5,两者相等,从而准确地表征出这种大致隶属于模糊子集的程度。而当某个维度的状态为某个模糊子集相邻模糊子集的中心值时,人们通过主观感知通常会给出该状态弱隶属于该模糊子集的评价,而采用该模糊子集的隶属度函数对该状态进行计算时,得到的状态与模糊子集之间的隶属度会为0,从而准确地表征出这种弱属于模糊子集的程度。也即是,保证通过隶属度函数对状态计算出的隶属度会与人为评价非常匹配,从而合理而准确地表征出状态隶属于模糊子集的程度高低。
第二,步骤简单,能够提高构建隶属度函数的效率:相关技术通常采用曲线拟合的方式构建隶属度函数:预先采集大量的样本,绘制隶属度函数的大致曲线,从抛物型分布、梯形分布等多种典型隶属度函数中确定与该曲线较为匹配的隶属度函数,调整隶属度函数的系数,使其逼近于样本,从而得到模糊子集的隶属度函数,这种方式较为繁琐,效率较低。而本实施例中,只需为每个模糊子集选取5个点,赋值、相连即可得到隶属度函数,构建隶属度函数的速度较快,提高了效率。
需要说明的是,上述仅是示例性地以绘制分段线性函数作为构建隶属度函数的方式进行 说明,在实施中也可以采用其他方式构建隶属度函数,例如可以采用曲线拟合的方式,获取抛物型分布函数、梯形分布函数、三角形分布函数等常用隶属度函数,根据模糊子集对隶属度函数进行曲线拟合,得到最终的隶属度函数,当然还可以采用根据模糊子集确定隶属度函数的任一种方式来构建隶属度函数,本实施例对此不做限定。
综上所述,上述步骤403-步骤406,将每个维度的状态空间划分为多个模糊子集并得到了每个模糊子集的隶属度函数。下述步骤407中人工智能设备会结合每个维度的模糊子集、隶属度函数、控制模型以及当前的状态,进行决策以控制自身的动作。
407、人工智能设备获取N个维度的状态。
本步骤与上述步骤402中训练模型时获取状态的过程类似,在此不做赘述。
408、人工智能设备基于N个维度中每个维度的状态的激活模糊子集以及控制模型,得到多个离散决策,一个状态的激活模糊子集是指状态的隶属度不为0的模糊子集,控制模型用于根据输入的状态输出对应的离散决策。
对于N个维度中的每个维度,人工智能设备会根据该每个维度的状态,获取该每个维度的多个模糊子集中的至少一个激活模糊子集,进而得到N个维度的激活模糊子集。根据这些激活模糊子集以及控制模型可得到多个离散决策,这些离散决策中每个离散决策均为针对确定的N个维度的状态时可供考虑的决策,后续会对这些离散决策加权求和得到连续决策。
获取每个维度的激活模糊子集的具体过程可以包括以下两种可能的设计。
设计一、对于N个维度中的每个维度,当该每个维度的模糊子集的隶属度不为0时,将该模糊子集作为该每个维度的激活模糊子集。
对于N个维度中每个维度的每个模糊子集,可以采用该每个模糊子集对应的隶属度函数,对维度的状态进行计算,得到该每个模糊子集的隶属度,当该隶属度不为0时,将该模糊子集作为激活模糊子集。也即是,可以获取各个维度内已划分的模糊子集中隶属度不为0的模糊子集,作为激活模糊子集。
具体来说,当人工智能设备在实际运行过程中得到N个维度的状态S=(s 1,...s n)后,对于每个维度的S,可以采用上述步骤406得到的该维度的每个模糊子集的隶属度函数,分别计算S与每个模糊子集之间的隶属度,记隶属度不为0的模糊子集为激活模糊子集。例如,假设θ维度的状态空间具有5个模糊子集以及5个隶属度函数,则得到θ后,可以分别采用5个隶属度函数对θ进行计算,得到θ与5个模糊子集中每个模糊子集之间的隶属度,从这5个模糊子集中选取隶属度不为0的模糊子集,作为θ维度的激活模糊子集。其中,依照上述隶属度函数设计原则,可看出任意一个维度上的激活模糊子集不会超过2个。
设计二、对于N个维度中的每个维度,从该每个维度的多个模糊子集中选取中心值位于维度的状态左右的两个模糊子集,作为该每个维度的激活模糊子集。
当人工智能设备在实际运行过程中得到N个维度的状态S=(s 1,...s n)后,对于每个维度的S,可以从该维度对应的模糊子集的中心值集合中找到位于S左侧的模糊子集的中心值,即小于S的所有模糊子集的中心值中与S最接近的中心值,将包含该中心值的模糊子集作为激活模糊子集。同理地,从该维度对应的模糊子集的中心值集合中找到位于S右侧的模糊子集的中心值,即大于S的所有模糊子集的中心值中与S最接近的中心值,将包含该中心值的模糊子集作为激活模糊子集。以此类推,每个维度获取两个激活模糊子集,N个维度即可获取n对激活模糊子集。
示例性地,以得到的N个维度的状态为S=(θ,V,P)为例,可以从θ维度的中心值集合中 找到与θ最接近的
Figure PCTCN2018121519-appb-000015
从V维度的中心值集合中找到与V最接近的
Figure PCTCN2018121519-appb-000016
从P维度的中心值集合中找到与P最接近的
Figure PCTCN2018121519-appb-000017
即每个维度找到三对激活模糊子集中心值,分别为
Figure PCTCN2018121519-appb-000018
则确定激活模糊子集包括
Figure PCTCN2018121519-appb-000019
Figure PCTCN2018121519-appb-000020
在得到N个维度的激活模糊子集后,可以根据这些激活模糊子集获取多个离散决策。获取离散决策的具体过程可以包括以下步骤一至步骤三:
步骤一、获取每个维度的状态对应的激活模糊子集的中心值,得到多个中心值。
步骤二、对不同维度的中心值进行组合,得到多个中间状态,每个中间状态包括N个维度的中心值。
对于N个维度中的第i维,可以从第i维的至少一个激活模糊子集的中心值选取一个中心值,遍历N个维度后会选取出N个中心值,对这N个中心值进行组合,可以得到一个中间状态,该中间状态包括N个维度,在任一维度的取值为该维度的某个激活模糊集合的中心值。
示例性地,当每个维度对应2个激活模糊集合时,得到的多个n维中间状态可以表示为
Figure PCTCN2018121519-appb-000021
其中,
Figure PCTCN2018121519-appb-000022
为s i在第i维上激活模糊子集的中心值。示例性地,假设N个维度的状态为S=(θ,V,P),则3个维度的模糊子集中心值组合在一起得到8个3维状态,可以表示为
Figure PCTCN2018121519-appb-000023
步骤三、分别将多个中间状态输入到控制模型中,得到控制模型输出的多个离散决策。
当得到多个N维中间状态后,将任一N维中间状态输入到控制模型,控制模型即可输出该N维中间状态的离散决策。示例性地,以
Figure PCTCN2018121519-appb-000024
作为控制模型的输入,可以得到输出的8个离散决策{A xyz,x,y,z=0,1}。
409、人工智能设备基于每个维度的状态与激活模糊子集之间的隶属度,对多个离散决策进行加权求和,得到连续决策。
由于隶属度函数为连续函数,基于隶属度函数计算的隶属度对多个离散决策进行加权求和后,得到的决策为连续决策,通过连续决策控制人工智能设备执行动作时,可以保证对人工智能设备的平顺控制,提高动作的平滑性。
对多个离散决策加权求和得到连续决策的过程具体可以包括以下步骤一至步骤三:
步骤一、对于多个离散决策中的每个离散决策,获取该每个离散决策对应的N个激活模糊子集的隶属度,得到N个隶属度。
针对离散决策对应的N个激活模糊子集的概念以及获取N个激活模糊子集的过程,对于每个离散决策,可以获取之前向控制模型输入中间状态得到输出的离散决策的过程中,输出该离散决策的中间状态,再获取之前将多个激活模糊子集的中心值组成中间状态的过程中,确定该中间状态的N个激活模糊子集,将该N个激活模糊子集作为该离散决策对应的N个激活模糊子集。
当确定N个激活模糊子集后,当人工智能设备在上述步骤408中采用了设计一时,可以根据上述过程中计算的每个维度的每个模糊子集的隶属度,获取该N个激活模糊子集的隶属度,得到N个隶属度。而当人工智能设备在上述步骤408中采用了设计二时,可以对于该N个维度中每个维度的每个模糊子集,采用该模糊子集对应的隶属度函数,对该维度的状态进行计算,得到该模糊子集的隶属度,从而得到N个隶属度。
以第i维第j个激活模糊子集的隶属度函数表示为f ij为例,每个维度的每个激活模糊子 集的隶属度函数为{f ij,i=1,...n,j=1,...m i},可以计算N维状态S与每个激活模糊子集之间的隶属度,其中S与第j个被激活子集之间的隶属度等于
Figure PCTCN2018121519-appb-000025
以S=(θ,V,P)为例,当上述步骤404得到了激活模糊子集
Figure PCTCN2018121519-appb-000026
Figure PCTCN2018121519-appb-000027
时,可以采用
Figure PCTCN2018121519-appb-000028
的隶属度函数对θ进行计算,得到θ与
Figure PCTCN2018121519-appb-000029
之间的隶属度
Figure PCTCN2018121519-appb-000030
采用
Figure PCTCN2018121519-appb-000031
的隶属度函数对θ进行计算,得到θ与
Figure PCTCN2018121519-appb-000032
之间的隶属度
Figure PCTCN2018121519-appb-000033
采用
Figure PCTCN2018121519-appb-000034
的隶属度函数对V进行计算,得到V与
Figure PCTCN2018121519-appb-000035
之间的隶属度
Figure PCTCN2018121519-appb-000036
采用
Figure PCTCN2018121519-appb-000037
的隶属度函数对V进行计算,得到V与
Figure PCTCN2018121519-appb-000038
之间的隶属度
Figure PCTCN2018121519-appb-000039
采用
Figure PCTCN2018121519-appb-000040
的隶属度函数对P进行计算,得到P与
Figure PCTCN2018121519-appb-000041
之间的隶属度
Figure PCTCN2018121519-appb-000042
采用
Figure PCTCN2018121519-appb-000043
的隶属度函数对P进行计算,得到P与
Figure PCTCN2018121519-appb-000044
之间的隶属度
Figure PCTCN2018121519-appb-000045
步骤二、基于N个隶属度,计算每个离散决策的权重。
可选地,人工智能设备可以计算N个隶属度的乘积,作为该离散决策的权重。例如假设离散决策为A xyz,则该离散决策的权重为
Figure PCTCN2018121519-appb-000046
步骤三、基于每个离散决策的权重,对多个离散决策进行加权求和,得到连续决策。
以S=(θ,V,P)对应于8个离散决策为例为例,当得到8个离散决策的权重后,可以基于每个离散决策的权重,对8个离散决策进行加权求和,得到连续决策为
Figure PCTCN2018121519-appb-000047
410、人工智能设备基于连续决策,控制自身执行对应的动作。
连续决策可以包括M个维度,即上述步骤401中M维动作空间A的维度。人工智能设备可以获取连续决策在每个维度上的取值,控制自身执行该每个维度的动作。
结合实际使用场景,以人工智能设备为无人车为例,对于转向角维度的连续决策,无人车会按照该连续决策转动一定的角度,例如转向角维度的连续决策为-1,映射为无人车能够右转的最大角度,则无人车会按照最大角度右转,又如连续决策为0,映射为0°,则无人车会直行。对于加速度维度的连续决策,无人车会按照连续决策进行加速,例如加速度维度的连续决策为0.3,映射的加速度为300M/s 2,则无人车会按照300M/s 2进行加速。以人工智能设备为机器人为例,对于左脚维度的连续决策,机器人会按照该连续决策控制左脚动作,例如左脚维度的连续决策为5,映射为抬起左脚40cm,则机器人会抬起左脚40cm。
需要说明的第一点是,本实施例仅是以执行主体为人工智能设备为例进行说明,在实施中,本实施例提供的动作控制方法也可以由服务器执行,该服务器可以部署于云端,服务器可以与人工智能设备之间建立网络连接,通过该网络连接与人工智能设备进行实时通信,人工智能设备可以将获取的每个维度的状态发送给服务器,服务器可以基于每个维度的状态以及控制模型,获取连续决策,将连续决策发送给人工智能设备,以便人工智能设备接收该连续决策后,可以控制自身执行对应的动作,从而达到服务器远程控制人工智能设备执行动作的效果。其中,在服务器与人工智能设备进行交互的过程中,为了保证安全性,可以要求通信延迟尽量小、通信安全尽量高。
需要说明的第二点是,通过状态与激活模糊子集之间的隶属度来确定连续决策,提供了一种合理地连续化决策的方式,保证连续决策具有高准确性:由于隶属度能够反映状态在模糊区间内的变化趋势,随着人工智能设备的运行,当获取到状态的数值以一定的变化趋势变 化时,状态在模糊子集中的位置也会以该变化趋势变化,则模糊子集的隶属度也会以该变化趋势变化,则基于隶属度确定的连续决策也会以该变化趋势变化,也即是,连续决策的变化趋势会与状态的变化趋势匹配,准确性高。
本实施例提供的方法,基于每个维度的状态与激活模糊子集之间的隶属度,对多个离散决策进行加权求和,得到连续决策,由于输出的决策为连续量,能够保证对人工智能设备的平顺控制,保证动作的平滑性。同时,通过隶属度来获取连续决策,提供了一种合理地连续化离散决策的方式,保证连续决策的变化趋势与状态的变化趋势匹配,从而保证连续决策具有高准确性。进一步地,当人工智能设备为无人车时,可以保证控制无人车的平顺性,提升乘客的舒适度。
图8是本公开实施例提供的一种动作控制装置的结构示意图,如图8所示,该装置包括:获取模块801、计算模块802和控制模块803。
获取模块801,用于执行上述步骤407;
该获取模块801,还用于执行上述步骤408;
计算模块802,用于执行上述步骤409;
控制模块803,用于执行上述步骤410;
在一种可能的设计中,该获取模块801,包括:
获取子模块,用于执行上述步骤408的设计二中的步骤一;
组合子模块,用于上述步骤408的设计二中的步骤二;
输入子模块,用于上述步骤408的设计二中的步骤三。
在一种可能的设计中,该装置还包括:
划分模块,用于执行上述步骤401;
该获取模块801,还用于执行上述步骤404;
该获取模块801,还用于执行上述步骤405。
在一种可能的设计中,该获取模块801,包括:
获取子模块,用于执行上述步骤404中的步骤一;
输入子模块,用于执行上述步骤404中的步骤二;
选取子模块,用于执行上述步骤404中的步骤三。
在一种可能的设计中,该计算模块802,用于计算每个模糊子集的隶属度;
在一种可能的设计中,该获取模块801,还用于执行上述步骤406。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机程序指令。在计算机上加载和执行该计算机程序指令时,全部或部分地产生按照本申请实施例该的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,该计算机程序指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或 者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质(例如软盘、硬盘、磁带)、光介质(例如,数字视频光盘(digital video disc,DVD)、或者半导体介质(例如固态硬盘)等。
以上该仅为本公开的可选实施例,并不用以限制本公开,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。

Claims (18)

  1. 一种动作控制方法,其特征在于,所述方法包括:
    获取人工智能设备的N个维度的状态,所述N为大于或等于1的正整数;
    基于所述N个维度中每个维度的状态的激活模糊子集以及控制模型,得到多个离散决策,一个状态的激活模糊子集是指所述状态的隶属度不为0的模糊子集,每个模糊子集是指一个维度内对应于同一个离散决策的状态区间,所述隶属度用于表示状态隶属于模糊子集的程度高低,所述控制模型用于根据输入的状态输出对应的离散决策;
    基于所述每个维度的状态与激活模糊子集之间的隶属度,对所述多个离散决策进行加权求和,得到连续决策;
    基于所述连续决策,控制所述人工智能设备执行对应的动作。
  2. 根据权利要求1所述的方法,其特征在于,所述基于所述每个维度的状态与激活模糊子集之间的隶属度,对所述多个离散决策进行加权求和,得到连续决策,包括:
    对于所述多个离散决策中的每个离散决策,获取所述每个离散决策对应的N个激活模糊子集的隶属度,得到N个隶属度;
    基于所述N个隶属度,计算所述每个离散决策的权重;
    基于所述每个离散决策的权重,对所述多个离散决策进行加权求和,得到所述连续决策。
  3. 根据权利要求1所述的方法,其特征在于,所述基于所述N个维度中每个维度的状态的激活模糊子集以及控制模型,得到多个离散决策,包括:
    获取所述N个维度中每个维度的激活模糊子集的中心值,得到多个中心值;
    对不同维度的中心值进行组合,得到多个中间状态,每个中间状态包括N个维度的中心值;
    分别将所述多个中间状态输入到所述控制模型中,得到所述控制模型输出的多个离散决策。
  4. 根据权利要求1所述的方法,其特征在于,所述基于所述N个维度中每个维度的状态的激活模糊子集以及控制模型,得到多个离散决策之前,所述方法还包括:
    对于所述N个维度中的每个维度,当所述每个维度的状态与所述每个维度的任一模糊子集之间的隶属度不为0时,将所述每个模糊子集作为所述每个维度的激活模糊子集;或,
    对于所述N个维度中的每个维度,从所述每个维度的多个模糊子集中选取中心值位于所述每个维度的状态左右的两个模糊子集,作为所述每个维度的激活模糊子集。
  5. 根据权利要求1所述的方法,其特征在于,所述获取人工智能设备的N个维度的状态之前,所述方法还包括:
    对于所述N个维度中的每个维度,将所述每个维度的状态空间划分为多个状态区间;
    基于所述控制模型,获取所述多个状态区间中每个状态区间的典型离散决策,得到多个典型离散决策;
    基于所述多个典型离散决策,将对应于同一典型离散决策且相邻的多个状态区间合并为一个模糊子集,得到所述每个维度的至少一个模糊子集。
  6. 根据权利要求5所述的方法,其特征在于,所述基于所述控制模型,获取所述多个状态区间中每个状态区间的典型离散决策,包括:
    对于所述多个状态区间中的每个状态区间,获取所述每个状态区间的多个代表状态,每个代表状态包括所述每个维度上的所述每个状态区间的中心值以及其他每个维度上的任一个状态;
    分别将所述多个代表状态输入到所述控制模型中,得到所述控制模型输出的多个离散决策;
    从所述多个离散决策中选取出现次数最多的离散决策,作为所述每个状态区间的典型离散决策。
  7. 根据权利要求1所述的方法,其特征在于,所述获取人工智能设备的N个维度的状态之后,所述方法还包括:
    对于所述N个维度中每个维度的每个模糊子集,采用所述每个模糊子集对应的隶属度函数,对所述每个维度的状态进行计算,得到所述每个模糊子集的隶属度。
  8. 根据权利要求7所述的方法,其特征在于,所述获取所述人工智能设备的N个维度的状态之前,所述方法还包括:
    按照预设规则,获取每个模糊子集的隶属度函数,所述隶属度函数用于计算对应模糊子集的隶属度,所述预设规则为:隶属度函数在所述每个模糊子集的中心值取1,在所述每个模糊子集的边界值的隶属度取0.5,在所述每个模糊子集的相邻的两个模糊子集的中心值取0。
  9. 一种动作控制装置,其特征在于,所述装置包括:
    获取模块,用于获取人工智能设备的N个维度的状态,所述N为大于或等于1的正整数;
    所述获取模块,还用于基于所述N个维度中每个维度的状态的激活模糊子集以及控制模型,得到多个离散决策,一个状态的激活模糊子集是指所述状态的隶属度不为0的模糊子集,每个模糊子集包括一个维度内对应于同一个离散决策的状态区间,所述隶属度用于表示状态隶属于模糊子集的程度高低,所述控制模型用于根据输入的状态输出对应的离散决策;
    计算模块,用于基于所述每个维度的状态与激活模糊子集之间的隶属度,对所述多个离散决策进行加权求和,得到连续决策;
    控制模块,用于基于所述连续决策,控制所述人工智能设备执行对应的动作作。
  10. 根据权利要求9所述的装置,其特征在于,所述计算模块,包括:
    获取子模块,用于对于所述多个离散决策中的每个离散决策,获取所述每个离散决策对应的N个激活模糊子集的隶属度,得到N个隶属度;
    计算子模块,用于基于所述N个隶属度,计算所述每个离散决策的权重;
    所述计算子模块,还用于基于所述每个离散决策的权重,对所述多个离散决策进行加权求和,得到所述连续决策。
  11. 根据权利要求9所述的装置,其特征在于,所述获取模块,包括:
    获取子模块,用于获取所述N个维度中每个维度的激活模糊子集的中心值,得到多个中心值;
    组合子模块,用于对不同维度的中心值进行组合,得到多个中间状态,每个中间状态包括N个维度的中心值;
    输入子模块,用于分别将所述多个中间状态输入到所述控制模型中,得到所述控制模型输出的多个离散决策。
  12. 根据权利要求9所述的装置,其特征在于,所述获取模块,还用于:对于所述N个维度中的每个维度,当所述每个维度的状态与所述每个维度的任一模糊子集之间的隶属度不为0时,将所述每个模糊子集作为所述每个维度的激活模糊子集;或,对于所述N个维度中的每个维度,从所述每个维度的多个模糊子集中选取中心值位于所述每个维度的状态左右的两个模糊子集,作为所述每个维度的激活模糊子集。
  13. 根据权利要求9所述的装置,其特征在于,所述装置还包括:
    划分模块,用于对于所述N个维度中的每个维度,将所述每个维度的状态空间划分为多个状态区间;
    所述获取模块,还用于基于所述控制模型,获取所述多个状态区间中每个状态区间的典型离散决策,得到多个典型离散决策;
    所述获取模块,还用于基于所述多个典型离散决策,将对应于同一典型离散决策且相邻的多个状态区间合并为一个模糊子集,得到所述每个维度的至少一个模糊子集。
  14. 根据权利要求12所述的装置,其特征在于,所述获取模块,包括:
    获取子模块,用于对于所述多个状态区间中的每个状态区间,获取所述每个状态区间的多个代表状态,每个代表状态包括所述每个维度上的所述每个状态区间的中心值以及其他每个维度上的任一个状态;
    输入子模块,用于分别将所述多个代表状态输入到所述控制模型中,得到所述控制模型输出的多个离散决策;
    选取子模块,用于从所述多个离散决策中选取出现次数最多的离散决策,作为所述每个状态区间的典型离散决策。
  15. 根据权利要求9所述的装置,其特征在于,所述计算模块,还用于对于所述N个维度中每个维度的每个模糊子集,采用所述每个模糊子集对应的隶属度函数,对所述每个维度的状态进行计算,得到所述每个模糊子集的隶属度。
  16. 根据权利要求9所述的装置,其特征在于,所述获取模块,还用于:按照预设规则, 获取每个模糊子集的隶属度函数,所述隶属度函数用于计算对应维度的模糊子集的隶属度,所述预设规则为:隶属度函数在所述每个模糊子集的中心值取1,在所述每个模糊子集的边界值的隶属度取0.5,在所述每个模糊子集的相邻的两个模糊子集的中心值取0。
  17. 一种人工智能设备,其特征在于,所述人工智能设备包括处理器和存储器,所述存储器中存储有至少一条指令,所述指令由所述处理器加载并执行以实现如权利要求1至权利要求8中任一项所述的动作控制方法中所执行的操作。
  18. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令,所述指令由处理器加载并执行以实现如权利要求1至权利要求8中任一项所述的动作控制方法中执行的操作。
PCT/CN2018/121519 2017-12-22 2018-12-17 动作控制方法及装置 WO2019120174A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP18890321.5A EP3719603B1 (en) 2017-12-22 2018-12-17 Action control method and apparatus
US16/906,863 US11449016B2 (en) 2017-12-22 2020-06-19 Action control method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711408965.4A CN109960246B (zh) 2017-12-22 2017-12-22 动作控制方法及装置
CN201711408965.4 2017-12-22

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/906,863 Continuation US11449016B2 (en) 2017-12-22 2020-06-19 Action control method and apparatus

Publications (1)

Publication Number Publication Date
WO2019120174A1 true WO2019120174A1 (zh) 2019-06-27

Family

ID=66993132

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/121519 WO2019120174A1 (zh) 2017-12-22 2018-12-17 动作控制方法及装置

Country Status (4)

Country Link
US (1) US11449016B2 (zh)
EP (1) EP3719603B1 (zh)
CN (1) CN109960246B (zh)
WO (1) WO2019120174A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516353A (zh) * 2019-08-27 2019-11-29 浙江科技学院 一种山区高速公路弯道设计缺陷快速识别方法
CN110971677A (zh) * 2019-11-19 2020-04-07 国网吉林省电力有限公司电力科学研究院 一种基于对抗强化学习的电力物联网终端设备边信道安全监测方法
CN116339130A (zh) * 2023-05-25 2023-06-27 中国人民解放军国防科技大学 基于模糊规则的飞行任务数据获取方法、装置及设备

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11410558B2 (en) * 2019-05-21 2022-08-09 International Business Machines Corporation Traffic control with reinforcement learning
JP7357537B2 (ja) * 2019-12-24 2023-10-06 本田技研工業株式会社 制御装置、制御装置の制御方法、プログラム、情報処理サーバ、情報処理方法、並びに制御システム
CN110929431B (zh) * 2020-02-03 2020-06-09 北京三快在线科技有限公司 一种车辆行驶决策模型的训练方法及装置
CN114435395A (zh) * 2021-12-31 2022-05-06 赛可智能科技(上海)有限公司 自动驾驶的方法、装置、设备、介质及计算机程序产品
CN116189380B (zh) * 2022-12-26 2024-08-02 湖北工业大学 机械设备人机安全交互方法、系统、装置以及介质
CN116540602B (zh) * 2023-04-28 2024-02-23 金陵科技学院 一种基于路段安全级别dqn的车辆无人驾驶方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414159A (zh) * 2008-10-08 2009-04-22 重庆邮电大学 基于连续型模糊插值的模糊控制方法和系统
CN103955930A (zh) * 2014-04-28 2014-07-30 中国人民解放军理工大学 基于灰度积分投影互相关函数特征的运动参数估计方法
EP2933069A1 (en) * 2014-04-17 2015-10-21 Aldebaran Robotics Omnidirectional wheeled humanoid robot based on a linear predictive position and velocity controller
CN106874668A (zh) * 2017-02-14 2017-06-20 复旦大学 一种基于全记忆事件序列挖掘模型的用药分析方法

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU3477397A (en) * 1996-06-04 1998-01-05 Paul J. Werbos 3-brain architecture for an intelligent decision and control system
US5796919A (en) * 1996-10-07 1998-08-18 Kubica; Eric Gregory Method of constructing and designing fuzzy controllers
US6917925B2 (en) * 2001-03-30 2005-07-12 Intelligent Inference Systems Corporation Convergent actor critic-based fuzzy reinforcement learning apparatus and method
US7627386B2 (en) * 2004-10-07 2009-12-01 Zonaire Medical Systems, Inc. Ultrasound imaging system parameter optimization via fuzzy logic
CN102207928B (zh) 2011-06-02 2013-04-24 河海大学常州校区 基于强化学习的多Agent污水处理决策支持系统
US8311973B1 (en) * 2011-09-24 2012-11-13 Zadeh Lotfi A Methods and systems for applications for Z-numbers
US9461876B2 (en) * 2012-08-29 2016-10-04 Loci System and method for fuzzy concept mapping, voting ontology crowd sourcing, and technology prediction
CN103645635A (zh) 2013-11-25 2014-03-19 大连海联自动控制有限公司 一种基于模拟退火-强化学习算法的船舶运动控制器
CN104240522A (zh) 2014-09-04 2014-12-24 中山大学 基于车载网和模糊神经网络的自适应十字路口控制技术
KR101484249B1 (ko) 2014-09-22 2015-01-16 현대자동차 주식회사 차량의 주행 모드 제어 장치 및 방법
CN105549384B (zh) 2015-09-01 2018-11-06 中国矿业大学 一种基于神经网络和强化学习的倒立摆控制方法
CN105501078A (zh) 2015-11-26 2016-04-20 湖南大学 一种四轮独立驱动电动汽车协同控制方法
CN105956968A (zh) 2016-05-26 2016-09-21 程欧亚 一种高考志愿人工智能填报系统及方法
CN107053179B (zh) 2017-04-21 2019-07-23 苏州康多机器人有限公司 一种基于模糊强化学习的机械臂柔顺力控制方法
CN107099785B (zh) 2017-04-24 2019-08-23 内蒙古大学 一种纳米薄膜的制备方法
US11200448B2 (en) * 2019-05-15 2021-12-14 RELX Inc. Systems and methods for generating a low-dimensional space representing similarities between patents

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414159A (zh) * 2008-10-08 2009-04-22 重庆邮电大学 基于连续型模糊插值的模糊控制方法和系统
EP2933069A1 (en) * 2014-04-17 2015-10-21 Aldebaran Robotics Omnidirectional wheeled humanoid robot based on a linear predictive position and velocity controller
CN103955930A (zh) * 2014-04-28 2014-07-30 中国人民解放军理工大学 基于灰度积分投影互相关函数特征的运动参数估计方法
CN106874668A (zh) * 2017-02-14 2017-06-20 复旦大学 一种基于全记忆事件序列挖掘模型的用药分析方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3719603A4

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516353A (zh) * 2019-08-27 2019-11-29 浙江科技学院 一种山区高速公路弯道设计缺陷快速识别方法
CN110516353B (zh) * 2019-08-27 2024-03-26 浙江科技学院 一种山区高速公路弯道设计缺陷快速识别方法
CN110971677A (zh) * 2019-11-19 2020-04-07 国网吉林省电力有限公司电力科学研究院 一种基于对抗强化学习的电力物联网终端设备边信道安全监测方法
CN116339130A (zh) * 2023-05-25 2023-06-27 中国人民解放军国防科技大学 基于模糊规则的飞行任务数据获取方法、装置及设备
CN116339130B (zh) * 2023-05-25 2023-09-15 中国人民解放军国防科技大学 基于模糊规则的飞行任务数据获取方法、装置及设备

Also Published As

Publication number Publication date
EP3719603A4 (en) 2021-02-17
US11449016B2 (en) 2022-09-20
US20200319609A1 (en) 2020-10-08
CN109960246A (zh) 2019-07-02
EP3719603B1 (en) 2023-04-26
EP3719603A1 (en) 2020-10-07
CN109960246B (zh) 2021-03-30

Similar Documents

Publication Publication Date Title
WO2019120174A1 (zh) 动作控制方法及装置
US12051001B2 (en) Multi-task multi-sensor fusion for three-dimensional object detection
JP7086111B2 (ja) 自動運転車のlidar測位に用いられるディープラーニングに基づく特徴抽出方法
KR102292277B1 (ko) 자율 주행 차량에서 3d cnn 네트워크를 사용하여 솔루션을 추론하는 lidar 위치 추정
US20230144209A1 (en) Lane line detection method and related device
JP7256758B2 (ja) 自動運転車両においてrnnとlstmを用いて時間平滑化を行うlidar測位
Chen et al. Driving maneuvers prediction based autonomous driving control by deep Monte Carlo tree search
Ivanovic et al. Mats: An interpretable trajectory forecasting representation for planning and control
Taniguchi et al. Hippocampal formation-inspired probabilistic generative model
CN115494879B (zh) 基于强化学习sac的旋翼无人机避障方法、装置及设备
Westny et al. MTP-GO: Graph-based probabilistic multi-agent trajectory prediction with neural ODEs
CN113238970B (zh) 自动驾驶模型的训练方法、评测方法、控制方法及装置
CN115311860B (zh) 一种交通流量预测模型的在线联邦学习方法
Wang et al. Oracle-guided deep reinforcement learning for large-scale multi-UAVs flocking and navigation
Xue et al. A UAV navigation approach based on deep reinforcement learning in large cluttered 3D environments
Fu et al. Memory-enhanced deep reinforcement learning for UAV navigation in 3D environment
Nguyen et al. Uncertainty-aware visually-attentive navigation using deep neural networks
Zhao et al. Improving autonomous vehicle visual perception by fusing human gaze and machine vision
KR20230024392A (ko) 주행 의사 결정 방법 및 장치 및 칩
Abbas et al. Statistically correlated multi-task learning for autonomous driving
Agyemang et al. Accelerating trail navigation for unmanned aerial vehicle: A denoising deep-net with 3D-NLGL
Hong et al. Knowledge Distillation-Based Edge-Decision Hierarchies for Interactive Behavior-Aware Planning in Autonomous Driving System
Musić et al. Adaptive fuzzy mediation for multimodal control of mobile robots in navigation-based tasks
CN117036966B (zh) 地图中点位特征的学习方法、装置、设备及存储介质
Ten Kathen et al. AquaFeL-PSO: An informative path planning for water resources monitoring using autonomous surface vehicles based on multi-modal PSO and federated learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18890321

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018890321

Country of ref document: EP

Effective date: 20200703