WO2023142316A1 - 飞行决策生成方法和装置、计算机设备、存储介质 - Google Patents

飞行决策生成方法和装置、计算机设备、存储介质 Download PDF

Info

Publication number
WO2023142316A1
WO2023142316A1 PCT/CN2022/094033 CN2022094033W WO2023142316A1 WO 2023142316 A1 WO2023142316 A1 WO 2023142316A1 CN 2022094033 W CN2022094033 W CN 2022094033W WO 2023142316 A1 WO2023142316 A1 WO 2023142316A1
Authority
WO
WIPO (PCT)
Prior art keywords
hyperparameter
population
target
original
flight
Prior art date
Application number
PCT/CN2022/094033
Other languages
English (en)
French (fr)
Inventor
尚可
石渕久生
Original Assignee
南方科技大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南方科技大学 filed Critical 南方科技大学
Publication of WO2023142316A1 publication Critical patent/WO2023142316A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • G05D1/106Change initiated in response to external conditions, e.g. avoidance of elevated terrain or of no-fly zones
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the technical field of unmanned aerial vehicles, and in particular to a method and device for generating a flight decision, computer equipment, and a storage medium.
  • UAV autonomous navigation avoids obstacles in the environment through autonomous decision-making, and flies safely from the starting point to the end point.
  • UAV autonomous navigation is generally modeled as a flight decision-making process, and an optimal control strategy is trained to complete the UAV autonomous navigation task.
  • more targets need to be considered for autonomous navigation tasks of UAVs.
  • the existing flight decision-making process cannot effectively solve multi-objective autonomous navigation tasks of UAVs, and the flexibility is poor.
  • the main purpose of the embodiments of the present disclosure is to provide a flight decision generation method and device, computer equipment, and storage media, which can improve the flexibility of autonomous navigation tasks of UAVs by generating flight decisions.
  • the first aspect of the embodiments of the present disclosure proposes a flight decision generation method, including:
  • the flight decision-making model includes a plurality of decision-making tuple data used to characterize the original flight decision-making of the UAV, and each of the decision-making tuple data includes an original hyperparameter group , each of the original hyperparameter sets is assembled to form an original hyperparameter population;
  • each said decision tuple data includes state data, action data, state transition function, reward function and discount factor
  • the state data includes the first distance between the UAV and the obstacle at a certain moment, the maximum detection range corresponding to the sensor of the UAV, the current position of the UAV and the predicted distance.
  • the action data includes acceleration and deceleration data and steering data of the drone at a certain moment
  • the state transition function is used to generate the state data of the UAV at the next moment
  • the reward function includes the original hyperparameter group, and the reward function is used to evaluate the preliminary degree of pros and cons of the drone performing an action in a certain state;
  • the discount factor is used in combination with the reward function to calculate the pros and cons of the action performed by the drone in a certain state.
  • the optimization target includes a flight time target and a flight risk target of the UAV;
  • the first objective function is constructed from the time-of-flight objective
  • the second objective function is constructed from the flight risk objective
  • the updating and optimizing the original hyperparameter population according to the target learning function to obtain a target hyperparameter population includes:
  • the target hyperparameter population is obtained through the multiple original control strategies and the multiple child control strategies.
  • the obtaining a target hyperparameter population through the multiple original control strategies and the multiple child control strategies includes:
  • An environment selection operation is performed on the updated original hyperparameter population according to the first objective function value and the second objective function value to obtain the target hyperparameter population.
  • the environment selection operation is performed on the updated original hyperparameter population according to the first objective function value and the second objective function value to obtain the target hyperparameter population, including:
  • each of the preliminary hyperparameter groups is the original hyperparameter group or the child hyperparameter group
  • each The preliminary hyperparameter population includes a plurality of preliminary hyperparameter groups, and each of the preliminary hyperparameter groups in each of the preliminary hyperparameter populations is not Pareto dominated by each other;
  • An environment selection operation is performed on each of the preliminary hyperparameter populations according to the environment selection sequence to obtain the target hyperparameter population.
  • the environment selection operation is performed on each of the preliminary hyperparameter populations according to the environment selection sequence to obtain the target hyperparameter population, including:
  • multiple preliminary hyperparameter groups satisfying the individual number threshold are sequentially selected from the multiple preliminary hyperparameter populations to form the target hyperparameter population.
  • the second aspect of the embodiments of the present disclosure provides a flight decision generating device, including:
  • Data acquisition module used to acquire task requirement data
  • Model building module used to construct a flight decision-making model according to the task requirement data; wherein, the flight decision-making model includes a plurality of decision-making tuple data used to characterize the original flight decision-making of the unmanned aerial vehicle, each of the decision-making tuples the data includes raw sets of hyperparameters, each of said raw hyperparameter sets aggregated to form a raw hyperparameter population;
  • a function building module used to build a corresponding target learning function based on the flight decision model
  • Population optimization module for updating and optimizing the original hyperparameter population according to the target learning function to obtain a target hyperparameter population
  • Decision obtaining module used to obtain the target flight decision of the UAV according to the target hyperparameter population.
  • a third aspect of the embodiments of the present disclosure provides a computer device, the computer device includes a memory and a processor, wherein a program is stored in the memory, and when the program is executed by the processor, the processor uses Execute the method described in any one of the embodiments of the first aspect of the present application.
  • the fourth aspect of the embodiments of the present disclosure provides a storage medium, the storage medium is a computer-readable storage medium, the storage medium stores computer-executable instructions, and the computer-executable instructions are used to make the computer execute the The method according to any one of the embodiments of the first aspect.
  • the flight decision generation method and device, computer equipment, and storage medium proposed by the embodiments of the present disclosure obtain mission requirement data; construct a flight decision model according to the mission requirement data; wherein, the flight decision model includes the original flight decision used to characterize the UAV Multiple decision-making tuple data, each decision-making tuple data includes the original hyperparameter group, and each original hyperparameter group is assembled to form the original hyperparameter population; based on the flight decision-making model, the corresponding target learning function is constructed; according to the target learning function The original hyperparameter population is updated and optimized to obtain the target hyperparameter population; according to the target hyperparameter population, the target flight decision of the UAV is obtained.
  • the embodiment of the present disclosure defines the target learning function for optimization on the basis of establishing the flight decision-making model of the drone, and on the basis of enabling the drone to complete the autonomous navigation task through the target flight decision-making, the optimization target of the target learning function
  • the original hyperparameter population is further updated and optimized to obtain the target flight decision, which improves the flexibility of the autonomous navigation task of the UAV.
  • FIG. 1 is a flowchart of a flight decision generation method provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a first state of a drone provided by an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of a second state of the drone provided by an embodiment of the present disclosure.
  • Fig. 4 is the flowchart of step S130 in Fig. 1;
  • Fig. 5 is the flowchart of step S140 in Fig. 1;
  • FIG. 6 is a flowchart of step S540 in FIG. 5;
  • FIG. 7 is a flowchart of step S640 in FIG. 6;
  • FIG. 8 is a flowchart of step S730 in FIG. 7;
  • FIG. 9 is a flow chart of a multi-objective deep reinforcement learning algorithm provided by an embodiment of the present disclosure.
  • FIG. 10 is a block diagram of a module structure of a flight decision generating device provided by an embodiment of the present disclosure.
  • Fig. 11 is a schematic diagram of a hardware structure of a computer device provided by an embodiment of the present disclosure.
  • Artificial Intelligence It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science. Intelligence attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a manner similar to human intelligence. Research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Markov Decision Process It is a mathematical model of sequential decision, which is used to simulate the random strategy and reward that the agent can realize in the environment where the system state has Markov properties.
  • MDP is constructed based on a set of interactive objects, that is, agents and environments, and its elements include state, action, strategy and reward.
  • the agent perceives the current system state and implements actions on the environment according to the strategy, thereby changing the state of the environment and receiving rewards. The accumulation of rewards over time is called rewards.
  • Deep reinforcement learning Combining the perception ability of deep learning and the decision-making ability of reinforcement learning, it can be directly controlled according to the input image, which is an artificial intelligence method closer to the way of human thinking. Deep learning has a strong perception ability, but lacks a certain decision-making ability; while reinforcement learning has a decision-making ability, and is at a loss for perception problems. Therefore, the combination of the two and their complementary advantages provide a solution to the problem of perception and decision-making in complex systems.
  • Asynchronous Advantage Actor-Critic A3C: The basic framework of A3C is the AC framework, but it no longer uses a single thread, but uses multiple threads. Each thread is equivalent to an agent exploring randomly, multiple agents jointly explore, calculate the policy gradient in parallel, and maintain a total update amount.
  • White Gaussian Noise Gaussian means that the probability distribution is a normal function, while white noise means that its second-order moments are irrelevant, and the first-order moments are constant, which refers to the temporal correlation of successive signals.
  • Gaussian white noise is an ideal model for analyzing channel additive noise, and thermal noise, the main noise source in communication, belongs to this kind of noise.
  • Euclidean distance is the "ordinary" (ie straight line) distance between two points in Euclidean space. Using this distance, the Euclidean space becomes a metric space. is a commonly used definition of distance, which refers to the real distance between two points in m-dimensional space, or the natural length of the vector (that is, the distance from the point to the origin), and the Euclidean distance in two-dimensional and three-dimensional space is The actual distance between two points.
  • UAV autonomous navigation avoids obstacles in the environment through autonomous decision-making, and flies safely from the starting point to the end point.
  • UAV autonomous navigation is generally modeled as a flight decision-making process, and an optimal control strategy is trained to complete the UAV autonomous navigation task.
  • more objectives often need to be considered.
  • users hope that on the basis of completing autonomous navigation tasks, the flight time of the drone should be as short as possible, and the flight risk should be as low as possible.
  • Existing technologies cannot effectively solve this complex multi-objective UAV autonomous navigation task, resulting in poor flexibility.
  • an embodiment of the present disclosure provides a flight decision generation method and device, computer equipment, and a storage medium, by acquiring mission requirement data; constructing a flight decision model according to the mission requirement data; Multiple decision-making tuple data of the original flight decision-making, each decision-making tuple data includes the original hyperparameter group, and each original hyperparameter group is assembled to form the original hyperparameter population; Based on the flight decision-making model, a corresponding target learning function is constructed; According to the target learning function, the original hyperparameter population is updated and optimized to obtain the target hyperparameter population; according to the target hyperparameter population, the target flight decision of the UAV is obtained.
  • the embodiment of the present disclosure defines the target learning function for optimization on the basis of establishing the flight decision-making model of the drone, and on the basis of enabling the drone to complete the autonomous navigation task through the target flight decision-making, the optimization target of the target learning function
  • the original hyperparameter population is further updated and optimized to obtain the target flight decision, which improves the flexibility of the autonomous navigation task of the UAV.
  • Embodiments of the present disclosure provide a flight decision generation method and device, a computer device, and a storage medium, which are specifically described through the following embodiments. First, the flight decision generation method in the embodiment of the present disclosure is described.
  • AI artificial intelligence
  • the embodiments of the present application may acquire and process relevant data based on artificial intelligence technology.
  • artificial intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the flight decision generation method provided by the embodiments of the present disclosure relates to the technical field of unmanned aerial vehicles and also to the field of artificial intelligence.
  • the flight decision generation method provided by the embodiments of the present disclosure may be applied to a terminal, may also be applied to a server, and may also be software running on the terminal or the server.
  • the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer or a smart watch, etc.
  • the server can be configured as an independent physical server, or as a server cluster composed of multiple physical servers Or a distributed system can also be configured to provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms, etc.
  • the cloud server of the basic cloud computing service; the software can be an application to realize the flight decision generation method, etc., but it is not limited to the above forms.
  • Embodiments of the present disclosure may be used in numerous general purpose or special purpose computer system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframes, Distributed computing environments including any of the above systems or devices, etc.
  • This application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • the application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote computer storage media including storage devices.
  • the flight decision generation method includes but is not limited to steps S110 to S150 .
  • Step S110 acquiring task requirement data
  • Step S120 constructing a flight decision-making model according to mission requirement data
  • Step S130 constructing a corresponding target learning function based on the flight decision model
  • Step S140 updating and optimizing the original hyperparameter population according to the target learning function to obtain the target hyperparameter population
  • Step S150 according to the target hyperparameter population, obtain the target flight decision of the drone.
  • the mission requirement data of the UAV is acquired.
  • the mission requirement data may include the mission scenario of the UAV and the specific flight requirements of the UAV in the mission scenario.
  • the task scenario can be that the UAV flies at a fixed height. The UAV starts from the starting point, avoids all obstacles through autonomous decision-making, and reaches the destination safely.
  • a flight decision-making model is constructed according to the task requirement data, wherein the flight decision-making model is generated according to the Markov decision process, and the goal of the flight decision-making model is to find an optimal preliminary flight control strategy so that no The cumulative reward for the human-machine autonomous navigation task is the largest.
  • the flight decision model includes a plurality of decision-making tuple data used to characterize the original flight decision of the UAV, each decision-making tuple data includes an original hyperparameter group, and each original hyperparameter group is assembled to form an original hyperparameter population.
  • a corresponding target learning function is constructed based on the flight decision model, specifically, the target learning function is set through the flight decision model, and multiple optimization targets for the autonomous navigation task of the UAV are defined through the target learning function.
  • step S140 of some embodiments the original hyperparameter population is updated and optimized according to the target learning function to obtain the target hyperparameter population, so that the UAV can consider minimizing multiple optimization objectives on the basis of safely completing the autonomous navigation task , so as to further optimize the original hyperparameter population.
  • the target flight decision of the drone is obtained according to the target hyperparameter population.
  • a set of optimal flight strategy sets can be obtained through the target hyperparameter population, and users can select one of the target flight strategies from the flight strategy set to execute according to mission requirements and preferences, which can improve the flexibility of UAV autonomous navigation tasks. sex.
  • each decision tuple data includes state data, action data, state transition functions, reward functions, and discount factors.
  • the flight decision model of this application that is, the Markov decision process can be represented by a decision tuple, namely (S,A,P,r, ⁇ , ⁇ 0 ), and various data in the decision tuple are called Decision-making tuple data, where S represents the state data, A represents the action data, P represents the state transition function, r represents the reward function, ⁇ represents the discount factor, and ⁇ 0 represents the distribution of the initial state s0 , as follows:
  • the state data is the state of the drone at a certain moment, such as time t.
  • the state data can be expressed as a 12-dimensional vector in Indicates the first distance from the obstacle detected by the drone's own sensor, d limit indicates the maximum detection range of the sensor, and ⁇ t ⁇ [- ⁇ , ⁇ ] represent the second distance and angle data (relative to the true north direction) between the current position of the drone and the preset end point at time t, v t ⁇ [0, v limit ] and ⁇ t ⁇ [- ⁇ , ⁇ ] represents the current flight speed and viewing direction of the UAV at time t, that is, the first viewing direction (relative to the true north direction), and v limit represents the maximum speed limit of the UAV.
  • the UAV state mentioned in the embodiment of the present application can refer to Fig. 2 and Fig. 3.
  • Fig. 2 shows the UAV state plan corresponding to the UAV in the case of a certain height, wherein, the 8 directions in Fig. 2 Indicates that the 8 sensors on the UAV can detect the distance from the surrounding obstacles (ie ), so the UAV can detect a total of 8 directions, front, back, left, and right.
  • Figure 3 shows the relationship between the UAV's first viewing angle direction and the end point. It can be seen from FIG. 2 and FIG. 3 that the UAV according to the embodiment of the present disclosure can sense obstacles in all directions and obtain a first distance between the sensor and the obstacle.
  • the state transition function is used to generate the state data of the drone at the next moment. Specifically, at time t, the UAV obtains its state s t , and obtains the output action a t according to the control strategy a ⁇ ( ⁇
  • the reward function includes an original set of hyperparameters, and the reward function is used to evaluate the initial goodness or badness of the drone performing an action in a certain state.
  • the reward function can be designed through the following situations, for example:
  • the discount factor is used in combination with the reward function to calculate the pros and cons of the actions performed by the drone in a certain state.
  • the discount factor ⁇ [0,1].
  • the cumulative reward of UAV autonomous navigation tasks is defined as in It should be noted that the goal of the above-mentioned Markov decision process is to find an optimal control strategy a ⁇ ( ⁇
  • the objective learning function includes a first objective function and a second objective function, as shown in FIG. 4 , step S130 specifically includes but not limited to steps S410 to S420.
  • Step S410 obtaining the optimization target according to the flight decision model
  • Step S420 constructing a first objective function according to the time-of-flight objective, and constructing a second objective function according to the flight risk objective.
  • the optimization target is obtained according to the flight decision model, specifically, the optimization target includes a flight time target and a flight risk target of the UAV.
  • step S140 specifically includes but not limited to step S510 to step S540.
  • Step S510 through the flight decision-making model, perform reinforcement learning training on each original hyperparameter group in the original hyperparameter population, and obtain multiple corresponding original control strategies;
  • Step S520 performing a mutation operation on each original hyperparameter group to generate multiple offspring hyperparameter groups
  • Step S530 perform reinforcement learning training on each sub-generation hyperparameter group through the flight decision model, and obtain corresponding multiple sub-generation control strategies;
  • step S540 a target hyperparameter population is obtained through multiple original control strategies and multiple child control strategies.
  • the goal of learning in this disclosed embodiment is to find a set of Pareto optimal individuals, each individual representing a set of optimal hyperparameters , that is, the process of realizing automatic optimization of hyperparameters.
  • each original hyperparameter group in the original hyperparameter population is subjected to reinforcement learning training through the flight decision model to obtain multiple corresponding original control strategies.
  • a deep reinforcement learning algorithm is applied to each individual in the population P for training, and the corresponding optimal control strategy ⁇ 1 ,...,, ⁇ n ⁇ is obtained, that is, the multiple original control strategies mentioned in the embodiments of the present disclosure Strategy.
  • the A3C algorithm can be used to train each individual. Before each offspring individual is trained, it first inherits the strategy of its parent individual, and trains on the basis of the strategy of its parent individual, thereby accelerating the training process of offspring individuals. . It should be noted that offspring individuals are produced by mutation of parent individuals, and a parent individual generates corresponding offspring through mutation, so each offspring has its unique corresponding parent individual.
  • a mutation operation is performed on each original hyperparameter set to generate multiple child hyperparameter sets.
  • step S530 of some embodiments similarly, reinforcement learning training is performed on each sub-generation hyperparameter group through the flight decision-making model, and corresponding multiple sub-generation control strategies are obtained, namely ⁇ 1′ ,...,, ⁇ n' ⁇ .
  • a target hyperparameter population ie, an optimal strategy set, is obtained through multiple original control strategies and multiple child control strategies.
  • step S540 specifically includes but not limited to step S610 to step S640.
  • Step S610 using each original control strategy to control the UAV to interact with the environment, so as to calculate the first objective function value of each original hyperparameter group according to the objective learning function;
  • Step S620 using each sub-generational control strategy to control the UAV to interact with the environment, so as to calculate the second objective function value of each sub-generational hyperparameter group according to the target learning function;
  • Step S630 adding multiple offspring hyperparameter groups to the original hyperparameter population to form an updated original hyperparameter population
  • Step S640 according to the first objective function value and the second objective function value, perform an environment selection operation on the updated original hyperparameter population to obtain a target hyperparameter population.
  • each original control strategy is used to control the UAV to interact with the environment, so as to calculate the first objective function value of each original hyperparameter set according to the objective learning function.
  • each original control strategy is used to control the UAV to interact with the environment, and the first objective function value of each individual of the original hyperparameter population P is evaluated by the objective learning function.
  • step S620 of some embodiments the control strategy of each child generation is used to control the UAV to interact with the environment, so as to calculate the second objective function value of each child hyperparameter group according to the objective learning function.
  • each offspring control strategy is used to control the interaction between the UAV and the environment, and the second objective function value of each individual in the offspring hyperparameter population Q formed by the offspring hyperparameter group is evaluated.
  • the evaluation process of the original hyperparameter group and the offspring hyperparameter group by using the target learning function is as follows: through the trained preliminary flight strategy a ⁇ ( ⁇
  • step S640 of some embodiments an environment selection operation is performed on the updated original hyperparameter population according to the first objective function value and the second objective function value to obtain the target hyperparameter population.
  • step S640 specifically includes but not limited to step S710 to step S730.
  • Step S710 dividing the updated original hyperparameter population to obtain multiple preliminary hyperparameter populations
  • Step S720 according to the value of the first objective function value and the value of the second objective function, non-dominated sorting is performed on a plurality of preliminary hyperparameter populations to obtain an environment selection order;
  • Step S730 perform an environment selection operation on each preliminary hyperparameter population according to the environment selection sequence to obtain a target hyperparameter population.
  • the updated original hyperparameter population is divided to obtain multiple preliminary hyperparameter populations; wherein, each preliminary hyperparameter group is an original hyperparameter group or a child hyperparameter group , each preliminary hyperparameter population includes multiple preliminary hyperparameter groups, and each preliminary hyperparameter group in each preliminary hyperparameter population is not Pareto dominated by each other.
  • non-dominated sorting is performed on multiple preliminary hyperparameter populations to obtain the order of environment selection; specifically, the population P containing 2n individuals is divided into l subpopulations ⁇ P 1 ,P 2 ,...,P l ⁇ , Individuals in each subpopulation do not have Pareto domination (that is, there is no individual whose f 1 and f 2 objective function values are smaller than another individual), and P 1 Pareto domination ⁇ P 2 ,..., P l ⁇ , P 2 Pareto-dominated ⁇ P 3 ,...,P l ⁇ .
  • step S730 of some embodiments an environment selection operation is performed on each preliminary hyperparameter population according to the environment selection sequence to obtain a target hyperparameter population.
  • step S730 specifically includes but not limited to step S810 to step S820.
  • Step S810 obtaining a preset individual number threshold
  • Step S820 sequentially select a plurality of preliminary hyperparameter groups satisfying the individual number threshold from the plurality of preliminary hyperparameter populations, so as to form a target hyperparameter population.
  • step S810 of some embodiments a preset individual number threshold n is acquired.
  • step S820 of some embodiments environment selection is performed sequentially according to the order of environment selection, that is, the order of the subpopulations ⁇ P 1 , P 2 ,...,P l ⁇ after step S1110 and step S1120 are divided, until After selecting n initial hyperparameter groups, further selection is performed to form the target hyperparameter population.
  • step S810 the steps for performing an environment selection operation on subpopulations are as follows:
  • the first step start environment selection from P 1 , if the number of individuals in P 1 is greater than n, go to the next step, otherwise continue to select P 2 . If the number of individuals in ⁇ P 1 , P 2 ⁇ is greater than n, go to the next step, otherwise continue to choose P 3 . By analogy, until P k , k ⁇ l is selected, and the number of individuals in ⁇ P 1 , P 2 ,...,P k ⁇ is greater than n, enter the next step.
  • Step 2 Assuming that the number of individuals in ⁇ P 1 , P 2 ,...,P k ⁇ is n′, then n′-n individuals need to be deleted from P k . First calculate the Euclidean distance between individuals in P k in the target space. Then find the closest two individuals in Pk , and delete one of them randomly. This step is repeated n' - n times, that is, n'-n individuals are deleted from P k .
  • a plurality of preliminary hyperparameter groups satisfying the individual number threshold are sequentially selected from the plurality of preliminary hyperparameter populations to form a target hyperparameter population.
  • the specific process of the evolutionary multi-objective deep reinforcement learning algorithm of the embodiment of the present disclosure is as follows:
  • the population is initialized, where the population includes multiple individuals, each individual is subjected to operations such as mutation, deep reinforcement learning, and objective function value evaluation in turn, and it is decided whether to exit according to the result of the environment selection operation. If not, the environment selection The individuals of the next-generation population generated by the operation are sequentially subjected to mutation, deep reinforcement learning, and objective function value evaluation until they meet the preset conditions and exit, thus ending the optimization process.
  • the steps of the environment selection operation can be as follows: Assume that there is a population P consisting of 16 individuals in total, first divide the population P into three subpopulations P 1 , P 2 , and P 3 .
  • the environmental selection operation needs to select 8 individuals from 16 individuals as the next generation population, starting from P 1, which contains 6 individuals, so P 1 is selected as a whole; then select P 2 , which contains 6 individuals Individuals, so only 2 individuals can be selected from P 2 , and combined with 6 individuals in P 1 to form the next generation population.
  • the flight decision generation method proposed by the embodiment of the present disclosure acquires mission requirement data; constructs a flight decision model according to the mission requirement data; wherein, the flight decision model includes multiple decision tuple data used to characterize the original flight decision of the UAV, Each decision tuple data includes the original hyperparameter group, and each original hyperparameter group is set to form the original hyperparameter population; based on the flight decision model, the corresponding target learning function is constructed; the original hyperparameter population is updated and optimized according to the target learning function , to obtain the target hyperparameter population; according to the target hyperparameter population, the target flight decision of the UAV is obtained.
  • the embodiment of the present disclosure defines the target learning function for optimization on the basis of establishing the flight decision-making model of the drone, and on the basis of enabling the drone to complete the autonomous navigation task through the target flight decision-making, the optimization target of the target learning function
  • the original hyperparameter population is further updated and optimized to obtain the target flight decision, which improves the flexibility of the autonomous navigation task of the UAV.
  • An embodiment of the present disclosure also provides a flight decision generating device, as shown in FIG. 10 , which can realize the above-mentioned flight decision generating method, and the device includes: a data acquisition module 1010, a model construction module 1020, a function construction module 1030, and a population optimization module 1040 and decision acquisition module 1050.
  • the data acquisition module 1010 is used to obtain task requirement data; the model construction module 1020 is used to build a flight decision model according to the task requirement data; the flight decision model includes a plurality of decision tuple data for characterizing the original flight decision of the unmanned aerial vehicle, Each decision-making tuple data includes an original hyperparameter group, and each original hyperparameter group is set to form an original hyperparameter population; the function construction module 1030 is used to construct a corresponding target learning function based on the flight decision model; the population optimization module 1040 is used to The original hyperparameter population is updated and optimized according to the target learning function to obtain the target hyperparameter population; the decision acquisition module 1050 is used to obtain the target flight decision of the drone according to the target hyperparameter population.
  • flight decision generation device in the embodiment of the present disclosure is used to implement the flight decision generation method in the above-mentioned embodiment, and its specific processing process is the same as the flight decision generation method in the above-mentioned embodiment, and will not be repeated here. .
  • the flight decision generation device obtains task requirement data; constructs a flight decision model according to the task requirement data; wherein, the flight decision model includes multiple decision tuple data used to characterize the original flight decision of the UAV, Each decision tuple data includes the original hyperparameter group, and each original hyperparameter group is set to form the original hyperparameter population; based on the flight decision model, the corresponding target learning function is constructed; the original hyperparameter population is updated and optimized according to the target learning function , to obtain the target hyperparameter population; according to the target hyperparameter population, the target flight decision of the UAV is obtained.
  • the embodiment of the present disclosure defines the target learning function for optimization on the basis of establishing the flight decision-making model of the drone, and on the basis of enabling the drone to complete the autonomous navigation task through the target flight decision-making, the optimization target of the target learning function
  • the original hyperparameter population is further updated and optimized to obtain the target flight decision, which improves the flexibility of the autonomous navigation task of the UAV.
  • An embodiment of the present disclosure also provides a computer device, including:
  • At least one processor and,
  • the memory stores instructions, and the instructions are executed by at least one processor, so that when the at least one processor executes the instructions, the method according to any one of the embodiments of the first aspect of the present application is implemented.
  • the computer device includes: a processor 1110 , a memory 1120 , an input/output interface 1130 , a communication interface 1140 and a bus 1150 .
  • the processor 1110 can be implemented by a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute Relevant programs to realize the technical solutions provided by the embodiments of the present disclosure;
  • CPU Central Processing Unit
  • ASIC Application Specific Integrated Circuit
  • the memory 1120 may be implemented in the form of a read-only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM).
  • the memory 1120 can store operating systems and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 1120 and called by the processor 1110 to execute the implementation of the present disclosure.
  • Example flight decision generation method ;
  • Input/output interface 1130 used to realize information input and output
  • the communication interface 1140 is used to realize the communication interaction between the device and other devices, and the communication can be realized through a wired method (such as USB, network cable, etc.), or can be realized through a wireless method (such as a mobile network, WIFI, Bluetooth, etc.); and
  • bus 1150 to transfer information between various components of the device (eg, processor 1110, memory 1120, input/output interface 1130, and communication interface 1140);
  • the processor 1110 , the memory 1120 , the input/output interface 1130 and the communication interface 1140 are connected to each other within the device through the bus 1150 .
  • the embodiment of the present disclosure also provides a storage medium, the storage medium is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to cause the computer to execute the flight of the embodiment of the present disclosure. Decision generation method.
  • memory can be used to store non-transitory software programs and non-transitory computer-executable programs.
  • the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices.
  • the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the flight decision generation method and device, computer equipment, and storage medium proposed by the embodiments of the present disclosure obtain mission requirement data; construct a flight decision model according to the mission requirement data; wherein, the flight decision model includes the original flight decision used to characterize the UAV Multiple decision-making tuple data, each decision-making tuple data includes the original hyperparameter group, and each original hyperparameter group is assembled to form the original hyperparameter population; based on the flight decision-making model, the corresponding target learning function is constructed; according to the target learning function The original hyperparameter population is updated and optimized to obtain the target hyperparameter population; according to the target hyperparameter population, the target flight decision of the UAV is obtained.
  • the embodiment of the present disclosure defines the target learning function for optimization on the basis of establishing the flight decision-making model of the drone, and on the basis of enabling the drone to complete the autonomous navigation task through the target flight decision-making, the optimization target of the target learning function
  • the original hyperparameter population is further updated and optimized to obtain the target flight decision, which improves the flexibility of the autonomous navigation task of the UAV.
  • Fig. 1, Fig. 4, Fig. 5, Fig. 6, Fig. 7 and Fig. 8 do not limit the embodiment of the present disclosure, and may include more or more Fewer steps, or combining certain steps, or different steps.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • At least one (item) means one or more, and “multiple” means two or more.
  • “And/or” is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, “A and/or B” can mean: only A exists, only B exists, and A and B exist at the same time , where A and B can be singular or plural.
  • the character “/” generally indicates that the contextual objects are an “or” relationship.
  • At least one of the following” or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • At least one item (piece) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c ", where a, b, c can be single or multiple.
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including multiple instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM), magnetic disk or optical disc, etc., which can store programs. medium.
  • ROM read-only memory
  • RAM random access memory
  • magnetic disk or optical disc etc., which can store programs. medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Automation & Control Theory (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)
  • Traffic Control Systems (AREA)

Abstract

本公开实施例提供一种飞行决策生成方法和装置、计算机设备、存储介质,属于无人机技术领域。方法包括:获取任务需求数据;根据该数据构建飞行决策模型;其中飞行决策模型包括多个决策元组数据,决策元组数据包括原始超参数组,原始超参数组集合形成原始超参数种群;基于飞行决策模型构建对应的目标学习函数;根据目标学习函数对原始超参数种群进行更新优化得到目标超参数种群;根据目标超参数种群获取无人机的目标飞行决策。本申请在建立飞行决策模型的基础上,通过定义目标学习函数,使无人机在完成自主导航任务的基础上,通过目标学习函数的优化目标进一步对原始超参数种群进行更新优化得到目标飞行决策,提高无人机自主导航任务的灵活性。

Description

飞行决策生成方法和装置、计算机设备、存储介质 技术领域
本申请涉及无人机技术领域,尤其涉及一种飞行决策生成方法和装置、计算机设备、存储介质。
背景技术
目前,无人机自主导航的目的在于:无人机通过自主决策,在环境中避开障碍物,从起点安全飞行到终点。在通常情况下,一般将无人机自主导航建模为一个飞行决策过程,并训练一个最优控制策略,从而完成无人机自主导航任务。然而,在实际应用中,对于无人机的自主导航任务,往往需要考虑更多的目标,现有的飞行决策过程无法有效解决多目标的无人机自主导航任务,灵活性较差。
发明内容
本公开实施例的主要目的在于提出一种飞行决策生成方法和装置、计算机设备、存储介质,能够通过生成飞行决策,提高无人机自主导航任务的灵活性。
为实现上述目的,本公开实施例的第一方面提出了一种飞行决策生成方法,包括:
获取任务需求数据;
根据所述任务需求数据构建飞行决策模型;其中,所述飞行决策模型包括用于表征无人机的原始飞行决策的多个决策元组数据,每一所述决策元组数据包括原始超参数组,每一所述原始超参数组集合以形成原始超参数种群;
基于所述飞行决策模型,构建对应的目标学习函数;
根据所述目标学习函数对所述原始超参数种群进行更新优化,以得到目标超参数种群;
根据所述目标超参数种群,获取所述无人机的目标飞行决策。
在一些实施例,每一所述决策元组数据包括状态数据、动作数据、状态转移函数、奖励函数和折扣因子;
所述状态数据包括某一时刻下,所述无人机到障碍物之间的第一距离、所述无人机的传感器对应的最大探测范围、所述无人机所处的当前位置与预设终点之间的第二距离和角度数据、所述无人机当前的飞行速度和视角方向,以及所述无人机的最大限速;
所述动作数据包括某一时刻下所述无人机的加减速数据和转向数据;
所述状态转移函数用于生成所述无人机在下一时刻的所述状态数据;
所述奖励函数包括所述原始超参数组,所述奖励函数用于评价所述无人机在某一状态下执行动作的初步优劣程度;
所述折扣因子用于与所述奖励函数结合计算出所述无人机在某一状态下执行动作的优劣程度。
在一些实施例,所述目标学习函数包括第一目标函数和第二目标函数;所述根据所述飞行决策模型,构建对应的目标学习函数,包括:
根据所述飞行决策模型获取优化目标;其中,所述优化目标包括所述无人机的飞行时间目标和飞行风险目标;
根据所述飞行时间目标构建所述第一目标函数,并根据所述飞行风险目标构建所述第二目标函数。
在一些实施例,所述根据所述目标学习函数对所述原始超参数种群进行更新优化,以得到目标超参数种群,包括:
通过所述飞行决策模型,对所述原始超参数种群中的每一所述原始超参数组进行强化学习训练,得到对应的多个原始控制策略;
对每一所述原始超参数组进行变异操作,生成多个子代超参数组;
通过所述飞行决策模型对每一所述子代超参数组进行强化学习训练,得到对应的多个子代控制策略;
通过所述多个原始控制策略和所述多个子代控制策略,得到所述目标超参数种群。
在一些实施例,所述通过所述多个原始控制策略和所述多个子代控制策略,得到目标超参数种群,包括:
利用每一所述原始控制策略控制所述无人机与环境进行互动,以根据所述目标学习函数计算每一所述原始超参数组的第一目标函数值;
利用每一所述子代控制策略控制所述无人机与环境进行互动,以根据所述目标学习函数计算每一所述子代超参数组的第二目标函数值;
将所述多个子代超参数组添加至所述原始超参数种群中,以形成更新后的原始超参数种群;
根据所述第一目标函数值和所述第二目标函数值,对更新后的所述原始超参数种群进行环境选择操作,得到所述目标超参数种群。
在一些实施例,所述根据所述第一目标函数值和所述第二目标函数值,对更新后的所述原始超参数种群进行环境选择操作,得到所述目标超参数种群,包括:
将更新后的所述原始超参数种群进行划分,得到多个初步超参数种群;其中,每一所述初步超参数组为所述原始超参数组或所述子代超参数组,每一所述初步超参数种群包括多个初步超参数组,且每一所述初步超参数种群中的每一所述初步超参数组之间相互不帕累托支配;
根据所述第一目标函数值和所述第二目标函数值的数值大小,对所述多个初步超参数种群进行非支配排序,得到环境选择顺序;
根据所述环境选择顺序对每一所述初步超参数种群进行环境选择操作,得到所述目标超参数种群。
在一些实施例,所述根据所述环境选择顺序对每一所述初步超参数种群进行环境选择操作,得到所述目标超参数种群,包括:
获取预设的个体数量阈值;
根据所述环境选择顺序,依次从所述多个初步超参数种群中选择满足所述个体数量阈值的多个所述初步超参数组,以形成所述目标超参数种群。
本公开实施例的第二方面提出了一种飞行决策生成装置,包括:
数据获取模块:用于获取任务需求数据;
模型构建模块:用于根据所述任务需求数据构建飞行决策模型;其中,所述飞行决策模型包括用于表征无人机的原始飞行决策的多个决策元组数据,每一所述决策元组数据包括原 始超参数组,每一所述原始超参数组集合以形成原始超参数种群;
函数构建模块:用于基于所述飞行决策模型,构建对应的目标学习函数;
种群优化模块:用于根据所述目标学习函数对所述原始超参数种群进行更新优化,以得到目标超参数种群;
决策获取模块:用于根据所述目标超参数种群,获取所述无人机的目标飞行决策。
本公开实施例的第三方面提出了一种计算机设备,所述计算机设备包括存储器和处理器,其中,所述存储器中存储有程序,所述程序被所述处理器执行时所述处理器用于执行如本申请第一方面实施例任一项所述的方法。
本公开实施例的第四方面提出了一种存储介质,该存储介质为计算机可读存储介质,所述存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行如本申请第一方面实施例任一项所述的方法。
本公开实施例提出的飞行决策生成方法和装置、计算机设备、存储介质,通过获取任务需求数据;根据任务需求数据构建飞行决策模型;其中,飞行决策模型包括用于表征无人机的原始飞行决策的多个决策元组数据,每一决策元组数据包括原始超参数组,每一原始超参数组集合以形成原始超参数种群;基于飞行决策模型,构建对应的目标学习函数;根据目标学习函数对原始超参数种群进行更新优化,以得到目标超参数种群;根据目标超参数种群,获取无人机的目标飞行决策。本公开实施例在建立无人机的飞行决策模型的基础上,定义用于优化的目标学习函数,在使无人机通过目标飞行决策完成自主导航任务的基础上,通过目标学习函数的优化目标进一步对原始超参数种群进行更新优化,以得到目标飞行决策,提高了无人机自主导航任务的灵活性。
附图说明
图1是本公开实施例提供的飞行决策生成方法的流程图;
图2是本公开实施例提供的无人机的第一状态示意图;
图3是本公开实施例提供的无人机的第二状态示意图;
图4是图1中的步骤S130的流程图;
图5是图1中的步骤S140的流程图;
图6是图5中的步骤S540的流程图;
图7是图6中的步骤S640的流程图;
图8是图7中的步骤S730的流程图;
图9是本公开实施例提供的多目标深度强化学习算法的流程图;
图10是本公开实施例提供的飞行决策生成装置的模块结构框图;
图11是本公开实施例提供的计算机设备的硬件结构示意图。
具体实施方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的 对象,而不必用于描述特定的顺序或先后次序。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本公开的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本公开的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本公开的各方面。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
首先,对本申请中涉及的若干名词进行解析:
人工智能(artificial intelligence,AI):是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学;人工智能是计算机科学的一个分支,人工智能企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器,该领域的研究包括机器人、语言识别、图像识别、自然语言处理和专家系统等。人工智能可以对人的意识、思维的信息过程的模拟。人工智能还是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。
马尔可夫决策过程(Markov Decision Process,MDP):是序贯决策(sequential decision)的数学模型,用于在系统状态具有马尔可夫性质的环境中模拟智能体可实现的随机性策略与回报。MDP基于一组交互对象,即智能体和环境进行构建,所具有的要素包括状态、动作、策略和奖励。在MDP的模拟中,智能体会感知当前的系统状态,按策略对环境实施动作,从而改变环境的状态并得到奖励,奖励随时间的积累被称为回报。
深度强化学习:将深度学习的感知能力和强化学习的决策能力相结合,可以直接根据输入的图像进行控制,是一种更接近人类思维方式的人工智能方法。深度学习具有较强的感知能力,但是缺乏一定的决策能力;而强化学习具有决策能力,对感知问题束手无策。因此,将两者结合起来,优势互补,为复杂系统的感知决策问题提供了解决思路。
异步优势动作评价算法(Asynchronous Advantage Actor-Critic,A3C):A3C的基本框架是AC框架,只是它不再利用单个线程,而是利用多个线程。每个线程相当于一个智能体在随机探索,多个智能体共同探索,并行计算策略梯度,维持一个总的更新量。
高斯白噪声(White Gaussian Noise):高斯是指概率分布是正态函数,而白噪声是指它的二阶矩不相关,一阶矩为常数,是指先后信号在时间上的相关性。高斯白噪声是分析信道加性噪声的理想模型,通信中的主要噪声源——热噪声就属于这类噪声。
欧氏距离:是欧几里得空间中两点间“普通”(即直线)距离。使用这个距离,欧氏空间 成为度量空间。是一个通常采用的距离定义,指在m维空间中两个点之间的真实距离,或者向量的自然长度(即该点到原点的距离),在二维和三维空间中的欧氏距离就是两点之间的实际距离。
目前,无人机自主导航的目的在于:无人机通过自主决策,在环境中避开障碍物,从起点安全飞行到终点。在通常情况下,一般将无人机自主导航建模为一个飞行决策过程,并训练一个最优控制策略,从而完成无人机自主导航任务。然而,在实际应用中,对于无人机的自主导航任务,往往需要考虑更多的目标。例如,用户希望无人机在完成自主导航任务的基础上,飞行时间越短越好,飞行风险越低越好。然而,实际中往往很难同时满足所有目标。过于追求短的飞行时间往往带来更高的飞行风险,而过于追求低飞行风险往往带来更长的飞行时间。现有技术还无法有效解决这种复杂的多目标无人机自主导航任务,导致灵活性较差。
基于此,本公开实施例提供一种飞行决策生成方法和装置、计算机设备、存储介质,通过获取任务需求数据;根据任务需求数据构建飞行决策模型;其中,飞行决策模型包括用于表征无人机的原始飞行决策的多个决策元组数据,每一决策元组数据包括原始超参数组,每一原始超参数组集合以形成原始超参数种群;基于飞行决策模型,构建对应的目标学习函数;根据目标学习函数对原始超参数种群进行更新优化,以得到目标超参数种群;根据目标超参数种群,获取无人机的目标飞行决策。本公开实施例在建立无人机的飞行决策模型的基础上,定义用于优化的目标学习函数,在使无人机通过目标飞行决策完成自主导航任务的基础上,通过目标学习函数的优化目标进一步对原始超参数种群进行更新优化,以得到目标飞行决策,提高了无人机自主导航任务的灵活性。
本公开实施例提供飞行决策生成方法和装置、计算机设备、存储介质,具体通过如下实施例进行说明,首先描述本公开实施例中的飞行决策生成方法。
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
本公开实施例提供的飞行决策生成方法,涉及无人机技术领域,同样涉及人工智能领域。本公开实施例提供的飞行决策生成方法可应用于终端中,也可应用于服务端中,还可以是运行于终端或服务端中的软件。在一些实施例中,终端可以是智能手机、平板电脑、笔记本电脑、台式计算机或者智能手表等;服务端可以配置成独立的物理服务端,也可以配置成多个物理服务端构成的服务端集群或者分布式系统,还可以配置成提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN以及大数据和人工智能平台等基础云计算服务的云服务端;软件可以是实现飞行决策生成方法的应用等,但并不局限于以上形式。
本公开实施例可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务端计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、 置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
参照图1,根据本公开实施例第一方面实施例的飞行决策生成方法,包括但不限于包括步骤S110至步骤S150。
步骤S110,获取任务需求数据;
步骤S120,根据任务需求数据构建飞行决策模型;
步骤S130,基于飞行决策模型,构建对应的目标学习函数;
步骤S140,根据目标学习函数对原始超参数种群进行更新优化,以得到目标超参数种群;
步骤S150,根据目标超参数种群,获取无人机的目标飞行决策。
在一些实施例的步骤S110中,获取无人机的任务需求数据,具体地,任务需求数据可以包括无人机的任务场景,以及无人机在任务场景中具体的飞行要求。在实际应用中,任务场景可以为无人机飞行在固定的高度,无人机从起点出发,并通过自主决策避开所有的障碍物,安全到达终点。
在一些实施例的步骤S120中,根据任务需求数据构建飞行决策模型,其中,飞行决策模型依据马尔可夫决策过程生成,该飞行决策模型的目标是找到一个最优的初步飞行控制策略,使无人机自主导航任务的累积奖励最大。其中,飞行决策模型包括用于表征无人机的原始飞行决策的多个决策元组数据,每一决策元组数据包括原始超参数组,每一原始超参数组集合以形成原始超参数种群。
在一些实施例的步骤S130中,基于飞行决策模型,构建对应的目标学习函数,具体地,通过飞行决策模型设置目标学习函数,通过目标学习函数定义无人机自主导航任务的多个优化目标。
在一些实施例的步骤S140中,根据目标学习函数对原始超参数种群进行更新优化,以得到目标超参数种群,使无人机在安全完成自主导航任务的基础上,考虑最小化多个优化目标,从而进一步优化原始超参数种群。
在一些实施例的步骤S150中,根据目标超参数种群,获取无人机的目标飞行决策。具体地,通过目标超参数种群能够得到一组最优的飞行策略集,用户可以根据任务需求和偏好,从该飞行策略集中选择其中一个目标飞行策略执行,能够提高无人机自主导航任务的灵活性。
在一些实施例中,每一决策元组数据包括状态数据、动作数据、状态转移函数、奖励函数和折扣因子。具体地,本申请的飞行决策模型,即马尔可夫决策过程可通过一个决策元组,即(S,A,P,r,γ,ρ 0)表示,决策元组中的各种数据称为决策元组数据,其中S表示状态数据,A表示动作数据,P表示状态转移函数,r表示奖励函数,γ表示折扣因子,ρ 0表示初始状态s 0的分布,具体如下:
在一些实施例中,状态数据是无人机在某一时刻,例如t时刻的状态。在本申请实施例中,可将状态数据表示为一个12维向量
Figure PCTCN2022094033-appb-000001
其中
Figure PCTCN2022094033-appb-000002
表示无人机自身传感器探测到的与障碍物的第一距离,d limit表示传感器的最大探测范围,
Figure PCTCN2022094033-appb-000003
Figure PCTCN2022094033-appb-000004
和ξ t∈[-π,π]表示t时刻无人机的当前位置和预设终点之间的第二距离和角度数据(相对于正北方向),v t∈[0,v limit]和φ t∈[-π,π]表示t时刻无人机当前的飞行速度和视角方向,也即第一视角方向(相对于正北方向),v limit表示无人机的最大限速。
具体地,本申请实施例提到的无人机状态可参照图2和图3,图2表示无人机在高度确定的情况下对应的无人机状态平面图,其中,图2的8个方向表示无人机上的8个传感器,可以探测与周围障碍物的距离(即
Figure PCTCN2022094033-appb-000005
),所以无人机可以探测前后左右一共8个方向,图3表示无人机第一视角方向与终点之间的关系。通过图2和图3可知,本公开实施例的无人机可以全方位感知障碍物,且获取传感器与障碍物的第一距离。但是,如果无人机的某个方向无法感知障碍物,则可能导致撞上障碍物,从而导致安全性较低。需要说明的是,本领域技术人员可以根据实际需求在无人机上设置不同传感器的数量,本公开实施例不做具体限制。
在一些实施例中,动作数据a t包括某一时刻下,例如t时刻下无人机的加减速数据ρ t∈[ρ lbup]和转向数据φ t∈[-π,π],综上,动作数据可表示为a t=[ρ tt]。
在一些实施例中,状态转移函数用于生成无人机在下一时刻的状态数据。具体地,无人机在t时刻,获取其状态s t,根据无人机的控制策略a~π(·|s)得到输出动作a t,环境将根据状态转移函数做出对应的环境状态转移,得到下一时刻状态s t+1
在一些实施例中,奖励函数包括原始超参数组,奖励函数用于评价无人机在某一状态下执行动作的初步优劣程度。具体地,可通过以下几种情况设计奖励函数,例如:
情况一:当无人机与预设终点的第二距离变小时会得到奖励,反之会受到惩罚,定义为:
Figure PCTCN2022094033-appb-000006
情况二:当无人机太接近障碍物时会收到相应惩罚,定义为:
Figure PCTCN2022094033-appb-000007
Figure PCTCN2022094033-appb-000008
其中d s表示无人机和障碍物之间的安全距离。
情况三:当无人机非常接近终点时会得到奖励
Figure PCTCN2022094033-appb-000009
其中r d是一个预设的常数。
情况四:无人机会得到一个恒定的惩罚r c(s t,a t)=-1,鼓励无人机尽快抵达终点。
因此,可将整体的奖励函数定义为:r(s t,a t)=β dr d(s t,a t)+β or o(s t,a t)+β sr s(s t,a t)+β cr c(s t,a t),其中β dosc是4个超参数,在
Figure PCTCN2022094033-appb-000010
范围内取值,4个超参数结合可形成一个原始超参数组。
在一些实施例中,折扣因子用于与奖励函数结合计算出无人机在某一状态下执行动作的优劣程度,在实际应用中,折扣因子:γ∈[0,1]。
进一步地,无人机自主导航任务的累积奖励定义为
Figure PCTCN2022094033-appb-000011
其中
Figure PCTCN2022094033-appb-000012
Figure PCTCN2022094033-appb-000013
需要说明的是,上述马尔可夫决策过程的目标是找到一个最优控制策略a~π(·|s),使得无人机自主导航任务的累积奖励最大。
在一些实施例中,目标学习函数包括第一目标函数和第二目标函数,如图4所示,步骤S130具体包括但不限于步骤S410至步骤S420。
步骤S410,根据飞行决策模型获取优化目标;
步骤S420,根据飞行时间目标构建第一目标函数,并根据飞行风险目标构建第二目标函数。
在一些实施例的步骤S410中,根据飞行决策模型获取优化目标,具体地,优化目标包括无人机的飞行时间目标和飞行风险目标。
在一些实施例的步骤S420中,根据飞行时间目标构建第一函数,设飞行时间目标为f 1,则可将第一函数定义为f 1=T,其中T为任务完成时间,其中,飞行时间越短越好。根据飞行风险目标构建第二函数,设飞行风险目标为f 2,则可将第二函数定义为f 2=1/d min,其中d min为任务中无人机与障碍物的最近距离,即
Figure PCTCN2022094033-appb-000014
其中,飞行风险越小越好。
在一些实施例中,如图5所示,步骤S140具体包括但不限于步骤S510至步骤S540。
步骤S510,通过飞行决策模型,对原始超参数种群中的每一原始超参数组进行强化学习训练,得到对应的多个原始控制策略;
步骤S520,对每一原始超参数组进行变异操作,生成多个子代超参数组;
步骤S530,通过飞行决策模型对每一子代超参数组进行强化学习训练,得到对应的多个子代控制策略;
步骤S540,通过多个原始控制策略和多个子代控制策略,得到目标超参数种群。
为了便于描述,在步骤S510至步骤S540之前,首先需要进行初始化原始超参数种群的操作,其中,将原始超参数种群设为P={β 1,...,,β n},其中
Figure PCTCN2022094033-appb-000015
Figure PCTCN2022094033-appb-000016
种群中的每个个体都是由上述提到的原始超参数组构成,本公开实施例所需要学习的目标是找到一组帕累托最优个体,每个个体代表了一组最优超参数,也即实现自动优化超参数的过程。
进一步地,还需要初始化一个循环次数G=1,用来计算步骤S140的所执行的次数,步骤S140每执行一次,则将G加1,直到G=G max,用于控制步骤S140执行一定的次数。
在一些实施例的步骤S510中,通过飞行决策模型,对原始超参数种群中的每一原始超参数组进行强化学习训练,得到对应的多个原始控制策略。具体地,对种群P中每个个体应用深度强化学习算法进行训练,得到对应的最优控制策略{π 1,...,,π n},即本公开实施例提到的多个原始控制策略。在实际应用中,可采用A3C算法训练每个个体,每个子代个体训练之前,首先继承其父代个体的策略,在其父代个体策略的基础上进行训练,从而加速子代个体的训练过程。需要说明的是,子代个体是由父代个体变异产生,一个父代个体通过变异产生对应的子代,所以每个子代有它唯一对应的父代个体。
在一些实施例的步骤S520中,对每一原始超参数组进行变异操作,生成多个子代超参数组。具体地,对种群P中每个个体进行变异操作产生子代种群Q=mutation(P),子代种群表示为Q={β 1′,...,,β n′},其中,子代种群即本公开实施例提到的子代超参数组。具体地,对每个父代个体β进行高斯白噪声变异操作,产生子代个体β′,即β′=β+Δβ,其中Δβ是基于标准正态分布N(0,I)采样产生的高斯白噪声。
在一些实施例的步骤S530中,同样的,通过飞行决策模型对每一子代超参数组进行强化学习训练,得到对应的多个子代控制策略,即{π 1′,...,,π n′}。
在一些实施例的步骤S540中,通过多个原始控制策略和多个子代控制策略,得到目标超参数种群,即最优策略集。
在一些实施例中,如图6所示,步骤S540具体包括但不限于步骤S610至步骤S640。
步骤S610,利用每一原始控制策略控制无人机与环境进行互动,以根据目标学习函数计算每一原始超参数组的第一目标函数值;
步骤S620,利用每一子代控制策略控制无人机与环境进行互动,以根据目标学习函数计算每一子代超参数组的第二目标函数值;
步骤S630,将多个子代超参数组添加至原始超参数种群中,以形成更新后的原始超参数种群;
步骤S640,根据第一目标函数值和第二目标函数值,对更新后的原始超参数种群进行环境选择操作,得到目标超参数种群。
在一些实施例的步骤S610中,利用每一原始控制策略控制无人机与环境进行互动,以根据目标学习函数计算每一原始超参数组的第一目标函数值。换句话说,利用每个原始控制策略控制无人机与环境进行互动,通过目标学习函数评估原始超参数种群P的每个个体的第一目标函数值。
在一些实施例的步骤S620中,同样的,利用每一子代控制策略控制无人机与环境进行互动,以根据目标学习函数计算每一子代超参数组的第二目标函数值。换句话说,利用每个子代控制策略控制无人机与环境互动,评估由子代超参数组形成子代超参数种群Q中的每个个体的第二目标函数值。
在实际应用中,利用目标学习函数对原始超参数组和子代超参数组的评估过程为:通过训练好的初步飞行策略a~π(·|s)控制无人机和环境进行互动,直到无人机完成任务。无人机完成任务后,需要记录无人机的任务完成时间T和距障碍物最近距离d min,得到该初步飞行策略对应的第一目标函数值和第二目标函数值(f 1,f 2)。设定最大完成时间T max,如果无人机在T max时间内无法到达终点,则判定任务失败,需要标记该初步飞行策略的第一目标函数值和第二目标函数值为正无穷大,即(+∞,+∞)。如果无人机在任务过程中撞击障碍物(即
Figure PCTCN2022094033-appb-000017
则判定任务失败,标记该策略目标函数值为(+∞,+∞)。
在一些实施例的步骤S630中,将多个子代超参数组添加至原始超参数种群中,以形成更新后的原始超参数种群,更新后的原始超参数种群P=P∪Q。
在一些实施例的步骤S640中,根据第一目标函数值和第二目标函数值,对更新后的原始超参数种群进行环境选择操作,得到目标超参数种群。
在一些实施例中,如图7所示,步骤S640具体包括但不限于步骤S710至步骤S730。
步骤S710,将更新后的原始超参数种群进行划分,得到多个初步超参数种群;
步骤S720,根据第一目标函数值和第二目标函数值的数值大小,对多个初步超参数种群进行非支配排序,得到环境选择顺序;
步骤S730,根据环境选择顺序对每一初步超参数种群进行环境选择操作,得到目标超参数种群。
在一些实施例的步骤S710和步骤S720中,将更新后的原始超参数种群进行划分,得到多个初步超参数种群;其中,每一初步超参数组为原始超参数组或子代超参数组,每一初步超参数种群包括多个初步超参数组,且每一初步超参数种群中的每一初步超参数组之间相互不帕累托支配。根据第一目标函数值和第二目标函数值的数值大小,对多个初步超参数种群进行非支配排序,得到环境选择顺序;具体地,将包含2n个个体的种群P划分为l个子种群{P 1,P 2,...,P l},
Figure PCTCN2022094033-appb-000018
每个子种群内的个体相互不帕累托支配(即不存在某一个体的f 1和f 2目标函数值均小于另一个体),并且P 1帕累托支配{P 2,...,P l},P 2帕累托支配{P 3,...,P l}。以此类推,其中P 1帕累托支配P 2表示P 2中的每个个体被P 1中的某个个体帕累托支配(即被帕累托支配个体的f 1和f 2目标函数值均大于帕累托支配它的个体)。
在一些实施例的步骤S730中,根据环境选择顺序对每一初步超参数种群进行环境选择操作,得到目标超参数种群。
在一些实施例中,如图8所示,步骤S730具体包括但不限于步骤S810至步骤S820。
步骤S810,获取预设的个体数量阈值;
步骤S820,根据环境选择顺序,依次从多个初步超参数种群中选择满足个体数量阈值的多个初步超参数组,以形成目标超参数种群。
在一些实施例的步骤S810中,获取预设的个体数量阈值n。
在一些实施例的步骤S820中,根据环境选择顺序,也即步骤S1110和步骤S1120进行划分后的子种群{P 1,P 2,...,P l}的顺序,依次进行环境选择,直到选择n个初步超参数组后,进行进一步的选择,以形成目标超参数种群。
具体地,步骤S810至步骤S820根据环境选择顺序,对子种群进行环境选择操作的步骤如下:
第一步:从P 1开始进行环境选择,如果P 1中个体数量大于n,则进入下一步,否则继续选择P 2。如果{P 1,P 2}中个体数量大于n,则进入下一步,否则继续选择P 3。依次类推,直到选择P k,k≤l,{P 1,P 2,...,P k}中个体数量大于n,进入下一步。
第二步:假设{P 1,P 2,...,P k}中个体数量为n′,则需要从P k中删掉n′-n个个体。首先计算P k中个体之间在目标空间中的欧式距离。然后找到P k中最近的两个个体,并随机删除其中一个。该步骤重复执行n -n次,即从P k中删除n′-n个个体。
第三步:将P={P 1,P 2,...,P k}输出作为下一代种群。
依次从多个初步超参数种群中选择满足个体数量阈值的多个初步超参数组,以形成目标超参数种群。
在一些实施例中,如图9所示,本公开实施例的演化多目标深度强化学习算法的具体过程如下:
首先初始化种群,其中,该种群包括多个个体,对每个个体依次进行变异、深度强化学习和目标函数值评估等操作,根据环境选择操作的结果决定是否退出,若不退出,则对环境选择操作生成的下一代种群的个体依次进行变异、深度强化学习和目标函数值评估等操作,直到满足预设条件后退出,从而结束优化过程。
在实际应用中,环境选择操作的步骤可以为:假设存在种群P,一共由16个个体组成,首先将种群P划分为P 1,P 2,P 3三个子种群。环境选择操作需要从16个个体中选择8个个体作为下一代种群,从P 1开始选择,P 1中含有6个个体,因此P 1整体被选择;然后选择P 2,P 2中含有6个个体,因此只能从P 2中选择2个个体,并与P 1中6个个体合并组成下一代种群。计算P 2个体之间的欧式距离并删除最近的2个个体中的1个,直到P 2剩余2个个体,将P 1和P 2合并作为下一代种群,由此实现了环境选择操作的过程。
本公开实施例提出的飞行决策生成方法,通过获取任务需求数据;根据任务需求数据构建飞行决策模型;其中,飞行决策模型包括用于表征无人机的原始飞行决策的多个决策元组数据,每一决策元组数据包括原始超参数组,每一原始超参数组集合以形成原始超参数种群;基于飞行决策模型,构建对应的目标学习函数;根据目标学习函数对原始超参数种群进行更新优化,以得到目标超参数种群;根据目标超参数种群,获取无人机的目标飞行决策。本公开实施例在建立无人机的飞行决策模型的基础上,定义用于优化的目标学习函数,在使无人机通过目标飞行决策完成自主导航任务的基础上,通过目标学习函数的优化目标进一步对原始超参数种群进行更新优化,以得到目标飞行决策,提高了无人机自主导航任务的灵活性。
本公开实施例还提供一种飞行决策生成装置,如图10所示,可以实现上述飞行决策生成方法,该装置包括:数据获取模块1010、模型构建模块1020、函数构建模块1030、种群优化模块1040和决策获取模块1050。其中,数据获取模块1010用于获取任务需求数据;模型构 建模块1020用于根据任务需求数据构建飞行决策模型;飞行决策模型包括用于表征无人机的原始飞行决策的多个决策元组数据,每一决策元组数据包括原始超参数组,每一原始超参数组集合以形成原始超参数种群;函数构建模块1030用于基于飞行决策模型,构建对应的目标学习函数;种群优化模块1040用于根据目标学习函数对原始超参数种群进行更新优化,以得到目标超参数种群;决策获取模块1050用于根据目标超参数种群,获取无人机的目标飞行决策。
需要说明的是,本公开实施例的飞行决策生成装置用于执行上述实施例中的飞行决策生成方法,其具体处理过程与上述实施例中的飞行决策生成方法相同,此处不再一一赘述。
本公开实施例提出的飞行决策生成装置,通过获取任务需求数据;根据任务需求数据构建飞行决策模型;其中,飞行决策模型包括用于表征无人机的原始飞行决策的多个决策元组数据,每一决策元组数据包括原始超参数组,每一原始超参数组集合以形成原始超参数种群;基于飞行决策模型,构建对应的目标学习函数;根据目标学习函数对原始超参数种群进行更新优化,以得到目标超参数种群;根据目标超参数种群,获取无人机的目标飞行决策。本公开实施例在建立无人机的飞行决策模型的基础上,定义用于优化的目标学习函数,在使无人机通过目标飞行决策完成自主导航任务的基础上,通过目标学习函数的优化目标进一步对原始超参数种群进行更新优化,以得到目标飞行决策,提高了无人机自主导航任务的灵活性。
本公开实施例还提供了一种计算机设备,包括:
至少一个处理器,以及,
与至少一个处理器通信连接的存储器;其中,
存储器存储有指令,指令被至少一个处理器执行,以使至少一个处理器执行指令时实现如本申请第一方面实施例中任一项的方法。
下面结合图11对计算机设备的硬件结构进行详细说明。该计算机设备包括:处理器1110、存储器1120、输入/输出接口1130、通信接口1140和总线1150。
处理器1110,可以采用通用的中央处理器(Central Processing Unit,CPU)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本公开实施例所提供的技术方案;
存储器1120,可以采用只读存储器(Read Only Memory,ROM)、静态存储设备、动态存储设备或者随机存取存储器(Random Access Memory,RAM)等形式实现。存储器1120可以存储操作系统和其他应用程序,在通过软件或者固件来实现本说明书实施例所提供的技术方案时,相关的程序代码保存在存储器1120中,并由处理器1110来调用执行本公开实施例的飞行决策生成方法;
输入/输出接口1130,用于实现信息输入及输出;
通信接口1140,用于实现本设备与其他设备的通信交互,可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信;和
总线1150,在设备的各个组件(例如处理器1110、存储器1120、输入/输出接口1130和通信接口1140)之间传输信息;
其中处理器1110、存储器1120、输入/输出接口1130和通信接口1140通过总线1150实现彼此之间在设备内部的通信连接。
本公开实施例还提供一种存储介质,该存储介质是计算机可读存储介质,该计算机可读 存储介质存储有计算机可执行指令,该计算机可执行指令用于使计算机执行本公开实施例的飞行决策生成方法。
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
本公开实施例提出的飞行决策生成方法和装置、计算机设备、存储介质,通过获取任务需求数据;根据任务需求数据构建飞行决策模型;其中,飞行决策模型包括用于表征无人机的原始飞行决策的多个决策元组数据,每一决策元组数据包括原始超参数组,每一原始超参数组集合以形成原始超参数种群;基于飞行决策模型,构建对应的目标学习函数;根据目标学习函数对原始超参数种群进行更新优化,以得到目标超参数种群;根据目标超参数种群,获取无人机的目标飞行决策。本公开实施例在建立无人机的飞行决策模型的基础上,定义用于优化的目标学习函数,在使无人机通过目标飞行决策完成自主导航任务的基础上,通过目标学习函数的优化目标进一步对原始超参数种群进行更新优化,以得到目标飞行决策,提高了无人机自主导航任务的灵活性。
本公开实施例描述的实施例是为了更加清楚的说明本公开实施例的技术方案,并不构成对于本公开实施例提供的技术方案的限定,本领域技术人员可知,随着技术的演变和新应用场景的出现,本公开实施例提供的技术方案对于类似的技术问题,同样适用。
本领域技术人员可以理解的是,图1、图4、图5、图6、图7和图8中示出的技术方案并不构成对本公开实施例的限定,可以包括比图示更多或更少的步骤,或者组合某些步骤,或者不同的步骤。
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。
本申请的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例 如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括多指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序的介质。
以上参照附图说明了本公开实施例的优选实施例,并非因此局限本公开实施例的权利范围。本领域技术人员不脱离本公开实施例的范围和实质内所作的任何修改、等同替换和改进,均应在本公开实施例的权利范围之内。

Claims (10)

  1. 一种无人机的飞行决策生成方法,其特征在于,包括:
    获取任务需求数据;
    根据所述任务需求数据构建飞行决策模型;其中,所述飞行决策模型包括用于表征无人机的原始飞行决策的多个决策元组数据,每一所述决策元组数据包括原始超参数组,每一所述原始超参数组集合以形成原始超参数种群;
    基于所述飞行决策模型,构建对应的目标学习函数;
    根据所述目标学习函数对所述原始超参数种群进行更新优化,以得到目标超参数种群;
    根据所述目标超参数种群,获取所述无人机的目标飞行决策。
  2. 根据权利要求1所述的方法,其特征在于,每一所述决策元组数据包括状态数据、动作数据、状态转移函数、奖励函数和折扣因子;
    所述状态数据包括某一时刻下,所述无人机到障碍物之间的第一距离、所述无人机的传感器对应的最大探测范围、所述无人机所处的当前位置与预设终点之间的第二距离和角度数据、所述无人机当前的飞行速度和视角方向,以及所述无人机的最大限速;
    所述动作数据包括某一时刻下所述无人机的加减速数据和转向数据;
    所述状态转移函数用于生成所述无人机在下一时刻的所述状态数据;
    所述奖励函数包括所述原始超参数组,所述奖励函数用于评价所述无人机在某一状态下执行动作的初步优劣程度;
    所述折扣因子用于与所述奖励函数结合计算出所述无人机在某一状态下执行动作的优劣程度。
  3. 根据权利要求1所述的方法,其特征在于,所述目标学习函数包括第一目标函数和第二目标函数;所述根据所述飞行决策模型,构建对应的目标学习函数,包括:
    根据所述飞行决策模型获取优化目标;其中,所述优化目标包括所述无人机的飞行时间目标和飞行风险目标;
    根据所述飞行时间目标构建所述第一目标函数,并根据所述飞行风险目标构建所述第二目标函数。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述根据所述目标学习函数对所述原始超参数种群进行更新优化,以得到目标超参数种群,包括:
    通过所述飞行决策模型,对所述原始超参数种群中的每一所述原始超参数组进行强化学习训练,得到对应的多个原始控制策略;
    对每一所述原始超参数组进行变异操作,生成多个子代超参数组;
    通过所述飞行决策模型对每一所述子代超参数组进行强化学习训练,得到对应的多个子代控制策略;
    通过所述多个原始控制策略和所述多个子代控制策略,得到所述目标超参数种群。
  5. 根据权利要求4所述的方法,其特征在于,所述通过所述多个原始控制策略和所述多个子代控制策略,得到目标超参数种群,包括:
    利用每一所述原始控制策略控制所述无人机与环境进行互动,以根据所述目标学习函数 计算每一所述原始超参数组的第一目标函数值;
    利用每一所述子代控制策略控制所述无人机与环境进行互动,以根据所述目标学习函数计算每一所述子代超参数组的第二目标函数值;
    将所述多个子代超参数组添加至所述原始超参数种群中,以形成更新后的原始超参数种群;
    根据所述第一目标函数值和所述第二目标函数值,对更新后的所述原始超参数种群进行环境选择操作,得到所述目标超参数种群。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述第一目标函数值和所述第二目标函数值,对更新后的所述原始超参数种群进行环境选择操作,得到所述目标超参数种群,包括:
    将更新后的所述原始超参数种群进行划分,得到多个初步超参数种群;其中,每一所述初步超参数组为所述原始超参数组或所述子代超参数组,每一所述初步超参数种群包括多个初步超参数组,且每一所述初步超参数种群中的每一所述初步超参数组之间相互不帕累托支配;
    根据所述第一目标函数值和所述第二目标函数值的数值大小,对所述多个初步超参数种群进行非支配排序,得到环境选择顺序;
    根据所述环境选择顺序对每一所述初步超参数种群进行环境选择操作,得到所述目标超参数种群。
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述环境选择顺序对每一所述初步超参数种群进行环境选择操作,得到所述目标超参数种群,包括:
    获取预设的个体数量阈值;
    根据所述环境选择顺序,依次从所述多个初步超参数种群中选择满足所述个体数量阈值的多个所述初步超参数组,以形成所述目标超参数种群。
  8. 一种无人机的飞行决策生成装置,其特征在于,包括:
    数据获取模块:用于获取任务需求数据;
    模型构建模块:用于根据所述任务需求数据构建飞行决策模型;其中,所述飞行决策模型包括用于表征无人机的原始飞行决策的多个决策元组数据,每一所述决策元组数据包括原始超参数组,每一所述原始超参数组集合以形成原始超参数种群;
    函数构建模块:用于基于所述飞行决策模型,构建对应的目标学习函数;
    种群优化模块:用于根据所述目标学习函数对所述原始超参数种群进行更新优化,以得到目标超参数种群;
    决策获取模块:用于根据所述目标超参数种群,获取所述无人机的目标飞行决策。
  9. 一种计算机设备,其特征在于,所述计算机设备包括存储器和处理器,其中,所述存储器中存储有计算机程序,所述计算机程序被所述处理器执行时,所述处理器用于执行:如权利要求1至7中任一项所述的方法。
  10. 一种存储介质,所述存储介质为计算机可读存储介质,其特征在于,所述计算机可读存储有计算机程序,在所述计算机程序被计算机执行时,所述计算机用于执行:如权利要求1至7中任一项所述的方法。
PCT/CN2022/094033 2022-01-25 2022-05-20 飞行决策生成方法和装置、计算机设备、存储介质 WO2023142316A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210084970.9 2022-01-25
CN202210084970.9A CN114492718A (zh) 2022-01-25 2022-01-25 飞行决策生成方法和装置、计算机设备、存储介质

Publications (1)

Publication Number Publication Date
WO2023142316A1 true WO2023142316A1 (zh) 2023-08-03

Family

ID=81473788

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/094033 WO2023142316A1 (zh) 2022-01-25 2022-05-20 飞行决策生成方法和装置、计算机设备、存储介质

Country Status (2)

Country Link
CN (1) CN114492718A (zh)
WO (1) WO2023142316A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117171893A (zh) * 2023-10-25 2023-12-05 天之翼(苏州)科技有限公司 基于人工智能的无人机飞行短板分析方法及系统
CN117371655A (zh) * 2023-10-12 2024-01-09 中山大学 一种无人机协同决策的评价方法、系统、设备和介质
CN117434968A (zh) * 2023-12-19 2024-01-23 华中科技大学 一种基于分布式a2c的多无人机追逃博弈方法及系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114492718A (zh) * 2022-01-25 2022-05-13 南方科技大学 飞行决策生成方法和装置、计算机设备、存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111708355A (zh) * 2020-06-19 2020-09-25 中国人民解放军国防科技大学 基于强化学习的多无人机动作决策方法和装置
CN112198870A (zh) * 2020-06-01 2021-01-08 西北工业大学 基于ddqn的无人机自主引导机动决策方法
CN112327923A (zh) * 2020-11-19 2021-02-05 中国地质大学(武汉) 一种多无人机协同路径规划方法
CN112507622A (zh) * 2020-12-16 2021-03-16 中国人民解放军国防科技大学 一种基于强化学习的反无人机任务分配方法
CN113281999A (zh) * 2021-04-23 2021-08-20 南京大学 一种基于强化学习和迁移学习的无人机自主飞行训练方法
US20210343160A1 (en) * 2020-05-01 2021-11-04 Honeywell International Inc. Systems and methods for flight planning for conducting surveys by autonomous aerial vehicles
CN114492718A (zh) * 2022-01-25 2022-05-13 南方科技大学 飞行决策生成方法和装置、计算机设备、存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210343160A1 (en) * 2020-05-01 2021-11-04 Honeywell International Inc. Systems and methods for flight planning for conducting surveys by autonomous aerial vehicles
CN112198870A (zh) * 2020-06-01 2021-01-08 西北工业大学 基于ddqn的无人机自主引导机动决策方法
CN111708355A (zh) * 2020-06-19 2020-09-25 中国人民解放军国防科技大学 基于强化学习的多无人机动作决策方法和装置
CN112327923A (zh) * 2020-11-19 2021-02-05 中国地质大学(武汉) 一种多无人机协同路径规划方法
CN112507622A (zh) * 2020-12-16 2021-03-16 中国人民解放军国防科技大学 一种基于强化学习的反无人机任务分配方法
CN113281999A (zh) * 2021-04-23 2021-08-20 南京大学 一种基于强化学习和迁移学习的无人机自主飞行训练方法
CN114492718A (zh) * 2022-01-25 2022-05-13 南方科技大学 飞行决策生成方法和装置、计算机设备、存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Master's Thesis", 30 April 2020, CHANGAN UNIVERSITY, CN, article ZHENG, BAOJUAN: "Research on Improved NSGA-II Algorithm for UVA Path Planning", pages: 1 - 68, XP009547782, DOI: 10.26976/d.cnki.gchau.2020.001862 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117371655A (zh) * 2023-10-12 2024-01-09 中山大学 一种无人机协同决策的评价方法、系统、设备和介质
CN117171893A (zh) * 2023-10-25 2023-12-05 天之翼(苏州)科技有限公司 基于人工智能的无人机飞行短板分析方法及系统
CN117171893B (zh) * 2023-10-25 2024-01-23 天之翼(苏州)科技有限公司 基于人工智能的无人机飞行短板分析方法及系统
CN117434968A (zh) * 2023-12-19 2024-01-23 华中科技大学 一种基于分布式a2c的多无人机追逃博弈方法及系统
CN117434968B (zh) * 2023-12-19 2024-03-19 华中科技大学 一种基于分布式a2c的多无人机追逃博弈方法及系统

Also Published As

Publication number Publication date
CN114492718A (zh) 2022-05-13

Similar Documents

Publication Publication Date Title
WO2023142316A1 (zh) 飞行决策生成方法和装置、计算机设备、存储介质
WO2020094060A1 (zh) 推荐方法及装置
US10762443B2 (en) Crowdsourcing system with community learning
US9825758B2 (en) Secure computer evaluation of k-nearest neighbor models
CN111784002B (zh) 分布式数据处理方法、装置、计算机设备及存储介质
CN108875955B (zh) 基于参数服务器的梯度提升决策树的实现方法及相关设备
CN112085041B (zh) 神经网络的训练方法、训练装置和电子设备
CN113159283B (zh) 一种基于联邦迁移学习的模型训练方法及计算节点
CN112382099B (zh) 交通路况预测方法、装置、电子设备及存储介质
CN112235384A (zh) 分布式系统中的数据传输方法、装置、设备及存储介质
CA3148760C (en) Automated image retrieval with graph neural network
KR102293791B1 (ko) 반도체 소자의 시뮬레이션을 위한 전자 장치, 방법, 및 컴퓨터 판독가능 매체
CN112308006A (zh) 视线区域预测模型生成方法、装置、存储介质及电子设备
CN112418302A (zh) 一种任务预测方法及装置
KR20230048614A (ko) 도메인 불변 정규화를 사용한 이미지 분류를 위한 시스템, 방법 및 장치
CN114332550A (zh) 一种模型训练方法、系统及存储介质和终端设备
Zhou et al. Improving robustness of random forest under label noise
CN115049730B (zh) 零件装配方法、装置、电子设备及存储介质
Lahmeri et al. Machine learning for UAV-based networks
Chansuparp et al. A novel augmentative backward reward function with deep reinforcement learning for autonomous UAV navigation
KR20230026104A (ko) 학습 처리 시스템, 로컬 파라미터 개수 결정 장치 및 방법
CN113779396B (zh) 题目推荐方法和装置、电子设备、存储介质
US11676370B2 (en) Self-supervised cross-video temporal difference learning for unsupervised domain adaptation
Jaafra et al. Meta-reinforcement learning for adaptive autonomous driving
KR102259429B1 (ko) 로봇의 배치 구역을 결정하는 인공 지능 서버 및 그 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22923129

Country of ref document: EP

Kind code of ref document: A1