CN110222697B

CN110222697B - Planetary surface landform active perception method based on reinforcement learning

Info

Publication number: CN110222697B
Application number: CN201910343241.9A
Authority: CN
Inventors: 余萌; 李爽; 孙俊
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2019-04-26
Filing date: 2019-04-26
Publication date: 2023-04-18
Anticipated expiration: 2039-04-26
Also published as: CN110222697A

Abstract

The invention discloses a planetary surface landform active perception method based on reinforcement learning, which comprises the following steps: firstly, on the basis of a modern set theory, describing planet landforms in real time by using a local feature description operator of an image and an image global significance method to generate an actively-perceived knowledge base; on the basis, a reward function based on a finite feature description algorithm set is designed in combination with an enhanced learning framework, and a learning framework for active perception of target landforms is constructed. Considering the limitation of the computing power of the spaceborne computer, defining the learning step length as a limited step length in the framework, and finally completing training and learning by combining with a planet landform description operator knowledge base to form an integral landform active perception method. The invention can realize the autonomous perception of the planet landform, the patrol device can autonomously identify the interesting landform, and the scientific exploration efficiency of the star catalogue task can be actively and effectively improved.

Description

Planetary surface landform active perception method based on reinforcement learning

Technical Field

The invention belongs to the technical field of task planning and pattern recognition, and particularly relates to a planetary surface landform active perception method based on reinforcement learning.

Background

In consideration of reliability, the on-board computer computing and storing capability of the Mars rover is limited (the dominant frequency of the Hodgkin's CPU is only 200 MHz), so that the rover can only store and upload a small part of observed scientific materials to a ground workstation in Mars working days (sol). With the rapid development of aerospace technology, the size of a patrol instrument for performing a roaming patrol task on the surface of a remote celestial body is also increasing generation by generation, and the curio number (Curiosity) of the fourth generation Mars patrol instrument in the United states is about 3 meters long and up to 900kg, which is 2-5 times that of the Mars patrol instrument in the previous generation. The increase of the volume enables the curiosity numbers to carry more scientific exploration loads, 17 sensors are commonly carried in an actual task, and due to the reliability and safety considerations, when a patroller encounters a complex road condition, field collected materials need to be transmitted back to the ground for landform identification and environment understanding, and a subsequent exploration task is executed by waiting for a subsequent instruction returned by the ground. The flexibility of the patrol instrument task and the ability of acquiring scientific targets are greatly restricted due to the long communication delay between the celestial body and the ground. In recent years, astronauts have been discussing autonomous exploration schemes with high exploration efficiency. Scientists of the national aeronautics and astronautics administration (NASA) of the united states propose to equip active sensing devices on a patrol instrument, such as acquiring rock hardness by touching a wall surface with a smart hand, and autonomously performing operation analysis to improve exploration efficiency; in addition, some scientists also propose to use artificial intelligence methods to perform autonomous analysis of landforms, such as autonomous extraction of scientific interest areas, detection of obstacles, and the like.

Compared with a landform analysis means depending on manual remote control, the landform autonomous perception method has many advantages, and firstly, the higher autonomy of a star catalogue exploration task is given. The Mars patrol instrument can explore more scientific targets in limited working time without waiting for command instructions of ground workers, greatly improves the efficiency of exploration tasks of the patrol instrument, and can obtain scientific return with higher value. Through the online independent perception of landform, the patrol device can screen scientific materials (such as dynamic environments of rocks, cloud layers, sand storms and the like) with higher scientific values for ground workers to study. However, due to the reliability, the on-board computer of the planetary inspection device has limited computation and storage capabilities, and the surface features of the planet are generally single in color and poor in texture, so that some identification methods which have achieved significant results in ground applications may not be suitable for the special environment of planetary feature exploration. At present, no system scheme aiming at the autonomous perception of the planet landform exists.

Disclosure of Invention

In view of the above disadvantages of the prior art, an object of the present invention is to provide an active sensing method for planet surface landform based on reinforcement learning, so as to solve the problem that no system solution for active sensing planet landform exists in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the invention discloses a planetary surface landform active perception method based on reinforcement learning, which comprises the following steps:

step 1): extracting SURF local feature descriptors of the images from a series of planet landform image sets, and cataloguing feature descriptor sets corresponding to landforms one by one according to the landform categories, namely cataloguing SURF feature descriptors belonging to the same type of landforms in a set form;

step 2): checking the feature repetition degree of the SURF feature descriptor set, eliminating feature pairs with high similarity and features with undersized feature scales, reserving the rest SURF feature descriptors, and establishing a feature knowledge base;

and step 3): describing landform perception in the form of observing feature proportion in a feature knowledge base, giving out joint distribution posterior probability, and establishing a corresponding reward function in an enhanced learning frame according to the posterior probability;

step 4): setting a triggering condition for actively sensing the planet landform, and analyzing the local saliency of the image of the satellite-borne camera in real time in the roaming and patrolling process of the patrolling device; when the saliency of the local image meets the trigger condition, executing SURF local feature descriptor extraction, transmitting the extracted SURF local feature descriptor serving as an observed quantity to a reinforcement learning training system, wherein the control quantity in the reinforcement learning system is the camera holder adjustment angle theta _c And focal length f of camera _c ；

Step 5): changing a strategy iteration step in reinforcement learning into a finite step mode, and training a satellite-borne camera identification action sequence by combining an reinforcement learning reward function established in the step 3) and the characteristic knowledge base established in the step 1) to finish the active landform identification work;

step 6): and storing the landform sensing result, and continuing the roaming patrol task by the patrol device.

Preferably, the checking of the feature repetition degree and the building of the feature knowledge base in the step 2) are as follows:

21 Using SURF feature descriptors to extract local features in the satellite images of the target patrol area;

22 The extracted 64-dimensional SURF feature descriptors are subjected to repetition degree screening to remove feature pairs with high similarity, wherein the similarity judgment is realized by point multiplication of normalized feature description vectors, and the feature pairs with the descriptor point multiplication product larger than 0.9 are removed;

23 Culling feature descriptors with a feature size of less than 3 pixels;

24 Retaining feature descriptor subsets after two rounds of screening to build a landform knowledge base.

Preferably, the SURF local feature descriptor extraction range performed in the step 3) is a local image area within the saliency detection area.

Preferably, the incentive function in step 3) is designed as follows:

31 Establishing the correlation between the feature observed quantity and the landform feature set, and describing by adopting a Bayes condition posterior probability model:

wherein, the first and the second end of the pipe are connected with each other,

for a feature description subset corresponding to the kth feature in the feature knowledge base, be->

For the current observation quantity->

And &>

Is combined with>

Is the prior probability of the correlation of the kth landform with the observed quantity, and is described as:

wherein the content of the first and second substances,

wherein the content of the first and second substances,

uniformly initializing the probability of different types of landforms to 1/K for the observed probability, wherein K is the total number of the landforms in the characteristic knowledge base;

32 After obtaining the correlation posterior probability, the entropy of the discrete fragrance concentration information is normalized to describe the completeness of the posterior probability distribution:

wherein N is _m (k) Is a feature knowledge base

The number of the landforms which are intersected with the SURF feature description subset extracted from the observed quantity; />

Observation feature set for time k>

Feature set->

The intersection of (a); />

The likelihood degree of a certain landform in the current observation characteristic set and the landform characteristic knowledge base is described;

33 Based on the posterior probability distribution description established in step 32), a reward function is established:

wherein R is _k () is a reward function; x is the number of _k The state parameters of the camera at the moment k are obtained; a is a _k ＝[θ _c (k),f _c (k)] ^T Controlling quantities for camera parameters；

For example, the entropy increment of the posterior probability distribution after performing the camera parameter adjustment can be regarded as a measure of the degree of uncertainty reduction of the posterior probability for identifying a certain type of landform; c _R > Δ I is a reward constant that is set to the current state quantity x _k Or x _k+1 When the extreme value is reached (maximum/minimum focal length or rotational angle of the head), a reward function is assigned to terminate the controlled variable, C _stop A constant less than ai is assigned and the control step is terminated when the reward gained by control cessation is greater than any control executed.

Preferably, the triggering condition for setting the active sensing of the planet landform in the step 4) is specifically:

41 A single planetary image is subjected to significance analysis by using a spectral residual error method;

42 Record the area of the detected salient outline pixels,

wherein s is ₁ ～s _N Pixel area of 1-N significant regions, based on the sum of the values of the pixels>

Is a collection of areas;

43 From)

To select the maximum pixel outline area S _max1 And the second large pixel outline area S _max2 When S is _max1 /S _max2 And if the current frame is more than 1.5, the landform with the observation value is considered to be covered in the current frame, and the landform active perception is triggered.

Preferably, in the step 5), the strategy iteration step in the reinforcement learning method is modified in a targeted manner as follows:

51 Define a camera parameter control strategy with a corresponding reward function:

respectively being an action space for zooming the focal length of the camera and an action space for rotating the angle of the camera holder; f. of _c + denotes focal length of magnification 1.2 times, f _c -represents a reduction of the focal length by a factor of 0.9 (superposition of symbols by a factor of multiple); theta.theta. _c + denotes a 5 degree rotation of the camera platform to the right, θ _c -represents a 5 degree turn of the platform to the left;

52 Define the evaluation function in a policy iteration as:

wherein R (-) corresponds to the reward function R in step 33) _k X is the camera state quantity, v _π (x) Is an evaluation function; e _π [·]An expected yield obtained after executing the camera control strategy pi; h is the total length of the control space, and H is the length of each step; gamma e (0,1) is a time penalty term aiming at weakening the future reward term; p (x) _i |x _i-1 A) is the state transition probability function, p (x) _i |x _i-1 A) =0.99, that is, it is considered that there is a camera parameter adjustment failure rate of 1%;

53 In a reinforcement learning framework, the strategy estimation-strategy update iteration needs to be repeated until convergence.

Preferably, the processing capacity of the on-board computer is considered in the policy updating step, and a finite step size iteration policy is set, that is, the following termination criterion is introduced in the iteration process:

the maximum iteration step number is 20 steps, and the reinforcement learning task is terminated when effective convergence is not completed in the 20 steps;

and (5) examining the image significance in real time during iteration, and terminating the reinforcement learning task if the following conditions are met: the Euclidean distance from the centroid of the maximum closed salient region in the current frame image to the image center is less than 40 pixels (1024 × 1024 resolution camera), and the area ratio of the pixels of the maximum salient region to the area of the imaging plane of the camera is 0.25-0.5;

the camera parameters have reached a limit value and no further parameter adjustment can be performed.

The invention has the beneficial effects that:

1. the invention constructs the training knowledge base by introducing the feature description subset, thereby ensuring the query efficiency on the basis of saving the storage space;

2. the invention establishes the reward function by seeking the intersection of the observation feature set and the corresponding landform membership feature set in the feature knowledge base, can effectively avoid the interference of non-interesting landforms on the learning process, and can effectively improve the success rate that the perceived landforms are interesting landforms cataloged in the feature knowledge base.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram illustrating a result of a camera action planning based on reinforcement learning;

FIG. 3 is a diagram illustrating the relationship between the number of training times and the number of steps per iteration;

fig. 4 is a schematic diagram of the final landform active re-observation result.

Detailed Description

In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.

Referring to fig. 1, the planetary surface landform active perception method based on reinforcement learning of the invention comprises the following steps:

step 1): extracting SURF local feature descriptors of the images from a series of planet landform image sets, and cataloging feature descriptor sets corresponding to landforms one by one according to the landform types, namely cataloging SURF feature descriptors belonging to the same type of landforms in a set form.

the feature repetition degree check and the feature knowledge base establishment are as follows:

21 Adopting SURF feature descriptors to extract local features in the satellite images of the target patrol area;

22 Repeat screening is carried out on the extracted 64-dimensional SURF feature descriptors, and feature pairs with high similarity are removed, wherein the similarity judgment is realized by point multiplication of normalized feature description vectors, and the feature pairs with the product of point multiplication of the descriptors larger than 0.9 are removed;

23 Culling feature descriptors with a feature size of less than 3 pixels;

the SURF local feature descriptor extraction range is executed in the step 3) and is a local image area in the significance detection area.

The reward function is designed as follows:

wherein the content of the first and second substances,

For the current observation>

And/or>

Is combined with>

Is the prior probability of the correlation between the kth landform and the observed quantity, and is described as follows:

wherein the content of the first and second substances,

wherein, N _m (k) Is a feature knowledge base

The number of landforms which have intersection with the SURF feature description subset extracted from the observed quantity is determined; />

Observation feature set for time k>

Feature set->

The intersection of (a); />

The system is used for describing the likelihood degree of a certain landform in the current observation characteristic set and the landform characteristic knowledge base; />

wherein R is _k (. H) is a reward function; x is a radical of a fluorine atom _k The state parameters of the camera at the moment k are obtained; a is _k ＝[θ _c (k),f _c (k)] ^T Controlling the quantity for the camera parameter;

for example, the entropy increment of the posterior probability distribution after performing the camera parameter adjustment can be regarded as a measure of the degree of uncertainty reduction of the posterior probability for identifying a certain type of landform; c _R > Δ I is the reward constant, which is set for the purpose of the state quantity x _k Or x _k+1 When an extreme value is reached (maximum/minimum focal length or rotational angle of the head), a reward function is assigned to terminate the control quantity, C _stop Is a constant less than Δ I, and is given when the reward gained by the control ceases is greater than any control performed and terminates the control step.

Step 4): setting a triggering condition for actively sensing the planet landform, and analyzing the local saliency of the image of the satellite-borne camera in real time in the roaming and patrolling process of the patrolling device; when the local image significance meets the trigger condition, SURF local feature descriptor extraction is executed, and the extracted SURF office is usedThe part feature descriptors are transmitted to a reinforcement learning training system as observed quantities, and the control quantities are used for adjusting the angle theta of the camera holder in the reinforcement learning system _c And focal length f of camera _c ；

The triggering conditions for setting the active sensing of the planet landform are specifically as follows:

42 Record the area of the detected salient outline pixels,

wherein s is ₁ ～s _N Pixel area of 1-N significant regions, based on the area of the pixel in the image frame>

Is a collection of areas;

43 From)

Medium-selected maximum pixel outline area S _max1 And the second large pixel outline area S _max2 When S is _max1 /S _max2 And if the current frame is more than 1.5, the landform with the observation value is considered to be covered in the current frame, and the landform active perception is triggered.

Step 5): changing a strategy iteration step in reinforcement learning into a finite step length mode, and training a satellite-borne camera recognition action sequence by combining a reinforcement learning reward function established in the step 3) and the characteristic knowledge base established in the step 1) to finish the active landform recognition work;

the strategy iteration step in the reinforcement learning method is modified in a targeted way as follows:

wherein the content of the first and second substances,

respectively an action space for zooming the focal length of the camera and an action space for rotating the angle of the camera holder; f. of _c + denotes focal length of magnification 1.2 times, f _c -represents a reduction of the focal length by a factor of 0.9 (superposition of symbols by a factor of multiple); theta.theta. _c + denotes a 5 degree rotation of the camera platform to the right, θ _c -represents a 5 degree turn of the platform to the left;

52 Define the evaluation function in a policy iteration as:

wherein R (-) corresponds to the reward function R in step 33) _k X is the camera state quantity, v _π (x) Is an evaluation function; e _π [·]An expected yield obtained after executing the camera control strategy pi; h is the total length of the control space, and H is the length of each step; gamma epsilon (0,1) is a time penalty item aiming at weakening a future reward item; p (x) _i |x _i-1 A) is the state transition probability function, p (x) _i |x _i-1 A) =0.99, i.e. there is a 1% failure rate of camera parameter adjustment;

53 In the reinforcement learning framework, the strategy estimation-strategy update iteration needs to be repeated until convergence.

In the strategy updating step, the processing capacity of the on-board computer is considered, and a finite step length iteration strategy is set, namely the following termination criterion is introduced in the iteration process:

and (5) examining the image significance in real time during iteration, and terminating the reinforcement learning task if the following conditions are met: the Euclidean distance from the centroid of the maximum closed salient region in the current frame image to the center of the image is less than 40 pixels (1024 × 1024 resolution camera), and the area ratio of the pixels of the maximum salient region to the area of the imaging plane of the camera is 0.25-0.5;

2-4 are simulation examples of the method of the present invention, which uses Rhino software to generate a three-dimensional planetary landscape, extracts SURF feature points from a rendered planetary image to construct a planetary landscape set, and places planetary vehicles at random positions in the three-dimensional planetary landscape to perform active landscape identification. Fig. 2 is a schematic diagram of a camera motion planning result obtained based on reinforcement learning, wherein S represents a camera parameter at the start time, and G represents a camera parameter after reinforcement learning is finished; FIG. 3 is a schematic diagram showing the relationship between the number of strategy iterations and the number of steps of camera actions planned in each iteration, wherein the reinforcement learning method successfully converges at step 14; fig. 4 is a current frame of the camera at the beginning and at the end of active relief perception. The result of reinforcement learning is to zoom in 1.4 times the camera angle of view and rotate 5 degrees to the left, i.e. to the right

Compared with the original landform observation, the landform in the image can be more clearly distinguished after parameter adjustment. In most of the rest simulation groups, the landform observation results adjusted by the active recognition algorithm are improved to different degrees, and meanwhile, the active recognition algorithm is also found to be closely related to the landform set construction quality: if the current landform to be identified is not fully described in the previous-stage constructed landform set (for example, the shooting angle is poor, the distance is too far/too close, etc.), the active landform identification effect is also poor. Therefore, the completeness of the satellite geomorphic set in the early stage work also determines the improvement degree of the online geomorphic perception effect.

While the invention has been described in terms of its preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A planet surface landform active perception method based on reinforcement learning is characterized by comprising the following steps:

step 3): describing the landform perception in the form of observing feature proportion in a feature knowledge base, giving out a joint distribution posterior probability, and establishing a corresponding reward function in an enhanced learning frame according to the posterior probability;

and step 4): setting a triggering condition for actively sensing the planet landform, and analyzing the local saliency of the image of the satellite-borne camera in real time in the roaming and patrolling process of the patrolling device; when the local image significance meets the trigger condition, SURF local feature descriptor extraction is executed, the extracted SURF local feature descriptor is used as an observed quantity and is transmitted to a reinforcement learning training system, and the control quantity in the reinforcement learning training system is used for adjusting the angle theta of the camera holder _c And focal length f of camera _c ；

step 6): storing the landform sensing result, and continuing the roaming patrol task by the patrol device;

the reward function in the step 3) is designed as follows:

wherein the content of the first and second substances,

for a feature description subset corresponding to the kth feature in the feature knowledge base, based on the feature description value of the kth feature, for>

For the current observed quantity

And/or>

In conjunction with (a), or (b)>

uniformly initializing the probability of the observed different types of landforms into 1K, wherein K is the total number of the landforms in the characteristic knowledge base;

wherein N is _m (k) Is a feature knowledge base

Observation feature set for time k>

Feature set->

The intersection of (a); />

The system is used for describing the likelihood degree of a certain landform in the current observation characteristic set and the landform characteristic knowledge base;

wherein R is _k (. H) is a reward function; x is a radical of a fluorine atom _k The state parameters of the camera at the moment k are obtained; a is a _k ＝[θ _c (k),f _c (k)] ^T Controlling the quantity for the camera parameter;

entropy increment of posterior probability distribution after performing camera parameter adjustment; c _R > Δ I is a reward constant that,C _stop is a constant less than al.

2. The active perception method for planetary surface landforms based on reinforcement learning of claim 1, wherein the feature repetition degree inspection and the feature knowledge base construction in the step 2) are specifically as follows:

23 Culling feature descriptors with a feature size of less than 3 pixels;

3. The active perception method for planetary surface topography based on reinforcement learning according to claim 1, wherein the SURF local feature descriptor extraction range in step 3) is performed as a local image area within a significance detection area.

4. The planetary surface topography active perception method based on reinforcement learning according to claim 1, wherein the triggering conditions for planetary topography active perception set in the step 4) are specifically:

42 Record the area of the detected salient outline pixels,

Is a collection of areas;

43 From)

5. The active perception method for planetary surface landforms based on reinforcement learning of claim 1, wherein the step 5) is to make targeted modification to the strategy iteration step in the reinforcement learning method as follows:

respectively being an action space for zooming the focal length of the camera and an action space for rotating the angle of the camera holder; f. of _c + denotes focal length of magnification 1.2 times, f _c -represents a reduction of the focal length by a factor of 0.9; theta _c + represents a 5 degree rotation of the camera platform to the right,. Theta. _c -represents a 5 degree turn of the platform to the left;

52 Define the evaluation function in a policy iteration as:

wherein R (-) corresponds to step 33)Reward function R _k (. G), x is the camera state quantity, v _π (x) Is an evaluation function; e _π [·]Expected gains obtained after executing the camera control strategy pi; h is the total length of the control space, and H is the step length; gamma e (0,1) is a time penalty term aiming at weakening the future reward term; p (x) _i |x _i-1 A) is the state transition probability function, p (x) _i |x _i-1 A) =0.99, that is, it is considered that there is a camera parameter adjustment failure rate of 1%;

6. The active perception method for planetary surface landforms based on reinforcement learning of claim 5, wherein a finite step size iteration strategy is set in consideration of the processing capability of an on-board computer in the strategy updating step, namely the following termination criteria are introduced in the iteration process:

the maximum iteration step number is 20 steps, and when effective convergence is not completed in the 20 steps, the reinforcement learning task is terminated;

and (5) examining the image significance in real time during iteration, and terminating the reinforcement learning task if the following conditions are met: the Euclidean distance from the centroid of the maximum closed salient region in the current frame image to the image center is less than 40 pixels, and the area ratio of the pixels of the maximum salient region to the area of the camera imaging plane is 0.25-0.5;