WO2019015544A1 - 一种机器学习、寻物方法及装置 - Google Patents

一种机器学习、寻物方法及装置 Download PDF

Info

Publication number
WO2019015544A1
WO2019015544A1 PCT/CN2018/095769 CN2018095769W WO2019015544A1 WO 2019015544 A1 WO2019015544 A1 WO 2019015544A1 CN 2018095769 W CN2018095769 W CN 2018095769W WO 2019015544 A1 WO2019015544 A1 WO 2019015544A1
Authority
WO
WIPO (PCT)
Prior art keywords
state
seeking
strategy
robot
action
Prior art date
Application number
PCT/CN2018/095769
Other languages
English (en)
French (fr)
Inventor
孙海鸣
Original Assignee
杭州海康威视数字技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州海康威视数字技术股份有限公司 filed Critical 杭州海康威视数字技术股份有限公司
Priority to EP18835074.8A priority Critical patent/EP3657404B1/en
Priority to US16/632,510 priority patent/US11548146B2/en
Publication of WO2019015544A1 publication Critical patent/WO2019015544A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J11/00Manipulators not otherwise provided for
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J11/00Manipulators not otherwise provided for
    • B25J11/008Manipulators for service tasks
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/163Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1664Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1679Programme controls characterised by the tasks executed
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N99/00Subject matter not provided for in other groups of this subclass
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/20Scenes; Scene-specific elements in augmented reality scenes

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular, to a machine learning, object-seeking method and device.
  • robots using machine learning algorithms have also developed rapidly. More and more unique robots are applied to people's daily lives, bringing convenience to people's lives.
  • the purpose of the embodiments of the present application is to provide a machine learning and object-seeking method and device to improve the success rate when searching for objects.
  • the specific technical solutions are as follows:
  • an embodiment of the present application provides a machine learning method, which is applied to a robot, and the method includes:
  • a state from a set of states of the target object-seeking scene as a first state wherein the state set is: a set of states of the robot in the target object-seeking scene;
  • the object-seeking strategy comprises: the robot starts from a starting state of the object-seeking strategy Go to the respective states that the object is sequentially experienced, and the actions performed by the robot from each state to the next state;
  • the target optimal object-seeking strategy Performing strategy learning for the learning target by using the target optimal object-seeking strategy, obtaining a searching strategy for the robot to find the target object in the target object-seeking scene, and adding the obtained object-seeking strategy to the object-seeking object a policy pool, wherein the obtained object-seeking strategy is: a tracking policy in which the first state is a starting state and a second state is a terminating state, and the second state is: the target object is in the a state in which the robot corresponding to the position in the target object finding scene is located;
  • the learning is performed by using the target optimal object-seeking strategy as a learning target, and obtaining a searching strategy for the robot to find the target object in the target object-seeking scene, including :
  • the target-type object-seeking strategy is: in the object-seeking strategy pool a search strategy for finding the target;
  • Performing strategy learning based on the reward function obtaining a material seeking strategy that maximizes the output value of the value function in the enhanced learning algorithm, as the object-seeking strategy for the robot to find the target object in the target object-seeking scene.
  • the target optimal object-seeking strategy is used as a learning target, and the target-type object-seeking strategy is used to determine a reward function in the enhanced learning algorithm for performing policy learning, including:
  • a reward function R that maximizes the value of the following expression is determined as a reward function in an enhanced learning algorithm for policy learning:
  • k represents the number of the object-seeking strategies included in the object-seeking strategy pool for finding the object
  • i represents the identifier of each of the object-seeking strategies in the pool of search-seeking strategies for finding the object
  • ⁇ i represents the The object-seeking strategy for finding the object in the object-seeking strategy pool is identified as i
  • ⁇ d represents the target optimal object-seeking strategy
  • S 0 represents the first state
  • V ⁇ represents the object-seeking strategy ⁇
  • M represents the number of states included in the object-seeking strategy ⁇
  • m represents the identity of each state in the object-seeking strategy ⁇
  • t represents the number of state transitions in the object-seeking strategy ⁇
  • ⁇ (S m ) represents the action performed by the robot in the object-seeking strategy ⁇ from the state S m to the next state
  • is a preset coefficient, 0 ⁇ ⁇ ⁇ 1, and maximise () represents a maximum function.
  • the policy learning is performed based on the reward function, and the object-finding strategy that maximizes the output value of the value function in the enhanced learning algorithm is obtained, including:
  • a preset state transition manner learning a material seeking strategy in which the first state is a starting state of a tracing strategy and the second state is a termination state of a tracing strategy;
  • R e represents a reward function in the enhanced learning algorithm
  • the object-seeking strategy corresponding to the maximum output value among the calculated output values is determined as the object-seeking strategy that maximizes the output value of the value function in the enhanced learning algorithm.
  • next state of each state in the object-seeking policy and the action performed by the robot from each state to the next state are determined by:
  • the action, wherein the action set is: a set of actions performed by the robot when performing state transition in the target object-seeking scene.
  • the state in the state set is obtained by:
  • the information sequence is composed of information elements, where the information elements include: a video frame and/or an audio frame;
  • the actions in the action set are obtained by:
  • the embodiment of the present application provides a method for searching for objects, which is applied to a robot, and the method includes:
  • the object strategy is: a strategy for finding the target object in the target object-seeking scene by searching for the target learning strategy in advance by searching for the optimal object-seeking strategy of the target object, and the object-seeking strategy includes: The robot starts from a start state of the object-seeking strategy to find each state that the object is sequentially experienced, and an action that is performed by the robot from each state to a next state;
  • the determining a behavior of the robot from the current state to the next state according to the object-seeking strategy for finding the target object in the current search policy pool includes:
  • the output value of the value function of the enhanced learning algorithm preset under the object-seeking policy of the current state in the policy pool is calculated according to the following expression:
  • V ⁇ represents the output value of the value function of the enhanced learning algorithm under the object-seeking strategy ⁇
  • M represents the number of states included in the object-seeking strategy ⁇
  • m represents the identity of each state in the object-seeking strategy ⁇
  • n represents The identifier of the current state in the object-seeking strategy ⁇
  • x represents the number of state transitions from the current state to the policy termination state in the object-seeking strategy ⁇
  • ⁇ (S m ) represents the robot slave state S in the object-seeking strategy ⁇ m is converted to the action performed in the next state
  • is a preset coefficient, 0 ⁇ 1
  • R e represents a reward function in the enhanced learning algorithm
  • An action of the robot to transition from a current state to a next state is determined from the target object-seeking strategy.
  • the obtaining the current state of the robot includes:
  • the information sequence is composed of information elements, where the information elements include: a video frame and/or an audio frame;
  • the state set is: the robot is capable of being in the target object-seeking scene a collection of states;
  • the state of the set of states that matches the selected information element is determined to be the current state of the robot.
  • an embodiment of the present application provides a machine learning device, which is applied to a robot, where the device includes:
  • a state selection module configured to select a state from a state set of the target object-seeking scene as a first state, wherein the state set is: a set of states of the robot in the target object-seeking scene;
  • a policy obtaining module configured to obtain a target optimal object-seeking strategy for finding a target object by using the first state as a starting state of the object-seeking strategy, wherein the object-seeking strategy includes: the robot from the object The initial state of the policy begins to find the respective states that the object is sequentially experienced, and the actions performed by the robot from each state to the next state;
  • a strategy learning module configured to perform strategy learning for the learning target by using the target optimal object-seeking strategy, obtain a searching strategy for the robot to find the target object in the target object-seeking scene, and obtain the obtained homing strategy
  • the object policy is added to the object-seeking policy pool, wherein the obtained object-seeking strategy is: a material-seeking strategy in which the first state is a starting state and a second state is a terminating state, and the second state is: a state in which the robot corresponds to a position of the object in the target object-seeking scene;
  • a policy comparison module configured to compare whether the obtained object-seeking strategy is consistent with the target optimal object-seeking strategy, and if yes, triggering the learning determination module, if not, triggering the state selection module;
  • the learning determination module is configured to determine that the policy learning in which the first state is the initial state of the object-seeking policy is completed.
  • the policy learning module includes:
  • a reward function determining sub-module configured to use the target optimal object-seeking strategy as a learning target, and use a target-type object-seeking strategy to determine a reward function in an enhanced learning algorithm for performing policy learning, the target type of searching
  • the strategy is: a search strategy for finding the target object in the object search strategy pool;
  • a policy obtaining submodule configured to perform policy learning based on the reward function, obtain a object-seeking strategy that maximizes an output value of a value function in the enhanced learning algorithm, and the robot searches for the target in the target object-seeking scene The object-seeking strategy of the target;
  • the policy adds a sub-module for adding the obtained tracing strategy to the hunt strategy pool.
  • the reward function determining sub-module is specifically configured to determine a reward function R that maximizes the value of the following expression as a reward function in an enhanced learning algorithm for policy learning:
  • k represents the number of the object-seeking strategies included in the object-seeking strategy pool for finding the object
  • i represents the identifier of each of the object-seeking strategies in the pool of search-seeking strategies for finding the object
  • ⁇ i represents the The object-seeking strategy for finding the object in the object-seeking strategy pool is identified as i
  • ⁇ d represents the target optimal object-seeking strategy
  • S 0 represents the first state
  • V ⁇ represents the object-seeking strategy ⁇
  • M represents the number of states included in the object-seeking strategy ⁇
  • m represents the identity of each state in the object-seeking strategy ⁇
  • t represents the number of state transitions in the object-seeking strategy ⁇
  • ⁇ (S m ) represents the action performed by the robot in the object-seeking strategy ⁇ from the state S m to the next state
  • is a preset coefficient, 0 ⁇ ⁇ ⁇ 1, and maximise () represents a maximum function.
  • the policy learning submodule includes:
  • a policy learning unit configured to learn, according to a preset state transition manner, a material seeking strategy in which the first state is a starting state of a tracing strategy, and the second state is a termination state of a tracing policy;
  • the output value calculation unit is configured to calculate an output value of the value function in the enhanced learning algorithm under each of the learned search strategies according to the following expression:
  • R e represents a reward function in the enhanced learning algorithm
  • a policy determining unit configured to determine a object-seeking strategy corresponding to a maximum output value of the calculated output values as a material-seeking strategy that maximizes an output value of the value function in the enhanced learning algorithm
  • a policy adding unit configured to add the obtained object-seeking strategy to the object-seeking policy pool.
  • the next state of each state in the object-seeking policy, and the action performed by the robot from each state to the next state is converted according to a pre-stated, pre-conversion state to The probability of other states is determined;
  • the action performed by the robot from each state transition to the next state is: an action belonging to the action set of the target object-seeking scene, wherein the action set is: the robot performs in the target object-seeking scene A collection of actions performed when the state transitions.
  • the learning device further includes:
  • a state obtaining module configured to obtain a state in the state set
  • the status obtaining module includes:
  • a first sequence collection sub-module configured to collect an information sequence of the target object-seeking scene, wherein the information sequence is composed of information elements, where the information elements include: a video frame and/or an audio frame;
  • a first element quantity determining submodule configured to determine whether the number of unselected information elements in the information sequence is greater than a preset number, and if yes, triggering a status generating submodule;
  • a state generation sub-module configured to select the preset number of information elements from the unselected information elements in the information sequence, and generate a state in which the robot is in the target object-seeking scene, as Third state
  • a state determining sub-module configured to determine whether the third state exists in the state set, if not, the trigger state adds a sub-module, if yes, triggering the first element quantity determining sub-module;
  • a state adding submodule configured to add the third state to the state set, and trigger the first element number determining submodule.
  • the learning device further includes:
  • An action obtaining module configured to obtain an action in the action set
  • the action obtaining module includes:
  • a second sequence collection submodule configured to obtain an action sequence corresponding to the information sequence, where the action sequence is composed of action elements, each action element in the action sequence and each information in the information sequence One-to-one correspondence of elements;
  • a second element quantity determining sub-module configured to determine whether the number of unselected action elements in the action sequence is greater than the preset quantity, and if yes, triggering an action generating sub-module;
  • An action generating submodule configured to select the preset number of action elements from the unselected action elements in the action sequence, and generate an action of the robot in the target object search scene, as the first action;
  • the action determining sub-module is configured to determine whether the first action exists in the action set, and if not, the trigger action adds a sub-module, and if yes, triggers the second element quantity determining sub-module;
  • the action adding submodule is configured to add the first action to the action set, and trigger the second element quantity determining submodule.
  • an embodiment of the present application provides a device for searching for a robot, the device comprising:
  • An instruction receiving module configured to receive a seek instruction for finding a target object in a target object search scene
  • a state obtaining module for obtaining a current state of the robot
  • the action determining module is configured to determine, according to the object-seeking strategy for finding the target object in the current search policy pool, the action of the robot to switch from the current state to the next state, wherein the searching
  • the object-seeking strategy in the object strategy pool is: a strategy for learning the target object in the target object-seeking scene by using the optimal object-seeking strategy for finding the target object in advance to learn the strategy for the learning target
  • the object-seeking strategy includes: the robot starts from a start state of the object-seeking strategy to find each state that the object is sequentially experienced, and an action that is performed by the robot from each state to a next state;
  • the state conversion module is configured to perform the determined action to implement state transition, and determine whether the target object is found, and if not, trigger the state obtaining module.
  • the action determining module includes:
  • the output value calculation submodule is configured to calculate, according to the following expression, an output value of a value function of the enhanced learning algorithm preset in the policy pool including the current state of the object search policy:
  • V ⁇ represents the output value of the value function of the enhanced learning algorithm under the object-seeking strategy ⁇
  • M represents the number of states included in the object-seeking strategy ⁇
  • m represents the identity of each state in the object-seeking strategy ⁇
  • n represents The identifier of the current state in the object-seeking strategy ⁇
  • x represents the number of state transitions from the current state to the policy termination state in the object-seeking strategy ⁇
  • ⁇ (S m ) represents the robot slave state S in the object-seeking strategy ⁇ m is converted to the action performed in the next state
  • is a preset coefficient, 0 ⁇ 1
  • R e represents a reward function in the enhanced learning algorithm
  • a policy selection sub-module configured to select a search strategy corresponding to a maximum output value of the calculated output values as a target object-seeking strategy
  • the action determining sub-module is configured to determine, from the target object-seeking strategy, an action to be performed by the robot to transition from a current state to a next state.
  • the status obtaining module includes:
  • a sequence collection sub-module configured to collect an information sequence of the target object-seeking scene, wherein the information sequence is composed of information elements, where the information elements include: a video frame and/or an audio frame;
  • An element selection submodule configured to select a preset number of information elements from the information sequence
  • a state judging sub-module configured to determine, in a state set of the target object-seeking scene obtained in advance, whether there is a state matching the selected information element, wherein the state set is: the robot is searching for the target a set of states that can be in the object scene, if present, triggering a state determination sub-module;
  • the state determining submodule is configured to determine a state in the set of states that matches the selected information element as a current state of the robot.
  • an embodiment of the present application provides a robot, including: a processor and a memory, where
  • a memory for storing a computer program
  • the processor when used to execute a program stored on the memory, implements the method steps described in the foregoing first aspect.
  • the embodiment of the present application provides a computer readable storage medium, where the computer readable storage medium is a computer readable storage medium in a robot, where the computer readable storage medium stores a computer program,
  • the method steps described in the aforementioned first aspect are implemented when the computer program is executed by the processor.
  • an embodiment of the present application provides a robot, including: a processor and a memory, where
  • a memory for storing a computer program
  • the processor when used to execute a program stored on the memory, implements the method steps described in the foregoing second aspect.
  • the embodiment of the present application provides a computer readable storage medium, where the computer readable storage medium is a computer readable storage medium in a robot, where the computer readable storage medium stores a computer program,
  • the method steps described in the aforementioned second aspect are implemented when the computer program is executed by the processor.
  • an embodiment of the present application provides an executable program code for being executed to implement the method steps described in the foregoing first aspect.
  • an embodiment of the present application provides an executable program code for being executed to implement the method steps described in the foregoing second aspect.
  • the robot takes a state in the state set of the target object-seeking scene as the starting state of the object-seeking strategy, and obtains the target optimal object-seeking strategy for finding the target object, and the target is optimal.
  • the object-seeking strategy conducts strategy learning for the learning target, and obtains the object-seeking strategy for the robot to find the target object in the target-finding scene.
  • the object-seeking strategy obtained according to the foregoing learning can be Without the use of the positioning device set by the robot itself, it is not affected by the scene of the object, thereby improving the probability of success in searching for objects.
  • FIG. 1 is a schematic flowchart of a machine learning method according to an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of another machine learning method according to an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a method for obtaining a state in a state set according to an embodiment of the present disclosure
  • FIG. 4 is a schematic flowchart of a method for searching for objects according to an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of a machine learning apparatus according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of another machine learning apparatus according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an apparatus for obtaining a state in a state set according to an embodiment of the present disclosure
  • FIG. 8 is a schematic structural diagram of an object searching device according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a robot according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of another robot according to an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a machine learning method according to an embodiment of the present disclosure. The method is applied to a robot, and includes:
  • S101 Select a state from a state set of the target object-seeking scene as the first state.
  • Robots can work in different scenarios, such as home scenes, factory floor scenarios, and more. Regardless of the scene in which the robot works, it may involve finding an object. In this case, the scene in which the robot works may also be referred to as a scene of the object. In the case of a family scene, the robot may be required to find a home. Pet dogs, looking for kids toys at home, etc.
  • the robots are different in the scene due to the difference between the scenes, and the robots need different operations. Therefore, the robots may be different in different object-seeking scenes. State, in addition, when the robot works in the same object-seeking scene, it may be in different positions of the scene. For this reason, the robot may be in various states in each scene.
  • the state set can be understood as corresponding to the robot's object-seeking scene, that is, the state set can be understood as a set of states of the robot in the target object-seeking scene.
  • the state in which the robot can be in the scene may be related to the location in the home scene.
  • the state may be that the robot is in the central scene of the living room in the home scene, and the southeast corner of the study. Area and more.
  • the state that the robot can be in the object-seeking scene may be based on the robot's acquisition in the scene. Video frames and/or audio frames are determined.
  • the state when the state is selected from the state set of the target object search scenario, the state may be selected from the state set by using a random selection manner, and the state may be selected from the state set according to a certain rule.
  • the actual selection does not limit the way the state is selected.
  • the process of the robot searching for the object can be understood as the following process:
  • the robot transitions from the current state to the next state
  • the robot transitions from the current state to the next state, it can be realized by performing certain actions.
  • the actions performed when the robot is in the current state may be different, the state in which the action is performed may be different.
  • the object-seeking strategy can be understood as: starting from the state in which the robot is currently in the target-finding scene, until the strategy of finding the object.
  • the object-seeking strategy includes: the robot starts from the initial state of the object-seeking strategy to find each state that the object is sequentially experienced, and the action that is performed by the robot from each state to the next state.
  • the action to be performed by the robot from each state to the next state may be different depending on the scene in which it works. Specifically, the above actions may be turning left, turning right, going forward, going backward, etc. Etc., this application does not limit this.
  • the next state of each state in the object-seeking policy and the action performed by the robot from each state to the next state may be determined by:
  • the action of the post-conversion state and the action set belonging to the target object-seeking scene executed by the robot after the transition from the pre-conversion state to the post-conversion state is determined according to a pre-stated probability of transitioning to the other state before the transition.
  • the actions that the robot can perform when performing state transition during the object-seeking process are generally limited. Based on this, the above-mentioned action set is: the robot is in the target object-seeking scene. A collection of actions performed when a state transition occurs.
  • the robot After obtaining the state set and the action set of the target object-seeking scene, the robot can be simply considered to have determined the state that the robot can be in the target object-finding scene and the action that can be performed when the state transition is performed.
  • the inventor collects data related to state transition through a large number of random repeated experiments, and then counts the actions to be performed by the robot when the state transitions between the two states, and the two states under the corresponding actions. The probability of achieving a state transition between.
  • the action performed by the robot is captured by a binocular camera or a TOF (Time of Flight) camera, and the three-dimensional data of the robot in each state or the state of the robot in each state is acquired.
  • the probability of a state transition to another state can be obtained by the following expression statistics:
  • P(S i , A i , S j ) represents the probability of the robot transitioning from state S i to state S j by performing action A i
  • x is represented in a large number of random iteration experiments (S i , A i , S j )
  • the number of times the combination occurs that is, the number of times the phenomenon occurs when the robot is switched from the state S i to the state S j by performing the action A i
  • y represents (S i , A i ) in a large number of random iteration experiments (S i , A i )
  • the number of times the combination occurs that is, the number of times the action A i is performed when the robot is in the state S i .
  • the state corresponding to the probability of the largest value may be selected from the pre-statistical transition from the pre-conversion state to the other state, as the post-conversion state, and the action corresponding to the probability of maximizing the value is used as the robot.
  • the process of finding a target object by a person may be considered as an optimal process, and then the target optimal object-seeking strategy may be: searching for a target object from the first state described above.
  • the process of abstraction gets the strategy.
  • S103 Performing strategy learning for the learning target by using the target optimal object-seeking strategy, obtaining a searching strategy for the robot to find the target object in the target object-seeking scene, and adding the obtained object-seeking strategy to the object-seeking strategy pool.
  • the obtained object-seeking strategy is: a object-seeking strategy with the first state as the initial state and the second state as the termination state, and the second state is: the robot corresponding to the position of the target in the target-seeking scene The state of the place.
  • the second state may be input to the robot as a parameter at the beginning of the implementation of the solution provided by the embodiment of the present application; in another implementation manner, the robot may use the device in the process of searching for the target object. In its own visual and/or speech function, it detects whether a target is found after each transition to a new state, and if found, determines the state at which it is located as the second state described above.
  • the above-mentioned object-seeking strategy pool is used to store a search strategy for finding an object in a target-finding scene.
  • the object-seeking strategy stored in the object-seeking strategy pool may be only a object-seeking strategy for finding a target object in the target-finding scene; the second case: storing the above-mentioned object-seeking strategy pool
  • the object-seeking strategy may be the object-seeking strategy mentioned in the first case mentioned above and the object-seeking strategy for finding other objects in the target-seeking scene. This application is only described by way of example, and does not limit the object-seeking strategy stored in the object-seeking policy pool.
  • the initial object-seeking strategy for the object is stored in the object-seeking strategy pool.
  • the initial object-seeking strategy can be randomly set.
  • the learned object-seeking strategy is added to the object-seeking strategy pool, so that through continuous iterative learning, the object-seeking strategy pool The search strategy will be more and more abundant.
  • the object-seeking strategy for finding a target object in the target object-finding scene by the robot may be obtained based on the enhanced learning algorithm.
  • S104 Compare whether the obtained object-seeking strategy is consistent with the target optimal object-seeking strategy. If they are consistent, execute S105. If they are inconsistent, return to S101.
  • the state selection may still be performed in a random manner, or may be selected according to a certain rule.
  • the state selected again may be the same as or different from the previously selected state, and the present application does not limit this.
  • S105 Determine to complete the policy learning in which the first state is the initial state of the object-seeking strategy.
  • the robot takes a state in the state set of the target object-seeking scene as the starting state of the object-seeking strategy, and obtains the target optimal object-seeking strategy for finding the target object, and aims to find the target optimal object.
  • the material strategy conducts strategy learning for the learning target, and obtains the object-seeking strategy for the robot to find the target object in the target object-seeking scene.
  • the object-seeking strategy obtained according to the foregoing learning can be There is no need to use the positioning device set by the robot itself, so it is not affected by the scene of the object, thereby improving the probability of success in searching for objects.
  • FIG. 2 a schematic flowchart of another method for machine learning is provided.
  • the target optimal object-seeking strategy is used for learning.
  • the target conducts strategy learning, obtains the object-seeking strategy for the robot to find the target object in the target-finding scene, and adds the obtained object-seeking strategy to the object-seeking strategy pool (S103), including:
  • the target-type object-seeking strategy is used to determine the reward function in the enhanced learning algorithm for strategy learning.
  • the object-seeking strategy of the target type is: the object-seeking strategy for finding the target in the object-seeking strategy pool.
  • Enhanced learning is a kind of machine learning method. It abstracts the real world through state and action to find the optimal value return, and finally finds the optimal strategy through some training and learning methods.
  • the inventors have found through experiments that the use of reinforcement learning can enable robots to improve their performance and make behavior choices through learning, and then make decisions, and implement state changes by selecting and executing certain actions.
  • various enhancement learning algorithms generally need to include a reward function of the strategy and a value function of the strategy, wherein the value function of the strategy is a function related to the reward function.
  • the reward function is generally different, and it is necessary to learn in combination with the specific object-seeking scene to obtain a reward function adapted to different object-seeking scenarios.
  • S103B Based on the reward function, the strategy learning is performed, and the object-seeking strategy that maximizes the output value of the value function in the enhanced learning algorithm is obtained, and the robot seeks the object-seeking strategy for finding the target object in the target object-seeking scene.
  • the enhanced learning algorithm is introduced when the strategy learning is performed, so that the robot can learn the object-seeking strategy for finding the target object more efficiently.
  • Embodiment 1 is a diagrammatic representation of Embodiment 1:
  • the target optimal object-seeking strategy is used as the learning target, and the target-type object-seeking strategy is used to determine the reward function (S103A) in the enhanced learning algorithm for policy learning, including:
  • a reward function R that maximizes the value of the following expression is determined as a reward function in an enhanced learning algorithm for policy learning:
  • k represents the number of object-seeking strategies for finding the target object in the object-seeking strategy pool
  • i represents the identifier of the object-seeking strategy for each object in the object-seeking strategy pool
  • ⁇ i represents the search for the identifier i in the object-seeking policy pool.
  • the object-seeking strategy of the object ⁇ d represents the target optimal object-seeking strategy
  • S 0 represents the first state
  • V ⁇ represents the output value of the value function of the above-mentioned enhanced learning algorithm under the object-seeking strategy ⁇
  • M represents the object-seeking strategy ⁇
  • m represents the identity of each state in the object-seeking strategy ⁇
  • t represents the number of state transitions in the object-seeking strategy ⁇
  • ⁇ (S m ) represents the transition of the robot from the state S m to the next in the object-seeking strategy ⁇
  • the action performed by the state, ⁇ is the preset coefficient, 0 ⁇ ⁇ ⁇ 1, and maximise () represents the maximum function.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • the strategy learning is performed based on the reward function, and the object-finding strategy that maximizes the output value of the value function in the enhanced learning algorithm is obtained, including:
  • the preset state transition mode learning the object-seeking strategy with the first state as the starting state of the object-seeking strategy and the second state as the ending state of the object-seeking strategy;
  • R e represents a reward function in the above enhanced learning algorithm
  • the object-seeking strategy corresponding to the maximum output value among the calculated output values is determined as the object-seeking strategy that maximizes the output value of the value function in the above-described enhanced learning algorithm.
  • the preset state transition manner may be a manner of performing state transition according to a conversion relationship between a pre-agreed state and a state.
  • the preset state transition manner may be:
  • the state transition is realized by the state before the transition to the state after the transition.
  • the probability of a state transition to another state can be obtained by the following expression statistics:
  • P(S i , A i , S j ) represents the probability of the robot transitioning from state S i to state S j by performing action A i
  • x is represented in a large number of random iteration experiments (S i , A i , S j )
  • the number of times the combination occurs that is, the number of times the phenomenon occurs when the robot is switched from the state S i to the state S j by performing the action A i
  • y represents (S i , A i ) in a large number of random iteration experiments (S i , A i )
  • the number of times the combination occurs that is, the number of times the action A i is performed when the robot is in the state S i .
  • the pre-stated, pre-conversion state can be switched to other Among the probabilities of the state, the state corresponding to the probability of the largest value is selected as the post-conversion state, and the action corresponding to the probability that the above-mentioned value is the largest is used as the action to be performed when the robot switches from the pre-conversion state to the post-conversion state.
  • the robot takes a state in the state set of the target object-seeking scene as the starting state of the object-seeking strategy, and combines the enhanced learning algorithm to learn the strategy of the object-seeking strategy, and obtains the target-seeking object.
  • the robot can obtain the object-seeking strategy according to the foregoing learning without using the positioning device set by the robot itself, and thus will not Influenced by the scene of the object, the probability of success in searching for objects is improved.
  • the robot can efficiently learn the object-seeking strategy in the process of strategy learning, thereby improving the performance of the robot itself.
  • Embodiment 3 is a diagrammatic representation of Embodiment 3
  • FIG. 3 a schematic flowchart of a method for obtaining a state in a state set is provided, and the method includes:
  • S301 Collect a sequence of information of a target object search scene.
  • the above information sequence is composed of information elements including: a video frame and/or an audio frame.
  • the foregoing information sequence may be collected by the robot during random cruising in the target object-seeking scene.
  • S302 Determine whether the number of unselected information elements in the information sequence is greater than a preset number. If yes, perform S303.
  • the preset number may be set according to a plurality of experimental statistics, or may be set according to information such as the type of the target scene, and is not limited in this application.
  • S303 Select a preset number of information elements from the unselected information elements in the information sequence, and generate a state in which the robot is in the target object search scene as the third state.
  • a vector when generating a state in which the robot is in the target object search scene, a vector may be formed by the selected information element, and the formed vector represents the robot in the target object search scene. A state at the place.
  • S304 Determine whether there is a third state in the state set. If not, execute S305. If yes, return to S302.
  • the vector indicating the third state may be matched with the vector indicating each state in the state set one by one, and if there is a vector matching the vector indicating the third state, The third state is already present in the state set; otherwise, the third state is not present in the state set.
  • whether the third state and each state in the state set are similar may be detected by using a pre-trained network model.
  • the above network model can be trained as follows:
  • the robot collects a sequence of information as a sample information sequence when arbitrarily cruising in the target object-seeking scene;
  • Two sets of model input parameters are formed by each sample segment after state marking, and input into a preset neural network model for model training, and a network model for detecting whether two states are similar is obtained, and the model is also Can be called a twin network model.
  • the information sequence collected by the above-mentioned robot when arbitrarily cruising in the target object-seeking scene is the environment information of the object-seeking scene.
  • the information sequence may be understood to be composed of information elements including: a video frame and/or an audio frame.
  • Selecting a sample segment from the sample information sequence can be understood as: selecting a plurality of consecutively collected information elements in the sample information sequence.
  • the aggregate of the selected information elements in this application is referred to as a sample segment, and the selected information is selected.
  • the number of elements may be equal to or different from the aforementioned preset number, which is not limited in this application.
  • the content in the sample information sequence may not be representative, or there is a large amount of repeated content in the collected sample sequence, etc., in view of this
  • the information element that satisfies the above situation in the sample information sequence may not be selected.
  • the robot obtains various states of the robot in the target object search scene by collecting the information sequence of the target object search scene and analyzing the information sequence, so that the user does not need to manually set the robot in the target object search scene.
  • the state of the robot improves the automation of the robot; on the other hand, the robot can adaptively obtain the state of the robot in different scenes for different scenes, thereby improving the adaptability of the robot to different scenes.
  • Embodiment 4 is a diagrammatic representation of Embodiment 4:
  • the method for obtaining the action in the action set is provided in the embodiment. Specifically, the method includes:
  • the robot obtains the actions of the robot in the target object search scene by collecting the action sequence corresponding to the information sequence of the target object search scene and analyzing the motion sequence, so that the user does not need to manually set the robot in the target.
  • the action in the object-seeking scene improves the automation of the robot; on the other hand, the robot can adaptively obtain its actions for different scenes, thereby improving the adaptability of the robot to different scenes.
  • the embodiment of the present application further provides a method for searching for objects.
  • Figure 4 provides a schematic flow diagram of a method of searching for objects, the method being applied to a robot, including:
  • S401 Receive a search instruction for finding a target object in a target object search scene.
  • S403 Determine an action performed by the robot to switch from the current state to the next state according to the object-seeking strategy for finding the target object in the current search policy pool.
  • the object-seeking strategy in the above-mentioned object-seeking strategy pool may be: a strategy for finding a target object by searching for a target object in a target-seeking scene by finding an optimal object-seeking strategy of the target object and learning the strategy for the learning target.
  • the specific manner of performing the policy learning can be referred to the specific manner provided in the foregoing part of the machine learning method, and details are not described herein again.
  • the object-seeking strategy includes: the robot starts from the initial state of the object-seeking strategy to find each state that the object is sequentially experienced, and the action that is performed by the robot from each state to the next state.
  • S404 Perform the determined action to implement state transition, and determine whether the target is found. If not, return to execute S402 until the target is found.
  • determining an action performed by the robot to switch from the current state to the next state according to the object-seeking policy for finding the target object in the current search policy pool including:
  • the output value of the value function of the enhanced learning algorithm preset under the object-seeking strategy for finding the object in the current search strategy pool is calculated:
  • V ⁇ represents the output value of the value function of the above enhanced learning algorithm under the object-seeking strategy ⁇
  • M represents the number of states included in the object-seeking strategy ⁇
  • m represents the identity of each state in the object-seeking strategy ⁇
  • n represents the current state
  • x represents the number of state transitions from the current state to the policy termination state in the object-seeking strategy ⁇
  • ⁇ (S m ) represents the robot's transition from the state S m to the next state execution in the object-seeking strategy ⁇
  • is a preset coefficient, 0 ⁇ ⁇ ⁇ 1
  • R e represents a reward function in the above enhanced learning algorithm
  • the action to be performed by the robot from the current state to the next state is determined from the target object-seeking strategy.
  • the information sequence of the target object-seeking scene may be collected, and a preset number of information elements are selected from the information sequence, and the state collection of the target object-seeking scene obtained in advance is determined. Whether there is a state that matches the selected information element, and if so, a state in the state set that matches the selected information element is determined as the current state of the robot.
  • the above information sequence is composed of information elements including: a video frame and/or an audio frame.
  • the set of states is a collection of states that the robot can have in the target-finding scene.
  • the robot learns the strategy for the learning target by searching for the optimal object-seeking strategy of the target object, and obtains the strategy that the robot searches for the target object in the target object-seeking scene, and then the robot is looking for
  • the object seeks the object based on the various object-seeking strategies obtained by the above learning, so that it is not necessary to use the positioning device set by the robot itself to find the object, and thus is not affected by the scene of the object, thereby improving the probability of success in searching for objects.
  • the embodiment of the present application further provides a machine learning device.
  • FIG. 5 is a machine learning device according to an embodiment of the present application.
  • the device is applied to a robot, and includes:
  • a state selection module 501 configured to select a state from a state set of the target object-seeking scene as a first state, wherein the state set is: a set of states of the robot in the target object-seeking scene;
  • the strategy obtaining module 502 is configured to obtain a target optimal object-seeking strategy for finding a target object by using the first state as a starting state of the object-seeking strategy, wherein the object-seeking strategy comprises: the robot from the seeking The initial state of the object policy begins to find the respective states that the object objects are sequentially experienced, and the actions performed by the robot from each state to the next state;
  • the strategy learning module 503 is configured to perform strategy learning for the learning target by using the target optimal object-seeking strategy, and obtain a searching strategy for the robot to find the target object in the target object-seeking scene, and obtain the obtained object
  • the object-seeking strategy is added to the object-seeking strategy pool, wherein the obtained object-seeking strategy is: a tracking policy in which the first state is a starting state and a second state is a terminating state, and the second state is: a state in which the robot corresponds to a position in the target object-seeking scene;
  • the policy comparison module 504 is configured to compare whether the obtained object-seeking strategy is consistent with the target optimal object-seeking strategy, if yes, the trigger learning determination module 505, if not, triggers the state selection module 501;
  • the learning determination module 505 is configured to determine to complete the policy learning in which the first state is the initial state of the object-seeking policy.
  • next state of each state in the object-seeking policy, and the action performed by the robot from each state to the next state is determined according to a pre-statistical probability of transitioning to a state before the transition state. of;
  • the action performed by the robot from each state transition to the next state is: an action belonging to the action set of the target object-seeking scene, wherein the action set is: the robot performs in the target object-seeking scene A collection of actions performed when the state transitions.
  • the robot takes a state in the state set of the target object-seeking scene as the starting state of the object-seeking strategy, and obtains the target optimal object-seeking strategy for finding the target object, and aims to find the target optimal object.
  • the material strategy conducts strategy learning for the learning target, and obtains the object-seeking strategy for the robot to find the target object in the target object-seeking scene.
  • the object-seeking strategy obtained according to the foregoing learning can be There is no need to use the positioning device set by the robot itself, so it is not affected by the scene of the object, thereby improving the probability of success in searching for objects.
  • the policy learning module 503 includes:
  • a reward function determining sub-module 503A configured to use the target optimal object-seeking strategy as a learning target, and use a target-type object-seeking strategy to determine a reward function in an enhanced learning algorithm for performing policy learning, the target type of finding
  • the object strategy is: a search strategy for finding the target object in the object search strategy pool;
  • a policy obtaining sub-module 503B configured to perform policy learning based on the reward function, obtain a object-seeking strategy that maximizes an output value of a value function in the enhanced learning algorithm, and find the location in the target object-seeking scene as the robot The object-seeking strategy of the object;
  • the policy adding sub-module 503C is configured to add the obtained object-seeking policy to the pool of search policies.
  • the reward function determining sub-module 503A is specifically configured to determine a reward function R that maximizes the value of the following expression is a reward function in an enhanced learning algorithm for performing policy learning:
  • k represents the number of the object-seeking strategies included in the object-seeking strategy pool for finding the object
  • i represents the identifier of each of the object-seeking strategies in the pool of search-seeking strategies for finding the object
  • ⁇ i represents the The object-seeking strategy for finding the object in the object-seeking strategy pool is identified as i
  • ⁇ d represents the target optimal object-seeking strategy
  • S 0 represents the first state
  • V ⁇ represents the object-seeking strategy ⁇
  • M represents the number of states included in the object-seeking strategy ⁇
  • m represents the identity of each state in the object-seeking strategy ⁇
  • t represents the number of state transitions in the object-seeking strategy ⁇
  • ⁇ (S m ) represents the action performed by the robot in the object-seeking strategy ⁇ from the state S m to the next state
  • is a preset coefficient, 0 ⁇ ⁇ ⁇ 1, and maximise () represents a maximum function.
  • the policy learning submodule 503B may include:
  • a policy learning unit configured to learn, according to a preset state transition manner, a material seeking strategy in which the first state is a starting state of a tracing strategy, and the second state is a termination state of a tracing policy;
  • the output value calculation unit is configured to calculate an output value of the value function in the enhanced learning algorithm under each of the learned search strategies according to the following expression:
  • R e represents a reward function in the enhanced learning algorithm
  • a policy determining unit configured to determine a object-seeking strategy corresponding to a maximum output value of the calculated output values as a material-seeking strategy that maximizes an output value of the value function in the enhanced learning algorithm
  • a policy adding unit configured to add the obtained object-seeking strategy to the object-seeking policy pool.
  • the robot takes a state in the state set of the target object-seeking scene as the starting state of the object-seeking strategy, and combines the preset enhanced learning algorithm to learn the strategy of the object-seeking strategy, and obtains
  • the object-seeking scene finds various object-seeking strategies of the object, so that when the robot searches for the object in the target-finding scene, the object-seeking strategy obtained according to the foregoing learning can be used without using the positioning device set by the robot itself. Therefore, it is not affected by the scene of the object, and thus the probability of success in searching for objects is improved.
  • the learning device further includes: a state obtaining module 506;
  • the state obtaining module 506 is configured to obtain a state in the state set.
  • the state obtaining module 506 includes:
  • the first sequence collection sub-module 506A is configured to collect an information sequence of the target object-seeking scene, where the information sequence is composed of information elements, where the information elements include: a video frame and/or an audio frame;
  • the first element quantity determining sub-module 506B is configured to determine whether the number of unselected information elements in the information sequence is greater than a preset number, and if so, the trigger state generating sub-module 506C;
  • a state generation sub-module 506C configured to select the preset number of information elements from the unselected information elements in the information sequence, and generate a state in which the robot is in the target object-seeking scene, As the third state;
  • the state judging sub-module 506D is configured to determine whether the third state exists in the state set, if not, the trigger state adding sub-module 506E, if present, triggering the first element quantity determining sub-module 506B;
  • the state adding submodule 506E is configured to add the third state to the state set, and trigger the first element number determining submodule 506B.
  • the apparatus may further include:
  • An action obtaining module configured to obtain an action in the action set
  • the action obtaining module includes:
  • a second sequence collection submodule configured to obtain an action sequence corresponding to the information sequence, where the action sequence is composed of action elements, each action element in the action sequence and each information in the information sequence One-to-one correspondence of elements;
  • a second element quantity determining sub-module configured to determine whether the number of unselected action elements in the action sequence is greater than the preset quantity, and if yes, triggering an action generating sub-module;
  • An action generating submodule configured to select the preset number of action elements from the unselected action elements in the action sequence, and generate an action of the robot in the target object search scene, as the first action;
  • the action determining sub-module is configured to determine whether the first action exists in the action set, and if not, the trigger action adds a sub-module, and if yes, triggers the second element quantity determining sub-module;
  • the action adding submodule is configured to add the first action to the action set, and trigger the second element quantity determining submodule.
  • the embodiment of the present application further provides a device for searching for objects.
  • FIG. 8 is a schematic structural diagram of an object searching device according to an embodiment of the present application.
  • the device is applied to a robot, and includes:
  • the instruction receiving module 801 is configured to receive a seek instruction for finding an object in the target object search scene
  • a state obtaining module 802 configured to obtain a current state of the robot
  • the action determining module 803 is configured to determine, according to the object-seeking strategy for finding the target object that includes the current state in the object-seeking policy pool, the action that the robot performs from the current state to the next state, where the The object-seeking strategy in the object-seeking strategy pool is: the robot is searched for the learning target by searching for the optimal object-seeking strategy of the target object, and the robot searches for the target object in the target object-seeking scene.
  • the strategy, the object-seeking strategy includes: the robot starts from a start state of the object-seeking strategy to find each state that the object objects sequentially go through, and switches from each state to a state that the robot performs;
  • the state conversion module 804 is configured to perform the determined action to implement state transition, and determine whether the target object is found. If not, the state obtaining module 802 is triggered.
  • the action determining module 803 can include:
  • the output value calculation submodule is configured to calculate, according to the following expression, an output value of a value function of the enhanced learning algorithm preset in the policy pool including the current state of the object search policy:
  • V ⁇ represents the output value of the value function of the enhanced learning algorithm under the object-seeking strategy ⁇
  • M represents the number of states included in the object-seeking strategy ⁇
  • m represents the identity of each state in the object-seeking strategy ⁇
  • n represents The identifier of the current state in the object-seeking strategy ⁇
  • x represents the number of state transitions from the current state to the policy termination state in the object-seeking strategy ⁇
  • ⁇ (S m ) represents the robot slave state S in the object-seeking strategy ⁇ m is converted to the action performed in the next state
  • is a preset coefficient, 0 ⁇ 1
  • R e represents a reward function in the enhanced learning algorithm
  • a policy selection sub-module configured to select a search strategy corresponding to a maximum output value of the calculated output values as a target object-seeking strategy
  • the action determining sub-module is configured to determine, from the target object-seeking strategy, an action to be performed by the robot to transition from a current state to a next state.
  • the state obtaining module 602 can include:
  • a sequence collection sub-module configured to collect an information sequence of the target object-seeking scene, wherein the information sequence is composed of information elements, where the information elements include: a video frame and/or an audio frame;
  • An element selection submodule configured to select a preset number of information elements from the information sequence
  • a state judging sub-module configured to determine, in a state set of the target object-seeking scene obtained in advance, whether there is a state matching the selected information element, wherein the state set is: the robot is searching for the target a set of states that can be in the object scene, if present, triggering a state determination sub-module;
  • the state determining submodule is configured to determine a state in the set of states that matches the selected information element as a current state of the robot.
  • the robot learns the strategy for the learning target by searching for the optimal object-seeking strategy of the target object, and obtains the strategy that the robot searches for the target object in the target object-seeking scene, and then the robot is looking for
  • the object seeks the object based on the various object-seeking strategies obtained by the above learning, so that it is not necessary to use the positioning device set by the robot itself to find the object, and thus is not affected by the scene of the object, thereby improving the probability of success in searching for objects.
  • the embodiment of the present application further provides a robot.
  • FIG. 9 is a schematic structural diagram of a robot according to an embodiment of the present disclosure, including: a processor and a memory, where
  • a memory for storing a computer program
  • the machine learning method provided by the embodiment of the present application is implemented when the processor is configured to execute a program stored in the memory.
  • the above machine learning method includes:
  • a state from a set of states of the target object-seeking scene as a first state wherein the state set is: a set of states of the robot in the target object-seeking scene;
  • the object-seeking strategy comprises: the robot starts from a starting state of the object-seeking strategy Go to the respective states that the object is sequentially experienced, and the actions performed by the robot from each state to the next state;
  • the target optimal object-seeking strategy Performing strategy learning for the learning target by using the target optimal object-seeking strategy, obtaining a searching strategy for the robot to find the target object in the target object-seeking scene, and adding the obtained object-seeking strategy to the object-seeking object a policy pool, wherein the obtained object-seeking strategy is: a tracking policy in which the first state is a starting state and a second state is a terminating state, and the second state is: the target object is in the a state in which the robot corresponding to the position in the target object finding scene is located;
  • the above robot may further include at least one of the following devices:
  • Image acquisition devices wheels, mechanical legs, robotic arms, and more.
  • the robot takes a state in the state set of the target object-seeking scene as the starting state of the object-seeking strategy, obtains the target optimal object-seeking strategy for finding the object, and uses the target optimal object-seeking strategy as the learning.
  • the target conducts strategy learning, and obtains the object-seeking strategy for the robot to find the target object in the target object-searching scene, so that when the robot searches for the target object in the target-finding scene, the robot can obtain the object-seeking strategy according to the foregoing learning without using the robot.
  • the positioning device set by itself is not affected by the scene of the object, thereby improving the probability of success in searching for objects.
  • the embodiment of the present application further provides a robot.
  • FIG. 10 is a schematic structural diagram of another robot according to an embodiment of the present disclosure, including: a processor and a memory, where
  • a memory for storing a computer program
  • the method for searching for objects described in the embodiments of the present application is implemented when the processor is configured to execute a program stored in the memory.
  • the above method for searching for objects includes:
  • the object strategy is: a strategy for finding the target object in the target object-seeking scene by searching for the target learning strategy in advance by searching for the optimal object-seeking strategy of the target object, and the object-seeking strategy includes: The robot starts from a start state of the object-seeking strategy to find each state that the object is sequentially experienced, and an action that is performed by the robot from each state to a next state;
  • the above robot may further include at least one of the following devices:
  • Image acquisition devices wheels, mechanical legs, robotic arms, and more.
  • the robot learns the strategy for the learning target by searching for the optimal object-seeking strategy of the target object, and obtains the strategy that the robot searches for the target object in the target object-seeking scene, and then the robot searches for objects.
  • the object is searched for, so that the positioning device provided by the robot itself is not needed to find the object, so that it is not affected by the scene of the object, thereby improving the probability of success in searching for objects.
  • the memory involved in the above two types of robots may include a random access memory (RAM), and may also include a non-volatile memory (NVM), such as at least one disk storage.
  • NVM non-volatile memory
  • the memory may also be at least one storage device located away from the aforementioned processor.
  • the processor involved in the above two types of robots may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc., or a digital signal processor (Digital Signal Processing, DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component.
  • CPU central processing unit
  • NP network processor
  • DSP Digital Signal Processing
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the embodiment of the present application further provides a computer readable storage medium.
  • the computer readable storage medium is a computer readable storage medium in a robot, and the computer readable storage medium stores therein a computer program, and when the computer program is executed by the processor, the machine learning method described in the embodiment of the present application is implemented. .
  • the above learning machine method includes:
  • a state from a set of states of the target object-seeking scene as a first state wherein the state set is: a set of states of the robot in the target object-seeking scene;
  • the object-seeking strategy comprises: the robot starts from a starting state of the object-seeking strategy Go to the respective states that the object is sequentially experienced, and the actions performed by the robot from each state to the next state;
  • the target optimal object-seeking strategy Performing strategy learning for the learning target by using the target optimal object-seeking strategy, obtaining a searching strategy for the robot to find the target object in the target object-seeking scene, and adding the obtained object-seeking strategy to the object-seeking object a policy pool, wherein the obtained object-seeking strategy is: a tracking policy in which the first state is a starting state and a second state is a terminating state, and the second state is: the target object is in the a state in which the robot corresponding to the position in the target object finding scene is located;
  • the robot performs a computer program stored in the computer readable storage medium, and takes a state in the state collection of the target object search scene as the starting state of the object-seeking strategy, and takes the first state as the seeking state.
  • the initial state of the material strategy obtain the target optimal object-seeking strategy for finding the target object, and use the target optimal object-seeking strategy to learn the strategy for the learning target, and obtain the object-seeking strategy for the robot to find the target object in the target-finding scene.
  • the object-seeking strategy obtained by the above learning can be used without using the positioning device set by the robot itself, so that it is not affected by the object-searching scene, thereby improving the object-seeking time. The probability of success.
  • the embodiment of the present application further provides a computer readable storage medium.
  • the computer readable storage medium is a computer readable storage medium in a robot, and the computer readable storage medium stores therein a computer program, and when the computer program is executed by the processor, the object searching method described in the embodiment of the present application is implemented. .
  • the above method for searching for objects includes:
  • the object strategy is: a strategy for finding the target object in the target object-seeking scene by searching for the target learning strategy in advance by searching for the optimal object-seeking strategy of the target object, and the object-seeking strategy includes: The robot starts from a start state of the object-seeking strategy to find each state that the object is sequentially experienced, and an action that is performed by the robot from each state to a next state;
  • the robot performs the strategy learning for the learning target by executing the computer program stored in the computer readable storage medium to find the optimal object-seeking strategy of the target object, and obtains the robot in the target object.
  • the strategy of finding the object in the scene then, the robot searches for the object based on the various object-seeking strategies obtained by the above-mentioned learning, so that the robot does not need to use the positioning device set by the robot to find the object, and thus is not subject to the object-seeking scene.
  • the impact which in turn increases the probability of success in finding things.
  • Embodiments of the present application also provide an executable program code for being executed to implement any of the above-described machine learning method steps applied to a robot.
  • Embodiments of the present application also provide an executable program code for a method step that is executed to implement any of the above-described methods of searching for a robot.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Development Economics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Biophysics (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Manipulator (AREA)
  • Feedback Control In General (AREA)

Abstract

本申请实施例提供了一种机器学习、寻物方法及装置,涉及人工智能技术领域,应用于机器人,方法包括:从目标寻物场景的状态集合中选择状态,作为第一状态;以第一状态为寻物策略的起始状态,获得寻找目的物的目标最优寻物策略;以目标最优寻物策略为学习目标进行策略学习,获得机器人在目标寻物场景中寻找目的物的寻物策略,并将所获得的寻物策略添加至寻物策略池;比较所获得寻物策略与目标最优寻物策略是否一致;若一致,判定完成以第一状态为寻物策略的起始状态的策略学习;若不一致,返回从目标寻物场景的状态集合中选择状态的步骤。应用本申请实施例提供的方案提高了寻物成功的概率。

Description

一种机器学习、寻物方法及装置
本申请要求于2017年7月20日提交中国专利局、申请号为201710594689.9发明名称为“一种机器学习、寻物方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,特别是涉及一种机器学习、寻物方法及装置。
背景技术
随着机器学习算法的飞速发展,应用机器学习算法的机器人也得到了快速发展,各具特色的机器人越来越多的应用到人们的日常生活中,为人们的生活带来便利。
以在一定应用场景中具有寻物功能的机器人为例,目前,大多数机器人依赖其自身设置的定位装置和数字地图技术确定寻物路径,进行寻物。虽然大多数情况下应用上述方式能够成功寻物,但是上述机器人自身设置的定位装置在很多应用场景下不够准确,进而应用上述方式进行寻物时成功率低。
发明内容
本申请实施例的目的在于提供一种机器学习、寻物方法及装置,以提高寻物时的成功率。具体技术方案如下:
第一方面,本申请实施例提供了一种机器学习方法,应用于机器人,所述方法包括:
从目标寻物场景的状态集合中选择状态,作为第一状态,其中,所述状态集合为:所述机器人在所述目标寻物场景中的状态的集合;
以所述第一状态为寻物策略的起始状态,获得寻找目的物的目标最优寻物策略,其中,所述寻物策略包含:所述机器人从所述寻物策略的起始状态开始至寻找到所述目的物依次经历的各个状态、从每一状态转换至下一状态所述机器人执行的动作;
以所述目标最优寻物策略为学习目标进行策略学习,获得所述机器人在所述目标寻物场景中寻找所述目的物的寻物策略,并将所获得的寻物策略添 加至寻物策略池,其中,所获得的寻物策略为:以所述第一状态为起始状态、以第二状态为终止状态的寻物策略,所述第二状态为:所述目的物在所述目标寻物场景中的位置对应的所述机器人所处的状态;
比较所获得寻物策略与所述目标最优寻物策略是否一致;
若一致,判定完成以所述第一状态为寻物策略的起始状态的策略学习;
若不一致,返回所述从目标寻物场景的状态集合中选择状态的步骤。
本申请的一种实现方式中,所述以所述目标最优寻物策略为学习目标进行策略学习,获得所述机器人在所述目标寻物场景中寻找所述目的物的寻物策略,包括:
以所述目标最优寻物策略为学习目标,利用目标类型的寻物策略确定用于进行策略学习的增强学习算法中的回报函数,所述目标类型的寻物策略为:寻物策略池中用于寻找所述目标物的寻物策略;
基于所述回报函数进行策略学习,获得使得所述增强学习算法中价值函数的输出值最大的寻物策略,作为所述机器人在所述目标寻物场景中寻找所述目的物的寻物策略。
本申请的一种实现方式中,所述以所述目标最优寻物策略为学习目标,利用目标类型的寻物策略确定用于进行策略学习的增强学习算法中的回报函数,包括:
确定使得以下表达式的取值最大的回报函数R为用于进行策略学习的增强学习算法中的回报函数:
Figure PCTCN2018095769-appb-000001
其中,
Figure PCTCN2018095769-appb-000002
Figure PCTCN2018095769-appb-000003
k表示所述寻物策略池中所包含寻找所述目的物的寻物策略的数量,i表示所述寻物策略池中各个寻找所述目的物的寻物策略的标识,π i表示所述寻物策略池中标识为i的寻找所述目的物的寻物策略,π d表示所述目标最优寻物策略,S 0表示所述第一状态,V π表示寻物策略π下所述增强学习算法的价值函 数的输出值,M表示寻物策略π中所包含状态的数量,m表示寻物策略π中各个状态的标识,t表示寻物策略π中状态转换的次数,π(S m)表示寻物策略π中所述机器人从状态S m转换至下一状态执行的动作,γ为预设的系数,0<γ<1,maximise()表示取最大值函数。
本申请的一种实现方式中,所述基于所述回报函数进行策略学习,获得使得所述增强学习算法中价值函数的输出值最大的寻物策略,包括:
按照预设的状态转换方式,学习得到以所述第一状态为寻物策略起始状态、以所述第二状态为寻物策略终止状态的寻物策略;
按照以下表达式计算学习到的每一寻物策略下所述增强学习算法中价值函数的输出值:
Figure PCTCN2018095769-appb-000004
其中,R e表示所述增强学习算法中的回报函数;
将计算得到的输出值中最大输出值对应的寻物策略确定为使得所述增强学习算法中价值函数的输出值最大的寻物策略。
本申请的一种实现方式中,通过以下方式确定所述寻物策略中每一状态的下一状态、以及从每一状态转换至下一状态所述机器人执行的动作:
根据预先统计的、转换前状态转换至其他状态的概率,确定转换后状态以及从所述转换前状态转换至所述转换后状态所述机器人执行的、属于所述目标寻物场景的动作集合的动作,其中,所述动作集合为:所述机器人在所述目标寻物场景中进行状态转换时执行动作的集合。
本申请的一种实现方式中,所述状态集合中的状态通过以下方式得到:
采集所述目标寻物场景的信息序列,其中,所述信息序列由信息元素组成,所述信息元素包括:视频帧和/或音频帧;
判断所述信息序列中未被选中过的信息元素的数量是否大于预设数量;
若为是,从所述信息序列中未被选中过的信息元素中选择所述预设数量个信息元素,生成所述机器人在所述目标寻物场景中所处的一个状态,作为第三状态;
判断所述状态集合中是否存在所述第三状态;
若不存在,将所述第三状态添加至所述状态集合,并返回执行所述判断所述信息序列中未被选中过的信息元素的数量是否大于预设数量的步骤;
若存在,直接返回执行所述判断所述信息序列中未被选中过的信息元素的数量是否大于预设数量的步骤。
本申请的一种实现方式中,所述动作集合中的动作通过以下方式得到:
获得所述信息序列对应的动作序列,其中,所述动作序列由动作元素组成,所述动作序列中的每一动作元素与所述信息序列中的每一信息元素一一对应;
判断所述动作序列中未被选中过的动作元素的数量是否大于所述预设数量;
若为是,从所述动作序列中未被选中过的动作元素中选择所述预设数量个动作元素,生成所述机器人在所述目标寻物场景中的一个动作,作为第一动作;
判断所述动作集合中是否存在所述第一动作;
若不存在,将所述第一动作添加至所述动作集合,并返回执行所述判断所述动作序列中未被选中过的动作元素的数量是否大于所述预设数量的步骤;
若存在,直接返回执行所述判断所述动作序列中未被选中过的动作元素的数量是否大于所述预设数量的步骤。
第二方面,本申请实施例提供了一种寻物方法,应用于机器人,所述方法包括:
接收在目标寻物场景中寻找目的物的寻物指令;
获得所述机器人的当前状态;
根据寻物策略池中包含所述当前状态的、用于寻找目的物的寻物策略,确定所述机器人从当前状态转换至下一状态执行的动作,其中,所述寻物策略池中的寻物策略是:预先以寻找所述目的物的最优寻物策略为学习目标进行策略学习得到的、所述机器人在所述目标寻物场景中寻找所述目的物的策略,寻物策略包含:所述机器人从寻物策略起始状态开始至寻找到所述目的物依次经历的各个状态、从每一状态转换至下一状态所述机器人执行的动作;
执行所确定的动作实现状态转换,并判断是否寻找到所述目的物;
若为否,返回执行所述获得所述机器人的当前状态的步骤,直至寻找到所述目的物。
本申请的一种实现方式中,所述根据寻物策略池中包含所述当前状态的用于寻找所述目的物的寻物策略,确定所述机器人从当前状态转换至下一状态执行的动作,包括:
按照以下表达式,计算在策略池中包含所述当前状态的寻物策略下预设的增强学习算法的价值函数的输出值:
Figure PCTCN2018095769-appb-000005
其中,V π表示寻物策略π下所述增强学习算法的价值函数的输出值,M表示寻物策略π中所包含状态的数量,m表示寻物策略π中各个状态的标识,n表示所述当前状态在寻物策略π中的标识,x表示寻物策略π中从所述当前状态至策略终止状态的状态转换次数,π(S m)表示寻物策略π中所述机器人从状态S m转换至下一状态执行的动作,γ为预设的系数,0<γ<1,R e表示所述增强学习算法中的回报函数;
选择计算得到的输出值中最大输出值对应的寻物策略为目标寻物策略;
从所述目标寻物策略中确定所述机器人从当前状态转换至下一状态执行的动作。
本申请的一种实现方式中,所述获得所述机器人的当前状态,包括:
采集所述目标寻物场景的信息序列,其中,所述信息序列由信息元素组成,所述信息元素包括:视频帧和/或音频帧;
从所述信息序列中选择预设数量个信息元素;
判断预先获得的所述目标寻物场景的状态集合中是否存在与所选择的信息元素相匹配的状态,其中,所述状态集合为:所述机器人在所述目标寻物场景中所能处的状态的集合;
若存在,将所述状态集合中与所选择的信息元素相匹配的状态确定为所述机器人的当前状态。
第三方面,本申请实施例提供了一种机器学习装置,应用于机器人,所述装置包括:
状态选择模块,用于从目标寻物场景的状态集合中选择状态,作为第一状态,其中,所述状态集合为:所述机器人在所述目标寻物场景中的状态的集合;
策略获得模块,用于以所述第一状态为寻物策略的起始状态,获得寻找目的物的目标最优寻物策略,其中,所述寻物策略包含:所述机器人从所述寻物策略的起始状态开始至寻找到所述目的物依次经历的各个状态、从每一状态转换至下一状态所述机器人执行的动作;
策略学习模块,用于以所述目标最优寻物策略为学习目标进行策略学习,获得所述机器人在所述目标寻物场景中寻找所述目的物的寻物策略,并将所获得的寻物策略添加至寻物策略池,其中,所获得的寻物策略为:以所述第一状态为起始状态、以第二状态为终止状态的寻物策略,所述第二状态为:所述目的物在所述目标寻物场景中的位置对应的所述机器人所处的状态;
策略比较模块,用于比较所获得寻物策略与所述目标最优寻物策略是否一致,若一致,触发学习判定模块,若不一致,触发所述状态选择模块;
所述学习判定模块,用于判定完成以所述第一状态为寻物策略的起始状态的策略学习。
本申请的一种实现方式中,所述策略学习模块,包括:
回报函数确定子模块,用于以所述目标最优寻物策略为学习目标,利用目标类型的寻物策略确定用于进行策略学习的增强学习算法中的回报函数,所述目标类型的寻物策略为:寻物策略池中用于寻找所述目标物的寻物策略;
策略获得子模块,用于基于所述回报函数进行策略学习,获得使得所述增强学习算法中价值函数的输出值最大的寻物策略,作为所述机器人在所述目标寻物场景中寻找所述目的物的寻物策略;
策略添加子模块,用于将所获得的寻物策略添加至寻物策略池。
本申请的一种实现方式中,所述回报函数确定子模块,具体用于确定使得以下表达式的取值最大的回报函数R为用于进行策略学习的增强学习算法中的回报函数:
Figure PCTCN2018095769-appb-000006
其中,
Figure PCTCN2018095769-appb-000007
Figure PCTCN2018095769-appb-000008
k表示所述寻物策略池中所包含寻找所述目的物的寻物策略的数量,i表示所述寻物策略池中各个寻找所述目的物的寻物策略的标识,π i表示所述寻物策略池中标识为i的寻找所述目的物的寻物策略,π d表示所述目标最优寻物策略,S 0表示所述第一状态,V π表示寻物策略π下所述增强学习算法的价值函数的输出值,M表示寻物策略π中所包含状态的数量,m表示寻物策略π中各个状态的标识,t表示寻物策略π中状态转换的次数,π(S m)表示寻物策略π中所述机器人从状态S m转换至下一状态执行的动作,γ为预设的系数,0<γ<1,maximise()表示取最大值函数。
本申请的一种实现方式中,所述策略学习子模块,包括:
策略学习单元,用于按照预设的状态转换方式,学习得到以所述第一状态为寻物策略起始状态、以所述第二状态为寻物策略终止状态的寻物策略;
输出值计算单元,用于按照以下表达式计算学习到的每一寻物策略下所述增强学习算法中价值函数的输出值:
Figure PCTCN2018095769-appb-000009
其中,R e表示所述增强学习算法中的回报函数;
策略确定单元,用于将计算得到的输出值中最大输出值对应的寻物策略确定为使得所述增强学习算法中价值函数的输出值最大的寻物策略;
策略加入单元,用于将所获得的寻物策略添加至所述寻物策略池。
本申请的一种实现方式中,所述寻物策略中每一状态的下一状态、从每一状态转换至下一状态所述机器人执行的动作,是根据预先统计的、转换前状态转换至其他状态的概率确定的;
从每一状态转换至下一状态所述机器人执行的动作为:属于所述目标寻物场景的动作集合的动作,其中,所述动作集合为:所述机器人在所述目标寻物场景中进行状态转换时执行动作的集合。
本申请的一种实现方式中,所述学习装置还包括:
状态获得模块,用于获得所述状态集合中的状态;
所述状态获得模块,包括:
第一序列采集子模块,用于采集所述目标寻物场景的信息序列,其中,所述信息序列由信息元素组成,所述信息元素包括:视频帧和/或音频帧;
第一元素数量判断子模块,用于判断所述信息序列中未被选中过的信息元素的数量是否大于预设数量,若为是,触发状态生成子模块;
状态生成子模块,用于从所述信息序列中未被选中过的信息元素中选择所述预设数量个信息元素,生成所述机器人在所述目标寻物场景中所处的一个状态,作为第三状态;
状态判断子模块,用于判断所述状态集合中是否存在所述第三状态,若不存在,触发状态添加子模块,若存在,触发所述第一元素数量判断子模块;
状态添加子模块,用于将所述第三状态添加至所述状态集合,并触发所述第一元素数量判断子模块。
本申请的一种实现方式中,所述学习装置还包括:
动作获得模块,用于获得所述动作集合中的动作;
所述动作获得模块,包括:
第二序列采集子模块,用于获得所述信息序列对应的动作序列,其中,所述动作序列由动作元素组成,所述动作序列中的每一动作元素与所述信息序列中的每一信息元素一一对应;
第二元素数量判断子模块,用于判断所述动作序列中未被选中过的动作元素的数量是否大于所述预设数量,若为是,触发动作生成子模块;
动作生成子模块,用于从所述动作序列中未被选中过的动作元素中选择所述预设数量个动作元素,生成所述机器人在所述目标寻物场景中的一个动作,作为第一动作;
动作判断子模块,用于判断所述动作集合中是否存在所述第一动作,若不存在,触发动作添加子模块,若存在,触发所述第二元素数量判断子模块;
动作添加子模块,用于将所述第一动作添加至所述动作集合,并触发所述第二元素数量判断子模块。
第四方面,本申请实施例提供了一种寻物装置,应用于机器人,所述装 置包括:
指令接收模块,用于接收在目标寻物场景中寻找目的物的寻物指令;
状态获得模块,用于获得所述机器人的当前状态;
动作确定模块,用于根据寻物策略池中包含所述当前状态的、用于寻找目的物的寻物策略,确定所述机器人从当前状态转换至下一状态执行的动作,其中,所述寻物策略池中的寻物策略是:预先以寻找所述目的物的最优寻物策略为学习目标进行策略学习得到的、所述机器人在所述目标寻物场景中寻找所述目的物的策略,寻物策略包含:所述机器人从寻物策略起始状态开始至寻找到所述目的物依次经历的各个状态、从每一状态转换至下一状态所述机器人执行的动作;
状态转换模块,用于执行所确定的动作实现状态转换,并判断是否寻找到所述目的物,若为否,触发所述状态获得模块。
本申请的一种实现方式中,所述动作确定模块,包括:
输出值计算子模块,用于按照以下表达式,计算在策略池中包含所述当前状态的寻物策略下预设的增强学习算法的价值函数的输出值:
Figure PCTCN2018095769-appb-000010
其中,V π表示寻物策略π下所述增强学习算法的价值函数的输出值,M表示寻物策略π中所包含状态的数量,m表示寻物策略π中各个状态的标识,n表示所述当前状态在寻物策略π中的标识,x表示寻物策略π中从所述当前状态至策略终止状态的状态转换次数,π(S m)表示寻物策略π中所述机器人从状态S m转换至下一状态执行的动作,γ为预设的系数,0<γ<1,R e表示所述增强学习算法中的回报函数;
策略选择子模块,用于选择计算得到的输出值中最大输出值对应的寻物策略为目标寻物策略;
动作确定子模块,用于从所述目标寻物策略中确定所述机器人从当前状态转换至下一状态要执行的动作。
本申请的一种实现方式中,所述状态获得模块,包括:
序列采集子模块,用于采集所述目标寻物场景的信息序列,其中,所述 信息序列由信息元素组成,所述信息元素包括:视频帧和/或音频帧;
元素选择子模块,用于从所述信息序列中选择预设数量个信息元素;
状态判断子模块,用于判断预先获得的所述目标寻物场景的状态集合中是否存在与所选择的信息元素相匹配的状态,其中,所述状态集合为:所述机器人在所述目标寻物场景中所能处的状态的集合,若存在,触发状态确定子模块;
所述状态确定子模块,用于将所述状态集合中与所选择的信息元素相匹配的状态确定为所述机器人的当前状态。
第五方面,本申请实施例提供了一种机器人,包括:处理器和存储器,其中,
存储器,用于存放计算机程序;
处理器,用于执行存储器上所存放的程序时,实现前述第一方面所述的方法步骤。
第六方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质为机器人中的计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现前述第一方面所述的方法步骤。
第七方面,本申请实施例提供了一种机器人,包括:处理器和存储器,其中,
存储器,用于存放计算机程序;
处理器,用于执行存储器上所存放的程序时,实现前述第二方面所述的方法步骤。
第八方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质为机器人中的计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现前述第二方面所述的方法步骤。
第九方面,本申请实施例提供了一种可执行程序代码,所述可执行程序代码用于被运行以实现前述第一方面所述的方法步骤。
第十方面,本申请实施例提供了一种可执行程序代码,所述可执行程序 代码用于被运行以实现前述第二方面所述的方法步骤。
由以上可见,本申请实施例提供的方案中,机器人以目标寻物场景的状态集合中一个状态作为寻物策略的起始状态,获得寻找目的物的目标最优寻物策略,以目标最优寻物策略为学习目标进行策略学习,获得机器人在目标寻物场景中寻找目的物的寻物策略,这样,机器人在目标寻物场景中寻找目的物时,根据前述学习得到的寻物策略即可,而无需使用机器人自身设置的定位装置,因而不会受寻物场景的影响,进而提高了寻物时的成功概率。
附图说明
为了更清楚地说明本申请实施例和现有技术的技术方案,下面对实施例和现有技术中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种机器学习方法的流程示意图;
图2为本申请实施例提供的另一种机器学习方法的流程示意图;
图3为本申请实施例提供的一种获得状态集合中状态的方法的流程示意图;
图4为本申请实施例提供的一种寻物方法的流程示意图;
图5为本申请实施例提供的一种机器学习装置的结构示意图;
图6为本申请实施例提供的另一种机器学习装置的结构示意图;
图7为本申请实施例提供的一种获得状态集合中状态的装置的结构示意图;
图8为本申请实施例提供的一种寻物装置的结构示意图;
图9为本申请实施例提供的一种机器人的结构示意图;
图10为本申请实施例提供的另一种机器人的结构示意图。
具体实施方式
为使本申请的目的、技术方案、及优点更加清楚明白,以下参照附图并举实施例,对本申请进一步详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本 申请保护的范围。
图1为本申请实施例提供的一种机器学习方法的流程示意图,该方法应用于机器人,包括:
S101:从目标寻物场景的状态集合中选择状态,作为第一状态。
机器人可以工作于不同的场景中,例如,家庭场景、工厂车间场景等等。不管机器人工作于哪种场景下,均可能会涉及到寻找物体的情况,这种情况下,上述机器人工作的场景也可以称之为寻物场景,以家庭场景为例,可能需要机器人寻找家里饲养的宠物狗、寻找家里的小孩玩具等等。
另外,机器人在不同的寻物场景中工作时,因场景之间的差异往往机器人在场景中处的位置不同、需要机器人进行的操作不同,因此,机器人在不同的寻物场景中可能处于不同的状态,再者,机器人在同一寻物场景中工作时,可能会处于该场景的不同位置,为此,机器人在每一寻物场景中可能会处于各种不同的状态。
基于上述情况,状态集合可以理解为是与机器人的寻物场景相对应的,也就是,上述状态集合可以理解为:机器人在目标寻物场景中的状态的集合。
仍然以家庭场景为例,机器人在该场景中所能处的状态可以是与其在家庭场景中所处的位置相关的,例如,上述状态可以是机器人处于家庭场景中客厅中央区域、书房的东南角区域等等。
由于已有的各种机器人中大多数具有视觉和语音功能,基于此,在本申请的一种实现方式中,机器人在寻物场景中所能处的状态可以是根据机器人在该场景中采集的视频帧和/或音频帧确定的。
具体的,从目标寻物场景的状态集合中选择状态时,可以是采用随机选择的方式从上述状态集合中选择状态,另外,还可以是按照一定的规则从上述状态集合中选择状态,本申请仅仅以此为例进行说明,实际应用中并不对状态的选择方式进行限定。
S102:以第一状态为寻物策略的起始状态,获得寻找目的物的目标最优寻物策略。
具体的,机器人寻找目的物的过程可以理解为如下过程:
机器人从当前状态转换至下一状态;
在转换后的状态下确认是否寻找到目的物;
若未寻找到目的物,重复执行上述两个步骤,直至找到目的物为止。
机器人从当前状态转换至下一状态时,可以是通过执行一定的动作实现的。另外,由于机器人处于当前状态时所执行的动作可以不同,所以执行动作后所处于的状态可能是不同的。
基于上述情况,寻物策略可以理解为:从机器人在目标寻物场景中当前所处的状态开始,直至寻找到目的物的策略。具体的,寻物策略包含:机器人从寻物策略的起始状态开始至寻找到目的物依次经历的各个状态、从每一状态转换至下一状态机器人执行的动作。
从每一状态转换至下一状态机器人要执行的动作,可能会因其所工作的场景不同而不同,具体的,上述动作可以是向左转、向右转、向前走、向后走等等,本申请并不对此进行限定。
本申请的一种实现方式中,可以通过以下方式确定寻物策略中每一状态的下一状态、以及从每一状态转换至下一状态所述机器人执行的动作:
根据预先统计的、转换前状态转换至其他状态的概率,确定转换后状态以及从转换前状态转换至转换后状态机器人执行的、属于目标寻物场景的动作集合的动作。
受寻物场景具体特征以及机器人自身特征等因素的限制,机器人在寻物过程中进行状态转换时所能够执行的动作一般是有限的,基于此,上述动作集合为:机器人在目标寻物场景中进行状态转换时执行动作的集合。
获得了机器人在目标寻物场景的状态集合和动作集合之后,可以简单的认为已经确定了机器人在目标寻物场景中能够处于的状态以及进行状态转换时能够执行的动作。鉴于此,本申请的一种实现方式中,发明人经过大量随机重复实验收集与状态转换相关的数据,进而统计两状态之间进行状态转换时机器人要执行的动作,以及相应动作下两状态之间实现状态转换的概率。例如,在随机重复实验中,将机器人执行的动作用双目摄像机或TOF(Time of Flight,飞行时间)摄像机拍下来,获取在每个状态下机器人的三维数据、或每个状态下机器人的状态向量集合等。
具体的,可以通过以下表达式统计得到一个状态转换至其他状态的概率:
P(S i,A i,S j)=x/y
其中,P(S i,A i,S j)表示通过执行动作A i机器人由状态S i转换至状态S j的概率,x表示在大量随机重复实验中(S i,A i,S j)组合发生的次数,也就是,通过执行动作A i机器人由状态S i转换至状态S j这一现象发生的次数,y表示(S i,A i)在大量随机重复实验中(S i,A i)组合发生的次数,也就是,机器人处于状态S i时执行动作A i的次数。
基于上述统计方式,可以从预先统计的、转换前状态转换至其他状态的概率中,选择取值最大的概率对应的状态,作为转换后状态,并将上述取值最大的概率对应的动作作为机器人从转换前状态转换至转换后状态要执行的动作。
在本申请的一种实现方式中,可以认为人演示的寻找目的物的过程是最优的过程,进而上述目标最优寻物策略可以是:对人演示的从上述第一状态开始寻找目的物的过程进行抽象得到的策略。
S103:以目标最优寻物策略为学习目标进行策略学习,获得机器人在目标寻物场景中寻找目的物的寻物策略,并将所获得的寻物策略添加至寻物策略池。
由于不同的寻物场景中,即便是同一目的物也可能会处于不同的位置,另外,即便同一目的物处于同一寻物场景中的相同位置,寻找该目的物的时候也可以采用不同的策略进行寻找,为此需要对寻找目的物的策略进行学习。
其中,所获得的寻物策略为:以第一状态为起始状态、以第二状态为终止状态的寻物策略,第二状态为:目的物在目标寻物场景中的位置对应的机器人所处的状态。
具体的,一种实现方式中,上述第二状态可以是在执行本申请实施例提供的方案之初作为参数输入至机器人的;另一种实现方式中,机器人在寻找目的物的过程中可以借助于其自身具有的视觉和/或语音功能,在每一次转换到新的状态后检测是否寻找到目的物,若寻找到,将此时所处的状态确定为上述第二状态。
本申请仅仅以此为例进行说明,实际应用中并不对机器人确定第二状态的方式进行限定。
上述寻物策略池用于存储在目标寻物场景中寻找物体的寻物策略。具体的,第一种情况:上述寻物策略池所存储的寻物策略可以仅仅是用于在目标寻物场景中寻找目的物的寻物策略;第二种情况:上述寻物策略池所存储的寻物策略可以是上述第一种情况提及的寻物策略和用于在目标寻物场景中寻找其他目的物的寻物策略。本申请仅仅以此为例进行说明,并不对寻物策略池中存储的寻物策略进行限定。
需要说明的是,对于目标寻物场景中的每一目的物而言,为了便于对该目的物的寻物策略进行学习,寻物策略池中会存储针对该目的物的初始寻物策略,这些初始寻物策略可以是随机设置的,随着本步骤中对寻物策略的学习,将学习到的寻物策略加入到寻物策略池中,这样通过不断的迭代学习,寻物策略池中的寻物策略会越来越丰富。
本申请的一种实现方式中,可以基于增强学习算法,获得机器人在目标寻物场景中寻找目的物的寻物策略。
S104:比较所获得寻物策略与目标最优寻物策略是否一致,若一致,执行S105,若不一致,返回执行S101。
从本步骤返回S101后,从目标寻物场景的状态集合中选择状态时,仍然可以是以随机的方式进行状态选择的,或者,也可以是按照一定的规则进行选择的。另外再次选出的状态可以与之前选择出的状态相同,也可以不相同,本申请并不对此进行限定。
S105:判定完成以第一状态为寻物策略的起始状态的策略学习。
由以上可见,本实施例提供的方案中,机器人以目标寻物场景的状态集合中一个状态作为寻物策略的起始状态,获得寻找目的物的目标最优寻物策略,以目标最优寻物策略为学习目标进行策略学习,获得机器人在目标寻物场景中寻找目的物的寻物策略,这样,机器人在目标寻物场景中寻找目的物时,根据前述学习得到的寻物策略即可,而无需使用机器人自身设置的定位装置,因而不会受寻物场景的影响,进而提高了寻物时的成功概率。
本申请的一种实现方式中,参见图2,提供了另一种机器学习的方法的流程示意图,与图1所示实施例相比,本实施例中,以目标最优寻物策略为学 习目标进行策略学习,获得机器人在目标寻物场景中寻找目的物的寻物策略,并将所获得的寻物策略添加至寻物策略池(S103),包括:
S103A:以目标最优寻物策略为学习目标,利用目标类型的寻物策略确定用于进行策略学习的增强学习算法中的回报函数。
目标类型的寻物策略为:寻物策略池中用于寻找目标物的寻物策略。
增强学习是一类机器学习方法,通过状态和动作对现实世界进行抽象建模,以找到最优的价值回报为目标,并通过一些训练、学习方法,最终找到最优的策略。
发明人经过实验发现,利用增强学习可以使得机器人通过学习改进自身的性能并进行行为选择,进而做出决策,通过选择执行某一动作实现状态改变。
另外,各种增强学习算法中一般需要包括策略的回报函数和策略的价值函数,其中,策略的价值函数是与回报函数相关的函数。具体应用中,由于寻物场景有所差异,所以回报函数一般是不同的,需要结合具体寻物场景进行学习,得到适应于不同寻物场景的回报函数。
S103B:基于回报函数进行策略学习,获得使得上述增强学习算法中价值函数的输出值最大的寻物策略,作为机器人在目标寻物场景中寻找目的物的寻物策略。
S103C:将所获得的寻物策略添加至寻物策略池。
本实施例提供的方案中,进行策略学习时引入增强学习算法,使得机器人能够更加高效的学习到寻找目的物的寻物策略。
下面再通过几个具体实施例对本申请实施例提供的机器学习方法进行进一步说明。
实施例一:
在前述图2所示实施例的基础上,以目标最优寻物策略为学习目标,利用目标类型的寻物策略确定用于进行策略学习的增强学习算法中的回报函数(S103A),包括:
确定使得以下表达式的取值最大的回报函数R为用于进行策略学习的增 强学习算法中的回报函数:
Figure PCTCN2018095769-appb-000011
其中,
Figure PCTCN2018095769-appb-000012
Figure PCTCN2018095769-appb-000013
k表示寻物策略池中所包含寻找目的物的寻物策略的数量,i表示寻物策略池中各个寻找目的物的寻物策略的标识,π i表示寻物策略池中标识为i的寻找目的物的寻物策略,π d表示目标最优寻物策略,S 0表示第一状态,V π表示寻物策略π下上述增强学习算法的价值函数的输出值,M表示寻物策略π中所包含状态的数量,m表示寻物策略π中各个状态的标识,t表示寻物策略π中状态转换的次数,π(S m)表示寻物策略π中机器人从状态S m转换至下一状态执行的动作,γ为预设的系数,0<γ<1,maximise()表示取最大值函数。
基于上述描述可以得知,上述
Figure PCTCN2018095769-appb-000014
表示寻物策略π d下上述增强学习算法的价值函数的输出值,上述
Figure PCTCN2018095769-appb-000015
表示寻物策略π i下上述增强学习算法的价值函数的输出值。
实施例二:
在上述实施例一的基础上,基于上述回报函数进行策略学习,获得使得上述增强学习算法中价值函数的输出值最大的寻物策略,包括:
按照预设的状态转换方式,学习得到以第一状态为寻物策略起始状态、以第二状态为寻物策略终止状态的寻物策略;
按照以下表达式计算学习到的每一寻物策略下所述增强学习算法中价值函数的输出值:
Figure PCTCN2018095769-appb-000016
其中,R e表示上述增强学习算法中的回报函数;
将计算得到的输出值中最大输出值对应的寻物策略确定为使得上述增强学习算法中价值函数的输出值最大的寻物策略。
具体的,上述预设的状态转换方式可以是按照预先约定好的状态与状态之间的转换关系进行状态转换的方式。
与上述情况相对应,本申请的一种实现方式中,上述预设的状态转换方式可以为:
根据预先统计的、转换前状态转换至其他状态的概率,确定转换后状态以及从转换前状态转换至转换后状态机器人执行的、属于目标寻物场景的动作集合的动作,然后执行所确定的动作,由转换前状态转换至转换后状态,实现状态转换。
具体的,可以通过以下表达式统计得到一个状态转换至其他状态的概率:
P(S i,A i,S j)=x/y
其中,P(S i,A i,S j)表示通过执行动作A i机器人由状态S i转换至状态S j的概率,x表示在大量随机重复实验中(S i,A i,S j)组合发生的次数,也就是,通过执行动作A i机器人由状态S i转换至状态S j这一现象发生的次数,y表示(S i,A i)在大量随机重复实验中(S i,A i)组合发生的次数,也就是,机器人处于状态S i时执行动作A i的次数。
基于上述统计方式,确定转换后状态以及从转换前状态转换至转换后状态机器人执行的、属于目标寻物场景的动作集合的动作的具体方式时,可以从预先统计的、转换前状态转换至其他状态的概率中,选择取值最大的概率对应的状态,作为转换后状态,并将上述取值最大的概率对应的动作作为机器人从转换前状态转换至转换后状态要执行的动作。
前面所提及的状态集合和动作集合可以是预先生成的,下面通过两个具体实施例进行详细说明。
由以上可见,上述各个实施例提供的方案中,机器人以目标寻物场景的状态集合中一个状态作为寻物策略起始状态,结合增强学习算法对寻物策略进行策略学习,得到在目标寻物场景中寻找目的物的各种寻物策略,这样,机器人在目标寻物场景中寻找目的物时,根据前述学习得到的寻物策略即可,而无需使用机器人自身设置的定位装置,因而不会受寻物场景的影响,进而提高了寻物时的成功概率。另外,由于增强学习算法自身的优势,使得机器人在进行策略学习的过程中,能够高效的学习到寻物策略,进而提高了机器 人自身的性能。
实施例三:
参见图3,提供了一种获得状态集合中状态的方法的流程示意图,该方法包括:
S301:采集目标寻物场景的信息序列。
其中,上述信息序列由信息元素组成,信息元素包括:视频帧和/或音频帧。
具体的,上述信息序列可以是机器人在目标寻物场景中随意巡航的过程中采集到的。
S302:判断信息序列中未被选中过的信息元素的数量是否大于预设数量,若为是,执行S303。
上述预设数量可以是根据多次实验统计结果设定的,或者,还可以是根据目标寻物场景的类型等信息设定的,本申请并不对此进行限定。
S303:从信息序列中未被选中过的信息元素中选择预设数量个信息元素,生成机器人在目标寻物场景中所处的一个状态,作为第三状态。
本申请的一种实现方式中,生成机器人在目标寻物场景中所处的一个状态时,可以由所选择的信息元素形成一个向量,并由形成的该向量表示机器人在目标寻物场景中所处的一个状态。
S304:判断状态集合中是否存在第三状态,若不存在,执行S305,若存在,直接返回执行S302。
具体的,判断状态集合中是否存在上述第三状态时,可以将表示上述第三状态的向量与表示状态集合中各个状态的向量逐一匹配,若存在与表示第三状态的向量相匹配的向量,则说明状态集合中已存在上述第三状态;否则,说明状态集合中不存在上述第三状态。
另外,本申请的一种实现方式中,还可以通过预先训练的网络模型检测上述第三状态与状态集合中的每一状态是否相似。
具体的,上述网络模型可以按照如下方式训练得到:
机器人在目标寻物场景中随意巡航时采集信息序列,作为样本信息序列;
从样本信息序列中选择样本段,并对所选择的样本段进行状态标记;
由进行状态标记后的各个样本段形成两个一组的模型输入参数,输入至预设的神经网络模型中进行模型训练,得到用于检测两个状态的是否相似的网络模型,这一模型也可以称之为孪生网络模型。
上述机器人在目标寻物场景中随意巡航时采集的信息序列为寻物场景的环境信息,具体的,上述信息序列可以理解为由信息元素组成,信息元素包括:视频帧和/或音频帧。
从样本信息序列中选择样本段可以理解为:在样本信息序列中选择若干个连续采集的信息元素,为便于描述本申请中将所选择的信息元素的集合体称之为样本段,所选择信息元素的数量可以与前述预设数量相等,也可以不相等,本申请并不对此进行限定。
另外,由于样本信息序列是机器人在目标寻物场景中巡航时随机采集的,所以样本信息序列中的内容可能不具有代表性,或者所采集的样本序列中存在大量重复的内容等等,鉴于此,从样本信息序列中选择样本段时,为了更好的进行网络模型序列,可以不选择样本信息序列中满足上述情况的信息元素。
S305:将第三状态添加至所述状态集合,并返回执行S302。
本实施例中,机器人通过采集目标寻物场景的信息序列并对信息序列进行分析的方式,获得机器人在目标寻物场景中的各个状态,这样一方面无需用户手动设置机器人在目标寻物场景中的状态,提高了机器人的自动化程度;另一方面,机器人可以针对不同的场景自适应的获得其在不同场景中所处的状态,进而提高了机器人针对不同场景的适应性。
实施例四:
与上述实施例三中提供的获得状态集合中状态的方法类似,本实施例中提供了一种获得动作集合中动作的方法,具体的,该方法包括:
获得上述信息序列对应的动作序列,其中,动作序列由动作元素组成,动作序列中的每一动作元素与上述信息序列中的每一信息元素一一对应;
判断动作序列中未被选中过的动作元素的数量是否大于上述预设数量;
若为是,从动作序列中未被选中过的动作元素中选择预设数量个动作元素,生成机器人在目标寻物场景中的一个动作,作为第一动作;
判断动作集合中是否存在第一动作;
若不存在,将第一动作添加至动作集合,并返回执行上述判断动作序列中未被选中过的动作元素的数量是否大于所述预设数量的步骤;
若存在,直接返回执行判断动作序列中未被选中过的动作元素的数量是否大于所述预设数量的步骤。
获得动作集合中动作的具体方式与前述实施例三中获得状态集合中状态的方式相类似,区别仅仅在于“动作”与“状态”的差别,相关之处可以参见实施例三部分,这里不再赘述。
本实施例中,机器人通过采集目标寻物场景的信息序列对应的动作序列并对动作序列进行分析的方式,获得机器人在目标寻物场景中的各个动作,这样一方面无需用户手动设置机器人在目标寻物场景中的动作,提高了机器人的自动化程度;另一方面,机器人可以针对不同的场景自适应的获得其动作,进而提高了机器人针对不同场景的适应性。
与上述机器学习方法相对应,本申请实施例还提供了一种寻物方法。
图4提供了一种寻物方法的流程示意图,该方法应用于机器人,包括:
S401:接收在目标寻物场景中寻找目的物的寻物指令。
S402:获得机器人的当前状态。
S403:根据寻物策略池中包含当前状态的、用于寻找目的物的寻物策略,确定机器人从当前状态转换至下一状态执行的动作。
其中,上述寻物策略池中的寻物策略可以是:以寻找目的物的最优寻物策略为学习目标进行策略学习得到的、机器人在目标寻物场景中寻找目的物的策略。
具体的,进行策略学习的具体方式可以参见前述机器学习方法实施例部分提供的具体方式,这里不再赘述。
寻物策略包含:机器人从寻物策略起始状态开始至寻找到目的物依次经历的各个状态、从每一状态转换至下一状态机器人执行的动作。
S404:执行所确定的动作实现状态转换,并判断是否寻找到目的物,若为否,返回执行S402,直至寻找到目的物。
本申请的一种实现方式中,根据寻物策略池中包含当前状态的、用于寻找目的物的寻物策略,确定机器人从当前状态转换至下一状态执行的动作(S403),包括:
按照以下表达式,计算在寻物策略池中包含当前状态的、用于寻找目的物的寻物策略下预设的增强学习算法的价值函数的输出值:
Figure PCTCN2018095769-appb-000017
其中,V π表示寻物策略π下上述增强学习算法的价值函数的输出值,M表示寻物策略π中所包含状态的数量,m表示寻物策略π中各个状态的标识,n表示当前状态在寻物策略π中的标识,x表示寻物策略π中从当前状态至策略终止状态的状态转换次数,π(S m)表示寻物策略π中机器人从状态S m转换至下一状态执行的动作,γ为预设的系数,0<γ<1,R e表示上述增强学习算法中的回报函数;
选择计算得到的输出值中最大输出值对应的寻物策略为目标寻物策略;
从目标寻物策略中确定机器人从当前状态转换至下一状态要执行的动作。
本申请的一种实现方式中,获得机器人的当前状态时,可以采集目标寻物场景的信息序列,从信息序列中选择预设数量个信息元素,判断预先获得的目标寻物场景的状态集合中是否存在与所选择的信息元素相匹配的状态,若存在,将状态集合中与所选择的信息元素相匹配的状态确定为机器人的当前状态。
其中,上述信息序列由信息元素组成,信息元素包括:视频帧和/或音频帧。状态集合为:机器人在目标寻物场景中所能处的状态的集合。
由以上可见,上述各个实施例提供的方案中,机器人以寻找目的物的最优寻物策略为学习目标进行策略学习,得到机器人在目标寻物场景中寻找目的物的策略,然后,机器人在寻物时基于上述学习得到的各种寻物策略寻找目的物,这样,无需使用机器人自身设置的定位装置寻找目的物,因而不会受寻物场景的影响,进而提高了寻物时的成功概率。
与前述机器学习方法相对应,本申请实施例还提供了一种机器学习装置。
图5为本申请实施例提供的一种机器学习装置,该装置应用于机器人,包括:
状态选择模块501,用于从目标寻物场景的状态集合中选择状态,作为第一状态,其中,所述状态集合为:所述机器人在所述目标寻物场景中的状态的集合;
策略获得模块502,用于以所述第一状态为寻物策略的起始状态,获得寻找目的物的目标最优寻物策略,其中,所述寻物策略包含:所述机器人从所述寻物策略的起始状态开始至寻找到所述目的物依次经历的各个状态、从每一状态转换至下一状态所述机器人执行的动作;
策略学习模块503,用于以所述目标最优寻物策略为学习目标进行策略学习,获得所述机器人在所述目标寻物场景中寻找所述目的物的寻物策略,并将所获得的寻物策略添加至寻物策略池,其中,所获得的寻物策略为:以所述第一状态为起始状态、以第二状态为终止状态的寻物策略,所述第二状态为:所述目的物在所述目标寻物场景中的位置对应的所述机器人所处的状态;
策略比较模块504,用于比较所获得寻物策略与所述目标最优寻物策略是否一致,若一致,触发学习判定模块505,若不一致,触发所述状态选择模块501;
所述学习判定模块505,用于判定完成以所述第一状态为寻物策略的起始状态的策略学习。
可选的,所述寻物策略中每一状态的下一状态、从每一状态转换至下一状态所述机器人执行的动作,是根据预先统计的、转换前状态转换至其他状态的概率确定的;
从每一状态转换至下一状态所述机器人执行的动作为:属于所述目标寻物场景的动作集合的动作,其中,所述动作集合为:所述机器人在所述目标寻物场景中进行状态转换时执行动作的集合。
由以上可见,本实施例提供的方案中,机器人以目标寻物场景的状态集合中一个状态作为寻物策略的起始状态,获得寻找目的物的目标最优寻物策略,以目标最优寻物策略为学习目标进行策略学习,获得机器人在目标寻物 场景中寻找目的物的寻物策略,这样,机器人在目标寻物场景中寻找目的物时,根据前述学习得到的寻物策略即可,而无需使用机器人自身设置的定位装置,因而不会受寻物场景的影响,进而提高了寻物时的成功概率。
本申请的一种实现方式中,参见图6,提供了另一种机器学习的装置的结构示意图,与前述图5所示实施例相比,本实施例中,上述策略学习模块503,包括:
回报函数确定子模块503A,用于以所述目标最优寻物策略为学习目标,利用目标类型的寻物策略确定用于进行策略学习的增强学习算法中的回报函数,所述目标类型的寻物策略为:寻物策略池中用于寻找所述目标物的寻物策略;
策略获得子模块503B,用于基于所述回报函数进行策略学习,获得使得所述增强学习算法中价值函数的输出值最大的寻物策略,作为所述机器人在所述目标寻物场景中寻找所述目的物的寻物策略;
策略添加子模块503C,用于将所获得的寻物策略添加至寻物策略池。
可选的,所述回报函数确定子模块503A,具体用于确定使得以下表达式的取值最大的回报函数R为用于进行策略学习的增强学习算法中的回报函数:
Figure PCTCN2018095769-appb-000018
其中,
Figure PCTCN2018095769-appb-000019
Figure PCTCN2018095769-appb-000020
k表示所述寻物策略池中所包含寻找所述目的物的寻物策略的数量,i表示所述寻物策略池中各个寻找所述目的物的寻物策略的标识,π i表示所述寻物策略池中标识为i的寻找所述目的物的寻物策略,π d表示所述目标最优寻物策略,S 0表示所述第一状态,V π表示寻物策略π下所述增强学习算法的价值函数的输出值,M表示寻物策略π中所包含状态的数量,m表示寻物策略π中各个状态的标识,t表示寻物策略π中状态转换的次数,π(S m)表示寻物策略π中所述机器人从状态S m转换至下一状态执行的动作,γ为预设的系数,0<γ<1, maximise()表示取最大值函数。
具体的,所述策略学习子模块503B,可以包括:
策略学习单元,用于按照预设的状态转换方式,学习得到以所述第一状态为寻物策略起始状态、以所述第二状态为寻物策略终止状态的寻物策略;
输出值计算单元,用于按照以下表达式计算学习到的每一寻物策略下所述增强学习算法中价值函数的输出值:
Figure PCTCN2018095769-appb-000021
其中,R e表示所述增强学习算法中的回报函数;
策略确定单元,用于将计算得到的输出值中最大输出值对应的寻物策略确定为使得所述增强学习算法中价值函数的输出值最大的寻物策略;
策略加入单元,用于将所获得的寻物策略添加至所述寻物策略池。
由以上可见,上述各个实施例提供的方案中,机器人以目标寻物场景的状态集合中一个状态作为寻物策略起始状态,结合预设的增强学习算法对寻物策略进行策略学习,得到在目标寻物场景中寻找目的物的各种寻物策略,这样,机器人在目标寻物场景中寻找目的物时,根据前述学习得到的寻物策略即可,而无需使用机器人自身设置的定位装置,因而不会受寻物场景的影响,进而提高了寻物时的成功概率。
本申请的一种实现方式中,上述学习装置还包括:状态获得模块506;
状态获得模块506,用于获得所述状态集合中的状态。
具体的,参见图7,提供了一种获得状态集合中状态的装置的结构示意图,也就是上述状态获得模块506的结构示意图,所述状态获得模块506,包括:
第一序列采集子模块506A,用于采集所述目标寻物场景的信息序列,其中,所述信息序列由信息元素组成,所述信息元素包括:视频帧和/或音频帧;
第一元素数量判断子模块506B,用于判断所述信息序列中未被选中过的信息元素的数量是否大于预设数量,若为是,触发状态生成子模块506C;
状态生成子模块506C,用于从所述信息序列中未被选中过的信息元素中选择所述预设数量个信息元素,生成所述机器人在所述目标寻物场景中所处 的一个状态,作为第三状态;
状态判断子模块506D,用于判断所述状态集合中是否存在所述第三状态,若不存在,触发状态添加子模块506E,若存在,触发所述第一元素数量判断子模块506B;
状态添加子模块506E,用于将所述第三状态添加至所述状态集合,并触发所述第一元素数量判断子模块506B。
本申请的另一实现方式中,所述装置还可以包括:
动作获得模块,用于获得所述动作集合中的动作;
所述动作获得模块,包括:
第二序列采集子模块,用于获得所述信息序列对应的动作序列,其中,所述动作序列由动作元素组成,所述动作序列中的每一动作元素与所述信息序列中的每一信息元素一一对应;
第二元素数量判断子模块,用于判断所述动作序列中未被选中过的动作元素的数量是否大于所述预设数量,若为是,触发动作生成子模块;
动作生成子模块,用于从所述动作序列中未被选中过的动作元素中选择所述预设数量个动作元素,生成所述机器人在所述目标寻物场景中的一个动作,作为第一动作;
动作判断子模块,用于判断所述动作集合中是否存在所述第一动作,若不存在,触发动作添加子模块,若存在,触发所述第二元素数量判断子模块;
动作添加子模块,用于将所述第一动作添加至所述动作集合,并触发所述第二元素数量判断子模块。
与前述寻物方法相对应,本申请实施例还提供了一种寻物装置。
图8为本申请实施例提供的一种寻物装置的结构示意图,该装置应用于机器人,包括:
指令接收模块801,用于接收在目标寻物场景中寻找目的物的寻物指令;
状态获得模块802,用于获得所述机器人的当前状态;
动作确定模块803,用于根据寻物策略池中包含所述当前状态的、用于寻找目的物的寻物策略,确定所述机器人从当前状态转换至下一状态执行的动 作,其中,所述寻物策略池中的寻物策略是:预先以寻找所述目的物的最优寻物策略为学习目标进行策略学习得到的、所述机器人在所述目标寻物场景中寻找所述目的物的策略,寻物策略包含:所述机器人从寻物策略起始状态开始至寻找到所述目的物依次经历的各个状态、从每一状态转换至下一状态所述机器人执行的动作;
状态转换模块804,用于执行所确定的动作实现状态转换,并判断是否寻找到所述目的物,若为否,触发所述状态获得模块802。
具体的,所述动作确定模块803可以包括:
输出值计算子模块,用于按照以下表达式,计算在策略池中包含所述当前状态的寻物策略下预设的增强学习算法的价值函数的输出值:
Figure PCTCN2018095769-appb-000022
其中,V π表示寻物策略π下所述增强学习算法的价值函数的输出值,M表示寻物策略π中所包含状态的数量,m表示寻物策略π中各个状态的标识,n表示所述当前状态在寻物策略π中的标识,x表示寻物策略π中从所述当前状态至策略终止状态的状态转换次数,π(S m)表示寻物策略π中所述机器人从状态S m转换至下一状态执行的动作,γ为预设的系数,0<γ<1,R e表示所述增强学习算法中的回报函数;
策略选择子模块,用于选择计算得到的输出值中最大输出值对应的寻物策略为目标寻物策略;
动作确定子模块,用于从所述目标寻物策略中确定所述机器人从当前状态转换至下一状态要执行的动作。
具体的,所述状态获得模块602可以包括:
序列采集子模块,用于采集所述目标寻物场景的信息序列,其中,所述信息序列由信息元素组成,所述信息元素包括:视频帧和/或音频帧;
元素选择子模块,用于从所述信息序列中选择预设数量个信息元素;
状态判断子模块,用于判断预先获得的所述目标寻物场景的状态集合中是否存在与所选择的信息元素相匹配的状态,其中,所述状态集合为:所述 机器人在所述目标寻物场景中所能处的状态的集合,若存在,触发状态确定子模块;
所述状态确定子模块,用于将所述状态集合中与所选择的信息元素相匹配的状态确定为所述机器人的当前状态。
由以上可见,上述各个实施例提供的方案中,机器人以寻找目的物的最优寻物策略为学习目标进行策略学习,得到机器人在目标寻物场景中寻找目的物的策略,然后,机器人在寻物时基于上述学习得到的各种寻物策略寻找目的物,这样,无需使用机器人自身设置的定位装置寻找目的物,因而不会受寻物场景的影响,进而提高了寻物时的成功概率。
与前述学习方法、学习装置相对应,本申请实施例还提供了一种机器人。
图9为本申请实施例提供的一种机器人的结构示意图,包括:处理器和存储器,其中,
存储器,用于存放计算机程序;
处理器,用于执行存储器上所存放的程序时,实现本申请实施例提供的机器学习方法。
具体的,上述机器学习方法包括:
从目标寻物场景的状态集合中选择状态,作为第一状态,其中,所述状态集合为:所述机器人在所述目标寻物场景中的状态的集合;
以所述第一状态为寻物策略的起始状态,获得寻找目的物的目标最优寻物策略,其中,所述寻物策略包含:所述机器人从所述寻物策略的起始状态开始至寻找到所述目的物依次经历的各个状态、从每一状态转换至下一状态所述机器人执行的动作;
以所述目标最优寻物策略为学习目标进行策略学习,获得所述机器人在所述目标寻物场景中寻找所述目的物的寻物策略,并将所获得的寻物策略添加至寻物策略池,其中,所获得的寻物策略为:以所述第一状态为起始状态、以第二状态为终止状态的寻物策略,所述第二状态为:所述目的物在所述目 标寻物场景中的位置对应的所述机器人所处的状态;
比较所获得寻物策略与所述目标最优寻物策略是否一致;
若一致,判定完成以所述第一状态为寻物策略的起始状态的策略学习;
若不一致,返回所述从目标寻物场景的状态集合中选择状态的步骤。
需要说明的是,上述处理器执行存储器上所存放的程序而实现的机器学习方法的其他实施例与前述方法实施例部分提及的机器学习方法实施例相同,这里不再赘述。
一种实现方式中,上述机器人还可以包括以下器件中的至少一种:
图像采集器件、轮子、机械腿、机械臂等等。
本实施例提供的方案中,机器人以目标寻物场景的状态集合中一个状态作为寻物策略的起始状态,获得寻找目的物的目标最优寻物策略,以目标最优寻物策略为学习目标进行策略学习,获得机器人在目标寻物场景中寻找目的物的寻物策略,这样,机器人在目标寻物场景中寻找目的物时,根据前述学习得到的寻物策略即可,而无需使用机器人自身设置的定位装置,因而不会受寻物场景的影响,进而提高了寻物时的成功概率。
与前述寻物方法、寻物装置相对应,本申请实施例还提供了一种机器人。
图10为本申请实施例提供的另一种机器人的结构示意图,包括:处理器和存储器,其中,
存储器,用于存放计算机程序;
处理器,用于执行存储器上所存放的程序时,实现本申请实施例所述的寻物方法。
具体的,上述寻物方法,包括:
接收在目标寻物场景中寻找目的物的寻物指令;
获得所述机器人的当前状态;
根据寻物策略池中包含所述当前状态的、用于寻找目的物的寻物策略,确定所述机器人从当前状态转换至下一状态执行的动作,其中,所述寻物策略池中的寻物策略是:预先以寻找所述目的物的最优寻物策略为学习目标进行策略学习得到的、所述机器人在所述目标寻物场景中寻找所述目的物的策略,寻物策略包含:所述机器人从寻物策略起始状态开始至寻找到所述目的物依次经历的各个状态、从每一状态转换至下一状态所述机器人执行的动作;
执行所确定的动作实现状态转换,并判断是否寻找到所述目的物;
若为否,返回执行所述获得所述机器人的当前状态的步骤,直至寻找到所述目的物。
需要说明的是,上述处理器执行存储器上所存放的程序而实现的寻物方法的其他实施例与前述方法实施例部分提及的寻物方法实施例相同,这里不再赘述。
一种实现方式中,上述机器人还可以包括以下器件中的至少一种:
图像采集器件、轮子、机械腿、机械臂等等。
由以上可见,本实施例提供的方案中,机器人以寻找目的物的最优寻物策略为学习目标进行策略学习,得到机器人在目标寻物场景中寻找目的物的策略,然后,机器人在寻物时基于上述学习得到的各种寻物策略寻找目的物,这样,无需使用机器人自身设置的定位装置寻找目的物,因而不会受寻物场景的影响,进而提高了寻物时的成功概率。
需要说明的是,上述两种机器人中涉及的存储器可以包括随机存取存储器(Random Access Memory,RAM),也可以包括非易失性存储器(Non-Volatile Memory,NVM),例如至少一个磁盘存储器。可选的,存储器还可以是至少一个位于远离前述处理器的存储装置。
上述两种机器人中涉及的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital Signal Processing,DSP)、专用集成电路 (Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。
与前述学习方法、学习装置相对应,本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质为机器人中的计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现本申请实施例所述的机器学习方法。
具体的,上述学习机器方法包括:
从目标寻物场景的状态集合中选择状态,作为第一状态,其中,所述状态集合为:所述机器人在所述目标寻物场景中的状态的集合;
以所述第一状态为寻物策略的起始状态,获得寻找目的物的目标最优寻物策略,其中,所述寻物策略包含:所述机器人从所述寻物策略的起始状态开始至寻找到所述目的物依次经历的各个状态、从每一状态转换至下一状态所述机器人执行的动作;
以所述目标最优寻物策略为学习目标进行策略学习,获得所述机器人在所述目标寻物场景中寻找所述目的物的寻物策略,并将所获得的寻物策略添加至寻物策略池,其中,所获得的寻物策略为:以所述第一状态为起始状态、以第二状态为终止状态的寻物策略,所述第二状态为:所述目的物在所述目标寻物场景中的位置对应的所述机器人所处的状态;
比较所获得寻物策略与所述目标最优寻物策略是否一致;
若一致,判定完成以所述第一状态为寻物策略的起始状态的策略学习;
若不一致,返回所述从目标寻物场景的状态集合中选择状态的步骤。
需要说明的是,上述处理器执行存储器上所存放的程序而实现的机器学习方法的其他实施例与前述方法实施例部分提及的机器学习方法实施例相同,这里不再赘述。
本实施例提供的方案中,机器人通过执行其计算机可读存储介质内存储 的计算机程序,以目标寻物场景的状态集合中一个状态作为寻物策略的起始状态,并以第一状态为寻物策略的起始状态,获得寻找目的物的目标最优寻物策略,以目标最优寻物策略为学习目标进行策略学习,获得机器人在目标寻物场景中寻找目的物的寻物策略,这样,机器人在目标寻物场景中寻找目的物时,根据前述学习得到的寻物策略即可,而无需使用机器人自身设置的定位装置,因而不会受寻物场景的影响,进而提高了寻物时的成功概率。
与前述寻物方法、寻物装置相对应,本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质为机器人中的计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现本申请实施例所述的寻物方法。
具体的,上述寻物方法,包括:
接收在目标寻物场景中寻找目的物的寻物指令;
获得所述机器人的当前状态;
根据寻物策略池中包含所述当前状态的、用于寻找目的物的寻物策略,确定所述机器人从当前状态转换至下一状态执行的动作,其中,所述寻物策略池中的寻物策略是:预先以寻找所述目的物的最优寻物策略为学习目标进行策略学习得到的、所述机器人在所述目标寻物场景中寻找所述目的物的策略,寻物策略包含:所述机器人从寻物策略起始状态开始至寻找到所述目的物依次经历的各个状态、从每一状态转换至下一状态所述机器人执行的动作;
执行所确定的动作实现状态转换,并判断是否寻找到所述目的物;
若为否,返回执行所述获得所述机器人的当前状态的步骤,直至寻找到所述目的物。
需要说明的是,上述处理器执行存储器上所存放的程序而实现的寻物方法的其他实施例与前述方法实施例部分提及的寻物方法实施例相同,这里不再赘述。
由以上可见,本实施例提供的方案中,机器人通过执行其计算机可读存 储介质内存储的计算机程序,以寻找目的物的最优寻物策略为学习目标进行策略学习,得到机器人在目标寻物场景中寻找目的物的策略,然后,机器人在寻物时基于上述学习得到的各种寻物策略寻找目的物,这样,无需使用机器人自身设置的定位装置寻找目的物,因而不会受寻物场景的影响,进而提高了寻物时的成功概率。
本申请实施例还提供了一种可执行程序代码,该可执行程序代码用于被运行以实现上述任一应用于机器人的机器学习方法步骤。
本申请实施例还提供了一种可执行程序代码,该可执行程序代码用于被运行以实现上述任一应用于机器人的寻物的方法步骤。
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
本说明书中的各个实施例均采用相关的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置、机器人、计算机可读存储介质、可执行程序代码实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
以上所述仅为本申请的较佳实施例而已,并不用以限制本申请,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请保护的范围之内。

Claims (26)

  1. 一种机器学习方法,其特征在于,应用于机器人,所述方法包括:
    从目标寻物场景的状态集合中选择状态,作为第一状态,其中,所述状态集合为:所述机器人在所述目标寻物场景中的状态的集合;
    以所述第一状态为寻物策略的起始状态,获得寻找目的物的目标最优寻物策略,其中,所述寻物策略包含:所述机器人从所述寻物策略的起始状态开始至寻找到所述目的物依次经历的各个状态、从每一状态转换至下一状态所述机器人执行的动作;
    以所述目标最优寻物策略为学习目标进行策略学习,获得所述机器人在所述目标寻物场景中寻找所述目的物的寻物策略,并将所获得的寻物策略添加至寻物策略池,其中,所获得的寻物策略为:以所述第一状态为起始状态、以第二状态为终止状态的寻物策略,所述第二状态为:所述目的物在所述目标寻物场景中的位置对应的所述机器人所处的状态;
    比较所获得寻物策略与所述目标最优寻物策略是否一致;
    若一致,判定完成以所述第一状态为寻物策略的起始状态的策略学习;
    若不一致,返回所述从目标寻物场景的状态集合中选择状态的步骤。
  2. 根据权利要求1所述的方法,其特征在于,所述以所述目标最优寻物策略为学习目标进行策略学习,获得所述机器人在所述目标寻物场景中寻找所述目的物的寻物策略,包括:
    以所述目标最优寻物策略为学习目标,利用目标类型的寻物策略确定用于进行策略学习的增强学习算法中的回报函数,所述目标类型的寻物策略为:寻物策略池中用于寻找所述目标物的寻物策略;
    基于所述回报函数进行策略学习,获得使得所述增强学习算法中价值函数的输出值最大的寻物策略,作为所述机器人在所述目标寻物场景中寻找所述目的物的寻物策略。
  3. 根据权利要求2所述的方法,其特征在于,所述以所述目标最优寻物策略为学习目标,利用目标类型的寻物策略确定用于进行策略学习的增强学习算法中的回报函数,包括:
    确定使得以下表达式的取值最大的回报函数R为用于进行策略学习的增 强学习算法中的回报函数:
    Figure PCTCN2018095769-appb-100001
    其中,
    Figure PCTCN2018095769-appb-100002
    Figure PCTCN2018095769-appb-100003
    k表示所述寻物策略池中所包含寻找所述目的物的寻物策略的数量,i表示所述寻物策略池中各个寻找所述目的物的寻物策略的标识,π i表示所述寻物策略池中标识为i的寻找所述目的物的寻物策略,π d表示所述目标最优寻物策略,S 0表示所述第一状态,V π表示寻物策略π下所述增强学习算法的价值函数的输出值,M表示寻物策略π中所包含状态的数量,m表示寻物策略π中各个状态的标识,t表示寻物策略π中状态转换的次数,π(S m)表示寻物策略π中所述机器人从状态S m转换至下一状态执行的动作,γ为预设的系数,0<γ<1,maximise()表示取最大值函数。
  4. 根据要求3所述的方法,其特征在于,所述基于所述回报函数进行策略学习,获得使得所述增强学习算法中价值函数的输出值最大的寻物策略,包括:
    按照预设的状态转换方式,学习得到以所述第一状态为寻物策略起始状态、以所述第二状态为寻物策略终止状态的寻物策略;
    按照以下表达式计算学习到的每一寻物策略下所述增强学习算法中价值函数的输出值:
    Figure PCTCN2018095769-appb-100004
    其中,R e表示所述增强学习算法中的回报函数;
    将计算得到的输出值中最大输出值对应的寻物策略确定为使得所述增强学习算法中价值函数的输出值最大的寻物策略。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,通过以下方式确定所述寻物策略中每一状态的下一状态、以及从每一状态转换至下一状态 所述机器人执行的动作:
    根据预先统计的、转换前状态转换至其他状态的概率,确定转换后状态以及从所述转换前状态转换至所述转换后状态所述机器人执行的、属于所述目标寻物场景的动作集合的动作,其中,所述动作集合为:所述机器人在所述目标寻物场景中进行状态转换时执行动作的集合。
  6. 根据权利要求5所述的方法,其特征在于,所述状态集合中的状态通过以下方式得到:
    采集所述目标寻物场景的信息序列,其中,所述信息序列由信息元素组成,所述信息元素包括:视频帧和/或音频帧;
    判断所述信息序列中未被选中过的信息元素的数量是否大于预设数量;
    若为是,从所述信息序列中未被选中过的信息元素中选择所述预设数量个信息元素,生成所述机器人在所述目标寻物场景中所处的一个状态,作为第三状态;
    判断所述状态集合中是否存在所述第三状态;
    若不存在,将所述第三状态添加至所述状态集合,并返回执行所述判断所述信息序列中未被选中过的信息元素的数量是否大于预设数量的步骤;
    若存在,直接返回执行所述判断所述信息序列中未被选中过的信息元素的数量是否大于预设数量的步骤。
  7. 根据权利要求6所述的方法,其特征在于,所述动作集合中的动作通过以下方式得到:
    获得所述信息序列对应的动作序列,其中,所述动作序列由动作元素组成,所述动作序列中的每一动作元素与所述信息序列中的每一信息元素一一对应;
    判断所述动作序列中未被选中过的动作元素的数量是否大于所述预设数量;
    若为是,从所述动作序列中未被选中过的动作元素中选择所述预设数量个动作元素,生成所述机器人在所述目标寻物场景中的一个动作,作为第一动作;
    判断所述动作集合中是否存在所述第一动作;
    若不存在,将所述第一动作添加至所述动作集合,并返回执行所述判断所述动作序列中未被选中过的动作元素的数量是否大于所述预设数量的步骤;
    若存在,直接返回执行所述判断所述动作序列中未被选中过的动作元素的数量是否大于所述预设数量的步骤。
  8. 一种寻物方法,其特征在于,应用于机器人,所述方法包括:
    接收在目标寻物场景中寻找目的物的寻物指令;
    获得所述机器人的当前状态;
    根据寻物策略池中包含所述当前状态的、用于寻找目的物的寻物策略,确定所述机器人从当前状态转换至下一状态执行的动作,其中,所述寻物策略池中的寻物策略是:预先以寻找所述目的物的最优寻物策略为学习目标进行策略学习得到的、所述机器人在所述目标寻物场景中寻找所述目的物的策略,寻物策略包含:所述机器人从寻物策略起始状态开始至寻找到所述目的物依次经历的各个状态、从每一状态转换至下一状态所述机器人执行的动作;
    执行所确定的动作实现状态转换,并判断是否寻找到所述目的物;
    若为否,返回执行所述获得所述机器人的当前状态的步骤,直至寻找到所述目的物。
  9. 根据权利要求8所述的方法,其特征在于,所述根据寻物策略池中包含所述当前状态的用于寻找所述目的物的寻物策略,确定所述机器人从当前状态转换至下一状态执行的动作,包括:
    按照以下表达式,计算在策略池中包含所述当前状态的寻物策略下预设的增强学习算法的价值函数的输出值:
    Figure PCTCN2018095769-appb-100005
    其中,V π表示寻物策略π下所述增强学习算法的价值函数的输出值,M表示寻物策略π中所包含状态的数量,m表示寻物策略π中各个状态的标识,n表示所述当前状态在寻物策略π中的标识,x表示寻物策略π中从所述当前状态至策略终止状态的状态转换次数,π(S m)表示寻物策略π中所述机器人从状态S m转换至下一状态执行的动作,γ为预设的系数,0<γ<1,R e表示所述增强学习算法中的回报函数;
    选择计算得到的输出值中最大输出值对应的寻物策略为目标寻物策略;
    从所述目标寻物策略中确定所述机器人从当前状态转换至下一状态执行的动作。
  10. 根据权利要求8或9中任一项所述的方法,其特征在于,所述获得所述机器人的当前状态,包括:
    采集所述目标寻物场景的信息序列,其中,所述信息序列由信息元素组成,所述信息元素包括:视频帧和/或音频帧;
    从所述信息序列中选择预设数量个信息元素;
    判断预先获得的所述目标寻物场景的状态集合中是否存在与所选择的信息元素相匹配的状态,其中,所述状态集合为:所述机器人在所述目标寻物场景中所能处的状态的集合;
    若存在,将所述状态集合中与所选择的信息元素相匹配的状态确定为所述机器人的当前状态。
  11. 一种学习装置,其特征在于,应用于机器人,所述装置包括:
    状态选择模块,用于从目标寻物场景的状态集合中选择状态,作为第一状态,其中,所述状态集合为:所述机器人在所述目标寻物场景中的状态的集合;
    策略获得模块,用于以所述第一状态为寻物策略的起始状态,获得寻找目的物的目标最优寻物策略,其中,所述寻物策略包含:所述机器人从所述寻物策略的起始状态开始至寻找到所述目的物依次经历的各个状态、从每一状态转换至下一状态所述机器人执行的动作;
    策略学习模块,用于以所述目标最优寻物策略为学习目标进行策略学习,获得所述机器人在所述目标寻物场景中寻找所述目的物的寻物策略,并将所获得的寻物策略添加至寻物策略池,其中,所获得的寻物策略为:以所述第一状态为起始状态、以第二状态为终止状态的寻物策略,所述第二状态为:所述目的物在所述目标寻物场景中的位置对应的所述机器人所处的状态;
    策略比较模块,用于比较所获得寻物策略与所述目标最优寻物策略是否一致,若一致,触发学习判定模块,若不一致,触发所述状态选择模块;
    所述学习判定模块,用于判定完成以所述第一状态为寻物策略的起始状 态的策略学习。
  12. 根据权利要求11所述的装置,其特征在于,所述策略学习模块,包括:
    回报函数确定子模块,用于以所述目标最优寻物策略为学习目标,利用目标类型的寻物策略确定用于进行策略学习的增强学习算法中的回报函数,所述目标类型的寻物策略为:寻物策略池中用于寻找所述目标物的寻物策略;
    策略获得子模块,用于基于所述回报函数进行策略学习,获得使得所述增强学习算法中价值函数的输出值最大的寻物策略,作为所述机器人在所述目标寻物场景中寻找所述目的物的寻物策略;
    策略添加子模块,用于将所获得的寻物策略添加至寻物策略池。
  13. 根据权利要求12所述的装置,其特征在于,所述回报函数确定子模块,具体用于确定使得以下表达式的取值最大的回报函数R为用于进行策略学习的增强学习算法中的回报函数:
    Figure PCTCN2018095769-appb-100006
    其中,
    Figure PCTCN2018095769-appb-100007
    Figure PCTCN2018095769-appb-100008
    k表示所述寻物策略池中所包含寻找所述目的物的寻物策略的数量,i表示所述寻物策略池中各个寻找所述目的物的寻物策略的标识,π i表示所述寻物策略池中标识为i的寻找所述目的物的寻物策略,π d表示所述目标最优寻物策略,S 0表示所述第一状态,V π表示寻物策略π下所述增强学习算法的价值函数的输出值,M表示寻物策略π中所包含状态的数量,m表示寻物策略π中各个状态的标识,t表示寻物策略π中状态转换的次数,π(S m)表示寻物策略π中所述机器人从状态S m转换至下一状态执行的动作,γ为预设的系数,0<γ<1,maximise()表示取最大值函数。
  14. 根据要求13所述的装置,其特征在于,所述策略学习子模块,包括:
    策略学习单元,用于按照预设的状态转换方式,学习得到以所述第一状态为寻物策略起始状态、以所述第二状态为寻物策略终止状态的寻物策略;
    输出值计算单元,用于按照以下表达式计算学习到的每一寻物策略下所述增强学习算法中价值函数的输出值:
    Figure PCTCN2018095769-appb-100009
    其中,R e表示所述增强学习算法中的回报函数;
    策略确定单元,用于将计算得到的输出值中最大输出值对应的寻物策略确定为使得所述增强学习算法中价值函数的输出值最大的寻物策略;
    策略加入单元,用于将所获得的寻物策略添加至所述寻物策略池。
  15. 根据权利要求11-14中任一项所述的装置,其特征在于,所述寻物策略中每一状态的下一状态、从每一状态转换至下一状态所述机器人执行的动作,是根据预先统计的、转换前状态转换至其他状态的概率确定的;
    从每一状态转换至下一状态所述机器人执行的动作为:属于所述目标寻物场景的动作集合的动作,其中,所述动作集合为:所述机器人在所述目标寻物场景中进行状态转换时执行动作的集合。
  16. 根据权利要求15所述的装置,其特征在于,所述装置还包括:
    状态获得模块,用于获得所述状态集合中的状态;
    所述状态获得模块,包括:
    第一序列采集子模块,用于采集所述目标寻物场景的信息序列,其中,所述信息序列由信息元素组成,所述信息元素包括:视频帧和/或音频帧;
    第一元素数量判断子模块,用于判断所述信息序列中未被选中过的信息元素的数量是否大于预设数量,若为是,触发状态生成子模块;
    状态生成子模块,用于从所述信息序列中未被选中过的信息元素中选择所述预设数量个信息元素,生成所述机器人在所述目标寻物场景中所处的一个状态,作为第三状态;
    状态判断子模块,用于判断所述状态集合中是否存在所述第三状态,若不存在,触发状态添加子模块,若存在,触发所述第一元素数量判断子模块;
    状态添加子模块,用于将所述第三状态添加至所述状态集合,并触发所述第一元素数量判断子模块。
  17. 根据权利要求16所述的装置,其特征在于,所述装置还包括:
    动作获得模块,用于获得所述动作集合中的动作;
    所述动作获得模块,包括:
    第二序列采集子模块,用于获得所述信息序列对应的动作序列,其中,所述动作序列由动作元素组成,所述动作序列中的每一动作元素与所述信息序列中的每一信息元素一一对应;
    第二元素数量判断子模块,用于判断所述动作序列中未被选中过的动作元素的数量是否大于所述预设数量,若为是,触发动作生成子模块;
    动作生成子模块,用于从所述动作序列中未被选中过的动作元素中选择所述预设数量个动作元素,生成所述机器人在所述目标寻物场景中的一个动作,作为第一动作;
    动作判断子模块,用于判断所述动作集合中是否存在所述第一动作,若不存在,触发动作添加子模块,若存在,触发所述第二元素数量判断子模块;
    动作添加子模块,用于将所述第一动作添加至所述动作集合,并触发所述第二元素数量判断子模块。
  18. 一种寻物装置,其特征在于,应用于机器人,所述装置包括:
    指令接收模块,用于接收在目标寻物场景中寻找目的物的寻物指令;
    状态获得模块,用于获得所述机器人的当前状态;
    动作确定模块,用于根据寻物策略池中包含所述当前状态的、用于寻找目的物的寻物策略,确定所述机器人从当前状态转换至下一状态执行的动作,其中,所述寻物策略池中的寻物策略是:预先以寻找所述目的物的最优寻物策略为学习目标进行策略学习得到的、所述机器人在所述目标寻物场景中寻找所述目的物的策略,寻物策略包含:所述机器人从寻物策略起始状态开始至寻找到所述目的物依次经历的各个状态、从每一状态转换至下一状态所述机器人执行的动作;
    状态转换模块,用于执行所确定的动作实现状态转换,并判断是否寻找到所述目的物,若为否,触发所述状态获得模块。
  19. 根据权利要求18所述的装置,其特征在于,所述动作确定模块,包括:
    输出值计算子模块,用于按照以下表达式,计算在策略池中包含所述当 前状态的寻物策略下预设的增强学习算法的价值函数的输出值:
    Figure PCTCN2018095769-appb-100010
    其中,V π表示寻物策略π下所述增强学习算法的价值函数的输出值,M表示寻物策略π中所包含状态的数量,m表示寻物策略π中各个状态的标识,n表示所述当前状态在寻物策略π中的标识,x表示寻物策略π中从所述当前状态至策略终止状态的状态转换次数,π(S m)表示寻物策略π中所述机器人从状态S m转换至下一状态执行的动作,γ为预设的系数,0<γ<1,R e表示所述增强学习算法中的回报函数;
    策略选择子模块,用于选择计算得到的输出值中最大输出值对应的寻物策略为目标寻物策略;
    动作确定子模块,用于从所述目标寻物策略中确定所述机器人从当前状态转换至下一状态要执行的动作。
  20. 根据权利要求18或19中任一项所述的装置,其特征在于,所述状态获得模块,包括:
    序列采集子模块,用于采集所述目标寻物场景的信息序列,其中,所述信息序列由信息元素组成,所述信息元素包括:视频帧和/或音频帧;
    元素选择子模块,用于从所述信息序列中选择预设数量个信息元素;
    状态判断子模块,用于判断预先获得的所述目标寻物场景的状态集合中是否存在与所选择的信息元素相匹配的状态,其中,所述状态集合为:所述机器人在所述目标寻物场景中所能处的状态的集合,若存在,触发状态确定子模块;
    所述状态确定子模块,用于将所述状态集合中与所选择的信息元素相匹配的状态确定为所述机器人的当前状态。
  21. 一种机器人,其特征在于,包括:处理器和存储器,其中,
    存储器,用于存放计算机程序;
    处理器,用于执行存储器上所存放的程序时,实现权利要求1-7任一所述的方法步骤。
  22. 一种机器人,其特征在于,包括:处理器和存储器,其中,
    存储器,用于存放计算机程序;
    处理器,用于执行存储器上所存放的程序时,实现权利要求8-10任一所述的方法步骤。
  23. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质为机器人中的计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1-7任一所述的方法步骤。
  24. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质为机器人中的计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现权利要求8-10任一所述的方法步骤。
  25. 一种可执行程序代码,其特征在于,所述可执行程序代码用于被运行以实现权利要求1-7任一所述的方法步骤。
  26. 一种可执行程序代码,其特征在于,所述可执行程序代码用于被运行以实现权利要求8-10任一所述的方法步骤。
PCT/CN2018/095769 2017-07-20 2018-07-16 一种机器学习、寻物方法及装置 WO2019015544A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP18835074.8A EP3657404B1 (en) 2017-07-20 2018-07-16 Machine learning and object searching method and device
US16/632,510 US11548146B2 (en) 2017-07-20 2018-07-16 Machine learning and object searching method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710594689.9 2017-07-20
CN201710594689.9A CN109284847B (zh) 2017-07-20 2017-07-20 一种机器学习、寻物方法及装置

Publications (1)

Publication Number Publication Date
WO2019015544A1 true WO2019015544A1 (zh) 2019-01-24

Family

ID=65016172

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/095769 WO2019015544A1 (zh) 2017-07-20 2018-07-16 一种机器学习、寻物方法及装置

Country Status (4)

Country Link
US (1) US11548146B2 (zh)
EP (1) EP3657404B1 (zh)
CN (1) CN109284847B (zh)
WO (1) WO2019015544A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112598116A (zh) * 2020-12-22 2021-04-02 王槐林 一种宠物食欲评价方法、装置、设备及存储介质
CN112651994A (zh) * 2020-12-18 2021-04-13 零八一电子集团有限公司 地面多目标跟踪方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210237266A1 (en) * 2018-06-15 2021-08-05 Google Llc Deep reinforcement learning for robotic manipulation
CN113379793B (zh) * 2021-05-19 2022-08-12 成都理工大学 基于孪生网络结构和注意力机制的在线多目标跟踪方法
CN113987963B (zh) * 2021-12-23 2022-03-22 北京理工大学 一种分布式信道汇聚策略生成方法及装置
CN116128013B (zh) * 2023-04-07 2023-07-04 中国人民解放军国防科技大学 基于多样性种群训练的临机协同方法、装置和计算机设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101537618A (zh) * 2008-12-19 2009-09-23 北京理工大学 体育场捡球机器人视觉系统
US20140121833A1 (en) * 2012-10-30 2014-05-01 Samsung Techwin Co., Ltd. Apparatus and method for planning path of robot, and the recording media storing the program for performing the method
CN106926247A (zh) * 2017-01-16 2017-07-07 深圳前海勇艺达机器人有限公司 具有自动家中寻物的机器人

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7003509B2 (en) * 2003-07-21 2006-02-21 Leonid Andreev High-dimensional data clustering with the use of hybrid similarity matrices
CN102609720B (zh) * 2012-01-31 2013-12-18 中国科学院自动化研究所 一种基于位置校正模型的行人检测方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101537618A (zh) * 2008-12-19 2009-09-23 北京理工大学 体育场捡球机器人视觉系统
US20140121833A1 (en) * 2012-10-30 2014-05-01 Samsung Techwin Co., Ltd. Apparatus and method for planning path of robot, and the recording media storing the program for performing the method
CN106926247A (zh) * 2017-01-16 2017-07-07 深圳前海勇艺达机器人有限公司 具有自动家中寻物的机器人

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3657404A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651994A (zh) * 2020-12-18 2021-04-13 零八一电子集团有限公司 地面多目标跟踪方法
CN112598116A (zh) * 2020-12-22 2021-04-02 王槐林 一种宠物食欲评价方法、装置、设备及存储介质

Also Published As

Publication number Publication date
US11548146B2 (en) 2023-01-10
US20200206918A1 (en) 2020-07-02
CN109284847B (zh) 2020-12-25
EP3657404A4 (en) 2020-07-22
EP3657404A1 (en) 2020-05-27
CN109284847A (zh) 2019-01-29
EP3657404B1 (en) 2024-05-01

Similar Documents

Publication Publication Date Title
WO2019015544A1 (zh) 一种机器学习、寻物方法及装置
US20210019619A1 (en) Machine learnable system with conditional normalizing flow
WO2019082165A1 (en) GENERATION OF NEURAL NETWORKS WITH COMPRESSED REPRESENTATION HAVING A HIGH DEGREE OF PRECISION
CN111985385A (zh) 一种行为检测方法、装置及设备
US11861071B2 (en) Local perspective method and device of virtual reality equipment and virtual reality equipment
KR20220058900A (ko) 샘플 생성, 신경망의 트레이닝, 데이터 처리 방법 및 장치
Monzón et al. Observational learning with position uncertainty
CN115860107B (zh) 一种基于多智能体深度强化学习的多机探寻方法及系统
Liu et al. Indoor navigation for mobile agents: A multimodal vision fusion model
KR20220122735A (ko) 동작 인식 방법, 장치, 컴퓨터 기기 및 저장 매체
Zuo et al. Double DQN method for object detection
Hundt et al. " good robot! now watch this!": Repurposing reinforcement learning for task-to-task transfer
CN109079786A (zh) 机械臂抓取自学习方法及设备
Daswani et al. Feature reinforcement learning: state of the art
US11850752B2 (en) Robot movement apparatus and related methods
CN110962120B (zh) 网络模型的训练方法及装置、机械臂运动控制方法及装置
Zholus et al. Factorized world models for learning causal relationships
US20220004806A1 (en) Method and device for creating a machine learning system
CN110874553A (zh) 一种识别模型训练方法及装置
Ren et al. Learning bifunctional push-grasping synergistic strategy for goal-agnostic and goal-oriented tasks
Zhao et al. Potential driven reinforcement learning for hard exploration tasks
Liu et al. Robotic cognitive behavior control based on biology-inspired episodic memory
CN114529010A (zh) 一种机器人自主学习方法、装置、设备及存储介质
Censi et al. Motion planning in observations space with learned diffeomorphism models
Choi et al. Recurrent DETR: Transformer-Based Object Detection for Crowded Scenes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18835074

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018835074

Country of ref document: EP

Effective date: 20200220