CN113043282A

CN113043282A - Method and system for object detection or robot interactive planning

Info

Publication number: CN113043282A
Application number: CN202110433348.XA
Authority: CN
Inventors: 叶旭涛; 魏璇有; 鲁仙·出杏光
Original assignee: Mujin Technology
Current assignee: Mujin Technology
Priority date: 2019-12-12
Filing date: 2020-12-09
Publication date: 2021-06-29
Anticipated expiration: 2040-12-09
Also published as: CN113043282B

Abstract

Methods and systems for object detection or robotic interactive planning are presented. The system may be configured to: receiving first image information representing at least a first portion of an object structure of an object in a field of view of a camera, wherein the first image information is associated with a first camera pose; generating or updating sensing structure information representing a structure of the object based on the first image information; identifying an object corner associated with the object structure; causing the robotic arm to move the camera to a second camera pose in which the camera is pointed at the object angle; receiving second image information associated with a second camera pose; updating the sensing structure information based on the second image information; determining an object type associated with the object based on the updated sensing structure information; one or more robot interaction locations are determined based on the object type.

Description

Method and system for object detection or robot interactive planning

The present application is a divisional application of patent application 202080005101.1 entitled "method and computing system for performing object detection or robotic interaction planning based on image information generated by a camera" filed on 12/9/2020.

Cross reference to related applications

This application claims the benefit of U.S. provisional application No. 62/946,973 entitled "ROBOTIC SYSTEM WITH GRIPPING MECHANISM," filed on 12/2019, the entire contents of which are incorporated herein by reference.

Technical Field

The present disclosure relates to a method and a computing system for performing planning of object detection or robot interaction based on image information generated by a camera.

Background

As automation becomes more prevalent, robots are being used in more environments, such as warehousing and retail environments. For example, robots may be used to interact with goods or other objects in a warehouse. The movement of the robot may be fixed or may be based on input, such as information generated by sensors in the warehouse.

Disclosure of Invention

One aspect of the present disclosure relates to a method performed by a computing system for performing object detection. In some cases, the computing system may include a non-transitory computer-readable medium having instructions to cause the computing system to perform the method. In an embodiment, a computing system may include a communication interface and at least one processing circuit. The communication interface is configured to communicate with: a communication interface configured to communicate with: (i) a robot having a robot arm and an end effector arrangement arranged at or forming one end of the robot arm; and (ii) a camera mounted on the robotic arm and having a camera field of view. The at least one processing circuit is configured to perform the following when the object is in the camera field of view: receiving first image information representing at least a first portion of an object structure associated with an object, wherein the first image information is generated by a camera when the camera is in a first camera pose in which the camera is directed at the first portion of the object structure; generating or updating sensed structure information representing an object structure associated with the object based on the first image information; identifying an object angle associated with the object structure based on the sensed structure information; outputting one or more camera placement movement commands that, when executed by the robot, cause the robotic arm to move the camera to a second camera pose in which the camera is pointed at the object angle; receiving second image information representing a structure of the object, wherein the second image information is generated by the camera when the camera is in a second camera pose; updating the sensing structure information based on the second image information to generate updated sensing structure information; determining an object type associated with the object based on the updated sensing structure information; determining one or more robot interaction locations based on the object type, wherein the one or more robot interaction locations are one or more locations for interaction between the end effector device and the object; and outputting one or more robot interaction movement commands for causing interaction at the one or more robot interaction locations, wherein the one or more robot interaction movement commands are generated based on the one or more robot interaction locations.

Drawings

Fig. 1A-1D illustrate a system for processing image information according to embodiments herein.

Fig. 2A-2D provide block diagrams illustrating computing systems configured to receive and process image information and/or perform object detection according to embodiments herein.

Fig. 3A and 3B illustrate an environment with a robotic arm and end effector apparatus for performing robotic interactions according to embodiments herein.

Fig. 4 shows a flow diagram illustrating an example method for generating a motion plan according to embodiments herein.

Fig. 5A-5C illustrate various aspects of generating image information for representing objects in a camera field of view according to embodiments herein.

Fig. 6 illustrates sensing structure information based on image information according to an embodiment herein.

FIG. 7 illustrates aspects of identifying object corners according to embodiments herein.

Fig. 8A-8C illustrate various aspects of generating image information for representing objects in a camera field of view according to embodiments herein.

Fig. 9 illustrates sensing structure information based on image information according to an embodiment herein.

Fig. 10A and 10B illustrate object recognition templates according to embodiments herein.

Fig. 11A and 11B illustrate a comparison between sensing structure information and a set of object recognition templates according to embodiments herein.

Fig. 12A and 12B illustrate a comparison between sensing structure information and a set of object recognition templates according to embodiments herein.

Fig. 13A-13C illustrate various aspects of a filtering operation of object recognition templates or model-orientation combination (model-orientation combination) candidate sets according to embodiments herein.

FIG. 14 illustrates aspects of a gesture refinement operation according to embodiments herein.

Fig. 15A-15C illustrate various aspects of determining error values according to embodiments herein.

FIG. 16 illustrates aspects of determining error values according to embodiments herein.

Fig. 17A-17D illustrate various aspects of determining a robot gripping location according to embodiments herein.

Detailed Description

One aspect of the present disclosure relates to performing object detection on objects in a field of view of a camera (also referred to as a camera field of view). For example, the object may be a box, crate, or other container in a warehouse, retail space, or other location. In embodiments, performing object detection may involve determining a characteristic of the object, such as an object type associated with the object. One aspect of the present disclosure relates to planning robot interactions based on information obtained via performing object detection. The robotic interaction may involve, for example, a robot engaging (engage) an object in the camera field of view, such as an interaction in which the robot grasps or otherwise picks up the object and moves the object to a destination location (e.g., as part of an unstacking operation).

In embodiments, object detection may be performed based on sets of image information generated by the camera, where the sets of image information may represent multiple views or viewpoints from which the camera senses or otherwise generates image information representing objects in the camera field of view. For example, the sets of image information may include a first set of image information representing a top view of the object and a second set of image information representing a perspective view of the object. In some implementations, a top view of the object may be used to perform a coarse detection (rough detection), which may involve obtaining image information with a level of detail sufficient to identify the object corners of the object. The camera may be moved or otherwise positioned to point at the identified object angle, and a second set of image information representing perspective views may be generated when the camera is pointing at the object angle. In some cases, the second image information may include a higher level of detail of the structure of the object relative to the first image information. In such a case, the second image information may be used to refine the description of the estimate of the structure of the object. In some implementations, the first image information and the second image information can be used to generate sensed structure information, such as a global point cloud, representative of the structure of the object.

In an embodiment, performing object detection may involve comparing the sensed structure information to a set of object recognition templates, or more specifically, to a set of corresponding object structure models (e.g., CAD models) described by the set of object recognition templates. The comparison may be used, for example, to select one of the object identification templates, where the selected object identification template may be associated with the object type of the object. In some cases, the comparison may take into account different orientations of the object structure model. In such a case, the structural information may be compared to a set of model-orientation combinations, each of which may include an object structure model and an orientation of the object structure model.

In an embodiment, selecting an object recognition template or model-orientation combination may involve calculating a set of error values. Each of the error values may be indicative of a respective degree of deviation between the sensed structure information and the object structure model in one of the object recognition templates or model-orientation combinations. More specifically, each of the error values may indicate how well or poorly the sensed structure information (e.g., the global point cloud) explains or supports the particular object structure model. In some cases, the selected object recognition template may have the lowest error value of the set of error values.

In embodiments, a filtering operation may be performed to remove certain object recognition templates or model-orientation combinations from being considered potential matches of sensed structural information. For example, the sensing structure information may define an estimated area of space occupied by objects in the camera field of view. In such implementations, the filtering operation may involve determining whether any of the object recognition templates or model-orientation combinations have an object structure model that does not substantially fit within the estimated region. If such an object structure model exists, object recognition templates or model-orientation combinations associated with the object structure model can be filtered out.

In an embodiment, a pose refinement operation may be performed to adjust the object structure model to more closely match the sensed structure information. In some cases, the object structure model may describe various physical features of the object structure, and more particularly, may include pose information describing the pose of these physical features. In such cases, the pose refinement operation may involve adjusting pose information, which may change the orientation and/or position of various physical features described by the object structure model.

In an embodiment, an object type associated with an object may be used to plan a robotic interaction with the object. For example, an object type may be associated with a particular object design, which may include a physical design and/or a visual design of a type or category of objects. In some implementations, a physical design (such as a physical shape or size of an object structure) may be used to plan the robotic interaction. In an embodiment, if the robot interaction involves the robot gripping an object, planning the robot interaction may involve determining one or more gripping locations on the object at which the robot is to grip the object. In some cases, if determining the object type is based on selecting an object recognition template associated with the object type, the one or more robot gripping locations may be determined based on the object recognition template, or more specifically, based on an object structure model described by the object recognition template.

Fig. 1A shows a system 1000 for performing object detection and/or planning robotic interactions based on image information. More specifically, the system 1000 may include a computing system 1100 and a camera 1200. In this example, the camera 1200 may be configured to generate image information that describes or otherwise represents the environment in which the camera 1200 is located, or more specifically the environment in the field of view of the camera 1200 (also referred to as the camera field of view). The environment may be, for example, a warehouse, a manufacturing facility, a retail space, or some other location (the terms "or" and/or "are used interchangeably in this disclosure). In such a case, the image information may represent an object located at such a place, such as a container (e.g., a box) holding various items. The computing system 1100 may be configured to receive and process the image information, such as by performing object detection based on the image information. Object detection may involve, for example, determining the type of object (also referred to as object type) of the object in the camera field of view. In some cases, the computing system may plan the robot interaction based on the object type. Robotic interaction may involve, for example, a robot grasping, or otherwise picking up or engaging an object. For example, if the object is a container, the robotic interaction may involve the robot picking up the container by grasping or grasping the container and moving the container to a destination location. The computing system 1100 and the camera 1200 may be located in the same facility or may be remote from each other. For example, the computing system 1100 may be part of a cloud computing platform hosted in a data center remote from a warehouse or retail space, and may communicate with the camera 1200 via a network connection.

In an embodiment, the camera 1200 may be a 3D camera (also referred to as a spatial structure sensing camera or a spatial structure sensing device) configured to generate 3D image information (also referred to as spatial structure information) about an environment in a field of view of the camera. In an embodiment, the camera 1200 may be a 2D camera configured to generate 2D image information (or more specifically, a 2D image) describing the visual appearance of the environment in the field of view of the camera. In some cases, the camera 1200 may be a combination of a 3D camera and a 2D camera configured to generate 3D image information and 2D image information. The 3D image information may include depth information that describes respective depth values of various locations (such as locations on the surfaces of various objects in the field of view of the camera 1200, or more specifically, locations on the structures of these objects) relative to the camera 1200. In this example, the depth information may be used to estimate how objects are spatially arranged in a three-dimensional (3D) space. In some cases, the 3D image information may include a point cloud describing locations on one or more surfaces of objects in the field of view of the camera. More specifically, the 3D image information may describe various positions on the structure of the object (also referred to as object structure).

As described above, the camera 1200 may be a 3D camera and/or a 2D camera. The 2D camera may be configured to generate a 2D image, such as a color image or a grayscale image or other 2D image information. The 3D camera may be, for example, a depth sensing camera, such as a time-of-flight (TOF) camera or a structured light camera, or any other type of 3D camera. In some cases, the 2D camera and/or the 3D camera may each include an image sensor, such as a Charge Coupled Device (CCD) sensor and/or a Complementary Metal Oxide Semiconductor (CMOS) sensor. In embodiments, the 3D camera may include a laser, a lidar device, an infrared device, a light/dark sensor, a motion sensor, a microwave detector, an ultrasonic detector, a radar detector, or any other device configured to capture or otherwise generate 3D image information.

In an embodiment, the system 1000 may be a robotic operating system for interacting with various objects in the environment of the camera 1200. For example, fig. 1B illustrates a robot operating system 1000A, which may be an embodiment of the system 1000 of fig. 1A. The robot operating system 1000A may include a computing system 1100, a camera 1200, and a robot 1300. In an embodiment, the robot 1300 may be used to interact with one or more objects in the environment of the camera 1200 (such as with a box, crate, box, case, or other container). For example, robot 1300 may be configured to pick containers from one location and move them to another location. In some cases, robot 1300 may be used to perform an unstacking operation that unloads and moves a stack of containers to, for example, a conveyor belt, or may be used to stack containers onto a pallet in preparation for transporting them.

In some cases, camera 1200 may be separate from robot 1300. For example, in such a case, the camera 1200 may be a fixed camera mounted on a ceiling or some other location at a warehouse or other location. In some cases, the camera 1200 may be part of the robot 1300 or otherwise attached to the robot 1300, which may provide the robot 1300 with the ability to move the camera 1200. For example, fig. 1C depicts a system 1000B (which may be an embodiment of system 1000), which system 1000B may include the computing system 1100, camera 1200, and robot 1300 of fig. 1B, and wherein robot 1300 includes a robotic arm 1400 and an end effector device 1500. The end effector device 1500 may be attached to, disposed on, or form one end of the robotic arm 1400. In the embodiment of fig. 1C, the end effector device 1500 may be moved via the motion of the robotic arm 1400. In this example, the camera 1200 may be mounted on the end effector device 1500 or otherwise attached to the end effector device 1500. If the end effector device 1500 is a robotic arm (e.g., a gripper device), the camera 1200 may be referred to as a handheld camera. By attaching the camera 1200 to the end effector device 1500, the robot 1300 may be able to move the camera 1200 to different poses (also referred to as camera poses) via motion of the robotic arm 1400 and/or the end effector device 1500. For example, as discussed in more detail below, the end effector device 1500 may position the camera 1200 to have various camera poses. The camera 1200 may generate respective sets of image information at these camera poses. In such examples, the respective sets of image information may represent different viewpoints or perspectives of the environment of the sensing camera 1200 and/or the robot 1300, where such image information may facilitate accurate object detection and planning of robot interactions.

In an embodiment, the computing system 1100 of fig. 1A-1D may form or be part of a robot control system (also referred to as a robot controller) that is part of the robot operating system 1000A/1000B. The robot control system may be a system configured to generate movement commands or other commands, for example, for robot 1300. In such embodiments, the computing system 1100 may be configured to generate such commands based on, for example, image information generated by the camera 1200.

In embodiments, the computing system 1100 may form or be part of a vision system. The vision system may be a system that generates, for example, visual information describing the environment in which the robot 1300 is located, or more specifically, the environment in which the camera 1200 is located. In some implementations, the visual information can include the image information discussed above. In some implementations, the visual information may describe object types or other characteristics of objects in the environment of the camera 1200 and/or the robot 1300. In such implementations, the computing system 1100 may generate such visual information based on the image information. If the computing system 1100 forms a vision system, the vision system may be part of the robotic control system discussed above, or may be separate from the robotic control system. If the vision system is separate from the robot control system, the vision system may be configured to output information describing the environment in which the robot 1300 is located, such as information describing object types or other characteristics of the camera 1200 and/or objects in the environment of the robot 1300. Information determined by the vision system may be output to a robot control system, which may receive such information from the vision system and control the movement of the robot 1300 based on the information.

In an embodiment, if the computing system 1100 is configured to generate one or more movement commands, the movement commands may include, for example, camera placement movement commands and/or robot interaction movement commands. In this embodiment, the camera placement movement command may be a movement command for controlling the placement of the camera 1200, and more specifically, a movement command for causing the robot 1300 to move or otherwise position the camera 1200 to a particular camera pose, where the camera pose may include a combination of a particular camera position and a particular camera orientation. Robot interactive movement commands (also referred to as object interactive movement commands) may be used to control the interaction between the robot 1300 (or more specifically, its end effector device) and one or more objects, such as a stack of containers in a warehouse. For example, the robot interactive movement commands may cause the robotic arm 1400 of the robot 1300 of fig. 1C to move the end effector device 1500 to approach one of the containers, cause the end effector device 1500 to grasp or otherwise pick up the container, and then cause the robotic arm 1400 to move the container to a designated or calculated destination location. If the end effector device 1500 has gripper members, in some implementations, the robotic interactive movement commands may include gripper member positioning commands that result in movement of the gripper members relative to the remainder of the end effector device in order to place or otherwise position the gripper members in positions where the gripper members will grip a portion of a container (e.g., a frame).

In embodiments, the computing system 1100 may communicate with the camera 1200 and/or the robot 1300 via a direct wired connection, such as a connection provided via a dedicated wired communication interface, such as an RS-232 interface, a Universal Serial Bus (USB) interface, and/or via a local computer bus, such as a Peripheral Component Interconnect (PCI) bus. In some implementations, the computing system 1100 may communicate with the camera 1200 and/or with the robot 1300 via a wireless communication interface. In embodiments, the computing system 1100 may communicate with the camera 1200 and/or with the robot 1300 via a network. The network may be any type and/or form of network, such as a Personal Area Network (PAN), a Local Area Network (LAN) (e.g., an intranet), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), or the internet. The network may utilize different technologies and protocol layers or protocol stacks including, for example, ethernet protocol, internet protocol suite (TCP/IP), ATM (asynchronous transfer mode) technology, SONET (synchronous optical network) protocol, or SDH (synchronous digital hierarchy) protocol.

In embodiments, the computing system 1100 may communicate information directly with the camera 1200 and/or with the robot 1300, or may communicate via an intermediate storage device or more generally via an intermediate non-transitory computer-readable medium. For example, fig. 1D depicts a system 1000C (which may be an embodiment of system 1000/1000 a/1000B) that includes an intermediate non-transitory computer-readable medium 1600 for storing information generated by camera 1200, robot 1300, and/or by computing system 1100. Such an intermediate non-transitory computer-readable medium 1600 may be external to the computing system 1100, and may serve as an external buffer or repository for storing image information generated by the camera 1200, for example, storing commands generated by the computing system 1100, and/or other information (e.g., sensor information generated by the robot 1300). For example, if the intermediate non-transitory computer-readable medium 1600 is used to store image information generated by the camera 1200, the computing system 1100 may retrieve or otherwise receive the image information from the intermediate non-transitory computer-readable medium 1600. Examples of the non-transitory computer-readable medium 1600 include an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination thereof. The non-transitory computer readable medium may form, for example, a computer diskette, a Hard Disk Drive (HDD), a Solid State Drive (SSD), a Random Access Memory (RAM), a Read Only Memory (ROM), an erasable programmable read only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read only memory (CD-ROM), a Digital Versatile Disc (DVD), and/or a memory stick.

As described above, image information generated by the camera 1200 may be processed by the computing system 1100. In embodiments, the computing system 1100 may include or be configured as a server (e.g., having one or more server blades, processors, etc.), a personal computer (e.g., a desktop computer, a laptop computer, etc.), a smartphone, a tablet computing device, and/or any other computing system. In embodiments, any or all of the functionality of the computing system 1100 may be performed as part of a cloud computing platform. Computing system 1100 may be a single computing device (e.g., a desktop computer), or may include multiple computing devices.

Fig. 2A provides a block diagram that illustrates an embodiment of a computing system 1100. The computing system 1100 includes at least one processing circuit 1110 and non-transitory computer-readable medium(s) 1120. In an embodiment, the processing circuitry 1110 includes one or more processors, one or more processing cores, a programmable logic controller ("PLC"), an application specific integrated circuit ("ASIC"), a programmable gate array ("PGA"), a field programmable gate array ("FPGA"), any combination thereof, or any other processing circuitry. In an embodiment, the non-transitory computer-readable medium 1120 that is part of the computing system 1100 may be an alternative or an addition to the intermediate non-transitory computer-readable medium 1600 discussed above. The non-transitory computer-readable medium 1120 may be a storage device, such as an electronic, magnetic, optical, electromagnetic, semiconductor storage device, or any suitable combination thereof, such as, for example, a computer diskette, a Hard Disk Drive (HDD), a solid state drive (SDD), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, any combination thereof, or any other storage device. In some cases, the non-transitory computer-readable medium 1120 may include a plurality of storage devices. In some cases, the non-transitory computer-readable medium 1120 is configured to store image information generated by the camera 1200. The non-transitory computer-readable medium 1120 may alternatively or additionally store computer-readable program instructions that, when executed by the processing circuit 1110, cause the processing circuit 1110 to perform one or more methods described herein, such as the operations described with respect to fig. 4.

Fig. 2B depicts a computing system 1100A that is an embodiment of computing system 1100 and that includes a communication interface 1130. The communication interface 1130 (also referred to as a communication component or communication device) may be configured to receive, for example, image information generated by the camera 1200 of fig. 1A-1D. The image information may be received via the intermediary non-transitory computer-readable medium 1600 or network discussed above, or via a more direct connection between the camera 1200 and the computing system 1100/1100 a. In an embodiment, the communication interface 1130 may be configured to communicate with the robot 1300 of fig. 1B and 1C. If the computing system 1100 is not part of a robotic control system, the communication interface 1130 of the computing system 1100 may be configured to provide communication between the computing system 1100 and the robotic control system. The communication interface 1130 may include or may be, for example, communication circuitry configured to perform communications via wired or wireless protocols. By way of example, the communication circuit may include an RS-232 port controller, a USB controller, an Ethernet controller, a USB interface, a,

A controller, a PCI bus controller, any other communication circuit, or a combination thereof.

In an embodiment, the processing circuit 110 may be programmed by one or more computer-readable program instructions stored on the non-transitory computer-readable medium 1120. For example, fig. 2C illustrates a computing system 1100B that is an embodiment of the computing system 1100/1100a in which the processing circuitry 1110 is programmed by one or more modules including an object detection module 1121 and a robotic interaction planning module 1122.

In an embodiment, the object detection module 1121 may be configured to determine information associated with an object (e.g., a container) that is currently or already in the camera field of view of the camera 1200 of fig. 1A-1D. The information may describe characteristics of the object, such as a type or category of the object to which the object belongs (also referred to as an object type associated with the object), a size of the object, a shape of the object (also referred to as an object size and an object shape, respectively), and/or any other characteristic of the object. In some implementations, the object detection module 1121 can be configured to perform object recognition by comparing image information representing an object to an object recognition template, as discussed in more detail below.

In an embodiment, the robot interaction planning module 1122 may be configured to determine how the robot 1300 of fig. 1B and 1C will interact with objects in the environment of the robot 1300 and/or camera 1200 (or more specifically objects that are or have been in the camera field of view). The interaction may involve, for example, the robot 1300 grasping or otherwise picking up the object and moving the object to the destination location. In some cases, the robot interaction planning module 1122 may be configured to generate a motion plan to implement or perform the interaction. A motion plan for interacting with the object may be generated based on, for example, information determined by the object detection module 1121, such as an object type associated with the object. In an embodiment, the motion plan may identify one or more gripping positions or gripping portions of the object for which the robot 1300 is to grip the object. The motion plan may also cause at least a portion of the robot 1300 (e.g., the end effector device 1500) to be moved to the one or more gripping locations. In some cases, if robot 1300 (or more specifically end effector device 1500) includes one or more grippers, robot interaction planning module 1122 may be configured to plan the operation of the one or more grippers. More specifically, if one or more grippers are transitionable from an open state to a closed state to grasp or otherwise engage an object, and from a closed state to an open state to release an object, the robotic interaction plan module 1122 may be configured to control or otherwise determine when the one or more grippers are transitionable between the open state and the closed state. In some implementations, the motion plan may include or describe a trajectory that the robot 1300, or a portion thereof (e.g., the end effector device 1500), is to follow after the robot 1300 has grasped or otherwise picked up an object. The trajectory may cause the robot 1300 to move the object to a desired destination location. It is to be understood that the functionality of the modules discussed herein is representative and not limiting.

In various embodiments, the terms "computer-readable instructions" and "computer-readable program instructions" are used to describe software instructions or computer code that are configured to perform various tasks and operations. In various embodiments, the term "module" broadly refers to a collection of software instructions or code configured to cause the processing circuit 1110 to perform one or more functional tasks. When a processing circuit or other hardware component is executing a module or computer readable instructions, the module and computer readable instructions may be described as performing various operations or tasks.

In an embodiment, as shown in fig. 2D, the non-transitory computer-readable medium 1120 may store or otherwise include object detection information 1126, the object detection information 1126 may be generated by a computing system 1100C (which may be an embodiment of the computing system 1100/1100 a/1100B). In an embodiment, the object detection information 1126 may describe one or more objects in the camera field of view of the camera 1200, or more specifically, in the environment of the camera 1200 and/or the robot 1300. For example, the object detection information 1126 may include sensing structure information and/or object type information. The sensing structure information (also referred to as measurement structure information) may be or may include information describing the structure of one or more objects (e.g., a global point cloud), where the structure is also referred to as a physical structure or an object structure. The sensing structure information may be based on depth information or other image information sensed by the camera 1200 or another sensing device. In other words, the sensed structure information may be structure information generated based on values (e.g., depth values) sensed or measured by the camera 1200. In an embodiment, the object type information may describe an object type associated with an object in the environment of the camera 1200 and/or the robot 1300. In some cases, the object type may be associated with an object recognition template, as will be discussed below, and the object type information may include or identify the object recognition template.

In an embodiment, computing system 1100 may access one or more object identification templates (also referred to as object templates), which may be stored as part of object identification template information 1128 on non-transitory computer-readable medium 1120, as shown in fig. 2D, on non-transitory computer-readable medium 1600, and/or may be stored on another device. In some implementations, the one or more object recognition templates may have been manually generated and may have been received (e.g., downloaded) by the computing system 1100/1100a/1100B/1100C via the communication interface 1300 or in some other manner. In some implementations, one or more object recognition templates may have been generated as part of an object registration process performed by the computing system 1100/1100a/1100B/1100C or another device. Templates are discussed in more detail in U.S. patent application No. 16/991,466 (Atty Dkt. No. MJ0054-US/0077-0012US1) and in U.S. patent application No. 16/991,510 (Atty Dkt. No. MJ0051-US/0077-0011US1), the entire contents of which are incorporated herein by reference.

In an embodiment, each of the one or more object recognition templates (also referred to as one or more object templates) may describe an object design associated with a type or category of object. The object design may include, for example, a visual design that describes or defines the appearance of the object associated with the type or category of the object (also referred to as the object type), and/or a physical design that describes or defines the structure of the object associated with the object type. For example, if the object design described by the object identification template is more specifically a container design associated with a particular container type, the object identification template may be a container template describing, for example, a visual design and/or a physical design associated with that container type. In some implementations, the object recognition template can include visual description information (also referred to as an object appearance description) describing the visual design, and/or can include an object structure description (also referred to as structure description information) describing the physical design.

In some cases, the visual descriptive information may include or describe one or more feature descriptors, which may represent visual features, visual patterns, or other visual details (e.g., logos or pictures) that form the visual design. In some cases, the object structure description may include descriptions of object sizes, object shapes or contours, and/or some other aspect of the structure of the object associated with a particular object type. For example, the object structure description may include values describing an object size (e.g., a combination of length, width, and/or height) associated with the object type, including a computer-aided design (CAD) file describing the object structure associated with the object type, and/or a point cloud describing a contour of the object structure. More specifically, the point cloud may, for example, include a plurality of coordinates describing a plurality of respective locations on one or more surfaces of the object structure. In some implementations, one or more object identification templates described by object identification template information 1128 may be compared, for example, with the sensing structure information discussed above to determine which object identification template best matches the sensing structure information. Such a comparison may be part of an object recognition operation. As discussed in more detail below, object recognition operations may be used to determine an object type associated with an object in the camera field of view of the camera 1200 of fig. 1A-1D. The computing system 1100/1100a/1100B/1100C or another computing system may be configured to use the object type of the object to plan the robot's interaction with the object.

Fig. 3A and 3B illustrate example environments in which object detection and/or robotic interaction may occur. More specifically, the environment includes a computing system 1100, a camera 3200 (which may be an embodiment of camera 1200 of fig. 1A-1D), and a robot 3300 (which may be an embodiment of robot 1300). In this embodiment, the robot 3300 may include a robotic arm 3400 and an end effector apparatus 3500. In embodiments, the end effector device 3500 may form an end of the robotic arm 3400, or be attached to an end of the robotic arm 3400.

In the example of fig. 3A, robot 3300 may operate via robotic arm 3400 to move end effector device 3500 toward one or more objects, such as a stack of bins, crates, or other containers on a pallet. End effector device 3500 may also be capable of engaging (grasping) at least one of the one or more objects and moving the object from the tray to another location (e.g., as part of an unstacking operation). More specifically, fig. 3A and 3B depict an environment with a stack 3710 of objects (or more specifically, a stack of containers). In some cases, as shown in fig. 3B, some or all of these containers may hold smaller objects (which may also be referred to as smaller items). Stack 3710 in fig. 3A and 3B may include at

least objects

3711 and 3719 and 3731 and 3733, while end effector device 3500 may be used to grasp or otherwise pick up one of the objects in stack 3710 (such as object 3711) and move the object from stack 3710 to a destination location, such as a location on conveyor 3800 of fig. 3A. To pick up object 3711, end effector apparatus 3500 may be moved and tilted to align with object 3711. In the environment shown in fig. 3A and 3B, the object on the tray may have a physical structure (also referred to more simply as a structure) in which a 3D pattern is formed on at least one of its outer side surfaces. For example, the 3D pattern may be a pattern of ridges protruding from the outer side surface (also referred to as a ridge pattern). As an example, fig. 3A depicts a ridge pattern 3711A on the outside surface of the object 3711. In some cases, the objects on the tray may have visual details that form a 2D pattern, such as a logo or other visual pattern, on its outside surface. In some cases, if the object is a container, the object may include a container rim. As discussed in more detail below, the ridge pattern and/or the container rim may be used to determine a robot interaction location, such as a location where an end effector device (e.g., 3500) of the robot is to grasp the container.

In an embodiment, end effector device 3500 may include one or more gripper members. For example, end effector device 3500 may include a mounting plate or other mounting structure, and include a plurality of gripper members mounted or otherwise attached to a surface (e.g., a bottom surface) of the mounting structure. In some implementations, the camera 3200 may be mounted on an opposing surface (e.g., a top surface) that is otherwise attached to a mounting structure. In some cases, the plurality of gripper members may include at least a first gripper member movable (e.g., slidable) along a first axis and a second gripper member movable along a second axis perpendicular to the first axis. The first axis may, for example, be parallel to a first edge of the mounting structure, and the second axis may, for example, be parallel to a second edge of the mounting structure. In some cases, the plurality of gripper members may further comprise a third gripper member arranged at a location where the first and second axes intersect. Such a location may be, for example, near a corner of the mounting structure.

In some implementations, each of the gripper members may have a respective gripper body formed by or attached to a respective gripper finger assembly. The gripper finger assembly may be used to grip an object (e.g., a container) by gripping or pinching a portion of the object, such as a portion of a lip forming an outer edge of the container. In an example, the gripper finger assembly may comprise two parts, also referred to as gripper fingers, which may be movable relative to each other. The two gripper fingers may form a chuck or a clamp, wherein the two gripper fingers may be moved towards each other to switch to a closed state in which the gripper fingers grip a portion of the object or tighten the grip on the object. The two gripper fingers can also be moved away from each other to switch to an open state in which the gripper fingers release or release the grip. End effector devices and gripper components are discussed in more detail in U.S. patent application number 17/084,272 (Atty Dkt. No. MJ0058-US/0077-0014US1), which is incorporated herein by reference in its entirety.

As described above, one aspect of the present application relates to performing object detection, which may involve determining an object type of an object in a camera field of view. The object type may be used to plan the robot's interaction with the object, such as an interaction where the robot grasps the object and moves the object from a current location to a destination location. Fig. 4 depicts a flow diagram of an example method 4000 for performing object detection and/or planning robotic interaction. The method 4000 may be performed by, for example, the computing system 1100 of fig. 2A-2D or fig. 3A, or more specifically by at least one processing circuit 1110 of the computing system 1100. In some cases, the at least one processing circuit 1100 may perform the method 4000 by executing instructions stored on a non-transitory computer-readable medium, such as non-transitory computer-readable medium 1120. For example, the instructions may cause the processing circuitry 1110 to execute an object detection module 1121 and a robot interaction planning module 1122, which may perform some or all of the steps of the method 4000. In embodiments, method 4000 may be performed in an environment where computing system 1100 is currently in communication with robots and cameras (such as robot 3300 and camera 3200 in fig. 3A and 3B), or with any other robot discussed in this disclosure. For example, the computing system 1100 may perform the method 4000 when an object is currently in the camera field of view of the camera 3200 or is already in the camera field of view. In some cases, as shown in fig. 3A, a camera (e.g., 3200) may be mounted on an end effector device (e.g., 3500) of a robot (e.g., 3300). In other cases, the camera may be mounted elsewhere, and/or may be stationary.

In an embodiment, the method 4000 of fig. 4 may begin at step 4002, or otherwise include step 4002, where the computing system 1100 receives (e.g., via the object detection module 1121) first image information (also referred to as a first set of image information) representing at least a first portion of an object structure associated with an object in the field of view of the camera (also referred to as the camera field of view). For example, fig. 5A depicts a case where a group 3720 of

objects

3721, 3722 is in a camera field of view 3202 of the camera 3200. Each of the

objects

3721, 3722 may be, for example, a box, crate, box, frame, or other container. The group 3720 of

objects

3721, 3722 may be disposed on another object 3728 (e.g., a tray), which may also be disposed at least partially within the camera field of view 3202. In some cases, the tray 3728 may be used to stack or otherwise arrange containers or other objects that may have a variety of sizes (e.g., a variety of length, width, and height values), and may have a variety of stacking or placement configurations.

In embodiments, the first image information received by computing system 1100 may be generated by a camera (e.g., 3200) when the camera is in or has a first camera pose (such as the camera pose shown in fig. 5A). Camera pose may refer to the position and orientation of a camera (e.g., 3200). In some cases, camera pose may affect the perspective or viewpoint of the camera (e.g., 3200). For example, a first camera pose depicted in fig. 5A may involve a camera 3200 having a position above a group 3720 of

objects

3721, 3722 and having an orientation in which the camera 3200 is directed at a first portion (or more specifically, a top portion (e.g., top surface) of the objects 3721, 3722). In some cases, the orientation of the first camera pose may be associated with a top view of the camera 3200 with objects. For example, a first camera pose may involve the camera 3200 having an orientation such that its image sensor directly faces the top of the

objects

3721, 3722 and/or an orientation such that the focal axis of one or more lenses of the camera 3200 is vertical or substantially vertical. In some cases, the camera 3200 may be arranged directly above the

objects

3721, 3722 and may be directed at a first portion (e.g., the top) of the

objects

3721, 3722.

In an embodiment, the first image information of step 4002 may be used in a coarse detection stage in which the computing system 1100 may determine a relatively incomplete or simpler description or estimation of the object structure. For example, the description or estimation of the object structure may be incomplete because the first image information may describe a first portion of the object structure (e.g., a top portion), but may not describe other portions of the object structure (e.g., side portions), or may do so in only a limited manner. In some cases, the coarse detection stage may also involve positioning the camera (e.g., 3200) at a position sufficiently far from the object structure so that the entire object structure fits within the camera field of view. In such a case, the first image information generated by the camera may not be as detailed as the image information generated when the camera is closer to the object structure. Thus, the estimation or description of the structure of the object derived on the basis of the first image information may be simpler at its level of detail. As discussed in more detail below, this estimate or description may be, for example, a global point cloud or some other sensed structure information. The sensed structure information generated using the coarse detection stage may be used to identify object corners of the object structure and receive second image information associated with the object corners. In some implementations, the second image information may be more detailed, and/or may supplement the first image information. Thus, the second image information may be used to refine the description or estimation of the object structure.

As discussed in more detail below, the steps of the method 4000 may be performed to facilitate robot interaction with individual objects (e.g., object 3722) in the set 3720 of

objects

3721, 3722. In such a case, a specific object that is a target of the robot interaction may be referred to as a target object. In some cases, the steps of method 4000 (e.g., steps 4004 and 4016) may be performed for multiple or multiple iterations in order to facilitate the interaction of the robot with multiple target objects.

As described above, the first image information may represent a particular view of the group 3720 of

objects

3721, 3722, or more specifically, a particular view of each of the

objects

3721, 3722. In the example of fig. 5A, the first image information may represent a top view of the

objects

3721, 3722, as the first image information may be generated when the camera 3200 has the first camera pose shown in fig. 5A in which the camera 3200 is over the

objects

3721, 3722 and is directed at a top (e.g., top or top side) of each of the

objects

3721, 3722.

In the example of fig. 5A, the

objects

3721, 3722 may be crates or other open containers having one or more walls that surround the bottom interior surface of the container. The one or more walls may form a rim at a top end of the container. In such an example, the top view of object 3721/3722 may include a view of a surface of a bezel of object 3721/3722 (also referred to as a bezel surface). For example, fig. 5B shows an example in which the first image information comprises 3D image information 5720 (also referred to as spatial structure information) describing the structure of a group 3720 of

objects

3721, 3722. In such embodiments, the camera 3200 from which the 3D image information 5720 is generated may be a 3D camera. The 3D image information 5720 in this example may describe the structure of the object 3721 (also referred to as object structure) and describe the object structure of the object 3722, and more particularly may represent a top view of the object structure of the

objects

3721, 3722.

In an embodiment, the 3D image information 5720 may include depth information, such as objects describing objects 3721, 3722A depth map of respective depth values of one or more portions of the volumetric structure relative to a reference point, such as a point at which the camera (e.g., 3200) was located when the camera generated the 3D image information 5720 or other image information used in step 4002. More specifically, the depth information may describe respective depth values for a plurality of locations (also referred to as points) on one or more surfaces of the object structure of the object 3721 and/or the object structure of the object 3722. In the example of fig. 5B, the 3D image information 5720 may include

image portions

5721, 5722, and 5728 that describe depth values of

objects

3721, 3722, and 3728, respectively. More specifically, the image portion 5728 may include a location 3728 on a top surface of an object 3728 (e.g., a tray)₁To 3728_nTo the corresponding depth value. Further, in this example, the object 3721 can be a container having a rim and a bottom interior surface. Image portion 5721 may include location 3721A on a surface of a bezel of object 3721 (also referred to as a bezel surface)₁To 3721A_nAnd may include a location 3721B on the bottom interior surface of the object 3721₁To 3721B_nTo the corresponding depth value. Similarly, image portion 5722 may include location 3722A on the bounding box surface of object 3722₁To 3722A_nAnd includes a location 3722B on the bottom interior surface of the object 3722₁To 3722B_nTo the corresponding depth value.

In some cases, if object 3721/3722 is a container containing one or more other items, such items may also be represented in 3D image information or other image information. For example, the 3D image information 5720 of fig. 5B may include

image portions

5723, 5724 that describe respective depth values for locations on two respective items contained within the object 3722. More specifically, the image portion 5723 may include a location 3723 on one of the objects₁To 3723_nAnd the image portion 5724 may include a location 3724 on another of the objects₁To 3724_nTo the corresponding depth value.

In some cases, the first image information may describe the respective depth values in a depth map that may include an array of pixels corresponding to, for example, a grid of locations on the surface of one or more objects in the field of view (e.g., 3202) of the camera. In such a case, some or all of the pixels may each include a respective depth value corresponding to a respective location of the pixel, where the respective location is on one or more object surfaces in the camera field of view.

In some cases, the first image information may describe the respective depth values by a plurality of 3D coordinates that may describe various locations on the surface of one or more objects. For example, the 3D coordinates may describe location 3728 in FIG. 5B₁To 3728_nPosition 3721A₁To 3721A_nPosition 3721B₁To 3721B_nPosition 3722A₁To 3722A_nPosition 3722B₁To 3722B_nPosition 3723₁To 3723_nAnd position 3724₁To 3724_n. The plurality of 3D coordinates may, for example, form a point cloud or a portion of a point cloud that describes at least a portion of an object structure (the top of the object structure, such as

objects

3721, 3722, 3723, 3724, and 3728). The 3D coordinates may be expressed in a camera coordinate system or some other coordinate system. In some cases, the depth value for a particular location may be represented by a component of the 3D coordinates of the location or based on the component of the 3D coordinates of the location. As an example, if the 3D coordinate of the location is [ X Y Z ]]Coordinates, then the depth value for the location may be equal to or based on the Z-component of the 3D coordinates.

In the example of fig. 5B, the first image information may represent a bottom interior surface of the object structure. More specifically, the first image information depicted in fig. 5B includes 3D image information 5720, which includes a location 3721B on the bottom interior surface of the object 3721₁To 3721B_nAnd location 3722B on the bottom inner surface of object 3722₁To 3722B_nCorresponding depth values or coordinates. In another example, the bottom interior surface of the object structure of the object (e.g., 3721/3722) may not be represented by the first image information because the bottom interior surface may be completely covered or otherwise obscured from view. In such an example, if the object (e.g., 3721/3722) is a container, the bottom interior surface of the container may be of the containerThe contents (such as materials or items disposed within the container) may be completely covered and/or may be completely covered by a lid, flap, or other means for closing the container. In such an example, the first image information may describe respective depth values or coordinates of locations on one or more surfaces of the material or item within the container or locations on the lid or baffle.

In an embodiment, the first image information received in step 4002 may describe the visual appearance of the group 3720 of

objects

3721, 3722. For example, fig. 5C provides an example in which the first image information includes or forms a 2D image 5730 (e.g., a grayscale or color image), the image 5730 including an image portion 5731 (e.g., a pixel region) describing an appearance of the object 3721 of fig. 5A, an image portion 5732 describing an appearance of the object 3722, and an image portion 5728 describing an appearance of the object 3728. More specifically, the image 5730 may describe the appearance of the

objects

3721, 3722 and the object 3728 from the viewpoint of the camera 3200 of fig. 5A, and more specifically may represent a top view of the

objects

3721, 3722. As described above, the 2D image 5730 may be generated by the camera 3200 when the camera 3200 has the first camera pose depicted in fig. 5A. More specifically, 2D image 5730 may represent visual detail(s) on one or more surfaces of object 3721/3722. For example, the image portion 5721 of the 2D image 5730 may more specifically include an image portion 5721A representing a first surface (e.g., a bezel surface) of the object 3721, and include an image portion 5721B representing a second surface (e.g., a bottom interior surface) of the object 3721. Similarly, image portion 5722 can include image portion 5722A representing a first surface (e.g., a bezel surface) of object 3722 of fig. 5A and image portion 5722B representing a second surface (e.g., a bottom interior surface) of object 3722. In another example, if the

objects

3721, 3722 are containers filled with an item or material, as discussed above, the

image portions

5721, 5722 may describe the appearance of the item or material disposed within the container.

Returning to fig. 4, in an embodiment, method 4000 may include step 4004, where computing system 1100 generates or updates sensed structure information representative of an object structure associated with an object (e.g., 3721/3722) in the camera field of view (e.g., 3202) based on the first image information (e.g., via object detection module 1121). As described above, the sensing structure information (also referred to as measurement structure information) may be or may include information describing or otherwise representing an object structure associated with the object (such as an object structure of object 3721/3722). For example, the sensed structure information may be a global point cloud comprising a plurality of coordinates describing a location on one or more surfaces of object 3721 and/or a plurality of coordinates describing a location on one or more surfaces of object 3722. In some implementations, the computing system 1100 may generate the sensing structure information by incorporating the first image information or image portion(s) thereof into the sensing structure information such that the sensing structure information includes values from the first image information. As an example, fig. 6 depicts sensing structure information 6720 generated by a computing system. The sensing structure information 6720 may include values (such as depth values or coordinates) in the first image information 5720. More specifically, the sensing structure information 6720 may be a point cloud including coordinates of the positions represented in the

image portions

5721, 5722, 5723, and 5724 of the first image information 5720. In other words, the sensing structure information 6720 may directly merge the

image portions

5721 and 5724 of the first image portion 5720. These image portions may describe at least a portion of the object structure of

objects

3721, 3722, 3723, and 3724. For example, the image portion 5722 may describe a border of the object structure of the object 3722 and describe at least a portion of the bottom interior surface of the object structure of the object 3722. As discussed above, the computing system 1100 may store the sensed structure information (e.g., 6720) as part of the object detection information 1126 in the non-transitory computer-readable medium 1120.

In an embodiment, if the sensing structure information already includes values describing the first portion of the object structure at the beginning or before step 4004, the computing system 1100 may update the sensing structure information based on the values in the first image information (e.g., 5720). For example, sensing structure information may be generated based on multiple sets of image information, all representing top views of one or more objects in the camera field of view (e.g., 3202). The sets of image information may have been generated by cameras (e.g., 3200) at different respective locations (when the cameras are moved laterally), but with the same or similar camera orientations, such as the orientation of the image sensors of the cameras directly facing respective areas on top of one or more objects. In this example, the first image information may be one of a plurality of sets of image information. If the sensing structure information already includes coordinates obtained from another of the sets of image information at the beginning of step 4004, the computing system 1100 may update the sensing structure information to incorporate the coordinates obtained from the first image information (e.g., 5720). Accordingly, the computing system 1100 may include the coordinate set obtained from the first image information as a new part of the sensing structure information. In this way, the sensing structure information can be used as an image information combination group in which the plurality of sets of image information discussed above are combined. In some cases, the computing system 1100 may generate the combined set of image information by combining the sets of image information discussed above, such as where the sets of image information represent overlapping regions of the tops of one or more objects. Such a merging operation may involve, for example, adjusting one or more existing values (e.g., depth values or coordinates) of the sensed structure information based on values in the first image information. In some cases, the merge operation may involve discarding duplicate values (e.g., coordinates or depth values) described in more than one of the sets of image information discussed above.

As described above, the sensing structure information may be generated or updated based on image information representing a particular viewpoint (such as a top view of one or more objects). As discussed in more detail below with respect to step 4012, the sensing structure information may be updated based on image information representing another viewpoint (such as a perspective viewpoint). Because the sensing structure information may be updated to merge or reflect values from multiple sets of image information, the sensing structure information may be used as global structure information that serves as an image information combining set that combines multiple sets of image information that may be associated with multiple camera poses. Thus, if the sensing structure information is or includes a point cloud or depth map, the point cloud or depth map may be a global point cloud or global depth map that is updated during method 4000 to merge values from multiple sets of image information.

Returning to fig. 4, in an embodiment, method 4000 may include step 4006, where computing system 1100 identifies an object angle associated with an object structure. For example, the object angle may be an outer angle of the object structure of object 3722 in FIG. 5A. In some implementations, the computing system 1100 can identify an object angle based on the sensed structure information (e.g., 6720). For example, the computing system 1100 may identify a plurality of outer edges described by the sensing structure information, or a plurality of edge regions described by the sensing structure information. In such an example, the computing system 1100 may identify the object angle as a location at or near the intersection of the plurality of outer edges, and/or as a location in an area where the plurality of edge regions intersect.

In one example, the computing system 1100 can identify the edge of the object structure by, for example, identifying a set of outermost locations among the locations described by the sensed structure information, where the set of locations can approximate a portion of the contour of the object structure (e.g., of the object 3722). In some cases, the computing system 1100 may estimate or otherwise identify an edge as a line fitting the set of outermost positions. In some implementations, the computing system 1100 can identify an edge region of the object structure as a location region that includes the set of outermost locations.

As an example, fig. 7 shows the computing system 1100 identifying a first edge region 7001 and a second edge region 7002 described or otherwise represented by the sensing structure information 6720. First edge region 7001 can be, for example, a location strip or band that represents a portion of an object perimeter of object 3722, where the object perimeter forms a set of edges of an object structure of object 3722. Similarly, second edge region 7002 can be, for example, another location strip or band on another portion of the object's bezel. In this example, computing system 1100 may identify an object corner 3722C of object 3722 based on the intersection of

edge regions

7001, 7002₁. More specifically, the computing system 1100 may map the object angle 3722C₁The location in the intersection region is determined, which may be a region where the

edge regions

7001, 7002 overlap or otherwise intersect. In some implementations, the edge regionEach of the

domains

7001, 7002 may be identified as a respective location group that is described in the sensing structure information 6720 and has the same or substantially the same respective depth value. In such implementations, the computing system 1100 may determine each of the

edge regions

7001, 7002 as a respective 2D plane that fits a corresponding set of locations having substantially the same depth values or the same Z-component in their 3D coordinates. In some cases, the computing system 1100 may identify the lobes of the object structure as the object corners of step 4004. A lobe may be, for example, the angle at which two orthogonal edges of a structure of an object meet. The lobes are discussed in more detail in U.S. patent application No. 16/578,900 (MJ0037-US/0077-0006US1), the entire contents of which are incorporated herein by reference in their entirety.

In an embodiment, the computing system 1100 may identify a particular object angle based on the accessibility or visibility of the object angle in step 4004. In such embodiments, the sensed structure information may describe a plurality of object angles of the object structure. For example, the sensed structure information 6720 in fig. 7 may describe a plurality of object angles 3722C of the object structure of the object 3722₁To 3722C₄. More specifically, object angle 3722C₁To 3722C₄May be a corner of the border of object 3722. In such a case, computing system 1100 may be configured to determine the position of object from multiple object angles 3722C₁To 3722C₄Selecting one of the object corners (e.g., 3722C)₁). The selection may be based on at least one of: (i) for a plurality of object angles 3722C₁To 3722C₄Or (ii) a plurality of object angles 3722C₁The corresponding level of occlusion.

In an embodiment, the level of accessibility to an object corner may refer to how accessible the object corner is for a robotic interaction with a robotic arm (e.g., 3400) or, more specifically, with an end effector device (e.g., 3500) formed or disposed at one end of the robotic arm. For example, if a robotic interaction involves an end effector device (e.g., 3500) reaching a particular object corner of an object (e.g., 3721/3722) and grasping the object at that object corner, the accessibility level of that object corner may be affected by, for example, whether there are other objects in the robot (e.g., 3300) environment that would physically obstruct the end effector device (e.g., 3500) from reaching that object corner. Such an obstructing object may comprise, for example, another object (e.g., another container) arranged directly above a corner of the object.

In an embodiment, the level of occlusion of an object corner may refer to the extent to which the object corner may be sensed by a camera (e.g., 3200), and more specifically to the level of visibility of the object corner to the camera (e.g., 3200). The visibility level may be affected by whether the line of sight (e.g., 3200) from the object angle to the camera is blocked or otherwise obstructed by another object. Occlusion may occur when a camera (e.g., 3200) is in a first camera pose as discussed above, and/or when the camera is in a pose of a second camera as will be discussed below. In an embodiment, the computing system 1100 may select an object angle associated with the highest accessibility level and/or the lowest occlusion level from a plurality of object angles in step 4004. If multiple object angles are associated with the same or substantially the same accessibility or occlusion level, the computing system 1100 may randomly select one of the object angles from the multiple object angles.

Referring back to fig. 4, in an embodiment, method 4000 may include step 4008, where computing system 1100 positions the camera (e.g., via one or more camera placement movement commands) to have a second camera pose in which the camera (e.g., 3200) is pointed at the object angle identified in step 4006. In this embodiment, the second camera pose of step 4008 may be different from the first camera pose associated with step 4002. As described above, a camera pose may be a combination of camera (e.g., 3200) position and orientation. Fig. 8A depicts an example in which the camera 3200 has a second camera pose in which the camera 3200 is pointed at an object angle 3722C of an object 3722₁. In an embodiment, the second camera pose may be a camera pose in which the camera (e.g., 3200) has a perspective view of an object (e.g., 3722). More specifically, when the camera 3200 has the first camera pose shown in fig. 5A, at least a portion of the object 3721/3722, such as a side portion (e.g., an outside surface), may not be at the camera 3200Within the line of sight, or more specifically, may not be within the line of sight of an image sensor within the camera 3200. When the camera 3200 has the second camera pose, the portion (e.g., the side portion) of the object 3721/3722 along with the object angle 3722C₁May come within the camera field of view 3202 of the camera 3200 and may be within the line of sight of the image sensor of the camera 3200.

In an embodiment, when a camera (e.g., 3200) has a first camera pose associated with first image information, the first camera may have a first distance from an object (e.g., 3722) in a field of view (e.g., 3202) of the camera. For example, as part of step 4002, the computing system 1100 may generate a first set of one or more first camera placement movement commands to cause the robotic arm 3400 of fig. 5A to move or otherwise position the camera 3200 to a camera pose in which the camera is disposed directly above the object (e.g., 3722) and has a predetermined first distance from the object (e.g., 3722) or, more particularly, from the top of the object (e.g., a bezel surface). In some cases, the first distance may be sufficiently far from the object (e.g., 3722) to allow the camera field of view (e.g., 3202) to encompass the entire top of the object, or more specifically, multiple object corners of the object. As a result, when the camera (e.g., 3200) generates the first image information while having the first camera pose, the first image information may represent the entire top of the object (e.g., 3722), including a plurality of object corners (e.g., 3722C) of the object₁-3722C₄). Such image information may facilitate the computing system 1100's ability to identify object corners in step 4006. However, if the first distance has a large value, the resulting image information may not be as detailed as image information associated with a closer distance. Thus, in one example, the second camera pose associated with the second image information in step 4010 may involve positioning the camera (e.g., 3200) closer to the object (e.g., 3722), or more specifically, to an object angle (e.g., 3722C) of the object (e.g., 3722)₁). More specifically, the computing system 1100 may generate a second set of one or more camera placement movement commands to cause the robotic arm (e.g., 3400) to move the camera (e.g., 3200) to the second camera pose. A second set of one or more camera placement movement commandsThe robotic arm (e.g., 3400) may position the camera (e.g., 3200) at an angle (e.g., 3722C) from the object₁) Has a second distance, wherein the second distance may be less than the first distance. The smaller distance may allow the second image information to capture or otherwise represent the object structure of the object (e.g., 3722) at a higher level of detail relative to the first image information. Thus, the second image information may be used to refine the description or estimation of the object structure.

Returning to FIG. 4, in an embodiment, method 4000 may include step 4010 in which computing system 1100 receives second image information (also referred to as a second set of image information) representing an object structure (such as object structure of object 3722/3721 of FIG. 8A). The second image information may be generated by a camera (e.g., 3200) when the camera has a second camera pose, such as the camera pose shown in fig. 8A. As described above, in some cases, the second camera pose may provide a perspective view of an object (e.g., 3722) for the camera (e.g., 3200). In such a case, the second image information may represent a perspective view of the object.

In some implementations, the second image information may be or may include 3D image information. As an example, fig. 8B depicts 3D image information 8720 that may form or be part of the second image information. The 3D image information 8720 may be or may include, for example, a point cloud generated by the camera 3200 when the camera 3200 has the second camera pose as shown in fig. 8A. Like the first image information, the second image information may include values, such as depth values or coordinates, for locations on one or more surfaces of various object structures. More specifically, the second image information may include

image portions

8721, 8722, 8723, and 8724, which may represent respective object structures of the

objects

3721, 3722, 3723, and 3724 in fig. 8A. In one example, if the object represented by the second image information has a pattern of ridges, or more generally a plurality of physical ridges protruding from one or more outer side surfaces of the object, the 3D image information 8720 may describe or otherwise represent the plurality of ridges.

In some implementations, the second image information may be or may include 2D image information. For example, fig. 8C shows a 2D image 8730 that may form part of the second camera information. In this example, the 2D image 8730 may comprise at

least image portions

8731, 8732, these

image portions

8731, 8732 representing respective appearances of

objects

3721, 3722 from a perspective or viewpoint of the camera 3200 when the camera 3200 has the second camera pose.

In an embodiment, as described above, the first image information received in step 4002 may be associated with a camera (e.g., 3200) having a first distance from an object (e.g., 3722), and the second image information received in step 4010 may be associated with a camera (e.g., 3200) having a second distance from an object (e.g., 3722), wherein the second distance may be less than the first distance. In this embodiment, a first camera pose may be associated with a first distance between the camera and the object or a portion thereof (e.g., the top of object 3722), while a second camera pose may be associated with a second distance between the camera and the object or a portion thereof (e.g., the corners of object 3722), where the second distance is less than the first distance. As described above, a larger value of the first distance may cause the first image information to have a lower level of detail that may be adequate for performing a coarse detection phase involving identification of object corners, but may be inadequate for determining an object type associated with the object (e.g., 3722). The second image information may provide a higher level of detail because the second image information is associated with a closer distance between the camera and the object (e.g., 3722). Furthermore, as described above, the second image information may represent a portion of the object structure that is not represented or only partially represented in the first image information. Thus, the second image information may enhance the ability of the computing system 1100 to accurately determine the object type of the object (e.g., 3722), as will be discussed below.

Returning to fig. 4, in an embodiment, method 4000 may include step 4012, wherein computing system 1100 updates the sensing structure information based on the second image information. The sensing structure information may be referred to as updated sensing structure information after it is updated. In some implementations, step 4012 may involve incorporating values (e.g., depth values or coordinates) from the second image information into the sensing structure information (e.g., 6720). If the sensing structure information generated or updated in step 4004 includes a value from the first image information, step 4012 may generate updated sensing structure information, such as by combining the first image information and the second image information by including both the first image information and the second image information. For example, fig. 6 shows sensed structure information 6720, which may be a global point cloud that incorporates or otherwise includes coordinates described by the first image information 5720. In this example, the sensed structure information 6720 may represent a portion of the object structure of objects 3722 and/or 3721, or more specifically, the border and bottom interior surface of the object structure of objects 3722 and/or 3721. In this example, the computing system 1100 may update the sensing structure information 6720 by updating the global point cloud to interpolate or otherwise incorporate the coordinates described by the second image information 8720 of fig. 8B. Sensing structure information 6720 may be updated to produce updated sensing structure information 9720 of fig. 9, which may be, for example, an updated version of a global point cloud including a plurality of coordinates representing an object structure associated with an object (e.g., 3721/3722). The plurality of coordinates of the updated version of the global point cloud may combine or otherwise merge the coordinates described by the first image information 5720 and the coordinates described by the second image information 8720. As described above, the second image information may in some cases represent a perspective view of an object (e.g., 3721/3722). The perspective view may allow the second image information to represent at least the side portion(s) of the object structure of the object (e.g., 3721/3722). Because the updated sensing structure information (e.g., 9720) incorporates the second image information, the updated sensing structure information (e.g., 9720) may also represent the side portion(s) of the object structure. If an object (e.g., 3721/3722) in the camera field of view (e.g., 3202) has a ridge pattern on one or more outside surfaces of the object, the updated sensed structure information (e.g., 9720) may describe or otherwise represent the ridge pattern.

Returning to fig. 4, in an embodiment, method 4000 may include step 4014 in which computing system 1100 determines an object type associated with an object (e.g., 3722) in the camera field of view, where the determination may be based on the updated sensed structure information (e.g., 9720). For example, if

objects

3721, 3722 are containers, step 4014 may involve determining a container type associated with object 3721 and/or a container type associated with object 3722. In embodiments, the object type may be associated with a particular object design, which may include a physical design and/or a visual design. In this embodiment, a physical design may refer to, for example, a physical structure (also referred to as an object structure) belonging to an object otherwise associated with the object type. The physical structure may be characterized by an object shape, an object size, and/or by physical features (e.g., a ridge pattern) disposed on a surface of the object associated with the object type.

In an embodiment, the object type may be associated with an object identification template (such as the template described by object identification template information 1128 of fig. 2D). In one example, the object identification template may be a container template describing a container design (or more specifically, a visual design and/or a physical design describing a container type). If object identification template information 1128 describes multiple object identification templates, the multiple object identification templates may be associated with different object types, respectively. For example, FIG. 10A depicts an object recognition template 9128A₁、9128A₂And 9128A₃Object identification template information 9128 (which may be an embodiment of object identification template information 1128). In this example, object recognition template 9128A₁、9128A₂And 9128A₃May be associated with three different respective object types, namely container type 1, container type 2 and container type 3, respectively. Object recognition template 9128A stored or otherwise described by template information 9128₁、9128A₂And 9128A₃May be used to populate the candidate set or, more specifically, the template candidate set. The set of template candidates may represent a set of candidate object recognition templates, which may be candidates for potentially matching objects in the camera field of view (e.g., 3722), or more specifically, matching updated sensed structure information. As discussed below, the computing system 1100 may update the updated sensing structure information(e.g., a global point cloud) is compared to these candidate templates to determine if any object recognition templates match the updated structural information, and/or to determine which object recognition template provides the best match.

In some implementations, the object recognition template (e.g., 9128A)₁、9128A₂And 9128A₃) Some or all of which may each include a corresponding object structure description (also referred to as structure description information). The object structure description of an object identification template may describe the physical design, or more specifically, the object structure, of the type of object associated with the object identification template. In some cases, the object structure description may include a CAD file that describes the structure of the object. In some cases, the object structure description may include a point cloud (also referred to as a template point cloud) that describes the outline of the object structure, such as by describing edges, surfaces, ridge patterns, or other physical features that form the object structure. In an embodiment, the set of object identification templates (e.g., 9128A)₁、9128A₂And 9128A₃) A set of object structure models may be described that may describe respective object shapes, physical designs, or general object structures associated with respective container types. For example, if the object structure description in the object identification template comprises a CAD file, the object structure model associated with the object identification template may be a CAD model described by the CAD file. FIG. 10A provides an object recognition template 9128A₁To 9128A₃Respectively describes an example of a set of three object structure models in its object structure description.

In embodiments, the object structure description in the object recognition template may include a direct description of a portion of the object structure, and a direct description of the remaining portion of the object structure may be omitted, as the remaining portion of the object structure may have the same or substantially the same structural details as the portion of the object structure directly described by the object recognition description. For example, FIG. 10B illustrates a description object recognition template 9128B₁、9128B₂And 9128B₃Object identification template information 9128, the object identification template 9128B₁、9128B₂And 9128B₃May also be associated with container type 1, container type 2, and container type 3, respectively, and may describe the corresponding container structures associated with container types 1 through 3. In the example of FIG. 10B, object recognition template 9128B₁、9128B₂And 9128B₃May have an object structure description directly describing structural details of two vertical sides of the respective container structure, while omitting a direct description of two remaining vertical sides of the respective container structure. A direct description of the two remaining vertical sides may be omitted as their structural details may be the same or substantially the same as those described for the object structural description. In other words, the object structure description may have indirectly described the two remaining vertical sides of the respective container structure.

As described above, object recognition templates (e.g., 9128B) stored in computing system 1100 or elsewhere₁To 9128B₃) May be used to populate a candidate set of templates, which may be a set of object identification templates, wherein the object identification templates in the set may describe object structures associated with different object types (e.g., container types 1, 2, and 3). In an embodiment, determining the object type associated with the object (e.g., 3722) may involve performing a comparison between the updated sensed structure information (e.g., global point cloud) of step 4012 and the object recognition templates in the template candidate set. By way of example, fig. 11A illustrates updated sensed structure information 9720 with an inclusion of an object recognition template 9728A₁To 9728A₃The updated sensed structure information 9720 may represent the object structure of object 3722 and the object structure of object 3721 of figure 8A. Similarly, fig. 11B illustrates updated sensed structure information 9720 and including an object recognition template 9728B₁To 9728B₃A comparison of the template candidate set of (a).

In an embodiment, the comparison discussed above may be used to determine each object recognition template in the candidate set of templates (e.g., 9728A)₁To 9728A₃Or 9728B₁To 9728B₃) With updated sensing structureAnd the corresponding degree of matching of information. The comparison may indicate the extent to which each of the object recognition templates is supported or interpreted by the updated sensed structure information (e.g., the global point cloud). In one example, computing system 1100 may select an object recognition template from a candidate set of templates based on the comparison (e.g., 9728A)₃Or 9728B₃). The selected object identification template may represent an object type (e.g., container type 3) associated with the object (e.g., 3722). More specifically, the selected object recognition template may be associated with the object type. Thus, in this example, determining an object type of an object in the camera field of view may involve selecting an object recognition template associated with the object type.

In an embodiment, the selection of the object identification template may be based on, for example, which object identification template in the candidate set of templates most closely matches the updated sensed structure information. As discussed in more detail below, the comparison may involve determining error values, each error value describing a respective amount of deviation between the object recognition template and the updated sensing structure information. In such a case, the selection of the object recognition template may be based on an error value, as discussed in more detail below. In an embodiment, the computing system 1100 may be configured to use the object structure description in the selected object recognition template in step 4016 to determine one or more robot interaction locations. If the object structure description includes an object structure model, the computing system may be configured to determine the one or more robot interaction locations using the object structure model of the selected object recognition template.

As described above, the computing system 1100 may compare the updated sensed structure information to a set of candidate object identification templates, or more specifically, to a set of corresponding object structure descriptions in the object identification templates. For example, if the updated sensed structure information describes a plurality of ridges protruding from the sides of the object structure (e.g., container structure), the computing system 1100 may detect the plurality of ridges based on the updated sensed structure information and/or the second image information, and may compare the detected ridges to the ridges or other physical features described by the object structure description in the object recognition template. In such an example, the object type (e.g., container type) of the object in the camera field of view may be determined based on selecting which object recognition template best matches the detected ridge on the outside surface of the object. Thus, the object type in this example may be determined based on the detected ridges on the outside surface of the container structure. In some cases, the set of object structure descriptions may describe a set of corresponding object structure models. In some cases, the comparison may take into account the orientation of the object structure model. Thus, the computing system 1100 may more specifically compare the updated sensed structure information to the object structure models and the candidate combinations of orientations of those object structure models. In this example, the set of template candidates may be more specifically a set of model-orientation candidates, which may be a set including a combination of model-orientations. Each model-orientation combination in the candidate set may be a combination of: (i) an object structure model that is one of the set of object structure models discussed above, and (ii) an orientation of the object structure model. In such an example, the computing system 1100 may compare the updated sensing structure information to the model-orientation combinations in the model-orientation candidate set.

In an embodiment, if the object structure model represents or describes a plurality of outer side surfaces (also referred to as lateral outer surfaces) of an object structure of a particular object type, the orientation of the object structure model may refer to the respective direction in which each of the plurality of outer side surfaces faces. In an embodiment, the orientation of the object structure model may refer to how the computing system 1100 attempts to align the object structure model with the point cloud or other sensed structure information. In one example, the point cloud may represent, for example, at least a first outside surface and a vertical second outside surface of a container or other object in a field of view (e.g., 3202) of the camera. In this example, the object structure model may also represent or describe at least a first and a second outer side surface of the object type associated with the object structure model, wherein the second outer side surface of the object structure model may be perpendicular to its first outer side surface. The first and second outside surfaces described by the point cloud and/or the object structure model may represent, for example, two vertical sidewalls of the container or container structure.

In some cases, the first orientation of the object structure model may refer to the degree to which the computing system 1100 determines that the first and second exterior surfaces of the object structure model are aligned with the first and second exterior surfaces, respectively, represented by the point cloud. More specifically, when the object structure model has a first orientation, the computing system 1100 may compare physical features (e.g., ridges) or other attributes (e.g., size) of a first exterior surface of the object structure model to physical features or other attributes of the first exterior surface described by the point cloud, and may compare physical features or other attributes of a second exterior surface of the object structure model to physical features of the second exterior surface described by the point cloud. Further, in this example, the second orientation of the object structure model may involve the first and second exterior surfaces of the object structure model being rotated 90 degrees relative to the first orientation. When the object structure model has the second orientation, the computing system 1100 may determine the degree to which the first and second outside surfaces of the object structure model are aligned with the second and first outside surfaces described by the point cloud, respectively. More specifically, when the object structure model has the second orientation, the computing system 1100 may compare physical features or other attributes of the first exterior surface of the object structure model to physical features of the second exterior surface described by the point cloud and may compare physical features or other attributes of the second exterior surface of the object structure model to physical features of the first exterior surface described by the point cloud.

In an embodiment, when the object structure model has one of the first orientation or the second orientation, the alignment between the object structure model and the point cloud may be better relative to when the object structure model has the other of the first orientation or the second orientation. Such an embodiment may occur because the first and second exterior surfaces described by the object structure model may have different physical characteristics, such as different ridge patterns and/or other different properties (e.g., different sizes). As an example, if the first exterior surface of the object structure model corresponds to the first exterior surface sensed by the point cloud, the level of alignment between the physical features (e.g., ridge patterns) described by the object structure model and the physical features (e.g., ridge patterns) described by the point cloud may be better when the object structure model has a first orientation than when the object structure model has a second orientation, as the first orientation may result in comparing the first exterior surface of the object structure model to the first exterior surface of the point cloud.

For example, fig. 12A and 12B illustrate a comparison between the updated sensing structure information 9720 and a candidate set of model-orientation that includes model-orientation combinations a-F (as shown in fig. 12B) or U-Z (as shown in fig. 12A). In fig. 12A, each model-orientation combination may be a combination of: (i) template 9128A for object recognition₁To 9128A₃And (ii) an orientation of the object structure model. Similarly, each model-orientation combination in fig. 12B may be a combination of: (i) template 9128B is recognized by object₁To 9128B₃And (ii) an orientation of the object structure model. As an example, the model-orientation combination Y in FIG. 12A may be identified by object recognition template 9128A₃The described object structure model and the first orientation of the object structure model are combined, whereas the model-orientation combination Z may be a combination of the same object structure model and the second orientation of the object structure model. In this embodiment, determining the object type of the object in the camera field of view may more particularly involve selecting a particular model-orientation combination, wherein the object structure model of the selected combination is associated with the object type. The computing system 1100 may use the object structure model and orientation of the selected model-orientation combination to determine the robot interaction location, as discussed in more detail below. If the selection involves determining an error value by the computing system 1100, such error value in this embodiment may be associated with a model-orientation combination in the model-orientation candidate set.

In an embodiment, as part of step 4014, computing system 1100 may determine whether to filter out object recognition template(s) from a candidate set of templates, or whether to filter out model-orientation combination(s) from a candidate set of model-orientations. Filtering out templates or combinations may remove them without being considered a potential match with the updated sensing structure information (e.g., 9720). In some cases, if the computing system 1100 determines error values based on a template candidate set or a model-orientation candidate set, these error values may be determined after these candidate sets have been filtered, which may reduce the total number of error values that need to be calculated, and thus save computational resources. In other words, the filtering may generate a filtered candidate set, and the error value may be generated based on the object recognition templates and the model-orientation combinations in the filtered candidate set.

In an embodiment, determining whether to filter an object recognition template or a model-orientation combination from a candidate set (e.g., a candidate set of templates for a model-orientation candidate set) may involve determining whether a corresponding object structure model has at least a portion that falls outside of an area occupied by the updated sensed structure information (e.g., 9720). More specifically, the updated sensing structure information may estimate the spatial region occupied by the object structure of the object (e.g., 3722) in the camera field of view (e.g., 3202). If a particular object recognition template or model-orientation combination includes an object structure model that falls outside of the spatial region, the computing system 1100 may determine that the object structure model is highly likely not to represent the object, and therefore, may not even have to determine an error value for the object recognition template or model-orientation combination. Thus, the computing system 1100 may remove the template or the combination from the candidate set.

As an example, the computing system 1100 may filter the set of template candidates of fig. 11A or 11B by identifying one or more object recognition templates that include one or more respective object structure models that do not fit, or substantially do not fit, within the estimated area and removing the one or more object recognition templates from the set of template candidates. In other words, the computing system 1100 may determine whether to filter out a particular object recognition template from the candidate set by determining whether the object structure model described by the object recognition template is sufficiently supported or interpreted by the updated sensed structure information. Such a determination may involve whether the object structure model substantially fits within an estimated spatial region occupied by the object structure. A basic fit exists when the object structure model fits completely within the estimation region, or when the percentage of the object structure model that falls outside the estimation region is less than a predetermined threshold. If the object structure model does not substantially fit within the estimated region, the computing system 1100 may determine that the object structure model is not sufficiently supported or sufficiently interpreted by the updated sensed structure information (e.g., 9702) associated with the object structure. Thus, the computing system 1100 may filter out the object recognition template by removing the object recognition template from the candidate set of templates.

In an embodiment, the computing system 1100 may perform the filtering operation by determining, for the object recognition templates in the candidate set of templates, whether an object structure model described by the object recognition template substantially fits the estimated region for at least one orientation of the object structure model. If there is at least one orientation of the object structure model such that the object structure model substantially fits within the estimated region, the computing system may determine not to filter the associated object recognition template from the candidate set of templates.

In an embodiment, the computing system 1100 may filter out candidate orientations, or more specifically, candidate combinations of object structure models and orientations of the object structure models. In such embodiments, the computing system 1100 may more specifically determine whether to filter out model-orientation combinations from the set of model-orientation candidates. As an example, the model-orientation candidate set may include model-orientation combinations a to F in fig. 12B. In this example, the computing system 1100 may be configured to perform the filtering operation by determining, for each of the model-orientation combinations in the candidate set, whether to remove the model-orientation combination from the candidate set. More specifically, the computing system may determine whether an object structure model included in or associated with the model-orientation combination substantially fits the estimated region discussed above when the object structure model has an orientation associated with the model-orientation combinationAnd (4) the following steps. For example, fig. 13A depicts an example involving determining whether to filter the model-orientation combination C from the model-orientation candidate set. As shown in FIG. 13A, this model-orientation combination may involve identifying template 9128B at the object₂The object structure model described in (1), and may relate to an object structure model having an orientation of 1. The computing system 1100 may determine: the object structure model included in the model-orientation combination C, when having the orientation indicated in or associated with the model-orientation combination C, does not substantially fit within the estimated region defined by the updated sensed structure information 9720. In response to such a determination, the computing system 1100 may remove the model-orientation combination C from the candidate set of model-orientations, or may generate an indication that the model-orientation combination C is to be removed from the candidate set. Fig. 13B depicts the model-orientation candidate set after the model-orientation combination C is removed. The model-orientation candidate set may represent the filtered candidate set.

Fig. 13C depicts another example of determining whether to remove the model-orientation combination from the model-orientation candidate set (model-orientation combination D). More specifically, the computing system 1100 in this example may determine that: the object structure model associated with the model-orientation combination D, when having an orientation (orientation 2) associated with the model-orientation combination D, fits substantially within the estimated region defined by the updated sensed structure information. This object structure model may be the same as the object structure model of the model-orientation combination C, but may have an orientation different from the orientation of the model-orientation combination C. In this example, the computing system 1100 may determine that the object structure model associated with the model-orientation combination D substantially fits within the estimated region when the object structure model has an orientation associated with the model-orientation combination D. As a result, the computing system 1100 may determine not to remove the model-orientation combination from the candidate set.

As discussed above, the computing system 1100 may determine a set of error values for the model-orientation combinations in the candidate set of model-orientations after the candidate models have been filtered. For example, the computing system 1100 may determine to filter out the model-orientation combination C of fig. 12B from the candidate set, and determine not to filter out the model-orientation combinations A, B, D, E and F. In this example, the computing system 1100 may determine error values for the model-orientation combinations A, B, D, E and F that remain in the candidate set after the candidate set has been filtered.

In embodiments, the computing system 1100 may perform a refinement operation (e.g., a gesture refinement operation) that adjusts the object structure description in the object recognition template, or more particularly, adjusts the gesture information associated with the physical features described by the object recognition template, in order to more closely match the object structure description (relative to the match level prior to the adjustment) to the updated sensed structure information. In some cases, the gesture refinement operation may be performed with respect to an object structure model associated with the object recognition template, and more particularly with respect to an object structure model associated with the model-orientation combination. The object recognition template, the object structure description, the pose information and the object structure model, after having been adjusted by the pose refinement operation, may be referred to as a refined object recognition template, a refined object structure description, a refined pose information and a refined object structure model, respectively.

In some implementations, the gesture refinement operations discussed below may be performed in parallel with the comparison between the object recognition template and the sensed structural information. For example, if the filtering operation discussed above is performed, the gesture refinement operation may be performed in parallel with the filtering operation and/or in parallel with the calculation of the error value, which will be discussed in more detail below. In some implementations, the gesture refinement operation may be performed prior to the comparison between the object recognition template and the sensed structure information. For example, the gesture refinement operation may be performed prior to the filtering operation and/or prior to the calculation of the error value. In some implementations, the gesture refinement operation may be performed after the filtering operation and/or before the calculation of the error value. In such implementations, gesture refinement operations may be performed on the object recognition templates in the candidate sets after the candidate sets of templates or model-orientation candidates have been filtered.

For example, FIG. 14 illustrates a gesture refinement operation that involves adjusting an objectBody recognition template 9128B₂The associated object structure model, or more specifically, the object structure model associated with the model-orientation combination D of fig. 12B and 13B, is adjusted to generate a refined object recognition template, or more specifically a refined object structure model. In some implementations, the adjusted object structure model can describe the object recognition template 9128B₂At least one physical feature (e.g., edge, corner, ridge, outer surface) of the object structure associated with and associated with the object structure model. In the example of FIG. 14, the adjusted object structure model may describe the physical features 9128B_2-1、9128B_2-2Each of these physical characteristics may be associated with an object recognition template 9128B₂The corresponding edge of the object structure associated with or represented by it. More specifically, the adjusted object structure model may include descriptive physical characteristics 9128B_2-1、9128B_2-2May refer to the physical characteristic 9128B_2-1、9128B_2-2In combination with the position and orientation of (a). As described above, the computing system 1100 may adjust the pose information in the object structure model based on the updated sensed structure information to generate refined pose information and/or a refined object structure model. For example, as shown in FIG. 14, the computing system 1100 may adjust the gesture information to indicate a pair of physical features 9128B_2-1To the physical characteristics 9128B_2-2And (4) adjusting. The adjustment may involve, for example, aligning the physical features 9128B_2-1E.g., 0.5 to 1 degree, to change the physical feature 9128B_2-1Rotate towards a set of coordinates that is closer to what is described by the updated sensing structure information 9720. The adjustment may also involve associating a physical feature 9128B_2-2Is offset by, for example, 2-5mm, in order to align the physical features 9128B_2-2Moving towards another set of coordinates that is closer to that described by the updated sensing structure information 9720. Adjusting physical characteristics 9128B that may be described by the refined object structure model generated in FIG. 14_2-1、9128B_2-2More closely matches the updated sensing structure information 9720.

In some cases, the gesture refinement may generate refined gesture information, a refined object structure model, and/or a refined object recognition template that provides enhanced accuracy in comparison to the sensed structure information (e.g., updated sensed structure information of step 4012). The enhanced accuracy of the refined pose information in the refined object recognition template may make the robot interaction location more optimal if the refined object recognition template is used to determine the robot interaction location, as described in more detail below. In some cases, the enhanced accuracy of the refined pose information in the object recognition template may facilitate determining the object type, such as by facilitating a comparison between the refined object recognition template and the updated sensed structure information. The comparison may involve determining an error value indicative of a respective degree of deviation between the refined object recognition template and the updated sensed structure information. In such an example, a gesture refinement operation may be performed to adjust the object recognition template prior to determining the error value. In some cases, the adjustment of the object recognition templates may make the error values more reliable or useful for determining which object recognition template most closely matches the updated sensed structure information after adjustment by the gesture refinement operation.

In some implementations, pose refinement may facilitate robust determination of object types in real-world, non-ideal environments, which may be affected by manufacturing tolerances, physical damage, or other sources of deviation between the object structure model and the actual object associated with the model. For example, manufacturing tolerances may cause objects of the same object type to have small structural variations, and thus may cause at least some of these objects to exhibit differences when compared to an object identification template, or more specifically an object structural model, associated with the object type. As another example, some of these objects may suffer small physical damage or some other form of structural change during use due to interaction with the environment. In these examples, pose refinement may be used to account for small structural changes that may naturally exist between an actual object in the physical environment of the camera and the object structural model associated with the object. More specifically, the pose refinement operation may adjust the object structure model to bring the refined object structure model closer to the sensed structure information of the object in order to reduce the bias discussed above.

In an embodiment, the computing system may perform the comparison between the object identification templates in the candidate set of templates and the updated sensed structure information by calculating or otherwise determining at least one respective error value for each object identification template in the candidate set. For example, as discussed above, if the updated sensed structure information includes a global point cloud having a plurality of coordinates, the at least one corresponding error value may be calculated based on, for example, how close coordinates of the plurality of coordinates of the global point cloud match the corresponding object recognition template, or more specifically, how close the coordinates of the global point cloud match one or more physical features (e.g., edges, corners, ridges, surfaces, etc.) described by the object structure description information included in the object recognition template. In some implementations, if the object structure description of the object identification template includes a point cloud (also referred to as a template point cloud), the error value associated with the object identification template may be based on a respective distance between the coordinates in the global point cloud and the coordinates in the template point cloud. In some implementations, the template point cloud and the object recognition template may be a refined template point cloud or a refined object recognition template generated from a gesture refinement operation, discussed above. In some implementations, the error value may indicate a degree of deviation between physical features (e.g., ridges, edges, and/or corners) described by the object structure model of the object recognition template and physical features (e.g., ridges, edges, and/or corners) described by the global point cloud.

15A-15C depict object recognition templates 9128B of computing system 1100 in the template candidate set of FIG. 11B₁To 9128B₃(which may be, for example, a refined object recognition template that has been adjusted by a gesture refinement operation) and updated sensed structure information 9720. More specifically, FIG. 15A illustrates that computing system 1100 determines object recognition templates 9128B in a candidate set of templates₁And fig. 15B shows the calculation systemSystem 1100 determines object recognition templates 9128B in a template candidate set₂And fig. 15C shows that computing system 1100 determines object recognition templates 9128B in a candidate set of templates₃At least one error value. In some implementations, the computing system 1100 may determine a plurality of error values (e.g., two error values) for the object recognition template. For an object structure model described by the object recognition template (e.g., a refined object structure model generated by a gesture refinement operation), the plurality of error values may correspond to a plurality of orientations, respectively. For example, as discussed in more detail below, fig. 15C may illustrate that computing system 1100 determines object recognition template 9128B₃And fig. 16 shows that computing system 1100 determines object recognition template 9128B₃Another error value (corresponding to another orientation).

Returning to the example in FIG. 15A, this example involves associating object recognition template 9128B₁An associated error value, which may indicate 9128B in an object recognition template₁And the corresponding degree of deviation between the object structure description (e.g., object structure model) and the updated sensed structure information 9720. In some implementations, computing system 1100 in fig. 15A can detect or otherwise determine object recognition template 9128B₁Whether the object structure model of (1) has any portions that are not sufficiently interpreted by the updated sensed structure information 9720. As an example, the computing system 1100 may determine whether a distance between a particular portion of the object structure model and a corresponding (e.g., closest) portion of the updated sensed structure information 9720 is greater than a predetermined distance threshold. For example, if the object structure model and the updated sensed structure information 9720 are both point clouds each including a plurality of coordinates (also referred to as points), the computing system may determine whether the object structure model has any coordinates that are separated from the corresponding (e.g., closest) coordinates of the updated sensed structure information 9720 by more than a predetermined distance threshold. If the distance separating the two corresponding coordinates is greater than a predetermined distance threshold, the computing system 1100 may determine that the particular coordinate (or more generally, the particular portion of the object structure model) is not separated byThe updated sensing structure information 9720 is fully interpreted. Such a portion may be referred to as an unexplained portion, or more specifically as an unexplained coordinate or unexplained point. FIG. 15A depicts a portion 14031 of the computing system 1100 to determine a model of the object structure₁-14031₇Examples of coordinates with sufficient interpretation of the coordinates of the sensed structure information 9720 (e.g., a global point cloud) that are not updated. These unexplained coordinates (also referred to as unexplained points) may form approximately 11% of the total number of coordinates or points in the object structure model, and may have an average distance of approximately 3.05mm from the corresponding coordinates in the updated sensed structure information 9720. In the example of FIG. 15A, computing system 1100 can determine an object recognition template 9128B₁The associated error value is equal to or based on the average distance or unexplained point number in fig. 15A.

As shown in fig. 15B and 15C, computing system 1100 can determine an object recognition template 9128B₂Associated error value and object recognition template 9128B₃An associated error value. In the example of FIG. 15B, computing system 1100 can determine object recognition template 9128B₂Contains a portion 14032 with coordinates that are not sufficiently interpreted by the updated sensed structure information 9720₁-14032₅. As a specific example, these unexplained coordinates may constitute object recognition template 9128B₂Of the object structure model and may have an average distance of about 3.85mm from the corresponding coordinates in the updated sensed structure information 9720. In the example of FIG. 15C, computing system 1100 can determine object recognition template 9128B₃Has a portion 14033 that includes coordinates that are not sufficiently interpreted by the updated sensed structure information 9720₁. Portion 14033₁May form about 0.09% of the total number of coordinates of the object structure model and may have an average distance of about 1.31mm from the corresponding coordinates in the updated sensed structure information 9720.

As described above, in embodiments, the computing system may determine a candidate set of orientations to the model (such as the candidate set of FIGS. 12A, 12B, or 13B)The respective model-orientation combination of (a) is associated with a set of error values. As discussed further above, each model-orientation combination of the candidate set may be a combination of an object structure model and an orientation of the object structure model. In this example, each error value of the set of error values may indicate a degree of deviation between: (i) updated sensed structure information (e.g., 9720) and (ii) an object structure model associated with a respective model-orientation combination associated with an error value when the object structure model has an orientation associated with the respective model-orientation combination. For example, fig. 15A-15C illustrate the computing system 1100 determining three error values associated with three respective model-orientation combinations, namely model-orientation combination a (as shown in fig. 15A), model-orientation combination D (as shown in fig. 15B), and model-orientation combination F (as shown in fig. 15C). Fig. 16 also shows that the computing system 1100 determines an error value associated with the model-orientation combination E. In the example of FIG. 16, computing system 1100 can recognize template 9128B as an object₃Has an orientation (orientation 2) of the model-orientation combination E, the object structure model has a portion 15033 comprising coordinates that are not fully interpreted by the updated sensed structure information 9720₁-15033₄. Furthermore, the error values in fig. 15C may indicate that when the object structure model has an orientation 1 (which is the orientation of the model-orientation combination F), the updated sensed structure information 9720 and the object recognition template 9128B₃While the error values in fig. 16 may indicate the degree of deviation between the updated sensed structure information 9720 and the same object structure model when the object structure model has an orientation 2 (which is the orientation associated with the model-orientation combination E).

In an embodiment, the computing system 1100 may be configured to determine the object type by determining the set of error values discussed above based on the updated sensed structure information, and selecting an object recognition template and/or model-orientation combination based on the set of error values, wherein the object structure model of the selected object recognition template and/or model-orientation combination is associated with the object type. In one example, computing system 1100 can select from a set of template candidates (e.g., 9128B)₁To 9128B₃) To select an object recognition template (e.g., 9128B)₃). The selected object recognition template may have the lowest error value of the set of error values. For example, if the set of error values includes an object recognition template 9128B as in FIGS. 15A-15C₁To 9128B₃Associated percentage values of 11.35%, 12.97%, and 0.09%, then computing system 1100 may select object identification template 9128B associated with the lowest percentage value (i.e., 0.09%) in the set of percentage values₃。

In one example, the computing system 1100 may select a model-orientation combination from a candidate set of model-orientations (such as the candidate set in fig. 12A or 12B or 13B). For example, if the candidate set includes at least the model-orientation combinations A, D, F and E shown in fig. 15A-15C and fig. 16, in one example, the set of error values may include a set of percentage values of 11.35%, 12.97%, 0.09%, and 3.74% of these figures. In this example, the computing system 1100 may be configured to select the model-orientation combination F associated with the lowest percentage value in the set of error values. As described above, the selected object recognition template and/or the selected model-orientation combination may be associated with an object type of an object (e.g., 3722) in the camera field of view (e.g., 3202). Thus, in an embodiment, the computing system 1100 may determine an object type associated with the object (e.g., 3722) by selecting an object recognition template or model-orientation combination in step 4014.

Returning to fig. 4, in an embodiment, method 4000 may include step 4016, wherein computing system 1100 may determine (e.g., via robot interaction planning module 1122) one or more robot interaction locations based on the object type determined in step 4014. As described above, the object type may be associated with an object design of the object type or category, or more specifically, a physical design (e.g., a physical shape) and/or a visual design of the object type or category. In such a case, the computing system 1100 may determine one or more robot interaction locations based on the physical design associated with the object type determined in step 4014. As an example, the object (e.g., 3722) may be a container, and the one or more robotic interaction locations may be a plurality of gripping locations where the container is to be gripped, picked up, or otherwise engaged by a robot (e.g., 3300) or, more specifically, by an end effector device (e.g., 3500). In such an example, the plurality of gripping locations may be determined based on a physical design (e.g., physical shape) associated with the object type.

In embodiments, the one or more robot interaction locations may be based on the selected object recognition template and/or based on the selected model-orientation combination discussed above (such as object recognition template 9128B of fig. 15C)₃And/or model-orientation combinations F). More specifically, the one or more robot interaction locations may be determined based on an object structure model included in the object recognition template and/or the model-orientation combination.

In an embodiment, the object structure model may already comprise or identify one or more robot gripping locations. For example, FIG. 17A shows a robot gripping location 16129 with identification₁And 16129₂ Object identification template 16128. In this example, the object structure model may be a container structure model, which may describe a physical structure associated with a container type. More specifically, the container structure model in FIG. 17A may be associated with a container having physical features 16128 as container borders₁Is associated with the container type or category. In other words, the container structure model may describe the container rim structure. In this example, the robotic grasping position 16129₁And 16129₂May be located along the rim structure of the container. The vessel structure model in FIG. 17A may also describe other physical characteristics 16128₂、16128₃And 16128₄Which may be a first ridge or other protrusion, a second ridge or other protrusion, and a corner, respectively.

In an embodiment, the selected object recognition template or the selected model-orientation combined object structure model may identify areas that may have physical features that may interfere with the robot's grip. For example, FIG. 17B shows object structure model identification representation of object recognition template 16128 surrounding the first ridge (16128) of FIG. 17A₂) First region 16130 of space₁(e.g., rectangular region) and the logo represents surrounding the second ridge (16128)₃) Second region 16130₂Examples of (2). In this example, if gripping a container represented by object identification template 16128 involves moving gripper fingers of an end effector device (e.g., 3500) toward a frame of the container so that the gripper fingers can grip the frame, the first and second ridges may interfere with the movement because they may block the gripper fingers from facing inward. Thus, if the gripping location is near the first ridge or the second ridge, the gripper fingers may not be able to achieve a grip at these locations, or may only achieve a shallow grip at these locations. Thus, the container structure model in fig. 17B may identify a first region 16130 surrounding a first ridge and a second ridge₁And a second region 16130₂Such that the computing system 1100 may avoid determining the region 16130₁、16130₂The gripping position of (1).

In such cases, computing system 1100 may determine a plurality of gripping locations (also referred to as gripping locations) based on a container frame structure, such as the container frame structure described by object identification template 16128. In some implementations, this determination may involve determining an overhang distance (overhang distance) at different locations along the container rim, where a large overhang distance at a particular location may indicate that a deep or stable grip may be achievable at that location, while a small overhang distance at a particular location may indicate that only a shallow grip is possible at that location. More specifically, computing system 1100 may determine a plurality of overhang distances associated with a plurality of respective locations along a container rim structure (such as the rim structure of container rim 16128 in fig. 17A and 17B). In this example, each of the plurality of overhang distances may be a distance that the robotic arm (e.g., 3400) or, more specifically, the end effector device (e.g., 3500) or its gripper fingers are extendable in an inward direction toward the container structure under the container rim structure. For example, the overhang distance associated with a particular location along the bezel structure may indicate how far the lower gripper fingers of the end effector device can extend in an inward direction toward the container if the gripper fingers are located at that location along the bezel structure. In some cases, the determination of the overhang distance may be part of a simulation in which the computing system 1100 simulates a robotic arm (e.g., 3400) sliding an end effector device (e.g., 3500) or portion thereof to different positions along a bezel structure. In this embodiment, the computing system may select a plurality of gripping locations from a plurality of respective locations along the container rim structure based on the plurality of overhang distances. For example, the plurality of gripping locations may be locations having some of the greatest overhang distances or, more generally, the highest overhang distances. In some cases, a higher overhang distance along a particular location of a bezel structure may indicate that a wider portion of the bezel structure may be engaged by an end effector device (e.g., 3500) at that location, which may facilitate a deeper or more stable grip at that location.

In an embodiment, the computing system 1100 may determine whether an object (e.g., 3722) in the camera field of view (e.g., 3202) has a container cover, and may determine one or more robot interaction locations (e.g., gripping locations) based on whether the object has a container cover. For example, the computing system 1100 can determine whether the first image information of step 4002, the second image information of step 4010, and/or the updated sensing structure information indicate the presence of a container lid. The computing system 1100 can determine a plurality of gripping locations based on its detection of the presence of a container lid.

In an embodiment, if the selected object identification template may include a container structure model describing at least a structure of the container lid, the computing system 1100 may determine a plurality of gripping locations based on the container lid structure. For example, FIG. 17C illustrates an object recognition template 17128, which may include a logo cover 17128₁A model of the existing container structure. The container structure model in this example may also identify features in the container lid structure, such as gaps 17128₂And 17128₃In the gripping position in the gap 17128₂And 17128₃Nearby, gap 17128₂And 17128₃The grip may be disturbed. Thus, in embodiments, computing system 1100 may use object recognition template 17128 to avoid determining at gap 17128₂And 17128₃A nearby gripping location. FIG. 17D providesGap 17128 of cover structure₂And 17128₃May be represented by region 16130₃And 16130₄Examples of representations. More specifically, region 16130₃And 16130₄Can surround the gap 17128₂And 17128₃. In this example, the computing system 1100 may cause the grasping location to be absent from the region 16130₃And 16130₄The grip position is determined.

In an embodiment, if the selected object recognition template or model-orientation combination includes an object structure model representing a container structure that is rotationally symmetric, or more specifically has 4-fold rotational symmetry, the computing system 1100 may use this symmetry to simplify the determination of multiple gripping locations. For example, the computing system may determine the first gripping location based on the object structure model. Because the container structure is rotationally symmetric, the computing system 1100 may determine the second gripping location based on the first gripping location. For example, the computing system 1100 may determine that the first gripping location has a first distance from a corner of the container structure, wherein the first gripping location is on a first side of the container structure. In this example, the computing system 1100 may determine the second gripping location as a location on the second side of the container structure and having the same distance from a corner of the container structure. In the above examples, 4-fold rotational symmetry or other rotational symmetry of the container structure may refer to rotational symmetry about a vertical axis of rotation passing through the center of the container structure, where the vertical axis of rotation may be an axis perpendicular to the floor or ground.

In an embodiment, the method 4000 of fig. 4 may further include the step of the computing system 1000 causing one or more robot interactions, which may be based on the gripping positions discussed above or other robot interaction positions. For example, computing system 1100 may output one or more robot interaction movement commands for causing interaction at one or more robot interaction locations, where the one or more robot interaction movement commands may be generated based on the one or more robot interaction locations.

In an embodiment, an object structure that is rotationally symmetric, or more specifically has 4-fold rotational symmetry, may influence how the object registration is performed. Object registration may involve, for example, generating a new object identification template for an object or object structure that does not match any existing object identification templates. For example, when an additional object (e.g., a new container) is in the camera field of view, the computing system 1100 may perform object registration if the additional object does not match any existing object identification templates stored on the non-transitory computer-readable medium 1120 or 1600. Object registration may involve generating additional object recognition templates based on image information of the additional objects. If the additional object has a rotationally symmetric object structure, the computing system 1100 may generate the additional object recognition template based on one corner of the object structure of the additional object and not based on the remaining corners of the object structure of the additional object. More specifically, if the object structure of the additional object is rotationally symmetric, or more specifically has 4-fold rotational symmetry, the object structure may comprise corners having substantially the same structure, and/or sides having substantially the same structure. Thus, while the computing system 1100 may determine an object recognition template that directly describes one corner and one side of the object structure, additional object recognition templates may not necessarily further directly describe the remaining corners or sides of the object structure, as they may have substantially the same structure as the corners or sides described by the additional object recognition templates.

Additional discussion of various embodiments:

embodiment 1 relates to a computing system for performing object detection, or a method that may be performed by a computing system, such as when the computing system executes instructions of a non-transitory computer-readable medium. In this embodiment, a computing system includes a communication interface and at least one processing circuit. The communication interface is configured to communicate with: (i) a robot having a robot arm and an end effector arrangement arranged at or forming one end of the robot arm; and (ii) a camera mounted on the robotic arm and having a camera field of view. The at least one processing circuit is configured to perform the following when the object is in the camera field of view: receiving first image information representing at least a first portion of an object structure associated with an object, wherein the first image information is generated by a camera when the camera is in a first camera pose in which the camera is directed at the first portion of the object structure; generating or updating sensed structure information representing an object structure associated with the object based on the first image information; identifying an object angle associated with the object structure based on the sensed structure information; outputting one or more camera placement movement commands that, when executed by the robot, cause the robotic arm to move the camera to a second camera pose in which the camera is pointed at an object angle; receiving second image information representing a structure of the object, wherein the second image information is generated by the camera when the camera is in a second camera pose; updating the sensing structure information based on the second image information to generate updated sensing structure information; determining an object type associated with the object based on the updated sensing structure information; determining one or more robot interaction locations based on the object type, wherein the one or more robot interaction locations are one or more locations for interaction between the end effector device and the object; and outputting one or more robot interaction movement commands for causing interaction at the one or more robot interaction locations, wherein the one or more robot interaction movement commands are generated based on the one or more robot interaction locations. In some cases, the computing system may omit output of one or more robot-interactive movement commands (which may be executed by another computing system).

Embodiment 2 includes the computing system of embodiment 1. In this embodiment, the at least one processing circuit is configured to determine the object type by: the at least one processing circuit is configured to determine the object type by: performing a comparison between the updated sensed structure information and a template candidate set, wherein the template candidate set is a set including object recognition templates describing object structures associated with different object types; based on the comparison, an object recognition template is selected from the template candidate set such that the object recognition template is the selected object recognition template, wherein the selected object recognition template represents an object type associated with the object. In this embodiment, the at least one processing circuit is configured to determine the one or more robot interaction locations based on the object structure description associated with the selected object recognition template.

Embodiment 3 includes the computing system of embodiment 2. In this embodiment, the at least one processing circuit is configured to perform the comparison between the updated sensing structure information and the candidate set of templates by calculating a set of error values associated with the object recognition templates in the candidate set of templates, wherein each error value of the set of error values is indicative of a respective degree of deviation between: (i) updated sensed structure information, and (ii) an object structure description included in the object recognition template associated with an error value, wherein the selected object recognition template is associated with a lowest error value of a set of error values.

Embodiment 4 includes the computing system of embodiment 3. In this embodiment, the updated sensed structure information is a point cloud comprising a plurality of coordinates for representing an object structure associated with the object, wherein the at least one processing circuit is configured to: for each object recognition template in the candidate set of templates, at least one error value is calculated based on how closely the coordinates from the plurality of coordinates of the point cloud match one or more physical features described by the corresponding object structure description included in the object recognition template.

Embodiment 5 includes the computing system of embodiment 3 or 4. In this embodiment, the object recognition templates in the candidate set of templates each describe a set of object structure models. Further, in this embodiment, the template candidate set is a model-orientation candidate set, the model-orientation candidate set being a set including model-orientation combinations, wherein each model-orientation combination in the model-orientation candidate set is a combination of: (i) an object structure model that is one of a set of object structure models, and (ii) an orientation of the object structure model. Further, in this embodiment, a set of error values is respectively associated with the model-orientation combinations in the candidate set of model-orientations, wherein each error value in the set of error values indicates a respective degree of deviation between: (i) updated sensed structure information, and (ii) a respective model-orientation combined object structure model associated with an error value, wherein the error value is also associated with the object structure model having the orientation of the respective model-orientation combination.

Embodiment 6 includes the computing system of embodiment 5. In this embodiment, the at least one processing circuit is configured to select the object recognition template by selecting a model-orientation combination from a candidate set of model-orientations, the model-orientation combination comprising the object structure model described by the selected object recognition template, wherein the selected model-orientation combination is associated with the lowest error value of a set of error values, and wherein the at least one processing circuit is configured to determine the one or more robot interaction positions based on the object structure model of the selected model-orientation combination and the orientation of the selected model-orientation combination.

Embodiment 7 includes the computing system of embodiment 6. In this embodiment, the updated sensing structure information defines an estimated area occupied by an object structure of an object in the camera field of view, wherein the at least one processing circuit is configured to: filtering the set of model-orientation candidates by, prior to computing a set of error values associated with the set of model-orientation candidates, performing the following for each model-orientation combination in the set of model-orientation candidates: determining whether the object structure model of the model-orientation combination substantially fits within the estimated region when the object structure model has the orientation of the model-orientation combination, and in response to the object structure model of the model-orientation combination not substantially fitting within the estimated region when the object structure model has the orientation of the model-orientation combination, removing the model-orientation combination from the set of model-orientation candidates, wherein a set of error values is calculated based on model-orientation combinations remaining in the set of model-orientation candidates after the set of model-orientation candidates is filtered.

Embodiment 8 includes the computing system of any of embodiments 3-7. In this embodiment, the updated sensed structure information defines an estimated area occupied by the object structure, wherein the object recognition templates in the candidate set of templates respectively describe a set of object structure models. In this embodiment, one processing circuit is configured to: prior to computing a set of error values associated with the object recognition templates in the template candidate set, filtering the template candidate set by: identifying one or more object recognition templates comprising one or more respective object structure models that do not substantially fit in the estimation region, and removing the one or more object recognition templates from the candidate set of templates, and wherein a set of error values is calculated based on the object recognition templates remaining in the candidate set of templates after the candidate set of templates is filtered.

Embodiment 9 includes the computing system of any of embodiments 2-8. In this embodiment, the at least one processing circuit is configured to: for at least one object recognition template in the candidate set of templates, a corresponding object structure description comprised in the object recognition template is adjusted based on the updated sensed structure information.

Embodiment 10 includes the computing system of embodiment 9. In this embodiment, the respective object structure description of the at least one object recognition template describes a physical feature of the respective object structure described by the at least one object recognition template, and wherein the respective object structure description further comprises pose information describing a pose of the physical feature, and wherein the at least one processing circuit is configured to adjust the pose information based on the updated sensed structure information to improve the degree to which the physical feature described by the at least one object recognition template matches the updated sensed structure information.

Embodiment 11 includes the computing system of any of embodiments 1-10. In this embodiment, the at least one processing circuit is configured to: when the object is a container and when the object structure is a container structure, determining the one or more robot interaction locations as a plurality of gripping locations associated with gripping the container such that the plurality of gripping locations are determined based on an object type, the object type being a container type associated with the container.

Embodiment 12 includes the computing system of embodiment 11. In this embodiment, the at least one processing circuit is configured to: when the container structure includes a plurality of ridges protruding from a side surface of the container structure, the plurality of ridges are detected based on the second image information or the updated sensing structure information such that the plurality of ridges are detected ridges on the side surface of the container structure, wherein a container type associated with the container is determined based on the detected ridges on the side surface of the container structure.

Embodiment 13 includes the computing system of embodiment 11 or 12. In this embodiment, the at least one processing circuit is configured to determine the container type by: performing a comparison between the updated sensed structure information and a template candidate set, wherein the template candidate set is a set comprising object identification templates describing container structures associated with different container types; based on the comparison, selecting an object recognition template from the template candidate set such that the object recognition template is the selected object recognition template, wherein the selected object recognition template represents a container type associated with the container, wherein the at least one processing circuit is further configured to: when the selected object recognition template includes a container structure model describing at least a container rim structure, a plurality of gripping locations are determined based on the container rim structure.

Embodiment 14 includes the computing system of any of embodiments 11-13. In this embodiment, the at least one processing circuit is configured to determine a plurality of overhang distances associated with a plurality of respective locations along the container rim structure, wherein each of the plurality of overhang distances is a distance that the end effector device is able to extend under the container rim structure in an inward direction toward the container structure if the end effector device is present at a respective location of the plurality of locations, wherein the at least one processing circuit is configured to select the plurality of gripping locations from the plurality of respective locations along the container rim structure based on the plurality of overhang distances.

Embodiment 15 includes the computing system of any of embodiments 11-14. In this embodiment, the at least one processing circuit is configured to determine whether the first image information or the second image information indicates the presence of the container lid, wherein the plurality of gripping locations is further determined based on whether the first image information or the second image information indicates the presence of the container lid.

Embodiment 16 includes the computing system of embodiment 15. In this embodiment, the at least one processing circuit is configured to: when the selected object identification template includes a container structure model describing at least one container lid structure, a plurality of gripping locations are determined based on the container lid structure.

Embodiment 17 includes the computing system of any of embodiments 1-16. In this embodiment, the at least one processing circuit is configured to, when the additional object is in the camera field of view and the additional object is rotationally symmetric: receiving additional image information representing an object structure of an additional object; and generating additional object recognition templates for the set of object recognition templates based on the additional image information, wherein the additional object recognition templates are generated based on one corner of the object structure of the additional object and not based on the remaining corners of the object structure of the additional object.

Embodiment 18 includes the computing system of any of embodiments 1-17, the first image information associated with a camera having a first distance from the object, and the second image information associated with a camera having a second distance from the object less than the first distance.

Embodiment 19 includes the computing system of any of embodiments 1-18, wherein the sensed structure information based on the first image information describes a plurality of object angles of the object structure, and wherein the at least one processing circuit is configured to identify the object angle by selecting the object angle from the plurality of object angles, the selection based on at least one of: (i) respective levels of accessibility to the plurality of object corners for robotic interaction with the robotic arm, or (ii) respective levels of occlusion for sensing the plurality of object corners by the camera.

It will be apparent to one of ordinary skill in the relevant art that other suitable modifications and adaptations to the methods and applications described herein may be made without departing from the scope of any of the embodiments. The embodiments described above are illustrative examples and should not be construed as limiting the invention to these particular embodiments. It should be understood that the various embodiments disclosed herein may be combined in different combinations than those specifically presented in the description and drawings. It will also be understood that, according to an example, certain acts or events of any process or method described herein can be performed in a different order, may be added, merged, or omitted altogether (e.g., all described acts or events may not be necessary for performing the method or process). Additionally, although certain features of the embodiments herein are described as being performed by a single component, module, or unit for clarity, it should be understood that the features and functions described herein can be performed by any combination of components, units, or modules. Accordingly, various changes and modifications may be effected therein by one skilled in the art without departing from the spirit or scope of the invention as defined in the appended claims.

Claims

1. A computing system, comprising:

a communication interface configured to communicate with: (i) a robot having a robotic arm and an end effector arrangement arranged at or forming one end of the robotic arm; and (ii) a camera mounted on the robotic arm and having a camera field of view;

at least one processing circuit configured to perform the following when an object is in the camera field of view:

receiving first image information representing at least a first portion of an object structure associated with the object, wherein the first image information is generated by the camera when the camera is in a first camera pose in which the camera is directed at the first portion of the object structure;

generating or updating sensed structure information representing the object structure associated with the object based on the first image information;

identifying an object angle associated with the object structure based on the sensed structure information;

outputting one or more camera placement movement commands that, when executed by the robot, cause the robotic arm to move the camera to a second camera pose in which the camera is pointed at the object angle;

receiving second image information representing the object structure, wherein the second image information is generated by the camera when the camera is in the second camera pose;

updating the sensing structure information based on the second image information to generate updated sensing structure information;

determining an object type associated with the object based on the updated sensing structure information;

determining one or more robot interaction locations based on the object type, wherein the one or more robot interaction locations are one or more locations for interaction between the end effector device and the object; and

outputting one or more robot interaction movement commands for causing the interaction at the one or more robot interaction locations, wherein the one or more robot interaction movement commands are generated based on the one or more robot interaction locations.

2. The computing system of claim 1, wherein the at least one processing circuit is configured to determine the object type by:

performing a comparison between the updated sensed structure information and a template candidate set, wherein the template candidate set is a set comprising object identification templates describing object structures associated with different object types;

selecting an object recognition template from the template candidate set based on the comparison such that the object recognition template is the selected object recognition template, wherein the selected object recognition template represents the object type associated with the object,

wherein the at least one processing circuit is configured to determine the one or more robot interaction locations based on an object structure description associated with the selected object recognition template.

3. The computing system of claim 2, wherein the at least one processing circuit is configured to perform the comparison between the updated sensed structure information and the candidate set of templates by calculating a set of error values associated with object recognition templates in the candidate set of templates,

wherein each error value of the set of error values is indicative of a respective degree of deviation between: (i) the updated sensed structure information, and (ii) an object structure description included in an object recognition template associated with the error value,

wherein the selected object recognition template is associated with the lowest error value of the set of error values.

4. The computing system of claim 3, wherein the updated sensed structure information is a point cloud comprising a plurality of coordinates representing the object structure associated with the object,

wherein the at least one processing circuit is configured to: for each object recognition template in the candidate set of templates, calculating at least one error value based on how closely the coordinates of the plurality of coordinates from the point cloud match one or more physical features described by the corresponding object structure description included in the object recognition template.

5. The computing system of claim 3, wherein the object recognition templates in the candidate set of templates each describe a set of object structure models,

wherein the set of template candidates is a set of model-orientation candidates, the set of model-orientation candidates being a set comprising model-orientation combinations, wherein each model-orientation combination in the set of model-orientation candidates is a combination of: (i) an object structure model that is one of the set of object structure models, and (ii) an orientation of the object structure model,

wherein the set of error values are associated with model-orientation combinations in the candidate set of model-orientation, respectively,

wherein each error value of the set of error values is indicative of a respective degree of deviation between: (i) the updated sensed structure information, and (ii) a respective model-orientation combined object structure model associated with the error value, wherein the error value is further associated with the object structure model having the orientation of the respective model-orientation combination.

6. The computing system of claim 5, wherein the at least one processing circuit is configured to select the object recognition template by selecting a model-orientation combination from the candidate set of model-orientation combinations, the model-orientation combination comprising an object structure model described by the selected object recognition template, wherein the selected model-orientation combination is associated with the lowest error value of the set of error values, and

wherein the at least one processing circuit is configured to determine the one or more robot interaction locations based on the object structure model of the selected model-orientation combination and the orientation of the selected model-orientation combination.

7. The computing system of claim 6, wherein the updated sensing structure information defines an estimated area occupied by the object structure of the object in the camera field of view,

wherein the at least one processing circuit is configured to: filtering the set of model-orientation candidates by, prior to computing the set of error values associated with the set of model-orientation candidates, performing the following for each model-orientation combination in the set of model-orientation candidates:

determining whether the object structure model of the model-orientation combination substantially fits within the estimation region when the object structure model has the orientation of the model-orientation combination, and

removing the model-orientation combination from said set of model-orientation candidates in response to the object structure model of the model-orientation combination not substantially fitting within said estimation region when the object structure model has the orientation of the model-orientation combination,

wherein the set of error values is calculated based on model-orientation combinations that remain in the model-orientation set candidate set after the model-orientation candidate set is filtered.

8. The computing system of claim 3, wherein the updated sensing structure information defines an estimated area occupied by the object structure,

wherein the object recognition templates in the template candidate set respectively describe a set of object structure models,

wherein the at least one processing circuit is configured to: prior to calculating the set of error values associated with the object recognition templates in the template candidate set, filtering the template candidate set by: identifying one or more object recognition templates comprising one or more respective object structure models that do not substantially fit into the estimation region, and removing the one or more object recognition templates from the candidate set of templates, and

wherein the associated set of error values is calculated based on object recognition templates remaining in the template candidate set after the template candidate set is filtered.

9. The computing system of claim 2, wherein the at least one processing circuit is configured to: for at least one object recognition template in the candidate set of templates, adjusting a respective object structure description comprised in the object recognition template based on the updated sensed structure information.

10. The computing system of claim 9, wherein the respective object structure description of the at least one object recognition template describes a physical feature of the respective object structure described by the at least one object recognition template, and wherein the respective object structure description further includes pose information describing a pose of the physical feature, and

wherein the at least one processing circuit is configured to adjust the pose information based on the updated sensed structure information to increase the degree to which the physical features described by the at least one object recognition template match the updated sensed structure information.

11. The computing system of claim 1, wherein the at least one processing circuit is configured to: determining the one or more robot interaction locations as a plurality of gripping locations associated with gripping the container when the object is a container and when the object structure is a container structure, such that the plurality of gripping locations are determined based on the object type, the object type being a container type associated with the container.

12. The computing system of claim 11, the at least one processing circuit configured to: detecting a plurality of ridges protruding from a side surface of the container structure based on the second image information or the updated sensing structure information when the container structure includes the plurality of ridges such that the plurality of ridges are detected ridges on the side surface of the container structure,

wherein the container type associated with the container is determined based on the detected ridge on the side surface of the container structure.

13. The computing system of claim 11, wherein the at least one processing circuit is configured to determine the container type by:

performing a comparison between the updated sensed structure information and a template candidate set, wherein the template candidate set is a set comprising object identification templates describing container structures associated with different container types;

selecting an object recognition template from the template candidate set based on the comparison such that the object recognition template is the selected object recognition template, wherein the selected object recognition template represents a container type associated with the container,

wherein the at least one processing circuit is further configured to: determining the plurality of gripping positions based on the container rim structure when the selected object recognition template includes a container structure model describing at least a container rim structure.

14. The computing system of claim 13, wherein the at least one processing circuit is configured to determine a plurality of overhang distances associated with a plurality of respective locations along the receptacle bezel structure, wherein each overhang distance of the plurality of overhang distances is a distance that the end effector device can extend in an inward direction toward the receptacle structure under the receptacle bezel structure if the end effector device is present at a respective location of the plurality of locations,

wherein the at least one processing circuit is configured to select a plurality of gripping locations from the plurality of respective locations along the container rim structure based on the plurality of overhang distances.

15. The computing system of claim 11, wherein the at least one processing circuit is configured to determine whether the first image information or the second image information indicates the presence of a container lid, wherein the plurality of gripping locations is further determined based on whether the first image information or the second image information indicates the presence of the container lid.

16. The computing system of claim 15, wherein the at least one processing circuit is configured to: determining the plurality of gripping positions based on the container lid structure when the selected object identification template includes a container structure model describing at least one container lid structure.

17. The computing system of claim 2, wherein the at least one processing circuit is configured to, when an additional object is in the camera field of view and the additional object is rotationally symmetric:

receiving additional image information representing an object structure of the additional object; and

generating additional object recognition templates for the set of object recognition templates based on the additional image information, wherein the additional object recognition templates are generated based on one corner of the object structure of the additional object and not based on the remaining corners of the object structure of the additional object.

18. The computing system of claim 1, wherein the first image information is associated with the camera having a first distance from the object and the second image information is associated with the camera having a second distance from the object that is less than the first distance.

19. A non-transitory computer-readable medium having instructions that, when executed by at least one processing circuit of a computing system, cause the at least one processing circuit to:

receiving, at the computing system, first image information, wherein the computing system is configured to communicate with: (i) a robot having a robotic arm and an end effector arrangement arranged at or forming an end of the robotic arm, and (ii) a camera mounted on the robotic arm and having a camera field of view, wherein the first image information is for representing at least a first portion of an object structure associated with the object, wherein the first image information is generated by the camera when the camera is in a first camera pose in which the camera is directed at the first portion of the object structure;

receiving second image information representing the object structure, wherein the second image information is generated by the camera while the camera is in the second camera pose;

20. A method performed by a computing system, the method comprising: