WO2019127306A1

WO2019127306A1 - Template-based image acquisition using a robot

Info

Publication number: WO2019127306A1
Application number: PCT/CN2017/119648
Authority: WO
Inventors: Xinmin Liu
Original assignee: Beijing Airlango Technology Co., Ltd.
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2019-07-04

Abstract

A technology for allowing a robot to automatically capture a picture of a target object that matches a picture template is disclosed. A user can specify a picture template that includes a position template describing a desired picture layout, such as a desired relative size of the target object in the image and its relative position in the image. The robot can capture a picture at its current location and compare the captured image with the position template. If the image does not match the position template, the robot can determine a location change and adjust its location to a new location to take another test image. The adjustment can be performed repeatedly until the test image matches the picture template.

Description

TEMPLATE-BASED IMAGE ACQUISITION USING A ROBOT

BACKGROUND

Robots, such as drones, have been used for aerial photography and videography to reduce cost and/or risk for pilot and crew. Drones have also been used to film sporting events due to their greater freedom of movement than cable-mounted cameras. Taking images and videos using these existing robots, however, requires human operations to move the robot to a certain location and to start the image capturing action.

Requiring human operations greatly limit the applications of the robots in image capturing. For example, a user might want to take a picture or a video of himself but does not want the picture or video to show him holding the robot controller in his hands

The disclosure made herein is presented with respect to these and other consideration.

SUMMARY

Technologies are described herein for template-based image acquisition by a robot without human intervention. A robot can receive an input from a user specifying a picture template that includes a position template describing a desired relative size of a target object in the image and its relative position in the image. The robot can then capture an image showing the target object and compare the image with the picture template. Based on the comparison, the robot can determine whether the image matches the picture template. If the image matches the picture template, the robot can output the image as the final image of the target object. Alternatively, the robot can capture a second image with a higher resolution as the output image.

If the image does not match the picture template, the robot can estimate a target location for the robot so that an image taken at the target location by the robot matches the picture template. The robot can then move to the target location and take a second image of the target object. The robot can then compare the second image with the picture template to determine if there is a match. When there is no match between the second image and the picture template, the robot can repeat the above process; otherwise the second image or a new image taken at the current location by the robot can be used as the output image.

The picture template can further include a gesture template that describes a desired gesture of the target object in the picture. In this configuration, determining a match between a captured image and the picture template can further include determining a match between the captured target object and the gesture template. By utilizing the techniques described herein, the robot can capture pictures of a target object with desired configuration without human intervention.

It should be appreciated that the above-described subject matter can also be implemented as a computer-controlled apparatus, a computer process, a computing system, or as an article of manufacture such as a computer-readable storage medium. These and various other features will be apparent from a reading of the following Detailed Description and a review of the associated drawings.

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended that this Summary be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGURE 1 is a system architecture diagram showing aspects of an illustrative operating environment for the technologies disclosed herein for template based image acquisition, according to one configuration disclosed herein;

FIGURE 2A is a diagram showing examples of position template, according to one configuration disclosed herein;

FIGURE 2B is a diagram illustrating a gesture template defined by utilizing a skeleton of a target object, according to one configuration disclosed herein;

FIGURE 3A is a diagram illustrating the matching operation between a position template and a test image of the target object, according to one particular configuration disclosed herein;

FIGURE 3B is a diagram showing aspects of moving the robot according to the position matching results, according to one particular configuration disclosed herein;

FIGURE 3C is a diagram illustrating the matching operation between a gesture template and an object image of a test image, according to one particular configuration disclosed herein;

FIGURE 4 is a diagram showing an illustrative user interface for providing feedback to a user regarding the image acquisition, according to one configuration disclosed herein;

FIGURE 5 is a flow diagram showing a routine that illustrates aspects of a method for template based image acquisition, according to one configuration disclosed herein;

FIGURE 6 is a flow diagram showing a routine that illustrates aspects of a method of adjusting the position of the robot, according to one configuration disclosed herein;

FIGURE 7 is a flow diagram showing a routine that illustrates aspects of a method of determining the robot’s position using a front facing camera, according to one configuration disclosed herein;

FIGURE 8 is a flow diagram showing a routine that illustrates aspects of a method of determining the robot’s position using a downward facing camera, according to one configuration disclosed herein;

FIGURE 9 is a flow diagram showing a routine that illustrates aspects of a method of determining whether an object image matches a gesture template, according to one configuration disclosed herein; and

FIGURE 10 is an architecture diagram showing an illustrative hardware architecture for implementing a robotic device that can be utilized to implement aspects of the various technologies presented herein.

DETAILED DESCRIPTION OF THE EMBODIMENTS

The following detailed description is directed to technologies for capturing an image of a target object by a robot based on a picture template. The robot can have access to a user specified or selected picture template. The picture template can define a position template describing the relative size and position of the target object in the captured image. The robot can capture a first image of the target object at its current location. The first image can then be compared with the picture template to determine whether the first image matches the picture template. If it is determined that the captured image matches the picture template, the robot can take a picture of the target object again at its current location, for example, by using a high-resolution camera to obtain the output image.

If the first image does not match the picture template, the robot can determine a position adjustment based on its current position and a target position where an image that matches the picture template can be captured. The robot can then adjust its position to be close to the target position. Once the adjustment is made, a second image of the target object can be taken, and the second image and the picture template can be compared to determine if there is a match. The above process can be repeated until a desired image is obtained.

In addition to the position template, the picture template can also include a gesture template, which describes the desired gesture of the target object in the obtained image. For instance, if the target object is an individual, the gesture template can be defined using a skeleton structure consisting of multiple joint points on a human body. The gesture template can be defined by the relative position of the joint points. If a user defines or selects a gesture template, the robot can also determine if the gesture of the object in the obtained image matches the gesture template. If not, the robot can provide a feedback to the user to indicate that the object’s gesture does not match the gesture template. Additionally, the robot can present a user interface to the user to indicate the areas that cause the mismatch thereby allowing the user to adjust his/her gesture.

When the gesture of the object in the image matches the gesture template and the image matches the position template, the image can be output as the desired image or a new image can be taken as the desired image. Additional details regarding the various aspects described briefly above will be provided below with regard to FIGURES 1-10.

While the subject matter described herein is presented in the general context of program modules that execute on one or more computing devices, those skilled in the art will recognize that other implementations can be performed in combination with other types of program modules. Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will also appreciate that aspects of the subject matter described herein can be practiced on or in conjunction with other computer system configurations beyond those described herein, including multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, handheld computers, personal digital assistants, e-readers, mobile telephone devices, tablet computing devices, special-purposed hardware devices, network appliances, and the like.

In the following detailed description, references are made to the accompanying drawings that form a part hereof, and which are shown by way of illustration, specific aspects or examples. Referring now to the drawings, in which like numerals represent like elements throughout the several figures, aspects of a computing system and methodology for subscribing, receiving and processing events will be described.

Turning now to FIGURE 1, details will be provided regarding an illustrative operating environment 100 for the technologies disclosed herein for template based image capturing. As shown in FIGURE 1, a robot 102 can be configured to track a target object 104 and to take a picture of the target object 104. The target object 104 can be any real-world objects, such as a pedestrian, a vehicle, or an animal. The robot 102 can be a machine capable of carrying out a series of actions autonomously. In one example, the robot 102 can be an unmanned aerial vehicle ( “UAV” ) , also referred to as a drone. The UAV is an aircraft without a human pilot aboard. The flight of UAVs may operate with various degrees of autonomy: either under remote control by a human operator, or fully or intermittently autonomously. The robot 102 can include one or more motors 116 configured to move the robot along various directions. By way of example, and not limitation, the movement of the robot 102 can include back and forth, left and right, up and down, and rotation along an axis in the three-dimensional space.

The robot 102 can be equipped with one or more cameras 110. A camera 110 is an optical instrument for recording or capturing images 126 of the target object 104. The images 126 may be individual still photographs or sequences of images constituting videos or movies. The camera 110 can also be configured to provide the depth information of the images 126 it captures or enable the depth information to be derived. For example, the camera 110 can be a stereo camera with two or more lenses with a separate image sensor for each lens. In other examples, the camera 110 can be a depth camera such as a ranging camera, a flash LiDar, a time-of-flight ( “ToF” ) camera, or a RGB-D camera. In addition, the cameras 110 can also include a front facing camera and/or a downward facing camera used to estimate the position of the robot 102.

The captured images 126 can be stored in a storage device 112 of the robot 102. A template matching module 140 can compare the captured image 126 with a picture template 132 that defines a desired layout of the captured image 126. If the template matching module 140 determines that the captured image 126 matches the picture template 132, the robot 102 can save the image 126 as the final output image. Alternatively, the robot 102 can capture another image 126 at its current location to obtain an image with higher quality, such as by using a high-resolution camera 110 or by using the same camera but with a setting for a high quality picture.

According to one configuration, the picture template 132 can include a position template 128 and a gesture template 130. The position template 128 can describe the desired relative position and size of the target object in the image 126, and the gesture template 130 can describe the desired gesture of the target object 104 in the image 126. Correspondingly, the template matching module 140 can include a position matching module 114 to determine whether the images 126 matches the position template 128, and a gesture matching module 124 to determine whether the image of the object in the image 126 has a gesture that matches the gesture template 130. Additional details regarding the picture template 132 and the template matching are provided below with respect to FIGURES 2A, 2B, 3A, 3C, 5 and 9.

If the captured image 126 does not match the picture template 132, the robot 102 might need to move to a new location in order to capture an image that matches the picture template 132. The new location of the robot 102 can be estimated by comparing the captured image 126 with the picture template 132, and more particularly, the position template 128, to determine a position change of the robot 102 so that at the new location the robot 102 can capture an image that matches the gesture template 130. The calculated position change can be utilized to drive the motors 116 of the robot 102 to move to the new location. Additional details regarding calculation of the position change of the robot 102 are provided below with respect to FIGURES 3A, 3B and 6-8.

It should be noted that the target object 104 might move. For example, the target object 104 can be a human being and he/she might walk around when the images 126 are being taken. In such scenarios, a tracking module 142 of the robot 102 can utilize the captured images 126 taken by the same or different cameras to track the target object 104 so that the robot 102 can follow the target object 104 as the target object 104 moves. For example, the object tracking can be performed by the robot 102 capturing a first image showing the target object 104. Object data of the target object 104 can be obtained or otherwise determined by the robot 102. The object data can include an object image that comprises a portion of the first image showing at least a portion of the target object 104, and a feature template of the object image. The feature template of the object image can be calculated and stored at the robot 102. The object data can further include position information including a target distance between the target object and the robot. The robot 102 can be configured to move along with the target object 104 when the target object moves, and to keep the distance between the robot 102 and the target object 104 to be close to the target distance.

The robot can be further configured to capture a second image showing the target object 104. A two-dimensional matching ( “2D matching” ) can be performed by searching in the second image for a best match of the object image. The search can be performed by comparing the object image with multiple test object images obtained by extracting content contained in a search window applied onto the second image. The search window can have the same size as the object image and can be applied at different location of the second image. The match can be measured using the feature template of the object image and a feature vector calculated from each of the test object images. The test object image having the best match with the object image can be determined to be the matched test object image. By comparing the location of the object image in the first image and the location of the matched test object image in the second image, horizontal and vertical movement of the target object can be determined.

The robot 102 can be configured to further determine its current distance to the target object 104 when the second image is taken. The distance can be determined by using images taken by the camera or other distance or depth determination mechanisms equipped with the robot. A depth change ratio can then be calculated based on the distance between the robot and the target object when the first image was taken and the determined current distance. The depth change ratio can be utilized to improve the tracking accuracy. Specifically, a bounding box can be generated by scaling the search window that identifies the matched test object image in the second image according to the depth change ratio. An updated object image can be generated by extracting the content of the second image that are located inside the scaled bounding box. Based on the updated object image, the feature template can also be updated. The robot 102 can move according to the calculated horizontal and vertical movement as well as the distance/depth change of the target object. If further tracking is to be performed, a new image can be taken and the above procedure can be repeated using the updated object image and feature template. Additional details regarding the tracking can be found in PCT Patent Application No. PCT/CN2017/095902, filed on August 3, 2017, and entitled “Object Tracking Using Depth Information” , which is herein incorporated by reference in its entirety.

The robot 102 can be in communication with a user computing device 106 through a network 108. The user computing device 106 can be a PC, a desktop workstation, a laptop or tablet, a notebook, a personal digital assistant ( “PDA” ) , an electronic book reader, a smartphone, a game console, a set-top box, a consumer electronics device, a wearable computing device, a server computer, or any other computing device capable of communicating with the robot 102 through the network 108. The network 108 may include one or more wireless networks, such as a Global System for Mobile Communications ( “GSM” ) network, a Code Division Multiple Access ( “CDMA” ) network, a Long Term Evolution ( “LTE” ) network, or any other type of wireless network.

For example, a user 122 can utilize the user computing device 106 to send a control signal 118 to the robot 102, such as to specify the target object 104, to select or define the gesture template 130, to request the start of the template based image capturing or to request cancelation of the image capturing. The robot 102 can also transmit feedback information 120 through a feedback module 134 to the user computing device 106. The feedback information 120 can include any information related to the template based image capturing of the target object 104, which can include, but is not limited to, indication that the gesture template 130 is not matched, indication of the area where the gesture template 130 is not matched, and/or the final output images 126 that matches the selected picture template 132. Additional details regarding the feedback information sent to the user 122 are provided below with respect to FIGURE 4.

FIGURE 2A illustrates an exemplary position template 128. As shown in FIGURE 2A, the position template 128 can include an object image 204 representing the target object 104. The position template 128 can define the size of the object image 204 by specifying the object image’s width W and height H. In addition, the position template 128 can also describe the relative location of the object image 204 within the position template 128. In the example shown in FIGURE 2A, the relative location is described as the vertical and horizontal distances of the center point O1 of the object image 204 to the upper left corner of the position template 128, denoted as (V, H) . In one configuration, the position template 128 is normalized so that the size of the position template 128 can be represented by its aspect ratio, such as 1: 1.5 as shown in FIGURE 2A. In this configuration, the value of W, L, V and H are measured with respect to the normalized size of the position template 128.

In addition to the relative size and location of the object image 204, the position template 128 can also include an object image type 202 describing the type of the object image 204. FIGURE 2A illustrates several examples of the object image type, such as a full body template 202A, a medium shot template 202B and a close-up template 202C. The template type 202 can be utilized to identify object image from the captured image 126 when determining whether the captured image 126 matches the position template 128.

FIGURE 2B is a diagram illustrating a gesture template 130 defined by utilizing a skeleton of a target object 104, or more specifically, a human being, according to one configuration disclosed herein. As shown in FIGURE 2B, the gesture template 130 can include a skeleton consisting of a set of joint points 206A-206K of a human body. In the following, the joint points 206A-206K may be referred to individually as a joint point 206 or collectively as the joint points 206.

The gesture template 130 can include multiple object parts 208A-208L, each object part 208 being defined by two or more joint points 206 and the portion connected through those joint points 206. For example,

joint points

206B, 206E and 206F form a part 208A that represents the torso of the body. Similarly,

joint points

206C and 206D can form a part 208B representing the upper left arm of the body;

joint points

206D and 206E can form a part 208C representing the lower left arm of the body. Under this definition, the gesture of the target object can be described using the angles formed by various pairs of object parts in the skeleton. For example, a gesture template where the object 104 raises his entire left arm to a horizontal position can be described as an angle of 90 degrees formed by

parts

208A and 208B and an angle of 180 degrees formed by

parts

208B and 208C.

It will be appreciated by one skilled in the art that although FIGURE 2B only shows a gesture template for a human being object, gesture templates 130 for other types of target objects can be defined similarly. For example, for an animal or a plant object, joint points can be identified on the body of the object according to its shape or structure, and parts can be formed by connecting two or more joint points. Various gestures of the object can be defined using a set of angles formed by different object parts of the object body.

FIGURE 3A is a diagram illustrating the matching operation between a position template 128 and a test image 302 of the target object 104, according to one particular configuration disclosed herein. As briefly describe above, the template matching module 140 can include a position matching module 114 to perform the position matching. Inputs to the position matching module 114 can include the position template 128 and a test image 302 that is obtained by the robot 102 using its camera 110 at its current location. According to one configuration, both the position template 128 and the test image 302 are normalized before the matching operation.

The position matching module 114 can extract an object image 304 from the test image 302 using any object detection technique known in the art. The extraction can be based on the object image type. If the object image type is a full body type, the full body of the object can be extracted; if the object image type is a medium shot type, the upper half of the object can be extracted; and if the object image type is a close-up type, then the face portion of the object can be extracted. Once the object image 304 is extracted, the size of the extracted object image 304 and the relative location of the object image 304 in the test image 302 can be determined. In FIGURE 3A, W2 and L2 represent the width and height of the extracted object image 304, respectively; and H2 and V2 represent the horizontal and vertical distances of the center of the extracted object image 304 O2 to the upper left corner of the test image 302, respectively.

By comparing the location of the extracted object image 304 in the test image 302 and the location of the object image 204 in the test image 302, the horizontal and vertical differences ΔH and ΔV can be calculated. The differences ΔH and ΔV can then be compared with a threshold to determine if the extracted object image 304 matches the template image 128 in terms of its location within the test image 302. Likewise, the size of the extracted object image 304 (W2, L2) can be compared with the size of the object image 204 (W, L) to determine if they are close enough.

If the difference between the size of the object image 204 and the extracted object image 304 is smaller than a threshold, and the difference between the relative locations of the object image 204 and the extracted object image 304 is also smaller than a threshold, the position matching module 114 can determine that test image 302 matches the position template 128. Otherwise, the position matching module 114 can determine that the test image 302 does not match the position template 128, and can further estimate a new location of the robot 102 so that an image captured by the robot 102 at the new location matches the location template.

It is known to those skilled in the art that given the size of the test image 302, the difference between the size of the object image 204 and the extracted object image 304, the difference between the relative locations of the object image 204 and the extracted object image 304, the distance of the camera 110 to the target object 104 and the focal length of the camera 110, the movement of the camera 110/robot 102 along horizontal, vertical and depth direction can be determined, which are denoted as (ΔH, ΔV, ΔD) in FIGURE 3A. The movement of the robot 102 (ΔH, ΔV, ΔD) can be used to drive the motors 116 to move the robot 102 to a new location. FIGURE 3B illustrates the robot 102 moving from the current location to a new location. Additional details regarding moving the robot 102 to the estimated new location are provided below in FIGURES 6-8.

It should be understood that the estimation of the new location for the robot 102 and the movement of the robot 102 shown in FIGURES 3A and 3B is based on the assumption that the target object 104 remains at his location. In the event that the target object 104 moves during the process, the tracking module 142 can perform object tracking to calculate additional movement of the robot 102 in order to move the robot 102 to a location where an image that matches the position template 128 can be taken. Object tracking can be performed as described above with additional details in PCT Patent Application No. PCT/CN2017/095902, filed on August 3, 2017, and entitled “Object Tracking Using Depth Information” , which is herein incorporated by reference in its entirety, or any other object tracking methods known in the art.

FIGURE 3C is a diagram illustrating the matching operation between a gesture template 130 and an object image 304 extracted from a test image 302, according to one particular configuration disclosed herein. As shown in FIGURE 3C, the extracted object image 304 and the gesture template 130 can be input to the gesture matching module 124 for gesture matching analysis. The gesture matching module 124 can analyze the extracted object image 304 to extract a skeleton 320 by identifying joint points 306 corresponding to the joint points 206 of the gesture template 130, and determining parts 308 formed by different sets of joint points 306 that correspond to the parts 208 in the gesture template 130.

In addition, the gesture matching module 124 can divide the gesture template 130 into various object regions 312A-312B (which may be referred to herein individually as an object region 312 or collectively as the object regions 312) , each region 312 containing multiple joint points 206 and object parts 208. One joint point 206 or one object part 208 may be included in one or more object regions 312. In FIGURE 3C, region 312A represent the left arm of the object 104 and region 312B represents the right arm of the object 104. To compare the skeleton 320 with the gesture template 130, the gesture matching module 124 can identify corresponding regions 310 in the extracted skeleton 320. In the example shown in FIGURE 3C, the region 310A identified from the skeleton 320 corresponds to the objection region 312A in the gesture template 130 and region 310B corresponds to the objection region 312B. The similarity between the skeleton 320 and the gesture template 130 can be measured by compare each pair of corresponding regions and if the difference between each pair of regions is less than a threshold, the gesture matching module 124 can determine that the skeleton 320 in the extracted object image 304 matches the gesture template 130.

Specifically, in the example shown in FIGURE 3C, in order to compare the pair of

regions

312A and 310A, an angle formed by the

object parts

208B and 208C in the object region 312A can be calculated and compared with the angle formed by corresponding

parts

308B and 308C in the region 310A. In region 312A, the angle has a value of 180 degrees, whereas the angle becomes 270 degrees in region 310A. The difference is thus 90 degrees exceeding a threshold of 15 degrees, for example. Region 312B and region 310B can be analyzed similarly. Because there is at least one region whose angle difference exceeds the threshold. The extracted object image 304 does not match the gesture template 130. The gesture matching module 124 can indicate such a mismatch to the feedback module 120.

It should be understood that for simplicity, FIGURE 3C merely shows two object regions 312. In actual implementation, more object regions 312 can be identified from the gesture template 130. In addition, measuring the similarity between two corresponding regions using angles form by parts of the object 104 is merely for illustration purpose, and should not be construed as limiting. Various other ways can be used to quantify the difference. Furthermore, in addition to indicating mismatch, the gesture matching module 124 can also generate detailed feedback information regarding how the skeleton 320 and the gesture template 130 are different.

For example, the gesture matching module 124 can generate a heat map indicating the area where the mismatch occurs and even a suggestion for the object 104 to change his gesture to conform to the gesture template 130. In FIGURE 3C, the gesture matching module 124 can generate data showing that the mismatch occurs at the

region

310A and 310B, and that the target object 104 should straighten his both arms in order to conform to the gesture template 130. Such feedback information can be sent to the feedback module 120 for further processing and/or presentation. For example, the feedback module 120 can indicate the mismatch by causing a LED light on the robot 102 to flash or change its color. The feedback module 120 can also include the gesture change suggestion in the feedback information 120 sent to the user computing device 106 for presentation. Details regarding presenting the gesture change suggestion are provided below in FIGURE 4.

FIGURE 4 is a diagram showing an illustrative user interface 400 for presenting feedback information 120 to a user 122 regarding the image acquisition, and in particular, the gesture mismatch, according to one configuration disclosed herein. In one implementation, the user interface 400 can be displayed on the user computing device 106 associated with the user 122. It should be understood that the user interface 400 can also be displayed on other device to other users.

As shown in FIGURE 4, the user interface 400 shows side by side a user interface control 402 showing the selected gesture template 130, and a user interface 404 for presenting the user’s current gesture. The regions where the user’s gesture does not match the gesture template 130 are highlighted such as by drawing a circle 412 around the region, changing the color of the region, adding a shade to the area, and so on. The user interface 400 also include a text user interface control 406 providing recommendation for the user to conform his gesture to the gesture template 130. In addition, the user interface 400 can provide the option of changing the template 130 through a user interface control 408, and the option of canceling the image capturing task through a user interface control 410. It should be noted that the user interface 400 shown in FIGURE 4 is for illustration only, and should not be construed as limiting. Various other ways of presenting the feedback information 120 and controlling the image capturing can be utilized.

FIGURE 5 is a flow diagram showing a routine 500 that illustrates aspects of a method for template based image capturing, according to one configuration disclosed herein. Routine 500 can be performed by the template matching module 140 or any other module of the robot 102. It should be appreciated that the logical operations described herein with respect to FIGURE 5, and the other FIGURES, can be implemented (1) as a sequence of computer-implemented acts or program modules running on a computing system and/or (2) as interconnected machine logic circuits or circuit modules within the computing system.

The implementation of the various components described herein is a matter of choice dependent on the performance and other requirements of the computing system. Accordingly, the logical operations described herein are referred to variously as operations, structural devices, acts, or modules. These operations, structural devices, acts, and modules can be implemented in software, in firmware, in special-purpose digital logic, and any combination thereof. It should also be appreciated that more or fewer operations can be performed than shown in the FIGURES. and described herein. These operations can also be performed in parallel, or in a different order than those described herein. Some or all of these operations can also be performed by components other than those specifically identified.

The routine 500 begins at operation 502, where the robot 102 can access a picture template 132 selected by the user 122. The picture template 132 can be pre-selected by the user 122 and retrieved by the template-matching module 140 from the storage 112. Alternatively, the picture template 132 can be selected by the user 122 through a user interface presented to the user when the image capturing starts. The picture template 132 can also include a template type to indicate whether the template is for a full body shot, a medium shot, a close-up, or other types. From operation 502, the routine 500 proceeds to operation 504 where the template-matching module 140 can obtain a test image 302 of the target object 104. The test image 302 can be taken by the robot 102 using a low-resolution camera 110. As will be discussed later, when the test image 302 is found to match the picture template 132, the robot 102 can take a picture of the target object 104 using a high-resolution camera 110.

In some implementation, after the test image 302 is taken, the robot 102 can identify the target object 104 in the test image 302. The identification can be performed by receiving a selection from among several automatically detected objects or by receiving an indication of the target object 104 directly from the user 122, such as through the user 122 drawing a bounding box manually around the target object 104.

The routine 500 then proceeds to operation 506, where the test image 302 is compared with the picture template 132. As discussed above, the picture template 132 can include one or more of a position template 128 and a gesture template 130. If the picture template 132 contain both templates, the test image 302 needs to match both the position template 128 and the gesture template 130. From operation 506, the routine 500 proceeds to operation 508, where it is determined whether the test image 302 matches the position template 128. According to one configuration, both the position template 128 and the test image 302 are normalized before the matching operation.

As discussed above, a position-matching module 114 of the template-matching module 140 can perform the position template matching. The position-matching module 114 can extract an object image 304 from the test image 302 using any object detection technique known in the art. The extraction can be based on the object image type, such as a full body type, a medium shot type, and a close-up type. Once the object image 304 is extracted, the size of the extracted object image 304 and the relative location of the object image 304 in the test image 302 can be determined. The size and location of the extracted object image 304 can then be compared with the size and location of the object image 204 in the position template 128. If at lease of the differences between the corresponding size and location of the object images is greater than a threshold, the test image 302 can be determined as not matching the gesture template 130, and the routine 500 proceeds to operation 510 where the position of robot 102 can be adjusted. Details about adjusting the position of the robot 510 are provided below with respect to FIGURE 6.

If none of the differences is greater than a threshold, the position-matching module 114 can declare that the test image 302 matches the position template 128, and the routine 500 proceeds to operation 512, where the template-matching module 140 can determine whether the picture template 132 includes a gesture template 130. If so, it means that the user 122 has selected a gesture template 130 and the routine 500 proceeds to operation 514 where the template matching module 140 can employ the gesture matching module 124 to perform gesture matching based on the gesture template 130. Details regarding the gesture matching are provided below with respect to FIGURE 9.

If the gesture matching module 124 determines that the gesture template 130 is matched, or if it is determined at operation 512 that no gesture template 130 was selected, the routine 500 proceeds to operation 520 where a new image can be captured by the robot at its current location. According to one configuration, the test image 302 can be taken using a low-resolution camera in order to reduce computational complexity thereby speed up the image capturing process. When it is determined that the current test image 302 matches the picture template 132, the new image can be taken by the robot 102 using a high-resolution camera to obtain a high quality picture for the user 122.

If it is determined at operation 516 that the test image 302 does not match the gesture template 130, the template matching module 140 can generate feedback information 120. The feedback information 120 can include, but is not limited to, the indication that the test image 302 does not match the gesture template 130, the area where the mismatch occurred, and/or any suggestion for achieving conformity with the gesture template 130. The routine 500 then proceeds to operation 518 where such feedback information can be sent to the feedback module 134. From operation 518 or operation 510, the routine 500 proceeds to operation 524 where it is determined whether the user 122 cancels the image capturing task. If not, the routine 500 proceeds to operation 522, where the robot 102 capture another test image 302. If necessary, such as when the target object 104 moves, the robot 102 can also perform object tracking to follow the target object 104. Once a new test image 302 is captured, the routine 500 proceeds to operation 506 where the process described above can be repeated for the new test image 302.

If it is determined at operation 524 that the user 122 cancels the task, or from operation 520 where a high-resolution image is captured, the routine 500 proceeds to operation 526, where it ends.

FIGURE 6 is a flow diagram showing a routine 600 that illustrates aspects of a method for adjusting the position of the robot 102, according to one configuration disclosed herein. Routine 600 can be performed by the template-matching module 140 or any other module of the robot 102. Routine 600 begins at operation 602, where the template-matching module 140 can estimate the horizontal, vertical and depth change of the robot 102, denoted as ΔH, ΔV and ΔD, respectively, in order to move the robot 102 to a new location. An image taken by the robot 102 at the new location should match the position template 128.

As discussed above, the movement (ΔH, ΔV, ΔD) of the camera 110 or robot 102 along horizontal, vertical and depth direction can be determined based on the size of the test image 302, the difference between the size of the object image 204 and the extracted object image 304, the difference between the relative locations of the object image 204 and the extracted object image 304, the distance of the camera 110 to the target object 104 and the focal length of the camera. Once the movements of the robot 102 are estimated, the routine 600 proceeds to operation 604, where the template matching module 140 can cause the robot 102 to move by the estimated amount of (ΔH, ΔV, ΔD) along the corresponding direction.

In order to determine whether the robot 102 has arrived at the new location, the robot 102 or the template-matching module 140 can constantly check the location of the robot 102 and compare its location with the new location. In order to determine the current location of the robot 102, the routine 600 proceeds from operation 604 to

operations

606, 608 and 610 where various location determination methods are employed. At operation 606, the robot 102 can determine its position based on images 126 taken by its front facing camera 110. In one configuration, the front facing camera can be a stereo camera. Specifically, the robot 102 can identify a set of feature points in the images captured by the front facing camera 110. The same set of feature points can be identified after the robot moves and be used to determine the real-world coordinate changes of the robot 102, thereby determining the position of the robot. Details regarding determining the robot’s location using front facing camera are provided below with respect to FIGURE 7.

At operation 608, the robot 102 can determine its location based on images 126 taken by its downward facing camera 110. The downward facing camera can be employed to determine the position of the robot 102 along the horizontal and depth directions. The robot 102 can identify a set of feature points in an image captured by the downward facing camera at a certain point of time. Another image taken at the next point of time can then be analyzed to find the corresponding feature points. The location change of these feature points in two images can then be utilized to determine the location change of the robot 102. Details regarding determining the robot’s location using downward facing camera 110 are provided below with respect to FIGURE 8. At operation 610, other methods for determining the robots position can be employed. For example, the robot’s position can be determined using a barometer, a GPS, and/or an inertial measurement unit (IMU) sensor.

It should be appreciated that not all of operations of 608, 609 and 610 are required and more operations may be added in addition to these operations. For example, if the robot 102 is not equipped with a downward facing camera, then operation 608 can be skipped. On the other hand, if the robot 102 has other mechanism of determining the position of the robot 102, additional operation may be added to include the position determining result by such a mechanism.

From

operations

606, 608 and 610, the routine 600 proceeds to operation 612, where the results of various position determination mechanisms can be aggregated to arrive at a final determination of the robot’s position. For example, Kalman filter approach can be used to fuse the position estimates from multiple sources. Firstly, based on IMU input, the position of the robot can be estimated upon the arrival of each IMU sample. Afterwards, the front-facing and/or downward-facing visual position estimation, which may become available periodically for every k IMU samples, can be used to update the estimated position, resulting in a more robust and accurate position estimate.

From operation 612, the routine 600 proceeds to operation 614, where it is determined whether the robot 102 has arrived at the desired location. If not, the routine 600 returns to operation 604, where the robot 102 makes additional position adjustment. If it is determined that the robot 102 has arrived at the desired position, the routine 600 proceeds to operation 616, where it ends.

FIGURE 7 is a flow diagram showing a routine 700 that illustrates aspects of a method for determining the robot’s position using a front facing camera 110 of the robot 102, according to one configuration disclosed herein. The routine 700 begins at operation 702, where the robot 102 can determine whether this is the first time to determine the position of the robot 102 using the front facing camera 110 after the test image 302 is found not matching the position template 128. If so, the routine 700 proceeds to operation 704 where an image taken by the front facing camera 110 before the robot 102 moves can be retrieved.

The routine 700 then proceeds to operation 706, where the robot 102 can identify a set of static feature points in the image. The set of feature points can be identified in the background area where the scene is generally static. The feature points can be identified using any technique known in the art, such as features from accelerated segment test ( “FAST” ) , binary robust independent elementary features ( “BRIEF” ) , oriented FAST and rotated BRIEF ( “ORB” ) , good features to track ( “EigenCorner” ) , scale-invariant feature transform ( “SIFT” ) , speed up robust features ( “SURF” ) , and/or other feature detectors and descriptors. If there are some objects in the image that are moving, these moving objects can be removed before the static feature points are identified.

Once the feature points are identified, the routine 700 proceeds to operation 708, where the real-world coordinates of the feature points are calculated. To calculate the coordinates of the feature points, assume the robot’s original position before the movement is the origin, that is, its coordinate is 0 along horizontal, vertical and depth directions. Under such a coordinate system, the coordinates of the feature points can be determined based on their respective positions in the image, the focal length of the front-facing camera 110, and other factors such as sample data from IMU onboard the robot.

Alternatively, and/or additionally, the position of the feature points and robot’s position, can be jointly estimated by combining factors such as a time series of IMU sensor data, two or more captured video frames corresponding to the time when IMU data are acquired, and/or depth information derived from stereo cameras. For example, feature points inside consecutive image frames can first be extracted, and corresponding feature points in different frames can be matched using a feature matching algorithm such as Kanade-Lucas-Tomasi ( “KLT” ) feature tracker. Afterwards, the matched feature points and IMU data can be used to formulate a constrained optimization problem and least squares solutions can be obtained to minimize the estimation error of feature point locations within different frames.

The feature point location estimations, however, might have drifting errors after the robot 102 moves for a while. When the drifting errors become large, the above described solution can be reset to perform the estimations from the beginning. In scenarios wherein the robot 102 returns to a location where it visited before, i.e. a spatial loop has occurred, the drifting problem can be solved without resetting the estimation. The spatial loop can be detected using algorithms such as bag-of-words by using different images captured at the same location by the robot 102 at different times. To correct the drifting error, the feature point positions and the robot position estimated from the previously captured images can be compared with the corresponding positions estimated from the current images at the same location. If there are large changes from the previously estimated positions to the currently estimated position, it means drifting has occurred. The estimation for the feature point locations and the robot location for current images can be set to the estimated position values from the previous matched images, thereby correcting the drifting errors.

In addition, the position estimation described above can be further improved by utilizing the stereo camera to estimate the coordinates of the features points before the optimization process starts. The estimated coordinates of the feature points can provide accurate initial points for the optimization process or any re-initialization of the optimization. The use of stereo camera to calculate depth information for feature points can improve the accuracy of dimension and scale estimation in the 3D space, and at the same time can substantially reduce the initialization time of the optimization process. Based on the estimated robot position, feature point positions and camera parameters, the attitude of the robot 102, or more precisely, the attitude of the camera that captured the images, can also be estimated. The attitude of robot or camera can be defined as an orientation of the robot or the camera described by three angles including roll, pitch and yaw angles. It should be appreciated that the above-described method of estimating the robot’s position and attitude is efficient enough to run onboard of the robot 102 without utilizing external devices, such as using the IMU sensors, stereo cameras, and computing chipsets that have been installed on the robot 102.

From operation 708, the routine 700 proceeds to operation 710, where a new image captured by the front facing camera 110 at the current location of the robot 102 can be obtained. If it is determined at operation 702 that the current iteration is not the first time to estimate the robot’s position using the front-facing camera since the robot’s initial move, the routine 700 proceeds from operation 702 directly to operation 710. In other words, the identification of the feature points in the initial image is skipped because those feature points have been identified before.

From operation 710, the routine 700 then proceeds to operation 712, where another set of static feature points can be identified in the new image using the similar method used at operation 706. The new set of static feature points can include some feature points that correspond to at least a portion of the feature points identified at operation 706, whereas the remaining feature points in the new set can be new feature points that did not appear in the old image.

From operation 712, the routine 700 proceeds to operation 714 where the current location of the robot 102 can be estimated using the feature points that appear in both the old image and the new image. Because the real-world coordinates of these overlapping feature points has been calculated in operation 708, their relative position in the new image can be used to infer the current real-world coordinate of the robot 102 based on their respective positions in the new image, the focal length of the front-facing camera 110, and other factors. Once the current position of the robot 102 is obtained, the routine 700 proceeds to operation 716, where it ends. It should be noted that at operation 714, after the real-world coordinates of the robot 102 are calculated, the coordinates of the new feature points that did not appear in the old image can also be calculated and be stored for use in the next round of estimation of the robot’s position.

FIGURE 8 is a flow diagram showing a routine 800 that illustrates aspects of a method for determining the robot’s position using a downward facing camera 110 of the robot 102, according to one configuration disclosed herein. The downward facing camera 110 can be employed to determine the position of the robot 102 along the horizontal and depth directions. The routine 800 begins at operation 802, where robot 102 can obtain an image captured by the downward facing camera 110 before the adjustment of the robot’s position at operation 510 in FIGURE 5. The routine 800 then proceeds to operation 804, where feature points are identified in the capture image using similar methods employed at operation 706 in FIGURE 7.

From operation 804, the routine 800 proceeds to operation 806 where a new image captured by the downward facing camera 110 at its current location can be obtained. With this new image, the routine 800 proceeds to operation 808 where a new set of feature points can be identified. It should be noted that there can be some feature points in the new image that correspond to the feature points in the initial image. The routine 800 then proceeds to operation 810, where these corresponding feature points can be utilized to estimate the horizontal and depth movement of the robot 102. This can be achieve based on a relationship that a ratio between the pixel movement of a feature point over the focal length of the camera equals the ratio of the horizontal or depth movement of the robot over the distance of the camera to the ground.

At operation 812, the robot 102 can determine the vertical movement of the robot using other mechanisms, such as a distance sensor. The routine 800 then proceeds to operation 814 where the estimation position of the robot 102 can be output. The routine 800 further proceeds to operation 816, where it ends.

It should be understood that the methods shown in FIGURES 7 and 8 might not generate a valid estimation of the robot’s current position. For example, if there are not enough textures on the ground, the robot 102 might not identify enough feature points in the images captured by the downward facing camera for the position estimate. Similarly, if the background is not static or lack of feature points, the front facing camera 110 can fail to extract enough feature points or extract the wrong feature points, thereby leading to the failure in the estimation. These failures in the position estimation can be taken care of by the aggregation operation at the operation 612 in FIGURE 6. Because the likelihood that all the methods used in

operations

606, 608 and 610 fail is very low, the aggregation can ensure that the estimation of the robot’s position are based on at least one or two valid outputs. In the worse scenario, the position estimation can be based on the IMU sensor output, which is always available, despite its low accuracy.

FIGURE 9 is a flow diagram showing a routine 900 that illustrates aspects of a method for determining whether a test image 302 matches a gesture template 130, according to one configuration disclosed herein. Routine 900 can be performed by the gesture matching module 124 or any other module of the robot 102. The routine 900 begins at operation 902, where the gesture matching module 124 can access the gesture template 130 selected or specified by the user 122. As discussed above, the gesture template 130 can be represented as a skeleton consisting of a set of joint points 206 on the body of the target object 104.

The gesture template 130 can also include multiple object parts 208, each object part 208 being defined by two or more joint points 206 and the portion connected through those joint points 206. Under this definition, the gesture of the target object 104 can be described using the angles formed by various pairs of object parts in the skeleton. For gesture matching purpose, the gesture template 130 can also be divided into multiple object regions 312. Each object region 312 can contain multiple joint points 206 and object parts 208. One joint point 206 or one object part 208 may be included in one or more object regions 312. It should be understood that the object regions 312 of the gesture template 130 can be generated after the user 122 selected the gesture template 130 and stored along with the gesture template 130 in the storage device 112. Alternatively, the object regions 312 may be identified when the gesture matching module 124 performs the gesture matching.

The gesture matching module 124 can also extract object image 304 from the test image 302 to prepare for the gesture matching. Because object identification has been completed when the position matching is performed at operation 508 in FIGURE 5, the object image 304 can be generated by extracting the identified object from the test image 302.

From operation 902, the routine 900 proceeds to operation 904 where the gesture matching module 124 can analyze the extracted object image 304 to extract a skeleton 320 from the object image 304. The extracted skeleton 320 can include joint points 306 that correspond to the joint points 206 on the gesture template 130 and object parts 308 that correspond to the object parts 208 on the gesture template 130. As an example, the extracted object image 304 can be fed into a deep convolutional neural network for detection of joints. The predicted joints can be a set of pixel locations in the object image 304 with confidence value for each joint location. The highest confidence joint locations can be chosen as positions for shoulder, elbow, wrist, etc. Then the human skeleton can be obtained by connecting each pair of adjacent joint locations with a line segment. For example, shoulder joint 306C and elbow joint 306E can form an upper arm segment 308B. According to one configuration, both the gesture template 130 and the skeleton 320 can be normalized before the comparison.

From operation 904, the routine 900 proceeds to operation 906 where object regions 310 that correspond to the object regions 312 of the gesture template 130 can be identified from the skeleton 320. The routine 900 then proceeds to operation 908 where a first region out of the identified objection regions 310 are selected for processing. The routine 900 then proceeds to operation 910, where angles in the selected region 310 are calculated. The angles can include any angle that is formed by two adjacent object parts 308 in the selected object region 312. Likewise, the angles formed by adjacent object parts 208 in the corresponding object region 310 of the gesture template 130 can also be calculated. Alternatively, these angles of the gesture template 130 can be pre-calculated and stored with the gesture template 130 in the storage device 112 before the gesture matching.

From operation 910, the routine 900 proceeds to operation 912 where the differences between the corresponding angles in the gesture template 130 and the extracted object image 304 are calculated and compared to a threshold. If any of the differences is greater than the threshold, the routine 900 proceeds to operation 914, where the feedback information for this particular is generated and the routine 900 further proceeds to operation 918. If the angle differences are lower than the threshold, the routine 900 also proceeds to operation 918, where it is determined whether there are more object regions 312 to be processed. If so, the routine 900 proceeds to operation 916, where next region 312 is selected for analysis, and further proceeds to operation 910 to repeat the above operations.

If there are no more object regions 312 to be processed, the routine 900 proceeds from operation 918 to operation 920, where it is determined whether any feedback information is generated. If so, the routine 900 proceeds to operation 922 where the generated feedback information can be output. If no feedback information is generated, it means that the extracted object image 304 matches the gesture template 130 in every object region 312, and the routine 900 proceeds to operation 924 where the gesture matching module 124 can generate output showing that the gesture template 130 is matched. From either operation 922 or operation 924, the routine 900 proceeds to operation 926, where it ends.

It should be appreciated that the template-based image acquisition method presented herein can achieve the image capturing task with low cost. For example, the methods for estimating the robot’s position using front facing camera and downward facing camera are pure visual methods, and requires no additional hardware, thereby allowing the cost of the robot 102 to remain low. In addition, in most operations of the process, a low-resolution test image 302 can be used for template matching and for adjusting the robot’s position. The low-resolution test image 302 requires fewer resources, such as memory usage and computational resource. In scenarios where the process requires multiple iterations of capturing and processing the test image 302, the overall process can be significantly speed up by the use of the low-resolution test image 302. In the meantime, the high-resolution image taken after the test image 302 is found to match the picture template 132 can ensure the high quality of the output picture.

It should be further appreciated that although a single picture template 132 is described here, similar idea can be used in a video template that includes multiple picture templates. Such a video template can specify a desired movement pattern of the target object in a video clip. The robot 102 can be configured to capture a video when frames in the video match their corresponding picture templates in the video template.

FIGURE 10 shows an example architecture for a computing device 1000 capable of executing program components for implementing the functionality described above. The architecture shown in FIGURE 10 illustrates a drone, a mobile robot, or any other robot that is capable of moving around, and can be utilized to execute any of the software components presented herein.

The computing device 1000 includes on-board compute-and-control module 1002 which include an on-board bus 1004. The on-board bus 1004 is a communication system that transfers data between components inside the computing device 1000. In one illustrative configuration, the computing device 1000 includes one or more heterogeneous processors 1006. The heterogeneous processors 1006 can include one or more central processing units ( “CPUs” ) 1008 which can be standard programmable processors that perform arithmetic and logical operations necessary for the operation of the computing device 1000.

The CPUs 1008 perform operations by transitioning from one discrete, physical state to the next through the manipulation of switching elements that differentiate between and change these states. Switching elements can generally include electronic circuits that maintain one of two binary states, such as flip-flops, and electronic circuits that provide an output state based on the logical combination of the states of one or more other switching elements, such as logic gates. These basic switching elements can be combined to create more complex logic circuits, including registers, adders-subtractors, arithmetic logic units, floating-point units, and the like.

The digital signal processors ( “DSPs” ) 1052 is a special-purpose microprocessor, which can be optimized to perform operations related to digital signal processing, typically involving, but not limited to, tasks such as measuring, filtering, compression, and conversion of digitized analog signals. DSPs typically have higher power efficiency compared to general-purpose microprocessors.

The field-programmable gate array ( “FPGA” ) 1054 is a customizable and reprogrammable integrated circuit, which can be used to achieve highly efficient special purpose computing with a relatively low design and deployment cost. FPGAs are widely used in the fields of telecommunications, digital signal processing, computer vision, speech processing, deep-learning neural networks, etc.

The graphical processing units ( “GPUs” ) 1056 is a highly parallel computing circuit designed for large-scale computing tasks such as video compression, graphic rendering, scientific computing, computer vision, deep-learning neural networks, etc.

The application-specific integrated circuit ( “ASICs” ) 1058 is an integrated circuit customized for a specific application. For a targeted application, an ASIC circuit typically exhibits the highest power efficiency and computation speed and usually no extra circuitry is provided to other forms of general-purpose computing, hence its use outside the targeted application are generally not applicable. ASICs can be designed to perform tasks in wireless communications, digital signal processing, computer vision, deep-learning neural networks, etc.

A microcontroller unit ( “MCU” ) 1060 is a single integrated circuit with one or more CPU cores, non-volatile memory, RAM, and programmable input/output peripherals all packaged within the same system on chip ( “SoC” ) . Compared to using separate chipsets for microprocessor, memory, and peripheral devices, microcontrollers are designed for embedded applications with low cost and low power constraints. Typical applications using microcontrollers include automobile control, biomedical devices and robotics such as unmanned aerial vehicles.

The on-board bus 1004 supports the communications between the heterogeneous processors 1006 and the remainder of the components and devices on the on-board bus 1004. The on-board bus 1004 can support communication with a RAM 1010, used as the main memory in the computing device 1000. The on-board bus 1004 can further support communication with a storage device 1014 that provides non-volatile storage for the computing device 1000. The storage device 1014 can store an operating system 1016, software applications 1018 such as movement control, vision processing, inertial navigation, and others, and a template matching module 140, which has been described in greater detail herein. The storage device 1014 can be connected to the on-board bus 1004 through a storage controller 1012. The storage device 1014 can consist of one or more physical storage units. The storage controller 1012 can interface with the physical storage units through a serial attached SCSI ( “SAS” ) interface, a serial advanced technology attachment ( “SATA” ) interface, a fiber channel ( “FC” ) interface, an embedded MultiMediaCard ( “EMMC” ) interface, a Universal Flash Storage ( “UFS” ) interface, or other type of interface for physically connecting and transferring data between computers and physical storage units.

The computing device 1000 can store data on the storage device 1014 by transforming the physical state of the physical storage units to reflect the information being stored. The specific transformation of physical state can depend on various factors, in different implementations of this description. Examples of such factors can include, but are not limited to, the technology used to implement the physical storage units, whether the storage device 1014 is characterized as primary or secondary storage, and the like.

For example, the computing device 1000 can store information to the storage device 1014 by issuing instructions through the storage controller 1012 to alter the magnetic characteristics of a particular location within a magnetic disk drive unit, the reflective or refractive characteristics of a particular location in an optical storage unit, or the electrical characteristics of a particular capacitor, transistor, or other discrete component in a solid-state storage unit. Other transformations of physical media are possible without departing from the scope and spirit of the present description, with the foregoing examples provided only to facilitate this description. The computing device 1000 can further read information from the storage device 1014 by detecting the physical states or characteristics of one or more particular locations within the physical storage units.

In addition to the storage device 1014 described above, the computing device 1000 can have access to other computer-readable storage media to store and retrieve information, such as program modules, data structures, or other data. It should be appreciated by those skilled in the art that computer-readable storage media is any available media that provides for the non-transitory storage of data and that can be accessed by the computing device 1000.

By way of example, and not limitation, computer-readable storage media can include volatile and non-volatile, removable and non-removable media implemented in any method or technology. Computer-readable storage media includes, but is not limited to, RAM, ROM, erasable programmable ROM ( “EPROM” ) , electrically-erasable programmable ROM ( “EEPROM” ) , flash memory or other solid-state memory technology, compact disc ROM ( “CD-ROM” ) , digital versatile disk ( “DVD” ) , high definition DVD ( “HD-DVD” ) , BLU-RAY, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information in a non-transitory fashion.

As mentioned briefly above, the storage device 1014 can store an operating system 1016 utilized to control the operation of the computing device 1000. According to one configuration, the operating system comprises Linux distributions such as Ubuntu, Gentoo, Debian, OpenWRT, etc., and a large collection of embedded real-time operating system ( “RTOS” ) such as VXWorks and Nuttx. UAV flight control can also be run on a micro-processor without operating system. In this case, the storage device will contain only the essential programs that are needed to run flight control algorithms, and these programs will run directly on a micro-processor such as an MCU. These programs are sometimes also called “bare-metal” programs. The storage device 1014 can store other system or application programs and data utilized by the computing device 1000.

In one configuration, the storage device 1014 or other computer-readable storage media is encoded with computer-executable instructions which, when loaded into the computing device 1000, make the computing system into a special-purpose computing device capable of implementing the configurations described herein. These computer-executable instructions transform the computing device 1000 by specifying how the CPUs 1008 transition between states, as described above. According to one configuration, the computing device 1000 has access to computer-readable storage media storing computer-executable instructions which, when executed by the computing device 1000, perform the various processes described above with regard to FIGURES 1-6. The computing device 1000 can also include computer-readable storage media for performing any of the other computer-implemented operations described herein.

The storage expansion interface 1020 can be used to add additional external storage modules in addition to on-board storage. The expansion interface can employ one of multitude of technologies, such as multimedia card ( “MMC” ) interface, secure digital ( “SD” ) interface, secure digital high capacity ( “SDHC” ) , secure digital extended capacity ( “SDXC” ) interface, universal serial bus ( “USB” ) interface, PCI express interface, etc. The expansion storage module is a storage that the compute-and-control module 1002 can communicate to via the expansion interface 1020. The expansion module can employ one of multitude of technologies, such as flash storage or magnetic storage.

The computing device 1000 can operate in a networked environment using logical connections to remote computing devices and computer systems through a network, such as a wireless network. The computing device 1000 can include functionality for providing network connectivity through a wireless communication controller 1028. The wireless communication controller 1028 is capable of connecting the computing device 1000 to other computing devices over the wireless network. The wireless communication module can employ one of a multitude of technologies, such as Wi-Fi, ZigBee, Bluetooth, proprietary point-to-point microwave communications, cellular systems such as 3G, 4G, 5G, WiMax and LTE networks, and custom-protocol small-scale wireless networks.

The computing device 1000 can also include one or more input/output controllers 1032 for receiving and processing input from a number of input devices, such as one or more sensors 1038, a battery subsystem 1034, or other type of input devices. Such sensors may include but not limited to, inertial measurement units, magnetometer, barometer, sonar, LiDar, ToF camera. The battery subsystem may include a battery and associated circuitry to provide information about the level of charge and general condition of the battery, and such information can be communicated to the computing device via a peripheral communication interface such as universal asynchronous receiver/transmitter ( “UART” ) , inter-integrated circuit ( “I2C” ) , and serial peripheral interface ( “SPI” ) protocols.

Similarly, an input/output controller 1032 can provide outputs to output devices or systems such as a motor and ESC subsystem 1040 and a gimbal subsystem 1046. The motor subsystem may include one or more electronic speed controllers ( “ESCs” ) , and/or one or more electric motors. The ESC is a special circuit to convert control signals into electric signals that causes the motor to operate at a desired speed. The electric motor then produces output energy in the form of rotation thrust and torque, and causes the robot to mobilize. The gimbal subsystem is a special electronic module that can be used to stabilize a camera or other objects. A gimbal typically includes one or more electric motors, a sensor for sensing the movement of the camera, a computing circuit that can calculate the attitude of the camera, and one or more ESCs that drives the electric motors and causes the camera to point to a desired direction in spite of the movement by the device or vehicle that holds the gimbal. It will be appreciated that the computing device 1000 might not include all of the components shown in FIGURE 10, can include other components that are not explicitly shown in FIGURE 10, or might utilize an architecture completely different than that shown in FIGURE 10.

The disclosure presented herein can be considered to encompass the subject matter set forth in the following clauses.

Clause 1: A computer implemented method for capturing a picture of a target object without human intervention, the method comprising: capturing, by a robot, a first test image showing a target object; obtaining a picture template, the picture template comprising a position template describing a size of the target object in the image and a relative location of the target object in the image; comparing the picture template with the first test image; determining whether the first test image matches the picture template by determining whether a size and a location of the target object in the first test image match the position template; in response to determining that the first test image matches the picture template, capturing another image as the picture of the target object; and in response to determining that the first test image does not match the picture template, estimating, at least in part based on first test image and template, a new location for the robot so that an image captured by the robot at the new location matches the location template, causing the robot to move toward the new location, calculating a current location of the robot using one or more cameras of the robot, determining that the current location of the robot matches the new location, in response to determining that the current location of the robot matches the new location, capturing, by the robot at the current location, a second test image of the target object, comparing the picture template with the second test image, determining whether the second test image matches the picture template, and in response to determining that the second test image matches the picture template, capturing a new image of the target object as the picture of the target object.

Clause 2: The computer-implemented method of clause 1, wherein the robot comprises a front-facing stereo camera, and wherein the current location of the robot is determined using the front-facing stereo camera.

Clause 3: The computer-implemented method of clauses 1-2, wherein calculating the current location of the robot comprising: obtaining an old image captured using the front-facing stereo camera at a first location; identifying feature points in the old image; capturing a new image using the front-facing stereo camera after the robot moves to a second location; identifying a plurality of feature points in the old image that correspond to two or more of the feature points in the old image; calculating the current location of the robot based on the plurality of feature points and parameters of the stereo camera.

Clause 4: The computer-implemented method of clauses 1-3, wherein the picture template further comprises a gesture template describing a desired gesture of the target object in the picture.

Clause 5: The computer-implemented method of clauses 1-4, further comprising: analyzing the gesture of the target object in the second image, wherein determining whether the first test image matches the picture template further comprising determining whether the gesture of the target object in the second test image matches the desired gesture described in the gesture template; and in response to determining that the gesture of the target object in the second test image does not match the desired gesture, providing an indication that the target object should change gesture.

Clause 6: The computer-implemented method of clauses 1-5, wherein the target object is a human being and the gesture template is represented by a skeleton of a human being.

Clause 7: An apparatus comprising: at least one processor; one or more cameras; at least one motor configured to move the apparatus from one location to another location; and a computer-readable storage medium having computer-executable instructions stored thereupon that, when executed by the at least one processor, cause the at least one processor to at least: cause a first camera to capture a first image showing the target object; obtain an picture template; compare the picture template with the first image; determining whether the first image matches the picture template; in response to determining that the first image matches the picture template, capture another image as the picture of the target object; in response to determining that the first image does not match the picture template, estimate a new location for the apparatus so that an image captured by the apparatus at the new location matches the picture template; cause the apparatus to move toward the new location; calculate a current location of the robot using the one or more cameras of the robot; determine that the current location of the robot matches the new location; in response to determining that the current location of the robot matches the new location, capture, by the apparatus at the current location, a second image of the target object as the picture of the target object.

Clause 8: The apparatus of clause 7, wherein the picture template further comprises a gesture template describing a desired gesture of the target object in the picture.

Clause 9: The apparatus of clauses 7-8, wherein the computer-readable storage medium has further computer-executable instructions that comprise: analyzing the gesture of the target object in the second image, wherein determining whether the first test image matches the picture template further comprising determining whether the gesture of the target object in the second test image matches the desired gesture described in the gesture template; and in response to determining that the gesture of the target object in the second test image does not match the desired gesture, providing an indication that the target object should change gesture.

Clause 10: The apparatus of clauses 7-9, wherein providing an indication that the target object should change gesture comprises causing an indicator on the apparatus to change its status.

Clause 11: The apparatus of clauses 7-10, wherein the one or more cameras comprise a front-facing stereo camera, and wherein the current location of the robot is determined using the front-facing stereo camera.

Clause 12: The apparatus of clauses 7-11, wherein the front-facing stereo camera is different from the first camera.

Clause 13: The apparatus of clauses 7-12, wherein the front-facing stereo camera comprises the first camera.

Clause 14: The apparatus of clauses 7-13, wherein the one or more cameras comprise a downward facing camera, and wherein the current location of the robot is determined using at least in part the downward facing camera.

Clause 15: A non-transitory computer-readable storage medium having computer-executable instructions stored thereupon which, when executed by a processer in a robot, cause the processor to: cause a first camera of the robot to capture a first image showing the target object; obtain a picture template; compare the picture template with the first image; determine whether the first image matches the picture template; in response to determining that the first image matches the picture template, cause another image to be captured as an output picture of the target object; in response to determining that the first image does not match the picture template, estimate a new location for the robot so that an image captured by the robot at the new location matches the picture template; cause the robot to move toward the new location; calculate a current location of the robot using one or more cameras of the robot; determine that the current location of the robot matches the new location; in response to determining that the current location of the robot matches the new location, capture, by the robot at the current location, a second image of the target object as the picture of the target object.

Clause 16: The non-transitory computer-readable storage medium of clause 15, wherein the picture template comprising a position template describing a size of the target object in the image and a relative location of the target object in the image, and determining whether the first image matches the picture template comprises determining whether the first image matches the position template.

Clause 17: The non-transitory computer-readable storage medium of clauses 15-16, wherein the picture template comprising a gesture template describing a desired gesture of the target object in the picture and determining whether the first image matches the picture template comprises determining whether the target object in the first image matches the gesture template.

Clause 18: The non-transitory computer-readable storage medium of clauses 15-17, wherein determining that the gesture of the target object in the first image matches the gesture template comprises: extracting a portion of the first image containing the target object as an object image; dividing the gesture template and the object image into a plurality of regions; determining a difference between the gesture template and the object image for each of the plurality of regions; and determining that the gesture of the target object in the second test image does not match the gesture template if there exists at least one region whose difference exceeds a threshold.

Clause 19: The non-transitory computer-readable storage medium of clauses 15-18, wherein the target object is a human being and the gesture template is represented by a skeleton of a human being, and wherein the difference between the gesture template and the object image for each of the plurality of regions is measured by a difference between an angle formed by two parts of the skeleton and an angle formed by corresponding parts of the object image.

Clause 20: The non-transitory computer-readable storage medium of clauses 15-19, wherein determining that the gesture of the target object in the first image matches the gesture template further comprises generating feedback information for regions whose difference exceeds a threshold.

Based on the foregoing, it should be appreciated that technologies for a robot capturing an image that matches a picture template have been presented herein. Moreover, although the subject matter presented herein has been described in language specific to computer structural features, methodological acts, and computer readable media, it is to be understood that the invention defined in the appended claims is not necessarily limited to the specific features, acts, or media described herein. Rather, the specific features, acts, and media are disclosed as example forms of implementing the claims.

The subject matter described above is provided by way of illustration only and should not be construed as limiting. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure. Various modifications and changes can be made to the subject matter described herein without following the example configurations and applications illustrated and described, and without departing from the true spirit and scope of the present invention, which is set forth in the following claims.

Claims

A computer implemented method for capturing a picture of a target object without human intervention, the method comprising:

capturing, by a robot, a first test image showing a target object;

obtaining a picture template, the picture template comprising a position template describing a size of the target object in the image and a relative location of the target object in the image;

comparing the picture template with the first test image;

determining whether the first test image matches the picture template by determining whether a size and a location of the target object in the first test image match the position template;

in response to determining that the first test image matches the picture template, capturing another image as the picture of the target object; and

in response to determining that the first test image does not match the picture template,

estimating, at least in part based on first test image and template, a new location for the robot so that an image captured by the robot at the new location matches the location template,

causing the robot to move toward the new location,

calculating a current location of the robot using one or more cameras of the robot,

determining that the current location of the robot matches the new location,

in response to determining that the current location of the robot matches the new location, capturing, by the robot at the current location, a second test image of the target object,

comparing the picture template with the second test image,

determining whether the second test image matches the picture template, and

in response to determining that the second test image matches the picture template, capturing a new image of the target object as the picture of the target object.
The computer-implemented method of claim 1, wherein the robot comprises a front-facing stereo camera, and wherein the current location of the robot is determined using the front-facing stereo camera.
The computer-implemented method of claim 2, wherein calculating the current location of the robot comprising:

obtaining an old image captured using the front-facing stereo camera at a first location;

identifying feature points in the old image;

capturing a new image using the front-facing stereo camera after the robot moves to a second location;

identifying a plurality of feature points in the old image that correspond to two or more of the feature points in the old image;

calculating the current location of the robot based on the plurality of feature points and parameters of the stereo camera.
The computer-implemented method of claim 1, wherein the picture template further comprises a gesture template describing a desired gesture of the target object in the picture.
The computer-implemented method of claim 4, further comprising:

analyzing the gesture of the target object in the second image, wherein determining whether the first test image matches the picture template further comprising determining whether the gesture of the target object in the second test image matches the desired gesture described in the gesture template; and

in response to determining that the gesture of the target object in the second test image does not match the desired gesture, providing an indication that the target object should change gesture.
The computer-implemented method of claim 4, wherein the target object is a human being and the gesture template is represented by a skeleton of a human being.
An apparatus comprising:

at least one processor;

one or more cameras;

at least one motor configured to move the apparatus from one location to another location; and

a computer-readable storage medium having computer-executable instructions stored thereupon that, when executed by the at least one processor, cause the at least one processor to at least:

cause a first camera to capture a first image showing the target object;

obtain an picture template;

compare the picture template with the first image;

determining whether the first image matches the picture template;

in response to determining that the first image matches the picture template, capture another image as the picture of the target object;

in response to determining that the first image does not match the picture template,

estimate a new location for the apparatus so that an image captured by the apparatus at the new location matches the picture template;

cause the apparatus to move toward the new location;

calculate a current location of the robot using the one or more cameras of the robot;

determine that the current location of the robot matches the new location;

in response to determining that the current location of the robot matches the new location, capture, by the apparatus at the current location, a second image of the target object as the picture of the target object.
The apparatus of claim 7, wherein the picture template further comprises a gesture template describing a desired gesture of the target object in the picture.
The apparatus of claim 8, wherein the computer-readable storage medium has further computer-executable instructions that comprise:

analyzing the gesture of the target object in the second image, wherein determining whether the first test image matches the picture template further comprising determining whether the gesture of the target object in the second test image matches the desired gesture described in the gesture template; and

in response to determining that the gesture of the target object in the second test image does not match the desired gesture, providing an indication that the target object should change gesture.
The apparatus of claim 9, wherein providing an indication that the target object should change gesture comprises causing an indicator on the apparatus to change its status.
The apparatus of claim 7, wherein the one or more cameras comprise a front-facing stereo camera, and wherein the current location of the robot is determined using the front-facing stereo camera.
The apparatus of claim 11, wherein the front-facing stereo camera is different from the first camera.
The apparatus of claim 11, wherein the front-facing stereo camera comprises the first camera.
The apparatus of claim 7, wherein the one or more cameras comprise a downward facing camera, and wherein the current location of the robot is determined using at least in part the downward facing camera.
A non-transitory computer-readable storage medium having computer-executable instructions stored thereupon which, when executed by a processer in a robot, cause the processor to:

cause a first camera of the robot to capture a first image showing the target object;

obtain a picture template;

compare the picture template with the first image;

determine whether the first image matches the picture template;

in response to determining that the first image matches the picture template, cause another image to be captured as an output picture of the target object;

in response to determining that the first image does not match the picture template,

estimate a new location for the robot so that an image captured by the robot at the new location matches the picture template;

cause the robot to move toward the new location;

calculate a current location of the robot using one or more cameras of the robot;

determine that the current location of the robot matches the new location;

in response to determining that the current location of the robot matches the new location, capture, by the robot at the current location, a second image of the target object as the picture of the target object.
The non-transitory computer-readable storage medium of claim 15, wherein the picture template comprising a position template describing a size of the target object in the image and a relative location of the target object in the image, and determining whether the first image matches the picture template comprises determining whether the first image matches the position template.
The non-transitory computer-readable storage medium of claim 15, wherein the picture template comprising a gesture template describing a desired gesture of the target object in the picture and determining whether the first image matches the picture template comprises determining whether the target object in the first image matches the gesture template.
The non-transitory computer-readable storage medium of claim 17, wherein determining that the gesture of the target object in the first image matches the gesture template comprises:

extracting a portion of the first image containing the target object as an object image;

dividing the gesture template and the object image into a plurality of regions;

determining a difference between the gesture template and the object image for each of the plurality of regions; and

determining that the gesture of the target object in the second test image does not match the gesture template if there exists at least one region whose difference exceeds a threshold.
The non-transitory computer-readable storage medium of claim 18, wherein the target object is a human being and the gesture template is represented by a skeleton of a human being, and wherein the difference between the gesture template and the object image for each of the plurality of regions is measured by a difference between an angle formed by two parts of the skeleton and an angle formed by corresponding parts of the object image.
The non-transitory computer-readable storage medium of claim 18, wherein determining that the gesture of the target object in the first image matches the gesture template further comprises generating feedback information for regions whose difference exceeds a threshold.