US20230030442A1

US20230030442A1 - Telepresence robot

Info

Publication number: US20230030442A1
Application number: US17/390,887
Authority: US
Inventors: Naoki Ogishita; Tsubasa Tsukahara; Fumihiko Iida; Ryuichi Suzuki; Karen Murata; Jun Momose; Kotara Imamura; Yasushi Okumura; Ramanath Bhat; Daisuke Kawamura
Original assignee: Sony Interactive Entertainment Inc; Sony Corp; Sony Group Corp; Sony Interactive Entertainment LLC
Current assignee: Sony Interactive Entertainment Inc; Sony Corp; Sony Group Corp; Sony Interactive Entertainment LLC
Priority date: 2021-07-31
Filing date: 2021-07-31
Publication date: 2023-02-02
Also published as: JP2023021207A; TW202309832A; CN115922657A; EP4124416A1

Abstract

A robot has a generally cubic body portion on propulsion elements. The robot turns its head in accordance with a remote person turning his head, and also presents, on a display representing the face of the robot, a full-face image representing the person even when the person is being imaged in profile.

Description

FIELD

The application pertains to robots.

BACKGROUND

Robots are increasingly used not only for performing useful tasks, but also for providing a measure of companionship.

SUMMARY

A robot includes a lower body portion on propulsion elements. An upper body portion is coupled to the lower body portion and is movable relative to the lower body portion. The upper body portion includes at least one display configured to present an image representing a person remote from the robot, with the image being a full-face image. An avatar may be presented, or an actual image of the person may be presented.
In some examples the upper body portion is movable relative to the lower body portion in accordance with motion of the person as indicated by signals received from an imager. The imager can be a webcam, smart phone cam, or other imaging device.
The full-face image can be generated from a profile image of the person, if desired using a machine learning (ML) model executed by a processor in the robot and/or by a processor distanced from the robot.
In some examples, opposed side surfaces of the upper body portion include respective microphones. Example implementations of the robot can include left and right cameras and at least one processor to send images from the cameras to a companion robot local to and associated with the person. A motorized vehicle may be provided with a recess configured to closely hold the lower body portion to transport the robot. At least one magnet can be disposed in the recess to magnetically couple the robot with the motorized vehicle and to charge at least one battery in the robot. If desired, at least one speaker can be provided on the robot and may be configured to play voice signals received from the person. The top surface of the robot may be implemented by at least one touch sensor to receive touch input for the processor.
In another aspect, a device includes at least one computer storage that is not a transitory signal and that in turn includes instructions executable by at least one processor to, for at least a first user, render, from at least one image of the first user by at least one imager, a full-face image representing the first user with background and body parts of the first user cropped out of the image representing the first user. The instructions may be executable to provide, to at least a first robot remote from the first user, the full-face image for presentation thereof on a display of the first robot with the full-face image filling the display. The instructions may be further executable to provide, to the first robot, information from the imager regarding motion of the first user such that a head of the first robot turns to mimic the motion of the first user while continuing to present a full-face image representing the first user on the display of the first robot regardless of whether the head of the first user turned away from the imager.
In another aspect, a method includes, for at least a first user, rendering, from at least one image captured of the first user, a full-face image representing the first user with background and body parts of the first user cropped out of the image captured of the first user. The method includes presenting, on at least one display of a first robot remote from the first user, the full-face image representing the first user with the full-face image filling the display of the first robot. The method also includes turning a head of the first robot to mimic a head turn of the first user while continuing to present a full-face image representing the first user on the display of the first robot.
Additionally, for at least a second user local to the first robot the method includes rendering, from at least one image captured of the second user, a full-face image representing the second user with background and body parts of the second user cropped out of the image representing the second user. The method includes presenting, on at least one display of a second robot local to the first user, the full-face image representing the second user with the full-face image of the second user filling the display of the second robot. Further, the method includes turning a head of the second robot to mimic a head turn of the second user while continuing to present a full-face image representing the second user on the display of the second robot.
The details of the present application, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which:

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an isometric view of the robot consistent with present principles, along with a control device such as a smart phone;

FIGS. 2 and 3 are isometric views of the robot with the display face showing different face images;

FIG. 4 illustrates the mobile buggy in which the robot of FIG. 1 can be disposed;

FIG. 5 is a block diagram of example components of the robot;

FIGS. 6-8 illustrate example logic in example flow chart format consistent with present principles;

FIG. 9 schematically illustrates two users remote from each other, each “conversing” with a respective local to the users which presents the facial image and mimics the motions of the opposite user;

FIG. 10 schematically illustrates additional aspects from FIG. 9 ;

FIGS. 11 and 12 illustrate example logic in example flow chart format consistent with present principles; and

FIGS. 13-15 illustrate example robot vehicles consistent with present principles.

DETAILED DESCRIPTION

FIG. 1 shows a robot 10 that includes a lower body portion 12 on propulsion elements 14, which may be established by four micro holonomic drives. The robot may be made of lightweight metal or plastic and may be relatively small, e.g., the robot 10 can be small enough to hold by hand.
An upper body or head portion 16 is movably coupled to the lower body portion 12 by one or more coupling shafts 18 that can be motor driven to move the head portion 16 relative to the lower body portion 12. The lower body 12 and head portion 16 can be parallelepiped-shaped as shown and may be cubic in some examples.
The head portion 16 can be movable relative to the lower body portion 12 both rotatably and tiltably. For example, as indicated by the arrows 20, the upper body or head portion 16 can be tiltable forward-and-back relative to the lower body portion 12, while as illustrated by the arrows 22 the upper body or head portion 16 can be tiltable left-and-right. Also, as indicated by the arrows 24, the upper body or head portion 16 can rotate about its vertical axis relative to the lower body portion 12.
The front surface 26 of the upper body or head portion 16 can be established by a display 28 configured to present demanded images. Opposed side surfaces 30 of the upper body or head portion 16 may include respective microphones 32 at locations corresponding to where the ears of a human would be. The robot 10, e.g., the lower body portion 12 thereof, can also include left and right cameras 34 which may be red-green-blue (RGB) cameras, depth cameras, or combinations thereof. The cameras alternately may be placed in the head portion 16 where the eyes of a human would be. A speaker 36 may be provided on the robot, e.g., on the head portion 16 near where the mouth of a human would be, and at least one touch sensor 38 can be mounted on the robot 10, e.g., on the top surface of the upper body or head portion 16 to receive touch input for a processor within the robot 10 and discussed further below.
A control device 40 such as a smart phone may include processors, cameras, network interfaces, and the like for controlling and communicating with the robot 10 as discussed more fully herein.
FIGS. 2 and 3 illustrate that the display 28 of the upper body or head portion 16 may present various different demanded images of, e.g., human faces imaged by any of the cameras herein. The images may be presented under control of any of the processors discussed herein and may be received by a network interface in the robot 10. In lieu of an image of the person, the face of an avatar representing the person may be presented to preserve privacy. The avatar may be animated to have the same emotional expressions of the person by face emotion capture (including eyes, eyebrows, mouth, and nose).
Note that whether the head portion 12 is facing straight ahead as in FIG. 2 or is tilted or rotated to one side as in FIG. 3 , the display 28 presents the full-face image (of the person or the avatar), even if the original image is of a human taken from the side of the human's face. Details are discussed further below.
FIG. 4 illustrates a motorized vehicle 400 (powered by, e.g., an internal rechargeable battery) with a recess 402 configured to closely hold the lower body portion 12 of the robot 10 to transport the robot 10. At least one magnet 404 can be disposed in the recess 402 to magnetically and electrically couple the robot 10 (which can include a magnet or ferromagnet) with the motorized vehicle 400 and to charge at least one battery in the robot. Advantageously, the propulsion elements 14 of the robot 10 need not be detached to secure the robot 10 in the recess 402. The processor of the robot 10 senses the presence of a processor in the vehicle 400 and controls the processor in the vehicle 400 to move the vehicle 400 in lieu of moving the robot 10 by means of the propulsion element 14.
FIG. 5 illustrates various components of the robot 10 many of which are internal to the robot 10. In addition to the camera(s) 34, microphone(s) 32, display(s) 28, speaker(s) 36, and touch surface(s) 38, the robot 10 may include one or more processors 500 accessing one or more computer storages 502 to program the processor 500 with instructions executable to undertake logic discussed herein. The processor 500 may control the components illustrated in FIG. 5 , including a head actuator 504 to move the head portion 16 relative to the body portion 12, a propulsion motor 506 to activate the propulsion elements 14 shown in FIG. 1 , and a network interface 508 such as a wireless transceiver to communicate data to components external to the robot 10.
A charge circuit 510 may be provided to charge one or more batteries 512 to provide power to the components of the robot. As discussed above, the charge circuit 510 may receive charge current via one or more magnetic elements 514 from, e.g., the vehicle 400 shown in FIG. 4 .
FIGS. 6-8 illustrate example logic in example flow chart format that the processor 500 in FIG. 5 may execute. Commencing at bock 600 in FIG. 6 , input may be received from the camera(s) 34. Face recognition may be executed on images from the camera, for example, to move the head 16 at block 602 to remain facing a person imaged by the camera. Also, at block 604 the robot may be activated to move on the propulsion elements 14 according to the camera signal, e.g., to turn and “hide” behind a nearby object as if “shy” in the presence of the person being imaged by the camera.
Commencing at bock 700 in FIG. 7 , input may be received from the microphone(s) 32. Voice recognition may be executed on the signals, for example, to move the head 16 at block 702 to cock one of the side of the head of the robot toward the source of the signals (or toward the face of a person imaged by the cameras) as if listening attentively to the person. Also, at block 704 the robot may be activated to move on the propulsion elements 14 according to the microphone signal, e.g., to turn and approach a person being imaged by the camera.
Commencing at bock 800 in FIG. 8 , input may be received from the touch surface(s) 38. At block 802 the processor may actuate the head 16 to move in response to the touch signal, e.g., to bow the head as if in respect to the person touching the head. Also, at block 804 the robot may be activated to move on the propulsion elements 14 according to the touch signal.
FIG. 9 illustrates a use case of the robot 10. A first user 900 may operate a smart phone or tablet computer or other control device 902 to communicate with a first robot 10A. Remote from the first user 900, a second user 904 may operate a smart phone or tablet computer or other control device 904 to communicate with a second robot 10B.
As indicated in FIG. 9 , the first robot 10A presents on its display 28 a full-face image 908 of the (frowning) second user 904 (equivalently, a frowning avatar face). The second robot 10B presents on its display 28 a full-face image 910 of the (smiling) first user 900 (equivalently, a smiling avatar face). The image 908 on the display of the first robot 10A may represent the second user 904 based on images generated by a camera in the second user control device 908 or a camera in the second robot 10B and communicated over, e.g., a wide area computer network or a telephony network to the first robot 10A. Likewise, the image 910 on the display of the second robot 10B may represent the first user 900 based on images generated by a camera in the first user control device 902 or the first robot 10A and communicated over, e.g., a wide area computer network or a telephony network to the second robot 10B. The face images of the users/avatars may be 2D or 3D and the displays 28 of the robots may be 2D displays or 3D displays.
Moreover, the head of the first robot 10A may be controlled by the processor in the first control device 902 and/or the first robot 10A to rotate and tilt in synchronization with the head of the second user 904 as indicated by images from the second control device 906 and/or second robot 10B. Likewise, the head of the second robot 10B may be controlled by the processor in the second control device 906 and/or the second robot 10B to rotate and tilt in synchronization with the head of the first user 900 as indicated by images from the first control device 902 and/or first robot 10A.
In both cases, however, the image of the faces on the robots remain full-face images as would be seen from a direction normal (perpendicular) to the display 28 from in front of the display, regardless of the orientation of the head of the respective robot. The full-face images are cropped from any background in the images of the respective user and are also cropped from body parts of the respective below the chin that may appear in the images. The full face images may be generated even as the head of the respective user turns away from the imaging camera consistent with disclosure herein, so that the front display surface of the robots present not profile images as generated by the cameras but full face images derived as described herein from camera images of a turned head no matter how the robot head is turned or tilted, just as a human face of a turned head remains a full face when viewed from directly in front of the face from a line of sight perpendicular to the face.
As below-the-head images of a user indicate movement (such as but not limited to translational movement) of the user, the corresponding (remote) robot may also move in the direction indicated by the images by activating the propulsion motor and, hence, propulsion elements 14 of the robot. In particular, the body portion of the robot below the display may move. Further, speech from the first user 900 as detected by the first control device 902 or first robot 10A may be sent to the second robots 10B for play on the speaker on the second robot, and vice-versa.
Thus, the first user 900 may interact with the first robot 10A presenting the face image of the second (remote) user 904 as if the second user 904 were located at the position of the first robot 10A, i.e., local to the first user 900. Likewise, the second user 904 may interact with the second robot 10B presenting the face image of the first (remote) user 900 as if the first user 900 were located at the position of the second robot 10B, i.e., local to the second user 904.
FIG. 10 further illustrates the above principles, assuming that user images are employed, it being understood that the same principles apply when avatars expressing the user emotion are used. At 1000 the first user 900 of FIG. 9 operating the first control device 902 is imaged using any of the cameras discussed previously to move the second robot 10B and to send an image of the face of the first user 900 to the display 28 of the second robot 10B for presentation of a full-face image (and only a full-face image) on the second robot 10B. The image may be sent to a network address of the second robot 10B or sent to the second control device 906 shown in FIG. 9 , which relays the image to the second robot 10B via, e.g., Wi-Fi or Bluetooth. No background apart from the face image and no body portions of the first user 900 other than the face are presented on the display 28 of the second robot 10B.
As shown at 1002, should the first user 900 turn his head to the left, this motion is captured, e.g., by the camera(s) in the first control device 902 and/or first robot 10A and signals such as a stream of images are sent to the second robot 10B as described above to cause the processor 500 of the second robot 10B to activate the head actuator 504 to turn the head 16 of the second robot 10B to the left relative to the body 12 of the second robot 10B, as illustrated in FIG. 10 . However, the display 28 of the second robot 10B, although turned to the left relative to the front of the body 12, does not show a profile view of the head of the first user 900 as currently being imaged by the camera(s) of the first control device 902 or first robot 10A. Instead, as shown in FIG. 10 the display 28 of the second robot 10B continues to show a full-face image, i.e., an image of the face of the first user 900 as would be seen if looking directly at the face from a line of sight perpendicular to the face.
FIG. 11 illustrates further principles that may be used in connection with the above description. Commencing at block 1100, an input set of training images is input to a machine learning (ML) model, such as one or more of a convolutional neural network (CNN), recurrent NN (RNN), and combinations thereof. The ML model is trained using the training set at block 1102.
The training set of images may include 3D images of human faces from various perspectives, from full frontal view through full side profile views. The training set of images may include ground truth 2D full frontal view representations of each 3D perspective view including non-full frontal 3D perspective views. The ground truth 2D images are face-only, configured to fill an entire display 28 of a robot 10, with background and body portions other than the face cropped out from the corresponding 3D images. The full-frontal view representations show facial features as well as emotional distortions of facial muscles (smiling, frowning, etc.). In this way, the ML model learns how to generate full frontal view 2D images from a series of 3D images of a user's face as the user turns his head toward and away from a camera rendering the 3D images.
Accordingly, present principles may employ machine learning models, including deep learning models. Machine learning models use various algorithms trained in ways that include supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, feature learning, self-learning, and other forms of learning. Examples of such algorithms, which can be implemented by computer circuitry, include one or more neural networks, such as a convolutional neural network (CNN), recurrent neural network (RNN) which may be appropriate to learn information from a series of images, and a type of RNN known as a long short-term memory (LSTM) network. Support vector machines (SVM) and Bayesian networks also may be considered to be examples of machine learning models.
As understood herein, performing machine learning involves accessing and then training a model on training data to enable the model to process further data to make predictions. A neural network may include an input layer, an output layer, and multiple hidden layers in between that that are configured and weighted to make inferences about an appropriate output.
FIG. 12 illustrates logic attendant to FIGS. 9 and 10 using a ML model as trained in FIG. 11 . It is to be understood that the logic of FIG. 12 may be executed by any of the processors or combinations thereof described herein, including the processor of a server on a wide area computer network communicating with the control devices 902, 906 and/or robots 10A, 10B.
Commencing at block 1200, for each user (assume only two users 900, 904 as shown in FIG. 9 for simplicity) images are captured at bock 1202 of the user's face, including images showing motion of the face and body of the user. The voice of the user is captured at block 1204 and both the voice signals and image sequence of the user as the user moves and speaks are sent at block 1206 to the other user's local robot.
Meanwhile and proceeding to block 1208, the same signals—image sequences of the face and body motions and voice signals of the other user—are received at block 1208. The image of the face of the other user, if not already full face as would be seen looking directly at the other user along a line of sight perpendicular to the front of the face of the other user, is converted at block 1210 to a 2D full face image using the ML model trained as described, with background and body parts of the other user other than the face being cropped. The full face 2D image is presented on the display 28 of the local robot preferably by entirely filling the display with the image of the face of the other user. As mentioned above, conversion of a 3D image in profile of a user's face to a full face 2D image may be effected by any one or more of the processors described herein.
In an alternative embodiment, in lieu of using ML models to convert 3D images to full face 2D images, a single 2D full face image of the other user may be obtained and presented for the duration on the local robot. As also discussed, avatars may be used for privacy instead of the image of a person, with the expression of the avatars preferably being animated according to the expression of the person.
If desired, the other user's voice may be played at block 1212 on the local robot or the local control device. Also, at block 1214 the head of the local robot may be turned to mimic head motion of the other user as represented by the sequence of images received at block 1208 and as shown at 1002 in FIG. 10 . Moreover, in the event that the other user moves his body by, e.g., walking, that motion is captured and received at block 1208 and input to the processor of the local robot to actuate the propulsion elements 14 (or, if the robot is in a vehicle such as the vehicle 400 shown in FIG. 4 , the vehicle) to translationally move the local robot to mimic the motion of the other user.
FIGS. 13-15 indicate that in lieu of the motorized vehicle 400 shown in FIG. 4 , the robot 10 may be mounted on other types of moving platforms such as a bicycle 1300, a crab-like tractor 1400, or an airborne drone 1500.
Components included in one embodiment can be used in other embodiments in any appropriate combination. For example, any of the various components described herein and/or depicted in the Figures may be combined, interchanged, or excluded from other embodiments.
“A system having at least one of A, B, and C” (likewise “a system having at least one of A, B, or C” and “a system having at least one of A, B, C”) includes systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.
A processor may be a single- or multi-chip processor that can execute logic by means of various lines such as address lines, data lines, and control lines and registers and shift registers.
Network interfaces such as transceivers may be configured for communication over at least one network such as the Internet, a WAN, a LAN, etc. An interface may be, without limitation, a Wi-Fi transceiver, Bluetooth® transceiver, near filed communication transceiver, wireless telephony transceiver, etc.
Computer storage may be embodied by computer memories such as disk-based or solid-state storage that are not transitory signals.
While the particular robot is herein shown and described in detail, it is to be understood that the subject matter which is encompassed by the present invention is limited only by the claims.

Claims

What is claimed is:

1. A robot, comprising:

a lower body portion on propulsion elements;

an upper body portion coupled to the lower body portion and movable relative to the lower body portion, the upper body portion comprising:

at least one display configured to present an image representing a person remote from the robot, the image being a full-face image.

2. The robot of claim 1, wherein the upper body portion is movable relative to the lower body portion in accordance with motion of the person as indicated by signals received from an imager imaging the person.

3. The robot of claim 1, wherein the full-face image is generated from a profile image of the person.

4. The robot of claim 1, wherein the full-face image is generated from a profile image of the person using a machine learning (ML) model.

5. The robot of claim 4, wherein the ML model is executed by a processor distanced from the robot to generate the full-face image.

6. The robot of claim 1, wherein opposed side surfaces of the upper body portion comprise respective microphones.

7. The robot of claim 1, wherein the robot comprises at least one camera and comprises at least one processor to send images from the camera to a companion robot local to and associated with the person.

8. The robot of claim 1, comprising a motorized vehicle with a recess configured to closely hold the lower body portion to transport the robot, at least one magnet being disposed in the recess to magnetically couple the robot with the motorized vehicle and to charge at least one battery in the robot.

9. The robot of claim 1, comprising at least one speaker configured to play voice signals received from the person.

10. The robot of claim 1, comprising at least one touch sensor on a top surface of the upper body portion to receive touch input for the processor.

11. The robot of claim 1, wherein the propulsion elements comprise micro holonomic drives.

12. A device comprising:

at least one computer storage that is not a transitory signal and that comprises instructions executable by at least one processor to:

for at least a first user, render, from at least one image captured of the first user by at least one imager, a full-face image representing the first user with background and body parts of the first user cropped out of the image representing the first user;

provide, to at least a first robot remote from the first user, the full-face image for presentation thereof on a display of the first robot with the full-face image filling the display;

provide, to the first robot, information from the imager regarding motion of the first user such that a head of the first robot turns to mimic the motion of the first user while continuing to present the full-face image the display of the first robot regardless of whether the head of the first user turned away from the imager.

13. The device of claim 12, comprising the at least one processor embodied in the first robot.

14. The device of claim 12, comprising the at least one processor embodied in a computing device other than the first robot.

15. The device of claim 12, wherein the instructions are executable to:

provide to the first robot voice signals from the first user to enable the first robot to play the voice signals on at least one speaker of the first robot.

16. The device of claim 12, wherein the first robot comprises a body portion movably engaged with the display, and the instructions are executable to:

provide to the first robot signals from the imager representing below-the-head movement of the first user to cause the body portion of the first robot to move according to the below-the-head movement of the first user.

17. The device of claim 12, wherein the instructions are executable to:

for at least a second user who is local to the first robot, render, from at least one image captured of the second user by at least one imaging device, a full-face image representing the second user with background and body parts of the second user cropped out of the image representing the second user;

provide, to at least a second robot remote from the second user and local to the first user, the full-face image representing second user for presentation thereof on a display of the second robot with the full-face image representing the second user filling the display of the second robot;

provide, to the second robot, information from the imaging device regarding motion of the second user such that a head of the second robot turns to mimic the motion of the second user while continuing to present a full-face image representing the second user on the display of the second robot regardless of whether the head of the second user turned away from the imaging device.

18. A method, comprising:

for at least a first user, rendering, from at least one image captured of the first user, a full-face image representing the first user with background and body parts of the first user cropped out of the full-face image;

presenting, on at least one display of a first robot remote from the first user, the full-face image representing the first user with the full-face image filling the display of the first robot;

turning a head of the first robot to mimic a head turn of the first user while continuing to present a full-face image representing the first user on the display of the first robot;

for at least a second user local to the first robot, rendering, from at least one image captured of the second user, a full-face image representing the second user with background and body parts of the second user cropped out of the image representing the second user;

presenting, on at least one display of a second robot local to the first user, the full-face image representing the second user with the full-face image representing the second user filling the display of the second robot; and

turning a head of the second robot to mimic a head turn of the second user while continuing to present a full-face image representing the second user on the display of the second robot.

19. The method of claim 18, comprising:

presenting audio generated by the first user on the second robot; and

presenting audio generated by the second user on the first robot.