KR20110125524A

KR20110125524A - System for object learning through multi-modal interaction and method thereof

Info

Publication number: KR20110125524A
Application number: KR1020100045098A
Authority: KR
Inventors: 김수환; 박성기; 김동환
Original assignee: 한국과학기술연구원
Priority date: 2010-05-13
Filing date: 2010-05-13
Publication date: 2011-11-21
Also published as: KR101100240B1

Abstract

PURPOSE: An object learning system of a robot using multi-modal interaction and method thereof are provided to enable a user to teach object information to the robot by enabling a user to communicate with the robot through the multi-modal interaction. CONSTITUTION: An object extraction means(122) extracts a video from objects. An interaction unit performs interaction with a user. A robot learns the object using object information in which is collected from the interaction means and the object information in which is extracted from the object extraction means. A gesture recognition means(120) recognizes the gesture of a user in the video. The robot learns the object using the object information collected from the gesture recognition means and the interaction means.

Description

System for object learning through multi-modal interaction and method

본 발명은 로봇이 사용자와 멀티모달 상호작용을 통하여 환경 속의 물체를 학습하기 위한 시스템 및 그 방법에 관한 것이다. 더 상세하게는 로봇이 영상 입력 장치를 이용하여 사용자의 제스처를 인식하고, 환경 속에서 사용자가 등록하고자 하는 물체를 찾아내어, 그 물체를 학습하여 물체 데이터베이스 및 환경 지도상에 저장하기 위한 시스템 및 그 방법에 관한 것이다. The present invention relates to a system and method for a robot to learn objects in an environment through multimodal interaction with a user. More specifically, the robot recognizes the user's gesture using the image input device, finds the object to be registered by the user in the environment, and learns the object to store in the object database and the environment map, and its It is about a method.

로봇이 환경 속의 물체를 학습하고 환경 지도상에 저장하기 위한 연구가 활발하게 이루어지고 있다. 로봇의 물체 학습을 위한 제스처는 주로 손이나 손가락을 이용하는데, Kahn외 3명은 손으로 등록하고자 하는 물체를 가리키는 동작을 취하였고(Roger E. Kahn, Michael J. Swain, Peter N. Prokopowicz, and R. James Firby, “Gesture Recognition Using the Perseus Architecture,” Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 734-741, 1996.), Roth외 2명은 손으로 물체를 집어 들어 올렸으며(P. M. Roth, M. Donoser, H. Bischof, “On-line Learning of Unknown Hand Held Objects via Tracking,” Proceedings of Internationl Conference on Computer Vision Systems, 2006.), Arsenio는 물체 앞에서 손이나 손가락을 흔드는 연구가 공개되었다.(Artur M. Arsenio, “Figure/Ground Segregation from Human Cues,” Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. 4, pp. 3244-3250, 2004.)Research is being actively conducted for robots to learn and store objects in the environment. Gestures for the object learning of robots are mainly using hands or fingers, and Kahn and three others pointed at the object to register by hand (Roger E. Kahn, Michael J. Swain, Peter N. Prokopowicz, and R). James Firby, “Gesture Recognition Using the Perseus Architecture,” Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 734-741, 1996.), Roth and two others picked up objects by hand (PM). Roth, M. Donoser, H. Bischof, “On-line Learning of Unknown Hand Held Objects via Tracking,” Proceedings of Internationl Conference on Computer Vision Systems, 2006.), Arsenio published a study of waving hands or fingers in front of objects. (Artur M. Arsenio, “Figure / Ground Segregation from Human Cues,” Proceedings of IEEE / RSJ International Conference on Intelligent Robots and Systems, Vol. 4, pp. 3244-3250, 2004.)

또한, 공개특허번호 제10-2005-0029161호(공개일 : 2005년3월24일)에 공간 인터랙션 장치를 이용한 물체 인식 시스템 및 그 방법이 개시되어 있다.In addition, Korean Patent Application Publication No. 10-2005-0029161 (published: March 24, 2005) discloses an object recognition system using a spatial interaction apparatus and a method thereof.

상기 공개발명은, 물체의 외형에 따른 복수의 이미지를 사용자의 인터랙션에 의해 획득함으로써 물체 인식이 가능한 SIDOR 모듈(Spatial Interaction Device Object Recognition Module); SIDOR 모듈로부터 획득된 복수의 이미지를 컴퓨터 내에 입력하기 위한 이미지 입력부; 입력된 복수의 이미지를 연결하여 스티치 이미지를 형성하고, 스티치 미지의 배경을 삭제하는 이미지 스티칭부; 스티치 이미지와 비교할 다수의 템플렛 이미지의 휴 히스토그램(Hue Histogram)이 저장된 템플렛 이미지(template image) 저장부; 및 스티치 이미지와 템플렛 이미지를 서로 비교하여 물체 인식을 수행하는 이미지 비교부를 포함한다. 상기 공개발명에 따르면, 사용자가 인식할 물체를 직접 가리키고 사용자의 인터랙션에 의해서 이미지를 스티칭함으로써, 인식하려는 사물을 이미지 처음부터 찾아내야 하는 시간을 필요로 하지 않기 때문에 컴퓨터의 물체인식 계산량이 줄어들고, 물체 인식 속도가 빨라지게 된다. The present invention, SIDOR module (Spatial Interaction Device Object Recognition Module) capable of object recognition by acquiring a plurality of images according to the appearance of the object by the user interaction; An image input unit for inputting a plurality of images obtained from the SIDOR module into a computer; An image stitching unit which connects a plurality of input images to form a stitch image and deletes an unknown background of the stitch; A template image storage unit for storing a Hue Histogram of a plurality of template images to be compared with a stitch image; And an image comparison unit for performing object recognition by comparing the stitch image and the template image to each other. According to the above disclosure, by directly pointing to an object to be recognized by the user and stitching the image by the user's interaction, the amount of computation of the object recognition of the computer is reduced since the user does not need time to find the object to be recognized from the beginning of the image. The recognition speed will be faster.

그러나 상기 공개된 연구결과는 이러한 사용자와 로봇의 상호작용을 통한 물체 학습 방법은 로봇이 항상 사용자를 바라보고 있다고 가정하였으며, 물체의 크기에 따라서 제스처를 각기 다르게 정하지 않고 특정 크기의 물체만을 학습시키기 위한 시스템이라는 한계를 갖고 있다. However, the published research results assume that the object learning method through the interaction between the user and the robot assumes that the robot is always looking at the user, and to learn only an object of a specific size without deciding a gesture according to the size of the object. It has a system limitation.

또한, 상기 종래기술에서는 별도의 공간 인터랙션 장치인 공간 마우스를 통하여 공간 속에서 사용자가 관심을 갖는 물체를 지시해주어야 하는 문제점이 있다.In addition, in the prior art, there is a problem in that an object of interest to the user is indicated in space through a space mouse which is a separate space interaction device.

따라서, 로봇이 사람의 얼굴과 손을 인식하여 사람의 제스처를 통해 공간 속에서 관심 물체를 자동으로 찾아 학습하는 시스템 및 그 방법에 관한 발명이 요망된다.Accordingly, there is a need for an invention relating to a system and a method for a robot to recognize a person's face and hands and automatically find and learn an object of interest in space through a human gesture.

공개특허번호 제10-2005-0029161호Publication No. 10-2005-0029161

Roger E. Kahn, Michael J. Swain, Peter N. Prokopowicz, and R. James Firby, “Gesture Recognition Using the Perseus Architecture,” Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 734-741, 1996.Roger E. Kahn, Michael J. Swain, Peter N. Prokopowicz, and R. James Firby, “Gesture Recognition Using the Perseus Architecture,” Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 734-741, 1996. P. M. Roth, M. Donoser, H. Bischof, “On-line Learning of Unknown Hand Held Objects via Tracking,” Proceedings of Internationl Conference on Computer Vision Systems, 2006.P. M. Roth, M. Donoser, H. Bischof, “On-line Learning of Unknown Hand Held Objects via Tracking,” Proceedings of Internationl Conference on Computer Vision Systems, 2006. Artur M. Arsenio, “Figure/Ground Segregation from Human Cues,” Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. 4, pp. 3244-3250, 2004.Artur M. Arsenio, “Figure / Ground Segregation from Human Cues,” Proceedings of IEEE / RSJ International Conference on Intelligent Robots and Systems, Vol. 4, pp. 3244-3250, 2004.

본 발명은 상기 종래기술의 문제점을 해결하기 위한 것으로서, 본 발명의 목적은 로봇이 환경 속의 물체를 학습하고 지도상에 저장하기 위해서, 사용자가 멀티모달 인터액션을 통해 로봇과 의사소통을 하면서 환경 속의 물체를 로봇에게 학습시키는 시스템 및 그 방법을 제공하는데 있다. The present invention is to solve the problems of the prior art, an object of the present invention, the robot communicates with the robot through the multi-modal interaction in order for the robot to learn and store the object in the environment, the object in the environment To provide a system and method for learning the robot.

본 발명의 다른 목적은 학습 대상 물체의 크기(대형/중형/소형)에 따라 각기 다른 제스처를 정하여 다양한 크기의 물체를 쉽고 간편하게 로봇에게 학습시키는 시스템 및 그 방법을 제공하는데 있다. Another object of the present invention is to provide a system and method for easily and simply learning robots of various sizes by determining different gestures according to the size (large / medium / small) of the object to be learned.

본 발명의 또 다른 목적은 학습 대상 물체의 크기를 사용자가 음성을 통해서 로봇에게 미리 알려줄 수 있을 뿐만 아니라, 사용자의 제스처만 보고서도 로봇이 자동으로 물체의 크기를 예상하고 그에 따른 학습을 수행하도록 하는 시스템 및 그 방법을 제공하는데 있다. Another object of the present invention is to not only inform the robot in advance of the size of the object to be learned by the user, but also to report the gesture of the user to automatically estimate the size of the object and perform the learning accordingly. It is to provide a system and a method thereof.

상기 본 발명의 목적을 달성하기 위한 기술적 해결 수단으로서, 본 발명에서는 마이크와 같은 음향 입력 장치를 통해 화자의 방향, 사용자, 음성 명령을 인식하는 화자/음성 인식부와, 스피커와 같은 음향 출력 장치를 통해 음성 대화를 하는 음성 발현부와, 팬/틸트 유닛과 같은 로봇의 머리 회전 장치와 카메라와 같은 영상 입력 장치를 통해 사용자의 얼굴/손을 인식하고 추적하는 사용자 인식 및 추적부와, 사용자의 제스처를 인식하여 공간 속에서 학습 물체의 위치를 추정하는 제스처 인식부와, 영상 속에서 물체가 위치하고 있다고 추정되는 부분에서 보다 정확하게 물체만을 분할해 내는 물체 검출부와, 검출된 물체를 학습하여 데이터베이스와 같은 정보 저장 장치 및 환경 지도에 저장하는 물체 학습부, 및 사용자와 대화를 통해 상기 부분들을 통제하고 모니터와 같은 영상 출력 장치를 통해 사용자에게 결과를 피드백하는 멀티모달 제어부를 포함하는 사용자와의 멀티모달 인터액션을 통한 로봇의 물체 학습 시스템 및 그 방법이 제시된다.As a technical solution for achieving the object of the present invention, in the present invention, the speaker / voice recognition unit for recognizing the direction of the speaker, the user, the voice command through a sound input device such as a microphone, and a sound output device such as a speaker Voice recognition unit for voice conversation through the user, User recognition and tracking unit for recognizing and tracking the user's face / hand through the robot head rotation device such as pan / tilt unit and the image input device such as camera, and user gesture A gesture recognition unit for estimating the position of the learning object in space by recognizing the symbol, an object detecting unit for dividing only the object more accurately in the part where the object is estimated to be located in the image, and information such as a database by learning the detected object. Object learning unit to store in the storage device and the environment map, and through the parts through a dialogue with the user And it is through the image output device such as a monitor object, the present learning system and method of a robot with a multi-modal interaction with the user, including a multi-modal control unit for feeding back the results to the user.

또한, 본 발명에서는 멀티모달 인터액션을 위하여 음성과 영상 신호를 사용하여 사용자는 음성을 통해 로봇에게 물체 학습 명령을 내리고 로봇은 이를 해석하여 명령을 수행하고, 또한 로봇은 대화를 통해 물체의 정보를 습득하고 물체 학습 결과를 피드백 받는 구성의 사용자와의 멀티모달 인터액션을 통한 로봇의 물체 학습 시스템 및 그 방법이 제시된다. In addition, in the present invention, the user gives an object learning command to the robot through voice using a voice and video signal for multi-modal interaction, the robot interprets the command and performs the command, and the robot acquires the information of the object through dialogue. And the object learning system and method of the robot through a multi-modal interaction with the user of the configuration receiving the object learning results are presented.

또한, 본 발명은 영상 신호를 통해 사용자의 제스처를 인식하여 환경 속에서 사용자가 등록하고자 하는 물체를 자동으로 찾아내고, 물체의 크기(대/중/소)에 따라서 물체를 지시하는 제스처를 각각 따로 정하여 환경 속에서 물체를 효율적으로 찾아낼 수 있도록 하는 구성과, 물체의 크기는 로봇과 대화 도중 사용자가 직접 가르쳐 줄 수도 있고, 사용자의 제스처만 보고도 로봇이 사용자가 지금 어떤 크기의 물체를 학습시키기 위한 제스처를 취하고 있는지 자동으로 인식하도록 하는 구성의 사용자와의 멀티모달 인터액션을 통한 로봇의 물체 학습 시스템 및 그 방법이 제시된다. In addition, the present invention automatically recognizes the user's gesture through the image signal to automatically find the object to be registered by the user in the environment, each gesture separately indicating the object according to the size (large / medium / small) of the object It can be configured to efficiently find objects in the environment, and the size of the object can be directly taught by the user during the conversation with the robot. An object learning system and method for a robot through multimodal interaction with a user configured to automatically recognize whether a gesture is being taken is provided.

본 발명은 로봇에게 물체를 학습시키기 위해 사용자가 일일이 수작업을 통해 환경 속의 물체 사진을 찍고 물체에 해당하는 부분만을 잘라내거나 표시하여 로봇에게 학습시키고 환경 지도에 저장하는 기존 방법의 대안으로서, 사용자가 멀티모달 인터액션을 통해 로봇과 의사 소통하면서 환경 속의 물체를 로봇에게 쉽게 학습시키는 효과가 있다. The present invention is an alternative to the existing method for the user to take a picture of the object in the environment by hand manually to learn the object to the robot, cut or display only the portion corresponding to the object to learn the robot and store it in the environment map. The modal interaction communicates with the robot, making it easier for the robot to learn objects in the environment.

도 1은 본 발명의 사용자와 로봇의 멀티모달 상호작용을 통한 물체 학습 시스템의 실시예에 관한 개략적인 구성도이다.
도 2는 본 발명의 실시예의 주요부인 멀티 모달 제어부 내부에 존재하는 유한 상태 기계(finite state machine)에 관한 개략적인 구성도이다.
도 3은 본 발명의 사용자와 로봇의 멀티모달 상호작용을 통한 물체 학습 방법의 실시예를 설명하기 위한 흐름도이다.
도 4는 도 3의 실시예에서 사용자의 물체 크기(대형/중형/소형)에 따른 제스처 방법에 관한 설명도이다.
도 5는 도 4의 물체의 크기(대형/중형/소형)에 따른 제스처 방법의 예시도이다.
도 6은 본 발명의 실시예에서 제스처 인식부가 자동으로 물체의 크기(대형/중형/소형)를 추정하는 과정을 설명하기 위한 흐름도이다.1 is a schematic block diagram of an embodiment of an object learning system through multimodal interaction between a user and a robot of the present invention.
FIG. 2 is a schematic diagram of a finite state machine existing inside a multi-modal control unit, which is a main part of an embodiment of the present invention.
3 is a flowchart illustrating an embodiment of an object learning method through multimodal interaction between a user and a robot of the present invention.
FIG. 4 is an explanatory diagram for a gesture method according to an object size (large / medium / small) of the user in the embodiment of FIG. 3.
5 is an exemplary view illustrating a gesture method according to the size (large / medium / small) of the object of FIG. 4.
6 is a flowchart illustrating a process of automatically estimating the size (large / medium / small) of an object by a gesture recognition unit in an embodiment of the present invention.

이하, 본 발명의 실시예를 첨부된 도면을 참조하면서 상세히 설명하기로 한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 사용자와 로봇의 멀티모달 상호작용을 통한 물체 학습 시스템의 개략적인 구성도이다. 도 1에 도시한 바와 같이, 본 발명은, 사용자의 음성신호를 수신하기 위한 음향입력부(100)와, 음향을 출력하기 위한 음향출력부(102)와, 영상을 출력시키기 위한 영상출력부(104)와, 외부 영상을 촬영하기 위한 영상촬영부(106)와, 상기 영상촬영부(106)의 방향을 변경시키기 위한 팬 및 틸트유닛(108)과, 정보를 저장하기 위한 저장부(110)와, 주변 환경에 대한 정보를 작성하기 위한 환경지도부(112)를 포함하는 로봇 시스템에 있어서,1 is a schematic diagram of an object learning system through a multimodal interaction between a user and a robot. As shown in FIG. 1, the present invention provides an audio input unit 100 for receiving a user's audio signal, an audio output unit 102 for outputting sound, and an image output unit 104 for outputting an image. ), An image capturing unit 106 for capturing an external image, a pan and tilt unit 108 for changing the direction of the image capturing unit 106, a storage unit 110 for storing information, and In the robot system comprising an environment map unit 112 for creating information about the surrounding environment,

멀티모달제어부(114)와, 상기 음향입력부(100)로부터 입력된 화자의 방향과 음향을 인식하여 상기 멀티모달제어부(114)로 전달하기 위한 화자 및 음향인식부(116)와, 상기 멀티모달제어부(114)로부터의 제어신호를 수신하여 상기 영상촬영부(106)의 방향을 사용자 방향으로 변경하도록 상기 팬 및 틸트유닛(108)에 제어신호를 전달하고 사용자의 얼굴과 손을 인식하고 사용자의 얼굴과 손의 위치를 추적하기 위한 사용자 인식 및 추적부(118)와, 상기 사용자 인식 및 추적부(118)로부터 수신된 사용자의 얼굴과 손의 위치 정보를 기초로 상기 영상촬영부(106)로부터 입력되는 영상을 수신하여 사용자의 제스처를 인식하기 위한 제스처인식부(120)와, 상기 제스처인식부(120)의 인식정보를 기초로 상기 입력된 영상에서 물체를 추출하기 위한 물체검출부(122)와, 상기 물체검출부(122)에서 추출된 물체영상으로 물체를 학습하여 상기 저장부(110)에 저장하고, 학습된 내용을 기초로 상기 환경지도를 작성하여 상기 환경지도부(112)에 저장하기 위한 물체학습부(124)와, 상기 멀티모달제어부(114)의 제어신호에 따른 음성을 상기 음향출력부(102)로 출력시키기 위한 음성발현부(126)를 포함하고 있다.A multi-modal control unit 114, a speaker and sound recognition unit 116 for recognizing a speaker's direction and sound input from the sound input unit 100 and transmitting the sound to the multi-modal control unit 114, and the multi-modal control unit Receives a control signal from the 114 and transmits a control signal to the pan and tilt unit 108 to change the direction of the image capturing unit 106 to the user's direction, recognizes the user's face and hand, the user's face And a user recognition and tracking unit 118 for tracking the position of the hand and the hand, and input from the image photographing unit 106 based on the position information of the user's face and hand received from the user recognition and tracking unit 118. A gesture recognition unit 120 for recognizing a user's gesture by receiving the image, an object detection unit 122 for extracting an object from the input image based on the recognition information of the gesture recognition unit 120, The water An object learning unit for learning an object from the object image extracted by the detection unit 122 and storing the object in the storage unit 110, and creating the environment map based on the learned contents and storing the object in the environment guidance unit 112. 124 and a voice expression unit 126 for outputting the voice according to the control signal of the multi-modal control unit 114 to the sound output unit 102.

상기 화자 및 음성 인식부(116)는 마이크와 같은 상기 음향입력부(100)로부터 사용자의 음성 명령을 입력 받아서 사용자가 어느 방향에 있는지 인식하여 로봇의 주의를 환기시키고 사용자에게 집중할 수 있도록 도와주고, 나아가 사용자가 누구이며 사용자의 음성 명령이 무엇인지를 인식하도록 할 수 있다. 상기 화자 및 음성인식부(116)에서 인식된 사용자의 방향 및 사용자 정보, 음성 명령은 상기 멀티모달제어부(114)에게 전달된다. The speaker and voice recognition unit 116 receives a user's voice command from the sound input unit 100 such as a microphone, recognizes in which direction the user is, and helps the user to draw attention to the robot and focus on the user. It allows you to recognize who you are and what your voice commands are. The direction of the user, the user information, and the voice command recognized by the speaker and the voice recognition unit 116 are transmitted to the multi-modal control unit 114.

상기 음성 발현부(126)는 상기 멀티모달제어부(114)로부터 전해진 대화 내용을 스피커와 같은 상기 음향출력부(102)를 이용하여 사용자에게 음성으로 알려주어 상기 화자 및 음성인식부(116)와 함께 대화를 통한 사용자와 로봇의 상호작용이 가능하도록 할 수 있다. The voice expression unit 126 notifies the user of the conversation contents transmitted from the multi-modal control unit 114 by using the sound output unit 102 such as a speaker, together with the speaker and the voice recognition unit 116. The interaction between the user and the robot can be made possible through dialogue.

상기 멀티모달제어부(114)는 본 발명의 상술한 물체 학습 시스템의 각 구성요소를 제어하고 모니터하는 중앙 통제 역할을 수행한다. 즉, 상기 멀티모달제어부(114)는 상기 화자 및 음성인식부(116)로부터 전달 받은 사용자 방향을 상기 사용자 인식 및 추적부(118)에 전달하여 상기 팬 및 틸트유닛(108)을 제어하여 로봇의 머리를 회전시켜 사용자에게 주목시키도록 하고 사용자로부터 입력 받은 음성 명령을 해석하여 현재 사용자가 원하는 것을 해석 및 수행하고, 상기 음성발현부(126)를 이용하여 사용자와 자연스러운 대화를 유도하도록 제어한다. 이때, 모니터와 같은 상기 영상출력부(104)를 이용하여 음성 뿐만 아니라 영상으로도 다양한 정보를 사용자에게 제공할 수 있다. 또한, 상기 멀티모달제어부(114)는 상기 사용자 인식 및 추적부(118), 제스처인식부(120), 물체검출부(122), 물체학습부(124) 등에게 수행을 시작하고 종료시키는 명령을 내리고, 상기 구성들로부터 전달되는 오류 메시지를 해석하여 이에 적절한 행동을 취하도록 제어한다. The multi-modal control unit 114 serves as a central control for controlling and monitoring each component of the above-described object learning system of the present invention. That is, the multi-modal control unit 114 transmits the user direction received from the speaker and the voice recognition unit 116 to the user recognition and tracking unit 118 to control the pan and tilt unit 108 to control the robot. The head is rotated to draw attention to the user, and the voice command received from the user is interpreted to interpret and perform what the current user wants, and the voice expression unit 126 controls to induce a natural conversation with the user. At this time, the image output unit 104 such as a monitor may provide a variety of information to the user as well as the audio image. In addition, the multi-modal control unit 114 instructs the user recognition and tracking unit 118, the gesture recognition unit 120, the object detection unit 122, the object learning unit 124 and the like to start and end the execution. It interprets the error message transmitted from the configurations and controls to take appropriate action.

상기 사용자 인식 및 추적부(118)는 상기 영상촬영부(106)로부터 입력 받은 영상에서 사용자의 얼굴 및 손을 인식하며 사용자가 움직일 때마다 계속 추적하면서 상기 팬 및 틸트유닛(pan/tilt unit)(108)을 이용하여 사용자를 따라 머리를 계속적으로 회전하면서 사용자를 추적할 수 있다. The user recognition and tracking unit 118 recognizes the user's face and hand in the image received from the image capturing unit 106 and continuously tracks each time the user moves the pan and tilt unit (pan / tilt unit) ( 108 can be used to track the user by continuously rotating the head along the user.

상기 제스처인식부(120)는 사용자의 얼굴과 손의 3차원 위치를 상기 사용자 인식 및 추적부(118)로부터 전달 받아 사용자의 제스처를 해석하여 사용자의 의도를 파악하고 환경 속에서 사용자가 로봇에게 학습시키려고 하는 물체의 대략적인 위치를 추정한다. 이때 물체의 대략적인 위치는 화면 속에서 사각형과 같은 관심 영역(region of interest)으로 표현 될 수 있다. The gesture recognition unit 120 receives the three-dimensional position of the user's face and hand from the user recognition and tracking unit 118, interprets the user's gesture to grasp the user's intention, and the user learns the robot in the environment. Estimate the approximate location of the object you want to make. In this case, the approximate position of the object may be expressed as a region of interest such as a rectangle on the screen.

상기 물체검출부(122)는 화면 속에서 학습 물체가 존재한다고 추정되는 영역에서 학습 물체를 보다 자세하게 자동으로 분할하는 구성이다. 이렇게 분할된 학습 물체는 상기 멀티모달제어부(114)에 전달되어 상기 영상출력부(104)에 표시되어 사용자에게 보여주고 환경 속에서 사용자가 가리키는 물체가 맞는지 확인하도록 하는 구성이다. The object detector 122 is configured to automatically divide the learning object in more detail in the region where the learning object is estimated to exist in the screen. The divided learning object is transmitted to the multi-modal control unit 114 to be displayed on the image output unit 104 to be shown to the user and to check whether the object indicated by the user in the environment is correct.

상기 물체학습부(124)는 상기 물체검출부(122)에서 검출된 물체 부분의 텍스처(texture)의 특징점(feature point)이나 색상(color), 모양(shape) 등을 추출(extraction)하고 학습(learning)하여 데이터베이스와 같은 상기 저장부(110)에 저장하고, 상기 환경 지도부(112)에서 환경지도에 표시하는 구성이다. The object learner 124 extracts and learns feature points, colors, shapes, etc., of the texture of the object part detected by the object detector 122. ) And store it in the storage unit 110 such as a database, and display it on the environment map in the environment map unit 112.

도 2는 상기 멀티 모달 제어부(114)가 포함하고 있는 유한 상태 기계(finite state machine)에서 본 발명의 물체 학습 방법을 설명하기 위한 흐름도이다.2 is a flowchart illustrating an object learning method of the present invention in a finite state machine included in the multi-modal control unit 114.

도 2에 도시한 바와 같이 본 발명의 물체 학습 방법은 상기 멀티모달제어부(114)와 연결된 각각의 부분들로부터 전해오는 이벤트(event)에 의해 상기 멀티모달제어부(114)의 상태가 전이(transition)되도록 구성된다. As shown in FIG. 2, in the object learning method of the present invention, the state of the multi-modal control unit 114 is transitioned by an event transmitted from respective parts connected to the multi-modal control unit 114. It is composed.

사용자와 로봇의 멀티모달 상호작용을 통한 물체 학습을 시작하기 위한 멀티모달제어부(114)의 최초 상태인 대기단계(S100)와, Waiting step (S100) which is the initial state of the multi-modal control unit 114 for starting the object learning through the multi-modal interaction of the user and the robot,

사용자가 로봇에게 학습 시작 명령을 내리면 상기 화자 및 음성 인식부(116)로부터 학습 시작 신호가 상기 멀티모달제어부(114)로 전달되어 상기 대기단계에서 전이(transition)되는 상태로서 상기 멀티모달제어부(114)에서 상기 사용자 인식 및 추적부(118)에 인식 및 추적의 시작 명령신호를 전달하여 로봇이 사용자의 얼굴과 손을 인식하고 계속적으로 추적하도록 하는 얼굴 및 손 추적단계(S102)와,When the user gives a learning start command to the robot, the learning start signal is transmitted from the speaker and the voice recognition unit 116 to the multi-modal control unit 114 and is transitioned in the waiting step. Face and hand tracking step (S102) to transmit a start command signal of the recognition and tracking to the user recognition and tracking unit 118 in the robot to recognize and continuously track the user's face and hands;

상기 얼굴 및 손 추적단계에서 상기 사용자 인식 및 추적부(118)가 사용자의 얼굴 및 손을 인식하는 것을 실패하거나 추적하는 것을 실패하면 실패 신호가 상기 멀티모달제어부(114)에 전달되어 오류 상태로 전이 된다. In the face and hand tracking step, if the user recognition and tracking unit 118 fails to recognize or track the user's face and hand, a failure signal is transmitted to the multi-modal control unit 114 to transition to an error state. do.

상기 얼굴 및 손 추적단계(S102)에서 로봇이 사용자를 주목하고 있는 동안 사용자가 로봇에게 물체 등록 시작 명령을 내리면 상기 화자 및 음성인식부(116)로부터 학습 시작 신호가 상기 멀티모달제어부(114)로 전달되어 상기 얼굴 및 손 추적단계(S102)에서 전이(transition)되는 상태로서, 상기 제스처인식부(120)에서 사용자의 제스처를 인식하는 제스처인식단계(S104)와,In the face and hand tracking step (S102), if the user gives an object registration start command to the robot while the robot pays attention to the user, a learning start signal is transmitted from the speaker and the voice recognition unit 116 to the multi-modal control unit 114. A gesture recognition step (S104) of recognizing a user's gesture in the gesture recognition unit 120 as a state of being transferred and transitioned in the face and hand tracking step (S102);

이때, 상기 제스처인식부(120)에서 사용자의 제스처 인식을 실패하는 경우 실패 신호가 상기 멀티모달제어부(114)에 전달되어 오류 상태로 전이 된다. In this case, when the gesture recognition unit 120 fails to recognize the gesture of the user, a failure signal is transmitted to the multi-modal control unit 114 to transition to an error state.

물체 검출 상태(S106)는 상기 제스처인식단계(S104)에서 상기 제스처인식부(120)가 사용자의 제스처를 성공적으로 인식하여 환경 속에서 학습 물체가 존재하는 대략적인 위치의 추정이 성공되면 상기 제스처인식단계(S104))에서 전이되는 상태로서, 상기 물체검출부(122)에서 상기 영상에 존재하는 물체를 분할하여 물체를 검출하는 물체검출단계(S106)와, The object detection state S106 is the gesture recognition when the gesture recognition unit 120 recognizes the user's gesture successfully in the gesture recognition step S104 and the estimation of the approximate location of the learning object in the environment is successful. An object detection step (S106) of detecting the object by dividing an object existing in the image by the object detecting unit 122 as a state transitioned in the step (S104);

이때, 상기 물체검출부(122)가 영상 속에서 학습 물체를 분할하는 것을 실패하거나 영상 속에서 학습 물체를 잘 못 분할하면 실패 신호가 상기 멀티모달제어부(114)에 전달되어 오류 상태로 전이 된다. In this case, when the object detector 122 fails to divide the learning object in the image or incorrectly divides the learning object in the image, a failure signal is transmitted to the multi-modal control unit 114 to transition to an error state.

상기 물체검출단계(S106)에서 물체검출을 성공적으로 수행하면 상기 물체검출단계(S106)에서 전이되는 상태로서, 상기 물체학습부(124)에서 물체 학습을 수행하는 물체학습단계(S108)를 포함하고 있다.If the object detection step (S106) successfully performs the object detection in the state transition to the object detection step (S106), the object learning step 124 includes an object learning step (S108) for performing the object learning; have.

상기 물체학습단계(S108)에서 물체 학습에 성공하면 다시 처음의 대기 단계(S100)로 돌아가서 사용자가 또 다른 물체의 학습을 명령할 때까지 대기 상태로 전이된다. When the object learning is successful in the object learning step (S108), the process returns to the first waiting step (S100) and transitions to the standby state until the user commands the learning of another object.

상술한 본 발명의 각각의 단계에서의 오류 상태는 얼굴 및 손 추적에 실패하거나 제스처 인식에 실패하거나 물체 검출에 실패하면 도달하는 상태로 실패 사항을 상기 음성발현부(126) 및 영상출력부(104) 등을 이용하여 사용자에게 전달하고 적당한 오류 처리를 수행한 뒤 처음의 대기단계(S100)로 전이 시킨다. The error state in each of the above-described steps of the present invention is reached when the face and hand tracking fails, the gesture recognition fails, or the object detection fails. After transmitting to the user by using the) and the appropriate error processing is transferred to the first waiting step (S100).

도 3은 본 발명의 물체 학습 방법의 실시 예에 따른 사용자와 로봇의 멀티모달 상호작용을 통한 물체 학습의 일 예를 설명하기 위한 흐름도이다.3 is a flowchart illustrating an example of object learning through multimodal interaction between a user and a robot according to an embodiment of the object learning method of the present invention.

사용자가 “새로운 물체를 학습해.”라고 물체 학습을 명령하면, 화자 및 음성인식부(116)에서 화자의 방향을 추정하고 명령을 해석하여 상기 멀티모달제어부(114)에 전달한다(S200). 상기 멀티모달제어부(114)는 상기 사용자 인식 및 추적부(118)에 사용자 주시를 명령한다(S202). 상기 멀티모달제어부(114)의 명령신호에 따라서 상기 사용자 인식 및 추적부(118)는 화자의 얼굴 및 손을 인식한다(S204). 상기 사용자 인식 및 추적부(118)에서 화자의 얼굴 및 손을 인식한 후 상기 멀티모달제어부(114)의 제어신호에 따라서 상기 화자의 얼굴 및 손의 추적을 시작한다(S206). 이때, 얼굴 및 손의 인식에 실패하거나 추적에 실패하면 오류 상태로 상태가 전이된다. When the user commands the object learning to "learn a new object", the speaker and voice recognition unit 116 estimates the direction of the speaker and interprets the command and transmits the command to the multi-modal control unit 114 (S200). The multi-modal control unit 114 commands a user attention to the user recognition and tracking unit 118 (S202). According to the command signal of the multi-modal control unit 114, the user recognition and tracking unit 118 recognizes the speaker's face and hand (S204). After the user recognition and tracking unit 118 recognizes the speaker's face and hand, tracking of the speaker's face and hand is started according to the control signal of the multi-modal control unit 114 (S206). At this time, if the face and hand fails to recognize or track, the state is transferred to the error state.

상기 음성발현부(126)를 통해 “물체 이름은 무엇인가요?”라고 사용자에게 질문한다(S208). 사용자로부터 “이건 XXX야.”라는 음성신호를 상기 화자 및 음성인식부(116)를 통해 수신한다(S210). 상기 음성발현부(126)을 통해 “물체는 얼마나 큰가요?”라고 사용자에게 질문한다(S212). 사용자로부터 “대형(중형 또는 소형) 물체야.”라는 음성신호를 상기 화자 및 음성인식부(116)를 통해 수신한다(S214). 상기 음성발현부(126)을 통해 “네 준비 됐습니다.”라고 대답한다(S216). 이때, 상기와 같이 물체의 크기를 직접적으로 사용자에게 물어볼 수 있지만, 후술하는 것과 같이 사용자의 제스처를 보고 사용자가 현재 대형/중형/소형 물체 중 얼마나 큰 물체를 등록하고자 하는지 자동으로 파악할 수도 있다. The voice expression unit 126 asks the user, "What is the object name?" (S208). A voice signal of "This is XXX" is received from the user through the speaker and the voice recognition unit 116 (S210). The voice expression unit 126 asks the user, "How big is the object?" (S212). A voice signal of “large (medium or small) object” is received from the user through the speaker and the voice recognition unit 116 (S214). Through the voice expression unit 126, the answer is "Yes ready" (S216). At this time, the user can directly ask the user of the size of the object as described above, but as described below, the user's gesture may be used to automatically determine how large an object is currently registered among the large / medium / small object.

로봇이 사용자의 얼굴 및 손을 계속적으로 추적하고 있는 동안 사용자로부터 “이 물체를 등록해.”라는 음성신호를 상기 화자 및 음성인식부(116)를 통해 수신한다(S218). 상기 멀티모달제어부(114)는 상기 제스처인식부(120)에 제스처 인식을 명령한다(S220). 제스처 인식에 실패하면 오류 상태로 상태가 전이되지만, 그렇지 않고 제스처 인식에 성공하면(S222) 상기 멀티모달제어부(114)는 상기 물체검출부(122)에 학습 물체를 영상에서 추출할 것을 명령한다(S224). 상기 물체검출부(122)에서 물체 추출을 완료한다(S226). 상기 추출된 물체 영상을 상기 영상출력부(104)에 출력시키고, 상기 음성발현부(126)를 통해 “이 물체가 맞습니까?”라는 질문을 한다(S228). 상기 화자 및 음성인식부(116)을 통해 사용자의 “그래, 그 물체가 맞아.”라는 음성신호를 수신한다(S230). 이때, 사용자가 “아니.”라고 대답하면 물체 추출은 실패하고 오류 상태로 상태가 전이된다. While the robot continuously tracks the user's face and hand, the user receives a voice signal "register this object" from the user through the speaker and voice recognition unit 116 (S218). The multi-modal control unit 114 commands a gesture recognition to the gesture recognition unit 120 (S220). If the gesture recognition fails, the state transitions to an error state. Otherwise, if the gesture recognition is successful (S222), the multi-modal control unit 114 instructs the object detector 122 to extract a learning object from the image (S224). ). The object detection unit 122 completes the object extraction (S226). The extracted object image is output to the image output unit 104, and the question is “is this object correct” through the voice expression unit 126 (S228). The speaker and voice recognition unit 116 receives a voice signal of the user "yes, the object is correct" (S230). If the user answers "no", object extraction fails and the state transitions to an error state.

멀티모달제어부(114)는 상기 물체학습부(124)에 물체 학습을 명령한다(S232). 상기 물체학습부(124)에서 물체 학습을 완료한다(S234). 물체 학습이 완료되면, 초기 상태인 대기 상태로 되고, 상기 음성발현부(126)를 통해 “학습을 성공적으로 마쳤습니다.”라고 대답한 뒤, 모든 학습 과정을 마친다(S236). The multi-modal control unit 114 commands the object learning unit 124 to learn the object (S232). The object learning unit 124 completes the object learning (S234). When the object learning is completed, the initial state is in the standby state, and the voice expression unit 126, the answer to the "learned successfully." After completing all the learning process (S236).

도 4는 물체 크기(대형/중형/소형)에 따른 제스처 방법의 예를 도시한 것이다. 4 shows an example of a gesture method according to the object size (large / medium / small).

책, 연필깎이, 탁상시계, 머그컵, 인형 등 소형 물체는 작고 가벼워서 손 위에 올려 놓거나 손으로 집어 들어서 로봇에게 학습할 대상이라고 알려준다. Small objects such as books, pencil sharpeners, table clocks, mugs, and dolls are small and light and can be placed on hands or picked up by hand to tell the robot what to learn.

벽시계, 액자, TV, 식기세척기 등 중형 물체는 무거워서 들어 올릴 수 없거나 손이 닿지 않는 곳에 있으므로 손으로 가리켜서 로봇에게 알려준다. Medium objects such as wall clocks, picture frames, TVs, and dishwashers are heavy and can't be lifted or reached out of reach.

책상, 테이블, 소파, 침대, 냉장고 등 대형 물체는 너무 커서 한 화면에 다 나오지 않기 때문에 로봇이 고개를 돌려서 보아야 하므로 물체의 좌우 양쪽 끝을 손으로 대서 물체가 어디부터 어디까지인지를 로봇에게 알려준다. Large objects such as desks, tables, sofas, beds, and refrigerators are so large that they don't appear on one screen, so the robot needs to look at them by turning their heads.

도 5는 상술한 본 발명의 실시 예에 따른 물체의 크기(대형/중형/소형)에 따른 제스처 방법의 예를 도시한 것이다. 5 illustrates an example of a gesture method according to the size (large / medium / small) of an object according to the embodiment of the present invention described above.

소형 물체의 경우 얼굴과 손이 한 화면에 나오기 때문에 물체를 찍기 위한 화면의 개수는 1개이다. In the case of a small object, the face and hand appear on one screen, so the number of screens for taking an object is one.

중형 물체의 경우 얼굴과 손은 한 화면에 나오지만 가리키는 대상은 다른 화면에 나올 수 있으므로 물체를 찍기 위해 필요한 화면의 개수는 1개 혹은 2개이다. In the case of a medium object, the face and hand appear on one screen, but the pointing object may appear on another screen, so the number of screens required for taking an object is one or two.

대형 물체의 경우 사람을 따라 이동하면서 물체의 양쪽 끝을 찍어야 하기 때문에 여러 장의 영상이 필요하다. In the case of large objects, multiple images are required because the object must be photographed at both ends while moving along the person.

도 6은 상기 제스처인식부(120)에서 자동으로 물체의 크기(대형/중형/소형)를 추정하는 과정을 설명하기 위한 흐름도이다. 6 is a flowchart illustrating a process of automatically estimating the size (large / medium / small) of an object in the gesture recognition unit 120.

상기 사용자 인식 및 추적부(118)로부터 전달받은 사용자의 얼굴 및 손의 3차원 공간상의 위치에 의해 얼굴과 손 사이의 거리를 비교(S300)하여 일정 거리보다 작으면, 사용자가 작은 물체를 손에 들거나 집어 들어서 로봇에게 보여주고 있다고 판단하여 소형 물체 영역 추정(S302)을 수행한다. 즉, 손 위 혹은 손 아래 부분을 탐색하여 카메라의 차영상(disparity map)에서 손의 위치와 비슷한 곳에 위치한 픽셀들을 포함하는 영역을 물체가 존재할 것으로 추정되는 영역으로 정한다. When the distance between the face and the hand is compared by the position of the user's face and the hand received from the user recognition and tracking unit 118 (S300), if the distance is smaller than a predetermined distance, the user places a small object in the hand. The small object area estimation (S302) is performed by determining that the robot is showing to the robot by lifting or picking up. In other words, the area above the hand or the hand is searched to determine a region including pixels located at a position similar to the position of the hand in the disparity map of the camera as an area where the object is expected to exist.

만약, 얼굴과 손 사이의 거리를 비교(S300)하여 일정 거리보다 멀면, 사용자가 손을 뻗어서 물체를 가리키거나 물체의 한쪽 끝을 만진다고 가정하여 중형 혹은 대형 물체의 경우로 판단한다. If the distance between the face and the hand is greater than a predetermined distance (S300), it is determined that the user is a medium or large object assuming that the user reaches out to point to the object or touch one end of the object.

중형 및 대형 물체는 또다시, 사용자가 손을 뻗은 뒤 움직이는지 그렇지 않은지를 판단(S304)하여 가린다. 즉, 사용자가 손을 뻗은 뒤, 움직이지 않고 가만히 있다면 중형 물체를 가리킨 것으로 파악하여 사람 손이 가리킨 방향의 3차원 벡터를 계산하여 공간 상에서 중형 물체가 존재할 가능성이 있는 영역을 추정(S306)한다. The medium and large objects are again masked by determining whether or not the user moves after reaching the hand (S304). In other words, if the user extends his or her hand and does not move, the user recognizes that the object is a medium object, calculates a three-dimensional vector in the direction indicated by the human hand, and estimates an area in which the medium object may exist in space (S306).

만일, 사용자가 손을 뻗어 가리킨 뒤, 몸을 움직여서 또 다른 곳에서 손을 뻗어 가리키면, 이는 대형 물체의 양쪽 끝을 만진 것이므로, 3차원 공간 상에서 두 양쪽 끝 점을 바닥으로 투영시켜서 네 점이 가리키는 영역을 대형 물체가 있는 공간이라고 추정(S308)한다. If the user extends his hand and then moves his body to reach another point, it touches both ends of a large object, so project both ends to the floor in three-dimensional space to see the area that the four points point to. It is assumed that there is a large object space (S308).

100 : 음향입력부
102 : 음향출력부
114 : 멀티모달제어부
116 : 화자 및 음성인식부
118 : 사용자 인식 및 추적부
120 : 제스처인식부
122 : 물체검출부
124 : 물체학습부
126 : 음성발현부100: sound input unit
102: sound output unit
114: multi-modal control unit
116: speaker and speech recognition unit
118: user recognition and tracking unit
120: gesture recognition unit
122: object detection unit
124: object learning unit
126: voice expression unit

Claims

In the robot system that can recognize the received voice and video by receiving a voice signal and video of the speaker,
Object extracting means for extracting an object from the received image;
It includes an interaction means for interacting with the speaker,
Robot object learning system using a multi-modal interaction, characterized in that for learning the object using the object information extracted from the object extraction means and the object information collected from the interaction means.

The method according to claim 1,
Further comprising a gesture recognition means for recognizing the gesture of the speaker in the received image,
Learning an object using a multi-modal interaction, characterized in that for learning the object using the object information extracted from the object extraction means, the object information collected from the interaction means and the object information collected from the gesture recognition means. system.

The method according to claim 1,
The object information extracted from the object extraction means is shape information of the object,
Object information collected from the interaction means is the object learning system of the robot using multi-modal interaction, characterized in that the information and the name of the object size.

The method according to claim 2,
The object information extracted from the object extraction means is shape information of the object,
The object information collected from the interaction means is name information of the object,
The object information collected from the gesture recognition means is a robot object learning system using multi-modal interaction, characterized in that the size information of the object.

The method according to claim 1 or 2,
The interaction means is a robot object learning system using a multi-modal interaction, characterized in that the means for performing a dialogue with the speaker.

The method according to claim 2,
The gesture recognition means is a robot object learning system using multi-modal interaction, characterized in that the means for recognizing the gesture of the speaker's face and hand.

The robot recognizes the speaker's direction and voice;
Photographing an image by changing a direction of an image photographing unit of the robot to a speaker direction according to the recognized speaker direction and voice;
Collecting, by the robot, object information through interaction with the speaker;
Extracting an object from the image by the robot;
And learning by the robot based on object information collected through interaction with the speaker and object information extracted from the image.

The robot recognizes the speaker's direction and voice;
Photographing an image by changing a direction of an image photographing unit of the robot to a speaker direction according to the recognized speaker direction and voice;
Collecting, by the robot, object information by recognizing a gesture of a speaker;
Collecting, by the robot, object information through interaction with the speaker;
Extracting an object from the image by the robot;
Learning based on object information collected from the gesture information, object information collected through interaction with the speaker, and object information extracted from the image; .

The method according to claim 7,
The object information extracted from the image is shape information of the object,
The object information collected through the interaction with the speaker is a name and size information of the object characterized in that the object learning method of the robot using multi-modal interaction.

The method according to claim 8,
The object information collected from the gesture information is the size information of the object,
The object information extracted from the image is shape information of the object,
The object information collected through the interaction with the speaker is the object information method of the robot, characterized in that the name information of the object.

The method according to claim 8,
The robot may recognize the gesture of the speaker and collect object information.
Comparing the distance between the face and the hand by the robot by the position of the speaker's face and the hand in three-dimensional space;
Recognizing a small object if the distance between the face and the hand is smaller than a certain distance, and if the distance between the face and the hand is greater than a medium or large object comprising the step of recognizing a robot using multi-modal interaction.

The method according to claim 8,
The robot may recognize the gesture of the speaker and collect object information.
Determining whether the speaker is moving after reaching out,
Recognizing a medium object if the speaker does not move after reaching the hand, and the object learning method of the robot using multi-modal interaction, characterized in that it comprises a large object.