KR20140132231A

KR20140132231A - method and apparatus for controlling mobile in video conference and recording medium thereof

Info

Publication number: KR20140132231A
Application number: KR1020130051501A
Authority: KR
Inventors: 김용태; 김성기; 김재한; 백윤선
Original assignee: 삼성전자주식회사
Priority date: 2013-05-07
Filing date: 2013-05-07
Publication date: 2014-11-17
Also published as: KR101892268B1

Abstract

Disclosed is technology for controlling a terminal in a video conference, capable of improving concentration on the conference by outputting the image of a determined speaker with the remote image of a conference room by determining the speaker based on a parameter capable of determine the speaker who currently speaks, wherein the parameter is generated from the terminal of each user who participates in the video conference.

Description

TECHNICAL FIELD The present invention relates to a method and apparatus for controlling a terminal during a video conference,

본 발명은 영상 회의 시, 회의 참여자들의 영상을 전송하는 방법 및 장치에 관한 것이다. The present invention relates to a method and apparatus for transmitting images of conference participants during a video conference.

영상 처리 기술의 발달로 실시간 영상 전송이 가능해지면서, 다양한 분야에 영상 처리 기술이 적용되고 있다. 예를 들어 오프라인 상에서만 가능했던 회의를 영상 기기를 이용하여 온라인 상에서도 진행할 수 있게 되었다.With the development of image processing technology, real - time image transmission becomes possible, and image processing technology is applied to various fields. For example, meetings that were only available offline could now be conducted online using video equipment.

종래에는 영상 회의 시, 원격의 카메라로부터 회의 참여자들을 촬영하여 회의실 전체의 모습을 담은 원경 영상을 획득할 수 있었다. 원경 영상의 경우, 회의실의 전체적인 모습을 나타낼 수는 있지만 회의 참여자 개개인의 모습을 정확히 표현하기 어려워 회의 시 현재 발언하고 있는 화자를 인식하기 어렵다.Conventionally, at the time of video conference, the participants of the conference were photographed from the remote camera, and thus, the image of the entire room was captured. In the case of the original image, it is possible to express the overall appearance of the conference room, but it is difficult to accurately express the individual participants of the conference, and it is difficult to recognize the speaker who is currently speaking at the meeting.

이러한 종래의 방법은 영상 회의 시 화자에게 집중하기가 어려워 참여자들의 몰입도를 떨어뜨린다는 문제가 있다.Such a conventional method has a problem in that it is difficult to focus on the speaker at the time of a video conference, thereby reducing the immersion of the participants.

본 발명은 영상 회의 시, 전체 회의 영상만을 촬영하는 경우 화자 개개인에 대한 영상 획득이 어려워 회의에 대한 몰입도가 떨어지는 점을 개선할 수 있는 방법, 장치 및 기록매체에 관한 것이다.The present invention relates to a method, an apparatus, and a recording medium capable of improving an immersion degree of a conference due to difficulty in acquiring an image for each individual speaker when only a whole conference image is captured during a video conference.

본 발명의 일 실시예는 단말기 제어 방법에 있어서, 상기 단말기에서 획득한 사용자의 영상 및 음성 중 적어도 하나를 기초로 상기 사용자가 현재 발언하고 있는 화자인지 여부를 판별하기 위한 파라미터를 생성하는 단계; 상기 생성된 파라미터를 호스트 디바이스에 송신하는 단계; 다수의 단말기들로부터 수신한 다수의 파라미터들을 이용하여 상기 호스트 디바이스가 현재 화자로 판별한 사용자의 단말기에 요청하는 상기 현재의 화자에 대한 영상의 요청을 수신하는 단계; 및 상기 호스트 디바이스에게 상기 판별된 영상을 송신하는 단계를 포함하고, 상기 송신 영상은 상기 호스트 디바이스에 의해 상기 호스트 디바이스가 획득한 원경 영상과 함께 출력되는 것을 특징으로 한다.According to an embodiment of the present invention, there is provided a method of controlling a terminal, the method comprising: generating a parameter for determining whether the user is a speaker currently speaking based on at least one of a video and a voice of a user acquired at the terminal; Transmitting the generated parameter to a host device; Receiving a request for an image of the current speaker requesting a terminal of a user who has determined that the host device is a current speaker using a plurality of parameters received from a plurality of terminals; And transmitting the determined image to the host device, wherein the transmission image is output together with the far-end image acquired by the host device by the host device.

상기 단말기 제어 방법에서, 상기 파라미터를 생성하는 단계는, 상기 획득한 사용자의 영상에서 상기 사용자의 얼굴 영역을 검출하는 단계;및 상기 검출된 얼굴 영역에서 상기 사용자가 발언하고 있는지 여부를 판단할 수 있는 입 영역의 움직임 정보를 추출하여 상기 파라미터를 생성하는 단계를 포함하는 것을 특징으로 한다.In the terminal control method, the generating of the parameter may include: detecting a face region of the user on the acquired image of the user; and determining whether the user is speaking in the detected face region And extracting the motion information of the input area and generating the parameter.

상기 단말기 제어 방법에서, 상기 파라미터를 생성하는 단계는, 상기 단말기에서 상기 사용자의 음성 신호를 획득하는 단계; 및 상기 획득한 음성 신호를 기초로 상기 파라미터를 생성하는 단계를 포함하고, 상기 다수의 파라미터들 중 가장 큰 음성 신호를 기초로 생성된 파라미터를 송신한 단말기의 사용자가 상기 현재 화자로 결정되는 것을 특징으로 한다.In the terminal control method, the generating of the parameter may include: acquiring the voice signal of the user at the terminal; And generating the parameter based on the obtained speech signal, characterized in that a user of the terminal that has transmitted the parameter generated based on the largest speech signal among the plurality of parameters is determined as the current speaker .

상기 단말기 제어 방법에서, 상기 파라미터를 생성하는 단계는, 상기 화자가 복수인 경우, 상기 단말기에 입력된 복수의 음성 신호들을 분리하는 단계; 및 상기 분리된 음성 신호들 각각에 대해 파라미터를 생성하는 단계를 더 포함하는 것을 특징으로 한다. In the terminal control method, the step of generating the parameter may include separating a plurality of voice signals input to the terminal when the number of the speakers is plural; And generating a parameter for each of the separated speech signals.

상기 단말기 제어 방법에서, 상기 단말기 제어 방법은 상기 생성된 파라미터를 주기적으로 호스트 디바이스에 송신하는 단계를 더 포함하고 상기 현재 화자가 결정되지 않을 경우에는, 최근에 결정된 화자의 영상을 출력하는 것을 특징으로 한다.In the terminal control method, the terminal control method may further include transmitting the generated parameter periodically to the host device, and when the current speaker is not determined, outputting a recently determined speaker image do.

본 발명의 일 실시예는 호스트 디바이스의 단말기 제어 방법에 있어서,According to an embodiment of the present invention, there is provided a method of controlling a terminal of a host device,

사용자들의 원경 영상을 획득하는 단계; 적어도 하나 이상의 단말기로부터 각각 상기 단말기의 사용자가 현재 발언하고 있는 화자를 판별할 수 있는 파라미터들을 수신하는 단계; 상기 수신한 파라미터들을 기초로 상기 화자로 판별된 사용자의 단말기에게 상기 판별된 화자의 영상을 요청하는 단계; 및 상기 요청한 화자의 영상을 수신하여 상기 원경 영상과 함께 출력하는 단계를 포함한다.Acquiring a perspective image of users; Receiving parameters from the at least one terminal, each of the parameters capable of determining a speaker currently speaking by a user of the terminal; Requesting the terminal of the user identified by the speaker based on the received parameters, the image of the identified speaker; And receiving the image of the requested speaker and outputting the received image together with the perspective image.

상기 호스트 디바이스의 단말기 제어 방법에 있어서, 상기 판별된 화자의 영상을 요청하는 단계는, 상기 파라미터로부터 상기 사용자의 입 영역 정보를 추출하는 단계;및 상기 추출한 입 영역 정보를 기초로 상기 현재 화자를 결정하는 단계를 더 포함한다.Wherein the step of requesting the identified image of the speaker comprises the steps of extracting the user's mouth area information from the parameter and determining the current speaker based on the extracted mouth area information, .

상기 호스트 디바이스의 단말기 제어 방법은 다수의 단말기로부터 다수의 파라미터를 수신하는 단계; 상기 다수의 파라미터들 중 가장 큰 음성 신호를 기초로 생성된 파라미터를 검출하는 단계; 및 상기 검출된 파라미터를 기초로 상기 현재 화자를 결정하는 단계를 더 포함하고, 상기 다수의 파라미터들 중 가장 큰 음성 신호를 기초로 생성된 파라미터를 송신한 단말기의 사용자가 상기 현재 화자로 결정되는 것을 특징으로 한다.The method includes: receiving a plurality of parameters from a plurality of terminals; Detecting a parameter generated based on a largest speech signal among the plurality of parameters; And determining the current speaker based on the detected parameter, wherein a user of the terminal that has transmitted the generated parameter based on the largest speech signal among the plurality of parameters is determined to be the current speaker .

상기 호스트 디바이스의 단말기 제어 방법에 있어서, 상기 요청 단계는 상기 화자로 판별된 사용자의 단말기가 복수 개일 경우 상기 복수 개의 단말기 각각에 상기 판별된 화자의 영상을 요청하는 것을 특징으로 한다. In the terminal control method of the host device, the requesting step may request an image of the identified speaker to each of the plurality of terminals when the number of the users determined by the speaker is plural.

상기 호스트 디바이스의 단말기 제어 방법에 있어서, 상기 단말기 제어 방법은 상기 파라미터를 주기적으로 수신하는 단계를 더 포함하고 상기 현재 화자가 결정되지 않을 경우, 최근에 결정 되었던 화자의 영상을 요청하는 것을 특징으로 한다.The terminal control method further includes a step of periodically receiving the parameter, and when the current speaker is not determined, the terminal control method requests a video of a speaker that has been recently determined .

본 발명의 일 실시예는 단말기 제어 장치에 있어서, 상기 단말기에서 획득한 사용자의 영상 및 음성 중 적어도 하나를 기초로 상기 사용자가 현재 발언하고 있는 화자인지 여부를 판별하기 위한 파라미터를 생성하는 생성부; 상기 생성된 파라미터를 호스트 디바이스에 송신하는 파라미터 송신부; 다수의 단말기들로부터 수신한 다수의 파라미터들을 이용하여 상기 호스트 디바이스가 현재 화자로 판별한 사용자의 단말기에 요청하는 상기 현재의 화자에 대한 영상의 요청을 수신하는 영상 요청 수신부; 및 상기 호스트 디바이스에게 상기 판별된 영상을 송신하는 영상 송신부를 포함하고 상기 송신 영상은 상기 호스트 디바이스에 의해 상기 호스트 디바이스가 획득한 원경 영상과 함께 출력되는 것을 특징으로 한다.According to an embodiment of the present invention, there is provided a terminal control apparatus comprising: a generation unit configured to generate a parameter for determining whether a user is a speaker currently speaking based on at least one of a video and a voice of a user acquired at the terminal; A parameter transmission unit for transmitting the generated parameter to the host device; A video request receiving unit for receiving a request for an image of the current speaker requesting a terminal of a user who has determined that the host device is a current speaker using a plurality of parameters received from a plurality of terminals; And a video transmission unit for transmitting the determined video to the host device, wherein the transmission video is output together with the perspective video acquired by the host device by the host device.

상기 단말기 제어 장치에 있어서 상기 생성부는, 상기 획득한 사용자의 영상에서 상기 사용자의 얼굴 영역을 검출하고 상기 검출된 얼굴 영역에서 상기 사용자가 발언하고 있는지 여부를 판단할 수 있는 입 영역의 움직임 정보를 추출하여 상기 파라미터를 생성하는 것을 특징으로 한다.In the terminal control device, the generation unit detects the face region of the user on the acquired user's image and extracts the motion information of the input region that can determine whether the user is uttering the detected face region Thereby generating the parameter.

상기 단말기 제어 장치에 있어서 상기 생성부는, 상기 단말기에서 상기 사용자의 음성 신호를 획득하여 상기 획득한 음성 신호를 기초로 상기 파라미터를 생성하고, 상기 다수의 파라미터들 중 가장 큰 음성 신호를 기초로 생성된 파라미터를 송신한 단말기의 사용자가 상기 현재 화자로 결정하는 것을 특징으로 한다. Wherein the generation unit of the terminal control device acquires the voice signal of the user at the terminal and generates the parameter based on the acquired voice signal and generates the parameter based on the largest voice signal among the plurality of parameters And the user of the terminal that transmitted the parameter determines the current speaker.

상기 단말기 제어 장치에 있어서 상기 생성부는, 상기 화자가 복수인 경우, 상기 단말기에 입력된 복수의 음성 신호들을 분리하여 상기 분리된 음성 신호들 각각에 대해 파라미터를 생성하는 것을 특징으로 한다.In the terminal control apparatus, when the number of the speakers is plural, the generator separates a plurality of voice signals input to the terminal, and generates parameters for each of the separated voice signals.

상기 단말기 제어 장치에 있어서 상기 단말기 제어 장치는, 상기 생성된 파라미터를 주기적으로 호스트 디바이스에 송신하고 상기 현재 화자가 결정되지 않을 경우에는, 최근에 결정된 화자의 영상을 출력하는 것을 특징으로 한다.In the terminal control device, the terminal control device periodically transmits the generated parameters to the host device, and when the current speaker is not determined, the terminal control device outputs a recently determined speaker image.

본 발명의 일 실시예는 단말기를 제어하는 호스트 디바이스에 있어서, 사용자들의 원경 영상을 획득하는 원경 영상 획득부; 적어도 하나 이상의 단말기로부터 각각 상기 단말기의 사용자가 현재 발언하고 있는 화자를 판별할 수 있는 파라미터들을 수신하는 파라미터 수신부; 상기 수신한 파라미터를 기초로 상기 화자로 판별된 사용자의 단말기에게 상기 판별된 화자의 영상을 요청하는 요청부; 상기 요청한 화자의 영상을 수신하여 상기 원경 영상과 함께 출력하는 출력부를 포함한다.According to an embodiment of the present invention, there is provided a host device for controlling a terminal, the host device comprising: a perspective image acquiring unit acquiring a perspective image of users; A parameter receiving unit that receives parameters from at least one terminal, each of which can identify a speaker currently speaking by a user of the terminal; A requesting unit for requesting a terminal of the user identified by the speaker based on the received parameter, the image of the identified speaker; And an output unit for receiving the image of the requested speaker and outputting the received image together with the perspective image.

상기 단말기를 제어하는 호스트 디바이스에 있어서 상기 요청부는, 상기 파라미터로부터 상기 사용자의 입 영역 정보를 추출하여 상기 추출한 입 영역 정보를 기초로 상기 현재 화자를 결정하는 것을 특징으로 한다.In the host device controlling the terminal, the request unit extracts the user's mouth area information from the parameter and determines the current speaker based on the extracted mouth area information.

상기 단말기를 제어하는 호스트 디바이스에 있어서 상기 단말기를 제어하는 호스트 디바이스는, 다수의 단말기로부터 다수의 파라미터를 수신하고, 상기 다수의 파라미터들 중 가장 큰 음성 신호를 기초로 생성된 파라미터를 검출하여, 상기 검출된 파라미터를 기초로 상기 현재 화자를 결정하고, 상기 다수의 파라미터들 중 가장 큰 음성 신호를 기초로 생성된 파라미터를 송신한 단말기의 사용자가 상기 현재 화자로 결정되는 것을 특징으로 한다.A host device controlling the terminal, the host device controlling the terminal receives a plurality of parameters from a plurality of terminals, detects parameters generated based on the largest voice signal among the plurality of parameters, The user of the terminal that has determined the current speaker based on the detected parameter and transmits the parameter generated based on the largest speech signal among the plurality of parameters is determined as the current speaker.

상기 단말기를 제어하는 호스트 디바이스에 있어서 상기 요청부는 상기 화자로 판별된 사용자의 단말기가 복수 개일 경우 상기 복수 개의 단말기 각각에 상기 판별된 화자의 영상을 요청하는 것을 특징으로 한다.Wherein the requesting unit requests the image of the identified speaker to each of the plurality of terminals when the number of users of the user identified by the speaker is plural.

상기 단말기를 제어하는 호스트 디바이스에 있어서 호스트 디바이스에 의한 상기 단말기 제어 장치는, 상기 파라미터를 주기적으로 수신하고 상기 현재 화자가 결정되지 않을 경우, 최근에 결정 되었던 화자의 영상을 요청하는 것을 특징으로 한다.The host device controlling the terminal is characterized in that the terminal control device by the host device periodically receives the parameter and requests the image of the speaker that has been determined recently if the current speaker is not determined.

본 발명의 일 실시예는, 제 1항 내지 제 10항 중 어느 하나의 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체를 제공한다.An embodiment of the present invention provides a computer-readable recording medium storing a program for causing a computer to execute the method according to any one of claims 1 to 10.

도 1은 영상 회의의 모습을 도시한 개념도이다.
도 2는 본 발명의 일 실시예에 따른 단말기 제어 장치를 설명하기 위한 블록도이다.
도 3은 본 발명의 일 실시예에 따른 단말기를 제어하는 호스트 디바이스를 설명하기 위한 블록도이다.
도 4는 MPEG-4에서의 얼굴 영상을 검출하기 위한 얼굴 정의 파라미터(Facial Definition Parameter)를 도시한 도면이다.
도 5는 본 발명의 일 실시예에 따라 획득한 화자의 음성 신호를 기초로 파라미터를 생성하는 방법에 대해 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른 영상회의 모습을 도시한 개념도이다.
도 7은 본 발명의 일 실시예에 따른 단말 제어 방법을 설명하기 위한 흐름도이다.
도 8은 본 발명의 일 실시예에 따른 사용자를 촬영한 영상을 이용하여 파라미터를 생성하는 방법을 설명하기 위한 흐름도이다.
도 9는 본 발명의 일 실시예에 따른 획득한 음성 신호 정보를 이용하여 파라미터를 생성하는 방법을 설명하기 위한 흐름도이다.
도 10은 본 발명의 일 실시예에 따른 호스트 디바이스의 단말 제어 방법을 설명하기 위한 흐름도이다.
도 11은 본 발명의 일 실시예에 따른 단말 제어 방법을 구체적으로 설명하기 위한 흐름도이다.1 is a conceptual diagram showing the appearance of a video conference.
2 is a block diagram illustrating a terminal control apparatus according to an embodiment of the present invention.
3 is a block diagram illustrating a host device for controlling a terminal according to an embodiment of the present invention.
4 is a diagram showing a facial definition parameter for detecting a face image in MPEG-4.
5 is a diagram for explaining a method of generating a parameter based on a speech signal of a speaker obtained according to an embodiment of the present invention.
6 is a conceptual diagram illustrating a video conference according to an exemplary embodiment of the present invention.
7 is a flowchart illustrating a terminal control method according to an embodiment of the present invention.
FIG. 8 is a flowchart illustrating a method of generating a parameter using an image of a user according to an exemplary embodiment of the present invention. Referring to FIG.
9 is a flowchart illustrating a method of generating parameters using acquired speech signal information according to an embodiment of the present invention.
10 is a flowchart illustrating a method of controlling a terminal of a host device according to an embodiment of the present invention.
11 is a flowchart illustrating a method for controlling a terminal according to an embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명의 실시예들을 상세히 설명한다. 본 발명의 하기 실시예는 본 발명을 구체화하기 위한 것일 뿐 본 발명의 권리 범위를 제한하거나 한정하지 않는다. 또한, 본 발명의 상세한 설명 및 실시예로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 유추할 수 있는 것은 본 발명의 권리범위에 속하는 것으로 해석된다.Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. The following examples of the present invention are intended only to illustrate the present invention and do not limit or limit the scope of the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

도 1은 영상 회의의 모습을 도시한 개념도이다. 도 1을 참고하면, 영상 회의 참여자들인 참여자1(100a), 참여자 2(100b), 참여자 3(100c) 및 참여자 4(100d)의 모습을 카메라(130)가 원경으로 촬영한다. 카메라(130)로 회의실 전체만을 촬영하여, 획득한 원경 영상(140)은 멀리서 촬영되기 때문에 현재 발언하고 있는 화자 개개인에 대한 정확한 영상 획득이 어렵다는 문제가 있다.1 is a conceptual diagram showing the appearance of a video conference. Referring to FIG. 1, a camera 130 photographs a participant 1 (100a), a participant 2 (100b), a participant 3 (100c), and a participant 4 (100d). Since only the entire conference room is photographed by the camera 130 and the acquired perspective image 140 is photographed from a distance, there is a problem in that it is difficult to accurately acquire an image for each individual speaker who is speaking.

일예로 참여자4(100d)가 현재 발언하고 있는 화자일 경우에, 원경 영상(140)에서는 참여자 4(100d)의 정확한 영상을 획득할 수 없어 회의에 대한 몰입도가 떨어지게 된다. For example, when the participant 4 (100d) is a speaker who is currently speaking, the accurate image of the participant 4 (100d) can not be obtained in the perspective image 140, and the immersion into the conference is reduced.

도 2는 본 발명의 일 실시예에 따른 단말기 제어 장치를 설명하기 위한 블록도이다. 2 is a block diagram illustrating a terminal control apparatus according to an embodiment of the present invention.

도 2를 참고하면, 본 발명의 일 실시예에 따른 단말기 제어 장치(200)는 생성부(210), 파라미터 송신부(220), 영상 요청 수신부(230) 및 영상 송신부(240)를 포함할 수 있다. 본 발명의 일 실시예에 따른 단말기 제어 장치(200)는 단말기 내부에 존재할 수 있다.2, the terminal control apparatus 200 according to an exemplary embodiment of the present invention may include a generation unit 210, a parameter transmission unit 220, a video request reception unit 230, and an image transmission unit 240 . The terminal control apparatus 200 according to an embodiment of the present invention may exist inside a terminal.

도 2에 도시된 단말기 제어 장치(200)에는 본 실시예와 관련된 구성요소들만이 도시되어 있다. 따라서, 도 2에 도시된 구성요소들 외에 다른 범용적인 구성요소들이 더 포함될 수 있음을 본 실시예와 관련된 기술분야에서 통상의 지식을 가진 자라면 이해할 수 있다. Only the components related to the present embodiment are shown in the terminal control device 200 shown in Fig. Accordingly, it will be understood by those skilled in the art that other general-purpose components other than the components shown in FIG. 2 may be further included.

생성부(210)는 단말기에서 사용자를 촬영한 영상을 바탕으로 상기 사용자가 현재 발언하고 있는 화자인지 여부를 판별하기 위한 파라미터를 생성할 수 있다. 여기에서 화자를 판별하기 위한 파라미터는 영상 회의에 참석하고 있는 사용자들 모두의 단말기에서 각각 사용자를 촬영한 영상을 바탕으로 생성될 수 있다. The generating unit 210 may generate a parameter for determining whether the user is a speaker currently speaking based on the image of the user photographed by the terminal. Here, the parameter for identifying the speaker can be generated based on the image of each user captured by the terminal of all the users attending the video conference.

본 발명의 일 실시예에 따른 파라미터는 사용자를 촬영한 영상에서 얼굴 영역을 검출하고, 검출된 얼굴 영역에서 추출한 입 영역의 움직임 정보를 기초로 생성된다. 구체적으로 도 4를 참고하면, 도 4는 MPEG-4에서의 얼굴 정의 파라미터(Facial Definition Parameter)들을 도시한 도면으로 각 얼굴 영역의 정점 좌표 값들을 이용하여 원하는 얼굴 영역을 검출해 낼 수 있다.A parameter according to an exemplary embodiment of the present invention is generated based on motion information of a mouth area extracted from a detected face area, and detects a face area in an image of a user. Referring to FIG. 4, FIG. 4 is a diagram illustrating facial definition parameters in MPEG-4, wherein a desired face region can be detected using vertex coordinate values of each face region.

본 발명의 일 실시예에 따라 화자를 판별하기 위해서는 MPEG-4에서의 얼굴 전체의 파라미터들(410) 중에서도 특히 입 영역의 파라미터들(420)이 이용될 수 있다. 검출된 입 영역에 대해 움직임 정보를 추출하여 사용자의 발언 여부를 판단할 수 있다. In order to discriminate a speaker according to an embodiment of the present invention, among the parameters 410 of the entire face in the MPEG-4, the parameters 420 of the mouth area may be used. It is possible to extract the motion information for the detected input area and determine whether or not the user speaks.

본 발명의 다른 실시예에 따른 파라미터는 단말기에서 사용자의 음성 신호를 획득하여, 획득한 음성 신호를 기초로 생성된다. 사용자의 음성 신호는 단말기 내에 내장된 마이크를 통해 획득될 수 있으며, 단말기 외부에 존재하는 마이크 어레이를 통해서 획득될 수도 있다. 다만, 이는 본 발명에 따른 일 실시예로 음성 신호를 획득하는 방법은 이에 한정되지 않는다.The parameter according to another embodiment of the present invention is obtained based on the acquired speech signal by obtaining the user's speech signal at the terminal. The voice signal of the user may be obtained through a microphone built in the terminal or may be acquired through a microphone array existing outside the terminal. However, the method for acquiring a voice signal according to an embodiment of the present invention is not limited to this.

본 발명의 일예로 음성 신호를 기초로 사용자의 발언 여부를 판단하기 위해 음성 신호의 크기를 이용할 수 있다. 예를 들어 회의실 안에는 두 명 이상의 사람들이 존재하고, 특정 사용자가 발언하게 되면 두 명 이상의 사람들의 단말기들에 각각 특정 사용자의 음성 신호가 입력된다.According to an embodiment of the present invention, the size of a voice signal may be used to determine whether a user speaks on the basis of a voice signal. For example, in a conference room, there are two or more people, and when a specific user speaks, a voice signal of a specific user is input to each of two or more people's terminals.

특정 사용자의 음성 신호는 특정 사용자의 단말기에서 가장 크게 입력된다. 거리를 고려하였을 때 발언한 특정 사용자의 단말기가 발언한 특정 사용자와 가장 가깝게 있으므로, 특정 사용자의 단말기에 입력된 음성 신호가 가장 클 것이라는 것이 쉽게 예측될 수 있다. 그러므로 본 발명의 일 실시예에 따른 파라미터는 획득한 음성 신호의 크기에 기초하여 생성될 수 있다. The voice signal of a specific user is input to the largest user terminal. It can be easily predicted that the voice signal inputted to the terminal of the specific user is the largest since the specific user terminal of the uttered voice is closest to the specific voice uttered by the user when the distance is considered. Therefore, a parameter according to an embodiment of the present invention can be generated based on the size of the acquired speech signal.

본 발명의 일 실시예에 따르면 영상 회의 중에는 사용자들의 발언으로 발생되는 소리 이외에 스피커 소리, 박수 치는 소리 등 다양한 소리들이 발생할 수 있다. 이러한 경우에는 다양한 소리들 중에서 음성 신호를 추출하여 추출된 음성 신호를 기초로 파라미터를 생성하도록 설정할 수 있다. According to an embodiment of the present invention, during the video conference, various sounds such as a speaker sound, a clapping sound, and the like may be generated in addition to the sounds generated by users' utterances. In this case, it is possible to extract the speech signal from among various sounds and set the parameter to be generated based on the extracted speech signal.

다수의 참여자들 중 화자를 구별하기 위해 음성 신호를 기초로 생긴 파라미터와 영상을 기초로 생긴 파라미터 모두를 사용하는 방법을 생각해 볼 수도 있다. 이 경우에는 음성 신호가 아닌 소리를 기초로 생긴 파라미터를 화자가 발생시킨 소리라고 오인하는 것을 막을 수 있고, 또한 한 가지의 파라미터만을 사용하여 화자를 검출하는 것에 비해 보다 정확한 결과를 얻을 수 있다.To distinguish a speaker among a plurality of participants, a method of using both a parameter based on a voice signal and a parameter based on an image may be considered. In this case, it is possible to prevent a parameter generated on the basis of a sound other than a speech signal from being mistaken as a sound generated by a speaker, and more accurate results can be obtained than a speaker can be detected using only one parameter.

파라미터 송신부(220)는 사용자 촬영 영상을 기초로 생성부(210)에서 생성한 파라미터를 호스트 디바이스(250)에 송신한다. 이 때 다수의 단말기들에 대해 각각의 사용자 촬영 영상을 기초로 생성된 다수의 파라미터들이 호스트 디바이스(250)로 송신된다. 호스트 디바이스(250)는 수신한 파라미터들을 기초로 화자를 판별하는 역할을 한다. 호스트 디바이스(250)의 구체적인 내용에 대해서는 도 3을 참조하여 후술하기로 한다.The parameter transmission unit 220 transmits the parameter generated by the generation unit 210 to the host device 250 based on the user photographed image. At this time, a plurality of parameters generated based on the respective user photographed images for a plurality of terminals are transmitted to the host device 250. The host device 250 determines the speaker based on the received parameters. The specific contents of the host device 250 will be described later with reference to FIG.

단말기 제어 장치(200)의 파라미터 송신부(220)에서 호스트 디바이스(250)로 생성한 파라미터를 송신하는 과정은 무선 통신을 위한 복수의 인터페이스 및 유선 통신을 위한 다른 유형들의 인터페이스들에서 수행될 수 있다. 예를 들면, 적외선 통신 인터페이스 및 블루투스 무선 통신 인터페이스와 같은 무선 통신을 위한 복수의 인터페이스들이 제공될 수 있다.The process of transmitting parameters generated by the parameter transmission unit 220 of the terminal control device 200 to the host device 250 may be performed in a plurality of interfaces for wireless communication and other types of interfaces for wired communication. For example, a plurality of interfaces for wireless communication, such as an infrared communication interface and a Bluetooth wireless communication interface, may be provided.

영상 요청 수신부(230)는 다수의 단말기들로부터 수신한 다수의 파라미터들을 이용하여 호스트 디바이스(250)가 현재 화자로 판별한 사용자의 단말기에 요청하는 현재의 화자에 대한 영상의 요청을 수신한다. 수신 과정 또한 무선 통신을 위한 복수의 인터페이스 및 유선 통신을 위한 다른 유형들의 인터페이스들에서 수행될 수 있다.The video request receiving unit 230 receives a request for an image of a current speaker that the host device 250 requests from a terminal of the user who has determined that the current speaker is a current speaker, using a plurality of parameters received from the plurality of terminals. The receiving process may also be performed on a plurality of interfaces for wireless communication and other types of interfaces for wired communication.

영상 송신부(240)는 호스트 디바이스(250)로 현재 화자로 판별된 사용자의 단말기에서 촬영한 사용자의 영상을 송신한다. 송신된 영상은 호스트 디바이스(250)에서 회의실 전체에 대한 원경 영상과 함께 출력된다.The image transmission unit 240 transmits the image of the user photographed by the terminal of the user identified as the current speaker to the host device 250. [ The transmitted image is output from the host device 250 together with the perspective image for the entire conference room.

도 3은 본 발명의 일 실시예에 따른 단말기를 제어하는 호스트 디바이스를 설명하기 위한 블록도이다.3 is a block diagram illustrating a host device for controlling a terminal according to an embodiment of the present invention.

도 3을 참고하면, 본 발명의 일 실시예에 따른 단말기를 제어하는 호스트 디바이스(300)는 파라미터 수신부(310), 영상 요청부(320), 원경영상 획득부(330) 및 출력부(340)를 포함할 수 있다. 여기에서 호스트 디바이스(300)는 도 2의 호스트 디바이스(250)에 대응된다.3, a host device 300 for controlling a terminal according to an exemplary embodiment of the present invention includes a parameter receiving unit 310, a video request unit 320, a perspective image obtaining unit 330, and an output unit 340. [ . &Lt; / RTI > Here, the host device 300 corresponds to the host device 250 in Fig.

도 3에 도시된 단말기를 제어하는 호스트 디바이스(300)에는 본 실시예와 관련된 구성요소들만이 도시되어 있다. 따라서, 도 3에 도시된 구성요소들 외에 다른 범용적인 구성요소들이 더 포함될 수 있음을 본 실시예와 관련된 기술분야에서 통상의 지식을 가진 자라면 이해할 수 있다.Only the components related to the present embodiment are shown in the host device 300 controlling the terminal shown in Fig. Therefore, it will be understood by those skilled in the art that other general-purpose components other than the components shown in FIG. 3 may be further included.

파라미터 수신부(310)는 하나 이상의 단말기로부터 각 단말기의 사용자가 현재 발언하고 있는 화자인지 여부를 판별할 수 있는 파라미터들을 수신한다. 여기에서 본발명의 일실시예에 따른 파라미터들은 도 2를 참고하여 전술한 사용자의 촬영 영상 또는 사용자로부터 입력된 음성 신호를 기초로 생성되는 파라미터들을 일예로 들 수 있다.The parameter receiving unit 310 receives parameters from one or more terminals that can determine whether the user of each terminal is currently speaking. Here, the parameters according to an exemplary embodiment of the present invention include parameters generated based on the user's photographed image described above with reference to FIG. 2 or a voice signal input from the user.

하나 이상의 단말기로부터 수신하는 파라미터들에는 각 단말기를 식별하기 위한 고유 정보가 포함되어 있다. 일예로, 사용자가 영상 회의에 참여하기 위해 프로그램에 고유의 아이디와 비밀번호를 입력하고 로그인하는 경우, 입력된 아이디와 비밀번호를 이용하여 각 단말기마다 고유 정보를 파라미터와 함께 생성할 수 있다. The parameters received from one or more terminals include unique information for identifying each terminal. For example, when a user inputs and logs in a unique ID and password for a program to participate in a video conference, unique information for each terminal can be generated together with parameters using the input ID and password.

로그인 아이디와 비밀번호를 이용하여 식별 정보를 생성하는 것 이외에도 각 사용자의 단말기의 시리얼 넘버 등을 사용하여 고유한 식별 정보를 생성할 수 있다. 한편, 본 발명에서의 각 단말기 별 식별 정보는 이에 한정되지 않는다.In addition to generating the identification information using the login ID and the password, unique identification information can be generated using the serial number of each terminal of the user. However, the identification information for each terminal in the present invention is not limited thereto.

영상 요청부(320)는 수신한 파라미터들을 기초로 현재 발언하고 있는 화자를 판별한다. 도 2를 참고하여 전술하였듯이, 사용자들의 촬영 영상에서 얼굴 영역을 기초로 파라미터들이 생성된 경우에는 파라미터들을 분석하여 각 사용자의 입 영역의 움직임 정보를 추출한다. 영상 요청부(320)에서는 추출한 움직임 정보를 바탕으로 현재 발언하고 있는 화자를 결정한다. The video request unit 320 determines a speaker currently speaking based on the received parameters. As described above with reference to FIG. 2, when parameters are generated on the basis of the face region in the captured images of the users, the parameters are analyzed to extract the motion information of the mouth region of each user. The video requesting unit 320 determines a speaker currently speaking based on the extracted motion information.

한편, 사용자들로부터 획득한 음성 신호들을 기초로 파라미터들이 생성된 경우에는 가장 큰 음성 신호를 기초로 생성된 파라미터를 송신한 단말기의 사용자를 현재 화자로 결정한다. 가장 큰 음성 신호가 입력된 단말기가 현재 화자와 가장 가까운 거리에 있는 단말기이므로 가장 큰 음성 신호를 기초로 생성된 파라미터를 송신한 단말기의 사용자가 현재 화자에 해당된다.On the other hand, when the parameters are generated based on the voice signals obtained from the users, the user of the terminal that has transmitted the parameters generated based on the largest voice signal is determined as the current speaker. Since the terminal to which the largest voice signal is input is the terminal that is closest to the current speaker, the user of the terminal that transmitted the generated parameter based on the largest voice signal corresponds to the current speaker.

원경 영상 획득부(330)는 회의실에 있는 사용자들 전체를 촬영한 원경 영상을 획득한다. 여기에서 회의실에 있는 사용자들 전체를 촬영하는 카메라는 호스트 디바이스(300)의 외부에 존재할 수도 있고 호스트 디바이스(300)의 내부에 존재할 수도 있다.The perspective image acquisition unit 330 acquires a perspective view image of all the users in the conference room. Here, a camera that captures the entire user in the conference room may be external to the host device 300 or may be internal to the host device 300.

출력부(340)는 영상 요청부(320)에서 현재 화자로 판별된 단말기에 요청한 화자의 영상을 수신한다. 또한 출력부(340)에서는 수신한 화자의 영상과 원경 영상 획득부(330)에서 획득한 영상을 함께 출력한다. 여기에서 수신한 화자의 영상과 원경 영상 획득부(330)에서 획득한 영상을 함께 출력한다는 것은 두 가지 영상의 믹싱과정을 거치거나 먹싱과정을 거쳐 스트림을 생성하여 출력하는 것을 의미한다.The output unit 340 receives the image of the requested speaker from the terminal identified as the current speaker in the image request unit 320. [ In addition, the output unit 340 outputs the image of the received speaker and the image obtained by the perspective image acquisition unit 330 together. The outputting of the image of the speaker received here and the image obtained by the perspective image obtaining unit 330 together means that the two images are subjected to a mixing process or a muxing process to generate and output a stream.

화자의 영상과 원경 영상을 믹싱(mixing)하는 과정은 우선 인코딩 되어 있는 화자의 영상을 디코딩 하여 로우데이터(Rawdata)로 만든 후 로우데이터 상태인 원경 영상을 혼합한다. 두 가지 영상을 로우데이터인 상태에서 혼합하는 것을 믹싱이라 한다.In the process of mixing the speaker's image and the original image, the encoded image of the speaker is first decoded into raw data (Rawdata), and then the original image, which is a low data state, is mixed. Mixing two images in the raw data state is called mixing.

한편 화자의 영상과 원경 영상을 먹싱(muxing)하는 과정은 인코딩 되어 있는 화자의 영상에 로우데이터 상태인 원경 영상을 인코딩하여 인코딩 된 상태의 두 영상을 혼합한다. 두 가지 영상을 인코딩된 상태에서 혼합하는 것을 먹싱이라 한다.Meanwhile, in the process of muxing the image of the speaker and the original image, a perspective image, which is a low data state, is encoded in the encoded image of the speaker, thereby mixing the two images in the encoded state. Mixing two images in an encoded state is called muxing.

믹싱의 경우 먹싱에 비해 상대적으로 화질이 떨어지고 시스템이 복잡한 반면 사용하는 대역폭이 줄어든다는 장점이 있다. 반면 먹싱의 경우 믹싱에 비해 상대적으로 화질이 좋고 시스템이 간단하지만, 사용하는 대역폭이 크다는 장점이 있다. 현재 네트워크 상태 등 시스템 사양을 고려하여 화자의 영상과 원경 영상을 출력하는 방법들 중 가장 적합한 출력 방식을 선택할 수 있다.Mixing has the advantage that the image quality is lower than that of muxing, the system is complicated, but the bandwidth used is reduced. On the other hand, in the case of muxing, the image quality is relatively good compared with the mixing, the system is simple, but the advantage is that the bandwidth used is large. The most suitable output method among the methods of outputting the speaker image and the perspective image in consideration of the system specifications such as the current network status can be selected.

도 5는 본 발명의 일 실시예에 따라 획득한 화자의 음성 신호를 기초로 파라미터를 생성하는 방법에 대해 설명하기 위한 도면이다.5 is a diagram for explaining a method of generating a parameter based on a speech signal of a speaker obtained according to an embodiment of the present invention.

도 5를 참고하면, 단말기1(510), 단말기 2(520) 및 단말기 3(530)은 회의에 참석한 사용자들이 영상 회의시 사용하는 단말기들을 나타낸다. 현재 화자(550)가 생성한 음성 신호는 단말기1(510), 단말기 2(520) 및 단말기 3(530) 모두에 입력된다.Referring to FIG. 5, the terminal 1 510, the terminal 2 520, and the terminal 3 530 represent terminals used by users attending the conference in a video conference. The voice signal generated by the speaker 550 is input to both the terminal 1 510, the terminal 2 520 and the terminal 3 530.

각 단말기(510, 520, 530)에 입력되는 음성 신호들의 크기를 비교해 보면 거리가 가장 가까운 화자의 단말기인 단말기 2(520)에 입력된 음성 신호의 크기가 가장 크다. 그러므로 가장 큰 음성 신호를 기초로 생성된 파라미터를 송신한 단말기의 사용자를 화자로 판별할 수 있다. 수신한 파라미터들을 기초로 화자를 판별하는 단계는 호스트 디바이스(540)에서 수행된다.When comparing the sizes of the voice signals input to the terminals 510, 520, and 530, the size of the voice signal input to the terminal 2 520, which is the closest speaker terminal, is the largest. Therefore, the user of the terminal transmitting the generated parameter based on the largest voice signal can be determined as a speaker. The step of determining the speaker based on the received parameters is performed in the host device 540.

각 단말기로부터의 음성 신호를 기초로 생성된 파라미터를 비교할 때 정규화 과정을 거칠 수 있다. 단말기 별로 동일한 회사 제품인 경우에도 단말기마다 하드웨어 적으로 획득하는 신호의 차이가 발생할 수 있고, 사용자 개인의 성량의 차이에도 영향을 받을 수 있어, 이러한 외부적인 요인에 의해 형평성을 유지할 수 있는 정규화 과정이 필요할 수 있다.A normalization process can be performed when comparing the parameters generated based on the voice signal from each terminal. Even if the same company product is used for each terminal, there may be a difference in signals acquired by hardware between terminals, and a difference in user's individuality may be affected. Thus, a normalization process that can maintain equity by external factors is required .

한편, 화자가 여러 명인 경우 음성 신호를 기초로 한 파라미터 생성에 대해 가정해 볼 수 있다. 화자가 여러 명인 경우에는 우선 음성 신호를 기초로 한 파라미터와 사용자의 촬영 영상을 기초로 한 파라미터를 동시에 사용하여 화자를 판별할 수 있다.On the other hand, when there are a plurality of speakers, it is possible to assume a parameter generation based on a voice signal. When the number of speakers is plural, a speaker can be discriminated by using a parameter based on a voice signal and a parameter based on a user's photographed image at the same time.

본 발명의 일 실시예에 따르면 화자가 여러 명인 경우 단말기 별로 획득한 음성 신호를 화자 별로 분리하여 분리된 음성 신호를 기초로 파라미터들을 생성한다. 호스트 디바이스는 다수의 파라미터들 중 미리 설정된 임계값 이상의 음성 신호 크기를 기초로 생성된 파라미터를 송신한 단말기를 현재 화자의 단말기로 판별할 수 있다. 호스트 디바이스는 설정된 임계값 이상인 파라미터를 전송한 단말기에 대해 화자의 영상을 요청할 수 있다. 임계값 이상인 파라미터가 복수 개일 경우, 현재 발언하고 있는 화자가 여러 명인 것을 추정할 수 있다. 화자가 여러 명인 경우에는, 복수 개의 화자의 영상들을 수신하여, 원경 영상과 함께 출력할 수 있다. According to an embodiment of the present invention, when there are a plurality of speakers, speech signals acquired for each terminal are separated for each speaker and parameters are generated based on the separated speech signals. The host device can determine the terminal that transmitted the generated parameter based on the voice signal size that is equal to or larger than a preset threshold value among the plurality of parameters to the terminal of the current speaker. The host device can request the speaker's image for the terminal that transmitted the parameter having the threshold value or more. When there are a plurality of parameters that are equal to or greater than the threshold value, it can be estimated that the number of speakers currently speaking is several. When the number of speakers is plural, images of a plurality of speakers can be received and output together with the original image.

여기에서 임계값은 특정 단말기와 특정 단말기의 사용자 간의 거리를 고려하여 회의실의 환경에 따라 사용자에 의해 조절될 수 있다.Here, the threshold value may be adjusted by the user according to the environment of the conference room in consideration of the distance between the specific terminal and the user of the specific terminal.

도 6은 본 발명의 일 실시예에 따른 영상회의 모습을 도시한 개념도이다. 6 is a conceptual diagram illustrating a video conference according to an exemplary embodiment of the present invention.

도 6을 참고하면, 영상 회의에 참여한 사용자 1(610a), 사용자 2(610b), 사용자 3(610c) 및 사용자 4(610d)는 각각 단말기 1(620a), 단말기 2(620b), 단말기 3(620c) 및 단말기 4(620d)를 사용한다. 회의실의 원경 영상(650)은 외부 카메라(640)에서 촬영된다.6, the user 1 610a, the user 2 610b, the user 3 610c, and the user 4 610d participating in the video conference are connected to the terminal 1 620a, the terminal 2 620b, and the terminal 3 620c and terminal 4 620d. The mirror image 650 of the conference room is photographed by the external camera 640.

각 사용자들(610a, 610b, 610c, 610d)중 현재 화자를 판별하는 호스트 디바이스(660)는 화자(610d)의 단말기(620d)가 송신한 파라미터(632)를 기초로 화자를 판별한 후 화자로 판별된 단말기에 화자의 영상을 요청하는 신호(634)를 보낸다. 화자의 단말기(620d)는 영상 요청 신호(634)에 대한 응답으로 화자의 영상을 송신(636)한다. 호스트 디바이스는 수신한 화자의 영상과 원경 영상을 함께 출력(670)한다.The host device 660 that determines the current speaker among the users 610a, 610b, 610c, and 610d determines the speaker based on the parameters 632 transmitted by the terminal 620d of the speaker 610d, And sends a signal 634 requesting the terminal to the identified terminal. The speaker's terminal 620d transmits (636) the speaker's picture in response to the video request signal 634. The host device outputs (670) the received image of the speaker and the remote image together.

한편, 영상 회의 중에는 사용자들 중 아무도 음성 신호를 발생하지 않는 타이밍이 있을 수 있다. 예를 들어 화자가 발언하는 도중에 일정 시간 음성 신호가 발생하지 않는 공백기간이 생길 수 있다.Meanwhile, during a video conference, there may be a timing at which none of the users generates a voice signal. For example, there may be a blank period during which a speaker does not generate a voice signal for a certain period of time.

본 발명의 일 실시예에 따르면 각 사용자들의 단말기들은 호스트 디바이스로 화자를 판별할 수 있는 파라미터를 주기적으로 송신한다. 여기에서 주기적으로 송신하는 경우 주기는 사용자에 의해 설정될 수 있다. 호스트 디바이스는 주기적으로 수신하는 다수의 파라미터들을 판별하는 데 현재 화자가 결정되지 않을 경우에는 최근에 결정되었던 화자의 영상을 최근 화자의 단말기에 요청하도록 한다.According to an embodiment of the present invention, terminals of each user periodically transmit a parameter capable of determining a speaker to a host device. Here, when transmitting periodically, the period can be set by the user. The host device periodically determines a plurality of parameters to be received. If the current speaker is not determined, the host device requests the terminal of the speaker that has recently been determined to have recently been determined.

도 7은 본 발명의 일 실시예에 따른 단말 제어 방법을 설명하기 위한 흐름도이다. 7 is a flowchart illustrating a terminal control method according to an embodiment of the present invention.

단계 710에서, 단말 제어 장치는 사용자들의 각 단말기에서 촬영한 사용자들의 영상을 기초로 영상 회의에서의 화자를 판별할 수 있는 파라미터를 생성한다.In step 710, the terminal control apparatus generates parameters for identifying the speaker in the video conference based on the images of the users photographed by the respective terminals.

단계 720에서, 단말 제어 장치는 각 단말기별로 생성된 파라미터들을 호스트 디바이스에게 송신한다.In step 720, the terminal control device transmits the parameters generated for each terminal to the host device.

단계 730에서, 단말 제어 장치는 다수의 단말기들로부터 수신한 다수의 파라미터들을 이용하여 호스트 디바이스가 현재 화자로 판별한 사용자의 단말기에 요청하는 현재의 화자에 대한 영상의 요청을 수신한다.In step 730, the terminal control device receives a request for a video of the current speaker requesting the terminal of the user who has determined that the host device is the current speaker, using a plurality of parameters received from the plurality of terminals.

도 8은 본 발명의 일 실시예에 따른 사용자를 촬영한 영상을 이용하여 파라미터를 생성하는 방법을 설명하기 위한 흐름도이다.FIG. 8 is a flowchart illustrating a method of generating a parameter using an image of a user according to an exemplary embodiment of the present invention. Referring to FIG.

단계 810에서, 단말 제어 장치는 단말기별로 각각의 사용자를 촬영한 영상에서 얼굴 영역을 검출한다.In operation 810, the terminal control apparatus detects a face region from an image of each user captured by each terminal.

단계 820에서, 단말 제어 장치는 검출된 얼굴 영역에서 입 영역의 움직임 정보를 추출한다.In step 820, the terminal control device extracts motion information of the input area in the detected face area.

단계 830에서, 단말 제어 장치는 추출된 입 영역의 움직임 정보를 이용하여 영상 회의 중 현재 화자를 판별할 수 있는 파라미터를 생성한다.In operation 830, the terminal control device generates a parameter for identifying the current speaker during the video conference using the extracted motion information of the input area.

도 9는 본 발명의 일 실시예에 따른 획득한 음성 신호 정보를 이용하여 파라마티를 생성하는 방법을 설명하기 위한 흐름도이다.FIG. 9 is a flowchart illustrating a method for generating parame- tees using acquired speech signal information according to an embodiment of the present invention.

단계 910에서, 단말 제어 장치는 단말기 별로 각각 입력되는 음성 신호를 수신한다.In step 910, the terminal control apparatus receives voice signals input respectively by the terminals.

단계 920에서, 단말 제어 장치는 수신한 음성 신호들을 분리한다.In step 920, the terminal control device separates the received voice signals.

단계 930에서, 단말 제어 장치는 분리된 음성 신호들 중 가장 큰 음성을 기초로 파라미터를 생성한다.In step 930, the terminal control device generates parameters based on the largest speech among the separated speech signals.

도 10은 본 발명의 일 실시예에 따른 호스트 디바이스의 단말 제어 방법을 설명하기 위한 흐름도이다. 10 is a flowchart illustrating a method of controlling a terminal of a host device according to an embodiment of the present invention.

단계 1010에서 호스트 디바이스에 의한 단말 제어 장치는 적어도 하나 이상의 단말기로부터 각각 단말기의 사용자가 현재 발언하고 있는 화자를 판별할 수 있는 파라미터들을 수신한다.In step 1010, the terminal control apparatus by the host device receives parameters from the at least one terminal, which enable the user of the terminal to identify the speaker currently speaking.

단계 1020에서 호스트 디바이스에 의한 단말 제어 장치는 수신한 파라미터들을 기초로 화자를 판별한다. 또한 화자로 판별된 사용자의 단말기에 화자를 촬영한 영상을 요청한다.In step 1020, the terminal control apparatus by the host device determines the speaker based on the received parameters. Also, the terminal of the user determined as the speaker requests the image of the speaker.

단계 1030에서 호스트 디바이스에 의한 단말 제어 장치는 화자로 판별된 사용자의 단말기에서 송신한 화자의 영상을 수신한다.In step 1030, the terminal control apparatus by the host device receives the image of the speaker transmitted from the terminal of the user identified as the speaker.

단계 1040에서 호스트 디바이스에 의한 단말 제어 장치는 회의실에 있는 사용자들의 전체 모습을 촬영한 원경 영상을 획득한다.In step 1040, the terminal control device by the host device acquires the perspective image of the entire scene of the users in the conference room.

단계 1050에서 호스트 디바이스에 의한 단말 제어 장치는 화자로 판별된 사용자의 단말기로부터 수신한 화자의 영상과 원경 영상을 함께 출력한다.In step 1050, the terminal control apparatus by the host device outputs the image of the speaker received from the terminal of the user determined as the speaker and the perspective image together.

도 11은 본 발명의 일 실시예에 따른 단말 제어 방법을 구체적으로 설명하기 위한 흐름도이다. 11 is a flowchart illustrating a method for controlling a terminal according to an embodiment of the present invention.

단계 1110에서 카메라를 이용하여 영상 회의 중인 사용자들의 원경 영상을 촬영한다.In step 1110, a camera is used to photograph a perspective view of users in a video conference.

단계 1120에서 호스트 디바이스에 의한 단말 제어 장치는 촬영된 원경 영상을 획득한다.In step 1120, the terminal control device by the host device acquires the photographed perspective image.

단계 1130에서 호스트 디바이스에 의한 단말 제어 장치는 사용자들의 단말기들로부터 수신한 현재 화자를 판별할 수 있는 파라미터들에 기초하여 화자를 판별한다. 호스트 디바이스는 화자가 판별되고 나면, 화자로 판별된 사용자의 단말기에 화자의 촬영 영상을 송신해 줄 것을 요청할 수 있다.In step 1130, the terminal control apparatus by the host device determines the speaker based on the parameters that can identify the current speaker received from the users' terminals. After the speaker is determined, the host device can request the terminal of the user identified as the speaker to transmit the captured image of the speaker.

단계 1140에서 단말 제어 장치는 화자로 판별된 사용자의 단말기에서 화자의 촬영 영상을 호스트 디바이스에 전송한다.In step 1140, the terminal control device transmits the photographed image of the speaker to the host device from the terminal of the user identified as the speaker.

단계 1150에서 호스트 디바이스에 의한 단말 제어 장치는 화자로 판별된 사용자의 단말기로부터 화자의 촬영 영상을 수신한다.In step 1150, the terminal control apparatus by the host device receives the photographed image of the speaker from the terminal of the user identified as the speaker.

단계 1160에서 호스트 디바이스에 의한 단말 제어 장치는 원경 영상과 화자의 영상을 믹싱한다.In step 1160, the terminal control apparatus by the host device mixes the perspective image and the speaker image.

단계 1170에서 단계 1160에서 믹싱한 영상을 인코딩한다.In step 1170, the mixed image is encoded in step 1160.

단계 1180에서 단계 1170에서 인코딩된 영상을 출력한다.In step 1180, the encoded image is output in step 1170.

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체 및 통신 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. 통신 매체는 전형적으로 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈, 또는 반송파와 같은 변조된 데이터 신호의 기타 데이터, 또는 기타 전송 메커니즘을 포함하며, 임의의 정보 전달 매체를 포함한다. One embodiment of the present invention may also be embodied in the form of a recording medium including instructions executable by a computer, such as program modules, being executed by a computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. In addition, the computer-readable medium may include both computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Communication media typically includes any information delivery media, including computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave, or other transport mechanism.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.It will be understood by those skilled in the art that the foregoing description of the present invention is for illustrative purposes only and that those of ordinary skill in the art can readily understand that various changes and modifications may be made without departing from the spirit or essential characteristics of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.

200: 단말기 제어 장치
210: 생성부
220: 파라미터 송신부
230: 영상 요청 수신부
240: 영상 송신부
250: 호스트 디바이스200: Terminal control device
210:
220: Parameter transmission section
230: Video request receiver
240:
250: Host device

Claims

A terminal control method comprising:
Generating a parameter for determining whether the user is a speaker currently speaking based on at least one of a video and a voice of a user acquired by the terminal;
Transmitting the generated parameter to a host device;
Receiving a request for an image of the determined current speaker from the host device that identifies a current speaker using a plurality of parameters received from a plurality of terminals; And
And transmitting the determined image to the host device,
Wherein the transmission image is output by the host device together with a perspective image acquired by the host device.

2. The method of claim 1, wherein generating the parameter comprises:
Detecting a face region of the user from the acquired user's image;
And generating the parameter by extracting motion information of an input area in which it is possible to determine whether or not the user is speaking in the detected face area.

2. The method of claim 1, wherein generating the parameter comprises:
Obtaining a voice signal of the user at the terminal; And
And generating the parameter based on the acquired speech signal,
Wherein a user of a terminal that has transmitted a parameter generated based on the largest speech signal among the plurality of parameters is determined as the current speaker.

4. The method of claim 3, wherein generating the parameter comprises:
Separating a plurality of audio signals input to the terminal when the number of the speakers is plural; And
And generating a parameter for each of the separated speech signals.

The terminal control method according to claim 1,
And transmitting the generated parameters periodically to the host device.

The terminal control method according to claim 1,
And when the current speaker is not determined, the host device outputs the image of the speaker that has recently been determined.

A method for controlling a terminal of a host device,
Acquiring a perspective image of users;
Receiving parameters from the at least one terminal, each of the parameters for determining a speaker currently speaking by a user of the terminal;
Requesting the terminal of the user identified by the speaker based on the received parameters, the image of the identified speaker; And
Receiving the requested image of the speaker and outputting the received image together with the perspective image.

The method of claim 7, wherein the requesting of the determined speaker image comprises:
Extracting the user's mouth area information from the parameter; and
And determining the current speaker based on the extracted mouth area information.

8. The method as claimed in claim 7,
Receiving a plurality of parameters from a plurality of terminals;
Detecting a parameter generated based on a largest speech signal among the plurality of parameters; And
Further comprising determining the current speaker based on the detected parameter,
Wherein a user of a terminal that has transmitted a parameter generated based on the largest speech signal among the plurality of parameters is determined as the current speaker.

8. The method of claim 7,
Wherein when the user identified by the speaker is a plurality of terminals, the image of the identified speaker is requested to each of the plurality of terminals.

8. The terminal control method according to claim 7,
&Lt; / RTI > further comprising periodically receiving the parameter.

8. The terminal control method according to claim 7,
And if the current speaker is not determined, requesting a recently determined image of the speaker.

A terminal control device comprising:
A generating unit for generating a parameter for determining whether the user is a speaker currently speaking based on at least one of a video and a voice of a user acquired by the terminal;
A parameter transmission unit for transmitting the generated parameter to the host device;
A video request receiving unit for receiving a request for an image of the determined current speaker from the host device that identifies a current speaker using a plurality of parameters received from a plurality of terminals; And
And a video transmission unit for transmitting the determined video to the host device, wherein the transmission video is output by the host device together with the perspective video acquired by the host device.

14. The apparatus according to claim 13,
Extracting motion information of an input area from which the user's face region is detected from the acquired user's image and determining whether the user is speaking in the detected face area, and generates the parameter Device.

14. The apparatus according to claim 13,
Wherein the terminal acquires the voice signal of the user from the terminal and generates the parameter based on the acquired voice signal, and when the user of the terminal, which has transmitted the parameter generated based on the largest voice signal among the plurality of parameters, And is determined as a speaker.

16. The apparatus of claim 15,
And separates a plurality of speech signals input to the terminal when a plurality of the speakers are present, thereby generating a parameter for each of the separated speech signals.

14. The terminal control apparatus according to claim 13,
And transmits the generated parameter periodically to the host device.

14. The terminal control apparatus according to claim 13,
And when the current speaker is not determined, the host device outputs an image of the speaker that has recently been determined.

A host device for controlling a terminal,
A perspective image acquiring unit for acquiring a perspective image of users;
A parameter receiving unit that receives parameters from at least one terminal, each of which can identify a speaker currently speaking by a user of the terminal;
A requesting unit for requesting a terminal of a user identified by the speaker based on the received parameter, the image of the identified speaker;
And an output unit for receiving the image of the requested speaker and outputting the received image together with the perspective image.

The apparatus of claim 19,
Extracts the user's input area information from the parameter, and determines the current speaker based on the extracted input area information.

20. The terminal device according to claim 19,
Receiving a plurality of parameters from a plurality of terminals, detecting parameters generated based on a largest one of the plurality of parameters, determining the current speaker based on the detected parameters, The user of the terminal that has transmitted the generated parameter based on the largest speech signal is determined as the current speaker.

The apparatus of claim 19,
Wherein the requesting unit requests the image of the identified speaker to each of the plurality of terminals when the user identified by the speaker is a plurality of terminals.

20. The terminal device according to claim 19,
And periodically receiving the parameter.

20. The terminal device according to claim 19,
And if the current speaker is not determined, requests a video of the speaker that has been recently determined.

A computer-readable recording medium having recorded thereon a program for causing a computer to execute the method according to any one of claims 1 to 12.