WO2017222258A1

WO2017222258A1 - Multilateral video communication system and method using 3d depth camera

Info

Publication number: WO2017222258A1
Application number: PCT/KR2017/006405
Authority: WO
Inventors: 남궁환식
Original assignee: (주)해든브릿지
Priority date: 2016-06-21
Filing date: 2017-06-19
Publication date: 2017-12-28
Also published as: KR101784266B1

Abstract

According to the present invention, a multilateral video communication system using a 3D depth camera comprises: a plurality of user terminals capturing a user's images so as to transmit the images to a remote server and carry out multilateral video communication; and a media processing server synthesizing the images received from the user terminals so as to generate a composite image, and retransmitting the generated composite image to user terminal groups that carry out video communication, wherein a user terminal comprises: a 3D depth camera imaging objects to be imaged so as to convert the same into images, and having distance information on each of the converted image objects; an image processing module for removing, excluding a user object image, a background image from the images captured by the 3D depth camera, and replacing the same with a single-colored virtual background image; a communication module for communicating with the media processing server so as to transmit and receive images; an image overlay module overlaying, on shared content of the user terminal groups, the composite image received from the media processing server so as to generate an overlay image; and a display means for outputting the overlay image on a screen.

Description

Multi-party Video Dialog System and Method Using 3D Depth Camera

The present invention relates to a system and method for implementing a multi-party video conversation using a 3D depth camera, and more particularly, to a 3D depth capable of realizing a multi-party video conversation by synthesizing user images photographed with a 3D depth camera on shared content. A multi-party video chat system and method using a camera.

Recently, with the development of display technology and communication technology, a system for implementing multi-party video conversation has also been greatly developed. Conventionally, video conversations were performed while viewing the participants on different screens, and in the case of a large number of participants, the screens were divided or toggled to check the video of the other party. .

Multi-party video chats are typically performed in distance learning, video conferencing, and multi-user game environments, and it is very important to increase immersion when remote participants exchange opinions on common topics through video chats.

Meanwhile, Korean Patent No. 10-0275930 discloses a screen forming method for displaying a plurality of pictures on one screen in a video conference system. This document is based on the hierarchical structure of the low resolution (QCIF, 176x144) picture format and the high resolution (CIF, 352x288) picture format supported by the H.261 Recommendation, and the observation of the characteristics of the video stream generated in accordance with the H.261 Recommendation. It refers to a method of synthesizing up to four low-resolution video streams into a single high-resolution stream by setting up a separate server. In a video conferencing system using a LAN or cable TV network, video streams of up to four participants are displayed on each participant's computer. While watching the screen at the same time to facilitate a smooth meeting.

However, since only four participants are displayed on one screen by dividing the screen, it is difficult to intuitively recognize that the participants participating in the video conversation exist in a common virtual space, and the immersion of the video conversation greatly decreases. There is this.

According to an embodiment of the present invention, when a plurality of user terminals connected to a network transmit user images extracted from a 3D depth camera to a media processing server, and synthesize the user images at the media processing server and retransmit them to the user terminal, the user terminal side performs a video conversation. Multi-way video chat system and method using a 3D depth camera that enables participants in a video group to overlay a composite video on shared content so that participants in the video chat can be immersed in the same virtual background to create immersive video chat. The purpose is to provide.

According to an embodiment of the present invention, a multi-party video chat system using a 3D depth camera includes: a plurality of user terminals photographing a user's video and transmitting the same to a remote server; And a media processing server for synthesizing the images received from the user terminal, generating a synthesized image, and retransmitting the generated synthesized image to a group of user terminals conducting a video conversation, wherein the user terminal captures a photographed object. A 3D depth camera converting into an image and having distance information for each of the converted image objects; An image processing module which removes a background image except a user object image from the image photographed by the 3D depth camera and replaces it with a single color virtual background image; A communication module communicating with the media processing server to transmit and receive an image; An image overlay module for generating an overlay image by overlaying the composite image received from the media processing server on shared content of the user terminal group; And display means for outputting the overlay image on the screen.

In a multi-party video chat system using a 3D depth camera according to another embodiment of the present invention, the 3D depth camera includes an RGB sensing unit for sensing RGB information of the photographic object and a distance sensing unit sensing distance information of the photographic object. It is composed.

In a multi-party video chat system using a 3D depth camera according to another embodiment of the present invention, the image processing module configures the virtual background image in a color different from a color constituting the user object image.

In a multi-party video chat system using a 3D depth camera according to another embodiment of the present invention, the image processing module configures the virtual background image in a color different from a color constituting a boundary of the user object image.

In a multi-party video chat system using a 3D depth camera according to another embodiment of the present invention, the media processing server encodes the composite video and retransmits it to the user terminal, and the image overlay module receives and decodes the composite video. Afterwards, the monochromatic virtual background image is removed from the composite image and is transparently processed, and the transparent image is overlaid on the shared content and rendered.

In a multi-party video chat system using a 3D depth camera according to another embodiment of the present invention, the shared content is a shared document or image selected by a host in distance education or video conferencing.

In a multi-party video chat system using a 3D depth camera according to another embodiment of the present invention, the shared content is a game screen selected by a user in a multi-user connected game environment.

The multi-party video chat system using the 3D depth camera according to another embodiment of the present invention further includes a data relay server for relaying video data between the user terminal and the media processing server.

The multi-party video chat system using the 3D depth camera according to another embodiment of the present invention further includes a content sharing server for providing the shared content.

In the multi-party video chat method using a 3D depth camera according to an embodiment of the present invention, when a user's image is photographed and transmitted from a user terminal, a media processing server synthesizes the images received from the user terminal to generate a composite image. A multi-party video chat method for retransmitting a composite video to a group of user terminals conducting video chats, the method comprising: (a) capturing a photographed object in a 3D depth camera installed in the user terminal; (b) removing the background image excluding the user object image from the captured image and replacing it with a solid virtual background image; (c) transmitting the image generated in step (b) to the media processing server; generating an overlay image by overlaying the composite image received from the media processing server on shared content of the user terminal group; And (e) outputting the overlay image on the screen.

In the multi-party video chat method using the 3D depth camera according to another embodiment of the present invention, the 3D depth camera is an RGB sensing unit for sensing the RGB information of the object, and a distance sensing unit for sensing the distance information of the object It is composed.

In the multi-party video chat method using the 3D depth camera according to another embodiment of the present invention, the step (b) is configured with a color different from the color constituting the user object image.

In the multi-party video chat method using the 3D depth camera according to another embodiment of the present invention, the step (b) comprises a color different from the color constituting the boundary of the user object image.

In a multi-party video chat method using a 3D depth camera according to another embodiment of the present invention, the media processing server encodes the composite video and retransmits the composite video to the user terminal, and step (d) receives the composite video. After decoding, the monochromatic virtual background image is removed from the composite image, and the transparent image is processed, and the transparent image is overlaid on the shared content and rendered.

In a multi-party video chat method using a 3D depth camera according to another embodiment of the present invention, the shared content is a shared document or an image selected by a host in a distance education or video conference.

In a multi-party video chat method using a 3D depth camera according to another embodiment of the present invention, the shared content is a game screen selected by a user in a game environment connected by a multi-user.

In a multi-party video chat method using a 3D depth camera according to another embodiment of the present invention, a data relay server relays video data between the user terminal and the media processing server.

In a multi-party video chat method using a 3D depth camera according to another embodiment of the present invention, a content sharing server provides the shared content.

Multi-party video chat system using a 3D depth camera according to an embodiment of the present invention, the user's video using a 2D camera, and at least one of the two-way video chat by sending and receiving a 2D image with a remote server A first user terminal; At least one second user terminal for capturing an image of a user by using a 3D depth camera and transmitting and receiving a captured 3D image with a remote server to perform a multi-party video conversation; And generating a composite image by synthesizing the images transmitted from the first user terminal and the second user terminal, wherein the 2D image is the base screen of the synthesized image, and the background color of the 3D image is removed to overlap the 2D image. A media processing server generating a composite image, encoding the generated composite image, and retransmitting the synthesized composite image to the first user terminal and the second user terminal, wherein the first user terminal and the second user terminal are the media processing server. Render and display the image received from.

In a multi-party video chat system using a 3D depth camera according to another embodiment of the present invention, a 2D image provided as a basic screen of the composite video is transferred to the 2D camera of the first user terminal by a moderator in a distance education or video conference. The 3D depth camera of the second user terminal is a captured image and includes an RGB sensing unit for sensing RGB information of the photographed object and a distance sensing unit for sensing distance information of the photographed object.

According to a multi-party video chat system and method using a 3D depth camera of the present invention, a plurality of user terminals connected to a network transmit user images extracted from a 3D depth camera to a media processing server, and synthesize the user images in the media processing server. By retransmitting to the user terminal, the composite video is overlaid on the shared content and displayed to the participants in the group conducting the video conversation on the user terminal side, so that the participants in the video conversation can be placed on the same virtual background to create a high-immersion video conversation. In addition, there is an effect that can provide a highly immersive user experience (UX) in a variety of video chat systems, such as distance learning, video conferencing, video chat in game.

In addition, the 3D camera user images are placed into the environment of the user using the 2D camera in the group conducting the video conversation, so that the effect of space movement can be brought to provide a realistic conversation.

In addition, the media processing server may increase the usability by providing compatibility for 2D camera users other than the 2D camera users to participate in such high immersion video conversation. That is, general user terminals using 2D cameras other than the 2D camera users may also maintain compatibility by allowing the media processing server to perform synthesis processing.

1 is a block diagram illustrating a multi-party video chat system according to the present invention;

2 is a block diagram illustrating a configuration of a user terminal in the present invention, and

3 is a diagram illustrating that an overlay image is provided in the present invention.

4 is a view showing another example of providing a composite image in the present invention.

Hereinafter, with reference to the accompanying drawings will be described a specific embodiment according to the present invention. However, this is not intended to limit the present invention to the specific embodiments, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

The same reference numerals are used for parts having similar configurations and operations throughout the specification. And the drawings attached to the present invention is for convenience of description, the shape and relative measures may be exaggerated or omitted.

In describing the embodiments in detail, overlapping descriptions or descriptions of obvious technology in the art are omitted. In addition, in the following description, when a portion "includes" another component, it means that the component can be further included in addition to the described component unless otherwise stated.

In addition, the terms "~", "~", "~ module" described in the specification means a unit for processing at least one function or operation, which may be implemented by hardware or software or a combination of hardware and software. Can be. In addition, when a part is electrically connected to another part, this includes not only the case where it is directly connected, but also the case where it is connected through the other structure in the middle.

Terms including ordinal numbers such as first and second may be used to describe various components, but the components are not limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the second component may be referred to as the first component, and similarly, the first component may also be referred to as the second component.

1 is a block diagram illustrating a multi-party video chat system according to the present invention. Referring to FIG. 1, a multi-party video chat system using a 3D depth camera of the present invention includes a plurality of user terminals 100, a data relay server 200 connected to the user terminal 100, and a network 500. , A media processing server 300, and a content sharing server 400.

The user terminal 100 refers to a terminal capable of accessing the Internet through a network 500 such as a PC, a laptop, a tablet, a smartphone, etc. The user terminal 100 according to the present invention may photograph a user for a multi-party video conversation. A camera, in particular, means a terminal having a 3D depth camera 110 for forming a three-dimensional image.

When the data relay server 200 transmits images from the user terminal 100 to the media processing server 300 or retransmits the image from the media processing server 300 to the user terminal 100, the data relay server 200 is dedicated to the image data. It is a server dedicated to data transmission when video data traffic increases, such as when there are many users participating in a video chat or when there are a plurality of video chat groups. If there are few users participating in the video chat, the data relay server 200 may not be used.

The media processing server 300 is a server for synthesizing and encoding video data. The media processing server 300 is a server that provides a media service for integrating and processing images transmitted from a plurality of users into one video for each user terminal group for a multi-party video chat service.

The content sharing server 400 is a server that provides content to be shared among the plurality of user terminals 100. For example, when the multi-party video chat system of the present invention is provided in the form of being added to the distance education service, the content sharing server 400 may be a server that provides education content to a group of user terminals. As another example, when the multi-party video chat system of the present invention is provided in a multi-user connected game environment, the content sharing server 400 may be a server providing a game environment and matching service to users connected to the same game. have.

2 is a block diagram illustrating a configuration of a user terminal in the present invention.

The user terminal 100 is a terminal for capturing an image of a user and transmitting the image to a remote media processing server 300, and receiving a composite image from the media processing server 300 to perform a multi-party video conversation. As shown in FIG. 2, the user terminal 100 includes a 3D depth camera 110, a key input unit 120, a microphone 130, a speaker 140, a controller 150, and a display unit ( 160, an image processing module 170, an image overlay module 180, and a communication module 190. Of course, other peripherals such as a commonly known PC may be further included.

The 3D depth camera 110 is a means for capturing a photographed object and converting the image into an image. Unlike the general camera, the 3D depth camera 110 has distance information for each converted image object. For example, the 3D depth camera 110 includes an RGB sensing unit that senses RGB information of an object and a distance sensing unit that senses distance information of an object. The RGB sensing unit detects objects constituting a photographic object, such as a CCD image sensor, and expresses the color of the object using R, G, and B values of 0 to 255. The distance sensing unit senses the distance of the object to be captured, such as an infrared sensor or an ultrasonic sensor. Typically, the 3D depth camera 110 is used to form a three-dimensional image, but in the present invention is used to extract the user image.

The key input unit 120 is a means for inputting a character when a user exchanges text with a conversation counterpart, and is a means for inputting a key input command. The microphone 130 converts a user's voice signal into an electrical signal, and the speaker 140 converts the electric signal into an audible frequency band and outputs the voice signal. The microphone 130 and the speaker 140 form a handset during video conversation.

The controller 150 is a means for controlling the operation of the user terminal 100 and may include, for example, a motherboard and a CPU. It may also include an operating system (OS) installed in the main memory.

The display means 160 is a means for outputting an image to the user. The display means 160 is, for example, a known display device such as an LCD or an AMOLED, and may include a touch user interface (TUI) supporting a user's touch input.

The image processing module 170 is a means for preprocessing the image photographed by the 3D depth camera 110. As described above, the image photographed by the 3D depth camera 110 includes distance information of the object. The image processing module 170 recognizes the distance information of the object constituting the photographed object, and removes a background image except for the user object image (the user image as illustrated in FIG. 3) from the distance information. For example, Background Segmentation may be used. And the area where the background image existed is replaced with the monochrome virtual background image.

Referring to FIG. 3, the user terminal A is a terminal used by a presenter of a video conference, and the user terminal A removes a background image behind the presenter and generates a first terminal image 520 coated with a solid virtual background image. The user terminals B and C are terminals used by the general attendant, and the second terminal image 530 and the third terminal image 540 in which a background image behind the participant is removed and a monochromatic virtual background image is coated, as in the user terminal A. Create

In this case, the image processing module 170 configures the virtual background image with a color constituting the user object image or a color different from the color constituting the boundary of the user object image. By configuring the virtual background image in a different color as described above, the media processing server 300 can easily identify the user object image and perform synthesis.

The controller 150 of each user terminal 100 transmits the

images

520, 530, and 540 generated by the image processing module 170 to the media processing server 300 through the communication module 190.

The media processing server 300 generates a composite image 550 by synthesizing the received images as shown in FIG. 3 with respect to the user terminal groups conducting the video conversation. For example, the received images are arranged so that the user object images do not overlap, and the generated composite image is encoded and retransmitted to each user terminal 100.

The image overlay module 180 of the user terminal 100 overlays the received composite image on the shared content to generate an overlay image. Specifically, the image overlay module 180 receives and decodes the synthesized image. Then, the monochromatic virtual background image is removed from the synthesized image and processed to be transparent. The overlay image is generated by overlaying the transparent image on the shared content and rendering the shared image.

Here, the term "shared content" means content shared by the participants who participated in the multi-party video conversation. For example, shared content is a shared document or image selected by a facilitator in a distance education or video conference. As another example, the shared content is a game screen selected by the user in a game environment accessed by multi-users.

In FIG. 3, the presentation data including the graph created by the presenter is the shared content 510, and the shared content 510 may be a document or an image provided directly by the user terminal 100, but may be real-time to other users who participate in the conversation. In order to be provided as, it is preferable that the content provided by the content sharing server 400 as shown in FIG.

As illustrated, the overlay image 560 is generated by overlaying the composite image 550 on the shared content 510, and the overlay image 560 is displayed through the display terminal 600 in addition to the terminal participating in the video conversation. Can be. For example, the user terminal 100 may be a terminal of a plurality of teachers who perform distance education, and the display terminal 600 may be an unspecified student terminal that attends distance education.

4 is a view showing another example of providing a composite image in the present invention. As shown in FIG. 4, some of the user terminal 100 may include a 2D camera 115.

For example, the user terminal A provided with the 2D camera 105 selected as the presenter or the presenter transmits the captured 2D image to the media processing server 300.

General user terminals B and C equipped with a 3D depth camera transmit the captured 3D image to the media processing server 300 as described above with reference to FIG. 3.

When there is a general user terminal D having another 2D camera 115, the 2D image photographed through the 2D camera 115 may be encoded and transmitted to the media processing server 300 as shown in the drawing.

The media processing server 300 generates a composite image 550 by synthesizing the 2D image received from the user terminal A, the 3D image received from the user terminals B and C, the 2D image received from the user terminal D, and the like. In this case, the 2D image of the user A declared as the presenter or presenter is used as the basic screen of the synthesized image 550. The 3D image of the user terminals B and C removes the background color by using the depth information as described above. The composite image may be generated by overlapping. In this case, the 2D image of the general user terminal D 250 having the 2D camera is placed in a separate position of the synthesized image in the media processing server, thereby ensuring compatibility to participate in the stereoscopic video conference.

Here, the image refers to removing the background color for use in the background, and sharing the video conference organizer wallpaper with the participants in the video conference so that the meeting can be effected in an environment having the same background. Can be.

The media processing server 300 encodes the generated composite image and retransmits it to each user terminal 100, and each user terminal 100 will render and display the image received from the media processing server 300.

As described above, the multi-party video chat system and method using the 3D depth camera of the present invention, a plurality of user terminals connected to the network transmits the user image extracted from the 3D depth camera to the media processing server, the media processing server When the user images are synthesized and retransmitted to the user terminal, the synthesized image is overlaid on the shared content and displayed to the participants in the group having the video conversation. As described with reference to FIG. 3, the participation in the video chat is placed on the same virtual background, and the virtual same background is a shared document or image, a game screen in common, and the like. Can be implemented. Accordingly, the present invention can provide a highly immersive user experience (UX) in various video conversation systems such as distance education, video conferencing, multi-user games, cyber model house, interactive seminar relaying, disaster / disaster response broadcasting, home shopping, etc. .

Various modifications are possible in the invention disclosed above without departing from the basic idea. That is, the above embodiments should all be interpreted as illustrative and not restrictive. Therefore, the protection scope of the present invention should be determined according to the appended claims rather than the above-described embodiments, and if the components defined in the appended claims are replaced by equivalents, they should be regarded as belonging to the protection scope of the present invention.

Claims

A plurality of user terminals for capturing an image of a user and transmitting the image to a remote server to perform a multi-party video conversation; And

And a media processing server for synthesizing the images received from the user terminal to generate a synthesized image and retransmitting the generated synthesized image to a group of user terminals conducting a video conversation.

The user terminal,

A 3D depth camera that captures a photographed object and converts the image into an image, and has distance information for each converted image object;

An image processing module which removes a background image except a user object image from the image photographed by the 3D depth camera and replaces it with a single color virtual background image;

A communication module communicating with the media processing server to transmit and receive an image;

An image overlay module for generating an overlay image by overlaying the composite image received from the media processing server on shared content of the user terminal group; And

And a display means for outputting the overlay image on a screen.
The method of claim 1,

And the 3D depth camera comprises an RGB sensing unit for sensing the RGB information of the photographic object, and a distance sensing unit sensing the distance information of the photographic object.
The method of claim 1,

And the image processing module is configured to configure the virtual background image in a color different from a color constituting the user object image.
The method of claim 1,

And the image processing module is configured to configure the virtual background image in a color different from a color constituting a boundary of the user object image.
The method of claim 1,

The media processing server encodes the composite video and retransmits it to the user terminal,

The image overlay module receives and decodes the composite image, and then removes and transparentizes a single virtual background image from the composite image, and overlays the transparent image on the shared content to render. Multi-party video chat system using 3D depth camera.
The method of claim 5,

And said shared content is a shared document or image selected by a host in distance learning or video conferencing.
The method of claim 5,

The shared content is a multi-party video chat system using a 3D depth camera, characterized in that the game screen selected by the user in a multi-user connected game environment.
The method according to any one of claims 1 to 7,

And a data relay server for relaying image data between the user terminal and the media processing server.
The method according to any one of claims 1 to 7,

And a content sharing server for providing the shared content.
When the user terminal captures the image of the user and transmits it, the media processing server synthesizes the images received from the user terminal to generate a composite image and re-transmits the synthesized image to the user terminal group having a video chat. In

(a) photographing a photographed object in a 3D depth camera installed in the user terminal;

(b) removing the background image excluding the user object image from the captured image and replacing it with a solid virtual background image;

(c) transmitting the image generated in step (c) to the media processing server;

generating an overlay image by overlaying the composite image received from the media processing server on shared content of the user terminal group; And

and (e) outputting the overlay image to the screen.
The method of claim 10,

And the 3D depth camera comprises an RGB sensing unit for sensing the RGB information of the photographic object and a distance sensing unit sensing the distance information of the photographic object.
The method of claim 10,

The step (b) is a multi-party video chat method using a 3D depth camera, characterized in that for configuring the virtual background image with a color different from the color constituting the user object image.
The method of claim 10,

The step (b) is a multi-party video chat method using a 3D depth camera, characterized in that for configuring the virtual background image with a color different from the color constituting the boundary of the user object image.
The method of claim 10,

The media processing server encodes the composite video and retransmits it to the user terminal,

In step (d), after receiving and decoding the composite image, removing and transparentizing a single virtual background image from the composite image, overlaying the transparent image on the shared content, and rendering the rendering process Multi-way video chat using 3D depth camera.
The method of claim 14,

And said shared content is a shared document or image selected by a host in distance learning or video conferencing.
The method of claim 14,

And said shared content is a game screen selected by a user in a multi-user connected game environment.
The method according to any one of claims 10 to 16,

And a data relay server relays image data between the user terminal and the media processing server.
The method according to any one of claims 10 to 16,

A multi-party video chat method using a 3D depth camera, characterized by providing the shared content in a content sharing server.
At least one first user terminal for capturing an image of a user by using a 2D camera and transmitting and receiving a photographed 2D image with a remote server to perform a multi-party video conversation;

At least one second user terminal for capturing an image of a user by using a 3D depth camera and transmitting and receiving a captured 3D image with a remote server to perform a multi-party video conversation; And

A synthesized image is generated by synthesizing the images transmitted from the first user terminal and the second user terminal. The synthesized image is superimposed on the 2D image by removing the background color of the 3D image as the base screen of the synthesized image. A media processing server generating an image, encoding the synthesized composite image, and retransmitting the synthesized image to the first user terminal and the second user terminal;

The first user terminal and the second user terminal is a multi-party video chat system using a 3D depth camera, characterized in that for rendering and displaying the image received from the media processing server.
The method of claim 19,

The 2D image provided as the basic screen of the composite image is an image captured by the host in the distance education or video conference by the 2D camera of the first user terminal,

The 3D depth camera of the second user terminal includes an RGB sensing unit for sensing the RGB information of the photographic object, and a distance sensing unit sensing the distance information of the photographic object. Dialog system.
The method of claim 19,

The media processing server, if there is the participation of the additional general user terminal using a 2D camera other than the 2D camera, multi-party video chat system, characterized in that to maintain compatibility.