WO2022259480A1 - Video processing device, video processing method, and video processing program - Google Patents

Video processing device, video processing method, and video processing program Download PDF

Info

Publication number
WO2022259480A1
WO2022259480A1 PCT/JP2021/022173 JP2021022173W WO2022259480A1 WO 2022259480 A1 WO2022259480 A1 WO 2022259480A1 JP 2021022173 W JP2021022173 W JP 2021022173W WO 2022259480 A1 WO2022259480 A1 WO 2022259480A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
image
video processing
processing device
unit
Prior art date
Application number
PCT/JP2021/022173
Other languages
French (fr)
Japanese (ja)
Inventor
弘員 柿沼
誉宗 巻口
秀信 長田
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2021/022173 priority Critical patent/WO2022259480A1/en
Priority to JP2023526771A priority patent/JPWO2022259480A1/ja
Publication of WO2022259480A1 publication Critical patent/WO2022259480A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/18Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast

Definitions

  • the embodiments relate to a video processing device, a video processing method, and a video processing program.
  • Non-Patent Document 1 and Non-Patent Document 2 propose a technology that makes the user feel the presence of others by presenting the appearance and remarks of the audience on the screen.
  • An embodiment is a video processing device that allows a user to feel as if they were watching at an actual venue when watching events such as music and theater or watching sports through video in a remote location. , provides a video processing method and a video processing program.
  • the video processing device of the embodiment has a receiving unit, a distance estimating unit, a video processing unit, and a transmitting unit.
  • the receiving unit receives video.
  • a distance estimating unit estimates a viewing distance when it is assumed that a human being has viewed scenery in the same range as the image.
  • the image processing unit processes the image so that the size of the reference object appearing in the image at the viewing distance matches the actual size of the reference object.
  • the transmission unit transmits the processed video to the display.
  • a processing device when viewing events such as music and theater, or watching sports through video in a remote location, the user can feel as if he/she is watching at an actual venue.
  • a processing device a video processing method, and a video processing program are provided.
  • FIG. 1 is a diagram showing a schematic configuration of a video distribution system including a video processing device according to an embodiment.
  • FIG. 2 is a diagram illustrating an example of a hardware configuration of a video processing device;
  • FIG. 3 is a functional block diagram of the video processing device.
  • FIG. 4 is a flow chart showing the operation of the video processing device.
  • FIG. 5 is a diagram showing the relationship between the height of the performer and the shooting range of the camera in the height direction.
  • FIG. 6 is a diagram showing the distance relationship between the user and the performer when it is assumed that the user is watching the performer at the venue.
  • FIG. 7A is a diagram showing a schematic configuration of a video distribution system using a transmissive wearable display.
  • FIG. 7B is a diagram showing a schematic configuration of a video distribution system using a transmissive wearable display.
  • FIG. 1 is a diagram showing a schematic configuration of a video distribution system including a video processing device according to an embodiment.
  • the video distribution system 1 has a camera 10 , a video processing device 20 and a display 30 .
  • the cameras 10 are installed, for example, in the audience seats AS at venues where events such as music and theater are held. Specifically, the audience seats AS are arranged to face the stage S, for example. A space for installing the camera 10 is provided in a part of the audience seats AS.
  • the camera 10 may be a camera configured to be movable or may be a fixed position camera.
  • the camera 10 shoots the performer P at a predetermined frame rate and generates image data of the performer P.
  • the video data may be resized according to the size of the video that can be displayed by the display 30, or the like. For example, if the size of an image that can be displayed by the display 30 is a full HD (High Definition) size, the image data may be resized to 1920 pixels ⁇ 1080 pixels.
  • the camera 10 is connected so as to be able to communicate with the video processing device 20 . Video data captured by the camera 10 is transmitted to the video processing device 20 .
  • the performer P in the embodiment is a general term for people who perform various activities in an event, such as a performer or the like for a music event, or an actor or the like for a theater event.
  • the performer P is not limited to a person performing a specific expressive activity.
  • the number of cameras 10 is one.
  • the number of cameras 10 is not limited to one.
  • the cameras 10 may be installed at multiple positions within the hall.
  • the video processing device 20 processes the video transmitted from the camera 10. For example, the video processing device 20 determines that the distance from the user U to the virtual display surface arranged in the three-dimensional space and the height of the performer P on the virtual display surface perceived by the user U through the display on the display 30 are different from the actual height of the performer P on the virtual display surface. The image is processed so that it is the distance and height that you would perceive if you were looking at the performer P at the venue. The video processing device 20 then transmits the processed video to the display 30 .
  • the video processing device 20 may be installed inside the venue or outside the venue. As in the venue, the video processing device 20 may be included in the camera 10 . Also, outside the venue, the video processing device 20 may be included in the display 30 . Of course, the video processing device 20 may be separate from the camera 10 and the display 30 . Details of the video processing device 20 will be described later.
  • the display 30 is configured to be able to communicate with the video processing device 20, and displays the video transmitted from the video processing device 20.
  • the display 30 is, for example, a non-transmissive glasses-type wearable display that is worn on the head of the user U who is in a remote location with respect to the venue.
  • the display 30 is configured to allow three-dimensional display.
  • the display 30 has a display unit at each eye position, and displays an image obtained by synthesizing the image of the camera 10 transmitted from the image processing device 20 and the avatar image of the audience on each display unit.
  • the virtual image Pi of the performer P and the virtual image Ai of the avatar of the audience are perceived at a position at a predetermined viewing distance from the user U.
  • the avatar is an image imitating the audience, and may be, for example, a two-dimensional or three-dimensional illustration image representing the audience. Also, the avatar may be a CG image or the like using actual photography.
  • a configuration for three-dimensional display by the display 30 is not limited to a specific configuration. Also, the location of the user U may be a remote location from the venue, such as the home of the user U or a public viewing venue provided separately from the venue.
  • the venue may also have a camera 40 and a large service monitor 50.
  • Camera 40 is communicatively connected to large service monitor 50 .
  • the camera 40 is arranged on the stage S, for example, and configured to photograph the performer P on the stage S.
  • the number of cameras 40 is not limited to one.
  • the camera 40 may also be installed in the auditorium AS.
  • the large service monitor 50 is installed, for example, on the stage S so as to face the audience seats AS, and displays the image captured by the camera 40 on a large screen.
  • the audience A1 near the stage S and the audience A2 far from the stage S perceive different scenery.
  • the performer P is smaller in the scenery v2 perceived by the audience A2 than in the scenery v1 perceived by the audience A1.
  • a spectator A3 sitting in the audience seat AS far from the stage S can see the performer P with the naked eye through the image on the large service monitor 50.
  • FIG. Therefore, spectator A3 also perceives scenery v3 different from that of spectators A1 and A2. In this way, people perceive different scenery depending on the viewing position even if they see the performer P at the same venue. This is one of the reasons why the user U feels uncomfortable when viewing the video captured by the camera 10 at a remote location.
  • FIG. 2 is a diagram showing an example of the hardware configuration of the video processing device 20.
  • the video processing device 20 may be configured as a computer.
  • the video processing device 20 does not have to be a single computer, and may be composed of a plurality of computers.
  • the video processing device 20 has a processor 201, a ROM (Read Only Memory) 202, a RAM (Random Access Memory) 203, a storage 204, an input device 205, and a communication module 206. is doing.
  • the video processing device 20 may further have a display or the like.
  • the processor 201 is a processing circuit capable of executing various programs and controls the overall operation of the video processing device 20 .
  • the processor 201 may be a processor such as a CPU (Central Processing Unit), MPU (Micro Processing Unit), or GPU (Graphics Processing Unit).
  • the processor 201 may be an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or the like.
  • the processor 201 may be composed of a single CPU or the like, or may be composed of a plurality of CPUs or the like.
  • the ROM 202 is a non-volatile semiconductor memory and holds programs and control data for controlling the video processing device 20 .
  • the RAM 203 is, for example, a volatile semiconductor memory, and is used as a work area for the processor 201.
  • the storage 204 is a nonvolatile storage device such as a hard disk drive (HDD) or solid state drive (SSD).
  • the storage 204 holds programs 2041 , variables 2042 and spectator data 2043 .
  • the program 2041 is a program for processing the video of the camera 10.
  • the program 2041 includes a process of estimating the shooting range from the image of the camera 10, a process of estimating a virtual viewing distance when it is assumed that a person sees a scene in the same range as the estimated shooting range, and a process of estimating the estimated viewing distance. It is a program for causing the processor 201 to execute a process of processing an image based on the distance.
  • Variables 2042 are various variables used for video processing.
  • variables 2042 include the length of the fiducial, the vertical size of the image sensor, the focal length, and the number of vertical pixels of the image.
  • the length of the reference object is the actual length of the reference object used to estimate the shooting range.
  • the reference object may be any object with a known length that can appear in the image captured by the camera 10 .
  • performer P may be a reference.
  • the length of the reference object may be performer P's height.
  • the height of the performer P may be input to the video processing device 20 by the organizer of the event or the like before the event is held.
  • the vertical size of the image sensor is the vertical size when the retina of the human eye is regarded as the image sensor of the camera. For example, if the retina of the human eye is equated with a full size image sensor, the vertical size of the image sensor is 24 mm. If the retina function of the human eye is considered equivalent to that of an APS-C size image sensor, the vertical size of the image sensor is 16.7 mm.
  • the focal length is the value of the focal length when the lens of the human eye is regarded as a camera lens.
  • the human eye when the human eye is regarded as a camera consisting of a lens and a full-size image sensor, its focal length is said to be equivalent to 10 to 12 mm.
  • the storage 204 holds a focal length value of 50 mm. If the human eye is regarded as a camera composed of a lens and an APS-C size image sensor, its focal length is about 35 mm. In this case, the storage 204 holds 35 mm as the focal length value.
  • the number of vertical pixels of the video is the number of vertical pixels of the video after being resized by the camera 10. For example, if the video size is resized to full HD (High Definition) size, the number of vertical pixels is 1080 pixels.
  • the storage 204 holds spectator data 2043 .
  • the audience data 2043 includes video data representing the audience.
  • the image data representing the audience is, for example, the image data of the avatar of the audience.
  • the video data representing the audience may be video data of the actual audience captured in advance.
  • the spectator data 2043 may include biometric information such as heartbeats, motion states, and emotions of spectators watching the event at the venue as metadata. Such metadata is sequentially collected at the venue during the event and transmitted to the video processing device 20 .
  • the input device 205 is an interface device for the administrator of the video processing device 20 to operate the video processing device 20 .
  • the input device 205 can include, for example, a touch panel, keyboard, mouse, various operation buttons, various operation switches, and the like.
  • Input device 205 may be used to input variables 2042, for example.
  • the communication module 206 is a module that includes circuits used for communication between the video processing device 20 and other devices.
  • the communication module 206 may be, for example, a communication module conforming to the wired LAN standard. Also, the communication module 206 may be a communication module conforming to the wireless LAN standard, for example.
  • FIG. 3 is a functional block diagram of the video processing device 20.
  • the image processing device 20 has a receiving section 2011 , a distance estimating section 2012 , an image processing section 2013 , a three-dimensional processing section 2014 and a transmitting section 2015 .
  • the processor 201 of the video processing device 20 can operate as a reception unit 2011, a distance estimation unit 2012, an image processing unit 2013, a three-dimensional processing unit 2014, and a transmission unit 2015.
  • the receiving unit 2011 , the distance estimating unit 2012 , the image processing unit 2013 , the three-dimensional processing unit 2014 and the transmitting unit 2015 may be realized by hardware different from the processor 201 .
  • the receiving unit 2011 acquires the video received from the camera 10 via the communication module 206, and decomposes the acquired video into frames. Then, the receiving unit 2011 sequentially transfers the images in units of frames to the distance estimation unit 2012 and the image processing unit 2013 . For example, when the video frame rate is 60 fps (frame per second), the receiving unit 2011 decomposes the video into 60 frames.
  • the distance estimating unit 2012 uses the image transferred from the receiving unit 2011 to determine the distance between the user U and the performer P when it is assumed that the user U was watching the scenery in the same range as the image captured by the camera 10 at the venue. Estimate virtual viewing distance. The operation of distance estimation section 2012 will be described later in detail.
  • the image processing unit 2013 processes the image captured by the camera 10 based on the viewing distance estimated by the distance estimation unit 2012 and the height of the performer P. For example, the image processing unit 2013 enlarges or reduces the area of the performer P in the image so that the height of the performer P in the three-dimensional space perceived by the user U by the three-dimensional display matches the actual height of the performer P. .
  • the operation of the image processing unit 2013 will be described later in detail.
  • the 3D processing unit 2014 performs rendering processing for 3D display on the display 30 .
  • the 3D processor 2014 places the image plane of the performer P at the viewing distance estimated by the distance estimator 2012 in the 3D space perceived by the user U.
  • the three-dimensional processing unit 2014 arranges the image representing the audience in front of the image plane of the performer P based on the viewing distance estimated by the distance estimation unit 2012 .
  • the 3D processing unit 2014 renders the 3D video data in consideration of the reflection from the virtual light source in the 3D space.
  • the 3D processing unit 2014 captures a 3D space including the image plane of the performer P and the image plane of the audience from the position of the user U with a virtual stereo camera, and obtains a right eye image and a left eye image. and get. Then, the 3D processing unit 2014 transfers 3D image data including the acquired right-eye image and left-eye image to the transmission unit 2015 .
  • the operation of the 3D processing unit 2014 will be described later in detail.
  • the transmission unit 2015 transmits the 3D video data sent from the 3D processing unit 2014 to the display 30 via the communication module 206 .
  • FIG. 4 is a flow chart showing the operation of the video processing device 20. As shown in FIG. The processing of FIG. 4 is performed by the processor 201 at regular intervals, for example, from the start of an event until the end of the event. This certain period of time is, for example, a time interval at which image data is transmitted from the camera 10 .
  • variables 2042 include the length of the reference object, the focal length, the vertical size of the image sensor, and the number of vertical pixels of the image.
  • the reference length is the performer P's height.
  • step S2 the processor 201 acquires image data transmitted from the camera 10 and stored in the RAM 203, for example.
  • the processor 201 then decomposes the video into frames.
  • the processor 201 also acquires spectator data 2043 from the storage 204 .
  • the video representing the spectator in the spectator data 2043 is the spectator's avatar video.
  • step S3 the processor 201 performs distance estimation processing.
  • the distance estimation process will be described below.
  • Focal length f 50 (mm)
  • Vertical size of image sensor S 24 (mm)
  • Number of vertical pixels of video h 1080 (pixels)
  • the processor 201 first detects an object in each frame unit video.
  • Object detection in the embodiment is a process of detecting an object with a known length in an image.
  • processor 201 detects performer P.
  • Object detection techniques may be based on object detection algorithms such as Mask R-CNN and YOLO. In the following, it is assumed that the number of vertical pixels hp of the region of the performer P occupying the video is estimated to be 640 (pixels) by object detection.
  • the photographing range H is the photographing range when photographed by the camera 10 having a certain focal length arranged at a certain distance from the performer P.
  • FIG. 6 the viewing distance D when it is assumed that the user U was looking at the scenery in the same range as the shooting range H is the viewing distance D of the user U's eyes with the lens of the focal length f and the vertical size S is calculated from the following (Equation 2).
  • the processor 201 moves the process to step S4.
  • step S4 the processor 201 processes the video. That is, the processor 201 enlarges or reduces the image so that the height of the performer P on the virtual display surface perceived by the user U by the three-dimensional display becomes Hp, which is the actual height.
  • step S5 the processor 201 performs 3D video generation processing. Specifically, the processor 201 places the image plane of the performer P at a viewing distance D from the user U. In addition, the processor 201 places the image plane of the audience avatar between the user U and performer P image planes. Then, the processor 201 acquires a right-eye image and a left-eye image obtained when the three-dimensional space is photographed from the position of the user U with a virtual stereo camera.
  • the viewing distance D may determine the number and size of the audience avatars in the avatar video superimposed on the performer P video. Specifically, the number of audience avatars decreases as the viewing distance D decreases, and becomes 0 when the viewing distance D is equal to or less than the threshold.
  • the threshold of the viewing distance D can be set to the distance between the front rows of audience seats and the stage S in the venue, for example.
  • the size of the spectator avatar becomes larger at a position closer to the user U in the three-dimensional space perceived by the user through three-dimensional display, and becomes smaller at a position farther away. This is to reproduce the fact that in an event at a real venue, the closer the audience, the larger the audience.
  • step S5 the processor 201 calculates the appearance of the performer P and the avatar at the viewing distance D in consideration of the reflection and scattering by the virtual light source that illuminates the performer P and the avatar, respectively, using an existing gloss reproduction technology. do. Furthermore, the processor 201 may perform processing such as moving the avatar according to the biological data of the audience.
  • step S ⁇ b>6 the processor 201 transmits the 3D image data including the right eye image and the left eye image to the display 30 via the communication module 206 .
  • the display 30 appropriately displays the right-eye image and the left-eye image on the binocular display units based on the received 3D image data. Thereby, the user U perceives the image of the performer P whose height is Hp at the position of the viewing distance D.
  • step S7 the processor 201 determines whether or not to end the process. For example, when the event ends and the communication with the camera 10 is cut off, it is determined that the processing ends. If it is not determined in step S7 that the process should end, the process returns to step S2. When it is determined in step S7 that the process should end, the processor 201 ends the process of FIG.
  • the imaging range of the camera 10 at the standing position of the performer P is estimated from the image captured by the camera 10 placed at the venue and the height of the performer P. Then, based on the estimated imaging range of the camera 10, the viewing distance D is estimated when the user U sees a scene in a range equivalent to the imaging range of the camera 10.
  • the virtual display plane is arranged so that the performer P with the height Hp is positioned at the viewing distance D in the three-dimensional space perceived by the user U by the three-dimensional display, and the size of the performer P on the virtual display plane is adjusted. .
  • the three-dimensional image displayed in this way allows the user U to feel as if he were watching the performer P at the venue.
  • the user U can get the feeling of actually being at the venue.
  • the display 30 is assumed to be a non-transmissive wearable display.
  • the display 30 may be a transmissive wearable display.
  • the transmissive wearable display either a video transmissive wearable display or an optical transmissive wearable display may be used.
  • a virtual image projection method and a retinal projection method are known as methods of projecting an image to a user U using a transmissive wearable display.
  • a virtual image projection type or a retinal projection type wearable display may be used.
  • FIGS. 7A and 7B are diagrams showing a schematic configuration of a video distribution system 1 using a transmissive wearable display.
  • FIGS. 7A and 7B show only the remote configuration.
  • the configuration is similar to that shown in FIG. 1, except for the configuration at the remote location.
  • FIG. 7A shows a first example.
  • the user U wears the display 30a, which is a transmissive wearable display, on the head. And there is a wall W in front of the user U.
  • the user U perceives a real image obtained from the outside world in addition to the virtual image Pi of the performer P and the virtual image Ai of the avatar of the audience based on the display by the binocular display unit. Therefore, for example, if the viewing distance D from the user U is farther than the wall W, the user U perceives the virtual image Pi of the performer P and the virtual image Ai of the avatar of the audience through the wall.
  • the user U gets a feeling as if the wall W is transparent and the space extends behind the wall W as well.
  • the user U perceives the virtual image Pi of the performer P and the virtual image Ai of the audience avatar at an empty position in the air. You can get a sense of reality rather than doing it.
  • the user U perceives the virtual image Pi of the performer P and the virtual image Ai of the audience's avatar behind the wall W.
  • the wall W has a clear unevenness, pattern, etc.
  • the user U perceives the virtual image Pi of the performer P and the virtual image Ai of the avatar of the audience along with the unevenness, pattern, etc. of the wall, which impairs the sense of reality. .
  • the wall W be flat and plain.
  • FIG. 7B shows a second example.
  • the user U wears a display 30a, which is a transmissive wearable display, on the head.
  • a monitor M In front of the user U, there is a monitor M whose power is turned off. Therefore, if the position of the viewing distance D from the user U is farther than the screen of the monitor M, the user U perceives the virtual image Pi of the performer P and the virtual image Ai of the avatar of the audience through the screen of the monitor M. . In other words, the user U feels as if the space extends beyond the frame of the monitor M.
  • the user U perceives the virtual image Pi of the performer P and the virtual image Ai of the audience avatar at an empty position in the air. You can get a sense of reality rather than doing it. Further, in the second example, the user U can feel as if the image is being displayed on the monitor M even though the monitor M is turned off.
  • the monitor M is powered off. This is because when an image is displayed on the monitor M, the user perceives the image displayed on the monitor M together with the virtual image Pi of the performer P and the virtual image Ai of the avatar of the audience. On the other hand, information related to the venue may be displayed on the monitor M.
  • processing is performed so that the height of the performer P is reproduced within the three-dimensional space.
  • processing may be performed so that the size of the large service monitor 50 is reproduced within the three-dimensional space.
  • processor 201 acquires an image from camera 40 .
  • the size of the virtual display surface is determined based on the height of the performer P in the video, as in the above-described embodiment.
  • the size of the virtual display surface is determined so that the size of the surface matches the size of the large service monitor 50 .
  • the processor 201 arranges the image plane of the large service monitor 50 at a viewing distance position sufficiently far from the user U, and the image plane of the large service monitor 50 between the user U and the image plane of the large service monitor 50. An image representing the audience is placed at the placement position. As a result, a state is created in which only the distance from the user U is increased, and there are spectators between the image plane and the user, while maintaining the power and viewability of the image.
  • the image processing device 20 performs processing on images captured in real time by the camera 10 .
  • processing by the video processing device 20 may be performed when playing back video recorded on a recording medium in the past.
  • the video recorded on the recording medium is not necessarily limited to the video captured by the camera 10 .
  • the video recorded on the recording medium may be CG or the like.
  • the video recorded on the recording medium is once sent to the video processing device 20 .
  • the processor 201 of the video processing device 20 performs the processing shown in FIG. 4 and then transmits the 3D video data to the display 30 .
  • the biological data of the spectator as the spectator data 2043 is data collected at the time of video recording.
  • the processor 201 updates the avatar video using biometric data in synchronization with the timing of the video being reproduced.
  • the display 30 is assumed to be a display capable of three-dimensional display.
  • the display 30 may be a display incapable of three-dimensional display.
  • the processor 201 does not have to change the size of the performer P in processing the video.
  • the processor 201 may enlarge or reduce the size of the performer P in the video so that the height of the performer P is reproduced if the screen of the display 30 is sufficiently large.
  • the processor 201 scales up or down the size of the avatar to the size that would be perceived at a viewing distance D from the user U.
  • the reference object does not necessarily have to be a person.
  • the reference object may be a person such as a player, or an object such as a soccer ball. If the reference object is an object such as a soccer ball, the image may be enlarged or reduced with reference to the object as the reference object. Also, depending on the reference object, the horizontal length of the reference object may be used instead of the vertical length.
  • the imaging range of the camera 10 is estimated based on the length of the reference object. This estimation is particularly effective when the camera 10 is configured to be mobile and the imaging range can change over time. On the other hand, when the camera 10 is fixed, the shooting range of the camera 10 may be estimated from information such as the focal length and shooting distance of the camera 10 .
  • the sense of distance and the sense of size when it is assumed that the user U sees scenery in the same range as the shooting range of the camera 10 is reproduced in the three-dimensional image perceived by the user.
  • the viewing distance D is defined as the distance from a specific audience seat to the performer P
  • the image of the shooting range H equivalent to that seen by the user U from this viewing distance D is captured by the camera 10.
  • a virtual display plane may be arranged so that a performer P with height Hp is positioned at a viewing distance D based on the trimmed video, and the size of the performer P on the virtual display plane may be adjusted.
  • the present invention is not limited to the above-described embodiments, and can be variously modified in the implementation stage without departing from the gist of the present invention. Further, each embodiment may be implemented in combination as appropriate, in which case the combined effect can be obtained. Furthermore, various inventions are included in the above embodiments, and various inventions can be extracted by combinations selected from a plurality of disclosed constituent elements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiments, if the problem can be solved and effects can be obtained, the configuration with the constituent elements deleted can be extracted as an invention.

Abstract

A video processing device (20) comprises a reception unit (2011), a distance inference unit (2012), a video processing unit (2014), and a transmission unit (2015). The reception unit receives a video. The distance inference unit infers a viewing distance on the assumption that a person was viewing a landscape in the range equivalent to that of the video. The video processing unit processes the video so that the size of a reference object appearing on the video at the viewing distance conforms to the actual size of the reference object. The transmission unit transmits the processed video to a display device.

Description

映像処理装置、映像処理方法及び映像処理プログラムVIDEO PROCESSING DEVICE, VIDEO PROCESSING METHOD, AND VIDEO PROCESSING PROGRAM
 実施形態は、映像処理装置、映像処理方法及び映像処理プログラムに関する。 The embodiments relate to a video processing device, a video processing method, and a video processing program.
 ユーザは、テレビジョン(テレビ)等のディスプレイに表示される遠隔地の映像を通じて、音楽及び演劇といったイベントを鑑賞したり、スポーツを観戦したりする際に、現地との一体感を感じにくいことがある。この理由の1つは、例えばユーザの周囲に観客の存在が感じられないことである。 Users may find it difficult to feel a sense of unity with the local area when watching events such as music and theater performances or watching sports through images displayed on a display such as a television (television). be. One reason for this is, for example, that the presence of spectators around the user cannot be felt.
 これに対し、非特許文献1及び非特許文献2では、観客の様子や発言を画面上に提示することでユーザに他者の存在を感じさせる技術が提案されている。 On the other hand, Non-Patent Document 1 and Non-Patent Document 2 propose a technology that makes the user feel the presence of others by presenting the appearance and remarks of the audience on the screen.
 ユーザが現地との一体感を感じにくい別の理由として、現実と異なる距離感で映像がテレビ等のディスプレイに表示されることがある。これにより、ユーザは、仮に観客の様子等が提示されたとしても、違和感を覚えてしまい、実際の会場に居るかのような没入感を感じにくい。 Another reason why it is difficult for users to feel a sense of unity with the site is that images are displayed on displays such as televisions with a different sense of distance from reality. As a result, even if the appearance of the audience is presented, the user will feel a sense of incongruity, and it is difficult for the user to feel a sense of immersion as if they were in the actual venue.
 実施形態は、遠隔地の映像を通じて、音楽及び演劇といったイベントを鑑賞したり、スポーツを観戦したりする際に、ユーザが実際の会場で観ているかのような感覚を得ることができる映像処理装置、映像処理方法及び映像処理プログラムを提供する。 An embodiment is a video processing device that allows a user to feel as if they were watching at an actual venue when watching events such as music and theater or watching sports through video in a remote location. , provides a video processing method and a video processing program.
 実施形態の映像処理装置は、受信部と、距離推定部と、映像加工部と、送信部とを有する。受信部は、映像を受信する。距離推定部は、映像と同等の範囲の景色を人間が見ていたと仮定した場合の視聴距離を推定する。映像加工部は、視聴距離において映像に映る基準物のサイズが基準物の現実のサイズと一致するように映像を加工する。送信部とは、加工された映像をディスプレイに送信する。 The video processing device of the embodiment has a receiving unit, a distance estimating unit, a video processing unit, and a transmitting unit. The receiving unit receives video. A distance estimating unit estimates a viewing distance when it is assumed that a human being has viewed scenery in the same range as the image. The image processing unit processes the image so that the size of the reference object appearing in the image at the viewing distance matches the actual size of the reference object. The transmission unit transmits the processed video to the display.
 実施形態によれば、遠隔地の映像を通じて、音楽及び演劇といったイベントを鑑賞したり、スポーツを観戦したりする際に、ユーザが実際の会場で観ているかのような感覚を得ることができる映像処理装置、映像処理方法及び映像処理プログラムが提供される。 According to the embodiment, when viewing events such as music and theater, or watching sports through video in a remote location, the user can feel as if he/she is watching at an actual venue. A processing device, a video processing method, and a video processing program are provided.
図1は、実施形態に係る映像処理装置を含む映像配信システムの概略の構成を示す図である。FIG. 1 is a diagram showing a schematic configuration of a video distribution system including a video processing device according to an embodiment. 図2は、映像処理装置のハードウェア構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a video processing device; 図3は、映像処理装置の機能ブロック図である。FIG. 3 is a functional block diagram of the video processing device. 図4は、映像処理装置の動作を示すフローチャートである。FIG. 4 is a flow chart showing the operation of the video processing device. 図5は、パフォーマーの身長とカメラの高さ方向の撮影範囲との関係を示す図である。FIG. 5 is a diagram showing the relationship between the height of the performer and the shooting range of the camera in the height direction. 図6は、ユーザが会場でパフォーマーを見ていたと仮定したときのユーザとパフォーマーとの距離の関係を示す図である。FIG. 6 is a diagram showing the distance relationship between the user and the performer when it is assumed that the user is watching the performer at the venue. 図7Aは、透過型のウェアラブルディスプレイが用いられた映像配信システムの概略の構成を示す図である。FIG. 7A is a diagram showing a schematic configuration of a video distribution system using a transmissive wearable display. 図7Bは、透過型のウェアラブルディスプレイが用いられた映像配信システムの概略の構成を示す図である。FIG. 7B is a diagram showing a schematic configuration of a video distribution system using a transmissive wearable display.
 以下、実施形態について図面を参照して説明する。図1は、実施形態に係る映像処理装置を含む映像配信システムの概略の構成を示す図である。映像配信システム1は、カメラ10と、映像処理装置20と、ディスプレイ30とを有する。 Hereinafter, embodiments will be described with reference to the drawings. FIG. 1 is a diagram showing a schematic configuration of a video distribution system including a video processing device according to an embodiment. The video distribution system 1 has a camera 10 , a video processing device 20 and a display 30 .
 カメラ10は、例えば、音楽及び演劇といったイベントが開催されている会場における観客席ASに設置される。具体的には、観客席ASは、例えばステージSに面するように配置されている。そして、観客席ASの一部にカメラ10を設置するための空間が空けられている。カメラ10は、移動できるように構成されたカメラであってもよく、位置が固定されたカメラであってもよい。カメラ10は、パフォーマーPを予め定められたフレームレートで撮影し、パフォーマーPの映像のデータを生成する。映像のデータは、ディスプレイ30によって表示できる映像のサイズ等に応じてリサイズされてよい。例えば、ディスプレイ30によって表示できる映像のサイズがフルHD(High Definition)サイズであれば、映像のデータは、1920画素×1080画素にリサイズされてよい。また、カメラ10は、映像処理装置20と通信できるように接続されている。カメラ10で撮影された映像のデータは、映像処理装置20に送信される。 The cameras 10 are installed, for example, in the audience seats AS at venues where events such as music and theater are held. Specifically, the audience seats AS are arranged to face the stage S, for example. A space for installing the camera 10 is provided in a part of the audience seats AS. The camera 10 may be a camera configured to be movable or may be a fixed position camera. The camera 10 shoots the performer P at a predetermined frame rate and generates image data of the performer P. The video data may be resized according to the size of the video that can be displayed by the display 30, or the like. For example, if the size of an image that can be displayed by the display 30 is a full HD (High Definition) size, the image data may be resized to 1920 pixels×1080 pixels. Also, the camera 10 is connected so as to be able to communicate with the video processing device 20 . Video data captured by the camera 10 is transmitted to the video processing device 20 .
 ここで、実施形態におけるパフォーマーPは、音楽のイベントであれば演奏者等であり、演劇のイベントであれば役者等であるといったように、イベントにおいて各種の活動をする人の総称である。実施形態において、パフォーマーPは、特定の表現活動をする人に限定されない。 Here, the performer P in the embodiment is a general term for people who perform various activities in an event, such as a performer or the like for a music event, or an actor or the like for a theater event. In embodiments, the performer P is not limited to a person performing a specific expressive activity.
 また、図1では、カメラ10の台数は1台である。しかしながら、カメラ10の台数は、1台に限定されない。例えば、会場内の複数の位置にカメラ10が設置されていてもよい。 Also, in FIG. 1, the number of cameras 10 is one. However, the number of cameras 10 is not limited to one. For example, the cameras 10 may be installed at multiple positions within the hall.
 映像処理装置20は、カメラ10から送信されてきた映像を処理する。例えば、映像処理装置20は、ディスプレイ30の表示によってユーザUに知覚される3次元空間上に配置される仮想ディスプレイ面のユーザUからの距離及び仮想ディスプレイ面におけるパフォーマーPの身長が、ユーザが実際に会場でパフォーマーPを見ていたと仮定したときに知覚するであろう距離及び身長となるように映像を処理する。そして、映像処理装置20は、処理した映像をディスプレイ30に送信する。映像処理装置20は、会場の中に設置されてもよいし、会場の外に設置されてもよい。会場の中として、映像処理装置20は、カメラ10に含められてもよい。また、会場の外として、映像処理装置20は、ディスプレイ30に含められてもよい。勿論、映像処理装置20は、カメラ10及びディスプレイ30と別体であってもよい。映像処理装置20の詳細については後で説明する。 The video processing device 20 processes the video transmitted from the camera 10. For example, the video processing device 20 determines that the distance from the user U to the virtual display surface arranged in the three-dimensional space and the height of the performer P on the virtual display surface perceived by the user U through the display on the display 30 are different from the actual height of the performer P on the virtual display surface. The image is processed so that it is the distance and height that you would perceive if you were looking at the performer P at the venue. The video processing device 20 then transmits the processed video to the display 30 . The video processing device 20 may be installed inside the venue or outside the venue. As in the venue, the video processing device 20 may be included in the camera 10 . Also, outside the venue, the video processing device 20 may be included in the display 30 . Of course, the video processing device 20 may be separate from the camera 10 and the display 30 . Details of the video processing device 20 will be described later.
 ディスプレイ30は、映像処理装置20と通信できるように構成されており、映像処理装置20から送信されてきた映像を表示する。ディスプレイ30は、会場に対して遠隔地に居るユーザUの頭部に装着される例えば非透過型の眼鏡型ウェアラブルディスプレイである。ディスプレイ30は、3次元表示をすることできるように構成されている。例えば、ディスプレイ30は、両眼位置にそれぞれディスプレイユニットを有し、映像処理装置20から送信されてきたカメラ10の映像と、観客のアバター映像が合成された映像をそれぞれのディスプレイユニットに表示することによって、ユーザUから所定の視聴距離の位置においてパフォーマーPの虚像Piと、観客のアバターの虚像Aiを知覚させる。ここで、アバターは、観客を模した映像であり、例えば観客を表す2次元又は3次元のイラスト映像であってよい。また、アバターは、実写を用いたCG映像等であってもよい。ディスプレイ30による3次元表示のための構成は、特定の構成には限定されない。また、ユーザUのいる場所は、ユーザUの自宅であったり、会場とは別に設けられたパブリックビューイング会場であったりといったように会場に対して遠隔地であればよい。 The display 30 is configured to be able to communicate with the video processing device 20, and displays the video transmitted from the video processing device 20. The display 30 is, for example, a non-transmissive glasses-type wearable display that is worn on the head of the user U who is in a remote location with respect to the venue. The display 30 is configured to allow three-dimensional display. For example, the display 30 has a display unit at each eye position, and displays an image obtained by synthesizing the image of the camera 10 transmitted from the image processing device 20 and the avatar image of the audience on each display unit. Thus, the virtual image Pi of the performer P and the virtual image Ai of the avatar of the audience are perceived at a position at a predetermined viewing distance from the user U. Here, the avatar is an image imitating the audience, and may be, for example, a two-dimensional or three-dimensional illustration image representing the audience. Also, the avatar may be a CG image or the like using actual photography. A configuration for three-dimensional display by the display 30 is not limited to a specific configuration. Also, the location of the user U may be a remote location from the venue, such as the home of the user U or a public viewing venue provided separately from the venue.
 また、会場には、カメラ10に加えて、カメラ40及び大型サービスモニタ50が配置されていてもよい。カメラ40は、大型サービスモニタ50と通信できるように接続されている。カメラ40は、例えばステージSの上に配置され、ステージSの上のパフォーマーPを撮影するように構成されている。カメラ40の台数は、1台に限定されない。例えば、カメラ40は、観客席ASにも設置されていてよい。大型サービスモニタ50は、観客席ASに面するように例えばステージSの上に設置され、カメラ40で撮影された映像を大画面で表示する。 In addition to the camera 10, the venue may also have a camera 40 and a large service monitor 50. Camera 40 is communicatively connected to large service monitor 50 . The camera 40 is arranged on the stage S, for example, and configured to photograph the performer P on the stage S. The number of cameras 40 is not limited to one. For example, the camera 40 may also be installed in the auditorium AS. The large service monitor 50 is installed, for example, on the stage S so as to face the audience seats AS, and displays the image captured by the camera 40 on a large screen.
 図1で示した映像配信システム1において、例えば、ステージSに近い観客席ASに座っている観客A1及びパフォーマーPに面した観客席ASに座っている観客A2は、ステージSのパフォーマーPを肉眼で見ることができる。ただし、ステージSの近くの観客A1とステージSから遠くの観客A2とでは、知覚する景色が異なる。具体的には、観客A1が知覚する景色v1に対して、観客A2が知覚する景色v2ではパフォーマーPが小さくなる。また、ステージSから遠くの観客席ASに座っている観客A3は、大型サービスモニタ50の映像を介してパフォーマーPを肉眼で見ることができる。したがって、観客A3も、観客A1及びA2とは異なる景色v3を知覚する。このように、人間は、同一の会場でパフォーマーPを見たとしても視聴位置によって異なる景色を知覚する。このことが、ユーザUがカメラ10で撮影された映像を遠隔地で見た場合の違和感の原因の1つである。 In the video distribution system 1 shown in FIG. 1, for example, a spectator A1 sitting on the audience seat AS near the stage S and a spectator A2 sitting on the audience seat AS facing the performer P see the performer P on the stage S with the naked eye. can be seen in However, the audience A1 near the stage S and the audience A2 far from the stage S perceive different scenery. Specifically, the performer P is smaller in the scenery v2 perceived by the audience A2 than in the scenery v1 perceived by the audience A1. A spectator A3 sitting in the audience seat AS far from the stage S can see the performer P with the naked eye through the image on the large service monitor 50. FIG. Therefore, spectator A3 also perceives scenery v3 different from that of spectators A1 and A2. In this way, people perceive different scenery depending on the viewing position even if they see the performer P at the same venue. This is one of the reasons why the user U feels uncomfortable when viewing the video captured by the camera 10 at a remote location.
 図2は、映像処理装置20のハードウェア構成の一例を示す図である。映像処理装置20は、コンピュータとして構成され得る。映像処理装置20は、単一のコンピュータである必要はなく、複数のコンピュータによって構成されていてもよい。図2に示すように、映像処理装置20は、プロセッサ201と、ROM(Read Only Memory)202と、RAM(Random Access Memory)203と、ストレージ204と、入力装置205と、通信モジュール206とを有している。ここで、映像処理装置20は、ディスプレイ等をさらに有していてもよい。 FIG. 2 is a diagram showing an example of the hardware configuration of the video processing device 20. As shown in FIG. The video processing device 20 may be configured as a computer. The video processing device 20 does not have to be a single computer, and may be composed of a plurality of computers. As shown in FIG. 2, the video processing device 20 has a processor 201, a ROM (Read Only Memory) 202, a RAM (Random Access Memory) 203, a storage 204, an input device 205, and a communication module 206. is doing. Here, the video processing device 20 may further have a display or the like.
 プロセッサ201は、様々なプログラムを実行することが可能な処理回路であり、映像処理装置20の全体の動作を制御する。プロセッサ201は、CPU(Central Processing Unit)、MPU(Micro Processing Unit)、GPU(Graphics Processing Unit)等のプロセッサであってよい。また、プロセッサ201は、ASIC(Application Specific Integrated Circuit)、FPGA(Field Programmable Gate Array)等であってもよい。さらに、プロセッサ201は、単一のCPU等で構成されていてもよいし、複数のCPU等で構成されていてもよい。 The processor 201 is a processing circuit capable of executing various programs and controls the overall operation of the video processing device 20 . The processor 201 may be a processor such as a CPU (Central Processing Unit), MPU (Micro Processing Unit), or GPU (Graphics Processing Unit). Also, the processor 201 may be an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or the like. Furthermore, the processor 201 may be composed of a single CPU or the like, or may be composed of a plurality of CPUs or the like.
 ROM202は、不揮発性の半導体メモリであり、映像処理装置20を制御するためのプログラム及び制御データ等を保持している。 The ROM 202 is a non-volatile semiconductor memory and holds programs and control data for controlling the video processing device 20 .
 RAM203は、例えば揮発性の半導体メモリであり、プロセッサ201の作業領域として使用される。 The RAM 203 is, for example, a volatile semiconductor memory, and is used as a work area for the processor 201.
 ストレージ204は、ハードディスクドライブ(HDD)、ソリッドステートドライブ(SSD)といった不揮発性の記憶装置である。ストレージ204は、プログラム2041、変数2042及び観客データ2043を保持している。 The storage 204 is a nonvolatile storage device such as a hard disk drive (HDD) or solid state drive (SSD). The storage 204 holds programs 2041 , variables 2042 and spectator data 2043 .
 プログラム2041は、カメラ10の映像の処理のためのプログラムである。プログラム2041は、カメラ10の映像から撮影範囲を推定する処理と、推定した撮影範囲と同等の範囲の景色を人間が見たと仮定した場合の仮想的な視聴距離を推定する処理と、推定した視聴距離に基づいて映像を加工する処理とをプロセッサ201に実行させるためのプログラムである。 The program 2041 is a program for processing the video of the camera 10. The program 2041 includes a process of estimating the shooting range from the image of the camera 10, a process of estimating a virtual viewing distance when it is assumed that a person sees a scene in the same range as the estimated shooting range, and a process of estimating the estimated viewing distance. It is a program for causing the processor 201 to execute a process of processing an image based on the distance.
 変数2042は、映像の処理に用いられる各種の変数である。実施形態では、変数2042は、基準物の長さ、イメージセンサの縦方向サイズ、焦点距離及び映像の縦画素数を含む。 Variables 2042 are various variables used for video processing. In an embodiment, variables 2042 include the length of the fiducial, the vertical size of the image sensor, the focal length, and the number of vertical pixels of the image.
 基準物の長さは、撮影範囲の推定に用いられる基準物の現実の長さである。基準物は、カメラ10で撮影される映像に映り得る、長さが既知の物体であればよい。例えば、パフォーマーPは、基準物であり得る。基準物がパフォーマーPであるとき、基準物の長さはパフォーマーPの身長であってよい。パフォーマーPの身長は、例えば、イベントの開催前にイベントの主催者等によって映像処理装置20に入力されてよい。 The length of the reference object is the actual length of the reference object used to estimate the shooting range. The reference object may be any object with a known length that can appear in the image captured by the camera 10 . For example, performer P may be a reference. When the reference object is performer P, the length of the reference object may be performer P's height. For example, the height of the performer P may be input to the video processing device 20 by the organizer of the event or the like before the event is held.
 イメージセンサの縦方向サイズは、人間の眼の網膜をカメラのイメージセンサとみなしたときの縦方向サイズである。例えば、人間の眼の網膜をフルサイズのイメージセンサと同等とみなした場合、イメージセンサの縦方向サイズは24mmである。また、人間の眼の網膜機能をAPS-Cサイズのイメージセンサと同等とみなした場合、イメージセンサの縦方向サイズは16.7mmである。 The vertical size of the image sensor is the vertical size when the retina of the human eye is regarded as the image sensor of the camera. For example, if the retina of the human eye is equated with a full size image sensor, the vertical size of the image sensor is 24 mm. If the retina function of the human eye is considered equivalent to that of an APS-C size image sensor, the vertical size of the image sensor is 16.7 mm.
 焦点距離は、人間の眼の水晶体をカメラのレンズとみなしたときの焦点距離の値である。一般に、人間の眼をレンズとフルサイズのイメージセンサからなるカメラとみなすと、その焦点距離は10~12mm相当であると言われている。ただし、実際には人間は、焦点距離10~12mm相当の視野から眼に入る光をすべて処理できるわけではなく、視野の一部の範囲から眼に入る光だけを処理している。この一部の範囲は、焦点距離で50mm程度であると言われている。実施形態では、ストレージ204は、焦点距離の値として50mmを保持している。なお、人間の眼をレンズとAPS-Cサイズのイメージセンサからなるカメラとみなすと、その焦点距離は35mm程度である。この場合、ストレージ204は、焦点距離の値として35mmを保持している。 The focal length is the value of the focal length when the lens of the human eye is regarded as a camera lens. Generally speaking, when the human eye is regarded as a camera consisting of a lens and a full-size image sensor, its focal length is said to be equivalent to 10 to 12 mm. However, in reality, humans cannot process all the light that enters the eye from a field of view equivalent to a focal length of 10 to 12 mm, and only processes the light that enters the eye from a partial range of the field of view. This partial range is said to be about 50 mm in focal length. In an embodiment, the storage 204 holds a focal length value of 50 mm. If the human eye is regarded as a camera composed of a lens and an APS-C size image sensor, its focal length is about 35 mm. In this case, the storage 204 holds 35 mm as the focal length value.
 映像の縦画素数は、カメラ10においてリサイズされた後の映像の縦画素数である。例えば、映像のサイズがフルHD(High Definition)サイズにリサイズされていれば、縦画素数は1080画素である。 The number of vertical pixels of the video is the number of vertical pixels of the video after being resized by the camera 10. For example, if the video size is resized to full HD (High Definition) size, the number of vertical pixels is 1080 pixels.
 また、ストレージ204は、観客データ2043を保持している。観客データ2043は、観客を表す映像のデータを含む。観客を表す映像のデータは、例えば観客のアバターの映像のデータである。観客を表す映像のデータは、予め撮影された実際の観客の映像のデータであってもよい。さらに、観客データ2043は、会場においてイベントを視聴している観客の心拍や、動作状態、感情等の生体情報をメタデータとして含んでいてもよい。このようなメタデータはイベントの進行中に会場において逐次に収集され、映像処理装置20に送信される。 Also, the storage 204 holds spectator data 2043 . The audience data 2043 includes video data representing the audience. The image data representing the audience is, for example, the image data of the avatar of the audience. The video data representing the audience may be video data of the actual audience captured in advance. Furthermore, the spectator data 2043 may include biometric information such as heartbeats, motion states, and emotions of spectators watching the event at the venue as metadata. Such metadata is sequentially collected at the venue during the event and transmitted to the video processing device 20 .
 入力装置205は、映像処理装置20の管理者が映像処理装置20を操作するためのインターフェース機器である。入力装置205は、例えば、タッチパネル、キーボード、マウス、各種の操作ボタン、各種の操作スイッチ等を含み得る。入力装置205は、例えば変数2042の入力に用いられ得る。 The input device 205 is an interface device for the administrator of the video processing device 20 to operate the video processing device 20 . The input device 205 can include, for example, a touch panel, keyboard, mouse, various operation buttons, various operation switches, and the like. Input device 205 may be used to input variables 2042, for example.
 通信モジュール206は、映像処理装置20と他の機器との通信に使用される回路を含むモジュールである。通信モジュール206は、例えば有線LANの規格に準拠した通信モジュールであってよい。また、通信モジュール206は、例えば無線LANの規格に準拠した通信モジュールであってもよい。 The communication module 206 is a module that includes circuits used for communication between the video processing device 20 and other devices. The communication module 206 may be, for example, a communication module conforming to the wired LAN standard. Also, the communication module 206 may be a communication module conforming to the wireless LAN standard, for example.
 図3は、映像処理装置20の機能ブロック図である。図3に示すように、映像処理装置20は、受信部2011と、距離推定部2012と、映像加工部2013と、3次元処理部2014と、送信部2015とを有している。映像処理装置20のプロセッサ201は、プログラム2041を実行することによって、受信部2011と、距離推定部2012と、映像加工部2013と、3次元処理部2014と、送信部2015として動作し得る。受信部2011と、距離推定部2012と、映像加工部2013と、3次元処理部2014と、送信部2015とは、プロセッサ201とは別のハードウェアによって実現されてもよい。 FIG. 3 is a functional block diagram of the video processing device 20. As shown in FIG. As shown in FIG. 3 , the image processing device 20 has a receiving section 2011 , a distance estimating section 2012 , an image processing section 2013 , a three-dimensional processing section 2014 and a transmitting section 2015 . By executing the program 2041, the processor 201 of the video processing device 20 can operate as a reception unit 2011, a distance estimation unit 2012, an image processing unit 2013, a three-dimensional processing unit 2014, and a transmission unit 2015. The receiving unit 2011 , the distance estimating unit 2012 , the image processing unit 2013 , the three-dimensional processing unit 2014 and the transmitting unit 2015 may be realized by hardware different from the processor 201 .
 受信部2011は、カメラ10から通信モジュール206を介して受信された映像を取得し、取得した映像をフレームの単位に分解する。そして、受信部2011は、フレーム単位の映像を逐次に距離推定部2012と映像加工部2013とに転送する。例えば、映像のフレームレートが60fps(frame per second)であるとき、受信部2011は、映像を60フレームに分解する。 The receiving unit 2011 acquires the video received from the camera 10 via the communication module 206, and decomposes the acquired video into frames. Then, the receiving unit 2011 sequentially transfers the images in units of frames to the distance estimation unit 2012 and the image processing unit 2013 . For example, when the video frame rate is 60 fps (frame per second), the receiving unit 2011 decomposes the video into 60 frames.
 距離推定部2012は、受信部2011から転送されてきた映像から、カメラ10で撮影された映像と同等の範囲の景色をユーザUが会場で見ていたと仮定したときのユーザUとパフォーマーPとの仮想的な視聴距離を推定する。距離推定部2012の動作については後で詳しく説明する。 The distance estimating unit 2012 uses the image transferred from the receiving unit 2011 to determine the distance between the user U and the performer P when it is assumed that the user U was watching the scenery in the same range as the image captured by the camera 10 at the venue. Estimate virtual viewing distance. The operation of distance estimation section 2012 will be described later in detail.
 映像加工部2013は、距離推定部2012で推定された視聴距離とパフォーマーPの身長とに基づいて、カメラ10で撮影された映像を加工する。例えば、映像加工部2013は、3次元表示によってユーザUによって知覚される3次元空間におけるパフォーマーPの身長がパフォーマーPの現実の身長と一致するように、映像におけるパフォーマーPの領域を拡大又は縮小する。映像加工部2013の動作については後で詳しく説明する。 The image processing unit 2013 processes the image captured by the camera 10 based on the viewing distance estimated by the distance estimation unit 2012 and the height of the performer P. For example, the image processing unit 2013 enlarges or reduces the area of the performer P in the image so that the height of the performer P in the three-dimensional space perceived by the user U by the three-dimensional display matches the actual height of the performer P. . The operation of the image processing unit 2013 will be described later in detail.
 3次元処理部2014は、ディスプレイ30による3次元表示のためのレンダリングの処理を行う。例えば、3次元処理部2014は、ユーザUによって知覚される3次元空間における距離推定部2012で推定された視聴距離において、パフォーマーPの映像平面を配置する。また、3次元処理部2014は、距離推定部2012で推定された視聴距離に基づいて、パフォーマーPの映像平面の手前に観客を表す映像を配置する。そして、3次元処理部2014は、3次元空間における仮想的な光源による反射等を考慮して、3次元映像のデータをレンダリングする。そして、3次元処理部2014は、パフォーマーPの映像平面と観客の映像平面とを含む3次元空間をユーザUの位置から仮想的なステレオカメラで撮影したときに得られる右眼映像と左眼映像とを取得する。そして、3次元処理部2014は、取得した右眼映像と左眼映像とを含む3次元映像のデータを送信部2015に転送する。3次元処理部2014の動作については後で詳しく説明する。 The 3D processing unit 2014 performs rendering processing for 3D display on the display 30 . For example, the 3D processor 2014 places the image plane of the performer P at the viewing distance estimated by the distance estimator 2012 in the 3D space perceived by the user U. Also, the three-dimensional processing unit 2014 arranges the image representing the audience in front of the image plane of the performer P based on the viewing distance estimated by the distance estimation unit 2012 . Then, the 3D processing unit 2014 renders the 3D video data in consideration of the reflection from the virtual light source in the 3D space. Then, the 3D processing unit 2014 captures a 3D space including the image plane of the performer P and the image plane of the audience from the position of the user U with a virtual stereo camera, and obtains a right eye image and a left eye image. and get. Then, the 3D processing unit 2014 transfers 3D image data including the acquired right-eye image and left-eye image to the transmission unit 2015 . The operation of the 3D processing unit 2014 will be described later in detail.
 送信部2015は、3次元処理部2014から送られてきた3次元映像のデータを、通信モジュール206を介してディスプレイ30に送信する。 The transmission unit 2015 transmits the 3D video data sent from the 3D processing unit 2014 to the display 30 via the communication module 206 .
 次に、実施形態における映像配信システム1の動作を説明する。図4は、映像処理装置20の動作を示すフローチャートである。図4の処理は、例えばイベントが開始されて終了されるまでの間の一定時間毎にプロセッサ201によって実施される。この一定時間は、例えばカメラ10から映像のデータが送信される時間間隔である。 Next, the operation of the video distribution system 1 according to the embodiment will be described. FIG. 4 is a flow chart showing the operation of the video processing device 20. As shown in FIG. The processing of FIG. 4 is performed by the processor 201 at regular intervals, for example, from the start of an event until the end of the event. This certain period of time is, for example, a time interval at which image data is transmitted from the camera 10 .
 ステップS1において、プロセッサ201は、ストレージ204から変数2042を取得する。前述したように、変数2042は、基準物の長さ、焦点距離、イメージセンサの縦方向サイズ及び映像の縦画素数を含む。以下の例では、基準物の長さは、パフォーマーPの身長である。 In step S1, the processor 201 acquires the variable 2042 from the storage 204. As described above, variables 2042 include the length of the reference object, the focal length, the vertical size of the image sensor, and the number of vertical pixels of the image. In the example below, the reference length is the performer P's height.
 ステップS2において、プロセッサ201は、例えばカメラ10から送信されてRAM203に蓄積された映像のデータを取得する。そして、プロセッサ201は、映像をフレームの単位に分解する。また、プロセッサ201は、ストレージ204から観客データ2043を取得する。以下の例では、観客データ2043の観客を表す映像は、観客のアバター映像である。 In step S2, the processor 201 acquires image data transmitted from the camera 10 and stored in the RAM 203, for example. The processor 201 then decomposes the video into frames. The processor 201 also acquires spectator data 2043 from the storage 204 . In the following example, the video representing the spectator in the spectator data 2043 is the spectator's avatar video.
 ステップS3において、プロセッサ201は、距離推定処理を実施する。以下、距離推定処理について説明する。ここで、以下の説明をするに当たり、変数2042の値は例えば次のものであるとする。 
   パフォーマーPの身長Hp=1.7(m) 
   焦点距離f=50(mm) 
   イメージセンサの縦方向サイズS=24(mm) 
   映像の縦画素数h=1080(画素) 
In step S3, the processor 201 performs distance estimation processing. The distance estimation process will be described below. Here, for the following explanation, it is assumed that the value of the variable 2042 is as follows.
Performer P's height Hp = 1.7 (m)
Focal length f=50 (mm)
Vertical size of image sensor S = 24 (mm)
Number of vertical pixels of video h = 1080 (pixels)
 距離推定処理において、まず、プロセッサ201は、それぞれのフレームの単位の映像における物体検出をする。実施形態における物体検出は、映像における長さが既知の物体を検出する処理である。実施形態では、プロセッサ201は、パフォーマーPを検出する。物体検出の手法は、Mask R-CNN及びYOLO等の物体検出アルゴリズムに基づいて行われてよい。以下では、物体検出により、映像に占めるパフォーマーPの領域の縦画素数hpが640(画素)と推定されたとする。 In the distance estimation process, the processor 201 first detects an object in each frame unit video. Object detection in the embodiment is a process of detecting an object with a known length in an image. In an embodiment, processor 201 detects performer P. Object detection techniques may be based on object detection algorithms such as Mask R-CNN and YOLO. In the following, it is assumed that the number of vertical pixels hp of the region of the performer P occupying the video is estimated to be 640 (pixels) by object detection.
 物体検出の後、プロセッサ201は、パフォーマーPの身長Hpに基づき、パフォーマーPの立ち位置におけるカメラ10の高さ方向の撮影範囲H(m)を推定する。図5は、パフォーマーPの身長Hpと撮影範囲Hとの関係を示す図である。図5に示すパフォーマーPの身長Hpと撮影範囲Hとの割合は、パフォーマーPの縦画素数hpと映像の縦画素数hとの割合と一致する。したがって、以下の(式1)の関係が成り立つ。 
   Hp/H=hp/h                (式1) 
したがって、H=h/hp×Hp=2.87(m)である。
After the object detection, the processor 201 estimates the height direction imaging range H(m) of the camera 10 at the performer P's standing position based on the performer's P height Hp. FIG. 5 is a diagram showing the relationship between the height Hp of the performer P and the shooting range H. As shown in FIG. The ratio between the height Hp of the performer P and the shooting range H shown in FIG. 5 matches the ratio between the vertical pixel number hp of the performer P and the vertical pixel number h of the video. Therefore, the following relationship (Equation 1) holds.
Hp/H=hp/h (Formula 1)
Therefore, H=h/hp×Hp=2.87 (m).
 撮影範囲Hは、パフォーマーPからある距離に配置されたある焦点距離を有するカメラ10で撮影した場合の撮影範囲である。したがって、図6に示すようにして、撮影範囲Hと同等の範囲の景色をユーザUが見ていたと仮定したときの視聴距離Dは、ユーザUの眼を焦点距離fのレンズ及び縦方向サイズSのイメージセンサを有するカメラと考えると、以下の(式2)より計算される。 
   S/f=H/D                  (式2)
したがって、D=H×f/S=5.98(m)である。以上で距離推定処理が完了する。その後、プロセッサ201は、処理をステップS4に移す。
The photographing range H is the photographing range when photographed by the camera 10 having a certain focal length arranged at a certain distance from the performer P. FIG. Therefore, as shown in FIG. 6, the viewing distance D when it is assumed that the user U was looking at the scenery in the same range as the shooting range H is the viewing distance D of the user U's eyes with the lens of the focal length f and the vertical size S is calculated from the following (Equation 2).
S/f=H/D (Formula 2)
Therefore, D=H*f/S=5.98(m). This completes the distance estimation processing. After that, the processor 201 moves the process to step S4.
 ここで図4の説明に戻る。ステップS4において、プロセッサ201は、映像の加工処理を行う。すなわち、プロセッサ201は、3次元表示によってユーザUによって知覚される仮想ディスプレイ面におけるパフォーマーPの身長が実際の身長であるHpとなるように映像を拡大又は縮小する。 Now, return to the description of FIG. In step S4, the processor 201 processes the video. That is, the processor 201 enlarges or reduces the image so that the height of the performer P on the virtual display surface perceived by the user U by the three-dimensional display becomes Hp, which is the actual height.
 ステップS5において、プロセッサ201は、3次元映像の生成処理を行う。具体的には、プロセッサ201は、ユーザUから視聴距離Dにおいて、パフォーマーPの映像平面を配置する。さらに、プロセッサ201は、ユーザUとパフォーマーPの映像平面との間に観客のアバターの映像平面を配置する。そして、プロセッサ201は、3次元空間をユーザUの位置から仮想的なステレオカメラで撮影したときに得られる右眼映像と左眼映像とを取得する。ここで、パフォーマーPの映像に重ねられるアバター映像における観客のアバターの数及びサイズは視聴距離Dによって決められてよい。具体的には、観客のアバターの数は、視聴距離Dが短くなるほど少なくなり、視聴距離Dが閾値以下のときには0になる。これは、現実の会場におけるイベントでは、ステージSに近い観客席ほど、その前に座っている観客の数が少なくなることを再現するためである。したがって、視聴距離Dの閾値は、例えば会場における最前列の観客席とステージSとの距離に設定され得る。また、観客のアバターのサイズは、3次元表示によってユーザに知覚される3次元空間内でユーザUに近い位置ほど大きくなり、遠い位置ほど小さくなる。これは、現実の会場におけるイベントでは、近い観客ほど大きく見えることを再現するためである。また、ステップS5において、プロセッサ201は、パフォーマーPとアバターとをそれぞれ照明する仮想的な光源による反射及び散乱等が考慮された視聴距離DにおけるパフォーマーPとアバターの見えを既存の光沢再現技術によって計算する。さらに、プロセッサ201は、観客の生体データに応じてアバターを動作させる等の加工を施してもよい。 In step S5, the processor 201 performs 3D video generation processing. Specifically, the processor 201 places the image plane of the performer P at a viewing distance D from the user U. In addition, the processor 201 places the image plane of the audience avatar between the user U and performer P image planes. Then, the processor 201 acquires a right-eye image and a left-eye image obtained when the three-dimensional space is photographed from the position of the user U with a virtual stereo camera. Here, the viewing distance D may determine the number and size of the audience avatars in the avatar video superimposed on the performer P video. Specifically, the number of audience avatars decreases as the viewing distance D decreases, and becomes 0 when the viewing distance D is equal to or less than the threshold. This is to reproduce the fact that the closer the audience seats are to the stage S, the smaller the number of spectators sitting in front of them in an event at a real venue. Therefore, the threshold of the viewing distance D can be set to the distance between the front rows of audience seats and the stage S in the venue, for example. Also, the size of the spectator avatar becomes larger at a position closer to the user U in the three-dimensional space perceived by the user through three-dimensional display, and becomes smaller at a position farther away. This is to reproduce the fact that in an event at a real venue, the closer the audience, the larger the audience. Further, in step S5, the processor 201 calculates the appearance of the performer P and the avatar at the viewing distance D in consideration of the reflection and scattering by the virtual light source that illuminates the performer P and the avatar, respectively, using an existing gloss reproduction technology. do. Furthermore, the processor 201 may perform processing such as moving the avatar according to the biological data of the audience.
 ステップS6において、プロセッサ201は、右眼映像と左眼映像とを含む3次元映像のデータを、通信モジュール206を介してディスプレイ30に送信する。ディスプレイ30は、受信した3次元映像のデータに基づいて両眼のディスプレイユニットに右眼映像と左眼映像とを適宜に表示する。これにより、ユーザUは、視聴距離Dの位置において身長がHpのパフォーマーPの映像を知覚する。 In step S<b>6 , the processor 201 transmits the 3D image data including the right eye image and the left eye image to the display 30 via the communication module 206 . The display 30 appropriately displays the right-eye image and the left-eye image on the binocular display units based on the received 3D image data. Thereby, the user U perceives the image of the performer P whose height is Hp at the position of the viewing distance D.
 ステップS7において、プロセッサ201は、処理を終了するか否かを判定する。例えば、イベントが終了してカメラ10との通信が切断された場合に、処理を終了すると判定される。ステップS7において、処理を終了すると判定されていないときには、処理はステップS2に戻る。ステップS7において、処理を終了すると判定されたときには、プロセッサ201は、図4の処理を終了させる。 In step S7, the processor 201 determines whether or not to end the process. For example, when the event ends and the communication with the camera 10 is cut off, it is determined that the processing ends. If it is not determined in step S7 that the process should end, the process returns to step S2. When it is determined in step S7 that the process should end, the processor 201 ends the process of FIG.
 以上説明したように実施形態によれば、会場に配置されたカメラ10で撮影される映像とパフォーマーPの身長とから、パフォーマーPの立ち位置におけるカメラ10の撮影範囲が推定される。そして、推定されたカメラ10の撮影範囲に基づき、このカメラ10の撮影範囲と同等の範囲の景色をユーザUが見たと仮定したときの視聴距離Dが推定される。そして、3次元表示によってユーザUに知覚される3次元空間において視聴距離Dに身長HpのパフォーマーPが位置するように仮想ディスプレイ面が配置され、仮想ディスプレイ面でのパフォーマーPのサイズが調整される。このようにして表示される3次元映像により、ユーザUは、あたかも会場でパフォーマーPを見ているかのような感覚を得ることができる。 As described above, according to the embodiment, the imaging range of the camera 10 at the standing position of the performer P is estimated from the image captured by the camera 10 placed at the venue and the height of the performer P. Then, based on the estimated imaging range of the camera 10, the viewing distance D is estimated when the user U sees a scene in a range equivalent to the imaging range of the camera 10. FIG. Then, the virtual display plane is arranged so that the performer P with the height Hp is positioned at the viewing distance D in the three-dimensional space perceived by the user U by the three-dimensional display, and the size of the performer P on the virtual display plane is adjusted. . The three-dimensional image displayed in this way allows the user U to feel as if he were watching the performer P at the venue.
 さらに、視聴距離Dに応じて観客のアバター映像がパフォーマーPの映像に重ねられることにより、よりユーザUは、実際に会場に居るかのような感覚を得ることができる。 Furthermore, by superimposing the avatar video of the audience on the video of the performer P according to the viewing distance D, the user U can get the feeling of actually being at the venue.
 [変形例1]
 以下、実施形態の変形例を説明する。前述した実施形態では、ディスプレイ30は、非透過型のウェアラブルディスプレイであるとされている。しかしながら、ディスプレイ30は、透過型のウェアラブルディスプレイであってもよい。透過型のウェアラブルディスプレイとして、ビデオ透過型のウェアラブルディスプレイ及び光学透過型のウェアラブルディスプレイの何れが用いられてもよい。また、透過型のウェアラブルディスプレイによるユーザUへの映像の投影方式として、虚像投影方式と網膜投影方式が知られている。実施形態では、虚像投影方式と網膜投影方式の何れのウェアラブルディスプレイが用いられてもよい。
[Modification 1]
Modifications of the embodiment will be described below. In the embodiments described above, the display 30 is assumed to be a non-transmissive wearable display. However, the display 30 may be a transmissive wearable display. As the transmissive wearable display, either a video transmissive wearable display or an optical transmissive wearable display may be used. A virtual image projection method and a retinal projection method are known as methods of projecting an image to a user U using a transmissive wearable display. In the embodiment, either a virtual image projection type or a retinal projection type wearable display may be used.
 図7A及び図7Bは、透過型のウェアラブルディスプレイが用いられた映像配信システム1の概略の構成を示す図である。ここで、図7A及び図7Bは、遠隔地における構成だけが示されている。遠隔地における構成以外は、図1で示したものと同様である。 7A and 7B are diagrams showing a schematic configuration of a video distribution system 1 using a transmissive wearable display. Here, FIGS. 7A and 7B show only the remote configuration. The configuration is similar to that shown in FIG. 1, except for the configuration at the remote location.
 図7Aは、第1の例を示す。第1の例では、ユーザUは、透過型のウェアラブルディスプレイであるディスプレイ30aを頭部に装着している。そして、ユーザUの前方には壁Wがある。透過型のウェアラブルディスプレイの場合、ユーザUは、両眼のディスプレイユニットによる表示に基づくパフォーマーPの虚像Pi及び観客のアバターの虚像Aiに加えて、外界から得られる実像も知覚する。したがって、例えばユーザUから視聴距離Dの位置が壁Wよりも遠くの位置であれば、ユーザUは、壁越しにパフォーマーPの虚像Pi及び観客のアバターの虚像Aiを知覚することになる。つまり、ユーザUは、壁Wが透過していて壁Wの奥にも空間が広がっているかのような感覚を得る。このようにして壁越しにパフォーマーPの虚像Pi及び観客のアバターの虚像Aiを知覚することにより、ユーザUは、何もない空中の位置にパフォーマーPの虚像Pi及び観客のアバターの虚像Aiを知覚するよりも現実感を得ることができる。 FIG. 7A shows a first example. In the first example, the user U wears the display 30a, which is a transmissive wearable display, on the head. And there is a wall W in front of the user U. In the case of a transmissive wearable display, the user U perceives a real image obtained from the outside world in addition to the virtual image Pi of the performer P and the virtual image Ai of the avatar of the audience based on the display by the binocular display unit. Therefore, for example, if the viewing distance D from the user U is farther than the wall W, the user U perceives the virtual image Pi of the performer P and the virtual image Ai of the avatar of the audience through the wall. In other words, the user U gets a feeling as if the wall W is transparent and the space extends behind the wall W as well. By perceiving the virtual image Pi of the performer P and the virtual image Ai of the avatar of the audience through the wall in this way, the user U perceives the virtual image Pi of the performer P and the virtual image Ai of the audience avatar at an empty position in the air. You can get a sense of reality rather than doing it.
 第1の例では、ユーザUは、壁Wの奥においてパフォーマーPの虚像Pi及び観客のアバターの虚像Aiを知覚している。この場合、壁Wに明確な凹凸や模様等があると、ユーザUは壁の凹凸や模様等とともにパフォーマーPの虚像Pi及び観客のアバターの虚像Aiを知覚することになって現実感が損なわれる。このような理由から、壁Wは平面且つ無地であることが望ましい。同様の理由により、壁Wの前には不要物が置かれていないことが望ましい。 In the first example, the user U perceives the virtual image Pi of the performer P and the virtual image Ai of the audience's avatar behind the wall W. In this case, if the wall W has a clear unevenness, pattern, etc., the user U perceives the virtual image Pi of the performer P and the virtual image Ai of the avatar of the audience along with the unevenness, pattern, etc. of the wall, which impairs the sense of reality. . For this reason, it is desirable that the wall W be flat and plain. For similar reasons, it is desirable that the wall W be free of unnecessary objects.
 図7Bは、第2の例を示す。第2の例では、ユーザUは、透過型のウェアラブルディスプレイであるディスプレイ30aを頭部に装着している。そして、ユーザUの前方には電源がオフされたモニタMがある。このため、ユーザUから視聴距離Dの位置がモニタMの画面よりも遠くの位置であれば、ユーザUは、モニタMの画面越しにパフォーマーPの虚像Pi及び観客のアバターの虚像Aiを知覚する。つまり、ユーザUは、モニタMの枠の奥にも空間が広がっているかのような感覚を得る。このようにしてモニタ越しにパフォーマーPの虚像Pi及び観客のアバターの虚像Aiを知覚することにより、ユーザUは、何もない空中の位置にパフォーマーPの虚像Pi及び観客のアバターの虚像Aiを知覚するよりも現実感を得ることができる。また、第2の例では、モニタMがオフされているにもかかわらずに、ユーザUは、モニタMにも映像が映っているかのような感覚を得ることができる。 FIG. 7B shows a second example. In a second example, the user U wears a display 30a, which is a transmissive wearable display, on the head. In front of the user U, there is a monitor M whose power is turned off. Therefore, if the position of the viewing distance D from the user U is farther than the screen of the monitor M, the user U perceives the virtual image Pi of the performer P and the virtual image Ai of the avatar of the audience through the screen of the monitor M. . In other words, the user U feels as if the space extends beyond the frame of the monitor M. By perceiving the virtual image Pi of the performer P and the virtual image Ai of the avatar of the audience through the monitor in this way, the user U perceives the virtual image Pi of the performer P and the virtual image Ai of the audience avatar at an empty position in the air. You can get a sense of reality rather than doing it. Further, in the second example, the user U can feel as if the image is being displayed on the monitor M even though the monitor M is turned off.
 第2の例では、モニタMの電源はオフされている。これは、モニタMに映像が表示されていると、ユーザは、モニタMに表示されている映像とともにパフォーマーPの虚像Pi及び観客のアバターの虚像Aiを知覚してしまうためである。一方で、会場と関係のある情報がモニタMに表示されてもよい。 In the second example, the monitor M is powered off. This is because when an image is displayed on the monitor M, the user perceives the image displayed on the monitor M together with the virtual image Pi of the performer P and the virtual image Ai of the avatar of the audience. On the other hand, information related to the venue may be displayed on the monitor M.
 [変形例2]
 前述した実施形態では、パフォーマーPの身長が3次元空間内で再現されるように処理が行われる。これに対し、大型サービスモニタ50のサイズが3次元空間内で再現されるように処理が行われてもよい。この場合、プロセッサ201は、カメラ40から映像を取得する。ここで、映像におけるパフォーマーPの身長に基づいて仮想ディスプレイ面のサイズが決められるのは、前述した実施形態と同様であるが、変形例2では、プロセッサ201は、3次元空間内での仮想ディスプレイ面のサイズが大型サービスモニタ50のサイズと一致するように仮想ディスプレイ面のサイズを決める。さらに、変形例2では、プロセッサ201は、ユーザUに対して十分に遠くの視聴距離の位置に大型サービスモニタ50の映像平面を配置し、ユーザUと大型サービスモニタ50の映像平面との間の配置位置に、観客を表す映像を配置する。これにより、映像の迫力や見やすさはそのままに、ユーザUからの距離だけが離れ、その映像平面とユーザとの間に観客もいるという状態が作り出される。
[Modification 2]
In the above-described embodiment, processing is performed so that the height of the performer P is reproduced within the three-dimensional space. On the other hand, processing may be performed so that the size of the large service monitor 50 is reproduced within the three-dimensional space. In this case, processor 201 acquires an image from camera 40 . Here, the size of the virtual display surface is determined based on the height of the performer P in the video, as in the above-described embodiment. The size of the virtual display surface is determined so that the size of the surface matches the size of the large service monitor 50 . Furthermore, in Modified Example 2, the processor 201 arranges the image plane of the large service monitor 50 at a viewing distance position sufficiently far from the user U, and the image plane of the large service monitor 50 between the user U and the image plane of the large service monitor 50. An image representing the audience is placed at the placement position. As a result, a state is created in which only the distance from the user U is increased, and there are spectators between the image plane and the user, while maintaining the power and viewability of the image.
 [変形例3]
 前述した実施形態では、カメラ10でリアルタイムに撮影される映像に対して映像処理装置20による処理が行われる。これに対し、過去に記録媒体に記録された映像の再生の際に映像処理装置20による処理が行われてもよい。記録媒体に記録された映像は、必ずしもカメラ10で撮影された映像に限らない。例えば、記録媒体に記録された映像は、CG等であってもよい。変形例3の場合、記録媒体に記録された映像は一旦、映像処理装置20に送られる。そして、映像処理装置20のプロセッサ201は、図4で示した処理を行った上で、3次元映像のデータをディスプレイ30に送信する。なお、観客データ2043としての観客の生体データは、映像の記録時に収集されたデータが用いられる。プロセッサ201は、再生される映像のタイミングと同期して生体データを用いたアバター映像の更新を実施する。
[Modification 3]
In the above-described embodiment, the image processing device 20 performs processing on images captured in real time by the camera 10 . On the other hand, processing by the video processing device 20 may be performed when playing back video recorded on a recording medium in the past. The video recorded on the recording medium is not necessarily limited to the video captured by the camera 10 . For example, the video recorded on the recording medium may be CG or the like. In the case of modification 3, the video recorded on the recording medium is once sent to the video processing device 20 . Then, the processor 201 of the video processing device 20 performs the processing shown in FIG. 4 and then transmits the 3D video data to the display 30 . It should be noted that the biological data of the spectator as the spectator data 2043 is data collected at the time of video recording. The processor 201 updates the avatar video using biometric data in synchronization with the timing of the video being reproduced.
 [変形例4]
 前述した実施形態では、ディスプレイ30は、3次元表示できるディスプレイであるとされている。これに対し、ディスプレイ30は、3次元表示できないディスプレイであってもよい。この場合、プロセッサ201は、映像の加工処理においてパフォーマーPのサイズを変化させなくてよい。なお、プロセッサ201は、ディスプレイ30の画面が十分に大きければパフォーマーPの身長が再現されるように映像におけるパフォーマーPのサイズを拡大又は縮小してもよい。一方で、プロセッサ201は、アバターのサイズをユーザUから視聴距離Dの位置において知覚されるであろうサイズに拡大又は縮小する。
[Modification 4]
In the embodiments described above, the display 30 is assumed to be a display capable of three-dimensional display. On the other hand, the display 30 may be a display incapable of three-dimensional display. In this case, the processor 201 does not have to change the size of the performer P in processing the video. Note that the processor 201 may enlarge or reduce the size of the performer P in the video so that the height of the performer P is reproduced if the screen of the display 30 is sufficiently large. On the other hand, the processor 201 scales up or down the size of the avatar to the size that would be perceived at a viewing distance D from the user U.
 [変形例5]
 前述した実施形態では、音楽及び演劇といったイベントにおける適用例が示されている。これに対し、実施形態は、スポーツ観戦といったイベントに対しても適用され得る。この場合において、基準物は、必ずしも人である必要はない。例えば、サッカーの試合といったイベントでは、基準物は、選手といった人物であってもよいし、サッカーボールといった物体等であってもよい。基準物がサッカーボールといった物体である場合、映像の拡大又は縮小は基準物としての物体を基準にして行われてよい。また、基準物によっては、基準物の縦方向の長さが用いられずに、横方向の長さが用いられてもよい。
[Modification 5]
The above-described embodiments show applications in events such as music and theater. In contrast, embodiments may also be applied to events such as watching sports. In this case, the reference object does not necessarily have to be a person. For example, in an event such as a soccer match, the reference object may be a person such as a player, or an object such as a soccer ball. If the reference object is an object such as a soccer ball, the image may be enlarged or reduced with reference to the object as the reference object. Also, depending on the reference object, the horizontal length of the reference object may be used instead of the vertical length.
 [変形例6]
 前述した実施形態では、基準物の長さに基づいてカメラ10の撮影範囲が推定されている。この推定は、カメラ10が移動できるように構成されていて、撮影範囲が時間的に変わり得るときに特に有効である。これに対し、カメラ10が固定されているときには、カメラ10の焦点距離及び撮影距離といった情報からカメラ10の撮影範囲が推定されてもよい。
[Modification 6]
In the above-described embodiment, the imaging range of the camera 10 is estimated based on the length of the reference object. This estimation is particularly effective when the camera 10 is configured to be mobile and the imaging range can change over time. On the other hand, when the camera 10 is fixed, the shooting range of the camera 10 may be estimated from information such as the focal length and shooting distance of the camera 10 .
 [変形例7]
 前述した実施形態では、カメラ10の撮影範囲と同等の範囲の景色をユーザUが見たと仮定したときの距離感及びサイズ感がユーザの知覚する3次元映像において再現される。これに対し、視聴距離Dが特定の観客席からパフォーマーPまでの距離とされて、この視聴距離DからユーザUが見たときと同等の撮影範囲Hの映像がカメラ10で撮影された映像からトリミングされ、このトリミングされた映像に基づいて視聴距離Dに身長HpのパフォーマーPが位置するように仮想ディスプレイ面が配置され、仮想ディスプレイ面でのパフォーマーPのサイズが調整されてもよい。この場合、映像が様々な領域でトリミングされても対応できるよう、カメラ10は広範囲を高解像度で撮影していることが望ましい。これにより、ユーザUは、あたかも会場の特定の席でパフォーマーPを見ているかのような感覚を得ることができる。
[Modification 7]
In the above-described embodiment, the sense of distance and the sense of size when it is assumed that the user U sees scenery in the same range as the shooting range of the camera 10 is reproduced in the three-dimensional image perceived by the user. On the other hand, the viewing distance D is defined as the distance from a specific audience seat to the performer P, and the image of the shooting range H equivalent to that seen by the user U from this viewing distance D is captured by the camera 10. A virtual display plane may be arranged so that a performer P with height Hp is positioned at a viewing distance D based on the trimmed video, and the size of the performer P on the virtual display plane may be adjusted. In this case, it is desirable that the camera 10 shoots a wide range with high resolution so that the image can be trimmed in various areas. This allows the user U to feel as if he were watching the performer P from a specific seat in the hall.
 なお、本発明は、上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。また、各実施形態は適宜組み合わせて実施してもよく、その場合組み合わせた効果が得られる。更に、上記実施形態には種々の発明が含まれており、開示される複数の構成要件から選択された組み合わせにより種々の発明が抽出され得る。例えば、実施形態に示される全構成要件からいくつかの構成要件が削除されても、課題が解決でき、効果が得られる場合には、この構成要件が削除された構成が発明として抽出され得る。 It should be noted that the present invention is not limited to the above-described embodiments, and can be variously modified in the implementation stage without departing from the gist of the present invention. Further, each embodiment may be implemented in combination as appropriate, in which case the combined effect can be obtained. Furthermore, various inventions are included in the above embodiments, and various inventions can be extracted by combinations selected from a plurality of disclosed constituent elements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiments, if the problem can be solved and effects can be obtained, the configuration with the constituent elements deleted can be extracted as an invention.
 1…映像配信システム
 10…カメラ
 20…映像処理装置
 30…ディスプレイ
 40…カメラ
 50…大型サービスモニタ
 201…プロセッサ
 202…ROM
 203…RAM
 204…ストレージ
 205…入力装置
 206…通信モジュール
 2011…受信部
 2012…距離推定部
 2013…映像加工部
 2014…3次元処理部
 2015…送信部
 2041…プログラム
 2042…変数
 2043…観客データ
DESCRIPTION OF SYMBOLS 1... Video distribution system 10... Camera 20... Video processing apparatus 30... Display 40... Camera 50... Large service monitor 201... Processor 202... ROM
203 RAM
204... Storage 205... Input device 206... Communication module 2011... Reception unit 2012... Distance estimation unit 2013... Image processing unit 2014... Three-dimensional processing unit 2015... Transmission unit 2041... Program 2042... Variable 2043... Spectator data

Claims (7)

  1.  映像を受信する受信部と、
     前記映像と同等の範囲の景色を人間が見ていたと仮定した場合の視聴距離を推定する距離推定部と、
     前記視聴距離において前記映像に映る基準物のサイズが前記基準物の現実のサイズと一致するように前記映像を加工する映像加工部と、
     加工された前記映像をディスプレイに送信する送信部と、
     を具備する映像処理装置。
    a receiving unit that receives video;
    a distance estimating unit for estimating a viewing distance when it is assumed that a person is viewing a scene in the same range as the image;
    an image processing unit that processes the image so that the size of the reference object appearing in the image at the viewing distance matches the actual size of the reference object;
    a transmission unit that transmits the processed video to a display;
    A video processing device comprising:
  2.  3次元空間内でユーザが前記映像を知覚する仮想ディスプレイ面が前記ユーザから前記視聴距離の位置において配置されるように前記映像を処理する3次元処理部をさらに具備し、
     前記映像加工部は、前記仮想ディスプレイ面における前記基準物のサイズが現実のサイズと一致するように前記映像を拡大又は縮小する、
     請求項1に記載の映像処理装置。
    further comprising a three-dimensional processing unit that processes the video such that a virtual display plane on which a user perceives the video in a three-dimensional space is positioned at the viewing distance from the user;
    The image processing unit enlarges or reduces the image so that the size of the reference object on the virtual display plane matches the actual size.
    The video processing device according to claim 1.
  3.  前記3次元処理部は、前記ユーザから前記視聴距離よりも短い距離の位置に観客を表す映像が配置されるように前記映像を処理する、
     請求項2に記載の映像処理装置。
    The three-dimensional processing unit processes the video so that the video representing the audience is placed at a position shorter than the viewing distance from the user.
    3. The video processing device according to claim 2.
  4.  前記距離推定部は、
     前記映像に映っている前記基準物のサイズから前記映像が映っている範囲の現実のサイズを推定し、
     前記映像が映っている範囲の現実のサイズと、前記人間をレンズとイメージセンサを有するカメラとみなした場合の前記レンズの焦点距離及び前記イメージセンサのサイズに基づいて前記視聴距離を推定する、
     請求項1乃至3の何れか1項に記載の映像処理装置。
    The distance estimation unit
    estimating the actual size of the range in which the image is displayed from the size of the reference object in the image;
    estimating the viewing distance based on the actual size of the range in which the image is displayed, and the focal length of the lens and the size of the image sensor when the human is regarded as a camera having a lens and an image sensor;
    The video processing device according to any one of claims 1 to 3.
  5.  前記基準物は、人間であり、
     前記基準物のサイズは、人間の身長である、
     請求項1乃至4の何れか1項に記載の映像処理装置。
    the reference object is a human being,
    wherein the size of the reference object is human height;
    The video processing device according to any one of claims 1 to 4.
  6.  映像処理装置によって実行される映像処理方法であって、
     前記映像処理装置の受信部により、映像を受信することと、
     前記映像処理装置の距離推定部により、前記映像と同等の範囲の景色を人間が見ていたと仮定した場合の視聴距離を推定することと、
     前記映像処理装置の映像加工部により、前記視聴距離において前記映像に映る基準物のサイズが前記基準物の現実のサイズと一致するように前記映像を加工することと、
     前記映像処理装置の送信部により、加工された前記映像をディスプレイに送信することと、
     を具備する映像処理方法。
    A video processing method executed by a video processing device,
    Receiving video by a receiving unit of the video processing device;
    estimating, by a distance estimation unit of the video processing device, a viewing distance when it is assumed that a human being has viewed scenery in a range equivalent to that of the video;
    processing the image by the image processing unit of the image processing device so that the size of the reference object appearing in the image at the viewing distance matches the actual size of the reference object;
    transmitting the processed video to a display by a transmission unit of the video processing device;
    A video processing method comprising:
  7.  コンピュータを請求項1乃至5の何れか1項に記載の映像処理装置の前記受信部と、前記距離推定部と、前記映像加工部と、前記送信部として機能させるための映像処理プログラム。 A video processing program for causing a computer to function as the receiving unit, the distance estimating unit, the video processing unit, and the transmitting unit of the video processing device according to any one of claims 1 to 5.
PCT/JP2021/022173 2021-06-10 2021-06-10 Video processing device, video processing method, and video processing program WO2022259480A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2021/022173 WO2022259480A1 (en) 2021-06-10 2021-06-10 Video processing device, video processing method, and video processing program
JP2023526771A JPWO2022259480A1 (en) 2021-06-10 2021-06-10

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/022173 WO2022259480A1 (en) 2021-06-10 2021-06-10 Video processing device, video processing method, and video processing program

Publications (1)

Publication Number Publication Date
WO2022259480A1 true WO2022259480A1 (en) 2022-12-15

Family

ID=84425786

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/022173 WO2022259480A1 (en) 2021-06-10 2021-06-10 Video processing device, video processing method, and video processing program

Country Status (2)

Country Link
JP (1) JPWO2022259480A1 (en)
WO (1) WO2022259480A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009200697A (en) * 2008-02-20 2009-09-03 Sony Corp Image transmitter, field angle control method, image receiver, image display system, and image display method
JP2010171695A (en) * 2009-01-22 2010-08-05 Nippon Telegr & Teleph Corp <Ntt> Television conference device and displaying/imaging method
JP2011248655A (en) * 2010-05-27 2011-12-08 Ntt Comware Corp User viewpoint spatial image provision device, user viewpoint spatial image provision method, and program
JP2021033354A (en) * 2019-08-14 2021-03-01 キヤノン株式会社 Communication device and control method therefor

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009200697A (en) * 2008-02-20 2009-09-03 Sony Corp Image transmitter, field angle control method, image receiver, image display system, and image display method
JP2010171695A (en) * 2009-01-22 2010-08-05 Nippon Telegr & Teleph Corp <Ntt> Television conference device and displaying/imaging method
JP2011248655A (en) * 2010-05-27 2011-12-08 Ntt Comware Corp User viewpoint spatial image provision device, user viewpoint spatial image provision method, and program
JP2021033354A (en) * 2019-08-14 2021-03-01 キヤノン株式会社 Communication device and control method therefor

Also Published As

Publication number Publication date
JPWO2022259480A1 (en) 2022-12-15

Similar Documents

Publication Publication Date Title
JP7368886B2 (en) Information processing system, information processing method, and information processing program
JP5594850B2 (en) Alternative reality system control apparatus, alternative reality system, alternative reality system control method, program, and recording medium
EP0961506B1 (en) Autostereoscopic display
JP6812803B2 (en) Information processing equipment, information processing methods, and programs
RU2743518C2 (en) Perception of multilayer augmented entertainment
JPWO2016009865A1 (en) Information processing apparatus and method, display control apparatus and method, playback apparatus and method, program, and information processing system
WO2003081921A1 (en) 3-dimensional image processing method and device
JP2019054488A (en) Providing apparatus, providing method, and program
JP2004007395A (en) Stereoscopic image processing method and device
JP2004007396A (en) Stereoscopic image processing method and device
KR101329057B1 (en) An apparatus and method for transmitting multi-view stereoscopic video
CN110730340B (en) Virtual audience display method, system and storage medium based on lens transformation
JP2014182597A (en) Virtual reality presentation system, virtual reality presentation device, and virtual reality presentation method
JP2022032483A (en) Image processing apparatus, image processing method, and program
KR20190031220A (en) System and method for providing virtual reality content
WO2021015035A1 (en) Image processing apparatus, image delivery system, and image processing method
WO2022259480A1 (en) Video processing device, video processing method, and video processing program
JP7275480B2 (en) Multiplayer Simultaneous Operation System, Method, and Program in VR
JP5924833B2 (en) Image processing apparatus, image processing method, image processing program, and imaging apparatus
JP6921204B2 (en) Information processing device and image output method
JP5916365B2 (en) Video transmission system, video transmission method, and computer program
JP2021015417A (en) Image processing apparatus, image distribution system, and image processing method
JP2014240961A (en) Substitutional reality system control device, substitutional reality system, substitutional reality control method, program, and storage medium
JP2020530218A (en) How to project immersive audiovisual content
JP3118547U (en) 3D image system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21945154

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023526771

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE