WO2023248678A1

WO2023248678A1 - Information processing device, information processing method, and information processing system

Info

Publication number: WO2023248678A1
Application number: PCT/JP2023/019085
Authority: WO
Inventors: 俊也浜田
Original assignee: ソニーグループ株式会社
Priority date: 2022-06-24
Filing date: 2023-05-23
Publication date: 2023-12-28

Abstract

An information processing device according to one embodiment of the present technology comprises a generation unit. The generation unit generates: three-dimensional space description data which is included in three-dimensional space data used for a rendering process executed for representing a three-dimensional space and defines the configuration of the three-dimensional space; video object data which defines a three-dimensional video object in the three-dimensional space; and subtitle object data which defines a three-dimensional subtitle object in the three-dimensional space on the basis of audio object data which defines a three-dimensional audio object in the three-dimensional space.

Description

Information processing device, information processing method, and information processing system

The present technology relates to an information processing device, an information processing method, and an information processing system that can be applied to distribution of VR (Virtual Reality) images, etc.

In recent years, all-sky videos taken with all-sky cameras and the like, which allow you to look around in all directions, have been distributed as VR videos. Furthermore, recently, viewers (users) can look around in all directions (freely select the line of sight) and move freely in three-dimensional space (freely select the viewpoint position). ) Development of technology for distributing 6DoF (Degree of Freedom) video (also referred to as 6DoF content) is progressing.

Patent Document 1 discloses a technology that can more appropriately generate guide audio regarding the distribution of 6DoF content.

International Publication No. 2021/241190

The distribution of virtual images (virtual images) such as VR images is expected to become widespread, and there is a need for technology that makes it possible to realize high-quality virtual images.

In view of the above circumstances, the purpose of the present technology is to provide an information processing device, an information processing method, and an information processing system that can realize high-quality virtual images.

In order to achieve the above object, an information processing device according to an embodiment of the present technology includes a generation unit.
The generation unit includes three-dimensional space description data that defines the configuration of the three-dimensional space, which is included in three-dimensional spatial data used in rendering processing performed to express the three-dimensional space, and Subtitle object data that defines a three-dimensional subtitle object in the three-dimensional space is generated based on video object data that defines a three-dimensional video object and audio object data that defines a three-dimensional audio object in the three-dimensional space. .

In this information processing device, subtitle object data is generated based on three-dimensional spatial description data, video object data, and audio object data. This makes it possible to realize high-quality virtual images.

The subtitle object data may include a subtitle sentence and attribute information of the three-dimensional subtitle object.

The attribute information may include information on the position and orientation of the three-dimensional subtitle object in the three-dimensional space.

The attribute information includes at least one of the size, shape, color, transparency, display surface state, character size, character font, character color, character transparency, and effect of the three-dimensional subtitle object in the three-dimensional space. It may also contain one piece of information.

The audio object data may include audio information. In this case, the generation unit may generate the subtitle sentence by performing voice recognition on the voice information.

The generation unit may generate the subtitle sentence by performing translation processing on the recognition result of the voice recognition.

The three-dimensional space description data may include position information of the three-dimensional audio object in the three-dimensional space and position information of the three-dimensional video object in the three-dimensional space. In this case, the generation unit determines the three-dimensional video object corresponding to the three-dimensional audio object based on the position information of the three-dimensional video object and the position information of the three-dimensional audio object, and the determination result is The attribute information may be generated based on.

The generation unit determines the three-dimensional video object corresponding to the subtitle sentence generated from the audio object data, based on position information of the three-dimensional video object and position information of the three-dimensional audio object. , the attribute information may be generated based on the determination result.

The generation unit may determine the three-dimensional video object that has uttered the content of the subtitle sentence as the three-dimensional video object that corresponds to the subtitle sentence.

The generation unit may generate information on the position and orientation of the 3D subtitle object with reference to a 3D bounding box of the 3D video object corresponding to the subtitle sentence.

The information processing device further performs a rendering process on the three-dimensional spatial data and the three-dimensional subtitle object based on visual field information regarding the user's visual field, thereby rendering two images according to the user's visual field. It may also include a rendering unit that generates dimensional video data. In this case, the generation unit may control the display mode of the subtitle text in the two-dimensional video data by adjusting the attribute information.

The rendering unit may perform pre-rendering processing on the three-dimensional spatial data and the three-dimensional subtitle object. In this case, the generation unit may adjust the attribute information based on the result of the pre-rendering process.

The rendering unit may perform pre-rendering processing to determine the occurrence of occlusion. In this case, the generation unit may adjust the position and orientation of the three-dimensional subtitle object included in the attribute information based on the result of the pre-rendering process.

The rendering unit may perform pre-rendering processing to determine the visibility of the subtitle text. In this case, the generation unit may determine the color, transparency, display surface state, font size, font color, and font of the 3D subtitle object included in the attribute information based on the result of the pre-rendering process. At least one of the transparency may be adjusted.

The audio object data may include audio information. In this case, the generation unit may generate the subtitle object data based on the audio object data when the output volume of the audio information is greater than a predetermined threshold.

The three-dimensional space description data may include information on the predetermined threshold value.

The generation unit may determine whether to end displaying the subtitle text based on the user's gaze point information.

An information processing method according to one embodiment of the present technology is an information processing method executed by a computer system, in which the three-dimensional Based on three-dimensional space description data that defines a configuration of a dimensional space, video object data that defines a three-dimensional video object in the three-dimensional space, and audio object data that defines a three-dimensional audio object in the three-dimensional space. , generating subtitle object data defining a three-dimensional subtitle object in the three-dimensional space.

An information processing system according to one embodiment of the present technology includes the generation unit.

1 is a schematic diagram showing a basic configuration example of a virtual space providing system. FIG. 3 is a schematic diagram for explaining rendering processing. FIG. 2 is a schematic diagram showing a configuration example of a client device for realizing automatic generation of subtitle sentences according to the present technology. FIG. 2 is a schematic diagram showing an example of a rendered video generated by a client device. 12 is a flowchart illustrating an example of display processing of subtitle objects according to utterances. FIG. 3 is a schematic diagram for explaining an example of setting initial values of the position and orientation of a subtitle object in a virtual space. FIG. 3 is a schematic diagram for explaining an example of setting initial values of the position and orientation of a subtitle object in a virtual space. FIG. 3 is a schematic diagram for explaining an example of a method of changing the position and orientation of a subtitle object. FIG. 3 is a schematic diagram illustrating a processing example of subtitle sentence display determination. 12 is a flowchart illustrating an example of a process for determining whether to end displaying a subtitle sentence. FIG. 7 is a schematic diagram showing another example of subtitle display in a rendered video representing a virtual space. FIG. 1 is a schematic diagram for explaining a configuration example of a server-side rendering system. 1 is a schematic diagram showing a basic configuration example of a remote communication system. FIG. 2 is a block diagram illustrating an example of a hardware configuration of a computer (information processing device) that can implement a distribution server, a client device, and a rendering server.

Hereinafter, embodiments according to the present technology will be described with reference to the drawings.

[Virtual space provision system]
First, a basic configuration example and basic operation example of an example of a virtual space providing system to which the present technology can be applied will be described.
The virtual space providing system according to the present embodiment provides free-viewpoint three-dimensional virtual space content that allows viewing a virtual three-dimensional space (three-dimensional virtual space) from a free viewpoint (six degrees of freedom). is possible. Such three-dimensional virtual space content is also called 6DoF content.

FIG. 1 is a schematic diagram showing a basic configuration example of a virtual space providing system.
FIG. 2 is a schematic diagram for explaining rendering processing.

The virtual space providing system 1 shown in FIG. 1 corresponds to an embodiment of an information processing system according to the present technology. Further, the virtual space S shown in FIG. 1 corresponds to an embodiment of a virtual three-dimensional space according to the present technology.

As shown in FIG. 1, the virtual space providing system 1 includes a distribution server 2, an HMD (Head Mounted Display) 3, and a client device 4.

The distribution server 2 and client device 4 are communicably connected via a network 5. The network 5 is constructed by, for example, the Internet or a wide area communication network. In addition, any WAN (Wide Area Network), LAN (Local Area Network), etc. may be used, and the protocol for constructing the network 5 is not limited.

The distribution server 2 and the client device 4 have hardware necessary for a computer, such as a processor such as a CPU, GPU, or DSP, memory such as a ROM or RAM, and a storage device such as an HDD (see FIG. 14). The information processing method according to the present technology is executed by the processor loading the program according to the present technology stored in the storage unit or memory into the RAM and executing the program.

For example, the distribution server 2 and the client device 4 can be realized by any computer such as a PC (Personal Computer). Of course, hardware such as FPGA or ASIC may also be used.

The HMD 3 and the client device 4 are connected to be able to communicate with each other. The communication form for communicably connecting both devices is not limited, and any communication technology may be used. For example, wireless network communication such as WiFi, short-range wireless communication such as Bluetooth (registered trademark), etc. can be used. Note that the HMD 3 and the client device 4 may be integrally configured. That is, the functions of the client device 4 may be installed in the HMD 3.

The distribution server 2 distributes three-dimensional spatial data to the client device 4. The three-dimensional space data is used in rendering processing performed to express the virtual space S (three-dimensional space). By performing rendering processing on the three-dimensional spatial data, a virtual image displayed by the HMD 3 is generated. Further, virtual audio is output from the headphones included in the HMD 3. The three-dimensional spatial data will be explained in detail later.

The HMD 3 is a device used to display virtual images of each scene configured in a three-dimensional space to the user 6, and to output virtual audio. The HMD 3 is used by being attached to the head of the user 6. For example, when a VR video is distributed as a virtual video, an immersive HMD 3 configured to cover the visual field of the user 6 is used. When an AR (Augmented Reality) video is distributed as a virtual video, AR glasses or the like are used as the HMD 3.

A device other than the HMD 3 may be used as a device for providing virtual images to the user 6. For example, a virtual image may be displayed on a display included in a television, a smartphone, a tablet terminal, a PC, or the like. Furthermore, the device capable of outputting virtual audio is not limited, and any type of speaker or the like may be used.

In this embodiment, a 6DoF video is provided as a VR video to a user 6 wearing an immersive HMD 3. The user 6 is able to view video in a 360° range of front and rear, left and right, and up and down directions within the virtual space S that is a three-dimensional space.

For example, the user 6 freely moves the position of the viewpoint, the line of sight direction, etc. within the virtual space S, and freely changes his/her field of view (field of view range). The virtual image displayed to the user 6 is switched in accordance with this change in the user's 6 visual field. By performing actions such as changing the direction of the face, tilting the face, and looking back, the user 6 can view the surroundings in the virtual space S with the same feeling as in the real world.

In this way, the virtual space providing system 1 according to the present embodiment makes it possible to distribute photorealistic free-viewpoint video, and to provide a viewing experience from a free viewpoint position.

As shown in FIG. 1, in this embodiment, visual field information is acquired by the HMD 3. The visual field information is information regarding the user's 6 visual field. Specifically, the visual field information includes any information that can specify the visual field of the user 6 within the virtual space S.

For example, the visual field information includes a viewpoint position, a gaze point, a central visual field, a viewing direction, a rotation angle of the viewing direction, and the like. Further, the visual field information includes the position of the user's 6 head, the rotation angle of the user's 6 head, and the like.

The rotation angle of the line of sight can be defined, for example, by a rotation angle whose rotation axis is an axis extending in the line of sight direction. Further, the rotation angle of the user 6's head can be defined by the roll angle, pitch angle, and yaw angle when the three mutually orthogonal axes set for the head are the roll axis, pitch axis, and yaw axis. It is possible.

For example, let the axis extending in the front direction of the face be the roll axis. When the user 6's face is viewed from the front, an axis extending in the left-right direction is defined as a pitch axis, and an axis extending in the vertical direction is defined as a yaw axis. The roll angle, pitch angle, and yaw angle with respect to these roll, pitch, and yaw axes are calculated as the rotation angle of the head. Note that it is also possible to use the direction of the roll axis as the viewing direction.

In addition, any information that can specify the visual field of the user 6 may be used. As the visual field information, one piece of information exemplified above may be used, or a combination of a plurality of pieces of information may be used.

The method of acquiring visual field information is not limited. For example, it is possible to acquire visual field information based on a detection result (sensing result) by a sensor device (including a camera) provided in the HMD 3.

For example, the HMD 3 is provided with a camera or distance measuring sensor whose detection range is around the user 6, an inward camera capable of capturing images of the left and right eyes of the user 6, and the like. Further, the HMD 3 is provided with an IMU (Inertial Measurement Unit) sensor and a GPS. For example, it is possible to use the position information of the HMD 3 acquired by GPS as the viewpoint position of the user 6 or the position of the user 6's head. Of course, the positions of the left and right eyes of the user 6 may be calculated in more detail.

It is also possible to detect the line of sight direction from the captured images of the left and right eyes of the user 6. Furthermore, it is also possible to detect the rotation angle of the line of sight and the rotation angle of the user's 6 head from the detection results of the IMU.

Furthermore, self-position estimation of the user 6 (HMD 3) may be performed based on the detection result by a sensor device included in the HMD 3. For example, by self-position estimation, it is possible to calculate position information of the HMD 3 and posture information such as which direction the HMD 3 is facing. It is possible to acquire visual field information from the position information and posture information.

The algorithm for estimating the self-position of the HMD 3 is also not limited, and any algorithm such as SLAM (Simultaneous Localization and Mapping) may be used. Further, head tracking that detects the movement of the user's 6 head, or eye tracking that detects the movement of the user's left and right gaze (movement of the gaze point) may be performed.

In addition, any device or any algorithm may be used to acquire visual field information. For example, in a case where a smartphone or the like is used as a device for displaying a virtual image to the user 6, the face (head), etc. of the user 6 may be imaged, and visual field information may be acquired based on the captured image. . Alternatively, a device including a camera, an IMU, etc. may be attached to the head or around the eyes of the user 6.

Any machine learning algorithm using, for example, DNN (Deep Neural Network) may be used to generate the visual field information. For example, by using AI (artificial intelligence) that performs deep learning, it is possible to improve the accuracy of generating visual field information. Note that the application of the machine learning algorithm may be performed to any processing within the present disclosure.

The client device 4 receives the three-dimensional spatial data transmitted from the distribution server 2 and the visual field information transmitted from the HMD 3. The client device 4 executes rendering processing on the three-dimensional spatial data based on the visual field information. As a result, two-dimensional video data (rendered video) corresponding to the visual field of the user 6 is generated.

In this embodiment, the client device 4 corresponds to an embodiment of an information processing device according to the present technology. The client device 4 executes an embodiment of the information processing method according to the present technology.

As shown in FIG. 2, the three-dimensional spatial data includes scene description information and three-dimensional object data. The scene description information is also called a scene description.
The scene description information corresponds to three-dimensional space description data that defines the configuration of a three-dimensional space (virtual space S). The scene description information includes various metadata for reproducing each scene of the 6DoF content.

The specific data structure (data format) of the scene description information is not limited, and any data structure may be used. For example, glTF (GL Transmission Format) can be used as the scene description information.

Three-dimensional object data is data that defines a three-dimensional object in a three-dimensional space. In other words, it is data of each object that constitutes each scene of the 6DoF content. In this embodiment, video object data and audio object data are distributed as three-dimensional object data.

The video object data is data that defines a 3D video object in a 3D space. A three-dimensional video object is composed of mesh (polygon mesh) data composed of geometry information and color information, and texture data pasted onto its surface. Alternatively, it is composed of point cloud data.
Geometry data (positions of meshes and point clouds) is expressed in a local coordinate system unique to that object. Object placement in the three-dimensional virtual space is specified by scene description information.

For example, the video object data includes data on three-dimensional video objects such as people, animals, buildings, trees, etc. Alternatively, data of three-dimensional image objects such as the sky and the sea forming the background etc. is included. A plurality of types of objects may be collectively configured as one three-dimensional image object.

The audio object data is composed of position information of the sound source and waveform data obtained by sampling audio data for each sound source. The position information of the sound source is the position in the local coordinate system that is used as a reference by the three-dimensional audio object group, and the object arrangement on the three-dimensional virtual space S is specified by the scene description information.

As shown in FIG. 2, the client device 4 reproduces the three-dimensional space by arranging the three-dimensional video object and the three-dimensional audio object in the three-dimensional space based on the scene description information. Then, by cutting out the video seen by the user 6 using the reproduced three-dimensional space as a reference (rendering process), a rendered video that is a two-dimensional video that the user 6 views is generated. Note that the rendered image according to the user's 6 visual field can also be said to be an image of a viewport (display area) according to the user's 6 visual field.

Further, the client device 4 controls the headphones of the HMD 3 so that the sound represented by the waveform data is output by the rendering process, with the position of the three-dimensional audio object as the sound source position. That is, the client device 4 generates audio information to be output from the headphones and output control information for specifying how the audio information is output.

The audio information is generated based on waveform data included in the three-dimensional audio object, for example. As the output control information, any information that defines the volume, sound localization (localization direction), etc. may be generated. For example, by controlling the localization of sound, it is also possible to realize audio output using stereophonic sound.

The rendered video, audio information, and output control information generated by the client device 4 are transmitted to the HMD 3. The HMD 3 displays rendered video and outputs audio information. This allows the user 6 to view the 6Dof content.

Hereinafter, a three-dimensional video object may be simply referred to as a video object. Similarly, a three-dimensional audio object may be simply referred to as an audio object.

[Study regarding display of subtitles in virtual space S]
As illustrated in Figures 1 and 2, in 6DoF video distribution that provides a viewing experience from any viewpoint, everything that appears in the video content is meshed to enable viewing from all positions. It consists of 3D video objects such as and point clouds. The data of each of these 3D video objects is distributed together with scene description information (Scene Description file) that manages scene information such as where to place it in the virtual space S. The user 6 can freely move within the virtual space S and view the content from any desired position.

In such three-dimensional virtual space content, if the content creator does not prepare subtitles, it is difficult to display subtitles within the rendered video representing the virtual space S. The present inventor has proposed that by using the characteristics of the three-dimensional virtual space content and the characteristics of the viewing style (field of view information) explained with reference to FIGS. We have devised a new system that automatically generates subtitles and automatically determines their position.

[Generation of subtitle object data]
FIG. 3 is a schematic diagram showing a configuration example of the client device 4 for realizing automatic generation of subtitle sentences according to the present technology.
FIG. 4 is a schematic diagram showing an example of rendered video 8 generated by client device 4. As shown in FIG.

As shown in FIG. 3, the client device 4 includes a file acquisition section 9, a rendering section 10, a visual field information acquisition section 11, and a subtitle object generation section 12.
These functional blocks are realized by a processor such as a CPU executing a program according to the present technology, and the information processing method according to the present embodiment is executed. Note that dedicated hardware such as an IC (integrated circuit) may be used as appropriate to realize each functional block.

The file acquisition unit 9 acquires three-dimensional spatial data (scene description information and three-dimensional object data) distributed from the distribution server 2. The visual field information acquisition unit 11 acquires visual field information from the HMD 3. The acquired visual field information may be recorded in the storage unit 68 (see FIG. 14) or the like. For example, a buffer or the like for recording visual field information may be configured.

The subtitle object generation unit 12 defines a 3D subtitle object in a 3D space based on scene description information (3D space description data), video object data, and audio object data included in the 3D space data. Generate subtitle object data.
The three-dimensional subtitle object data includes a subtitle sentence consisting of text data and attribute information of the three-dimensional subtitle object, and will be described in detail later. Hereinafter, a three-dimensional subtitle object may be simply referred to as a subtitle object.

In this embodiment, the subtitle object generation unit 12 functions as an embodiment of a generation unit according to the present technology.

The rendering unit 10 performs rendering processing on the three-dimensional spatial data (scene description information and three-dimensional object data) and the subtitle object data generated by the subtitle object generation unit 12.

As illustrated in FIG. 2, the rendering unit 10 arranges video objects and audio objects in a three-dimensional space based on scene description information. Further, in this embodiment, the rendering unit 10 arranges subtitle objects in a three-dimensional space.

By performing rendering processing based on the three-dimensional space in which the video object, audio object, and subtitle object are arranged, a rendered video 8 that corresponds to the visual field of the user 6 is generated. Also, virtual audio is output with the position of the audio object as the sound source position.

As illustrated in FIG. 4, a subtitle object 14 is displayed within the rendered video 8. That is, in this embodiment, a virtual video including subtitle text is displayed on the HMD 3 of the user 6. In the subtitle object 14 (subtitle text), text data whose content corresponds to the speech content (virtual voice) output from the headphones is displayed in accordance with the speech timing. Further, the subtitle object 14 (subtitle text) is displayed in a display manner that allows the video object that has uttered the content of the subtitle text to be recognized.

In the example shown in FIG. 4, subtitle sentences are displayed in response to the following conversation performed by person objects 15 (15a and 15b).
Person object 15a: "Good morning"
Person object 15b: "Good morning"
Person object 15a: “Long time no see!”
Person object 15b: "How have you been?"

In the example shown in FIG. 4, a speech bubble containing a subtitle sentence is displayed as the subtitle object 14. Furthermore, the tail (pointed part) of the speech bubble extends toward the speaker, thereby making it possible to recognize which person object 15 has spoken. The subtitle object 14 can also be called a subtitle panel.

Of course, the specific configuration example of the subtitle object 14 is not limited, and any configuration other than a speech bubble may be adopted. For example, a configuration may be adopted in which two straight lines extend radially from a video object that is a speaker, and a subtitle sentence is displayed between them. Alternatively, only the subtitle sentence may be displayed as the subtitle object 14 near the speaker.

As mentioned above, the video of free viewpoint three-dimensional virtual space content is composed of three-dimensional video objects. A three-dimensional video object is composed of geometry data representing the shape of the object and texture data representing the color of the object's surface. The geometry data is, for example, a polygon mesh or a set of triangles called a mesh. Another data format that constitutes a three-dimensional video object is point cloud data.

Furthermore, the audio of the free viewpoint three-dimensional virtual space content is composed of three-dimensional audio objects. In a three-dimensional audio object, audio is not recorded as a mixture of multiple sound sources called "channels," but the vibration waveform of the sound is sampled for each sound source (object) that emits sound. That is, the audio object includes sound vibration waveform data as audio information generated in the three-dimensional virtual space S.
Furthermore, the three-dimensional position of the sound source in the local coordinate system (on which the audio objects are based) is also recorded. One three-dimensional audio object is composed of sound vibration waveform data, sound source position information in a local coordinate system, and other metadata.

In this embodiment, a new subtitle object is generated by the subtitle object generation unit 12 shown in FIG. 3 based on the scene description information, the video object, and the audio object. The subtitle object includes subtitle text and attribute information of the three-dimensional subtitle object (hereinafter referred to as subtitle attribute).

For example, information on the position and orientation (direction) of a subtitle object in the three-dimensional virtual space S is generated as the subtitle attribute. That is, position information and orientation information of a subtitle object are generated as subtitle attributes.
In addition, as subtitle attributes, information such as the size, shape, color, transparency, display surface condition, text (subtitle text) size, text font, text color, text transparency, and effects of the subtitle object in the virtual space S is generated. All of these exemplified attribute information may be generated, or some of the attribute information may be generated.

The color of the subtitle object includes the color of the display surface on which the subtitle text is displayed, and can also be called the color of the background of the subtitle text (background color). The state of the display surface includes information such as the roughness and reflectance of the display surface. The display surface can also be called a subtitle surface.
The effects include various effects when expressing the subtitle object 14 and subtitle sentences. For example, various animation displays such as continuous changes in size or color or blinking may be set as effects.

In the example shown in FIG. 4, the position of the speech bubble that is the subtitle object 14, the direction of the speech bubble, the size of the speech bubble, the shape of the speech bubble, the color of the speech bubble, the transparency of the speech bubble, the state of the display surface, the characters within the speech bubble ( Information such as the size of the subtitle text, text font, text color, text transparency, and effects are generated as subtitle attributes.

The subtitle attribute can also be said to be attribute information regarding the display of subtitle sentences in the rendered video 8. Alternatively, the subtitle attribute can also be said to be information that defines the display mode of the subtitle text within the rendered video 8.

[Specific example of display processing of subtitle text (subtitle object)]
FIG. 5 is a flowchart illustrating an example of display processing of the subtitle object 14 according to utterances.
The process shown in FIG. 5 includes generation of subtitle object data (subtitle text and subtitle attributes) by the subtitle object generation unit 12, and pre-rendering processing and rendering processing by the rendering unit 10.
Steps other than the pre-rendering process (steps 106 and 111) and the rendering process (step 112) are executed by the subtitle object generation unit 12.

For example, when the user 6 selects a subtitle display mode in which subtitles are displayed, the subtitle object display process shown in FIG. 5 is executed. The user 6 can specify the language of the subtitle text to be displayed as a user setting. For example, subtitles can be displayed in any language such as Japanese, English, French, etc.

Based on the user settings, the language of the subtitle text to be displayed in the rendered video 8 is determined (step 101). Next, the video object that made the utterance is specified from the position information of the audio object that includes the content of the utterance (audio information) (step 102).

The scene description information describes position information of a three-dimensional audio object in the three-dimensional virtual space S and position information of a three-dimensional video object in the three-dimensional virtual space S. That is, the scene description information describes the position information of the audio object and the position information of the video object on the global coordinates based on the three-dimensional virtual space S.

In this embodiment, it is possible to determine the video object that corresponds to the audio object based on the position information of the audio object and the position information of the video object described in the scene description information. That is, it is possible to easily determine the correspondence between audio objects and video objects based on the scene description information. As a result, it is possible to easily identify the video object that uttered the audio information included in the audio object.

Speech recognition is performed on the audio information included in the audio object, and the utterance content is converted into text (step 103). Furthermore, the text of the utterance is translated into the subtitle language (step 104). In this embodiment, the text of the utterance is translated into the language specified by the user 6 in step 101. As a result, generation of the subtitle sentence (text data) included in the subtitle object data is completed.

As described above, in this embodiment, a subtitle sentence is generated by performing speech recognition on audio information included in an audio object. Furthermore, subtitle sentences are generated by performing translation processing on the recognition results of speech recognition.
Note that, in step 104, the writing style of the subtitle text, etc. may be adjusted based on the attribute information of the scene description information. For example, if attribute information is written indicating that the video object that uttered the audio information is female, the subtitle text may be adjusted to have a female writing style.

Here, as a comparative example, we will consider a case where subtitles are newly generated and displayed for two-dimensional video content composed of channel audio in which a two-dimensional image and a large number of sounds are mixed.
When attempting to identify an object such as a person displayed in an image of such two-dimensional video content, technology for shape recognition and meaning recognition for the object is required to identify the object within the image. Even if such shape recognition and meaning recognition can be performed with high accuracy, it is difficult to reduce the false recognition rate to zero.

Furthermore, it is difficult to separate only the voice of a specific person from channel audio in which a large number of voices are mixed. For example, it has been difficult to associate which person in an image uttered a certain sound because there is no metadata that connects the sound and the video.
In particular, when a certain voice is uttered by a person outside the image, it is extremely difficult to automatically associate the voice with the person. In reality, the correspondence between people in a video and voices relies largely on the viewer's cognitive abilities.

Here, the present inventor focused on the characteristics of the three-dimensional virtual space content described with reference to FIGS. 1 and 2.

The vibration waveform data that is the audio information that makes up the audio object is not mixed with the sounds of other audio objects or surrounding sounds. Although external environmental sounds may be mixed in as noise, the volume level is low because it is not intentionally mixed. Therefore, when the vibration waveform data constituting the audio object is the voice spoken by a person, voice recognition is easy.

Taking advantage of this feature, the vibration waveform data of the speaker's audio object is input to the speech recognizer and converted into text data in the scene that the user 6 is viewing. Since there is no need to perform processing to classify and separate the voices of multiple people in order to extract the voice of a specific person from the channel audio, it is possible to perform voice recognition with a high recognition rate.

Furthermore, the video object and the audio object are individually described in the scene description information, and the position information of each object in the virtual space S is also described. Therefore, compared to two-dimensional video content consisting of two-dimensional images and channel audio mixed with many sounds, text data generated as a subtitle sentence and a video object that utters the contents of the subtitle sentence are It is possible to make correspondences very easily and with very high precision.

Note that in the example shown in FIG. 5, the correspondence between the audio object and the video object is determined in step 102. Then, in steps 103 and 104, speech recognition and translation processing is performed on the utterance content (audio information) to generate a subtitle sentence.

The present invention is not limited to this, and the correspondence between a subtitle sentence generated from an audio object and a video object may be determined. That is, the video object corresponding to the subtitle sentence generated from the audio object may be determined based on the position information of the audio object and the position information of the video object described in the scene description information. Of course, the 3D video object that uttered the content of the subtitle sentence is determined as the 3D video object corresponding to the subtitle sentence.

In order to identify the video object that emitted the audio information (vibration waveform data) included in the audio object data, the correspondence between the audio object and the video object may be determined, or the subtitle text generated from the audio object and the subtitle text generated from the audio object may be determined. The correspondence relationship with the video object may also be determined.

Next, subtitle attributes included in the subtitle object are generated. The subtitle attribute is generated based on the determination result of the video object corresponding to the audio object. Of course, the subtitle attribute may be generated based on the determination result of the video object corresponding to the subtitle text generated from the audio object.

In this embodiment, first, the initial value (default value) of the subtitle attribute is set (step 105). For example, the position, orientation (direction), size, shape, color, transparency, display surface state, text (subtitle text) size, text font, text color, text transparency, and The initial value of the effect is set.

6 and 7 are schematic diagrams for explaining an example of setting initial values of the position and orientation of the subtitle object (speech bubble) 14 in the virtual space S.

In this embodiment, information on the position and orientation of the subtitle object 14 is based on the three-dimensional bounding box (BBox: Bounding Box) 19 of the video object 18 corresponding to the audio object 17, that is, the video object 18 corresponding to the subtitle text. generated.

Specifically, as shown in FIG. 6, attention is paid to the perspective projection image of the three-dimensional BBox 19 of the video object 18 corresponding to the subtitle sentence when viewed from the viewpoint position of the user 6. With respect to the perspective projection image of the 3D BBox 19, a condition is set in which the subtitle surface (the display surface on which the subtitle text is displayed) is adjacent to the outside of the 3D BBox 19 and is perpendicular to (directly facing) the user 6's line of sight direction. The satisfying position and orientation are set as initial values. Note that the method for generating the three-dimensional BBoX 19 of the video object 18 is not limited, and for example, a well-known technique may be used.

By setting the position and orientation of the subtitle object 14 using the three-dimensional BBox 19 as a reference in this way, it is possible to prevent the video object 18 from being hidden by its own subtitle object 14. Furthermore, since the subtitle text is displayed in a direction directly facing the line of sight, easy-to-read subtitle display is realized.

Note that, as shown in FIG. 7, when a plurality of subtitle objects 14 are generated at approximately the same timing, the initial values of the position and orientation of each subtitle object 14 are set so as to surround the three-dimensional BBoX 19.

Of course, other methods may be adopted for setting the initial values of the position and orientation of the subtitle object 14. Furthermore, for attribute information other than the position and orientation information of the subtitle object 14, any direction may be adopted as the initial value setting method.

In this embodiment, the rendering unit 10 executes pre-rendering processing to adjust subtitle attributes. That is, pre-rendering processing is performed on the three-dimensional spatial data and the subtitle object (initial values of subtitle text and subtitle attributes) (step 106). Then, by adjusting the subtitle attributes based on the result of the pre-rendering process, it is possible to control the display mode of the subtitle text in the rendered video 8, which is two-dimensional video data.

Since the subtitle attributes are adjusted, in step 105, temporary subtitle attributes are set as initial values. Of course, subtitle object data including subtitle text and initial values of subtitle attributes is also included in one embodiment of subtitle object data according to the present technology.

In the example shown in FIG. 5, pre-rendering processing is executed to determine the occurrence of occlusion and the visibility of subtitle sentences (see steps 107 and 109). Note that occlusion is a state in which an object in the foreground hides an object in the background based on the viewpoint position.

That is, in this embodiment, when viewed from the viewpoint position of the user 6, in the rendered video 8, whether or not occlusion occurs between the video object 18 and the subtitle object 14, or whether occlusion occurs between the subtitle objects 14 is determined. Pre-rendering processing is performed to determine whether or not this has occurred.

Furthermore, in this embodiment, pre-rendering processing is executed to determine whether the visibility of the subtitle text is low due to surrounding colors, lighting, etc.

The pre-rendering process executed in step 106 is a pre-rendering process to determine the occurrence of occlusion, and is also a pre-rendering process to determine the visibility of the subtitle text. Note that the pre-rendering process for determining the occurrence of occlusion and the pre-rendering process for determining the visibility of the subtitle text may be performed separately. Then, based on the results of each pre-rendering process, determination of occurrence of occlusion (step 107) and determination of visibility of subtitle text (step 109) may be performed.

Note that since the pre-rendering process is not rendering for final display, for example, a simple rendering process that uses only mesh vertices without using texture data may be performed.

If occlusion has occurred (YES in step 107), the position and orientation of the subtitle object 14 are changed so that the occlusion is resolved (step 108). That is, in this embodiment, the position and orientation of the subtitle object 14 included in the subtitle attributes are adjusted based on the result of pre-rendering processing for determining the occurrence of occlusion.

For example, when the video object 18 is hidden by the subtitle object 14, when the subtitle object 14 is hidden by the video object 18, or when the subtitle object 14 is hidden by another subtitle object 14, the position of the subtitle object 14 and The orientation is changed.

FIG. 8 is a schematic diagram for explaining an example of a method for changing the position and orientation of the subtitle object 14.

As shown in FIG. 8A, initial values of the positions and orientations of the subtitle objects 14a and 14b are set based on the three-

dimensional BBoXs

19a and 19b of each

video object

18a and 18b. Assume that occlusion occurs when pre-rendering processing is performed with the subtitle attribute having the initial value. In this case, the position and orientation of each of the subtitle objects 14a and 14b are adjusted to avoid occlusion.

As shown in FIG. 8B, in this embodiment, first, the subtitle objects 14a and 14b are moved so as to satisfy the condition that they are adjacent to the outside of the three-

dimensional BBoXs

19a and 19b, and a position where occlusion can be avoided is searched for. If the occlusion is not resolved even after moving the subtitle objects 14a and 14b around the three-

dimensional BBoXs

19a and 19b, the subtitle objects 14a and 14b are moved to positions away from the three-

dimensional BBoXs

19a and 19b.

For example, it is possible to eliminate occlusion by adjusting subtitle attributes like this. Of course, other methods may be employed to change the positions and orientations of the subtitle objects 14a and 14b to avoid occlusion.

Note that depending on the viewpoint position and line of sight direction of the user 6, the audio may be heard but the video object 18 may be out of the field of view. In this case, the subtitle object 14 cannot be placed adjacent to the three-dimensional BBoX 19 of the video object 18. In such a case, the subtitle object 14 is placed at an edge (edge) within the field of view that is close to the sound source position outside the field of view. This makes it possible for the user 6 to easily understand that the subtitle object 14 corresponds to the audio emitted from the video object 18 that is outside the field of view.

In step 107, the occurrence of occlusion may be determined based on the three-dimensional BBoX 19 of each video object 18. For example, when the 3D BBoX 19 is hidden by the subtitle object 14 or when the 3D BBoX 19 is hidden, it is determined that occlusion has occurred, and the position and orientation of the subtitle object 14 are changed. Good too.

Furthermore, the area hidden by other objects in the rendered image 8 may be used to determine whether or not occlusion has occurred. For example, even if occlusion occurs, if the area where objects overlap is smaller than a predetermined threshold, it is determined that occlusion has not occurred (there is no effect), and the subtitle attributes (position and orientation) are determined. adjustment is not performed. Such settings are also possible.

Furthermore, in step 108, attribute information other than the position and orientation of the subtitle object 14 may be changed so that occlusion is resolved. For example, occlusion can be eliminated by adjusting the shape and size of the subtitle object 14. Of course, other attribute information may also be adjusted.

The processes of steps 106 to 108 are repeated until it is determined that occlusion has not occurred. If occlusion has not occurred (No in step 107), it is determined whether the visibility of the subtitle text in the rendered video 8 is poor (step 109).

If the visibility of the subtitle text is poor (Yes in step 109), adjustment of subtitle attributes is performed to improve visibility (step 110). That is, in this embodiment, the subtitle attributes are adjusted to improve visibility based on the results of pre-rendering processing for determining the visibility of subtitle sentences.

In this embodiment, information on the surrounding color and illumination for the position of the subtitle object 14 is acquired from the result of the pre-rendering process. Based on the information obtained from the results of these pre-rendering processes, it is determined whether the visibility of the subtitle text is poor or not, and if the visibility is poor, the subtitle attributes are adjusted so as to improve the visibility. Note that specific criteria for determining whether visibility is poor may be set as appropriate depending on the implementation details.

For example, if the display color of the subtitle text is close to the surrounding color, it is changed to a color that is further away from the surrounding color in the color space, that is, a color with a large color difference. This adjustment is included in the adjustment of the font color of the subtitle attribute.

In addition, if the subtitle object is exposed to strong lighting and the subtitle text is blown out, the roughness and transparency of the subtitle text surface or subtitle surface is changed to prevent the subtitle text from being blown out and unreadable due to light reflection. . This adjustment is included in the adjustment of the color, transparency, state of the front maintenance surface, and transparency of characters of the subtitle object 14 as a subtitle attribute. In addition, any adjustment processing for improving the visibility of the subtitle text, such as adjusting the size of the characters, may be performed.

Once the subtitle attributes are adjusted in step 110, pre-rendering processing for confirmation is executed (step 111). This pre-rendering process can also be called a pre-rendering process for readjusting subtitle attributes. That is, in step 111, a pre-rendering process similar to step 106 is executed to determine the occurrence of occlusion and to determine the visibility of the subtitle text. Note that the process may return to step 106 from step 111.

The processes of steps 109 to 111 are repeated until it is determined that visibility is not bad. If it is determined that the visibility is not bad (No in step 109), detailed rendering processing for display (not pre-rendering) is executed, and rendered video 8 to be presented to user 6 is generated. Since the rendering process is performed after subtitle attribute adjustment has been performed, an easy-to-read subtitle text that is sufficiently visible to the user 6 and is not obscured by the color or lighting of the surrounding environment is placed near the video object 18 that has emitted the sound. It becomes possible to display (step 112).

Note that when it is difficult to avoid occlusion, it is also possible to reduce the effect of the video object 18 etc. being hidden by the subtitle object 14 by adjusting the transparency of the subtitle object 14 and the transparency of the subtitle text. For example, if the determination result of Yes in step 107 shown in FIG. Proceed to rendering process. It is also possible to adopt such a processing flow.

[Display judgment of subtitle text (subtitle object)]
By generating the subtitle object 14 by the subtitle object generation unit 12, it becomes possible to convert audio into subtitle text and display it in the rendered video 8. The display of this subtitle text (subtitle object 14) is executed in conjunction with the output of audio information included in the audio object 17 to the user 6, for example. That is, when there is a speech from the video object 18, the subtitle object 14 is generated and the subtitle sentence is displayed.

The present invention is not limited to this, and when there is a speech from a person's avatar or the like that is the video object 18, a processing flow may be executed in which it is determined whether or not the content of the speech is converted into a subtitle sentence and displayed. That is, a predetermined condition may be set regarding the display of the subtitle text, and it may be determined whether or not to generate the subtitle object 14, that is, whether to display the subtitle text according to the utterance.

FIG. 9 is a schematic diagram illustrating a processing example of subtitle display determination. The process shown in FIG. 9 can also be said to be a process for determining whether or not to generate the subtitle object 14.

First, a volume threshold is set for determining whether or not to convert audio into subtitle text (step 201). The volume threshold may be specified by the user 6 as a user setting, for example. That is, a threshold value regarding the volume for determination may be set on the user/client side.

In the three-dimensional virtual space S, it is monitored whether or not the avatar (video object 18) that has started speaking is present (step 202). If there is an avatar that has started speaking (Yes in step 202), it is determined whether the output volume is greater than a threshold based on the user position (viewpoint position) in the virtual space S (step 203). ).

If the volume is larger than the threshold (Yes in step 203), a process of converting the utterance content into a subtitle sentence is executed. That is, the subtitle object generation unit 12 generates a subtitle object 14 based on the audio object 17, and starts displaying a subtitle sentence according to the content of the utterance in the rendered video 8 (step 204).

If the volume is smaller than the threshold (No in step 203), the process returns to step 202. That is, the process of converting the content of the utterance into a subtitle sentence is not executed, and the subtitle text corresponding to the content of the utterance is not displayed in the rendered video 8.

As described above, in the example shown in FIG. 9, subtitle object data is generated based on the audio object data when the output volume of the audio information included in the audio object 17 is larger than a predetermined threshold.

In the virtual space S, the farther the avatar's position is from the user's position (viewpoint position), the more the volume of the voice emitted from the avatar attenuates and becomes smaller. In other words, just like in the real world, it is easier to hear the voices of avatars who are nearby, and it is harder to hear the voices of avatars who are far away.

In the determination process shown in FIG. 9, the volume threshold is set as appropriate. This makes it possible to determine, for example, that if the voice uttered from a distant avatar is at a volume that is almost inaudible based on the user position (viewpoint position), it will not be converted into subtitle text.

Furthermore, by appropriately setting the volume threshold, it is also possible to adjust the number and frequency of subtitles displayed in the rendered video 8 representing the virtual space S. For example, it becomes possible to convert only the content of utterances at a relatively loud volume into subtitles and display them.

Note that a threshold value related to the volume for determination, an initial value of the threshold value, etc. may be described in the scene description information. That is, the scene description information may include information on a predetermined threshold value. In this case, it is also possible to set a threshold value for determination on the content side. For example, a volume threshold may be set appropriately for each scene.

[Delete displayed subtitle text (subtitle object)]
It is desirable that the subtitle text (subtitle object 14) once displayed be deleted when some condition is met. If the subtitle text is not deleted, the three-dimensional virtual space S will overflow with subtitle objects 14.

Any setting may be adopted as to when to delete the subtitle text (subtitle object 14). For example, the subtitle text may be deleted when a predetermined period of time has passed since the start of display. Alternatively, the subtitle sentence may be deleted based on the timing at which the utterance ends.

FIG. 10 is a flowchart illustrating an example of a process for determining whether to end displaying a subtitle text (subtitle object 14). In the example shown in FIG. 10, it is determined whether to end the display of the subtitle text based on the user's 6 gaze point information. That is, information on the user's gaze point position in the virtual space S is used to determine whether or not to erase the subtitle text (subtitle object 14).

The gaze point information of the user 6 is information included in the user's visual field information, and can be obtained by, for example, eye tracking. Using the gaze point information, it is possible to determine the position to which the user 6 has directed his/her line of sight in the three-dimensional virtual space S. That is, the gaze point corresponds to the position where the user directs his/her line of sight.

First, it is determined whether the user 6 has turned his/her line of sight to the displayed subtitle text (step 301). When the user 6 turns his/her line of sight to the subtitle text (Yes in step 301), it is then monitored whether the user 6 takes his/her line of sight away from the subtitle text (step 302).

If the user 6 takes his/her line of sight away from the subtitle sentence (Yes in step 302), it is determined whether the user 6 moves his/her line of sight to the end of the subtitle sentence and then takes his/her line of sight off (step 303). If the user 6 moves his line of sight to the end of the subtitle sentence and then removes his line of sight (Yes in step 303), it is determined that he has finished reading the subtitle sentence, and the subtitle sentence is deleted (step 304).

If the user 6 has not moved his line of sight to the end of the subtitle sentence and then removed his line of sight (No in step 303), it is determined that he has not finished reading the subtitle sentence, and the process returns to step 301. In other words, the subtitle text is not deleted.

In step 301, if the user 6 is not looking at the subtitle text (No in step 301), it is monitored whether the display time of the subtitle text has exceeded (step 305). The display time (threshold value) serving as the criterion for determination may be set as appropriate by the user 6 or the content.

If the elapsed time exceeds the display time of the subtitle text (Yes in step 305), the subtitle text is deleted even though the subtitle text has not been read yet (step 306). In this way, subtitles that the user 6 does not look at are automatically deleted after a predetermined period of time has elapsed.

Note that it is assumed that the subtitle text (subtitle object 14) moves out of the field of view as the user 6 moves within the three-dimensional virtual space S or moves his or her head. In this case, similar to the case of displaying the subtitle text corresponding to the video object 18 outside the field of view, processing such as continuing to display the subtitle object 14 at the edge (edge) position within the field of view is performed. You can.

Furthermore, when the user 6's line of sight returns to the position of the subtitle text (subtitle object 14) that has been erased in a short period of time, the user 6 wants to reconfirm the subtitle text that the user 6 was looking at immediately before. It may be determined that the deleted subtitle text is redisplayed.

In this manner, in free viewpoint three-dimensional virtual space content, by utilizing visual field information including gaze point information, it is possible to automatically determine the timing for erasing the subtitle object 14 from the scene. As shown in FIG. 10, the subtitle object 14 can be automatically deleted at an appropriate timing without hindering the understanding of the user 6 who is viewing the content.

As described above, in the virtual space providing system 1 according to the present embodiment, subtitle object data is generated by the client device 4 based on the three-dimensional space description data, video object data, and audio object data. This makes it possible to realize high-quality virtual images.

By applying this technology, in 3D virtual space content that allows viewing of 3D virtual space S from any free viewpoint (6 degrees of freedom), even if the content creator has not prepared subtitles, subtitles can be created with high accuracy. It becomes possible to generate the object 14 and display the subtitle text. Furthermore, it is possible to display the subtitle text at an appropriate position depending on the field of view (viewpoint position and line of sight direction) of the user 6 who is viewing the content from a free viewpoint.

For example, suppose that for content consisting of video and audio, it is possible to generate subtitles by some method even though subtitles have not been prepared in advance. In this case, in three-dimensional virtual space content, it is very difficult to appropriately determine the subtitle display position in advance. This is because the viewing position and viewing direction of the user 6 are not fixed because the three-dimensional virtual space content can be viewed from any free viewpoint.

If subtitles are displayed without considering the viewing position and viewing direction of the user 6, the video object 18 that is located ahead of the user's line of sight and is desired to be viewed may be hidden by the subtitles, or a person who is not the speaker may be A problem arises in that speech bubbles are displayed nearby, making it difficult to understand the scene. In three-dimensional virtual space content, it is extremely difficult to check this problem in advance and determine the subtitle position in all viewing positions and directions that can be set by the user 6.

With this technology, it is possible to fully solve such problems. That is, in the three-dimensional virtual space content, even if the user 6 freely moves his/her viewpoint, it is possible to generate the subtitle object 14 with high accuracy, and the subtitle object 14 can be placed at an appropriate position without interfering with viewing. It becomes possible to display.

By applying this technology, it becomes possible to automatically generate subtitles and display them at appropriate positions, which is extremely effective in terms of accessibility for free-viewpoint content.

By applying this technology, in free-viewpoint three-dimensional virtual space content for which subtitles are not available, a subtitle string is generated using voice recognition, translated into the language specified by the user 6, and subtitle text is generated from scene description information. By associating the video object 18 with the video object 18, it becomes possible to determine a subtitle position that avoids occlusion. This makes it possible to display the subtitles at an appropriate position and attribute when viewed from the viewpoint of the user 6, thereby increasing the user's 6 understanding of the content.

As shown in FIG. 11, it is assumed that the display area of the display is divided into a content display area 22 and a subtitle display area 23. It is assumed that a rendered video representing the virtual space S is displayed in the content display area 22, and subtitles are displayed in the subtitle display area 23. In this way, when subtitles are displayed at a fixed position on the display, you will be viewing a mixed image of the three-dimensional virtual space S and the two-dimensional subtitles. It becomes difficult to immerse yourself in

In this embodiment, as shown in FIG. 4, a subtitle object 14 is arranged as a three-dimensional object in a three-dimensional virtual space. Further, the position, orientation, shape, etc. of the subtitle object 14 are changed as the viewpoint position, line of sight direction, etc. of the user 6 are changed. As a result, it becomes possible to provide subtitles to the user 6 without impairing the depth and stereoscopic effect of the space, and it becomes possible to immerse the user 6 in a three-dimensional space. It also becomes possible to improve the accessibility of content.

<Other embodiments>
The present technology is not limited to the embodiments described above, and various other embodiments can be realized.

The present technology can also be applied to three-dimensional virtual space content in which text data for subtitles has been prepared by the content creator, but the subtitle display position has not been specified. That is, by applying the present technology, it is possible to newly generate a subtitle object that is a three-dimensional object so as to include a prepared subtitle sentence. This makes it possible to determine in real time the position, shape, etc. where the subtitle text can be read, depending on the viewpoint position and line of sight direction of the user 6.

[Client side rendering/server side rendering]
As explained above, in the example shown in FIG. 1, rendering processing is executed by the client device 4, and two-dimensional video data (rendered video 8) is generated according to the user's 6 visual field. That is, in the example shown in FIG. 1, a client-side rendering system configuration is adopted as a 6DoF video distribution system.

The 6DoF video distribution system to which the present technology can be applied is not limited to a client-side rendering system, but can also be applied to other distribution systems such as a server-side rendering system.

FIG. 12 is a schematic diagram for explaining a configuration example of a server-side rendering system.
In the server-side rendering system, a rendering server 30 is constructed on the network 5. The rendering server 30 is communicably connected to the distribution server 2 and the client device 4 via the network 5. For example, the rendering server 30 can be implemented by any computer such as a PC.

As illustrated in FIG. 12, visual field information is transmitted from the client device 4 to the rendering server 30. Furthermore, three-dimensional spatial data is distributed from the distribution server 2 to the rendering server 30.

The rendering server 30 newly generates subtitle object data (subtitle text and subtitle attributes). Furthermore, the rendering server 30 executes rendering processing based on the user's 6 visual field information. As a result, two-dimensional video data (rendered video) corresponding to the visual field of the user 6 is generated. Also, audio information and output control information are generated. A subtitle object 14 corresponding to the utterance or the like is displayed in the rendered video.

The rendered video, audio information, and output control information generated by the rendering server 30 are encoded and sent to the client device 4. The client device 4 decodes the received rendered video and the like and transmits it to the HMD 3 worn by the user 6. The HMD 3 displays rendered video and outputs audio information. The user 6 can view the virtual space S in which subtitles are displayed.

By adopting the server-side rendering system configuration, it is possible to offload the processing load on the client device 4 side to the rendering server 30 side, and even when the client device 4 with low processing capacity is used, the processing load on the user 6 side can be offloaded. On the other hand, it becomes possible to experience 6DoF video.

In such a server-side rendering system, it is possible to apply subtitle object generation (subtitle text display) according to the present technology. For example, the functional configuration of the client device 4 described in FIG. 3 is applied to the rendering server 30. As a result, even if the user 6 freely moves his/her viewpoint in the three-dimensional virtual space content, it is possible to generate the subtitle object 14 with high precision, and the subtitle object 14 can be placed in an appropriate position without interfering with viewing. 14 can be displayed. As a result, it becomes possible to realize high-quality virtual images.

When a server-side rendering system is constructed, the rendering server 30 functions as an embodiment of the information processing device according to the present technology. Then, the rendering server 30 executes an embodiment of the information processing method according to the present technology.

Note that the rendering server 30 may be prepared for each user 6 who uses the present virtual space providing system 1, or may be prepared for a plurality of users 6. Furthermore, the configuration of client side rendering and the configuration of server side rendering may be configured separately for each user 6. That is, in realizing the virtual space providing system 1, both a client-side rendering configuration and a server-side rendering configuration may be employed.

[Remote communication system]
FIG. 13 is a schematic diagram showing a basic configuration example of a remote communication system.
The remote communication system is a system in which a plurality of users 6 (6a to 6c) can share a three-dimensional virtual space S and communicate. Remote communication can also be called volumetric remote communication.

In the remote communication system 31 shown in FIG. 13, user information regarding each user 6 is transmitted from each client device 4 (4a to 4c) to the distribution server 2. For example, as the user information, the user's visual field information, movement information, audio information, etc. are transmitted.

The configuration and method for acquiring the movement information and voice information of the user 6 are not limited, and any configuration and method may be adopted. For example, a camera, a ranging sensor, a microphone, etc. may be arranged around the user 6, and movement information and audio information of the user 6 may be acquired based on the detection results thereof.

Alternatively, various forms of wearable devices such as a glove type may be worn by the user 6. The wearable device is equipped with a motion sensor or the like, and based on the detection result, movement information of the user 6 may be acquired.

The distribution server 2 generates and distributes three-dimensional spatial data based on the user information transmitted from each client device 4 so that the movements, speech, etc. of the user 6 are reflected. In this embodiment, each user's own object (user object) 33 (33a to 33c) and another user's object (other user object) 34 are generated and distributed as video objects included in the three-dimensional spatial data. be done. Further, as an audio object included in the three-dimensional spatial data, an audio object including the content of utterances (audio information) from each user 6 is generated and distributed.

For example, when users 6 interact with each other through conversation, dancing, collaborative work, etc., three-dimensional spatial data that reflects the movements and utterances of each user 6 in real time is sent from the distribution server 2 to each client device 4. Placed.
In each client device 4, rendering processing is executed based on the visual field information of the users 6, and two-dimensional video data including the users 6 interacting with each other is generated. Furthermore, audio information and output control information for outputting the utterance content of the users 6 from the sound source positions corresponding to the positions of each user 6 are generated.

By viewing the two-dimensional video displayed on the HMD 3 (3a to 3c) and the audio information output from the headphones, each user 6 can interact with other users 6 in various ways in the virtual space S. It becomes possible to carry out various interactions. As a result, a remote communication system 31 that allows interaction with other users 6 is realized.

The generation of subtitle objects (display of subtitle text) according to the present technology is applied to such a remote communication system 31 in which a plurality of users 6 participate in a free viewpoint three-dimensional virtual space and interact such as conversations. Is possible. For example, the functional configuration described in FIG. 3 is applied to each client device 4.

Thereby, in the three-dimensional virtual space S, it becomes possible to appropriately translate the words spoken by each user 6 and display them as subtitle sentences. Further, even if each user 6 freely moves his/her viewpoint, the subtitle object can be displayed at an appropriate position without interfering with viewing. As a result, it becomes possible to realize high-quality virtual images.

Note that it is also possible to construct the remote communication system 31 as illustrated in FIG. 13 using the server-side rendering configuration as illustrated in FIG. 12. Even when a server-side rendering configuration is adopted, by having the rendering server 30 generate a subtitle object, it is possible to display the utterances of other users 6 as subtitle sentences at appropriate positions.

Further, in a remote communication system, the present technology is also applicable to a form in which the user's 6 own avatar, that is, the user object 33 is not displayed.

In the above, an example is given in which a 6DoF video including 360-degree spatial video data is distributed as a virtual image. The present technology is not limited to this, and is also applicable when 3DoF video, 2D video, etc. are distributed. Moreover, instead of VR video, AR video or the like may be distributed as the virtual image. Further, the present technology is also applicable to stereo images (for example, right-eye images, left-eye images, etc.) for viewing 3D images.

FIG. 14 is a block diagram showing an example of the hardware configuration of a computer (information processing device) 60 that can realize the distribution server 2, the client device 4, and the rendering server 30.
The computer 60 includes a CPU 61, a ROM 62, a RAM 63, an input/output interface 65, and a bus 64 that connects these to each other. A display section 66 , an input section 67 , a storage section 68 , a communication section 69 , a drive section 70 , and the like are connected to the input/output interface 65 .
The display section 66 is a display device using, for example, liquid crystal, EL, or the like. The input unit 67 is, for example, a keyboard, pointing device, touch panel, or other operating device. If the input section 67 includes a touch panel, the touch panel can be integrated with the display section 66.
The storage unit 68 is a nonvolatile storage device, such as an HDD, flash memory, or other solid-state memory. The drive section 70 is a device capable of driving a removable recording medium 71, such as an optical recording medium or a magnetic recording tape.
The communication unit 69 is a modem, router, or other communication equipment connectable to a LAN, WAN, etc., for communicating with other devices. The communication unit 69 may communicate using either wired or wireless communication. The communication unit 69 is often used separately from the computer 60.
Information processing by the computer 60 having the above-mentioned hardware configuration is realized by cooperation between software stored in the storage unit 68, ROM 62, etc., and hardware resources of the computer 60. Specifically, the information processing method according to the present technology is realized by loading a program constituting software stored in the ROM 62 or the like into the RAM 63 and executing it.
The program is installed on the computer 60 via the recording medium 61, for example. Alternatively, the program may be installed on the computer 60 via a global network or the like. In addition, any computer-readable non-transitory storage medium may be used.

The information processing method and program according to the present technology may be executed by a plurality of computers communicatively connected via a network or the like, and an information processing device according to the present technology may be constructed.
That is, the information processing method and program according to the present technology can be executed not only in a computer system configured by a single computer but also in a computer system in which multiple computers operate in conjunction with each other.

Note that in the present disclosure, a system means a collection of multiple components (devices, modules (components), etc.), and it does not matter whether all the components are located in the same casing. Therefore, a plurality of devices housed in separate casings and connected via a network and a single device in which a plurality of modules are housed in one casing are both systems.

Execution of the information processing method and program according to the present technology by a computer system includes, for example, generation of a subtitle object, generation of a subtitle sentence, generation (adjustment) of subtitle attributes, execution of rendering processing, execution of pre-rendering processing, and generation of user information. This includes both cases where acquisition, determination of the start of display of a subtitle object, determination of end of display of a subtitle object, etc. are executed by a single computer, and cases where each process is executed by different computers. Furthermore, execution of each process by a predetermined computer includes having another computer execute part or all of the process and acquiring the results.
That is, the information processing method and program according to the present technology can also be applied to a cloud computing configuration in which one function is shared and jointly processed by a plurality of devices via a network.

The configurations of the virtual space providing system, client-side rendering system, server-side rendering system, remote communication system, distribution server, client device, rendering server, HMD, etc., and each processing flow described with reference to the drawings are just one example of implementation. It can be arbitrarily modified without departing from the spirit of the present technology. That is, any other configuration, algorithm, etc. may be adopted for implementing the present technology.

In this disclosure, words such as "approximately,""approximately," and "approximately" are used as appropriate to facilitate understanding of the explanation. On the other hand, there is no clear difference between when words such as "abbreviation,""approximately," and "approximately" are used and when they are not.
That is, in the present disclosure, "center", "center", "uniform", "equal", "same", "orthogonal", "parallel", "symmetrical", "extending", "axial direction", "cylindrical shape", "cylindrical shape", "ring shape" Concepts that define the shape, size, positional relationship, state, etc., such as "circular shape", include "substantially centered,""substantiallycentral,""substantiallyuniform,""substantiallyequal," and "substantially "Substantially perpendicular""Substantiallyparallel""Substantiallysymmetrical""Substantiallyextending""Substantiallyaxial""Substantiallycylindrical""Substantiallycylindrical" The concept includes "substantially ring-shaped", "substantially annular-shaped", etc.
For example, "perfectly centered", "perfectly centered", "perfectly uniform", "perfectly equal", "perfectly identical", "perfectly orthogonal", "perfectly parallel", "perfectly symmetrical", "perfectly extended", "perfectly It also includes states that fall within a predetermined range (e.g. ±10% range) based on the following criteria: axial direction, completely cylindrical, completely cylindrical, completely ring-shaped, completely annular, etc. It will be done.
Therefore, even when words such as "approximately,""approximately," and "approximately" are not added, concepts that can be expressed by adding so-called "approximately,""approximately," and "approximately" may be included. On the other hand, when a state is expressed by adding words such as "approximately", "approximately", "approximately", etc., a complete state is not always excluded.

In this disclosure, expressions using "more" such as "greater than A" and "less than A" are inclusive of both concepts that include the case of being equivalent to A and concepts that do not include the case of being equivalent to A. This is an expression included in For example, "greater than A" is not limited to not including "equivalent to A", but also includes "more than A". Moreover, "less than A" is not limited to "less than A", but also includes "less than A".
When implementing the present technology, specific settings etc. may be appropriately adopted from the concepts included in "greater than A" and "less than A" so that the effects described above are exhibited.

It is also possible to combine at least two of the feature parts according to the present technology described above. That is, the various characteristic portions described in each embodiment may be arbitrarily combined without distinction between each embodiment. Further, the various effects described above are merely examples and are not limited, and other effects may also be exhibited.

Note that the present technology can also adopt the following configuration.
(1)
3D space description data that defines the configuration of the 3D space and 3D video objects in the 3D space, which are included in 3D space data used in rendering processing performed to express the 3D space. a generation unit that generates subtitle object data that defines a three-dimensional subtitle object in the three-dimensional space based on video object data that defines the object and audio object data that defines the three-dimensional audio object in the three-dimensional space. Information processing device.
(2) The information processing device according to (1),
The subtitle object data includes a subtitle sentence and attribute information of the three-dimensional subtitle object. Information processing apparatus.
(3) The information processing device according to (2),
The attribute information includes information on the position and orientation of the three-dimensional subtitle object in the three-dimensional space.
(4) The information processing device according to (3),
The attribute information includes at least one of the size, shape, color, transparency, display surface state, character size, character font, character color, character transparency, and effect of the three-dimensional subtitle object in the three-dimensional space. An information processing device that contains 1 piece of information.
(5) The information processing device according to any one of (2) to (4),
The audio object data includes audio information,
The generation unit generates the subtitle sentence by performing voice recognition on the voice information. The information processing device.
(6) The information processing device according to (5),
The generation unit generates the subtitle sentence by performing translation processing on the recognition result of the voice recognition.
(7) The information processing device according to any one of (2) to (6),
The three-dimensional space description data includes position information of the three-dimensional audio object in the three-dimensional space and position information of the three-dimensional video object in the three-dimensional space,
The generation unit determines the three-dimensional video object corresponding to the three-dimensional audio object based on the position information of the three-dimensional video object and the position information of the three-dimensional audio object, and based on the determination result. An information processing device that generates the attribute information.
(8) The information processing device according to any one of (2) to (7),
The generation unit determines the three-dimensional video object corresponding to the subtitle sentence generated from the audio object data, based on position information of the three-dimensional video object and position information of the three-dimensional audio object. , an information processing device that generates the attribute information based on the determination result.
(9) The information processing device according to (8),
The generation unit determines, as the three-dimensional video object corresponding to the subtitle sentence, the three-dimensional video object that has uttered the content of the subtitle sentence.
(10) The information processing device according to (8) or (9),
The generation unit generates information on the position and orientation of the three-dimensional subtitle object based on a three-dimensional bounding box of the three-dimensional video object corresponding to the subtitle sentence.
(11) The information processing device according to any one of (2) to (10), further comprising:
a rendering unit that generates two-dimensional video data according to the user's visual field by performing rendering processing on the three-dimensional spatial data and the three-dimensional subtitle object based on visual field information regarding the user's visual field; Equipped with
The information processing device, wherein the generation unit controls a display mode of the subtitle sentence in the two-dimensional video data by adjusting the attribute information.
(12) The information processing device according to (11),
The rendering unit performs pre-rendering processing on the three-dimensional spatial data and the three-dimensional subtitle object,
The generation unit adjusts the attribute information based on the result of the pre-rendering process. The information processing apparatus.
(13) The information processing device according to (12),
The rendering unit executes pre-rendering processing to determine the occurrence of occlusion,
The information processing apparatus, wherein the generation unit adjusts the position and orientation of the three-dimensional subtitle object included in the attribute information based on the result of the pre-rendering process.
(14) The information processing device according to (12) or (13),
The rendering unit executes a pre-rendering process to determine the visibility of the subtitle text,
The generation unit generates at least one of the color, transparency, display surface state, font size, font color, and font transparency of the three-dimensional subtitle object included in the attribute information, based on the result of the pre-rendering process. An information processing device that adjusts one.
(15) The information processing device according to any one of (2) to (14),
The audio object data includes audio information,
The generation unit generates the subtitle object data based on the audio object data when the output volume of the audio information is larger than a predetermined threshold.
(16) The information processing device according to (15),
The three-dimensional space description data includes information on the predetermined threshold value. Information processing apparatus.
(17) The information processing device according to any one of (2) to (16),
The generation unit determines whether to end displaying the subtitle text based on user's gaze point information.
(18)
3D space description data that defines the configuration of the 3D space and 3D video objects in the 3D space that are included in the 3D space data used in rendering processing performed to express the 3D space. A computer system generates subtitle object data that defines a three-dimensional subtitle object in the three-dimensional space based on video object data that defines the three-dimensional subtitle object and audio object data that defines the three-dimensional audio object in the three-dimensional space. Information processing method to be carried out.
(19)
3D space description data that defines the configuration of the 3D space and 3D video objects in the 3D space that are included in the 3D space data used in rendering processing performed to express the 3D space. a generation unit that generates subtitle object data that defines a three-dimensional subtitle object in the three-dimensional space, based on video object data that defines a three-dimensional subtitle object and audio object data that defines a three-dimensional audio object in the three-dimensional space. Information processing system.

S...Virtual space 1...Virtual space provision system 2...Distribution server 3...HMD
4... Client device 6... User 8... Rendered video 10... Rendering unit 12... Subtitle object generation unit 14... Subtitle object 15... Person object 17... Audio object 18... Video object 19... Three-dimensional bounding box (BBoX)
30...Rendering server 31...Remote communication system 60...Computer

Claims

3D space description data that defines the configuration of the 3D space and 3D video objects in the 3D space, which are included in 3D space data used in rendering processing performed to express the 3D space. a generation unit that generates subtitle object data that defines a three-dimensional subtitle object in the three-dimensional space based on video object data that defines the object and audio object data that defines the three-dimensional audio object in the three-dimensional space. Information processing device.
The information processing device according to claim 1,
The subtitle object data includes a subtitle sentence and attribute information of the three-dimensional subtitle object. Information processing apparatus.
The information processing device according to claim 2,
The attribute information includes information on the position and orientation of the three-dimensional subtitle object in the three-dimensional space.
The information processing device according to claim 3,
The attribute information includes at least one of the size, shape, color, transparency, display surface state, character size, character font, character color, character transparency, and effect of the three-dimensional subtitle object in the three-dimensional space. An information processing device that contains 1 piece of information.
The information processing device according to claim 2,
The audio object data includes audio information,
The generation unit generates the subtitle sentence by performing voice recognition on the voice information. The information processing device.
The information processing device according to claim 5,
The generation unit generates the subtitle sentence by performing translation processing on the recognition result of the voice recognition.
The information processing device according to claim 2,
The three-dimensional space description data includes position information of the three-dimensional audio object in the three-dimensional space and position information of the three-dimensional video object in the three-dimensional space,
The generation unit determines the three-dimensional video object corresponding to the three-dimensional audio object based on the position information of the three-dimensional video object and the position information of the three-dimensional audio object, and based on the determination result. An information processing device that generates the attribute information.
The information processing device according to claim 2,
The generation unit determines the three-dimensional video object corresponding to the subtitle sentence generated from the audio object data, based on position information of the three-dimensional video object and position information of the three-dimensional audio object. , an information processing device that generates the attribute information based on the determination result.
The information processing device according to claim 8,
The generation unit determines, as the three-dimensional video object corresponding to the subtitle sentence, the three-dimensional video object that has uttered the content of the subtitle sentence.
The information processing device according to claim 8,
The generation unit generates information on the position and orientation of the three-dimensional subtitle object based on a three-dimensional bounding box of the three-dimensional video object corresponding to the subtitle sentence.
The information processing device according to claim 2, further comprising:
a rendering unit that generates two-dimensional video data according to the user's visual field by performing rendering processing on the three-dimensional spatial data and the three-dimensional subtitle object based on visual field information regarding the user's visual field; Equipped with
The information processing device, wherein the generation unit controls a display mode of the subtitle sentence in the two-dimensional video data by adjusting the attribute information.
The information processing device according to claim 11,
The rendering unit performs pre-rendering processing on the three-dimensional spatial data and the three-dimensional subtitle object,
The generation unit adjusts the attribute information based on the result of the pre-rendering process. The information processing apparatus.
The information processing device according to claim 12,
The rendering unit executes pre-rendering processing to determine the occurrence of occlusion,
The information processing apparatus, wherein the generation unit adjusts the position and orientation of the three-dimensional subtitle object included in the attribute information based on the result of the pre-rendering process.
The information processing device according to claim 12,
The rendering unit executes a pre-rendering process to determine the visibility of the subtitle text,
The generation unit generates at least one of the color, transparency, display surface state, font size, font color, and font transparency of the three-dimensional subtitle object included in the attribute information, based on the result of the pre-rendering process. An information processing device that adjusts one.
The information processing device according to claim 2,
The audio object data includes audio information,
The generation unit generates the subtitle object data based on the audio object data when the output volume of the audio information is larger than a predetermined threshold.
The information processing device according to claim 15,
The three-dimensional space description data includes information on the predetermined threshold value. Information processing apparatus.
The information processing device according to claim 2,
The generation unit determines whether to end displaying the subtitle text based on user's gaze point information.
3D space description data that defines the configuration of the 3D space and 3D video objects in the 3D space, which are included in 3D space data used in rendering processing performed to express the 3D space. A computer system generates subtitle object data that defines a three-dimensional subtitle object in the three-dimensional space based on video object data that defines the three-dimensional subtitle object and audio object data that defines the three-dimensional audio object in the three-dimensional space. Information processing method to perform.
3D space description data that defines the configuration of the 3D space and 3D video objects in the 3D space, which are included in 3D space data used in rendering processing performed to express the 3D space. a generation unit that generates subtitle object data that defines a three-dimensional subtitle object in the three-dimensional space based on video object data that defines the object and audio object data that defines the three-dimensional audio object in the three-dimensional space. Information processing system.