WO2018047667A1

WO2018047667A1 - Sound processing device and method

Info

Publication number: WO2018047667A1
Application number: PCT/JP2017/030858
Authority: WO
Inventors: 由楽池宮; 光行畠中; 矢ケ崎　陽一; 高林　和彦; 富三白石
Original assignee: ソニー株式会社
Priority date: 2016-09-12
Filing date: 2017-08-29
Publication date: 2018-03-15

Abstract

The present feature relates to a sound processing device and method with which it is possible to reproduce a content with high realistic sensations requiring a small computation or transmission amount. The sound processing device is provided with a process selection unit for selecting a process performed on the sound data of an object sound source on the basis of one or a plurality of importance indices that serve as an index to the importance of the object sound source. The present feature can be applied to a content reproduction system.

Description

Audio processing apparatus and method

The present technology relates to an audio processing apparatus and method, and more particularly, to an audio processing apparatus and method capable of performing content reproduction with high presence with a small amount of computation or transmission.

For example, when enjoying free viewpoint content created by 3D modeling, etc. from VR (Virtual Reality) content, videos taken with games, omnidirectional cameras, etc., the sound produced by objects present in the virtual space is also highly realistic It is desirable to regenerate.

For that purpose, when the viewer moves in the real space, it is necessary to appropriately change the direction in which the sound is heard and the distance to the sound source position for the sound output at a specific sound source position. . That is, it is necessary to appropriately change the sound image position of the sound.

Such a thing can be realized by recording the sound of the sound source existing in the free viewpoint video as an object sound source and appropriately rendering it according to the viewing environment.

Specifically, for example, binaural playback is performed when sound is played back through headphones, and sound image is localized at an appropriate position by performing processing such as wavefront synthesis when sound is played back through speakers. Can be made.

In addition, as a technology related to the reproduction of audio accompanying a video, the calculation amount is reduced by selecting whether or not to decode the data of each audio channel or audio object based on the priority information of the audio channel or audio object. A technique has been proposed (see, for example, Patent Document 1).

Japanese Patent Application Laid-Open No. 2015-194666

However, it has been difficult to reproduce highly realistic content with a small amount of computation or transmission with the above-described technology.

For example, in binaural reproduction and wavefront synthesis, the convolution processing of the filter corresponding to the direction of the sound source and the distance to the sound source is performed, so that the amount of calculation of the rendering process increases. In particular, when a plurality of sound sources are handled, the amount of calculation increases in proportion to the number of sound sources. Also, when performing streaming playback or the like, the amount of audio stream transmission increases when the number of sound sources is large.

On the other hand, the technology for selecting whether to perform decoding based on priority information can reduce the amount of computation, but the audio channel or audio object data with low priority is not decoded, so the audio of the channel or object is not decoded. Will not play at all. This may impair the sense of presence during content playback.

The present technology has been made in view of such a situation, and makes it possible to perform highly realistic content reproduction with a small amount of calculation or transmission.

An audio processing device according to an aspect of the present technology includes a process selection unit that selects a process to be performed on audio data of an object sound source based on one or more importance indices that are an index of importance of the object sound source. Prepare.

The process selection unit can select a process for reducing a calculation amount or a transmission amount as the process.

The process selection unit can select any one of a plurality of rendering processes having different calculation amounts as the process.

The process selection unit can select a process for integrating the audio data of a plurality of the object sound sources as the process.

The process selection unit can select a process for changing the reproduction bit rate of the audio data of the object sound source as the process.

The voice processing device may further include an importance index calculation unit that calculates the importance index based on meta information related to the audio data.

In the importance index calculation unit, the position information of the object sound source as the meta information, the position information of the viewer, the gaze direction information of the viewer, the importance information of the object sound source, the spread information of the object sound source, The importance index can be calculated based on at least one of the acoustic characteristic information of the space and the arrangement information of the object in the space.

The importance index calculation unit can calculate the distance between the object sound source and the viewer in space as the importance index.

The importance index calculation unit can use the importance information of the object sound source as the meta information as it is as the importance index.

The importance index calculation unit can calculate the distance between the two object sound sources in space as the importance index.

The importance index calculation unit may calculate an angle difference between the direction of the object sound source viewed from a viewer in space and the direction of another object sound source viewed from the viewer as the importance index. it can.

The process selection unit includes a plurality of the processes based on at least one of calculation specification information indicating calculation processing capability of the processing unit performing the process and transmission rate specification information indicating a maximum transmission rate of the audio data. The number of object sound sources on which the process is performed can be determined.

An audio processing method according to an aspect of the present technology includes an acquisition step of acquiring one or a plurality of importance indices serving as an importance index of an object sound source, and the object sound source based on the one or more importance indices. And a process selection step for selecting a process to be performed on the audio data.

In one aspect of the present technology, processing to be performed on the sound data of the object sound source is selected based on one or more importance indexes that are the index of importance of the object sound source.

According to one aspect of the present technology, highly realistic content reproduction can be performed with a small amount of calculation or transmission.

Note that the effects described here are not necessarily limited, and may be any of the effects described in the present disclosure.

It is a figure which shows the structural example of a content reproduction system. It is a flowchart explaining a transmission process. It is a flowchart explaining a reproduction | regeneration process. It is a flowchart explaining a reduction process. It is a flowchart explaining a reduction process. It is a flowchart explaining a reduction process. It is a flowchart explaining a reduction process. It is a figure which shows the structural example of a computer.

Hereinafter, embodiments to which the present technology is applied will be described with reference to the drawings.

<First Embodiment>
<About this technology>
This technology uses the meta information of the content, etc. to calculate an index of the importance of the object sound source to determine whether it is an object sound source to be played more strictly. By changing the handling of each object sound source, etc., the content can be reproduced with a high sense of reality while reducing the calculation amount and transmission amount.

Note that the strict reproduction here means reproduction without impairing the localization and sound quality of the sound image of the object sound source. The rendering process here refers to the entire process of remapping the sound data of the object sound source to the sound data of the number of channels suitable for the reproduction environment in accordance with the number of channels and the reproduction conditions of the reproduction device.

In a content playback system to which the present technology is applied, an audio stream including audio data for each channel of each object sound source is input as an audio stream for reproducing the audio of content including video and accompanying audio. To do.

In the following, it is assumed that the content is, for example, free viewpoint video content. Further, it is assumed that a video object on the content video corresponds to an audio object of content audio, that is, an object sound source. In other words, the sound of the object sound source is assumed to be the sound of the video object.

Also, in the content reproduction system, rendering processing is performed based on the input audio stream, and an audio stream for reproducing the content audio is generated in the reproduction device, that is, the client device, and the content audio is reproduced on the reproduction device. Is done. Hereinafter, the input audio stream is also referred to as an input audio stream, and the audio stream output to the playback device is also referred to as an output audio stream.

Furthermore, in the content reproduction system, in order to reduce the amount of calculation of the output audio stream and the transmission amount of the output audio stream as appropriate, based on various meta information and spec information of each device. Is performed.

In the following, by selecting a process to be performed on the sound data of the object sound source from one or a plurality of processes including a process for reducing at least the calculation amount or the transmission amount, the calculation amount or Processing for reducing the transmission amount is also referred to as reduction processing. More specifically, in the reduction process, depending on the object sound source, a selection result that the process for reducing the calculation amount or the transmission amount is not performed or a selection result that the normal process is selected may be obtained. is there.

In the content reproduction system, for example, at least one of processing TR1 to processing TR3 shown below is performed as the reduction processing for reducing the calculation amount and the transmission amount.

Processing TR1
Changing the rendering process for object sound sources

For example, for each object sound source, it is determined whether or not it is an object sound source to be reproduced more strictly, and a rendering process performed on the object sound source is selected according to the determination result.

That is, for an object sound source that is determined to be highly important and should be reproduced more precisely, a rendering process is performed that can obtain a more precise sense of localization of the sound image although the amount of calculation is large. To.

Hereinafter, such a rendering process capable of obtaining a more precise sense of localization is also referred to as a strict rendering process. For example, the strict rendering process includes a process with a large amount of calculation such as a convolution process using a filter coefficient, and is a process that can localize a sound image with higher accuracy.

On the other hand, for object sound sources that are low in importance and have not been determined to be played more strictly, rendering processing that cannot localize sound images with high accuracy but requires a small amount of computation is performed. To be done.

In the following, a rendering process with a small amount of computation, although it is not possible to localize a sound image with such high accuracy, will be referred to as a light rendering process. For example, the light rendering processing is VBAP (Vector Base Amplitude Panning) or the like.

In this way, for each object sound source, depending on the importance of the object sound source, either strict rendering processing or light rendering processing is selectively performed, thereby reducing the amount of calculation. Highly realistic content audio can be obtained.

In the following, a case will be described in which there are two types of rendering processes, a light rendering process and a strict rendering process, as rendering processes having different computational complexity and sound image localization accuracy. However, three or more rendering processes may be defined as rendering processes having different calculation amounts and sound image localization accuracy.

In such a case, for example, any of light rendering processing, somewhat strict rendering processing, and strict rendering processing is selected for each object sound source as the rendering processing.

Processing TR2
Processing to change the playback bit rate of the object sound source

For example, it is determined for each object sound source whether or not it is an object sound source to be reproduced more strictly, and the reproduction bit rate of the sound data of the object sound source is changed according to the determination result.

That is, for an object sound source that is determined to be highly important and should be reproduced more precisely, the output audio stream is generated using the audio data as it is without changing the audio data reproduction bit rate. .

On the other hand, a process for changing the playback bit rate is performed on an object sound source that is low in importance and has not been determined to be played more strictly. That is, audio data having a lower reproduction bit rate is generated based on the original audio data, and an output audio stream is generated using the obtained audio data.

For example, as a method of generating audio data with a lower reproduction bit rate, a method of down-sampling the original audio data to obtain audio data with a lower sampling frequency, There are a method of performing conversion processing to generate audio data with a smaller number of quantization bits, a method of combining both of them, and the like.

Here, the reproduction bit rate of the audio data of the object sound source is determined by the sampling frequency, the number of channels of the object sound source, and the number of quantization bits. Also, the lower the playback bit rate, the smaller the amount of audio data, so that not only can the transmission bit rate of the output audio stream, that is, the transmission amount be reduced, but also the amount of computation when generating the output audio stream, etc. Can be reduced.

For example, for object sound sources that are close to the viewer, the playback bit rate is not changed as it should be played more strictly, and for object sound sources that are far from the viewer and may be background sound, lower playback is possible. The bit rate can be changed. In this way, highly realistic content audio can be obtained while reducing the amount of computation and the amount of transmission.

It should be noted that the playback device side can play back audio data of different playback bit rates by some method.

Processing TR3
Process that integrates multiple object sound sources into one object sound source

For example, it is assumed that there is an object sound source for a viewer that combines several object sound sources as one object sound source, that is, even if the object sound source is integrated into one object sound source. In such a case, the amount of calculation and the amount of transmission can be reduced by integrating those object sound sources.

Specifically, for example, when two object sound sources in the space are at substantially the same position, the sound data of the two object sound sources are added at a predetermined volume ratio to obtain one sound data, thereby obtaining a predetermined position. Are treated as one object sound source.

Thus, for example, what required rendering processing for two object sound sources becomes rendering processing for one object sound source, so that the amount of calculation during rendering processing can be reduced. Also, since the audio data of the two object sound sources becomes one audio data, the data amount can be reduced, and as a result, the transmission amount of the output audio stream can be reduced.

Note that only one of the processes TR1 to TR3 described above may be performed. However, depending on the purpose and situation, any one of these processes may be combined. Also good.

As described above, according to the present technology, depending on whether or not the object sound source is to be reproduced more strictly, that is, whether the importance is high, different rendering processes are performed on the object sound source or playback is performed. By changing the bit rate or integrating the object sound sources, it is possible to obtain highly realistic content audio while reducing the amount of computation and transmission.

For example, when the distribution of free viewpoint video content with a large number of audio objects becomes common in the future, the server side in real time according to the viewing environment such as headphones and speakers on the viewer side, that is, the client side It is necessary to perform streaming distribution while performing rendering processing. At this time, if the amount of computation and transmission can be reduced by this technology, it is possible to perform computation processing for generating an output audio stream for each of a large number of clients and transmit the output audio stream to each of those clients. become able to. That is, streaming distribution can be performed simultaneously for a large number of clients.

Here, a description will be given of processing for calculating an importance index indicating the importance index of each object sound source, and meta information and specification information used when performing reduction processing.

First, meta information (hereinafter also referred to as video meta information) included in a video stream for reproducing content video includes user position information indicating the position of a user who is a viewer in space, and the user's position. Line-of-sight information indicating the line-of-sight direction is included.

The meta information included in the input audio stream of content audio (hereinafter also referred to as audio meta information) includes space information indicating the size of the space and a sound source position indicating the position of the object sound source in the space. Information and included. The audio meta information may include importance level information indicating the importance level of each object sound source.

In free-viewpoint video content, the viewer and the audio object (object sound source) move from moment to moment, so their positional relationship changes. Therefore, the object sound source important for the viewer changes depending on the positional relationship.

For example, object sound sources that are close to the viewer should be played back strictly in order to maintain the localization to the correct position. That is, a strict rendering process should be performed.

On the other hand, for the object sound source located sufficiently far from the viewer, since only the approximate direction needs to be known, it may be reproduced by performing a light rendering process or the like.

Thus, for example, when rendering processing is selected for each object sound source, using the user position information included in the video meta information and the sound source position information included in the audio meta information, the user and each The positional relationship with the object sound source can be specified.

Also, when performing reduction processing to reduce the amount of computation and transmission, computation specification information indicating the computation processing capability of the computation block that generates the output audio stream, for example, the computation block that performs rendering processing, that is, computation processing performance Is used as appropriate.

For example, when there is a limit on the processing capacity of the calculation block, in order to realize real-time streaming delivery, it is necessary to adjust the number of object sound sources to which light rendering processing or strict rendering processing is assigned according to the processing capacity. is there.

Specifically, for example, in an arithmetic block with low arithmetic processing capability, the number of object sound sources to which strict rendering processing is assigned can be 10 or less, and light rendering processing can be performed for other object sound sources. .

Furthermore, there may be a limit on the communication speed, that is, the transmission speed (transmission bit rate) on the transmission side or reception side of the output audio stream. Therefore, in the content reproduction system, transmission rate specification information indicating the maximum transmission rate, which is the fastest possible transmission rate, is appropriately acquired for each of the transmission side and the reception side, and the transmission rate specification information is used. Reduction processing for reducing the calculation amount and the transmission amount can be performed.

For example, when the maximum transmission rate indicated by the transmission rate specification information is slow, processing for changing the playback bit rate is performed so that the output audio stream can be transmitted at the maximum transmission rate or less, or the object sound source Processing to integrate the. At this time, if the maximum transmission rate is different between the transmission side and the reception side, the reproduction bit rate may be changed or the object sound source may be integrated so that communication at the slower maximum transmission rate is possible. .

In the following description, when it is not necessary to distinguish between transmission rate specification information on the transmission side and transmission rate specification information on the reception side, they are also simply referred to as transmission rate specification information. Further, when it is not necessary to distinguish between the operation spec information and the transmission speed spec information, they are also simply referred to as spec information.

In addition, in free-viewpoint video content, priority information, that is, importance information indicating importance is given to the object sound source at the time of content recording or editing of the content after recording. May be added).

For example, a high importance is added to the sound of an object that symbolizes the scene (place), that is, an object sound source, and a low importance is added to the sound of an object that is not. In this way, more computation resources can be allocated to play the sound of the object sound source with higher importance, and even when the computation processing capacity of the computation block is limited, the amount of computation and transmission amount is reduced and high presence A feeling of content audio can be obtained.

Note that the importance level information of each object sound source may be included in the audio meta information of the input audio stream. Also, the importance level information may be determined for each frame of the content audio, or may be determined in units of a plurality of frames.

In the content reproduction system, the importance index indicating the importance index of each object sound source by using at least one of the information included in the video meta information, the audio meta information, and the recording / editing meta information described above. Is calculated.

Then, a reduction process is performed using the importance index of each object sound source, and the amount of calculation when generating the output audio stream and the transmission amount of the output audio stream are reduced. In the reduction process, calculation specification information, transmission speed specification information, and the like are also used as necessary.

Further, as the meta information and the specification information, things other than those described above may be used, and the reduction processing for reducing the calculation amount and the transmission amount may be processing other than the processing described above.

<Example configuration of content playback system>
Next, a more specific embodiment of the content reproduction system described above will be described. FIG. 1 is a diagram illustrating a configuration example of an embodiment of a content reproduction system to which the present technology is applied.

The content reproduction system shown in FIG. 1 includes a server 11, a client device 12, and a meta information storage unit 13.

In this example, the server 11 includes a network server such as a cloud, and is connected to the client device 12 operated by the user via a wired or wireless network. Here, one client device 12 is connected to the server 11, but two or more client devices 12 may be connected to the server 11.

The server 11 is a device that can perform arithmetic processing on an analog or digital audio signal (audio data), and generates a suitable output audio stream by switching rendering processing for an input audio stream in real time for each object sound source. Perform streaming distribution of viewpoint video content. That is, the server 11 performs streaming distribution of the audio of the free viewpoint video content supplied from the outside or recorded in advance to the client device 12.

Specifically, the server 11 generates an output audio stream based on the input audio stream and the meta information and specification information, and transmits the output audio stream to the client device 12. At that time, the server 11 appropriately acquires recording / editing meta information from the meta information storage unit 13.

In addition, the client device 12 receives the output audio stream from the server 11 and reproduces the content audio. At this time, the client device 12 reproduces a free-viewpoint video content composed of video and audio by also playing a content video based on a video stream acquired from the server 11 or another server.

The server 11 includes an importance index calculation unit 21, a process selection unit 22, a rendering processing unit 23, and an audio stream transmission unit 24.

In addition, the client device 12 includes an audio stream receiving unit 31 and a free viewpoint video reproduction unit 32.

The importance index calculation unit 21 acquires meta information and spec information as necessary, calculates the importance index based on the meta information, and sends the obtained importance index and spec information to the process selection unit 22. Supply. This importance index is an index indicating the importance of each object sound source.

For example, the importance index calculation unit 21 acquires (extracts) audio meta information from the input audio stream, acquires recording / editing meta information from the meta information storage unit 13, and the free viewpoint video reproduction unit of the client device 12. Video meta information is acquired from 32.

Further, for example, the importance index calculation unit 21 acquires calculation specification information from the rendering processing unit 23, acquires transmission speed specification information on the transmission side from the audio stream transmission unit 24, and receives from the audio stream reception unit 31 on the reception side. Or get the transmission speed spec information.

Furthermore, the importance index calculation unit 21 supplies video meta information to the rendering processing unit 23 via the processing selection unit 22 as necessary.

The process selection unit 22 acquires the input audio stream and performs a reduction process for reducing the calculation amount and the transmission amount based on the importance index and the specification information supplied from the importance index calculation unit 21. Further, the process selection unit 22 supplies the processing result of the reduction process and the input audio stream to the rendering processing unit 23.

Here, as a result of the reduction process for reducing the calculation amount and the transmission amount, for example, for each object sound source, a selection result (decision result) of whether to perform a light rendering process or a strict rendering process, and which As a result of selecting whether to change the reproduction bit rate of the object sound source, a result of selecting which object sound source is integrated into one, and the like are obtained. That is, as a result of the reduction process, a selection result of what kind of process is performed for each object sound source is obtained.

Also, the spec information is used to determine the number of object sound sources on which each process is performed for a plurality of processes including a process for reducing the amount of computation and transmission, for example.

The rendering processing unit 23 performs rendering processing based on the result of the reduction processing supplied from the processing selection unit 22 and the input audio stream, and supplies the output audio stream obtained as a result to the audio stream transmission unit 24.

At this time, the rendering processing unit 23 appropriately selects the video meta information supplied from the importance index calculation unit 21 via the processing selection unit 22 or the audio meta information included in the input audio stream supplied from the processing selection unit 22. The rendering process is performed using.

In addition, the rendering processing unit 23 appropriately acquires information regarding the reproduction environment of the client device 12 such as how many channels of the speaker system the client device 12 has. Then, the rendering processing unit 23 generates an output audio stream composed of audio data of each channel that can be reproduced by the client device 12 according to the reproduction environment. Furthermore, the rendering processing unit 23 also appropriately performs processing for changing the reproduction bit rate, processing for integrating object sound sources, and the like based on the result of the reduction processing.

The audio stream transmission unit 24 transmits the output audio stream supplied from the rendering processing unit 23 to the client device 12 via the network.

The audio stream receiving unit 31 of the client device 12 receives the output audio stream transmitted by the audio stream transmitting unit 24 of the server 11 and supplies it to the free viewpoint video reproduction unit 32. In addition, the audio stream receiving unit 31 appropriately supplies the transmission rate specification information on the receiving side to the importance index calculating unit 21 in response to a request from the server 11.

The free viewpoint video reproduction unit 32 includes, for example, a sound reproduction device such as a headphone or a speaker system, and a device that drives the sound reproduction device. Based on the output audio stream supplied from the audio stream reception unit 31, Play content audio.

The free viewpoint video playback unit 32 also has a display device and the like, and plays back content video based on a video stream acquired from the outside. Furthermore, the free viewpoint video reproduction unit 32 appropriately extracts video meta information from the video stream in response to a request from the server 11 and supplies the video meta information to the importance index calculation unit 21.

Here, an example in which video meta information including user position information and line-of-sight direction information is extracted from a video stream will be described. However, any method for acquiring user position information and line-of-sight direction information can be used. Good.

For example, the client device 12 may acquire user position information and line-of-sight direction information from another external device and supply the user position information and line-of-sight direction information to the importance index calculation unit 21. Further, for example, the client device 12 may be provided with a gyro sensor that detects the user's head direction, an image sensor that captures the user, and the like so as to obtain user position information and line-of-sight direction information. In this case, for example, the user's face direction is specified from the output of the gyro sensor, and the direction is set as the user's line-of-sight direction, or the user's line-of-sight direction or the position of the user in space is detected from the image obtained by the image sensor. do it.

In addition, the importance index calculation unit 21 may use space area information included in the video meta information of the video stream, or position information and importance information of the video object included in the video meta information may be used. It may be used as sound source position information or importance level information of an object sound source corresponding to the video object.

Furthermore, here, an example in which the server 11 and the client device 12 are connected via a network has been described. However, the importance index calculation unit 21 to the audio stream transmission unit 24, the audio stream reception unit 31, and the free viewpoint video reproduction unit 32 may be provided in one apparatus. Further, a device provided with the importance index calculation unit 21 to the audio stream transmission unit 24 and a device provided with the audio stream reception unit 31 and the free viewpoint video reproduction unit 32 are connected by a wire such as a cable. You may do it.

Specifically, for example, a case where a free-viewpoint video content stored in a personal computer at the user's home is played on a head-mounted display can be considered. In such a case, for example, the importance index calculation unit 21 to the audio stream transmission unit 24 are provided in the personal computer, and the audio stream reception unit 31 and the free viewpoint video reproduction unit 32 are provided in the head mounted display connected to the personal computer. You can make it.

It is also possible to play 3D game content with a content playback system. In such a case, for example, the stationary index game machine main body may be configured to include the importance index calculation unit 21 to the audio stream transmission unit 24, the audio stream reception unit 31, and the free viewpoint video reproduction unit 32. . Further, the importance level index calculation unit 21 to the audio stream transmission unit 24 are provided in the stationary game machine main body, and the audio stream reception unit 31 and the free viewpoint video are connected to an external device connected to the game machine main body by wire or wirelessly. The reproducing unit 32 may be provided.

Here, specific examples of the light rendering process and the strict rendering process performed in the rendering processing unit 23 of the content reproduction system described above will be described.

For example, assume that the audio playback specification of the client device 12, that is, the content audio playback environment, is playback through headphones. In other words, it is assumed that the free viewpoint video reproduction unit 32 includes headphones.

In such a case, for example, binaural playback processing that convolves the head related transfer function (HRTF (Head Related Transfer Function)) and the sound data of the object sound source is performed as a strict rendering process.

In this case, for example, a head-related transfer function is prepared in advance for each relative positional relationship between the viewer and the object sound source in the space. Also, a head-related transfer function corresponding to the relative positional relationship between the position of the object sound source indicated by the sound source position information and the position of the viewer indicated by the user position information is selected from those head-related transfer functions. The

Then, a convolution process is performed to convolve the selected head-related transfer function and the sound data of the object sound source, and sound data of the left and right channels of the sound of the object sound source whose sound image is localized at a desired position is generated. .

On the other hand, for example, rendering with a light panning process that localizes the sound image by changing the volume ratio of the left and right sounds of the object sound source based on the position and line-of-sight direction of the viewer in space and the position of the object sound source. It is done as a process.

In this case, based on the user position information, the sound source position information, and the line-of-sight direction information, depending on the positional relationship between the viewer and the object sound source in the space and the line-of-sight direction of the viewer, the desired volume ratio of the left and right channels The sound data of the left and right channels of the sound of the object sound source that localizes the sound image at the position is generated.

For example, assume that the audio playback specification of the client device 12 is playback using a multi-channel speaker. In other words, it is assumed that the free viewpoint video reproduction unit 32 includes a multi-channel speaker.

In such a case, for example, when the free-viewpoint video playback unit 32 is composed of a linear speaker array, the processing for generating the sound data of each speaker, that is, each channel for reproducing the sound of the object sound source by wavefront synthesis, is a strict rendering process. As done.

In wavefront synthesis, a filter coefficient determined from the positional relationship between the position of the object sound source indicated by the sound source position information and the position of the viewer indicated by the user position information is selected for each speaker (channel). Then, a convolution process for convolving the selected filter coefficient and the sound data of the object sound source is performed, and sound data of each channel of the sound of the object sound source in which the sound image is localized at a desired position is generated.

For example, when the free-viewpoint video playback unit 32 is composed of an annular speaker array, a process of generating audio data of each channel for playing back the sound of the object sound source by HOA (Higher Order Ambisonics) is performed as a strict rendering process. . In HOA, sound data of each channel is generated by calculation in the spherical harmonic region.

On the other hand, for example, a process of generating audio data of each channel for reproducing the sound of the object sound source by VBAP is performed as a light rendering process.

<Description of transmission processing and playback processing>
Next, processing performed by the content reproduction system shown in FIG. 1 will be described.

First, a transmission process, which is a process in which the server 11 generates and outputs an output audio stream, will be described with reference to the flowchart of FIG.

In step S11, the importance index calculation unit 21 determines whether or not there is a request for reduction of the calculation amount or the transmission amount.

For example, the importance index calculation unit 21 determines that there is a reduction request when the client apparatus 12 requests reduction of the calculation amount or transmission amount of the output audio stream.

Further, for example, when a plurality of client apparatuses 12 are connected to the server 11 and the number of client apparatuses 12 that have requested transmission of an output audio stream is large and the processing load on the server 11 is high, a reduction request is made. It may be said that there is.

If it is determined in step S11 that there is no reduction request, the processing from step S12 to step S15 is skipped, and then the processing proceeds to step S16.

In this case, the processing selection unit 22 supplies the input audio stream to the rendering processing unit 23, and also displays a selection result indicating that strict rendering processing has been selected for all object sound sources to the rendering processing unit 23. Supply. Also, video meta information and the like are acquired as necessary, and video meta information and the like are also supplied from the importance index calculation unit 21 to the rendering processing unit 23 via the process selection unit 22.

On the other hand, when it is determined in step S11 that there is a reduction request, the importance index calculation unit 21 acquires meta information and specification information in step S12.

For example, the importance index calculation unit 21 acquires audio meta information from the input audio stream as meta information regarding free viewpoint video content, that is, audio data of each object sound source, or records / editing meta information from the meta information storage unit 13. Or the video meta information from the free viewpoint video playback unit 32 of the client device 12.

As meta information and spec information, for example, information acquired sequentially at a predetermined interval such as a frame unit or a plurality of frame units in real time may be used, or information acquired in advance may be used continuously. May be.

In step S13, the importance index calculation unit 21 determines whether all necessary meta information and specification information have been acquired.

If it is determined in step S13 that necessary meta information and specification information have not yet been acquired, the process returns to step S12, and the above-described process is repeated.

On the other hand, if it is determined in step S13 that necessary meta information and specification information have been acquired, the process proceeds to step S14.

In step S14, the importance index calculation unit 21 calculates the importance index of each object sound source based on the acquired meta information, and supplies the obtained importance index and spec information to the process selection unit 22. Also, the importance index calculation unit 21 supplies video meta information to the rendering processing unit 23 via the process selection unit 22 as necessary.

For example, the importance index calculation unit 21 calculates the distance from the viewer position to the object sound source position as the importance index from the sound source position information included in the audio meta information and the user position information included in the video meta information. Or the importance information included in the meta information at the time of recording / editing is directly used as an importance index.

In step S <b> 15, the process selection unit 22 performs the reduction process based on the importance index of each object sound source supplied from the importance index calculation unit 21 and the specification information, thereby performing each object sound source. Select (determine) the process.

That is, the process selection unit 22 acquires the importance index and the spec information from the importance index calculation unit 21. For example, the process selection unit 22 selects whether to perform light rendering processing or strict rendering processing for each object sound source, or to select whether to perform processing for changing the reproduction bit rate of the object sound source. , Select whether to integrate the object sound source. The processing selection unit 22 supplies the selection result and the input audio stream to the rendering processing unit 23.

For example, a light rendering process, a process for changing a playback bit rate, and a process for integrating object sound sources are processes for reducing the amount of computation and transmission, and a strict rendering process is a normal process. In the reduction process, specification information may be used as necessary, and specification information is not necessarily used.

If it is determined in step S15 that the process has been selected or there is no reduction request in step S11, the process of step S16 is performed.

In step S16, the rendering processing unit 23 performs a rendering process based on the selection result of the process supplied from the process selection unit 22 and the input audio stream, and generates an output audio stream.

For example, for the object sound source for which the light rendering process is selected, the rendering processing unit 23 performs a panning process based on the sound data of the object sound source included in the input audio stream according to the playback environment of the free viewpoint video playback unit 32. And VBAP.

Also, the rendering processing unit 23, for an object sound source for which strict rendering processing has been selected, is based on the sound data of the object sound source included in the input audio stream according to the playback environment of the free viewpoint video playback unit 32. Perform binaural playback, wavefront synthesis, HOA, and other processing.

Furthermore, the rendering processing unit 23 adds the sound data of the object sound sources at a predetermined volume ratio to one sound data for the object sound sources for which the process of integrating the object sound sources is selected before the rendering process. By integrating the sound data of the object sound source.

At this time, the volume ratio when adding the audio data, that is, the weight multiplied by the audio data is determined based on, for example, the spatial positional relationship between each of the plurality of object sound sources and the viewer. The audio data addition process is performed for each channel.

Further, the position of the integrated object sound source may be a coordinate position indicated by an average value of the coordinates of each of the integrated object sound sources, or a representative value of the position of the plurality of object sound sources. May be the position indicated by.

For example, the representative value of the coordinates of the position of the object sound source may be the coordinates of the position of one object sound source among a plurality of object sound sources to be integrated, or weighted addition from the coordinates of the position of the plurality of object sound sources. The coordinates calculated by the above may be used.

In addition, for the object sound source for which the processing for changing the reproduction bit rate is selected, the rendering processing unit 23 performs downsampling for converting the sampling frequency on the sound data of the object sound source before or after the rendering processing, Audio data with a changed reproduction bit rate is generated by performing a conversion process for converting the number of digitized bits.

When the sound data of each channel for each object sound source is obtained by the above processing, the rendering processing unit 23 adds the sound data of the same channel of each object sound source to obtain one sound data, thereby converting the content sound. Audio data of each channel for reproduction is generated. The rendering processing unit 23 generates an output audio stream in which the audio data of each channel of the content audio obtained in this way is stored, and supplies the output audio stream to the audio stream transmission unit 24. Note that the output audio stream may be stored with audio data for each object sound source.

In step S17, the audio stream transmission unit 24 transmits (transmits) the output audio stream supplied from the rendering processing unit 23 to the client device 12. Thereafter, the process returns to step S11, and the above-described process is repeated until the streaming distribution of the content audio is completed.

As described above, the server 11 acquires meta information and specification information as necessary, calculates an importance index, and selects processing to be performed on the object sound source based on the obtained importance index. Then, the server 11 performs a rendering process or the like according to the processing selection result, and generates an output audio stream.

As described above, by appropriately selecting processing for each object sound source, it is possible to appropriately reduce the amount of calculation when generating the output audio stream and the amount of transmission of the output audio stream. As a result, it is possible to reproduce content with a high level of realism with a small amount of computation or transmission.

In addition, when the output audio stream is output from the server 11, the client device 12 performs a playback process, and the content audio is played back.

Hereinafter, the reproduction process performed by the client device 12 will be described with reference to the flowchart of FIG.

In step S41, the audio stream receiving unit 31 acquires the output audio stream and supplies it to the free viewpoint video reproduction unit 32.

That is, the audio stream receiving unit 31 acquires the output audio stream by receiving the output audio stream transmitted by the audio stream transmitting unit 24 of the server 11 at regular intervals.

In step S42, the audio stream receiving unit 31 determines whether or not an output audio stream necessary for reproduction has been acquired. If it is determined in step S42 that the output audio stream has not yet been acquired, the process returns to step S41, and the above-described process is repeated.

On the other hand, when it is determined in step S42 that the output audio stream has been acquired, the free viewpoint video reproduction unit 32 reproduces content audio based on the output audio stream supplied from the audio stream reception unit 31 in step S43. To do.

At this time, the free viewpoint video playback unit 32 plays back the content video based on the video stream acquired from the outside, thereby playing back the free viewpoint video content including the content video and the content audio.

In step S44, the free viewpoint video reproduction unit 32 determines whether or not the supply of video meta information is requested from the importance index calculation unit 21 of the server 11. For example, in step S12 in the transmission process of FIG. 2, when the supply of video meta information is requested from the importance index calculation unit 21, it is determined that the supply of video meta information is requested.

If it is determined in step S44 that the supply of video meta information is not requested, the process returns to step S41, and the above-described process is repeated until the reproduction of the free viewpoint video content is completed.

On the other hand, when it is determined in step S44 that the supply of the video meta information is requested, in step S45, the free viewpoint video reproduction unit 32 extracts the video meta information from the video stream and calculates the importance index calculation unit 21. To supply. When the video meta information is supplied, the process returns to step S41, and the above-described process is repeated.

Note that the video meta information may be supplied in real time at the time of content audio reproduction, or may be supplied in advance.

As described above, the client device 12 acquires the output audio stream from the server 11 to reproduce the content sound, and outputs video meta information in response to a request from the server 11. As a result, it is possible to reduce the amount of computation at the time of generating the output audio stream on the server 11 side and the transmission amount of the output audio stream, and it is possible to perform content reproduction with high presence.

<Example 1 of reduction processing>
By the way, in the transmission process described with reference to FIG. 2, the importance index of each object sound source is obtained as the importance index based on the acquired meta information. Then, processing to be performed on each object sound source is selected based on the importance index and the spec information.

At this time, the importance index used does not have to be one for each object sound source, and a plurality of different importance indices may be obtained for each object sound source and used for processing.

In the following, with reference to each of FIGS. 4 to 7, an importance index corresponding to the processing of steps S12 to S15 of FIG. 2 is calculated, and each object sound source is calculated using the obtained importance index. A specific example of the reduction process for selecting the process to be performed will be described.

First, an example of selecting processing to be performed on each object sound source based on the positional relationship between the object sound source and the viewer in the space will be described with reference to the flowchart of FIG. That is, hereinafter, the reduction process performed by the server 11 will be described with reference to the flowchart of FIG. This reduction process is performed for each object sound source.

In step S71, the importance index calculation unit 21 acquires user position information and sound source position information as information necessary for calculating the importance index.

Specifically, for example, the importance index calculation unit 21 acquires the user meta information included in the video meta information by acquiring the video meta information from the free viewpoint video reproduction unit 32. Also, the importance index calculation unit 21 acquires sound source position information included in the audio meta information by acquiring the audio meta information from the input audio stream. Such processing in step S71 corresponds to the processing in steps S12 and S13 in FIG.

In step S72, the importance index calculation unit 21 calculates the distance between the viewer on the space and the object sound source to be processed.

For example, the importance index calculation unit 21 calculates the coordinates of the position in the viewer's space indicated by the user position information, that is, the vector VL representing the position of the viewer, and the position of the object sound source in the space indicated by the sound source position information. The distance | VO−VL | is calculated from the coordinates, that is, the vector VO representing the position of the object sound source.

Here, | VO-VL | indicates the size of the vector (VO-VL). In this example, | VO-VL | indicates the distance from the viewer to the object sound source in the space.

When the distance as the importance index is calculated for the object sound source to be processed, the importance index calculation unit 21 supplies the calculated importance index to the process selection unit 22. Such processing in step S72 corresponds to the processing in step S14 in FIG.

In step S73, the process selection unit 22 determines whether or not the distance between the viewer and the object sound source as the importance index supplied from the importance index calculation unit 21 is equal to or greater than a predetermined threshold th. judge.

That is, the process selection unit 22 determines whether or not the distance | VO−VL | and the threshold th satisfy the relationship of the following expression (1). In this case, when the relationship of Expression (1) is not established, it is determined that the distance as the importance index is greater than or equal to the threshold th.

| VO-VL | <th (1)

If it is determined in step S73 that the distance is greater than or equal to the threshold th, that is, if the relationship of equation (1) is not established, the process proceeds to step S74.

In step S74, the process selection unit 22 selects a light rendering process as the process performed on the object sound source to be processed, supplies the selection result to the rendering process unit 23, and the reduction process ends.

For example, when the relationship of the expression (1) is not established, the object sound source to be processed is far from the viewer (user) in the space, so that such object sound source is unlikely to be important. Therefore, the process selection unit 22 reduces the amount of calculation at the time of rendering by performing a light rendering process on the object sound source to be processed with low importance.

On the other hand, if it is determined in step S73 that the distance is not greater than or equal to the threshold th, that is, if the relationship of expression (1) is established, the process proceeds to step S75.

In step S75, the process selection unit 22 selects a strict rendering process as the process to be performed on the object sound source to be processed, supplies the selection result to the rendering process unit 23, and the reduction process ends.

For example, when the relationship of the expression (1) is established, the object sound source to be processed is located at a sufficiently close position, that is, within a certain distance as viewed from the viewer in the space. Therefore, such an object sound source is important. Therefore, the process selection unit 22 can reproduce the content sound with a high sense of realism by performing a strict rendering process on the object sound source to be processed with high importance.

The processes in steps S73 to S75 described above correspond to the process in step S15 in FIG. In these processes, the distance from the viewer to the object sound source is used as the importance index, and the closer the distance is, the higher the importance is, and the object sound source within a certain distance from the viewer. On the other hand, a strict rendering process is selected. On the other hand, for an object sound source whose distance from the viewer is a certain distance or more, a light rendering process is selected, and the amount of calculation is reduced.

As described above, the server 11 calculates the distance between the viewer and the object sound source as the importance index, and selects a rendering process performed on the object sound source according to the distance. This makes it possible to obtain highly realistic content audio while reducing the total amount of computation during rendering.

Note that, here, what kind of rendering processing is performed for each object sound source is selected based on whether or not the distance to the object sound source is equal to or greater than the threshold th. However, for example, a predetermined number of object sound sources are selected in ascending order of distance from the viewer, and a strict rendering process is performed on the selected object sound sources, and a light rendering process is performed on other object sound sources. May be performed. In this case, the number of object sound sources that perform strict rendering processing may be determined based on, for example, calculation specification information.

In addition, although an example using the user position information and sound source position information included in the meta information has been described here, other information is also used to select a process to be performed on the object sound source by combining a plurality of information, Information may be used to select processing to be performed on the object sound source by a plurality of conditional branches.

In such a case, for example, space size information, line-of-sight direction information, spread information, acoustic characteristic information indicating acoustic characteristics such as reverberation in space, and the arrangement positions of object sound sources and other objects in space, that is, positional relationships are indicated. It is conceivable to use arrangement information or the like.

Specifically, for example, when using the gaze direction information included in the video meta information, the viewing direction is determined from the viewer's gaze direction, the viewer position, and the object sound source position specified from the gaze direction information. For example, a light rendering process may be performed on an object sound source that is not visible to a person.

This is effective when there is a plurality of object sound sources having the same importance index such as the distance from the viewer when it is necessary to reduce the calculation amount.

For example, if there are two object sound sources that are the same distance from the viewer, the object sound source that is visible to the viewer is more important. For example, a light rendering process can be selected for an object sound source that is not visible. Accordingly, it is possible to reproduce the content sound with a high sense of reality while giving priority to the object sound source that is visible to the viewer and reducing the total amount of calculation.

In such a case, the importance index calculation unit 21 calculates the distance from the viewer to the object sound source as one importance index, and information indicating whether the object sound source is visible to the viewer. It is calculated as one other importance index. Then, the process selection unit 22 selects a process to be performed on the object sound source based on the two importance indexes calculated for the object sound source.

Also, for example, the audio meta information may include spread information indicating the extent of the sound image of the object sound source, that is, the size of the object in space. The spread information is used, for example, to make the sound image of the object sound source wide when performing VBAP as a rendering process.

When the spread of the sound image indicated by such spread information is large, it indicates that the sound image of the object sound source is spread over a wide area to some extent, and therefore, such object sound source is subjected to a strict rendering process. The need is low. Therefore, for example, a light rendering process is selected for an object sound source whose sound image spread degree indicated by spread information is equal to or greater than a predetermined threshold, and a strict rendering process is selected for other object sound sources. May be.

In such a case, the importance index calculation unit 21 supplies the spread information extracted from the audio meta information as it is to the process selection unit 22 as one importance index.

Furthermore, when acoustic characteristic information indicating the acoustic characteristics of the space is obtained, for example, when the acoustic characteristic information is stored in the reserved area of the audio meta information, the object is based on the degree of reverberation of the space indicated by the acoustic characteristic information. Sound source processing may be selected.

For example, in a space with many sound reflections and a high degree of reverberation, even if the number of object sound sources subjected to light rendering processing is increased to some extent, the sense of reality of the content audio is not impaired. Therefore, for example, the processing selection unit 22 may determine the number of object sound sources that perform strict rendering processing and the number of object sound sources that perform light rendering processing based on the degree of reverberation indicated by the acoustic characteristic information.

In such a case, the importance index calculation unit 21 supplies the acoustic characteristic information to the process selection unit 22 together with the importance index, and the process selection unit 22 renders lighter as the reverberation degree indicated by the acoustic characteristic information is higher, for example. Increase the number of object sound sources to be processed.

Further, for example, when the video meta information includes arrangement information indicating the arrangement position of an object sound source or other object in the space, processing to be performed on each object sound source is selected based on the arrangement information. You may do it.

Specifically, for example, when there is an object sound source that is invisible to the viewer due to an object such as a wall in the space, a light rendering process may be performed on the object sound source.

In such a case, for example, the importance index calculation unit 21 calculates the distance from the viewer to the object sound source as one importance index, and based on the user position information, the sound source position information, and the arrangement information, Information indicating whether or not the sound source is visible to the viewer is calculated as another importance index. Then, the process selection unit 22 selects a process to be performed on the object sound source based on the two importance indexes calculated for the object sound source.

<Example 2 of reduction processing>
Next, another example of the reduction process performed by the server 11 will be described.

In this example, particularly when there are restrictions on the computation resources of the server 11, that is, the computation processing capability of the rendering processing unit 23 and the transmission speed of the output audio stream, processing performed on each object sound source so as to satisfy the restrictions. This is effective when selecting.

Hereinafter, the reduction process performed by the server 11 will be described with reference to the flowchart of FIG. Note that the reduction process described with reference to FIG. 5 is not performed for each object sound source, but is performed once for all object sound sources.

In addition, since the processing in steps S101 and S102 is the same as the processing in steps S12 to S14 in FIG. 2, the description thereof is omitted. For example, in step S102, the distance from the viewer to the object sound source is calculated as an importance index, and the importance index and specification information are supplied from the importance index calculation unit 21 to the process selection unit 22.

In step S103, the process selection unit 22 determines the number of object sound sources to be subjected to a strict rendering process based on the specification information supplied from the importance index calculation unit 21. In other words, the number of object sound sources to which a strict rendering process is assigned, that is, the strict rendering process is selected, and the number of object sound sources to which a light rendering process is assigned are determined.

For example, the higher the calculation processing capability of the rendering processing unit 23 indicated by the calculation specification information, the greater the number of object sound sources (hereinafter also referred to as “allocated sound source number”) on which strict rendering processing is performed.

At this time, the number of assigned sound sources is determined so that the calculation processing capability required for various processing calculations such as rendering processing for all object sound sources does not exceed the calculation processing capability indicated by the calculation specification information.

Note that the number of assigned sound sources may be determined based on the specification information, may be a predetermined number, or may be a number specified from the outside. When processing for changing the playback bit rate or processing for integrating object sound sources is selected as processing for reducing the amount of computation and transmission, for example, at least one of computation spec information and transmission speed spec information The number of object sound sources for which processing for changing the reproduction bit rate or processing for integrating object sound sources is performed may be determined based on the above. In this case, for example, the number of object sound sources to be processed is determined so that the transmission rate necessary for transmission of the obtained output audio stream does not exceed the maximum transmission rate indicated by the transmission rate specification information.

In step S104, the process selection unit 22 selects one object sound source as the object sound source to be processed, and determines whether the object sound source to be processed is higher than the number of assigned sound sources determined in step S103. .

For example, based on the importance index of all object sound sources, the process selection unit 22 ranks each object sound source so that the object sound source having the higher importance indicated by the importance index becomes the higher object sound source. Then, when the number of assigned sound sources is AS and the rank of the object sound sources to be processed is within the upper AS position as a whole, the process selecting unit 22 determines that the object sound source to be processed is higher than the number of assigned sound sources. judge.

If it is determined in step S104 that the number is higher than the assigned sound source number, then the process proceeds to step S105.

In step S105, the process selection unit 22 selects a strict rendering process as a process to be performed on the object sound source to be processed, and supplies the selection result to the rendering process unit 23. Thereafter, the process proceeds to step S107. move on.

On the other hand, if it is determined in step S104 that it is not higher than the number of assigned sound sources, that is, if the rank of the object sound source to be processed is lower than the AS position, the process proceeds to step S106.

In step S106, the process selection unit 22 selects a light rendering process as the process performed on the object sound source to be processed, supplies the selection result to the rendering process unit 23, and then the process proceeds to step S107. .

In step S105 or step S106, when a rendering process to be performed on the object sound source to be processed is selected, in step S107, the process selection unit 22 determines whether or not processing has been selected for all object sound sources.

If it is determined in step S107 that processing has not yet been selected for all object sound sources, the processing returns to step S104, and the above-described processing is repeated. That is, an object sound source that has not yet been processed is set as a new object sound source to be processed, and a rendering process to be performed on the object sound source is selected.

On the other hand, if it is determined in step S107 that processing has been selected for all object sound sources, the reduction processing ends. The processes in steps S103 to S107 described above correspond to the process in step S15 in FIG.

As described above, the server 11 determines the number of assigned sound sources based on the specification information, and is performed for each object sound source so that strict rendering processing is performed on the object sound sources for the number of assigned sound sources. Select a rendering process.

Thus, strict rendering processing is performed on object sound sources corresponding to the number of assigned sound sources with high importance, and light rendering processing is performed on the remaining object sound sources.

Therefore, for example, when the distance from the viewer to the object sound source is calculated as the importance index, strict rendering processing is assigned to the object sound sources for the number of assigned sound sources in order of increasing distance from the viewer. Become. In this case, since the strict rendering process is assigned to the object sound source having higher importance, that is, the object sound source having a shorter distance, the calculation amount and the transmission amount can be reduced without impairing the realistic feeling of the content sound.

<Example 3 of reduction processing>
Also, importance information included in audio meta information and recording / editing meta information may be used as it is as an importance index. Hereinafter, the reduction process performed by the server 11 in such a case will be described with reference to the flowchart of FIG. Note that the reduction processing described with reference to FIG. 6 is performed for each object sound source.

In step S131, the importance index calculation unit 21 acquires importance information as meta information for the object sound source to be processed.

For example, the importance index calculation unit 21 acquires importance information from the audio meta information of the input audio stream, or acquires importance information from the recording / editing meta information supplied from the meta information storage unit 13. The importance index calculation unit 21 supplies the acquired importance information as it is to the process selection unit 22 as the importance index. The processing in step S131 corresponds to the processing in steps S12 to S14 in FIG.

The importance level information may be used for each object sound source in units of frames or a plurality of frames, or one level of importance information may be commonly used in all frames.

In step S132, the process selection unit 22 determines whether there is importance level information of the object sound source to be processed. For example, the process selection unit 22 determines that there is importance level information when the importance level information of the object sound source to be processed is supplied as the importance level index from the importance level index calculation unit 21.

If it is determined in step S132 that there is no importance information, the process proceeds to step S135.

On the other hand, if it is determined in step S132 that the importance level information is present, in step S133, the process selection unit 22 is indicated by the importance level information of the processing target object sound source supplied from the importance level index calculation unit 21. It is determined whether the importance is equal to or higher than a predetermined threshold. Note that the threshold used in step S133 may be a predetermined value, or may be a value determined based on spec information or the like.

If it is determined in step S133 that the importance level is not equal to or greater than the threshold value, the process proceeds to step S135.

On the other hand, if it is determined in step S133 that the importance level is equal to or greater than the threshold, the process proceeds to step S134.

In step S134, the process selection unit 22 selects a strict rendering process as a process performed on the object sound source to be processed, supplies the selection result to the rendering process unit 23, and the reduction process ends. An object sound source whose importance is equal to or higher than a threshold value is an important object sound source to be strictly reproduced, and therefore a strict rendering process is selected.

If it is determined in step S132 that there is no importance level information, or if it is determined in step S133 that the importance level is not equal to or greater than the threshold value, the process of step S135 is performed.

In step S135, the process selection unit 22 selects a light rendering process as the process performed on the object sound source to be processed, supplies the selection result to the rendering process unit 23, and the reduction process ends.

As described above, a light rendering process is selected for an object sound source that has no importance information used as an importance index or has a low importance indicated by the importance information, and the amount of calculation at the time of rendering is reduced.

The processes in steps S132 to S135 described above correspond to the process in step S15 in FIG. In particular, in this example, a strict rendering process is performed only on an object sound source that has importance level information and whose importance level indicated by the importance level information is a certain value or more.

As described above, the server 11 selects processing to be performed on the object sound source using the importance information as the importance index. Thus, by using the importance level information, it is possible to appropriately reduce the amount of calculation, and to reproduce content with a high sense of presence even with a small amount of calculation.

<Example 4 of reduction processing>
Furthermore, when there are a large number of object sound sources in the space, if two or more object sound sources can be integrated and handled as one object sound source by a simple process, the amount of computation such as rendering processing can be reduced. Further, if the object sound sources are integrated, the data amount of the output audio stream is reduced, so that the transmission amount can be reduced.

Therefore, two or more object sound sources may be integrated into one object sound source based on the importance index. Hereinafter, the reduction process performed by the server 11 in such a case will be described with reference to the flowchart of FIG.

Note that the reduction process described with reference to FIG. 7 is performed for each combination of object sound sources. However, in order to simplify the description, a case where any two object sound sources are set as object sound sources to be processed and reduction processing is performed will be described.

In step S161, the importance index calculation unit 21 acquires user position information and sound source position information of two object sound sources to be processed.

That is, the importance index calculation unit 21 acquires user position information from the video meta information and also acquires sound source position information of the two object sound sources to be processed from the audio meta information. The processing in step S161 corresponds to the processing in steps S12 and S13 in FIG.

In step S162, the importance index calculation unit 21 calculates the distance between the object sound sources in the space based on the sound source position information of the two object sound sources to be processed acquired in step S161.

In step S163, the importance index calculation unit 21 determines the direction of one object sound source to be processed as viewed from the viewer based on the user position information and sound source position information acquired in step S161, and the other processing target. The angle difference from the direction of the object sound source is calculated.

For example, the importance index calculation unit 21 starts with the position of the viewer in the space and uses the position of one object sound source to be processed as the end point, and the position of the viewer as the start point, and the other object to be processed. An angle formed by a vector whose end point is the position of the sound source is calculated as an angle difference.

Here, for example, when attention is paid to one object sound source to be processed, the direction of one object sound source to be processed as viewed from the viewer in space is the direction of the other object sound source to be processed as viewed from the viewer. The angle difference is obtained as an importance index of one object sound source to be processed.

The importance index calculation unit 21 supplies the distance obtained in step S162 and the angle difference obtained in step S163 to the process selection unit 22 as the importance index of the object sound source to be processed. The processes in step S162 and step S163 correspond to the process in step S14 in FIG.

In step S164, the process selection unit 22 determines whether or not the distance between the object sound sources supplied from the importance index calculation unit 21 is equal to or less than a predetermined threshold value.

If it is determined in step S164 that the distance between the object sound sources is not equal to or less than the threshold, the process proceeds to step S167.

On the other hand, if it is determined in step S164 that the distance between the object sound sources is equal to or smaller than the threshold value, in step S165, the process selection unit 22 determines that the angle difference in the direction of the object sound source obtained in step S163 is equal to or smaller than the predetermined threshold value. It is determined whether or not there is.

Note that the angle difference used as the importance index may be calculated when it is determined in step S164 that the distance between the object sound sources is equal to or less than a threshold value. Further, the threshold value used in step S164 is different from the threshold value used in step S165.

If it is determined in step S165 that the angle difference is not equal to or less than the threshold value, then the process proceeds to step S167.

On the other hand, if it is determined in step S165 that the angle difference is equal to or smaller than the threshold value, the process proceeds to step S166.

In step S166, the process selection unit 22 selects a process for integrating the two object sound sources as a process to be performed on the two object sound sources to be processed, and supplies the selection result to the rendering processing unit 23 for reduction. The process ends.

In this case, the two object sound sources to be processed are located within a certain distance and are located in substantially the same direction as viewed from the viewer. Therefore, even if these two object sound sources are integrated to form a single object sound source, there is no significant shift in the position of the sound image, and the realism of the content sound is not impaired.

Therefore, the processing selection unit 22 integrates such two object sound sources into one object sound source, and does not impair the sense of reality of the content sound, and the amount of computation during the rendering process and the transmission amount of the output audio stream To reduce.

When the process of step S166 is performed, a process of integrating two object sound sources into one object sound source is performed in step S16 of FIG. At this time, the integrated position of the object sound source in space is an average value of coordinates indicating the positions of the two object sound sources, that is, the position of the average coordinate.

If it is determined in step S164 that the distance between the object sound sources is not less than the threshold value, or if it is determined in step S165 that the angle difference is not less than the threshold value, the process of step S167 is performed.

In step S167, the process selection unit 22 performs the process individually on the two object sound sources to be processed. That is, the process selection unit 22 selects that the two object sound sources to be processed are not integrated, supplies the selection result to the rendering process unit 23, and the reduction process ends.

In this case, the two object sound sources to be processed are located at a distance of a certain distance or more in space, or even if they are within a certain distance, the object sound sources are sufficiently different from the viewpoint of the viewer. Exists in the direction.

Therefore, in such a case, if the two object sound sources to be processed are integrated into one, depending on the position of the object sound source after integration, the localization position of the sound image may change greatly before and after the integration, There is a possibility that the sense of reality of the content audio is impaired. Therefore, in step S167, the two object sound sources to be processed are processed individually without being integrated.

The processes in steps S164 to S167 described above correspond to the process in step S15 in FIG.

As described above, the server 11 calculates the importance index based on the user position information and the sound source position information, and selects a process to be performed on the object sound source to be processed based on the obtained importance index. By integrating the object sound sources in accordance with the positional relationship between the two object sound sources to be processed and the viewer in this way, the amount of computation and transmission can be appropriately reduced, and content with a high sense of presence even with a small amount of computation and transmission Playback can be performed.

In addition, the reduction process described with reference to FIGS. 4 to 7 is not limited, and other reduction processes such as changing the reproduction bit rate of the object sound source can be performed, or with reference to FIGS. 4 to 7. It is also possible to perform a combination of any of the described reduction processes and other reduction processes.

For example, when the playback bit rate is changed in order to reduce the calculation amount and the transmission amount, it is selected whether to perform strict rendering processing or light rendering processing in the reduction processing of FIGS. 4 to 6 described above. Instead of this, it is only necessary to select whether or not to perform processing for changing the reproduction bit rate for the object sound source.

In this case, in the reduction process of FIGS. 4 to 6, the process for changing the reproduction bit rate is selected in the step where the light rendering process is selected, and the reproduction bit rate is selected in the step where the strict rendering process is selected. What is necessary is just to make it the process which changes is not performed.

Furthermore, although the case where the rendering process is performed by the server 11 has been described above, the rendering process or the like is performed in the free viewpoint video reproduction unit 32 or the like of the client device 12 according to the selection result of the process by the process selection unit 22. You may be made to be.

<Example of computer configuration>
By the way, the above-described series of processing can be executed by hardware or can be executed by software. When a series of processing is executed by software, a program constituting the software is installed in the computer. Here, the computer includes, for example, a general-purpose personal computer capable of executing various functions by installing a computer incorporated in dedicated hardware and various programs.

FIG. 8 is a block diagram showing an example of the hardware configuration of a computer that executes the above-described series of processes by a program.

In the computer, a CPU (Central Processing Unit) 501, a ROM (Read Only Memory) 502, and a RAM (Random Access Memory) 503 are connected to each other via a bus 504.

An input / output interface 505 is further connected to the bus 504. An input unit 506, an output unit 507, a recording unit 508, a communication unit 509, and a drive 510 are connected to the input / output interface 505.

The input unit 506 includes a keyboard, a mouse, a microphone, an image sensor, and the like. The output unit 507 includes a display, a speaker, and the like. The recording unit 508 includes a hard disk, a nonvolatile memory, and the like. The communication unit 509 includes a network interface or the like. The drive 510 drives a removable recording medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.

In the computer configured as described above, the CPU 501 loads the program recorded in the recording unit 508 to the RAM 503 via the input / output interface 505 and the bus 504 and executes the program, for example. Is performed.

The program executed by the computer (CPU 501) can be provided by being recorded in a removable recording medium 511 as a package medium, for example. The program can be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.

In the computer, the program can be installed in the recording unit 508 via the input / output interface 505 by attaching the removable recording medium 511 to the drive 510. Further, the program can be received by the communication unit 509 via a wired or wireless transmission medium and installed in the recording unit 508. In addition, the program can be installed in the ROM 502 or the recording unit 508 in advance.

The program executed by the computer may be a program that is processed in time series in the order described in this specification, or in parallel or at a necessary timing such as when a call is made. It may be a program for processing.

The embodiments of the present technology are not limited to the above-described embodiments, and various modifications can be made without departing from the gist of the present technology.

For example, the present technology can take a cloud computing configuration in which one function is shared by a plurality of devices via a network and is jointly processed.

Further, each step described in the above flowchart can be executed by one device or can be shared by a plurality of devices.

Further, when a plurality of processes are included in one step, the plurality of processes included in the one step can be executed by being shared by a plurality of apparatuses in addition to being executed by one apparatus.

Furthermore, the present technology can be configured as follows.

(1)
An audio processing apparatus, comprising: a process selection unit that selects a process to be performed on audio data of the object sound source based on one or more importance indexes serving as an importance index of the object sound source.
(2)
The audio processing apparatus according to (1), wherein the process selection unit selects a process for reducing a calculation amount or a transmission amount as the process.
(3)
The audio processing apparatus according to (1) or (2), wherein the process selection unit selects one of a plurality of rendering processes having different calculation amounts as the process.
(4)
The audio processing apparatus according to any one of (1) to (3), wherein the process selection unit selects a process of integrating the audio data of the plurality of object sound sources as the process.
(5)
The audio processing apparatus according to any one of (1) to (4), wherein the process selection unit selects a process of changing a reproduction bit rate of the audio data of the object sound source as the process.
(6)
The speech processing apparatus according to any one of (1) to (5), further including an importance index calculation unit that calculates the importance index based on meta information related to the audio data.
(7)
The importance index calculation unit includes the position information of the object sound source as the meta information, the position information of the viewer, the gaze direction information of the viewer, the importance information of the object sound source, the spread information of the object sound source, the space The audio processing device according to (6), wherein the importance index is calculated based on at least one of the acoustic characteristic information and the object arrangement information in the space.
(8)
The audio processing device according to (6) or (7), wherein the importance index calculation unit calculates a distance between the object sound source and a viewer in space as the importance index.
(9)
The voice processing device according to any one of (6) to (8), wherein the importance index calculation unit directly uses the importance information of the object sound source as the meta information as the importance index.
(10)
The speech processing apparatus according to any one of (6) to (9), wherein the importance index calculation unit calculates a distance between two object sound sources in space as the importance index.
(11)
The importance index calculation unit calculates, as the importance index, an angle difference between the direction of the object sound source viewed from the viewer in space and the direction of another object sound source viewed from the viewer. Thru | or the audio processing apparatus as described in any one of (10).
(12)
The process selection unit is configured to perform a plurality of processes based on at least one of calculation specification information indicating calculation processing capability of a processing unit that performs the process and transmission rate specification information indicating a maximum transmission rate of the audio data. The audio processing device according to any one of (1) to (11), wherein the number of the object sound sources on which the processing is performed is determined.
(13)
An acquisition step of acquiring one or a plurality of importance indices that serve as an index of importance of the object sound source;
And a process selection step of selecting a process to be performed on the sound data of the object sound source based on the one or more importance indexes.

11 servers, 12 client devices, 13 meta information storage unit, 21 importance index calculation unit, 22 process selection unit, 23 rendering processing unit, 24 audio stream transmission unit, 31 audio stream reception unit, 32 free viewpoint video playback unit

Claims

An audio processing apparatus, comprising: a process selection unit that selects a process to be performed on audio data of the object sound source based on one or more importance indexes serving as an importance index of the object sound source.
The audio processing apparatus according to claim 1, wherein the process selection unit selects a process for reducing a calculation amount or a transmission amount as the process.
The audio processing apparatus according to claim 1, wherein the process selection unit selects one of a plurality of rendering processes having different calculation amounts as the process.
The audio processing apparatus according to claim 1, wherein the process selection unit selects a process of integrating the audio data of the plurality of object sound sources as the process.
The audio processing apparatus according to claim 1, wherein the process selection unit selects a process of changing a reproduction bit rate of the audio data of the object sound source as the process.
The speech processing apparatus according to claim 1, further comprising an importance index calculation unit that calculates the importance index based on meta information related to the audio data.
The importance index calculation unit includes the position information of the object sound source as the meta information, the position information of the viewer, the gaze direction information of the viewer, the importance information of the object sound source, the spread information of the object sound source, the space The speech processing apparatus according to claim 6, wherein the importance index is calculated based on at least one of the acoustic characteristic information of the object and the arrangement information of the object in the space.
The audio processing apparatus according to claim 6, wherein the importance index calculation unit calculates a distance between the object sound source and a viewer in space as the importance index.
The voice processing apparatus according to claim 6, wherein the importance index calculation unit directly uses the importance information of the object sound source as the meta information as the importance index.
The speech processing apparatus according to claim 6, wherein the importance index calculation unit calculates a distance between two object sound sources in space as the importance index.
The importance index calculation unit calculates an angle difference between the direction of the object sound source viewed from a viewer in space and the direction of another object sound source viewed from the viewer as the importance index. The voice processing apparatus according to 1.
The process selection unit is configured to perform a plurality of processes based on at least one of calculation specification information indicating calculation processing capability of a processing unit that performs the process and transmission rate specification information indicating a maximum transmission rate of the audio data. The audio processing device according to claim 1, wherein the number of the object sound sources on which the processing is performed is determined.
An acquisition step of acquiring one or a plurality of importance indices that serve as an index of importance of the object sound source;
And a process selection step of selecting a process to be performed on the sound data of the object sound source based on the one or more importance indexes.