WO2018047667A1 - Sound processing device and method - Google Patents

Sound processing device and method Download PDF

Info

Publication number
WO2018047667A1
WO2018047667A1 PCT/JP2017/030858 JP2017030858W WO2018047667A1 WO 2018047667 A1 WO2018047667 A1 WO 2018047667A1 JP 2017030858 W JP2017030858 W JP 2017030858W WO 2018047667 A1 WO2018047667 A1 WO 2018047667A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound source
object sound
importance
information
processing
Prior art date
Application number
PCT/JP2017/030858
Other languages
French (fr)
Japanese (ja)
Inventor
由楽 池宮
光行 畠中
矢ケ崎 陽一
高林 和彦
富三 白石
Original Assignee
ソニー株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニー株式会社 filed Critical ソニー株式会社
Publication of WO2018047667A1 publication Critical patent/WO2018047667A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control

Definitions

  • the present technology relates to an audio processing apparatus and method, and more particularly, to an audio processing apparatus and method capable of performing content reproduction with high presence with a small amount of computation or transmission.
  • Such a thing can be realized by recording the sound of the sound source existing in the free viewpoint video as an object sound source and appropriately rendering it according to the viewing environment.
  • binaural playback is performed when sound is played back through headphones, and sound image is localized at an appropriate position by performing processing such as wavefront synthesis when sound is played back through speakers. Can be made.
  • the calculation amount is reduced by selecting whether or not to decode the data of each audio channel or audio object based on the priority information of the audio channel or audio object.
  • a technique has been proposed (see, for example, Patent Document 1).
  • the convolution processing of the filter corresponding to the direction of the sound source and the distance to the sound source is performed, so that the amount of calculation of the rendering process increases.
  • the amount of calculation increases in proportion to the number of sound sources.
  • the amount of audio stream transmission increases when the number of sound sources is large.
  • the technology for selecting whether to perform decoding based on priority information can reduce the amount of computation, but the audio channel or audio object data with low priority is not decoded, so the audio of the channel or object is not decoded. Will not play at all. This may impair the sense of presence during content playback.
  • the present technology has been made in view of such a situation, and makes it possible to perform highly realistic content reproduction with a small amount of calculation or transmission.
  • An audio processing device includes a process selection unit that selects a process to be performed on audio data of an object sound source based on one or more importance indices that are an index of importance of the object sound source. Prepare.
  • the process selection unit can select a process for reducing a calculation amount or a transmission amount as the process.
  • the process selection unit can select any one of a plurality of rendering processes having different calculation amounts as the process.
  • the process selection unit can select a process for integrating the audio data of a plurality of the object sound sources as the process.
  • the process selection unit can select a process for changing the reproduction bit rate of the audio data of the object sound source as the process.
  • the voice processing device may further include an importance index calculation unit that calculates the importance index based on meta information related to the audio data.
  • the position information of the object sound source as the meta information, the position information of the viewer, the gaze direction information of the viewer, the importance information of the object sound source, the spread information of the object sound source,
  • the importance index can be calculated based on at least one of the acoustic characteristic information of the space and the arrangement information of the object in the space.
  • the importance index calculation unit can calculate the distance between the object sound source and the viewer in space as the importance index.
  • the importance index calculation unit can use the importance information of the object sound source as the meta information as it is as the importance index.
  • the importance index calculation unit can calculate the distance between the two object sound sources in space as the importance index.
  • the importance index calculation unit may calculate an angle difference between the direction of the object sound source viewed from a viewer in space and the direction of another object sound source viewed from the viewer as the importance index. it can.
  • the process selection unit includes a plurality of the processes based on at least one of calculation specification information indicating calculation processing capability of the processing unit performing the process and transmission rate specification information indicating a maximum transmission rate of the audio data.
  • the number of object sound sources on which the process is performed can be determined.
  • An audio processing method includes an acquisition step of acquiring one or a plurality of importance indices serving as an importance index of an object sound source, and the object sound source based on the one or more importance indices. And a process selection step for selecting a process to be performed on the audio data.
  • processing to be performed on the sound data of the object sound source is selected based on one or more importance indexes that are the index of importance of the object sound source.
  • highly realistic content reproduction can be performed with a small amount of calculation or transmission.
  • This technology uses the meta information of the content, etc. to calculate an index of the importance of the object sound source to determine whether it is an object sound source to be played more strictly. By changing the handling of each object sound source, etc., the content can be reproduced with a high sense of reality while reducing the calculation amount and transmission amount.
  • the rendering process refers to the entire process of remapping the sound data of the object sound source to the sound data of the number of channels suitable for the reproduction environment in accordance with the number of channels and the reproduction conditions of the reproduction device.
  • an audio stream including audio data for each channel of each object sound source is input as an audio stream for reproducing the audio of content including video and accompanying audio.
  • the content is, for example, free viewpoint video content.
  • a video object on the content video corresponds to an audio object of content audio, that is, an object sound source.
  • the sound of the object sound source is assumed to be the sound of the video object.
  • rendering processing is performed based on the input audio stream, and an audio stream for reproducing the content audio is generated in the reproduction device, that is, the client device, and the content audio is reproduced on the reproduction device.
  • the input audio stream is also referred to as an input audio stream
  • the audio stream output to the playback device is also referred to as an output audio stream.
  • the calculation amount or Processing for reducing the transmission amount is also referred to as reduction processing. More specifically, in the reduction process, depending on the object sound source, a selection result that the process for reducing the calculation amount or the transmission amount is not performed or a selection result that the normal process is selected may be obtained. is there.
  • processing TR1 to processing TR3 shown below is performed as the reduction processing for reducing the calculation amount and the transmission amount.
  • each object sound source it is determined whether or not it is an object sound source to be reproduced more strictly, and a rendering process performed on the object sound source is selected according to the determination result.
  • the strict rendering process includes a process with a large amount of calculation such as a convolution process using a filter coefficient, and is a process that can localize a sound image with higher accuracy.
  • a rendering process with a small amount of computation although it is not possible to localize a sound image with such high accuracy, will be referred to as a light rendering process.
  • the light rendering processing is VBAP (Vector Base Amplitude Panning) or the like.
  • rendering processes there are two types of rendering processes, a light rendering process and a strict rendering process, as rendering processes having different computational complexity and sound image localization accuracy.
  • three or more rendering processes may be defined as rendering processes having different calculation amounts and sound image localization accuracy.
  • any of light rendering processing, somewhat strict rendering processing, and strict rendering processing is selected for each object sound source as the rendering processing.
  • each object sound source determines for each object sound source whether or not it is an object sound source to be reproduced more strictly, and the reproduction bit rate of the sound data of the object sound source is changed according to the determination result.
  • the output audio stream is generated using the audio data as it is without changing the audio data reproduction bit rate.
  • a process for changing the playback bit rate is performed on an object sound source that is low in importance and has not been determined to be played more strictly. That is, audio data having a lower reproduction bit rate is generated based on the original audio data, and an output audio stream is generated using the obtained audio data.
  • a method of generating audio data with a lower reproduction bit rate a method of down-sampling the original audio data to obtain audio data with a lower sampling frequency
  • the reproduction bit rate of the audio data of the object sound source is determined by the sampling frequency, the number of channels of the object sound source, and the number of quantization bits. Also, the lower the playback bit rate, the smaller the amount of audio data, so that not only can the transmission bit rate of the output audio stream, that is, the transmission amount be reduced, but also the amount of computation when generating the output audio stream, etc. Can be reduced.
  • the playback bit rate is not changed as it should be played more strictly, and for object sound sources that are far from the viewer and may be background sound, lower playback is possible.
  • the bit rate can be changed. In this way, highly realistic content audio can be obtained while reducing the amount of computation and the amount of transmission.
  • the playback device side can play back audio data of different playback bit rates by some method.
  • the sound data of the two object sound sources are added at a predetermined volume ratio to obtain one sound data, thereby obtaining a predetermined position.
  • the sound data of the two object sound sources are added at a predetermined volume ratio to obtain one sound data, thereby obtaining a predetermined position.
  • what required rendering processing for two object sound sources becomes rendering processing for one object sound source, so that the amount of calculation during rendering processing can be reduced. Also, since the audio data of the two object sound sources becomes one audio data, the data amount can be reduced, and as a result, the transmission amount of the output audio stream can be reduced.
  • the server side in real time according to the viewing environment such as headphones and speakers on the viewer side, that is, the client side It is necessary to perform streaming distribution while performing rendering processing.
  • the amount of computation and transmission can be reduced by this technology, it is possible to perform computation processing for generating an output audio stream for each of a large number of clients and transmit the output audio stream to each of those clients. become able to. That is, streaming distribution can be performed simultaneously for a large number of clients.
  • meta information included in a video stream for reproducing content video includes user position information indicating the position of a user who is a viewer in space, and the user's position.
  • Line-of-sight information indicating the line-of-sight direction is included.
  • the meta information included in the input audio stream of content audio (hereinafter also referred to as audio meta information) includes space information indicating the size of the space and a sound source position indicating the position of the object sound source in the space. Information and included.
  • the audio meta information may include importance level information indicating the importance level of each object sound source.
  • the viewer and the audio object move from moment to moment, so their positional relationship changes. Therefore, the object sound source important for the viewer changes depending on the positional relationship.
  • object sound sources that are close to the viewer should be played back strictly in order to maintain the localization to the correct position. That is, a strict rendering process should be performed.
  • the object sound source located sufficiently far from the viewer since only the approximate direction needs to be known, it may be reproduced by performing a light rendering process or the like.
  • computation specification information indicating the computation processing capability of the computation block that generates the output audio stream, for example, the computation block that performs rendering processing, that is, computation processing performance Is used as appropriate.
  • the number of object sound sources to which strict rendering processing is assigned can be 10 or less, and light rendering processing can be performed for other object sound sources.
  • transmission speed transmission bit rate
  • transmission rate specification information indicating the maximum transmission rate, which is the fastest possible transmission rate, is appropriately acquired for each of the transmission side and the reception side, and the transmission rate specification information is used. Reduction processing for reducing the calculation amount and the transmission amount can be performed.
  • the maximum transmission rate indicated by the transmission rate specification information is slow, processing for changing the playback bit rate is performed so that the output audio stream can be transmitted at the maximum transmission rate or less, or the object sound source Processing to integrate the.
  • the reproduction bit rate may be changed or the object sound source may be integrated so that communication at the slower maximum transmission rate is possible.
  • transmission rate specification information when it is not necessary to distinguish between transmission rate specification information on the transmission side and transmission rate specification information on the reception side, they are also simply referred to as transmission rate specification information. Further, when it is not necessary to distinguish between the operation spec information and the transmission speed spec information, they are also simply referred to as spec information.
  • priority information that is, importance information indicating importance is given to the object sound source at the time of content recording or editing of the content after recording. May be added).
  • a high importance is added to the sound of an object that symbolizes the scene (place), that is, an object sound source, and a low importance is added to the sound of an object that is not.
  • a high importance is added to the sound of an object that symbolizes the scene (place), that is, an object sound source
  • a low importance is added to the sound of an object that is not.
  • the importance level information of each object sound source may be included in the audio meta information of the input audio stream. Also, the importance level information may be determined for each frame of the content audio, or may be determined in units of a plurality of frames.
  • the importance index indicating the importance index of each object sound source by using at least one of the information included in the video meta information, the audio meta information, and the recording / editing meta information described above. Is calculated.
  • a reduction process is performed using the importance index of each object sound source, and the amount of calculation when generating the output audio stream and the transmission amount of the output audio stream are reduced.
  • calculation specification information, transmission speed specification information, and the like are also used as necessary.
  • the reduction processing for reducing the calculation amount and the transmission amount may be processing other than the processing described above.
  • FIG. 1 is a diagram illustrating a configuration example of an embodiment of a content reproduction system to which the present technology is applied.
  • the content reproduction system shown in FIG. 1 includes a server 11, a client device 12, and a meta information storage unit 13.
  • the server 11 includes a network server such as a cloud, and is connected to the client device 12 operated by the user via a wired or wireless network.
  • a network server such as a cloud
  • the client device 12 operated by the user via a wired or wireless network.
  • one client device 12 is connected to the server 11, but two or more client devices 12 may be connected to the server 11.
  • the server 11 is a device that can perform arithmetic processing on an analog or digital audio signal (audio data), and generates a suitable output audio stream by switching rendering processing for an input audio stream in real time for each object sound source. Perform streaming distribution of viewpoint video content. That is, the server 11 performs streaming distribution of the audio of the free viewpoint video content supplied from the outside or recorded in advance to the client device 12.
  • the server 11 generates an output audio stream based on the input audio stream and the meta information and specification information, and transmits the output audio stream to the client device 12. At that time, the server 11 appropriately acquires recording / editing meta information from the meta information storage unit 13.
  • the client device 12 receives the output audio stream from the server 11 and reproduces the content audio. At this time, the client device 12 reproduces a free-viewpoint video content composed of video and audio by also playing a content video based on a video stream acquired from the server 11 or another server.
  • the server 11 includes an importance index calculation unit 21, a process selection unit 22, a rendering processing unit 23, and an audio stream transmission unit 24.
  • the client device 12 includes an audio stream receiving unit 31 and a free viewpoint video reproduction unit 32.
  • the importance index calculation unit 21 acquires meta information and spec information as necessary, calculates the importance index based on the meta information, and sends the obtained importance index and spec information to the process selection unit 22. Supply.
  • This importance index is an index indicating the importance of each object sound source.
  • the importance index calculation unit 21 acquires (extracts) audio meta information from the input audio stream, acquires recording / editing meta information from the meta information storage unit 13, and the free viewpoint video reproduction unit of the client device 12.
  • Video meta information is acquired from 32.
  • the importance index calculation unit 21 acquires calculation specification information from the rendering processing unit 23, acquires transmission speed specification information on the transmission side from the audio stream transmission unit 24, and receives from the audio stream reception unit 31 on the reception side. Or get the transmission speed spec information.
  • the importance index calculation unit 21 supplies video meta information to the rendering processing unit 23 via the processing selection unit 22 as necessary.
  • the process selection unit 22 acquires the input audio stream and performs a reduction process for reducing the calculation amount and the transmission amount based on the importance index and the specification information supplied from the importance index calculation unit 21. Further, the process selection unit 22 supplies the processing result of the reduction process and the input audio stream to the rendering processing unit 23.
  • a selection result (decision result) of whether to perform a light rendering process or a strict rendering process, and which As a result of selecting whether to change the reproduction bit rate of the object sound source, a result of selecting which object sound source is integrated into one, and the like are obtained. That is, as a result of the reduction process, a selection result of what kind of process is performed for each object sound source is obtained.
  • the spec information is used to determine the number of object sound sources on which each process is performed for a plurality of processes including a process for reducing the amount of computation and transmission, for example.
  • the rendering processing unit 23 performs rendering processing based on the result of the reduction processing supplied from the processing selection unit 22 and the input audio stream, and supplies the output audio stream obtained as a result to the audio stream transmission unit 24.
  • the rendering processing unit 23 appropriately selects the video meta information supplied from the importance index calculation unit 21 via the processing selection unit 22 or the audio meta information included in the input audio stream supplied from the processing selection unit 22.
  • the rendering process is performed using.
  • the rendering processing unit 23 appropriately acquires information regarding the reproduction environment of the client device 12 such as how many channels of the speaker system the client device 12 has. Then, the rendering processing unit 23 generates an output audio stream composed of audio data of each channel that can be reproduced by the client device 12 according to the reproduction environment. Furthermore, the rendering processing unit 23 also appropriately performs processing for changing the reproduction bit rate, processing for integrating object sound sources, and the like based on the result of the reduction processing.
  • the audio stream transmission unit 24 transmits the output audio stream supplied from the rendering processing unit 23 to the client device 12 via the network.
  • the audio stream receiving unit 31 of the client device 12 receives the output audio stream transmitted by the audio stream transmitting unit 24 of the server 11 and supplies it to the free viewpoint video reproduction unit 32.
  • the audio stream receiving unit 31 appropriately supplies the transmission rate specification information on the receiving side to the importance index calculating unit 21 in response to a request from the server 11.
  • the free viewpoint video reproduction unit 32 includes, for example, a sound reproduction device such as a headphone or a speaker system, and a device that drives the sound reproduction device. Based on the output audio stream supplied from the audio stream reception unit 31, Play content audio.
  • the free viewpoint video playback unit 32 also has a display device and the like, and plays back content video based on a video stream acquired from the outside. Furthermore, the free viewpoint video reproduction unit 32 appropriately extracts video meta information from the video stream in response to a request from the server 11 and supplies the video meta information to the importance index calculation unit 21.
  • video meta information including user position information and line-of-sight direction information is extracted from a video stream.
  • any method for acquiring user position information and line-of-sight direction information can be used. Good.
  • the client device 12 may acquire user position information and line-of-sight direction information from another external device and supply the user position information and line-of-sight direction information to the importance index calculation unit 21.
  • the client device 12 may be provided with a gyro sensor that detects the user's head direction, an image sensor that captures the user, and the like so as to obtain user position information and line-of-sight direction information.
  • the user's face direction is specified from the output of the gyro sensor, and the direction is set as the user's line-of-sight direction, or the user's line-of-sight direction or the position of the user in space is detected from the image obtained by the image sensor. do it.
  • the importance index calculation unit 21 may use space area information included in the video meta information of the video stream, or position information and importance information of the video object included in the video meta information may be used. It may be used as sound source position information or importance level information of an object sound source corresponding to the video object.
  • the server 11 and the client device 12 are connected via a network.
  • the importance index calculation unit 21 to the audio stream transmission unit 24, the audio stream reception unit 31, and the free viewpoint video reproduction unit 32 may be provided in one apparatus.
  • a device provided with the importance index calculation unit 21 to the audio stream transmission unit 24 and a device provided with the audio stream reception unit 31 and the free viewpoint video reproduction unit 32 are connected by a wire such as a cable. You may do it.
  • a case where a free-viewpoint video content stored in a personal computer at the user's home is played on a head-mounted display can be considered.
  • the importance index calculation unit 21 to the audio stream transmission unit 24 are provided in the personal computer, and the audio stream reception unit 31 and the free viewpoint video reproduction unit 32 are provided in the head mounted display connected to the personal computer. You can make it.
  • the stationary index game machine main body may be configured to include the importance index calculation unit 21 to the audio stream transmission unit 24, the audio stream reception unit 31, and the free viewpoint video reproduction unit 32.
  • the importance level index calculation unit 21 to the audio stream transmission unit 24 are provided in the stationary game machine main body, and the audio stream reception unit 31 and the free viewpoint video are connected to an external device connected to the game machine main body by wire or wirelessly.
  • the reproducing unit 32 may be provided.
  • the audio playback specification of the client device 12 that is, the content audio playback environment
  • the free viewpoint video reproduction unit 32 includes headphones.
  • binaural playback processing that convolves the head related transfer function (HRTF (Head Related Transfer Function)) and the sound data of the object sound source is performed as a strict rendering process.
  • HRTF Head Related Transfer Function
  • a head-related transfer function is prepared in advance for each relative positional relationship between the viewer and the object sound source in the space. Also, a head-related transfer function corresponding to the relative positional relationship between the position of the object sound source indicated by the sound source position information and the position of the viewer indicated by the user position information is selected from those head-related transfer functions.
  • a convolution process is performed to convolve the selected head-related transfer function and the sound data of the object sound source, and sound data of the left and right channels of the sound of the object sound source whose sound image is localized at a desired position is generated. .
  • rendering with a light panning process that localizes the sound image by changing the volume ratio of the left and right sounds of the object sound source based on the position and line-of-sight direction of the viewer in space and the position of the object sound source. It is done as a process.
  • the desired volume ratio of the left and right channels The sound data of the left and right channels of the sound of the object sound source that localizes the sound image at the position is generated.
  • the audio playback specification of the client device 12 is playback using a multi-channel speaker.
  • the free viewpoint video reproduction unit 32 includes a multi-channel speaker.
  • the processing for generating the sound data of each speaker is a strict rendering process. As done.
  • a filter coefficient determined from the positional relationship between the position of the object sound source indicated by the sound source position information and the position of the viewer indicated by the user position information is selected for each speaker (channel). Then, a convolution process for convolving the selected filter coefficient and the sound data of the object sound source is performed, and sound data of each channel of the sound of the object sound source in which the sound image is localized at a desired position is generated.
  • the free-viewpoint video playback unit 32 is composed of an annular speaker array
  • a process of generating audio data of each channel for playing back the sound of the object sound source by HOA is performed as a strict rendering process.
  • HOA Higher Order Ambisonics
  • sound data of each channel is generated by calculation in the spherical harmonic region.
  • a process of generating audio data of each channel for reproducing the sound of the object sound source by VBAP is performed as a light rendering process.
  • step S11 the importance index calculation unit 21 determines whether or not there is a request for reduction of the calculation amount or the transmission amount.
  • the importance index calculation unit 21 determines that there is a reduction request when the client apparatus 12 requests reduction of the calculation amount or transmission amount of the output audio stream.
  • step S11 If it is determined in step S11 that there is no reduction request, the processing from step S12 to step S15 is skipped, and then the processing proceeds to step S16.
  • the processing selection unit 22 supplies the input audio stream to the rendering processing unit 23, and also displays a selection result indicating that strict rendering processing has been selected for all object sound sources to the rendering processing unit 23. Supply. Also, video meta information and the like are acquired as necessary, and video meta information and the like are also supplied from the importance index calculation unit 21 to the rendering processing unit 23 via the process selection unit 22.
  • step S11 when it is determined in step S11 that there is a reduction request, the importance index calculation unit 21 acquires meta information and specification information in step S12.
  • the importance index calculation unit 21 acquires audio meta information from the input audio stream as meta information regarding free viewpoint video content, that is, audio data of each object sound source, or records / editing meta information from the meta information storage unit 13. Or the video meta information from the free viewpoint video playback unit 32 of the client device 12.
  • the importance index calculation unit 21 acquires calculation specification information from the rendering processing unit 23, acquires transmission speed specification information on the transmission side from the audio stream transmission unit 24, and receives from the audio stream reception unit 31 on the reception side. Or get the transmission speed spec information.
  • meta information and spec information for example, information acquired sequentially at a predetermined interval such as a frame unit or a plurality of frame units in real time may be used, or information acquired in advance may be used continuously. May be.
  • step S13 the importance index calculation unit 21 determines whether all necessary meta information and specification information have been acquired.
  • step S13 If it is determined in step S13 that necessary meta information and specification information have not yet been acquired, the process returns to step S12, and the above-described process is repeated.
  • step S13 if it is determined in step S13 that necessary meta information and specification information have been acquired, the process proceeds to step S14.
  • step S14 the importance index calculation unit 21 calculates the importance index of each object sound source based on the acquired meta information, and supplies the obtained importance index and spec information to the process selection unit 22. Also, the importance index calculation unit 21 supplies video meta information to the rendering processing unit 23 via the process selection unit 22 as necessary.
  • the importance index calculation unit 21 calculates the distance from the viewer position to the object sound source position as the importance index from the sound source position information included in the audio meta information and the user position information included in the video meta information. Or the importance information included in the meta information at the time of recording / editing is directly used as an importance index.
  • step S ⁇ b> 15 the process selection unit 22 performs the reduction process based on the importance index of each object sound source supplied from the importance index calculation unit 21 and the specification information, thereby performing each object sound source. Select (determine) the process.
  • the process selection unit 22 acquires the importance index and the spec information from the importance index calculation unit 21. For example, the process selection unit 22 selects whether to perform light rendering processing or strict rendering processing for each object sound source, or to select whether to perform processing for changing the reproduction bit rate of the object sound source. , Select whether to integrate the object sound source.
  • the processing selection unit 22 supplies the selection result and the input audio stream to the rendering processing unit 23.
  • a light rendering process, a process for changing a playback bit rate, and a process for integrating object sound sources are processes for reducing the amount of computation and transmission, and a strict rendering process is a normal process.
  • specification information may be used as necessary, and specification information is not necessarily used.
  • step S15 If it is determined in step S15 that the process has been selected or there is no reduction request in step S11, the process of step S16 is performed.
  • step S16 the rendering processing unit 23 performs a rendering process based on the selection result of the process supplied from the process selection unit 22 and the input audio stream, and generates an output audio stream.
  • the rendering processing unit 23 performs a panning process based on the sound data of the object sound source included in the input audio stream according to the playback environment of the free viewpoint video playback unit 32. And VBAP.
  • the rendering processing unit 23, for an object sound source for which strict rendering processing has been selected is based on the sound data of the object sound source included in the input audio stream according to the playback environment of the free viewpoint video playback unit 32. Perform binaural playback, wavefront synthesis, HOA, and other processing.
  • the rendering processing unit 23 adds the sound data of the object sound sources at a predetermined volume ratio to one sound data for the object sound sources for which the process of integrating the object sound sources is selected before the rendering process. By integrating the sound data of the object sound source.
  • the volume ratio when adding the audio data that is, the weight multiplied by the audio data is determined based on, for example, the spatial positional relationship between each of the plurality of object sound sources and the viewer.
  • the audio data addition process is performed for each channel.
  • the position of the integrated object sound source may be a coordinate position indicated by an average value of the coordinates of each of the integrated object sound sources, or a representative value of the position of the plurality of object sound sources. May be the position indicated by.
  • the representative value of the coordinates of the position of the object sound source may be the coordinates of the position of one object sound source among a plurality of object sound sources to be integrated, or weighted addition from the coordinates of the position of the plurality of object sound sources.
  • the coordinates calculated by the above may be used.
  • the rendering processing unit 23 performs downsampling for converting the sampling frequency on the sound data of the object sound source before or after the rendering processing, Audio data with a changed reproduction bit rate is generated by performing a conversion process for converting the number of digitized bits.
  • the rendering processing unit 23 When the sound data of each channel for each object sound source is obtained by the above processing, the rendering processing unit 23 adds the sound data of the same channel of each object sound source to obtain one sound data, thereby converting the content sound. Audio data of each channel for reproduction is generated. The rendering processing unit 23 generates an output audio stream in which the audio data of each channel of the content audio obtained in this way is stored, and supplies the output audio stream to the audio stream transmission unit 24. Note that the output audio stream may be stored with audio data for each object sound source.
  • step S17 the audio stream transmission unit 24 transmits (transmits) the output audio stream supplied from the rendering processing unit 23 to the client device 12. Thereafter, the process returns to step S11, and the above-described process is repeated until the streaming distribution of the content audio is completed.
  • the server 11 acquires meta information and specification information as necessary, calculates an importance index, and selects processing to be performed on the object sound source based on the obtained importance index. Then, the server 11 performs a rendering process or the like according to the processing selection result, and generates an output audio stream.
  • the client device 12 when the output audio stream is output from the server 11, the client device 12 performs a playback process, and the content audio is played back.
  • step S41 the audio stream receiving unit 31 acquires the output audio stream and supplies it to the free viewpoint video reproduction unit 32.
  • the audio stream receiving unit 31 acquires the output audio stream by receiving the output audio stream transmitted by the audio stream transmitting unit 24 of the server 11 at regular intervals.
  • step S42 the audio stream receiving unit 31 determines whether or not an output audio stream necessary for reproduction has been acquired. If it is determined in step S42 that the output audio stream has not yet been acquired, the process returns to step S41, and the above-described process is repeated.
  • step S42 when it is determined in step S42 that the output audio stream has been acquired, the free viewpoint video reproduction unit 32 reproduces content audio based on the output audio stream supplied from the audio stream reception unit 31 in step S43. To do.
  • the free viewpoint video playback unit 32 plays back the content video based on the video stream acquired from the outside, thereby playing back the free viewpoint video content including the content video and the content audio.
  • step S44 the free viewpoint video reproduction unit 32 determines whether or not the supply of video meta information is requested from the importance index calculation unit 21 of the server 11. For example, in step S12 in the transmission process of FIG. 2, when the supply of video meta information is requested from the importance index calculation unit 21, it is determined that the supply of video meta information is requested.
  • step S44 If it is determined in step S44 that the supply of video meta information is not requested, the process returns to step S41, and the above-described process is repeated until the reproduction of the free viewpoint video content is completed.
  • step S45 the free viewpoint video reproduction unit 32 extracts the video meta information from the video stream and calculates the importance index calculation unit 21. To supply.
  • the process returns to step S41, and the above-described process is repeated.
  • the video meta information may be supplied in real time at the time of content audio reproduction, or may be supplied in advance.
  • the client device 12 acquires the output audio stream from the server 11 to reproduce the content sound, and outputs video meta information in response to a request from the server 11. As a result, it is possible to reduce the amount of computation at the time of generating the output audio stream on the server 11 side and the transmission amount of the output audio stream, and it is possible to perform content reproduction with high presence.
  • Example 1 of reduction processing By the way, in the transmission process described with reference to FIG. 2, the importance index of each object sound source is obtained as the importance index based on the acquired meta information. Then, processing to be performed on each object sound source is selected based on the importance index and the spec information.
  • the importance index used does not have to be one for each object sound source, and a plurality of different importance indices may be obtained for each object sound source and used for processing.
  • step S71 the importance index calculation unit 21 acquires user position information and sound source position information as information necessary for calculating the importance index.
  • the importance index calculation unit 21 acquires the user meta information included in the video meta information by acquiring the video meta information from the free viewpoint video reproduction unit 32. Also, the importance index calculation unit 21 acquires sound source position information included in the audio meta information by acquiring the audio meta information from the input audio stream. Such processing in step S71 corresponds to the processing in steps S12 and S13 in FIG.
  • step S72 the importance index calculation unit 21 calculates the distance between the viewer on the space and the object sound source to be processed.
  • the importance index calculation unit 21 calculates the coordinates of the position in the viewer's space indicated by the user position information, that is, the vector VL representing the position of the viewer, and the position of the object sound source in the space indicated by the sound source position information.
  • is calculated from the coordinates, that is, the vector VO representing the position of the object sound source.
  • VO-VL indicates the size of the vector (VO-VL).
  • VO-VL indicates the distance from the viewer to the object sound source in the space.
  • the importance index calculation unit 21 supplies the calculated importance index to the process selection unit 22.
  • Such processing in step S72 corresponds to the processing in step S14 in FIG.
  • step S73 the process selection unit 22 determines whether or not the distance between the viewer and the object sound source as the importance index supplied from the importance index calculation unit 21 is equal to or greater than a predetermined threshold th. judge.
  • the process selection unit 22 determines whether or not the distance
  • step S73 If it is determined in step S73 that the distance is greater than or equal to the threshold th, that is, if the relationship of equation (1) is not established, the process proceeds to step S74.
  • step S74 the process selection unit 22 selects a light rendering process as the process performed on the object sound source to be processed, supplies the selection result to the rendering process unit 23, and the reduction process ends.
  • the process selection unit 22 reduces the amount of calculation at the time of rendering by performing a light rendering process on the object sound source to be processed with low importance.
  • step S73 determines whether the distance is greater than or equal to the threshold th, that is, if the relationship of expression (1) is established. If it is determined in step S73 that the distance is not greater than or equal to the threshold th, that is, if the relationship of expression (1) is established, the process proceeds to step S75.
  • step S75 the process selection unit 22 selects a strict rendering process as the process to be performed on the object sound source to be processed, supplies the selection result to the rendering process unit 23, and the reduction process ends.
  • the process selection unit 22 can reproduce the content sound with a high sense of realism by performing a strict rendering process on the object sound source to be processed with high importance.
  • steps S73 to S75 described above correspond to the process in step S15 in FIG.
  • the distance from the viewer to the object sound source is used as the importance index, and the closer the distance is, the higher the importance is, and the object sound source within a certain distance from the viewer.
  • a strict rendering process is selected.
  • a light rendering process is selected, and the amount of calculation is reduced.
  • the server 11 calculates the distance between the viewer and the object sound source as the importance index, and selects a rendering process performed on the object sound source according to the distance. This makes it possible to obtain highly realistic content audio while reducing the total amount of computation during rendering.
  • what kind of rendering processing is performed for each object sound source is selected based on whether or not the distance to the object sound source is equal to or greater than the threshold th.
  • a predetermined number of object sound sources are selected in ascending order of distance from the viewer, and a strict rendering process is performed on the selected object sound sources, and a light rendering process is performed on other object sound sources. May be performed.
  • the number of object sound sources that perform strict rendering processing may be determined based on, for example, calculation specification information.
  • space size information for example, space size information, line-of-sight direction information, spread information, acoustic characteristic information indicating acoustic characteristics such as reverberation in space, and the arrangement positions of object sound sources and other objects in space, that is, positional relationships are indicated. It is conceivable to use arrangement information or the like.
  • the viewing direction is determined from the viewer's gaze direction, the viewer position, and the object sound source position specified from the gaze direction information.
  • a light rendering process may be performed on an object sound source that is not visible to a person.
  • the object sound source that is visible to the viewer is more important.
  • a light rendering process can be selected for an object sound source that is not visible. Accordingly, it is possible to reproduce the content sound with a high sense of reality while giving priority to the object sound source that is visible to the viewer and reducing the total amount of calculation.
  • the importance index calculation unit 21 calculates the distance from the viewer to the object sound source as one importance index, and information indicating whether the object sound source is visible to the viewer. It is calculated as one other importance index. Then, the process selection unit 22 selects a process to be performed on the object sound source based on the two importance indexes calculated for the object sound source.
  • the audio meta information may include spread information indicating the extent of the sound image of the object sound source, that is, the size of the object in space.
  • the spread information is used, for example, to make the sound image of the object sound source wide when performing VBAP as a rendering process.
  • the importance index calculation unit 21 supplies the spread information extracted from the audio meta information as it is to the process selection unit 22 as one importance index.
  • acoustic characteristic information indicating the acoustic characteristics of the space is obtained, for example, when the acoustic characteristic information is stored in the reserved area of the audio meta information, the object is based on the degree of reverberation of the space indicated by the acoustic characteristic information. Sound source processing may be selected.
  • the processing selection unit 22 may determine the number of object sound sources that perform strict rendering processing and the number of object sound sources that perform light rendering processing based on the degree of reverberation indicated by the acoustic characteristic information.
  • the importance index calculation unit 21 supplies the acoustic characteristic information to the process selection unit 22 together with the importance index, and the process selection unit 22 renders lighter as the reverberation degree indicated by the acoustic characteristic information is higher, for example. Increase the number of object sound sources to be processed.
  • the video meta information includes arrangement information indicating the arrangement position of an object sound source or other object in the space
  • processing to be performed on each object sound source is selected based on the arrangement information. You may do it.
  • a light rendering process may be performed on the object sound source.
  • the importance index calculation unit 21 calculates the distance from the viewer to the object sound source as one importance index, and based on the user position information, the sound source position information, and the arrangement information, Information indicating whether or not the sound source is visible to the viewer is calculated as another importance index. Then, the process selection unit 22 selects a process to be performed on the object sound source based on the two importance indexes calculated for the object sound source.
  • step S102 the distance from the viewer to the object sound source is calculated as an importance index, and the importance index and specification information are supplied from the importance index calculation unit 21 to the process selection unit 22.
  • step S103 the process selection unit 22 determines the number of object sound sources to be subjected to a strict rendering process based on the specification information supplied from the importance index calculation unit 21. In other words, the number of object sound sources to which a strict rendering process is assigned, that is, the strict rendering process is selected, and the number of object sound sources to which a light rendering process is assigned are determined.
  • the number of assigned sound sources is determined so that the calculation processing capability required for various processing calculations such as rendering processing for all object sound sources does not exceed the calculation processing capability indicated by the calculation specification information.
  • the number of assigned sound sources may be determined based on the specification information, may be a predetermined number, or may be a number specified from the outside.
  • processing for changing the playback bit rate or processing for integrating object sound sources is selected as processing for reducing the amount of computation and transmission, for example, at least one of computation spec information and transmission speed spec information
  • the number of object sound sources for which processing for changing the reproduction bit rate or processing for integrating object sound sources is performed may be determined based on the above. In this case, for example, the number of object sound sources to be processed is determined so that the transmission rate necessary for transmission of the obtained output audio stream does not exceed the maximum transmission rate indicated by the transmission rate specification information.
  • step S104 the process selection unit 22 selects one object sound source as the object sound source to be processed, and determines whether the object sound source to be processed is higher than the number of assigned sound sources determined in step S103. .
  • the process selection unit 22 ranks each object sound source so that the object sound source having the higher importance indicated by the importance index becomes the higher object sound source. Then, when the number of assigned sound sources is AS and the rank of the object sound sources to be processed is within the upper AS position as a whole, the process selecting unit 22 determines that the object sound source to be processed is higher than the number of assigned sound sources. judge.
  • step S104 If it is determined in step S104 that the number is higher than the assigned sound source number, then the process proceeds to step S105.
  • step S105 the process selection unit 22 selects a strict rendering process as a process to be performed on the object sound source to be processed, and supplies the selection result to the rendering process unit 23. Thereafter, the process proceeds to step S107. move on.
  • step S104 determines whether it is higher than the number of assigned sound sources, that is, if the rank of the object sound source to be processed is lower than the AS position. If it is determined in step S104 that it is not higher than the number of assigned sound sources, that is, if the rank of the object sound source to be processed is lower than the AS position, the process proceeds to step S106.
  • step S106 the process selection unit 22 selects a light rendering process as the process performed on the object sound source to be processed, supplies the selection result to the rendering process unit 23, and then the process proceeds to step S107. .
  • step S105 or step S106 when a rendering process to be performed on the object sound source to be processed is selected, in step S107, the process selection unit 22 determines whether or not processing has been selected for all object sound sources.
  • step S107 If it is determined in step S107 that processing has not yet been selected for all object sound sources, the processing returns to step S104, and the above-described processing is repeated. That is, an object sound source that has not yet been processed is set as a new object sound source to be processed, and a rendering process to be performed on the object sound source is selected.
  • step S107 if it is determined in step S107 that processing has been selected for all object sound sources, the reduction processing ends.
  • the processes in steps S103 to S107 described above correspond to the process in step S15 in FIG.
  • the server 11 determines the number of assigned sound sources based on the specification information, and is performed for each object sound source so that strict rendering processing is performed on the object sound sources for the number of assigned sound sources. Select a rendering process.
  • step S131 the importance index calculation unit 21 acquires importance information as meta information for the object sound source to be processed.
  • the importance index calculation unit 21 acquires importance information from the audio meta information of the input audio stream, or acquires importance information from the recording / editing meta information supplied from the meta information storage unit 13.
  • the importance index calculation unit 21 supplies the acquired importance information as it is to the process selection unit 22 as the importance index.
  • the processing in step S131 corresponds to the processing in steps S12 to S14 in FIG.
  • the importance level information may be used for each object sound source in units of frames or a plurality of frames, or one level of importance information may be commonly used in all frames.
  • step S132 the process selection unit 22 determines whether there is importance level information of the object sound source to be processed. For example, the process selection unit 22 determines that there is importance level information when the importance level information of the object sound source to be processed is supplied as the importance level index from the importance level index calculation unit 21.
  • step S132 If it is determined in step S132 that there is no importance information, the process proceeds to step S135.
  • step S133 the process selection unit 22 is indicated by the importance level information of the processing target object sound source supplied from the importance level index calculation unit 21. It is determined whether the importance is equal to or higher than a predetermined threshold.
  • the threshold used in step S133 may be a predetermined value, or may be a value determined based on spec information or the like.
  • step S133 If it is determined in step S133 that the importance level is not equal to or greater than the threshold value, the process proceeds to step S135.
  • step S133 determines whether the importance level is equal to or greater than the threshold. If it is determined in step S133 that the importance level is equal to or greater than the threshold, the process proceeds to step S134.
  • step S134 the process selection unit 22 selects a strict rendering process as a process performed on the object sound source to be processed, supplies the selection result to the rendering process unit 23, and the reduction process ends.
  • An object sound source whose importance is equal to or higher than a threshold value is an important object sound source to be strictly reproduced, and therefore a strict rendering process is selected.
  • step S132 If it is determined in step S132 that there is no importance level information, or if it is determined in step S133 that the importance level is not equal to or greater than the threshold value, the process of step S135 is performed.
  • step S135 the process selection unit 22 selects a light rendering process as the process performed on the object sound source to be processed, supplies the selection result to the rendering process unit 23, and the reduction process ends.
  • a light rendering process is selected for an object sound source that has no importance information used as an importance index or has a low importance indicated by the importance information, and the amount of calculation at the time of rendering is reduced.
  • steps S132 to S135 described above correspond to the process in step S15 in FIG.
  • a strict rendering process is performed only on an object sound source that has importance level information and whose importance level indicated by the importance level information is a certain value or more.
  • the server 11 selects processing to be performed on the object sound source using the importance information as the importance index.
  • the importance level information it is possible to appropriately reduce the amount of calculation, and to reproduce content with a high sense of presence even with a small amount of calculation.
  • Example 4 of reduction processing Furthermore, when there are a large number of object sound sources in the space, if two or more object sound sources can be integrated and handled as one object sound source by a simple process, the amount of computation such as rendering processing can be reduced. Further, if the object sound sources are integrated, the data amount of the output audio stream is reduced, so that the transmission amount can be reduced.
  • two or more object sound sources may be integrated into one object sound source based on the importance index.
  • the reduction process performed by the server 11 in such a case will be described with reference to the flowchart of FIG.
  • step S161 the importance index calculation unit 21 acquires user position information and sound source position information of two object sound sources to be processed.
  • the importance index calculation unit 21 acquires user position information from the video meta information and also acquires sound source position information of the two object sound sources to be processed from the audio meta information.
  • the processing in step S161 corresponds to the processing in steps S12 and S13 in FIG.
  • step S162 the importance index calculation unit 21 calculates the distance between the object sound sources in the space based on the sound source position information of the two object sound sources to be processed acquired in step S161.
  • step S163 the importance index calculation unit 21 determines the direction of one object sound source to be processed as viewed from the viewer based on the user position information and sound source position information acquired in step S161, and the other processing target. The angle difference from the direction of the object sound source is calculated.
  • the importance index calculation unit 21 starts with the position of the viewer in the space and uses the position of one object sound source to be processed as the end point, and the position of the viewer as the start point, and the other object to be processed.
  • An angle formed by a vector whose end point is the position of the sound source is calculated as an angle difference.
  • the direction of one object sound source to be processed as viewed from the viewer in space is the direction of the other object sound source to be processed as viewed from the viewer.
  • the angle difference is obtained as an importance index of one object sound source to be processed.
  • the importance index calculation unit 21 supplies the distance obtained in step S162 and the angle difference obtained in step S163 to the process selection unit 22 as the importance index of the object sound source to be processed.
  • the processes in step S162 and step S163 correspond to the process in step S14 in FIG.
  • step S164 the process selection unit 22 determines whether or not the distance between the object sound sources supplied from the importance index calculation unit 21 is equal to or less than a predetermined threshold value.
  • step S164 If it is determined in step S164 that the distance between the object sound sources is not equal to or less than the threshold, the process proceeds to step S167.
  • step S165 the process selection unit 22 determines that the angle difference in the direction of the object sound source obtained in step S163 is equal to or smaller than the predetermined threshold value. It is determined whether or not there is.
  • the angle difference used as the importance index may be calculated when it is determined in step S164 that the distance between the object sound sources is equal to or less than a threshold value. Further, the threshold value used in step S164 is different from the threshold value used in step S165.
  • step S165 If it is determined in step S165 that the angle difference is not equal to or less than the threshold value, then the process proceeds to step S167.
  • step S165 if it is determined in step S165 that the angle difference is equal to or smaller than the threshold value, the process proceeds to step S166.
  • step S166 the process selection unit 22 selects a process for integrating the two object sound sources as a process to be performed on the two object sound sources to be processed, and supplies the selection result to the rendering processing unit 23 for reduction. The process ends.
  • the two object sound sources to be processed are located within a certain distance and are located in substantially the same direction as viewed from the viewer. Therefore, even if these two object sound sources are integrated to form a single object sound source, there is no significant shift in the position of the sound image, and the realism of the content sound is not impaired.
  • the processing selection unit 22 integrates such two object sound sources into one object sound source, and does not impair the sense of reality of the content sound, and the amount of computation during the rendering process and the transmission amount of the output audio stream To reduce.
  • step S166 When the process of step S166 is performed, a process of integrating two object sound sources into one object sound source is performed in step S16 of FIG. At this time, the integrated position of the object sound source in space is an average value of coordinates indicating the positions of the two object sound sources, that is, the position of the average coordinate.
  • step S164 If it is determined in step S164 that the distance between the object sound sources is not less than the threshold value, or if it is determined in step S165 that the angle difference is not less than the threshold value, the process of step S167 is performed.
  • step S167 the process selection unit 22 performs the process individually on the two object sound sources to be processed. That is, the process selection unit 22 selects that the two object sound sources to be processed are not integrated, supplies the selection result to the rendering process unit 23, and the reduction process ends.
  • the two object sound sources to be processed are located at a distance of a certain distance or more in space, or even if they are within a certain distance, the object sound sources are sufficiently different from the viewpoint of the viewer. Exists in the direction.
  • step S167 the two object sound sources to be processed are processed individually without being integrated.
  • steps S164 to S167 described above correspond to the process in step S15 in FIG.
  • the server 11 calculates the importance index based on the user position information and the sound source position information, and selects a process to be performed on the object sound source to be processed based on the obtained importance index.
  • the reduction process described with reference to FIGS. 4 to 7 is not limited, and other reduction processes such as changing the reproduction bit rate of the object sound source can be performed, or with reference to FIGS. 4 to 7. It is also possible to perform a combination of any of the described reduction processes and other reduction processes.
  • the playback bit rate when the playback bit rate is changed in order to reduce the calculation amount and the transmission amount, it is selected whether to perform strict rendering processing or light rendering processing in the reduction processing of FIGS. 4 to 6 described above. Instead of this, it is only necessary to select whether or not to perform processing for changing the reproduction bit rate for the object sound source.
  • the process for changing the reproduction bit rate is selected in the step where the light rendering process is selected, and the reproduction bit rate is selected in the step where the strict rendering process is selected. What is necessary is just to make it the process which changes is not performed.
  • the rendering process or the like is performed in the free viewpoint video reproduction unit 32 or the like of the client device 12 according to the selection result of the process by the process selection unit 22. You may be made to be.
  • the above-described series of processing can be executed by hardware or can be executed by software.
  • a program constituting the software is installed in the computer.
  • the computer includes, for example, a general-purpose personal computer capable of executing various functions by installing a computer incorporated in dedicated hardware and various programs.
  • FIG. 8 is a block diagram showing an example of the hardware configuration of a computer that executes the above-described series of processes by a program.
  • a CPU Central Processing Unit
  • ROM Read Only Memory
  • RAM Random Access Memory
  • An input / output interface 505 is further connected to the bus 504.
  • An input unit 506, an output unit 507, a recording unit 508, a communication unit 509, and a drive 510 are connected to the input / output interface 505.
  • the input unit 506 includes a keyboard, a mouse, a microphone, an image sensor, and the like.
  • the output unit 507 includes a display, a speaker, and the like.
  • the recording unit 508 includes a hard disk, a nonvolatile memory, and the like.
  • the communication unit 509 includes a network interface or the like.
  • the drive 510 drives a removable recording medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.
  • the CPU 501 loads the program recorded in the recording unit 508 to the RAM 503 via the input / output interface 505 and the bus 504 and executes the program, for example. Is performed.
  • the program executed by the computer (CPU 501) can be provided by being recorded in a removable recording medium 511 as a package medium, for example.
  • the program can be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.
  • the program can be installed in the recording unit 508 via the input / output interface 505 by attaching the removable recording medium 511 to the drive 510. Further, the program can be received by the communication unit 509 via a wired or wireless transmission medium and installed in the recording unit 508. In addition, the program can be installed in the ROM 502 or the recording unit 508 in advance.
  • the program executed by the computer may be a program that is processed in time series in the order described in this specification, or in parallel or at a necessary timing such as when a call is made. It may be a program for processing.
  • the present technology can take a cloud computing configuration in which one function is shared by a plurality of devices via a network and is jointly processed.
  • each step described in the above flowchart can be executed by one device or can be shared by a plurality of devices.
  • the plurality of processes included in the one step can be executed by being shared by a plurality of apparatuses in addition to being executed by one apparatus.
  • the present technology can be configured as follows.
  • An audio processing apparatus comprising: a process selection unit that selects a process to be performed on audio data of the object sound source based on one or more importance indexes serving as an importance index of the object sound source.
  • a process selection unit selects a process for reducing a calculation amount or a transmission amount as the process.
  • the process selection unit selects one of a plurality of rendering processes having different calculation amounts as the process.
  • the process selection unit selects a process of integrating the audio data of the plurality of object sound sources as the process.
  • the audio processing apparatus selects a process of changing a reproduction bit rate of the audio data of the object sound source as the process.
  • the importance index calculation unit includes the position information of the object sound source as the meta information, the position information of the viewer, the gaze direction information of the viewer, the importance information of the object sound source, the spread information of the object sound source, the space
  • the audio processing device according to (6), wherein the importance index is calculated based on at least one of the acoustic characteristic information and the object arrangement information in the space.
  • the audio processing device (6) or (7), wherein the importance index calculation unit calculates a distance between the object sound source and a viewer in space as the importance index.
  • the voice processing device (6) to (8), wherein the importance index calculation unit directly uses the importance information of the object sound source as the meta information as the importance index.
  • the speech processing apparatus according to any one of (6) to (9), wherein the importance index calculation unit calculates a distance between two object sound sources in space as the importance index.
  • the importance index calculation unit calculates, as the importance index, an angle difference between the direction of the object sound source viewed from the viewer in space and the direction of another object sound source viewed from the viewer. Thru
  • the process selection unit is configured to perform a plurality of processes based on at least one of calculation specification information indicating calculation processing capability of a processing unit that performs the process and transmission rate specification information indicating a maximum transmission rate of the audio data.
  • the audio processing device according to any one of (1) to (11), wherein the number of the object sound sources on which the processing is performed is determined.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Stereophonic System (AREA)

Abstract

The present feature relates to a sound processing device and method with which it is possible to reproduce a content with high realistic sensations requiring a small computation or transmission amount. The sound processing device is provided with a process selection unit for selecting a process performed on the sound data of an object sound source on the basis of one or a plurality of importance indices that serve as an index to the importance of the object sound source. The present feature can be applied to a content reproduction system.

Description

音声処理装置および方法Audio processing apparatus and method
 本技術は音声処理装置および方法に関し、特に、少ない演算量または伝送量で臨場感の高いコンテンツ再生を行うことができるようにした音声処理装置および方法に関する。 The present technology relates to an audio processing apparatus and method, and more particularly, to an audio processing apparatus and method capable of performing content reproduction with high presence with a small amount of computation or transmission.
 例えばVR(Virtual Reality)コンテンツやゲーム、全天球カメラなどで撮影された映像から3Dモデリングなどにより作成された自由視点コンテンツを楽しむ際、仮想空間内に存在する物体が発する音声も高い臨場感で再生することが望まれる。 For example, when enjoying free viewpoint content created by 3D modeling, etc. from VR (Virtual Reality) content, videos taken with games, omnidirectional cameras, etc., the sound produced by objects present in the virtual space is also highly realistic It is desirable to regenerate.
 そのためには、視聴者が現実の空間内で移動したときに、ある特定の音源位置で出力された音声について、その音声が聞こえてくる方向や音源位置までの距離も適切に変化させる必要がある。すなわち音声の音像位置を適切に変化させる必要がある。 For that purpose, when the viewer moves in the real space, it is necessary to appropriately change the direction in which the sound is heard and the distance to the sound source position for the sound output at a specific sound source position. . That is, it is necessary to appropriately change the sound image position of the sound.
 このようなことは、自由視点映像内に存在する音源の音声をオブジェクト音源として記録しておき、視聴環境に合わせて適切にレンダリングを行うことで実現することができる。 Such a thing can be realized by recording the sound of the sound source existing in the free viewpoint video as an object sound source and appropriately rendering it according to the viewing environment.
 具体的には、例えばヘッドホンで音声を再生する場合にはバイノーラル再生を行うことで、また、スピーカで音声を再生する場合には波面合成などの処理を行うことで、適切な位置に音像を定位させることができる。 Specifically, for example, binaural playback is performed when sound is played back through headphones, and sound image is localized at an appropriate position by performing processing such as wavefront synthesis when sound is played back through speakers. Can be made.
 また、映像に付随する音声の再生に関する技術として、オーディオチャネルやオーディオオブジェクトの優先度情報に基づいて、各オーディオチャネルやオーディオオブジェクトのデータを復号するか否かを選択することで演算量を削減する技術が提案されている(例えば、特許文献1参照)。 In addition, as a technology related to the reproduction of audio accompanying a video, the calculation amount is reduced by selecting whether or not to decode the data of each audio channel or audio object based on the priority information of the audio channel or audio object. A technique has been proposed (see, for example, Patent Document 1).
特開2015-194666号公報Japanese Patent Application Laid-Open No. 2015-194666
 しかしながら、上述した技術では少ない演算量または伝送量で臨場感の高いコンテンツ再生を行うことは困難であった。 However, it has been difficult to reproduce highly realistic content with a small amount of computation or transmission with the above-described technology.
 例えばバイノーラル再生や波面合成などにおいては、音源の方向と音源までの距離に対応したフィルタの畳み込み処理が行われるため、レンダリング処理の演算量が多くなってしまう。特に複数個の音源を扱う場合には、音源の個数に比例して演算量が増加してしまう。また、ストリーミング再生などを行う場合、音源数が多いときにはオーディオストリームの伝送量も多くなってしまう。 For example, in binaural reproduction and wavefront synthesis, the convolution processing of the filter corresponding to the direction of the sound source and the distance to the sound source is performed, so that the amount of calculation of the rendering process increases. In particular, when a plurality of sound sources are handled, the amount of calculation increases in proportion to the number of sound sources. Also, when performing streaming playback or the like, the amount of audio stream transmission increases when the number of sound sources is large.
 一方、優先度情報に基づいて復号を行うかを選択する技術では、演算量を削減することはできるが、優先度の低いオーディオチャネルやオーディオオブジェクトのデータは復号されないため、そのチャネルやオブジェクトの音声は全く再生されなくなってしまう。そうすると、コンテンツ再生時における臨場感が損なわれてしまうことがある。 On the other hand, the technology for selecting whether to perform decoding based on priority information can reduce the amount of computation, but the audio channel or audio object data with low priority is not decoded, so the audio of the channel or object is not decoded. Will not play at all. This may impair the sense of presence during content playback.
 本技術は、このような状況に鑑みてなされたものであり、少ない演算量または伝送量で臨場感の高いコンテンツ再生を行うことができるようにするものである。 The present technology has been made in view of such a situation, and makes it possible to perform highly realistic content reproduction with a small amount of calculation or transmission.
 本技術の一側面の音声処理装置は、オブジェクト音源の重要度の指標となる1または複数の重要度指標に基づいて、前記オブジェクト音源の音声データに対して行われる処理を選択する処理選択部を備える。 An audio processing device according to an aspect of the present technology includes a process selection unit that selects a process to be performed on audio data of an object sound source based on one or more importance indices that are an index of importance of the object sound source. Prepare.
 前記処理選択部には、前記処理として演算量または伝送量を削減するための処理を選択させることができる。 The process selection unit can select a process for reducing a calculation amount or a transmission amount as the process.
 前記処理選択部には、前記処理として互いに演算量が異なる複数のレンダリング処理のうちの何れかを選択させることができる。 The process selection unit can select any one of a plurality of rendering processes having different calculation amounts as the process.
 前記処理選択部には、前記処理として複数の前記オブジェクト音源の前記音声データを統合する処理を選択させることができる。 The process selection unit can select a process for integrating the audio data of a plurality of the object sound sources as the process.
 前記処理選択部には、前記処理として前記オブジェクト音源の前記音声データの再生ビットレートを変更する処理を選択させることができる。 The process selection unit can select a process for changing the reproduction bit rate of the audio data of the object sound source as the process.
 音声処理装置には、前記音声データに関するメタ情報に基づいて前記重要度指標を算出する重要度指標算出部をさらに設けることができる。 The voice processing device may further include an importance index calculation unit that calculates the importance index based on meta information related to the audio data.
 前記重要度指標算出部には、前記メタ情報としての前記オブジェクト音源の位置情報、視聴者の位置情報、前記視聴者の視線方向情報、前記オブジェクト音源の重要度情報、前記オブジェクト音源のスプレッド情報、空間の音響特性情報、および前記空間におけるオブジェクトの配置情報のうちの少なくとも何れか1つに基づいて前記重要度指標を算出させることができる。 In the importance index calculation unit, the position information of the object sound source as the meta information, the position information of the viewer, the gaze direction information of the viewer, the importance information of the object sound source, the spread information of the object sound source, The importance index can be calculated based on at least one of the acoustic characteristic information of the space and the arrangement information of the object in the space.
 前記重要度指標算出部には、空間上における前記オブジェクト音源と視聴者との距離を前記重要度指標として算出させることができる。 The importance index calculation unit can calculate the distance between the object sound source and the viewer in space as the importance index.
 前記重要度指標算出部には、前記メタ情報としての前記オブジェクト音源の重要度情報をそのまま前記重要度指標とさせることができる。 The importance index calculation unit can use the importance information of the object sound source as the meta information as it is as the importance index.
 前記重要度指標算出部には、空間上における2つの前記オブジェクト音源間の距離を前記重要度指標として算出させることができる。 The importance index calculation unit can calculate the distance between the two object sound sources in space as the importance index.
 前記重要度指標算出部には、空間上における視聴者から見た前記オブジェクト音源の方向の、前記視聴者から見た他のオブジェクト音源の方向との角度差を前記重要度指標として算出させることができる。 The importance index calculation unit may calculate an angle difference between the direction of the object sound source viewed from a viewer in space and the direction of another object sound source viewed from the viewer as the importance index. it can.
 前記処理選択部には、前記処理を行う処理部の演算処理能力を示す演算スペック情報、および前記音声データの最大伝送速度を示す伝送速度スペック情報の少なくとも何れかに基づいて、複数の各前記処理について、前記処理が行われる前記オブジェクト音源の個数を決定させることができる。 The process selection unit includes a plurality of the processes based on at least one of calculation specification information indicating calculation processing capability of the processing unit performing the process and transmission rate specification information indicating a maximum transmission rate of the audio data. The number of object sound sources on which the process is performed can be determined.
 本技術の一側面の音声処理方法は、オブジェクト音源の重要度の指標となる1または複数の重要度指標を取得する取得ステップと、前記1または複数の重要度指標に基づいて、前記オブジェクト音源の音声データに対して行われる処理を選択する処理選択ステップとを含む。 An audio processing method according to an aspect of the present technology includes an acquisition step of acquiring one or a plurality of importance indices serving as an importance index of an object sound source, and the object sound source based on the one or more importance indices. And a process selection step for selecting a process to be performed on the audio data.
 本技術の一側面においては、オブジェクト音源の重要度の指標となる1または複数の重要度指標に基づいて、前記オブジェクト音源の音声データに対して行われる処理が選択される。 In one aspect of the present technology, processing to be performed on the sound data of the object sound source is selected based on one or more importance indexes that are the index of importance of the object sound source.
 本技術の一側面によれば、少ない演算量または伝送量で臨場感の高いコンテンツ再生を行うことができる。 According to one aspect of the present technology, highly realistic content reproduction can be performed with a small amount of calculation or transmission.
 なお、ここに記載された効果は必ずしも限定されるものではなく、本開示中に記載された何れかの効果であってもよい。 Note that the effects described here are not necessarily limited, and may be any of the effects described in the present disclosure.
コンテンツ再生システムの構成例を示す図である。It is a figure which shows the structural example of a content reproduction system. 伝送処理を説明するフローチャートである。It is a flowchart explaining a transmission process. 再生処理を説明するフローチャートである。It is a flowchart explaining a reproduction | regeneration process. 削減処理を説明するフローチャートである。It is a flowchart explaining a reduction process. 削減処理を説明するフローチャートである。It is a flowchart explaining a reduction process. 削減処理を説明するフローチャートである。It is a flowchart explaining a reduction process. 削減処理を説明するフローチャートである。It is a flowchart explaining a reduction process. コンピュータの構成例を示す図である。It is a figure which shows the structural example of a computer.
 以下、図面を参照して、本技術を適用した実施の形態について説明する。 Hereinafter, embodiments to which the present technology is applied will be described with reference to the drawings.
〈第1の実施の形態〉
〈本技術について〉
 本技術は、コンテンツのメタ情報等を用いて、より厳密に再生すべきオブジェクト音源であるか否かを判別するためのオブジェクト音源の重要度の指標を算出し、レンダリング処理など再生時や伝送時等における各オブジェクト音源の扱いを変えることで、演算量や伝送量を削減しつつ高い臨場感でコンテンツを再生できるようにするものである。
<First Embodiment>
<About this technology>
This technology uses the meta information of the content, etc. to calculate an index of the importance of the object sound source to determine whether it is an object sound source to be played more strictly. By changing the handling of each object sound source, etc., the content can be reproduced with a high sense of reality while reducing the calculation amount and transmission amount.
 なお、ここでいう厳密に再生するとは、オブジェクト音源の音像の定位性や音質を損なわずに再生することである。また、ここでいうレンダリング処理は、オブジェクト音源の音声データを再生機器のチャネル数や再生条件に応じて、再生環境に合ったチャネル数の音声データへとリマッピングする処理全般をいうものとする。 Note that the strict reproduction here means reproduction without impairing the localization and sound quality of the sound image of the object sound source. The rendering process here refers to the entire process of remapping the sound data of the object sound source to the sound data of the number of channels suitable for the reproduction environment in accordance with the number of channels and the reproduction conditions of the reproduction device.
 本技術を適用したコンテンツ再生システムでは、映像とそれに付随する音声とからなるコンテンツの音声を再生するためのオーディオストリームとして、各オブジェクト音源のチャネルごとの音声データからなるオーディオストリームが入力されるものとする。 In a content playback system to which the present technology is applied, an audio stream including audio data for each channel of each object sound source is input as an audio stream for reproducing the audio of content including video and accompanying audio. To do.
 なお、以下では、コンテンツは、例えば自由視点映像コンテンツであるものとする。また、コンテンツの映像上の映像オブジェクトと、コンテンツ音声のオーディオオブジェクト、すなわちオブジェクト音源とが対応するものであるとする。換言すれば、オブジェクト音源の音声は、映像オブジェクトの音声であるものとする。 In the following, it is assumed that the content is, for example, free viewpoint video content. Further, it is assumed that a video object on the content video corresponds to an audio object of content audio, that is, an object sound source. In other words, the sound of the object sound source is assumed to be the sound of the video object.
 また、コンテンツ再生システムでは、入力されたオーディオストリームに基づいてレンダリング処理が行われ、再生機器、すなわちクライアント装置においてコンテンツ音声を再生するためのオーディオストリームが生成されて、再生機器でのコンテンツ音声の再生が行われる。以下では、入力されたオーディオストリームを入力オーディオストリームとも称し、再生機器に出力されるオーディオストリームを出力オーディオストリームとも称することとする。 Also, in the content reproduction system, rendering processing is performed based on the input audio stream, and an audio stream for reproducing the content audio is generated in the reproduction device, that is, the client device, and the content audio is reproduced on the reproduction device. Is done. Hereinafter, the input audio stream is also referred to as an input audio stream, and the audio stream output to the playback device is also referred to as an output audio stream.
 さらに、コンテンツ再生システムにおいては、各種のメタ情報や各機器のスペック情報に基づいて、適宜、出力オーディオストリーム生成時における演算量を削減するための処理や、出力オーディオストリームの伝送量を削減するための処理が行われる。 Furthermore, in the content reproduction system, in order to reduce the amount of calculation of the output audio stream and the transmission amount of the output audio stream as appropriate, based on various meta information and spec information of each device. Is performed.
 以下では、少なくとも演算量または伝送量を削減するための処理を含む1または複数の処理のなかから、オブジェクト音源の音声データに対して行われる処理を選択することで、オーディオストリーム全体で演算量または伝送量を削減する処理を削減処理とも称することとする。より詳細には、削減処理では、オブジェクト音源によっては、演算量または伝送量を削減するための処理が行われないといった選択結果や、通常通りの処理が選択されるといった選択結果も得られることがある。 In the following, by selecting a process to be performed on the sound data of the object sound source from one or a plurality of processes including a process for reducing at least the calculation amount or the transmission amount, the calculation amount or Processing for reducing the transmission amount is also referred to as reduction processing. More specifically, in the reduction process, depending on the object sound source, a selection result that the process for reducing the calculation amount or the transmission amount is not performed or a selection result that the normal process is selected may be obtained. is there.
 コンテンツ再生システムでは、演算量や伝送量を削減するための削減処理として、例えば以下に示す処理TR1乃至処理TR3のうちの少なくとも何れか1つが行われる。 In the content reproduction system, for example, at least one of processing TR1 to processing TR3 shown below is performed as the reduction processing for reducing the calculation amount and the transmission amount.
 処理TR1
 オブジェクト音源に対するレンダリング処理を変更する処理
Processing TR1
Changing the rendering process for object sound sources
 例えばオブジェクト音源ごとに、より厳密に再生すべきオブジェクト音源であるか否かが判別され、その判別結果に応じてオブジェクト音源について行われるレンダリング処理が選択される。 For example, for each object sound source, it is determined whether or not it is an object sound source to be reproduced more strictly, and a rendering process performed on the object sound source is selected according to the determination result.
 すなわち、重要度が高く、より厳密に再生すべきであると判別されたオブジェクト音源に対しては、演算量が多いが、音像のより厳密な定位感を得ることができるレンダリング処理が行われるようにする。 That is, for an object sound source that is determined to be highly important and should be reproduced more precisely, a rendering process is performed that can obtain a more precise sense of localization of the sound image although the amount of calculation is large. To.
 以下では、このような、より厳密な定位感を得ることができるレンダリング処理を厳密なレンダリング処理とも称することとする。例えば厳密なレンダリング処理は、フィルタ係数を用いた畳み込み処理などの演算量の多い処理を含むが、より高精度に音像を定位させることができる処理とされる。 Hereinafter, such a rendering process capable of obtaining a more precise sense of localization is also referred to as a strict rendering process. For example, the strict rendering process includes a process with a large amount of calculation such as a convolution process using a filter coefficient, and is a process that can localize a sound image with higher accuracy.
 これに対して、重要度が低く、より厳密に再生すべきであると判別されなかったオブジェクト音源に対しては、高精度に音像を定位することはできないが演算量は少なくてすむレンダリング処理が行われるようにする。 On the other hand, for object sound sources that are low in importance and have not been determined to be played more strictly, rendering processing that cannot localize sound images with high accuracy but requires a small amount of computation is performed. To be done.
 以下では、このような高精度に音像を定位することはできないが演算量の少ないレンダリング処理を軽いレンダリング処理とも称することとする。例えば軽いレンダリング処理は、VBAP(Vector Base Amplitude Panning)などとされる。 In the following, a rendering process with a small amount of computation, although it is not possible to localize a sound image with such high accuracy, will be referred to as a light rendering process. For example, the light rendering processing is VBAP (Vector Base Amplitude Panning) or the like.
 このようにオブジェクト音源ごとに、それらのオブジェクト音源の重要度に応じて、厳密なレンダリング処理と軽いレンダリング処理のうちの何れかが選択的に行われるようにすることで、演算量を削減しつつ高い臨場感のコンテンツ音声を得ることができる。 In this way, for each object sound source, depending on the importance of the object sound source, either strict rendering processing or light rendering processing is selectively performed, thereby reducing the amount of calculation. Highly realistic content audio can be obtained.
 なお、以下では、演算量および音像定位精度が互いに異なるレンダリング処理として、軽いレンダリング処理と厳密なレンダリング処理の2種類のレンダリング処理がある場合について説明する。しかし、演算量および音像定位精度が互いに異なるレンダリング処理として、3以上のレンダリング処理が定められているようにしてもよい。 In the following, a case will be described in which there are two types of rendering processes, a light rendering process and a strict rendering process, as rendering processes having different computational complexity and sound image localization accuracy. However, three or more rendering processes may be defined as rendering processes having different calculation amounts and sound image localization accuracy.
 そのような場合、例えばレンダリング処理として軽いレンダリング処理、ある程度厳密なレンダリング処理、および厳密なレンダリング処理のうちの何れかがオブジェクト音源ごとに選択される。 In such a case, for example, any of light rendering processing, somewhat strict rendering processing, and strict rendering processing is selected for each object sound source as the rendering processing.
 処理TR2
 オブジェクト音源の再生ビットレートを変更する処理
Processing TR2
Processing to change the playback bit rate of the object sound source
 例えばオブジェクト音源ごとに、より厳密に再生すべきオブジェクト音源であるか否かが判別され、その判別結果に応じてオブジェクト音源の音声データの再生ビットレートが変更される。 For example, it is determined for each object sound source whether or not it is an object sound source to be reproduced more strictly, and the reproduction bit rate of the sound data of the object sound source is changed according to the determination result.
 すなわち、重要度が高く、より厳密に再生すべきであると判別されたオブジェクト音源については、音声データの再生ビットレートは変更されずにそのままの音声データが用いられて出力オーディオストリームが生成される。 That is, for an object sound source that is determined to be highly important and should be reproduced more precisely, the output audio stream is generated using the audio data as it is without changing the audio data reproduction bit rate. .
 これに対して、重要度が低く、より厳密に再生すべきであると判別されなかったオブジェクト音源に対しては再生ビットレートを変更する処理が行われる。すなわち、もとの音声データに基づいて、より低い再生ビットレートの音声データが生成され、得られた音声データが用いられて出力オーディオストリームが生成される。 On the other hand, a process for changing the playback bit rate is performed on an object sound source that is low in importance and has not been determined to be played more strictly. That is, audio data having a lower reproduction bit rate is generated based on the original audio data, and an output audio stream is generated using the obtained audio data.
 例えば、より低い再生ビットレートの音声データを生成する方法としては、もとの音声データに対してダウンサンプリング等を行って、より低いサンプリング周波数の音声データとする方法や、もとの音声データに対する変換処理を行って、より小さい量子化ビット数の音声データを生成する方法、それらの両方を組み合わせて行う方法などがある。 For example, as a method of generating audio data with a lower reproduction bit rate, a method of down-sampling the original audio data to obtain audio data with a lower sampling frequency, There are a method of performing conversion processing to generate audio data with a smaller number of quantization bits, a method of combining both of them, and the like.
 ここでオブジェクト音源の音声データの再生ビットレートは、サンプリング周波数と、オブジェクト音源のチャネル数と、量子化ビット数とにより定まる。また、再生ビットレートが低いほど、音声データのデータ量が少なくなるので、出力オーディオストリームの伝送ビットレート、つまり伝送量を削減することができるだけでなく、出力オーディオストリームの生成時等の演算量も削減することができる。 Here, the reproduction bit rate of the audio data of the object sound source is determined by the sampling frequency, the number of channels of the object sound source, and the number of quantization bits. Also, the lower the playback bit rate, the smaller the amount of audio data, so that not only can the transmission bit rate of the output audio stream, that is, the transmission amount be reduced, but also the amount of computation when generating the output audio stream, etc. Can be reduced.
 例えば視聴者の近くに存在するオブジェクト音源については、より厳密に再生すべきであるとして再生ビットレートを変更せず、視聴者から遠く、背景音としてもよいようなオブジェクト音源については、より低い再生ビットレートに変更することができる。このようにすれば、演算量および伝送量を削減しつつ高い臨場感のコンテンツ音声を得ることができる。 For example, for object sound sources that are close to the viewer, the playback bit rate is not changed as it should be played more strictly, and for object sound sources that are far from the viewer and may be background sound, lower playback is possible. The bit rate can be changed. In this way, highly realistic content audio can be obtained while reducing the amount of computation and the amount of transmission.
 なお、再生機器側においては、何らかの方法により、互いに異なる再生ビットレートの音声データを再生可能であるものとする。 It should be noted that the playback device side can play back audio data of different playback bit rates by some method.
 処理TR3
 複数のオブジェクト音源を統合して1つのオブジェクト音源とする処理
Processing TR3
Process that integrates multiple object sound sources into one object sound source
 例えば視聴者にとって、いくつかのオブジェクト音源をまとめて1つのオブジェクト音源としても、つまり1つのオブジェクト音源に統合しても臨場感が損なわれないようなオブジェクト音源があったとする。そのような場合、それらのオブジェクト音源を統合することで、演算量や伝送量を削減することができる。 For example, it is assumed that there is an object sound source for a viewer that combines several object sound sources as one object sound source, that is, even if the object sound source is integrated into one object sound source. In such a case, the amount of calculation and the amount of transmission can be reduced by integrating those object sound sources.
 具体的には、例えば空間上の2つのオブジェクト音源がほぼ同じ位置にある場合、それらの2つのオブジェクト音源の音声データを所定の音量比で加算して1つの音声データとすることで、所定位置にある1つのオブジェクト音源として扱われるようにされる。 Specifically, for example, when two object sound sources in the space are at substantially the same position, the sound data of the two object sound sources are added at a predetermined volume ratio to obtain one sound data, thereby obtaining a predetermined position. Are treated as one object sound source.
 これにより、例えば2つのオブジェクト音源に対するレンダリング処理が必要であったものが、1つのオブジェクト音源に対するレンダリング処理となるので、レンダリング処理時の演算量を削減することができる。また、2つのオブジェクト音源の音声データが1つの音声データとなるのでデータ量を削減することができ、その結果、出力オーディオストリームの伝送量を削減することができる。 Thus, for example, what required rendering processing for two object sound sources becomes rendering processing for one object sound source, so that the amount of calculation during rendering processing can be reduced. Also, since the audio data of the two object sound sources becomes one audio data, the data amount can be reduced, and as a result, the transmission amount of the output audio stream can be reduced.
 なお、以上において説明した処理TR1乃至処理TR3のうちの1つのみが行われるようにしてもよいが、目的や状況に応じて、それらの処理のうちの任意のものを組み合わせて行うようにしてもよい。 Note that only one of the processes TR1 to TR3 described above may be performed. However, depending on the purpose and situation, any one of these processes may be combined. Also good.
 以上のように本技術によれば、オブジェクト音源がより厳密に再生すべきものであるか否か、つまり重要度が高いか否かに応じて、それらのオブジェクト音源について異なるレンダリング処理を行ったり、再生ビットレートを変更したり、オブジェクト音源の統合を行ったりすることで、演算量や伝送量を削減しつつ高い臨場感のコンテンツ音声を得ることができる。 As described above, according to the present technology, depending on whether or not the object sound source is to be reproduced more strictly, that is, whether the importance is high, different rendering processes are performed on the object sound source or playback is performed. By changing the bit rate or integrating the object sound sources, it is possible to obtain highly realistic content audio while reducing the amount of computation and transmission.
 例えば、将来的に多数のオーディオオブジェクトをもつ自由視点映像コンテンツの配信が一般的になった場合、視聴者側、つまりクライアント側におけるヘッドホン、スピーカ配置などの視聴環境に合わせて、リアルタイムでサーバ側においてレンダリング処理を行いながらストリーミング配信を行う必要がある。その際、本技術により演算量や伝送量を削減することができれば、多数のクライアントごとに出力オーディオストリームを生成する演算処理を行い、それらのクライアントに対して、それぞれ出力オーディオストリームを伝送することができるようになる。すなわち、多数のクライアントに対して同時にストリーミング配信を行うことができる。 For example, when the distribution of free viewpoint video content with a large number of audio objects becomes common in the future, the server side in real time according to the viewing environment such as headphones and speakers on the viewer side, that is, the client side It is necessary to perform streaming distribution while performing rendering processing. At this time, if the amount of computation and transmission can be reduced by this technology, it is possible to perform computation processing for generating an output audio stream for each of a large number of clients and transmit the output audio stream to each of those clients. become able to. That is, streaming distribution can be performed simultaneously for a large number of clients.
 ここで、各オブジェクト音源の重要度の指標を示す重要度指標を算出する処理や、削減処理を行う際に用いられるメタ情報やスペック情報について説明する。 Here, a description will be given of processing for calculating an importance index indicating the importance index of each object sound source, and meta information and specification information used when performing reduction processing.
 まず、コンテンツ映像を再生するためのビデオストリームに含まれているメタ情報(以下、映像メタ情報とも称する)には、空間上における視聴者であるユーザの位置を示すユーザ位置情報と、そのユーザの視線方向を示す視線方向情報とが含まれている。 First, meta information (hereinafter also referred to as video meta information) included in a video stream for reproducing content video includes user position information indicating the position of a user who is a viewer in space, and the user's position. Line-of-sight information indicating the line-of-sight direction is included.
 また、コンテンツ音声の入力オーディオストリームに含まれているメタ情報(以下、音声メタ情報とも称する)には、空間の広さを示す空間広さ情報と、空間上におけるオブジェクト音源の位置を示す音源位置情報とが含まれている。また、音声メタ情報には、各オブジェクト音源の重要度を示す重要度情報が含まれていてもよい。 The meta information included in the input audio stream of content audio (hereinafter also referred to as audio meta information) includes space information indicating the size of the space and a sound source position indicating the position of the object sound source in the space. Information and included. The audio meta information may include importance level information indicating the importance level of each object sound source.
 自由視点映像コンテンツでは、時々刻々と視聴者自身やオーディオオブジェクト(オブジェクト音源)が動くため、それらの位置関係が変化していく。そのため、その位置関係によって視聴者にとって重要なオブジェクト音源が変化することになる。 In free-viewpoint video content, the viewer and the audio object (object sound source) move from moment to moment, so their positional relationship changes. Therefore, the object sound source important for the viewer changes depending on the positional relationship.
 例えば、視聴者と近い位置にあるオブジェクト音源については、正確な位置への定位性を保つために厳密に再生すべきである。すなわち、厳密なレンダリング処理が行われるべきである。 For example, object sound sources that are close to the viewer should be played back strictly in order to maintain the localization to the correct position. That is, a strict rendering process should be performed.
 これに対して、視聴者から十分に遠い位置にあるオブジェクト音源については、おおよその方向のみが分かればよいため、軽いレンダリング処理などを行って再生するようにしてもよい。 On the other hand, for the object sound source located sufficiently far from the viewer, since only the approximate direction needs to be known, it may be reproduced by performing a light rendering process or the like.
 このように、例えばオブジェクト音源ごとにレンダリング処理を選択する場合には、映像メタ情報に含まれているユーザ位置情報と、音声メタ情報に含まれている音源位置情報とを用いれば、ユーザと各オブジェクト音源との位置関係を特定することができる。 Thus, for example, when rendering processing is selected for each object sound source, using the user position information included in the video meta information and the sound source position information included in the audio meta information, the user and each The positional relationship with the object sound source can be specified.
 また、演算量や伝送量を削減するための削減処理を行う際には、出力オーディオストリームを生成する演算ブロック、例えばレンダリング処理を行う演算ブロックの演算処理能力、つまり演算処理性能を示す演算スペック情報が適宜用いられる。 Also, when performing reduction processing to reduce the amount of computation and transmission, computation specification information indicating the computation processing capability of the computation block that generates the output audio stream, for example, the computation block that performs rendering processing, that is, computation processing performance Is used as appropriate.
 例えば演算ブロックの演算処理能力に制限がある場合、リアルタイムのストリーミング配信を実現するためには、その演算処理能力に合わせて軽いレンダリング処理や厳密なレンダリング処理を割り当てるオブジェクト音源の個数を調整する必要がある。 For example, when there is a limit on the processing capacity of the calculation block, in order to realize real-time streaming delivery, it is necessary to adjust the number of object sound sources to which light rendering processing or strict rendering processing is assigned according to the processing capacity. is there.
 具体的には、例えば演算処理能力が低い演算ブロックでは、厳密なレンダリング処理を割り当てるオブジェクト音源を10個以下とし、それ以外のオブジェクト音源については軽いレンダリング処理を行うようにするなどとすることができる。 Specifically, for example, in an arithmetic block with low arithmetic processing capability, the number of object sound sources to which strict rendering processing is assigned can be 10 or less, and light rendering processing can be performed for other object sound sources. .
 さらに、出力オーディオストリームの送信側や受信側で通信速度、つまり伝送速度(伝送ビットレート)に制限があることもある。そこで、コンテンツ再生システムでは送信側と受信側のそれぞれについて、伝送速度として取り得る最も速い速度である最大伝送速度を示す伝送速度スペック情報が適宜、取得され、それらの伝送速度スペック情報が用いられて演算量や伝送量を削減するための削減処理が行われるようにすることができる。 Furthermore, there may be a limit on the communication speed, that is, the transmission speed (transmission bit rate) on the transmission side or reception side of the output audio stream. Therefore, in the content reproduction system, transmission rate specification information indicating the maximum transmission rate, which is the fastest possible transmission rate, is appropriately acquired for each of the transmission side and the reception side, and the transmission rate specification information is used. Reduction processing for reducing the calculation amount and the transmission amount can be performed.
 例えば伝送速度スペック情報により示される最大伝送速度が遅い場合には、その最大伝送速度以下で出力オーディオストリームの伝送を行うことができるように、再生ビットレートを変更する処理が行われたり、オブジェクト音源を統合する処理が行われたりする。このとき、送信側と受信側とで最大伝送速度が異なる場合には、より遅い方の最大伝送速度での通信が可能となるように、再生ビットレートの変更やオブジェクト音源の統合を行えばよい。 For example, when the maximum transmission rate indicated by the transmission rate specification information is slow, processing for changing the playback bit rate is performed so that the output audio stream can be transmitted at the maximum transmission rate or less, or the object sound source Processing to integrate the. At this time, if the maximum transmission rate is different between the transmission side and the reception side, the reproduction bit rate may be changed or the object sound source may be integrated so that communication at the slower maximum transmission rate is possible. .
 なお、以下では、送信側の伝送速度スペック情報と、受信側の伝送速度スペック情報とを特に区別する必要のない場合には、単に伝送速度スペック情報とも称することとする。また、演算スペック情報と伝送速度スペック情報とを特に区別する必要のない場合、単にスペック情報とも称することとする。 In the following description, when it is not necessary to distinguish between transmission rate specification information on the transmission side and transmission rate specification information on the reception side, they are also simply referred to as transmission rate specification information. Further, when it is not necessary to distinguish between the operation spec information and the transmission speed spec information, they are also simply referred to as spec information.
 さらに、自由視点映像コンテンツでは、コンテンツの収録時や、収録後のコンテンツの編集時にオブジェクト音源に対して優先度、つまり重要度を示す重要度情報がメタ情報(以下、収録/編集時メタ情報とも称する)として付加されることもある。 In addition, in free-viewpoint video content, priority information, that is, importance information indicating importance is given to the object sound source at the time of content recording or editing of the content after recording. May be added).
 例えば、その場面(場所)を象徴するようなオブジェクトの音、つまりオブジェクト音源については高い重要度が付加され、そうでないオブジェクトの音については低い重要度が付加されるなどとされる。こうすることで、より重要度の高いオブジェクト音源の音の再生により多くの演算資源を割り当てることができ、演算ブロックの演算処理能力に制限があるときでも演算量や伝送量を削減しつつ高い臨場感のコンテンツ音声を得ることができる。 For example, a high importance is added to the sound of an object that symbolizes the scene (place), that is, an object sound source, and a low importance is added to the sound of an object that is not. In this way, more computation resources can be allocated to play the sound of the object sound source with higher importance, and even when the computation processing capacity of the computation block is limited, the amount of computation and transmission amount is reduced and high presence A feeling of content audio can be obtained.
 なお、各オブジェクト音源の重要度情報は、入力オーディオストリームの音声メタ情報に含まれているようにしてもよい。また、重要度情報はコンテンツ音声のフレームごとに定められているようにしてもよいし、複数フレーム単位で定められているようにしてもよい。 Note that the importance level information of each object sound source may be included in the audio meta information of the input audio stream. Also, the importance level information may be determined for each frame of the content audio, or may be determined in units of a plurality of frames.
 コンテンツ再生システムでは、以上において説明した映像メタ情報、音声メタ情報、および収録/編集時メタ情報に含まれる情報の少なくとも何れか1つが用いられて各オブジェクト音源の重要度の指標を示す重要度指標が算出される。 In the content reproduction system, the importance index indicating the importance index of each object sound source by using at least one of the information included in the video meta information, the audio meta information, and the recording / editing meta information described above. Is calculated.
 そして、各オブジェクト音源の重要度指標が用いられて削減処理が行われ、出力オーディオストリーム生成時の演算量や、出力オーディオストリームの伝送量が削減される。なお、削減処理時には、必要に応じて演算スペック情報や伝送速度スペック情報なども用いられる。 Then, a reduction process is performed using the importance index of each object sound source, and the amount of calculation when generating the output audio stream and the transmission amount of the output audio stream are reduced. In the reduction process, calculation specification information, transmission speed specification information, and the like are also used as necessary.
 また、メタ情報やスペック情報として、上述したもの以外のものを用いてもよく、演算量や伝送量を削減するための削減処理は、上述した処理以外の処理であってもよい。 Further, as the meta information and the specification information, things other than those described above may be used, and the reduction processing for reducing the calculation amount and the transmission amount may be processing other than the processing described above.
〈コンテンツ再生システムの構成例〉
 続いて、以上において説明したコンテンツ再生システムのより具体的な実施の形態について説明する。図1は、本技術を適用したコンテンツ再生システムの一実施の形態の構成例を示す図である。
<Example configuration of content playback system>
Next, a more specific embodiment of the content reproduction system described above will be described. FIG. 1 is a diagram illustrating a configuration example of an embodiment of a content reproduction system to which the present technology is applied.
 図1に示すコンテンツ再生システムは、サーバ11、クライアント装置12、およびメタ情報格納部13を有している。 The content reproduction system shown in FIG. 1 includes a server 11, a client device 12, and a meta information storage unit 13.
 この例では、サーバ11は、例えばクラウドなどのネットワークサーバからなり、ユーザが操作するクライアント装置12と有線や無線のネットワークを介して接続されている。なお、ここではサーバ11に1つのクライアント装置12が接続されているが、サーバ11には2以上の複数のクライアント装置12が接続されるようにしてもよい。 In this example, the server 11 includes a network server such as a cloud, and is connected to the client device 12 operated by the user via a wired or wireless network. Here, one client device 12 is connected to the server 11, but two or more client devices 12 may be connected to the server 11.
 サーバ11は、アナログまたはデジタルの音声信号(音声データ)への演算処理が可能な装置であり、入力オーディオストリームに対するレンダリング処理をオブジェクト音源ごとにリアルタイムで切り替えて適切な出力オーディオストリームを生成し、自由視点映像コンテンツのストリーミング配信を行う。すなわち、サーバ11は、外部から供給された、または予め記録している自由視点映像コンテンツの音声のストリーミング配信をクライアント装置12に対して行う。 The server 11 is a device that can perform arithmetic processing on an analog or digital audio signal (audio data), and generates a suitable output audio stream by switching rendering processing for an input audio stream in real time for each object sound source. Perform streaming distribution of viewpoint video content. That is, the server 11 performs streaming distribution of the audio of the free viewpoint video content supplied from the outside or recorded in advance to the client device 12.
 具体的には、サーバ11は入力オーディオストリームとメタ情報やスペック情報とに基づいて出力オーディオストリームを生成し、クライアント装置12へと伝送する。その際、サーバ11は、適宜、メタ情報格納部13から収録/編集時メタ情報を取得する。 Specifically, the server 11 generates an output audio stream based on the input audio stream and the meta information and specification information, and transmits the output audio stream to the client device 12. At that time, the server 11 appropriately acquires recording / editing meta information from the meta information storage unit 13.
 また、クライアント装置12は、サーバ11から出力オーディオストリームを受信して、コンテンツ音声を再生する。このとき、クライアント装置12は、サーバ11または他のサーバ等から取得したビデオストリームに基づいてコンテンツ映像も再生することで、映像と音声からなる自由視点映像コンテンツを再生する。 In addition, the client device 12 receives the output audio stream from the server 11 and reproduces the content audio. At this time, the client device 12 reproduces a free-viewpoint video content composed of video and audio by also playing a content video based on a video stream acquired from the server 11 or another server.
 サーバ11は、重要度指標算出部21、処理選択部22、レンダリング処理部23、およびオーディオストリーム伝送部24を有している。 The server 11 includes an importance index calculation unit 21, a process selection unit 22, a rendering processing unit 23, and an audio stream transmission unit 24.
 また、クライアント装置12は、オーディオストリーム受信部31および自由視点映像再生部32を有している。 In addition, the client device 12 includes an audio stream receiving unit 31 and a free viewpoint video reproduction unit 32.
 重要度指標算出部21は、必要に応じてメタ情報やスペック情報を取得して、メタ情報に基づいて重要度指標を算出するとともに、得られた重要度指標とスペック情報を処理選択部22に供給する。この重要度指標は、各オブジェクト音源の重要度を示す指標である。 The importance index calculation unit 21 acquires meta information and spec information as necessary, calculates the importance index based on the meta information, and sends the obtained importance index and spec information to the process selection unit 22. Supply. This importance index is an index indicating the importance of each object sound source.
 例えば重要度指標算出部21は、入力オーディオストリームから音声メタ情報を取得(抽出)したり、メタ情報格納部13から収録/編集時メタ情報を取得したり、クライアント装置12の自由視点映像再生部32から映像メタ情報を取得したりする。 For example, the importance index calculation unit 21 acquires (extracts) audio meta information from the input audio stream, acquires recording / editing meta information from the meta information storage unit 13, and the free viewpoint video reproduction unit of the client device 12. Video meta information is acquired from 32.
 また、例えば重要度指標算出部21は、レンダリング処理部23から演算スペック情報を取得したり、オーディオストリーム伝送部24から送信側の伝送速度スペック情報を取得したり、オーディオストリーム受信部31から受信側の伝送速度スペック情報を取得したりする。 Further, for example, the importance index calculation unit 21 acquires calculation specification information from the rendering processing unit 23, acquires transmission speed specification information on the transmission side from the audio stream transmission unit 24, and receives from the audio stream reception unit 31 on the reception side. Or get the transmission speed spec information.
 さらに、重要度指標算出部21は、必要に応じて映像メタ情報を、処理選択部22を介してレンダリング処理部23に供給する。 Furthermore, the importance index calculation unit 21 supplies video meta information to the rendering processing unit 23 via the processing selection unit 22 as necessary.
 処理選択部22は、入力オーディオストリームを取得するとともに、重要度指標算出部21から供給された重要度指標およびスペック情報に基づいて演算量や伝送量を削減するための削減処理を行う。また、処理選択部22は、削減処理の処理結果と、入力オーディオストリームとをレンダリング処理部23に供給する。 The process selection unit 22 acquires the input audio stream and performs a reduction process for reducing the calculation amount and the transmission amount based on the importance index and the specification information supplied from the importance index calculation unit 21. Further, the process selection unit 22 supplies the processing result of the reduction process and the input audio stream to the rendering processing unit 23.
 ここでは、演算量や伝送量を削減するための削減処理の結果として、例えば各オブジェクト音源について、軽いレンダリング処理または厳密なレンダリング処理の何れの処理を行うかの選択結果(決定結果)や、どのオブジェクト音源の再生ビットレートを変更するかの選択結果、どのオブジェクト音源を1つに統合するかの選択結果などが得られる。すなわち、削減処理の結果として、各オブジェクト音源について、どのような処理を行うかの選択結果が得られる。 Here, as a result of the reduction process for reducing the calculation amount and the transmission amount, for example, for each object sound source, a selection result (decision result) of whether to perform a light rendering process or a strict rendering process, and which As a result of selecting whether to change the reproduction bit rate of the object sound source, a result of selecting which object sound source is integrated into one, and the like are obtained. That is, as a result of the reduction process, a selection result of what kind of process is performed for each object sound source is obtained.
 また、スペック情報は、例えば演算量や伝送量を削減するための処理を含む複数の処理について、各処理が行われるオブジェクト音源の個数を決定するために用いられる。 Also, the spec information is used to determine the number of object sound sources on which each process is performed for a plurality of processes including a process for reducing the amount of computation and transmission, for example.
 レンダリング処理部23は、処理選択部22から供給された削減処理の結果と、入力オーディオストリームとに基づいてレンダリング処理を行い、その結果得られた出力オーディオストリームをオーディオストリーム伝送部24に供給する。 The rendering processing unit 23 performs rendering processing based on the result of the reduction processing supplied from the processing selection unit 22 and the input audio stream, and supplies the output audio stream obtained as a result to the audio stream transmission unit 24.
 このとき、レンダリング処理部23は、適宜、処理選択部22を介して重要度指標算出部21から供給された映像メタ情報や、処理選択部22から供給された入力オーディオストリームに含まれる音声メタ情報を用いてレンダリング処理を行う。 At this time, the rendering processing unit 23 appropriately selects the video meta information supplied from the importance index calculation unit 21 via the processing selection unit 22 or the audio meta information included in the input audio stream supplied from the processing selection unit 22. The rendering process is performed using.
 また、レンダリング処理部23は、適宜、クライアント装置12から、何チャネルのスピーカシステムであるかなどのクライアント装置12の再生環境に関する情報を取得する。そして、レンダリング処理部23は、その再生環境に応じて、クライアント装置12で再生可能な各チャネルの音声データからなる出力オーディオストリームを生成する。さらに、レンダリング処理部23は、削減処理の結果に基づいて、適宜、再生ビットレートを変更する処理やオブジェクト音源を統合する処理なども行う。 In addition, the rendering processing unit 23 appropriately acquires information regarding the reproduction environment of the client device 12 such as how many channels of the speaker system the client device 12 has. Then, the rendering processing unit 23 generates an output audio stream composed of audio data of each channel that can be reproduced by the client device 12 according to the reproduction environment. Furthermore, the rendering processing unit 23 also appropriately performs processing for changing the reproduction bit rate, processing for integrating object sound sources, and the like based on the result of the reduction processing.
 オーディオストリーム伝送部24は、レンダリング処理部23から供給された出力オーディオストリームを、ネットワークを介してクライアント装置12に送信する。 The audio stream transmission unit 24 transmits the output audio stream supplied from the rendering processing unit 23 to the client device 12 via the network.
 クライアント装置12のオーディオストリーム受信部31は、サーバ11のオーディオストリーム伝送部24により送信された出力オーディオストリームを受信し、自由視点映像再生部32に供給する。また、オーディオストリーム受信部31は、サーバ11からの要求に応じて、適宜、受信側の伝送速度スペック情報を重要度指標算出部21に供給する。 The audio stream receiving unit 31 of the client device 12 receives the output audio stream transmitted by the audio stream transmitting unit 24 of the server 11 and supplies it to the free viewpoint video reproduction unit 32. In addition, the audio stream receiving unit 31 appropriately supplies the transmission rate specification information on the receiving side to the importance index calculating unit 21 in response to a request from the server 11.
 自由視点映像再生部32は、例えばヘッドホンやスピーカシステムなどの音響再生機器と、その音響再生機器を駆動する機器を有しており、オーディオストリーム受信部31から供給された出力オーディオストリームに基づいて、コンテンツ音声を再生する。 The free viewpoint video reproduction unit 32 includes, for example, a sound reproduction device such as a headphone or a speaker system, and a device that drives the sound reproduction device. Based on the output audio stream supplied from the audio stream reception unit 31, Play content audio.
 また、自由視点映像再生部32は、表示装置なども有しており、外部から取得したビデオストリームに基づいてコンテンツ映像を再生する。さらに、自由視点映像再生部32は、サーバ11からの要求に応じて、適宜、ビデオストリームから映像メタ情報を抽出し、重要度指標算出部21に供給する。 The free viewpoint video playback unit 32 also has a display device and the like, and plays back content video based on a video stream acquired from the outside. Furthermore, the free viewpoint video reproduction unit 32 appropriately extracts video meta information from the video stream in response to a request from the server 11 and supplies the video meta information to the importance index calculation unit 21.
 なお、ここでは、ユーザ位置情報と視線方向情報とを含む映像メタ情報がビデオストリームから抽出される例について説明するが、ユーザ位置情報や視線方向情報の取得方法はどのような方法であってもよい。 Here, an example in which video meta information including user position information and line-of-sight direction information is extracted from a video stream will be described. However, any method for acquiring user position information and line-of-sight direction information can be used. Good.
 例えばクライアント装置12が外部の他の装置からユーザ位置情報や視線方向情報を取得して、重要度指標算出部21に供給するようにしてもよい。また、例えばクライアント装置12にユーザの頭部方向を検出するジャイロセンサや、ユーザを撮影するイメージセンサなどを設け、ユーザ位置情報や視線方向情報を得るようにしてもよい。この場合、例えばジャイロセンサの出力からユーザの顔方向を特定し、その方向をユーザの視線方向としたり、イメージセンサで得られた画像からユーザの視線方向や空間上におけるユーザの位置を検出したりすればよい。 For example, the client device 12 may acquire user position information and line-of-sight direction information from another external device and supply the user position information and line-of-sight direction information to the importance index calculation unit 21. Further, for example, the client device 12 may be provided with a gyro sensor that detects the user's head direction, an image sensor that captures the user, and the like so as to obtain user position information and line-of-sight direction information. In this case, for example, the user's face direction is specified from the output of the gyro sensor, and the direction is set as the user's line-of-sight direction, or the user's line-of-sight direction or the position of the user in space is detected from the image obtained by the image sensor. do it.
 その他、重要度指標算出部21において、ビデオストリームの映像メタ情報に含まれる空間広さ情報が用いられるようにしてもよいし、映像メタ情報に含まれる映像オブジェクトの位置情報や重要度情報が、その映像オブジェクトに対応するオブジェクト音源の音源位置情報や重要度情報として用いられるようにしてもよい。 In addition, the importance index calculation unit 21 may use space area information included in the video meta information of the video stream, or position information and importance information of the video object included in the video meta information may be used. It may be used as sound source position information or importance level information of an object sound source corresponding to the video object.
 さらに、ここではサーバ11とクライアント装置12とがネットワークを介して接続されている例について説明した。しかし、重要度指標算出部21乃至オーディオストリーム伝送部24、オーディオストリーム受信部31、および自由視点映像再生部32が1つの装置に設けられているようにしてもよい。また、重要度指標算出部21乃至オーディオストリーム伝送部24が設けられた装置と、オーディオストリーム受信部31および自由視点映像再生部32が設けられた装置とが、ケーブル等の有線により接続されているようにしてもよい。 Furthermore, here, an example in which the server 11 and the client device 12 are connected via a network has been described. However, the importance index calculation unit 21 to the audio stream transmission unit 24, the audio stream reception unit 31, and the free viewpoint video reproduction unit 32 may be provided in one apparatus. Further, a device provided with the importance index calculation unit 21 to the audio stream transmission unit 24 and a device provided with the audio stream reception unit 31 and the free viewpoint video reproduction unit 32 are connected by a wire such as a cable. You may do it.
 具体的には、例えばユーザの自宅のパーソナルコンピュータに保存されている自由視点映像コンテンツをヘッドマウントディスプレイで再生する場合などが考えられる。そのような場合、例えばパーソナルコンピュータに重要度指標算出部21乃至オーディオストリーム伝送部24が設けられ、そのパーソナルコンピュータに接続されるヘッドマウントディスプレイにオーディオストリーム受信部31および自由視点映像再生部32が設けられるようにすればよい。 Specifically, for example, a case where a free-viewpoint video content stored in a personal computer at the user's home is played on a head-mounted display can be considered. In such a case, for example, the importance index calculation unit 21 to the audio stream transmission unit 24 are provided in the personal computer, and the audio stream reception unit 31 and the free viewpoint video reproduction unit 32 are provided in the head mounted display connected to the personal computer. You can make it.
 また、コンテンツ再生システムで3Dゲームコンテンツを再生することなども考えられる。そのような場合には、例えば据え置き型のゲーム機本体に重要度指標算出部21乃至オーディオストリーム伝送部24、オーディオストリーム受信部31、および自由視点映像再生部32が設けられる構成とされてもよい。さらに、据え置き型のゲーム機本体に重要度指標算出部21乃至オーディオストリーム伝送部24が設けられ、そのゲーム機本体に有線または無線で接続される外部の機器にオーディオストリーム受信部31および自由視点映像再生部32が設けられる構成とされてもよい。 It is also possible to play 3D game content with a content playback system. In such a case, for example, the stationary index game machine main body may be configured to include the importance index calculation unit 21 to the audio stream transmission unit 24, the audio stream reception unit 31, and the free viewpoint video reproduction unit 32. . Further, the importance level index calculation unit 21 to the audio stream transmission unit 24 are provided in the stationary game machine main body, and the audio stream reception unit 31 and the free viewpoint video are connected to an external device connected to the game machine main body by wire or wirelessly. The reproducing unit 32 may be provided.
 ここで、以上において説明したコンテンツ再生システムのレンダリング処理部23において行われる軽いレンダリング処理と厳密なレンダリング処理の具体的な例について説明する。 Here, specific examples of the light rendering process and the strict rendering process performed in the rendering processing unit 23 of the content reproduction system described above will be described.
 例えばクライアント装置12の音声再生仕様、すなわちコンテンツ音声の再生環境がヘッドホンでの再生であるとする。換言すれば、自由視点映像再生部32がヘッドホンからなるとする。 For example, assume that the audio playback specification of the client device 12, that is, the content audio playback environment, is playback through headphones. In other words, it is assumed that the free viewpoint video reproduction unit 32 includes headphones.
 そのような場合、例えば頭部伝達関数(HRTF(Head Related Transfer Function))と、オブジェクト音源の音声データとを畳み込むバイノーラル再生処理が、厳密なレンダリング処理として行われる。 In such a case, for example, binaural playback processing that convolves the head related transfer function (HRTF (Head Related Transfer Function)) and the sound data of the object sound source is performed as a strict rendering process.
 この場合、例えば空間上における視聴者とオブジェクト音源の相対的な位置関係ごとに頭部伝達関数が予め用意されている。また、それらの頭部伝達関数のうち、音源位置情報により示されるオブジェクト音源の位置と、ユーザ位置情報により示される視聴者の位置との相対的な位置関係に対応する頭部伝達関数が選択される。 In this case, for example, a head-related transfer function is prepared in advance for each relative positional relationship between the viewer and the object sound source in the space. Also, a head-related transfer function corresponding to the relative positional relationship between the position of the object sound source indicated by the sound source position information and the position of the viewer indicated by the user position information is selected from those head-related transfer functions. The
 そして、選択された頭部伝達関数と、オブジェクト音源の音声データとを畳み込む畳み込み処理が行われて、所望の位置に音像が定位するオブジェクト音源の音声の左右の各チャネルの音声データが生成される。 Then, a convolution process is performed to convolve the selected head-related transfer function and the sound data of the object sound source, and sound data of the left and right channels of the sound of the object sound source whose sound image is localized at a desired position is generated. .
 これに対して、例えば空間上の視聴者の位置および視線方向と、オブジェクト音源の位置とに基づいて、オブジェクト音源の左右の音声の音量比を変化させることで音像を定位させるパンニング処理が軽いレンダリング処理として行われる。 On the other hand, for example, rendering with a light panning process that localizes the sound image by changing the volume ratio of the left and right sounds of the object sound source based on the position and line-of-sight direction of the viewer in space and the position of the object sound source. It is done as a process.
 この場合、ユーザ位置情報、音源位置情報、および視線方向情報に基づいて、空間上の視聴者とオブジェクト音源との位置関係と視聴者の視線方向に応じて、左右のチャネルの音量比によって所望の位置に音像を定位させるオブジェクト音源の音声の左右の各チャネルの音声データが生成される。 In this case, based on the user position information, the sound source position information, and the line-of-sight direction information, depending on the positional relationship between the viewer and the object sound source in the space and the line-of-sight direction of the viewer, the desired volume ratio of the left and right channels The sound data of the left and right channels of the sound of the object sound source that localizes the sound image at the position is generated.
 また、例えばクライアント装置12の音声再生仕様が多チャネルのスピーカでの再生であるとする。換言すれば、自由視点映像再生部32が多チャネルのスピーカからなるとする。 For example, assume that the audio playback specification of the client device 12 is playback using a multi-channel speaker. In other words, it is assumed that the free viewpoint video reproduction unit 32 includes a multi-channel speaker.
 そのような場合、例えば自由視点映像再生部32が直線スピーカアレイからなるときには、波面合成によりオブジェクト音源の音声を再生するための各スピーカ、つまり各チャネルの音声データを生成する処理が厳密なレンダリング処理として行われる。 In such a case, for example, when the free-viewpoint video playback unit 32 is composed of a linear speaker array, the processing for generating the sound data of each speaker, that is, each channel for reproducing the sound of the object sound source by wavefront synthesis, is a strict rendering process. As done.
 波面合成では、音源位置情報により示されるオブジェクト音源の位置と、ユーザ位置情報により示される視聴者の位置との位置関係から定まるフィルタ係数がスピーカ(チャネル)ごとに選択される。そして、選択されたフィルタ係数と、オブジェクト音源の音声データとを畳み込む畳み込み処理が行われて、所望の位置に音像が定位するオブジェクト音源の音声の各チャネルの音声データが生成される。 In wavefront synthesis, a filter coefficient determined from the positional relationship between the position of the object sound source indicated by the sound source position information and the position of the viewer indicated by the user position information is selected for each speaker (channel). Then, a convolution process for convolving the selected filter coefficient and the sound data of the object sound source is performed, and sound data of each channel of the sound of the object sound source in which the sound image is localized at a desired position is generated.
 また、例えば自由視点映像再生部32が環状スピーカアレイからなるときには、HOA(Higher Order Ambisonics)によりオブジェクト音源の音声を再生するための各チャネルの音声データを生成する処理が厳密なレンダリング処理として行われる。HOAでは、球面調和領域での演算により各チャネルの音声データが生成される。 For example, when the free-viewpoint video playback unit 32 is composed of an annular speaker array, a process of generating audio data of each channel for playing back the sound of the object sound source by HOA (Higher Order Ambisonics) is performed as a strict rendering process. . In HOA, sound data of each channel is generated by calculation in the spherical harmonic region.
 これに対して、例えばVBAPによりオブジェクト音源の音声を再生するための各チャネルの音声データを生成する処理が軽いレンダリング処理として行われる。 On the other hand, for example, a process of generating audio data of each channel for reproducing the sound of the object sound source by VBAP is performed as a light rendering process.
〈伝送処理および再生処理の説明〉
 次に、図1に示したコンテンツ再生システムにより行われる処理について説明する。
<Description of transmission processing and playback processing>
Next, processing performed by the content reproduction system shown in FIG. 1 will be described.
 まず、図2のフローチャートを参照して、サーバ11が出力オーディオストリームを生成して出力する処理である伝送処理について説明する。 First, a transmission process, which is a process in which the server 11 generates and outputs an output audio stream, will be described with reference to the flowchart of FIG.
 ステップS11において、重要度指標算出部21は、演算量や伝送量の削減要求があるか否かを判定する。 In step S11, the importance index calculation unit 21 determines whether or not there is a request for reduction of the calculation amount or the transmission amount.
 例えば重要度指標算出部21は、クライアント装置12から出力オーディオストリームの演算量や伝送量の削減が要求された場合に、削減要求があると判定する。 For example, the importance index calculation unit 21 determines that there is a reduction request when the client apparatus 12 requests reduction of the calculation amount or transmission amount of the output audio stream.
 また、例えばサーバ11に複数のクライアント装置12が接続されており、出力オーディオストリームの伝送要求をしてきたクライアント装置12の数が多く、サーバ11における処理の負荷が高い状態であるときに、削減要求があるとされるようにしてもよい。 Further, for example, when a plurality of client apparatuses 12 are connected to the server 11 and the number of client apparatuses 12 that have requested transmission of an output audio stream is large and the processing load on the server 11 is high, a reduction request is made. It may be said that there is.
 ステップS11において、削減要求がないと判定された場合、ステップS12乃至ステップS15の処理はスキップされ、その後、処理はステップS16へと進む。 If it is determined in step S11 that there is no reduction request, the processing from step S12 to step S15 is skipped, and then the processing proceeds to step S16.
 この場合、処理選択部22は、入力オーディオストリームをレンダリング処理部23に供給するとともに、レンダリング処理部23に対して、全てのオブジェクト音源に対して厳密なレンダリング処理が選択された旨の選択結果を供給する。また、必要に応じて映像メタ情報等も取得され、重要度指標算出部21から、処理選択部22を介してレンダリング処理部23へと映像メタ情報なども供給される。 In this case, the processing selection unit 22 supplies the input audio stream to the rendering processing unit 23, and also displays a selection result indicating that strict rendering processing has been selected for all object sound sources to the rendering processing unit 23. Supply. Also, video meta information and the like are acquired as necessary, and video meta information and the like are also supplied from the importance index calculation unit 21 to the rendering processing unit 23 via the process selection unit 22.
 これに対して、ステップS11において削減要求があると判定された場合、ステップS12において重要度指標算出部21は、メタ情報およびスペック情報を取得する。 On the other hand, when it is determined in step S11 that there is a reduction request, the importance index calculation unit 21 acquires meta information and specification information in step S12.
 例えば重要度指標算出部21は、自由視点映像コンテンツ、すなわち各オブジェクト音源の音声データに関するメタ情報として、入力オーディオストリームから音声メタ情報を取得したり、メタ情報格納部13から収録/編集時メタ情報を取得したり、クライアント装置12の自由視点映像再生部32から映像メタ情報を取得したりする。 For example, the importance index calculation unit 21 acquires audio meta information from the input audio stream as meta information regarding free viewpoint video content, that is, audio data of each object sound source, or records / editing meta information from the meta information storage unit 13. Or the video meta information from the free viewpoint video playback unit 32 of the client device 12.
 また、例えば重要度指標算出部21は、レンダリング処理部23から演算スペック情報を取得したり、オーディオストリーム伝送部24から送信側の伝送速度スペック情報を取得したり、オーディオストリーム受信部31から受信側の伝送速度スペック情報を取得したりする。 Further, for example, the importance index calculation unit 21 acquires calculation specification information from the rendering processing unit 23, acquires transmission speed specification information on the transmission side from the audio stream transmission unit 24, and receives from the audio stream reception unit 31 on the reception side. Or get the transmission speed spec information.
 なお、メタ情報やスペック情報は、例えばリアルタイムにフレーム単位や複数フレーム単位など所定間隔で順次取得されたものが用いられるようにしてもよいし、予め取得されたものが継続して用いられるようにしてもよい。 As meta information and spec information, for example, information acquired sequentially at a predetermined interval such as a frame unit or a plurality of frame units in real time may be used, or information acquired in advance may be used continuously. May be.
 ステップS13において、重要度指標算出部21は、必要なメタ情報およびスペック情報を全て取得したか否かを判定する。 In step S13, the importance index calculation unit 21 determines whether all necessary meta information and specification information have been acquired.
 ステップS13においてまだ必要なメタ情報およびスペック情報を取得していないと判定された場合、処理はステップS12に戻り、上述した処理が繰り返し行われる。 If it is determined in step S13 that necessary meta information and specification information have not yet been acquired, the process returns to step S12, and the above-described process is repeated.
 これに対して、ステップS13において、必要なメタ情報およびスペック情報を取得したと判定された場合、処理はステップS14へと進む。 On the other hand, if it is determined in step S13 that necessary meta information and specification information have been acquired, the process proceeds to step S14.
 ステップS14において、重要度指標算出部21は、取得したメタ情報に基づいて各オブジェクト音源の重要度指標を算出し、得られた重要度指標とスペック情報とを処理選択部22に供給する。また、重要度指標算出部21は、必要に応じて映像メタ情報を処理選択部22を介してレンダリング処理部23に供給する。 In step S14, the importance index calculation unit 21 calculates the importance index of each object sound source based on the acquired meta information, and supplies the obtained importance index and spec information to the process selection unit 22. Also, the importance index calculation unit 21 supplies video meta information to the rendering processing unit 23 via the process selection unit 22 as necessary.
 例えば重要度指標算出部21は、音声メタ情報に含まれる音源位置情報と、映像メタ情報に含まれるユーザ位置情報とから、視聴者の位置からオブジェクト音源の位置までの距離を重要度指標として算出したり、収録/編集時メタ情報に含まれる重要度情報をそのまま重要度指標としたりする。 For example, the importance index calculation unit 21 calculates the distance from the viewer position to the object sound source position as the importance index from the sound source position information included in the audio meta information and the user position information included in the video meta information. Or the importance information included in the meta information at the time of recording / editing is directly used as an importance index.
 ステップS15において、処理選択部22は、重要度指標算出部21から供給された各オブジェクト音源の重要度指標と、スペック情報とに基づいて削減処理を行うことで、各オブジェクト音源に対して行われる処理を選択(決定)する。 In step S <b> 15, the process selection unit 22 performs the reduction process based on the importance index of each object sound source supplied from the importance index calculation unit 21 and the specification information, thereby performing each object sound source. Select (determine) the process.
 すなわち、処理選択部22は、重要度指標算出部21から重要度指標とスペック情報とを取得する。そして、例えば処理選択部22は、オブジェクト音源ごとに軽いレンダリング処理を行うか、または厳密なレンダリング処理を行うかを選択したり、オブジェクト音源の再生ビットレートを変更する処理を行うかを選択したり、オブジェクト音源を統合する処理を行うかを選択したりする。処理選択部22は、それらの選択結果と、入力オーディオストリームとをレンダリング処理部23に供給する。 That is, the process selection unit 22 acquires the importance index and the spec information from the importance index calculation unit 21. For example, the process selection unit 22 selects whether to perform light rendering processing or strict rendering processing for each object sound source, or to select whether to perform processing for changing the reproduction bit rate of the object sound source. , Select whether to integrate the object sound source. The processing selection unit 22 supplies the selection result and the input audio stream to the rendering processing unit 23.
 例えば軽いレンダリング処理や、再生ビットレートを変更する処理、オブジェクト音源を統合する処理は、演算量や伝送量を削減するための処理であり、厳密なレンダリング処理は通常通りの処理である。また、削減処理では、必要に応じてスペック情報が用いられるようにすればよく、必ずしもスペック情報は用いられなくてもよい。 For example, a light rendering process, a process for changing a playback bit rate, and a process for integrating object sound sources are processes for reducing the amount of computation and transmission, and a strict rendering process is a normal process. In the reduction process, specification information may be used as necessary, and specification information is not necessarily used.
 ステップS15において処理が選択されたか、またはステップS11において削減要求がないと判定されると、ステップS16の処理が行われる。 If it is determined in step S15 that the process has been selected or there is no reduction request in step S11, the process of step S16 is performed.
 ステップS16において、レンダリング処理部23は、処理選択部22から供給された処理の選択結果および入力オーディオストリームに基づいてレンダリング処理を行い、出力オーディオストリームを生成する。 In step S16, the rendering processing unit 23 performs a rendering process based on the selection result of the process supplied from the process selection unit 22 and the input audio stream, and generates an output audio stream.
 例えばレンダリング処理部23は、軽いレンダリング処理が選択されたオブジェクト音源に対しては、自由視点映像再生部32の再生環境に応じて、入力オーディオストリームに含まれるオブジェクト音源の音声データに基づいてパンニング処理やVBAPなどを行う。 For example, for the object sound source for which the light rendering process is selected, the rendering processing unit 23 performs a panning process based on the sound data of the object sound source included in the input audio stream according to the playback environment of the free viewpoint video playback unit 32. And VBAP.
 また、レンダリング処理部23は、厳密なレンダリング処理が選択されたオブジェクト音源に対しては、自由視点映像再生部32の再生環境に応じて、入力オーディオストリームに含まれるオブジェクト音源の音声データに基づいてバイノーラル再生処理や、波面合成、HOAなどの処理を行う。 Also, the rendering processing unit 23, for an object sound source for which strict rendering processing has been selected, is based on the sound data of the object sound source included in the input audio stream according to the playback environment of the free viewpoint video playback unit 32. Perform binaural playback, wavefront synthesis, HOA, and other processing.
 さらに、レンダリング処理部23は、オブジェクト音源を統合する処理が選択されたオブジェクト音源については、レンダリング処理前に、それらのオブジェクト音源の音声データを所定の音量比で加算して1つの音声データとすることでオブジェクト音源の音声データを統合する。 Furthermore, the rendering processing unit 23 adds the sound data of the object sound sources at a predetermined volume ratio to one sound data for the object sound sources for which the process of integrating the object sound sources is selected before the rendering process. By integrating the sound data of the object sound source.
 このとき、音声データを加算するときの音量比、つまり音声データに乗算される重みは、例えば複数の各オブジェクト音源と視聴者との空間上の位置関係に基づいて定められる。また、音声データの加算処理はチャネルごとに行われる。 At this time, the volume ratio when adding the audio data, that is, the weight multiplied by the audio data is determined based on, for example, the spatial positional relationship between each of the plurality of object sound sources and the viewer. The audio data addition process is performed for each channel.
 さらに、統合後のオブジェクト音源の位置は、統合された複数のオブジェクト音源のそれぞれの位置の座標の平均値により示される座標位置とされてもよいし、複数のオブジェクト音源の位置の座標の代表値により示される位置とされてもよい。 Further, the position of the integrated object sound source may be a coordinate position indicated by an average value of the coordinates of each of the integrated object sound sources, or a representative value of the position of the plurality of object sound sources. May be the position indicated by.
 例えばオブジェクト音源の位置の座標の代表値は、統合される複数のオブジェクト音源のうちの1つのオブジェクト音源の位置の座標とされてもよいし、複数のオブジェクト音源の位置の座標から重み付き加算等により算出された座標とされてもよい。 For example, the representative value of the coordinates of the position of the object sound source may be the coordinates of the position of one object sound source among a plurality of object sound sources to be integrated, or weighted addition from the coordinates of the position of the plurality of object sound sources. The coordinates calculated by the above may be used.
 その他、レンダリング処理部23は、再生ビットレートを変更する処理が選択されたオブジェクト音源については、レンダリング処理の前または後において、オブジェクト音源の音声データに対してサンプリング周波数を変換するダウンサンプリングや、量子化ビット数を変換する変換処理を行って、再生ビットレートが変更された音声データを生成する。 In addition, for the object sound source for which the processing for changing the reproduction bit rate is selected, the rendering processing unit 23 performs downsampling for converting the sampling frequency on the sound data of the object sound source before or after the rendering processing, Audio data with a changed reproduction bit rate is generated by performing a conversion process for converting the number of digitized bits.
 以上の処理により、オブジェクト音源ごとの各チャネルの音声データが得られると、レンダリング処理部23は、各オブジェクト音源の同じチャネルの音声データを加算して1つの音声データとすることで、コンテンツ音声を再生するための各チャネルの音声データを生成する。レンダリング処理部23は、このようにして得られたコンテンツ音声の各チャネルの音声データが格納された出力オーディオストリームを生成し、オーディオストリーム伝送部24に供給する。なお、出力オーディオストリームは、オブジェクト音源ごとの音声データが格納されたものとされてもよい。 When the sound data of each channel for each object sound source is obtained by the above processing, the rendering processing unit 23 adds the sound data of the same channel of each object sound source to obtain one sound data, thereby converting the content sound. Audio data of each channel for reproduction is generated. The rendering processing unit 23 generates an output audio stream in which the audio data of each channel of the content audio obtained in this way is stored, and supplies the output audio stream to the audio stream transmission unit 24. Note that the output audio stream may be stored with audio data for each object sound source.
 ステップS17において、オーディオストリーム伝送部24は、レンダリング処理部23から供給された出力オーディオストリームをクライアント装置12に伝送(送信)する。そして、その後、処理はステップS11に戻り、コンテンツ音声のストリーミング配信が終了するまで上述した処理が繰り返し行われる。 In step S17, the audio stream transmission unit 24 transmits (transmits) the output audio stream supplied from the rendering processing unit 23 to the client device 12. Thereafter, the process returns to step S11, and the above-described process is repeated until the streaming distribution of the content audio is completed.
 以上のようにしてサーバ11は、必要に応じてメタ情報やスペック情報を取得して重要度指標を算出し、得られた重要度指標に基づいてオブジェクト音源について行う処理を選択する。そして、サーバ11は処理の選択結果に応じてレンダリング処理等を行い、出力オーディオストリームを生成する。 As described above, the server 11 acquires meta information and specification information as necessary, calculates an importance index, and selects processing to be performed on the object sound source based on the obtained importance index. Then, the server 11 performs a rendering process or the like according to the processing selection result, and generates an output audio stream.
 このようにオブジェクト音源ごとに適切に処理を選択することで、出力オーディオストリーム生成時における演算量や、出力オーディオストリームの伝送量を適切に削減することができる。これにより、少ない演算量または伝送量で臨場感の高いコンテンツ再生を行うことができる。 As described above, by appropriately selecting processing for each object sound source, it is possible to appropriately reduce the amount of calculation when generating the output audio stream and the amount of transmission of the output audio stream. As a result, it is possible to reproduce content with a high level of realism with a small amount of computation or transmission.
 また、サーバ11から出力オーディオストリームが出力されると、クライアント装置12では再生処理が行われ、コンテンツ音声が再生される。 In addition, when the output audio stream is output from the server 11, the client device 12 performs a playback process, and the content audio is played back.
 以下、図3のフローチャートを参照して、クライアント装置12により行われる再生処理について説明する。 Hereinafter, the reproduction process performed by the client device 12 will be described with reference to the flowchart of FIG.
 ステップS41において、オーディオストリーム受信部31は、出力オーディオストリームを取得して自由視点映像再生部32に供給する。 In step S41, the audio stream receiving unit 31 acquires the output audio stream and supplies it to the free viewpoint video reproduction unit 32.
 すなわち、オーディオストリーム受信部31は、一定時間ごとにサーバ11のオーディオストリーム伝送部24により送信された出力オーディオストリームを受信することで、出力オーディオストリームを取得する。 That is, the audio stream receiving unit 31 acquires the output audio stream by receiving the output audio stream transmitted by the audio stream transmitting unit 24 of the server 11 at regular intervals.
 ステップS42において、オーディオストリーム受信部31は、再生に必要な出力オーディオストリームを取得したか否かを判定する。ステップS42においてまだ出力オーディオストリームを取得していないと判定された場合、処理はステップS41に戻り、上述した処理が繰り返し行われる。 In step S42, the audio stream receiving unit 31 determines whether or not an output audio stream necessary for reproduction has been acquired. If it is determined in step S42 that the output audio stream has not yet been acquired, the process returns to step S41, and the above-described process is repeated.
 これに対して、ステップS42において出力オーディオストリームを取得したと判定された場合、ステップS43において自由視点映像再生部32は、オーディオストリーム受信部31から供給された出力オーディオストリームに基づいてコンテンツ音声を再生する。 On the other hand, when it is determined in step S42 that the output audio stream has been acquired, the free viewpoint video reproduction unit 32 reproduces content audio based on the output audio stream supplied from the audio stream reception unit 31 in step S43. To do.
 このとき、自由視点映像再生部32は、外部から取得したビデオストリームに基づいてコンテンツ映像も再生することで、コンテンツ映像とコンテンツ音声とからなる自由視点映像コンテンツを再生する。 At this time, the free viewpoint video playback unit 32 plays back the content video based on the video stream acquired from the outside, thereby playing back the free viewpoint video content including the content video and the content audio.
 ステップS44において、自由視点映像再生部32は、サーバ11の重要度指標算出部21から映像メタ情報の供給が要求されたか否かを判定する。例えば図2の伝送処理におけるステップS12において、重要度指標算出部21から映像メタ情報の供給が要求された場合、映像メタ情報の供給が要求されたと判定される。 In step S44, the free viewpoint video reproduction unit 32 determines whether or not the supply of video meta information is requested from the importance index calculation unit 21 of the server 11. For example, in step S12 in the transmission process of FIG. 2, when the supply of video meta information is requested from the importance index calculation unit 21, it is determined that the supply of video meta information is requested.
 ステップS44において映像メタ情報の供給が要求されていないと判定された場合、処理はステップS41に戻り、自由視点映像コンテンツの再生が終了するまで、上述した処理が繰り返し行われる。 If it is determined in step S44 that the supply of video meta information is not requested, the process returns to step S41, and the above-described process is repeated until the reproduction of the free viewpoint video content is completed.
 これに対して、ステップS44において映像メタ情報の供給が要求されたと判定された場合、ステップS45において、自由視点映像再生部32は、ビデオストリームから映像メタ情報を抽出して重要度指標算出部21に供給する。映像メタ情報の供給が行われると、その後、処理はステップS41に戻り、上述した処理が繰り返し行われる。 On the other hand, when it is determined in step S44 that the supply of the video meta information is requested, in step S45, the free viewpoint video reproduction unit 32 extracts the video meta information from the video stream and calculates the importance index calculation unit 21. To supply. When the video meta information is supplied, the process returns to step S41, and the above-described process is repeated.
 なお、映像メタ情報は、コンテンツ音声再生時にリアルタイムで供給されるようにしてもよいし、予め供給されるようにしてもよい。 Note that the video meta information may be supplied in real time at the time of content audio reproduction, or may be supplied in advance.
 以上のようにしてクライアント装置12は、サーバ11から出力オーディオストリームを取得してコンテンツ音声を再生するとともに、サーバ11からの要求に応じて映像メタ情報を出力する。これにより、サーバ11側における出力オーディオストリーム生成時の演算量や、出力オーディオストリームの伝送量を削減することができるとともに、臨場感の高いコンテンツ再生を行うことができる。 As described above, the client device 12 acquires the output audio stream from the server 11 to reproduce the content sound, and outputs video meta information in response to a request from the server 11. As a result, it is possible to reduce the amount of computation at the time of generating the output audio stream on the server 11 side and the transmission amount of the output audio stream, and it is possible to perform content reproduction with high presence.
〈削減処理の例1〉
 ところで、図2を参照して説明した伝送処理では、取得されたメタ情報に基づいて、各オブジェクト音源の重要度の指標が重要度指標として求められる。そして、その重要度指標とスペック情報とに基づいて各オブジェクト音源に対して行われる処理が選択される。
<Example 1 of reduction processing>
By the way, in the transmission process described with reference to FIG. 2, the importance index of each object sound source is obtained as the importance index based on the acquired meta information. Then, processing to be performed on each object sound source is selected based on the importance index and the spec information.
 このとき、用いられる重要度指標は各オブジェクト音源について1つである必要はなく、各オブジェクト音源について複数の異なる重要度指標が求められ、処理に用いられるようにしてもよい。 At this time, the importance index used does not have to be one for each object sound source, and a plurality of different importance indices may be obtained for each object sound source and used for processing.
 以下では、図4乃至図7のそれぞれを参照して図2のステップS12乃至ステップS15の処理に対応する、重要度指標を算出し、得られた重要度指標を用いて各オブジェクト音源に対して行われる処理を選択する削減処理の具体例について説明する。 In the following, with reference to each of FIGS. 4 to 7, an importance index corresponding to the processing of steps S12 to S15 of FIG. 2 is calculated, and each object sound source is calculated using the obtained importance index. A specific example of the reduction process for selecting the process to be performed will be described.
 まず、図4のフローチャートを参照して、空間上におけるオブジェクト音源と視聴者の位置関係に基づいて各オブジェクト音源に対して行われる処理を選択する例について説明する。すなわち、以下、図4のフローチャートを参照して、サーバ11により行われる削減処理について説明する。なお、この削減処理はオブジェクト音源ごとに行われる。 First, an example of selecting processing to be performed on each object sound source based on the positional relationship between the object sound source and the viewer in the space will be described with reference to the flowchart of FIG. That is, hereinafter, the reduction process performed by the server 11 will be described with reference to the flowchart of FIG. This reduction process is performed for each object sound source.
 ステップS71において、重要度指標算出部21は、重要度指標の算出に必要な情報として、ユーザ位置情報および音源位置情報を取得する。 In step S71, the importance index calculation unit 21 acquires user position information and sound source position information as information necessary for calculating the importance index.
 具体的には、例えば重要度指標算出部21は、自由視点映像再生部32から映像メタ情報を取得することで、その映像メタ情報に含まれているユーザ位置情報を取得する。また、重要度指標算出部21は、入力オーディオストリームから音声メタ情報を取得することで、音声メタ情報に含まれている音源位置情報を取得する。このようなステップS71の処理は、図2のステップS12およびステップS13の処理に対応する。 Specifically, for example, the importance index calculation unit 21 acquires the user meta information included in the video meta information by acquiring the video meta information from the free viewpoint video reproduction unit 32. Also, the importance index calculation unit 21 acquires sound source position information included in the audio meta information by acquiring the audio meta information from the input audio stream. Such processing in step S71 corresponds to the processing in steps S12 and S13 in FIG.
 ステップS72において、重要度指標算出部21は、空間上における視聴者と、処理対象のオブジェクト音源との距離を算出する。 In step S72, the importance index calculation unit 21 calculates the distance between the viewer on the space and the object sound source to be processed.
 例えば重要度指標算出部21は、ユーザ位置情報により示される視聴者の空間上の位置の座標、つまり視聴者の位置を表すベクトルVLと、音源位置情報により示されるオブジェクト音源の空間上の位置の座標、つまりオブジェクト音源の位置を表すベクトルVOとから距離|VO-VL|を算出する。 For example, the importance index calculation unit 21 calculates the coordinates of the position in the viewer's space indicated by the user position information, that is, the vector VL representing the position of the viewer, and the position of the object sound source in the space indicated by the sound source position information. The distance | VO−VL | is calculated from the coordinates, that is, the vector VO representing the position of the object sound source.
 ここで|VO-VL|は、ベクトル(VO-VL)の大きさを示しており、この例では|VO-VL|は、空間上における視聴者からオブジェクト音源までの距離を示している。 Here, | VO-VL | indicates the size of the vector (VO-VL). In this example, | VO-VL | indicates the distance from the viewer to the object sound source in the space.
 処理対象のオブジェクト音源について重要度指標としての距離が算出されると、重要度指標算出部21は、算出した重要度指標を処理選択部22に供給する。このようなステップS72の処理は、図2のステップS14の処理に対応する。 When the distance as the importance index is calculated for the object sound source to be processed, the importance index calculation unit 21 supplies the calculated importance index to the process selection unit 22. Such processing in step S72 corresponds to the processing in step S14 in FIG.
 ステップS73において、処理選択部22は、重要度指標算出部21から供給された重要度指標としての視聴者とオブジェクト音源との距離が、予め定められた所定の閾値th以上であるか否かを判定する。 In step S73, the process selection unit 22 determines whether or not the distance between the viewer and the object sound source as the importance index supplied from the importance index calculation unit 21 is equal to or greater than a predetermined threshold th. judge.
 すなわち、処理選択部22は、距離|VO-VL|と閾値thとが以下の式(1)の関係を満たすか否かを判定する。この場合、式(1)の関係が成立しないときに重要度指標としての距離が閾値th以上であると判定される。 That is, the process selection unit 22 determines whether or not the distance | VO−VL | and the threshold th satisfy the relationship of the following expression (1). In this case, when the relationship of Expression (1) is not established, it is determined that the distance as the importance index is greater than or equal to the threshold th.
 |VO-VL|<th ・・・(1) | VO-VL | <th (1)
 ステップS73において距離が閾値th以上であると判定された場合、すなわち式(1)の関係が成立しない場合、処理はステップS74に進む。 If it is determined in step S73 that the distance is greater than or equal to the threshold th, that is, if the relationship of equation (1) is not established, the process proceeds to step S74.
 ステップS74において処理選択部22は、処理対象のオブジェクト音源に対して行われる処理として軽いレンダリング処理を選択して、その選択結果をレンダリング処理部23に供給し、削減処理は終了する。 In step S74, the process selection unit 22 selects a light rendering process as the process performed on the object sound source to be processed, supplies the selection result to the rendering process unit 23, and the reduction process ends.
 例えば式(1)の関係が成立しない場合、空間上における視聴者(ユーザ)から見て処理対象のオブジェクト音源は遠い位置にあるので、そのようなオブジェクト音源は重要である可能性が低い。そこで、処理選択部22は、重要度の低い処理対象のオブジェクト音源に対しては軽いレンダリング処理を行うようにすることで、レンダリング時の演算量を削減する。 For example, when the relationship of the expression (1) is not established, the object sound source to be processed is far from the viewer (user) in the space, so that such object sound source is unlikely to be important. Therefore, the process selection unit 22 reduces the amount of calculation at the time of rendering by performing a light rendering process on the object sound source to be processed with low importance.
 これに対して、ステップS73において距離が閾値th以上でないと判定された場合、すなわち式(1)の関係が成立する場合、処理はステップS75へと進む。 On the other hand, if it is determined in step S73 that the distance is not greater than or equal to the threshold th, that is, if the relationship of expression (1) is established, the process proceeds to step S75.
 ステップS75において処理選択部22は、処理対象のオブジェクト音源に対して行われる処理として厳密なレンダリング処理を選択して、その選択結果をレンダリング処理部23に供給し、削減処理は終了する。 In step S75, the process selection unit 22 selects a strict rendering process as the process to be performed on the object sound source to be processed, supplies the selection result to the rendering process unit 23, and the reduction process ends.
 例えば式(1)の関係が成立する場合、空間上における視聴者から見て処理対象のオブジェクト音源は十分近い位置、すなわち一定距離内の位置にあるので、そのようなオブジェクト音源は重要である。そこで、処理選択部22は、重要度の高い処理対象のオブジェクト音源に対しては厳密なレンダリング処理を行うようにすることで、高い臨場感でコンテンツ音声を再生することができる。 For example, when the relationship of the expression (1) is established, the object sound source to be processed is located at a sufficiently close position, that is, within a certain distance as viewed from the viewer in the space. Therefore, such an object sound source is important. Therefore, the process selection unit 22 can reproduce the content sound with a high sense of realism by performing a strict rendering process on the object sound source to be processed with high importance.
 以上のステップS73乃至ステップS75の処理は、図2のステップS15の処理に対応する。これらの処理では、視聴者から見たオブジェクト音源までの距離が重要度指標とされるとともに、その距離が近いほど重要度が高いものとされて、視聴者から一定の距離内にあるオブジェクト音源に対しては厳密なレンダリング処理が選択される。一方で、視聴者からの距離が一定以上であるオブジェクト音源については、軽いレンダリング処理が選択され、演算量が削減される。 The processes in steps S73 to S75 described above correspond to the process in step S15 in FIG. In these processes, the distance from the viewer to the object sound source is used as the importance index, and the closer the distance is, the higher the importance is, and the object sound source within a certain distance from the viewer. On the other hand, a strict rendering process is selected. On the other hand, for an object sound source whose distance from the viewer is a certain distance or more, a light rendering process is selected, and the amount of calculation is reduced.
 以上のようにしてサーバ11は、重要度指標として視聴者とオブジェクト音源との距離を算出し、その距離に応じてオブジェクト音源に対して行われるレンダリング処理を選択する。これにより、レンダリング時の全体の演算量を削減しつつ、高い臨場感のコンテンツ音声を得ることができる。 As described above, the server 11 calculates the distance between the viewer and the object sound source as the importance index, and selects a rendering process performed on the object sound source according to the distance. This makes it possible to obtain highly realistic content audio while reducing the total amount of computation during rendering.
 なお、ここではオブジェクト音源までの距離が閾値th以上であるか否かに基づいて、各オブジェクト音源についてどのようなレンダリング処理が行われるかが選択されるようにした。しかし、その他、例えば視聴者からの距離が短い順に所定個数のオブジェクト音源を選択し、選択したオブジェクト音源に対しては厳密なレンダリング処理が行われ、それ以外のオブジェクト音源に対しては軽いレンダリング処理が行われるなどしてもよい。また、この場合、厳密なレンダリング処理を行うオブジェクト音源の個数は、例えば演算スペック情報に基づいて定められるようにしてもよい。 Note that, here, what kind of rendering processing is performed for each object sound source is selected based on whether or not the distance to the object sound source is equal to or greater than the threshold th. However, for example, a predetermined number of object sound sources are selected in ascending order of distance from the viewer, and a strict rendering process is performed on the selected object sound sources, and a light rendering process is performed on other object sound sources. May be performed. In this case, the number of object sound sources that perform strict rendering processing may be determined based on, for example, calculation specification information.
 また、ここではメタ情報に含まれるユーザ位置情報および音源位置情報を用いる例について説明したが、他の情報も用い、複数の情報を組み合わせてオブジェクト音源に対して行う処理を選択したり、他の情報も用いて複数の条件分岐によりオブジェクト音源に対して行う処理を選択したりしてもよい。 In addition, although an example using the user position information and sound source position information included in the meta information has been described here, other information is also used to select a process to be performed on the object sound source by combining a plurality of information, Information may be used to select processing to be performed on the object sound source by a plurality of conditional branches.
 そのような場合、例えば空間広さ情報、視線方向情報、スプレッド情報、空間の残響などの音響特性を示す音響特性情報、空間上におけるオブジェクト音源や他のオブジェクト等の配置位置、すなわち位置関係を示す配置情報などを用いることが考えられる。 In such a case, for example, space size information, line-of-sight direction information, spread information, acoustic characteristic information indicating acoustic characteristics such as reverberation in space, and the arrangement positions of object sound sources and other objects in space, that is, positional relationships are indicated. It is conceivable to use arrangement information or the like.
 具体的には、例えば映像メタ情報に含まれている視線方向情報を用いる場合、視線方向情報から特定される視聴者の視線の方向と、視聴者の位置と、オブジェクト音源の位置とから、視聴者に見えていないオブジェクト音源に対して軽いレンダリング処理が行われるようにするなどとすることができる。 Specifically, for example, when using the gaze direction information included in the video meta information, the viewing direction is determined from the viewer's gaze direction, the viewer position, and the object sound source position specified from the gaze direction information. For example, a light rendering process may be performed on an object sound source that is not visible to a person.
 これは、演算量を削減する必要がある場合に、視聴者からの距離などの重要度指標が同じであるオブジェクト音源が複数あるとき等に有効である。 This is effective when there is a plurality of object sound sources having the same importance index such as the distance from the viewer when it is necessary to reduce the calculation amount.
 例えば視聴者からの距離が同じである2つのオブジェクト音源がある場合、視聴者に見えているオブジェクト音源がより重要であるとして、そのオブジェクト音源に対しては厳密なレンダリング処理を選択し、視聴者に見えていないオブジェクト音源に対しては軽いレンダリング処理を選択するなどとすることができる。これにより、視聴者に見えているオブジェクト音源を優先して、全体の演算量を削減しつつも高い臨場感でコンテンツ音声を再生することができる。 For example, if there are two object sound sources that are the same distance from the viewer, the object sound source that is visible to the viewer is more important. For example, a light rendering process can be selected for an object sound source that is not visible. Accordingly, it is possible to reproduce the content sound with a high sense of reality while giving priority to the object sound source that is visible to the viewer and reducing the total amount of calculation.
 このような場合には、重要度指標算出部21では、視聴者からオブジェクト音源までの距離が1つの重要度指標として算出されるとともに、オブジェクト音源が視聴者から見えているか否かを示す情報が他の1つの重要度指標として算出されることになる。そして、処理選択部22では、オブジェクト音源について算出された2つの重要度指標に基づいて、オブジェクト音源に対して行われる処理が選択される。 In such a case, the importance index calculation unit 21 calculates the distance from the viewer to the object sound source as one importance index, and information indicating whether the object sound source is visible to the viewer. It is calculated as one other importance index. Then, the process selection unit 22 selects a process to be performed on the object sound source based on the two importance indexes calculated for the object sound source.
 また、例えば音声メタ情報には、オブジェクト音源の音像の広がり度合い、すなわちオブジェクトの空間上の大きさを示すスプレッド情報が含まれていることがある。スプレッド情報は、例えばレンダリング処理としてVBAPを行なうときに、オブジェクト音源の音像に広がりをもたせるために用いられる。 Also, for example, the audio meta information may include spread information indicating the extent of the sound image of the object sound source, that is, the size of the object in space. The spread information is used, for example, to make the sound image of the object sound source wide when performing VBAP as a rendering process.
 このようなスプレッド情報により示される音像の広がり度合いが大きい場合、オブジェクト音源の音像はある程度広い範囲の領域に広がっていることを示しているので、そのようなオブジェクト音源については厳密なレンダリング処理を行う必要性は低い。そこで、例えばスプレッド情報により示される音像の広がり度合いが所定の閾値以上であるオブジェクト音源に対しては軽いレンダリング処理を選択し、それ以外のオブジェクト音源に対しては厳密なレンダリング処理を選択するようにしてもよい。 When the spread of the sound image indicated by such spread information is large, it indicates that the sound image of the object sound source is spread over a wide area to some extent, and therefore, such object sound source is subjected to a strict rendering process. The need is low. Therefore, for example, a light rendering process is selected for an object sound source whose sound image spread degree indicated by spread information is equal to or greater than a predetermined threshold, and a strict rendering process is selected for other object sound sources. May be.
 このような場合、重要度指標算出部21は、音声メタ情報から抽出されたスプレッド情報を1つの重要度指標としてそのまま処理選択部22に供給する。 In such a case, the importance index calculation unit 21 supplies the spread information extracted from the audio meta information as it is to the process selection unit 22 as one importance index.
 さらに、例えば音声メタ情報のリザーブ領域に音響特性情報が格納されている場合など、空間の音響特性を示す音響特性情報が得られる場合、その音響特性情報により示される空間の残響度合いに基づいてオブジェクト音源の処理を選択してもよい。 Furthermore, when acoustic characteristic information indicating the acoustic characteristics of the space is obtained, for example, when the acoustic characteristic information is stored in the reserved area of the audio meta information, the object is based on the degree of reverberation of the space indicated by the acoustic characteristic information. Sound source processing may be selected.
 例えば音の反射が多く残響度合いの高い空間では、軽いレンダリング処理が行われるオブジェクト音源の数をある程度多くしてもコンテンツ音声の臨場感が損なわれることはない。そこで、例えば処理選択部22が音響特性情報により示される残響度合いに基づいて、厳密なレンダリング処理を行うオブジェクト音源の数と、軽いレンダリング処理を行うオブジェクト音源の数とを定めるようにしてもよい。 For example, in a space with many sound reflections and a high degree of reverberation, even if the number of object sound sources subjected to light rendering processing is increased to some extent, the sense of reality of the content audio is not impaired. Therefore, for example, the processing selection unit 22 may determine the number of object sound sources that perform strict rendering processing and the number of object sound sources that perform light rendering processing based on the degree of reverberation indicated by the acoustic characteristic information.
 そのような場合、重要度指標算出部21は、音響特性情報を重要度指標とともに処理選択部22に供給し、処理選択部22は、例えば音響特性情報により示される残響度合いが高いほど、軽いレンダリング処理を行うオブジェクト音源の数が多くなるようにする。 In such a case, the importance index calculation unit 21 supplies the acoustic characteristic information to the process selection unit 22 together with the importance index, and the process selection unit 22 renders lighter as the reverberation degree indicated by the acoustic characteristic information is higher, for example. Increase the number of object sound sources to be processed.
 さらに、例えば映像メタ情報に空間上におけるオブジェクト音源や他のオブジェクト等の配置位置を示す配置情報が含まれている場合、その配置情報に基づいて各オブジェクト音源に対して行われる処理が選択されるようにしてもよい。 Further, for example, when the video meta information includes arrangement information indicating the arrangement position of an object sound source or other object in the space, processing to be performed on each object sound source is selected based on the arrangement information. You may do it.
 具体的には、例えば空間上の壁等のオブジェクトによって視聴者から見えないオブジェクト音源があるときには、そのオブジェクト音源に対して軽いレンダリング処理が行われるようにしてもよい。 Specifically, for example, when there is an object sound source that is invisible to the viewer due to an object such as a wall in the space, a light rendering process may be performed on the object sound source.
 このような場合、例えば重要度指標算出部21では、視聴者からオブジェクト音源までの距離が1つの重要度指標として算出されるとともに、ユーザ位置情報、音源位置情報、および配置情報に基づいて、オブジェクト音源が視聴者から見えているか否かを示す情報が他の1つの重要度指標として算出される。そして、処理選択部22では、オブジェクト音源について算出された2つの重要度指標に基づいて、オブジェクト音源に対して行われる処理が選択される。 In such a case, for example, the importance index calculation unit 21 calculates the distance from the viewer to the object sound source as one importance index, and based on the user position information, the sound source position information, and the arrangement information, Information indicating whether or not the sound source is visible to the viewer is calculated as another importance index. Then, the process selection unit 22 selects a process to be performed on the object sound source based on the two importance indexes calculated for the object sound source.
〈削減処理の例2〉
 続いて、サーバ11により行われる削減処理の他の例について説明する。
<Example 2 of reduction processing>
Next, another example of the reduction process performed by the server 11 will be described.
 この例は、特にサーバ11の演算資源、すなわちレンダリング処理部23の演算処理能力や、出力オーディオストリームの伝送速度に制約がある場合に、その制約を満たすように各オブジェクト音源に対して行われる処理を選択するときに有効である。 In this example, particularly when there are restrictions on the computation resources of the server 11, that is, the computation processing capability of the rendering processing unit 23 and the transmission speed of the output audio stream, processing performed on each object sound source so as to satisfy the restrictions. This is effective when selecting.
 以下、図5のフローチャートを参照してサーバ11により行われる削減処理について説明する。なお、図5を参照して説明する削減処理はオブジェクト音源ごとに行われるのではなく、全オブジェクト音源が対象とされて1度の削減処理が行われる。 Hereinafter, the reduction process performed by the server 11 will be described with reference to the flowchart of FIG. Note that the reduction process described with reference to FIG. 5 is not performed for each object sound source, but is performed once for all object sound sources.
 また、ステップS101およびステップS102の処理は、図2のステップS12乃至ステップS14の処理と同様であるので、その説明は省略する。例えばステップS102では、視聴者からオブジェクト音源までの距離などが重要度指標として算出され、重要度指標およびスペック情報が重要度指標算出部21から処理選択部22に供給される。 In addition, since the processing in steps S101 and S102 is the same as the processing in steps S12 to S14 in FIG. 2, the description thereof is omitted. For example, in step S102, the distance from the viewer to the object sound source is calculated as an importance index, and the importance index and specification information are supplied from the importance index calculation unit 21 to the process selection unit 22.
 ステップS103において、処理選択部22は重要度指標算出部21から供給されたスペック情報に基づいて、厳密なレンダリング処理を行うオブジェクト音源の数を決定する。換言すれば、厳密なレンダリング処理が割り当てられる、すなわち厳密なレンダリング処理が選択されるオブジェクト音源の個数と、軽いレンダリング処理が割り当てられるオブジェクト音源の個数とが決定される。 In step S103, the process selection unit 22 determines the number of object sound sources to be subjected to a strict rendering process based on the specification information supplied from the importance index calculation unit 21. In other words, the number of object sound sources to which a strict rendering process is assigned, that is, the strict rendering process is selected, and the number of object sound sources to which a light rendering process is assigned are determined.
 例えば演算スペック情報により示されるレンダリング処理部23の演算処理能力が高いほど、厳密なレンダリング処理を行うオブジェクト音源の数(以下、割当音源数とも称する)が多くなるようにされる。 For example, the higher the calculation processing capability of the rendering processing unit 23 indicated by the calculation specification information, the greater the number of object sound sources (hereinafter also referred to as “allocated sound source number”) on which strict rendering processing is performed.
 このとき、全オブジェクト音源についてのレンダリング処理等の各種の処理の演算に必要となる演算処理能力が演算スペック情報により示される演算処理能力を超えないように割当音源数が定められる。 At this time, the number of assigned sound sources is determined so that the calculation processing capability required for various processing calculations such as rendering processing for all object sound sources does not exceed the calculation processing capability indicated by the calculation specification information.
 なお、割当音源数はスペック情報に基づいて定められるようにしてもよいし、予め定められた数とされたり、外部から指定された数とされるようにしてもよい。また、演算量および伝送量を削減するための処理として、再生ビットレートを変更する処理やオブジェクト音源を統合する処理が選択される場合には、例えば演算スペック情報と伝送速度スペック情報の少なくとも何れかに基づいて、再生ビットレートを変更する処理やオブジェクト音源を統合する処理が行われるオブジェクト音源の数が決定されるようにしてもよい。この場合、例えば、得られる出力オーディオストリームの伝送に必要となる伝送速度が、伝送速度スペック情報により示される最大伝送速度を超えないように各処理が行われるオブジェクト音源の数が決定される。 Note that the number of assigned sound sources may be determined based on the specification information, may be a predetermined number, or may be a number specified from the outside. When processing for changing the playback bit rate or processing for integrating object sound sources is selected as processing for reducing the amount of computation and transmission, for example, at least one of computation spec information and transmission speed spec information The number of object sound sources for which processing for changing the reproduction bit rate or processing for integrating object sound sources is performed may be determined based on the above. In this case, for example, the number of object sound sources to be processed is determined so that the transmission rate necessary for transmission of the obtained output audio stream does not exceed the maximum transmission rate indicated by the transmission rate specification information.
 ステップS104において、処理選択部22は、1つのオブジェクト音源を処理対象のオブジェクト音源として選択し、その処理対象のオブジェクト音源がステップS103で決定した割当音源数よりも上位であるか否かを判定する。 In step S104, the process selection unit 22 selects one object sound source as the object sound source to be processed, and determines whether the object sound source to be processed is higher than the number of assigned sound sources determined in step S103. .
 例えば処理選択部22は、全オブジェクト音源の重要度指標に基づいて、重要度指標により示される重要度の高いオブジェクト音源ほど、より上位のオブジェクト音源となるように各オブジェクト音源を順位付けする。そして、処理選択部22は、割当音源数をASとしたときに処理対象のオブジェクト音源の順位が全体の上位AS位以内である場合、処理対象のオブジェクト音源が割当音源数よりも上位であると判定する。 For example, based on the importance index of all object sound sources, the process selection unit 22 ranks each object sound source so that the object sound source having the higher importance indicated by the importance index becomes the higher object sound source. Then, when the number of assigned sound sources is AS and the rank of the object sound sources to be processed is within the upper AS position as a whole, the process selecting unit 22 determines that the object sound source to be processed is higher than the number of assigned sound sources. judge.
 ステップS104において割当音源数よりも上位であると判定された場合、その後、処理はステップS105へと進む。 If it is determined in step S104 that the number is higher than the assigned sound source number, then the process proceeds to step S105.
 ステップS105において処理選択部22は、処理対象のオブジェクト音源に対して行われる処理として厳密なレンダリング処理を選択して、その選択結果をレンダリング処理部23に供給し、その後、処理はステップS107へと進む。 In step S105, the process selection unit 22 selects a strict rendering process as a process to be performed on the object sound source to be processed, and supplies the selection result to the rendering process unit 23. Thereafter, the process proceeds to step S107. move on.
 これに対して、ステップS104において割当音源数よりも上位でないと判定された場合、すなわち処理対象のオブジェクト音源の順位がAS位よりも下位である場合、処理はステップS106へと進む。 On the other hand, if it is determined in step S104 that it is not higher than the number of assigned sound sources, that is, if the rank of the object sound source to be processed is lower than the AS position, the process proceeds to step S106.
 ステップS106において処理選択部22は、処理対象のオブジェクト音源に対して行われる処理として軽いレンダリング処理を選択して、その選択結果をレンダリング処理部23に供給し、その後、処理はステップS107へと進む。 In step S106, the process selection unit 22 selects a light rendering process as the process performed on the object sound source to be processed, supplies the selection result to the rendering process unit 23, and then the process proceeds to step S107. .
 ステップS105またはステップS106において、処理対象のオブジェクト音源に対して行われるレンダリング処理が選択されると、ステップS107において処理選択部22は、全てのオブジェクト音源について処理を選択したか否かを判定する。 In step S105 or step S106, when a rendering process to be performed on the object sound source to be processed is selected, in step S107, the process selection unit 22 determines whether or not processing has been selected for all object sound sources.
 ステップS107においてまだ全てのオブジェクト音源について処理が選択されていないと判定された場合、処理はステップS104に戻り、上述した処理が繰り返し行われる。すなわち、まだ処理対象とされていないオブジェクト音源が新たな処理対象のオブジェクト音源とされて、そのオブジェクト音源について行われるレンダリング処理が選択される。 If it is determined in step S107 that processing has not yet been selected for all object sound sources, the processing returns to step S104, and the above-described processing is repeated. That is, an object sound source that has not yet been processed is set as a new object sound source to be processed, and a rendering process to be performed on the object sound source is selected.
 これに対して、ステップS107において全てのオブジェクト音源について処理を選択したと判定された場合、削減処理は終了する。以上のステップS103乃至ステップS107の処理が、図2のステップS15の処理に対応する。 On the other hand, if it is determined in step S107 that processing has been selected for all object sound sources, the reduction processing ends. The processes in steps S103 to S107 described above correspond to the process in step S15 in FIG.
 以上のようにしてサーバ11は、スペック情報に基づいて割当音源数を決定し、その割当音源数分のオブジェクト音源に対して厳密なレンダリング処理が行われるように、各オブジェクト音源に対して行われるレンダリング処理を選択する。 As described above, the server 11 determines the number of assigned sound sources based on the specification information, and is performed for each object sound source so that strict rendering processing is performed on the object sound sources for the number of assigned sound sources. Select a rendering process.
 これにより、重要度が高い割当音源数分のオブジェクト音源に対しては厳密なレンダリング処理が行われ、残りのオブジェクト音源に対しては軽いレンダリング処理が行われるようになる。 Thus, strict rendering processing is performed on object sound sources corresponding to the number of assigned sound sources with high importance, and light rendering processing is performed on the remaining object sound sources.
 したがって、例えば重要度指標として視聴者からオブジェクト音源までの距離が算出された場合には、視聴者からの距離が短い順に割当音源数分のオブジェクト音源に対して厳密なレンダリング処理が割り当てられることになる。この場合、より重要度の高いオブジェクト音源、つまりより距離が短いオブジェクト音源に厳密なレンダリング処理が割り当てられるので、コンテンツ音声の臨場感を損なうことなく演算量や伝送量を削減することができる。 Therefore, for example, when the distance from the viewer to the object sound source is calculated as the importance index, strict rendering processing is assigned to the object sound sources for the number of assigned sound sources in order of increasing distance from the viewer. Become. In this case, since the strict rendering process is assigned to the object sound source having higher importance, that is, the object sound source having a shorter distance, the calculation amount and the transmission amount can be reduced without impairing the realistic feeling of the content sound.
〈削減処理の例3〉
 また、音声メタ情報や収録/編集時メタ情報に含まれる重要度情報がそのまま重要度指標として用いられるようにしてもよい。以下、図6のフローチャートを参照して、そのような場合にサーバ11により行われる削減処理について説明する。なお、図6を参照して説明する削減処理は、オブジェクト音源ごとに行われる。
<Example 3 of reduction processing>
Also, importance information included in audio meta information and recording / editing meta information may be used as it is as an importance index. Hereinafter, the reduction process performed by the server 11 in such a case will be described with reference to the flowchart of FIG. Note that the reduction processing described with reference to FIG. 6 is performed for each object sound source.
 ステップS131において、重要度指標算出部21は、処理対象のオブジェクト音源について、メタ情報としての重要度情報を取得する。 In step S131, the importance index calculation unit 21 acquires importance information as meta information for the object sound source to be processed.
 例えば重要度指標算出部21は、入力オーディオストリームの音声メタ情報から重要度情報を取得したり、メタ情報格納部13から供給された収録/編集時メタ情報から重要度情報を取得したりする。重要度指標算出部21は、取得した重要度情報をそのまま重要度指標として処理選択部22に供給する。このステップS131の処理は、図2のステップS12乃至ステップS14の処理に対応する。 For example, the importance index calculation unit 21 acquires importance information from the audio meta information of the input audio stream, or acquires importance information from the recording / editing meta information supplied from the meta information storage unit 13. The importance index calculation unit 21 supplies the acquired importance information as it is to the process selection unit 22 as the importance index. The processing in step S131 corresponds to the processing in steps S12 to S14 in FIG.
 なお、重要度情報は、オブジェクト音源ごとにフレーム単位または複数フレーム単位で用いられるようにしてもよいし、1つの重要度情報が全フレームで共通して用いられるようにしてもよい。 The importance level information may be used for each object sound source in units of frames or a plurality of frames, or one level of importance information may be commonly used in all frames.
 ステップS132において、処理選択部22は、処理対象のオブジェクト音源の重要度情報があるか否かを判定する。例えば処理選択部22は、重要度指標算出部21から処理対象のオブジェクト音源の重要度情報が重要度指標として供給された場合、重要度情報があると判定する。 In step S132, the process selection unit 22 determines whether there is importance level information of the object sound source to be processed. For example, the process selection unit 22 determines that there is importance level information when the importance level information of the object sound source to be processed is supplied as the importance level index from the importance level index calculation unit 21.
 ステップS132において重要度情報がないと判定された場合、その後、処理はステップS135へと進む。 If it is determined in step S132 that there is no importance information, the process proceeds to step S135.
 これに対して、ステップS132において重要度情報があると判定された場合、ステップS133において、処理選択部22は重要度指標算出部21から供給された処理対象のオブジェクト音源の重要度情報により示される重要度が予め定められた閾値以上であるか否かを判定する。なお、ステップS133で用いられる閾値は、予め定められた値とされてもよいし、スペック情報等に基づいて定められた値とされてもよい。 On the other hand, if it is determined in step S132 that the importance level information is present, in step S133, the process selection unit 22 is indicated by the importance level information of the processing target object sound source supplied from the importance level index calculation unit 21. It is determined whether the importance is equal to or higher than a predetermined threshold. Note that the threshold used in step S133 may be a predetermined value, or may be a value determined based on spec information or the like.
 ステップS133において重要度が閾値以上でないと判定された場合、その後、処理はステップS135へと進む。 If it is determined in step S133 that the importance level is not equal to or greater than the threshold value, the process proceeds to step S135.
 一方、ステップS133において、重要度が閾値以上であると判定された場合、その後、処理はステップS134へと進む。 On the other hand, if it is determined in step S133 that the importance level is equal to or greater than the threshold, the process proceeds to step S134.
 ステップS134において、処理選択部22は処理対象のオブジェクト音源に対して行われる処理として厳密なレンダリング処理を選択して、その選択結果をレンダリング処理部23に供給し、削減処理は終了する。このような重要度が閾値以上であるオブジェクト音源は、厳密に再生すべき重要なオブジェクト音源であるから、厳密なレンダリング処理が選択される。 In step S134, the process selection unit 22 selects a strict rendering process as a process performed on the object sound source to be processed, supplies the selection result to the rendering process unit 23, and the reduction process ends. An object sound source whose importance is equal to or higher than a threshold value is an important object sound source to be strictly reproduced, and therefore a strict rendering process is selected.
 また、ステップS132において重要度情報がないと判定されたか、またはステップS133において重要度が閾値以上でないと判定された場合、ステップS135の処理が行われる。 If it is determined in step S132 that there is no importance level information, or if it is determined in step S133 that the importance level is not equal to or greater than the threshold value, the process of step S135 is performed.
 ステップS135において、処理選択部22は処理対象のオブジェクト音源に対して行われる処理として軽いレンダリング処理を選択して、その選択結果をレンダリング処理部23に供給し、削減処理は終了する。 In step S135, the process selection unit 22 selects a light rendering process as the process performed on the object sound source to be processed, supplies the selection result to the rendering process unit 23, and the reduction process ends.
 このように重要度指標として用いられる重要度情報がないか、または重要度情報により示される重要度が低いオブジェクト音源に対しては軽いレンダリング処理が選択され、レンダリング時の演算量が削減される。 As described above, a light rendering process is selected for an object sound source that has no importance information used as an importance index or has a low importance indicated by the importance information, and the amount of calculation at the time of rendering is reduced.
 以上のステップS132乃至ステップS135の処理は、図2のステップS15の処理に対応する。特に、この例では、重要度情報があり、かつその重要度情報により示される重要度が一定値以上であるオブジェクト音源に対してのみ、厳密なレンダリング処理が行われる。 The processes in steps S132 to S135 described above correspond to the process in step S15 in FIG. In particular, in this example, a strict rendering process is performed only on an object sound source that has importance level information and whose importance level indicated by the importance level information is a certain value or more.
 以上のようにしてサーバ11は、重要度情報を重要度指標として用いてオブジェクト音源に対して行われる処理を選択する。このように重要度情報を用いることで適切に演算量を削減し、少ない演算量でも臨場感の高いコンテンツ再生を行うことができる。 As described above, the server 11 selects processing to be performed on the object sound source using the importance information as the importance index. Thus, by using the importance level information, it is possible to appropriately reduce the amount of calculation, and to reproduce content with a high sense of presence even with a small amount of calculation.
〈削減処理の例4〉
 さらに、空間上に多数のオブジェクト音源が存在する場合、簡潔な処理で2以上の複数のオブジェクト音源を統合し、1つのオブジェクト音源として扱うことができれば、レンダリング処理などの演算量を削減できる。また、オブジェクト音源を統合すれば、出力オーディオストリームのデータ量も少なくなるので、伝送量を削減することもできる。
<Example 4 of reduction processing>
Furthermore, when there are a large number of object sound sources in the space, if two or more object sound sources can be integrated and handled as one object sound source by a simple process, the amount of computation such as rendering processing can be reduced. Further, if the object sound sources are integrated, the data amount of the output audio stream is reduced, so that the transmission amount can be reduced.
 そこで、重要度指標に基づいて2以上の複数のオブジェクト音源を1つのオブジェクト音源に統合するようにしてもよい。以下、図7のフローチャートを参照して、そのような場合にサーバ11により行われる削減処理について説明する。 Therefore, two or more object sound sources may be integrated into one object sound source based on the importance index. Hereinafter, the reduction process performed by the server 11 in such a case will be described with reference to the flowchart of FIG.
 なお、図7を参照して説明する削減処理は、任意のオブジェクト音源の組み合わせごとに行われる。但し、ここでは説明を簡単にするため、任意の2つのオブジェクト音源が処理対象のオブジェクト音源とされて削減処理が行われる場合について説明する。 Note that the reduction process described with reference to FIG. 7 is performed for each combination of object sound sources. However, in order to simplify the description, a case where any two object sound sources are set as object sound sources to be processed and reduction processing is performed will be described.
 ステップS161において、重要度指標算出部21は、ユーザ位置情報と、処理対象の2つのオブジェクト音源の音源位置情報とを取得する。 In step S161, the importance index calculation unit 21 acquires user position information and sound source position information of two object sound sources to be processed.
 すなわち、重要度指標算出部21は、映像メタ情報からユーザ位置情報を取得するとともに、音声メタ情報から処理対象の2つのオブジェクト音源の音源位置情報を取得する。このステップS161の処理は、図2のステップS12およびステップS13の処理に対応する。 That is, the importance index calculation unit 21 acquires user position information from the video meta information and also acquires sound source position information of the two object sound sources to be processed from the audio meta information. The processing in step S161 corresponds to the processing in steps S12 and S13 in FIG.
 ステップS162において、重要度指標算出部21は、ステップS161で取得された処理対象の2つのオブジェクト音源の音源位置情報に基づいて、空間上におけるそれらのオブジェクト音源間の距離を算出する。 In step S162, the importance index calculation unit 21 calculates the distance between the object sound sources in the space based on the sound source position information of the two object sound sources to be processed acquired in step S161.
 ステップS163において、重要度指標算出部21は、ステップS161で取得されたユーザ位置情報と音源位置情報に基づいて、視聴者から見た処理対象の一方のオブジェクト音源の方向と、処理対象の他方のオブジェクト音源の方向との角度差を算出する。 In step S163, the importance index calculation unit 21 determines the direction of one object sound source to be processed as viewed from the viewer based on the user position information and sound source position information acquired in step S161, and the other processing target. The angle difference from the direction of the object sound source is calculated.
 例えば重要度指標算出部21は、空間上における視聴者の位置を始点とし、処理対象の一方のオブジェクト音源の位置を終点とするベクトルと、視聴者の位置を始点とし、処理対象の他方のオブジェクト音源の位置を終点とするベクトルとのなす角度を角度差として算出する。 For example, the importance index calculation unit 21 starts with the position of the viewer in the space and uses the position of one object sound source to be processed as the end point, and the position of the viewer as the start point, and the other object to be processed. An angle formed by a vector whose end point is the position of the sound source is calculated as an angle difference.
 ここでは、例えば処理対象の一方のオブジェクト音源に注目すると、空間上における視聴者から見た処理対象の一方のオブジェクト音源の方向の、視聴者から見た処理対象の他方のオブジェクト音源の方向との角度差が、処理対象の一方のオブジェクト音源の重要度指標として得られたことになる。 Here, for example, when attention is paid to one object sound source to be processed, the direction of one object sound source to be processed as viewed from the viewer in space is the direction of the other object sound source to be processed as viewed from the viewer. The angle difference is obtained as an importance index of one object sound source to be processed.
 重要度指標算出部21は、ステップS162で得られた距離と、ステップS163で得られた角度差とを、それぞれ処理対象のオブジェクト音源の重要度指標として処理選択部22に供給する。これらのステップS162およびステップS163の処理は、図2のステップS14の処理に対応する。 The importance index calculation unit 21 supplies the distance obtained in step S162 and the angle difference obtained in step S163 to the process selection unit 22 as the importance index of the object sound source to be processed. The processes in step S162 and step S163 correspond to the process in step S14 in FIG.
 ステップS164において、処理選択部22は、重要度指標算出部21から供給されたオブジェクト音源間の距離が予め定められた閾値以下であるか否かを判定する。 In step S164, the process selection unit 22 determines whether or not the distance between the object sound sources supplied from the importance index calculation unit 21 is equal to or less than a predetermined threshold value.
 ステップS164においてオブジェクト音源間の距離が閾値以下でないと判定された場合、その後、処理はステップS167へと進む。 If it is determined in step S164 that the distance between the object sound sources is not equal to or less than the threshold, the process proceeds to step S167.
 一方、ステップS164においてオブジェクト音源間の距離が閾値以下であると判定された場合、ステップS165において、処理選択部22はステップS163で得られたオブジェクト音源の方向の角度差が予め定めた閾値以下であるか否かを判定する。 On the other hand, if it is determined in step S164 that the distance between the object sound sources is equal to or smaller than the threshold value, in step S165, the process selection unit 22 determines that the angle difference in the direction of the object sound source obtained in step S163 is equal to or smaller than the predetermined threshold value. It is determined whether or not there is.
 なお、重要度指標として用いられる角度差は、ステップS164においてオブジェクト音源間の距離が閾値以下であると判定された場合に算出されるようにしてもよい。また、ステップS164で用いられる閾値と、ステップS165で用いられる閾値とは異なるものである。 Note that the angle difference used as the importance index may be calculated when it is determined in step S164 that the distance between the object sound sources is equal to or less than a threshold value. Further, the threshold value used in step S164 is different from the threshold value used in step S165.
 ステップS165において角度差が閾値以下でないと判定された場合、その後、処理はステップS167へと進む。 If it is determined in step S165 that the angle difference is not equal to or less than the threshold value, then the process proceeds to step S167.
 これに対してステップS165において角度差が閾値以下であると判定された場合、処理はステップS166へと進む。 On the other hand, if it is determined in step S165 that the angle difference is equal to or smaller than the threshold value, the process proceeds to step S166.
 ステップS166において処理選択部22は、処理対象の2つのオブジェクト音源に対して行われる処理として、2つのオブジェクト音源を統合する処理を選択して、その選択結果をレンダリング処理部23に供給し、削減処理は終了する。 In step S166, the process selection unit 22 selects a process for integrating the two object sound sources as a process to be performed on the two object sound sources to be processed, and supplies the selection result to the rendering processing unit 23 for reduction. The process ends.
 この場合、処理対象の2つのオブジェクト音源は、一定距離内の位置にあり、かつ視聴者から見てほぼ同じ方向に位置している。そのため、それらの2つのオブジェクト音源を統合して1つのオブジェクト音源としても音像位置等に大きなずれは生じることはなく、コンテンツ音声の臨場感が損なわれることはない。 In this case, the two object sound sources to be processed are located within a certain distance and are located in substantially the same direction as viewed from the viewer. Therefore, even if these two object sound sources are integrated to form a single object sound source, there is no significant shift in the position of the sound image, and the realism of the content sound is not impaired.
 そこで、処理選択部22は、そのような2つのオブジェクト音源を1つのオブジェクト音源に統合するようにして、コンテンツ音声の臨場感を損なうことなく、レンダリング処理時の演算量や出力オーディオストリームの伝送量を削減する。 Therefore, the processing selection unit 22 integrates such two object sound sources into one object sound source, and does not impair the sense of reality of the content sound, and the amount of computation during the rendering process and the transmission amount of the output audio stream To reduce.
 なお、ステップS166の処理が行われた場合、図2のステップS16では2つのオブジェクト音源が1つのオブジェクト音源に統合される処理が行われる。その際、統合後のオブジェクト音源の空間上の位置は、2つのオブジェクト音源のそれぞれの位置を示す座標の平均値、つまり平均座標の位置などとされる。 When the process of step S166 is performed, a process of integrating two object sound sources into one object sound source is performed in step S16 of FIG. At this time, the integrated position of the object sound source in space is an average value of coordinates indicating the positions of the two object sound sources, that is, the position of the average coordinate.
 また、ステップS164においてオブジェクト音源間の距離が閾値以下でないと判定されたか、またはステップS165において角度差が閾値以下でないと判定された場合、ステップS167の処理が行われる。 If it is determined in step S164 that the distance between the object sound sources is not less than the threshold value, or if it is determined in step S165 that the angle difference is not less than the threshold value, the process of step S167 is performed.
 ステップS167において、処理選択部22は、処理対象の2つのオブジェクト音源に対して個別に処理が行われるようにする。すなわち、処理選択部22は、処理対象の2つのオブジェクト音源を統合しないとの選択を行い、その選択結果をレンダリング処理部23に供給し、削減処理は終了する。 In step S167, the process selection unit 22 performs the process individually on the two object sound sources to be processed. That is, the process selection unit 22 selects that the two object sound sources to be processed are not integrated, supplies the selection result to the rendering process unit 23, and the reduction process ends.
 この場合、処理対象の2つのオブジェクト音源は、空間上において一定以上の距離だけ離れた位置にあるか、または一定の距離内にあったとしても視聴者から見てそれらのオブジェクト音源は十分に異なる方向に存在している。 In this case, the two object sound sources to be processed are located at a distance of a certain distance or more in space, or even if they are within a certain distance, the object sound sources are sufficiently different from the viewpoint of the viewer. Exists in the direction.
 そのため、このような場合には処理対象の2つのオブジェクト音源を1つに統合してしまうと、統合後のオブジェクト音源の位置によっては統合前後で音像の定位位置が大きく変わってしまうことがあり、コンテンツ音声の臨場感が損なわれてしまう可能性がある。そこで、ステップS167では処理対象の2つのオブジェクト音源を統合せずに個別に処理されるようになされる。 Therefore, in such a case, if the two object sound sources to be processed are integrated into one, depending on the position of the object sound source after integration, the localization position of the sound image may change greatly before and after the integration, There is a possibility that the sense of reality of the content audio is impaired. Therefore, in step S167, the two object sound sources to be processed are processed individually without being integrated.
 以上のステップS164乃至ステップS167の処理は、図2のステップS15の処理に対応する。 The processes in steps S164 to S167 described above correspond to the process in step S15 in FIG.
 以上のようにしてサーバ11は、ユーザ位置情報および音源位置情報に基づいて重要度指標を算出し、得られた重要度指標に基づいて処理対象のオブジェクト音源に対して行われる処理を選択する。このように処理対象の2つのオブジェクト音源および視聴者の位置関係に応じてオブジェクト音源を統合することで、適切に演算量および伝送量を削減し、少ない演算量および伝送量でも臨場感の高いコンテンツ再生を行うことができる。 As described above, the server 11 calculates the importance index based on the user position information and the sound source position information, and selects a process to be performed on the object sound source to be processed based on the obtained importance index. By integrating the object sound sources in accordance with the positional relationship between the two object sound sources to be processed and the viewer in this way, the amount of computation and transmission can be appropriately reduced, and content with a high sense of presence even with a small amount of computation and transmission Playback can be performed.
 なお、図4乃至図7を参照して説明した削減処理に限らず、オブジェクト音源の再生ビットレートを変更するなど、他の削減処理を行うこともできるし、図4乃至図7を参照して説明した削減処理と他の削減処理のうちの任意のものを組み合わせて行うこともできる。 In addition, the reduction process described with reference to FIGS. 4 to 7 is not limited, and other reduction processes such as changing the reproduction bit rate of the object sound source can be performed, or with reference to FIGS. 4 to 7. It is also possible to perform a combination of any of the described reduction processes and other reduction processes.
 例えば演算量および伝送量の削減のために再生ビットレートを変更する場合には、上述した図4乃至図6の削減処理において、厳密なレンダリング処理を行うか、または軽いレンダリング処理を行うかを選択するのに代えて、オブジェクト音源に対して再生ビットレートを変更する処理を行うか否かを選択するようにすればよい。 For example, when the playback bit rate is changed in order to reduce the calculation amount and the transmission amount, it is selected whether to perform strict rendering processing or light rendering processing in the reduction processing of FIGS. 4 to 6 described above. Instead of this, it is only necessary to select whether or not to perform processing for changing the reproduction bit rate for the object sound source.
 この場合、図4乃至図6の削減処理において、軽いレンダリング処理を選択するとされたステップにおいて再生ビットレートを変更する処理を選択するとされ、厳密なレンダリング処理を選択するとされたステップにおいて、再生ビットレートを変更する処理が行われないようにするとされるようにすればよい。 In this case, in the reduction process of FIGS. 4 to 6, the process for changing the reproduction bit rate is selected in the step where the light rendering process is selected, and the reproduction bit rate is selected in the step where the strict rendering process is selected. What is necessary is just to make it the process which changes is not performed.
 さらに、以上においてはサーバ11でレンダリング処理が行われる場合について説明したが、クライアント装置12の自由視点映像再生部32等において、処理選択部22での処理の選択結果に応じてレンダリング処理等が行われるようにしてもよい。 Furthermore, although the case where the rendering process is performed by the server 11 has been described above, the rendering process or the like is performed in the free viewpoint video reproduction unit 32 or the like of the client device 12 according to the selection result of the process by the process selection unit 22. You may be made to be.
〈コンピュータの構成例〉
 ところで、上述した一連の処理は、ハードウェアにより実行することもできるし、ソフトウェアにより実行することもできる。一連の処理をソフトウェアにより実行する場合には、そのソフトウェアを構成するプログラムが、コンピュータにインストールされる。ここで、コンピュータには、専用のハードウェアに組み込まれているコンピュータや、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどが含まれる。
<Example of computer configuration>
By the way, the above-described series of processing can be executed by hardware or can be executed by software. When a series of processing is executed by software, a program constituting the software is installed in the computer. Here, the computer includes, for example, a general-purpose personal computer capable of executing various functions by installing a computer incorporated in dedicated hardware and various programs.
 図8は、上述した一連の処理をプログラムにより実行するコンピュータのハードウェアの構成例を示すブロック図である。 FIG. 8 is a block diagram showing an example of the hardware configuration of a computer that executes the above-described series of processes by a program.
 コンピュータにおいて、CPU(Central Processing Unit)501,ROM(Read Only Memory)502,RAM(Random Access Memory)503は、バス504により相互に接続されている。 In the computer, a CPU (Central Processing Unit) 501, a ROM (Read Only Memory) 502, and a RAM (Random Access Memory) 503 are connected to each other via a bus 504.
 バス504には、さらに、入出力インターフェース505が接続されている。入出力インターフェース505には、入力部506、出力部507、記録部508、通信部509、及びドライブ510が接続されている。 An input / output interface 505 is further connected to the bus 504. An input unit 506, an output unit 507, a recording unit 508, a communication unit 509, and a drive 510 are connected to the input / output interface 505.
 入力部506は、キーボード、マウス、マイクロホン、撮像素子などよりなる。出力部507は、ディスプレイ、スピーカなどよりなる。記録部508は、ハードディスクや不揮発性のメモリなどよりなる。通信部509は、ネットワークインターフェースなどよりなる。ドライブ510は、磁気ディスク、光ディスク、光磁気ディスク、又は半導体メモリなどのリムーバブル記録媒体511を駆動する。 The input unit 506 includes a keyboard, a mouse, a microphone, an image sensor, and the like. The output unit 507 includes a display, a speaker, and the like. The recording unit 508 includes a hard disk, a nonvolatile memory, and the like. The communication unit 509 includes a network interface or the like. The drive 510 drives a removable recording medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.
 以上のように構成されるコンピュータでは、CPU501が、例えば、記録部508に記録されているプログラムを、入出力インターフェース505及びバス504を介して、RAM503にロードして実行することにより、上述した一連の処理が行われる。 In the computer configured as described above, the CPU 501 loads the program recorded in the recording unit 508 to the RAM 503 via the input / output interface 505 and the bus 504 and executes the program, for example. Is performed.
 コンピュータ(CPU501)が実行するプログラムは、例えば、パッケージメディア等としてのリムーバブル記録媒体511に記録して提供することができる。また、プログラムは、ローカルエリアネットワーク、インターネット、デジタル衛星放送といった、有線または無線の伝送媒体を介して提供することができる。 The program executed by the computer (CPU 501) can be provided by being recorded in a removable recording medium 511 as a package medium, for example. The program can be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.
 コンピュータでは、プログラムは、リムーバブル記録媒体511をドライブ510に装着することにより、入出力インターフェース505を介して、記録部508にインストールすることができる。また、プログラムは、有線または無線の伝送媒体を介して、通信部509で受信し、記録部508にインストールすることができる。その他、プログラムは、ROM502や記録部508に、あらかじめインストールしておくことができる。 In the computer, the program can be installed in the recording unit 508 via the input / output interface 505 by attaching the removable recording medium 511 to the drive 510. Further, the program can be received by the communication unit 509 via a wired or wireless transmission medium and installed in the recording unit 508. In addition, the program can be installed in the ROM 502 or the recording unit 508 in advance.
 なお、コンピュータが実行するプログラムは、本明細書で説明する順序に沿って時系列に処理が行われるプログラムであっても良いし、並列に、あるいは呼び出しが行われたとき等の必要なタイミングで処理が行われるプログラムであっても良い。 The program executed by the computer may be a program that is processed in time series in the order described in this specification, or in parallel or at a necessary timing such as when a call is made. It may be a program for processing.
 また、本技術の実施の形態は、上述した実施の形態に限定されるものではなく、本技術の要旨を逸脱しない範囲において種々の変更が可能である。 The embodiments of the present technology are not limited to the above-described embodiments, and various modifications can be made without departing from the gist of the present technology.
 例えば、本技術は、1つの機能をネットワークを介して複数の装置で分担、共同して処理するクラウドコンピューティングの構成をとることができる。 For example, the present technology can take a cloud computing configuration in which one function is shared by a plurality of devices via a network and is jointly processed.
 また、上述のフローチャートで説明した各ステップは、1つの装置で実行する他、複数の装置で分担して実行することができる。 Further, each step described in the above flowchart can be executed by one device or can be shared by a plurality of devices.
 さらに、1つのステップに複数の処理が含まれる場合には、その1つのステップに含まれる複数の処理は、1つの装置で実行する他、複数の装置で分担して実行することができる。 Further, when a plurality of processes are included in one step, the plurality of processes included in the one step can be executed by being shared by a plurality of apparatuses in addition to being executed by one apparatus.
 さらに、本技術は、以下の構成とすることも可能である。 Furthermore, the present technology can be configured as follows.
(1)
 オブジェクト音源の重要度の指標となる1または複数の重要度指標に基づいて、前記オブジェクト音源の音声データに対して行われる処理を選択する処理選択部を備える
 音声処理装置。
(2)
 前記処理選択部は、前記処理として演算量または伝送量を削減するための処理を選択する
 (1)に記載の音声処理装置。
(3)
 前記処理選択部は、前記処理として互いに演算量が異なる複数のレンダリング処理のうちの何れかを選択する
 (1)または(2)に記載の音声処理装置。
(4)
 前記処理選択部は、前記処理として複数の前記オブジェクト音源の前記音声データを統合する処理を選択する
 (1)乃至(3)の何れか一項に記載の音声処理装置。
(5)
 前記処理選択部は、前記処理として前記オブジェクト音源の前記音声データの再生ビットレートを変更する処理を選択する
 (1)乃至(4)の何れか一項に記載の音声処理装置。
(6)
 前記音声データに関するメタ情報に基づいて前記重要度指標を算出する重要度指標算出部をさらに備える
 (1)乃至(5)の何れか一項に記載の音声処理装置。
(7)
 前記重要度指標算出部は、前記メタ情報としての前記オブジェクト音源の位置情報、視聴者の位置情報、前記視聴者の視線方向情報、前記オブジェクト音源の重要度情報、前記オブジェクト音源のスプレッド情報、空間の音響特性情報、および前記空間におけるオブジェクトの配置情報のうちの少なくとも何れか1つに基づいて前記重要度指標を算出する
 (6)に記載の音声処理装置。
(8)
 前記重要度指標算出部は、空間上における前記オブジェクト音源と視聴者との距離を前記重要度指標として算出する
 (6)または(7)に記載の音声処理装置。
(9)
 前記重要度指標算出部は、前記メタ情報としての前記オブジェクト音源の重要度情報をそのまま前記重要度指標とする
 (6)乃至(8)の何れか一項に記載の音声処理装置。
(10)
 前記重要度指標算出部は、空間上における2つの前記オブジェクト音源間の距離を前記重要度指標として算出する
 (6)乃至(9)の何れか一項に記載の音声処理装置。
(11)
 前記重要度指標算出部は、空間上における視聴者から見た前記オブジェクト音源の方向の、前記視聴者から見た他のオブジェクト音源の方向との角度差を前記重要度指標として算出する
 (6)乃至(10)の何れか一項に記載の音声処理装置。
(12)
 前記処理選択部は、前記処理を行う処理部の演算処理能力を示す演算スペック情報、および前記音声データの最大伝送速度を示す伝送速度スペック情報の少なくとも何れかに基づいて、複数の各前記処理について、前記処理が行われる前記オブジェクト音源の個数を決定する
 (1)乃至(11)の何れか一項に記載の音声処理装置。
(13)
 オブジェクト音源の重要度の指標となる1または複数の重要度指標を取得する取得ステップと、
 前記1または複数の重要度指標に基づいて、前記オブジェクト音源の音声データに対して行われる処理を選択する処理選択ステップと
 を含む音声処理方法。
(1)
An audio processing apparatus, comprising: a process selection unit that selects a process to be performed on audio data of the object sound source based on one or more importance indexes serving as an importance index of the object sound source.
(2)
The audio processing apparatus according to (1), wherein the process selection unit selects a process for reducing a calculation amount or a transmission amount as the process.
(3)
The audio processing apparatus according to (1) or (2), wherein the process selection unit selects one of a plurality of rendering processes having different calculation amounts as the process.
(4)
The audio processing apparatus according to any one of (1) to (3), wherein the process selection unit selects a process of integrating the audio data of the plurality of object sound sources as the process.
(5)
The audio processing apparatus according to any one of (1) to (4), wherein the process selection unit selects a process of changing a reproduction bit rate of the audio data of the object sound source as the process.
(6)
The speech processing apparatus according to any one of (1) to (5), further including an importance index calculation unit that calculates the importance index based on meta information related to the audio data.
(7)
The importance index calculation unit includes the position information of the object sound source as the meta information, the position information of the viewer, the gaze direction information of the viewer, the importance information of the object sound source, the spread information of the object sound source, the space The audio processing device according to (6), wherein the importance index is calculated based on at least one of the acoustic characteristic information and the object arrangement information in the space.
(8)
The audio processing device according to (6) or (7), wherein the importance index calculation unit calculates a distance between the object sound source and a viewer in space as the importance index.
(9)
The voice processing device according to any one of (6) to (8), wherein the importance index calculation unit directly uses the importance information of the object sound source as the meta information as the importance index.
(10)
The speech processing apparatus according to any one of (6) to (9), wherein the importance index calculation unit calculates a distance between two object sound sources in space as the importance index.
(11)
The importance index calculation unit calculates, as the importance index, an angle difference between the direction of the object sound source viewed from the viewer in space and the direction of another object sound source viewed from the viewer. Thru | or the audio processing apparatus as described in any one of (10).
(12)
The process selection unit is configured to perform a plurality of processes based on at least one of calculation specification information indicating calculation processing capability of a processing unit that performs the process and transmission rate specification information indicating a maximum transmission rate of the audio data. The audio processing device according to any one of (1) to (11), wherein the number of the object sound sources on which the processing is performed is determined.
(13)
An acquisition step of acquiring one or a plurality of importance indices that serve as an index of importance of the object sound source;
And a process selection step of selecting a process to be performed on the sound data of the object sound source based on the one or more importance indexes.
 11 サーバ, 12 クライアント装置, 13 メタ情報格納部, 21 重要度指標算出部, 22 処理選択部, 23 レンダリング処理部, 24 オーディオストリーム伝送部, 31 オーディオストリーム受信部, 32 自由視点映像再生部 11 servers, 12 client devices, 13 meta information storage unit, 21 importance index calculation unit, 22 process selection unit, 23 rendering processing unit, 24 audio stream transmission unit, 31 audio stream reception unit, 32 free viewpoint video playback unit

Claims (13)

  1.  オブジェクト音源の重要度の指標となる1または複数の重要度指標に基づいて、前記オブジェクト音源の音声データに対して行われる処理を選択する処理選択部を備える
     音声処理装置。
    An audio processing apparatus, comprising: a process selection unit that selects a process to be performed on audio data of the object sound source based on one or more importance indexes serving as an importance index of the object sound source.
  2.  前記処理選択部は、前記処理として演算量または伝送量を削減するための処理を選択する
     請求項1に記載の音声処理装置。
    The audio processing apparatus according to claim 1, wherein the process selection unit selects a process for reducing a calculation amount or a transmission amount as the process.
  3.  前記処理選択部は、前記処理として互いに演算量が異なる複数のレンダリング処理のうちの何れかを選択する
     請求項1に記載の音声処理装置。
    The audio processing apparatus according to claim 1, wherein the process selection unit selects one of a plurality of rendering processes having different calculation amounts as the process.
  4.  前記処理選択部は、前記処理として複数の前記オブジェクト音源の前記音声データを統合する処理を選択する
     請求項1に記載の音声処理装置。
    The audio processing apparatus according to claim 1, wherein the process selection unit selects a process of integrating the audio data of the plurality of object sound sources as the process.
  5.  前記処理選択部は、前記処理として前記オブジェクト音源の前記音声データの再生ビットレートを変更する処理を選択する
     請求項1に記載の音声処理装置。
    The audio processing apparatus according to claim 1, wherein the process selection unit selects a process of changing a reproduction bit rate of the audio data of the object sound source as the process.
  6.  前記音声データに関するメタ情報に基づいて前記重要度指標を算出する重要度指標算出部をさらに備える
     請求項1に記載の音声処理装置。
    The speech processing apparatus according to claim 1, further comprising an importance index calculation unit that calculates the importance index based on meta information related to the audio data.
  7.  前記重要度指標算出部は、前記メタ情報としての前記オブジェクト音源の位置情報、視聴者の位置情報、前記視聴者の視線方向情報、前記オブジェクト音源の重要度情報、前記オブジェクト音源のスプレッド情報、空間の音響特性情報、および前記空間におけるオブジェクトの配置情報のうちの少なくとも何れか1つに基づいて前記重要度指標を算出する
     請求項6に記載の音声処理装置。
    The importance index calculation unit includes the position information of the object sound source as the meta information, the position information of the viewer, the gaze direction information of the viewer, the importance information of the object sound source, the spread information of the object sound source, the space The speech processing apparatus according to claim 6, wherein the importance index is calculated based on at least one of the acoustic characteristic information of the object and the arrangement information of the object in the space.
  8.  前記重要度指標算出部は、空間上における前記オブジェクト音源と視聴者との距離を前記重要度指標として算出する
     請求項6に記載の音声処理装置。
    The audio processing apparatus according to claim 6, wherein the importance index calculation unit calculates a distance between the object sound source and a viewer in space as the importance index.
  9.  前記重要度指標算出部は、前記メタ情報としての前記オブジェクト音源の重要度情報をそのまま前記重要度指標とする
     請求項6に記載の音声処理装置。
    The voice processing apparatus according to claim 6, wherein the importance index calculation unit directly uses the importance information of the object sound source as the meta information as the importance index.
  10.  前記重要度指標算出部は、空間上における2つの前記オブジェクト音源間の距離を前記重要度指標として算出する
     請求項6に記載の音声処理装置。
    The speech processing apparatus according to claim 6, wherein the importance index calculation unit calculates a distance between two object sound sources in space as the importance index.
  11.  前記重要度指標算出部は、空間上における視聴者から見た前記オブジェクト音源の方向の、前記視聴者から見た他のオブジェクト音源の方向との角度差を前記重要度指標として算出する
     請求項6に記載の音声処理装置。
    The importance index calculation unit calculates an angle difference between the direction of the object sound source viewed from a viewer in space and the direction of another object sound source viewed from the viewer as the importance index. The voice processing apparatus according to 1.
  12.  前記処理選択部は、前記処理を行う処理部の演算処理能力を示す演算スペック情報、および前記音声データの最大伝送速度を示す伝送速度スペック情報の少なくとも何れかに基づいて、複数の各前記処理について、前記処理が行われる前記オブジェクト音源の個数を決定する
     請求項1に記載の音声処理装置。
    The process selection unit is configured to perform a plurality of processes based on at least one of calculation specification information indicating calculation processing capability of a processing unit that performs the process and transmission rate specification information indicating a maximum transmission rate of the audio data. The audio processing device according to claim 1, wherein the number of the object sound sources on which the processing is performed is determined.
  13.  オブジェクト音源の重要度の指標となる1または複数の重要度指標を取得する取得ステップと、
     前記1または複数の重要度指標に基づいて、前記オブジェクト音源の音声データに対して行われる処理を選択する処理選択ステップと
     を含む音声処理方法。
    An acquisition step of acquiring one or a plurality of importance indices that serve as an index of importance of the object sound source;
    And a process selection step of selecting a process to be performed on the sound data of the object sound source based on the one or more importance indexes.
PCT/JP2017/030858 2016-09-12 2017-08-29 Sound processing device and method WO2018047667A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2016-177336 2016-09-12
JP2016177336 2016-09-12

Publications (1)

Publication Number Publication Date
WO2018047667A1 true WO2018047667A1 (en) 2018-03-15

Family

ID=61561462

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2017/030858 WO2018047667A1 (en) 2016-09-12 2017-08-29 Sound processing device and method

Country Status (1)

Country Link
WO (1) WO2018047667A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018180531A1 (en) * 2017-03-28 2018-10-04 ソニー株式会社 Information processing device, information processing method, and program
WO2019116890A1 (en) * 2017-12-12 2019-06-20 ソニー株式会社 Signal processing device and method, and program
WO2020105423A1 (en) * 2018-11-20 2020-05-28 ソニー株式会社 Information processing device and method, and program
WO2020153092A1 (en) * 2019-01-25 2020-07-30 ソニー株式会社 Information processing device, and information processing method
CN111903136A (en) * 2018-03-29 2020-11-06 索尼公司 Information processing apparatus, information processing method, and program
EP3809709A1 (en) * 2019-10-14 2021-04-21 Koninklijke Philips N.V. Apparatus and method for audio encoding
WO2021140959A1 (en) * 2020-01-10 2021-07-15 ソニーグループ株式会社 Encoding device and method, decoding device and method, and program
JP2021136465A (en) * 2020-02-21 2021-09-13 日本放送協会 Receiver, content transfer system, and program
JP2022505964A (en) * 2018-10-26 2022-01-14 フラウンホーファー-ゲゼルシャフト・ツール・フェルデルング・デル・アンゲヴァンテン・フォルシュング・アインゲトラーゲネル・フェライン Directional volume map based audio processing
WO2023199778A1 (en) * 2022-04-14 2023-10-19 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Acoustic signal processing method, program, acoustic signal processing device, and acoustic signal processing system
WO2023199673A1 (en) * 2022-04-14 2023-10-19 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Stereophonic sound processing method, stereophonic sound processing device, and program
RU2815621C1 (en) * 2018-08-28 2024-03-19 Конинклейке Филипс Н.В. Audio device and audio processing method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1127800A (en) * 1997-07-03 1999-01-29 Fujitsu Ltd Stereophonic processing system
JP2009278381A (en) * 2008-05-14 2009-11-26 Nippon Hoso Kyokai <Nhk> Acoustic signal multiplex transmission system, manufacturing device, and reproduction device added with sound image localization acoustic meta-information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1127800A (en) * 1997-07-03 1999-01-29 Fujitsu Ltd Stereophonic processing system
JP2009278381A (en) * 2008-05-14 2009-11-26 Nippon Hoso Kyokai <Nhk> Acoustic signal multiplex transmission system, manufacturing device, and reproduction device added with sound image localization acoustic meta-information

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018180531A1 (en) * 2017-03-28 2018-10-04 ソニー株式会社 Information processing device, information processing method, and program
US11074921B2 (en) 2017-03-28 2021-07-27 Sony Corporation Information processing device and information processing method
WO2019116890A1 (en) * 2017-12-12 2019-06-20 ソニー株式会社 Signal processing device and method, and program
US11310619B2 (en) 2017-12-12 2022-04-19 Sony Corporation Signal processing device and method, and program
US11838742B2 (en) 2017-12-12 2023-12-05 Sony Group Corporation Signal processing device and method, and program
CN111903136A (en) * 2018-03-29 2020-11-06 索尼公司 Information processing apparatus, information processing method, and program
RU2815621C1 (en) * 2018-08-28 2024-03-19 Конинклейке Филипс Н.В. Audio device and audio processing method
JP7526173B2 (en) 2018-10-26 2024-07-31 フラウンホーファー-ゲゼルシャフト・ツール・フェルデルング・デル・アンゲヴァンテン・フォルシュング・アインゲトラーゲネル・フェライン Directional Loudness Map Based Audio Processing
JP2022505964A (en) * 2018-10-26 2022-01-14 フラウンホーファー-ゲゼルシャフト・ツール・フェルデルング・デル・アンゲヴァンテン・フォルシュング・アインゲトラーゲネル・フェライン Directional volume map based audio processing
JP2022177253A (en) * 2018-10-26 2022-11-30 フラウンホーファー-ゲゼルシャフト・ツール・フェルデルング・デル・アンゲヴァンテン・フォルシュング・アインゲトラーゲネル・フェライン Directional volume map-based audio processing
JP7468359B2 (en) 2018-11-20 2024-04-16 ソニーグループ株式会社 Information processing device, method, and program
JPWO2020105423A1 (en) * 2018-11-20 2021-10-14 ソニーグループ株式会社 Information processing equipment and methods, and programs
WO2020105423A1 (en) * 2018-11-20 2020-05-28 ソニー株式会社 Information processing device and method, and program
US12073841B2 (en) 2019-01-25 2024-08-27 Sony Group Corporation Information processing device and information processing method
WO2020153092A1 (en) * 2019-01-25 2020-07-30 ソニー株式会社 Information processing device, and information processing method
JPWO2020153092A1 (en) * 2019-01-25 2021-12-02 ソニーグループ株式会社 Information processing equipment and information processing method
US20220122616A1 (en) * 2019-01-25 2022-04-21 Sony Group Corporation Information processing device and information processing method
JP7415954B2 (en) 2019-01-25 2024-01-17 ソニーグループ株式会社 Information processing device and information processing method
WO2021074007A1 (en) * 2019-10-14 2021-04-22 Koninklijke Philips N.V. Apparatus and method for audio encoding
CN114600188A (en) * 2019-10-14 2022-06-07 皇家飞利浦有限公司 Apparatus and method for audio coding
RU2823537C1 (en) * 2019-10-14 2024-07-23 Конинклейке Филипс Н.В. Audio encoding device and method
EP3809709A1 (en) * 2019-10-14 2021-04-21 Koninklijke Philips N.V. Apparatus and method for audio encoding
WO2021140959A1 (en) * 2020-01-10 2021-07-15 ソニーグループ株式会社 Encoding device and method, decoding device and method, and program
JP7457525B2 (en) 2020-02-21 2024-03-28 日本放送協会 Receiving device, content transmission system, and program
JP2021136465A (en) * 2020-02-21 2021-09-13 日本放送協会 Receiver, content transfer system, and program
WO2023199673A1 (en) * 2022-04-14 2023-10-19 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Stereophonic sound processing method, stereophonic sound processing device, and program
WO2023199778A1 (en) * 2022-04-14 2023-10-19 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Acoustic signal processing method, program, acoustic signal processing device, and acoustic signal processing system

Similar Documents

Publication Publication Date Title
WO2018047667A1 (en) Sound processing device and method
US20200260210A1 (en) Audio parallax for virtual reality, augmented reality, and mixed reality
JP7479352B2 (en) Audio device and method for audio processing
CN110447071B (en) Information processing apparatus, information processing method, and removable medium recording program
TW202133625A (en) Selecting audio streams based on motion
Quackenbush et al. MPEG standards for compressed representation of immersive audio
CN103609143B (en) For catching and the method for playback sources from the sound of multiple sound source
CN111492342B (en) Audio scene processing
JP7457525B2 (en) Receiving device, content transmission system, and program
JP7533223B2 (en) AUDIO SYSTEM, AUDIO PLAYBACK DEVICE, SERVER DEVICE, AUDIO PLAYBACK METHOD, AND AUDIO PLAYBACK PROGRAM
KR20240001226A (en) 3D audio signal coding method, device, and encoder
KR20230060502A (en) Signal processing device and method, learning device and method, and program
EP4055840A1 (en) Signalling of audio effect metadata in a bitstream
RU2815621C1 (en) Audio device and audio processing method
RU2823573C1 (en) Audio device and audio processing method
RU2815366C2 (en) Audio device and audio processing method
RU2798414C2 (en) Audio device and audio processing method
WO2022034805A1 (en) Signal processing device and method, and audio playback system
RU2816884C2 (en) Audio device, audio distribution system and method of operation thereof
WO2024084920A1 (en) Sound processing method, sound processing device, and program
CN116866817A (en) Device and method for presenting spatial audio content

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17848608

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: JP

122 Ep: pct application non-entry in european phase

Ref document number: 17848608

Country of ref document: EP

Kind code of ref document: A1