WO2024084997A1 - 音響処理装置及び音響処理方法 - Google Patents

音響処理装置及び音響処理方法 Download PDF

Info

Publication number
WO2024084997A1
WO2024084997A1 PCT/JP2023/036494 JP2023036494W WO2024084997A1 WO 2024084997 A1 WO2024084997 A1 WO 2024084997A1 JP 2023036494 W JP2023036494 W JP 2023036494W WO 2024084997 A1 WO2024084997 A1 WO 2024084997A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
reflected
information
sounds
volume
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2023/036494
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
康太 中橋
智一 石川
陽 宇佐見
成悟 榎本
宏幸 江原
摩里子 山田
修二 宮阪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Intellectual Property Corp of America
Original Assignee
Panasonic Intellectual Property Corp of America
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Panasonic Intellectual Property Corp of America filed Critical Panasonic Intellectual Property Corp of America
Priority to KR1020257010711A priority Critical patent/KR20250091193A/ko
Priority to EP23879643.7A priority patent/EP4607965A4/en
Priority to JP2024551488A priority patent/JPWO2024084997A1/ja
Priority to CN202380071402.8A priority patent/CN119999236A/zh
Publication of WO2024084997A1 publication Critical patent/WO2024084997A1/ja
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/305Electronic adaptation of stereophonic audio signals to reverberation of the listening space
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K15/00Acoustics not otherwise provided for
    • G10K15/08Arrangements for producing a reverberation or echo sound
    • G10K15/12Arrangements for producing a reverberation or echo sound using electronic time-delay networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/13Aspects of volume control, not necessarily automatic, in stereophonic sound systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • H04S7/303Tracking of listener position or orientation

Definitions

  • This disclosure relates to audio processing devices, etc.
  • Patent Document 1 disclose technologies related to the sound processing device and sound processing method of the present disclosure.
  • Patent No. 6288100 JP 2019-22049 A International Publication No. 2021/180938
  • Patent Document 1 discloses a technology that performs signal processing on object audio signals and presents them to a listener.
  • ER technology becomes more widespread and services that use ER technology become more diverse, there is a demand for audio processing that corresponds to differences in, for example, the acoustic quality required by each service, the signal processing capabilities of the terminal used, and the sound quality that can be provided by the sound presentation device.
  • audio processing technology provides this.
  • improvements in sound processing technology refer to changes to existing sound processing.
  • improvements in sound processing technology may provide processing that imparts new sound effects, a reduction in the amount of processing required for sound processing, improvement in the quality of sound obtained by sound processing, a reduction in the amount of data required for information used to implement sound processing, or easier acquisition or generation of information used to implement sound processing.
  • improvements in sound processing technology may provide a combination of any two or more of these.
  • An audio device includes a circuit and a memory, and the circuit uses the memory to acquire sound space information including information on a sound source in a sound space, information on objects in the sound space, and information on the position of a listener in the sound space, and uses the sound space information to calculate an evaluation value of a reflected sound generated in response to a sound generated from the sound source.
  • One aspect of the present disclosure can provide, for example, processing to impart new acoustic effects, reduction in the amount of acoustic processing, improvement in the sound quality of the audio obtained by acoustic processing, reduction in the amount of data of information used to implement acoustic processing, or simplification of acquisition or generation of information used to implement acoustic processing.
  • one aspect of the present disclosure can provide any combination of these.
  • one aspect of the present disclosure can provide acoustic processing suited to the listener's usage environment, contributing to an improved acoustic experience for the listener.
  • the above effects can be achieved in devices or services that allow listeners to move freely within a virtual space.
  • the above effects are merely examples of the effects of various aspects grasped based on this disclosure.
  • Each of the one or more aspects grasped based on this disclosure may be an aspect conceived based on a perspective different from the above, an aspect that achieves a purpose different from the above, or an aspect that obtains an effect different from the above.
  • FIG. 1 is a diagram showing an example of direct sound and reflected sound generated in a sound space.
  • FIG. 2 is a diagram showing an example of a stereophonic sound reproduction system according to an embodiment.
  • FIG. 3A is a block diagram showing an example of a configuration of an encoding device according to an embodiment.
  • FIG. 3B is a block diagram showing an example of a configuration of a decoding device according to an embodiment.
  • FIG. 3C is a block diagram showing another example of the configuration of the encoding device according to the embodiment.
  • FIG. 3D is a block diagram showing another example of the configuration of a decoding device according to an embodiment.
  • FIG. 4A is a block diagram showing an example of the configuration of a decoder according to an embodiment.
  • FIG. 4B is a block diagram showing another example of the configuration of a decoder according to an embodiment.
  • FIG. 5 is a diagram illustrating an example of a physical configuration of the audio signal processing device according to the embodiment.
  • FIG. 6 is a diagram illustrating an example of a physical configuration of an encoding device according to an embodiment.
  • FIG. 7 is a block diagram illustrating an example of the configuration of a rendering unit according to the embodiment.
  • FIG. 8 is a flowchart showing an example of the operation of the audio signal processing device according to the embodiment.
  • FIG. 9 is a diagram showing the positional relationship between the listener and an obstacle object, which is relatively far away.
  • FIG. 10 is a diagram showing a positional relationship in which the listener and an obstacle object are relatively close to each other.
  • FIG. 11 is a flowchart showing an example of the selection process according to the embodiment.
  • FIG. 12 is a flowchart illustrating an example of the evaluation process according to the embodiment.
  • FIG. 13 is a diagram showing an example of the arrival angles of direct sound and reflected sound.
  • FIG. 14 is a diagram showing an example of a method for setting threshold data based on the temporal masking phenomenon.
  • FIG. 15 is a diagram illustrating an example of threshold data.
  • FIG. 16 is a diagram showing the relationship between the time difference between a direct sound and a reflected sound and the threshold value.
  • FIG. 17 is a diagram illustrating an example of a configuration for a rendering unit to perform pipeline processing.
  • (Findings that form the basis of this disclosure) 1 is a diagram showing an example of direct sound and reflected sound generated in a sound space.
  • acoustic processing that expresses the characteristics of a virtual space with sound, it is effective to reproduce not only direct sound but also reflected sound in order to express the size of the space and the material of the walls, and to accurately grasp the position of the sound source (localization of the sound image).
  • the number of sound rays used to express the characteristics of the virtual space with sound, and the transition in the volume of each sound ray, are calculated at the time of rendering. For this reason, it is not easy to reduce the amount of calculations required at the time of rendering.
  • Patent Document 1 discloses a method for detecting the importance of audio objects and not playing sounds caused by audio objects with low importance.
  • the listener grasps the positional relationship between the listener, the sound source, and the objects that reflect the sound based on the direct sound generated from the sound source and the reflected sound generated when the direct sound is reflected by the objects. Therefore, by reducing the direct sound and reflected sound caused by a specific sound source, it may become difficult to accurately grasp the sound position and space.
  • the present disclosure therefore aims to provide an audio processing device etc. that can reduce the computational load for processing reflected sounds while enabling sound localization and spatial understanding.
  • the sound processing device includes a circuit and a memory, and the circuit uses the memory to acquire sound space information including information on a sound source in the sound space, information on objects in the sound space, and information on the position of a listener in the sound space, and calculates an evaluation value of a reflected sound generated in response to a sound generated from the sound source using the sound space information.
  • the device of the above aspect can use the sound space information to appropriately calculate the evaluation value of the reflected sound, which depends on the information of the sound source, the object, and the position of the listener. Therefore, it becomes possible to appropriately select the reflected sound to be processed based on the evaluation value of the reflected sound. This makes it possible to reduce the computational load for the reflected sound while making it possible to grasp the position and space of the sound.
  • the sound processing device may be the sound processing device according to the first aspect, in which the circuit controls whether or not to select reflected sounds based on the evaluation value.
  • the device of the above aspect can appropriately select the reflected sound to be processed based on the evaluation value of the reflected sound.
  • the sound processing device may be the sound processing device according to the second aspect, in which the circuit does not perform binaural processing on the reflected sound if the reflected sound is not selected.
  • the device of the above aspect can reduce the computational load for reflected sounds by omitting binaural processing.
  • the sound processing device may be any of the sound processing devices according to the first to third aspects, and the circuit may calculate the volume of the reflected sound, and calculate an evaluation value of the reflected sound when the volume exceeds a predetermined threshold value.
  • the device of the above aspect can omit the calculation of the evaluation value of the reflected sound when the volume of the reflected sound is equal to or lower than a predetermined threshold. Therefore, the device of the above aspect can reduce the computational load for the reflected sound.
  • the sound processing device may be the sound processing device according to the second aspect, in which the circuit calculates the total computation load of one or more selected reflected sounds including the reflected sound when the reflected sound is selected based on the evaluation value, and cancels the selection of the reflected sound when the total computation load exceeds a predetermined upper limit.
  • the device of the above aspect can prevent the total computation load from exceeding a predetermined upper limit. This allows the device of the above aspect to reduce the computation load for reflected sound.
  • the sound processing device may be the sound processing device according to the fifth aspect, in which the total computational load is defined by the number of one or more selective reflected sounds or the processing amount of one or more selective reflected sounds.
  • the device of the above aspect can prevent the number of one or more selective reflected sounds or the processing amount of one or more selective reflected sounds from exceeding a predetermined upper limit. This allows the device of the above aspect to reduce the computational load for reflected sounds.
  • the sound processing device may be any of the sound processing devices according to the first to sixth aspects, in which the circuit calculates the volume of each of a plurality of reflected sounds generated as reflected sounds in the sound space, and calculates an evaluation value of the reflected sound for each of one or more reflected sounds that have a volume equal to or greater than a predetermined threshold value among the plurality of reflected sounds.
  • the device of the above aspect can omit the calculation of the evaluation value of each of the multiple reflected sounds when the volume of the reflected sound falls below a predetermined threshold. Therefore, the device of the above aspect can reduce the computational load for the reflected sounds.
  • the sound processing device may be the sound processing device according to the seventh aspect, in which the circuit calculates a total computation load for one or more reflected sounds, and when the total computation load exceeds a predetermined upper limit, calculates an evaluation value of the reflected sounds for each of the one or more reflected sounds.
  • the device of the above aspect can omit the calculation of the evaluation value of the reflected sound when the total calculation load is equal to or less than a predetermined upper limit. Therefore, the device of the above aspect can suppress the calculation load for the reflected sound.
  • the sound processing device may be any of the sound processing devices according to the first to eighth aspects, in which the circuit calculates an evaluation value of each of a plurality of reflected sounds generated as reflected sounds in a sound space, adds the computational load of each of the plurality of reflected sounds to a total computational load in descending order of evaluation value, compares the total computational load with a predetermined upper limit each time the computational load of the reflected sounds is added to the total computational load, selects the reflected sound if the total computational load obtained by adding the computational loads of the reflected sounds does not exceed the predetermined upper limit, and does not select one or more of the remaining reflected sounds after the reflected sound from among the plurality of reflected sounds if the total computational load obtained by adding the computational loads of the reflected sounds exceeds the predetermined upper limit.
  • the device of the above aspect can exclude the remaining reflected sounds from selection when the total calculation load, which is the sum of the calculation loads in sequence, exceeds a predetermined upper limit. Therefore, the device of the above aspect can limit the reflected sounds to be processed, and suppress the calculation load.
  • the sound processing device may be any of the sound processing devices according to the first to ninth aspects, and the evaluation value may be a total value of at least one of an index value relating to a volume, a visual index value, an index value relating to an object, and an index value indicating the relationship between a direct sound corresponding to a reflected sound and the reflected sound.
  • the device of the above aspect can calculate the sum of at least one of the index values related to the volume, the visual index value, the index value related to the object, and the index value indicating the relationship between the direct sound and the reflected sound as an evaluation value. Therefore, it becomes possible to appropriately select the reflected sound to be processed based on the index value related to the volume, the visual index value, the index value related to the object, or the index value indicating the relationship between the direct sound and the reflected sound.
  • the sound processing device may be the sound processing device according to the tenth aspect, in which the circuit increases the index value relating to the volume as the volume of the sound generated from the sound source increases.
  • the device of the above aspect can calculate a higher evaluation value the louder the sound generated from the sound source. Therefore, it becomes possible to appropriately select the reflected sound to be processed based on the higher evaluation value the louder the sound generated from the sound source.
  • the sound processing device may be the sound processing device according to the tenth or eleventh aspect, in which the circuitry increases the visual index value when the sound source is within the listener's field of vision compared to when the sound source is not within the listener's field of vision.
  • the device of the above aspect can calculate a higher evaluation value when the sound source is in view than when the sound source is not in view. Therefore, it becomes possible to appropriately select the reflected sound to be processed based on the higher evaluation value when the sound source is in view than when the sound source is not in view.
  • the sound processing device may be any of the sound processing devices according to the tenth to twelfth aspects, and the circuit may be a sound processing device that increases the visual index value the slower the moving speed of the sound source.
  • the device of the above aspect can calculate a higher evaluation value the slower the sound source's moving speed is. Therefore, it becomes possible to appropriately select the reflected sound to be processed based on the higher evaluation value the slower the sound source's moving speed is.
  • the sound processing device may be any of the sound processing devices according to the 10th to 13th aspects, in which the index value related to the object is assigned to each object in the sound space and is included in the sound space information.
  • the device of the above aspect can calculate an evaluation value based on the index value assigned to each object. Therefore, it becomes possible to appropriately select the reflected sound to be processed based on the index value assigned to each object.
  • the sound processing device may be any of the sound processing devices according to the tenth to fourteenth aspects, and the circuit may be a sound processing device that increases the index value indicating the relationship between the direct sound and the reflected sound the larger the angle between the direction from which the direct sound arrives and the direction from which the reflected sound arrives.
  • the device of the above aspect can calculate a higher evaluation value the larger the angle between the direction of arrival of the direct sound and the direction of arrival of the reflected sound. Therefore, it becomes possible to appropriately select the reflected sound to be processed based on the higher evaluation value the larger the angle between the direction of arrival of the direct sound and the direction of arrival of the reflected sound.
  • the sound processing device may be any of the sound processing devices according to the 10th to 15th aspects, and the circuit may be a sound processing device that increases an index value indicating the relationship between direct sound and reflected sound the greater the difference between the distance from the sound source that the direct sound travels to the listener and the distance from the sound source that the reflected sound travels to the listener after being reflected.
  • the device of the above aspect can calculate a higher evaluation value the greater the difference between the distance of the direct sound and the distance of the reflected sound. Therefore, it becomes possible to appropriately select the reflected sound to be processed based on the higher evaluation value the greater the difference between the distance of the direct sound and the distance of the reflected sound.
  • the sound processing device may be any of the sound processing devices according to the 10th to 16th aspects, and the circuit may be a sound processing device that increases the index value indicating the relationship between the direct sound and the reflected sound the more the amplitude value of the reflected sound exceeds a temporal masking threshold, which is the threshold of the temporal masking phenomenon in which the reflected sound is masked by the direct sound when the amplitude value of the reflected sound is equal to or less than a threshold.
  • a temporal masking threshold which is the threshold of the temporal masking phenomenon in which the reflected sound is masked by the direct sound when the amplitude value of the reflected sound is equal to or less than a threshold.
  • the device of the above aspect can calculate a higher evaluation value the greater the amplitude value of the reflected sound exceeds the temporal masking threshold. Therefore, it becomes possible to appropriately select the reflected sound to be processed based on the higher evaluation value the greater the amplitude value of the reflected sound exceeds the temporal masking threshold.
  • the sound processing device may be any of the sound processing devices according to the 10th to 17th aspects, in which the circuit reduces an index value for an object related to a selected reflected sound among a plurality of reflected sounds generated as reflected sounds in a sound space, calculates an evaluation value for a reflected sound that has not yet been selected, and repeatedly performs a process of selecting a reflected sound in descending order of evaluation value, and terminates the repeated process when the total computation load of one or more reflected sounds selected from the plurality of reflected sounds exceeds a predetermined upper limit.
  • the device of the above aspect can terminate the process of selecting a new reflected sound when the total computational load of one or more selected reflected sounds exceeds a predetermined upper limit. Therefore, the device of the above aspect can limit the reflected sounds to be processed, and suppress the computational load.
  • the sound processing device includes a circuit and a memory, and the circuit uses the memory to acquire information on the volume of the sound output from the sound source, corrects the evaluation value of the reflected sound corresponding to the sound using the volume information, and controls whether or not to select the reflected sound based on the corrected evaluation value.
  • the device of the above aspect can use information on the volume of the sound to appropriately correct the evaluation value of the reflected sound corresponding to the sound, and can appropriately control the selection of the reflected sound.
  • the sound processing device according to the 20th aspect as understood based on the present disclosure may be the sound processing device according to the 19th aspect, in which the volume has a transition.
  • the device of the above aspect can use information on the transitioning volume to appropriately correct the evaluation value of the reflected sound corresponding to the sound, and can appropriately control the selection of the reflected sound.
  • the acoustic processing method includes a step of acquiring sound space information including information on a sound source in the sound space, information on an object in the sound space, and information on the position of a listener in the sound space, and a step of calculating an evaluation value of a reflected sound generated in response to a sound generated from the sound source using the sound space information.
  • the method of the above aspect can achieve the same effect as the sound processing device described in the first aspect.
  • the program according to the 22nd aspect as understood based on this disclosure is a program for causing a computer to execute the acoustic processing method according to the 21st aspect.
  • the program of the above aspect can achieve the same effect as the acoustic processing method of the 21st aspect by using a computer.
  • the sound processing device, encoding device, decoding device, and stereophonic reproduction system according to the present disclosure will be described in detail below with reference to the drawings.
  • the stereophonic reproduction system can also be expressed as an audio signal reproduction system.
  • FIG. 2 is a diagram showing an example of a stereophonic sound reproduction system. Specifically, Fig. 2 shows a stereophonic sound reproduction system 1000, which is an example of a system to which the audio processing or decoding processing of the present disclosure can be applied. Stereophonic sound is also expressed as immersive audio.
  • the stereophonic sound reproduction system 1000 includes an audio signal processing device 1001 and an audio presentation device 1002.
  • the audio signal processing device 100 also referred to as an acoustic processing device, applies acoustic processing to an audio signal emitted by a virtual sound source to generate an audio signal after acoustic processing that is presented to a listener.
  • the audio signal is not limited to a voice, but may be any audible sound.
  • Acoustic processing is, for example, signal processing applied to an audio signal in order to reproduce one or more effects that a sound undergoes between the time it is generated by the sound source and the time it reaches the listener.
  • the audio signal processing device 1001 performs acoustic processing based on spatial information that describes the factors that cause the above-mentioned effects.
  • the spatial information includes, for example, information indicating the positions of the sound source, the listener, and surrounding objects, information indicating the shape of the space, and parameters related to sound propagation.
  • the audio signal processing device 1001 is, for example, a PC (Personal Computer), a smartphone, a tablet, or a game console.
  • the signal after acoustic processing is presented to the listener by the audio presentation device 1002.
  • the audio presentation device 1002 is connected to the audio signal processing device 1001 via wireless or wired communication.
  • the audio signal after acoustic processing generated by the audio signal processing device 1001 is transmitted to the audio presentation device 1002 via wireless or wired communication.
  • the audio presentation device 1002 is composed of multiple devices, such as a device for the right ear and a device for the left ear, the multiple devices present sound in synchronization through communication between the multiple devices or communication between each of the multiple devices and the audio signal processing device 1001.
  • the audio presentation device 1002 is, for example, headphones, earphones, or a head-mounted display worn on the listener's head, or a surround speaker composed of multiple fixed speakers.
  • the stereophonic sound reproduction system 1000 may be used in combination with an image presentation device or a stereoscopic video presentation device that provides a visual ER experience including AR/VR.
  • the space handled by the spatial information is a virtual space, and the positions of the sound source, listener, and object in the space are the virtual positions of the virtual sound source, virtual listener, and virtual object in the virtual space.
  • the space may also be expressed as a sound space.
  • the spatial information may also be expressed as sound space information.
  • FIG. 2 shows an example of a system configuration in which the audio signal processing device 1001 and the audio presentation device 1002 are separate devices
  • the stereophonic sound reproduction system 1000 to which the audio processing method or decoding method of the present disclosure can be applied is not limited to the configuration shown in FIG. 2.
  • the audio signal processing device 1001 may be included in the audio presentation device 1002, which may perform both audio processing and sound presentation.
  • the audio signal processing device 1001 and the audio presentation device 1002 may share the responsibility of performing the acoustic processing described in this disclosure.
  • a server connected to the audio signal processing device 1001 or the audio presentation device 1002 via a network may perform part or all of the acoustic processing described in this disclosure.
  • the audio signal processing device 1001 may also decode a bit stream generated by encoding at least a portion of the data of the audio signal and the spatial information used in the audio processing, and perform the audio processing. Therefore, the audio signal processing device 1001 may be referred to as a decoding device.
  • FIG. 3A is a block diagram showing an example of the configuration of a coding device. Specifically, Fig. 3A shows the configuration of a coding device 1100 which is an example of the coding device of the present disclosure.
  • the input data 1101 is data to be encoded that includes spatial information and/or an audio signal and is input to the encoder 1102. Details of the spatial information will be explained later.
  • the encoder 1102 encodes the input data 1101 to generate encoded data 1103.
  • the encoded data 1103 is, for example, a bit stream generated by the encoding process.
  • Memory 1104 stores the encoded data 1103.
  • Memory 1104 may be, for example, a hard disk or a solid-state drive (SSD), or may be other memory.
  • encoded data 1103 may be data other than a bit stream.
  • encoding device 1100 may store converted data generated by converting a bit stream into a predetermined data format in memory 1104.
  • the converted data may be, for example, a file or multiplexed stream corresponding to one or more bit streams.
  • the file is a file having a file format such as ISOBMFF (ISO Base Media File Format).
  • ISOBMFF ISO Base Media File Format
  • the encoded data 1103 may also be in the form of multiple packets generated by dividing the bit stream or file.
  • the bit stream generated by the encoder 1102 may be converted into data different from the bit stream.
  • the encoding device 1100 may include a conversion unit (not shown) and perform the conversion process, or the conversion process may be performed by a CPU (Central Processing Unit), which is an example of a processor described below.
  • a CPU Central Processing Unit
  • Fig. 3B is a block diagram showing an example of the configuration of a decoding device. Specifically, Fig. 3B shows the configuration of a decoding device 1110 which is an example of the decoding device of the present disclosure.
  • the memory 1114 stores, for example, the same data as the encoded data 1103 generated by the encoding device 1100.
  • the stored data is read from the memory 1114 and input to the decoder 1112 as input data 1113.
  • the input data 1113 is, for example, a bit stream to be decoded.
  • the memory 1114 may be, for example, a hard disk or SSD, or may be some other memory.
  • the decoding device 1110 may convert the data read from the memory 1114 and input the converted data to the decoder 1112 as the input data 1113 instead of inputting the data directly to the decoder 1112.
  • the data before conversion may be, for example, multiplexed data including one or more bit streams.
  • the multiplexed data may be, for example, a file having a file format such as ISOBMFF.
  • the data before conversion may also be a plurality of packets generated by dividing the bit stream or file. Data different from the bit stream may be read from memory 1114 and converted into a bit stream.
  • the decoding device 1110 may include a conversion unit (not shown) and the conversion process may be performed by the conversion unit, or the conversion process may be performed by a CPU, which is an example of a processor described below.
  • the decoder 1112 decodes the input data 1113 to generate an audio signal 1111 representing the audio to be presented to the listener.
  • FIG. 3C is a block diagram showing another example of the configuration of an encoding device. Specifically, Fig. 3C shows the configuration of an encoding device 1120, which is another example of the encoding device of the present disclosure. In Fig. 3C, the same components as those in Fig. 3A are given the same reference numerals as those in Fig. 3A, and descriptions of these components are omitted.
  • the encoding device 1100 stores encoded data 1103 in a memory 1104.
  • the encoding device 1120 differs from the encoding device 1100 in that it includes a transmission unit 1121 that transmits the encoded data 1103 to the outside.
  • the transmitting unit 1121 transmits a transmission signal 1122 generated based on the encoded data 1103 or data converted from the encoded data 1103 into another data format to another device or server.
  • the data used to generate the transmission signal 1122 is, for example, a bit stream, multiplexed data, a file, or a packet, as described in the encoding device 1100.
  • Fig. 3D is a block diagram showing another example of the configuration of a decoding device. Specifically, Fig. 3D shows the configuration of a decoding device 1130, which is another example of the decoding device of the present disclosure. In Fig. 3D, the same components as those in Fig. 3B are assigned the same reference numerals as those in Fig. 3B, and descriptions of these components are omitted.
  • the decryption device 1110 reads the input data 1113 from the memory 1114.
  • the decryption device 1130 differs from the decryption device 1110 in that it includes a receiving unit 1131 that receives the input data 1113 from outside.
  • the receiving unit 1131 receives the received signal 1132, acquires the received data, and outputs the input data 1113 that is input to the decoder 1112.
  • the received data may be the same as the input data 1113 that is input to the decoder 1112, or may be data in a different data format from the input data 1113.
  • the receiving unit 1131 may convert the received data into the input data 1113.
  • a conversion unit or a CPU (not shown) of the decoding device 1130 may convert the received data into the input data 1113.
  • the received data is, for example, a bit stream, multiplexed data, a file, or a packet described in the encoding device 1120.
  • Fig. 4A is a block diagram showing an example of the configuration of a decoder. Specifically, Fig. 4A shows the configuration of a decoder 1200, which is an example of the decoder 1112 in Fig. 3B or 3D.
  • the input data 1113 is an encoded bitstream and includes encoded audio data, which is an encoded audio signal, and metadata used in the acoustic processing.
  • the spatial information management unit 1201 acquires metadata contained in the input data 1113 and analyzes the metadata.
  • the metadata includes information describing elements that act on sounds arranged in a sound space.
  • the spatial information management unit 1201 manages the spatial information used for acoustic processing obtained by analyzing the metadata, and provides the spatial information to the rendering unit 1203.
  • the information used in the acoustic processing is expressed as spatial information, but other expressions may be used.
  • the information used in the acoustic processing may be expressed as sound spatial information or as scene information.
  • the spatial information input to the rendering unit 1203 may be information expressed as a spatial state, a sound spatial state, a scene state, or the like.
  • the spatial information may also be managed for each sound space or for each scene. For example, when multiple different rooms are each represented as a virtual space, the multiple rooms may each be managed as multiple scenes that are different from each other. Furthermore, the spatial information may be managed for the same space as different scenes depending on the situation being represented.
  • multiple pieces of spatial information may be managed for multiple sound spaces or multiple scenes.
  • an identifier that identifies each piece of the multiple pieces of spatial information may be assigned to the spatial information.
  • the spatial information data may be included in a bitstream, which is an example of input data 1113.
  • the bitstream may include an identifier for the spatial information, and the spatial information data may be obtained from an information source other than the bitstream.
  • the identifier for the spatial information may be used in rendering to obtain the spatial information data stored in a memory within the device or an external server as input data 1113.
  • the information managed by the spatial information management unit 1201 is not limited to information contained in the bitstream.
  • the input data 1113 may include data that is not included in the bitstream and indicates the characteristics and structure of the space obtained from software or a server that provides VR or AR.
  • the input data 1113 may also include data indicating the characteristics and position of a listener or an object.
  • the input data 1113 may also include information regarding the listener's position acquired by a sensor provided in a terminal including a decoding device (1110, 1130), or may include information indicating the terminal's position estimated based on information acquired by the sensor.
  • the spatial information management unit 1201 may communicate with an external system or server to acquire spatial information and listener positions.
  • the spatial information management unit 1201 may also acquire clock synchronization information from an external system and execute processing to synchronize with the clock of the rendering unit 1203.
  • the space in the above description may be a virtually formed space, i.e., a VR space, or may be a real space or a virtual space corresponding to a real space, i.e., an AR space or an MR space.
  • the virtual space may also be expressed as a sound field or sound space.
  • the information indicating a position in the above description may be information such as coordinate values indicating a position within a space, information indicating a relative position with respect to a predetermined reference position, or information indicating the movement or acceleration of a position within a space.
  • the audio data decoder 1202 decodes the encoded audio data contained in the input data 1113 to obtain an audio signal.
  • the encoded audio data acquired by the stereophonic sound reproduction system 1000 is a bitstream encoded in a specific format, such as MPEG-H 3D Audio (ISO/IEC 23008-3).
  • MPEG-H 3D Audio is merely one example of an encoding method that can be used to generate the encoded audio data contained in the bitstream.
  • the encoded audio data may be a bitstream encoded using another encoding method.
  • the encoding method may be a lossy codec such as MP3 (MPEG-1 Audio Layer-3), AAC (Advanced Audio Coding), WMA (Windows Media Audio), AC3 (Audio Codec-3) or Vorbis.
  • the encoding method may be a lossless codec such as ALAC (Apple Lossless Audio Codec) or FLAC (Free Lossless Audio Codec).
  • PCM data may be a type of encoded audio data.
  • the decoding process may be, for example, a process of converting an N-bit binary number into a number format (e.g., floating-point format) that can be processed by the rendering unit 1203 when the number of quantization bits of the PCM data is N.
  • the rendering unit 1203 acquires the audio signal and spatial information, performs acoustic processing on the audio signal using the spatial information, and outputs the audio signal after acoustic processing (audio signal 1111).
  • the spatial information management unit 1201 reads the metadata of the input signal, detects rendering items such as objects and sounds defined in the spatial information, and transmits them to the rendering unit 1203. After rendering begins, the spatial information management unit 1201 grasps changes over time in the spatial information and the listener's position, and updates and manages the spatial information. It then transmits the updated spatial information to the rendering unit 1203.
  • the rendering unit 1203 generates and outputs an audio signal to which acoustic processing has been added based on the audio signal contained in the input data 1113 and the spatial information received from the spatial information management unit 1201.
  • the spatial information update process and the audio signal output process with added acoustic processing may be executed in the same thread. Furthermore, the spatial information management unit 1201 and the rendering unit 1203 may each allocate processing to an independent thread. When the spatial information management unit 1201 and the rendering unit 1203 execute the spatial information update process and the audio signal output process with added acoustic processing in different threads, the thread startup frequency may be set individually, or the processes may be executed in parallel.
  • the allocation of computational resources to the spatial information management unit 1201 is limited.
  • updating of spatial information is a low-frequency process compared to the output processing of audio signals (for example, a process such as updating the direction of the listener's face), it does not necessarily have to be performed instantaneously like the output processing of audio signals. Therefore, even if the allocation of computational resources is limited, there is no significant impact on acoustic quality.
  • the spatial information may be updated periodically at preset times or intervals, or when preset conditions are met.
  • the spatial information may also be updated manually by the listener or the sound space manager, or may be updated in response to a change in an external system.
  • the spatial information may be updated when a listener operates a controller to instantly warp the position of his/her avatar or instantly advance or reverse the time.
  • the spatial information may be updated when an administrator of the virtual space suddenly changes the environment of the venue.
  • the thread for updating the spatial information managed by the spatial information management unit 1201 may be started as a one-off interrupt process in addition to being started periodically.
  • the update process of the spatial information managed by the spatial information management unit 1201 is performed in an information update thread.
  • the role of the information update thread is, for example, to update the position and orientation of the listener's avatar placed in the virtual space based on the position and orientation of the VR goggles worn by the listener, or to update the position of objects moving in the virtual space.
  • Such processing is handled within a processing thread that runs at a relatively low frequency of around a few tens of Hz.
  • processing for updating information indicating the characteristics of the direct sound may be performed.
  • the reason for this is that the characteristics of the direct sound change less frequently than the frequency with which audio processing frames for audio output occur. This makes it possible to relatively reduce the computational load of the processing. Also, updating information at an unnecessarily high frequency runs the risk of generating pulsive noise. By updating information infrequently, it is possible to avoid such risks.
  • FIG. 4B is a block diagram showing another example of the configuration of a decoder. Specifically, FIG. 4B shows the configuration of a decoder 1210, which is another example of the decoder 1112 in FIG. 3B or 3D.
  • FIG. 4B differs from FIG. 4A in that the input data 1113 includes an unencoded audio signal rather than encoded audio data.
  • the input data 1113 includes a bitstream including metadata and an audio signal.
  • the spatial information management unit 1211 is the same as the spatial information management unit 1201 in FIG. 4A, so a description thereof will be omitted.
  • the rendering unit 1213 is the same as the rendering unit 1203 in FIG. 4A, so a description thereof will be omitted.
  • decoders 1112, 1200, and 1210 may be expressed as audio processing units that perform audio processing.
  • the decoding devices 1110 and 1130 may be the audio signal processing device 1001, and may be expressed as audio processing devices.
  • FIG. 5 is a diagram showing an example of a physical configuration of an audio signal processing device 1001.
  • the audio signal processing device 1001 in Fig. 5 may be the decoding device 1110 in Fig. 3B or the decoding device 1130 in Fig. 3D.
  • the multiple components shown in Fig. 3B or Fig. 3D may be implemented by the multiple components shown in Fig. 5.
  • a part of the configuration described here may be provided in the audio presentation device 1002.
  • the audio signal processing device 1001 in FIG. 5 includes a processor 1402, a memory 1404, a communication IF (Interface) 1403, a sensor 1405, and a speaker 1401.
  • a processor 1402 a memory 1404, a communication IF (Interface) 1403, a sensor 1405, and a speaker 1401.
  • the processor 1402 is, for example, a CPU, a DSP (Digital Signal Processor), or a GPU (Graphics Processing Unit).
  • the CPU, DSP, or GPU may execute a program stored in the memory 1404 to perform the acoustic processing or decoding processing of the present disclosure.
  • the processor 1402 is, for example, a circuit that performs information processing.
  • the processor 1402 may be a dedicated circuit that performs signal processing on audio signals, including the acoustic processing of the present disclosure.
  • the memory 1404 is composed of, for example, a RAM (Random Access Memory) or a ROM (Read Only Memory).
  • the memory 1404 may include a magnetic recording medium such as a hard disk or a semiconductor memory such as an SSD.
  • the memory 1404 may also be an internal memory incorporated in the CPU or GPU.
  • the memory 1404 may also store spatial information managed by the spatial information management units (1201, 1211), and may also store threshold data, which will be described later.
  • the communication IF 1403 is a communication module compatible with a communication method such as Bluetooth (registered trademark) or WIGIG (registered trademark).
  • the audio signal processing device 1001 communicates with another communication device via the communication IF 1403, for example, to obtain a bitstream to be decoded.
  • the obtained bitstream is stored in the memory 1404, for example.
  • the communication IF 1403 is composed of, for example, a signal processing circuit and an antenna corresponding to the communication method.
  • the communication method is not limited to Bluetooth (registered trademark) and WIGIG (registered trademark), but may be LTE (Long Term Evolution), NR (New Radio), Wi-Fi (registered trademark), etc.
  • the communication method is not limited to the wireless communication method described above.
  • the communication method may be a wired communication method such as Ethernet (registered trademark), USB (Universal Serial Bus), or HDMI (registered trademark) (High-Definition Multimedia Interface).
  • Sensor 1405 performs sensing to estimate the position and orientation of the listener. Specifically, sensor 1405 estimates the position and/or orientation of the listener based on one or more detection results of the position, orientation, movement, velocity, angular velocity, acceleration, etc. of a part or the whole of the body, and generates position/or orientation information indicating the position and/or orientation of the listener.
  • the part of the body may be the listener's head, etc.
  • the position/orientation information may be information indicating the position and/or orientation of the listener in real space, or may be information indicating the displacement of the position and/or orientation of the listener based on the position and/or orientation of the listener at a specific time.
  • the position/or orientation information may also be information indicating the relative position and/or orientation with respect to the stereophonic sound reproduction system 1000 or an external device equipped with the sensor 1405.
  • the sensor 1405 is, for example, an imaging device such as a camera or a ranging device such as a LiDAR (Laser Imaging Detection and Ranging).
  • the sensor 1405 may capture the movement of the listener's head and detect the movement of the listener's head by processing the captured image.
  • a device that performs position estimation using wireless signals of any frequency band, such as millimeter waves, may be used as the sensor 1405.
  • the audio signal processing device 1001 may also acquire position information from an external device equipped with a sensor 1405 via the communication IF 1403.
  • the audio signal processing device 1001 may not include the sensor 1405.
  • the external device is, for example, the audio presentation device 1002 described in FIG. 2, or a stereoscopic image playback device worn on the listener's head.
  • the sensor 1405 is configured by combining various sensors such as a gyro sensor and an acceleration sensor.
  • the sensor 1405 may detect, for example, the angular velocity of rotation about at least one of three mutually orthogonal axes in the sound space as the axis of rotation as the speed of movement of the listener's head, or may detect the acceleration of displacement with at least one of the three axes as the direction of displacement.
  • the sensor 1405 may detect, for example, the amount of movement of the listener's head, the amount of rotation about at least one of three mutually orthogonal axes in the sound space as the axis of rotation, or the amount of displacement about at least one of the above three axes as the direction of displacement. Specifically, the sensor 1405 detects the 6DoF position (x, y, z) and angle (yaw, pitch, roll) as the listener's position.
  • the sensor 1405 is configured by combining various sensors used for detecting movement, such as a gyro sensor and an acceleration sensor.
  • the sensor 1405 may be realized by a camera for detecting the position of the listener or a GPS (Global Positioning System) receiver, etc. Position information obtained by performing self-position estimation using LiDAR or the like as the sensor 1405 may be used. For example, when the stereophonic sound reproduction system 1000 is realized by a smartphone, the sensor 1405 is built into the smartphone.
  • GPS Global Positioning System
  • the sensor 1405 may also include a temperature sensor such as a thermocouple that detects the temperature of the audio signal processing device 1001.
  • the sensor 1405 may also include a sensor that detects the remaining charge of a battery provided in the audio signal processing device 1001 or a battery connected to the audio signal processing device 1001.
  • Speaker 1401 has, for example, a diaphragm, a drive mechanism such as a magnet or voice coil, and an amplifier, and presents the audio signal after acoustic processing as sound to the listener. Speaker 1401 operates the drive mechanism in response to the audio signal (more specifically, a waveform signal indicating the waveform of the sound) amplified via the amplifier, and causes the drive mechanism to vibrate the diaphragm. In this way, the diaphragm vibrating in response to the audio signal generates sound waves, which propagate through the air and are transmitted to the listener's ears, causing the listener to perceive the sound.
  • the audio signal more specifically, a waveform signal indicating the waveform of the sound
  • the audio signal processing device 1001 includes a speaker 1401 and presents the audio signal after acoustic processing via the speaker 1401, the means for presenting the audio signal is not limited to the above configuration.
  • the audio signal after acoustic processing may be output to an external audio presentation device 1002 connected via a communication module. Communication via the communication module may be wired or wireless.
  • the audio signal processing device 1001 may have a terminal for outputting an analog audio signal, and an audio signal may be presented from the earphone or the like by connecting a cable for earphones or the like to the terminal.
  • the audio presentation device 1002 may be headphones, earphones, a head-mounted display, a neck speaker, a wearable speaker, or the like that are worn on the listener's head or part of the body.
  • the audio presentation device 1002 may be a surround speaker composed of multiple fixed speakers, or the like. The audio presentation device 1002 may then reproduce the audio signal.
  • Fig. 6 is a diagram showing an example of a physical configuration of an encoding device.
  • the encoding device 1500 in Fig. 6 may be the encoding device 1100 in Fig. 3A or the encoding device 1120 in Fig. 3C, and multiple components shown in Fig. 3A or 3C may be implemented by multiple components shown in Fig. 6.
  • the encoding device 1500 in FIG. 6 includes a processor 1501, a memory 1503, and a communication IF 1502.
  • the processor 1501 is, for example, a CPU, a DSP, or a GPU.
  • the CPU, DSP, or GPU may execute a program stored in the memory 1503 to perform the encoding process of the present disclosure.
  • the processor 1501 is, for example, a circuit that performs information processing.
  • the processor 1501 may be a dedicated circuit that performs signal processing on an audio signal, including the encoding process of the present disclosure.
  • Memory 1503 is composed of, for example, RAM or ROM.
  • Memory 1503 may include a magnetic recording medium such as a hard disk or a semiconductor memory such as an SSD.
  • Memory 1503 may also be an internal memory built into the CPU or GPU.
  • the communication IF 1502 is a communication module that supports communication methods such as Bluetooth (registered trademark) or WIGIG (registered trademark).
  • the encoding device 1500 communicates with other communication devices via the communication IF 1502, for example, and transmits an encoded bitstream.
  • the communication IF 1502 is composed of, for example, a signal processing circuit and an antenna corresponding to the communication method.
  • the communication method is not limited to Bluetooth (registered trademark) and WIGIG (registered trademark), but may be LTE, NR, Wi-Fi (registered trademark), etc.
  • the communication method is not limited to a wireless communication method.
  • the communication method may be a wired communication method such as Ethernet (registered trademark), USB, or HDMI (registered trademark).
  • Fig. 7 is a block diagram showing an example of the configuration of a rendering unit. Specifically, Fig. 7 shows an example of the detailed configuration of a rendering unit 1300 corresponding to the rendering units 1203 and 1213 in Figs. 4A and 4B.
  • the rendering unit 1300 is composed of an analysis unit 1301, a selection unit 1302, and a synthesis unit 1303, and applies acoustic processing to the sound data contained in the input signal and outputs it.
  • the input signal is composed of, for example, spatial information, sensor information, and sound data.
  • the input signal may include a bitstream composed of sound data and metadata (control information), in which case the metadata may include spatial information.
  • Spatial information is information about the sound space (three-dimensional sound field) created by the stereophonic sound reproduction system 1000, and is composed of information about the objects contained in the sound space and information about the listener.
  • Objects include sound source objects that emit sound and are sound sources, and non-sound-emitting objects that do not emit sound. Sound source objects can also be simply expressed as sound sources.
  • a non-sound-producing object acts as an obstacle object that reflects the sound emitted by a sound source object, but a sound source object may also act as an obstacle object that reflects the sound emitted by another sound source object. Obstacle objects may also be referred to as reflective objects.
  • Information that is commonly assigned to sound source objects and non-sound-producing objects includes position information, shape information, and the rate at which the sound volume decays when the object reflects sound.
  • the position information is expressed by coordinate values on three axes, for example the X-axis, Y-axis, and Z-axis, in Euclidean space, but it does not necessarily have to be three-dimensional information.
  • the position information may be two-dimensional information expressed by coordinate values on two axes, the X-axis and the Y-axis.
  • the position information of an object is determined by the representative position of a shape expressed by a mesh or voxels.
  • the shape information may also include information about the surface material.
  • the attenuation rate may be expressed as a real number between 0 and 1, or may be expressed as a negative decibel value.
  • sound volume is not amplified by reflection, so the attenuation rate is set to a negative decibel value, but for example, to create the eerie feeling of an unreal space, an attenuation rate of 1 or more, i.e., a positive decibel value, may be set.
  • the attenuation rate may be set to a different value for each of the frequency bands that make up the multiple frequency bands, or a value may be set independently for each frequency band.
  • a corresponding attenuation rate value may be used based on information about the surface material.
  • the spatial information may also include information indicating whether the object belongs to a living thing, and information indicating whether the object is a moving object. If the object is a moving object, the position indicated by the position information may move over time. In this case, information on the changed position or the amount of change is transmitted to the rendering unit 1300.
  • Information about sound source objects includes information commonly assigned to sound source objects and non-sound generating objects, as well as sound data and information necessary for radiating the sound data into the sound space.
  • Sound data is data that indicates information about the frequency and strength of sound, and is data that expresses the sound perceived by the listener.
  • the sound data is typically a PCM signal, but may also be data compressed using an encoding method such as MP3.
  • the rendering unit 1300 may include a decoding unit (not shown).
  • the signal may be decoded by the sound data decoder 1202.
  • One piece of sound data may be set for one sound source object, or multiple pieces of sound data may be set for one sound source object. Furthermore, identification information for identifying each piece of sound data may be assigned to the sound data, and the information relating to the sound source object may include the identification information for the sound data.
  • the information required to radiate sound data into a sound space may include, for example, information on the reference volume used as a reference for playing back the sound data, information on the position of the sound source object, and information on the orientation of the sound source object (i.e., information on the directionality of the sound emitted by the sound source object).
  • the reference volume information may be, for example, the effective amplitude value of the sound data at the sound source position when the sound data is emitted into the sound space, and may be expressed as a floating-point decibel (db) value.
  • db decibel
  • the reference volume may indicate that sound is emitted into the sound space from the position indicated by the information regarding the position of the sound source object at the same volume as the signal level indicated by the sound data, without increasing or decreasing the volume.
  • the reference volume is -6 db, it may indicate that sound is emitted into the sound space from the position indicated by the information regarding the position of the sound source object, with the volume of the signal level indicated by the sound data reduced by approximately half.
  • the reference volume information may be added to each sound data, or may be added to multiple sound data collectively.
  • the information required to radiate sound data into a sound space may include volume information, for example, information indicating time-series fluctuations in the volume of the sound source.
  • the volume transitions intermittently over a short period of time. In other words, sound and silence alternate. If the sound space is a concert hall and the sound source is a performer, the volume is maintained for a certain length of time. If the sound space is a battlefield and the sound source is an explosive, the volume of the explosion sound will increase for a moment and then remain silent or low.
  • the information on the volume of the sound source may include not only information on the loudness of the sound, but also information on the transition of the loudness of the sound. Such information may be used as information indicating the nature of the sound data.
  • the transition information may be represented by data showing frequency characteristics in a time series.
  • the transition information may be represented by data showing the duration of a sound section.
  • the transition information may be represented by data showing a time series of the duration of a sound section and the duration of a silent section.
  • the transition information may be represented by data listing, in a time series, multiple pairs of durations during which the amplitude of a sound signal can be considered steady (approximately constant) and the amplitude values of the signal during those periods.
  • the transition information may be represented by data on the duration for which the frequency characteristics of the sound signal can be considered stationary.
  • the transition information may be represented by data that lists in chronological order multiple sets of durations for which the frequency characteristics of the sound signal can be considered stationary and the frequency characteristics during those periods.
  • the transition information may be represented, for example, in the form of data that shows the outline of a spectrogram.
  • the volume used as the standard for the above frequency characteristics may be the reference volume.
  • Information on the reference volume and information indicating the properties of the sound data may be used in the process of calculating the volume of direct sound or reflected sound to be perceived by the listener, or may be used in the process of selecting whether or not to perceive it by the listener. Other examples of volume information and methods of using it will be described later.
  • orientation information is typically expressed using yaw, pitch, and roll.
  • the roll rotation may be omitted, and the orientation information of the sound source object may be expressed using azimuth (yaw) and elevation (pitch).
  • the orientation information of the sound source object may change over time, and if it does change, it is transmitted to the rendering unit 1300.
  • Information about the listener is information about the listener's position and orientation in sound space.
  • Information about the position is expressed as a position on the XYZ axes in Euclidean space, but it does not necessarily have to be three-dimensional information and can be two-dimensional information.
  • Information about the listener's orientation is typically expressed in yaw, pitch, and roll. Alternatively, the roll rotation may be omitted, and the listener's orientation information may be expressed in azimuth (yaw) and elevation (pitch).
  • the listener's position and orientation information may change over time, and if so, is transmitted to the rendering unit 1300.
  • the sensor information includes the amount of rotation or displacement detected by the sensor 1405 worn by the listener, and the listener's position and orientation.
  • the sensor information is transmitted to the rendering unit 1300, which updates the listener's position and orientation information based on the sensor information.
  • the sensor information may include position information obtained by the mobile terminal performing self-position estimation using a GPS, a camera, LiDAR, or the like, for example.
  • information obtained from the outside through a communication module, rather than from the sensor 1405, may be detected as sensor information.
  • Information indicating the temperature of the audio signal processing device 1001 and information indicating the remaining battery charge may be obtained from the sensor 1405.
  • the computational resources (CPU capacity, memory resources, PC performance, etc.) of the audio signal processing device 1001 or the audio presentation device 1002 may be obtained in real time.
  • the analysis unit 1301 analyzes the audio signal contained in the input signal and the spatial information received from the spatial information management units (1201, 1211) to detect the information necessary to generate direct sound and reflected sound, as well as the information necessary to select whether or not to generate reflected sound.
  • Information required to generate direct sound and reflected sound is, for example, information on the characteristics of direct sound and reflected sound that may occur in the sound space.
  • the reflected sound detected here is a candidate for reflected sound that is selected by the selection unit 1302 as the reflected sound that will ultimately be generated by the synthesis unit 1303.
  • the characteristics of direct sound and reflected sound are, for example, the arrival time (arrival time) and volume at the time of arrival of the direct sound and reflected sound to the listener.
  • arrival time arrival time
  • volume volume at the time of arrival of the direct sound and reflected sound to the listener.
  • the information required to select the reflected sound to be output may be, for example, information indicating the evaluation value of the reflected sound and the upper limit of the computational resources, or information for calculating the evaluation value of the reflected sound and the information indicating the upper limit of the computational resources.
  • the analysis unit 1301 may obtain the evaluation value of the reflected sound from an external device, a memory unit, or an input signal.
  • the analysis unit 1301 or the selection unit 1302 may calculate the evaluation value of the reflected sound and the information indicating the upper limit of the computational resources using information obtained by the analysis unit 1301 from an external device, a memory unit, or an input signal.
  • the selection unit 1302 decides whether to select a reflected sound based on the evaluation value of the reflected sound. In other words, the selection unit 1302 preferentially selects a reflected sound with a high evaluation value over a reflected sound with a low evaluation value.
  • the evaluation value of a reflected sound is the value of the reflected sound, and corresponds to, for example, the perceptual importance of the reflected sound. The higher the perceptual importance of the reflected sound, the higher the evaluation value.
  • the perceptual importance of the reflected sound is the degree of necessity of the reflected sound used for the listener to correctly grasp the position of the sound source object in the sound space and the width of the space.
  • the listener is able to grasp the positioning of the sound image, such as the direction from which the sound comes and the sense of distance to the sound source object, as well as the size and material of the space.
  • the selection unit 1302 evaluates the perceptual importance of the reflected sound based on, for example, the volume of the sound source, the visibility of the sound source, the positioning of the sound source, the visibility of the reflecting object (obstacle object), information about the material of the reflecting object, and the geometric relationship between the direct sound and the reflected sound, and calculates an evaluation value.
  • Other indices may be used to evaluate the perceptual importance of the reflected sound.
  • the evaluation value of the reflected sound may be calculated based on any one of multiple indices related to the perceptual importance of the reflected sound, or the evaluation value of the reflected sound may be calculated comprehensively using multiple indices.
  • the selection unit 1302 may also obtain the evaluation value of the reflected sound from an external device or memory unit, or may obtain the evaluation value from the input signal.
  • the louder the sound source volume the higher the evaluation value may be.
  • the evaluation value may be high.
  • the difference in the angle of arrival between direct sound and reflected sound, and the difference in the time of arrival between direct sound and reflected sound have a significant impact on the perception of space. Therefore, if the difference in the angle of arrival between direct sound and reflected sound is large, or if the difference in the time of arrival between direct sound and reflected sound is large, the evaluation value may be high.
  • an evaluation value of the reflected sound may be calculated using information on the difference in arrival time between the direct sound and the reflected sound.
  • a masking threshold in the well-known phenomenon of temporal masking (post-masking phenomenon) may be used.
  • the synthesis unit 1303 synthesizes the audio signal of the direct sound with the audio signal of the reflected sound that the selection unit 1302 has selected to generate.
  • the synthesis unit 1303 processes the input audio signal to generate a direct sound based on the information on the direct sound arrival time and volume at the time of direct sound arrival calculated by the analysis unit 1301.
  • the synthesis unit 1303 also processes the input audio signal to generate a reflected sound based on the information on the reflected sound arrival time and volume at the time of reflected sound arrival for the reflected sound selected by the selection unit 1302.
  • the synthesis unit 1303 then synthesizes and outputs the generated direct sound and reflected sound.
  • FIG. 8 is a flowchart showing an example of the operation of the audio signal processing device 1001. Fig. 8 mainly shows the process executed by the rendering unit 1300 of the audio signal processing device 1001.
  • the analysis unit 1301 analyzes the input signal input to the audio signal processing device 1001 to detect direct sound and reflected sound that may occur in the sound space.
  • the reflected sound detected here is a candidate for the reflected sound selected by the selection unit 1302 as the reflected sound to be finally generated by the synthesis unit 1303.
  • the analysis unit 1301 also analyzes the input signal to calculate information required for generating the direct sound and reflected sound, and information required for selecting the reflected sound to be generated.
  • the characteristics of the direct sound and the reflected sound are calculated. Specifically, the arrival time and volume of the direct sound and the reflected sound when they reach the listener are calculated. If multiple objects exist in the sound space as reflecting objects, the characteristics of the reflected sound are calculated for each of the multiple objects.
  • the direct sound arrival time (td) is calculated based on the direct sound arrival path (pd).
  • the direct sound arrival path (pd) is the path connecting the position information S (xs, ys, zs) of the sound source object and the position information A (xa, ya, za) of the listener.
  • the direct sound arrival time (td) is the value obtained by dividing the length of the path connecting the position information S (xs, ys, zs) and the position information A (xa, ya, za) by the speed of sound (approximately 340 m/sec).
  • the path length (X) is calculated as (xs-xa) ⁇ 2 + (ys-ya) ⁇ 2 + (zs-za) ⁇ 2) ⁇ 0.5.
  • the volume N at the sound source position may be the reference volume described above.
  • the reflected sound arrival time (tr) is calculated based on the reflected sound arrival path (pr).
  • the reflected sound arrival path (pr) is the path that connects the position of the sound image of the reflected sound and the position information A (xa, ya, za).
  • the position of the sound image of the reflected sound may be derived using, for example, the "mirror method” or "ray tracing method,” or any other method for deriving the sound image position.
  • the mirror method is a method for simulating a sound image by assuming that a mirror image of the reflected wave on the wall of a room exists in a position symmetrical to the sound source with respect to the wall, and that sound waves are emitted from the position of that mirror image.
  • the ray tracing method is a method for simulating an image (sound image) observed at a certain point by tracing waves that propagate in a straight line, such as light rays or sound rays.
  • FIG. 9 is a diagram showing a positional relationship in which the listener and an obstacle object are relatively far apart.
  • FIG. 10 is a diagram showing a positional relationship in which the listener and an obstacle object are relatively close together. That is, each of FIG. 9 and FIG. 10 shows an example in which a sound image of a reflected sound is formed at a position symmetrical to the sound source position across a wall. By determining the position of the sound image of a reflected sound on the x, y and z axes based on such a relationship, the arrival time of the reflected sound can be determined in a similar manner to the method of calculating the arrival time of a direct sound.
  • the arrival time of the reflected sound (tr) is the value obtained by dividing the length (Y) of the path connecting the position of the sound image of the reflected sound and the position information A (xa, ya, za) by the speed of sound (approximately 340 m/sec).
  • the attenuation rate G may be expressed as a real number between 0 and 1, or may be expressed as a negative decibel value.
  • the volume of the entire signal is attenuated by G.
  • the attenuation rate may also be set for each frequency band that constitutes multiple frequency bands.
  • the analysis unit 1301 multiplies each frequency component of the signal by a specified attenuation rate.
  • the analysis unit 1301 may also use a representative value or average value of multiple attenuation rates for multiple frequency bands as the overall attenuation rate, and attenuate the volume of the entire signal by that amount.
  • the selection unit 1302 selects whether or not to generate the reflected sound calculated by the analysis unit 1301. In other words, the selection unit 1302 determines whether or not to select the reflected sound as a target reflected sound to be generated. When there are multiple reflected sounds, the selection unit 1302 selects whether or not to generate each reflected sound. As a result of selecting whether or not to generate each reflected sound, the selection unit 1302 may select one or more target reflected sounds to be generated from among the multiple reflected sounds, or may not select any target reflected sounds to be generated.
  • the selection unit 1302 may select reflected sounds to which other processes are to be applied, not limited to generation processes. For example, the selection unit 1302 may select reflected sounds to which binaural processing is to be applied. Furthermore, the selection unit 1302 basically selects only one or more reflected sounds to be processed. However, the selection unit 1302 may select only one or more reflected sounds that are not to be processed. Then, processing may be applied to the one or more reflected sounds that are not selected.
  • the selection of reflected sounds is performed based on the allowable computational load and the perceptual importance of the reflected sounds.
  • the flow of the reflected sound selection process is explained using the flowchart in Figure 11.
  • FIG. 11 is a flowchart showing an example of a selection process for reflected sounds.
  • the selection process is performed based on the computational load and the perceptual importance of the reflected sounds, but the selection process may be performed based on only one of them.
  • the selection unit 1302 acquires (S201) information indicating an upper limit of the computational load in the audio signal processing device 1001.
  • the information indicating the upper limit of the computational load may be determined in advance by a listener or may be acquired from an input signal.
  • the information indicating the upper limit of the computational load may indicate the number of reflected sounds (one or more) as the upper limit, or may indicate the processing amount of reflected sounds (one or more).
  • the predicted value of the number of reflected sounds is also used as the predicted value of the computational load of the reflected sound candidates described below, so it is possible to reduce the processing amount of the selection unit 1302 compared to calculating a predicted value of the processing amount of the reflected sounds.
  • the predicted value of the processing volume of the reflected sound is also used for the predicted value of the computational load of the reflected sound candidate described below, making it possible to predict the computational load more accurately.
  • the processing volume of (one or more) reflected sounds is, for example, the processing volume required to generate (one or more) reflected sounds, and is the total computation volume required for the processing to generate (one or more) reflected sounds.
  • the reflected sound processing is, for example, processing for generating reflected sound, and is included in the pipeline processing.
  • the pipeline processing includes, for example, reverberation processing, early reflection processing, distance attenuation processing, binaural processing, diffraction processing, and occlusion processing.
  • the pipeline process may include other processes or may not include some of the processes.
  • the rendering unit 1300 may perform diffraction processing and occlusion processing as the pipeline process.
  • reverberation processing may be omitted if it is not necessary.
  • the information indicating the upper limit of the computational load may be determined according to the computational resources (CPU capabilities, memory resources, PC performance, remaining battery power, etc.) of the audio signal processing device 1001 or the audio presentation device 1002. For example, since CPU processing capabilities generally increase in the order of head-mounted displays, VR/AR goggles, smartphones, notebook PCs, desktop PCs, and supercomputers, the upper limit of the computational load may also be set to increase in the same order.
  • the selection unit 1302 may also acquire information indicating the temperature of the device or information indicating the remaining battery power from a sensor 1405 provided in the audio signal processing device 1001 or the audio presentation device 1002.
  • the selection unit 1302 may also acquire the computational resources (CPU capacity, memory resources, PC performance, etc.) of the audio signal processing device 1001 or the audio presentation device 1002 in real time.
  • the selection unit 1302 may obtain information indicating the upper limit of the computational load in real time, or may obtain the information periodically each time the spatial information is updated by the spatial information management unit (1201, 1211).
  • the information indicating the upper limit of the computational load may be set according to the battery life of the audio signal processing device 1001 or the audio presentation device 1002.
  • an upper limit on the computational load may be set for each mode, such as an "energy saving mode” that requires less computation and allows the device to be used for a long time, or a "high performance mode” that requires more computation but allows more reflected sound to be heard.
  • the desired battery life or desired mode may be specified by the listener, an administrator managing the stereophonic sound reproduction system 1000, or a creator of the stereophonic sound content.
  • the upper limit on the computational load may be input directly without selecting a mode.
  • information indicating an upper limit of the computational load may be set for each piece of content reproduced by the stereophonic sound reproduction system 1000.
  • the upper limit of the computational load may be set high and more reflected sounds may be selected.
  • the upper limit of the computational load may be set low so as to prevent delays associated with increased processing volume. This prevents too many reflected sounds from being selected.
  • the input signal including the content may include information indicating an upper limit of the computational load. Furthermore, the selection unit 1302 may determine the upper limit of the computational load based on information indicating the type of content or the type of mode included in the input signal. Alternatively, the selection unit 1302 may determine the upper limit of the computational load based on other flags or parameters included in the input signal, not limited to information indicating the type of content or the type of mode.
  • the selection unit 1302 extracts, as selection candidates, one or more reflected sounds whose arrival volume is equal to or greater than a threshold value from among one or more reflected sounds detected by the analysis unit 1301 (S202). In other words, the selection unit 1302 determines not to execute subsequent processing for one or more reflected sounds whose arrival volume is smaller than a threshold value.
  • the selection unit 1302 does not need to extract the reflected sound caused by the direct sound.
  • the volume of the reflected sound when it arrives is smaller than the volume of the direct sound when it arrives. Therefore, if the volume of the direct sound when it arrives is smaller than the threshold, the volume of the reflected sound when it arrives caused by the direct sound is also smaller than the threshold.
  • the selection unit 1302 may extract reflected sounds whose volume upon arrival is equal to or greater than a threshold from reflected sounds caused by direct sounds whose volume upon arrival is equal to or greater than a threshold.
  • the selection unit 1302 may first compare the volume of the direct sound when it arrives with a threshold value. This makes it possible to determine not to extract multiple reflected sounds caused by the direct sound if the volume of the direct sound when it arrives is smaller than the threshold value. Therefore, it is possible to reduce the amount of calculations compared to the case where the volume of the reflected sound when it arrives is calculated for each of the multiple reflected sounds caused by the direct sound, and then it is determined whether or not to extract the reflected sound.
  • the threshold value to be compared with the volume of the direct sound or reflected sound at the time of arrival may be the minimum volume reproduced in the sound space.
  • the threshold value may be the minimum audible limit indicating the volume at the boundary between whether or not a sound can be perceived by the listener. For example, sounds with a volume lower than this threshold may not be reproduced in the virtual space as sounds that cannot be perceived by the listener.
  • the threshold value may be determined in advance by the listener or may be obtained from the input signal.
  • the threshold value of the volume upon arrival may be determined according to the computational resources (CPU capability, memory resources, PC performance, remaining battery power, etc.) of the audio signal processing device 1001 or the audio presentation device 1002. For example, since CPU processing capabilities generally increase in the order of head mounted displays, VR/AR goggles, smartphones, notebook PCs, desktop PCs, and supercomputers, the threshold value of the volume upon arrival may also be set to increase in the same order.
  • the selection unit 1302 may also acquire information indicating the temperature of the device or information indicating the remaining battery power from a sensor 1405 provided in the audio signal processing device 1001 or the audio presentation device 1002.
  • the selection unit 1302 may also acquire the computational resources (CPU capacity, memory resources, PC performance, etc.) of the audio signal processing device 1001 or the audio presentation device 1002 in real time.
  • the selection unit 1302 may obtain the threshold value of the sound volume at the time of arrival in real time, or may obtain it periodically each time the spatial information is updated by the spatial information management unit (1201, 1211).
  • the threshold for the arrival volume may also be set according to the battery life of the audio signal processing device 1001 or the audio presentation device 1002.
  • the threshold value of the sound volume at the time of arrival may be set for each mode, such as an "energy saving mode” that requires less calculation and allows the device to be used for a long time, or a "high performance mode” that requires more calculation but allows more reflected sounds to be received.
  • the desired battery life or desired mode may be specified by the listener, an administrator who manages the stereophonic sound reproduction system 1000, or a creator of the stereophonic sound content.
  • the threshold value of the sound volume at the time of arrival may be input directly without selecting a mode.
  • a threshold for the volume of sound at the time of arrival may be set for each piece of content played back by the stereophonic sound playback system 1000. For example, for content in which immersion is more important, the threshold for the volume of sound at the time of arrival may be set high, and more reflected sounds may be selected. For content in which real-time performance is important, the threshold for the volume of sound at the time of arrival may be set low, so as to prevent delays associated with increased processing volume. This prevents too many reflected sounds from being selected.
  • the input signal including the content may include a threshold for the volume upon arrival.
  • the selection unit 1302 may also determine the threshold for the volume upon arrival based on information indicating the type of content or the type of mode included in the input signal. Alternatively, the selection unit 1302 may determine the threshold for the volume upon arrival based on other flags or parameters included in the input signal, not limited to information indicating the type of content or the type of mode.
  • the selection unit 1302 calculates a predicted value of the total computation load of all the reflected sounds extracted as selection candidates whose arrival volume is equal to or greater than the threshold (S203).
  • the predicted value of the computation load may be the number of (one or more) reflected sounds or a predicted value of the processing amount of (one or more) reflected sounds.
  • Which of the predicted values of the number of reflected sounds or the processing volume of reflected sounds is used as the predicted value of the computational load may be determined depending on whether the information indicating the upper limit of the computational load indicates the number of reflected sounds or the processing volume of reflected sounds as the upper limit.
  • the predicted value of the computational load is the number of reflected sounds, it is possible to reduce the processing volume of the selection unit 1302 more than when the predicted value of the computational load is a predicted value of the processing volume of reflected sounds.
  • the predicted value of the computational load is a predicted value of the processing volume of reflected sounds, it is possible to predict the computational load more accurately by calculating the total amount of computation required to generate (one or more) reflected sounds.
  • the processing of reflected sound is, for example, processing for generating reflected sound, and is included in pipeline processing.
  • the predicted value of the amount of calculation in the pipeline processing i.e., the amount of processing for one reflected sound
  • the predicted value of the amount of calculation in the pipeline processing may be different for each reflected sound.
  • a predicted value for the processing amount for all reflected sounds may be calculated by assuming that the same processing is performed for each reflected sound.
  • the same predicted value may be applied to the predicted value for the processing amount for each reflected sound to calculate a predicted value for the processing amount for all reflected sounds.
  • the predicted value of the total computational load of multiple reflected sounds may be calculated, or the predicted value of the total computational load of one reflected sound may be calculated.
  • the reflected sounds used to calculate the predicted value of the total computational load may be all reflected sounds extracted as selection candidates, or only some of the reflected sounds extracted as selection candidates.
  • the predicted value of the number or processing amount of some of the reflected sounds may be used as the predicted value of the total computational load.
  • the selection unit 1302 compares the calculated predicted value of the total calculation load with the upper limit of the calculation load, and determines whether or not the predicted value of the total calculation load exceeds the upper limit of the calculation load (S204). If the predicted value of the total calculation load exceeds the upper limit of the calculation load (Yes in S204), the selection unit 1302 performs selection processing (S205 to S211) based on the evaluation value. If the predicted value of the total calculation load does not exceed the upper limit of the calculation load (No in S204), the selection unit 1302 selects all of the reflected sounds extracted as selection candidates, and ends the processing.
  • the selection unit 1302 calculates an evaluation value of the reflected sound for each of the selection candidates based on the perceptual importance, and controls whether or not to select the reflected sound based on the evaluation value. For example, the selection unit 1302 selects the reflected sounds in descending order of evaluation value. A specific method for calculating the evaluation value of the reflected sounds will be described later. Here, an example of the selection process for selecting the reflected sounds based on the evaluation value will be described.
  • the selection unit 1302 for example, executes a loop process in which the computational loads of the selected reflected sounds are sequentially added up, and ends the selection process when the cumulative total exceeds the upper limit of the computational load (S205 to S211). In other words, when the cumulative total value of the computational loads of one or more selected reflected sounds exceeds the upper limit of the computational load (Yes in S209), the selection unit 1302 determines that the remaining undetermined reflected sounds are not selected reflected sounds, and ends the selection process.
  • the selection unit 1302 first sets the count of the total calculation load to zero (S205). The selection unit 1302 also calculates an evaluation value for each extracted reflected sound (S206). The selection unit 1302 then decides to select the reflected sound with the high evaluation value (S207). The selection unit 1302 also adds the calculation load of the reflected sound that has been decided to be selected to the total calculation load (S208).
  • the selection unit 1302 determines that the remaining undetermined reflected sounds are not to be selected and ends the selection process. In this case, the selection unit 1302 may re-determine that the last reflected sound determined to be selected is not to be selected. This makes it possible to suppress the total computational load to below the upper limit of the computational load.
  • the selection unit 1302 repeats the process (S207 to S209), and if there is no undetermined reflected sound (No in S210), the selection process ends.
  • a process may be performed to lower the importance of the sound source object and the reflecting object that generate the reflected sound by a predetermined amount (S211).
  • the value of the "sound source” of that reflected sound may be lowered. This makes it more likely that a reflected sound related to a different sound source will be selected in the next turn. Also, when a reflected sound is selected, the value of the "wall" (reflective object) that generated that reflected sound may be lowered. This makes it more likely that a reflected sound generated by a different wall will be selected in the next turn.
  • the sound source objects are represented as X, Y, and Z
  • the walls are represented as R1 to R6
  • the reflected sounds are represented as x1 to x6, y1 to y6, and z1 to z6.
  • the sense of reality of the walls R3 to R6 in the sound space will not be reproduced.
  • the volume of sound source Y is almost zero, expressing the sense of reality of sound source Y is not that important. Therefore, the evaluation values of the reflected sounds y1 to y6 may be low. In this way, when it comes to selecting which reflected sounds to use when computational resources are limited, the reflected sounds may not be selected randomly, but may be selected evenly based on their importance from the acoustic, auditory, and visual perspectives.
  • the method of determining the reflected sounds to be selected is not limited to determining the reflected sounds in order of highest evaluation value. For example, reflected sounds with evaluation values equal to or greater than a threshold may be selected, and reflected sounds with evaluation values below the threshold may not be selected. Also, reflected sounds from layers with high evaluation values may be selected at a predetermined rate. Alternatively, reflected sounds from layers with low evaluation values may not be selected at a predetermined rate. In these cases, a loop process of sequentially adding up the computational load of reflected sounds may not be executed.
  • (Evaluation process) 12 is a flowchart showing an example of the evaluation process A specific method for determining the evaluation value will be described with reference to the flowchart shown in FIG.
  • the selection unit 1302 may calculate an evaluation value of the reflected sound using a pre-set evaluation method according to, for example, the volume of the sound source, the visibility of the sound source, the positioning of the sound source, the visibility of a reflecting object (obstacle object), or the geometric relationship between the direct sound and the reflected sound.
  • the selection unit 1302 acquires a plurality of reflected sounds each extracted as a selection candidate, and calculates an evaluation value of each of the plurality of reflected sounds based on the perceptual importance of the reflected sound.
  • an evaluation score may be assigned to the reflected sound for each of the multiple indicators described below, and an evaluation value may be assigned to the reflected sound based on the evaluation score.
  • the multiple indicators for evaluation are not limited to the multiple indicators described below.
  • any one of the multiple indicators may be used, any two or more of the multiple indicators may be used, or all of the multiple indicators may be used.
  • the order of evaluation of the multiple indicators may be determined based on a predetermined priority order of the indicators.
  • an index related to a sound source object may be used as an evaluation index for a reflected sound. Furthermore, when a reflected sound caused by a sound source object is selected as described above, the value of the sound source object may be reduced. This makes it possible to reproduce reflected sounds caused by many sound source objects evenly without being biased toward reflected sounds caused by a specific sound source object. Therefore, it is possible to secure clues for the listener to correctly perceive the localization of each sound source.
  • evaluation points may be assigned to the reflected sounds caused by a sound source object based on the importance of the sound source object or the importance of the direct sound emitted by the sound source object.
  • the importance of a sound source object i.e., the importance of the direct sound
  • This evaluation may be used as an evaluation of the reflected sound caused by the direct sound generated from the sound source object.
  • an evaluation score for an index related to the sound source object may be assigned to the reflected sound based on the audibility of the direct sound or the visibility of the sound source object.
  • the evaluation of the sound source object and the direct sound may be used not only to select the reflected sound, but also to select the direct sound.
  • the selection unit 1302 may evaluate the sound source object based on the audibility, i.e., ease of hearing, of the direct sound, and use the evaluation as an evaluation index for the reflected sound (S301). For example, an evaluation point A obtained by evaluating the audibility using information on the loudness of the direct sound may be assigned to the sound source object (direct sound) and the reflected sound.
  • a higher evaluation point A may be assigned to a loud sound source object compared to a quiet sound source object.
  • a higher evaluation point A may be assigned to a reflected sound caused by a loud sound source object compared to a reflected sound caused by a quiet sound source object.
  • the information relating to the loudness of a sound may be either the volume (decibel value) or the amplitude value. Since the volume or amplitude value of a sound usually changes from moment to moment, it goes without saying that the information relating to the loudness of a sound used in the evaluation may be a reference volume assigned to the sound source object, or information indicating the loudness of a sound as it transitions over time.
  • both the reference volume information and the volume information that transitions over time may be used as information indicating the loudness of the direct sound.
  • the evaluation score of the sound source object may be calculated based on the reference volume information, and then the evaluation score of the direct sound may be calculated by correcting the evaluation score using information indicating the loudness of the transitioning sound.
  • the evaluation score of the direct sound may first be calculated using information indicating the loudness of the transitioning sound, and then the evaluation score of the direct sound may be corrected using the reference volume assigned to the sound source object.
  • the evaluation score of the sound source object may be calculated using only either the reference volume information or the volume information that transitions over time.
  • the volume transitions intermittently over a short period of time. In other words, sound and silence alternate. If the virtual space is a concert hall and the direct sound is a musical performance, the volume is maintained for a certain length of time. If the virtual space is a battlefield and the direct sound is an explosion, the volume increases for a moment and then remains silent or low.
  • the volume information of the sound source may include not only loudness information, but also information on the transition of loudness.
  • the information may be information that lists in chronological order multiple pairs of time lengths during which the volume is roughly constant and the volume values for those time periods.
  • the selection unit 1302 may evaluate the sound source object based on its visibility and use the evaluation as an evaluation index for the reflected sound (S302).
  • the selection unit 1302 may detect a sound source object that is visible to the listener in the video provided by the video providing device in synchronization with the sound provided by the audio presentation device 1002.
  • a sound source object included in a video provided by a video providing device in synchronization with a sound provided by the audio presentation device 1002 may be detected as a visible sound source object.
  • the determination of whether or not it is visible may be made according to an update process of the spatial information managed by the spatial information management unit (1201, 1211), that is, according to a process in an information update thread.
  • the selection unit 1302 may assign a higher evaluation score V to the sound source object detected as a visible object, compared to a sound source object that is not visible to the listener.
  • the selection unit 1302 may assign a higher evaluation score V to direct sounds and reflected sounds resulting from a sound source object that is visible to the listener, compared to direct sounds and reflected sounds resulting from a sound source object that is not visible to the listener.
  • the method of detecting an object visible to the listener is not limited to the method based on an image provided in synchronization with the sound as described above.
  • a visible object may be determined based on the relationship between the listener's position in the sound space and the object's position.
  • the sound source object may be identified as being visible to the listener. More specifically, if there is no obstacle object on the propagation path of the direct sound and reflected sound that may occur in the sound space calculated by the analysis unit 1301, the sound source object or the reflected object may be identified as being visible to the listener.
  • sound source objects located within a predetermined distance range from the listener's position may be identified as being visible to the listener.
  • the selection unit 1302 may then evaluate the importance of reflected sounds resulting from sounds generated from sound source objects identified as being visible to the listener as being high, and assign a high evaluation score V to such reflected sounds.
  • the sound source object By using an index of visibility of the sound source object, it becomes possible to appropriately select reflected sound that matches the visual localization in the video with the auditory localization (acoustic localization) in the sound. If the visual localization of the sound source object visible to the listener does not match the acoustic localization based on the direct sound, reflected sound, and their relationship provided by the sound presentation device 1002, the sense of localization becomes unnatural, causing the listener to feel uncomfortable and reducing the sense of immersion.
  • the audio presentation device 1002 and the video providing device may be the same device, such as VR goggles and a head-mounted display, or may be separate devices, such as earphones and a smartphone.
  • the selection unit 1302 may evaluate the sound source object based on its localization and use the evaluation as an evaluation index for the reflected sound (S303).
  • the selection unit 1302 may detect the moving speed of a sound source object visible to the listener in the video provided by the video providing device in synchronization with the sound provided by the audio presentation device 1002. The selection unit 1302 may then assign a higher evaluation point S to a sound source object with a slower moving speed than to a sound source object with a faster moving speed. Similarly, the selection unit 1302 may assign a higher evaluation point S to direct sound and reflected sound caused by a sound source object with a slower moving speed than to direct sound and reflected sound caused by a sound source object with a faster moving speed.
  • the selection unit 1302 may assign the highest evaluation score S to the direct sound and reflected sound caused by the sound source object in the index of the localization of the sound source object. For example, the selection unit 1302 may assign a higher evaluation score to the reflected sound caused by the sound emitted by a stopped sound source object than to the reflected sound caused by the sound emitted by a moving sound source object.
  • the selection unit 1302 may assign a higher evaluation score to the direct sound and reflected sound caused by the sound source object, the slower the moving speed of the sound source object.
  • the selection unit 1302 may assign a low evaluation score to a reflected sound from a fast-moving sound source so that the reflected sound is not selected.
  • the selection unit 1302 may use the importance of the reflecting object as an evaluation index for the reflected sound. In other words, the selection unit 1302 may evaluate the importance of the reflecting object (S304).
  • the spatial information may include information about a reflecting object.
  • the selection unit 1302 may then evaluate the importance of the reflecting object based on the information about the reflecting object.
  • the selection unit 1302 may then assign an evaluation point to the reflected sound caused by the object based on the importance of the reflecting object.
  • the selection unit 1302 may determine the importance of a reflective object based on information included in the input signal or metadata included in the bitstream.
  • the selection unit 1302 may also determine the importance of a reflective object based on other flags or parameters included in the input signal.
  • the importance of a reflective object may be determined based on the visibility of the reflective object (obstacle object) or information about the material of the reflective object. For example, the importance may be determined to be high according to the visibility of the reflective object (obstacle object), that is, for sound source objects that are visible to the listener.
  • a reflective object (obstacle object) visible to the listener may be detected in the video provided by the video providing device in synchronization with the sound provided by the audio presentation device 1002.
  • the selection unit 1302 may then assign a higher importance to the reflective object detected as a visible object compared to a reflective object that is not visible to the listener.
  • the selection unit 1302 may assign a higher evaluation score V to a reflected sound caused by a reflecting object visible to the listener, compared to a reflected sound caused by a reflecting object not visible to the listener. In other words, the selection unit 1302 may evaluate the importance of a reflected sound caused by a reflecting object within the listener's field of vision as being high, and assign a high evaluation score V to such a reflected sound.
  • a method for detecting a reflective object visible to a listener can be the same as the method for detecting a visible sound source object described above.
  • a method of using information about the material of a reflective object as an index for evaluating the perceptual importance of a reflected sound.
  • multiple parameters such as a reflection coefficient (reflectance), diffusion coefficient, transmittance, and sound absorption coefficient may be obtained from metadata as information about the material of a reflective object.
  • the perceptual importance of a reflected sound may then be evaluated according to the ratio of each parameter.
  • the ratio of reflectance or diffusion rate is high, the volume of the reflected sound that is reflected from the reflective surface and reaches the listener will be louder than if the ratio of transmittance or sound absorption rate is high. In this case, the perceptual importance is likely to be high. Therefore, if the ratio of reflectance or diffusion rate is high among multiple parameters related to the material that can be set for the reflective surface of a reflective object, the evaluation value for the reflected sound reflected by that reflective object may be high.
  • the information regarding the material of the reflective object is not limited to the reflection coefficient (reflectance), diffusion rate, transmittance, and sound absorption rate, but may be information that can identify the importance of the material.
  • a set of multiple parameters such as the reflection coefficient (reflectance), diffusion rate, transmittance, and sound absorption rate may be obtained from metadata as information that identifies the material.
  • the importance may be predefined for each material identifier. Then, an evaluation value of the reflected sound may be calculated according to the importance associated with the material identifier.
  • the information that specifies (identifies) the material is not limited to information for uniquely identifying the material (material identification information), but may be, for example, information that classifies the material (material classification information).
  • the information that classifies the material may be, for example, information that classifies the material according to a classification method preset by the content creator.
  • the importance of the reflective object may be updated to be lower. This makes it possible to reproduce reflected sounds caused by more reflective objects evenly, without being biased toward reflected sounds caused by a specific reflective object (e.g., a specific wall or a specific ceiling). This therefore makes it possible to provide clues for the listener to correctly perceive the width of the sound space.
  • a specific reflective object e.g., a specific wall or a specific ceiling
  • the selection unit 1302 may use the relationship between the direct sound and the reflected sound (e.g., a geometric relationship) as an evaluation index for the reflected sound. Specifically, the selection unit 1302 may evaluate the geometric relationship between the direct sound and the reflected sound using the arrival angle between the direct sound and the reflected sound, and use the evaluation as an evaluation index for the reflected sound (S305).
  • the arrival angle between the direct sound and the reflected sound corresponds to the angle between the arrival direction of the direct sound and the arrival direction of the reflected sound, and corresponds to the angle difference between the angle of the arrival direction of the direct sound relative to a reference direction and the angle of the arrival direction of the reflected sound relative to the reference direction.
  • the angle between the direction from which the direct sound arrives and the direction from which the reflected sound arrives may be detected, and the greater the angle, the higher the evaluation score given to the reflected sound.
  • the analysis unit 1301 calculates the direct sound arrival path (pd) and the reflected sound arrival direction path (pr).
  • the analysis unit 1301 or the selection unit 1302 calculates the arrival direction of the direct sound and the arrival direction of the reflected sound based on the direct sound arrival path (pd), the reflected sound arrival direction path (pr), and the orientation information (D) of the avatar (listener) included in the input signal.
  • the arrival direction of the direct sound and the arrival direction of the reflected sound are expressed using the orientation of the listener as a reference.
  • the selection unit 1302 calculates an evaluation score for the reflected sound based on the angle between the direction from which the direct sound arrives and the direction from which the reflected sound arrives.
  • FIG. 13 is a diagram showing an example of the arrival angles of direct sound and reflected sound.
  • an avatar, a sound source object, and an obstacle object are positioned as shown in FIG. 13.
  • Position information of the avatar, sound source object, and obstacle object, as well as orientation information (D) of the avatar are obtained from the input signal. Then, from this information, the direction of the direct sound ( ⁇ ) and the direction of the sound image of the reflected sound ( ⁇ ) are calculated, assuming that the orientation of the avatar is 0 degrees.
  • the direction of the direct sound ( ⁇ ) is about 20 degrees
  • the direction of the sound image of the reflected sound ( ⁇ ) is about 265 degrees (-95 degrees).
  • the angle between the direction from which the direct sound comes and the direction from which the reflected sound comes is about 115 degrees.
  • the reflected sound is given a higher score. This means that, for example, a reflected sound that originates from a sound source visible in front of the listener but is heard from behind the listener will receive a higher score. As a result, it becomes possible to give priority to the selection of reflected sounds that help the listener anticipate the presence of a large object behind them, creating a sense of claustrophobia and tension.
  • the selection unit 1302 may evaluate the relationship between the direct sound and the reflected sound based on the time difference between the direct sound and the reflected sound, and use this evaluation as an evaluation index for the reflected sound (S306). For example, the selection unit 1302 may assign a higher evaluation point to a reflected sound with a large difference in arrival time between the direct sound and the reflected sound than to a reflected sound with a small difference in arrival time. For example, the echo that returns when you shout "Yahoo! from the top of a mountain has a decisive impact on the perception of space. For this reason, such a reflected sound may be assigned a high evaluation point.
  • the selection unit 1302 may evaluate the relationship between the direct sound and the reflected sound using the time difference between the direct sound and the reflected sound and a threshold value corresponding to the time difference. For example, a reflected sound that arrives at the listener's position immediately after the direct sound is likely to be masked by the direct sound and is difficult to perceive. On the other hand, a reflected sound that arrives at the listener's position with a time lag from the direct sound is unlikely to be masked by the direct sound and is easy to perceive. An evaluation score may be assigned to the reflected sound based on such a model of perception.
  • the time difference (T) between the direct sound and the reflected sound may be, for example, the time difference between the time it takes for the direct sound and the reflected sound to arrive at the listening position.
  • the comparison process is performed using a threshold determined corresponding to the time difference between the direct sound and the reflected sound.
  • the threshold indicates a volume that is set in advance corresponding to the time difference between the direct sound and the reflected sound, and is determined by referring to threshold data.
  • the threshold data may be an index that indicates the boundary between whether or not the reflected sound is perceived by the listener relative to the direct sound.
  • the threshold refers to a value expressed as a numerical value or the like that is determined corresponding to the time difference (T)
  • the threshold data refers to table data or a relational expression that is used to identify or calculate the threshold at the time difference (T).
  • the format and type of the threshold data are not limited to table data or a relational expression.
  • FIG. 14 is a diagram showing an example of a method for setting threshold data based on the temporal masking phenomenon.
  • the threshold data may be set by referring to a masking threshold, which is a known threshold.
  • the temporal masking phenomenon is widely known, as described in Non-Patent Document 1 and elsewhere.
  • the shaded area in the figure shows the time period during which a masker (an inhibitory signal that interferes with the perception of the signal S to be heard) occurs and its amplitude.
  • the masking threshold indicates the audible level (SPL: Sound Pressure Level) of the signal S.
  • SPL Sound Pressure Level
  • the masking threshold is high while the masker is occurring.
  • the masking threshold does not instantly become zero, but gradually decays. In other words, the masking threshold is high for a while (the period during which post-masking exists) immediately after the masker stops.
  • the post-masking tendency shown in the area surrounded by the dotted line in FIG. 14 may be used as threshold data for evaluating reflected sound based on the relationship between direct sound and reflected sound.
  • the threshold data may be determined based on the post-masking tendency, assuming that the direct sound corresponds to the Masker and the reflected sound corresponds to the signal S to be heard.
  • FIG. 15 is a diagram showing an example of threshold data.
  • the threshold data may be determined as shown in the curve in FIG. 15.
  • the boundary (threshold) at which the reflected sound is perceived or not is shown by a curve.
  • the curve corresponds to the threshold data.
  • the threshold data according to this embodiment is stored in the memory 1404 of the audio signal processing device 1001.
  • the stored threshold data may be in any format and type.
  • the threshold data may be expressed as an approximation formula having the time difference between the direct sound and the reflected sound as a variable.
  • the threshold data may also be expressed as an array of the time difference between the direct sound and the reflected sound and the threshold.
  • FIG. 16 is a diagram showing the relationship between the time difference between direct sound and reflected sound and the threshold value.
  • the threshold value data may be stored in an area of memory 1404 as an array of indexes of the time difference between direct sound and reflected sound and threshold values corresponding to the indexes.
  • the memory 1404 may store information regarding a relational equation showing the relationship between the time difference (T) and the threshold value.
  • a relational equation showing the relationship between the time difference (T) and the threshold value.
  • an equation having the time difference (T) as a variable may be stored.
  • the threshold value of each time difference (T) may be approximated by a straight line or a curve, and parameters indicating the geometric shape of the line or curve may be stored. For example, if the geometric shape is a straight line, the starting point and the slope for expressing the straight line may be stored.
  • the threshold for evaluating the reflected sound is not limited to the known masking threshold.
  • Other thresholds may be determined in relation to the value of the time difference between the direct sound and the reflected sound, and a value indicating the amplitude or volume.
  • the threshold may be determined based on the minimum time difference at which the listener's perception detects a discrepancy between the two sounds.
  • Specific numerical values may be derived from already known research results, or may be determined by listening experiments conducted on the premise of application to the virtual space.
  • the threshold value is set by referring to threshold data based on the time difference between the arrival time of the direct sound and the arrival time of the reflected sound.
  • the selection unit 1302 may increase the evaluation score if the volume of the reflected sound when it arrives is greater than the set threshold value.
  • the time difference between the arrival time of the direct sound and the arrival time of the reflected sound is, in other words, the difference in the time it takes for the direct sound and the reflected sound to arrive at the listening position. Therefore, the difference in the distance of the arrival path of the direct sound and the reflected sound may be used as a value related to the time difference in the arrival time of the direct sound and the reflected sound.
  • the time difference between the end of the direct sound and the arrival of the reflected sound at the listening position may be used as the time difference between the direct sound and the reflected sound.
  • the end time of the direct sound may be calculated by adding the duration of the direct sound to the arrival time of the direct sound, for example.
  • the graph in Figure 15 has the time difference between direct sound and reflected sound on the horizontal axis, and the volume ratio between direct sound and reflected sound on the vertical axis.
  • the curve represents the threshold at which the reflected sound is perceived or not.
  • A, B, and C in the graph each represent reflected sound. Note that here, the volume ratio, i.e., the volume of the reflected sound determined relatively to the volume of the direct sound, is used on the vertical axis, but the volume of the reflected sound determined absolutely regardless of the volume of the direct sound may also be used.
  • the volume ratio of two signals is expressed as the difference in decibel values.
  • the volume ratio of two signals may be the difference when the amplitude values of each signal are expressed in the decibel domain. This value may be calculated based on an energy value or a power value, etc. Furthermore, this difference may be called the gain difference or simply the gain difference in the decibel domain.
  • the volume ratio in this disclosure is essentially the ratio of signal amplitudes, and may be expressed as Sound volume ratio, Volume ratio, Amplitude ratio, Sound level ratio, Sound intensity ratio, Gain ratio, or the like. Also, when the unit of volume is decibels, it goes without saying that the volume ratio in this disclosure can be rephrased as volume difference.
  • volume ratio typically refers to the gain difference when the volumes of two sounds are expressed in decibel units
  • the threshold data is also typically defined as a gain difference expressed in the decibel domain.
  • the volume ratio is not limited to a gain difference in the decibel domain.
  • the threshold data defined in the decibel domain may be converted into the unit of the calculated volume ratio and used.
  • the threshold data defined in each unit may be stored in advance in memory.
  • Figure 9 shows the positional relationship between the listener, the sound source object, and an obstacle object (wall). In Figure 9, the sound source object and the obstacle object are relatively far away, and the listener hears reflected sound C in Figure 15.
  • Figure 10 shows another positional relationship between the listener, the sound source object, and an obstacle object (wall). In Figure 10, the sound source object and the obstacle object are relatively close, and the listener hears reflected sound A or B in Figure 15.
  • reflected sound C is located to the right of reflected sounds A and B on the graph.
  • the greater the time difference between the direct sound and the reflected sound the smaller the threshold value.
  • reflected sound B which has the same volume as reflected sound C, is smaller than the threshold value, and reflected sound C is larger than the threshold value. Therefore, the evaluation score of reflected sound C is higher than the evaluation score of reflected sound B.
  • the arrival times of reflected sounds A and B are the same, but the volume of reflected sound A is greater than the volume of reflected sound B, which is smaller than the volume of reflected sound A. Furthermore, the volume of reflected sound A is greater than the threshold value shown by the curve, and the volume of reflected sound B is smaller than the threshold value shown by the curve. In this case, reflected sound A is given a higher evaluation score than reflected sound B.
  • the reflected sound is evaluated based on a threshold value indicating the volume that is determined according to the time difference between the direct sound and the reflected sound. This allows the evaluation of the reflected sound to reflect the nature of human perception, whereby reflected sound that arrives at the listener's position with a time difference from the direct sound is not masked by the direct sound and is therefore easily perceived.
  • calculation of the arrival time and volume at the time of arrival of the direct sound and the reflected sound may be omitted, and the reflected sound may be evaluated based on the path length.
  • a threshold value for the path length of the reflected sound may be set corresponding to the value of the path length difference. In that case, the reflected sound may be evaluated based on whether the path length of the reflected sound is greater than a threshold value set corresponding to the value of the path length difference.
  • a parameter that indicates the sound propagation speed or a parameter that affects the sound propagation speed parameter may be used.
  • the geometric relationship may be the relationship between the positions of the sound source, the listener, and the reflecting object in the virtual space. These relationships make it possible to geometrically calculate the path lengths along which the direct sound and the reflected sound arrive. Therefore, by utilizing the relationship in which the volume is inversely proportional to the distance, it is possible to calculate the reference volume of the reflected sound relative to the reference volume of the direct sound.
  • the reflection coefficient of the reflecting object may be used to calculate the reference volume of the reflected sound.
  • a typical value that is generally used may be used as the reflection coefficient.
  • a specially assigned reflection coefficient may be used as the reflection coefficient of the reflecting object.
  • the reflected sound may be evaluated based on its volume.
  • the volume of the reflected sound may be calculated from the geometric relationship between the direct sound and the reflected sound, as described above, and from an index assigned to the reflecting object.
  • the reflected sound may be evaluated by comparing the volume with a predetermined threshold value.
  • information indicating the temporal transition of the volume of the sound source may be reflected in the evaluation. For example, if the information indicating the temporal transition of the volume of the sound source indicates the duration of a section with sound, and the time is within the section with sound, the evaluation value of the reflected sound may be maintained as is. On the other hand, if the time is outside the section with sound, even if the reference volume of the reflected sound exceeds the threshold, a process may be performed to reduce the evaluation value of the reflected sound or set it to zero.
  • the information indicating the temporal transition of the volume of the sound source may be data that lists in chronological order multiple pairs of durations during which the amplitude of a sound signal is considered to be roughly constant, and the amplitude values of the signal during those periods.
  • a process may be performed to evaluate the reflected sound by changing the reference volume of the reflected sound in conjunction with changes in the amplitude values in the data.
  • evaluation points may be assigned for all of the above-mentioned indices, or for some of the indices. Also, the number of indices used for evaluation may differ for each reflected sound, or the same indices may be used for all reflected sounds.
  • the indices used to assign evaluation points to the reflected sounds may be set based on predetermined information, and may be determined, for example, based on information included in the input signal, or based on information set by the listener or administrator.
  • a high evaluation score corresponds to a large evaluation score
  • a low evaluation score corresponds to a small evaluation score
  • a high evaluation value corresponds to a large evaluation value
  • a low evaluation value corresponds to a small evaluation value.
  • the selection unit 1302 calculates an evaluation value indicating the importance of the reflected sound based on the evaluation points for the reflected sound to which evaluation points have been assigned using each index. For example, the selection unit 1302 determines the sum of the multiple evaluation points as the evaluation value of the reflected sound (S307). The sum of the multiple evaluation points may be a weighted sum. If there is an unevaluated reflected sound (Yes in S308), the selection unit 1302 repeats the above-mentioned processing (S301 to S307), and if there is no unevaluated reflected sound (No in S308), the evaluation processing ends.
  • the evaluation value of a reflected sound is not limited to the sum of multiple evaluation points obtained from multiple indices. For example, a predetermined standard evaluation value and an already calculated evaluation value may be corrected with multiple evaluation points. Furthermore, only the evaluation points of some of the indices may be used for the evaluation value of the reflected sound, or may be used to correct the evaluation value of the reflected sound. Furthermore, when multiple evaluation points are assigned to one reflected sound from multiple indices, the highest evaluation point may be determined as the evaluation value of the reflected sound.
  • the index score to be used to calculate or correct the evaluation value may be determined based on predetermined information, may be determined based on information included in the input signal, or may be determined based on information set by the listener or administrator.
  • the evaluation points and evaluation values are conveniently divided into evaluation points obtained from each index and evaluation values obtained using multiple evaluation points obtained from multiple indexes.
  • the evaluation points and evaluation values may be treated in the same way.
  • the audio signal processing device 1001 may use an evaluation point obtained from one index as an evaluation value as it is in the selection process of the reflected sound, or may use multiple evaluation points obtained from multiple indexes in the selection process of the reflected sound.
  • the audio signal processing device 1001 determines whether or not to select the reflected sound based on each of the multiple evaluation points. Then, the audio signal processing device 1001 may finally determine that the reflected sound is selected when all of the multiple determination results based on the multiple evaluation points indicate that the reflected sound is selected. Alternatively, the audio signal processing device 1001 may finally determine that the reflected sound is selected when any one of the multiple determination results based on the multiple evaluation points indicates that the reflected sound is selected.
  • priorities may be assigned to multiple evaluation points based on multiple indices.
  • the audio signal processing device 1001 determines whether or not to select a reflected sound based on each of the first to third evaluation points based on the first to third indices.
  • the audio signal processing device 1001 may make a final judgment that the reflected sound is not to be selected, without relying on the judgment results based on the second and third evaluation points.
  • the audio signal processing device 1001 may ultimately judge that the reflected sound should be selected without relying on the judgment results based on the third evaluation point.
  • processing is carried out as described above according to the flowchart shown in FIG. 11.
  • the selection process of the reflected sounds is performed based on both the computational load and the evaluation value (importance) of the reflected sounds.
  • the selection process of the reflected sounds may be performed based on only one of them.
  • the selection unit 1302 may omit the calculation of the evaluation value of each reflected sound, and may determine not to select the reflected sound if the computational load of the reflected sound is greater than a threshold value.
  • the selection unit 1302 may omit the acquisition of information indicating the upper limit of the computational load, the calculation of the total computational load of the extracted reflected sounds, and the comparison of the total computational load with the upper limit of the computational load, and may perform the selection process of the reflected sounds based only on the evaluation value of the reflected sounds.
  • the extraction of reflected sounds whose volume is equal to or greater than the threshold value may be performed after the evaluation value is determined, or after it is determined that a reflected sound is to be selected. For example, even if it is determined that a reflected sound is to be selected based on the evaluation value or the computational load, if the volume of the reflected sound falls below the threshold value, the reflected sound may be redetermined not to be selected.
  • the synthesis unit 1303 generates and synthesizes an audio signal of the direct sound and an audio signal of the reflected sound selected by the selection unit 1302 as the reflected sound to be generated.
  • the audio signal of the direct sound is generated by applying the arrival time (td) and arrival volume (ld) calculated by the analysis unit 1301 to the sound data of the sound source object included in the input information. Specifically, the sound data is delayed by the arrival time (td) and multiplied by the arrival volume (ld).
  • the process of delaying the sound data is a process of moving the position of the sound data forward or backward on the time axis. For example, a process of delaying sound data without degrading sound quality as disclosed in Patent Document 2 may be applied.
  • the audio signal of the reflected sound is generated by applying the arrival time (tr) and arrival volume (ld) calculated by the analysis unit 1301 to the sound data of the sound source object.
  • the volume at the time of arrival (lr) when generating reflected sound is different from the volume at the time of arrival of direct sound, and is a value to which the attenuation rate G of the volume at the reflection is applied.
  • G may be an attenuation rate that is applied to all frequency bands at once.
  • a reflectance rate may be specified for each specified frequency band to reflect the bias of frequency components caused by reflection.
  • the process of applying the volume at the time of arrival (lr) may be implemented as a frequency equalizer process that multiplies each band by an attenuation rate.
  • FIG. 17 is a block diagram showing an example of the configuration for the rendering unit 1300 to perform pipeline processing.
  • the rendering unit 1300 in FIG. 17 includes a reverberation processing unit 1311, an early reflection processing unit 1312, a distance attenuation processing unit 1313, a selection unit 1314, a generation unit 1315, and a binaural processing unit 1316.
  • the reverberation processing unit 1311, the early reflection processing unit 1312, and the distance attenuation processing unit 1313 perform reverberation processing, early reflection processing, and distance attenuation processing, respectively.
  • the selection unit 1314 selects a reflected sound
  • the generation unit 1315 generates a direct sound and a reflected sound
  • the binaural processing unit 1316 applies binaural processing to the direct sound and the reflected sound.
  • These multiple components may be composed of multiple components of the rendering unit 1300 shown in FIG. 7, or may be composed of at least some of the multiple components of the audio signal processing device 1001 shown in FIG. 5.
  • Pipeline processing refers to dividing the process for creating sound effects into multiple processes and executing the multiple processes one by one in sequence. Each of the multiple processes performs, for example, signal processing on an audio signal, or the generation of parameters used in signal processing.
  • the rendering unit 1300 may perform reverberation processing, early reflection processing, distance attenuation processing, binaural processing, and the like as pipeline processing.
  • these processes are merely examples, and the pipeline processing may include other processes than these, or may not include some of the processes.
  • the pipeline processing may include diffraction processing and occlusion processing.
  • reverberation processing may be omitted if it is not necessary.
  • Each process may be expressed as a stage.
  • audio signals such as reflected sounds generated as a result of each process may be expressed as rendering items.
  • the multiple stages in pipeline processing and their order are not limited to the example shown in FIG. 17.
  • the parameters used in the selection process are calculated at one of multiple stages for generating a rendering item.
  • the parameters used to select reflected sound are calculated as part of the pipeline processing for generating a rendering item. Note that not all stages need to be performed by the rendering unit 1300. For example, some stages may be omitted, or may be performed outside the rendering unit 1300.
  • reverberation processing early reflection processing, distance attenuation processing, selection processing, generation processing, and binaural processing that may be included as stages in the pipeline processing.
  • metadata included in the input signal may be analyzed to calculate parameters used to generate the reflected sound.
  • the reverberation processor 1311 In reverberation processing, the reverberation processor 1311 generates an audio signal indicating reverberation sound, or parameters used to generate an audio signal.
  • Reverberation sound is sound that reaches the listener as reverberation after direct sound.
  • reverberation sound is sound that reaches the listener after being reflected more times (e.g., several tens of times) than the initial reflection sound, at a relatively late stage (e.g., about 150 ms after the direct sound arrives) after the initial reflection sound described below reaches the listener.
  • the reverberation processor 1311 refers to the audio signal and spatial information contained in the input signal, and calculates the reverberation using a predetermined function prepared in advance as a function for generating the reverberation.
  • the reverberation processor 1311 may generate reverberation sound by applying a known reverberation generation method to the audio signal included in the input signal.
  • a known reverberation generation method is the Schroeder method, but known reverberation generation methods are not limited to the Schroeder method.
  • the reverberation processor 1311 uses the shape and acoustic characteristics of the sound reproduction space indicated by the spatial information. This allows the reverberation processor 1311 to calculate parameters for generating reverberation sound.
  • the early reflection processor 1312 calculates parameters for generating early reflection sounds based on spatial information.
  • Early reflection sounds are reflected sounds that arrive at the listener after one or more reflections at a relatively early stage (e.g., about several tens of milliseconds after the direct sound arrives) after the direct sound from the sound source object arrives at the listener.
  • the early reflection processing unit 1312 refers to the audio signal and metadata and calculates the path of the reflected sound that travels from the sound source object to the listener after being reflected by the reflecting object.
  • the shape of the three-dimensional sound field (space), the size of the three-dimensional sound field, the position of the reflecting object such as a structure, and the reflectance of the reflecting object may be used in calculating the path.
  • the early reflection processing unit 1312 may also calculate the path of the direct sound.
  • the information on the path may be used as a parameter by which the early reflection processing unit 1312 generates the early reflected sound, or may be used as a parameter by which the selection unit 1314 selects the reflected sound.
  • the distance attenuation processing unit 1313 calculates the volume of the direct sound and reflected sound that reach the listener based on the path length of the direct sound and reflected sound.
  • the volume of the direct sound and reflected sound that reach the listener attenuates in proportion to the distance of the path to the listener (inversely proportional to the distance) relative to the volume of the sound source. Therefore, the distance attenuation processing unit 1313 can calculate the volume of the direct sound by dividing the volume of the sound source by the path length of the direct sound, and can calculate the volume of the reflected sound by dividing the volume of the sound source by the path length of the reflected sound.
  • the selection unit 1314 selects the reflected sound to be generated based on the parameters calculated before the selection process. Any of the selection methods disclosed herein may be used to select the reflected sound to be generated.
  • the processing after the selection process may not be executed in the pipeline processing for the reflected sounds that were not selected in the selection process.
  • the processing after the selection process for the reflected sounds that were not selected it is possible to reduce the computational load of the audio signal processing device 1001 more than by not executing only the binaural processing.
  • the selection process when included in the pipeline process, by assigning an earlier order to the selection process among the multiple processes in the pipeline process, it becomes possible to omit more processes and reduce the amount of calculations even more.
  • the binaural processing unit 1316 performs signal processing so that the audio signal of the direct sound is perceived by the listener as a sound arriving from the direction of the sound source object. Furthermore, the binaural processing unit 1316 performs signal processing so that the reflected sound selected by the selection unit 1314 is perceived by the listener as a sound arriving from the reflecting object.
  • the binaural processing unit 1316 performs processing to apply the HRIR DB so that sound arrives at the listener from the position of a sound source object or the position of an obstacle object based on the listener's position and orientation in the sound space.
  • HRIR Head-Related Impulse Responses
  • HRIR is the response characteristic when one impulse is generated.
  • HRIR is a response characteristic obtained by converting the head-related transfer function, which expresses the changes in sound caused by surrounding objects including the auricle, the human head, and shoulders as a transfer function, from a frequency domain expression to a time domain expression using a Fourier transform.
  • the HRIR DB is a database that contains this kind of information.
  • the position and orientation of the listener in the sound space are, for example, the position and orientation of the virtual listener in the virtual sound space.
  • the position and orientation of the virtual listener in the virtual sound space may change in accordance with the movement of the listener's head.
  • the position and orientation of the virtual listener in the virtual sound space may be determined based on information acquired from the sensor 1405.
  • the programs, spatial information, HRIR DB, threshold data, and other parameters used in the above processing are obtained from the memory 1404 provided in the audio signal processing device 1001 or from outside the audio signal processing device 1001.
  • the pipeline processing may also include other processes.
  • the rendering unit 1300 may also include processing units (not shown) for performing other processes included in the pipeline processing.
  • the rendering unit 1300 may include a diffraction processing unit and an occlusion processing unit.
  • the diffraction processing unit executes processing to generate an audio signal that indicates sound including diffracted sound caused by an obstacle object between the listener and the sound source object in a three-dimensional sound field (space).
  • diffracted sound is sound that travels from the sound source object to the listener, going around the obstacle object.
  • the diffraction processing unit refers to the audio signal and metadata, calculates the path of the diffracted sound that travels from the sound source object to the listener, bypassing the obstacle object, and generates the diffracted sound based on the path.
  • the positions of the sound source object, the listener, and the obstacle object in the three-dimensional sound field (space), as well as the shape and size of the obstacle object, etc. may be used.
  • the occlusion processor When a sound source object is present behind an obstacle object, the occlusion processor generates an audio signal for the sound that leaks from the sound source object through the obstacle object based on spatial information and information such as the material of the obstacle object.
  • the position information given to the sound source object indicates a "point” in the virtual space as the position of the sound source object. That is, in the above, the sound source is defined as a "point sound source.”
  • a sound source in a virtual space may be defined as an object having length, size, shape, etc., that is, as a spatially extended sound source that is not a point sound source.
  • the distance between the listener and the sound source and the direction from which the sound comes are not determined. Therefore, the reflected sound caused by such a sound source may be limited to being selected by the selection unit 1302 without analysis by the analysis unit 1301 or regardless of the analysis results. This makes it possible to avoid deterioration in sound quality that may occur by not selecting the reflected sound.
  • a representative point such as the center of gravity of the object may be determined, and the processing of the present disclosure may be applied on the assumption that sound is generated from that representative point.
  • the threshold may be adjusted according to information on the spatial extension of the sound source.
  • a direct sound is a sound that is not reflected by a reflecting object
  • a reflected sound is a sound that is reflected by a reflecting object.
  • a direct sound may be a sound that arrives at a listener from a sound source without being reflected by a reflecting object
  • a reflected sound may be a sound that arrives at a listener from a sound source after being reflected by a reflecting object.
  • each of the direct sound and the reflected sound is not limited to the sound that has arrived at the listener, but may be the sound before it arrives at the listener.
  • the direct sound may be the sound output from the sound source, or in other words, the sound of the sound source.
  • the bit stream includes, for example, an audio signal and metadata.
  • the audio signal is sound data that represents sound, and indicates information about the frequency and intensity of the sound.
  • the metadata includes spatial information about the sound space, which is the space of the sound field.
  • spatial information is information about the space in which a listener who hears sound based on an audio signal is located.
  • spatial information is information about a specific position (localization position) for localizing a sound image at that position in a sound space (e.g., a three-dimensional sound field), that is, for allowing the listener to perceive sound coming from a direction corresponding to the specific position.
  • Spatial information includes, for example, sound source object information and position information indicating the position of the listener.
  • Sound source object information is information about a sound source object that generates sound based on an audio signal.
  • sound source object information is information about an object (sound source object) that reproduces an audio signal, and is information about a virtual sound source object that is placed in a virtual sound space.
  • the virtual sound space may correspond to a real space in which an object that generates sound is placed, and the sound source object in the virtual sound space may correspond to an object that generates sound in the real space.
  • the sound source object information may indicate the position of the sound source object placed in the sound space, the orientation of the sound source object, the directionality of the sound emitted by the sound source object, whether the sound source object belongs to a living thing, and whether the sound source object is a moving object.
  • the audio signal is associated with one or more sound source objects indicated by the sound source object information.
  • the bitstream has a data structure that consists of, for example, metadata (control information) and an audio signal.
  • the audio signal and metadata may be contained in a single bitstream or may be contained separately in multiple bitstreams. Also, the audio signal and metadata may be contained in a single file or may be contained separately in multiple files.
  • a bitstream may exist for each sound source, or for each playback time. Even if a bitstream exists for each playback time, multiple bitstreams may be processed in parallel at the same time.
  • Metadata may be added to each bitstream, or may be added to multiple bitstreams collectively as information for controlling multiple bitstreams. In this case, multiple bitstreams may share metadata. Metadata may also be added for each playback time.
  • one or more of the bitstreams or one or more of the files may contain information indicating the associated bitstreams or associated files.
  • each of all of the bitstreams or each of all of the files may contain information indicating the associated bitstreams or associated files.
  • the related bitstreams or related files are, for example, bitstreams or files that may be used simultaneously during audio processing. Also, a bitstream or file that collectively describes information indicating related bitstreams or related files may be included.
  • the information indicating the related bitstream or related file may be, for example, an identifier indicating the related bitstream or related file.
  • the information indicating the related bitstream or related file may be, for example, a file name indicating the related bitstream or related file, a URL (Uniform Resource Locator), or a URI (Uniform Resource Identifier), etc.
  • the acquisition unit identifies and acquires the related bitstream or related file based on the information indicating the related bitstream or related file.
  • a bitstream or file may contain information indicating the related bitstream or related file, and another bitstream or another file may contain information indicating the related bitstream or related file.
  • the file containing information indicating the associated bitstream or associated file may be a control file such as a manifest file used for content distribution.
  • All or some of the metadata may be obtained from a source other than the bitstream of the audio signal.
  • the metadata for controlling the sound or the metadata for controlling the video may be obtained from a source other than the bitstream, or both may be obtained from a source other than the bitstream.
  • Metadata for controlling the video may be included in the bitstream acquired by the stereophonic sound reproduction system 1000.
  • the stereophonic sound reproduction system 1000 may output the metadata for controlling the video to a display device that displays the image, or a stereophonic video reproduction device that reproduces the stereophonic video.
  • the metadata may be information used to describe a scene represented in sound space, the term scene being used to refer to the collection of all elements representing 3D video and audio events in sound space that are modeled by the stereophonic reproduction system 1000 using the metadata.
  • the metadata may include not only information for controlling audio processing, but also information for controlling video processing.
  • the metadata may include only one of information for controlling audio processing and information for controlling video processing, or may include both.
  • the stereophonic sound reproduction system 1000 performs acoustic processing on the audio signal using metadata included in the bitstream and interactive listener position information that is additionally acquired, thereby generating virtual acoustic effects.
  • acoustic effects early reflection processing, obstacle processing, diffraction processing, blocking processing, and reverberation processing may be performed, and other acoustic processing may be performed using metadata.
  • acoustic effects such as distance attenuation effect, localization, or Doppler effect may be added.
  • information for switching all or some of the sound effects on and off, or priority information for multiple sound effect processes may be added to the metadata.
  • the metadata includes information about a sound space including sound source objects and obstacle objects, and information about a localization position for localizing a sound image at a specific position within the sound space (i.e., allowing a listener to perceive a sound coming from a specific direction).
  • an obstacle object is an object that may affect the sound perceived by the listener, for example by blocking or reflecting the sound emitted by the sound source object before it reaches the listener.
  • Obstacle objects may include stationary objects as well as moving objects such as animals or machines. Animals may also be people, etc.
  • the other sound source objects can be obstacle objects for any of the sound source objects.
  • non-sound-making objects which are objects that do not emit sound such as building materials or inanimate objects
  • sound source objects that emit sound can be obstacle objects.
  • the metadata includes information that represents all or part of the shape of the sound space, the shape and position of obstacle objects in the sound space, the shape and position of sound source objects in the sound space, and the position and orientation of the listener in the sound space.
  • the sound space may be either a closed space or an open space.
  • the metadata may also include information that indicates the reflectance of obstacle objects that may reflect sound in the sound space. For example, the floor, walls, or ceiling that form the boundaries of the sound space may also constitute obstacle objects.
  • Reflectance is the ratio of the energy of reflected sound to incident sound, and may be set for each frequency band of sound. Of course, reflectance may be set uniformly regardless of the frequency band of sound. When the sound space is an open space, parameters such as attenuation rate, diffracted sound, and early reflected sound that are set uniformly may be used.
  • the metadata may include information other than reflectance as a parameter related to an obstacle object or sound source object.
  • the metadata may include information related to the material of the object as a parameter related to both sound source objects and non-sound-producing objects.
  • the metadata may include information such as diffusion rate, transmittance, and sound absorption rate.
  • Information about a sound source object may include information indicating the volume, radiation characteristics (directivity), playback conditions, the number and type of sound sources in an object, and the sound source area in the object.
  • the playback conditions may, for example, determine whether the sound is a sound that continues to play continuously or a sound that triggers an event.
  • the sound source area in the object may be determined by the relative relationship between the position of the listener and the position of the object, or may be determined using the object as a reference.
  • the sound source area is determined based on the relative relationship between the listener's position and the object's position, it is possible for the listener to perceive sound A coming from the right side of the object and sound B coming from the left side of the object.
  • the sound source area is determined using an object as a reference, it is possible to fix which area of the object will emit which sound, using the object as a reference. For example, if a listener views the object from the front, it is possible for the listener to perceive a high-pitched sound from the right side of the object and a low-pitched sound from the left side of the object. And, if the listener views the object from the back, it is possible for the listener to perceive a low-pitched sound from the right side of the object and a high-pitched sound from the left side of the object.
  • Spatial metadata may include time to early reflections, reverberation time, and the ratio of direct sound to diffuse sound. If the ratio of direct sound to diffuse sound is zero, it is possible for the listener to perceive only direct sound.
  • a process executed by a specific component may be executed by another component instead of the specific component.
  • the order of multiple processes may be changed, and multiple processes may be executed in parallel.
  • ordinal numbers such as first and second used in the description may be changed, removed, or newly added as appropriate. These ordinal numbers do not necessarily correspond to a meaningful order and may be used to identify elements.
  • being equal to or greater than the threshold value and being greater than the threshold value may be interpreted as interchangeable.
  • being equal to or less than the threshold value and being smaller than the threshold value may be interpreted as interchangeable.
  • time and hour may be interpreted as interchangeable.
  • the process of selecting one or more processing target sounds from a plurality of sounds if there is no sound that satisfies the conditions, then none of the sounds may be selected as processing target sounds.
  • the process of selecting one or more processing target sounds from a plurality of sounds may include cases in which no processing target sound is selected.
  • an expression "at least one of a first element, a second element, and a third element” may correspond to a first element, a second element, a third element, or any combination thereof.
  • the aspects understood based on this disclosure are described as being implemented as an audio processing device, an encoding device, or a decoding device.
  • the aspects understood based on this disclosure are not limited to these, and may be implemented as software for executing an audio processing method, an encoding method, or a decoding method.
  • a program for executing the above-mentioned acoustic processing method, encoding method, or decoding method may be stored in the ROM in advance.
  • the CPU may then operate according to the program.
  • a program for executing the above-mentioned acoustic processing method, encoding method, or decoding method may be stored in a computer-readable recording medium.
  • the computer may then record the program stored in the recording medium in the computer's RAM and operate according to the program.
  • the above components may be realized as an LSI, which is an integrated circuit typically having input and output terminals. These may be individually formed into single chips, or may be formed into a single chip that includes all or some of the components of the embodiments. Depending on the degree of integration, the LSI may be expressed as an IC, a system LSI, a super LSI, or an ultra LSI.
  • LSI LSI
  • a dedicated circuit or a general-purpose processor may be used.
  • a programmable FPGA or a reconfigurable processor that allows the connections or settings of the circuit cells inside the LSI to be reconfigured may be used.
  • an integrated circuit technology that can replace LSI emerges due to advances in semiconductor technology or a different derived technology, naturally that technology may be used to integrate the components. The application of biotechnology, etc. is also a possibility.
  • the FPGA or CPU, etc. may download all or part of the software for realizing the acoustic processing method, encoding method, or decoding method described in this disclosure via wireless or wired communication. Furthermore, all or part of the software for updates may be downloaded via wireless or wired communication. Then, the FPGA or CPU, etc. may store the downloaded software in memory and operate based on the stored software to execute the digital signal processing described in this disclosure.
  • the device equipped with an FPGA or a CPU, etc. may be connected to the signal processing device wirelessly or via a wire, or may be connected to the signal processing server via a network.
  • This device and the signal processing device or the signal processing server may then carry out the acoustic processing method, encoding method, or decoding method described in this disclosure.
  • the sound processing device, encoding device, or decoding device in this disclosure may include an FPGA or a CPU, etc.
  • the sound processing device, encoding device, or decoding device may include an interface for obtaining software for operating the FPGA or CPU, etc. from the outside, and a memory for storing the obtained software. Then, the FPGA or CPU, etc. may execute the signal processing described in this disclosure by operating based on the stored software.
  • a server may provide software related to the acoustic processing, encoding processing, or decoding processing of the present disclosure. Then, a terminal or device may operate as the acoustic processing device, encoding device, or decoding device described in the present disclosure by installing the software. Note that the terminal or device may be connected to the server via a network and the software may be installed.
  • a device other than the terminal or device may connect to a server via a network to obtain data for installing the software, and the other device may provide the data for installing the software to the terminal or device, thereby installing the software in the terminal or device.
  • the software may be VR software or AR software for causing a terminal or device to execute the acoustic processing method described in the embodiment.
  • each component may be configured with dedicated hardware, or may be realized by executing a software program suitable for each component.
  • Each component may be realized by a program execution unit such as a CPU or processor reading and executing a software program recorded on a recording medium such as a hard disk or semiconductor memory.
  • a sound processing device that includes a circuit and a memory, and that uses the memory to acquire sound space information including information on a sound source in a sound space, information on objects in the sound space, and information on the position of a listener in the sound space, and that uses the sound space information to calculate an evaluation value of a reflected sound that occurs in response to a sound generated from the sound source.
  • the circuit calculates a total computation load of the one or more reflected sounds, and if the total computation load exceeds a predetermined upper limit, calculates the evaluation value of each of the one or more reflected sounds.
  • the circuit calculates the evaluation value of each of the multiple reflected sounds generated as the reflected sound in the sound space, adds the computation load of each of the multiple reflected sounds to a total computation load in descending order of the evaluation value, compares the total computation load with a predetermined upper limit each time the computation load of the reflected sound is added to the total computation load, selects the reflected sound if the total computation load obtained by adding the computation loads of the reflected sounds does not exceed the predetermined upper limit, and does not select one or more remaining reflected sounds after the reflected sound from among the multiple reflected sounds if the total computation load obtained by adding the computation loads of the reflected sounds exceeds the predetermined upper limit.
  • a sound processing device according to any one of techniques 10 to 16, in which the circuit increases the index value indicating the relationship between the direct sound and the reflected sound the more the amplitude value of the reflected sound exceeds a temporal masking threshold, which is the threshold of a temporal masking phenomenon in which the reflected sound is masked by the direct sound when the amplitude value of the reflected sound is equal to or less than a threshold.
  • a temporal masking threshold which is the threshold of a temporal masking phenomenon in which the reflected sound is masked by the direct sound when the amplitude value of the reflected sound is equal to or less than a threshold.
  • a sound processing device according to any one of techniques 10 to 17, in which the circuit reduces an index value for the object related to a selected reflected sound among a plurality of reflected sounds generated as the reflected sound in the sound space, calculates the evaluation value for a reflected sound that has not yet been selected, and repeatedly performs a process of selecting reflected sounds in descending order of the evaluation value, and terminates the repeatedly performed process when the total computation load of one or more reflected sounds selected from the plurality of reflected sounds exceeds a predetermined upper limit.
  • a sound processing device that includes a circuit and a memory, the circuit using the memory to acquire volume information of a sound output from a sound source, corrects an evaluation value of a reflected sound corresponding to the sound using the volume information, and controls whether or not to select the reflected sound based on the corrected evaluation value.
  • An acoustic processing method including a step of acquiring sound space information including information on a sound source in a sound space, information on objects in the sound space, and information on the position of a listener in the sound space, and a step of calculating an evaluation value of a reflected sound generated in response to a sound generated from the sound source using the sound space information.
  • the present disclosure includes aspects that can be applied, for example, to an audio processing device, an encoding device, a decoding device, or a terminal or device equipped with any of these devices.
  • Audio signal processing device 1002 Audio presentation device 1100, 1120, 1500 Encoding device 1101, 1113 Input data 1102 Encoder 1103 Encoded data 1104, 1114, 1404, 1503 Memory 1110, 1130 Decoding device 1111 Audio signal 1112, 1200, 1210 Decoder 1121 Transmitting unit 1122 Transmitted signal 1131 Receiving unit 1132 Received signal 1201, 1211 Spatial information management unit 1202 Audio data decoder 1203, 1213, 1300 Rendering unit 1301 Analysis unit 1302, 1314 Selection unit 1303 Synthesis unit 1311 Reverberation processing unit 1312 Early reflection processing unit 1313 Distance attenuation processing unit 1315 Generation unit 1316 Binaural processing unit 1401 Speaker 1402, 1501 Processor 1403, 1502 Communication IF 1405 Sensor

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)
PCT/JP2023/036494 2022-10-19 2023-10-06 音響処理装置及び音響処理方法 Ceased WO2024084997A1 (ja)

Priority Applications (4)

Application Number Priority Date Filing Date Title
KR1020257010711A KR20250091193A (ko) 2022-10-19 2023-10-06 음향 처리 장치 및 음향 처리 방법
EP23879643.7A EP4607965A4 (en) 2022-10-19 2023-10-06 SOUND PROCESSING DEVICE AND SOUND PROCESSING METHOD
JP2024551488A JPWO2024084997A1 (https=) 2022-10-19 2023-10-06
CN202380071402.8A CN119999236A (zh) 2022-10-19 2023-10-06 音响处理装置及音响处理方法

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202263417410P 2022-10-19 2022-10-19
US63/417,410 2022-10-19
JP2023110710 2023-07-05
JP2023-110710 2023-07-05

Publications (1)

Publication Number Publication Date
WO2024084997A1 true WO2024084997A1 (ja) 2024-04-25

Family

ID=90737461

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/036494 Ceased WO2024084997A1 (ja) 2022-10-19 2023-10-06 音響処理装置及び音響処理方法

Country Status (6)

Country Link
EP (1) EP4607965A4 (https=)
JP (1) JPWO2024084997A1 (https=)
KR (1) KR20250091193A (https=)
CN (1) CN119999236A (https=)
TW (1) TW202501241A (https=)
WO (1) WO2024084997A1 (https=)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0546193A (ja) * 1991-08-19 1993-02-26 Matsushita Electric Ind Co Ltd 反射音抽出装置
JP2006047523A (ja) * 2004-08-03 2006-02-16 Sony Corp 情報処理装置および方法、並びにプログラム
JP6288100B2 (ja) 2013-10-17 2018-03-07 株式会社ソシオネクスト オーディオエンコード装置及びオーディオデコード装置
JP2019022049A (ja) 2017-07-14 2019-02-07 ヤマハ株式会社 信号処理装置
WO2021180938A1 (en) 2020-03-13 2021-09-16 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for rendering a sound scene using pipeline stages

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11146905B2 (en) * 2017-09-29 2021-10-12 Apple Inc. 3D audio rendering using volumetric audio rendering and scripted audio level-of-detail

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0546193A (ja) * 1991-08-19 1993-02-26 Matsushita Electric Ind Co Ltd 反射音抽出装置
JP2006047523A (ja) * 2004-08-03 2006-02-16 Sony Corp 情報処理装置および方法、並びにプログラム
JP6288100B2 (ja) 2013-10-17 2018-03-07 株式会社ソシオネクスト オーディオエンコード装置及びオーディオデコード装置
JP2019022049A (ja) 2017-07-14 2019-02-07 ヤマハ株式会社 信号処理装置
WO2021180938A1 (en) 2020-03-13 2021-09-16 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for rendering a sound scene using pipeline stages

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Psychoacoustic Models for Perceptual Audio Coding - A Tutorial Review", APPL. SCI., vol. 9, 2019, pages 2854
NAKAHARA, MASATAKA ET AL.: "Post-Production 3D Sampling Reverberator Utilization Methods.", PROCEEDINGS OF THE ACOUSTICAL SOCIETY OF JAPAN, JP, 19 February 2019 (2019-02-19) - 7 March 2019 (2019-03-07), JP, pages 1481 - 1484, XP009554076 *
See also references of EP4607965A4

Also Published As

Publication number Publication date
JPWO2024084997A1 (https=) 2024-04-25
CN119999236A (zh) 2025-05-13
TW202501241A (zh) 2025-01-01
EP4607965A1 (en) 2025-08-27
KR20250091193A (ko) 2025-06-20
EP4607965A4 (en) 2026-01-14

Similar Documents

Publication Publication Date Title
WO2024084997A1 (ja) 音響処理装置及び音響処理方法
WO2025075147A1 (ja) 音声信号処理方法、コンピュータプログラム、及び、音声信号処理装置
WO2025075136A1 (ja) 音声信号処理方法、コンピュータプログラム、及び、音声信号処理装置
WO2024084999A1 (ja) 音響処理装置及び音響処理方法
WO2025075135A1 (ja) 音声信号処理方法、コンピュータプログラム、及び、音声信号処理装置
US20250310717A1 (en) Acoustic processing device and acoustic processing method
US20250247667A1 (en) Acoustic processing method, acoustic processing device, and recording medium
WO2025075108A1 (ja) 音響処理装置、閾値特定装置及び音響処理方法
WO2025075149A1 (ja) 音声信号処理方法、コンピュータプログラム、及び、音声信号処理装置
CN122003880A (zh) 声音信号处理方法、计算机程序以及声音信号处理装置
US20250150776A1 (en) Acoustic signal processing method, recording medium, and acoustic signal processing device
WO2025205328A1 (ja) 情報処理装置、情報処理方法、及び、プログラム
WO2025075079A1 (ja) 音響処理装置、音響処理方法、及び、プログラム
WO2024084949A1 (ja) 音響信号処理方法、コンピュータプログラム、及び、音響信号処理装置
WO2025135070A1 (ja) 音響情報処理方法、情報処理装置、及び、プログラム
WO2025075102A1 (ja) 音響処理装置、音響処理方法、及び、プログラム
KR20250037456A (ko) 음향 신호 처리 방법, 정보 생성 방법, 및, 음향 신호 처리 장치
WO2026018859A1 (ja) 情報処理方法、情報処理システム、及び、プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23879643

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024551488

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 202380071402.8

Country of ref document: CN

WWP Wipo information: published in national office

Ref document number: 202380071402.8

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 202547046897

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 2023879643

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2023879643

Country of ref document: EP

Effective date: 20250519

WWP Wipo information: published in national office

Ref document number: 202547046897

Country of ref document: IN

WWP Wipo information: published in national office

Ref document number: 1020257010711

Country of ref document: KR

WWP Wipo information: published in national office

Ref document number: 2023879643

Country of ref document: EP