WO2024011937A1 - 音频处理方法、系统及电子设备 - Google Patents

音频处理方法、系统及电子设备 Download PDF

Info

Publication number
WO2024011937A1
WO2024011937A1 PCT/CN2023/081669 CN2023081669W WO2024011937A1 WO 2024011937 A1 WO2024011937 A1 WO 2024011937A1 CN 2023081669 W CN2023081669 W CN 2023081669W WO 2024011937 A1 WO2024011937 A1 WO 2024011937A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
audio
signal
rir
target
Prior art date
Application number
PCT/CN2023/081669
Other languages
English (en)
French (fr)
Inventor
寇毅伟
秦鹏
林远鹏
范泛
周雷
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024011937A1 publication Critical patent/WO2024011937A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • H04S7/303Tracking of listener position or orientation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control

Definitions

  • the embodiments of the present application relate to the field of data processing, and in particular, to an audio processing method, system and electronic device.
  • Spatial audio technology can render sound sources of different formats into binaural signals, allowing users to perceive the position, distance and space of the sound and image in the audio when wearing headphones, and can provide users with an immersive listening experience when using headphones.
  • this application provides an audio processing method, system and electronic device.
  • the rendering effect of the binaural signal can be adjusted according to the user's settings for the rendering effect.
  • embodiments of the present application provide an audio processing method.
  • the method includes: first, in response to a user's playback operation, performing spatial audio processing on an initial audio segment in the source audio signal to obtain an initial binaural signal and playing it.
  • Initial binaural signals receive the user's settings for rendering effect options, which include at least one of the following: sound image position options, distance sense options, or spatial sense options; then, according to the settings, the initial audio in the source audio signal is Audio segments after the segment continue to undergo spatial audio processing to obtain the target binaural signal.
  • the initial rendering effect in the source audio signal can be processed according to the system's settings for the rendering effect options and/or the user's historical settings for the rendering effect options.
  • the audio clip is subjected to spatial audio processing to obtain the initial binaural signal and played.
  • the process of playing the initial binaural signal that is, the user is listening to the initial binaural signal
  • the user can set the rendering effect options; at this time, the user can set the rendering effect options according to the user's requirements.
  • This setting of the rendering effect option continues spatial audio processing on the audio segments after the initial audio segment in the source audio signal to obtain the target binaural signal.
  • the target binaural signal can be played.
  • the process of playing the target binaural signal that is, when the user is listening to the target binaural signal
  • the user can set the rendering effect options again; at this time, according to the user's settings for the rendering effect options again, the spatial audio can be continued on the audio segments after the last audio segment for spatial audio processing in the source audio signal. Process to obtain new target binaural signals; and so on.
  • the rendering effect of the binaural signal corresponding to the source audio signal can be continuously adjusted according to the user's settings for the rendering effect, that is, "tuning while listening”; thereby improving the user experience.
  • the system can set rendering effect options based on the user's personal information. For example, the system can analyze the user's head type, preferred rendering effects, etc., based on the user's personal information, to set rendering effect options.
  • the system can set default settings for rendering effect options.
  • the source audio signal is a media file.
  • the source audio signal may be an audio signal of a song, an audio signal of an audiobook, an audio signal contained in a video, etc. This application does not limit this.
  • both the target binaural signal and the initial binaural signal may include one signal for left earphone playback and one signal for right earphone playback.
  • this application may also include other rendering effect options, and this application will not limit this.
  • the sound image position option is used to adjust the sound image position in the target binaural signal.
  • the sound image position may refer to the direction of the sound subjectively felt by the user relative to the center of the head.
  • the distance sense option is used to adjust the distance sense of the sound image in the target binaural signal.
  • the sense of distance may refer to the distance of the sound relative to the center of the head that the user subjectively feels.
  • the spatial sense option is used to adjust the spatial sense of the target binaural signal.
  • the sense of space may refer to the size of the acoustic environment space subjectively felt by the user.
  • the audio segments after the initial audio segment in the source audio signal are continued to be spatially audio processed to obtain the target binaural signal, including: according to the sound image Set the position option to adjust the sound image position parameters; perform direct sound rendering on the audio segments after the initial audio segment in the source audio signal according to the sound image position parameters to obtain the first binaural signal; determine the target based on the first binaural signal Binaural signals.
  • the sound-image distance of the target binaural signal can be adjusted according to the user's personalized settings for the sound-image distance.
  • the rendering effect option includes the distance sense option
  • spatial audio processing is continued on the audio segments after the initial audio segment in the source audio signal according to the settings, so as to obtain
  • the target binaural signal includes: adjusting the distance sensing parameters according to the settings for the distance sensing option; performing early reflection sound rendering on the audio segments after the initial audio segment in the source audio signal according to the distance sensing parameters to obtain the second binaural signal; Based on the second binaural signal, the target binaural signal is determined. In this way, the distance perception of the sound image in the target binaural signal can be adjusted according to the user's personalized settings for distance perception.
  • the rendering effect option includes the spatial sense option
  • spatial audio processing is continued on the audio segments after the initial audio segment in the source audio signal according to the settings, so as to obtain
  • the target binaural signal includes: adjusting the spatial sense parameters according to the settings for the spatial sense options; performing late reflection sound rendering on the audio segments after the initial audio segment in the source audio signal according to the spatial sense parameters to obtain the third binaural signal; Based on the third binaural signal, the target binaural signal is determined. In this way, the spatial sense of the target binaural signal can be adjusted according to the user's personalized settings for the spatial sense.
  • the audio segments after the initial audio segment in the source audio signal are continued.
  • Spatial audio processing to obtain the target binaural signal also includes: adjusting the sound image position parameters according to the settings for the sound image position options; and performing direct sound processing on the audio segments after the initial audio segment in the source audio signal according to the sound image position parameters.
  • determining the target binaural signal based on the second binaural signal includes: mixing the first binaural signal, the second binaural signal and the third binaural signal to obtain the target binaural signal. In this way, the sound image position, sound image distance and spatial sense in the target binaural signal can be adjusted according to the user's personalized settings for acoustic position, distance and spatial sense.
  • this application can restore the sound image position, sense of distance and sense of space with high precision by performing direct sound rendering, early reflected sound rendering and late reflected sound rendering on the audio segments after the initial audio segment in the source audio signal. Able to achieve a more realistic and immersive binaural rendering effect.
  • the direct sound part in the target binaural signal refers to the part of the source audio signal that reaches the human ear through a direct path (that is, it propagates directly to the human ear in a straight line without any reflection); in the target binaural signal
  • the early reflected sound part refers to the first part of the source audio signal reaching the human ear through the reflection path; the late reflected sound part in the target binaural signal refers to the later part of the source audio signal reaching the human ear through the reflection path.
  • the audio segments after the initial audio segment in the source audio signal are continued.
  • Spatial audio processing to obtain the target binaural signal also includes: adjusting the spatial sense parameters according to the settings for the spatial sense options; and performing late reflection sound rendering on the audio segments after the initial audio segment in the source audio signal according to the spatial sense parameters, To obtain the third binaural signal; according to the second binaural signal, determine the target binaural signal, including: mixing the second binaural signal and the third binaural signal to obtain the fourth binaural signal; according to the target binaural signal Setting the sound image position option, adjusting the sound image position parameters; and performing direct sound rendering on the fourth binaural signal according to the sound image position parameters to obtain the fifth binaural signal; determining the target binaural signal based on the fifth binaural signal .
  • the sound image position, sound image distance and spatial sense in the target binaural signal is adjusting the spatial sense parameters according to the settings for the spatial sense options.
  • this application can restore the sound image position, distance sense and space sense with high precision by performing direct sound rendering, early reflected sound rendering and late reflected sound rendering on the audio segments after the initial audio segment in the source audio signal. Able to achieve a more realistic and immersive binaural rendering effect.
  • direct sound rendering is performed on the audio segments after the initial audio segment in the source audio signal according to the sound image position parameters to obtain the first binaural signal, including: from Select the candidate direct sound RIR from the preset direct sound RIR (Room Impulse Response) library, and determine the sound image position correction factor according to the sound image position parameter; correct the candidate direct sound RIR according to the sound image position correction factor, To obtain the target direct sound RIR; perform direct sound rendering on the audio segments after the initial audio segment in the source audio signal according to the target direct sound RIR to obtain the first binaural signal.
  • RIR Room Impulse Response
  • the direct sound RIR library includes multiple first sets, one first set corresponds to one head type, and the first set includes preset direct sound at multiple locations.
  • the method before receiving the user's settings for the rendering effect options, the method further includes: obtaining the selection of the target scene option, and displaying the rendering effect option corresponding to the target scene option. .
  • a target scene option corresponds to a spatial scene. In this way, the spatial scene for binaural signal playback can be set, which increases the diversity of spatial audio effect settings.
  • obtaining the selection of the target scene option may include receiving the user's selection operation of the target scene option.
  • the user can be provided with a choice of spatial scenes for binaural signal playback, increasing the diversity of spatial audio effect settings.
  • different target scene options correspond to different rendering effect options. Users can set different rendering effects for different spatial scenes to achieve refined adjustment of spatial audio effects.
  • obtaining the selection of the target scene option may be the selection of the target scene option by the system of the electronic device.
  • the system can select a target scenario based on the user's personal information.
  • the system can select the target scene based on the user's personal information, analyze the user's preferred spatial scene, etc.
  • the target scene options may include any of the following: cinema options, recording studio options, concert hall options, KTV (Karaoke TV, Karaoke) options, and so on.
  • the space scene corresponding to the cinema option is a cinema
  • the space scene corresponding to the recording studio option is a recording studio
  • the space scene corresponding to the concert hall option is a concert hall
  • the space scene corresponding to the KTV option is KTV.
  • target scenario option can also be other options, which is not limited by this application.
  • early reflection sound rendering is performed on the audio segments after the initial audio segment in the source audio signal according to the distance sense parameter to obtain the second binaural signal, including: from Select the candidate early reflection sound RIR from the preset early reflection sound RIR library, and determine the distance based on the distance sensing parameter sense correction factor; correct the candidate early reflection sound RIR according to the distance sense correction factor to obtain the target early reflection sound RIR; perform early reflection sound rendering on the audio segments after the initial audio segment in the source audio signal based on the target early reflection sound RIR, to get the second binaural signal.
  • the early reflection sound RIR library includes a plurality of second sets, one second set corresponds to a spatial scene, and the second set includes preset early reflection sound at multiple locations.
  • Reflection sound RIR select candidate early reflection sound RIR from the preset early reflection sound RIR library, including: selecting a second target set from multiple second sets according to the spatial scene parameters corresponding to the target scene option; according to the user's head
  • the candidate early reflection sound RIR is selected from the second target set based on the position information of the source audio signal and the position information of the preset early reflection sound RIR in the second target set. In this way, head motion tracking rendering can be achieved.
  • late reflection sound rendering is performed on the audio segments after the initial audio segment in the source audio signal according to the spatial sense parameters to obtain the third binaural signal, including: from Select candidate late reflection sound RIR from the preset late reflection sound RIR library, and determine the spatial sense correction factor according to the spatial sense parameter; correct the candidate late reflection sound RIR according to the spatial sense correction factor to obtain the target late reflection sound RIR; based on The target late reflection sound RIR performs late reflection sound rendering on the audio segments after the initial audio segment in the source audio signal to obtain the third binaural signal.
  • the late reflection sound RIR library includes multiple third sets, one third set corresponds to one spatial scene, and the third set includes preset late reflections at multiple locations.
  • the candidate late reflection sound RIR is selected from the third target set based on the position information of the source audio signal and the position information of the preset late reflection sound RIR in the third target set. In this way, head motion tracking rendering can be achieved.
  • determining the target binaural signal based on the first binaural signal, the second binaural signal and the third binaural signal includes: determining according to a preset relationship A sound effect parameter group that matches the space scene parameters.
  • the preset relationship includes the relationship between multiple space scenes and multiple sound effect parameter groups.
  • the sound effect parameter group that matches the space scene parameters includes: direct sound effect parameters, early reflection sound effects parameters, late reflection sound effect parameters; perform sound effect processing on the first binaural signal according to the direct sound effect parameters, perform sound effect processing on the second binaural signal according to the early reflection sound effect parameters, and perform sound effect processing on the third binaural signal according to the late reflection sound effect parameters.
  • Perform sound effect processing on the ear signal determine the target binaural signal based on the first binaural signal after sound effect processing, the second binaural signal after sound effect processing, and the second binaural signal after sound effect processing. In this way, the audio signal can be modified.
  • the source audio signal includes at least one of the following formats: multi-channel format, multi-object format and Ambisonics format.
  • the Ambisonics format refers to a spherical harmonic surround sound field format.
  • the target direct sound RIR is HRIR (Head Related Impulse Response).
  • the target early reflected sound RIR is HOA (High-Order Ambisonics, high-order Ambisonics) RIR.
  • HOA High-Order Ambisonics, high-order Ambisonics
  • the target late reflection sound RIR is HOA RIR.
  • this application uses a spherical microphone to collect RIR in all directions in one acquisition, which can reduce the production of late reflection sound. RIR workload.
  • the audio processing method is applied to the headset, and the head position information is determined based on the user's head movement information collected by the headset; or the audio processing method is applied to the mobile terminal, The head position information is obtained from a headset connected to the mobile terminal; or, the audio processing method is applied to a VR (Virtual Reality, virtual reality) device, and the head position information is determined based on the user's head movement information collected by the VR device.
  • VR Virtual Reality, virtual reality
  • the implementation and effect of performing spatial audio processing on the initial audio segment in the source audio signal to obtain the initial binaural signal can refer to any implementation described in the first aspect.
  • the The audio segments after the initial audio segment in the source audio signal continue to undergo spatial audio processing to obtain the implementation method and effect of the target binaural signal, which will not be described again here.
  • embodiments of the present application provide an audio processing method.
  • the method includes: acquiring a source audio signal to be processed; performing direct sound rendering, early reflected sound rendering and late reflected sound rendering on the source audio signal to obtain dual sound. ear signal. Because the direct sound part in the binaural signal affects the user's perception of the sound image position, the early reflected sound part in the binaural signal affects the user's perception of the sound image distance, and the late reflected sound part in the binaural signal affects the user's perception of the acoustics Perception of ambient space.
  • this application can restore the sound image position, distance sense, and spatial sense with high precision, thereby achieving a more realistic and immersive binaural experience. Rendering effect.
  • performing direct sound rendering, early reflected sound rendering and late reflected sound rendering on the source audio signal to obtain the binaural signal includes: performing direct sound rendering on the source audio signal to obtain the first binaural signal; Perform early reflection sound rendering on the source audio signal to obtain the second binaural signal; perform late reflection sound rendering on the source audio signal to obtain the third binaural signal; based on the first binaural signal, the second binaural signal and the third binaural signal Three binaural signals to determine the binaural signal.
  • performing direct sound rendering, early reflected sound rendering and late reflected sound rendering on the source audio signal respectively to obtain binaural signals including: Perform early reflection sound rendering to obtain the second binaural signal; perform late reflection sound rendering on the source audio signal to obtain the third binaural signal; perform mixing processing on the second binaural signal and the third binaural signal to obtain Obtain a fourth binaural signal; perform direct sound rendering on the fourth binaural signal to obtain a fifth binaural signal; determine the binaural signal based on the fifth binaural signal.
  • the room impulse response RIR used for direct sound rendering is head-related impulse response HRIR; the RIR used for early reflected sound rendering is HOA RIR; and the late reflected sound rendering uses RIR
  • the RIR used is HOA RIR.
  • this application uses a spherical microphone to collect RIR in all directions in one acquisition, which can reduce the cost of recording RIR in all directions in the early stage of production. /Late reflected sound RIR workload.
  • the source audio signal includes at least one of the following formats: multi-channel format, multi-object format and Ambisonics format.
  • the source audio signal to be processed in the second aspect and any implementation of the second aspect may refer to the source audio signal in the first aspect and any implementation of the first aspect.
  • the initial audio segment; the binaural signal in the second aspect and any implementation of the second aspect may refer to the initial binaural signal.
  • the source audio signal to be processed in the second aspect and any implementation manner of the second aspect may refer to the initial audio signal in the source audio signal in any implementation manner of the first aspect and the first aspect.
  • the binaural signal in the second aspect and any implementation manner of the second aspect may refer to the target binaural signal.
  • this application provides an audio processing system, which includes a mobile terminal and a headset connected to the mobile terminal; wherein,
  • a mobile terminal configured to respond to the user's playback operation, perform spatial audio processing on the initial audio segment in the source audio signal to obtain the initial binaural signal and play the initial binaural signal, where the source audio signal is a media file; receive the user's rendering request Setting of effect options, the rendering effect options include at least one of the following: sound image position option, distance sense option or spatial sense option; according to the settings, continue to perform spatial audio processing on the audio segments after the initial audio segment in the source audio signal to obtain Target binaural signal; send target binaural signal to headphones;
  • Headphones for playing target binaural signals For playing target binaural signals.
  • the headset is also used to collect the user's head movement information, determine the user's head position information based on the head movement information; and send the head position information to the mobile terminal;
  • the mobile terminal is specifically configured to continue spatial audio processing on the audio segments after the initial audio segment in the source audio signal according to the settings and head position information to obtain the target binaural signal.
  • the mobile terminal of the third aspect can be used to perform the audio processing method in any of the above-mentioned first aspects and any implementation of the first aspect.
  • the mobile terminal of the third aspect can be used to perform the audio processing method in any of the above-mentioned second aspects and any implementation of the second aspect, and this application is not limited thereto.
  • embodiments of the present application provide a mobile terminal for executing the audio processing method in any one of the above first aspects and the implementation of the first aspect.
  • the fourth aspect and any implementation manner of the fourth aspect respectively correspond to the first aspect and any implementation manner of the first aspect.
  • the technical effects corresponding to the fourth aspect and any implementation manner of the fourth aspect may be referred to the technical effects corresponding to the above-mentioned first aspect and any implementation manner of the first aspect, and will not be described again here.
  • embodiments of the present application provide a mobile terminal for executing the audio processing method in any one of the above second aspects and the implementation of the second aspect.
  • the fifth aspect and any implementation manner of the fifth aspect respectively correspond to the second aspect and any implementation manner of the second aspect.
  • the technical effects corresponding to the fifth aspect and any implementation manner of the fifth aspect please refer to the technical effects corresponding to the above-mentioned second aspect and any implementation manner of the second aspect, which will not be described again here.
  • embodiments of the present application provide a headset for performing the audio processing method in any one of the above first aspects and the implementation of the first aspect.
  • the sixth aspect and any implementation manner of the sixth aspect respectively correspond to the first aspect and any implementation manner of the first aspect.
  • the technical effects corresponding to the sixth aspect and any implementation manner of the sixth aspect may be referred to the technical effects corresponding to the above-mentioned first aspect and any implementation manner of the first aspect, and will not be described again here.
  • embodiments of the present application provide a headset for performing the audio processing method in any one of the above-mentioned second aspects and the implementation of the second aspect.
  • the seventh aspect and any implementation manner of the seventh aspect respectively correspond to the second aspect and any implementation manner of the second aspect.
  • the technical effects corresponding to the seventh aspect and any implementation manner of the seventh aspect please refer to the technical effects corresponding to the above-mentioned second aspect and any implementation manner of the second aspect, which will not be described again here.
  • embodiments of the present application provide an electronic device, including: a memory and a processor, the memory is coupled to the processor; the memory stores program instructions, and when the program instructions are executed by the processor, the electronic device executes the first aspect or The audio processing method in any possible implementation of the first aspect.
  • the eighth aspect and any implementation manner of the eighth aspect respectively correspond to the first aspect and any implementation manner of the first aspect.
  • the technical effects corresponding to the eighth aspect and any implementation manner of the eighth aspect may be referred to the technical effects corresponding to the above-mentioned first aspect and any implementation manner of the first aspect, and will not be described again here.
  • embodiments of the present application provide an electronic device, including: a memory and a processor, the memory is coupled to the processor; the memory stores program instructions, and when the program instructions are executed by the processor, the electronic device executes the second aspect or The audio processing method in any possible implementation of the second aspect.
  • the ninth aspect and any implementation manner of the ninth aspect respectively correspond to the second aspect and any implementation manner of the second aspect.
  • the technical effects corresponding to the ninth aspect and any implementation manner of the ninth aspect may be referred to the technical effects corresponding to the above-mentioned second aspect and any implementation manner of the second aspect, and will not be described again here.
  • embodiments of the present application provide a chip, including one or more interface circuits and one or more processors; the interface circuit is used to receive signals from the memory of the electronic device and send signals to the processor, and the signals include the memory computer instructions stored in; when the processor executes the computer instructions, the electronic device is caused to execute the audio processing method in the first aspect or any possible implementation of the first aspect.
  • the tenth aspect and any implementation manner of the tenth aspect respectively correspond to the first aspect and any implementation manner of the first aspect.
  • the technical effects corresponding to the tenth aspect and any implementation manner of the tenth aspect may be referred to the technical effects corresponding to the above-mentioned first aspect and any implementation manner of the first aspect, and will not be described again here.
  • embodiments of the present application provide a chip, including one or more interface circuits and one or more processors; the interface circuit is used to receive signals from the memory of the electronic device and send signals to the processor, and the signals include Computer instructions stored in the memory; when the processor executes the computer instructions, the electronic device is caused to perform the audio processing method in the second aspect or any possible implementation of the second aspect.
  • the eleventh aspect and any implementation manner of the eleventh aspect respectively correspond to the second aspect and any implementation manner of the second aspect.
  • the technical effects corresponding to the eleventh aspect and any implementation manner of the eleventh aspect may be referred to the technical effects corresponding to the above-mentioned second aspect and any implementation manner of the second aspect, and will not be described again here.
  • embodiments of the present application provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program When the computer program is run on a computer or processor, it causes the computer or processor to execute the first aspect or The audio processing method in any possible implementation of the first aspect.
  • the twelfth aspect and any implementation manner of the twelfth aspect respectively correspond to the first aspect and any implementation manner of the first aspect.
  • the technical effects corresponding to the twelfth aspect and any implementation manner of the twelfth aspect please refer to the technical effects corresponding to the above-mentioned first aspect and any implementation manner of the first aspect, which will not be described again here.
  • embodiments of the present application provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program When the computer program is run on a computer or processor, it causes the computer or processor to execute the second aspect or The audio processing method in any possible implementation of the second aspect.
  • the thirteenth aspect and any implementation manner of the thirteenth aspect respectively correspond to the second aspect and any implementation manner of the second aspect.
  • the technical effects corresponding to the thirteenth aspect and any implementation manner of the thirteenth aspect please refer to the technical effects corresponding to the above-mentioned second aspect and any implementation manner of the second aspect, which will not be described again here.
  • inventions of the present application provide a computer program product.
  • the computer program product includes a software program.
  • the software program When executed by a computer or a processor, it causes the computer or processor to execute the first aspect or any possibility of the first aspect. Audio processing methods in the implementation.
  • the fourteenth aspect and any implementation manner of the fourteenth aspect are respectively the same as the first aspect and any implementation method of the first aspect. Corresponds to an implementation method.
  • the technical effects corresponding to the fourteenth aspect and any implementation manner of the fourteenth aspect please refer to the technical effects corresponding to the above-mentioned first aspect and any implementation manner of the first aspect, which will not be described again here.
  • inventions of the present application provide a computer program product.
  • the computer program product includes a software program.
  • the software program When executed by a computer or a processor, it causes the computer or processor to execute the second aspect or any possibility of the second aspect. Audio processing methods in the implementation.
  • the fifteenth aspect and any implementation manner of the fifteenth aspect respectively correspond to the second aspect and any implementation manner of the second aspect.
  • the technical effects corresponding to the fifteenth aspect and any implementation manner of the fifteenth aspect please refer to the technical effects corresponding to the above-mentioned second aspect and any implementation manner of the second aspect, which will not be described again here.
  • Figure 1a is a schematic diagram of an exemplary application scenario
  • Figure 1b is a schematic diagram of an exemplary application scenario
  • Figure 2 is a schematic diagram of an exemplary audio processing process
  • Figure 3a is a schematic diagram of an exemplary audio processing process
  • Figure 3b is a schematic diagram of an exemplary audio processing process
  • Figure 4a is a schematic diagram of an exemplary audio processing process
  • Figure 4b is a schematic diagram of an exemplary audio processing process
  • Figure 5 is a schematic diagram of an exemplary processing process
  • Figure 6 is a schematic diagram of an exemplary processing process
  • Figure 7a is a schematic diagram of an exemplary audio processing process
  • Figure 7b is a schematic diagram of an exemplary audio processing process
  • Figure 8a is a schematic diagram of an exemplary audio processing process
  • Figure 8b is a schematic diagram of an exemplary audio processing process
  • Figure 9 is a schematic diagram of an exemplary audio processing system
  • Figure 10 is a schematic structural diagram of an exemplary device.
  • a and/or B can mean: A exists alone, A and B exist simultaneously, and they exist alone. B these three situations.
  • first and second in the description and claims of the embodiments of this application are used to distinguish different objects, rather than to describe a specific order of objects.
  • first target object, the second target object, etc. are used for Distinguish between different target objects rather than a specific order used to describe the target objects.
  • multiple processing units refer to two or more processing units; multiple systems refer to two or more systems.
  • this application can be applied to the scenario of using headphones to listen to audio in a mobile terminal.
  • the earphones can be wireless earphones (such as TWS (True Wireless Stereo, True Wireless Stereo) Bluetooth earphones, head-mounted Bluetooth earphones, neck-mounted Bluetooth earphones, etc.), or they can be wired earphones, which is not covered by this application. limit.
  • the connection between the mobile terminal and the headset may be a wireless connection or a wired connection, which is not limited in this application.
  • the mobile terminal can be a mobile phone, a tablet computer, a smart watch, a personal notebook, etc., which is not limited in this application.
  • the audio in the mobile terminal listened to using headphones can be songs, audio parts of videos, audio books, etc., and this application is not limited to this.
  • Figure 1a is a schematic diagram of an exemplary application scenario.
  • the mobile terminal is a mobile phone and the earphones are wireless earphones;
  • Figure 1a shows a scene of using earphones to listen to songs on the mobile phone.
  • the headset remains connected to the mobile phone.
  • the user wants to listen to song A, he can open the audio application on the phone, find song A in the audio application and perform the playback operation.
  • the mobile phone can respond to the play operation and send the audio signal of song A to the earphones, and the earphones play it; in this way, the user can hear song A in the earphones.
  • this application can be applied to various VR (Virtual Reality, virtual reality) scenes such as VR movies, VR games, etc., where audio is played by a VR device or by a headset connected to the VR device.
  • VR Virtual Reality, virtual reality
  • VR equipment may include VR glasses, VR helmets, etc., which is not limited in this application.
  • FIG 1b is a schematic diagram of an exemplary application scenario.
  • the VR device is VR glasses. Shown in Figure 1b is the scene of watching a VR movie using VR glasses.
  • the VR movie image can be displayed on the inside of the lens of the VR glasses, and the audio signal in the VR movie can be played on a speaker near the human ear on the VR glasses.
  • VR glasses can be connected to headphones; in this way, during the process of playing a VR movie in the VR glasses, the VR movie picture can be displayed on the inside of the lenses of the VR glasses, and the audio signals in the VR movie can be sent to the headphones by the headphones Playback; this application does not limit this.
  • this application proposes an audio processing method that can perform spatial audio processing on audio signals to obtain binaural signals for earphone playback; allowing the user to feel the sound image position and distance when wearing earphones to listen. and a sense of space.
  • the sound image position may refer to the direction of the sound subjectively felt by the user relative to the center of the head.
  • the sense of distance can refer to the distance of the sound relative to the center of the head that the user subjectively feels.
  • the sense of space can refer to the size of the acoustic environment space subjectively felt by the user.
  • FIG. 2 is a schematic diagram illustrating an audio processing process.
  • an audio signal to be processed can be obtained, and the audio signal to be processed is called a source audio signal.
  • the source audio signal is a media file.
  • the source audio signal may be an audio signal corresponding to a song, an audio signal corresponding to an audiobook, an audio signal contained in a video, etc. This application does not limit this.
  • S202 Perform direct sound rendering, early reflected sound rendering and late reflected sound rendering on the source audio signal to obtain binaural signals.
  • the source audio signal may travel to the human ear via a direct path, and may travel to the human ear via a reflected path.
  • the part of the sound wave that the source audio signal propagates to the human ear through the direct path will affect the user's perception of the sound image position;
  • the first part of the sound wave that the source audio signal reaches the human ear through the reflection path (for example, the time range is generally taken as the time range received by the human ear)
  • the sound wave that reaches the human ear within 50ms or 95ms is mainly generated by primary reflection or secondary reflection), which will affect the user's perception of the sound-image distance;
  • the source audio signal passes through
  • the last part of the sound wave that reaches the human ear after reflection for example, the time range is generally taken as the sound wave that reaches the human ear after 50ms or 95ms after the human ear receives the part of the sound wave from the source audio signal that propagates to the human
  • the binaural signals may include one signal for left earphone playback and one signal for right earphone playback.
  • this application restores The resulting sound image position, sense of distance and sense of space are more accurate, thereby achieving a more realistic and immersive binaural rendering effect.
  • Figure 3a is a schematic diagram illustrating an audio processing process.
  • a method of performing direct sound rendering, early reflected sound rendering and late reflected sound rendering on the source audio signal respectively is shown.
  • S301 may refer to the description of S201, which will not be described again here.
  • the source audio signal includes at least one of the following formats: multi-channel format, multi-object format and Ambisonics format (spherical harmonic surround sound field format).
  • the source audio signal is in a multi-channel format, it can be assumed that the source audio signal includes n1 (n1 is an integer greater than 1) channels, then the source audio signal can be expressed as: [ch 1 (t,x 1 ,y 1 ,z 1 ),ch 2 (t,x 2 ,y 2 ,z 2 ),...,ch n1 (t,x n1 ,y n1 ,z n1 )]
  • t time
  • (x, y, z) represents the position of the sound source in the Cartesian coordinate system.
  • ch 1 (t,x 1 ,y 1 ,z 1 ) represents the audio signal of the first channel
  • ch 2 (t,x 2 ,y 2 ,z 2 ) represents the audio signal of the second channel
  • ch n1 (t,x n1 ,y n1 ,z n1 ) represents the audio signal of the n1th channel.
  • the sound source position corresponding to each channel remains unchanged.
  • the source audio signal can be expressed as:
  • the source audio signal is in a multi-object format, it can be assumed that the source audio signal includes n2 (n2 is an integer greater than 1) objects.
  • the source audio signal can be expressed as:
  • t represents time
  • t represents the position of the sound source in the spherical coordinate system.
  • Indicates the audio signal of the first object Represents the audio signal of the second object,...and so on, Represents the audio signal of the n2th object.
  • each object is a moving sound source, and the position of the audio signal of each object changes with time; that is, the audio signal of each object can include multiple groups, and one group of audio signals corresponds to one position.
  • the source audio signal is in Ambisonics format, it can be assumed that the source audio signal contains n3 (n3 is a positive integer greater than 1) channels, and the source audio signal can be expressed as:
  • each The audio signal of the channel may include 2 n+1 groups.
  • i represents the currently processed audio signal in the source audio signal
  • Represents the i-th audio signal in the audio signal is the position of the i-th audio signal.
  • S302 Perform direct sound rendering on the source audio signal to obtain the first binaural signal.
  • a direct sound RIR (Room Impulse Response) library can be established in advance.
  • an artificial head recording device can be used in advance under free sound field conditions (such as an anechoic chamber environment) to collect the responses of sound sources located at p1 (p1 is a positive integer) positions in the free sound field conditions, and p1 positions can be obtained
  • the direct sound RIR i.e. HRIR (Head Related Impulse Response, head related impulse response)
  • HRIR Head Related Impulse Response, head related impulse response
  • the direct sound RIR library can be expressed as:
  • the subscript BIN indicates that the HRIR distinguishes the left and right ears. That is to say, the HRIR of each position includes 2 groups (i.e., the HRIR of the left ear and the HRIR of the right ear). Indicates the direct sound RIR at the first position, Represents the direct sound RIR at the second position,... and so on, Represents the direct sound RIR at the p1th position.
  • the direct sound RIR can be converted to Ambisonics format and saved, and can be expressed as HRIR BIN-AMB .
  • the direct sound RIR can be convolved with the source audio signal to implement direct sound rendering of the source audio signal to obtain the first binaural signal.
  • direct sound rendering For the i-th audio signal of the source audio signal, you can refer to the following formula for direct sound rendering:
  • the "*" in the above formula is convolution
  • the audio signal obtained after direct sound rendering of the i-th audio signal of the source audio signal is the position in the direct sound RIR library and The position corresponds to the direct sound RIR.
  • the source audio signal includes N (N is an integer greater than 1) channels, then perform direct sound rendering on the source audio signal to obtain the first binaural signal out 1 (t) as follows:
  • an early reflection sound RIR library can be established in advance.
  • a spherical microphone can be used in the acoustic environment in advance to collect the responses of the sound source located at p2 (p2 is a positive integer) positions in the acoustic environment, and the RIR data of p2 positions can be obtained. Then determine the first part of the impulse response of the reflection path between the sound source and the spherical microphone in the RIR data of p2 positions (that is, the early reflection part, which can be represented by ER (Early Reflections, early reflections)), and you can get the early reflections of p2 positions.
  • Reflected sound RIR ie HOA (High-Order Ambisonics, high-order Ambisonics) RIR. Then the early reflection sound RIRs at p2 positions can be used to form an early reflection sound RIR library.
  • this application uses a spherical microphone to complete the recording of RIR in all directions in one acquisition, which can reduce the production of early reflections. workload of the acoustic RIR library.
  • the early reflected sound RIR library can be expressed as:
  • AMB means that the ER is saved in Ambisonics format, and the HOA RIR at each location can include 2 n+1 groups.
  • the early reflected sound RIR can be converted to BIN format and saved, and the calculation method is as follows:
  • the early reflection sound RIR can be convolved with the source audio signal to perform early reflection sound rendering on the source audio signal to obtain the second binaural signal.
  • the i-th audio signal of the source audio signal you can refer to The following formula is used for early reflection sound rendering:
  • the "*" in the above formula represents convolution.
  • the source audio signal includes N (N is an integer greater than 1) channels, then perform early reflection sound rendering on the source audio signal to obtain the second binaural signal out 2 (t) as follows:
  • a late reflection sound RIR library can be established in advance.
  • a spherical microphone can be used in the acoustic environment in advance, and the responses of the sound sources located at p3 (p3 is a positive integer) positions in the acoustic environment can be collected, and the RIR data of the p3 positions can be obtained. Then respectively determine the impulse response of the latter part of the reflection path between the sound source and the spherical microphone in the RIR data of the p3 positions (that is, the late reflection part, which can be represented by LR (Late Reflections, late reflections))), and we can obtain the impulse response of the p3 positions. Late reflected sound RIR (ie HOA RIR). Then the late reflection sound RIRs at p3 positions can be used to form a late reflection sound RIR library.
  • this application uses a spherical microphone to collect RIR in all directions once to complete the recording, which can reduce the cost of late production. Reflected sound RIR library workload.
  • the RIR late acoustic database can be expressed as:
  • AMB means that LR is saved in Ambisonics format, and the late reflection sound RIR at each position can include 2 n+1 groups.
  • the late reflection sound RIR can be converted to BIN format and saved, and the calculation method is as follows:
  • the late reflection sound RIR can be convolved with the source audio signal to implement late reflection sound rendering of the source audio signal to obtain the third binaural signal.
  • the i-th audio signal of the source audio signal you can refer to the following formula for late reflection sound rendering:
  • the data obtained after late reflection sound rendering for the i-th audio data of the source audio signal is the position in the late reflected sound RIR library and The audio data corresponding to the position.
  • the source audio signal includes N (N is an integer greater than 1) channels, then perform late reflection sound rendering on the source audio signal to obtain the third binaural signal out 3 (t) as follows:
  • p1, p2 and p3 may be equal or unequal, and this application does not limit this.
  • out B (t) is the binaural signal.
  • Figure 3b is a schematic diagram illustrating an audio processing process.
  • sound effect processing can also be performed on the first binaural signal, the second binaural signal and the third binaural signal , to modify the audio.
  • step S306 perform sound effect processing on the first binaural signal according to the preset sound effect parameter 1 to obtain the audio signal 1.
  • S306 may be executed after S302 and before S305, that is, after obtaining the first binaural signal, sound effect processing may be performed on the first binaural signal to obtain the audio signal 1.
  • sound effect processing may be performed on the first binaural signal to obtain the audio signal 1.
  • preset sound effects Parameter 1 that is, the direct sound effect parameter, which may refer to the parameter used to perform sound effect processing on the direct sound part
  • a set of filters can be generated based on the preset sound effect parameter 1, which can be represented by AudioEffects 1-BIN (t); and then AudioEffects 1-BIN (t) is used to filter the first binaural signal to achieve
  • AudioEffects 1-BIN (t) is used to filter the first binaural signal to achieve
  • step S307 perform sound effect processing on the second binaural signal according to the preset sound effect parameter 2 to obtain the audio signal 2.
  • S307 is executed after S303 and before S305, that is, after obtaining the second binaural signal, sound effect processing can be performed on the second binaural signal to obtain the audio signal 2.
  • sound effect processing can be performed on the second binaural signal according to the preset sound effect parameter 2 (that is, the early reflection sound effect parameter, which may refer to the parameter used to perform sound effect processing on the early reflection sound part) to obtain Audio signal 2.
  • the preset sound effect parameter 2 that is, the early reflection sound effect parameter, which may refer to the parameter used to perform sound effect processing on the early reflection sound part
  • a set of filters can be generated according to the preset sound effect parameter 2, which can be represented by AudioEffects 2-BIN (t); and then AudioEffects 2-BIN (t) is used to filter the second binaural signal to achieve
  • step S308 sound effect processing is performed on the third binaural signal according to the preset sound effect parameter 3 to obtain the audio signal 3.
  • S308 is executed after S304 and before S305, that is, after obtaining the third binaural signal, sound effect processing can be performed on the third binaural signal to obtain the audio signal 3.
  • sound effect processing can be performed on the third binaural signal according to the preset sound effect parameter 3 (that is, the late reflection sound effect parameter, which may refer to the parameter used to perform sound effect processing on the late reflection sound part) to obtain Audio signal 3.
  • the preset sound effect parameter 3 that is, the late reflection sound effect parameter, which may refer to the parameter used to perform sound effect processing on the late reflection sound part
  • a set of filters can be generated based on the preset sound effect parameter 3, which can be represented by AudioEffects 3-BIN (t); and then AudioEffects 3-BIN (t) can be used to filter the third binaural signal to achieve
  • AudioEffects 3-BIN (t) can be used to filter the third binaural signal to achieve
  • exemplary S305 may include S305a and S305b, wherein:
  • S305a Mix audio signal 1, audio signal 2 and audio signal 3 to obtain audio signal 4.
  • S305b Perform sound effect processing on the audio signal 4 according to the preset sound effect parameter 4 to obtain a binaural signal.
  • the audio can be processed according to the preset sound effect parameter 4 (that is, the first mixing sound effect parameter, which may refer to the parameters used to perform sound effect processing on the direct sound part, the early reflected sound part, and the late reflected sound part).
  • Signal 4 undergoes sound effect processing to obtain binaural signals; for details, please refer to the above description and will not be repeated here.
  • first binaural signal, the second binaural signal, the third binaural signal, audio signal 1, audio signal 2, audio signal 3 and audio signal 4 obtained above all include the audio of the left and right ears. Signal.
  • Figure 4a is a schematic diagram illustrating an audio processing process.
  • Figure 4a another way of performing direct sound rendering, early reflected sound rendering and late reflected sound rendering on the source audio signal respectively is shown.
  • S401 may refer to the description of S301 above, which will not be described again here.
  • S402 may refer to the description of S303 above, which will not be described again here.
  • the second binaural signal is represented by out 2 (t).
  • S403 may refer to the description of S304 above, which will not be described again here.
  • the third binaural signal is represented by out 3 (t).
  • S404 Mix the second binaural signal and the third binaural signal to obtain a fourth binaural signal.
  • out 4 (t) is the fourth binaural signal.
  • S405 Perform direct sound rendering on the fourth binaural signal to obtain the fifth binaural signal.
  • S405 may refer to the description of S302, which will not be described again here.
  • S406 Determine the binaural signal based on the fifth binaural signal.
  • the fifth binaural signal is used as a binaural signal.
  • Figure 4b is a schematic diagram illustrating an audio processing process.
  • Figure 4a after obtaining the second binaural signal, the third binaural signal, the fourth binaural signal and the fifth binaural signal, for the second binaural signal, the third binaural signal, the fourth binaural signal
  • the signal and the fifth binaural signal are subjected to sound effects processing to modify the audio.
  • step S407 perform sound effect processing on the second binaural signal according to the preset sound effect parameter 2 to obtain the audio signal 2.
  • S407 is executed after S402 and before S404, that is, after obtaining the second binaural signal, sound effect processing can be performed on the second binaural signal to obtain the audio signal 2.
  • S307 For details, reference may be made to the description of S307 above, which will not be described again here.
  • exemplary S408 sound effect processing is performed on the third binaural signal according to the preset sound effect parameter 3 to obtain the audio signal 3.
  • S408 is executed after S403 and before S404, that is, after obtaining the third binaural signal, sound effect processing can be performed on the third binaural signal to obtain audio signal 3.
  • S308 is executed after S403 and before S404, that is, after obtaining the third binaural signal, sound effect processing can be performed on the third binaural signal to obtain audio signal 3.
  • S404 may include: performing mixing processing on audio signal 2 and audio signal 3 to obtain a fourth binaural signal.
  • step S409 sound effect processing is performed on the fourth binaural signal according to the preset sound effect parameter 5 to obtain the audio signal 5.
  • S409 is executed after S404 and before S405, that is, after the fourth binaural signal is obtained, sound effect processing can be performed on the fourth binaural signal to obtain the audio signal 5.
  • the fourth binaural signal can be processed according to the preset sound effect parameter 5 (that is, the second mixed sound effect parameter, which may refer to the parameters used to perform sound effect processing on the early reflection sound part and the late reflection sound part). Sound effect processing is performed to obtain the audio signal 5; for details, please refer to the above description and will not be repeated here.
  • S405 may include performing direct sound rendering on the audio signal 5 to obtain a fifth binaural signal.
  • the above-mentioned S406 may include S406_X; wherein, S406_X performs sound effect processing on the fifth binaural signal according to the preset sound effect parameter 1 to obtain the binaural signal.
  • S406_X performs sound effect processing on the fifth binaural signal according to the preset sound effect parameter 1 to obtain the binaural signal.
  • the second binaural signal, the third binaural signal, the fourth binaural signal, the fifth binaural signal, audio signal 2, audio signal 3 and audio signal 5 obtained above all include the left and right binaural signals. audio signal.
  • this application proposes an audio processing method that can support "tuning while listening", that is, after the user performs a playback operation on the source audio signal, it can respond to the user's playback operation and first adjust according to the system requirements. Based on the settings of the rendering effect options and/or the user's historical settings for the rendering effect options, perform spatial audio processing on the initial audio fragments in the source audio signal to obtain the initial binaural signal and play it.
  • the user can be supported to set the spatial audio effect; and then according to the user's settings for the spatial audio effect, the source audio signal after the initial audio clip is The audio clips continue to undergo spatial audio processing.
  • the rendering effect of the binaural signal corresponding to the source audio signal can be continuously adjusted according to the user's settings for the rendering effect; it can also meet the user's personalized needs for spatial audio effects.
  • the spatial audio effect may include a rendering effect
  • the rendering effect may include sound and image position, sense of distance, sense of space, etc. This application is not limited to this.
  • this application can provide an application (or applet, web page, toolbar, etc.) for setting spatial audio effects.
  • FIG. 5 is a schematic diagram of an exemplary processing procedure.
  • FIG. 5(1) is a schematic diagram of an exemplary interface.
  • the spatial audio effect setting interface 51 in Figure 5(1) can be an interface set by the system or an interface set by the user. This application does not limit this. This application takes the user's settings on the spatial audio effect setting interface 51 to implement spatial audio effect adjustment as an example for explanation.
  • exemplary spatial audio effect setting interface 51 may include one or more setting areas, including but not limited to: rendering effect setting area 52 and so on, which is not limited in this application.
  • rendering effect setting area 52 can be set in the rendering effect setting area 52 according to different rendering effects.
  • the rendering effects may include a variety of effects, such as sound and image position, sense of distance, sense of space, etc.; of course, other rendering effects may also be included, and this application is not limited to this.
  • the rendering effect setting area 52 may include but is not limited to: sound and image position options 521, distance sense options 522, space sense options 523, etc., and of course other rendering effect options may also be included.
  • This application takes the rendering effect setting area 52 as an example to illustrate: a sound and image position option 521, a sense of distance option 522, and a sense of space option 523.
  • the sound image position option 521, the distance sense option 522, and the space sense option 523 may be slider controls, and the slider control may include a slider.
  • the user can operate the slider of the sound image position option 521 to increase or decrease the sound image position. For example, when the user performs an upward sliding operation on the slider of the sound image position option 521, the sound image position can be increased. When the user performs a sliding operation on the slider of the sound image position option 521, the sound image position can be lowered.
  • the user can operate the slider of the distance feeling option 522 to increase or shorten the distance feeling. For example, when the user performs an upward sliding operation on the slider of the distance sense option 522, the distance between the sound image and the user can be increased; when the user performs a downward operation on the slider of the distance sense option 522, the sound image can be shortened. distance from the user.
  • the user can operate the slider of the spatial sense option 523 to increase or decrease the spatial sense. For example, when the user performs an upward sliding operation on the slider of the spatial sense option 523, the spatial sense of the audio can be increased; when the user performs a downward operation on the slider of the spatial sense option 523, the spatial sense of the audio can be reduced. .
  • the sound image position option 521, the distance sense option 522 and the space sense option 523 can be other types of controls, such as knob controls (knob controls include knobs),
  • knob controls knob controls include knobs
  • the user can turn the knob of the sound image position option 521 to raise or lower the sound image position; turn the knob of the distance sense option 522 Turn the knob to increase or decrease the sense of distance; and turn the knob of space sense option 523 to increase or decrease the sense of space.
  • This application does not limit the display form of the sound and image position option 521, the distance sense option 522 and the space sense option 523.
  • Figure 5(2) is a schematic diagram showing an exemplary audio processing process.
  • the source audio signal is a media file.
  • the source audio signal is a media file.
  • the source audio signal may be an audio signal of a song, an audio signal of an audiobook, an audio signal contained in a video, etc. This application does not limit this.
  • the user when the user wants to listen to song A, he can open the audio application in the mobile phone, find song A in the audio application, and perform the playback operation. At this time, in response to the user's playback operation, spatial audio processing can be performed on the initial audio segment in the audio signal corresponding to song A (ie, the source audio signal), and then the initial binaural signal can be obtained and played.
  • song A ie, the source audio signal
  • the source audio signal can be divided into multiple audio segments in a preset manner.
  • the preset method can be set according to the demand, for example, the source audio signal is divided into multiple audio segments of the same duration; another example, the source audio signal is divided into a preset number of audio segments (which can be set according to the demand); etc. .
  • the first X1 (X1 is a positive integer) audio segments among the multiple audio segments included in the source audio signal can be determined as the initial audio segments.
  • spatial audio processing can be performed on the first X1 audio segments in the source audio signal to obtain an initial binaural signal according to the description of the above embodiment.
  • One possible way is to perform spatial audio processing on the first X1 audio segments in the source audio signal according to the system's settings for rendering effect options to obtain the initial binaural signal.
  • the system can pre-set each rendering effect option. After receiving the user's playback operation, it can perform spatial audio processing on the first X1 audio clips in the source audio signal according to the system's settings to obtain the initial binaural signal. ; For details, please refer to the description above and will not be repeated here.
  • spatial audio processing can be performed on the first X1 audio segments in the source audio signal to obtain the initial binaural signal according to the user's historical (such as the last time) settings for rendering effect options.
  • the first X1 audio clips in the source audio signal can be spatially processed according to the user's last setting of the rendering effect option to obtain the initial binaural signal; for details, please refer to The description in the text will not be repeated here.
  • the rendering effect options include at least one of the following: sound and image position options, distance sense options, or space sense options.
  • the user can hear the initial binaural signal; when the user determines that the rendering effect does not meet his needs, he can set the rendering effect options, that is, enter Figure 5(1)
  • the spatial audio effect setting interface 51 is used to perform setting operations on the rendering effect options to set the rendering effect according to one's own needs.
  • the user can perform a setting operation for at least one rendering effect option in the rendering effect setting area 52; for example, the sound and image position option 521, the sense of distance option 522 and the sense of space option can be performed.
  • At least one option in 523 performs a setting operation to set any one of the rendering effects of sound image position, distance sense, and space sense when the source audio signal is played.
  • X2 is a positive integer
  • the X2 audio segments may be X2 consecutive audio segments after the first X1 audio segments in the source audio signal, and the first audio segment of the X2 audio segments is adjacent to the last audio segment of the previous X1 audio segments.
  • S503 continue to perform spatial audio processing on the audio segments after the initial audio segment in the source audio signal to obtain the target binaural signal.
  • the corresponding rendering effect parameters can be adjusted according to the user's setting operation for the rendering effect options; and then the source audio signal can be adjusted according to the adjusted rendering effect parameters.
  • the audio segments after the initial audio segment continue to undergo spatial audio processing to obtain the target binaural signal for headphone playback. The specific spatial audio processing process will be explained later.
  • the X3 (X3 is a positive integer) audio segments after the first X1 audio segments in the signal continue to undergo spatial audio processing to obtain the target binaural signal for headphone playback.
  • X3 can be set according to requirements, and this application does not impose restrictions on this.
  • the X3 audio segments may be the X3 consecutive audio segments after the first X1 audio segments in the source audio signal, and the first audio segment of the X3 audio segments is adjacent to the last audio segment of the previous X1 audio segments.
  • the X3 (X3 is a positive integer) audio segments after the first X1+X2 audio segments in the signal continue to undergo spatial audio processing to obtain the target binaural signal for headphone playback.
  • the X3 audio segments can be the X3 consecutive audio segments after the first X1+X2 audio segments in the source audio signal, and the first audio segment of the X3 audio segments and the last audio segment of the Adjacent.
  • the target binaural signal can be played after the target binaural signal is obtained.
  • the user determines that the rendering effect does not meet his or her needs. , you can set the rendering effect options again; at this time, according to the user's settings for the rendering effect options again, you can continue to perform spatial audio processing on the audio fragments in the source audio signal after the last audio fragment for spatial audio rendering, and get New target binaural signal; and so on.
  • the rendering effect of the binaural signal corresponding to the source audio signal can be continuously adjusted according to the settings for the rendering effect.
  • it can also perform personalized spatial audio processing on the source audio signal according to the user's personalized spatial audio effect needs to obtain the target binaural signal for headphone playback; thus meeting the user's personalized needs for spatial audio effects.
  • FIG. 6 is a schematic diagram of an exemplary processing process. Among them, FIG. 6(1) is a schematic diagram of an exemplary interface.
  • spatial audio effects may also include spatial scenes.
  • different users have different needs for spatial scenes in which audio signals are played.
  • some users prefer spatial scenes such as cinemas
  • some users prefer spatial scenes such as recording studios
  • some users prefer spatial scenes such as KTV, and so on.
  • a spatial scene selection area 53 can be added to the spatial audio effect setting interface 51 in Figure 5(1), as shown in Figure 6(1).
  • the space scene selection area 53 can include multiple scene options according to different space scenes.
  • the space scenes may include a variety of spaces, such as cinemas, concert halls, recording studios, KTVs, etc., and of course may also include other space scenes, which are not limited by this application.
  • the space scene selection area 53 may include but is not limited to: cinema option 531, concert hall option 532, recording studio option 533, KTV option 534, etc., and of course may also include other scene options. There are no restrictions on applications.
  • the movie theater option 531 in the spatial scene setting area 53 can be selected.
  • the concert hall option 532 in the space scene setting area 53 can be selected.
  • the recording studio option 533 in the space scene setting area 53 can be selected.
  • the KTV option 534 in the space scene setting area 53 can be selected.
  • the rendering effect options in the rendering effect setting area 52 are associated with the scene options in the spatial scene selection area 53; different scene options have different corresponding rendering effect options.
  • the rendering effect setting area 52 may display the rendering effect options corresponding to the cinema option 531 .
  • the rendering effect setting area 52 may display the rendering effect options corresponding to the concert hall option 532 .
  • the rendering effect setting area 52 may display rendering effect options corresponding to the recording studio option 533 .
  • the rendering effect setting area 52 may display the rendering effect options corresponding to the KTV option 534 .
  • different scene options corresponding to different rendering effect options may mean that different scene options and rendering effect options have different default parameter values for rendering effect parameters.
  • the positions of the sliders (or knobs) of the displayed rendering effect options may be the same or different, and this application does not limit this.
  • Figure 6(2) is a schematic diagram showing an exemplary audio processing process.
  • the source audio signal is a media file.
  • S602 In response to the user's selection operation on the target scene option, display the rendering effect options corresponding to the target scene option.
  • the rendering effect options include at least one of the following: sound and image position options, distance sense options, or space sense options.
  • the terminal device can display the rendering effect options corresponding to the target scene option in the rendering effect setting area 52 in response to the user's selection operation for the target scene option.
  • the user can perform a setting operation for at least one rendering effect option in the rendering effect setting area 52, specifically according to the description of S501 above, which will not be described again here.
  • S604 continue to perform spatial audio processing on the audio segments after the initial audio segment in the source audio signal to obtain the target binaural signal.
  • the rendering effect parameters can be adjusted according to the user's setting operation for the rendering effect options; and then the source audio signal can be adjusted according to the adjusted rendering effect parameters.
  • the audio segments after the initial audio segment are subjected to spatial audio processing to obtain the target binaural signal for headphone playback; details will be explained later.
  • the rendering effect parameters can be adjusted according to the user's setting operation for the rendering effect options; then, the scene parameters are updated according to the target scene options; and then, according to The adjusted rendering effect parameters and updated scene parameters are used to perform spatial audio processing on the audio segments after the initial audio segment in the source audio signal to obtain the target binaural signal for headphone playback; details will be explained later.
  • S603 can be implemented by performing direct sound rendering, early reflected sound rendering and late reflected sound rendering on the audio segments after the initial audio segment in the source audio signal respectively.
  • Spatial audio processing That is, according to the settings, the audio segments after the initial audio segment in the source audio signal are respectively subjected to direct sound rendering, early reflected sound rendering and late reflected sound rendering to obtain the target binaural signal. In this way, in any type of spatial scene, high-precision sound and image position, audio spatial sense and distance sense can be restored, thereby achieving a more realistic and immersive binaural rendering effect.
  • the following describes the process of performing direct sound rendering, early reflected sound rendering, and late reflected sound rendering on the audio segments after the initial audio segment in the source audio signal according to the settings.
  • Figure 7a is a schematic diagram illustrating an audio processing process.
  • a method of performing direct sound rendering, early reflected sound rendering and late reflected sound rendering respectively on the audio segments after the initial audio segment in the source audio signal is described.
  • the source audio signal is a media file.
  • the rendering effect options include: sound and image position options, distance sense options, and space sense options.
  • S701 to S703 may refer to the description of S601 to S603, which will not be described again.
  • S704 Adjust the sound image position parameters according to the setting operation for the sound image position option.
  • the rendering effect parameters corresponding to the sound and image position options may be called sound and image position parameters.
  • the slider position of the sound image position option 521 can be determined according to the user's setting operation for the sound image position option 521, and then the sound image can be adjusted according to the slider position of the sound image position option 521. Position parameters are adjusted.
  • adjusting the sound image position parameter means adjusting the parameter value of the sound image position parameter.
  • direct sound rendering is performed on the audio segments after the initial audio segment in the source audio signal according to the sound image position parameters to obtain the first binaural signal.
  • S705 Select a candidate direct sound RIR from the preset direct sound RIR library, and determine the sound image position correction factor according to the sound image position parameter.
  • S706 Modify the candidate direct sound RIR according to the sound image position correction factor to obtain the target direct sound RIR.
  • S707 Perform direct sound rendering on the audio segments after the initial audio segment in the source audio signal according to the target direct sound RIR to obtain the first binaural signal.
  • a direct sound RIR library can be established in advance.
  • a head-type artificial head recording device can be used in advance under free sound field conditions (such as an anechoic chamber environment) to collect the responses when the sound source is located at the position p1 (p1 is a positive integer) in the free sound field conditions.
  • the direct sound RIR i.e. HRIR
  • the HRIRs of p1 positions can be used to form a direct sound RIR corresponding to a head type (for convenience of description, the direct sound RIR corresponding to a head type is called a first set).
  • the first set can be expressed as:
  • a first set may include preset direct sound RIRs at p1 positions.
  • m1 first sets can be recorded, and m1 is a positive integer. Then use these m1 first sets to form a direct sound RIR library; the direct sound RIR library can be expressed as:
  • the head type may include but is not limited to: female head type, male head type, elderly head type, middle-aged head type, young person head type, child head type, European race Head type, Asian head type, etc. This application does not limit this.
  • the first target set may be selected from the m1 first sets of the direct sound RIR library according to the user's head type.
  • the user's head type can be determined based on the gender, age and other information entered by the user when logging into the system account.
  • the spatial audio effect setting interface 51 in Figure 6(1) may also include a head type setting area.
  • the head type setting area includes multiple head type options, such as female head type options, Male head type options, elderly head type options, middle-aged head type options, young people’s head type options, children’s head type options, European head type options, Asian head type options, etc. wait.
  • the user can select the corresponding head type option according to his or her own situation; in this way, the user's head type can be determined based on the head type option selected by the user.
  • different head types correspond to different spatial scenes, and the user's head type can be determined based on the spatial scene parameters.
  • the user can be prompted to use a mobile phone to take an image of the user's auricle; and then based on the image of the auricle taken by the user, the head type most similar to the user can be found from a variety of preset head types. Identifies the user's header type.
  • the position information of the currently processed audio signal in the source audio signal that is, the audio segment after the initial audio segment in the source audio signal
  • the position information in the first target set can be used.
  • a candidate direct sound RIR is selected from the first target set.
  • the preset direct sound RIR whose position information is closest to the position information of the currently processed audio signal in the source audio signal ie, the audio segment after the initial audio segment in the source audio signal
  • the position information of the currently processed audio signal in the source audio signal ie, the audio segment after the initial audio segment in the source audio signal
  • the position information of the preset direct sound RIR at p1 positions in the first target set and select the candidate direct sound RIR from the first target set.
  • the position information of the currently processed audio signal in the source audio signal ie, the audio segment after the initial audio segment in the source audio signal
  • the offset value of the user's head position information can be determined, and then the offset value can be obtained from the first target
  • the preset direct sound RIR with the closest position information and offset value is selected from the set as the candidate direct sound RIR. In this way, head motion tracking rendering can be achieved.
  • the sound image position correction factor may be determined according to the parameter value of the adjusted sound image position parameter.
  • HRIR' is the target direct sound RIR
  • is the sound image position correction factor
  • HRIR is the candidate direct sound RIR.
  • can be represented by a set of filters.
  • the sound image position can be adjusted by correcting the candidate direct sound RIR by the sound image position correction factor, and the target direct sound RIR can be obtained.
  • the target direct sound RIR can be convolved with the audio segments after the initial audio segment in the source audio signal to achieve direct sound rendering of the audio segments after the initial audio segment in the source audio signal to obtain the first pair of audio segments. ear signal.
  • the audio segment following the initial audio segment in the i-th audio signal of the source audio signal You can refer to the following formula for direct sound rendering:
  • "*" in the above formula represents convolution. is the audio signal obtained by performing direct sound rendering on the audio segment after the initial audio segment in the i-th audio data of the source audio signal, Direct sound RIR for the target.
  • the source audio signal includes N (N is an integer greater than 1) channels
  • N is an integer greater than 1 channels
  • the rendering effect parameters corresponding to the distance sense option may be called distance sense parameters.
  • the slider position of the distance sense option 522 can be determined according to the user's setting operation for the distance sense option 522, and then the distance sense parameter can be adjusted according to the slider position of the distance sense option 522. .
  • adjusting the distance sensing parameter refers to adjusting the parameter value of the distance sensing parameter.
  • S709 Select candidate early reflection sound RIR from the preset early reflection sound RIR library, and determine the distance sense correction factor according to the distance sense parameter.
  • S710 Modify the candidate early reflection sound RIR according to the distance perception correction factor to obtain the target early reflection sound RIR.
  • S711 Perform early reflection sound rendering on the audio segment after the initial audio segment in the source audio signal according to the target early reflection sound RIR to obtain the second binaural signal.
  • an early reflection sound RIR library can be established in advance.
  • a spherical microphone can be used in advance in the acoustic environment corresponding to a spatial scene, and the responses of the sound sources located at positions p2 (p2 is a positive integer) in the acoustic environment corresponding to the spatial scene can be collected, and the responses of p2 positions can be obtained.
  • RIR the early reflected sound RIR (i.e. HOA RIR) at p2 positions can be obtained.
  • the early reflection sound RIR of p2 positions can be used to form an early reflection sound RIR corresponding to a space scene (for the convenience of description, the early reflection sound RIR corresponding to a space scene is called a second set).
  • the second set can be expressed as:
  • a second set may include preset early reflection sound RIRs at p2 positions.
  • m2 second sets can be recorded, and m2 is a positive integer. Then these m2 second sets are used to form an early reflection sound RIR library.
  • the early reflected sound RIR library can be expressed as:
  • the second set corresponding to the spatial scene parameters may be selected from the m2 second sets of the early reflection sound RIR library as the second target set.
  • the position information of the currently processed audio signal in the source audio signal that is, the audio segment after the initial audio segment in the source audio signal
  • the position information in the second target set can be used.
  • a candidate early reflection sound RIR is selected from the second target set.
  • the preset early reflection sound RIR whose position information is closest to the position information of the currently processed audio signal in the source audio signal ie, the audio segment after the initial audio segment in the source audio signal
  • the position information of the currently processed audio signal in the source audio signal ie, the audio segment after the initial audio segment in the source audio signal
  • the position information of the preset early reflection sound RIR at p2 positions in the second target set and selects the candidate early reflection sound RIR from the second target set.
  • the position information of the currently processed audio signal in the source audio signal (that is, the audio segment after the initial audio segment in the source audio signal) and the offset value of the user's head position information can be determined, and then the offset value of the user's head position information can be determined from the second target
  • the preset early reflection sound RIR with the closest position information and offset value is selected from the set as the candidate early reflection sound RIR. In this way, head motion tracking rendering can be achieved.
  • ER' is the target early reflection sound RIR
  • is the sound image position correction factor
  • ER is the candidate early reflection sound RIR.
  • can be represented by a gain, which reduces the distance perception by increasing the amplitude of the candidate early reflected sound RIR.
  • the distance perception can be adjusted by correcting the candidate early reflection sound RIR by the distance perception correction factor.
  • the target early reflection sound RIR and the audio after the initial audio segment in the source audio signal can be used.
  • the segments are convolved to achieve early reflection sound rendering of the audio segments after the initial audio segment in the source audio signal to obtain the second binaural signal.
  • "*" in the above formula represents convolution. is the data obtained by performing early reflection sound rendering on the audio segments after the initial audio segment in the i-th audio data of the source audio signal, RIR for early reflected sound from the target.
  • the source audio signal includes N (N is an integer greater than 1) channels, then perform early reflection sound rendering on the audio segments after the initial audio segment in the source audio signal, and obtain the second binaural signal out 2 (t), It can be as follows:
  • the rendering effect parameters corresponding to the spatial sense options may be called spatial sense parameters.
  • the slider position of the spatial sense option 521 can be determined according to the user's setting operation for the spatial sense option 521, and then the spatial sense parameters can be adjusted according to the slider position of the spatial sense option 521. .
  • adjusting the spatial sense parameter refers to adjusting the parameter value of the spatial sense parameter.
  • late reflection sound rendering can be performed on the audio segments after the initial audio segment in the source audio signal according to the spatial sense parameters to obtain the third binaural signal; refer to S713 to S715 as follows:
  • S713 Select candidate late reflection sound RIRs from the preset late reflection sound RIR library, and determine the spatial perception correction factor according to the spatial perception parameters.
  • S714 Modify the candidate late reflection sound RIR according to the spatial sense correction factor to obtain the target late reflection sound RIR.
  • S715 Perform late reflection sound rendering on the audio segments after the initial audio segment in the source audio signal according to the target late reflection sound RIR to obtain a third binaural signal.
  • a late reflection sound RIR library can be established in advance.
  • a spherical microphone can be used in advance in an acoustic environment corresponding to a spatial scene, and the responses of the sound sources located at positions p3 (p3 is a positive integer) in the acoustic environment corresponding to the spatial scene can be collected, and the responses of p3 positions can be obtained.
  • RIR the impulse response of the latter part of the reflection path between the sound source and the spherical microphone in the RIR at p3 positions, and obtain the late reflected sound RIR (i.e. HOA RIR) at p3 positions.
  • the late reflection sound RIR of p3 positions can be used to form a late reflection sound RIR corresponding to a space scene (for the convenience of description, the late reflection sound RIR corresponding to a space scene is called for a third set).
  • the third set can be expressed as:
  • a third set may include preset late reflection sound RIRs at p3 positions.
  • m3 third sets can be collected, and m3 is a positive integer. Then these m3 third sets are used to form a late reflection sound RIR library.
  • the late reflection sound RIR library can be expressed as:
  • n2 and m3 can be equal.
  • the third set corresponding to the spatial scene parameters can be selected from the m3 third sets of the late reflection sound RIR library as the third target set.
  • the position information of the currently processed audio signal in the source audio signal that is, the audio segment after the initial audio segment in the source audio signal
  • the third target set Based on the position information of preset late reflection sound RIRs at p3 positions, candidate late reflection sound RIRs are selected from the third target set.
  • the preset late reflection sound RIR whose position information is closest to the position information of the currently processed audio signal in the source audio signal ie, the audio segment after the initial audio segment in the source audio signal
  • the position information of the currently processed audio signal in the source audio signal ie, the audio segment after the initial audio segment in the source audio signal
  • the position information of the preset late reflection sound RIR at p3 positions in the third target set select the candidate late reflection sound RIR from the third target set.
  • the position information of the currently processed audio signal in the source audio signal ie, the audio segment after the initial audio segment in the source audio signal
  • the offset value of the user's head position information can be determined, and then the offset value can be obtained from the third target
  • the preset late reflection sound RIR with the closest position information and offset value is selected from the set as the candidate late reflection sound RIR. In this way, head motion tracking rendering can be achieved.
  • LR' is the target late reflection sound RIR
  • is the sound image position correction factor
  • LR is the candidate late reflection sound RIR.
  • can be expressed as a gain, and by increasing the amplitude of the candidate late reflection sound RIR, the sense of space can be increased.
  • the spatial perception can be improved by modifying the candidate late reflection sound RIR by the spatial perception correction factor. adjust.
  • the target late reflection sound RIR can be convolved with the audio segment after the initial audio segment in the source audio signal to realize the late reflection sound rendering of the audio segment after the initial audio segment in the source audio signal to obtain the third Three binaural signals.
  • the audio segment following the initial audio segment in the i-th audio signal of the source audio signal You can refer to the following formula for late reflection sound rendering:
  • the source audio signal includes N (N is an integer greater than 1) channels, then perform late reflection sound rendering on the audio segments after the initial audio segment in the source audio signal, and obtain the third binaural signal out 3 (t), It can be as follows:
  • S716 Determine the target binaural signal based on the first binaural signal, the second binaural signal and the third binaural signal.
  • S716 may refer to the description of S305 above, which will not be described again here.
  • S704 ⁇ S707, S708 ⁇ S711 and S712 ⁇ S715 can be executed in parallel or serially.
  • Figure 7b is a schematic diagram illustrating an audio processing process.
  • sound effect processing can be performed on the first binaural signal, the second binaural signal and the third binaural signal , to modify the audio.
  • correspondence relationships between multiple spatial scenes and multiple sound effect parameter groups can be established in advance to obtain a preset relationship.
  • the preset relationships may include: cinema—sound effect parameter group 1, concert hall—sound effect parameter group 2, recording studio—sound effect parameter group 3, and KTV—sound effect parameter group 4.
  • Each sound effect parameter group may include multiple sound effect parameters.
  • a sound effect parameter group matching the scene parameters can be determined according to the preset relationship.
  • the sound effect parameter group matching the scene parameters may include direct sound effect parameters (sound effect parameter 1), early reflection sound effect parameters (sound effect parameter 2), late reflection sound effect parameters (sound effect parameter 3) and the first mixture Sound effect parameters (sound effect parameter 4).
  • step S717 sound effect processing is performed on the first binaural signal according to the sound effect parameter 1 to obtain the audio signal 1.
  • S717 can be executed after S707 and before S716, that is, after obtaining the first binaural signal, sound effect processing can be performed on the first binaural signal to obtain the audio signal 1.
  • S306 Sound effect processing
  • step S718 perform sound effect processing on the second binaural signal according to the sound effect parameter 2 to obtain the audio signal 2.
  • S718 is executed after S711 and before S716, that is, after obtaining the second binaural signal, sound effect processing can be performed on the second binaural signal to obtain the audio signal 2.
  • S307 the description of S307 above, which will not be described again here.
  • step S719 sound effect processing is performed on the third binaural signal according to the sound effect parameter 3 to obtain the audio signal 3.
  • S719 is executed after S715 and before S716, that is, after obtaining the third binaural signal, sound effect processing can be performed on the third binaural signal to obtain audio signal 3.
  • S308 For details, reference may be made to the description of S308 above, which will not be described again here.
  • exemplary S716 may include S716a and S716b, wherein:
  • S716a Mix audio signal 1, audio signal 2 and audio signal 3 to obtain audio signal 4.
  • S716b Perform sound effect processing on the audio signal 4 according to the sound effect parameter 4 to obtain the target binaural signal.
  • S716a and S716b may refer to S305a and S305b; specifically, refer to the above description, which will not be described again here.
  • the audio segments after the initial audio segment in the source audio signal can be subjected to direct sound rendering, early reflected sound rendering and Late reflection sound rendering to achieve spatial audio processing in S603. That is, according to the settings, the audio segments after the initial audio segment in the source audio signal are respectively subjected to direct sound rendering, early reflected sound rendering and late reflected sound rendering to obtain the target binaural signal. In this way, in any type of spatial scene, high-precision sound and image position, audio spatial sense and distance sense can be restored, thereby achieving a more realistic and immersive binaural rendering effect.
  • the following describes the process of performing direct sound rendering, early reflected sound rendering, and late reflected sound rendering on the audio segments after the initial audio segment in the source audio signal according to the settings.
  • Figure 8a is a schematic diagram illustrating an audio processing process.
  • a method of performing direct sound rendering, early reflected sound rendering and late reflected sound rendering respectively on the audio segments after the initial audio segment in the source audio signal is described.
  • the source audio signal is a media file.
  • the rendering effect options include: sound and image position options, distance sense options, and space sense options.
  • S801 to S803 may refer to the description of S701 to S703, which will not be described again.
  • S805 Select a candidate early reflection sound RIR from the preset early reflection sound RIR library, and determine the distance sense correction factor according to the distance sense parameter.
  • S806 Modify the candidate early reflection sound RIR according to the distance correction factor to obtain the target early reflection sound RIR.
  • S807 Perform early reflection sound rendering on the audio segment after the initial audio segment in the source audio signal according to the target early reflection sound RIR to obtain the second binaural signal.
  • S804 to S807 may refer to the description of S708 to S711 above, which will not be described again.
  • S809 Select candidate late reflection sound RIR from the preset late reflection sound RIR library, and determine the spatial perception correction factor according to the spatial perception parameter.
  • S810 Modify the candidate late reflection sound RIR according to the spatial sense correction factor to obtain the target late reflection sound RIR.
  • S811 Perform late reflection sound rendering on the audio segments after the initial audio segment in the source audio signal according to the target late reflection sound RIR to obtain a third binaural signal.
  • S808 to S811 may refer to the description of S712 to S715 above, which will not be described again here.
  • S812 Mix the second binaural signal and the third binaural signal to obtain a fourth binaural signal.
  • S812 may refer to the description of S405 above, which will not be described again here.
  • S814 Select a candidate direct sound RIR from the preset direct sound RIR library, and determine the sound image position correction factor according to the sound image position parameter.
  • S815 Modify the candidate direct sound RIR according to the sound image position correction factor to obtain the target direct sound RIR.
  • S816 Perform direct sound rendering on the fourth binaural signal according to the target direct sound RIR to obtain the fifth binaural signal.
  • S813 to S816 may refer to the description of S704 to S707 above, and will not be described again.
  • S817 Determine the target binaural signal based on the fifth binaural signal.
  • S817 may refer to the description of S407 above, which will not be described again here.
  • Figure 8b is a schematic diagram illustrating an audio processing process.
  • Figure 8a after obtaining the second binaural signal, the third binaural signal, the fourth binaural signal and the fifth binaural signal, for the second binaural signal, the third binaural signal, the fourth binaural signal
  • the signal and the fifth binaural signal are subjected to sound effects processing to modify the audio.
  • a sound effect parameter group matching the spatial scene parameters is determined; reference can be made to the above description, which will not be described again here.
  • the sound effect parameter group matching the spatial scene parameters may include: direct sound effect parameter (sound effect parameter 1), early reflection sound effect parameter (sound effect parameter 2), late reflection sound effect parameter (sound effect parameter 3) and The second mixing sound effect parameter (sound effect parameter 5).
  • step S818 perform sound effect processing on the second binaural signal according to the sound effect parameter 2 to obtain the audio signal 2 .
  • S818 is executed after S807 and before S813, that is, after the second binaural signal is obtained, the Perform sound effect processing on the second binaural signal to obtain audio signal 2.
  • S307 the description of S307 above, which will not be described again here.
  • step S819 sound effect processing is performed on the third binaural signal according to the sound effect parameter 3 to obtain the audio signal 3.
  • S819 is executed after S812 and before S813, that is, after the third binaural signal is obtained, sound effect processing can be performed on the third binaural signal to obtain the audio signal 3.
  • S308 is executed after S812 and before S813, that is, after the third binaural signal is obtained, sound effect processing can be performed on the third binaural signal to obtain the audio signal 3.
  • S813 may include mixing audio signal 2 and audio signal 3 to obtain a fourth binaural signal.
  • step S820 sound effect processing is performed on the fourth binaural signal according to the sound effect parameter 5 to obtain the audio signal 6 .
  • S820 is executed after S813 and before S816, that is, after the fourth binaural signal is obtained, sound effect processing can be performed on the fourth binaural signal to obtain the audio signal 6.
  • S409 the description of S409 above, which will not be described again here.
  • the above-mentioned S817 may include S817_X; wherein, S817_X performs sound effect processing on the fifth binaural signal according to the sound effect parameter 1 to obtain the target binaural signal.
  • S817_X performs sound effect processing on the fifth binaural signal according to the sound effect parameter 1 to obtain the target binaural signal.
  • rendering effect parameters can be adjusted according to the setting operations for some rendering effect options; for rendering effect options that the user has not performed, the corresponding rendering effect can be used Parameters are rendered with their default parameter values.
  • the sound and image position parameters can be adjusted according to the setting operation for the sound and image position option; the initial audio in the source audio signal can be adjusted according to the adjusted parameter value of the sound and image position parameter.
  • the audio segment after the segment is subjected to direct sound rendering to obtain the first binaural signal (or the fourth binaural signal is subjected to direct sound rendering according to the adjusted parameter value of the sound image position parameter to obtain the fifth binaural signal).
  • the subsequent audio clips are rendered with late reflections to obtain a third binaural signal.
  • the distance sensing parameter can be adjusted according to the setting operation for the distance sensing option; and the audio after the initial audio segment in the source audio signal is adjusted according to the adjusted parameter value of the distance sensing parameter.
  • the clip is rendered with early reflections to obtain a second binaural signal.
  • direct sound rendering on the audio segments after the initial audio segment in the source audio signal according to the default value of the sound image position parameter to obtain the first binaural signal (or, the fourth binaural signal according to the default value of the sound image position parameter).
  • Direct sound rendering is performed to obtain the fifth binaural signal
  • late reflection sound rendering is performed on the audio segments after the initial audio segment in the source audio signal according to the default values of the spatial sense parameters to obtain the third binaural signal.
  • the distance sense parameter can be adjusted according to the setting operation for the sense of space option; the audio after the initial audio segment in the source audio signal can be adjusted according to the adjusted parameter value of the sense of space parameter.
  • the clips are rendered with late reflections to obtain a third binaural signal.
  • the audio processing method provided by this application can be applied to headphones.
  • the headset can obtain the source audio signal and audio processing parameters from the mobile terminal connected to it (wherein, the audio processing parameters may refer to parameters used for spatial audio processing, and the audio processing parameters may include rendering effect parameters, sound effect parameters group, correction factor, etc.), and then perform spatial audio processing on the audio segments after the initial audio segment in the source audio signal according to the audio processing parameters to obtain the target binaural signal; then, play the target binaural signal.
  • the headset when the headset can be equipped with sensors (such as gyroscopes and inertial sensors) for collecting head movement information, the headset can determine the user's head position information based on the collected head movement information; and then can determine the user's head position information based on the audio processing parameters and Based on the head position information, spatial audio processing is performed on the audio segments after the initial audio segment in the source audio signal to obtain the target binaural signal.
  • sensors such as gyroscopes and inertial sensors
  • the audio processing method provided by this application can be applied to a mobile terminal.
  • the mobile terminal can obtain the source audio signal and audio processing parameters through interaction with the user, and then perform spatial audio processing on the audio segments after the initial audio segment in the source audio signal according to the audio processing parameters to obtain the target dual audio signal. Ear signal steps.
  • the mobile terminal After the mobile terminal obtains the target binaural signal, it can send the target binaural signal to the earphone connected to the mobile terminal, and the earphone plays the target binaural signal.
  • the mobile terminal can obtain the head position information from the headset connected to the mobile terminal, and then perform spatial audio processing on the audio segments after the initial audio segment in the source audio signal according to the audio processing parameters and the head position information to obtain Target binaural signals.
  • the audio processing method provided by this application can be applied to VR equipment.
  • the VR device can obtain the source audio signal and audio processing parameters based on the interaction with the user, and then perform spatial audio processing on the audio segments after the initial audio segment in the source audio signal according to the audio processing parameters to obtain the target Binaural Signaling Steps. Then, the VR device can play the target binaural signal (or send the target binaural signal to the headset, and the headset plays the target binaural signal).
  • the VR device can be equipped with sensors for collecting head movement information (such as gyroscopes and inertial sensors), the VR device can determine the user's head position information based on the collected head movement information; and then can process the user's head position based on the collected head movement information.
  • head movement information such as gyroscopes and inertial sensors
  • the VR device can determine the user's head position information based on the collected head movement information; and then can process the user's head position based on the collected head movement information.
  • Parameters and head position information spatial audio processing is performed on the audio segments after the initial audio segment in the source audio signal to obtain the target binaural signal. (Or, obtain the head position information from the headset connected to the VR device, and then perform spatial audio processing on the audio segments after the initial audio segment in the source audio signal according to the audio processing parameters and head position information to obtain the target binaural signal ).
  • FIG. 9 is a schematic diagram of an audio processing system.
  • Figure 9 shows an audio processing system provided by an embodiment of the present application.
  • the audio processing system includes a mobile terminal and an earphone 902 connected to the mobile terminal 901; wherein,
  • Mobile terminal 901 used to execute the audio processing method of the above embodiment, and send the target binaural signal to the earphones;
  • Earphones 902 are used to play target binaural signals.
  • the headset 902 is used to collect the user's head movement information, determine the user's head position information based on the head movement information; and send the head position information to the mobile terminal;
  • the mobile terminal 901 is configured to perform spatial audio processing on the audio segments after the initial audio segment in the source audio signal according to the settings and head position information to obtain the target binaural signal.
  • FIG. 10 shows a schematic block diagram of a device 1000 according to an embodiment of the present application.
  • the device 1000 may include: a processor 1001 and a transceiver/transceiver pin 1002, and optionally, a memory 1003.
  • bus 1004 includes a power bus, a control bus, and a status signal bus in addition to a data bus.
  • bus 1004 includes a power bus, a control bus, and a status signal bus in addition to a data bus.
  • bus 1004 various buses are referred to as bus 1004 in the figure.
  • the memory 1003 may be used to store instructions in the foregoing method embodiments.
  • the processor 1001 can be used to execute instructions in the memory 1003, and control the receiving pin to receive signals, and control the transmitting pin to send signals.
  • the device 1000 may be the electronic device or a chip of the electronic device in the above method embodiment.
  • This embodiment also provides a computer-readable storage medium.
  • Computer instructions are stored in the computer-readable storage medium.
  • the electronic device causes the electronic device to execute the above-mentioned related method steps to implement the above-mentioned embodiments. Audio processing methods.
  • This embodiment also provides a computer program product.
  • the computer program product When the computer program product is run on a computer, it causes the computer to perform the above related steps to implement the audio processing method in the above embodiment.
  • inventions of the present application also provide a device.
  • This device may be a chip, a component or a module.
  • the device may include a connected processor and a memory.
  • the memory is used to store computer execution instructions.
  • the processor can execute computer execution instructions stored in the memory, so that the chip executes the audio processing method in each of the above method embodiments.
  • the electronic devices, computer-readable storage media, computer program products or chips provided in this embodiment are all used to execute the corresponding methods provided above. Therefore, the beneficial effects they can achieve can be referred to the above provided The beneficial effects of the corresponding methods will not be described again here.
  • the disclosed devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of modules or units is only a logical function division.
  • there may be other division methods for example, multiple units or components may be combined or can be integrated into another device, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separate.
  • a component shown as a unit may be one physical unit or multiple physical units, that is, it may be located in one place, or it may be distributed to multiple different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • Integrated units may be stored in a readable storage medium if they are implemented in the form of software functional units and sold or used as independent products.
  • the technical solutions of the embodiments of the present application are essentially or contribute to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the software product is stored in a storage medium , including several instructions to cause a device (which can be a microcontroller, a chip, etc.) or a processor to execute all or part of the steps of the methods of various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program code.
  • the steps of the methods or algorithms described in connection with the disclosure of the embodiments of this application can be implemented in hardware or by a processor executing software instructions.
  • Software instructions can be composed of corresponding software modules.
  • Software modules can be stored in random access memory (Random Access Memory, RAM), flash memory, read only memory (Read Only Memory, ROM), erasable programmable read only memory ( Erasable Programmable ROM (EPROM), electrically erasable programmable read-only memory (Electrically EPROM, EEPROM), register, hard disk, removable hard disk, compact disc (CD-ROM) or any other form of storage media well known in the art.
  • An exemplary storage medium is coupled to the processor such that the processor can read information from the storage medium and write information to the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and storage media may be located in an ASIC.
  • Computer-readable media includes computer-readable storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
  • Storage media can be any available media that can be accessed by a general purpose or special purpose computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stereophonic System (AREA)

Abstract

本申请实施例提供了一种音频处理方法、系统及电子设备。该方法包括:响应于用户的播放操作,对源音频信号中的初始音频片段进行空间音频处理,以得到初始双耳信号并播放初始双耳信号;接收用户针对渲染效果选项的设置,渲染效果选项包括以下至少一种:声像位置选项、距离感选项或空间感选项;根据设置,对源音频信号中初始音频片段之后的音频片段进行空间音频处理,以得到目标双耳信号。这样,能够在播放音频信号过程中,根据用户针对渲染效果的设置,来不断的调整源音频信号对应的双耳信号的渲染效果。

Description

音频处理方法、系统及电子设备
本申请要求于2022年07月12日提交中国国家知识产权局、申请号为202210813749.2、申请名称为“音频处理方法、系统及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。本申请还要求于2023年01月30日提交中国国家知识产权局、申请号为202310127907.3、申请名称为“音频处理方法、系统及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及数据处理领域,尤其涉及一种音频处理方法、系统及电子设备。
背景技术
空间音频技术可以将不同格式的音源渲染为双耳信号,使用户佩戴耳机时能够感知到音频中声像位置、距离感以及空间感,能够为用户使用耳机时带来沉浸式听音体验。
针对同一双耳信号,不同用户的听感和偏好不同,例如,用户A听该双耳信号时,感受到的声源位置位于双耳水平面之上,需要将声源位置调整到双耳水平面;用户B听该双耳信号时,感受到的距离感和空间感较小,希望增强距离感和空间感,等等。但是现有技术渲染得到的双耳信号的渲染效果无法调整。
发明内容
为了解决上述技术问题,本申请提供一种音频处理方法、系统及电子设备。在该方法中,能够根据用户针对渲染效果的设置,调整双耳信号的渲染效果。
第一方面,本申请实施例提供一种音频处理方法,该方法包括:首先,响应于用户的播放操作,对源音频信号中的初始音频片段进行空间音频处理,以得到初始双耳信号并播放初始双耳信号;随后,接收用户针对渲染效果选项的设置,渲染效果选项包括以下至少一种:声像位置选项、距离感选项或空间感选项;接着,根据设置,对源音频信号中初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号。
也就是,在用户针对源音频信号执行播放操作后,可以响应于用户的播放操作,先按照系统针对渲染效果选项的设置和/或者用户针对渲染效果选项的历史设置,对源音频信号中的初始音频片段进行空间音频处理,以得到初始双耳信号并播放。在播放初始双耳信号的过程中(即用户在收听初始双耳信号的过程中),若用户确定渲染效果不满足自身需求时,用户可以针对渲染效果选项进行设置;此时,可以根据用户针对渲染效果选项的本次设置,对源音频信号中初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号。
应该理解的是,在得到目标双耳信号后可以播放目标双耳信号,在播放目标双耳信号的过程中(即用户在收听目标双耳信号的过程中),若用户确定渲染效果不满足自身 需求时,用户可以再次针对渲染效果选项进行设置;此时,可以根据用户再次针对渲染效果选项的设置,对源音频信号中上次进行空间音频处理的音频片段之后的音频片段,继续进行空间音频处理,得到新的目标双耳信号;以此类推。
这样,能够在播放音频信号过程中,根据用户针对渲染效果的设置,来不断的调整源音频信号对应的双耳信号的渲染效果,即“边听边调”;进而提高用户体验。
此外,还可以实现按照用户个性化空间音频效果需求,对源音频信号进行个性化空间音频处理,得到用于耳机播放的目标双耳信号;进而能够满足用户针对空间音频效果个性化需求。
一种可能的方式中,系统可以根据用户个人信息,来针对渲染效果选项进行设置。例如,系统可以根据用户个人信息,分析用户的头部类型、偏好的渲染效果等等,来针对渲染效果选项进行设置。
一种可能的方式中,系统可以针对渲染效果选项进行默认设置。
示例性的,源音频信号为媒体文件。源音频信号可以是歌曲的音频信号、有声读物的音频信号、视频包含的音频信号等等,本申请对此不作限制。
示例性的,目标双耳信号和初始双耳信号,均可以包括用于左耳机播放的一路信号和用于右耳机播放的一路信号。
应该理解的是,除上述的渲染效果选项之外,本申请还可以包括其他渲染效果选项,本申请对此不作限制。
示例性的,声像位置选项用于调节目标双耳信号中的声像位置。其中,声像位置可以是指用户主观感受到的声音相对人头中心的方位。
示例性的,距离感选项用于调节目标双耳信号中声像的距离感。其中,距离感可以是指用户主观感受到的声音相对人头中心的距离。
示例性的,空间感选项用于调节目标双耳信号的空间感。其中,空间感可以是指用户主观感受到的声学环境空间的大小。
根据第一方面,当渲染效果选项包括声像位置选项时,根据设置,对源音频信号中初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号,包括:根据针对声像位置选项的设置,调节声像位置参数;根据声像位置参数对源音频信号中初始音频片段之后的音频片段进行直达声渲染,以得到第一双耳信号;根据第一双耳信号,确定目标双耳信号。这样,能够根据用户针对声像距离的个性化设置,调节目标双耳信号的声像距离。
根据第一方面,或者以上第一方面的任意一种实现方式,当渲染效果选项包括距离感选项时,根据设置,对源音频信号中初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号,包括:根据针对距离感选项的设置,调节距离感参数;根据距离感参数对源音频信号中初始音频片段之后的音频片段进行早期反射声渲染,以得到第二双耳信号;根据第二双耳信号,确定目标双耳信号。这样,能够根据用户针对距离感的个性化设置,调节目标双耳信号中声像的距离感。
根据第一方面,或者以上第一方面的任意一种实现方式,当渲染效果选项包括空间感选项时,根据设置,对源音频信号中初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号,包括:根据针对空间感选项的设置,调节空间感参数;根据空间感参数对源音频信号中初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号;根据第三双耳信号,确定目标双耳信号。这样,能够根据用户针对空间感的个性化设置,调节目标双耳信号的空间感。
根据第一方面,或者以上第一方面的任意一种实现方式,当渲染效果选项还包括声像位置选项和空间感选项时,根据设置,对源音频信号中初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号,还包括:根据针对声像位置选项的设置,调节声像位置参数;以及根据声像位置参数对源音频信号中初始音频片段之后的音频片段进行直达声渲染,以得到第一双耳信号;根据针对空间感选项的设置,调节空间感参数;以及根据空间感参数对源音频信号中初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号;依据第二双耳信号,确定目标双耳信号,包括:对第一双耳信号、第二双耳信号和第三双耳信号进行混音处理,以得到目标双耳信号。这样,能够根据用户针对声学位置、距离感和空间感的个性化设置,调整目标双耳信号中声像位置、声像的距离和空间感。
此外,由于目标双耳信号中的直达声部分影响用户对于声像位置的感知,目标双耳信号中的早期反射声部分影响用户对于声像距离的感知,以及目标双耳信号中的晚期反射声部分影响用户对于声学环境空间的感知。因此,本申请通过对源音频信号中初始音频片段之后的音频片段分别进行直达声渲染、早期反射声渲染和晚期反射声渲染,能够高精度的还原出声像位置、距离感和空间感,进而能够达到更真实沉浸的双耳渲染效果。
示例性的,目标双耳信号中的直达声部分是指源音频信号经过直接路径达到人耳(即不经过任何的反射而以直线的形式直接传播到人耳)的部分;目标双耳信号中的早期反射声部分是指源音频信号经过反射路径到达人耳的前一部分;目标双耳信号中的晚期反射声部分是指源音频信号经过反射路径到达人耳的后一部分。
根据第一方面,或者以上第一方面的任意一种实现方式,当渲染效果选项还包括声像位置选项和空间感选项时,根据设置,对源音频信号中初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号,还包括:根据针对空间感选项的设置,调节空间感参数;以及根据空间感参数对源音频信号中初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号;依据第二双耳信号,确定目标双耳信号,包括:对第二双耳信号和第三双耳信号进行混音处理,以得到第四双耳信号;根据针对声像位置选项的设置,调节声像位置参数;以及根据声像位置参数对第四双耳信号进行直达声渲染,以得到第五双耳信号;根据第五双耳信号,确定目标双耳信号。这样,能够根据用户针对声学位置、距离感和空间感的个性化设置,调整目标双耳信号中声像位置、声像的距离和空间感。
此外,本申请通过对源音频信号中初始音频片段之后的音频片段分别进行直达声渲染、早期反射声渲染和晚期反射声渲染,能够高精度的还原出声像位置、距离感和空间感,进而能够达到更真实沉浸的双耳渲染效果。
根据第一方面,或者以上第一方面的任意一种实现方式,根据声像位置参数对源音频信号中初始音频片段之后的音频片段进行直达声渲染,以得到第一双耳信号,包括:从预设的直达声RIR(Room Impulse Response,房间脉冲响应)库中选取候选直达声RIR,以及根据声像位置参数确定声像位置修正因子;根据声像位置修正因子对候选直达声RIR进行修正,以得到目标直达声RIR;根据目标直达声RIR对源音频信号中初始音频片段之后的音频片段进行直达声渲染,以得到第一双耳信号。
根据第一方面,或者以上第一方面的任意一种实现方式,直达声RIR库包括多个第一集合,一个第一集合对应一种头部类型,第一集合包括多个位置的预设直达声RIR;从预设的直达声RIR库中选取候选直达声RIR,包括:根据用户的头部类型,从多个第一集合中选取第一目标集合;根据用户的头部位置信息、源音频信号的位置信息和第一目标集合中预设直达声RIR的位置信息,从第一目标集合中选取候选直达声RIR。这样,能够实现头动跟踪渲染。
根据第一方面,或者以上第一方面的任意一种实现方式,在接收用户针对渲染效果选项的设置之前,该方法还包括:获取针对目标场景选项的选取,显示目标场景选项对应的渲染效果选项。其中,一个目标场景选项对应一种空间场景,这样,可以设置双耳信号播放的空间场景,增加了空间音频效果设置的多样性。
示例性的,获取针对目标场景选项的选取可以包括,接收用户针对目标场景选项的选取操作。这样,能够为用户提供双耳信号播放的空间场景的选择,增加了空间音频效果设置的多样性。此外,不同目标场景选项对应的渲染效果选项不同,用户可以针对不同空间场景设置不同的渲染效果,实现空间音频效果的精细化调节。
示例性的,获取针对目标场景选项的选取,可以是电子设备的系统针对目标场景选项的选取。示例性的,系统可以根据用户个人信息,来选取目标场景。例如,系统可以根据用户个人信息,分析用户偏好的空间场景等等,来选取目标场景。
示例性的,目标场景选项可以包括以下任意一种:电影院选项、录音棚选项、音乐厅选项和KTV(Karaoke TV,卡拉OK)选项等等。其中,电影院选项对应的空间场景为电影院,录音棚选项对应的空间场景为录音棚,音乐厅选项对应的空间场景为音乐厅,以及KTV选项对应的空间场景为KTV。
应该理解的是,目标场景选项还可以是其他选项,本申请对此不作限制。
根据第一方面,或者以上第一方面的任意一种实现方式,根据距离感参数对源音频信号中初始音频片段之后的音频片段进行早期反射声渲染,以得到第二双耳信号,包括:从预设的早期反射声RIR库中选取候选早期反射声RIR,以及根据距离感参数确定距离 感修正因子;根据距离感修正因子对候选早期反射声RIR进行修正,以得到目标早期反射声RIR;根据目标早期反射声RIR对源音频信号中初始音频片段之后的音频片段进行早期反射声渲染,以得到第二双耳信号。
根据第一方面,或者以上第一方面的任意一种实现方式,早期反射声RIR库包括多个第二集合,一个第二集合对应一种空间场景,第二集合包括多个位置的预设早期反射声RIR;从预设的早期反射声RIR库中选取候选早期反射声RIR,包括:根据目标场景选项对应的空间场景参数,从多个第二集合中选取第二目标集合;根据用户的头部位置信息、源音频信号的位置信息和第二目标集合中预设早期反射声RIR的位置信息,从第二目标集合中选取候选早期反射声RIR。这样,能够实现头动跟踪渲染。
根据第一方面,或者以上第一方面的任意一种实现方式,根据空间感参数对源音频信号中初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号,包括:从预设的晚期反射声RIR库中选取候选晚期反射声RIR,以及根据空间感参数确定空间感修正因子;依据空间感修正因子对候选晚期反射声RIR进行修正,以得到目标晚期反射声RIR;依据目标晚期反射声RIR对源音频信号中初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号。
根据第一方面,或者以上第一方面的任意一种实现方式,晚期反射声RIR库包括多个第三集合,一个第三集合对应一种空间场景,第三集合包括多个位置的预设晚期反射声RIR;从预设的晚期反射声RIR库中选取候选晚期反射声RIR,包括:根据目标场景选项对应的空间场景参数,从多个第三集合中选取第三目标集合;根据用户的头部位置信息、源音频信号的位置信息和第三目标集合中预设晚期反射声RIR的位置信息,从第三目标集合中选取候选晚期反射声RIR。这样,能够实现头动跟踪渲染。
根据第一方面,或者以上第一方面的任意一种实现方式,基于第一双耳信号、第二双耳信号和第三双耳信号,确定目标双耳信号,包括:根据预设关系,确定与空间场景参数匹配的音效参数组,预设关系包括多种空间场景与多个音效参数组之间的关系,与空间场景参数匹配的音效参数组包括:直达声音效参数、早前反射声音效参数、晚期反射声音效参数;根据直达声音效参数对第一双耳信号进行音效处理,根据早期反射声音效参数对第二双耳信号进行音效处理,以及根据晚期反射声音效参数对第三双耳信号进行音效处理;基于音效处理后的第一双耳信号、音效处理后的第二双耳信号和音效处理后的第二双耳信号,确定目标双耳信号。这样,能够对音频信号进行修饰。
根据第一方面,或者以上第一方面的任意一种实现方式,源音频信号包括以下至少一种格式:多声道格式、多对象格式和Ambisonics格式。
示例性的,Ambisonics格式是指球谐环绕声场格式。
根据第一方面,或者以上第一方面的任意一种实现方式,目标直达声RIR为HRIR(Head Related Impulse Response,头相关脉冲响应)。
根据第一方面,或者以上第一方面的任意一种实现方式,目标早期反射声RIR为HOA(High-Order Ambisonics,高阶Ambisonics)RIR。相对于现有技术需要经过多次采集才能完成采集各个方向的RIR录制,来制作早期反射声RIR而言,本申请采用球形麦克风采集一次即完成各个方向的RIR的录制,能够降低制作早期反射声RIR的工作量。
根据第一方面,或者以上第一方面的任意一种实现方式,目标晚期反射声RIR为HOA RIR。相对于现有技术需要经过多次采集才能完成采集各个方向的RIR录制,来制作晚期反射声RIR而言,本申请采用球形麦克风采集一次即完成各个方向的RIR的录制,能够降低制作晚期反射声RIR的工作量。
根据第一方面,或者以上第一方面的任意一种实现方式,音频处理方法应用于耳机,头部位置信息根据耳机采集的用户的头部运动信息确定;或,音频处理方法应用于移动终端,头部位置信息从与移动终端连接的耳机获取;或,音频处理方法应用于VR(Virtual Reality,虚拟现实)设备,头部位置信息根据VR设备采集的用户的头部运动信息确定。
应该理解的是,对源音频信号中的初始音频片段进行空间音频处理,以得到初始双耳信号的实现方式与效果,可以参照第一方面的任意一种实现方式中描述的,根据设置,对源音频信号中初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号的实现方式与效果,在此不再赘述。
第二方面,本申请实施例提供一种音频处理方法,该方法包括:获取待处理的源音频信号;对源音频信号分别进行直达声渲染、早期反射声渲染和晚期反射声渲染,以得到双耳信号。由于双耳信号中的直达声部分影响用户对于声像位置的感知,双耳信号中的早期反射声部分影响用户对于声像距离的感知,以及双耳信号中的晚期反射声部分影响用户对于声学环境空间的感知。因此,本申请通过对源音频信号分别进行直达声渲染、早期反射声渲染和晚期反射声渲染,能够高精度的还原出声像位置、距离感和空间感,进而能够达到更真实沉浸的双耳渲染效果。
根据第二方面,对源音频信号分别进行直达声渲染、早期反射声渲染和晚期反射声渲染,以得到双耳信号,包括:对源音频信号进行直达声渲染,以得到第一双耳信号;对源音频信号进行早期反射声渲染,以得到第二双耳信号;对源音频信号进行晚期反射声渲染,以得到第三双耳信号;基于第一双耳信号、第二双耳信号和第三双耳信号,以确定双耳信号。
根据第二方面,或者以上第二方面的任意一种实现方式,对源音频信号分别进行直达声渲染、早期反射声渲染和晚期反射声渲染,以得到双耳信号,包括:对源音频信号 进行早期反射声渲染,以得到第二双耳信号;对源音频信号进行晚期反射声渲染,以得到第三双耳信号;对第二双耳信号和第三双耳信号进行混音处理,以得到第四双耳信号;对第四双耳信号进行直达声渲染,以得到第五双耳信号;基于第五双耳信号,确定双耳信号。
根据第二方面,或者以上第二方面的任意一种实现方式,直达声渲染所使用的房间脉冲响应RIR为头相关脉冲响应HRIR;早期反射声渲染所使用的RIR为HOA RIR;晚期反射声渲染所使用的RIR为HOA RIR。相对于现有技术需要经过多次采集才能完成采集各个方向的RIR录制,来制作早期/晚期反射声RIR而言,本申请采用球形麦克风采集一次即完成各个方向的RIR的录制,能够降低制作早期/晚期反射声RIR的工作量。
根据第二方面,或者以上第二方面的任意一种实现方式,源音频信号包括以下至少一种格式:多声道格式、多对象格式和Ambisonics格式。
应该理解的是,第二方面及第二方面的任意一种实现方式中的待处理的源音频信号,可以是指第一方面及第一方面的任意一种实现方式中,源音频信号中的初始音频片段;第二方面及第二方面的任意一种实现方式中的双耳信号,可以是指初始双耳信号。
应该理解的是,第二方面及第二方面的任意一种实现方式中的待处理的源音频信号,可以是指第一方面及第一方面的任意一种实现方式中,源音频信号中初始音频片段之后的音频片段。第二方面及第二方面的任意一种实现方式中的双耳信号,可以是指目标双耳信号。
第三方面,本申请提供一种音频处理系统,该音频处理系统包括移动终端和与移动终端连接的耳机;其中,
移动终端,用于响应于用户的播放操作,对源音频信号中的初始音频片段进行空间音频处理,以得到初始双耳信号并播放初始双耳信号,源音频信号为媒体文件;接收用户针对渲染效果选项的设置,渲染效果选项包括以下至少一种:声像位置选项、距离感选项或空间感选项;根据设置,对源音频信号中初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号;将目标双耳信号发送至耳机;
耳机,用于播放目标双耳信号。
根据第三方面,耳机,还用于采集用户的头部运动信息,根据头部运动信息确定用户的头部位置信息;以及将头部位置信息发送至移动终端;
移动终端,具体用于根据设置和头部位置信息,对源音频信号中初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号。
示例性的,第三方面的移动终端可以用于执行上述第一方面及第一方面的任意一种实现方式中的音频处理方法。
示例性的,第三方面的移动终端可以用于执行上述第二方面及第二方面的任意一种实现方式中的音频处理方法,本申请对此不作限制。
第四方面,本申请实施例提供一种移动终端,用于执行上述第一方面及第一方面的任意一种实现方式中的音频处理方法。
第四方面以及第四方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第四方面以及第四方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第五方面,本申请实施例提供一种移动终端,用于执行上述第二方面及第二方面的任意一种实现方式中的音频处理方法。
第五方面以及第五方面的任意一种实现方式分别与第二方面以及第二方面的任意一种实现方式相对应。第五方面以及第五方面的任意一种实现方式所对应的技术效果可参见上述第二方面以及第二方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第六方面,本申请实施例提供一种耳机,用于执行上述第一方面及第一方面的任意一种实现方式中的音频处理方法。
第六方面以及第六方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第六方面以及第六方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第七方面,本申请实施例提供一种耳机,用于执行上述第二方面及第二方面的任意一种实现方式中的音频处理方法。
第七方面以及第七方面的任意一种实现方式分别与第二方面以及第二方面的任意一种实现方式相对应。第七方面以及第七方面的任意一种实现方式所对应的技术效果可参见上述第二方面以及第二方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第八方面,本申请实施例提供一种电子设备,包括:存储器和处理器,存储器与处理器耦合;存储器存储有程序指令,当程序指令由处理器执行时,使得电子设备执行第一方面或第一方面的任意可能的实现方式中的音频处理方法。
第八方面以及第八方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第八方面以及第八方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第九方面,本申请实施例提供一种电子设备,包括:存储器和处理器,存储器与处理器耦合;存储器存储有程序指令,当程序指令由处理器执行时,使得电子设备执行第二方面或第二方面的任意可能的实现方式中的音频处理方法。
第九方面以及第九方面的任意一种实现方式分别与第二方面以及第二方面的任意一种实现方式相对应。第九方面以及第九方面的任意一种实现方式所对应的技术效果可参见上述第二方面以及第二方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十方面,本申请实施例提供一种芯片,包括一个或多个接口电路和一个或多个处理器;接口电路用于从电子设备的存储器接收信号,并向处理器发送信号,信号包括存储器中存储的计算机指令;当处理器执行计算机指令时,使得电子设备执行第一方面或第一方面的任意可能的实现方式中的音频处理方法。
第十方面以及第十方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第十方面以及第十方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十一方面,本申请实施例提供一种芯片,包括一个或多个接口电路和一个或多个处理器;接口电路用于从电子设备的存储器接收信号,并向处理器发送信号,信号包括存储器中存储的计算机指令;当处理器执行计算机指令时,使得电子设备执行第二方面或第二方面的任意可能的实现方式中的音频处理方法。
第十一方面以及第十一方面的任意一种实现方式分别与第二方面以及第二方面的任意一种实现方式相对应。第十一方面以及第十一方面的任意一种实现方式所对应的技术效果可参见上述第二方面以及第二方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十二方面,本申请实施例提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,当计算机程序运行在计算机或处理器上时,使得计算机或处理器执行第一方面或第一方面的任意可能的实现方式中的音频处理方法。
第十二方面以及第十二方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第十二方面以及第十二方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十三方面,本申请实施例提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,当计算机程序运行在计算机或处理器上时,使得计算机或处理器执行第二方面或第二方面的任意可能的实现方式中的音频处理方法。
第十三方面以及第十三方面的任意一种实现方式分别与第二方面以及第二方面的任意一种实现方式相对应。第十三方面以及第十三方面的任意一种实现方式所对应的技术效果可参见上述第二方面以及第二方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十四方面,本申请实施例提供一种计算机程序产品,计算机程序产品包括软件程序,当软件程序被计算机或处理器执行时,使得计算机或处理器执行第一方面或第一方面的任意可能的实现方式中的音频处理方法。
第十四方面以及第十四方面的任意一种实现方式分别与第一方面以及第一方面的任 意一种实现方式相对应。第十四方面以及第十四方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十五方面,本申请实施例提供一种计算机程序产品,计算机程序产品包括软件程序,当软件程序被计算机或处理器执行时,使得计算机或处理器执行第二方面或第二方面的任意可能的实现方式中的音频处理方法。
第十五方面以及第十五方面的任意一种实现方式分别与第二方面以及第二方面的任意一种实现方式相对应。第十五方面以及第十五方面的任意一种实现方式所对应的技术效果可参见上述第二方面以及第二方面的任意一种实现方式所对应的技术效果,此处不再赘述。
附图说明
图1a为示例性示出的应用场景示意图;
图1b为示例性示出的应用场景示意图;
图2为示例性示出的音频处理过程示意图;
图3a为示例性示出的音频处理过程示意图;
图3b为示例性示出的音频处理过程示意图;
图4a为示例性示出的音频处理过程示意图;
图4b为示例性示出的音频处理过程示意图;
图5为示例性示出的处理过程的示意图;
图6为示例性示出的处理过程的示意图;
图7a为示例性示出的音频处理过程示意图;
图7b为示例性示出的音频处理过程示意图;
图8a为示例性示出的音频处理过程示意图;
图8b为示例性示出的音频处理过程示意图;
图9为示例性示出的音频处理系统示意图;
图10为示例性示出的装置的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。
本申请实施例的说明书和权利要求书中的术语“第一”和“第二”等是用于区别不同的对象,而不是用于描述对象的特定顺序。例如,第一目标对象和第二目标对象等是用于 区别不同的目标对象,而不是用于描述目标对象的特定顺序。
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
在本申请实施例的描述中,除非另有说明,“多个”的含义是指两个或两个以上。例如,多个处理单元是指两个或两个以上的处理单元;多个系统是指两个或两个以上的系统。
示例性的,本申请可以应用于使用耳机听移动终端中音频的场景。
示例性的,耳机可以是无线耳机(如TWS(True Wireless Stereo,真无线立体声)蓝牙耳机、头戴式蓝牙耳机、挂脖式蓝牙耳机等等),也可以是有线耳机,本申请对此不作限制。对应的,移动终端与耳机之间的连接可以是无线连接,也可以是有线连接,本申请对此不做限制。
示例性的,移动终端可以是手机、平板电脑、智能手表、个人笔记本等等,本申请对此不作限制。
示例性的,使用耳机聆听的移动终端中音频可以是歌曲、视频中的音频部分、有声读物等等,本申请对此不作限制。
图1a为示例性示出的应用场景示意图。在图1a中移动终端为手机,耳机为无线耳机;图1a中示出的是使用耳机听手机中歌曲的场景。
参照图1a,示例性的,耳机与手机保持连接状态。当用户想要听歌曲A时,可以打开手机中的音频应用,在音频应用中查找到歌曲A并执行播放操作。此时,手机可以响应于播放操作,将歌曲A的音频信号发送给耳机,由耳机播放;这样,用户可以在耳机中听到歌曲A。
示例性的,本申请可以应用于各种VR(Virtual Reality,虚拟现实)场景如VR电影、VR游戏等,由VR设备播放音频或者由与VR设备连接的耳机播放音频。
示例性的,VR设备可以包括VR眼镜、VR头盔等等,本申请对此不作限制。
图1b为示例性示出的应用场景示意图。在图1b中VR设备为VR眼镜。图1b中示出的是,使用VR眼镜观看VR电影的场景。
参照图1b,示例性的,在VR眼镜中播放VR电影的过程中,可以在VR眼镜的镜片内侧显示VR电影画面,以及在VR眼镜上人耳附近的扬声器播放VR电影中的音频信号。
应该理解的是,VR眼镜可以与耳机连接;这样,在VR眼镜中播放VR电影的过程中,可以在VR眼镜的镜片内侧显示VR电影画面,以及将VR电影中的音频信号发送给耳机由耳机播放;本申请对此不作限制。
示例性的,本申请提出一种音频处理方法,能够对音频信号进行空间音频处理,得到用于耳机播放的双耳信号;使得用户佩戴耳机听音时,能够感受到声像位置、距离感 以及空间感。
其中,声像位置可以是指用户主观感受到的声音相对人头中心的方位。距离感可以是指用户主观感受到的声音相对人头中心的距离。空间感可以是指用户主观感受到的声学环境空间的大小。
图2为示例性示出的音频处理过程示意图。
S201,获取待处理的源音频信号。
示例性的,可以获取待处理的音频信号,将待处理的音频信号称为源音频信号。
示例性的,源音频信号是媒体文件。源音频信号可以是歌曲对应的音频信号、有声读物对应的音频信号、视频包含的音频信号等等,本申请对此不作限制。
S202,对源音频信号分别进行直达声渲染、早期反射声渲染和晚期反射声渲染,以得到双耳信号。
示例性的,源音频信号可以经过直接路径传播到人耳,以及可以经过反射路径到达人耳。其中,源音频信号经过直接路径传播到人耳的部分声波,会影响用户对于声像位置的感知;源音频信号经过反射路径达到人耳的前一部分声波(例如,时间范围一般取人耳接收到源音频信号经过直接路径传播到人耳的部分声波以后,50ms或95ms内到达人耳的声波,主要由初次反射或二次反射产生),会影响用户对于声像距离的感知;源音频信号经过反射达到人耳的后一部分声波(例如,时间范围一般取人耳接收到源音频信号经过直接路径传播到人耳的部分声波以后,50ms或95ms之后到达人耳的声波,主要由多次反射产生),会影响用户对于声学环境空间的感知。因此,本申请可以通过对源音频信号分别进行直达声渲染、早期反射声渲染和晚期反射声渲染,来高精度的还原出声像位置、距离感和空间感,以达到更真实沉浸的双耳渲染效果。
示例性的,双耳信号可以包括用于左耳机播放的一路信号和用于右耳机播放的一路信号。
相对于现有技术仅对源音频信号进行直达声渲染,或者对源音频信号进行直达声与反射声分离后渲染,或者将源音频信号转换为虚拟扬声器信号后渲染等等而言,本申请还原出的声像位置、距离感和空间感的精度更高,进而能够达到更真实沉浸的双耳渲染效果。
图3a为示例性示出的音频处理过程示意图。在图3a的实施例中,示出了对源音频信号分别进行直达声渲染、早期反射声渲染和晚期反射声渲染的一种方式。
S301,获取待处理的源音频信号。
示例性的,S301可以参照S201的描述,在此不再赘述。
示例性的,源音频信号包括以下至少一种格式:多声道格式、多对象格式和Ambisonics格式(球谐环绕声场格式)。
示例性的,若源音频信号为多声道格式,可以假设源音频信号包括n1(n1为大于1的整数)个声道,则源音频信号可以表示为:
[ch1(t,x1,y1,z1),ch2(t,x2,y2,z2),...,chn1(t,xn1,yn1,zn1)]
其中,t表示时间,(x,y,z)表示声源在笛卡尔坐标系中的位置。ch1(t,x1,y1,z1)表示第1声道的音频信号,ch2(t,x2,y2,z2)表示第2个声道的音频信号,......以此类推,chn1(t,xn1,yn1,zn1)表示第n1个声道的音频信号。其中,每个声道对应的声源位置不变。
在球坐标系中,源音频信号可以表示为:
其中,表示声源在球坐标系中的位置,分别表示距离,水平角和俯仰角。为统一描述,后续均采用球坐标系表示。
示例性的,若源音频信号为多对象格式,可以假设源音频信号包括包含n2(n2为大于1的整数)个对象,源音频信号可以表示为:
其中,t表示时间,表示声源在球坐标系中的位置。表示第1对象的音频信号,表示第2个对象的音频信号,......以此类推,表示第n2对象的音频信号。其中,每个对象是都是运动的声源,每个对象的音频信号的位置随着时间发生变化;也就是说,每个对象的音频信号可以包括多组,一组音频信号对应一个位置。
示例性的,若源音频信号为Ambisonics格式,可以假设源音频信号包含n3(n3为大于1的正整数)个通道,源音频信号可以表示为:
其中,t表示时间,表示声源在球坐标系中的位置。表示第1通道的音频信号,表示第2个通道的音频信号,......以此类推,表示第n3个通道的音频数据。其中,假设Ambisonics为n阶,则每个 通道的音频信号可以包括2n+1组。
为了便于后续说明,将以上几种格式的源音频信号统一表示为:
其中,i表示源音频信号中当前处理的音频信号,表示音频信号中的第i个音频信号,为第i个音频信号的位置。
S302,对源音频信号进行直达声渲染,以得到第一双耳信号。
示例性的,可以预先建立直达声RIR(Room Impulse Response,房间脉冲响应)库。示例性的,可以预先在自由声场条件下(如消声室环境)采用人工头录音装置,分别采集声源位于自由声场条件中p1(p1为正整数)个位置的响应,可以得到p1个位置的直达声RIR(即HRIR(Head Related Impulse Response,头相关脉冲响应))。然后可以采用p1个位置的HRIR,组成直达声RIR库。
其中,直达声RIR库可以表示为:
其中,下标BIN表示HRIR区分左右耳,也就是说,每个位置的HRIR包括2组(即左耳的HRIR和右耳的HRIR)。表示第1个位置的直达声RIR,表示第2个位置的直达声RIR,......以此类推,表示第p1个位置的直达声RIR。
示例性的,直达声RIR可转为Ambisonics格式进行保存,可以表示为HRIRBIN-AMB
示例性的,可以采用直达声RIR与源音频信号进行卷积,来实现对源音频信号进行直达声渲染,以得到第一双耳信号。针对源音频信号的第i个音频信号,可以参照如下公式进行直达声渲染:
其中,上述公式中的“*”为卷积,为源音频信号的第i个音频信号进行直达声渲染后得到的音频信号,为直达声RIR库中位置与 的位置对应的直达声RIR。
假设,源音频信号包括N(N为大于1的整数)个通道,则对源音频信号进行直达声渲染,以得到的第一双耳信号out1(t)可以如下:
S303,对源音频信号进行早期反射声渲染,以得到第二双耳信号。
示例性的,可以预先建立早期反射声RIR库。示例性的,可以预先在声学环境中采用球形麦克风,分别采集声源位于声学环境中p2(p2为正整数)个位置的响应,可以得到p2个位置的RIR数据。然后分别确定p2个位置的RIR数据中声源到球形麦克风之间反射路径的前一部分脉冲响应(即早期反射部分,可以采用ER(Early Reflections,早期反射)表示),可以得到p2个位置的早期反射声RIR(即HOA(High-Order Ambisonics,高阶Ambisonics)RIR)。然后可以采用p2个位置的早期反射声RIR,组成早期反射声RIR库。
相对于现有技术需要经过多次采集才能完成采集各个方向的RIR录制,来制作早期反射声RIR库而言,本申请采用球形麦克风采集一次即完成各个方向的RIR的录制,能够降低制作早期反射声RIR库的工作量。
其中,早期反射声RIR库可以表示为:
其中,AMB表示ER使用Ambisonics格式进行保存,每个位置的HOA RIR可以包括2n+1组。
其中,表示第1个位置的早期反射声RIR,表示第2个位置的早期反射声RIR,......以此类推,表示第p2个位置的早期反射声RIR。
示例性的,早期反射声RIR可转为BIN格式进行保存,计算方式如下:
其中,上述公式中的“*”为卷积。
示例性的,可以采用早期反射声RIR与源音频信号卷积,来实现对源音频信号进行早期反射声渲染,以得到第二双耳信号。针对源音频信号的第i个音频信号,可以参照 如下公式进行早期反射声渲染:
其中,上述公式中的“*”为卷积。为源音频信号的第i个音频数据进行早期反射声渲染后得到的音频信号,为早期反射声RIR库中位置与的位置对应的早期反射声RIR。
假设,源音频信号包括N(N为大于1的整数)个通道,则对源音频信号进行早期反射声渲染,以得到的第二双耳信号out2(t)可以如下:
S304,对源音频信号进行晚期反射声渲染,以得到第三双耳信号。
示例性的,可以预先建立晚期反射声RIR库。示例性的,可以预先在声学环境中采用球形麦克风,分别采集声源位于该声学环境中p3(p3为正整数)个位置的响应,可以得到p3个位置的RIR数据。然后分别确定p3个位置的RIR数据中声源到球形麦克风之间反射路径的后一部分脉冲响应(即晚期反射部分,可以采用LR(Late Reflections,晚期反射)表示)),可以得到p3个位置的晚期反射声RIR(即HOA RIR)。然后可以采用p3个位置的晚期反射声RIR,组成晚期反射声RIR库。
相对于现有技术需要经过多次采集才能完成采集各个方向的RIR进行录制,来制作晚期反射声RIR库而言,本申请采用球形麦克风采集一次即完成各个方向的RIR的录制,能够降低制作晚期反射声RIR库的工作量。
其中,RIR晚期声数据库可以表示为:
其中,AMB表示LR使用Ambisonics格式进行保存,每个位置的晚期反射声RIR可以包括2n+1组。
其中,表示第1个位置的晚期反射声RIR,表示第2个位置的晚期反射声RIR,......以此类推,表示第p3个位置的晚期反射声RIR。
示例性的,晚期反射声RIR可转为BIN格式进行保存,计算方式如下:
其中,上述公式中的“*”为卷积。
示例性的,可以采用晚期反射声RIR与源音频信号卷积,来实现对源音频信号进行晚期反射声渲染,以得到第三双耳信号。针对源音频信号的第i个音频信号,可以参照如下公式进行晚期反射声渲染:
其中,上述公式中的“*”为卷积。为源音频信号的第i个音频数据进行晚期反射声渲染后得到的数据,为晚期反射声RIR库中位置与的位置对应的音频数据。
假设,源音频信号包括N(N为大于1的整数)个通道,则对源音频信号进行晚期反射声渲染,以得到的第三双耳信号out3(t)可以如下:
需要说明的是,p1、p2和p3可以相等,也可以不等,本申请对此不作限制。
S305,基于第一双耳信号、第二双耳信号和第三双耳信号,以确定双耳信号。
示例性的,可以对第一双耳信号、第二双耳信号和第三双耳信号进行混音处理,以得到双耳信号;可以参照如下公式进行混音处理:
outB(t)=out1(t)+out2(t)+out3(t)
其中,outB(t)为双耳信号。
需要说明的是,本申请不限制S302、S303和S304的执行顺序,这三个步骤可以同步执行。
图3b为示例性示出的音频处理过程示意图。在图3a的基础上,在得到第一双耳信号、第二双耳信号和第三双耳信号,还可以对第一双耳信号、第二双耳信号和第三双耳信号进行音效处理,以对音频进行修饰。
参照图3b,示例性的,S306,依据预设的音效参数1对第一双耳信号进行音效处理,以得到音频信号1。
示例性的,S306可以在S302之后且在S305之前执行,即在得到第一双耳信号后,可以对第一双耳信号进行音效处理,以得到音频信号1。示例性的,可以根据预设的音效 参数1(即直达声音效参数,可以是指用于对直达声部分进行音效处理的参数),对第一双耳信号进行音效处理,以得到音频信号1。
示例性的,可以根据预设的音效参数1,生成一组滤波器,可以采用AudioEffects1-BIN(t)表示;然后采用AudioEffects1-BIN(t)对第一双耳信号进行滤波,来实现对第一双耳信号的音效处理,以得到音频信号1(可以采用out1-BIN(t)表示),可以参照如下公式:
out1-BIN(t)=out1(t)*AudioEffects1-BIN(t)
其中,上述公式中的“*”表示卷积。
参照图3b,示例性的,S307,依据预设的音效参数2对第二双耳信号进行音效处理,以得到音频信号2。
示例性的,S307在S303之后且在S305之前执行,即在得到第二双耳信号后,可以对第二双耳信号进行音效处理,以得到音频信号2。示例性的,可以根据预设的音效参数2(也就是早期反射声音效参数,可以是指用于对早期反射声部分进行音效处理的参数),对第二双耳信号进行音效处理,以得到音频信号2。
示例性的,可以根据预设的音效参数2,生成一组滤波器,可以采用AudioEffects2-BIN(t)表示;然后采用AudioEffects2-BIN(t)对第二双耳信号进行滤波,来实现对第二双耳信号的音效处理,以得到音频信号2(可以采用out2-BIN(t)表示),可以参照如下公式:
out2-BIN(t)=out2(t)*AudioEffects2-BIN(t)
其中,上述公式中的“*”表示卷积。
参照图3b,示例性的,S308,依据预设的音效参数3对第三双耳信号进行音效处理,以得到音频信号3。
示例性的,S308在S304之后且在S305之前执行,即在得到第三双耳信号后,可以对第三双耳信号进行音效处理,以得到音频信号3。示例性的,可以根据预设的音效参数3(也就是晚期反射声音效参数,可以是指用于对晚期反射声部分进行音效处理的参数),对第三双耳信号进行音效处理,以得到音频信号3。
示例性的,可以根据预设的音效参数3,生成一组滤波器,可以采用AudioEffects3-BIN(t)表示;然后采用AudioEffects3-BIN(t)对第三双耳信号进行滤波,来实现对第三双耳信号的音效处理,以得到音频信号3(可以采用out3-BIN(t)表示),可以参照如下公式:
out3-BIN(t)=out3(t)*AudioEffects3-BIN(t)
其中,上述公式中的“*”表示卷积。
参照图3b,示例性的,S305可以包括S305a和S305b,其中:
S305a,对音频信号1、音频信号2和音频信号3进行混音处理,以得到音频信号4。
S305b,依据预设的音效参数4对音频信号4进行音效处理,以得到双耳信号。
示例性的,可以根据预设的音效参数4(也就是第一混合音效参数,可以是指用于对直达声部分、早期反射声部分和晚期反射声部分均进行音效处理的参数),对音频信号4进行音效处理,得到双耳信号;具体可以参照上述描述,在此不再赘述。
需要说明的是,上述得到的第一双耳信号、第二双耳信号、第三双耳信号、音频信号1、音频信号2、音频信号3和音频信号4,都是包括左右双耳的音频信号。
图4a为示例性示出的音频处理过程示意图。在图4a的实施例中,示出了对源音频信号分别进行直达声渲染、早期反射声渲染和晚期反射声渲染的另一种方式。
S401,获取待处理的源音频信号。
示例性的,S401可以参照上述S301的描述,在此不再赘述。
S402,对源音频信号进行早期反射声渲染,以得到第二双耳信号。
示例性的,S402可以参照上述S303的描述,在此不再赘述。示例性的,第二双耳信号采用out2(t)表示。
S403,对源音频信号进行晚期反射声渲染,以得到第三双耳信号。
示例性的,S403可以参照上述S304的描述,在此不再赘述。示例性的,第三双耳信号采用out3(t)表示。
S404,对第二双耳信号和第三双耳信号进行混音处理,以得到第四双耳信号。
示例性的,可以参照如下公式,对第二双耳信号和第三双耳信号进行混音处理,以得到第四双耳信号:
out4(t)=out2(t)+out3(t)
其中,out4(t)为第四双耳信号。
S405,对第四双耳信号进行直达声渲染,以得到第五双耳信号。
示例性的,S405可以参照S302的描述,在此不再赘述。
S406,基于第五双耳信号,确定双耳信号。
一种可能的方式中,将第五双耳信号,作为双耳信号。
图4b为示例性示出的音频处理过程示意图。在图4a的基础上,在得到第二双耳信号、第三双耳信号、第四双耳信号和第五双耳信号,对第二双耳信号、第三双耳信号、第四双耳信号和第五双耳信号进行音效处理,以对音频进行修饰。
参照图4b,示例性的,S407,依据预设的音效参数2对第二双耳信号进行音效处理,以得到音频信号2。
示例性的,S407在S402之后且在S404之前执行,即在得到第二双耳信号后,可以对第二双耳信号进行音效处理,以得到音频信号2。具体可以参照上述S307的描述,在此不再赘述。
参照图4b,示例性的,S408,依据预设的音效参数3对第三双耳信号进行音效处理,以得到音频信号3。
示例性的,S408在S403之后且在S404之前执行,即在得到第三双耳信号后,可以对第三双耳信号进行音效处理,以得到音频信号3。具体可以参照上述S308的描述,在此不再赘述。
这样,S404可以包括:对音频信号2和音频信号3进行混音处理,以得到第四双耳信号。
参照图4b,示例性的,S409,依据预设的音效参数5对第四双耳信号进行音效处理,以得到音频信号5。
示例性的,S409在S404之后且在S405之前执行,即在得到第四双耳信号后,可以对第四双耳信号进行音效处理,以得到音频信号5。示例性的,可以根据预设的音效参数5(也就是第二混合音效参数,可以是指用于对早期反射声部分和晚期反射声部分进行音效处理的参数),对第四双耳信号进行音效处理,以得到音频信号5;具体可以参照上述描述,在此不再赘述。此时,S405可以包括:对音频信号5进行直达声渲染,以得到第五双耳信号。
参照图4b,示例性的,上述S406可以包含S406_X;其中,S406_X,依据预设的音效参数1对第五双耳信号进行音效处理,以得到双耳信号。具体可以参照上述描述,在此不再赘述。
需要说明的是,上述得到的第二双耳信号、第三双耳信号、第四双耳信号、第五双耳信号、音频信号2、音频信号3和音频信号5,都是包括左右双耳的音频信号。
在上述实施例的基础上,本申请提出一种音频处理方法,可以支持“边听边调”,即在用户针对源音频信号执行播放操作后,可以响应于用户的播放操作,先按照系统针 对渲染效果选项的设置和/或用户针对渲染效果选项的历史设置,对源音频信号中的初始音频片段进行空间音频处理,得到初始双耳信号并播放。在播放初始双耳信号的过程中(即用户在收听初始双耳信号的过程中),可以支持用户设置空间音频效果;然后根据用户针对空间音频效果的设置,对源音频信号中初始音频片段之后的音频片段继续进行空间音频处理。这样,能够在播放音频信号过程中,根据用户针对渲染效果的设置,来不断的调整源音频信号对应的双耳信号的渲染效果;还能够满足用户针对空间音频效果个性化需求。
示例性的,空间音频效果可以包括渲染效果,渲染效果可以包括声像位置、距离感以及空间感等等,本申请对此不作限制。
示例性的,本申请可以提供用于针对空间音频效果进行设置的应用程序(或小程序或网页或工具栏等等)。
图5为示例性示出的处理过程的示意图。其中,图5(1)为示例性示出的界面的示意图。需要说明的是,图5(1)中空间音频效果设置界面51,可以是由系统进行设置的界面,也可以是由用户进行设置的界面,本申请对此不作限制。本申请以用户在空间音频效果设置界面51进行设置,来实现空间音频效果调整为例进行说明。
参照图5(1),示例性的,空间音频效果设置界面51可以包括一个或多个设置区域,包括但不限于:渲染效果设置区域52等等,本申请对此不作限制。
示例性的,可以根据不同的渲染效果,在渲染效果设置区域52设置多个渲染效果选项。示例性的,渲染效果可以包括多种,如声像位置、距离感和空间感等等;当然还可以包括其他渲染效果,本申请对此不作限制。参照图5(1),示例性的,渲染效果设置区域52可以包括但不限于:声像位置选项521,距离感选项522和空间感选项523等等,当然还可以包括其他渲染效果选项,本申请对此不作限制。本申请以渲染效果设置区域52包括:声像位置选项521,距离感选项522和空间感选项523为例进行示例性说明。
参照图5(1),示例性的,声像位置选项521,距离感选项522和空间感选项523可以是滑块控件,滑块控件可以包括滑块。
示例性的,用户可以针对声像位置选项521的滑块进行操作,来升高或降低声像位置。示例性的,当用户针对声像位置选项521的滑块执行上滑操作时,可以升高声像位置。当用户针对声像位置选项521的滑块执行下滑操作时,可以降低声像位置。
示例性的,用户可以针对距离感选项522的滑块进行操作,来增大或缩短距离感。示例性的,当用户针对距离感选项522的滑块执行上滑操作时,则可以增大声像与用户的距离;当用户针对距离感选项522的滑块执行下滑操作时,则可以缩短声像与用户的距离。
示例性的,用户可以针对空间感选项523的滑块进行操作,来增大或缩小空间感。示例性的,当用户针对空间感选项523的滑块执行上滑操作时,则可以增加音频的空间感;当用户针对空间感选项523的滑块执行下滑操作时,则可以缩小音频的空间感。
应该理解的是,图5(1)仅是本申请的一个示例,声像位置选项521,距离感选项522和空间感选项523可以是其他类型的控件,例如旋钮控件(旋钮控件包括旋钮),用户可以转动声像位置选项521的旋钮,来升高或降低声像位置;转动距离感选项522的 旋钮,来增大或缩短距离感;以及转动空间感选项523的旋钮,来增大或缩小空间感。本申请对声像位置选项521,距离感选项522和空间感选项523的显示形式不作限制。
以下在图5(1)的基础上,对根据用户针对渲染效果选项的设置操作,来进行空间音频处理的过程进行示例性说明。
图5(2)为示例性示出的音频处理过程示意图。
S501,响应于用户的播放操作,对源音频信号中的初始音频片段进行空间音频处理,以得到初始双耳信号并播放初始双耳信号,源音频信号为媒体文件。
示例性的,源音频信号为媒体文件。源音频信号可以是歌曲的音频信号、有声读物的音频信号、视频包含的音频信号等等,本申请对此不作限制。
再次参照图1a,示例性的,当用户想要听歌曲A时,可以打开手机中的音频应用,在音频应用中查找到歌曲A并执行播放操作。此时,可以响应于用户的播放操作,对歌曲A对应的音频信号(即源音频信号)中的初始音频片段进行空间音频处理,进而可以得到初始双耳信号并播放初始双耳信号。
示例性的,可以按照预设方式,将源音频信号划分为多个音频片段。其中,预设方式可以按照需求设置,例如,将源音频信号划分为时长相同的多个音频片段;又例如,将源音频信号划分为预设数量(可以按照需求设置)的音频片段;等等。然后可以将源音频信号包括的多个音频片段中的前X1(X1为正整数)个音频片段,确定为初始音频片段。之后,可以按照上述实施例的描述,对源音频信号中的前X1个音频片段进行空间音频处理,以得到初始双耳信号。
一种可能的方式中,可以按照系统针对渲染效果选项的设置,对源音频信号中的前X1个音频片段进行空间音频处理,以得到初始双耳信号。其中,系统可以针对各渲染效果选项预先进行设置,在接收到用户的播放操作后,可以根据系统的设置,对源音频信号中的前X1个音频片段进行空间音频处理,以得到初始双耳信号;具体可以参照上文中的描述,在此不再赘述。
一种可能的方式中,可以按照用户针对渲染效果选项的历史(如上一次)设置,对源音频信号中的前X1个音频片段进行空间音频处理,以得到初始双耳信号。其中,在接收到用户的播放操作后,可以根据用户针对渲染效果选项的上一次设置,对源音频信号中的前X1个音频片段进行空间音频处理,以得到初始双耳信号;具体可以参照上文中的描述,在此不再赘述。
应该理解的是,当用户上一次仅设置声像位置选项、距离感选项或空间感选项中的部分选项时,可以根据用户针对部分渲染效果选项的设置和系统针对另一部分渲染效果选项的设置,对源音频信号中前X1个音频片段进行空间音频处理,以得到初始双耳信号。
S502,接收用户针对渲染效果选项的设置,渲染效果选项包括以下至少一种:声像位置选项、距离感选项或空间感选项。
示例性的,在播放初始双耳信号的过程中,用户可以听到初始双耳信号;当用户确定渲染效果不满足自身需求时,可以针对渲染效果选项进行设置,即进入图5(1)中的空间音频效果设置界面51,针对渲染效果选项执行设置操作,以按照自身需求设置渲染效果。
示例性的,在进入空间音频效果设置界面51后,用户可以针对渲染效果设置区域52中至少一个渲染效果选项执行设置操作;例如,可以针对声像位置选项521、距离感选项522和空间感选项523中的至少一个选项执行设置操作,来设置源音频信号播放时的声像位置、距离感和空间感中的任一项渲染效果。
应该理解的是,当用户确定渲染效果满足自身需求时,可以无需针对渲染效果选项进行设置;此时,若在播放到初始音频片段的最后一个音频片段时,没有接收到用户针对渲染效果选项的设置,则可以按照系统针对渲染效果选项的设置和/或用户针对渲染效果选项的历史设置,对源音频信号中初始音频片段之后的X2(X2为正整数)个音频片段,继续进行空间音频处理,以得到新的初始双耳信号。其中,X2可以按照需求设置,本申请对此不作限制。这X2个音频片段可以是源音频信号中前X1个音频片段之后连续的X2个音频片段,且这X2个音频片段的第一个音频片段与前X1个音频片段的最后一个音频片段相邻。
S503,根据设置,对源音频信号中初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号。
示例性的,待接收到用户针对渲染效果选项的设置操作后,可以根据用户针对渲染效果选项的设置操作,对对应的渲染效果参数进行调节;然后根据调节后的渲染效果参数,对源音频信号中初始音频片段之后的音频片段继续进行空间音频处理,以得到用于耳机播放的目标双耳信号。具体空间音频处理过程在后续进行说明。
示例性的,当在接收用户针对渲染效果选项的设置之前,仅对源音频信号中前X1个音频片段进行了空间音频处理,则可以按照用户针对渲染效果选项的的本次设置,对源音频信号中前X1个音频片段之后的X3(X3为正整数)个音频片段,继续进行空间音频处理,以得到用于耳机播放的目标双耳信号。其中,X3可以按照需求设置,本申请对此不作限制。这X3个音频片段可以是源音频信号中前X1个音频片段之后连续的X3个音频片段,且这X3个音频片段的第一个音频片段与前X1个音频片段的最后一个音频片段相邻。
示例性的,当在接收用户针对渲染效果选项的设置之前,对源音频信号中前X1+X2个音频片段进行了空间音频处理,则可以按照用户针对渲染效果选项的本次设置,对源音频信号中前X1+X2个音频片段之后的X3(X3为正整数)个音频片段,继续进行空间音频处理,以得到用于耳机播放的目标双耳信号。其中,这X3个音频片段可以是源音频信号中前X1+X2个音频片段之后连续的X3个音频片段,且这X3个音频片段的第一个音频片段与X2个音频片段的最后一个音频片段相邻。
应该理解的是,在得到目标双耳信号后可以播放目标双耳信号,在播放目标双耳信号的过程中(即用户在收听目标双耳信号的过程中),用户确定渲染效果不满足自身需求时,可以再次针对渲染效果选项进行设置;此时,可以根据用户再次针对渲染效果选项的设置,对源音频信号中上次进行空间音频渲染的音频片段之后的音频片段继续进行空间音频处理,得到新的目标双耳信号;以此类推。
这样,能够在播放音频信号过程中,根据针对渲染效果的设置,来不断的调整源音频信号对应的双耳信号的渲染效果。
此外,还可以按照用户个性化空间音频效果需求,对源音频信号进行个性化空间音频处理,得到用于耳机播放的目标双耳信号;进而能够满足用户针对空间音频效果个性化需求。
图6为示例性示出的处理过程示意图。其中,图6(1)为示例性示出的界面的示意图。
示例性的,空间音频效果还可以包括空间场景。示例性的,不同用户针对播放音频信号的空间场景的需求不同,例如,部分用户偏好电影院这种空间场景,部分用户偏好录音棚这种空间场景,部分用户偏好KTV这种空间场景等等。进而,为了满足用户针对空间场景的需求,可以在图5(1)的空间音频效果设置界面51中增加空间场景选择区域53,如图6(1)所示。
示例性的,可以根据不同的空间场景,在空间场景选择区域53设置多个场景选项。示例性的,空间场景可以包括多种,如电影院、音乐厅、录音棚以及KTV等等,当然还可以包括其他空间场景,本申请对此不作限制。参照图6(1),示例性的,空间场景选择区域53可以包括但不限于:电影院选项531、音乐厅选项532、录音棚选项533以及KTV选项534等等,当然还可以包括其他场景选项本申请对此不作限制。
示例性的,当用户想要选择的空间场景为电影院时,可以选中空间场景设置区域53中的电影院选项531。当用户想要选择的空间场景为音乐厅时,可以选中空间场景设置区域53中的音乐厅选项532。当用户想要选择的空间场景为录音棚时,可以选中空间场景设置区域53中的录音棚选项533。当用户想要选择的空间场景为KTV时,可以选中空间场景设置区域53中的KTV选项534。
参照图6(1),示例性的,渲染效果设置区域52中的渲染效果选项,与空间场景选择区域53中的场景选项是关联的;不同的场景选项,对应的渲染效果选项不同。
例如,当用户在空间场景选择区域53中,选取电影院选项531后,渲染效果设置区域52可以显示与电影院选项531对应的渲染效果选项。当用户在空间场景选择区域53中,选取音乐厅选项532后,渲染效果设置区域52可以显示与音乐厅选项532对应的渲染效果选项。当用户在空间场景选择区域53中,选取录音棚选项533后,渲染效果设置区域52可以显示与录音棚选项533对应的渲染效果选项。当用户在空间场景选择区域53中,选取KTV选项534后,渲染效果设置区域52可以显示与KTV选项534对应的渲染效果选项。
示例性的,不同的场景选项对应的渲染效果选项不同可以是指:不同的场景选项,渲染效果选项对应渲染效果参数的默认参数值不同。
示例性的,针对不同场景选项,显示的渲染效果选项的滑块(或者旋钮)的位置可以相同,也可以不同,本申请对此不作限制。
以下在图6(1)的基础上,对根据用户针对渲染效果选项的设置操作,来进行空间音频处理的过程进行示例性说明。
图6(2)为示例性示出的音频处理过程示意图。
S601,响应于用户的播放操作,对源音频信号中的初始音频片段进行空间音频处理, 以得到初始双耳信号并播放初始双耳信号,源音频信号为媒体文件。
S602,响应于用户针对目标场景选项的选取操作,显示目标场景选项对应的渲染效果选项。
S603,接收用户针对渲染效果选项的设置操作,渲染效果选项包括以下至少一种:声像位置选项、距离感选项或空间感选项。
示例性的,当用户需要设置渲染效果时,可以进入图6(1)中的空间音频效果设置界面51,然后从空间场景选择区域53中,选取所需的目标场景选项。这样,终端设备可以响应于用户针对目标场景选项的选取操作,在渲染效果设置区域52显示目标场景选项对应的渲染效果选项。接着,用户可以针对渲染效果设置区域52中至少一个渲染效果选项执行设置操作,具体可以按照上述S501的描述,在此不再赘述。
S604,根据设置,对源音频信号中初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号。
一种可能的方式中,待接收到用户针对渲染效果选项的设置操作后,可以根据用户针对渲染效果选项的设置操作对渲染效果参数进行调节;然后根据调节后的渲染效果参数,对源音频信号中初始音频片段之后的音频片段进行空间音频处理,以得到用于耳机播放的目标双耳信号;具体在后续进行说明。
一种可能的方式中,待接收到用户针对渲染效果选项的设置操作后,可以根据用户针对渲染效果选项的设置操作对渲染效果参数进行调节;接着,根据目标场景选项更新场景参数;然后,根据调节后的渲染效果参数和更新后的场景参数,对源音频信号中初始音频片段之后的音频片段进行空间音频处理,以得到用于耳机播放的目标双耳信号;具体在后续进行说明。
以下在图6的基础上,以用户针对声像位置选项521,距离感选项522和空间感选项523均执行了设置操作为例进行说明。
示例性的,可以参照上述图3a实施例描述的方法,可以通过对源音频信号中初始音频片段之后的音频片段分别进行直达声渲染、早期反射声渲染和晚期反射声渲染,来实现S603中的空间音频处理。即,根据设置,对源音频信号中初始音频片段之后的音频片段分别进行直达声渲染、早期反射声渲染和晚期反射声渲染,以得到目标双耳信号。这样,在任一类型的空间场景下,都能够还原出高精度的声像位置、音频的空间感和距离感,进而达到更真实沉浸的双耳渲染效果。
以下对根据设置,对源音频信号中初始音频片段之后的音频片段分别进行直达声渲染、早期反射声渲染和晚期反射声渲染的过程进行说明。
图7a为示例性示出的音频处理过程示意图。在图7a的实施例中,描述了对源音频信号中初始音频片段之后的音频片段分别进行直达声渲染、早期反射声渲染和晚期反射声渲染的一种方式。
S701,响应于用户的播放操作,对源音频信号中的初始音频片段进行空间音频处理,以得到初始双耳信号并播放初始双耳信号,源音频信号为媒体文件。
S702,响应于用户针对目标场景选项的选取操作,显示目标场景选项对应的渲染效 果选项。
S703,接收用户针对渲染效果选项的设置操作,渲染效果选项包括:声像位置选项、距离感选项和空间感选项。
示例性的,S701~S703可以参照S601~S603的描述,在此不再赘述。
S704,根据针对声像位置选项的设置操作,调节声像位置参数。
示例性的,可以将与声像位置选项对应的渲染效果参数,称为声像位置参数。
参照图6(1),示例性的,可以根据用户针对声像位置选项521的设置操作,确定声像位置选项521的滑块位置,然后根据声像位置选项521的滑块位置,对声像位置参数进行调节。
需要说明的是,对声像位置参数进行调节是指,对声像位置参数的参数值进行调节。
示例性的,根据声像位置参数对源音频信号中初始音频片段之后的音频片段进行直达声渲染,以得到第一双耳信号。可以按照如下S705~S707:
S705,从预设的直达声RIR库中选取候选直达声RIR,以及根据声像位置参数确定声像位置修正因子。
S706,根据声像位置修正因子对候选直达声RIR进行修正,以得到目标直达声RIR。
S707,根据目标直达声RIR对源音频信号中初始音频片段之后的音频片段进行直达声渲染,以得到第一双耳信号。
示例性的,可以预先建立直达声RIR库。示例性的,可以预先在自由声场条件下(如消声室环境)采用一种头部类型的人工头录音装置,分别采集声源位于自由声场条件中p1(p1为正整数)位置时的响应,可以得到p1个位置的直达声RIR(即HRIR)。然后可以采用p1个位置的HRIR,组成一种头部类型对应的直达声RIR(为了便于描述,将一种头部类型对应的直达声RIR,称为一个第一集合)。第一集合可以表示为:
其中,一个第一集合可以包括p1个位置的预设直达声RIR。
按照上述方式,针对m1种头部类型,可以录制得到m1个第一集合,m1为正整数。然后采用这m1个第一集合,组成直达声RIR库;直达声RIR库可以表示为:
示例性的,头部类型可以包括但不限于:女性头部类型、男性头部类型、老年人头部类型、中年人头部类型、青年人头部类型、儿童头部类型、欧洲人种头部类型、亚洲人种头部类型等等,本申请对此不作限制。
示例性的,可以从直达声RIR库的m1个第一集合中,根据用户的头部类型,选取第一目标集合。
一种可能的方式中,可以根据用户在登录系统账号时输入的性别、年龄等信息,确定用户的头部类型。一种可能的方式中,图6(1)中空间音频效果设置界面51还可以包括头部类型设置区域,头部类型设置区域包括多个头部类型选项,如女性头部类型选项、 男性头部类型选项、老年人头部类型选项、中年人头部类型选项、青年人头部类型选项、儿童头部类型选项、欧洲人种头部类型选项、亚洲人种头部类型选项等等。用户可以根据自身情况,选取对应的头部类型选项;这样,可以根据用户选中的头部类型选项,确定用户的头部类型。一种可能的方式中,不同空间场景对应的头部类型不同,可以根据空间场景参数,确定用户的头部类型。一种可能的方式中,可以提示用户使用手机拍摄用户耳廓的图像;然后可以根据用户拍摄的耳廓的图像,从预设的多种头部类型中查找与用户最相似的头部类型,确定为用户的头部类型。
示例性的,当无法获取到用户头部位置信息时,可以根据源音频信号中当前处理的音频信号(即源音频信号中初始音频片段之后的音频片段)的位置信息,以及第一目标集合中p1个位置的预设直达声RIR的位置信息,从第一目标集合中选取候选直达声RIR。示例性的,可以从第一目标集合中选取位置信息与源音频信号中当前处理的音频信号(即源音频信号中初始音频片段之后的音频片段)的位置信息距离最近的预设直达声RIR,作为候选直达声RIR。
示例性的,当可以获取到用户头部位置信息时,可以根据用户的头部位置信息、源音频信号中当前处理的音频信号(即源音频信号中初始音频片段之后的音频片段)的位置信息、以及第一目标集合中p1个位置的预设直达声RIR的位置信息,从第一目标集合中选取候选直达声RIR。示例性的,可以确定源音频信号中当前处理的音频信号(即源音频信号中初始音频片段之后的音频片段)的位置信息和用户的头部位置信息的偏移值,然后可以从第一目标集合中选取位置信息与偏移值距离最近的预设直达声RIR,作为候选直达声RIR。这样,能够实现头动跟踪渲染。
示例性的,可以根据调节后的声像位置参数的参数值,确定声像位置修正因子。示例性的,可以预先建立声像位置参数的参数值与对应声像位置修正因子之间的关系,然后根据声像位置参数调节后的参数值查找该关系,确定对应的声像位置修正因子。接着,采用声像位置修正因子对候选直达声RIR进行修正,可以得到目标直达声RIR;可以参照如下公式:
HRIR'=α·HRIR
其中,HRIR'为目标直达声RIR,α为声像位置修正因子,HRIR为候选直达声RIR。
示例性的,α可以用一组滤波器表示,通过对候选直达声RIR中高频部分进行衰减,可以降低声像位置。
进而,可以通过声像位置修正因子对候选直达声RIR的修正,来实现对声像位置的调节,可以得到目标直达声RIR。
示例性的,可以采用目标直达声RIR与源音频信号中初始音频片段之后的音频片段进行卷积,来实现对源音频信号中初始音频片段之后的音频片段进行直达声渲染,以得到第一双耳信号。针对源音频信号的第i个音频信号中初始音频片段之后的音频片段 可以参照如下公式进行直达声渲染:
其中,上述公式中的“*”表示卷积。为源音频信号的第i个音频数据中初始音频片段之后的音频片段进行直达声渲染后得到的音频信号,为目标直达声RIR。
假设,源音频信号包括N(N为大于1的整数)个通道,则对源音频信号中初始音频片段之后的音频片段对应的部分进行直达声渲染,得到的第一双耳信号out1(t)可以如下:
S708,根据针对距离感选项的设置操作,调节距离感参数。
示例性的,可以将与距离感选项对应的渲染效果参数,称为距离感参数。
参照图6(1),示例性的,可以根据用户针对距离感选项522的设置操作,确定距离感选项522的滑块位置,然后根据距离感选项522的滑块位置,对距离感参数进行调节。
需要说明的是,对距离感参数进行调节是指对距离感参数的参数值进行调节。
示例性的,根据距离感参数对源音频信号中初始音频片段之后的音频片段进行早期反射声渲染,以得到第二双耳信号;可以参照如下S709~S711:
S709,从预设的早期反射声RIR库中选取候选早期反射声RIR,以及根据距离感参数确定距离感修正因子。
S710,根据距离感修正因子对候选早期反射声RIR进行修正,以得到目标早期反射声RIR。
S711,根据目标早期反射声RIR对源音频信号中初始音频片段之后的音频片段进行早期反射声渲染,以得到第二双耳信号。
示例性的,可以预先建立早期反射声RIR库。示例性的,可以预先在一种空间场景对应的声学环境中采用球形麦克风,分别采集声源位于该空间场景对应的声学环境中p2(p2为正整数)位置的响应,可以得到p2个位置的RIR。然后分别确定p2个位置的RIR中声源至球形麦克风之间反射路径的前一部分脉冲响应,可以得到p2个位置的早期反射声RIR(即HOA RIR)。然后可以采用p2个位置的早期反射声RIR,组成一种空间场景对应的早期反射声RIR(为了便于描述,将一种空间场景对应的早期反射声RIR,称为一个第二集合)。第二集合可以表示为:
其中,一个第二集合可以包括p2个位置的预设早期反射声RIR。
按照上述方式,针对m2种空间场景,可以录制得到m2个第二集合,m2为正整数。然后采用这m2个第二集合,组成早期反射声RIR库。其中,早期反射声RIR库可以表示为:
示例性的,可以从早期反射声RIR库的m2个第二集合中,根据选取与空间场景参数对应的第二集合,作为第二目标集合。
示例性的,当无法获取到用户头部位置信息时,可以根据源音频信号中当前处理的音频信号(即源音频信号中初始音频片段之后的音频片段)的位置信息,以及第二目标集合中p2个位置的预设早期反射声RIR的位置信息,从第二目标集合中选取候选早期反射声RIR。示例性的,可以从第二目标集合中选取位置信息与源音频信号中当前处理的音频信号(即源音频信号中初始音频片段之后的音频片段)的位置信息距离最近的预设早期反射声RIR,作为候选早期反射声RIR。
示例性的,当可以获取到用户头部位置信息时,可以根据用户的头部位置信息、源音频信号中当前处理的音频信号(即源音频信号中初始音频片段之后的音频片段)的位置信息、以及第二目标集合中p2个位置预设早期反射声RIR的位置信息,从第二目标集合中选取候选早期反射声RIR。示例性的,可以确定源音频信号中当前处理的音频信号(即源音频信号中初始音频片段之后的音频片段)的位置信息和用户的头部位置信息的偏移值,然后可以从第二目标集合中选取位置信息与偏移值距离最近的预设早期反射声RIR,作为候选早期反射声RIR。这样,能够实现头动跟踪渲染。
示例性的,可以根据渲染效果参数中的第二效果参数,确定距离感修正因子。然后,采用距离感修正因子对候选早期反射声RIR进行修正,可以得到目标早期反射声RIR;可以参照如下公式:
ER'=β·ER
其中,ER'为目标早期反射声RIR,β为声像位置修正因子,ER为候选早期反射声RIR。
示例性的,β可以用采用增益表示,通过增加候选早期反射声RIR的幅值,降低距离感。
进而,可以通过距离感修正因子对候选早期反射声RIR的修正,来实现对距离感的调节。
示例性的,可以采用目标早期反射声RIR与源音频信号中初始音频片段之后的音频 片段进行卷积,来实现对源音频信号中初始音频片段之后的音频片段进行早期反射声渲染,以得到第二双耳信号。针对源音频信号的第i个音频信号中初始音频片段之后的音频片段对应的部分可以参照如下公式进行早期反射声渲染:
其中,上述公式中的“*”表示卷积。为源音频信号的第i个音频数据中初始音频片段之后的音频片段进行早期反射声渲染后得到的数据,为目标早期反射声RIR。
假设,源音频信号包括N(N为大于1的整数)个通道,则对源音频信号中初始音频片段之后的音频片段进行早期反射声渲染,得到的第二双耳信号out2(t),可以如下:
S712,根据针对空间感选项的设置操作,调节空间感参数。
示例性的,可以将与空间感选项对应的渲染效果参数,称为空间感参数。
参照图6(1),示例性的,可以根据用户针对空间感选项521的设置操作,确定空间感选项521的滑块位置,然后根据空间感选项521的滑块位置,对空间感参数进行调节。
需要说明的是,对空间感参数进行调节是指对空间感参数的参数值进行调节。
示例性的,可以根据空间感参数对源音频信号中初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号;可以参照如下S713~S715:
S713,从预设的晚期反射声RIR库中选取候选晚期反射声RIR,以及根据空间感参数确定空间感修正因子。
S714,根据空间感修正因子对候选晚期反射声RIR进行修正,以得到目标晚期反射声RIR。
S715,根据目标晚期反射声RIR对源音频信号中初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号。
示例性的,可以预先建立晚期反射声RIR库。示例性的,可以预先在一种空间场景对应的声学环境中采用球形麦克风,分别采集声源位于该空间场景对应的声学环境中p3(p3为正整数)位置的响应,可以得到p3个位置的RIR。然后分别确定p3个位置的RIR中声源至球形麦克风之间反射路径的后一部分脉冲响应,可以得到p3个位置的晚期反射声RIR(即HOA RIR)。然后可以采用p3个位置的晚期反射声RIR,组成一种空间场景对应的晚期反射声RIR(为了便于描述,将一种空间场景对应的晚期反射声RIR,称 为一个第三集合)。第三集合可以表示为:
其中,一个第三集合可以包括p3个位置的预设晚期反射声RIR。
按照上述方式,针对m3种空间场景类型,可以采集得到m3个第三集合,m3为正整数。然后采用这m3个第三集合,组成晚期反射声RIR库。其中,晚期反射声RIR库可以表示为:
需要说明的是,m2与m3可以相等。
示例性的,可以从晚期反射声RIR库的m3个第三集合中,根据选取与空间场景参数对应的第三集合,作为第三目标集合。
示例性的,当无法获取用户的头部位置信息时,可以根据源音频信号中当前处理的音频信号(即源音频信号中初始音频片段之后的音频片段)的位置信息,以及第三目标集合中p3个位置的预设晚期反射声RIR的位置信息,从第三目标集合中选取候选晚期反射声RIR。示例性的,可以从第三目标集合中选取位置信息与源音频信号中当前处理的音频信号(即源音频信号中初始音频片段之后的音频片段)的位置信息距离最近的预设晚期反射声RIR,作为候选晚期反射声RIR。
示例性的,当可以获取用户的头部位置信息时,可以根据用户的头部位置信息、源音频信号中当前处理的音频信号(即源音频信号中初始音频片段之后的音频片段)的位置信息、以及第三目标集合中p3个位置的预设晚期反射声RIR的位置信息,从第三目标集合中选取候选晚期反射声RIR。示例性的,可以确定源音频信号中当前处理的音频信号(即源音频信号中初始音频片段之后的音频片段)的位置信息和用户的头部位置信息的偏移值,然后可以从第三目标集合中选取位置信息与偏移值距离最近的预设晚期反射声RIR,作为候选晚期反射声RIR。这样,能够实现头动跟踪渲染。
示例性的,可以根据渲染效果参数中的第三效果参数,确定空间感修正因子。然后,采用空间感修正因子对候选晚期反射声RIR进行修正,可以得到目标晚期反射声RIR;可以参照如下公式:
LR'=γ·LR
其中,LR'为目标晚期反射声RIR,γ为声像位置修正因子,LR为候选晚期反射声RIR。
示例性的,γ可以用采用增益表示,通过增加候选晚期反射声RIR的幅值,可以增加空间感。
进而,可以通过空间感修正因子对候选晚期反射声RIR的修正,来实现对空间感的 调节。
示例性的,可以采用目标晚期反射声RIR与源音频信号中初始音频片段之后的音频片段进行卷积,来实现对源音频信号中初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号。针对源音频信号的第i个音频信号中初始音频片段之后的音频片段可以参照如下公式进行晚期反射声渲染:
其中,上述公式中的“*”表示卷积。为源音频信号的第i个音频数据中初始音频片段之后的音频片段进行晚期反射声渲染后得到的数据,为目标晚期反射声RIR。
假设,源音频信号包括N(N为大于1的整数)个通道,则对源音频信号中初始音频片段之后的音频片段进行晚期反射声渲染,得到的第三双耳信号out3(t),可以如下:
S716,基于第一双耳信号、第二双耳信号和第三双耳信号,确定目标双耳信号。
示例性的,S716可以参照上述S305的描述,在此不再赘述。
需要说明的是,S704~S707,S708~S711以及S712~S715,可以并行执行,也可以串行执行。
图7b为示例性示出的音频处理过程示意图。在图7a的基础上,在得到第一双耳信号、第二双耳信号和第三双耳信号后,可以对第一双耳信号、第二双耳信号和第三双耳信号进行音效处理,以对音频进行修饰。
示例性的,可以预先建立多种空间场景与多个音效参数组之间的对应关系,以得到预设关系。例如,预设关系可以包括:电影院—音效参数组1,音乐厅—音效参数组2,录音棚—音效参数组3,KTV—音效参数组4。其中,每个音效参数组可以包括多个音效参数。
示例性的,可以根据预设关系,确定与场景参数(根据目标场景选项更新后的场景参数)匹配的音效参数组。
示例性的,与场景参数匹配的音效参数组可以包括直达声音效参数(音效参数1)、早前反射声音效参数(音效参数2)、晚期反射声音效参数(音效参数3)和第一混合音效参数(音效参数4)。
参照图7b,示例性的,S717,依据音效参数1对第一双耳信号进行音效处理,以得到音频信号1。
示例性的,S717可以在S707之后且在S716之前执行,即在得到第一双耳信号后,可以对第一双耳信号进行音效处理,以得到音频信号1。具体可以参照上述S306的描述,在此不再赘述。
参照图7b,示例性的,S718,依据音效参数2对第二双耳信号进行音效处理,以得到音频信号2。
示例性的,S718在S711之后且在S716之前执行,即在得到第二双耳信号后,可以对第二双耳信号进行音效处理,以得到音频信号2。具体可以参照上述S307的描述,在此不再赘述。
参照图7b,示例性的,S719,依据音效参数3对第三双耳信号进行音效处理,以得到音频信号3。
示例性的,S719在S715之后且在S716之前执行,即在得到第三双耳信号后,可以对第三双耳信号进行音效处理,以得到音频信号3。具体可以参照上述S308的描述,在此不再赘述。
参照图7b,示例性的,S716可以包括S716a和S716b,其中:
S716a,对音频信号1、音频信号2和音频信号3进行混音处理,以得到音频信号4。
S716b,依据音效参数4对音频信号4进行音效处理,以得到目标双耳信号。
示例性的,S716a和S716b可以参照S305a和S305b;具体可以参照上述描述,在此不再赘述。
示例性的,可以在图6(1)的基础上,参照上述图4a实施例描述的方法,可以通过对源音频信号中初始音频片段之后的音频片段分别进行直达声渲染、早期反射声渲染和晚期反射声渲染,来实现S603中的空间音频处理。即,根据设置,对源音频信号中初始音频片段之后的音频片段分别进行直达声渲染、早期反射声渲染和晚期反射声渲染,以得到目标双耳信号。这样,在任一类型的空间场景下,都能够还原出高精度的声像位置、音频的空间感和距离感,进而达到更真实沉浸的双耳渲染效果。
以下对根据设置,对源音频信号中初始音频片段之后的音频片段分别进行直达声渲染、早期反射声渲染和晚期反射声渲染的过程进行说明。
图8a为示例性示出的音频处理过程示意图。在图8a的实施例中,描述了对源音频信号中初始音频片段之后的音频片段分别进行直达声渲染、早期反射声渲染和晚期反射声渲染的一种方式。
S801,响应于用户的播放操作,对源音频信号中的初始音频片段进行空间音频处理,以得到初始双耳信号并播放初始双耳信号,源音频信号为媒体文件。
S802,响应于用户针对目标场景选项的选取操作,显示目标场景选项对应的渲染效果选项。
S803,接收用户针对渲染效果选项的设置操作,渲染效果选项包括:声像位置选项、距离感选项和空间感选项。
示例性的,S801~S803可以参照S701~S703的描述,在此不再赘述。
S804,根据针对距离感选项的设置操作,调节距离感参数。
S805,从预设的早期反射声RIR库中选取候选早期反射声RIR,以及根据距离感参数确定距离感修正因子。
S806,根据距离感修正因子对候选早期反射声RIR进行修正,以得到目标早期反射声RIR。
S807,根据目标早期反射声RIR对源音频信号中初始音频片段之后的音频片段进行早期反射声渲染,以得到第二双耳信号。
示例性的,S804~S807可以参照上述S708~S711的描述,在此不再赘述。
S808,根据针对空间感选项的设置操作,调节空间感参数。
S809,从预设的晚期反射声RIR库中选取候选晚期反射声RIR,以及根据空间感参数确定空间感修正因子。
S810,根据空间感修正因子对候选晚期反射声RIR进行修正,以得到目标晚期反射声RIR。
S811,根据目标晚期反射声RIR对源音频信号中初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号。
示例性的,S808~S811可以参照上述S712~S715的描述,在此不再赘述。
S812,对第二双耳信号和第三双耳信号进行混音处理,以得到第四双耳信号。
示例性的,S812可以参照上述S405的描述,在此不再赘述。
S813,根据针对声像位置选项的设置操作,调节声像位置参数。
S814,从预设的直达声RIR库中选取候选直达声RIR,以及根据声像位置参数确定声像位置修正因子。
S815,根据声像位置修正因子对候选直达声RIR进行修正,以得到目标直达声RIR。
S816,根据目标直达声RIR对第四双耳信号进行直达声渲染,以得到第五双耳信号。
示例性的,S813~S816可以参照上述S704~S707的描述,在此不再赘述。
S817,基于第五双耳信号,确定目标双耳信号。
示例性的,S817可以参照上述S407的描述,在此不再赘述。
图8b为示例性示出的音频处理过程示意图。在图8a的基础上,在得到第二双耳信号、第三双耳信号、第四双耳信号和第五双耳信号,对第二双耳信号、第三双耳信号、第四双耳信号和第五双耳信号进行音效处理,以对音频进行修饰。
示例性的,根据预设关系,确定与空间场景参数匹配的音效参数组;可以参照上述的描述,在此不再赘述。
示例性的,与空间场景参数匹配的音效参数组,可以包括:直达声音效参数(音效参数1)、早前反射声音效参数(音效参数2)、晚期反射声音效参数(音效参数3)和第二混合音效参数(音效参数5)。
参照图8b,示例性的,S818,依据音效参数2对第二双耳信号进行音效处理,以得到音频信号2。
示例性的,S818在S807之后且在S813之前执行,即在得到第二双耳信号后,可以 对第二双耳信号进行音效处理,以得到音频信号2。具体可以参照上述S307的描述,在此不再赘述。
参照图8b,示例性的,S819,依据音效参数3对第三双耳信号进行音效处理,以得到音频信号3。
示例性的,S819在S812之后且在S813之前执行,即在得到第三双耳信号后,可以对第三双耳信号进行音效处理,以得到音频信号3。具体可以参照上述S308的描述,在此不再赘述。
示例性的,S813可以包括对音频信号2和音频信号3进行混音处理,以得到第四双耳信号。
参照图8b,示例性的,S820,依据音效参数5对第四双耳信号进行音效处理,以得到音频信号6。
示例性的,S820在S813之后且在S816之前执行,即在得到第四双耳信号后,可以对第四双耳信号进行音效处理,以得到音频信号6。具体可以参照上述S409的描述,在此不再赘述。
参照图8b,示例性的,上述S817可以包含S817_X;其中,S817_X,依据音效参数1对第五双耳信号进行音效处理,以得到目标双耳信号。具体可以参照上述描述,在此不再赘述。
需要说明的是,当用户仅针对部分渲染效果选项执行了设置操作时,可以根据针对部分渲染效果选项的设置操作,调节部分渲染效果参数;对于用户未执行的渲染效果选项,可以使用对应渲染效果参数的默认参数值进行渲染。
例如,当用户仅针对声像位置选项执行设置操作时,可以根据根据针对声像位置选项的设置操作,调节声像位置参数;根据声像位置参数调整后的参数值对源音频信号中初始音频片段之后的音频片段进行直达声渲染,以得到第一双耳信号(或者,根据声像位置参数调整后的参数值对第四双耳信号进行直达声渲染,以得到第五双耳信号)。然后根据距离感参数的默认值对源音频信号中初始音频片段之后的音频片段进行早期反射声渲染,以得到第二双耳信号,以及根据空间感参数的默认值对源音频信号中初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号。
例如,当用户仅针对距离感选项执行设置操作时,可以根据根据针对距离感选项的设置操作,调节距离感参数;根据距离感参数调整后的参数值对源音频信号中初始音频片段之后的音频片段进行早期反射声渲染,以得到第二双耳信号。然后根据声像位置参数的默认值对源音频信号中初始音频片段之后的音频片段进行直达声渲染,以得到第一双耳信号(或者,根据声像位置参数的默认值对第四双耳信号进行直达声渲染,以得到第五双耳信号),以及根据空间感参数的默认值对源音频信号中初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号。
例如,当用户仅针对空间感选项执行设置操作时,可以根据根据针对空间感选项的设置操作,调节距离感参数;根据空间感参数调整后的参数值对源音频信号中初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号。然后根据声像位置参数的默认值对源音频信号中初始音频片段之后的音频片段进行直达声渲染,以得到第一 双耳信号(或者,根据声像位置参数的默认值对第四双耳信号进行直达声渲染,以得到第五双耳信号),以及根据距离感参数的默认值对源音频信号中初始音频片段之后的音频片段进行早期反射声渲染,以得到第二双耳信号。以此类推,在此不再赘述。
一种可能的方式中,本申请提供的音频处理方法可以应用于耳机中。这种情况下,耳机可以从与其连接的移动终端获取源音频信号和音频处理参数(其中,音频处理参数可以是指用于进行空间音频处理的参数,音频处理参数可以包括渲染效果参数、音效参数组、修正因子等等),然后执行根据音频处理参数对源音频信号中初始音频片段之后的音频片段进行空间音频处理,以得到目标双耳信号的步骤;接着,对目标双耳信号进行播放。
示例性的,耳机中可以布设有头部运动信息采集的传感器(如陀螺仪、惯性传感器)时,耳机可以根据采集的头部运动信息确定用户的头部位置信息;然后可以根据音频处理参数和头部位置信息,对源音频信号中初始音频片段之后的音频片段进行空间音频处理,以得到目标双耳信号。
一种可能的方式中,本申请提供的音频处理方法可以应用于移动终端中。这种情况下,移动终端可以通过与用户的交互,获取源音频信号和音频处理参数,然后执行根据音频处理参数对源音频信号中初始音频片段之后的音频片段进行空间音频处理,以得到目标双耳信号的步骤。移动终端在得到目标双耳信号后,可以将目标双耳信号发送给与移动终端连接的耳机,由耳机播放目标双耳信号。
示例性的,移动终端可以从与移动终端连接的耳机获取头部位置信息,然后根据音频处理参数和头部位置信息,对源音频信号中初始音频片段之后的音频片段进行空间音频处理,以得到目标双耳信号。
一种可能的方式中,本申请提供的音频处理方法可以应用于VR设备中。这种情况下,VR设备可以根据与用户的交互,获取源音频信号和音频处理参数,然后,执行根据音频处理参数对源音频信号中初始音频片段之后的音频片段进行空间音频处理,以得到目标双耳信号的步骤。接着,VR设备可以播放目标双耳信号(或者,将目标双耳信号发送给耳机,由耳机播放目标双耳信号)。
示例性的,VR设备中可以布设有头部运动信息采集的传感器(如陀螺仪、惯性传感器)时,VR设备可以根据采集的头部运动信息确定用户的头部位置信息;然后可以根据音频处理参数和头部位置信息,对源音频信号中初始音频片段之后的音频片段进行空间音频处理,以得到目标双耳信号。(或者,从与VR设备连接的耳机获取头部位置信息,然后根据音频处理参数和头部位置信息,对源音频信号中初始音频片段之后的音频片段进行空间音频处理,以得到目标双耳信号)。
图9为示例性示出的音频处理系统示意图。图9示出的是,本申请实施例提供的一种音频处理系统,该音频处理系统包括移动终端和与移动终端901连接的耳机902;其中,
移动终端901,用于执行上述实施例的音频处理方法,以及将目标双耳信号发送给耳机;
耳机902,用于播放目标双耳信号。
示例性的,耳机902,用于采集用户的头部运动信息,根据头部运动信息确定用户的头部位置信息;以及将头部位置信息发送至移动终端;
移动终端901,用于根据设置和头部位置信息,对源音频信号中初始音频片段之后的音频片段进行空间音频处理,以得到目标双耳信号。
一个示例中,图10示出了本申请实施例的一种装置1000的示意性框图装置1000可包括:处理器1001和收发器/收发管脚1002,可选地,还包括存储器1003。
装置1000的各个组件通过总线1004耦合在一起,其中总线1004除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图中将各种总线都称为总线1004。
可选地,存储器1003可以用于存储前述方法实施例中的指令。该处理器1001可用于执行存储器1003中的指令,并控制接收管脚接收信号,以及控制发送管脚发送信号。
装置1000可以是上述方法实施例中的电子设备或电子设备的芯片。
其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。
本实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机指令,当该计算机指令在电子设备上运行时,使得电子设备执行上述相关方法步骤实现上述实施例中的音频处理方法。
本实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述实施例中的音频处理方法。
另外,本申请的实施例还提供一种装置,这个装置具体可以是芯片,组件或模块,该装置可包括相连的处理器和存储器;其中,存储器用于存储计算机执行指令,当装置运行时,处理器可执行存储器存储的计算机执行指令,以使芯片执行上述各方法实施例中的音频处理方法。
其中,本实施例提供的电子设备、计算机可读存储介质、计算机程序产品或芯片均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
通过以上实施方式的描述,所属领域的技术人员可以了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点, 所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
本申请各个实施例的任意内容,以及同一实施例的任意内容,均可以自由组合。对上述内容的任意组合均在本申请的范围之内。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。
结合本申请实施例公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(Random Access Memory,RAM)、闪存、只读存储器(Read Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable ROM,EPROM)、电可擦可编程只读存储器(Electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请实施例所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机可读存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。

Claims (23)

  1. 一种音频处理方法,其特征在于,所述方法包括:
    响应于用户的播放操作,对源音频信号中的初始音频片段进行空间音频处理,以得到初始双耳信号并播放所述初始双耳信号,所述源音频信号为媒体文件;
    接收用户针对渲染效果选项的设置,所述渲染效果选项包括以下至少一种:声像位置选项、距离感选项或空间感选项;
    根据所述设置,对所述源音频信号中所述初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号。
  2. 根据权利要求1所述的方法,其特征在于,当所述渲染效果选项包括所述声像位置选项时,所述根据所述设置,对所述源音频信号中所述初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号,包括:
    根据针对所述声像位置选项的设置,调节声像位置参数;
    根据所述声像位置参数对所述源音频信号中所述初始音频片段之后的音频片段进行直达声渲染,以得到第一双耳信号;
    根据所述第一双耳信号,确定所述目标双耳信号。
  3. 根据权利要求1所述的方法,其特征在于,当所述渲染效果选项包括所述距离感选项时,所述根据所述设置,对所述源音频信号中所述初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号,包括:
    根据针对所述距离感选项的设置,调节距离感参数;
    根据所述距离感参数对所述源音频信号中所述初始音频片段之后的音频片段进行早期反射声渲染,以得到第二双耳信号;
    根据所述第二双耳信号,确定所述目标双耳信号。
  4. 根据权利要求1所述的方法,其特征在于,当所述渲染效果选项包括所述空间感选项时,所述根据所述设置,对所述源音频信号中所述初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号,包括:
    根据针对所述空间感选项的设置,调节空间感参数;
    根据所述空间感参数对所述源音频信号中所述初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号;
    根据所述第三双耳信号,确定所述目标双耳信号。
  5. 根据权利要求3所述的方法,其特征在于,当所述渲染效果选项还包括所述声像位置选项和所述空间感选项时,所述根据所述设置,对所述源音频信号中所述初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号,还包括:
    根据针对所述声像位置选项的设置,调节声像位置参数;以及根据所述声像位置参 数对所述源音频信号中所述初始音频片段之后的音频片段进行直达声渲染,以得到第一双耳信号;
    根据针对所述空间感选项的设置,调节空间感参数;以及根据所述空间感参数对所述源音频信号中所述初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号;
    所述依据所述第二双耳信号,确定所述目标双耳信号,包括:
    对所述第一双耳信号、所述第二双耳信号和所述第三双耳信号进行混音处理,以得到所述目标双耳信号。
  6. 根据权利要求3所述的方法,其特征在于,当所述渲染效果选项还包括所述声像位置选项和所述空间感选项时,所述根据所述设置,对所述源音频信号中所述初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号,还包括:
    根据针对所述空间感选项的设置,调节空间感参数;以及根据所述空间感参数对所述源音频信号中所述初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号;
    所述依据所述第二双耳信号,确定所述目标双耳信号,包括:
    对所述第二双耳信号和所述第三双耳信号进行混音处理,以得到第四双耳信号;
    根据针对所述声像位置选项的设置,调节声像位置参数;以及根据所述声像位置参数对所述第四双耳信号进行直达声渲染,以得到第五双耳信号;
    根据所述第五双耳信号,确定所述目标双耳信号。
  7. 根据权利要求2所述的方法,其特征在于,所述根据所述声像位置参数对所述源音频信号中所述初始音频片段之后的音频片段继续进行直达声渲染,以得到第一双耳信号,包括:
    从预设的直达声RIR库中选取候选直达声RIR,以及根据所述声像位置参数确定声像位置修正因子;
    根据所述声像位置修正因子对所述候选直达声RIR进行修正,以得到目标直达声RIR;
    根据所述目标直达声RIR对所述源音频信号中所述初始音频片段之后的音频片段进行直达声渲染,以得到所述第一双耳信号。
  8. 根据权利要求7所述的方法,其特征在于,所述直达声RIR库包括多个第一集合,一个第一集合对应一种头部类型,所述第一集合包括多个位置的预设直达声RIR;
    所述从预设的直达声RIR库中选取候选直达声RIR,包括:
    根据所述用户的头部类型,从所述多个第一集合中选取第一目标集合;
    根据所述用户的头部位置信息、所述源音频信号的位置信息和所述第一目标集合中预设直达声RIR的位置信息,从所述第一目标集合中选取所述候选直达声RIR。
  9. 根据权利要求5或6所述的方法,其特征在于,在所述接收用户针对渲染效果选项的设置之前,所述方法还包括:
    获取针对目标场景选项的选取,显示所述目标场景选项对应的渲染效果选项。
  10. 根据权利要求9所述的方法,其特征在于,所述根据所述距离感参数对所述源音频信号中所述初始音频片段之后的音频片段进行早期反射声渲染,以得到第二双耳信号,包括:
    从预设的早期反射声RIR库中选取候选早期反射声RIR,以及根据所述距离感参数确定距离感修正因子;
    根据所述距离感修正因子对所述候选早期反射声RIR进行修正,以得到目标早期反射声RIR;
    根据所述目标早期反射声RIR对所述源音频信号中所述初始音频片段之后的音频片段进行早期反射声渲染,以得到所述第二双耳信号。
  11. 根据权利要求10所述的方法,其特征在于,所述早期反射声RIR库包括多个第二集合,一个第二集合对应一种空间场景,所述第二集合包括多个位置的预设早期反射声RIR;
    所述从预设的早期反射声RIR库中选取候选早期反射声RIR,包括:
    根据所述目标场景选项对应的空间场景参数,从所述多个第二集合中选取第二目标集合;
    根据所述用户的头部位置信息、所述源音频信号的位置信息和所述第二目标集合中预设早期反射声RIR的位置信息,从所述第二目标集合中选取所述候选早期反射声RIR。
  12. 根据权利要求9所述的方法,其特征在于,所述根据所述空间感参数对所述源音频信号中所述初始音频片段之后的音频片段进行晚期反射声渲染,以得到第三双耳信号,包括:
    从预设的晚期反射声RIR库中选取候选晚期反射声RIR,以及根据所述空间感参数确定空间感修正因子;
    依据所述空间感修正因子对所述候选晚期反射声RIR进行修正,以得到目标晚期反射声RIR;
    依据所述目标晚期反射声RIR对所述源音频信号中所述初始音频片段之后的音频片段进行晚期反射声渲染,以得到所述第三双耳信号。
  13. 根据权利要求12所述的方法,其特征在于,所述晚期反射声RIR库包括多个第三集合,一个第三集合对应一种空间场景,所述第三集合包括多个位置的预设晚期反射声RIR;
    所述从预设的晚期反射声RIR库中选取候选晚期反射声RIR,包括:
    根据所述目标场景选项对应的空间场景参数,从所述多个第三集合中选取第三目标 集合;
    根据所述用户的头部位置信息、所述源音频信号的位置信息和所述第三目标集合中预设晚期反射声RIR的位置信息,从所述第三目标集合中选取所述候选晚期反射声RIR。
  14. 根据权利要求1至13任一项所述的方法,其特征在于,所述源音频信号包括以下至少一种格式:多声道格式、多对象格式和球谐环绕声Ambisonics格式。
  15. 根据权利要求10或11所述的方法,其特征在于,
    所述目标早期反射声RIR为高阶球谐环绕声HOARIR。
  16. 根据权利要求12或13所述的方法,其特征在于,
    所述目标晚期反射声RIR为HOARIR。
  17. 根据权利要求8或11或13所述的方法,其特征在于,
    所述音频处理方法应用于耳机,所述头部位置信息根据所述耳机采集的所述用户的头部运动信息确定;或,
    所述音频处理方法应用于移动终端,所述头部位置信息从与所述移动终端连接的耳机获取;或,
    所述音频处理方法应用于虚拟现实VR设备,所述头部位置信息根据所述VR设备采集的所述用户的头部运动信息确定。
  18. 一种音频处理系统,其特征在于,所述音频处理系统包括移动终端和与所述移动终端连接的耳机;其中,
    所述移动终端,用于响应于用户的播放操作,对源音频信号中的初始音频片段进行空间音频处理,以得到初始双耳信号并播放所述初始双耳信号,所述源音频信号为媒体文件;接收用户针对渲染效果选项的设置,所述渲染效果选项包括以下至少一种:声像位置选项、距离感选项或空间感选项;根据所述设置,对所述源音频信号中所述初始音频片段之后的音频片段继续进行空间音频处理,以得到目标双耳信号;将所述目标双耳信号发送至所述耳机;
    所述耳机,用于播放所述目标双耳信号。
  19. 根据权利要求18所述的系统,其特征在于,
    所述耳机,还用于采集所述用户的头部运动信息,根据所述头部运动信息确定所述用户的头部位置信息;以及将所述头部位置信息发送至所述移动终端;
    所述移动终端,具体用于根据所述设置和所述头部位置信息,对所述源音频信号中所述初始音频片段之后的音频片段继续进行空间音频处理,以得到所述目标双耳信号。
  20. 一种电子设备,其特征在于,包括:
    存储器和处理器,所述存储器与所述处理器耦合;
    所述存储器存储有程序指令,当所述程序指令由所述处理器执行时,使得所述电子设备执行权利要求1至权利要求17中任一项所述的音频处理方法。
  21. 一种芯片,其特征在于,包括一个或多个接口电路和一个或多个处理器;所述接口电路用于从电子设备的存储器接收信号,并向所述处理器发送所述信号,所述信号包括存储器中存储的计算机指令;当所述处理器执行所述计算机指令时,使得所述电子设备执行权利要求1至权利要求17中任一项所述的音频处理方法。
  22. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,当所述计算机程序运行在计算机或处理器上时,使得所述计算机或所述处理器执行如权利要求1至权利要求17中任一项所述的音频处理方法。
  23. 一种计算机程序产品,其特征在于,所述计算机程序产品包含软件程序,当所述软件程序被计算机或处理器执行时,使得权利要求1至权利要求17任一项所述的方法的步骤被执行。
PCT/CN2023/081669 2022-07-12 2023-03-15 音频处理方法、系统及电子设备 WO2024011937A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210813749 2022-07-12
CN202210813749.2 2022-07-12
CN202310127907.3A CN117395592A (zh) 2022-07-12 2023-01-30 音频处理方法、系统及电子设备
CN202310127907.3 2023-01-30

Publications (1)

Publication Number Publication Date
WO2024011937A1 true WO2024011937A1 (zh) 2024-01-18

Family

ID=89463631

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/081669 WO2024011937A1 (zh) 2022-07-12 2023-03-15 音频处理方法、系统及电子设备

Country Status (2)

Country Link
CN (1) CN117395592A (zh)
WO (1) WO2024011937A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016042410A1 (en) * 2014-09-17 2016-03-24 Symphonova, Ltd Techniques for acoustic reverberance control and related systems and methods
CN109076305A (zh) * 2016-02-02 2018-12-21 Dts(英属维尔京群岛)有限公司 增强现实耳机环境渲染
CN111142838A (zh) * 2019-12-30 2020-05-12 广州酷狗计算机科技有限公司 音频播放方法、装置、计算机设备及存储介质
US11246002B1 (en) * 2020-05-22 2022-02-08 Facebook Technologies, Llc Determination of composite acoustic parameter value for presentation of audio content

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016042410A1 (en) * 2014-09-17 2016-03-24 Symphonova, Ltd Techniques for acoustic reverberance control and related systems and methods
CN109076305A (zh) * 2016-02-02 2018-12-21 Dts(英属维尔京群岛)有限公司 增强现实耳机环境渲染
CN111142838A (zh) * 2019-12-30 2020-05-12 广州酷狗计算机科技有限公司 音频播放方法、装置、计算机设备及存储介质
US11246002B1 (en) * 2020-05-22 2022-02-08 Facebook Technologies, Llc Determination of composite acoustic parameter value for presentation of audio content

Also Published As

Publication number Publication date
CN117395592A (zh) 2024-01-12

Similar Documents

Publication Publication Date Title
CN110049428B (zh) 用于实现多声道环绕立体声播放的方法、播放设备及系统
KR102537714B1 (ko) 오디오 신호 처리 방법 및 장치
CN111294724B (zh) 多个音频流的空间重新定位
US11877135B2 (en) Audio apparatus and method of audio processing for rendering audio elements of an audio scene
CN114072761A (zh) 用于控制针对扩展现实体验的音频渲染的用户接口
CN114051736A (zh) 用于音频流送和渲染的基于定时器的访问
EP3994566A1 (en) Audio capture and rendering for extended reality experiences
TW202117500A (zh) 用於音訊呈現之隱私分區及授權
US9226091B2 (en) Acoustic surround immersion control system and method
EP3506080B1 (en) Audio scene processing
CN114072792A (zh) 用于音频渲染的基于密码的授权
JP2018110366A (ja) 3dサウンド映像音響機器
CN114391263A (zh) 用于扩展现实体验的参数设置调整
WO2024011937A1 (zh) 音频处理方法、系统及电子设备
TW202324375A (zh) 可動態調整目標聆聽點並消除環境物件干擾的音響系統
US10735885B1 (en) Managing image audio sources in a virtual acoustic environment
JP2023546839A (ja) 視聴覚レンダリング装置およびその動作方法
WO2023197646A1 (zh) 一种音频信号处理方法及电子设备
US20230011591A1 (en) System and method for virtual sound effect with invisible loudspeaker(s)
WO2022151336A1 (en) Techniques for around-the-ear transducers
CN116634348A (zh) 头戴式可穿戴装置、音频信息的处理方法及存储介质
CN117750270A (zh) 音频的空间共混
CN115167803A (zh) 一种音效的调节方法、装置、电子设备及存储介质
WO2019057189A1 (zh) Vr眼镜和vr眼镜的声音播放方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23838439

Country of ref document: EP

Kind code of ref document: A1