CN116634350A

CN116634350A - Audio processing method and device and electronic equipment

Info

Publication number: CN116634350A
Application number: CN202310903809.4A
Authority: CN
Inventors: 陈绍天; 丁幸运; 胡贝贝
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-07-24
Filing date: 2023-07-24
Publication date: 2023-08-22
Anticipated expiration: 2043-07-24
Also published as: CN116634350B

Abstract

The application provides an audio processing method, an audio processing device and electronic equipment. Comprising the following steps: the system comprises an audio input module, a voice extraction module, a spatial audio rendering module, a content self-adaptive frequency division module, a self-adaptive crosstalk elimination module, a down-mixing module and an audio output module. The voice extraction module performs distinguishing processing on the signals of the left and right voice channels and the voice signals so as to obtain better fidelity and voice definition. The spatial audio rendering module renders signals of left and right channels based on the head azimuth information of the user, and can improve the sound image height sense. The content self-adaptive frequency division module divides the frequency of signals of the left channel and the right channel, and the self-adaptive crosstalk elimination module eliminates the crosstalk of the intermediate frequency channel signals obtained by frequency division so as to widen the sound field. The down-mixing module performs down-mixing processing on signals of left and right channels and human voice signals, and the audio output module outputs a first output audio signal. The embodiment of the application can improve the sound image height sense, the width sense and the immersion sense of the audio on the basis of ensuring the tone quality of the audio.

Description

Audio processing method and device and electronic equipment

Technical Field

The present application relates to the field of terminal technologies, and in particular, to an audio processing method, an audio processing device, and an electronic device.

Background

With the continuous development of electronic technology, electronic devices with mobile convenience features are rapidly popularized, for example: notebook computers, tablet computers, smart phones, etc., which are usually small, light and thin, the layout of internal devices of these electronic devices is often limited. In this way, in the process of playing audio through the loudspeaker of the electronic equipment, the playing audio of the electronic equipment has the problems of thin sound, narrower sound field, lower height and insufficient immersion feeling due to the influence of the layout of the loudspeaker and the limitation of devices. Meanwhile, the dessert position problem exists in the process of playing the audio out of the electronic equipment, and the dessert position problem is a problem that the optimal hearing effect cannot be obtained when the position of the user changes because the position of the user is not considered when the audio is played out. Thus, the user experience is affected.

Disclosure of Invention

The application provides an audio processing method, an audio processing device and electronic equipment, which can solve the problems of thinness, narrower sound field, lower height and insufficient immersion of the audio played by the electronic equipment, improve the sound image height, width and immersion of the audio on the basis of guaranteeing the tone quality of the audio, and greatly inhibit dessert position, thereby improving user experience.

In a first aspect, the present application provides an audio processing method applied to an electronic device, where the electronic device is provided with at least one speaker, the method comprising: acquiring a first audio signal, wherein the first audio signal comprises a first left channel signal and a first right channel signal; performing human voice separation on the first left channel signal and the first right channel signal to obtain a first human voice signal, a second left channel signal and a second right channel signal, wherein the first human voice signal comprises a first human voice sub-signal separated from the first left channel signal and a second human voice sub-signal separated from the first right channel signal, the second left channel signal is a channel signal of the first left channel signal from which the first human voice sub-signal is separated, and the second right channel signal is a channel signal of the first right channel signal from which the second human voice sub-signal is separated; respectively performing audio rendering processing on the second left channel signal, the second right channel signal and the first human voice signal based on the first orientation information to obtain a third left channel signal, a third right channel signal and the second human voice signal, wherein the first orientation information comprises the orientation information of the head of the user relative to the loudspeaker; performing content adaptive frequency division processing on the third left channel signal and the third right channel signal respectively to divide the third left channel signal into a first left channel sub-signal of a first frequency band, a second left channel sub-signal of a second frequency band and a third left channel sub-signal of a third frequency band, and divide the third right channel signal into a first right channel sub-signal of the first frequency band, a second right channel sub-signal of the second frequency band and a third right channel sub-signal of the third frequency band, wherein the minimum value of the first frequency band is greater than or equal to the maximum value of the second frequency band, and the minimum value of the second frequency band is greater than or equal to the maximum value of the third frequency band; performing adaptive crosstalk cancellation processing on the second right channel sub-signal, overlapping the second right channel sub-signal to generate a fourth left channel sub-signal, performing adaptive crosstalk cancellation processing on the second left channel sub-signal, overlapping the second right channel sub-signal to generate a fourth right channel sub-signal, respectively performing delay processing on the first left channel sub-signal and the third left channel sub-signal, respectively, overlapping the first left channel sub-signal and the third left channel sub-signal to generate a first overlapping signal, respectively performing delay processing on the first right channel sub-signal and the third right channel sub-signal, respectively performing gain processing on the first right channel sub-signal and the third right channel sub-signal, respectively performing delay processing on the first left channel sub-signal and the third left channel sub-signal, respectively, and overlapping the first left channel sub-signal and the third left channel sub-signal to generate a second target overlapping signal; superposing the fourth left channel sub-signal, the first superposition signal and the first target superposition signal to generate a fourth left channel signal, and superposing the fourth right channel sub-signal, the second superposition signal and the second target superposition signal to generate a fourth right channel signal; performing down-mixing processing on the fourth left channel signal, the fourth right channel signal and the second human voice signal to obtain a fifth left channel signal and a fifth right channel signal; a first output audio signal generated from the fifth left channel signal and the fifth right channel signal is output.

The audio processing method can distinguish signals of the left and right channels of the stereo from the human voice signals, and can enable the signals of the left and right channels to obtain better fidelity and human voice definition. The audio rendering is carried out on the signals of the left channel and the right channel based on the azimuth information of the head of the user, and the audio rendering is carried out on the human voice signals, so that the sound image height sense can be improved. The signals of the middle frequency band of the signals of the left channel and the right channel are subjected to self-adaptive crosstalk elimination, so that the sound field can be widened. The audio processing method has the advantages that the signals of the left and right channels after a series of processing and the human voice signals are processed in a downmixing mode and output, the output first output audio signals can improve the sound image height sense, the width sense and the immersion sense of the audio on the basis of guaranteeing the tone quality of the audio, and the dessert position problem is restrained to a large extent, so that the user experience is improved.

In one implementation, before performing the downmix processing on the fourth left channel signal, the fourth right channel signal, and the second human voice signal to obtain the fifth left channel signal and the fifth right channel signal, the method further includes: and carrying out parameter equalization processing on the fourth left channel signal, the fourth right channel signal and the second human voice signal to obtain an equalized fourth left channel signal, an equalized fourth right channel signal and an equalized second human voice signal. By adopting the implementation mode, the tone quality of the left and right channels can be improved by carrying out parameter equalization processing on the left and right channel signals.

In one implementation, performing downmix processing on the fourth left channel signal, the fourth right channel signal, and the second human voice signal to obtain a fifth left channel signal and a fifth right channel signal, including: and performing down-mixing processing on the equalized fourth left channel signal, the equalized fourth right channel signal and the equalized second human voice signal to obtain a fifth left channel signal and a fifth right channel signal. By adopting the implementation mode, the tone quality of the left and right channels can be improved by carrying out parameter equalization processing on the left and right channel signals.

In one implementation, performing a human voice separation on a first left channel signal and a first right channel signal to obtain a first human voice signal, a second left channel signal, and a second right channel signal, including: and separating the first left channel signal and the first right channel signal based on a principal component analysis method or based on a neural network to obtain a first human sound signal, a second left channel signal and a second right channel signal. By adopting the implementation mode, the signals of the left and right channels and the voice signals are subjected to distinguishing processing, so that better voice definition can be obtained.

In one implementation, performing audio rendering processing on the second left channel signal, the second right channel signal, and the first human voice signal based on the first direction information to obtain a third left channel signal, a third right channel signal, and a second human voice signal, respectively, including: performing segmented convolution on the first human voice signal and the first height filter based on an overlap preservation method to obtain a second human voice signal, performing segmented convolution on the second left channel signal and the second height filter based on the overlap preservation method to obtain a third left channel signal, and performing segmented convolution on the second right channel signal and the third height filter based on the overlap preservation method to obtain a third right channel signal; the first height filter is a filter obtained according to the head height of the user and a first angle of the first human voice signal relative to the electronic device, the second height filter is a filter obtained according to the head height of the user and a second angle of the second left channel signal relative to the electronic device, and the third height filter is a filter obtained according to the head height of the user and a third angle of the second right channel signal relative to the electronic device. By adopting the implementation mode, the filter can be obtained based on the head azimuth information of the user, so that the filter can adjust different channel signals, and audio with sound image height sense, width sense and immersion sense is provided for the user. In this way, the dessert level problem can be effectively suppressed.

In one implementation, the first angle includes a horizontal angle and a pitch angle corresponding to a first head related transform function HRTF corresponding to the first vocal signal; the first height filter obtaining mode includes: acquiring first orientation information, wherein the first orientation information comprises a first distance between the center of the head of the user and the center of the electronic equipment, the head height of the user, the head size of the user and the head rotation angle of the user, and the left ear orientation of the user and the right ear orientation of the user can be determined according to the head size of the user; judging whether the first distance is equal to the measurement radius of a preset HRTF or not; if the first distance is larger or smaller than the measurement radius of the preset HRTF, correcting the preset HRTF according to the first distance, the left ear position of the user and the right ear position of the user to obtain the first HRTF; acquiring a head height of a user and a Head Related Impulse Response (HRIR) data set corresponding to a first angle, wherein the HRIR data set comprises at least one HRIR of a measurement user; carrying out frequency domain change on the HRIR data set to obtain an HRTF data set; acquiring a time-frequency domain average signal of an HRTF data set; performing frequency multiplication smoothing or envelope extraction on the time-frequency domain average signal to obtain an initial height filter; and performing inter-ear time difference (ITD) adjustment on the discretized form of the initial height filter to obtain a first height filter. With the implementation, a specific acquisition mode of the height filter is shown, so that the filter can adjust different channel signals, and audio with sound image height sense, width sense and immersion sense is provided for a user. In this way, the dessert level problem can be effectively suppressed.

In one implementation, the first HRTF includes a first left ear HRTF and a first right ear HRTF; if the first distance is greater than or less than the measurement radius of the preset HRTF, correcting the HRTF according to the first distance, the left ear position of the user, and the right ear position of the user to obtain a first HRTF, including: if the first distance is larger than the measurement radius of the preset HRTF, a first intersection point of a connecting line of the center of the electronic device and the left ear azimuth on the measurement area of the HRTF is obtained, and a second intersection point of a connecting line of the center of the electronic device and the right ear azimuth on the measurement area of the HRTF is obtained, wherein the first intersection point is used for determining a first horizontal azimuth angle and a first pitch angle, and the second intersection point is used for determining a second horizontal azimuth angle and a second pitch angle; the first left ear HRTF is determined according to the first horizontal azimuth and the first pitch angle, and the first right ear HRTF is determined according to the second horizontal azimuth and the second pitch angle. By adopting the implementation mode, a specific correction mode of the HRTF is shown, so that a filter corresponding to the voice signal can be obtained through the corrected HRTF, different channel signals can be regulated by the filter, and audio with sound image height sense, width sense and immersion sense is provided for a user. In this way, the dessert level problem can be effectively suppressed.

In one implementation, the first HRTF includes a first left ear HRTF and a first right ear HRTF; if the first distance is greater than or less than the measurement radius of the preset HRTF, correcting the HRTF according to the first distance, the left ear position of the user, and the right ear position of the user to obtain a first HRTF, including: if the first distance is smaller than the measurement radius of the preset HRTF, a first intersection point of a connecting line of the center of the electronic device and the left ear azimuth on the measurement area of the HRTF is obtained, and a second intersection point of a connecting line of the center of the electronic device and the right ear azimuth on the measurement area of the HRTF is obtained, wherein the first intersection point is used for determining a first horizontal azimuth angle and a first pitch angle, and the second intersection point is used for determining a second horizontal azimuth angle and a second pitch angle; the first right ear HRTF is determined according to the first horizontal azimuth and the first pitch angle, and the first left ear HRTF is determined according to the second horizontal azimuth and the second pitch angle. By adopting the implementation mode, a specific correction mode of the HRTF is shown, so that a filter corresponding to the voice signal can be obtained through the corrected HRTF, different channel signals can be regulated by the filter, and audio with sound image height sense, width sense and immersion sense is provided for a user. In this way, the dessert level problem can be effectively suppressed.

In one implementation, performing a content adaptive divide process on a third left channel signal to divide the third left channel signal into a first left channel sub-signal of a first frequency band, a second left channel sub-signal of a second frequency band, and a third left channel sub-signal of a third frequency band, includes: classifying the third left channel signal through a neural audio multi-classification network to obtain at least one audio object, wherein each audio object corresponds to a sound probability which is used for determining the classification accuracy of the audio object; comparing each sound probability with a preset sound probability threshold; if at least two sound probabilities are larger than a sound probability threshold, acquiring a first target audio object corresponding to the two sound probabilities with the sound probability value at the front, or if one sound probability is larger than the sound probability threshold, acquiring a second target audio object corresponding to the sound probability; determining a first frequency division point and a second frequency division point based on the first target audio object or the second target audio object, wherein the frequency value of the first frequency division point is larger than that of the second frequency division point; determining a frequency range larger than a first frequency division point as a first frequency band, determining a frequency range smaller than or equal to the first frequency division point and larger than a second frequency division point as a second frequency band, and determining a frequency range smaller than or equal to the second frequency division point as a third frequency band; the sound signal of the third left channel signal with the frequency in the first frequency band is determined as a first left channel sub-signal, the sound signal of the third left channel signal with the frequency in the second frequency band is determined as a second left channel sub-signal, and the sound signal of the third left channel signal with the frequency in the third frequency band is determined as a third left channel sub-signal. By adopting the implementation mode, a specific mode of content self-adaptive frequency division is shown, so that the intermediate frequency signals of the left channel and the right channel can be obtained, the self-adaptive crosstalk elimination is carried out on the subsequent intermediate frequency signals, and the effect of widening a sound field is achieved.

In one implementation, determining a first crossover point and a second crossover point based on a first target audio object includes: judging whether a first target audio object contains a preset first concerned object or not; if the first target audio object contains a first attention object, determining the maximum value of the frequency range of the first attention object as a first frequency division point, and determining the minimum value of the frequency range of the first attention object as a second frequency division point; if the first target audio object does not contain the first concerned object, acquiring a first frequency range of the first target audio object, and acquiring a second frequency range of the second first target audio object; if the first frequency range includes the second frequency range, determining a maximum value of the first frequency range as a first frequency division point, and determining a minimum value of the first frequency range as a second frequency division point; if the second frequency range includes the first frequency range, determining a maximum value of the second frequency range as a first frequency division point, and determining a minimum value of the second frequency range as a second frequency division point; if the first frequency range and the second frequency range have a first intersection, determining a maximum value of the first intersection as a first frequency division point, and determining a minimum value of the first intersection as a second frequency division point; if the first frequency range and the second frequency range are not intersected, and the minimum value of the first frequency range is larger than the maximum value of the second frequency range, determining the minimum value of the first frequency range as a first frequency division point, and determining the maximum value of the second frequency range as a second frequency division point; if the first frequency range and the second frequency range have no intersection, and the minimum value of the second frequency range is larger than the maximum value of the first frequency range, the minimum value of the second frequency range is determined as a first frequency division point, and the maximum value of the first frequency range is determined as a second frequency division point. By adopting the implementation mode, a specific mode of frequency division is shown, so that the intermediate frequency signals of the left channel and the right channel can be obtained, the adaptive crosstalk elimination is carried out on the subsequent intermediate frequency signals, and the effect of widening the sound field is achieved.

In one implementation, determining the first crossover point and the second crossover point based on the second target audio object includes: the maximum value of the frequency range of the second target audio object is determined as the first crossover point, and the minimum value of the frequency range of the second target audio object is determined as the second crossover point. By adopting the implementation mode, a specific mode of content self-adaptive frequency division is shown, so that the intermediate frequency signals of the left channel and the right channel can be obtained, the self-adaptive crosstalk elimination is carried out on the subsequent intermediate frequency signals, and the effect of widening a sound field is achieved.

In one implementation, performing adaptive crosstalk cancellation processing on a second right channel sub-signal, superimposing the second right channel sub-signal on the second left channel sub-signal, generating a fourth left channel sub-signal, and performing adaptive crosstalk cancellation processing on the second left channel sub-signal, superimposing the second left channel sub-signal on the second right channel sub-signal, generating the fourth right channel sub-signal, includes: and sequentially performing inversion processing, attenuation processing, delay processing and gain processing on the second right channel sub-signal, superposing the second right channel sub-signal on the second left channel sub-signal to generate a fourth left channel sub-signal, and sequentially performing inversion processing, attenuation processing, delay processing and gain processing on the second left channel sub-signal, superposing the second left channel sub-signal on the second left channel sub-signal to generate the fourth right channel sub-signal. By adopting the implementation mode, a specific mode of self-adaptive crosstalk elimination processing is shown, so that self-adaptive crosstalk elimination can be carried out on the medium-frequency band signals, and the effect of widening a sound field is achieved. Meanwhile, the high-frequency channel signal and the low-frequency channel signal with low user perceptibility are overlapped, so that the texture of the low-frequency channel signal and the tone of the high-frequency channel signal can be ensured.

In one implementation, before performing audio rendering processing on the second left channel signal, the second right channel signal, and the first human voice signal based on the first direction information to obtain a third left channel signal, a third right channel signal, and a second human voice signal, the method further includes: the method comprises the steps that first orientation information is obtained based on an image acquisition device of electronic equipment, or the first orientation information is obtained based on user-defined content filled by a user. By adopting the implementation mode, the application applies the UI interface with the head-out movement tracking function for the first time, and can provide better hearing experience based on the direction of the user.

In one implementation, an image capturing device based on an electronic device obtains first orientation information, including: the electronic equipment displays a first interface, wherein the first interface comprises a first control, and the first control is used for starting or closing a personalized setting function in a playback audio mode; responding to the opening operation of the user on the first control, displaying a second interface on the electronic equipment, and displaying the second control on the second interface, wherein the second control is used for opening or closing a head tracking function in a playback audio mode; and responding to the opening operation of the user on the second control, the electronic equipment starts the image acquisition device to acquire the first orientation information. By adopting the implementation mode, the application applies the UI interface with the head-out movement tracking function for the first time, and can provide better hearing experience based on the direction of the user.

In one implementation, obtaining first orientation information based on user-filled custom content includes: in response to a user closing operation of the second control, the electronic device displays at least one filling item in an editable state, wherein the filling item is used for filling the first orientation information. By adopting the implementation mode, the application applies the UI interface with the head-out movement tracking function for the first time, and can provide better hearing experience based on the direction of the user.

In a second aspect, the present application provides an audio processing method applied to an electronic device, where the electronic device is provided with at least one speaker, the method comprising: acquiring a second audio signal, wherein the second audio signal comprises a sixth left channel signal, a sixth right channel signal, a first left surround sound signal, a first right surround sound signal and a first center voice signal; respectively performing audio rendering processing on the sixth left channel signal, the sixth right channel signal, the first left surround sound signal, the first right surround sound signal and the first middle-set human voice signal based on second azimuth information to obtain a seventh left channel signal, a seventh right channel signal, a second left surround sound signal, a second right surround sound signal and a second middle-set human voice signal, wherein the second azimuth information comprises azimuth information of a user head relative to a loudspeaker; performing content adaptive frequency division processing on the seventh left channel signal, the seventh right channel signal, the second left surround sound signal, and the second right surround sound signal, respectively, to divide the seventh left channel signal into a fifth left channel sub-signal of the first frequency band, a sixth left channel sub-signal of the second frequency band, and a seventh left channel sub-signal of the third frequency band, and divide the seventh right channel signal into a fifth right channel sub-signal of the first frequency band, a sixth right channel sub-signal of the second frequency band, and a seventh right channel sub-signal of the third frequency band, and divide the second left surround sound signal into a first left surround sound sub-signal of the first frequency band, a second left surround sound sub-signal of the second frequency band, and a third left surround sound sub-signal of the third frequency band, and divide the second right surround sound signal into a first right surround sound sub-signal of the first frequency band, a second right surround sound sub-signal of the second frequency band, and a third right surround sound sub-signal of the third frequency band, wherein a minimum value of the first frequency band is greater than or equal to a maximum value of the second frequency band or greater than or equal to a maximum value of the third frequency band; performing adaptive crosstalk cancellation processing on the sixth right channel sub-signal, overlapping the sixth right channel sub-signal to generate an eighth left channel sub-signal, performing adaptive crosstalk cancellation processing on the sixth left channel sub-signal, overlapping the sixth right channel sub-signal to generate an eighth right channel sub-signal, performing delay processing on the fifth left channel sub-signal and the seventh left channel sub-signal respectively, overlapping the fifth left channel sub-signal and the seventh right channel sub-signal respectively to generate a third overlapping signal, performing delay processing on the fifth right channel sub-signal and the seventh right channel sub-signal respectively to generate a fourth overlapping signal, performing delay processing on the fifth right channel sub-signal and the seventh right channel sub-signal respectively, performing gain processing on the fifth right channel sub-signal and the seventh right channel sub-signal respectively, overlapping the fifth left channel sub-signal and the seventh left channel sub-signal respectively to generate a third target overlapping signal, and after gain processing, generating a fourth target superimposed signal by superimposing, performing adaptive crosstalk cancellation processing on the second right surround sound signal, superimposing the fourth target superimposed signal on the second left surround sound signal, generating a fourth left surround sound signal, performing adaptive crosstalk cancellation processing on the second left surround sound signal, superimposing the fourth right surround sound signal, generating a first surround sound superimposed signal by superimposing the first left surround sound signal and the third left surround sound signal after delay processing, generating a second surround sound superimposed signal by superimposing the first right surround sound signal and the third right surround sound signal after delay processing, and performing delay processing on the first right surround sound signal and the third right surround sound signal after delay processing, superposing to generate a first target surround sound superposition signal, respectively carrying out delay processing on the first left surround sound sub-signal and the third left surround sound sub-signal, respectively carrying out gain processing on the first left surround sound sub-signal and the third left surround sound sub-signal, and superposing to generate a second target surround sound superposition signal; superposing the eighth left channel sub-signal, the third superposition signal and the third target superposition signal to generate an eighth left channel signal, superposing the eighth right channel sub-signal, the fourth superposition signal and the fourth target superposition signal to generate an eighth right channel signal, superposing the fourth left surround sound sub-signal, the first surround sound superposition signal and the first target surround sound superposition signal to generate a fourth left surround sound signal, and superposing the fourth right surround sound sub-signal, the second surround sound superposition signal and the second target surround sound superposition signal to generate a fourth right surround sound signal; performing down-mixing processing on the eighth left channel signal, the eighth right channel signal, the fourth left surround sound signal, the fourth right surround sound signal and the second center human voice signal to obtain a ninth left channel signal and a ninth right channel signal; a second output audio signal generated from the ninth left channel signal and the ninth right channel signal is output.

The audio processing method can process the channel signals of the 5.1 channels, perform audio rendering on the signals of the left and right channels and the left and right surrounding channels based on the azimuth information of the head of the user, and perform audio rendering on the human voice signals, so that the sound image height sense can be improved. The signals of the middle frequency band of the signals of the left channel, the right channel and the left surrounding channel are subjected to self-adaptive crosstalk elimination, so that the sound field can be widened. The audio signal processing method has the advantages that the signals of the left and right channels and the left and right surrounding channels after a series of processing and the human sound signals are subjected to downmixing processing and output in stereo, the output second output audio signals can improve the sound image height sense, the width sense and the immersion sense of the audio on the basis of guaranteeing the tone quality of the audio, and the dessert position problem is restrained to a large extent, so that the user experience is improved.

In one implementation, before the second audio signal is acquired, further comprising: acquiring a third audio signal, wherein the third audio signal comprises a first left front surround sound signal, a first right front surround sound signal, a third center human sound signal, a first left surround sound signal, a first right surround sound signal, a first left rear surround sound signal and a first right rear surround sound signal; acquiring a second audio signal, comprising: and performing down-mixing processing on the first left front surround sound signal, the first right front surround sound signal, the third middle-set human sound signal, the first left side surround sound signal, the first right side surround sound signal, the first left rear surround sound signal and the first right rear surround sound signal to obtain a sixth left channel signal, a sixth right channel signal, the first left surround sound signal, the first right surround sound signal and the first middle-set human sound signal. By adopting the implementation mode, the channel signals of the 7.1 channels can be processed and downmixed to the 5.1 channels, the signals of the left and right channels and the left and right surrounding channels are subjected to audio rendering based on the azimuth information of the head of the user, and the human voice signals are subjected to audio rendering, so that the sound image height sense can be improved. The signals of the middle frequency band of the signals of the left channel, the right channel and the left surrounding channel are subjected to self-adaptive crosstalk elimination, so that the sound field can be widened. The audio processing method has the advantages that the signals of the left and right channels and the left and right surrounding channels after a series of processing and the human voice signals are processed in a downmixing mode and output, the output second output audio signals can improve the sound image height sense, the width sense and the immersion sense of the audio on the basis of guaranteeing the tone quality of the audio, the dessert position problem is restrained to a large extent, and the user experience is improved.

In one implementation, before the second audio signal is acquired, further comprising: acquiring a fourth audio signal, wherein the fourth audio signal comprises a second left front surround sound signal, a second right front surround sound signal, a fourth middle-set human sound signal, a second left surround sound signal, a second right surround sound signal, a second left rear surround sound signal, a second right rear surround sound signal, a first front longitudinal sound signal and a second front longitudinal sound signal; acquiring a second audio signal, comprising: and performing down-mixing processing on the second left front surround sound signal, the second right front surround sound signal, the fourth middle-set human sound signal, the second left side surround sound signal, the second right side surround sound signal, the second left rear surround sound signal, the second right rear surround sound signal, the first front longitudinal sound signal and the second front longitudinal sound signal to obtain a sixth left channel signal, a sixth right channel signal, a first left surround sound signal, a first right surround sound signal and a first middle-set human sound signal. By adopting the implementation mode, the channel signals of the 9.1 channels can be processed, the channel signals are downmixed to the 5.1 channels, the signals of the left and right channels and the left and right surrounding channels are subjected to audio rendering based on the azimuth information of the head of the user, and the human voice signals are subjected to audio rendering, so that the sound image height sense can be improved. The signals of the middle frequency band of the signals of the left channel, the right channel and the left surrounding channel are subjected to self-adaptive crosstalk elimination, so that the sound field can be widened. The audio processing method has the advantages that the signals of the left and right channels and the left and right surrounding channels after a series of processing and the human voice signals are processed in a downmixing mode and output, the output second output audio signals can improve the sound image height sense, the width sense and the immersion sense of the audio on the basis of guaranteeing the tone quality of the audio, the dessert position problem is restrained to a large extent, and the user experience is improved.

In a third aspect, the present application provides an audio processing apparatus comprising: the audio input module is used for acquiring a first audio signal, and the first audio signal comprises a first left channel signal and a first right channel signal; the voice extraction module is used for performing voice separation on a first left channel signal and a first right channel signal to obtain a first voice signal, a second left channel signal and a second right channel signal, wherein the first voice signal comprises a first voice sub-signal separated from the first left channel signal and a second voice sub-signal separated from the first right channel signal, the second left channel signal is a channel signal after the first voice sub-signal is separated from the first left channel signal, and the second right channel signal is a channel signal after the second voice sub-signal is separated from the first right channel signal; the spatial audio rendering module is used for respectively performing audio rendering processing on the second left channel signal, the second right channel signal and the first human voice signal based on the first direction information to obtain a third left channel signal, a third right channel signal and the second human voice signal, wherein the first direction information comprises the direction information of the head of the user relative to the loudspeaker; the content adaptive frequency division module is used for respectively carrying out content adaptive frequency division processing on the third left channel signal and the third right channel signal so as to divide the third left channel signal into a first left channel sub-signal of a first frequency band, a second left channel sub-signal of a second frequency band and a third left channel sub-signal of the third frequency band, and divide the third right channel signal into a first right channel sub-signal of the first frequency band, a second right channel sub-signal of the second frequency band and a third right channel sub-signal of the third frequency band, wherein the minimum value of the first frequency band is larger than or equal to the maximum value of the second frequency band, and the minimum value of the second frequency band is larger than or equal to the maximum value of the third frequency band; the adaptive crosstalk cancellation module is configured to perform adaptive crosstalk cancellation processing on the second right channel sub-signal, superimpose the second right channel sub-signal on the second left channel sub-signal to generate a fourth left channel sub-signal, perform adaptive crosstalk cancellation processing on the second left channel sub-signal on the second right channel sub-signal to generate a fourth right channel sub-signal, and superimpose the first left channel sub-signal and the third left channel sub-signal after performing delay processing respectively on the first left channel sub-signal and the third left channel sub-signal, and superimpose the first right channel sub-signal and the third right channel sub-signal after performing delay processing respectively on the first right channel sub-signal and the third right channel sub-signal to generate a second superimposed signal, and perform delay processing respectively on the first left channel sub-signal and the third left channel sub-signal and perform gain processing respectively on the first left channel sub-signal and the third left channel sub-signal to generate a second target superimposed signal; superposing the fourth left channel sub-signal, the first superposition signal and the first target superposition signal to generate a fourth left channel signal, and superposing the fourth right channel sub-signal, the second superposition signal and the second target superposition signal to generate a fourth right channel signal; the down-mixing module is used for performing down-mixing processing on the fourth left channel signal, the fourth right channel signal and the second human voice signal to obtain a fifth left channel signal and a fifth right channel signal; and the audio output module is used for outputting a first output audio signal generated according to the fifth left channel signal and the fifth right channel signal.

According to the audio processing device, after the audio input module acquires the first audio signal, the voice extraction module carries out distinguishing processing on the signals of the left and right channels and the voice signal so as to obtain better fidelity and voice definition. The spatial audio rendering module renders signals of left and right channels based on azimuth information of the head of the user, and renders human voice signals, so that the sense of sound image height can be improved. The content self-adaptive frequency division module obtains signals of left and right channels to carry out content self-adaptive frequency division, and the self-adaptive crosstalk elimination module can carry out self-adaptive crosstalk elimination on intermediate frequency signals obtained by the content self-adaptive frequency division so as to widen a sound field. The down-mixing module performs down-mixing processing on signals of the left channel and the right channel and the human voice signals and sends the processed signals to the audio output module, so that the audio output module outputs a first output audio signal. Therefore, the embodiment of the application can improve the sound image height sense, the width sense and the immersion sense of the audio on the basis of ensuring the tone quality of the audio, can inhibit the dessert position problem to a greater extent, and can improve the user experience.

In one implementation, the method further comprises: the parameter equalization module is used for performing parameter equalization processing on the first left channel sub-signal, the third left channel sub-signal and the fourth left channel sub-signal to obtain an equalized first left channel sub-signal, an equalized third left channel sub-signal and an equalized fourth left channel sub-signal, and performing parameter equalization processing on the first right channel sub-signal, the third right channel sub-signal and the fourth right channel sub-signal to obtain an equalized first right channel sub-signal, an equalized third right channel sub-signal and an equalized fourth right channel sub-signal. By adopting the implementation mode, the parameter equalization module can perform parameter equalization processing on the left and right channel signals, and can improve tone quality of the left and right channels.

In a fourth aspect, the present application provides an electronic device, comprising: a processor and a memory; the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the audio processing method in any of the implementations of the first and second aspects.

In a fifth aspect, the present application also provides a computer readable storage medium having instructions stored therein, which when executed on an electronic device, cause the electronic device to perform the audio processing method in any implementation manner of the first and second aspects.

In a sixth aspect, the present application also provides a computer program product for, when run on an electronic device, causing the electronic device to perform the audio processing method of any one of the implementations of the first and second aspects.

It will be appreciated that the electronic device of the second aspect, the computer storage medium of the third aspect, and the computer program product of the fourth aspect provided above are all configured to perform the corresponding methods provided above, and therefore, the advantages achieved by the electronic device of the second aspect may refer to the advantages of the corresponding methods provided above, which are not described herein.

Drawings

FIG. 1 is a schematic diagram of a speaker playback audio scenario;

FIG. 2 is a schematic diagram of a hardware structure according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an outer-air audio link provided by an embodiment of the present application;

FIG. 4 is a first flowchart of an audio processing method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a neural network for separating a first human voice signal provided by an embodiment of the present application;

FIG. 6A is a schematic diagram of a first interface for an electronic device to obtain user head orientation information according to an embodiment of the present application;

FIG. 6B is a schematic diagram of a second interface for the electronic device to obtain the head orientation information of the user according to the embodiment of the present application;

FIG. 7A is a third interface diagram of an electronic device for obtaining user head orientation information according to an embodiment of the present application;

FIG. 7B is a fourth interface diagram of an electronic device for obtaining user head orientation information according to an embodiment of the present application;

fig. 8 is a schematic diagram of HRTF correction mode according to an embodiment of the present application;

FIG. 9 is a schematic diagram of left channel spectral envelope extraction provided by an embodiment of the present application;

fig. 10 is a schematic diagram of right channel spectral envelope extraction according to an embodiment of the present application;

FIG. 11 is a schematic view of a head rotation angle provided by an embodiment of the present application;

FIG. 12 is a schematic diagram of a neural audio multi-classification network classification process according to an embodiment of the present application;

fig. 13 is a schematic diagram of an adaptive crosstalk cancellation process according to an embodiment of the present application;

FIG. 14 is a second flowchart of an audio processing method according to an embodiment of the present application;

fig. 15 is a schematic view of a scene of audio playback of an electronic device according to an embodiment of the present application;

fig. 16 is a schematic software structure of an audio processing device according to an embodiment of the present application;

fig. 17 is a schematic hardware structure of an audio processing apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application. It will be apparent that the described embodiments are some, but not all, embodiments of the application. Based on the embodiments of the present application, other embodiments that may be obtained by those of ordinary skill in the art without making any inventive effort are within the scope of the present application.

The terms first, second, third and the like in the description and in the claims, are used for distinguishing between different objects and not for limiting the specified sequence. In the description of the present application, unless otherwise indicated, the meaning of "at least one" is one, two and more than two.

In embodiments of the application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

The terminology used in the description of the embodiments of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application, as will be described in detail with reference to the accompanying drawings.

In order to facilitate the technical solution of the embodiments of the present application to be understood by the skilled person, technical terms related to the embodiments of the present application are explained below.

Sound channels (Sound channels) refer to mutually independent audio signals that are collected or played back at different control positions when Sound is recorded or played. The number of channels refers to the number of sound sources at the time of sound recording or the corresponding number of speakers at the time of playback.

Based on the difference in the number of channels, audio may have the following categories.

1. Stereophonic sound means that sound is distributed to two independent channels during recording, thereby achieving a better sound localization effect.

2. The 5.1 channels, 5.1 channels refer to center channel, front left channel, front right channel, rear left surround channel, rear right surround channel, and subwoofer channel.

3. The 7.1 channels, the 7.1 channels refer to five actual channels, two virtual channels and one subwoofer channel, wherein the five actual channels include: the two virtual channels include a left surround channel and a right surround channel which are allocated from five actual channels.

4. The 9.1 channel and the 9.1 channel are formed by adding two front longitudinal speakers on the basis of the 7.1 channel.

The audio may also have more categorization patterns in the present application, which is only illustrative of the types of audio described above.

Among the above-mentioned types of audio, the audio corresponding to the 5.1 channel, the 7.1 channel and the 9.1 channel is actually spatial audio, and the spatial audio is audio with three-dimensional quality, and compared with stereo sound, the audio has the effects of widening sound image, improving sound layering and distance feeling, and enabling a user to obtain extremely good audio-video experience.

With the continuous development of electronic technology, electronic devices with mobile convenience features are rapidly popularized, for example: notebook computers, tablet computers, smart phones, etc., which are usually small, light and thin, the layout of internal devices of these electronic devices is often limited. In this way, in the process of playing audio through the loudspeaker of the electronic equipment, the playing audio of the electronic equipment has the problems of thin sound, narrower sound field, lower height and insufficient immersion feeling due to the influence of the layout of the loudspeaker and the limitation of devices.

Fig. 1 is a schematic view of a speaker playback audio scene.

Taking a notebook computer as an example of an electronic device as shown in fig. 1, the notebook computer is generally provided with two speakers, and as shown in fig. 1 (a), one speaker may be disposed in an area a at the upper left corner of the keyboard, and the other speaker may be disposed in an area B at the upper right corner of the keyboard. In this arrangement, as can be seen from the top view of the notebook computer shown in fig. 1 (b) and the side view of the notebook computer shown in fig. 1 (C), in the process of playing audio through the speaker, the sound field in the C region perceived by the user has the problems of narrower sound field and lower height.

In order to solve the above problems, an embodiment of the present application provides an audio processing method.

The audio processing method provided by the embodiment of the application can be applied to electronic equipment, wherein the types of the electronic equipment include, but are not limited to, mobile phones, tablet computers, notebook computers, large-screen equipment (such as intelligent televisions and intelligent screens), personal computers (personal computers, PCs), handheld computers, netbooks, personal digital assistants (Personal Digital Assistant, PDAs), wearable electronic equipment, vehicle-mounted equipment, virtual reality equipment and the like.

Fig. 2 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

As shown in fig. 2, the electronic device 100 may include a processor 110, a memory 120, a universal serial bus (Universal Serial Bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, a camera 192, a display 193, and a subscriber identity module (Subscriber Identification Module, SIM) card interface 194, etc. The sensor module 180 may include a touch sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a geomagnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, and the like. Among them, the gyro sensor 180B, the air pressure sensor 180C, the geomagnetic sensor 180D, the acceleration sensor 180E, and the like can be used to detect a motion state of an electronic apparatus, and thus, may also be referred to as a motion sensor.

It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (Application Processor, AP), a modem processor, a graphics processor (Graphics Processing Unit, GPU), an image signal processor (Image Signal Processor, ISP), a controller, a video codec, a digital signal processor (Digital Signal Processor, DSP), a baseband processor, and/or a Neural network processor (Neural-network Processing Unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

Memory 120 may be used to store computer-executable program code that includes instructions. The memory 120 may include a stored program area and a stored data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device 100 (e.g., audio data, phonebook, etc.), and so on. In addition, the memory 120 may include a high-speed random access memory, and may also include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (Universal Flash Storage, UFS), and the like. The processor 110 performs various functional applications and data processing of the electronic device 100 by executing instructions stored in the memory 120 and/or instructions stored in a memory provided in the processor.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transfer data between the electronic device 100 and a peripheral device. And can also be used for connecting with a headset, and playing audio through the headset. The interface may also be used to connect other electronic devices, such as AR devices, etc.

It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present application is only illustrative, and is not meant to limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also employ different interfacing manners in the above embodiments, or a combination of multiple interfacing manners.

The charge management module 140 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charge management module 140 may receive a charging input of a wired charger through the USB interface 130. In some wireless charging embodiments, the charge management module 140 may receive wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the memory 120, the display 193, the camera 192, the wireless communication module 160, and the like. The power management module 141 may also be configured to monitor battery capacity, battery cycle number, battery health (leakage, impedance) and other parameters. In other embodiments, the power management module 141 may also be provided in the processor 110. In other embodiments, the power management module 141 and the charge management module 140 may be disposed in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc., applied to the electronic device 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (Low Noise Amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be provided in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio device (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or videos through the display screen 193. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional module, independent of the processor 110.

The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (Wireless Local Area Networks, WLAN) (e.g., wireless fidelity (Wireless Fidelity, wi-Fi) network), bluetooth (BT), global navigation satellite system (Global Navigation Satellite System, GNSS), frequency modulation (Frequency Modulation, FM), near field wireless communication technology (Near Field Communication, NFC), infrared technology (IR), etc., as applied to the electronic device 100. The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.

In some embodiments, antenna 1 and mobile communication module 150 of electronic device 100 are coupled, and antenna 2 and wireless communication module 160 are coupled, such that electronic device 100 may communicate with a network and other devices through wireless communication techniques.

The electronic device 100 implements display functions through a GPU, a display screen 193, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 193 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display 193 is used to display images, videos, and the like. The display 193 includes a display panel. The display panel may employ a liquid crystal display (Liquid Crystal Display, LCD), an Organic Light-Emitting Diode (OLED), an Active-matrix Organic Light-Emitting Diode (AMOLED) or an Active-matrix Organic Light-Emitting Diode (Matrix Organic Light Emitting Diode), a flexible Light-Emitting Diode (Flex), a mini, micro-OLED, a quantum dot Light-Emitting Diode (Quantum dot Light Emitting Diodes, QLED), or the like.

The electronic device 100 may implement photographing functions through an ISP, a camera 192, a video codec, a GPU, a display screen 193, an application processor, and the like.

The ISP is used to process the data fed back by the camera 192. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also optimize the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be located in the camera 192.

The camera 192 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, RYYB, YUV, or the like format. In some embodiments, the electronic device 100 may include 1 or N cameras 192, N being a positive integer greater than 1.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.

Wherein the number of speakers 170A may be set to at least one.

The touch sensor 180A, also referred to as a "touch device". The touch sensor 180A may be disposed on the display 193, and the touch sensor 180A and the display 193 form a touch screen, which is also referred to as a "touch screen". The touch sensor 180A is used to detect a touch operation acting thereon or thereabout. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to the touch operation may be provided through the display 193. In other embodiments, the touch sensor 180A may also be disposed on a surface of the electronic device 100 at a location different from the location of the display 193.

The gyro sensor 180B may be used to determine a motion gesture of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., x, y, and z axes) may be determined by gyro sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. For example, when the shutter is pressed, the gyro sensor 180B detects the shake angle of the electronic device 100, calculates the distance to be compensated by the lens module according to the angle, and makes the lens counteract the shake of the electronic device 100 through the reverse motion, so as to realize anti-shake. The gyro sensor 180B may also be used for navigating, somatosensory game scenes.

The air pressure sensor 180C is used to measure air pressure. In some embodiments, electronic device 100 calculates altitude from barometric pressure values measured by barometric pressure sensor 180C, aiding in positioning and navigation.

The geomagnetic sensor 180D includes a hall sensor. The electronic device 100 may detect the opening and closing of the flip cover using the geomagnetic sensor 180D. In some embodiments, when the electronic device 100 is a flip machine, the electronic device 100 may detect the opening and closing of the flip according to the geomagnetic sensor 180D. And then according to the detected opening and closing state of the leather sheath or the opening and closing state of the flip, the characteristics of automatic unlocking of the flip and the like are set.

The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity may be detected when the electronic device 100 is stationary. The electronic equipment gesture recognition method can also be used for recognizing the gesture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.

A distance sensor 180F for measuring a distance. The electronic device 100 may measure the distance by infrared or laser. In some embodiments, the electronic device 100 may range using the distance sensor 180F to achieve quick focus.

The proximity light sensor 180G may include, for example, a light emitting diode and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The electronic device 100 emits infrared light outward through the light emitting diode. The electronic device 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it may be determined that there is an object in the vicinity of the electronic device 100. When insufficient reflected light is detected, the electronic device 100 may determine that there is no object in the vicinity of the electronic device 100. The electronic device 100 can detect that the user holds the electronic device 100 close to the ear by using the proximity light sensor 180G, so as to automatically extinguish the screen for the purpose of saving power. The proximity light sensor 180G may also be used in holster mode, pocket mode to automatically unlock and lock the screen.

The fingerprint sensor 180H is used to collect a fingerprint. The electronic device 100 may utilize the collected fingerprint feature to unlock the fingerprint, access the application lock, photograph the fingerprint, answer the incoming call, etc.

The temperature sensor 180J is for detecting temperature. In some embodiments, the electronic device 100 performs a temperature processing strategy using the temperature detected by the temperature sensor 180J. For example, when the temperature reported by temperature sensor 180J exceeds a threshold, electronic device 100 performs a reduction in the performance of a processor located in the vicinity of temperature sensor 180J in order to reduce power consumption to implement thermal protection. In other embodiments, when the temperature is below another threshold, the electronic device 100 heats the battery 142 to avoid the low temperature causing the electronic device 100 to be abnormally shut down. In other embodiments, when the temperature is below a further threshold, the electronic device 100 performs boosting of the output voltage of the battery 142 to avoid abnormal shutdown caused by low temperatures.

The keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The electronic device 100 may receive key inputs, generating key signal inputs related to user settings and function controls of the electronic device 100.

The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration alerting as well as for touch vibration feedback. For example, touch operations acting on different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 191 may also correspond to different vibration feedback effects by touch operations applied to different areas of the display screen 193. Different application scenarios (such as time reminding, receiving information, alarm clock, game, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

The SIM card interface 194 is used to connect to a SIM card. The SIM card may be inserted into the SIM card interface 194, or removed from the SIM card interface 194 to enable contact and separation with the electronic device 100. The electronic device 100 may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 194 may support a Nano SIM card, micro SIM card, etc. The same SIM card interface 194 may be used to insert multiple cards simultaneously. The types of the plurality of cards may be the same or different. The SIM card interface 194 may also be compatible with different types of SIM cards. The SIM card interface 194 may also be compatible with external memory cards. The electronic device 100 interacts with the network through the SIM card to realize functions such as communication and data communication. In some embodiments, the electronic device 100 employs esims, i.e.: an embedded SIM card. The eSIM card can be embedded in the electronic device 100 and cannot be separated from the electronic device 100.

It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the application, the electronic device may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The audio processing method provided by the embodiment of the application can be applied to a scene that the electronic equipment is in a play audio mode and the space audio is started.

When the electronic equipment is not connected with the audio output equipment such as the earphone, the electronic equipment can be in a loudspeaker external playing mode by default. When the electronic equipment is connected with audio output equipment such as a headset and the like, the audio of the headset output mode can be changed into an audio-out mode.

The embodiment of the application provides a specific mode for changing the output mode of the earphone into the play audio mode of the electronic equipment.

Taking the electronic device as a notebook computer with a Windows operating system as an example, a user can click on a speaker icon in a task bar of a main screen page, the notebook computer displays a volume bar in response to clicking operation of the speaker icon by the user, and displays a device identifier corresponding to a current audio output device on one side of the volume bar, and the notebook computer can display a sound output interface in response to clicking operation of the device identifier by the user, and display output device options on the sound output interface, wherein the output device options can comprise speaker options. The user can click on the speaker option, and the notebook computer is in the play audio mode in response to the click operation of the speaker option by the user.

It should be noted that, in the embodiment of the present application, the specific manner of enabling the electronic device to be in the audio playback mode is only used for illustrative description, and other manners of enabling the electronic device to be in the audio playback mode may be adopted in the present application, which is not described herein.

The embodiment of the application provides a specific mode for opening spatial audio of electronic equipment.

Taking the electronic device as a notebook computer with a Windows operating system as an example, a user can press a combined key of "WIN" and "I" on a keyboard, and the notebook computer starts a setting interface in response to the pressing operation of the combined key of "WIN" and "I" by the user. The setting interface includes a sound option, the user can click on the sound option, and the notebook computer displays the sound interface in response to clicking operation on the sound option with the user. The sound interface comprises an output interface, wherein the output interface comprises a loudspeaker option, a user can click on the loudspeaker option, the notebook computer displays an attribute interface in response to clicking operation of the loudspeaker option by the user, the attribute interface comprises a spatial audio option, an opening/closing option is arranged in the spatial audio option, the user can select the opening option, and the notebook computer opens spatial audio in response to selecting operation of the opening option by the user.

It should be noted that, in the embodiment of the present application, a specific manner of opening spatial audio of an electronic device is only used for performing an exemplary illustration, and other manners of opening spatial audio may also be adopted in the present application, which is not described herein.

After the electronic device is in the play-out audio mode and the spatial audio is started, the audio processing method provided by the embodiment of the application can be realized based on the play-out spatial audio link.

Fig. 3 is a schematic diagram of an outer space audio link according to an embodiment of the present application.

As shown in fig. 3, the outer spatial audio link may include, for example, an audio input unit 1, a digital signal processing unit (Digital Signal Processing, DSP) 2, a digital-to-analog conversion unit (Digital to Analog Converter, DAC) 3, and an audio output unit 4, etc.

Wherein the audio input unit 1 may be used for inputting audio of different audio formats, for example, MP3, PCM, WAV, etc. The DSP2 may be used to perform digital signal processing on audio, where the digital signal processing includes multiple processing methods, for example, first, to increase speaker power, so that sound can be delayed, with better stereo effect. Second, the sound output of a partial frequency band is increased. Because of the problem that the sound of the middle frequency band or the low frequency band may not be prominent in part of the audio, the DSP2 can increase the sound output of the frequency bands so as to improve the hearing effect of the user on the sound of the frequency bands. Third, the audio is processed through various algorithms such as transformation, modulation and demodulation, so that the sound is clearer, and the hearing comfort of the user is improved. The embodiment of the application only exemplifies the digital signal processing mode of the DSP 2. The DSP2 may also have other digital signal processing manners, which are not described in detail in the embodiment of the present application. The digital signal processing mode related to the audio processing method provided by the application concretely refers to the following embodiment. The DSP2 sends the processed digital signal to the DAC3. The DAC3 may convert the processed digital signal into an analog signal, and transmit the analog signal to the audio output unit 4, and the audio output unit 4 may output audio.

It should be noted that, the outer spatial audio link shown in the embodiment of the present application is only for exemplary illustration, and the outer spatial audio link may specifically include more or less units, which is not described in detail in the embodiment of the present application.

The following describes exemplary steps of the audio processing method according to the embodiment of the present application.

Fig. 4 is a first flowchart of an audio processing method according to an embodiment of the present application.

As shown in fig. 4, in one implementation, the method may include the following steps S101-S108.

In step S101, a first audio signal is acquired, the first audio signal including a first left channel signal and a first right channel signal.

The electronic device may obtain the first audio signal from a file having a different audio/video format, and the first audio signal may be a stereo signal. For example, the audio/video format may include MP3, MP4, PCM, WAV, etc., and the embodiment of the present application does not limit the audio/video format.

The stereo signal may transmit two signals, including a first left channel signal, which may use a channel of a left speaker, and a first right channel signal, which may use a channel of a right speaker, which may use two different channels.

Step S102, performing human voice separation on the first left channel signal and the first right channel signal to obtain a first human voice signal, a second left channel signal and a second right channel signal.

The first human voice signal comprises a first human voice sub-signal separated from a first left channel signal and a second human voice sub-signal separated from a first right channel signal, the second left channel signal is a channel signal after the first human voice sub-signal is separated from the first left channel signal, and the second right channel signal is a channel signal after the second human voice sub-signal is separated from the first right channel signal.

In one implementation, the electronic device may extract the first human voice signal using a correlation of the first left channel signal and the first right channel signal.

Wherein separating the first sub-human voice signal from the first left channel signal generally serves to cancel human voice in the first left channel signal and separating the second sub-human voice signal from the first right channel signal generally serves to cancel human voice in the first right channel signal.

In particular implementations, the electronic device may separate the first human voice signal for the first left channel signal and the first right channel signal using a principal component analysis (Principal Component Analysis, PCA) method.

The method may include the following steps S1021-S1025.

Step S1021, based on the fourier transform, models the first left channel signal as a first time-frequency sequence and models the first right channel signal as a second time-frequency sequence.

Specifically, the time-frequency sequence is established according to the following formula 1.

Equation 1:

。

wherein, the liquid crystal display device comprises a liquid crystal display device,is the characteristic vector expression form of the channel signal, m is the number of channels, k is the frequency index, ++>Is a time index.

Since the number of channels of the first audio signal is 2, it can be corresponded with m=1Representing the first left channel signal with m=2 +.>Representing the first right channel signal.

Can be modeled as a first time-frequency sequence +.>。

Can be modeled as a second time-frequency sequence +.>。

Step S1022, the first time-frequency sequence corresponding to the first left channel signal is changed to a form of a sum of the first principal component vector and the first environment component, and the first time-frequency sequence corresponding to the first right channel signal is changed to a form of a sum of the second principal component vector and the second environment component.

In particular according to the following equation 2.

Equation 2:

。

wherein, the liquid crystal display device comprises a liquid crystal display device,is the principal component vector, ++>Is an ambient component.

In this way the first and second light sources,，/>。

first left channel signalEqual to the first principal component vector->And a first ambient component- >And (3) summing. First right channel signal->Equal to the second principal component vector->And a second ambient component->And (3) summing.

Step S1023, changing the form of the sum of the first principal component vector and the first ambient component corresponding to the first left channel signal to the form of the sum of the product of the first scaling factor and the unit base vector and the first ambient component, and changing the form of the sum of the second principal component vector and the second ambient component corresponding to the first right channel signal to the form of the sum of the product of the second scaling factor and the unit base vector and the second ambient component.

In particular according to the following equation 3.

Equation 3:

。/>

wherein, the liquid crystal display device comprises a liquid crystal display device,is a unit basis vector, ">Is the unit basis vector +.>Is a major component vector +.>Is the unit basis vector +.>Is a scaled version of (a).

In this way the first and second light sources,，/>。

first left channel signalEqual to the first scaling factor->And unit basis vector->Is multiplied by the first ambient component->And (3) summing. First right channel signal->Equal to the second scaling factor->And unit basis vector->Product of (2) and the second ambient component->And (3) summing.

Step S1024, creating a first matrix based on the first left channel signal being in the form of the sum of the product of the first scaling factor and the unit base vector and the first ambient component, and based on the first right channel signal being in the form of the sum of the product of the second scaling factor and the unit base vector and the second ambient component.

Specifically, the following equation 4 is used.

Equation 4:

。

wherein P is a principal component matrix, and A is an environment matrix.

Thus, since m=2, then。

Step S1025, solving unit basis vectors based on the first matrix。

Specifically, the solution is performed according to the following equation 5.

Equation 5:

。

wherein X is ^H Is the conjugate transpose of X, A ^H Is the conjugate transpose of a.

Assuming that the principal component matrix P has the greatest energy and the environmental matrix A has the least energy, a reasonable optimization criterion is based on the least mean square error method, such that tr (A ^H A) Is equivalent to maximization of the minimum value of (2)Thus, unit basis vector +.>Corresponding to XX ^H A feature vector corresponding to the maximum feature value of (a). Thus, a unit basis vector +.>。

Step S1026, based on unit basis vectorA first product of the conjugate transpose of (a) and the first left channel signal, and the first product and the unit basis vector +.>The product of (2) results in a first principal component vector and, based on the unit basis vector +.>A second product of the conjugate transpose of (a) and the first right channel signal, and the second product and the unit basis vector +.>The product of (2) yields a second principal component vector.

Specifically, the solution is performed according to the following equation 6.

Equation 6:

。

in this way the first and second light sources, ，/>。

Step S1027, obtaining a first ambient component based on the difference between the first left channel signal and the first principal component vector, and obtaining a second ambient component based on the difference between the first right channel signal and the second principal component vector.

Specifically, the solution is performed according to the following equation 7.

Equation 7:

。/>

thus, it is possible to obtainAnd->。

Therefore, based on the obtained first principal component vector, second principal component vector, first environment component and second environment component, the principal component vector can be separated, and the principal component vector is in the form of a characteristic vector of the human voice signal, so that the electronic equipment can separate the first human voice signal.

In one implementation, an electronic device may separate a first human voice signal based on a Neural Network (NN) Network.

Fig. 5 is a schematic diagram of a neural network for separating a first human voice signal according to an embodiment of the present application.

As shown in fig. 5, the electronic device may input the first audio signal into an encoder composed of a convolutional neural network or a cyclic neural network, the encoder converts the first audio signal into a plurality of different features, and inputs the features into a separation network to form a plurality of matrixes corresponding to the features, further, the electronic device combines the matrixes with the plurality of features, inputs the combined matrixes into a decoder, decodes the combined matrixes, and separates human voice and other sounds after decoding. In this way, the electronic device may separate the first human voice signal.

The electronic equipment performs voice separation, so that the left and right channels can obtain better fidelity and better voice definition.

Step S103, respectively performing audio rendering processing on the second left channel signal, the second right channel signal and the first human voice signal based on the first orientation information to obtain a third left channel signal, a third right channel signal and the second human voice signal, wherein the first orientation information comprises the orientation information of the user head relative to the loudspeaker.

In one implementation, the first human voice signal and the first height filter are subjected to segmented convolution based on an overlap-save method to obtain a second human voice signal, the second left channel signal and the second height filter are subjected to segmented convolution based on the overlap-save method to obtain a third left channel signal, and the second right channel signal and the third height filter are subjected to segmented convolution based on the overlap-save method to obtain a third right channel signal.

The overlap preservation method is proposed for calculating the linear convolution of an infinitely long sequence, and replaces the linear convolution by the circular convolution to facilitate computer calculation, and is a fast convolution method which prolongs the signal sequence by preserving the original input sequence with a certain bit number at the front end of a segmented signal, discards the error sequence with the bit number after the convolution is finished, and adds the bit number to ensure that the circular convolution result is the same as the linear convolution result. The specific convolution process of the application is referred to in the following.

The first height filter is a filter obtained according to the head height of the user and a first angle of the first human voice signal relative to the electronic device, the second height filter is a filter obtained according to the head height of the user and a second angle of the second left channel signal relative to the electronic device, and the third height filter is a filter obtained according to the head height of the user and a third angle of the second right channel signal relative to the electronic device.

The first angle comprises a horizontal angle and a pitch angle corresponding to a first head related transfer function (Head Related Transfer Function, HRTF) corresponding to the first vocal signal. Is obtained during the process of obtaining the first height filter.

Among them, HRTF is a sound localization algorithm for describing the transmission of sound waves from a sound source to both ears. When sound is transmitted to a user, the HRTF will respond with a phase and frequency corresponding to the user's head.

The volume difference between the user's left and right ears is mainly dependent on the interaural time difference (Interaural Time Different, ITD), interaural intensity difference (Interaural Level Difference, ILD) and interaural color difference. Among them, ITD and ILD are generally used to act on sound source localization in the horizontal direction, and interaural tone differences generated by body structures such as auricles are generally used to generate localization in the vertical direction. Based on these differences, the HRTF can provide the user with a suitable impulse response.

The first height filter may be obtained by the following steps S1031 to S1038.

In step S1031, first orientation information is acquired.

The electronic device may acquire the first orientation information according to a manner that the camera tracks the orientation information of the head of the user or a manner that the user defines the orientation information of the head of the user. The first orientation information may be determined according to various parameters such as a first distance between a center of a user's head and a center of the electronic device, a height of the user's head, a size of the user's head, and a rotation angle of the user's head. Wherein, the left ear position of the user and the right ear position of the user can be determined according to the head size of the user.

The user needs to authorize the electronic device so that the electronic device can track the user information according to the camera.

Fig. 6A is a schematic diagram of a first interface for an electronic device to obtain user head orientation information according to an embodiment of the present application.

Fig. 6B is a schematic diagram of a second interface for the electronic device to obtain the head orientation information of the user according to the embodiment of the present application.

Fig. 7A is a schematic diagram of a third interface for an electronic device to obtain user head orientation information according to an embodiment of the present application.

Fig. 7B is a schematic diagram of a fourth interface for the electronic device to obtain the head orientation information of the user according to the embodiment of the present application.

As shown in fig. 6A, fig. 6B, fig. 7A, and fig. 7B, the electronic device displays a first interface 101, where the first interface 101 includes a first control 102 and a third control 103, where the first control 102 may be used to turn on or off a personalized setting function in the audio-out mode, and the third control 103 may be used to turn on spatial audio in the audio-out mode.

In response to the opening operation of the first control 102 by the user, the electronic device displays a second interface 104, where the second interface 104 includes a second control 105, and the second control 105 is used to turn on or off a head tracking function in an audio playback mode of the electronic device.

In response to the user's opening operation of the second control 105, the electronic device opens the camera and tracks the user's head orientation information through the camera.

The electronic device may also obtain user head orientation information in other ways if the user closes the second control 105. For example, the user fills in the head orientation information of the user in a custom manner in the electronic device.

In response to a user closing operation of the second control 105, the electronic device displays at least one filling item. The filling item may be displayed in the second interface 104, or the electronic device jumps to a new interface to be displayed in the new interface, which is not limited in the embodiment of the present application.

The filling-in item may include: a first filling item 106, a second filling item 107 and a third filling item 108. The first filling item 106 is used for filling the head height of the user, the second filling item 107 is used for filling the head rotation angle of the user, and the third filling item 108 is used for filling the head rotation distance of the user.

The electronic device may display a cursor in the first filling item 106. In this way, the user can input corresponding content behind the cursor to customize the head orientation information.

In the embodiment of the present application, the interface setting manner is only used for exemplary illustration, and in a specific implementation, the electronic device may also implement the functions by using other interface setting manners, which is not limited in the embodiment of the present application.

Based on the implementation manner, the electronic device may acquire the first orientation information.

In step S1032, it is determined whether the first distance is equal to the measurement radius of the preset HRTF.

Specifically, the HRTF is a set of filters, which are the result of the integrated filtering of sound waves by the physiological structures of the user (e.g., the user's head, the user's pinna, the user's torso, etc.). Sound waves emitted by the sound source reach ears after being scattered by the head, auricle, trunk and the like of a user, wherein the physical process can be regarded as a linear unchanged sound filtering system, and the characteristics of the linear unchanged sound filtering system can be represented by a frequency domain transmission function.

The electronic device may preset an HRTF in the DSP2, so that the electronic device may process sound sources of the virtual world in real time through the preset HRTF.

However, since the relative positions of the user and the electronic device are not fixed, the processing procedure of the preset HRTF for the sound may not meet the user's requirement, and the preset HRTF needs to be corrected. In this way, rendering deviation in the audio rendering process can be avoided.

Here, the expression of HRTF in the time domain is called head-related impulse response (Head Related Impulse Response, HRIR), and HRIR and HRTF are fourier transform pairs.

In step S1033, if the first distance is greater than or less than the measurement radius of the preset HRTF, correcting the preset HRTF according to the first distance, the left ear position of the user, and the right ear position of the user, to obtain the first HRTF.

Wherein the first HRTF is a corrected HRTF, may include a first left-ear HRTF and a first right-ear HRTF,

fig. 8 is a schematic diagram of HRTF correction according to an embodiment of the present application.

As shown in fig. 8, the M area is an area where the head of the user is located, M1 is the left ear position of the user, M2 is the right ear position of the user, the R area is a measurement area corresponding to a preset HRTF, and R1 is a measurement radius corresponding to the preset HRTF.

In the present application, fig. 8 is only a schematic plan view, and each region should be a three-dimensional region.

In the process of playing audio, the electronic device is a sound source, the first distance is the actual distance between the sound source and the user, the first distance is not necessarily equal to the measurement radius of the HRTF, and when the first distance is smaller than the measurement radius of the HRTF or larger than the measurement radius of the HRTF, the preset HRTF needs to be corrected.

Specifically, when the first distance is greater than a measurement radius of a preset HRTF, a first intersection point of a connection line between the center of the electronic device and the left ear azimuth of the user on a measurement area of the HRTF is obtained, and a second intersection point of a connection line between the center of the electronic device and the right ear azimuth of the user on the measurement area of the HRTF is obtained, wherein the first intersection point is used for determining a first horizontal azimuth angle and a first pitch angle, and the second intersection point is used for determining a second horizontal azimuth angle and a second pitch angle. In this way, the first left ear HRTF can be determined from the first horizontal azimuth and the first pitch angle, and the first right ear HRTF can be determined from the second horizontal azimuth and the second pitch angle.

For example, when the sound source is positioned at the X1 azimuth as shown in fig. 8, a first intersection AL1 is formed by a line between the actual azimuth of the sound source and the left ear azimuth M1 of the user and the R area, and a second intersection AR1 is formed by a line between the actual azimuth of the sound source and the right ear azimuth M2 of the user and the R area, so that two corrected HRTFs can be obtained. The first left ear HRTF may be a function including a first horizontal azimuth angle θ1 and a first pitch angle Φ1 corresponding to the left ear, and the first right ear HRTF may be a function including a second horizontal azimuth angle θ2 and a second pitch angle Φ2 corresponding to the right ear.

Illustratively, the center coordinates of the user's head are (0, 0), and the coordinates of the user's left ear are (0, -r) _people 0), the right ear of the userThe coordinates are (0, r) _people 0), the sound source coordinates are (x ₁ ，y ₁ ，z ₁ ) The first horizontal azimuth θ1 corresponding to the first left-ear HRTF can be solved according to the following equation 8.

Equation 8:

。

the first pitch angle Φ1 corresponding to the first left-ear HRTF can be solved according to the following equation 9.

Equation 9:

。

similarly, the second horizontal azimuth angle θ2 and the second pitch angle Φ2 corresponding to the first right ear HRTF can be solved, which is not described in detail in the embodiment of the present application.

Thus, the first angle may include a first horizontal azimuth angle θ1, a first pitch angle Φ1, a second horizontal azimuth angle θ2, and a second pitch angle Φ2, which can be used to obtain a first altitude filter corresponding to the first human voice signal.

When the first distance is smaller than the measurement radius of the preset HRTF, a first intersection point of a connecting line of the center of the electronic device and the left ear azimuth on the measurement area of the HRTF is obtained, and a second intersection point of a connecting line of the center of the electronic device and the right ear azimuth on the measurement area of the HRTF is obtained. Thus, a first right ear HRTF is determined based on the first horizontal azimuth and the first pitch angle, and a first left ear HRTF is determined based on the second horizontal azimuth and the second pitch angle.

For example, when the sound source is located in the X2 direction as shown in fig. 8, a first intersection AL2 is formed between the connection line between the actual direction of the sound source and the direction M1 of the left ear of the user and the R region, and a second intersection AR2 is formed between the connection line between the actual direction of the sound source and the direction M2 of the right ear of the user and the R region, after AL2 and AR2 are obtained, AL2 and AR2 need to be interchanged, and HRTF correction is performed based on the horizontal azimuth angles, pitch angles, and the like corresponding to AL2 and AR2 after the interchange. For example, AL2 corresponds to a first horizontal azimuth angle θ3, a first pitch angle Φ3, ar2 corresponds to a second horizontal azimuth angle θ4, a second pitch angle Φ4, the first left ear HRTF may be a function including the second horizontal azimuth angle θ4 and the second pitch angle Φ4 corresponding to the left ear, and the first right ear HRTF may be a function including the first horizontal azimuth angle θ3 and the first pitch angle Φ3 corresponding to the right ear.

The solution method of the first horizontal azimuth angle θ3, the first pitch angle Φ3, the second horizontal azimuth angle θ4, and the second pitch angle Φ4 may refer to the above embodiment, which is not described in detail in the present disclosure.

When the first distance is equal to the measured radius of the preset HRTF, no correction of the preset HRTF is required.

In step S1034, an HRIR dataset corresponding to the head height and the first angle of the user is obtained, where the HRIR dataset includes at least one HRIR of the measurement user.

The method for acquiring the HRIR data set in the embodiment of the application is as follows:

the HRIR dataset is a time domain sequence and may include multiple sets of HRIR data, which are typically impulse responses corresponding to different head heights, different horizontal azimuth angles, and different pitch angles of different users.

For example, the HRIR dataset has impulse responses corresponding to head data of 100 users, and because the head of each user is different in size at the same height, the horizontal azimuth angle and the pitch angle corresponding to the left ear azimuth of each user are different in impulse response. Thus, acquiring HRIRs for 100 users at a first height and a first angle may generate a HRIR dataset in an embodiment of the present application.

It should be noted here that the number of users in the HRIR dataset is only for illustrative purposes, and the actual number of users is not limited in this regard.

Step S1035, performing frequency domain variation on the HRIR dataset to obtain an HRTF dataset.

Since the HRIR data sets are all time-series, it is necessary to change the impulse response in the HRIR data sets from the time domain to the frequency domain.

In step S1036, a time-frequency domain average signal of the HRTF data set is acquired.

Specifically, the time-frequency domain average signal is obtained according to the following formula 10.

Equation 10:

。

in this equation 10, the data of the first equation,is a frequency domain variation function of the impulse response corresponding to the header data at a first height, a first angle, for each user in the HRIR data set. M represents the number of HRIRs in the HRIR dataset, < >>Is a time-frequency domain average signal.

Step S1037, performing frequency multiplication smoothing or envelope extraction processing on the time-frequency domain average signal to obtain an initial height filter.

An example is given in which the electronic device performs envelope extraction processing on the time-frequency domain average signal.

Fig. 9 is a schematic diagram of left channel spectral envelope extraction according to an embodiment of the present application.

Fig. 10 is a schematic diagram of right channel spectral envelope extraction according to an embodiment of the present application.

As shown in fig. 9 and 10, the time-frequency domain average signal includes a left channel spectrum and a right channel spectrum, and extraction of the left channel spectrum envelope and extraction of the right channel spectrum envelope can make the time-frequency domain average signal smoother, and improve spectrum stability.

The envelope extraction can be specifically performed according to the following formula 11.

Equation 11:

。

wherein envelope extraction, also known as Hilbert transform, can be used to construct the analytic signal.

Is a smoothed time-frequency domain average signal, +.>Hilbert transform is imaginary.

Further, the time-frequency domain average signal is inversely transformed according to the following formula 12 to obtain a time domain pulse.

Equation 12:

。

where h (t) is a time domain pulse.

After the time domain pulse is acquired, an initial height filter can be obtained according to equation 13.

Equation 13:

。

in step S1038, ITD adjustment is performed on the discretized form of the initial height filter, to obtain a first height filter.

The target height filter adjusted by the ITD can play a better role in filtering.

The discrete form of the initial height filter may be obtained according to equation 14.

Equation 14:

。

ITD information is determined specifically according to the following equation 15.

Equation 15:

。

wherein, the liquid crystal display device comprises a liquid crystal display device,the head rotation angle, r is the head half of a personDiameter, c, is the speed of sound.

Fig. 11 is a schematic view of a head rotation angle according to an embodiment of the present application.

As shown in fig. 11, taking the example that the electronic device symmetrically arranges the first speaker P1 and the second speaker P2 at the upper left corner and the upper right corner of the keyboard, the center O1 of the line connecting P1 and P2 may be determined as the center of the electronic device. If the center of the head of the user is O2, the included angle between the connecting line of O1 and O2 relative to the connecting line of P1 and P2The head rotation angle of the user can be determined. />

The first height filter may be represented by the following equation 16.

Equation 16:

。

where f (n) in equation 14 is actually a two-channel corresponding filter, specifically including a left channel filter and a right channel filter,for representing the left channel filter->For representing the right channel filter.

Thus, a first height filter corresponding to the first human voice signal can be obtained according to the above process. That is, the first height filter f _c (n) includeAnd->。

The second height filter is obtained in a similar manner to the first height filter, and in particular relates to the following differences.

The specific implementation in step S1034 is different.

And acquiring an HRIR data set corresponding to the head height and the second angle of the user.

The second angle is obtained based on the first angle and a first preset angle threshold, specifically, a horizontal azimuth angle in the first angle is added to the first preset angle threshold, and the first preset angle threshold is-60 degrees by way of example, so that the second angle can comprise a third horizontal azimuth angle theta 1-60 degrees, a first pitch angle phi 1, a fourth horizontal azimuth angle theta 2-60 degrees, and a second pitch angle phi 2.

Since the HRIR data sets are different, the values of the filters obtained after performing steps S1035-S1038 are different in the same manner, and based on this, a second height filter corresponding to the second left channel signal is obtained.

The third height filter is similar to the first height filter in terms of its acquisition, and in particular relates to the following differences.

The specific implementation in step S1034 is different.

And acquiring an HRIR data set corresponding to the head height and the third angle of the user.

The third angle is obtained based on the first angle and a first preset angle threshold, specifically, a horizontal azimuth angle in the first angle is added to a second preset angle threshold, and the second preset angle threshold is 60 ° by way of example, so that the second angle may include a fifth horizontal azimuth angle θ1+60°, a first pitch angle Φ1, a sixth horizontal azimuth angle θ2+60°, and a second pitch angle Φ2.

Since the HRIR data sets are different, the values of the filters obtained after performing steps S1035-S1038 are different in the same manner, and based on this, a third height filter corresponding to the second right channel signal is obtained.

After the first, second, and third height filters are acquired, a segmented convolution operation may be performed on the first, second, and third right channel signals.

The first human voice signal and the first height filter are subjected to segmented convolution based on an overlap preservation method, and the second human voice signal is obtained according to the following formula 17:

Equation 17:

。

in the formula 17 of the present invention,for the first person sound signal +.>For the second voice signal, f _c (n) is a first height filter.

The method comprises the steps of carrying out segmented convolution on a second left channel signal and a second height filter based on an overlap preservation method to obtain a third left channel signal, and carrying out segmented convolution on a second right channel signal and the third height filter based on the overlap preservation method to obtain a third right channel signal, wherein the third right channel signal is realized according to the following formula 18:

equation 18:

。

in the equation 18,for the second left channel signal->For the second right channel signal->In the case of a second-height filter,is a third height filter, +>For the third left channel signal->Is the third right channel signal.

Thus, based on the above process, the sound field width can be widened by convolving the second left channel signal and the second right channel signal, the sound image height sense and the externalization sense can be optimized by highly filtering the first human sound signal, and the audio rendering of the second left channel signal, the second right channel signal and the first human sound signal is realized.

Step S104, performing content adaptive frequency division processing on the third left channel signal and the third right channel signal to divide the third left channel signal into a first left channel sub-signal of the first frequency band, a second left channel sub-signal of the second frequency band, and a third left channel sub-signal of the third frequency band, and divide the third right channel signal into a first right channel sub-signal of the first frequency band, a second right channel sub-signal of the second frequency band, and a third right channel sub-signal of the third frequency band.

Content adaptive crossover (Content Adaptive Bitrate Streaming) is a streaming media transmission technique that aims to dynamically adjust the video or audio code rate to provide a better user experience based on changes in network bandwidth and device performance. The minimum value of the first frequency band is larger than or equal to the maximum value of the second frequency band, and the minimum value of the second frequency band is larger than or equal to the maximum value of the third frequency band.

Taking the content adaptive divide processing of the third left channel signal as an example.

The specific implementation includes the following steps S1041-S1045.

In step S1041, the third left channel signal is classified by the neural audio multi-classification network to obtain at least one audio object, where each audio object corresponds to a sound probability.

Fig. 12 is a schematic diagram of a neural audio multi-classification network classification process according to an embodiment of the present application.

As shown in fig. 12, taking the third left channel signal as an example, the third left channel signal is first converted into a spectrogram, the spectrogram is input into the neural audio multi-classification network to obtain features in the spectrogram, and the features are classified by a classifier to obtain at least one audio object and a sound probability corresponding to the audio object.

By way of example, the audio object includes: car horn, dog call, drum, etc. The embodiment of the application does not limit the specific form of the audio object.

Each audio object corresponds to a sound probability, for example, the sound probability corresponding to a car horn is 0.05, the sound probability corresponding to a dog call is 0.81, and the sound probability corresponding to a drum sound is 0.03. The sound probabilities are used to determine accuracy in classifying the audio objects.

In step S1042, each sound probability is compared with a preset sound probability threshold.

The electronic device is preset with a sound probability threshold, and the sound probability threshold is 0.2 in an example.

In step S1043, if at least two sound probabilities are greater than the sound probability threshold, a first target audio object corresponding to the two sound probabilities with the front sound probability value is obtained, or if one sound probability is greater than the sound probability threshold, a second target audio object corresponding to the sound probability is obtained.

For example, the sound probability corresponding to piano sound is 0.3, the sound probability corresponding to violin sound is 0.4, the sound probability corresponding to flute sound is 0.2, since 0.4 > 0.3 > 0.2, the sound probabilities 0.4 and 0.3 are determined to be the two sound probabilities with the sound probability value being the front, and the violin sound and piano sound corresponding to the two sound probabilities are determined to be the first target audio object, which indicates that there are two sound classifications in the third left channel signal.

If the sound probability of the piano sound is 0.03, the sound probability of the violin sound is 0.7, and the sound probability of only the violin sound is greater than 0.2, this means that there is one sound classification in the third left channel signal.

In step S1043, a first frequency division point and a second frequency division point are determined based on the first target audio object or the second target audio object, where the frequency value of the first frequency division point is greater than the frequency value of the second frequency division point.

Here, each audio object is preset with a corresponding frequency range. By way of example, the frequency range of the partial instrument is 262Hz-1976Hz for the flute, 110Hz-880Hz for the circle, and 523Hz-3951Hz for the siren.

In one implementation, determining a first crossover point and a second crossover point based on a first target audio object includes: and judging whether the first target audio object contains a preset first attention object or not. If the first target audio object contains a first object of interest, determining a maximum value of the frequency range of the first object of interest as a first frequency division point, and determining a minimum value of the frequency range of the first object of interest as a second frequency division point.

By way of example, the first object of interest is provided as a musical instrument, including, for example, a flute, a round, a siren, etc.

If the first target audio object includes a flute and a dog call, since the flute belongs to the first object of interest, a maximum 1976Hz of the frequency range 262Hz-1976Hz of the flute is determined as the first crossover point, and a minimum of the frequency range 262Hz-1976Hz of the flute is determined as the second crossover point.

If the first target audio object does not contain the first object of interest, a first frequency range of the first target audio object is acquired, and a second frequency range of the second first target audio object is acquired.

Illustratively, the first target audio object includes a dog call and an automobile horn. Thus, the first target audio object does not contain the first object of interest.

If the first frequency range includes the second frequency range, a maximum value of the first frequency range is determined as a first frequency division point, and a minimum value of the first frequency range is determined as a second frequency division point. If the second frequency range includes the first frequency range, a maximum value of the second frequency range is determined as the first frequency division point, and a minimum value of the second frequency range is determined as the second frequency division point.

For example, the first frequency range is 100Hz-1000Hz and the second frequency range is 200Hz-800Hz, and thus, the maximum value of 1000Hz of 100Hz-1000Hz in the first frequency range is determined as a first frequency division point, and the minimum value of 100Hz-1000Hz in the first frequency range is determined as a second frequency division point.

If the first frequency range and the second frequency range have a first intersection, a maximum value of the first intersection is determined as a first frequency division point, and a minimum value of the first intersection is determined as a second frequency division point.

For example, the first frequency range is 100Hz-1000Hz, the second frequency range is 800Hz-2000Hz, then the first intersection is [800Hz,1000Hz ], and therefore, the maximum value 1000Hz in the first intersection [800Hz,1000Hz ] is determined as the first frequency division point, and the minimum value 800Hz in the first intersection [800Hz,1000Hz ] is determined as the second frequency division point.

If the first frequency range and the second frequency range have no intersection, and the minimum value of the first frequency range is larger than the maximum value of the second frequency range, the minimum value of the first frequency range is determined as a first frequency division point, and the maximum value of the second frequency range is determined as a second frequency division point. If the first frequency range and the second frequency range have no intersection, and the minimum value of the second frequency range is larger than the maximum value of the first frequency range, the minimum value of the second frequency range is determined as a first frequency division point, and the maximum value of the first frequency range is determined as a second frequency division point.

For example, the first frequency range is 2000Hz-3000Hz, the second frequency range is 100Hz-1000Hz, and since the minimum value 2000Hz of the first frequency range is greater than the maximum value 1000Hz of the second frequency range, the minimum value 2000Hz of the first frequency range 2000Hz-3000Hz is determined as the first frequency division point, and the maximum value 1000Hz of the second frequency range 100Hz-1000Hz is determined as the second frequency division point.

In one implementation, determining the first crossover point and the second crossover point based on the second target audio object includes: the maximum value of the frequency range of the second target audio object is determined as the first crossover point, and the minimum value of the frequency range of the second target audio object is determined as the second crossover point.

By way of example, the second target audio object is a flute, the maximum 1976Hz of the frequency range 262Hz-1976Hz of the flute is determined as the first crossover point, and the minimum 262Hz of the frequency range 262Hz-1976Hz of the flute is determined as the second crossover point.

In step S1044, the frequency range greater than the first frequency division point is determined as the first frequency band, the frequency range less than or equal to the first frequency division point is determined as the second frequency band, and the frequency range less than or equal to the second frequency division point is determined as the third frequency band.

For example, the first frequency division point is 1976Hz, the second frequency division point is 262Hz, and the first frequency band is (1976 Hz, +) The second frequency band is (262 Hz,1976Hz]The third band is (-)>，262Hz]。

In step S1045, a sound signal with a frequency in the first frequency band in the third left channel signal is determined as a first left channel sub-signal, a sound signal with a frequency in the second frequency band is determined as a second left channel sub-signal, and a sound signal with a frequency in the third frequency band is determined as a third left channel sub-signal.

Similarly, the third right channel signal may be content-adaptively divided according to a similar process as the third left channel signal.

In this way, frequency division of the high, medium and low frequency bands is achieved for the third left channel signal and the third right channel signal.

Specifically, the first left channel sub-signal and the first right channel sub-signal are channel signals of a high frequency band, the second left channel sub-signal and the second right channel sub-signal are channel signals of a medium frequency band, and the third left channel sub-signal and the third right channel sub-signal are sound signals of a low frequency band.

Fig. 13 is a schematic diagram of an adaptive crosstalk cancellation process according to an embodiment of the present application.

As shown in fig. 13, after the content-adaptive frequency division is performed on the third left channel signal and the third right channel signal, adaptive crosstalk cancellation processing is required for the sub-signals after the content-adaptive frequency division.

Step S105, performing adaptive crosstalk cancellation processing on the second right channel sub-signal, overlapping the second right channel sub-signal to generate a fourth left channel sub-signal, performing adaptive crosstalk cancellation processing on the second left channel sub-signal, overlapping the second left channel sub-signal to generate a fourth right channel sub-signal, performing delay processing on the first left channel sub-signal and the third left channel sub-signal respectively, overlapping the first left channel sub-signal and the third left channel sub-signal to generate a first overlapping signal, performing delay processing on the first right channel sub-signal and the third right channel sub-signal respectively, overlapping the first right channel sub-signal and the third right channel sub-signal to generate a second overlapping signal, performing delay processing on the first right channel sub-signal and the third right channel sub-signal respectively, performing gain processing on the first left channel sub-signal and the third left channel sub-signal respectively, and overlapping the first left channel sub-signal and the third left channel sub-signal to generate a second target overlapping signal.

Here, in the audio playback mode, the user's left ear receives the left channel signal and the user's right ear receives the right channel signal, and in the audio playback mode, the user's left ear receives the right channel signal and the user's right ear receives the left channel signal, respectively, so that a crosstalk path is formed. The crosstalk path can cause spectrum staining, reduce the correlation of the left and right ear received signals, cause the stereoscopic damage of sound image, and deteriorate the tone quality.

Adaptive crosstalk cancellation (Adaptive Interference Cancellation) is a technique used in signal processing and communication systems that aims to reduce or eliminate signal crosstalk due to spectral overlap.

The adaptive crosstalk elimination processing can eliminate crosstalk signals of left and right ears on the basis of ensuring the original tone quality. The method is realized by performing crosstalk suppression on the sound channel signals of the middle frequency band sensitive to human ears.

Specifically, the adaptive crosstalk cancellation processing includes inverse processing, attenuation processing, delay processing, and gain processing.

The inversion processing is to make the vibration direction of sound waves of the left and right channels have a phase difference of 180 degrees. The attenuation process refers to a process in which the acoustic wave gradually decreases in the course of transmission. Delay processing refers to a period of time that is intermittently stopped in the channel signal propagation process. Gain processing refers to amplifying a channel signal by a certain multiple.

Step S106, the fourth left channel sub-signal, the first superposition signal and the first target superposition signal are superposed to generate a fourth left channel signal, and the fourth right channel sub-signal, the second superposition signal and the second target superposition signal are superposed to generate a fourth right channel signal.

Step S105 and step S106 are specifically implemented according to the following formulas 19, 20, and 21.

Equation 19:

；

equation 20:

；

。

wherein, the liquid crystal display device comprises a liquid crystal display device,this can be obtained according to the following equation 21.

Equation 21:

。

wherein, in the formula 19,is the first left channel sub-signal,/and/or>Is the second left channel sub-signal,/and/or>Is the third left channel sub-signal,/o>Is the first right channel sub-signal,/o>Is the second right channel sub-signal,/and>is the third right channel sub-signal,/o>Is the number of iterations, +.>Is the iterative attenuation of the left channel, +.>Is the iterative attenuation of the right channel, d _L Is the sample delay time of the left channel, d _R Is the sample delay time of the right channel, +.>Is the gain of the low frequency of the channel, +.>Is the gain of the high frequency of the channel, d ₁ Is the sample delay preset for the left channel, d ₂ Is the sample delay preset for the right channel, y _L (t) is the fourth left channel signal, y _R And (t) is a fourth right channel signal.

D in formula 19 _L And d _R Is obtained according to formula 20.

Further as shown in FIG. 11, Q1 is the left ear orientation of the user, Q2 is the right ear orientation of the user, and the distance between Q2 and P1 is l _LR The distance between Q2 and P2 is l _RR The distance between Q1 and P1 isl _LL The distance between Q1 and P2 is l _RL D is the vertical distance between O2 and the connection of P1 and P2, K is the distance between O1 and P1, c is the speed of sound, and fs is the sampling rate.

In equation 19And->Obtained according to formula 21.

Wherein, the liquid crystal display device comprises a liquid crystal display device,and->Is determined based on the current context information.

Is a room environment adjustment factor, which is related to the room size S in which the user is located. Meanwhile, in equation 19And also the room size S in which the user is located.

See in particular table 1 below:

；

the room type and the room size can be obtained by the electronic device when the camera is started to track the user information, or can be obtained by customizing the room size.

As further shown in fig. 7A and 7B, the filling-in items may further include a fourth filling-in item 109, and a fifth filling-in item 110. The fourth filling item 109 is used for filling out estimated room types, such as a hall, a room, a conference room, a living room, a concert, a court, etc., and the fifth filling item 110 is used for filling out estimated room sizes.

Based on the above process, the electronic device completes the adaptive crosstalk cancellation process for the channel signal. Therefore, the electronic equipment can eliminate crosstalk signals of the left ear and the right ear on the basis of ensuring the original tone quality, and can perform crosstalk suppression on the middle frequency band which is most sensitive to the human ear. Specifically, first, the second right channel sub-signal generates a constant-amplitude reverse signal of the second left channel sub-signal through reverse processing, attenuation processing and delay processing, and the constant-amplitude reverse signal is superimposed on the second left channel sub-signal, so that crosstalk sound formed by the second right channel sub-signal to the left ear is counteracted, and the purpose of self-adaptive crosstalk elimination is achieved. Secondly, as the perception of the high-frequency band and the low-frequency band channel signals such as the first left channel sub-signal and the third left channel sub-signal of the human ear is weaker, the first superposition signal generated by the two sub-signals can ensure the texture of bass and the tone of high frequency. And thirdly, generating a first target superposition signal by the first right channel sub-signal and the third right channel sub-signal, so that the effect of sound field widening can be enhanced. In addition, in the adaptive crosstalk elimination process, the attenuation process and the delay process are adaptively adjusted according to the head direction of the user, so that the dessert position problem can be restrained to a greater extent, and the user experience is improved.

Step S107, performing a downmix process on the fourth left channel signal, the fourth right channel signal, and the second human voice signal to obtain a fifth left channel signal and a fifth right channel signal.

Wherein, the down-mixing process is a process of synthesizing a plurality of channels into a few channels. Since the stereo sound itself has two channels, after separating out the human voice signal, it is also necessary to re-synthesize the human voice signal into the fourth left channel signal and the fourth right channel signal after a series of processing is performed on the human voice signal.

Specifically, the down-mixing process is performed according to the following formula 22:

equation 22:

。

wherein L is ₄ Is the fourth left channel signal, R ₄ Is the fourth right channel signal, C ₂ Is a second human voice signal, L ₅ Is the fifth left channel signal, R ₅ Is the fifth right channel signal.

Step S108, outputting a first output audio signal generated from the fifth left channel signal and the fifth right channel signal.

Thus, after the audio processing process, the output first output audio signal has better sound image height sense, width sense and immersion sense.

In some embodiments, step S109 is further included before step S107.

Step S109, performing parameter equalization processing on the fourth left channel signal, the fourth right channel signal, and the second human voice signal to obtain an equalized fourth left channel signal, an equalized fourth right channel signal, and an equalized second human voice signal.

After step S109 is performed, the fourth left channel signal, the fourth right channel signal, and the second human voice signal in step S107 are actually the signals after the parameter equalization processing.

The parameter equalization processing is to process the channel signal by an EQ (equal) equalizer, and the function of the EQ equalizer is to adjust gain values of each frequency band of the channel signal. Specifically, the EQ equalizer may include at least one of a peak filter, a first order low frequency shelf filter, a low pass filter, a high pass filter, a band pass filter, a first order high frequency shelf filter, a second order low frequency shelf filter, a second order high frequency shelf filter. The channel signals are processed by these filters, so that the tone color can be optimized.

Fig. 14 is a second flowchart of an audio processing method according to an embodiment of the present application.

As shown in fig. 14, in one implementation, the method may include the following steps S201-S207.

In step S201, a second audio signal is acquired, where the second audio signal includes a sixth left channel signal, a sixth right channel signal, a first left surround signal, a first right surround signal, and a first center-set human voice signal.

The electronic device may obtain the second audio signal from a file having a different audio/video format, and the second audio signal may be a 5.1 channel signal. For example, the audio/video format may include MP3, MP4, PCM, WAV, etc., and the embodiment of the present application does not limit the audio/video format.

Step S202, performing audio rendering processing on the sixth left channel signal, the sixth right channel signal, the first left surround sound signal, the first right surround sound signal, and the first center-positioned human sound signal based on the second azimuth information, to obtain a seventh left channel signal, a seventh right channel signal, a second left surround sound signal, a second right surround sound signal, and a second center-positioned human sound signal, where the second azimuth information includes azimuth information of the user head relative to the speaker.

Since the 5.1 channel has a center channel, unlike the previous embodiment, no separation of human voice is required.

In one implementation, the sixth left channel signal and the fourth height filter are subjected to a segmented convolution based on an overlap-save method to obtain a seventh left channel signal, the sixth right channel signal and the fifth height filter are subjected to a segmented convolution based on the overlap-save method to obtain a seventh right channel signal, the first left surround sound signal and the sixth height filter are subjected to a segmented convolution based on the overlap-save method to obtain a second left surround sound signal, the first right surround sound signal and the seventh height filter are subjected to a segmented convolution based on the overlap-save method to obtain a second right surround sound signal, and the first center human sound signal and the eighth height filter are subjected to a segmented convolution based on the overlap-save method to obtain a second center human sound signal.

The fourth height filter may refer to the obtaining mode of the second height filter, the obtaining mode of the fifth height filter may refer to the obtaining mode of the third height filter, and the obtaining mode of the eighth height filter may refer to the obtaining mode of the first height filter.

The sixth height filter is similar to the first height filter in terms of its acquisition, and specifically relates to the following differences.

The specific implementation in step S1034 is different.

And acquiring an HRIR data set corresponding to the head height and the fourth angle of the user.

The fourth angle is obtained based on the first angle and a third preset angle threshold, specifically, the horizontal azimuth angle in the first angle is added to the third preset angle threshold, and the third preset angle threshold is-120 degrees by way of example, so that the fourth angle can comprise a seventh horizontal azimuth angle theta 1-120 degrees, a first pitch angle phi 1, an eighth horizontal azimuth angle theta 2-120 degrees, and a second pitch angle phi 2.

Since the HRIR data sets are different, the values of the filters obtained after performing steps S1035-S1038 are different in the same manner, and based on this, a sixth-height filter corresponding to the first left surround sound signal is obtained.

The seventh height filter is obtained in a similar manner to the first height filter, and specifically relates to the following differences.

The specific implementation in step S1034 is different.

And acquiring an HRIR data set corresponding to the head height and the fifth angle of the user.

The fifth angle is obtained based on the first angle and a fourth preset angle threshold, specifically, a horizontal azimuth angle in the first angle is added to the fourth preset angle threshold, and the fourth preset angle threshold is 120 ° in an example, so the fifth angle may include a ninth horizontal azimuth angle θ1+120°, a first pitch angle Φ1, a tenth horizontal azimuth angle θ2+120°, and a second pitch angle Φ2.

Because the HRIR data sets are different, the values of the filters obtained after performing steps S1035-S1038 are different in the same manner, and based on this, a seventh-height filter corresponding to the first right surround sound signal is obtained.

The specific way of performing the segmented convolution on the first mid-voice signal and the eighth height filter based on the overlap preservation method to obtain the second mid-voice signal may refer to the above formula 17.

The specific manner of performing the piecewise convolution on the sixth left channel signal and the fourth height filter based on the overlap-save method to obtain the seventh left channel signal, and performing the piecewise convolution on the sixth right channel signal and the fifth height filter based on the overlap-save method to obtain the seventh right channel signal may refer to the above formula 18.

The step of performing a piecewise convolution on the first left surround sound signal and the sixth height filter based on the overlap-save method to obtain a second left surround sound signal, and the step of performing a piecewise convolution on the first right surround sound signal and the seventh height filter based on the overlap-save method to obtain a second right surround sound signal may refer to the following formula 23:

equation 23:

。

in the formula 23 of the present invention,for the first left surround signal, +.>For the first right surround sound signal, +.>Is a sixth-height filter, +.>Is a seventh high-level filter, +.>For the second left surround signal +.>Is the second right surround sound signal.

Thus, based on the above-mentioned process, by convolving the sixth left channel signal, the sixth right channel signal, the first left surround sound signal, and the first right surround sound signal, the sound field width can be widened, by highly filtering the first center-positioned human voice signal, the sound image height sense can be optimized, and the externalization sense, the audio rendering of the sixth left channel signal, the sixth right channel signal, the first left surround sound signal, the first right surround sound signal, and the first center-positioned human voice signal is realized.

In step S203, the content adaptive frequency division processing is performed on the seventh left channel signal, the seventh right channel signal, the second left surround sound signal, and the second right surround sound signal, so as to divide the seventh left channel signal into a fifth left channel sub-signal of the first frequency band, a sixth left channel sub-signal of the second frequency band, and a seventh left channel sub-signal of the third frequency band, and divide the seventh right channel signal into a fifth right channel sub-signal of the first frequency band, a sixth right channel sub-signal of the second frequency band, and a seventh right channel sub-signal of the third frequency band, and divide the second left surround sound signal into a first left surround sound sub-signal of the first frequency band, a second left surround sound sub-signal of the second frequency band, and a third left surround sound sub-signal of the third frequency band, and divide the second right surround sound signal into a first right surround sound sub-signal of the first frequency band, a second right surround sound sub-signal of the second frequency band, and a third right surround sound sub-signal of the third frequency band, wherein the first frequency band is the maximum value of the first frequency band or the second frequency band is the maximum value of the second frequency band or the third frequency band is the maximum value of the first frequency band or the maximum value of the second frequency band or the maximum value is the maximum value of the third frequency band or the maximum value of the third frequency band.

Step S204 of performing adaptive crosstalk cancellation processing on the sixth right channel sub-signal, superimposing the sixth right channel sub-signal to generate an eighth left channel sub-signal, performing adaptive crosstalk cancellation processing on the sixth left channel sub-signal, superimposing the sixth right channel sub-signal to generate an eighth right channel sub-signal, and, after delay processing on the fifth left channel sub-signal and the seventh left channel sub-signal, respectively, superimposing the fifth right channel sub-signal and the seventh right channel sub-signal to generate a third superimposed signal, and, after delay processing on the fifth right channel sub-signal and the seventh right channel sub-signal, respectively, performing gain processing on the fifth right channel sub-signal and the seventh left channel sub-signal, respectively, performing delay processing on the fifth left channel sub-signal and the seventh left channel sub-signal, superimposing the fourth left channel sub-signal, generating a fourth superimposed signal, and the fourth left channel sub-signal, respectively, superimposing the fifth right channel sub-signal and the seventh left channel sub-signal, respectively, performing delay processing on the fourth superimposed signal, and the fourth surround signal, respectively, performing the first surround signal and the fourth surround signal, respectively, and the fourth surround signal are generated after the first surround signal and the surround signal, and the fourth surround signal are superimposed, and the fourth surround signals are superimposed, and the surround signals are generated, and superposing the first left surround sound superposition signal and the third left surround sound superposition signal to respectively delay the first left surround sound signal and the third left surround sound signal, and respectively gain the first left surround sound signal and the third left surround sound signal to respectively generate a second target surround sound superposition signal.

In step S205, the eighth left channel sub-signal, the third superimposed signal and the third target superimposed signal are superimposed to generate an eighth left channel signal, and the eighth right channel sub-signal, the fourth superimposed signal and the fourth target superimposed signal are superimposed to generate an eighth right channel signal, and the fourth left surround sound sub-signal, the first surround sound superimposed signal and the first target surround sound superimposed signal are superimposed to generate a fourth left surround sound signal, and the fourth right surround sound sub-signal, the second surround sound superimposed signal and the second target surround sound superimposed signal are superimposed to generate a fourth right surround sound signal.

The specific implementation manners of step S203 to step S205 may refer to the foregoing embodiments, and the disclosure is not repeated herein.

In the process of executing step S204 and step S205, it is necessary to select a room environment adjustment coefficient different from the left and right channels for the processing of the left and right surround sound signalsAnd a room size S.

In step S206, the eighth left channel signal, the eighth right channel signal, the fourth left surround signal, the fourth right surround signal, and the second center human voice signal are subjected to down-mixing processing to obtain a ninth left channel signal and a ninth right channel signal.

The downmix processing manner of the eighth left channel signal and the eighth right channel signal in the 5.1 channel can be referred to the above formula 22. The down-mixing process of the fourth left surround sound signal and the fourth right surround sound signal may be implemented according to the following formula 23.

Equation 23:

。

wherein L is ₉ Is a ninth left channel signal, R ₉ Is the ninth right channel signal, cz ₂ Is a second middle-set human voice signal L ₈ Is the eighth left channel signal, R ₈ Is the eighth right channel signal, ls ₄ Fourth left surround sound signal, rs ₄ And a fourth right surround sound signal.

Step S207 outputs a second output audio signal generated from the ninth left channel signal and the ninth right channel signal.

In other embodiments of the present application, step S200 is further included before step S201.

In step S200, a third audio signal is obtained, where the third audio signal includes a first left front surround signal, a first right front surround signal, a third center human voice signal, a first left surround signal, a first right surround signal, a first left rear surround signal, and a first right rear surround signal.

That is, the embodiment of the application can also process files with audio/video formats with more channels, such as 7.1 channels.

For 7.1 channel audio, it is downmixed to 5.1 channels to realize according to the specific implementation mode corresponding to 5.1 channels.

In a specific implementation, for the audio of 7.1 channels, step S201 includes:

and performing down-mixing processing on the first left front surround sound signal, the first right front surround sound signal, the third middle-set human sound signal, the first left side surround sound signal, the first right side surround sound signal, the first left rear surround sound signal and the first right rear surround sound signal to obtain a sixth left channel signal, a sixth right channel signal, the first left surround sound signal, the first right surround sound signal and the first middle-set human sound signal.

At this time, the second audio signal is not directly obtained from the file, but is generated by the third audio signal downmixing process of the 7.1 channel.

In other embodiments of the present application, step S199 is further included before step S201.

In step S199, a fourth audio signal is obtained, where the fourth audio signal includes a second left front surround signal, a second right front surround signal, a fourth center human sound signal, a second left surround signal, a second right surround signal, a second left rear surround signal, a second right rear surround signal, a first front longitudinal sound signal, and a second front longitudinal sound signal.

That is, the embodiment of the application can also process files with audio/video formats with more channels, such as 9.1 channels.

For 9.1 channel audio, it is downmixed to 5.1 channels to realize according to the specific implementation mode corresponding to 5.1 channels.

In a specific implementation, for the audio of 9.1 channels, step S201 includes:

and performing down-mixing processing on the second left front surround sound signal, the second right front surround sound signal, the fourth middle-set human sound signal, the second left side surround sound signal, the second right side surround sound signal, the second left rear surround sound signal, the second right rear surround sound signal, the first front longitudinal sound signal and the second front longitudinal sound signal to obtain a sixth left channel signal, a sixth right channel signal, a first left surround sound signal, a first right surround sound signal and a first middle-set human sound signal.

At this time, the second audio signal is not directly obtained from the file, but is generated by the 9.1-channel fourth audio signal downmixing process.

Fig. 15 is a schematic view of a scene of audio playback of an electronic device according to an embodiment of the present application.

As shown in fig. 15, taking an electronic device as a notebook computer as an example, as seen from a top view of the notebook computer shown in fig. 15 (a) and a side view of the notebook computer shown in fig. 15 (b), after the above audio processing process, in the embodiment of the present application, a sound field felt by a user is wide and full in a process of playing audio through a speaker, so that the user has an immersion feeling.

The scheme provided by the embodiment of the application is mainly described from the perspective of the electronic equipment. It will be appreciated that the electronic device, in order to achieve the above-described functions, includes corresponding hardware structures and/or software modules that perform the respective functions. Those skilled in the art will readily appreciate that an audio processing method step of each of the examples described in connection with the disclosed embodiments of the application may be implemented in hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or electronic device software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application can divide the functional modules or functional units of the electronic device according to the method example, for example, each functional module or functional unit can be divided corresponding to each function, or two or more functions can be integrated in one processing module. The integrated modules may be implemented in hardware, or in software functional modules or functional units. The division of the modules or units in the embodiment of the present application is schematic, which is merely a logic function division, and other division manners may be implemented in practice.

Fig. 16 is a schematic diagram of a software structure of an audio processing device according to an embodiment of the present application. In one embodiment, the electronic device may implement the corresponding functionality through the software means shown in fig. 16.

As shown in fig. 16, the audio processing apparatus may include:

the audio input module 2001 is configured to obtain a first audio signal, where the first audio signal includes a first left channel signal and a first right channel signal.

The voice extraction module 2002 is configured to perform voice separation on a first left channel signal and a first right channel signal, so as to obtain a first voice signal, a second left channel signal, and a second right channel signal, where the first voice signal includes a first voice sub-signal separated from the first left channel signal and a second voice sub-signal separated from the first right channel signal, the second left channel signal is a channel signal after the first left channel signal separates the first voice sub-signal, and the second right channel signal is a channel signal after the first right channel signal separates the second voice sub-signal.

The spatial audio rendering module 2003 is configured to perform audio rendering processing on the second left channel signal, the second right channel signal, and the first human voice signal based on the first orientation information, to obtain a third left channel signal, a third right channel signal, and a second human voice signal, where the first orientation information includes orientation information of a user head relative to a speaker.

The content adaptive frequency division module 2004 is configured to perform content adaptive frequency division processing on the third left channel signal and the third right channel signal, respectively, so as to divide the third left channel signal into a first left channel sub-signal in a first frequency band, a second left channel sub-signal in a second frequency band, and a third left channel sub-signal in a third frequency band, and divide the third right channel signal into a first right channel sub-signal in the first frequency band, a second right channel sub-signal in the second frequency band, and a third right channel sub-signal in the third frequency band, where a minimum value in the first frequency band is greater than or equal to a maximum value in the second frequency band, and a minimum value in the second frequency band is greater than or equal to a maximum value in the third frequency band.

The adaptive crosstalk cancellation module 2005 is configured to perform adaptive crosstalk cancellation processing on the second right channel sub-signal, superimpose the second right channel sub-signal on the second left channel sub-signal to generate a fourth left channel sub-signal, perform adaptive crosstalk cancellation processing on the second left channel sub-signal on the second right channel sub-signal to generate a fourth right channel sub-signal, and superimpose the first left channel sub-signal and the third left channel sub-signal after performing delay processing respectively on the first left channel sub-signal and the third left channel sub-signal, and superimpose the first right channel sub-signal and the third right channel sub-signal after performing delay processing respectively on the first right channel sub-signal and the third right channel sub-signal to generate a second superimposed signal, and perform delay processing respectively on the first left channel sub-signal and the third left channel sub-signal and perform gain processing respectively on the first left channel sub-signal and the third left channel sub-signal to generate a second target superimposed signal; and superposing the fourth left channel sub-signal, the first superposition signal and the first target superposition signal to generate a fourth left channel signal, and superposing the fourth right channel sub-signal, the second superposition signal and the second target superposition signal to generate a fourth right channel signal.

A downmix module 2006, configured to perform downmix processing on the fourth left channel signal, the fourth right channel signal, and the second human voice signal, to obtain a fifth left channel signal and a fifth right channel signal;

the audio output module 2007 is configured to output a first output audio signal generated according to a fifth left channel signal and a fifth right channel signal.

In one implementation, the method further comprises:

the parameter equalization module 2008 is configured to perform parameter equalization processing on the fourth left channel signal, the fourth right channel signal, and the second human voice signal, so as to obtain an equalized fourth left channel signal, an equalized fourth right channel signal, and an equalized second human voice signal.

The audio processing device provided by the embodiment of the application can also set more or fewer modules based on the type of the input audio, and the embodiment of the application is not limited to this.

As shown in fig. 17, in one embodiment, the electronic device may implement the corresponding functions through the hardware apparatus shown in fig. 17. The apparatus may include: a touch screen 701, a memory 702, a processor 703 and a communication module 704. The devices described above may be connected by one or more communication buses 705. The devices described above may be connected by one or more communication buses 705. The display 701 may include a display panel 7011 and a touch sensor 7012, wherein the display panel 7011 is used to display an image and the touch sensor 7012 may communicate a detected touch operation to an application processor to determine a touch event type, and provide visual output related to the touch operation through the display panel 7011. The processor 703 may include one or more processing units, such as: the processor 703 may include an application processor, a modem processor, a graphics processor, an image signal processor, a controller, a video codec, a digital signal processor, a baseband processor, and/or a neural network processor, etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors. Memory 702 is coupled to processor 703 for storing various software programs and/or computer instructions, and memory 702 may include volatile memory and/or non-volatile memory. The electronic device, when executing computer instructions, can perform the functions or steps of the method embodiments described above.

In one embodiment, the touch screen 701 may include a display panel 7011 and a touch sensor 7012, wherein the display panel 7011 is configured to display an image and the touch sensor 7012 may communicate a detected touch operation to an application processor to determine a type of touch event and provide visual output related to the touch operation through the display panel 7011. The processor 703 may include one or more processing units, such as: the processor 703 may include an application processor, a modem processor, a graphics processor, an image signal processor, a controller, a video codec, a digital signal processor, a baseband processor, and/or a neural network processor, etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors. The memory 702 is coupled to the processor 703 for storing various software programs, capacity data, etc., and the memory 702 may include volatile memory and/or non-volatile memory.

The software programs and/or sets of instructions in the memory 702, when executed by the processor 703, cause the electronic device to implement the audio processing method in any of the implementations described above.

The application also provides an electronic device, comprising: a processor, a memory, and a touch screen; the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the audio processing method in any of the implementations described above.

The embodiment of the application also provides a chip system which comprises at least one processor and at least one interface circuit. The processors and interface circuits may be interconnected by wires. For example, the interface circuit may be used to receive signals from other devices (e.g., a memory of an electronic apparatus). For another example, the interface circuit may be used to send signals to other devices. The interface circuit may, for example, read instructions stored in the memory and send the instructions to the processor. The instructions, when executed by a processor, may cause an electronic device to perform the various steps of the embodiments described above. Of course, the system-on-chip may also include other discrete devices, which are not particularly limited in accordance with embodiments of the present application.

Embodiments of the present application also provide a computer-readable storage medium including computer instructions which, when executed on an electronic device as described above, cause the electronic device to perform the functions or steps performed in the method embodiments described above.

The embodiment of the application also provides a computer program product which, when run on a computer, causes the computer to execute the functions or steps executed by the mobile phone in the above method embodiment.

It will be apparent to those skilled in the art from this description that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and the parts displayed as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An audio processing method, characterized by being applied to an electronic device provided with at least one speaker, comprising:

acquiring a first audio signal, wherein the first audio signal comprises a first left channel signal and a first right channel signal;

performing human voice separation on the first left channel signal and the first right channel signal to obtain a first human voice signal, a second left channel signal and a second right channel signal, wherein the first human voice signal comprises a first human voice sub-signal separated from the first left channel signal and a second human voice sub-signal separated from the first right channel signal, the second left channel signal is a channel signal of the first left channel signal from which the first human voice sub-signal is separated, and the second right channel signal is a channel signal of the first right channel signal from which the second human voice sub-signal is separated;

Respectively performing audio rendering processing on the second left channel signal, the second right channel signal and the first human voice signal based on first azimuth information to obtain a third left channel signal, a third right channel signal and a second human voice signal, wherein the first azimuth information comprises azimuth information of a user head relative to the loudspeaker;

performing content adaptive frequency division processing on the third left channel signal and the third right channel signal respectively to divide the third left channel signal into a first left channel sub-signal of a first frequency band, a second left channel sub-signal of a second frequency band and a third left channel sub-signal of a third frequency band, and divide the third right channel signal into a first right channel sub-signal of the first frequency band, a second right channel sub-signal of the second frequency band and a third right channel sub-signal of the third frequency band, wherein the minimum value of the first frequency band is greater than or equal to the maximum value of the second frequency band, and the minimum value of the second frequency band is greater than or equal to the maximum value of the third frequency band;

performing adaptive crosstalk cancellation processing on the second right channel sub-signal, adding the second right channel sub-signal to generate a fourth left channel sub-signal, performing adaptive crosstalk cancellation processing on the second left channel sub-signal, adding the second left channel sub-signal to the second right channel sub-signal to generate a fourth right channel sub-signal, adding the first left channel sub-signal and the third left channel sub-signal to generate a first added signal after respectively performing delay processing, and, the first right channel sub-signal and the third right channel sub-signal are respectively subjected to delay processing and then are overlapped to generate a second overlapped signal, the first right channel sub-signal and the third right channel sub-signal are respectively subjected to delay processing and then are respectively subjected to gain processing and then are overlapped to generate a first target overlapped signal, and the first left channel sub-signal and the third left channel sub-signal are respectively subjected to delay processing and then are respectively subjected to gain processing and then are overlapped to generate a second target overlapped signal;

Superposing the fourth left channel sub-signal, the first superposition signal and the first target superposition signal to generate a fourth left channel signal, and superposing the second superposition signal and the second target superposition signal to generate a fourth right channel signal;

performing down-mixing processing on the fourth left channel signal, the fourth right channel signal and the second human voice signal to obtain a fifth left channel signal and a fifth right channel signal;

a first output audio signal generated from the fifth left channel signal and the fifth right channel signal is output.

2. The audio processing method according to claim 1, wherein before the downmixing the fourth left channel signal, the fourth right channel signal, and the second human voice signal to obtain a fifth left channel signal and a fifth right channel signal, further comprising:

and carrying out parameter equalization processing on the fourth left channel signal, the fourth right channel signal and the second human voice signal to obtain an equalized fourth left channel signal, an equalized fourth right channel signal and an equalized second human voice signal.

3. The audio processing method according to claim 2, wherein the performing the downmix processing on the fourth left channel signal, the fourth right channel signal, and the second human voice signal to obtain a fifth left channel signal and a fifth right channel signal includes:

and performing down-mixing processing on the equalized fourth left channel signal, the equalized fourth right channel signal and the equalized second human voice signal to obtain a fifth left channel signal and a fifth right channel signal.

4. The audio processing method according to claim 1, wherein the performing the human voice separation on the first left channel signal and the first right channel signal to obtain a first human voice signal, a second left channel signal, and a second right channel signal includes:

and separating the first left channel signal and the first right channel signal based on a principal component analysis method or based on a neural network to obtain the first human sound signal, the second left channel signal and the second right channel signal.

5. The audio processing method according to claim 1, wherein the audio rendering process is performed on the second left channel signal, the second right channel signal, and the first human voice signal based on the first direction information, respectively, to obtain a third left channel signal, a third right channel signal, and a second human voice signal, and the audio processing method comprises:

Performing segmented convolution on the first human voice signal and the first height filter based on an overlap-save method to obtain the second human voice signal, performing segmented convolution on the second left channel signal and the second height filter based on the overlap-save method to obtain the third left channel signal, and performing segmented convolution on the second right channel signal and the third height filter based on the overlap-save method to obtain the third right channel signal;

the first height filter is a filter acquired according to the head height of a user and a first angle of the first human voice signal relative to the electronic equipment, the second height filter is a filter acquired according to the head height of the user and a second angle of the second left channel signal relative to the electronic equipment, and the third height filter is a filter acquired according to the head height of the user and a third angle of the second right channel signal relative to the electronic equipment.

6. The audio processing method according to claim 5, wherein the first angle includes a horizontal angle and a pitch angle corresponding to a first head related transform function HRTF corresponding to a first vocal signal;

The obtaining mode of the first height filter comprises the following steps:

acquiring first orientation information, wherein the first orientation information comprises a first distance between the center of the head of the user and the center of the electronic equipment, the head height of the user, the head size of the user and the head rotation angle of the user, and the left ear orientation of the user and the right ear orientation of the user can be determined according to the head size of the user;

judging whether the first distance is equal to a measurement radius of a preset HRTF or not;

if the first distance is larger or smaller than the measurement radius of the preset HRTF, correcting the preset HRTF according to the first distance, the left ear azimuth of the user and the right ear azimuth of the user to obtain the first HRTF;

acquiring a head height of the user and a Head Related Impulse Response (HRIR) data set corresponding to the first angle, wherein the HRIR data set comprises at least one HRIR of a measurement user;

carrying out frequency domain change on the HRIR data set to obtain an HRTF data set;

acquiring a time-frequency domain average signal of the HRTF data set;

performing frequency multiplication smoothing or envelope extraction on the time-frequency domain average signal to obtain an initial height filter;

And performing inter-ear time difference (ITD) adjustment on the discretized form of the initial height filter to obtain the first height filter.

7. The audio processing method of claim 6 wherein the first HRTF comprises a first left-ear HRTF and a first right-ear HRTF;

and if the first distance is greater than or less than the measurement radius of the preset HRTF, correcting the HRTF according to the first distance, the left ear position of the user, and the right ear position of the user, to obtain the first HRTF, including:

if the first distance is larger than the measurement radius of the preset HRTF, a first intersection point of a connecting line of the center of the electronic equipment and the left ear azimuth on a measurement area of the HRTF is obtained, and a second intersection point of a connecting line of the center of the electronic equipment and the right ear azimuth on the measurement area of the HRTF is obtained, wherein the first intersection point is used for determining a first horizontal azimuth angle and a first pitch angle, and the second intersection point is used for determining a second horizontal azimuth angle and a second pitch angle;

and determining the first left ear HRTF according to the first horizontal azimuth angle and the first pitch angle, and determining the first right ear HRTF according to the second horizontal azimuth angle and the second pitch angle.

8. The audio processing method of claim 6 wherein the first HRTF comprises a first left-ear HRTF and a first right-ear HRTF;

if the first distance is smaller than the measurement radius of the preset HRTF, a first intersection point of a connecting line of the center of the electronic equipment and the left ear azimuth on a measurement area of the HRTF is obtained, and a second intersection point of a connecting line of the center of the electronic equipment and the right ear azimuth on the measurement area of the HRTF is obtained, wherein the first intersection point is used for determining a first horizontal azimuth angle and a first pitch angle, and the second intersection point is used for determining a second horizontal azimuth angle and a second pitch angle;

and determining the first right ear HRTF according to the first horizontal azimuth angle and the first pitch angle, and determining the first left ear HRTF according to the second horizontal azimuth angle and the second pitch angle.

9. The audio processing method according to claim 1, wherein performing content adaptive crossover processing on the third left channel signal to divide the third left channel signal into a first left channel sub-signal of a first frequency band, a second left channel sub-signal of a second frequency band, and a third left channel sub-signal of a third frequency band, comprises:

Classifying the third left channel signal through a neural audio multi-classification network to obtain at least one audio object, wherein each audio object corresponds to a sound probability which is used for determining the classification accuracy of the audio object;

comparing each sound probability with a preset sound probability threshold;

if at least two sound probabilities are larger than the sound probability threshold, acquiring a first target audio object corresponding to two sound probabilities with the sound probability value being front, or acquiring a second target audio object corresponding to the sound probability if one sound probability is larger than the sound probability threshold;

determining a first frequency division point and a second frequency division point based on the first target audio object or the second target audio object, wherein the frequency value of the first frequency division point is larger than that of the second frequency division point;

determining a frequency range larger than the first frequency division point as the first frequency band, determining a frequency range smaller than or equal to the first frequency division point and larger than the second frequency division point as the second frequency band, and determining a frequency range smaller than or equal to the second frequency division point as the third frequency band;

And determining the sound signal with the frequency in the first frequency band in the third left channel signal as the first left channel sub-signal, the sound signal with the frequency in the second frequency band as the second left channel sub-signal, and the sound signal with the frequency in the third frequency band as the third left channel sub-signal.

10. The audio processing method of claim 9, wherein determining the first crossover point and the second crossover point based on the first target audio object comprises:

judging whether the first target audio object contains a preset first concerned object or not;

if the first target audio object contains the first attention object, determining the maximum value of the frequency range of the first attention object as the first frequency division point, and determining the minimum value of the frequency range of the first attention object as the second frequency division point;

if the first target audio object does not contain the first concerned object, acquiring a first frequency range of a first target audio object, and acquiring a second frequency range of a second first target audio object;

if the first frequency range includes the second frequency range, determining a maximum value of the first frequency range as the first frequency division point, and determining a minimum value of the first frequency range as the second frequency division point;

If the second frequency range includes the first frequency range, determining a maximum value of the second frequency range as the first frequency division point, and determining a minimum value of the second frequency range as the second frequency division point;

if the first frequency range and the second frequency range have a first intersection, determining a maximum value of the first intersection as the first frequency division point, and determining a minimum value of the first intersection as the second frequency division point;

if the first frequency range is not intersected with the second frequency range, and the minimum value of the first frequency range is larger than the maximum value of the second frequency range, determining the minimum value of the first frequency range as the first frequency division point, and determining the maximum value of the second frequency range as the second frequency division point;

if the first frequency range is not intersected with the second frequency range, and the minimum value of the second frequency range is larger than the maximum value of the first frequency range, determining the minimum value of the second frequency range as the first frequency division point, and determining the maximum value of the first frequency range as the second frequency division point.

11. The audio processing method of claim 10, wherein determining the first crossover point and the second crossover point based on the second target audio object comprises:

the maximum value of the frequency range of the second target audio object is determined as the first frequency division point, and the minimum value of the frequency range of the second target audio object is determined as the second frequency division point.

12. The audio processing method according to claim 1, wherein performing adaptive crosstalk cancellation processing on the second right channel sub-signal, overlapping the second left channel sub-signal, generating a fourth left channel sub-signal, and performing adaptive crosstalk cancellation processing on the second left channel sub-signal, overlapping the second right channel sub-signal, generating a fourth right channel sub-signal, comprises:

and sequentially performing inversion processing, attenuation processing, delay processing and gain processing on the second right channel sub-signal, overlapping the second right channel sub-signal to generate the fourth left channel sub-signal, and sequentially performing inversion processing, attenuation processing, delay processing and gain processing on the second left channel sub-signal to overlap the second left channel sub-signal to generate the fourth right channel sub-signal.

13. The audio processing method according to claim 1, wherein before the audio rendering processing is performed on the second left channel signal, the second right channel signal, and the first human voice signal based on the first direction information, respectively, obtaining a third left channel signal, a third right channel signal, and a second human voice signal, further comprises:

and acquiring the first orientation information based on the image acquisition device of the electronic equipment, or acquiring the first orientation information based on user-defined content filled by a user.

14. The audio processing method according to claim 13, wherein the acquiring the first orientation information by the image capturing apparatus based on the electronic device includes:

the electronic equipment displays a first interface, wherein the first interface comprises a first control, and the first control is used for starting or closing a personalized setting function in a play audio mode;

responding to the opening operation of the user on the first control, displaying a second interface on the electronic equipment, and displaying a second control on the second interface, wherein the second control is used for opening or closing a head tracking function in a playback audio mode;

And responding to the opening operation of the user on the second control, the electronic equipment starts the image acquisition device to acquire the first orientation information.

15. The audio processing method according to claim 14, wherein the acquiring the first orientation information based on the user-filled custom content includes:

in response to a closing operation of the second control by the user, the electronic device displays at least one filling item in an editable state, wherein the filling item is used for filling the first orientation information.

16. An audio processing method, characterized by being applied to an electronic device provided with at least one speaker, comprising:

acquiring a second audio signal, wherein the second audio signal comprises a sixth left channel signal, a sixth right channel signal, a first left surround sound signal, a first right surround sound signal and a first center voice signal;

respectively performing audio rendering processing on the sixth left channel signal, the sixth right channel signal, the first left surround sound signal, the first right surround sound signal and the first center-positioned human voice signal based on second azimuth information to obtain a seventh left channel signal, a seventh right channel signal, a second left surround sound signal, a second right surround sound signal and a second center-positioned human voice signal, wherein the second azimuth information comprises azimuth information of a user head relative to the loudspeaker;

Performing content adaptive frequency division processing on the seventh left channel signal, the seventh right channel signal, the second left surround sound signal, and the second right surround sound signal, respectively, to divide the seventh left channel signal into a fifth left channel sub-signal of a first frequency band, a sixth left channel sub-signal of a second frequency band, and a seventh left channel sub-signal of a third frequency band, and divide the seventh right channel signal into a fifth right channel sub-signal of the first frequency band, a sixth right channel sub-signal of the second frequency band, and a seventh right channel sub-signal of the third frequency band, and divide the second left surround sound signal into a first left surround sound sub-signal of the first frequency band, a second left surround sound sub-signal of the second frequency band, and a third left surround sound sub-signal of the third frequency band, and divide the second right surround sound signal into a first right surround sound sub-signal of the first frequency band, a second right surround sound sub-signal of the second frequency band, and a third surround sound sub-signal of the third frequency band, wherein the maximum value of the second left surround sound signal is equal to or greater than the maximum value of the first frequency band, and the maximum value of the second frequency band is equal to or greater than the maximum value of the third frequency band;

Performing adaptive crosstalk cancellation processing on the sixth right channel sub-signal, adding the sixth right channel sub-signal to the sixth left channel sub-signal to generate an eighth left channel sub-signal, performing adaptive crosstalk cancellation processing on the sixth left channel sub-signal, adding the sixth left channel sub-signal to the sixth right channel sub-signal to generate an eighth right channel sub-signal, respectively performing delay processing on the fifth left channel sub-signal and the seventh left channel sub-signal, adding the fifth left channel sub-signal and the seventh left channel sub-signal to generate a third added signal, respectively performing delay processing on the fifth right channel sub-signal and the seventh right channel sub-signal, respectively performing gain processing, adding the fifth right channel sub-signal and the seventh right channel sub-signal to generate a third target added signal, respectively, the fifth left channel sub-signal and the seventh left channel sub-signal are respectively subjected to delay processing and gain processing, and then are overlapped to generate a fourth target overlapped signal, the second right surround sound sub-signal is subjected to adaptive crosstalk cancellation processing and is overlapped to the second left surround sound sub-signal to generate a fourth left surround sound sub-signal, the second left surround sound sub-signal is subjected to adaptive crosstalk cancellation processing and is overlapped to the second right surround sound sub-signal to generate a fourth right surround sound sub-signal, and the first left surround sound sub-signal and the third left surround sound sub-signal are respectively subjected to delay processing and then are overlapped to generate a first surround sound overlapped signal, and the first right surround sound sub-signal and the third right surround sound sub-signal are respectively subjected to delay processing and then are overlapped to generate a second surround sound overlapped signal, the first right surround sound signal and the third right surround sound signal are respectively subjected to delay processing and gain processing, and then are overlapped to generate a first target surround sound overlapped signal;

Superposing the eighth left channel sub-signal, the third superposition signal and the third target superposition signal to generate an eighth left channel signal, and superposing the eighth right channel sub-signal, the fourth superposition signal and the fourth target superposition signal to generate an eighth right channel signal, and superposing the fourth left surround sound sub-signal, the first surround sound superposition signal and the first target surround sound superposition signal to generate a fourth left surround sound signal, and superposing the fourth right surround sound sub-signal, the second surround sound superposition signal and the second target surround sound superposition signal to generate a fourth right surround sound signal;

performing down-mixing processing on the eighth left channel signal, the eighth right channel signal, the fourth left surround sound signal, the fourth right surround sound signal and the second center-set human voice signal to obtain a ninth left channel signal and a ninth right channel signal;

and outputting a second output audio signal generated according to the ninth left channel signal and the ninth right channel signal.

17. The audio processing method of claim 16, further comprising, prior to the acquiring the second audio signal:

Acquiring a third audio signal, wherein the third audio signal comprises a first left front surround sound signal, a first right front surround sound signal, a third center human sound signal, a first left surround sound signal, a first right surround sound signal, a first left rear surround sound signal and a first right rear surround sound signal;

the acquiring the second audio signal includes:

and performing down-mixing processing on the first left front surround sound signal, the first right front surround sound signal, the third middle-set human voice signal, the first left surround sound signal, the first right surround sound signal, the first left rear surround sound signal and the first right rear surround sound signal to obtain a sixth left channel signal, a sixth right channel signal, the first left surround sound signal, the first right surround sound signal and the first middle-set human voice signal.

18. The audio processing method of claim 16, further comprising, prior to the acquiring the second audio signal:

acquiring a fourth audio signal, wherein the fourth audio signal comprises a second left front surround sound signal, a second right front surround sound signal, a fourth center human sound signal, a second left side surround sound signal, a second right side surround sound signal, a second left rear surround sound signal, a second right rear surround sound signal, a first front longitudinal sound signal and a second front longitudinal sound signal;

The acquiring the second audio signal includes:

and performing down-mixing processing on the second left front surround sound signal, the second right front surround sound signal, the fourth middle voice signal, the second left side surround sound signal, the second right side surround sound signal, the second left rear surround sound signal, the second right rear surround sound signal, the first front longitudinal sound signal and the second front longitudinal sound signal to obtain a sixth left channel signal, a sixth right channel signal, the first left surround sound signal, the first right surround sound signal and the first middle voice signal.

19. An audio processing apparatus, comprising:

the audio input module is used for acquiring a first audio signal, and the first audio signal comprises a first left channel signal and a first right channel signal;

the voice extraction module is used for performing voice separation on the first left channel signal and the first right channel signal to obtain a first voice signal, a second left channel signal and a second right channel signal, the first voice signal comprises a first voice sub-signal separated from the first left channel signal and a second voice sub-signal separated from the first right channel signal, the second left channel signal is a channel signal after the first voice sub-signal is separated from the first left channel signal, and the second right channel signal is a channel signal after the first right channel signal is separated from the second voice sub-signal;

The spatial audio rendering module is used for respectively performing audio rendering processing on the second left channel signal, the second right channel signal and the first human voice signal based on first orientation information to obtain a third left channel signal, a third right channel signal and a second human voice signal, wherein the first orientation information comprises orientation information of a user head relative to a loudspeaker;

the content adaptive frequency division module is configured to perform content adaptive frequency division processing on the third left channel signal and the third right channel signal, so as to divide the third left channel signal into a first left channel sub-signal in a first frequency band, a second left channel sub-signal in a second frequency band, and a third left channel sub-signal in a third frequency band, and divide the third right channel signal into a first right channel sub-signal in the first frequency band, a second right channel sub-signal in the second frequency band, and a third right channel sub-signal in the third frequency band, where a minimum value of the first frequency band is greater than or equal to a maximum value of the second frequency band, and a minimum value of the second frequency band is greater than or equal to a maximum value of the third frequency band;

The adaptive crosstalk cancellation module is configured to perform adaptive crosstalk cancellation processing on the second right channel sub-signal, and superimpose the second right channel sub-signal on the second left channel sub-signal to generate a fourth left channel sub-signal, and perform adaptive crosstalk cancellation processing on the second left channel sub-signal, and superimpose the second left channel sub-signal on the second right channel sub-signal to generate a fourth right channel sub-signal, and superimpose the first left channel sub-signal and the third left channel sub-signal on the second left channel sub-signal to generate a first superimposed signal, and superimpose the first right channel sub-signal and the third right channel sub-signal on the first right channel sub-signal to generate a first target superimposed signal, and superimpose the first left channel sub-signal and the third left channel sub-signal on the first right channel sub-signal to generate a second target superimposed signal; superposing the fourth left channel sub-signal, the first superposition signal and the first target superposition signal to generate a fourth left channel signal, and superposing the second superposition signal and the second target superposition signal to generate a fourth right channel signal;

The downmix module is used for carrying out downmix processing on the fourth left channel signal, the fourth right channel signal and the second human voice signal to obtain a fifth left channel signal and a fifth right channel signal;

and the audio output module is used for outputting a first output audio signal generated according to the fifth left channel signal and the fifth right channel signal.

20. The audio processing apparatus of claim 19, further comprising:

and the parameter equalization module is used for carrying out parameter equalization processing on the fourth left channel signal, the fourth right channel signal and the second human voice signal to obtain an equalized fourth left channel signal, an equalized fourth right channel signal and an equalized second human voice signal.

21. An electronic device, comprising: a processor and a memory; the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the audio processing method of any one of claims 1-15 or any one of claims 16-18.