CN117202083A

CN117202083A - Earphone stereo audio processing method and earphone

Info

Publication number: CN117202083A
Application number: CN202311205936.3A
Authority: CN
Inventors: 宋明辉; 王红丽; 韦莎丽
Original assignee: Shenzhen Zhongke Lanxun Technology Co ltd
Current assignee: Shenzhen Zhongke Lanxun Technology Co ltd
Priority date: 2023-09-18
Filing date: 2023-09-18
Publication date: 2023-12-08

Abstract

The embodiment of the application relates to the technical field of audio processing, in particular to a stereo audio processing method of an earphone and the earphone. Obtaining a target stereo audio signal to be processed from a first channel of the earphone, and performing decorrelation processing on the target stereo audio signal to obtain an environmental audio signal; and mixing the target stereo audio signal to be processed and the environment audio signal to obtain a mixed audio signal of the first channel. The space sense of stereo can be effectively enhanced without losing the tone quality of original sound.

Description

Earphone stereo audio processing method and earphone

Technical Field

The embodiment of the application relates to the technical field of audio processing, in particular to a stereo audio processing method of an earphone and the earphone.

Background

People have long learned that the reproduction of sound is accomplished with simple speakers, and early speakers have relatively crude functions, and the sound reproduced by them is not realistic at all and cannot be perceived as being personally on the spot by the listener. With the advent of devices capable of electronic recording and playback, the pursuit of perfect sound reproduction has also risen to a new level.

Headphones are favored by people because of their convenience in carrying, low price and good privacy, but stereo content reproduced by headphones has the problem of "positioning in the head", and the method of reproducing with more loudspeakers has narrower sound field, weaker spatial impression, and cannot experience the feeling of "being in the scene".

Disclosure of Invention

The technical problem to be solved by the embodiment of the application is to provide a method for processing stereo audio of the earphone and the earphone, which can enable stereo content presented by the earphone to have wider sound field, more real space feeling and no loss of acoustic quality.

In a first aspect, an embodiment of the present application provides a method for processing stereo audio of an earphone, including:

obtaining a target stereo audio signal to be processed from a first channel of the earphone, and performing decorrelation processing on the target stereo audio signal to obtain an environmental audio signal;

and mixing the target stereo audio signal to be processed and the environment audio signal to obtain a mixed audio signal of the first channel.

In some embodiments, the method further comprises:

and mixing the mixed audio signal with at least two head related pulse signals in directions to obtain an output audio signal of the first channel.

In some embodiments, the decorrelating the target stereo audio signal to obtain an ambient audio signal includes:

obtaining a reference audio signal from a second channel of the headset;

filtering the reference audio signal to obtain a first reference audio signal;

the first reference audio signal is subtracted from the target stereo audio signal to obtain the ambient audio signal.

In some embodiments, the filtering the reference audio signal to obtain a first reference audio signal includes:

the reference audio signal is filtered by the following formula:

S(n)＝F(n)*W(n-1)；

wherein S represents the first reference audio signal, F represents the reference audio signal, W represents a filter coefficient, and n represents a frame number.

In some embodiments, the method further comprises:

updating the filter coefficients by the following formula:

E(n)＝D(n)–S(n)，

T(n)＝abs(F(n))，

Pe＝a*P(n)*T(n)+abs(E(n))^2，

Mu(n)＝P(n)/(Pe+eps)，

P(n+1)＝(1–a*Mu*T(n))*P(n)+deta*abs(W(n-1))，

G＝Mu(n)*E(n)，

PP＝G*conj(F(n))，

W(n)＝W(n-1)+PP；

wherein E represents the absolute value of the environmental audio signal, T represents the absolute value of the reference audio signal, abs represents the absolute value, pe represents the smoothed error energy power spectrum, a represents the error control factor, P represents the error covariance, mu represents the step size factor, eps represents the division protection factor, delta represents the filter energy smoothing factor, W represents the filter coefficient, G represents the kalman gain factor, PP represents the filter update value, and conj represents the conjugate complex number.

In some embodiments, the mixing the target stereo audio signal to be processed and the ambient audio signal to obtain the mixed audio signal of the first channel includes:

mixing the target stereo audio signal to be processed and the ambient audio signal using the following formula:

X(n)＝b*D(n)+(1-a)*E(n)；

where X represents the audio output signal of the first channel and b represents the mixing factor.

In some embodiments, the mixing the mixed audio signal with the head related pulse signals of at least two directions to obtain the output audio signal of the first channel includes:

dividing the mixed audio signal into a plurality of first sequence partitions;

dividing the head related pulse signals of the at least two directions into a plurality of second sequence blocks respectively;

respectively carrying out dot product processing on a plurality of first sequence blocks of the mixed audio signal and a plurality of second sequence blocks of the head related pulse signal of each azimuth to obtain a second mixed audio signal of a plurality of azimuth;

and mixing the second mixed audio signals in the multiple directions to obtain the output audio signals of the first channel.

In some embodiments, the dividing the mixed audio signal into a plurality of first sequence partitions comprises:

dividing the mixed audio signal into equal-length sequence blocks with the frame length of B, and then supplementing B zeros after each frame block;

dividing the head related pulse signal for each azimuth into a plurality of second sequence blocks, comprising:

dividing the head related pulse signal into equal-length sequence blocks with the frame length of B, and then supplementing B zeros after each frame block.

In some embodiments, the plurality of first sequence partitions and the plurality of second sequence partitions are time domain signals, the dot product processing is performed on the plurality of first sequence partitions of the mixed audio signal and the plurality of second sequence partitions of the head related pulse signal of each azimuth, to obtain a second mixed audio signal of the azimuth, including:

fourier transforming the first sequence block and the second sequence block to transform into frequency domain signals of the first sequence block and the second sequence block;

performing dot product processing on the frequency domain signals of the first sequence block and the second sequence block to obtain a frequency domain signal of a second mixed audio signal;

performing inverse fourier transform on the frequency domain signal of the second mixed audio signal to obtain a time domain signal of the second mixed audio signal;

for each frame of the time domain signal of the second mixed audio signal, the values of the first B points of the time domain signal of the second mixed audio signal of the frame are added with the values of the last B points of the time domain signal of the second mixed audio signal of the previous frame to update the time domain signal of the second mixed audio signal of the frame.

In some embodiments, the mixing the second mixed audio signals of the plurality of directions to obtain the output audio signal of the first channel includes:

mixing the second mixed audio signals of the plurality of orientations by the following formula:

y _OUT ＝α*y _30° +α*y _-30° +β*y _120° +β*y _-120° +γ*y _270° ；

wherein y is _out Representing the output audio signal of the first channel, y _30° 、y _-30° 、y _120° 、y _-120° 、y _270° A second mixed audio signal representing 30 °, -30 °, 120 °, -120 °, 270 ° channel orientations, α, β, γ representing mixing factors.

In a second aspect, an embodiment of the present application provides an earphone, including:

at least one processor, and

a memory communicatively coupled to the at least one processor, the memory storing instructions executable by the at least one processor to enable the at least one processor to perform the method as described above in the first aspect.

In a third aspect, embodiments of the present application also provide a non-transitory computer-readable storage medium, characterized in that the computer-readable storage medium stores computer-executable instructions that, when executed by at least one processor, cause the at least one processor to perform the method as described in the first aspect above.

The embodiment of the application has the beneficial effects that: different from the situation of the prior art, the earphone stereo audio processing method provided by the embodiment of the application obtains the environment audio signal by carrying out decorrelation processing on the signal to be processed, and then mixes the environment audio signal with the original sound.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.

Fig. 1 is a flowchart of a method for processing a stereo audio signal of an earphone according to an embodiment of the application;

fig. 2 is a flowchart of a method for processing a stereo audio signal of an earphone according to another embodiment of the present application;

fig. 3 is a flowchart of a method for processing a stereo audio signal of an earphone according to another embodiment of the present application;

fig. 4 is a flowchart of a method for processing a stereo audio signal of an earphone according to another embodiment of the present application;

fig. 5 is a block diagram of an earphone according to an embodiment of the present application;

Detailed Description

The present application will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present application, but are not intended to limit the application in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present application.

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that, if not in conflict, the features of the embodiments of the present application may be combined with each other, which is within the protection scope of the present application. In addition, while functional block division is performed in a device diagram and logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. Moreover, the words "first," "second," "third," and the like as used herein do not limit the data and order of execution, but merely distinguish between identical or similar items that have substantially the same function and effect.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and/or" as used in this specification includes any and all combinations of one or more of the associated listed items.

In addition, the technical features of the embodiments of the present application described below may be combined with each other as long as they do not collide with each other.

Fig. 1 shows a flow diagram of one embodiment of a headphone stereo audio processing method including, but not limited to, the steps of:

s101: a target stereo audio signal to be processed is obtained from a first channel of the headset.

In this embodiment, a time domain signal of a target stereo audio signal to be processed is acquired from a first channel of an earphone.

Wherein the first channel may be a left channel or a right channel.

S102: and performing decorrelation processing on the target stereo audio signal to obtain an environment audio signal.

Specifically, in some embodiments, fourier transform is performed on the target stereo audio signal, the time domain signal is converted into a frequency domain signal, and decorrelation processing is performed on the frequency domain signal, so as to obtain a frequency domain signal of the environmental audio signal.

The decorrelation technology can extract the environmental sound in the stereophonic sound, and the environmental sound represents the width and depth of the sound field of the stereophonic sound.

S103: and mixing the target stereo audio signal to be processed and the environment audio signal to obtain a mixed audio signal of the first channel.

In this embodiment, the frequency domain signal of the target stereo audio signal and the frequency domain signal of the environmental audio signal are mixed to obtain the frequency domain signal of the mixed audio signal of the first channel, so that the sound field depth of the audio signal can be greatly improved.

In this embodiment, the environmental sound is obtained through decorrelation processing, and the environmental sound and the original sound are mixed to improve the sound field depth of the original sound, so that the sound field is wider, the space sense is stronger, and the experience is better when the user listens to the audio by using the earphone.

In some embodiments, the decorrelation may be performed using a kalman filter algorithm, or may be performed using a normalized least mean square algorithm or a recursive least square algorithm.

In some embodiments, referring to fig. 2, the target stereo audio signal is subjected to decorrelation processing to obtain an ambient audio signal, including but not limited to the following steps:

s201: a reference audio signal is obtained from a second channel of the headset.

In this embodiment, the time domain signal of the reference audio signal is acquired from the second channel of the earphone.

When the first channel is a left channel, the second channel is a right channel; when the first channel is a right channel, the second channel is a left channel;

s202: and filtering the reference audio signal to obtain a first reference audio signal.

In this embodiment, fourier transform is performed on the reference audio signal, the time domain signal is converted into a frequency domain signal, and filtering processing is performed on the frequency domain signal, so as to obtain a frequency domain signal of the first reference audio signal.

S203: the first reference audio signal is subtracted from the target stereo audio signal to obtain an ambient audio signal.

In this embodiment, the frequency domain signal of the environmental audio signal is obtained by subtracting the frequency domain signal of the first reference audio signal from the frequency domain signal of the target stereo audio signal. The process can remove the part of the first reference audio signal existing in the target stereo audio signal, thereby achieving the decorrelation effect and obtaining the environment audio signal.

In some embodiments, referring to fig. 3, the headphone stereo audio processing method further includes:

s104: and mixing the mixed audio signal with at least two head related pulse signals in at least two directions to obtain an output audio signal of the first channel.

In this embodiment, the mixed audio signal mixed with the environmental sound is mixed with the head-related pulse signals of a plurality of directions to be simulated, so as to obtain the output audio signal of the first channel.

In this embodiment, the corresponding HRTF of the azimuth to be simulated is convolved with the stereo signal using a head related transfer function processing technique (HRTF), so that the sound source of the corresponding azimuth can be simulated. For example, the sound field of the 5.1 channel or 7.1 channel of the theatre may be simulated.

A Head-Related Transfer function (HRTF) describes the Transfer of sound waves from a sound source to both ears. It is the result of the comprehensive filtering of sound waves by human physiological structures such as the head, pinna, torso, etc. In practical applications, various spatial hearing effects can be virtualized by retransmitting the HRTF processed signal with headphones or speakers.

In some embodiments, the reference audio signal is filtered to obtain a first reference audio signal, including but not limited to the following steps:

the reference audio signal is filtered by the following formula:

S(n)＝F(n)*W(n-1)；

In this embodiment, the first reference audio signal is obtained by multiplying the reference audio signal of the present frame by the filter coefficient of one frame to perform the filtering operation.

In some embodiments, the filter coefficients are continually modified and updated, including, but not limited to, the steps of:

updating the filter coefficients by the following formula:

E(n)＝D(n)–S(n)，

T(n)＝abs(F(n))，

Pe＝a*P(n)*T(n)+abs(E(n))^2，

Mu(n)＝P(n)/(Pe+eps)，

P(n+1)＝(1–a*Mu*T(n))*P(n)+deta*abs(W(n-1))，

G＝Mu(n)*E(n)，

PP＝G*conj(F(n))，

W(n)＝W(n-1)+PP；

wherein E represents the environmental audio signal, D represents the target stereo audio signal, T represents the absolute value of the reference audio signal, abs represents taking the absolute value, pe represents the smoothed error energy power spectrum, a represents the error control factor, P represents the error covariance, mu represents the step size factor, eps represents the division protection factor, delta represents the filter energy smoothing factor, W represents the filter coefficient, G represents the kalman gain factor, PP represents the filter update value, conj represents the conjugate complex number.

Wherein the error control factor is used to represent the error between the measurement result and the true value, and may take a value between 0 and 1, for example, a=0.5; the error covariance is updated by the error covariance of the previous frame, and when the frame number n=1, the error covariance P (n) =0; the division protection factor is a parameter for preventing division by 0, and may be eps=0.000000001, for example; the filter energy smoothing factor can take a value between 0 and 1; the step factor is variable, so that the robustness of the Kalman filter is higher, and the decorrelation effect is stronger.

In this embodiment, by continuously modifying and updating the filter coefficients, the filter is kept in an optimal state, so that a better filtering effect can be achieved, and stability and reliability can be maintained under different conditions.

In some embodiments, the target stereo audio signal to be processed and the ambient audio signal are mixed to obtain a mixed audio signal of the first channel, including but not limited to the steps of:

the target stereo audio signal to be processed and the ambient audio signal are mixed using the following formula:

X(n)＝b*D(n)+(1-a)*E(n)

where X represents the mixed audio signal b of the first channel represents the mixing factor and (1-a) represents the error acceptance probability factor.

In this embodiment, the frequency domain signal of the target stereo audio signal to be processed is multiplied by the mixing factor, the frequency domain signal of the environmental audio signal is multiplied by the error acceptance probability factor, and the frequency domain signal of the mixed audio signal of the first channel is obtained by adding the frequency domain signal of the environmental audio signal and the error acceptance probability factor. And performing inverse Fourier transform on the frequency domain signal to obtain a time domain signal.

The value of the mixing factor is obtained by empirical adjustment according to different actual devices, and can be generally between 0 and 1, for example, b=0.5.

In other embodiments, the target stereo audio signal to be processed and the ambient audio signal may be mixed using the following formulas:

X_L(n)＝b*D_L(n)+(1-a)*E_L(n)

X_R(n)＝b*D_R(n)+(1-a)*E_R(n)

wherein x_l represents a mixed left channel audio signal, d_l represents a left channel target stereo audio signal, e_l represents a left channel ambient audio signal, x_r represents a mixed right channel audio signal, d_r represents a right channel target stereo audio signal, and e_r represents a right channel ambient audio signal.

In this embodiment, the frequency domain signal of the left channel target stereo audio signal to be processed is multiplied by the mixing factor, the frequency domain signal of the left channel environmental audio signal is multiplied by the error acceptance probability factor, and the frequency domain signal of the mixed left channel audio signal is obtained by adding the frequency domain signal of the left channel environmental audio signal and the error acceptance probability factor; performing inverse Fourier transform on the frequency domain signal to obtain a time domain signal; multiplying the frequency domain signal of the right channel target stereo audio signal to be processed by a mixing factor, multiplying the frequency domain signal of the right channel environment audio signal by an error acceptance probability factor, and adding the frequency domain signal and the error acceptance probability factor to obtain the frequency domain signal of the mixed right channel audio signal; and performing inverse Fourier transform on the frequency domain signal to obtain a time domain signal.

In some embodiments, referring to fig. 4, the headphone stereo audio processing method further includes:

s301: the mixed audio signal is divided into a plurality of first sequence partitions.

S302: the head related pulse signals of at least two orientations are divided into a plurality of second sequence blocks, respectively.

S303: and respectively carrying out dot product processing on the first sequence blocks of the mixed audio signal and the second sequence blocks of the head related pulse signal of each azimuth to obtain a second mixed audio signal of a plurality of azimuth.

S304: and mixing the second mixed audio signals in a plurality of directions to obtain the output audio signals of the first channel.

In this embodiment, the time domain signal of the mixed audio signal is divided into a plurality of first sequence blocks, and then the time domain signal of the head related pulse signal of the azimuth to be simulated is also divided into second sequence blocks of the same length. And carrying out dot product processing on the first sequence blocks and the second sequence blocks of each azimuth to obtain time domain signals of the second mixed audio signals of a plurality of azimuth. And mixing the second mixed audio signals in a plurality of directions to obtain a time domain signal of the output audio signal of the first channel.

In particular, in some embodiments, the at least two azimuth head related pulse signals may be multi-channel multi-azimuth head related pulse signals, such as cinema 5.1 channel azimuth or 7.1 channel azimuth head related pulse signals.

In other embodiments, the at least two azimuth head related pulse signals may be the azimuth of the theatre 5.1 channel: the second mixed audio signal obtained after the block dot product of the head related pulse signals of 30 degrees, -120 degrees and 270 degrees and the mixed audio signal is the second mixed audio signal of 30 degrees, -120 degrees and 270 degrees.

In the embodiment, the Fourier transform is utilized to realize the uniform block dot product, so that the computational effort can be greatly reduced compared with the direct convolution operation with higher operation complexity.

In some embodiments, dividing the mixed audio signal into a plurality of first sequence partitions includes:

dividing the mixed audio signal into equal-length sequence blocks with the frame length of B, and then supplementing B zeros after each frame of blocks;

the head related pulse signal is divided into equal length sequence blocks with the frame length of B, and then B zeros are added after each frame block.

Specifically, in some embodiments, the mixed audio signal and the at least two azimuth head related pulse signals may be partitioned by the following formula:

wherein x represents the mixed audio signal, k represents a frame number, B represents a sequence length, h represents a head related pulse signal, l represents a frame number, and j represents a frame number.

In this embodiment, the fourier transform is used to implement a uniform block convolution, where the time domain signal x (k) of the mixed audio signal is divided into equal length sequences with frame length B, and B zeros are appended after each frame. The time domain signal h (l) of the head related pulse signal is also divided into equal length sequences of length B, the number of frames is j, and B zeros are added after each frame.

performing fourier transform on the first sequence block and the second sequence block to transform into frequency domain signals of the first sequence block and the second sequence block;

for each frame of the time domain signal of the second mixed audio signal, the values of the first B points of the time domain signal of the frame of the second mixed audio signal are added to the values of the last B points of the time domain signal of the previous frame of the second mixed audio signal to update the time domain signal of the frame of the second mixed audio signal.

Specifically, in some embodiments, the mixed audio signal and the head related pulse signal may be convolved to obtain the mixed audio signal by the following formula:

X(k)＝FFT(x(k)) k＝0,1,2…

H(l)＝FFT(h(l)) l＝0,1,2…j

Y(k)＝X(k)*H(j)+X(k-1)*H(j-1)+X(k-2)*H(j-2)+…+X(k-j)*H(0)

y(k)＝IFFT(Y(k))

y′ _k ＝y _k (1:B)+y _k-1 (B+1:2*B)

wherein X represents a time domain signal of the mixed audio signal, X is a frequency domain form of the time domain X, H represents a time domain signal of the head related pulse signal, H is a frequency domain form of the time domain H, FFT represents a fast fourier transform operation, Y represents a frequency domain signal of the mixed audio signal, Y is a time domain form of the frequency domain Y, and IFFT represents an inverse fast fourier transform operation.

In this embodiment, fourier transforming the block-processed mixed audio signal to obtain X (k), where X (k) represents a frequency domain signal of a kth frame of the mixed audio signal, and has a length of 2B; similarly, the block-processed head-related pulse signal is fourier transformed to obtain H (l), where H (l) represents a frequency domain signal of a first frame of the head-related pulse signal, and the length is 2B. The dot product of X (k) and H (l) in the frequency domain yields Y (k), which represents the frequency domain signal of the mixed audio signal. And performing inverse Fourier transform on the Y (k) to obtain Y (k), wherein Y (k) represents a time domain signal of the mixed audio signal. Then the time domain signal is subjected to overlap-add and discard steps, the first B length values of each frame signal are added with the last B length values of the previous frame, and the added last B length values are discarded to obtain the time domain signal y of the second mixed audio signal _k The length is B.

In some embodiments, to further reduce the computational effort, the fourier transform results of the past j frames of the mixed audio signal and the head related pulse signal may be preserved, such that fourier transforms of the past multiple frames are not required each time. Only the Fourier transform result of the current frame is needed to be calculated, the Fourier transform result of the past frame can be directly inquired, a plurality of shift transformation operations are omitted, and the operation amount is greatly reduced.

In some embodiments, mixing the second mixed audio signals of the plurality of directions to obtain the output audio signal of the first channel includes, but is not limited to, the steps of:

y _OUT ＝α*y _30° +α*y _-30° +β*y _120° +β*y _-120° +γ*y _270°

wherein y is _out Representing the output audio signal of the first channel, y _30° 、y _-30° 、y _120° 、y _-120° 、y _270° Representing sound of 30 DEG, -120 DEG, 270 DEGThe second mixed audio signal of the track orientation, α, β, γ represent the mixing factor.

In this embodiment, the second mixed audio signals of 30 °, -30 °, 120 °, -120 °, 270 ° channel directions are multiplied by the mixing factors and added to obtain the output audio signal of the first channel, and the mixing ratio is controlled by controlling the plurality of mixing factors, so as to obtain the stereo audio data with virtual surround feeling.

Wherein alpha, beta and gamma represent mixing factors, and the optimal value of the mixing effect is obtained by debugging according to practical experience, and is generally not more than 1.

Specifically, in some embodiments, the second mixed audio signal of the plurality of directions may be subjected to a mixing process by the following formula:

L _OUT ＝α*L _30° +α*L _-30° +β*L _120° +β*L _-120° +γ*L _270°

R _OUT ＝α*R _30° +α*R _-30° +β*R _120° +β*R _-120° +γ*R _270°

wherein L is _out An output audio signal representing the first channel, L _30° 、L _-30° 、L _120° 、L _-120° 、L _270° A left channel second mixed audio signal representing 30 °, -30 °, 120 °, -120 °, 270 ° channel orientations, R _out An output audio signal representing the first channel, R _30° 、R _-30° 、R _120° 、R _-120° 、R _270° A second mixed audio signal of the right channel representing 30 °, -30 °, 120 °, -120 °, 270 ° channel orientations, α, β, γ representing mixing factors.

In this embodiment, the left channel mixed audio signals in 30 °, -30 °, 120 °, -120 °, 270 ° channel directions are multiplied by the mixing factors and added to obtain a left channel output audio signal, the right channel mixed audio signals in 30 °, -30 °, 120 °, -120 °, 270 ° channel directions are multiplied by the mixing factors and added to obtain a right channel output audio signal, and the mixing ratio is controlled by controlling a plurality of mixing factors to obtain the binaural audio data with virtual surround feeling.

In summary, in the method for processing stereo audio of headphones according to the embodiment of the present application, a target stereo audio signal to be processed is obtained from a first channel of headphones, and decorrelation processing is performed on the target stereo audio signal to obtain an environmental audio signal; and mixing the target stereo audio signal to be processed and the environment audio signal to obtain a mixed audio signal of the first channel. Because the environment audio signal reflects the width and depth of the sound field of the stereophonic sound, the width and depth of the sound field of the stereophonic sound can be effectively enhanced after the environment audio signal is mixed with the original sound, so that the space sense of the stereophonic sound can be effectively enhanced without losing the tone quality of the original sound.

Another embodiment of the application provides a headset, as shown in fig. 5, the headset 10 comprising at least one processor 11 and a memory 12 (bus connection, one processor being an example in fig. 5) in communication. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 5 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the headset may also include more or fewer components than shown in fig. 5, or have a different configuration than shown in fig. 5.

The processor 11 is configured to provide computing and control capabilities to control the earphone 10 to perform corresponding tasks, for example, to control the earphone 10 to perform any one of the earphone stereo audio processing methods provided in the foregoing embodiments of the application.

It is understood that the processor 11 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

The memory 12 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the headphone stereo audio processing method in the embodiment of the present application. The processor 11 may implement the headphone stereo audio processing method in any of the method embodiments described above by running non-transitory software programs, instructions, and modules stored in the memory 12, where the memory 12 may include high-speed random access memory, and may further include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 12 may also include memory located remotely from the processor, which may be connected to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Another embodiment of the present application also provides a non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by at least one processor, cause the at least one processor to perform the headphone stereo audio processing method according to any one of the above embodiments.

It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Those skilled in the art will appreciate that all or part of the processes implementing the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and where the program may include processes implementing the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the application, the steps may be implemented in any order, and there are many other variations of the different aspects of the application as described above, which are not provided in detail for the sake of brevity; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims

1. A method for headphone stereo audio processing, comprising:

2. The method according to claim 1, wherein the method further comprises:

3. The method according to any one of claims 1-2, wherein said decorrelating the target stereo audio signal to obtain an ambient audio signal comprises:

obtaining a reference audio signal from a second channel of the headset;

filtering the reference audio signal to obtain a first reference audio signal;

4. A method according to claim 3, wherein said filtering the reference audio signal to obtain a first reference audio signal comprises:

the reference audio signal is filtered by the following formula:

S(n)＝F(n)*W(n-1)；

wherein S represents a first reference audio signal, F represents the reference audio signal, W represents a filter coefficient, and n represents a frame number.

5. The method according to claim 4, wherein the method further comprises:

updating the filter coefficients by the following formula:

E(n)＝D(n)–S(n)，

T(n)＝abs(F(n))，

Pe＝a*P(n)*T(n)+abs(E(n))^2，

Mu(n)＝P(n)/(Pe+eps)，

P(n+1)＝(1–a*Mu*T(n))*P(n)+deta*abs(W(n-1))，

G＝Mu(n)*E(n)，

PP＝G*conj(F(n))，

W(n)＝W(n-1)+PP；

wherein E represents the environmental audio signal, D represents the target stereo audio signal, T represents the absolute value of the reference audio signal, abs represents taking the absolute value, pe represents the smoothed error energy power spectrum, a represents the error control factor, P represents the error covariance, mu represents the step size factor, eps represents the division protection factor, delta represents the filter energy smoothing factor, G represents the kalman gain factor, PP represents the filter update value, conj represents the conjugate complex number.

6. The method according to any one of claims 1-2, wherein said mixing the target stereo audio signal to be processed with the ambient audio signal to obtain a mixed audio signal of the first channel comprises:

X(n)＝b*D(n)+(1-a)*E(n)；

where X represents the mixed audio signal of the first channel and b represents the mixing factor.

7. The method of claim 2, wherein mixing the mixed audio signal with at least two azimuth head related pulse signals to obtain the output audio signal of the first channel comprises:

dividing the mixed audio signal into a plurality of first sequence partitions;

8. The method of claim 7, wherein said dividing the mixed audio signal into a plurality of first sequence partitions comprises:

9. The method of claim 8, wherein the plurality of first sequence partitions and the plurality of second sequence partitions are time domain signals, wherein performing dot product processing on the plurality of first sequence partitions of the mixed audio signal and the plurality of second sequence partitions of the head related pulse signal for each azimuth to obtain a second mixed audio signal for the azimuth comprises:

10. The method of claim 9, wherein mixing the second mixed audio signals of the plurality of directions to obtain the output audio signal of the first channel comprises:

y _OUT ＝α*y _30° +α*y _-30° +β*y _120° +β*y _-120° +γ*y _270° ；

11. An earphone, comprising:

at least one processor, and

a memory communicatively coupled to the at least one processor, the memory storing instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

12. A non-transitory computer-readable storage medium storing computer-executable instructions which, when executed by at least one processor, cause the at least one processor to perform the method of any of claims 1-10.