WO2020020247A1

WO2020020247A1 - Signal processing method and device, and computer storage medium

Info

Publication number: WO2020020247A1
Application number: PCT/CN2019/097552
Authority: WO
Inventors: 崔腾飞
Original assignee: 西安中兴新软件有限责任公司
Priority date: 2018-07-25
Filing date: 2019-07-24
Publication date: 2020-01-30
Also published as: CN110769352B; CN110769352A

Abstract

Embodiments of the present invention provide a signal processing method, a signal processing device, and a computer storage medium. The method comprises: receiving a first audio signal and playing the first audio signal by at least one loudspeaker; obtaining, according to at least one loudspeaker model, at least one microphone model, and the first audio signal, at least one echo estimation signal corresponding to the first audio signal; receiving a second audio signal using at least one microphone, wherein the second audio signal comprises an echo signal generated by the first audio signal, output by the at least one loudspeaker, and received by the at least one microphone; and removing, from the second audio signal, at least one echo estimation signal corresponding to the first audio signal to obtain an echo-processed audio signal.

Description

Signal processing method, device and computer storage medium

Technical field

The present disclosure relates to, but is not limited to, the field of audio signal processing technology.

Background technique

During a call, people can sometimes hear the sound of their own speech. This is mainly because the sound played by the speaker of the other party is received and transmitted back by their microphone (Microphone, MIC), which is caused by acoustic reasons. Therefore, in the scenarios involving speakers and MIC duplex, echo phenomena generally exist, such as terminal calls, personal computer (PC) Internet calls, personal digital assistant (PDA) Internet calls, and side recording. Side broadcast scenes and more.

Summary of the Invention

An embodiment of the present disclosure provides a signal processing method, which is applied to a signal processing device having at least one speaker and at least one microphone. The method includes: receiving a first audio signal and the first audio signal being processed by the at least one speaker. Playing the signal; obtaining at least one echo estimation signal corresponding to the first audio signal according to at least one speaker model, at least one microphone model, and the first audio signal, wherein the at least one is obtained based on the at least one speaker A speaker model, and obtaining the at least one microphone model based on the at least one microphone; receiving a second audio signal using the at least one microphone, wherein the second audio signal includes an output by the at least one speaker and passes through the at least one speaker An echo signal generated by the first audio signal and received by at least one microphone; and removing at least one echo estimation signal corresponding to the first audio signal from the second audio signal to obtain an echo-processed audio signal.

An embodiment of the present disclosure further provides a signal processing device including at least one speaker, at least one microphone, a first receiving part, a first obtaining part, a second receiving part, and a second obtaining part, wherein the first receiving part Configured to receive a first audio signal and play the first audio signal by the at least one speaker; the first acquisition component is configured to be based on at least one speaker model, at least one microphone model, and the first audio signal, Acquiring at least one echo estimation signal corresponding to the first audio signal, wherein the at least one speaker model is obtained based on the at least one speaker, and the at least one microphone model is obtained based on the at least one microphone; the second The receiving component is configured to receive a second audio signal by using the at least one microphone, wherein the second audio signal includes a first audio signal generated by the at least one microphone and received by the at least one microphone. An echo signal; and the second acquisition component is configured It is configured to remove at least one echo estimation signal corresponding to the first audio signal from the second audio signal to obtain an echo-processed audio signal.

An embodiment of the present disclosure further provides a signal processing apparatus including a memory and a processor, wherein the memory stores a computer program, and when the processor runs the computer program, the processor executes the computer program according to the present disclosure. Signal processing method.

An embodiment of the present disclosure further provides a computer storage medium on which a computer program is stored. When the computer program is executed by at least one processor, the at least one processor executes a signal processing method according to the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a schematic circuit structure of a single speaker and a single microphone;

Figure 2 shows a schematic circuit structure of a single speaker and a dual microphone;

FIG. 3 shows a schematic circuit structure of a dual speaker and a single microphone;

FIG. 4 is a schematic diagram showing a curve comparison between an echo reference signal and a microphone recording signal;

FIG. 5 is another schematic diagram of comparison between an echo reference signal and a microphone recording signal;

FIG. 6 shows another schematic circuit structure diagram of a dual speaker and a single microphone; FIG.

FIG. 7 shows a circuit structure diagram of a dual speaker and a dual microphone;

8 is a schematic flowchart of a signal processing method according to an embodiment of the present disclosure;

9 is a schematic diagram of a circuit structure of a four-speaker and a single microphone according to an embodiment of the present disclosure;

10 is a schematic diagram of another circuit structure of a four-speaker and a single microphone according to an embodiment of the present disclosure;

11 is a schematic diagram showing a circuit structure of a four-speaker and a dual microphone according to an embodiment of the present disclosure;

12 is a schematic diagram of another circuit structure of a four-speaker and a single microphone according to an embodiment of the present disclosure;

13 is a schematic flowchart of a signal processing method according to an embodiment of the present disclosure;

14 is a schematic flowchart of a signal processing method according to an embodiment of the present disclosure;

15 is a schematic flowchart of a signal processing method according to an embodiment of the present disclosure;

16 is a schematic flowchart of a signal processing method according to an embodiment of the present disclosure;

17 is a schematic flowchart of a signal processing method according to an embodiment of the present disclosure;

18 is a schematic diagram of a circuit structure of a single speaker and a single microphone according to an embodiment of the present disclosure;

19 is a schematic flowchart of a signal processing method according to an embodiment of the present disclosure;

20 is a schematic diagram of a circuit structure of a dual speaker and a single microphone according to an embodiment of the present disclosure;

21 is a schematic structural diagram of a signal processing device according to an embodiment of the present disclosure;

22 is another schematic structural diagram of a signal processing device according to an embodiment of the present disclosure;

23 is another schematic structural diagram of a signal processing device according to an embodiment of the present disclosure;

FIG. 24 is another schematic structural diagram of a signal processing apparatus according to an embodiment of the present disclosure; FIG.

25 is another schematic structural diagram of a signal processing device according to an embodiment of the present disclosure; and

FIG. 26 is a schematic diagram of a hardware structure of a signal processing apparatus according to an embodiment of the present disclosure.

detailed description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

Echo is mainly the repetition of sound caused by the reflection of sound waves, that is, the sound emitted by the sound source is reflected back to the position of the sound source. In electronic devices, microphones and speakers are often used. When the speaker plays sound data received by the far end, the microphone transmits voice or other sound data from the near end to the far end. For a typical hands-free system, since the speaker is disposed adjacent to the microphone, the sound emitted by the speaker will be received by the microphone, which is called an echo. In the absence of processing, the echo will be heard by the far-end user, resulting in unexpectedly great noise and unpleasant psychoacoustic experience. Therefore, echo cancellation technology is introduced to perform echo processing on the echoes extracted by the microphone.

Echo cancellation technology is a technology that processes the sound played by the speakers received by the microphone and retains only the sound played by non-speakers. The echo cancellation technology commonly used at present is a single speaker (earpiece) echo cancellation technology, a stereo (dual speaker) echo cancellation technology and more channels (speakers) echo cancellation technology are relatively less used.

Single-speaker echo cancellation technology is widely used, that is, only one speaker plays sound during a call. For example, referring to FIG. 1, an application example of a circuit structure 10 of a single speaker and a single microphone is shown. As shown in FIG. 1, the circuit structure 10 includes a speaker 101, a microphone 102, an adder 103, an adaptive filter (AF) module 104, a voice processing module 105, a noise reduction and other processing module 106, a decoder 107, Encoder 108 and radio frequency end 109. The radio frequency terminal 109 sends the received audio signal to the decoder 107, and the decoder 107 demodulates the audio signal. The demodulated audio signal enters the voice processing module 105 for voice processing such as noise reduction and filtering, and then the voice signal Will be played by speaker 101. The audio signal recorded by the microphone 102 includes an echo signal generated by the audio signal played by the speaker 101. In order to eliminate this echo signal as much as possible, an audio signal is extracted at the front end of the speaker 101 as a reference signal. The reference signal is input to the adder 103 after passing through the AF module 104, and the audio signal recorded by the microphone 102 is also input to the adder 103. The two signals are subjected to subtraction processing in the adder 103, so that the audio signals recorded by the microphone 102 can be subjected to echo processing. The audio signal that has undergone the echo processing is then processed by the noise reduction and other processing module 106 for voice processing, modulated in the encoder 108, and finally transmitted by the transmitting end 109 through the transmission line.

FIG. 2 shows an application example of the circuit structure 20 of a single speaker and a dual microphone. As shown in FIG. 2, the circuit structure 20 includes a speaker 201, a first microphone 202a, a second microphone 202b, a first adder 203a, a second adder 203b, a first AF module 204a, a second AF module 204b, and a voice processing module 205. Noise reduction and other processing module 206, decoder 207, encoder 208, and transmitting end 209. The audio signal at the front end of the speaker 201 is selected as a reference signal. The reference signal performs echo processing on the audio signal received by the first microphone 202a through the first AF module 204a and the first adder 203a, and on the second microphone 202b through the second AF module 204b and the second adder 203b. The received audio signal is subjected to echo processing. The audio signal after the echo processing is input to the noise reduction and other processing module 206 for noise reduction processing, is modulated in the encoder 208, and is finally transmitted by the transmitting end 209 through the transmission line.

FIG. 3 shows an application example of the circuit structure 30 of a dual speaker and a single microphone. As shown in FIG. 3, the circuit structure 30 includes a first speaker 301a, a second speaker 301b, a microphone 302, an adder 303, an AF module 304, a voice processing module 305, a noise reduction and other processing module 306, and a decoder 307. , Encoder 308 and transmitting end 309. The audio signals at the front ends of the first speaker 301a and the second speaker 301b are selected as reference signals for echo processing. The reference signal performs echo processing on the audio signal received by the microphone 302 through the AF module 304 and the adder 303. The audio signal after the echo processing is input to the noise reduction and other processing module 306 for noise reduction processing, then is modulated in the encoder 308, and finally sent by the transmitting terminal 309 through the transmission line.

It was found in research that the above-mentioned echo cancellation technology has obvious defects. Referring to Figures 4 and 5, when the high-frequency resonance peak of the speaker or microphone frequency response is relatively low, such as the resonance peak of the speaker is around 4kHz, the echo signal cannot be processed cleanly. If the speaker is louder, , It is easy to appear echo howling. This is because the reference signal is only a digital signal at the front of the speaker, and the acoustic effects introduced by the speaker and microphone are not taken into account, so that the signal near the resonance peak is amplified, causing the loop of the sound signal to be amplified near the resonance peak, and the echo cannot be processed. . It can be seen from FIG. 4 that the reference signal of the echo is larger than the signal recorded by the microphone (that is, the echo signal to be processed), and the linearity of the two is better. In this case, the echo is easier to handle and clean. It can be seen from FIG. 5 that, at around 4 kHz, the microphone recording signal (that is, the echo signal to be processed) is larger than the reference signal of the echo. At this time, the echo will be difficult to be processed.

In addition, the circuit structure in FIG. 1 and FIG. 2 is suitable for a single-speaker application scenario. For a multi-speaker application scenario, such as FIG. 3, if only one signal at the front end of a certain speaker is used as a reference signal for echo processing, there is no Considering the effect of the sound superposition of multiple speakers, the effect of echo processing will be worse, especially when the loudness of the sound played by the speaker is large, the effect of echo processing is even worse.

In the multi-channel echo cancellation technology, considering that each speaker will affect the echo signal, multiple echo processing modules are generally added to the algorithm of the microphone input path. Generally speaking, how many speakers exist in an electronic device, a corresponding number of echo processing modules will be used, and each echo processing module will introduce the audio signal of one speaker as a reference signal. FIG. 6 shows an application example of the circuit structure 60 of a dual speaker and a single microphone. As shown in FIG. 6, the circuit structure 60 includes a first speaker 601a, a second speaker 601b, a microphone 602, a first adder 603a, a second adder 603b, a first AF module 604a, a second AF module 604b, and a voice processing module. 605. The noise reduction and other processing module 606, the decoder 607, the encoder 608, and the transmitting end 609. The audio signals at the front ends of the first speaker 601a and the second speaker 601b are selected as reference signals for echo processing. The audio signal at the front end of the first speaker 601a passes through the first AF module 604a and is input to the first adder 603a. The audio signal received by the microphone 602 is also input to the first adder 603a. The audio signal recorded by the first adder 603a to the microphone 602 is The echo generated by the audio signal played by the first speaker 601a is subjected to echo processing. The audio signal at the front end of the second speaker 601b passes through the second AF module 604b and is input to the second adder 603b. The audio signal received by the microphone 602 is also input to the second adder 603b. The audio signal recorded by the second adder 603b to the microphone 602 is The echo generated by the audio signal played by the second speaker 601b is subjected to echo processing. The audio signal after the echo processing is input to the noise reduction and other processing module 606 for processing, and then is modulated in the encoder 608, and finally sent by the transmitting end 609 through the transmission line. This processing method considers the effect of the superposition of multiple speakers on the echo signal, which is better than the processing effect of introducing the digital signal at the front of one speaker as the reference signal of the echo in FIG. 3, but it also has the above disadvantages, that is, because the speaker is not considered Acoustic effects introduced by microphones and microphones, when the resonance peaks of speakers and microphones are low, there will still be obvious echo howling. In addition, due to the introduction of multiple echo processing modules, and the greater the number of speakers and microphones, the more data the echo processing module has, and the complexity is also significantly increased. In addition, since there may be multiple echo processing modules in a microphone, the mutual influence between each other will significantly increase the complexity of debugging.

In the multi-channel echo cancellation technology, the transfer function is used to calculate the signal during speaker playback and microphone recording. FIG. 7 shows an application example of the circuit structure 70 of a dual speaker and a dual microphone. It can be seen from FIG. 7 that the left / right two-channel stereo signals XL and XR input to the line input terminals LI (L) and LI (R) do not pass through the sum / difference signal generating device 52 and pass through the sound output terminals, respectively. SO (L) and SO (R) are output and reproduced at the speakers SP (L) and SP (R), and then collected by the microphones MC (L) and MC (R) and input to the sound input terminals SI (L), SI ( R). The filters 40-1, 40-2, 40-3, and 40-4 are formed by, for example, FIR filters. The impulse responses set by the filters 40-1, 40-2, 40-3, and 40-4 are respectively related to the speakers. The transfer functions between SP (L) and SP (R) and the microphones MC (L) and MC (R) correspond to each other, thereby correspondingly generating echo processing signals EC1, EC2, EC3, and EC4.

Adders

44 and 46 and subtractors 48 and 50 are used for echo processing. These echo processing signals are output from the line output terminals LO (L) and LO (R), respectively. Here, the sum / difference signal generating device 52 includes an adder 54 and a subtractor 56 for generating a sum signal X _M and a difference signal X _S. The correlation detection device 59 detects a correlation between the sum signal X _M and the difference signal X _S based on a correlation value calculation (or such calculation). The transfer function calculation device 58 is used to calculate the transfer functions of the four audio transmission systems between the speakers SP (L), SP (R) and the microphones MC (L), MC (R). The technical solution uses the sum signal and the difference signal of the stereo sound signal as reference signals, and calculates the transfer of the four audio transmission systems between the two speakers and the two microphones according to the cross-spectrum calculation of the reference signal and the sound signal recorded by the microphone. function. The obtained transfer function is subjected to an inverse Fourier transform to obtain impulse responses, and these impulse responses are set in a filter device to generate an echo-processed reference signal and perform an echo process. The technical solution also considers the impact of the acoustic path and acoustic structure and components on the echo signal. At the same time, in the process of analyzing the transfer function, the sum signal and difference signal of the two speaker signals and the echo signal recorded by the microphone are considered. More comprehensive. However, in the process of obtaining the transfer function, the speaker is required to play the audio signal, and the microphone receives the audio signal at the same time, which is greatly affected by environmental fluctuations, and the speaker and the microphone have inconsistent working conditions under different audio signals and environments. This has caused some situations where the echo processing is not ideal. In addition, as the number of speakers and microphones increases, so does the complexity of the transfer function.

In the application examples of the circuit structures shown in FIG. 1 to FIG. 3 and FIG. 6 to FIG. 7, the acoustic response model of the speaker and microphone is not considered, so that when the speaker or microphone's high-frequency resonance peak is relatively low (such as around 4kHz) When the signal recorded by the microphone is higher than the amplitude of the reference signal at the high-frequency resonance peak, it is easy to produce echo howling. In order to effectively solve the echo howling phenomenon generated when the high-frequency resonance peak of a speaker or a microphone is relatively low, the technical solutions of the embodiments of the present disclosure are proposed. The embodiments of the present disclosure will be described in detail below with reference to the drawings.

FIG. 8 is a schematic flowchart of a signal processing method according to an embodiment of the present disclosure.

Referring to FIG. 8, a signal processing method according to an embodiment of the present disclosure is applied to a signal processing apparatus having at least one speaker and at least one microphone. The method may include steps S801 to S804.

In step S801, a first audio signal is received and the first audio signal is played by at least one speaker.

In step S802, at least one echo estimation signal corresponding to the first audio signal is obtained according to at least one speaker model, at least one microphone model, and the first audio signal, and the at least one is obtained based on the at least one speaker. A speaker model, and obtaining the at least one microphone model based on the at least one microphone.

In step S803, a second audio signal is received by using at least one microphone, wherein the second audio signal includes an echo signal generated by the first audio signal and output by at least one speaker and received through the at least one microphone.

In step S804, at least one echo estimation signal corresponding to the first audio signal is removed from the second audio signal to obtain an echo-processed audio signal.

Based on the signal processing method shown in FIG. 8, the echo howling phenomenon generated when the high-frequency resonance peak of the speaker or the microphone is low can be effectively solved, and the calculation workload in the multi-microphone design is also reduced.

For the signal processing method shown in FIG. 8, in a possible implementation manner, before step S801, the method further includes: demodulating and preprocessing the voice of the first audio signal, wherein the first An audio signal is generated and transmitted by the remote device.

It should be noted that, generally speaking, the first audio signal is generated and transmitted by the remote device. After the signal processing device receives the first audio signal, it performs demodulation and speech preprocessing. The processed first audio signal enters the speaker and the speaker plays the first audio signal.

For the signal processing method shown in FIG. 8, in a possible implementation manner, before step S802, the method further includes: correspondingly establishing at least one speaker model according to the characteristic information of the at least one speaker, where The characteristic information of the at least one speaker includes circuit information and structural information corresponding to the at least one speaker; and at least one microphone model is correspondingly established according to the characteristic information of the at least one microphone, wherein the characteristic information of the at least one microphone includes Circuit information and structure information corresponding to the at least one microphone.

It should be noted that, in the embodiment of the present disclosure, a speaker model and a microphone model are introduced. The speaker model is based on the circuit information and structure information of the speaker to simulate the acoustic response of the speaker, and the microphone model is based on the circuit information and structure information of the microphone to simulate the acoustic response of the microphone. The acquired reference signal of the first audio signal can be made more accurate and closer to the echo signal of the first audio signal played by the speaker, so that the processing effect of the echo signal is better.

Understandably, the number of microphones may be one or more, which is not specifically limited in the embodiments of the present disclosure. When the number of microphones is one, the number of corresponding microphone models is one. In this implementation, step S802 may include: inputting the first audio signal to each speaker model and performing acoustic response processing to obtain an acoustic response signal of each speaker; according to the acoustic response signal of each speaker and the First delay and attenuation information corresponding to the acoustic response signal to obtain a first reference signal of each speaker, wherein the first delay and attenuation are correspondingly obtained based on distance and position information between each speaker and the microphone. Information; performing superposition and bit expansion processing on the first reference signal of each speaker to obtain a first superimposed reference signal; and performing acoustic response processing on the first superimposed reference signal based on the microphone model To obtain a first echo estimation signal of the first audio signal.

It should be noted that, in the embodiment of the present disclosure, the echo estimation signal of the first audio signal is obtained through a series of signal processing steps such as speaker model, delay and attenuation, signal superposition, and microphone model, so that the echo estimation signal is closer to The echo signal of the first audio signal played by the speaker makes the subsequent echo signal processing effect better.

For example, referring to FIG. 9, an application example of a circuit structure 90 of a four-speaker and a single microphone according to an embodiment of the present disclosure is shown. As shown in FIG. 9, the circuit structure 90 includes a speaker group 901, a microphone 902, a speaker model group 903, a delay attenuation module group 904, a summing module 905, a microphone model 906, an echo processing module 907, a voice processing module 908, and noise reduction. And other processing modules 909, a decoder 910, an encoder 911, and a transmitting end 912. The speaker group 901 includes a first speaker 901a, a second speaker 901b, a third speaker 901c, and a fourth speaker 901d. The speaker model group 903 includes a first speaker model 903a, a second speaker model 903b, a third speaker model 903c, and a fourth speaker model 903d. The delay attenuation module group 904 includes a first delay attenuation module 904a, a second delay attenuation module 904b, a third delay attenuation module 904c, and a fourth delay attenuation module 904d. The speaker model group 903 is correspondingly established based on the speaker group 901, and the delay attenuation module group 904 performs corresponding delay and attenuation on the audio signal after passing through the speaker model group 903.

The radio frequency end 912 sends the received first audio signal to the decoder 910, and the decoder 910 demodulates the first audio signal. The demodulated first audio signal enters the voice processing module 908 for pre-processing such as noise reduction and filtering, and then the first audio signal is played by the speaker group 901. The second audio signal recorded by the microphone 902 includes an echo signal generated by the first audio signal played by the speaker group 901. In order to process the echo signal, a first audio signal is extracted at the front end of the speaker group 901, and the first audio signal is input to the first speaker model 903a, the second speaker model 903b, the third speaker model 903c, and the fourth speaker model 903d for acoustics, respectively. Response analysis processing to obtain an acoustic response signal of each speaker in the speaker group 901. Since sound takes time in the propagation process, and the energy also decays during the propagation, in order to represent the delay and attenuation of the sound transmitted from the speaker to the microphone position (that is, the sound transmission time and attenuation), the acoustic response signal is correspondingly input. Go to the first delay attenuation module 904a, the second delay attenuation module 904b, the third delay attenuation module 904c, and the fourth delay attenuation module 904d to perform delay attenuation processing to obtain the four delays and the first after the attenuation. Reference signal. The summing module 905 performs superposition processing to obtain a superposed reference signal. Here, it is also necessary to consider that overflow may occur after the superposition, so a bit expansion process is required, and at least two binary numbers can be expanded. In general, it is extended up to four times to prevent overflow after superposition. The superimposed reference signal is input to the microphone model 906 to perform acoustic response processing according to the sound pressure excitation of the microphone model 906, thereby obtaining an echo reference signal recorded through the microphone 902, that is, a first echo estimation signal of the first audio signal. The first echo estimation signal and the second audio signal recorded by the microphone 902 are input to the echo processing module 907 for echo processing. The echo-processed audio signal is then subjected to noise reduction processing through the noise reduction and other processing module 909, and then modulated in the encoder 911, and finally sent by the transmitting end 912 through the transmission line.

It should be noted that the sound signals of multiple speakers (such as the four

speakers

901a, 901b, 901c, and 901d in FIG. 9) are superimposed and the number of bits is expanded, mainly considering that overflow may occur after superposition. , Need to be extended by digits. The superimposed sound signal is input to the echo processing module 907 as an echo estimation signal, and is mainly used to perform echo processing on the echo signal generated by the first audio signal played by the speaker among the second audio signals recorded by the microphone 902. After the echo processing module 907 does not increase the original audio signal amplitude, the echo processing module 907 does not need to perform a bit expansion process on the second audio signal recorded by the microphone 902.

Understandably, when the high-frequency resonance peak of the microphone is relatively high (for example, greater than 8 kHz), the acoustic response processing of the microphone model to the above-mentioned first superimposed reference signal may be cancelled. Therefore, in the above specific implementation manner, after the step of obtaining the first superimposed reference signal, the method further includes: obtaining a high-frequency resonance peak of the microphone; and combining the high-frequency resonance peak of the microphone with a preset Compare the high-frequency resonance peak; and in response to the high-frequency resonance peak of the microphone being higher than a preset high-frequency resonance peak, cancel the acoustic response processing of the microphone model to the first superimposed reference signal.

It should be noted that if the bandwidth of the first microphone (such as the microphone 902 in FIG. 9) is relatively wide, that is, the high-frequency resonance peak of the microphone 902 is relatively high, it is not necessary to use the microphone model 906 to pair the obtained first An superimposed reference signal is subjected to acoustic response processing, that is, it is not necessary to modify the echo estimation signal through the microphone model 906, and the obtained first superimposed reference signal can be used as the echo estimation signal. For example, in combination with the circuit structure 90 shown in FIG. 9, after the first superimposed reference signal is obtained through the summing module 905, it can be determined according to the bandwidth of the microphone 902, that is, the high-frequency resonance peak of the microphone 902 and Compare preset high-frequency resonance peaks. Assume that the preset high-frequency resonance peak is 8 kHz. If the high-frequency resonance peak of the microphone 902 is 5 kHz, that is, the high-frequency resonance peak of the microphone 902 is lower than the preset high-frequency resonance peak, the microphone model 906 is required to pair the first The superimposed reference signal is processed for acoustic response, so that a modified echo estimation signal can be obtained. If the high-frequency resonance peak of the microphone 902 is 9 kHz, that is, the high-frequency resonance peak of the microphone 902 is higher than a preset high-frequency resonance peak, then The microphone model 906 is not required to perform acoustic response processing on the first superimposed reference signal.

Referring to FIG. 10, an application example of a circuit structure 100 of a four-speaker and a single microphone according to an embodiment of the present disclosure is shown. As shown in FIG. 10, the circuit structure 100 includes a speaker group 1001, a microphone 1002, a speaker model group 1003, a delay attenuation module group 1004, a summing module 1005, an echo processing module 1006, a voice processing module 1007, noise reduction, and other processing modules. 1008, a decoder 1009, an encoder 1010, and a transmitting end 1011. The speaker group 1001 includes a first speaker 1001a, a second speaker 1001b, a third speaker 1001c, and a fourth speaker 1001d. The speaker model group 1003 includes a first speaker model 1003a, a second speaker model 1003b, a third speaker model 1003c, and a fourth speaker model 1003d. The delay attenuation module group 1004 includes a first delay attenuation module 1004a, a second delay attenuation module 1004b, a third delay attenuation module 1004c, and a fourth delay attenuation module 1004d. Compared with the circuit structure 90 shown in FIG. 9, the circuit structure 100 shown in FIG. 10 reduces the microphone model. That is, in the circuit structure shown in FIG. 10, the step of performing the acoustic response processing on the first superimposed reference signal by the microphone model is eliminated. After obtaining the superimposed reference signal, the superimposed reference signal is directly input as an echo estimation signal to the echo processing module 1006 to perform echo processing on the second audio signal recorded by the microphone 1002. The other operation processes are the same as those of the circuit structure 90 shown in FIG. 9 described above, and will not be described in detail here.

Understandably, the number of microphones may be one or more, which is not specifically limited in the embodiments of the present disclosure. When the number of microphones is two, the number of correspondingly established microphone models is two. In this implementation, step S802 may include: inputting the first audio signal to each speaker model and performing acoustic response processing to obtain an acoustic response signal of each speaker; according to the acoustic response signal of each speaker and the The first delay and attenuation information corresponding to the acoustic response signal and the second delay and attenuation information corresponding to the acoustic response signal are used to obtain a first reference signal and a second reference signal for each speaker, wherein, based on each speaker and The distance and position information between the first microphones in the at least one microphone correspond to the first delay and attenuation information, and are based on the distance between each speaker and the second microphone in the at least one microphone and The position information corresponds to the second delay and attenuation information; the first reference signal and the second reference signal of each speaker are separately superimposed and bit expanded to obtain a first superimposed reference signal and A second superimposed reference signal; based on a first microphone in the at least one microphone model A model, performing acoustic response processing on the first superimposed reference signal to obtain a first echo estimation signal of the first audio signal; and based on a second microphone model of the at least one microphone model, Acoustic response processing is performed on the two superimposed reference signals to obtain a second echo estimation signal of the first audio signal.

For example, referring to FIG. 11, an application example of a circuit structure 110 of a four-speaker and dual-microphone according to an embodiment of the present disclosure is shown. As shown in FIG. 11, the circuit structure 110 includes a speaker group 1101, a first microphone 1102a, a second microphone 1102b, a speaker model group 1103, a first delay attenuation module group 1104-1, and a second delay attenuation module group 1104-2. , First summation module 1105a, second summation module 1105b, first microphone model 1106a, second microphone model 1106b, first echo processing module 1107a, second echo processing module 1107b, speech processing module 1108, noise reduction and other The processing module 1109, the decoder 1111, the encoder 1111, and the transmitting end 1112. The speaker group 1101 includes a first speaker 1101a, a second speaker 1101b, a third speaker 1101c, and a fourth speaker 1101d. The speaker model group 1103 includes a first speaker model 1103a, a second speaker model 1103b, a third speaker model 1103c, and a fourth speaker model 1103d. The first delay attenuation module group 1104-1 includes a first delay attenuation module 1104-1a, a second delay attenuation module 1104-1b, a third delay attenuation module 1104-1c, and a fourth delay attenuation module 1104-1d. . The second delay attenuation module group 1104-2 includes a fifth delay attenuation module 1104-2a, a sixth delay attenuation module 1104-2b, a seventh delay attenuation module 1104-2c, and an eighth delay attenuation module 1104-2d. . The speaker model group 1103 is correspondingly established based on the speaker group 1101, and the first delay attenuation module group 1104-1 and the second delay attenuation module group 1104-2 perform corresponding delay and attenuation on the audio signal after passing through the speaker model group 1103. .

The radio frequency end 1112 sends the received first audio signal to the decoder 1111, and the decoder 1111 demodulates the first audio signal. The demodulated first audio signal enters the voice processing module 1108 for preprocessing such as noise reduction and filtering, and then the first audio signal is played by the speaker group 1101. The second audio signal recorded by the first microphone 1102a includes an echo signal generated by the first audio signal played by the speaker group 1101, and the second audio signal recorded by the second microphone 1102b also includes the first audio played by the speaker group 1101. The echo signal produced by the signal. In order to eliminate these echo signals as much as possible, the first audio signal is extracted at the front end of the speaker group 1101. If the first audio signals input by the four speakers are different, each of the extracted first audio signals is correspondingly input to the first speaker model 1103a. , The second speaker model 1103b, the third speaker model 1103c, and the fourth speaker model 1103d, and then perform an acoustic response analysis process to obtain an acoustic response signal of each speaker in the speaker group 1101. Since sound takes time in the propagation process, and energy also decays during the propagation, in order to indicate the delay and attenuation of the sound transmitted from the speaker to the microphone position, it is necessary to use the first microphone 1102a and each speaker in the speaker group 1101. The distance and position information are used to obtain the corresponding delay and attenuation in the first delay attenuation module group 1104-1, and the second delay attenuation is obtained according to the distance and position information of the second microphone 1102b and each speaker in the speaker group 1101. Corresponding delay and attenuation in module group 1104-2. The acoustic response signals are input to the first delay attenuation module group 1104-1 and the first delay attenuation module group 1104-2, respectively, to perform delay attenuation processing to obtain eight-channel delayed and attenuated sound signals. The four delays obtained by the first delay attenuation module group 1104-1 and the first reference signal after attenuation. The first reference signal is input to the first summing module 1105a for superposition and bit expansion processing to obtain a first superposed reference signal. The first superimposed reference signal is input to the first microphone model 1106a, and an acoustic response process is performed according to the sound pressure excitation of the first microphone model 1106a, thereby obtaining a first echo estimation signal recorded by the first microphone 1102a. The four delays obtained by the second delay attenuation module group 1104-2 and the second reference signal after attenuation. The second reference signal is input to the second summation module 1105b for superposition and bit expansion processing to obtain a second superimposed reference signal. The second superimposed reference signal is input to the second microphone model 1106b, and the acoustic response processing is performed according to the sound pressure excitation of the second microphone model 1106b, thereby obtaining a second echo estimation signal recorded by the second microphone 1102b. The first echo estimation signal and the second audio signal recorded by the first microphone 1102a are input to the first echo processing module 1107a for echo processing, so that the speaker group 1101 can be eliminated from the second audio signal recorded by the first microphone 1102a as much as possible. The echo signal is generated by the first audio signal. The second echo estimation signal and the second audio signal recorded by the second microphone 1102b are input to the second echo processing module 1107b for echo processing, so that the speaker group 1101 can be eliminated from the second audio signal recorded by the second microphone 1102b as much as possible. The echo signal is generated by the first audio signal. The audio signal after the echo processing is then subjected to noise reduction processing by the noise reduction and other processing module 1109, then modulated in the encoder 1111, and finally transmitted by the transmitting end 1112 through the transmission line.

It should be noted that, in the first echo processing module 1107a or the second echo processing module 1107b, it is not necessary to perform a bit expansion process in consideration of the second audio signal recorded by the first microphone and the second microphone.

It should also be noted that when there are multiple microphones (such as the first microphone 1102a and the second microphone 1102b in FIG. 11), the reference signal of the echo signal (that is, the echo estimation signal) in each microphone needs to be obtained separately. , And also need to perform echo processing separately (such as the first echo processing module 1107a and the second echo processing module 1107b in FIG. 11). For the acoustic response processing of the speaker model, since the speaker group is shared, and the speaker model and the delay attenuation module are separately provided, so first consider the speaker model, and then consider the position and distance information between the speaker and each microphone, so that The audio signal at the front of the speaker only needs to undergo the acoustic response processing of the speaker model once, which helps reduce the calculation workload in the multi-microphone design scheme.

Understandably, when the high-frequency resonance peak of the first microphone or the second microphone is relatively high (for example, greater than 8 kHz), the acoustics of the microphone model corresponding to the microphone on the first superimposed reference signal or the second superimposed reference signal may be cancelled. Response processing. Therefore, in the above specific implementation manner, after the step of obtaining the first superimposed reference signal and the second superimposed reference signal, the method further includes: acquiring a high-frequency resonance peak of the first microphone and the second superimposed reference signal. The high-frequency resonance peak of the microphone; comparing the high-frequency resonance peak of the first microphone and the high-frequency resonance peak of the second microphone with a preset high-frequency resonance peak, respectively; in response to the height of the first microphone The high-frequency resonance peak is higher than a preset high-frequency resonance peak, canceling the acoustic response processing of the first microphone model to the first superimposed reference signal; and in response to the high-frequency resonance peak of the second microphone being higher than a preset Set the high-frequency resonance peak, cancel the acoustic response processing of the second microphone model to the second superimposed reference signal.

It should be noted that if the bandwidth of the first microphone or the second microphone is relatively wide, that is, the high-frequency resonance peak value of the microphone is high, the superposition obtained by using the first microphone model or the second microphone model is not required. Acoustic response processing is performed on the reference signal, that is, there is no need to modify the echo reference signal through the first microphone model or the second microphone model. For example, in combination with the circuit structure 110 shown in FIG. 11, after obtaining the first superimposed reference signal through the first summing module 1105a, the first microphone 1102a may be judged according to the bandwidth of the first microphone 1102a, that is, the first microphone 1102a The high-frequency resonance peak is compared with a preset high-frequency resonance peak. Assume that the preset high-frequency resonance peak is 8 kHz. If the high-frequency resonance peak of the first microphone 1102a is 5 kHz, that is, the high-frequency resonance peak of the first microphone 1102a is lower than the preset high-frequency resonance peak, the first The microphone model 1106a performs an acoustic response processing on the first superimposed reference signal, so that a corrected first echo estimation signal can be obtained; if the high-frequency resonance peak of the first microphone 1102a is 9kHz, that is, the high-frequency resonance of the first microphone 1102a If the peak value is higher than the preset high-frequency resonance peak value, the first microphone model 1106a does not need to perform acoustic response processing on the superimposed reference signal, that is, the first microphone model 1106a does not need to modify the first echo estimation signal, and the first A microphone model 1106a performs an acoustic response process on the first superimposed reference signal. It can also be judged according to the bandwidth of the second microphone 1102b, that is, comparing the high-frequency resonance peak of the second microphone 1102b with a preset high-frequency resonance peak, so as to obtain whether the second microphone model 1106b is required for the second echo. Correction of estimated signal. If the high-frequency resonance peak of the second microphone 1102b is higher than the preset high-frequency resonance peak, the second microphone model 1106b does not need to perform the acoustic response processing on the superimposed reference signal, and the second microphone model 1106b may cancel the second superimposition. The reference signal performs the steps of acoustic response processing.

For the signal processing method shown in FIG. 8, in a possible implementation manner, the step of inputting the first audio signal into each speaker model and performing acoustic response processing to obtain an acoustic response signal of each speaker includes: responding The first audio signal received by the at least one speaker is the same, inputting the first audio signal to the at least one speaker model and performing acoustic response processing to obtain an acoustic response signal of each speaker; and responding to the The first audio signal received by the at least one speaker is different, and the first audio signal is correspondingly input to the at least one speaker model and an acoustic response process is performed to obtain an acoustic response signal of each speaker.

It should be noted that, for the first audio signals received by the multiple speakers, the first audio signals received by the multiple speakers may be the same. At this time, the same audio signal is input to the multiple speakers, and the audio signals are also Input to multiple speaker models, such as the circuit structure 90 shown in FIG. 9; In addition, the first audio signals received by the multiple speakers may also be different. At this time, these first audio signals are correspondingly input to multiple speakers. At the same time, these audio signals are correspondingly input to multiple speaker models, such as the circuit structure 120 shown in FIG. 12.

Referring to FIG. 12, an application example of a circuit structure 120 of a four-speaker and a single microphone according to an embodiment of the present disclosure is shown. As shown in FIG. 12, the circuit structure 120 includes a speaker group 1201, a microphone 1202, a speaker model group 1203, a delay attenuation module group 1204, a summing module 1205, a microphone model 1206, an echo processing module 1207, a voice processing module 1208, and noise reduction. And other processing modules 1209, decoder 1210, encoder 1211, and transmitting end 1212. The speaker group 1201 includes a first speaker 1201a, a second speaker 1201b, a third speaker 1201c, and a fourth speaker 1201d. The speaker model group 1203 includes a first speaker model 1203a, a second speaker model 1203b, a third speaker model 1203c, and a fourth speaker model 1203d. The delay attenuation module group 1204 includes a first delay attenuation module 1204a, a second delay attenuation module 1204b, a third delay attenuation module 1204c, and a fourth delay attenuation module 1204d. It should also be noted that the circuit structure 120 shown in FIG. 12 is similar to the circuit structure 90 shown in FIG. 9 except that the first audio signal is divided into four first audio signals after passing through the voice processing module 1208. An audio signal correspondingly enters and is played by the speaker group 1201, and each first audio signal received by the first speaker 1201a, the second speaker 1201b, the third speaker 1201c, and the fourth speaker 1201d is correspondingly input to the first speaker 1201a. A speaker model 1203a, a second speaker model 1203b, a third speaker model 1203c, and a fourth speaker model 1203d. The other operations are the same as those of the circuit structure 90 shown in FIG. 9, and will not be described in detail here.

According to the signal processing method provided in this embodiment, an echo howling phenomenon generated when a high-frequency resonance peak of a speaker or a microphone is low can be effectively solved, and a calculation workload in a multi-microphone design is also reduced.

Referring to FIG. 13, which shows an application example of a signal processing method according to an embodiment of the present disclosure applied to the circuit structure 100 of the four-speaker and single-microphone shown in FIG. 10, the method may include steps S1301 to S1305.

In step S1301, a digital signal is extracted from the front end of the speaker group 1001 to obtain a first audio signal.

In step S1302, the first audio signal is input to the first speaker model 1003a, the second speaker model 1003b, the third speaker model 1003c, and the fourth speaker model 1003d, respectively, and an acoustic response process is performed to obtain an acoustic response signal of each speaker .

In step S1303, a reference signal of each speaker is obtained according to the acoustic response signal of each speaker in the speaker group 1001 and the first delay and attenuation information corresponding to the acoustic response signal.

In step S1304, the reference signal of each speaker is superposed and the number of bits is expanded to obtain a superposed reference signal, and the superposed reference signal is used as an echo estimation signal.

In step S1305, the echo estimation signal is input to an echo processing module 1006 to perform echo processing on a second audio signal recorded by the microphone 1002.

It should be noted that the above process does not consider the modification of the echo estimation signal by the microphone model. For example, taking the circuit structure 100 shown in FIG. 10 as an example, combined with the foregoing example, after obtaining the superposed reference signal, the superposed reference signal can be directly used as an echo estimation signal, and then the echo estimation signal is used to record the microphone 1002. The second audio signal is subjected to echo processing.

Referring to FIG. 14, which shows an application example of a signal processing method according to an embodiment of the present disclosure applied to the circuit structure 90 of the four-speaker and single-microphone shown in FIG. 9, the method may include steps S1401 to S1409.

In step S1401, a digital signal is extracted from the front end of the speaker group 901 to obtain a first audio signal.

In step S1402, the first audio signal is input to the first speaker model 903a, the second speaker model 903b, the third speaker model 903c, and the fourth speaker model 903d, respectively, and an acoustic response process is performed to obtain an acoustic response signal of each speaker. .

In step S1403, a reference signal of each speaker is obtained according to an acoustic response signal of each speaker in the speaker group 901 and first delay and attenuation information corresponding to the acoustic response signal.

In step S1404, the reference signal of each speaker is subjected to superposition and bit expansion processing to obtain a superposed reference signal.

In step S1405, a high-frequency resonance peak of the microphone 902 is acquired.

In step S1406, the high-frequency resonance peak of the microphone 902 is compared with a preset high-frequency resonance peak.

In step S1407, if the high-frequency resonance peak of the microphone 902 is higher than a preset high-frequency resonance peak, the superimposed reference signal is directly used as an echo estimation signal.

In step S1408, if the high-frequency resonance peak of the microphone 902 is not higher than a preset high-frequency resonance peak, an acoustic response process is performed on the superimposed reference signal based on the microphone model 906 to obtain an echo estimation signal.

In step S1409, the echo estimation signal is input to an echo processing module 907 to perform echo processing on the second audio signal recorded by the microphone 902.

It should be noted that, compared with the process shown in FIG. 13, the process shown in FIG. 14 adds a microphone model and a judgment as to whether the microphone model is required to perform acoustic response processing on the superimposed reference signal. For example, taking the circuit structure 90 shown in FIG. 9 as an example, in combination with the foregoing example, after obtaining the superimposed reference signal, a high-frequency resonance peak of the microphone 902 is obtained. Assume that the preset high-frequency resonance peak is 8kHz. If the high-frequency resonance peak of the microphone 902 is 5kHz, that is, the high-frequency resonance peak of the microphone 902 is lower than the preset high-frequency resonance peak, the superimposed reference of the microphone model 906 is required. The signal is subjected to acoustic response processing, so that a modified echo estimation signal can be obtained, and the second audio signal recorded by the microphone 902 is subjected to echo processing using the modified echo estimation signal. If the high-frequency resonance peak of the microphone 902 is 9 kHz, that is, The high-frequency resonance peak of the microphone 902 is higher than the preset high-frequency resonance peak. Therefore, the microphone model 906 is not required to perform acoustic response processing on the superimposed reference signal, that is, the microphone model 906 is not required to modify the echo estimation signal, and the superposition is directly used. The reference signal is used as an echo estimation signal to perform echo processing on the second audio signal recorded by the microphone 902.

Referring to FIG. 15, there is shown an application example in which the signal processing method according to the embodiment of the present disclosure is applied to the circuit structure 120 of the four-speaker and single-microphone shown in FIG. 12, and the method may include steps S1501 to S1509.

In step S1501, one digital signal is extracted from the front end of each speaker in the speaker group 1201 to obtain four first audio signals.

In step S1502, the four first audio signals are correspondingly input to the first speaker model 1203a, the second speaker model 1203b, the third speaker model 1203c, and the fourth speaker model 1203d, and the acoustic response processing is performed to obtain the acoustic response signal of each speaker .

In step S1503, a reference signal for each speaker is obtained according to the acoustic response signal of each speaker in the speaker group 1201 and the first delay and attenuation information corresponding to the acoustic response signal.

In step S1504, the reference signal of each speaker is subjected to superposition and bit expansion processing to obtain a superposed reference signal.

In step S1505, a high-frequency resonance peak of the microphone 1202 is acquired.

In step S1506, the high-frequency resonance peak of the microphone 1202 is compared with a preset high-frequency resonance peak.

In step S1507, if the high-frequency resonance peak of the microphone 1202 is higher than a preset high-frequency resonance peak, the superimposed reference signal is used as an echo estimation signal.

In step S1508, if the high-frequency resonance peak of the microphone 1202 is not higher than a preset high-frequency resonance peak, an acoustic response process is performed on the superimposed reference signal based on the microphone model 1206 to obtain an echo estimation signal.

In step S1509, the echo estimation signal is input to an echo processing module 1207 to perform echo processing on the second audio signal recorded by the microphone 1202.

It should be noted that, compared with the process shown in FIG. 14, the process shown in FIG. 15 considers that the audio signals input by each speaker in the speaker group may be different. The other processing steps are the same as the corresponding processing steps shown in FIG. 14 and will not be described in detail here.

Referring to FIG. 16, which shows an application example of a signal processing method according to an embodiment of the present disclosure applied to the circuit structure 110 of the four speakers and dual microphones shown in FIG. 11, the method may include steps S1601 to S1617.

In step S1601, one digital signal is extracted from the front end of each speaker in the speaker group 1101 to obtain four first audio signals.

In step S1602, the four first audio signals are correspondingly input to the first speaker model 1103a, the second speaker model 1103b, the third speaker model 1103c, and the fourth speaker model 1103d, and an acoustic response process is performed to obtain an acoustic response signal of each speaker. .

In step S1603, a first reference signal of each speaker is obtained according to an acoustic response signal of each speaker in the speaker group 1101 and first delay and attenuation information corresponding to the acoustic response signal.

In step S1604, the first reference signal of each speaker is subjected to superposition and bit expansion processing to obtain a first superposed reference signal.

In step S1605, a high-frequency resonance peak of the first microphone 1102a is acquired.

In step S1606, the high-frequency resonance peak of the first microphone 1102a is compared with a preset high-frequency resonance peak.

In step S1607, if the high-frequency resonance peak of the first microphone 1102a is higher than a preset high-frequency resonance peak, the first superimposed reference signal is used as a first echo estimation signal.

In step S1608, if the high-frequency resonance peak of the first microphone 1102a is not higher than a preset high-frequency resonance peak, an acoustic response process is performed on the first superimposed reference signal based on the first microphone model 1106a to obtain a first Echo estimation signal.

In step S1609, the first echo estimation signal is input to a first echo processing module 1107a to perform a first echo processing on a second audio signal recorded by the first microphone 1102a.

In step S1610, the second reference signal of each speaker is obtained according to the acoustic response signal of each speaker in the speaker group 1101 and the second delay and attenuation information corresponding to the acoustic response signal.

In step S1611, the second reference signal of each speaker is subjected to superposition and bit expansion processing to obtain a second superimposed reference signal.

In step S1612, a high-frequency resonance peak of the second microphone 1102b is acquired.

In step S1613, the high-frequency resonance peak of the second microphone 1102b is compared with a preset high-frequency resonance peak.

In step S1614, if the high-frequency resonance peak of the second microphone 1102b is higher than a preset high-frequency resonance peak, the second superimposed reference signal is used as a second echo estimation signal.

In step S1615, if the high-frequency resonance peak of the second microphone 1102b is not higher than a preset high-frequency resonance peak, an acoustic response process is performed on the second superimposed reference signal based on the second microphone model 1106b to obtain a second Echo estimation signal.

In step S1616, the second echo estimation signal is input to a second echo processing module 1107b to perform a second echo processing on the second audio signal recorded by the second microphone 1102b.

In step S1617, the audio signals after the first echo processing and the second echo processing are input to the noise reduction and other processing module 1109.

It should be noted that, compared with the process shown in FIG. 15, the process shown in FIG. 16 considers the echo processing situation of the dual microphones. Other processing steps are basically the same as the corresponding processing steps shown in FIG. 15, and will not be described in detail here.

It should be noted that for dual-microphone applications, the reference signal of the echo signal (ie, the echo estimation signal) in each microphone needs to be obtained separately, and echo processing needs to be performed separately (such as the first echo processing module 1107a in FIG. 11). And second echo processing module 1107b). For the acoustic response processing of the speaker model, since the speaker group is shared, and the speaker model and the delay attenuation module are separately provided, so first consider the speaker model, and then consider the position and distance information between the speaker and each microphone, so that The audio signal at the front of the speaker only needs to undergo the acoustic response processing of the speaker model once, thereby helping to reduce the computational workload in a multi-microphone design.

The foregoing embodiments perform echo processing on the echo signals of the four speakers. The embodiments of the present disclosure are also suitable for echo processing of other multi-speakers and multi-microphones, as well as echo processing of a single speaker. In the embodiments of the present disclosure, The number of speakers is not specifically limited.

Referring to FIG. 17, which shows an application example of the signal processing method according to the embodiment of the present disclosure applied to the circuit structure 180 of the single speaker and single microphone shown in FIG. 18, the method may include steps S1701 to S1708.

In step S1701, a digital signal is extracted from the front end of the speaker 1801 to obtain a first audio signal.

In step S1702, the first audio signal is input to a speaker model 1803 for acoustic response processing to obtain an acoustic response signal of the speaker 1801.

In step S1703, a reference signal of the speaker 1801 is obtained according to the acoustic response signal of the speaker 1801 and the delay and attenuation information corresponding to the acoustic response signal.

In step S1704, a high-frequency resonance peak of the microphone 1802 is acquired.

In step S1705, the high-frequency resonance peak of the microphone 1802 is compared with a preset high-frequency resonance peak.

In step S1706, if the high-frequency resonance peak of the microphone 1802 is higher than a preset high-frequency resonance peak, the reference signal is used as an echo estimation signal.

In step S1707, if the high-frequency resonance peak of the microphone 1802 is not higher than a preset high-frequency resonance peak, an acoustic response process is performed on the reference signal based on the microphone model 1805 to obtain an echo estimation signal.

In step S1708, the echo estimation signal is input to the adder 1806 to perform echo processing on the second audio signal recorded by the microphone 1802.

For example, the circuit structure 180 shown in FIG. 18 is taken as an example. As shown in FIG. 18, the circuit structure 180 includes a speaker 1801, a microphone 1802, a speaker model 1803, a delay attenuation module 1804, a microphone model 1805, an adder 1806, The speech processing module 1807, the noise reduction and other processing module 1808, the decoder 1809, the encoder 1810, and the transmitting end 1811. The speaker model 1803 is correspondingly established based on the speaker 1801, and the delay attenuation module 1804 performs corresponding delay and attenuation on the audio signal after passing through the speaker model 1803. The delay and attenuation information is obtained according to the distance and position information between the speaker 1801 and the microphone 1802. The function of the adder 1806 is the same as that of the echo processing module, and is used to perform echo processing on the echo signal generated by the first audio signal played by the speaker 1801 among the second audio signals recorded by the microphone 1802. The operation process of the circuit structure shown in FIG. 18 is similar to the operation process of the aforementioned multi-speaker circuit structure, and will not be described in detail here.

Referring to FIG. 19, which shows an application example of a signal processing method according to an embodiment of the present disclosure applied to the circuit structure 200 of the dual speaker and single microphone shown in FIG. 20, the method may include steps S1901 to S1909.

In step S1901, one digital signal is extracted from the front ends of the first speaker 2001a and the second speaker 2001b respectively to obtain two first audio signals.

In step S1902, two channels of the first audio signal are correspondingly input to the first speaker model 2003a and the second speaker model 2003b to perform an acoustic response process to obtain an acoustic response signal of each speaker.

In step S1903, a reference signal of each speaker is obtained according to the acoustic response signal of each speaker and the first delay attenuation module 2004a and the second delay attenuation module 2004b.

In step S1904, the reference signal of each speaker is subjected to superposition and bit expansion processing to obtain a superposed reference signal.

In step S1905, a high-frequency resonance peak of the microphone 2002 is acquired.

In step S1906, the high-frequency resonance peak of the microphone 2002 is compared with a preset high-frequency resonance peak.

In step S1907, if the high-frequency resonance peak of the microphone 2002 is higher than a preset high-frequency resonance peak, the superimposed reference signal is used as an echo estimation signal.

In step S1908, if the high-frequency resonance peak of the microphone 2002 is not higher than a preset high-frequency resonance peak, an acoustic response process is performed on the superimposed reference signal based on the microphone model 2006 to obtain an echo estimation signal.

In step S1909, the echo estimation signal is input to an echo processing module 2007 to perform echo processing on the second audio signal recorded by the microphone 2002.

It should be noted that the circuit structure 200 shown in FIG. 20 is taken as an example. As shown in FIG. 20, the circuit structure 200 includes a first speaker 2001a, a second speaker 2001b, a microphone 2002, a first speaker model 2003a, and a second speaker model. 2003b, first delay attenuation module 2004a, second delay attenuation module 2004b, summing module 2005, microphone model 2006, adder 2007, speech processing module 2008, noise reduction and other processing module 2009, decoder 2010, encoder 2011 and launcher 2012. The function of the adder 2007 is the same as that of the echo processing module. The operation process of the circuit structure shown in FIG. 20 is similar to the operation process of the aforementioned multi-speaker circuit structure (such as FIG. 11), and will not be described in detail here.

Through the foregoing embodiments, the specific implementation of the foregoing embodiments is described in detail. It can be seen from the foregoing that the technical solutions of the foregoing embodiments effectively solve the echo whistling generated when the high-frequency resonance peak of the speaker or microphone is low. This also reduces the computational workload in multi-microphone designs.

21 to 25 are schematic structural diagrams of a signal processing apparatus according to an embodiment of the present disclosure.

Referring to FIG. 21, a composition of an information processing apparatus 210 according to an embodiment of the present disclosure is shown. The signal processing device 210 may include at least one speaker 2101, at least one microphone 2102, a first receiving part 2103, a first obtaining part 2104, a second receiving part 2105, and a second obtaining part 2106.

The first receiving part 2103 is configured to receive a first audio signal and play the first audio signal by the at least one speaker.

The first obtaining part 2104 is configured to obtain at least one echo estimation signal corresponding to the first audio signal according to at least one speaker model, at least one microphone model, and the first audio signal, and obtain the estimated signal based on the at least one speaker. The at least one speaker model is described, and the at least one microphone model is obtained based on the at least one microphone.

The second receiving part 2105 is configured to receive a second audio signal by using the at least one microphone, wherein the second audio signal includes the first audio signal output by the at least one speaker and received by the at least one microphone. The echo signal produced by the signal.

The second acquisition component 2106 is configured to remove at least one echo estimation signal corresponding to the first audio signal from the second audio signal to obtain an echo-processed audio signal.

Referring to FIG. 22, the signal processing device 210 may further include a pre-processing component 2107 configured to demodulate and pre-process the first audio signal, where the first audio signal is generated and transmitted by a remote device. .

Referring to FIG. 23, the signal processing device 210 may further include a modeling component 2108 configured to: correspondingly establish at least one speaker model according to the characteristic information of the at least one speaker, where the characteristic information of the at least one speaker includes the Circuit information and structure information corresponding to at least one speaker; and correspondingly establishing at least one microphone model according to the characteristic information of the at least one microphone, wherein the characteristic information of the at least one microphone includes circuit information and Structural information.

The first obtaining part 2104 may be configured to: input the first audio signal into each speaker model and perform acoustic response processing to obtain an acoustic response signal of each speaker; according to the acoustic response signal of each speaker and the acoustic response signal Corresponding first delay and attenuation information to obtain a first reference signal for each speaker, wherein the first delay and attenuation information is correspondingly obtained based on the distance and position information between each speaker and the microphone; Performing superposition and bit expansion processing on the first reference signal of each speaker to obtain a first superimposed reference signal; and performing an acoustic response process on the first superimposed reference signal based on the microphone model to obtain A first echo estimation signal of the first audio signal.

Referring to FIG. 24, the signal processing device 210 may further include a first comparison component 2109 configured to: obtain a high-frequency resonance peak of the microphone; and compare the high-frequency resonance peak of the microphone with a preset high-frequency resonance peak And in response to the high-frequency resonance peak of the microphone being higher than a preset high-frequency resonance peak, canceling the acoustic response processing of the microphone model to the first superimposed reference signal.

The first obtaining part 2104 may be configured to: input the first audio signal into each speaker model and perform acoustic response processing to obtain an acoustic response signal of each speaker; according to the acoustic response signal of each speaker and the acoustic response signal Corresponding first delay and attenuation information and second delay and attenuation information corresponding to the acoustic response signal to obtain a first reference signal and a second reference signal for each speaker, wherein, based on each speaker and the at least one The distance and position information between the first microphones in a microphone are correspondingly obtained to obtain the first delay and attenuation information, and are based on the distance and position information between each speaker and the second microphone in the at least one microphone. Obtaining the second delay and attenuation information; and performing superposition and bit expansion on the first reference signal and the second reference signal of each speaker to obtain a first superposed reference signal and a second superposition Based on the first microphone model in the at least one microphone model, Performing acoustic response processing on the first superimposed reference signal to obtain a first echo estimation signal of the first audio signal; and based on a second microphone model of the at least one microphone model, the second superimposed The reference signal is subjected to an acoustic response process to obtain a second echo estimation signal of the first audio signal.

Referring to FIG. 25, the signal processing device 210 may further include a second comparison component 2110 configured to: obtain a high-frequency resonance peak of the first microphone and a high-frequency resonance peak of the second microphone; The high-frequency resonance peak of the second microphone and the high-frequency resonance peak of the second microphone are respectively compared with a preset high-frequency resonance peak; in response to the high-frequency resonance peak of the first microphone being higher than the preset high-frequency resonance peak, Cancel the acoustic response processing of the first microphone model to the first superimposed reference signal; and cancel the second microphone in response to the high-frequency resonance peak of the second microphone being higher than a preset high-frequency resonance peak The model processes the acoustic response of the second superimposed reference signal.

The first obtaining part 2104 may be configured to: in response to the first audio signals received by the at least one speaker being the same, input the first audio signals to the at least one speaker model and perform acoustic response processing to obtain each An acoustic response signal of the speaker; and in response to the first audio signal received by the at least one speaker being different, correspondingly inputting the first audio signal to the at least one speaker model and performing acoustic response processing to obtain the Acoustic response signal.

Understandably, in this embodiment, the “component” may be a part of a circuit, a part of a processor, a part of a program or software, and the like, of course, it may also be a unit, or may be a module or non-modular.

In addition, each component in this embodiment may be integrated into one processing unit, or each unit may exist separately physically, or two or more units may be integrated into one unit. The above integrated unit may be implemented in the form of hardware or in the form of software functional modules.

If the integrated unit is implemented in the form of a software functional module and is not sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on such an understanding, the technical solution of this embodiment may be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions for making a computer device (which may be a personal computer, a server, or Network equipment, etc.) or a processor executes all or part of the steps of the method described in this embodiment. The foregoing storage media include: U disks, mobile hard disks, read only memories (ROM, Read Only Memory), random access memories (RAM, Random Access Memory), magnetic disks or optical disks, and other media that can store program codes.

Therefore, an embodiment of the present disclosure provides a computer storage medium on which a computer program is stored. When the computer program is executed by at least one processor, the at least one processor executes a signal processing method according to embodiments of the present disclosure. .

Referring to FIG. 26, which illustrates a hardware structure of a signal processing device 210 according to an embodiment of the present disclosure, the signal processing device 210 may include a network interface 2601, a memory 2602, and a processor 2603. The various components are coupled together by a bus system 2604. It can be understood that the bus system 2604 is used to implement connection and communication between these components. The bus system 2604 may include a data bus, and may further include a power bus, a control bus, and a status signal bus. For the sake of clarity, various buses are labeled as the bus system 2604 in FIG. 26.

The network interface 2601 is used to receive and send signals during the process of transmitting and receiving information with other external network elements.

The memory 2602 stores a computer program capable of running on the processor 2603.

When the processor 2603 runs the computer program, it can execute a signal processing method according to various embodiments of the present disclosure.

It can be understood that the memory 2602 in the embodiment of the present disclosure may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memories. Non-volatile memory can be Read-Only Memory (ROM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable read-only memory (EPROM, EEPROM) or flash memory. The volatile memory may be Random Access Memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDRSDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (Synchlink DRAM, SLDRAM) And direct memory bus random access memory (Direct RAMbus RAM, DRRAM). The memory 2602 of the systems and methods described herein is intended to include, but is not limited to, these and any other suitable types of memory.

The processor 2603 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 2603 or an instruction in the form of software. The above-mentioned processor 2603 may be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA), or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Various methods, steps, and logical block diagrams disclosed in the embodiments of the present disclosure may be implemented or executed. A general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in combination with the embodiments of the present disclosure may be directly embodied as being executed by a hardware decoding processor, or may be executed and completed by using a combination of hardware and software modules in the decoding processor. The software module may be located in a mature storage medium such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, or an electrically erasable programmable memory, a register, and the like. The storage medium is located in the memory 2602, and the processor 2603 reads the information in the memory 2602 and completes the steps of the foregoing method in combination with its hardware.

It can be understood that the embodiments described herein may be implemented by hardware, software, firmware, middleware, microcode, or a combination thereof. For hardware implementation, the processing unit can be implemented in one or more application-specific integrated circuits (ASICs), digital signal processors (DSP), digital signal processing devices (DSPD), programmable Programmable Logic Device (PLD), Field-Programmable Gate Array (FPGA), general-purpose processor, controller, microcontroller, microprocessor, other for performing functions described in this disclosure Electronic unit or combination thereof.

For software implementation, the techniques described herein can be implemented through modules (e.g., procedures, functions, etc.) that perform the functions described herein. Software codes may be stored in a memory and executed by a processor. The memory may be implemented in the processor or external to the processor.

It should be noted that the technical solutions described in the embodiments of the present disclosure can be arbitrarily combined without conflict.

The above is only a specific implementation of the present disclosure, but the scope of protection of the present disclosure is not limited to this. Any person skilled in the art can easily think of changes or replacements within the technical scope disclosed in the present disclosure. It should be covered by the protection scope of this disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

A signal processing method applied to a signal processing device having at least one speaker and at least one microphone. The method includes:

Receiving a first audio signal and playing the first audio signal by the at least one speaker;

Obtaining at least one echo estimation signal corresponding to the first audio signal according to at least one speaker model, at least one microphone model, and the first audio signal, wherein the at least one speaker model is obtained based on the at least one speaker, and Obtaining the at least one microphone model based on the at least one microphone;

Receiving a second audio signal using the at least one microphone, wherein the second audio signal includes an echo signal generated by the first audio signal output by the at least one speaker and received through the at least one microphone; and

Removing at least one echo estimation signal corresponding to the first audio signal from the second audio signal to obtain an echo-processed audio signal.
The signal processing method according to claim 1, wherein before the step of receiving a first audio signal and playing the first audio signal by the at least one speaker, the method further comprises:

Demodulate and pre-process the first audio signal, wherein the first audio signal is generated and transmitted by a remote device.
The signal processing method according to claim 1, wherein before the step of obtaining at least one echo estimation signal corresponding to the first audio signal according to at least one speaker model, at least one microphone model, and the first audio signal, The method further includes:

Establish at least one speaker model corresponding to the characteristic information of the at least one speaker, wherein the characteristic information of the at least one speaker includes circuit information and structure information corresponding to the at least one speaker; and

According to the characteristic information of the at least one microphone, at least one microphone model is correspondingly established, wherein the characteristic information of the at least one microphone includes circuit information and structure information corresponding to the at least one microphone.
The signal processing method according to claim 3, wherein the number of the at least one microphone is one, the number of corresponding microphone models is one, and according to at least one speaker model, at least one microphone model, and the first audio signal The step of obtaining at least one echo estimation signal corresponding to the first audio signal includes:

Inputting the first audio signal to each speaker model and performing acoustic response processing to obtain an acoustic response signal of each speaker;

The first reference signal of each speaker is obtained according to the acoustic response signal of each speaker and the first delay and attenuation information corresponding to the acoustic response signal, wherein based on the distance and position between each speaker and the microphone Information corresponding to the first delay and attenuation information;

Performing superposition and bit expansion processing on the first reference signal of each speaker to obtain a first superimposed reference signal; and

Based on the microphone model, an acoustic response process is performed on the first superimposed reference signal to obtain a first echo estimation signal of the first audio signal.
The signal processing method according to claim 4, wherein after the step of obtaining a first superimposed reference signal, the method further comprises:

Obtaining a high-frequency resonance peak of the microphone;

Comparing the high-frequency resonance peak of the microphone with a preset high-frequency resonance peak; and

In response to that the high-frequency resonance peak of the microphone is higher than a preset high-frequency resonance peak, the acoustic response processing of the microphone model to the first superimposed reference signal is cancelled.
The signal processing method according to claim 3, wherein the number of the at least one microphone is two, the number of corresponding microphone models is two, and according to at least one speaker model, at least one microphone model, and the first For an audio signal, the step of obtaining at least one echo estimation signal corresponding to the first audio signal includes:

Inputting the first audio signal to each speaker model and performing acoustic response processing to obtain an acoustic response signal of each speaker;

According to the acoustic response signal of each speaker and the first delay and attenuation information corresponding to the acoustic response signal and the second delay and attenuation information corresponding to the acoustic response signal, a first reference signal and a first reference signal of each speaker are obtained. Two reference signals, wherein the first delay and attenuation information is correspondingly obtained based on distance and position information between each speaker and a first microphone in the at least one microphone, and based on each speaker and the at least one The distance and position information between the second microphones in the microphones correspond to the second delay and attenuation information;

Performing superposition and bit expansion on the first reference signal and the second reference signal of each speaker to obtain a first superposed reference signal and a second superposed reference signal;

Performing an acoustic response process on the first superimposed reference signal based on a first microphone model of the at least one microphone model to obtain a first echo estimation signal of the first audio signal; and

Based on the second microphone model of the at least one microphone model, performing acoustic response processing on the second superimposed reference signal to obtain a second echo estimation signal of the first audio signal.
The signal processing method according to claim 6, wherein after the step of obtaining a first superimposed reference signal and a second superimposed reference signal, the method further comprises:

Obtaining a high-frequency resonance peak of the first microphone and a high-frequency resonance peak of the second microphone;

Comparing the high-frequency resonance peak of the first microphone and the high-frequency resonance peak of the second microphone with preset high-frequency resonance peaks, respectively;

In response to the high-frequency resonance peak of the first microphone being higher than a preset high-frequency resonance peak, canceling the acoustic response processing of the first microphone model to the first superimposed reference signal; and

In response to the high-frequency resonance peak value of the second microphone being higher than a preset high-frequency resonance peak value, canceling the acoustic response processing of the second microphone model to the second superimposed reference signal.
The signal processing method according to any one of claims 4 to 7, wherein the step of inputting the first audio signal into each speaker model and performing acoustic response processing to obtain an acoustic response signal of each speaker comprises:

In response to the first audio signals received by the at least one speaker being the same, inputting the first audio signals to the at least one speaker model separately and performing acoustic response processing to obtain an acoustic response signal of each speaker; and

In response to different first audio signals received by the at least one speaker, the first audio signal is correspondingly input to the at least one speaker model and an acoustic response process is performed to obtain an acoustic response signal of each speaker.
A signal processing device includes at least one speaker, at least one microphone, a first receiving part, a first obtaining part, a second receiving part, and a second obtaining part, wherein:

The first receiving component is configured to receive a first audio signal and play the first audio signal by the at least one speaker;

The first obtaining component is configured to obtain at least one echo estimation signal corresponding to the first audio signal according to at least one speaker model, at least one microphone model, and the first audio signal, wherein, based on the at least one speaker, The at least one speaker model, and obtaining the at least one microphone model based on the at least one microphone;

The second receiving component is configured to receive a second audio signal using the at least one microphone, wherein the second audio signal includes the first audio signal output by the at least one speaker and received by the at least one microphone. Echo signals from audio signals; and

The second acquisition component is configured to remove at least one echo estimation signal corresponding to the first audio signal from the second audio signal to obtain an echo-processed audio signal.
The signal processing device according to claim 9, further comprising a pre-processing component configured to demodulate and pre-process the first audio signal, wherein the first audio signal is generated and performed by a remote device. send.
The signal processing device according to claim 9, further comprising a modeling component configured to:

Establish at least one speaker model corresponding to the characteristic information of the at least one speaker, wherein the characteristic information of the at least one speaker includes circuit information and structure information corresponding to the at least one speaker; and

According to the characteristic information of the at least one microphone, at least one microphone model is correspondingly established, wherein the characteristic information of the at least one microphone includes circuit information and structure information corresponding to the at least one microphone.
The signal processing device according to claim 11, wherein the first acquisition component is configured to:

Inputting the first audio signal to each speaker model and performing acoustic response processing to obtain an acoustic response signal of each speaker;

The first reference signal of each speaker is obtained according to the acoustic response signal of each speaker and the first delay and attenuation information corresponding to the acoustic response signal, wherein based on the distance and position between each speaker and the microphone Information corresponding to the first delay and attenuation information;

Performing superposition and bit expansion processing on the first reference signal of each speaker to obtain a first superimposed reference signal; and

Based on the microphone model, an acoustic response process is performed on the first superimposed reference signal to obtain a first echo estimation signal of the first audio signal.
The signal processing device according to claim 12, further comprising a first comparison section configured to:

Obtaining a high-frequency resonance peak of the microphone;

Comparing the high-frequency resonance peak of the microphone with a preset high-frequency resonance peak; and

In response to that the high-frequency resonance peak of the microphone is higher than a preset high-frequency resonance peak, the acoustic response processing of the microphone model to the first superimposed reference signal is cancelled.
The signal processing device according to claim 11, wherein the first acquisition component is configured to:

Inputting the first audio signal to each speaker model and performing acoustic response processing to obtain an acoustic response signal of each speaker;

According to the acoustic response signal of each speaker and the first delay and attenuation information corresponding to the acoustic response signal and the second delay and attenuation information corresponding to the acoustic response signal, a first reference signal and a first reference signal of each speaker are obtained. Two reference signals, wherein the first delay and attenuation information is correspondingly obtained based on distance and position information between each speaker and a first microphone in the at least one microphone, and based on each speaker and the at least one The distance and position information between the second microphones in the microphones correspond to the second delay and attenuation information;

Performing superposition and bit expansion on the first reference signal and the second reference signal of each speaker to obtain a first superposed reference signal and a second superposed reference signal;

Performing an acoustic response process on the first superimposed reference signal based on a first microphone model of the at least one microphone model to obtain a first echo estimation signal of the first audio signal; and

Based on the second microphone model of the at least one microphone model, performing acoustic response processing on the second superimposed reference signal to obtain a second echo estimation signal of the first audio signal.
The signal processing device according to claim 14, further comprising a second comparison section configured to:

Obtaining a high-frequency resonance peak of the first microphone and a high-frequency resonance peak of the second microphone;

Comparing the high-frequency resonance peak of the first microphone and the high-frequency resonance peak of the second microphone with preset high-frequency resonance peaks, respectively;

In response to the high-frequency resonance peak of the first microphone being higher than a preset high-frequency resonance peak, canceling the acoustic response processing of the first microphone model to the first superimposed reference signal; and

In response to the high-frequency resonance peak value of the second microphone being higher than a preset high-frequency resonance peak value, canceling the acoustic response processing of the second microphone model to the second superimposed reference signal.
The signal processing device according to any one of claims 12 to 15, wherein the first acquisition component is configured to:

In response to the first audio signals received by the at least one speaker being the same, inputting the first audio signals to the at least one speaker model separately and performing acoustic response processing to obtain an acoustic response signal of each speaker; and

In response to different first audio signals received by the at least one speaker, the first audio signal is correspondingly input to the at least one speaker model and an acoustic response process is performed to obtain an acoustic response signal of each speaker.
A signal processing device includes a memory and a processor, wherein:

The memory stores a computer program, and when the processor runs the computer program, the processor executes the signal processing method according to any one of claims 1 to 8.
A computer storage medium stores a computer program thereon, and when the computer program is executed by at least one processor, the at least one processor executes the signal processing method according to any one of claims 1 to 8.