CN115665602A

CN115665602A - Echo cancellation method, echo cancellation device, conference system, electronic device, and storage medium

Info

Publication number: CN115665602A
Application number: CN202211248585.XA
Authority: CN
Inventors: 鲍晓; 许丽; 熊世富; 万根顺; 高建清; 潘嘉; 刘聪
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-10-12
Filing date: 2022-10-12
Publication date: 2023-01-31

Abstract

The invention provides an echo cancellation method, an echo cancellation device, a conference system, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a reference signal of each terminal and a microphone signal of any terminal in each terminal; respectively extracting the characteristics of the reference signal of each terminal and the microphone signal of the terminal, and determining the characteristics of the echo signal based on the characteristics of the reference signal of each terminal and the microphone signal of the terminal, which are obtained by characteristic extraction; based on the characteristics of the echo signals, the echo cancellation is carried out on the microphone signals of the terminal to obtain echo cancellation signals of the terminal, the defect that the traditional echo cancellation method cannot carry out echo cancellation on a multi-person conference scene is overcome, meanwhile, automatic management of terminal audio signal acquisition and playing is achieved, the problem of inconvenience in manual control is avoided, and stability of a conference process is improved.

Description

Echo cancellation method, echo cancellation device, conference system, electronic device, and storage medium

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to an echo cancellation method and apparatus, a conference system, an electronic device, and a storage medium.

Background

With the development of information technology, the application of intelligent equipment in various fields is increasingly wide. Echo cancellation, as an indispensable link in intelligent device interaction, has been a hot spot of research by technicians in related fields.

Echo cancellation avoids far-end audio signals from being returned to the far-end by canceling or removing far-end audio signals picked up by the microphone and output by the loudspeaker for the case where the loudspeaker is coupled to the microphone. A common echo cancellation method is implemented by an adaptive filter, i.e. an algorithm adaptively updates the transfer function between the loudspeaker and the microphone.

However, in a multi-user conference scenario, there are often a plurality of active speakers, and due to differences in distance, network, hardware, and the like, the time delays of audio signals played by the speakers are also different, which makes the conventional echo cancellation method unable to perform echo cancellation effectively.

Disclosure of Invention

The invention provides an echo cancellation method, an echo cancellation device, a conference system, electronic equipment and a storage medium, which are used for solving the defect that echo cancellation cannot be effectively carried out aiming at a multi-person conference scene because a plurality of time delays cannot be simultaneously estimated in the prior art.

The invention provides an echo cancellation method, which comprises the following steps:

acquiring a reference signal of each terminal and a microphone signal of any terminal in each terminal;

respectively extracting the characteristics of the reference signal of each terminal and the microphone signal of any terminal, and determining the characteristics of the echo signal based on the reference signal characteristics of each terminal and the microphone signal characteristics of any terminal obtained by characteristic extraction;

and based on the echo signal characteristics, carrying out echo cancellation on the microphone signal of any terminal to obtain an echo cancellation signal of any terminal.

According to an echo cancellation method provided by the present invention, the determining an echo signal characteristic based on a reference signal characteristic of each terminal obtained by characteristic extraction and a microphone signal characteristic of any one terminal includes:

fusing the reference signal characteristics of each terminal to obtain reference signal fusion characteristics;

and determining echo signal characteristics based on the reference signal fusion characteristics and the microphone signal characteristics of any terminal.

The echo cancellation method provided by the invention further comprises the following steps:

determining the conference state of each participant corresponding to each terminal based on the state of each terminal and/or the voice detection result of each terminal;

performing pickup control on each terminal based on the participation state of each participant corresponding to each terminal;

the state is a hand-held state or a placed state, and the participating state is any one of a discussion state, a speech state and a listening state.

According to an echo cancellation method provided by the present invention, determining the conference state of each participant corresponding to each terminal based on the state of each terminal and/or the voice detection result of each terminal, includes:

determining that the participation states of the participants corresponding to the terminals are discussion states under the condition that the states of the terminals are all placement states;

and under the condition that the state of any terminal is a handheld state and the voice detection result of any terminal indicates that any terminal detects the voice, determining that the conference state of the participant corresponding to any terminal is a speech state and the conference states of other participants are listening states.

According to an echo cancellation method provided by the present invention, the controlling sound pickup of each terminal based on the conference state of each participant corresponding to each terminal includes:

under the condition that the conference state of the conference participant corresponding to any terminal is a speech state, acquiring an echo cancellation signal of the any terminal;

and carrying out voice separation on the echo cancellation signal of any terminal based on the voiceprint characteristics of all the participants to obtain the voice separation signal of any terminal, and carrying out voiceprint extraction on the voiceprint characteristics of all the participants based on the historical echo cancellation signals of all the terminals to obtain the voiceprint characteristics of all the participants.

acquiring microphone signals of each terminal under the condition that the participation states of each participant are discussion states;

and carrying out audio alignment on the microphone signals of the terminals, determining a target terminal based on the energy of each terminal corresponding to each microphone signal after audio alignment, and taking the target terminal as a pickup terminal of a current speaker in the participants in the discussion state.

The present invention also provides an echo cancellation device, comprising:

the system comprises a signal acquisition unit, a signal processing unit and a signal processing unit, wherein the signal acquisition unit is used for acquiring a reference signal of each terminal and a microphone signal of any terminal in each terminal;

an echo determining unit, configured to perform feature extraction on the reference signal of each terminal and the microphone signal of any terminal, and determine an echo signal feature based on a reference signal feature of each terminal and a microphone signal feature of any terminal obtained through feature extraction;

and the echo cancellation unit is used for performing echo cancellation on the microphone signal of any terminal based on the echo signal characteristics to obtain an echo cancellation signal of any terminal.

The invention also provides a conference system, which comprises all the terminals and an echo cancellation device;

the echo cancellation device is configured to determine an echo signal characteristic based on a reference signal characteristic of the reference signal of each terminal and a microphone signal characteristic of a microphone signal of any terminal in the terminals, and perform echo cancellation on the microphone signal of any terminal based on the echo signal characteristic to obtain an echo cancellation signal of any terminal.

According to the conference system provided by the invention, the echo cancellation device is further configured to determine the conference participation states of the participants corresponding to the terminals based on the states of the terminals and/or the voice detection results of the terminals, and perform sound pickup control on the terminals based on the conference participation states of the participants corresponding to the terminals;

the state is a hand-held state or a placing state, and the participation state is any one of a discussion state, a speech state and a listening state.

According to the conference system provided by the invention, the echo cancellation device is specifically used for acquiring the echo cancellation signal of any terminal under the condition that the conference state of the participant corresponding to any terminal is a speech state, and performing voice separation on the echo cancellation signal of any terminal based on the voiceprint characteristics of all the participants to obtain the voice separation signal of any terminal;

and the voiceprint characteristics of all the participants are obtained by voiceprint extraction based on the historical echo cancellation signals of all the terminals.

According to the conference system provided by the present invention, the echo cancellation device is specifically configured to, when the conference states of the participants are all discussion states, acquire microphone signals of the terminals, perform audio alignment on the microphone signals of the terminals, determine a target terminal based on energy of the terminals corresponding to the microphone signals after the audio alignment, and use the target terminal as a sound pickup terminal for a current speaker in the participants in the discussion states.

The present invention also provides an electronic device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the echo cancellation method as described in any of the above when executing the program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the echo cancellation method as described in any of the above.

The echo cancellation method, the device, the conference system, the electronic equipment and the storage medium provided by the invention can simultaneously acquire the reference signal of each terminal, and utilize the reference signal of each terminal and the microphone signal of any terminal in each terminal to perform time delay estimation so as to acquire the channel propagation parameter of the reference signal of each terminal, thereby acquiring the echo signal characteristic, and perform echo cancellation on the microphone signal of the terminal according to the echo signal characteristic to acquire the echo cancellation signal of the terminal.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of an echo cancellation method provided by the present invention;

FIG. 2 is a schematic diagram of the echo signal feature determination process provided by the present invention;

FIG. 3 is an overall block diagram of the echo cancellation process provided by the present invention;

fig. 4 is a flow chart of a pickup control process provided by the present invention;

FIG. 5 is a schematic flow chart of voice separation during pickup control according to the present invention;

FIG. 6 is an overall framework diagram of the speech separation process provided by the present invention;

fig. 7 is a schematic flow chart of microphone selection in the pickup control process provided by the present invention;

fig. 8 is a schematic structural diagram of an echo cancellation device provided in the present invention;

FIG. 9 is a schematic diagram of a conference system provided by the present invention;

fig. 10 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Many people's meeting, as a comparatively frequent activity in daily official working, it can help people's high efficiency solution problem. In the process of a multi-person conference, the collection, processing and playing of audio signals are key technologies, and the whole process relates to the collection of audio signals of participants at different positions, noise processing in audio, echo cancellation among devices and the like.

At present, the following audio acquisition and processing schemes in a multi-person conference scene mainly exist:

the goose neck microphone scheme is mainly used in large-scale standard conferences, universality is poor, namely a plurality of goose neck microphones are deployed to collect audio signals, a dynamic switch is supported, and the audio signals collected by the goose neck microphones are centrally accessed to a sound card and processed in the sound card. Existing gooseneck microphone deployment schemes are common in large professional conference rooms.

Secondly, a single terminal scheme is mainly used in conference discussion scenes, participants share one microphone array to join in a conference, but in a multi-person conference scene under the online condition, the problem of poor remote sound quality can occur by applying the scheme.

Thirdly, many distributed terminal schemes, the personnel of participating in meeting can use self equipment (terminal) to insert the meeting for the convenience of work cooperation, and this makes the indoor collection and the broadcast that have a plurality of terminals to carry out audio signal simultaneously of same meeting, and at this moment, in order to avoid the interference of echo, the personnel of participating in hand control terminal need, need hand switch microphone and speaker promptly, and is very inconvenient.

Further, in a multi-person conference scene, there are often a plurality of activated speakers, and audio signals played by each speaker will propagate to a microphone after being transmitted and spatially reflected, and due to differences in distance, network, hardware, and the like, the time delays of the audio signals played by each speaker are also different, and because the current echo cancellation method cannot estimate the time delays of the audio signals played by a plurality of sound sources at the same time, it cannot perform echo cancellation effectively, in other words, the current echo cancellation method cannot perform echo cancellation for the scene of a multi-person conference.

To this end, the present invention provides an echo cancellation method, aiming at providing a multi-source echo cancellation method applicable to a multi-user conference scene, that is, capable of simultaneously performing acquisition and delay estimation of reference signals for a plurality of terminals, and performing echo cancellation on the basis of the acquisition and delay estimation, so as to overcome the defect that the traditional echo cancellation method cannot perform echo cancellation for the multi-user conference scene, and at the same time, realize automatic management of terminal audio signal acquisition and playing, and avoid the problem of inconvenient manual control, fig. 1 is a flow diagram of the echo cancellation method provided by the present invention, and as shown in fig. 1, the execution main body of the method is a conference system, and the method includes:

step 110, acquiring a reference signal of each terminal and a microphone signal of any terminal in each terminal;

here, the terminal, that is, the device for accessing the conference by the participant, may be a smart phone, a tablet computer, or the like, which is not specifically limited in this embodiment of the present invention.

The microphone signal is an audio signal picked up by a microphone, which may be picked up by a microphone on any one of the terminals accessing the conference. In a multi-user conference scene, a plurality of echo signals exist in a microphone signal, so that a plurality of reference signals are correspondingly needed in the process of performing echo cancellation on the microphone signal, where a reference signal may be understood as a source signal that needs to be cancelled through echo cancellation, and is also an audio signal, which is a speaker signal of each terminal accessing a conference, in other words, a conference system acquires the speaker signal of each terminal as a reference signal of each terminal. Taking the mobile phone hands-free call as an example, the microphone signal is an audio signal picked up by a mobile phone microphone, and the reference signal is an audio signal output by a mobile phone loudspeaker.

Step 120, respectively extracting the features of the reference signal of each terminal and the microphone signal of the terminal, and determining the echo signal features based on the reference signal features of each terminal and the microphone signal features of the terminal, which are obtained by feature extraction;

specifically, in the embodiment of the present invention, after acquiring the reference signal of each terminal and the microphone signal of any terminal, the conference system may estimate a plurality of channel propagation parameters to implement echo cancellation for multiple sound sources, and the specific process includes the following steps:

in the conference system, for example, after the audio signal output by the speaker of the terminal is output by the speaker of the terminal corresponding to the second participant, an echo is formed through spatial reflection, and is input again from the microphone of the terminal corresponding to the second participant, and at this time, the audio signal input from the microphone includes not only the audio signal of the second participant but also the echo signal of the first participant, so that the audio signal output by the speaker of the terminal corresponding to the first participant includes both of them, in other words, the sound of the second participant heard by the first participant is superimposed with the sound of the first participant, which seriously affects the conference quality. In this case, echo cancellation is required, which estimates the magnitude of the echo signal by using an echo cancellation technique, and subtracts the estimated value from the microphone signal to cancel the echo.

Because a plurality of echo signals exist in a microphone signal in a multi-person conference scene, and an acoustic transmission process still exists between a reference signal of each terminal and an actual echo signal, the traditional adaptive echo cancellation method needs to estimate the transmission path parameter in real time, and due to the difference of distance, network, hardware and the like, the time delay of each reference signal is different, so that the time delay estimation is difficult to be simultaneously carried out on a plurality of sound sources, and further, the echo cancellation cannot be effectively carried out.

In view of this, in the embodiment of the present invention, in consideration that the microphone signal is formed by superimposing a plurality of echo signals formed by propagating a reference signal of each terminal through a channel and a target signal, where the target signal may be understood as an audio signal of a participant corresponding to the terminal, the microphone signal of each terminal and the reference signal of the terminal may be subjected to multi-source fusion processing, so as to estimate a channel propagation parameter of the reference signal of each terminal through the multi-source fusion processing, thereby obtaining the echo signal in the microphone signal, specifically, feature extraction may be performed on the reference signal of each terminal and the microphone signal of the terminal, respectively, so as to obtain a reference signal feature of each terminal and a microphone signal feature of the terminal; then, the echo signal characteristics can be solved according to the reference signal characteristics of each terminal and the microphone signal characteristics of the terminal.

Here, the feature extraction process for the reference signal and the microphone signal, and the determination process for the echo signal feature may be implemented by a multi-source signal fusion network, that is, the reference signal of each terminal and the microphone signal of the terminal may be input to the multi-source signal fusion network, the multi-source signal fusion network performs feature extraction on the input reference signal of each terminal and the microphone signal of the terminal, and predicts the echo signal input from the microphone of the terminal after the reference signal of each terminal is propagated through the channel by using the microphone signal feature of the terminal as a reference, in other words, the echo component in the microphone signal is predicted according to the reference signal feature of each terminal and the microphone signal feature of the terminal, and finally obtains the echo signal feature of the terminal output by the multi-source signal fusion network.

It is worth noting that before the reference signal of each terminal and the microphone signal of the terminal are input to the multi-source signal fusion network, the multi-source signal fusion network can be obtained through pre-training, and the training process of the multi-source signal fusion network comprises the following steps: firstly, a large number of sample microphone signals, sample reference signals of a plurality of terminals and sample echo signal characteristics are collected, and then an initial multi-source signal fusion network is trained on the basis of the sample reference signals of all the terminals, the sample microphone signals of any one of the terminals and the sample echo signal characteristics, so that the multi-source signal fusion network with echo signal prediction capability is obtained.

In the embodiment of the invention, aiming at the training of the multi-source signal fusion network, the implicit relation among the characteristics of the sample reference signal, the sample microphone signal and the sample echo signal can be learned, so that the implicit calculation of the multi-sound-source time delay and the effective prediction of the echo component in the microphone signal can be realized by directly relying on the implicit relation in the application process.

And meanwhile, reference signals of all terminals and microphone signals of any terminal are collected and used as input of the multi-source signal fusion network, so that echo signals reaching a microphone of the terminal after being transmitted through a channel can be predicted according to the microphone signals and the reference signals in the multi-source signal fusion network in the process of multi-source fusion processing.

Step 130, based on the echo signal characteristics, performing echo cancellation on the microphone signal of the terminal to obtain an echo cancellation signal of the terminal.

In the embodiment of the invention, after the echo signal characteristics are obtained through multi-source fusion processing, the conference system can utilize the echo signal characteristics to perform echo cancellation on the microphone signal of the terminal so as to offset a plurality of echo signals in the microphone signal, thereby ensuring the conference quality.

Specifically, in step 130, the conference system may cancel a plurality of echo signals present in the microphone signal of the terminal based on the echo signal characteristic to obtain an echo cancellation signal, specifically, the echo signal characteristic and the microphone signal of the terminal are input to a deep neural network, where the deep neural network may be understood as a pre-trained echo cancellation network for echo cancellation, so that the echo cancellation network predicts an echo component in the microphone signal of the terminal based on the echo signal characteristic and performs a mask output on the echo component, and finally obtains an audio mask of the microphone signal of the terminal output by the echo cancellation network;

then, the audio mask may be used to eliminate the echo component in the microphone signal of the terminal, that is, the microphone signal of the terminal may be multiplied by the audio mask output by the echo cancellation network, so as to cancel a plurality of echo signals existing in the microphone signal of the terminal.

It should be noted that, before the echo signal characteristics and the microphone signal are input to the echo cancellation network, the echo cancellation network may also be obtained through pre-training, and it is worth noting that the network is trained by using the acquisition of the optimal echo cancellation signal as a criterion, that is, by using the minimum echo component in the echo cancellation signal as a target, the training is obtained, and the training process includes the following steps: firstly, collecting a large number of sample microphone signals, sample reference signals of a plurality of terminals and sample echo cancellation signals, and determining the characteristics of the sample echo signals according to the sample reference signals of all the terminals and the sample microphone signals of any one terminal in all the terminals; and then training the initial network based on the sample echo signal characteristics, the sample microphone signals and the sample echo cancellation signals, thereby obtaining the trained echo cancellation network.

Here, the initial Network for training may be constructed on the basis of a Long Short-Term Memory (LSTM) Network, a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), and the like, which is not specifically limited in this embodiment of the present invention.

The echo cancellation method provided by the invention can simultaneously acquire the reference signal of each terminal, and utilize the reference signal of each terminal and the microphone signal of any terminal in each terminal to perform time delay estimation so as to acquire the channel propagation parameter of the reference signal of each terminal, thereby acquiring the echo signal characteristic, and perform echo cancellation on the microphone signal of the terminal according to the echo signal characteristic to acquire the echo cancellation signal of the terminal.

Based on the foregoing embodiment, fig. 2 is a schematic diagram of a process for determining echo signal characteristics provided by the present invention, and as shown in fig. 2, in step 120, determining echo signal characteristics based on reference signal characteristics of each terminal and microphone signal characteristics of the terminal, where the process includes:

step 210, fusing the reference signal characteristics of each terminal to obtain reference signal fusion characteristics;

step 220, determining echo signal characteristics based on the reference signal fusion characteristics and the microphone signal characteristics of the terminal.

Specifically, in step 120, after feature extraction is performed on the reference signal of each terminal and the microphone signal of the terminal respectively to obtain the reference signal feature of each terminal and the microphone signal feature of the terminal, a process of determining the echo signal feature according to the reference signal feature of each terminal and the microphone signal feature of the terminal may specifically include the following steps:

step 210, firstly, the reference signal features of each terminal obtained by feature extraction may be fused to obtain reference signal fusion features, specifically, the reference signal features of each terminal may be used as a reference to perform feature fusion to obtain reference signal fusion features;

here, the fusion manner of the reference signal features of each terminal may be splicing, adding, weighted fusion, and the like, which is not specifically limited in the embodiment of the present invention. The fusion of the reference signal characteristics of each terminal is beneficial to the subsequent echo prediction process, namely, the echo component in the microphone signal can be predicted more accurately by the multi-source signal fusion network, so that the echo cancellation process aiming at the microphone signal is facilitated.

Step 220, then, the echo signal characteristics may be determined according to the reference signal fusion characteristics obtained by the characteristic fusion and the microphone signal characteristics of the terminal, specifically, the reference signal fusion characteristics obtained by the characteristic fusion of the reference signal characteristics of each terminal are used as a reference, and the channel propagation parameters of the reference signal of each terminal are estimated by using the microphone signal characteristics of the terminal, so as to solve the echo signal characteristics of multiple echo signals in the microphone signals, in other words, on the basis of the microphone signal characteristics of the terminal, the echo signal input from the microphone of the terminal after the reference signal of each terminal is propagated through the channel is predicted according to the reference signal fusion characteristics, that is, the echo component in the microphone signal is predicted, so as to obtain the echo signal characteristics.

Based on the foregoing embodiment, fig. 3 is a general framework diagram of an echo cancellation process provided by the present invention, and as shown in fig. 3, first, a reference signal of each terminal and a microphone signal of any terminal in each terminal need to be obtained; then, the reference signal of each terminal and the microphone signal of any terminal in each terminal can be input to the multi-source signal fusion network to obtain the echo signal characteristics output by the multi-source signal fusion network; then, the microphone signal and the echo signal characteristic of the terminal can be input to an echo cancellation network, so as to obtain an audio mask of the microphone signal output by the echo cancellation network, where the echo cancellation network is composed of a CNN layer (convolutional layer), a Bi-LSTM layer (bidirectional long-short time memory layer) and an FC layer (full connection layer), specifically, the microphone signal of the terminal is input to the convolutional layer in the echo cancellation network, so as to obtain the microphone signal characteristic output by the convolutional layer, and the microphone signal characteristic output by the convolutional layer and the echo signal characteristic output by the multi-source signal fusion network are input to the bidirectional long-short time memory layer in the echo cancellation network, so as to solve the audio mask of the microphone signal through the bidirectional long-short time memory layer, and then output through the full connection layer; then, the audio mask of the microphone signal output by the full connection and the microphone signal of the terminal are multiplied to obtain an echo cancellation signal.

In the embodiment of the invention, the acquisition, the time delay estimation and the echo cancellation of the reference signals can be simultaneously carried out aiming at multiple sound sources, and the defect that the echo cancellation can not be effectively carried out because the time delay estimation can not be simultaneously carried out aiming at the reference signals of the multiple sound sources in the traditional scheme is overcome.

Based on the above embodiment, fig. 4 is a schematic flow chart of a pickup control process provided by the present invention, and as shown in fig. 4, the method further includes:

step 410, determining the conference state of each participant corresponding to each terminal based on the state of each terminal and/or the voice detection result of each terminal;

step 420, sound pickup control is carried out on each terminal based on the participation state of each participant corresponding to each terminal; the state is a hand-held state or a placement state, and the participation state is any one of a discussion state, a speech state and a listening state.

Specifically, the conference system may determine the conference state of each participant according to the state of each terminal except for performing echo cancellation on the microphone signal of each terminal, and perform sound pickup control on each terminal according to the conference state, where the specific process includes the following steps:

because of the mobility of the terminal and the randomness of the speech during the multi-person conference, the conference state of the participants can be adjusted through the change of the terminal state and/or the voice detection result of the terminal during the multi-person conference, specifically, step 110 is executed first, and the conference state of each participant corresponding to each terminal can be determined by taking the state of each terminal and/or the voice detection result of each terminal as a reference; here, the state of the terminal is a placement state or a holding state, that is, whether the terminal is placed on a desktop or held in a hand of a corresponding participant, the participant state corresponding to the state of the terminal may be any one of a discussion state, a speech state and a listening state, and the voice detection result is used to indicate whether the corresponding terminal detects a sound;

step 420 is executed immediately, the sound pickup control of each terminal may be performed according to the conference state of the conference participant corresponding to each terminal, specifically, the microphone in the sound pickup process of each terminal may be regulated and controlled dynamically based on the conference state of the conference participant corresponding to each terminal, and/or the echo cancellation signal after echo cancellation picked up by each terminal is subjected to voice separation, that is, the dynamic selection of the microphone in the sound pickup process may be controlled on the premise of the conference state of the conference participant corresponding to each terminal, and/or the voice separation of the echo cancellation signal obtained after echo cancellation after sound pickup is controlled.

In the embodiment of the invention, the mobility and the flexibility of the terminal and the randomness of speaking in the conference process are fully utilized to adjust the conference mode or the conference state of each participant, thereby realizing the dynamic switching between different conference modes or the conference states, improving the communication convenience in the conference process and optimizing the conference quality.

Based on the above embodiment, step 410 includes:

under the condition that the states of all the terminals are all placement states, determining that the participation states of all the participants corresponding to all the terminals are all discussion states;

and under the condition that the state of any terminal is a handheld state and the voice detection result of the terminal indicates that the terminal detects sound, determining that the conference state of the participant corresponding to the terminal is a speech state and the conference states of other participants are listening states.

Specifically, in step 410, the process of determining the conference state of each participant corresponding to each terminal according to the state of each terminal and/or the voice detection result of each terminal may be specifically divided into the following two cases:

first, when the state of each terminal is a placement state, that is, when each participant places the corresponding terminal on the desktop during the conference, the participant state of each participant can be determined as a discussion state, in other words, the conference mode at this time is a free discussion mode, each participant is in a free discussion state, and each participant in the conference mode speaks randomly, in other words, each participant may speak at any time in the free discussion mode, so that the microphone can be controlled to make a dynamic selection during the sound pickup.

Secondly, in addition to the discussion state, in the embodiment of the present invention, in consideration of the flexibility of the terminal, the lectable state and the listening state corresponding thereto are determined, that is, in the case that the state of any terminal is the handheld state and the voice detection result of the terminal indicates that the terminal detects a sound, in other words, the terminal is held in the hand of the corresponding participant, and in the case that the participant is speaking properly, the speaking state of the participant can be determined, that is, the speaking state of the participant is the lecture state, and the listening state of the participants other than the participant is the listening state.

It should be noted that, when any participant is in the speech state, only the microphone of the terminal corresponding to the participant is turned on, and the microphones of the other terminals are all turned off, so as to ensure the concentration of the participant during speech, that is, avoid the interference of the microphone signals input from the microphones of the other terminals.

Based on the foregoing embodiment, fig. 5 is a schematic flow chart of voice separation in the pickup control process provided by the present invention, and as shown in fig. 5, step 420 includes:

step 421-a, under the condition that the conference state of the conference participant corresponding to any terminal is the speech state, acquiring an echo cancellation signal of the terminal;

and step 422-A, performing voice separation on the echo cancellation signal of the terminal based on the voiceprint characteristics of the participants to obtain a voice separation signal of the terminal, and performing voiceprint extraction on the voiceprint characteristics of the participants based on the historical echo cancellation signal of the terminal.

Specifically, in step 420, in the process of performing sound pickup control on each terminal according to the conference state of each participant corresponding to each terminal, a process of performing voice separation on an echo cancellation signal obtained by echo cancellation after sound pickup may specifically include the following steps:

step 421-a, under the condition that any one participant is in the speech state, that is, the participant state of the participant corresponding to any one terminal is the speech state, and the participant states of the participants corresponding to other terminals are the listening state, the echo cancellation signal of the terminal can be obtained; the echo cancellation signal, that is, the echo cancellation signal obtained after the echo cancellation process of the microphone signal picked up by the terminal, is described in detail above, and is not described herein again;

step 422-a, after obtaining the echo cancellation signal of the terminal, before performing voice separation on the echo cancellation signal of the terminal, determining voiceprint features of the participants, where the voiceprint features may be determined based on the historical echo cancellation signals of the terminals, that is, performing voiceprint extraction on the historical echo cancellation signals of the terminals, respectively, to extract the voiceprint features of the participants from the voiceprint features, and then performing voice separation on the echo cancellation signal of the terminal according to the voiceprint features of the participants, so as to obtain a voice separation signal.

Here, the voice separation of the echo cancellation signal may be implemented by a voice separation model, that is, voiceprint features of the participants and the echo cancellation signal of the terminal may be input to the voice separation model, the voice separation model performs voice separation on the echo cancellation signal of the terminal according to the input voiceprint features of the participants, and finally obtains an audio mask of the echo cancellation signal output by the voice separation model, and the audio mask is multiplied by the echo cancellation signal to obtain the voice separation signal.

In the embodiment of the invention, the echo cancellation signal of the terminal is subjected to voice separation, and the audio signal of the participant corresponding to the terminal can be separated from the echo cancellation signal, so that the audio signal is not interfered by background noise and the audio signals of other participants when being played through a loudspeaker, and the concentration of the speech process and the definition of the audio signal in the speech state are ensured.

Based on the foregoing embodiment, fig. 6 is a general framework diagram of the voice separation process provided by the present invention, and as shown in fig. 6, when the conference state of the conference participant corresponding to any terminal is the speech state, first, an echo cancellation signal of the terminal needs to be obtained; then, the voiceprint characteristics of each participant and the echo cancellation signal of the terminal can be input into a voice separation model, so as to obtain an audio mask of the echo cancellation signal output by the voice separation model, wherein the voice separation model consists of a CNN layer (convolutional layer), a Bi-LSTM layer (bidirectional long and short time memory layer) and an FC layer (full connection layer), specifically, the echo cancellation signal of the terminal is input into the convolutional layer in the voice separation model, so as to obtain the echo cancellation signal characteristics output by the convolutional layer, the echo cancellation signal characteristics output by the convolutional layer and the voiceprint characteristics of each participant are input into the bidirectional long and short time memory layer in the voice separation model, so as to solve the audio mask of the echo cancellation signal through the bidirectional long and short time memory layer, and then the audio mask is output through the full connection layer; then, the audio mask of the echo cancellation signal output by the full connection is multiplied by the echo cancellation signal of the terminal, so as to obtain the voice separation signal.

The voiceprint characteristics of each participant are determined based on the historical echo cancellation signals of each terminal, that is, voiceprint extraction can be performed by taking the historical echo cancellation signals of each terminal as a reference to obtain the voiceprint characteristics of each participant corresponding to each terminal, wherein the historical echo cancellation signals can be echo cancellation signals which are picked up by corresponding terminals in a historical manner and have the longest echo cancellation signal duration after echo cancellation, so that the accuracy of extracted voiceprint characteristics can be ensured.

In the embodiment of the invention, under the guidance of the voiceprint characteristics of each participant, the voice separation is carried out aiming at the echo cancellation signal, and the audio signal of the participant corresponding to the terminal can be stripped and reserved, so that the speech state of the participant is not interfered by background noise and the audio signals of other participants, in other words, the interference of the background noise and other voices can be inhibited under the condition that the conference mode is the speech mode, the attentiveness of the speech process is ensured, and the definition of the audio signal under the speech state can be ensured by only playing the stripped audio signal of the participant by the loudspeaker, thereby improving the conference quality.

Based on the foregoing embodiment, fig. 7 is a schematic flow chart illustrating microphone selection in the sound pickup control process provided by the present invention, and as shown in fig. 7, step 420 includes:

step 421-B, acquiring the microphone signals of each terminal under the condition that the conference state of each participant is a discussion state;

and step 422-B, performing audio alignment on the microphone signals of the terminals, determining a target terminal based on the energy of the terminals corresponding to the microphone signals after audio alignment, and taking the target terminal as a sound pickup terminal of the current speaker in the participants in the discussion state.

Specifically, in step 420, in the process of performing sound pickup control on each terminal according to the meeting state of each meeting participant corresponding to each terminal, the process of dynamic direction selection of the microphone in the sound pickup process is controlled, and the method may further include the following steps:

step 421-B, under the condition that each participant is in the discussion state, that is, under the condition that the participant states of each participant corresponding to each terminal are all discussion states, acquiring the microphone signal of each terminal; the microphone signal here is an audio signal picked up by a microphone on each terminal accessing the conference;

step 422-B, because the multiple microphones are often picking up sound in the process of the multi-person conference, and the different microphones are different in position, in other words, the distances between the different microphones and the speakers are different, so that the microphone signals picked up by the microphones have transmission time delay, in this embodiment of the present invention, the microphone signals of each terminal can be subjected to audio alignment, that is, the microphone signals of each terminal can be subjected to audio alignment by using an autocorrelation function to eliminate the transmission time delay, which is a preprocessing process, which is a precondition for the next step (energy size), that is, energy judgment can be performed only when audio alignment is performed and on the basis of ensuring that the microphone signals are synchronized in time; in other words, this process provides a critical benefit for the next dynamic selection of microphones;

then, the energy of each terminal corresponding to each microphone signal after audio alignment may be determined, and the target terminal is determined according to the energy of each terminal, and the target terminal is used as the sound pickup terminal of the current speaker, specifically, a terminal with the largest microphone energy is determined from the energy of the microphone in each terminal as the target terminal, and then the target terminal is used as the sound pickup terminal of the current speaker in the conference state, that is, the target terminal is used as the terminal for picking up the audio signal of the current speaker.

It is worth mentioning that the target terminal selected by energy is necessarily the terminal closest to the current speaker, so that the sound pickup quality in the sound pickup process can be ensured to the maximum extent, and the target terminal is used as the sound pickup terminal of the current speaker, and the microphones of other terminals can be closed, so that the terminal closest to the current speaker is always the sound pickup terminal, and different speakers can correspond to different sound pickup terminals, thereby realizing the dynamic selection of the sound pickup terminal or the sound pickup microphone.

The method is different from the traditional conference scene, after each terminal is accessed, each microphone is started to pick up sound, direction selection cannot be carried out, and the sound pick-up quality in the conference process cannot be guaranteed.

Based on the above embodiment, the general process of the echo cancellation method provided by the present invention includes the following steps: the conference system firstly issues a conference APP (Application), and the conference APP can realize functions such as remote conference creation and conference access.

After each terminal accesses the conference, the conference system firstly needs to acquire a reference signal of each terminal and a microphone signal of any terminal in each terminal; then, feature extraction can be respectively carried out on the reference signal of each terminal and the microphone signal of the terminal to obtain the reference signal feature of each terminal and the microphone signal feature of the terminal, the reference signal feature of each terminal can be fused to obtain the reference signal fusion feature, and the echo signal feature is determined based on the reference signal fusion feature and the microphone signal feature of the terminal; then, based on the echo signal characteristics, the echo cancellation is performed on the microphone signal of the terminal, so as to obtain an echo cancellation signal of the terminal.

In addition, the conference system can also determine the conference participation state of each participant corresponding to each terminal based on the state of each terminal and/or the voice detection result of each terminal; performing pickup control on each terminal based on the participation state of each participant corresponding to each terminal; the state here is a hand-held state or a placed state, and the conference state is any one of a discussion state, a speech state, and a listening state.

Further, the process of determining the conference state of each participant corresponding to each terminal based on the state of each terminal and/or the voice detection result of each terminal may specifically include: under the condition that the states of all the terminals are all placement states, determining that the participation states of all the participants corresponding to all the terminals are all discussion states; and under the condition that the state of any terminal is a handheld state and the voice detection result of the terminal indicates that the terminal detects sound, determining that the conference state of the participant corresponding to the terminal is a speech state and the conference states of other participants are listening states.

The process of performing sound pickup control on each terminal based on the conference state of each participant corresponding to each terminal may include the following steps: under the condition that the conference state of the conference participant corresponding to any terminal is a speech state, acquiring an echo cancellation signal of the terminal; and performing voice separation on the echo cancellation signal of the terminal based on the voiceprint characteristics of the participants to obtain a voice separation signal of the terminal, and performing voiceprint extraction on the voiceprint characteristics of the participants based on the historical echo cancellation signal of the terminal to obtain the voiceprint characteristics of the participants. Correspondingly, acquiring the microphone signals of each terminal under the condition that the participation states of each participant are discussion states; and carrying out audio alignment on the microphone signals of all the terminals, determining a target terminal based on the energy of all the terminals corresponding to all the microphone signals after audio alignment, and taking the target terminal as a sound pickup terminal of a current speaker in all the participants in the discussion state.

In the embodiment of the invention, the conference system can access a plurality of terminals through the conference APP, and each terminal accessed to the conference is a sound pickup terminal corresponding to a participant, and meanwhile, the conference system can adjust the participant state by means of the state of the terminal and/or a voice detection result; furthermore, under the condition that the participant corresponding to any terminal is in the speech state, the echo cancellation signal of the terminal can be subjected to voice separation, so that background noise and other human interference in the speech state are suppressed, and the processing capability of the conference system for complex scenes is improved to the greatest extent.

In addition, on the scheme of the multi-distributed terminal, the embodiment of the invention can fully utilize the networking linkage of each terminal aiming at the requirement of interactive intellectualization of each terminal, give play to the personalized characteristics of the terminal, strengthen the pickup gain of the microphone and realize the dynamic selection of the microphone.

The method provided by the embodiment of the invention can simultaneously acquire the reference signal of each terminal, and utilize the reference signal of each terminal and the microphone signal of any terminal in each terminal to perform time delay estimation so as to acquire the channel propagation parameter of the reference signal of each terminal, thereby acquiring the echo signal characteristic, and perform echo cancellation on the microphone signal of the terminal according to the echo signal characteristic to acquire the echo cancellation signal of the terminal.

The echo cancellation device provided by the present invention is described below, and the echo cancellation device described below and the echo cancellation method described above may be referred to in correspondence with each other.

Fig. 8 is a schematic structural diagram of an echo cancellation device provided in the present invention, and as shown in fig. 8, the device includes:

a signal obtaining unit 810, configured to obtain a reference signal of each terminal and a microphone signal of any one of the terminals;

an echo determining unit 820, configured to perform feature extraction on the reference signal of each terminal and the microphone signal of the terminal, and determine an echo signal feature based on the reference signal feature of each terminal and the microphone signal feature of the terminal obtained through feature extraction;

the echo cancellation unit 830 is configured to perform echo cancellation on the microphone signal of the terminal based on the echo signal characteristic, so as to obtain an echo cancellation signal of the terminal.

The echo cancellation device provided by the invention can simultaneously acquire the reference signal of each terminal, and utilize the reference signal of each terminal and the microphone signal of any terminal in each terminal to perform time delay estimation so as to acquire the channel propagation parameter of the reference signal of each terminal, thereby acquiring the echo signal characteristic, and perform echo cancellation on the microphone signal of the terminal according to the echo signal characteristic to acquire the echo cancellation signal of the terminal.

Based on the above embodiment, the echo determination unit 820 is configured to:

and determining the echo signal characteristic based on the reference signal fusion characteristic and the microphone signal characteristic of the terminal.

Based on the above embodiment, the apparatus further includes a pickup control unit configured to:

determining the conference participation state of each participant corresponding to each terminal based on the state of each terminal and/or the voice detection result of each terminal;

Based on the above embodiment, the sound pickup control unit is configured to:

under the condition that the conference state of the conference participant corresponding to any terminal is a speech state, acquiring an echo cancellation signal of the terminal;

and performing voice separation on the echo cancellation signal of the terminal based on the voiceprint characteristics of the participants to obtain a voice separation signal of the terminal, and performing voiceprint extraction on the voiceprint characteristics of the participants based on the historical echo cancellation signal of the terminal to obtain the voiceprint characteristics of the participants.

Based on the above embodiment, the sound pickup control unit is configured to:

Fig. 9 is a schematic structural diagram of a conference system provided in the present invention, as shown in fig. 9, the conference system includes terminals 910, and an echo cancellation device 800;

the echo cancellation device 800 is configured to determine an echo signal characteristic based on a reference signal characteristic of a reference signal of each terminal 910 and a microphone signal characteristic of a microphone signal of any one of the terminals 910, and perform echo cancellation on the microphone signal of the terminal based on the echo signal characteristic to obtain an echo cancellation signal of the terminal.

Specifically, the conference system in the embodiment of the present invention includes each terminal 910 and the echo cancellation device 800, where the terminal is a device for accessing a participant to a conference, which may be a smart phone, a tablet computer, and the like, and for convenience of work and cooperation, the participant can access the device of the participant, i.e., the terminal to the conference when participating in the conference; the microphone signal is an audio signal picked up by a microphone, and in a multi-person conference scene, a plurality of echo signals exist in the microphone signal, so a plurality of reference signals are correspondingly needed in the process of performing echo cancellation on the microphone signal, and the reference signal is also an audio signal and can be understood as a source signal to be cancelled through echo cancellation, which is a loudspeaker signal of each terminal accessing a conference.

In the conference system, for example, after the audio signal output by the speaker of the terminal is output by the speaker of the terminal corresponding to the second participant, an echo is formed through spatial reflection, and is input again from the microphone of the terminal corresponding to the second participant, and at this time, the audio signal input from the microphone includes not only the audio signal of the second participant but also the echo signal of the first participant, so that the audio signal output by the speaker of the terminal corresponding to the first participant includes both the audio signal of the second participant and the echo signal of the first participant, in other words, the sound of the second participant heard by the first participant is superimposed on the sound of the first participant, which seriously affects the conference quality. In this case, echo cancellation is required, which estimates the magnitude of the echo signal by using an echo cancellation technique, and subtracts the estimated value from the microphone signal to cancel the echo.

Because a plurality of echo signals exist in a microphone signal in a multi-person conference scene, and an acoustic transmission process still exists between a reference signal of each terminal and an actual echo signal, the traditional adaptive echo cancellation method needs to estimate the transmission path parameter in real time, and due to the difference of distances, networks, hardware and the like, the time delay of each reference signal is different, so that the time delay estimation is difficult to be simultaneously carried out on a plurality of sound sources, and further, the echo cancellation cannot be effectively carried out.

In view of this, in the embodiment of the present invention, the echo cancellation apparatus 800 may first obtain a reference signal of each terminal 910 (terminal 1, terminal 2, terminal 3, etc.) in the conference system and a microphone signal of any terminal in each terminal 910; then, considering that the microphone signals are formed by overlapping a plurality of echo signals formed by propagating the reference signal of each terminal 910 through the channel and a target signal, where the target signal may be understood as the audio signal of the participant corresponding to the terminal, the echo cancellation apparatus 800 may perform multi-source fusion processing on the microphone signal of each terminal 910 and the reference signal of the terminal to estimate the channel propagation parameters of the reference signal of each terminal 910, so as to obtain the echo signals in the microphone signals, that is, the echo cancellation apparatus 800 may perform feature extraction on the reference signal of each terminal 910 and the microphone signal of the terminal respectively to obtain the reference signal features of each terminal 910 and the microphone signal features of the terminal, and then may solve the echo signal features according to the reference signal features of each terminal 910 and the microphone signal features of the terminal.

After that, the echo cancellation device 800 may perform echo cancellation on the microphone signal of the terminal by using the echo signal characteristic to cancel multiple echo signals present in the microphone signal, so as to ensure the conference quality, specifically, the echo cancellation device 800 may predict an echo component in the microphone signal of the terminal by using the echo signal characteristic as a reference, and perform a mask output on the echo component, so as to obtain an audio mask of the microphone signal of the terminal, and use the audio mask to cancel the echo component in the microphone signal of the terminal, that is, multiply the microphone signal of the terminal by the audio mask, so as to cancel multiple echo signals present in the microphone signal of the terminal, in other words, set the echo signal in the microphone signal of the terminal to zero, and retain the audio signal, that is, the target signal, of the participant corresponding to the terminal, so as to finally obtain the echo cancellation signal.

The conference system provided by the invention comprises each terminal and an echo cancellation device, wherein the echo cancellation device can perform time delay estimation by using a reference signal of each terminal and a microphone signal of any terminal in each terminal to obtain a channel propagation parameter of the reference signal of each terminal so as to obtain an echo signal characteristic, and can perform echo cancellation according to the echo signal characteristic to obtain an echo cancellation signal of the terminal.

Based on the above embodiment, the echo cancellation device 80 is further configured to determine the conference state of each participant corresponding to each terminal 910 based on the state of each terminal 910 and/or the voice detection result of each terminal 910, and perform sound pickup control on each terminal 910 based on the conference state of each participant corresponding to each terminal 910;

the state is a hand-held state or a placement state, and the participation state is any one of a discussion state, a speech state and a listening state.

Specifically, the echo cancellation device 800 in the conference system may be configured to perform sound pickup control on each terminal 910, in addition to performing echo cancellation on a microphone signal of each terminal 910, and a specific process may be that, due to mobility of the terminal, and randomness of speech during a multi-person conference when the conference system is applied to a multi-person conference scene, the echo cancellation device 800 may determine a participation state of each participant corresponding to each terminal 910 through a change of a terminal state and/or a speech detection result of the terminal, and may adjust the participation state of the participant, that is, the echo cancellation device 800 may perform sound pickup control on each terminal 910 based on the state of each terminal 910 and/or the speech detection result of each terminal 910, where the sound pickup control is actually based on dynamic regulation and control of a microphone during a process of each terminal 910, and/or voice separation of the echo cancellation signal for each terminal 910, that is obtained by performing sound pickup control and/or sound pickup control after separation of the echo cancellation control of the microphone during the sound pickup process on the premise that the participation state of each terminal 910 corresponds to each terminal 910.

Here, the state of the terminal is a placement state or a handheld state, that is, whether the terminal is placed on a desktop or held in a hand of a corresponding participant, and the participant state corresponding to the state of the terminal may be any one of a discussion state, a speech state and a listening state, and the voice detection result is used to indicate whether the corresponding terminal detects a sound;

in the embodiment of the invention, the echo cancellation device in the conference system fully utilizes the mobility and the flexibility of the terminal and the randomness of speaking in the conference process to adjust the conference mode or the conference state of each participant, thereby realizing the dynamic switching between different conference modes or the conference states in the conference system, and improving the flexibility and the application convenience of the conference system.

Based on the foregoing embodiment, the echo cancellation device 800 is specifically configured to, when the conference state of a participant corresponding to any terminal is a speech state, obtain an echo cancellation signal of the terminal, and perform voice separation on the echo cancellation signal of the terminal based on voiceprint features of the participants to obtain a voice separation signal of the terminal;

the voiceprint characteristics of each participant are extracted based on the historical echo cancellation signals of each terminal 910.

Specifically, when the echo cancellation device 800 in the conference system performs sound pickup control on each terminal 910, for the voice separation process of the echo cancellation signal, specifically, the echo cancellation device 800 may obtain the echo cancellation signal of the terminal when any participant (a participant corresponding to any terminal) is in a speech state and other participants are in a listening state, and perform voice separation on the echo cancellation signal of the terminal to obtain the voice separation signal, where the voice separation process is performed based on the voiceprint features of each participant, and the voiceprint features of each participant are determined according to the historical echo cancellation signal of each terminal 910, in short, the echo cancellation device 800 may perform voice separation on the echo cancellation signal of the terminal according to the voiceprint features of each participant, so as to obtain the voice separation signal.

Here, the echo cancellation signal, that is, the echo cancellation signal obtained after the echo cancellation is performed on the microphone signal picked up by the terminal by the echo cancellation device 800, the echo cancellation process is described in detail above, and is not described here again.

In the embodiment of the invention, the voice separation function of the echo cancellation device can separate the audio signal of the participant corresponding to the terminal from the echo cancellation signal, so that the audio signal is not interfered by background noise and audio signals of other participants when being played through a loudspeaker, and the concentration of the speech process and the definition of the audio signal in the speech state are ensured.

Based on the foregoing embodiment, the echo cancellation device 800 is specifically configured to, when the conference states of the participants are all discussion states, acquire microphone signals of the terminals 910, perform audio alignment on the microphone signals of the terminals 910, determine a target terminal based on energy of each terminal 910 corresponding to each microphone signal after the audio alignment, and use the target terminal as a sound pickup terminal of a current speaker in the participants in the discussion states.

Specifically, when the echo cancellation device 800 in the conference system performs sound pickup control on each terminal 910, the selection of the microphone direction in the sound pickup process may specifically be that, in the process of a multi-person conference, multiple microphones are often used to pick up sound, and different microphone positions are different, in other words, distances between different microphones and speakers are different, so that there is a transmission delay in the microphone signals picked up by the microphones, and for this echo cancellation device 800 in the embodiment of the present invention, under the condition that each participant is in a discussion state, the microphone signals of each terminal 910 may be acquired, and audio alignment is performed on the microphone signals of each terminal 910, even if an autocorrelation function is used to perform audio alignment to eliminate the transmission delay, this is a preprocessing process, which is a precondition for the next step (determining the energy size), that energy determination can be performed only on the basis of performing audio alignment and ensuring that each microphone signal is synchronized in time; in other words, this process provides a critical benefit for the next dynamic selection of microphones;

then, the echo cancellation apparatus 800 may determine the energy of each terminal 910 corresponding to each microphone signal after audio alignment, determine a target terminal according to the energy of each terminal 910, and use the target terminal as a sound pickup terminal for the current speaker, that is, may select a terminal with the largest microphone energy from the terminals 910 as the target terminal, and then use the target terminal as a sound pickup terminal for the current speaker in each participant in the discussion state, that is, use the target terminal as a terminal for picking up the audio signal of the current speaker.

Different from the traditional conference scene, after each terminal is accessed, each microphone is started to pick up sound, the dynamic selection cannot be carried out, and the sound pick-up quality in the conference process cannot be ensured.

Based on the above embodiment, the conference system provided by the present invention includes each terminal 910, and the echo cancellation apparatus 800;

The echo cancellation device 800 is further configured to determine, based on the state of each terminal 910 and/or the voice detection result of each terminal 910, a participant state of each participant corresponding to each terminal 910, and perform sound pickup control on each terminal 910 based on the participant state of each participant corresponding to each terminal 910; the state is a hand-held state or a placement state, and the participation state is any one of a discussion state, a speech state and a listening state.

The echo cancellation device 800 is specifically configured to, when the conference state of a participant corresponding to any terminal is a speech state, obtain an echo cancellation signal of the terminal, and perform voice separation on the echo cancellation signal of the terminal based on voiceprint features of each participant to obtain a voice separation signal of the terminal; the voiceprint characteristics of each participant are extracted based on the historical echo cancellation signals of each terminal 910.

The echo cancellation device 800 is specifically configured to, when the conference states of the participants are all discussion states, acquire microphone signals of the terminals 910, perform audio alignment on the microphone signals of the terminals 910, determine a target terminal based on energy of each terminal 910 corresponding to each microphone signal after the audio alignment, and use the target terminal as a sound pickup terminal of a current speaker in the participants in the discussion states.

In the embodiment of the invention, each terminal can access the conference system through a conference APP (the conference APP can realize the functions of remote conference creation, conference access and the like), each terminal accessed to the conference is a sound pickup terminal corresponding to a participant, and meanwhile, the conference system can adjust the participant state by means of the state of the terminal and/or a voice detection result; furthermore, under the condition that the participant corresponding to any terminal is in the speech state, the echo cancellation signal of the terminal can be subjected to voice separation, so that background noise and other human voice interference in the speech state are suppressed, and the processing capacity of the conference system for complex scenes is improved to the great extent.

Fig. 10 illustrates a physical structure diagram of an electronic device, and as shown in fig. 10, the electronic device may include: a processor (processor) 1010, a communication Interface (Communications Interface) 1020, a memory (memory) 1030, and a communication bus 1040, wherein the processor 1010, the communication Interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may invoke logic instructions in memory 1030 to perform an echo cancellation method comprising: acquiring a reference signal of each terminal and a microphone signal of any terminal in each terminal; respectively extracting the characteristics of the reference signal of each terminal and the microphone signal of the terminal, and determining the characteristics of the echo signal based on the characteristics of the reference signal of each terminal and the microphone signal of the terminal, which are obtained by characteristic extraction; and based on the echo signal characteristics, carrying out echo cancellation on the microphone signal of the terminal to obtain an echo cancellation signal of the terminal.

Furthermore, the logic instructions in the memory 1030 can be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the echo cancellation method provided by the above methods, the method comprising: acquiring a reference signal of each terminal and a microphone signal of any terminal in each terminal; respectively extracting the characteristics of the reference signal of each terminal and the microphone signal of the terminal, and determining the characteristics of the echo signal based on the characteristics of the reference signal of each terminal and the microphone signal of the terminal, which are obtained by characteristic extraction; and based on the echo signal characteristics, carrying out echo cancellation on the microphone signal of the terminal to obtain an echo cancellation signal of the terminal.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the echo cancellation method provided by the above methods, the method comprising: acquiring a reference signal of each terminal and a microphone signal of any terminal in each terminal; respectively extracting the characteristics of the reference signal of each terminal and the microphone signal of the terminal, and determining the characteristics of the echo signal based on the characteristics of the reference signal of each terminal and the microphone signal of the terminal, which are obtained by characteristic extraction; and based on the echo signal characteristics, carrying out echo cancellation on the microphone signal of the terminal to obtain an echo cancellation signal of the terminal.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An echo cancellation method, comprising:

2. The method of claim 1, wherein the determining the echo signal characteristic based on the reference signal characteristic of each terminal and the mic signal characteristic of any terminal obtained by the characteristic extraction comprises:

3. The echo cancellation method of claim 1, further comprising:

4. The echo cancellation method according to claim 3, wherein the determining, based on the states of the terminals and/or the voice detection results of the terminals, the conference state of each participant corresponding to each terminal includes:

and under the condition that the state of any terminal is a handheld state and the voice detection result of any terminal indicates that any terminal detects sound, determining that the conference state of the participant corresponding to any terminal is a speech state and the conference states of other participants are listening states.

5. The echo cancellation method according to claim 3, wherein the sound pickup control for each terminal based on the conference state of each participant corresponding to each terminal includes:

and performing voice separation on the echo cancellation signal of any terminal based on the voiceprint characteristics of all the participants to obtain the voice separation signal of any terminal, and performing voiceprint extraction on the voiceprint characteristics of all the participants based on the historical echo cancellation signals of all the terminals to obtain the voiceprint characteristics of all the participants.

6. The echo cancellation method according to claim 3, wherein the sound pickup control for each terminal based on the conference state of each participant corresponding to each terminal includes:

7. An echo cancellation device, comprising:

a signal acquisition unit, configured to acquire a reference signal of each terminal and a microphone signal of any terminal in the terminals;

8. A conference system is characterized by comprising terminals and an echo cancellation device;

the echo cancellation device is configured to determine an echo signal characteristic based on a reference signal characteristic of a reference signal of each terminal and a microphone signal characteristic of a microphone signal of any one of the terminals, and perform echo cancellation on the microphone signal of any one of the terminals based on the echo signal characteristic to obtain an echo cancellation signal of any one of the terminals.

9. The conferencing system of claim 8,

the echo cancellation device is further configured to determine, based on the state of each terminal and/or the voice detection result of each terminal, a participation state of each participant corresponding to each terminal, and perform pickup control on each terminal based on the participation state of each participant corresponding to each terminal;

10. The conferencing system of claim 9,

the echo cancellation device is specifically configured to, when a conference state of a participant corresponding to any terminal is a speech state, acquire an echo cancellation signal of the any terminal, and perform voice separation on the echo cancellation signal of the any terminal based on voiceprint features of the participants to obtain a voice separation signal of the any terminal;

11. The conferencing system of claim 9,

the echo cancellation device is specifically configured to, when the conference states of the participants are discussion states, acquire microphone signals of the terminals, perform audio alignment on the microphone signals of the terminals, determine a target terminal based on energy of the terminals corresponding to the microphone signals after the audio alignment, and use the target terminal as a sound pickup terminal of a current speaker in the participants in the discussion states.

12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the echo cancellation method of any of claims 1 to 6 when executing the program.

13. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the echo cancellation method of any one of claims 1 to 6.