CN110648678B

CN110648678B - Scene identification method and system for conference with multiple microphones

Info

Publication number: CN110648678B
Application number: CN201910893667.1A
Authority: CN
Inventors: 周建明; 康元勋; 冯万健
Original assignee: Xiamen Yealink Network Technology Co Ltd
Current assignee: Xiamen Yealink Network Technology Co Ltd
Priority date: 2019-09-20
Filing date: 2019-09-20
Publication date: 2022-04-22
Anticipated expiration: 2039-09-20
Also published as: CN110648678A

Abstract

The application discloses a scene recognition method and system for a conference with multiple microphones. Wherein storing in frame alignment comprises in response to detecting speech signals of a plurality of microphone channels; calculating the voice energy of each frame of the voice signals of the plurality of microphone channels based on the aligned voice signals; based on voice energy tracking and scene recognition, a scene in which a single person speaks and a scene in which multiple persons speak simultaneously are recognized, and therefore switching of microphone output channels is conducted. The scheme is beneficial to selecting the microphone voice signal output channel with the best voice quality under the conditions of synthesizing voice energy, reverberation degree, noise and the like in a single speaking scene or a multi-person simultaneous speaking scene.

Description

Scene identification method and system for conference with multiple microphones

Technical Field

The present application relates to the field of sound processing, and in particular, to a method and system for scene recognition for conferences with multiple microphones.

Background

In recent years, with the progress and development of VOIP technology, the demand of video conferences is increasing, and the video conferences can realize simultaneous voice communication of multiple users, and have a wide application prospect in the communication field. With the rise of intelligent voice, microphone array sound pickup technology is also becoming one of the current popular technologies.

In the prior art, a microphone array is often used to solve the problem of reduced remote speech recognition rate, and certain beam forming and adaptive filtering methods are adopted to eliminate noise. In a real conference, a plurality of built-in microphones may be cascaded, and each built-in microphone is cascaded with a plurality of extension microphones, generally, a directional microphone is built in, and the extension microphones may be directional microphones or omnidirectional microphones. The sound pickup microphone of a general conference system can be divided into an omnidirectional microphone and a directional microphone (or a directional microphone), wherein the omnidirectional microphone has a large sound pickup range, but has serious reverberation and poor sound quality, and the directional microphone has good sound quality but a narrow sound pickup range. When the omnidirectional microphone and the directional microphone exist in the conference at the same time, and the sensitivities of the omnidirectional microphone and the directional microphone are not greatly different, the sound quality of the directional microphone is better under the same distance, but the voice energy picked up by the omnidirectional microphone may be larger at the moment, especially in a conference room with large reverberation, if the microphone output channel is directly selected through the voice energy, the omnidirectional microphone is likely to be selected. The problem that the microphone has poor sound pickup effect due to the problems of difference of characteristics of different types of microphones, deviation of microphone placing positions, inaccurate target voice direction and the like, and the selection of the optimal microphone output channel becomes a problem to be solved urgently.

Disclosure of Invention

The application aims to provide a scene recognition method and a scene recognition system for a conference with multiple microphones, so as to solve the technical problem of selecting a microphone voice output channel with high quality tone quality in an environment of multiple microphones for sound pickup.

In a first aspect, an embodiment of the present application provides a scene identification method for a conference with multiple microphones, where the method includes:

s1: in response to detecting speech signals of multiple microphone channels, storing in frame alignment.

S2: based on the aligned speech signals, speech energy of each frame of the plurality of microphone channel speech signals is calculated.

S3: based on voice energy tracking and scene recognition, a scene in which a single person speaks and a scene in which multiple persons speak simultaneously are recognized, and therefore switching of microphone output channels is conducted.

In the method, whether a mute scene or a talking scene is judged preliminarily according to voice detection and voice energy tracking, the scene of a single person speaking and the scene of a plurality of persons speaking simultaneously are identified according to the following steps of specific MFCC characteristics, Euclidean distances and the like, and then subsequent sound mixing output is carried out or a microphone voice signal with minimum reverberation is selected as an output channel

In some embodiments, the method further comprises: before step S1, the method further includes removing the far-end speech signal by using an AEC algorithm and a VAD algorithm, and outputting each microphone speech signal whose strength is greater than or equal to a first threshold value. An Acoustic Echo Cancellation Algorithm (AEC) is adopted to effectively eliminate the far-end Echo, that is, when a mute scene is detected, no person is speaking at the near end of the microphone, and a person is speaking at the far end. And calculating a logarithm value by adopting a Voice Activity Detection algorithm (VAD), wherein each microphone Voice signal of which the logarithm value of the input Voice signal is larger than or equal to a first threshold value is used as the Voice signal of a plurality of microphone channels in the step S1, so that noise and noise signals can be eliminated, and the probability of selecting a high-quality microphone Voice output channel is improved.

In some embodiments, the first threshold value may be set to [ -70dB, -50dB ].

In some embodiments, the method further comprises: scene recognition in step S3 includes the steps of:

in response to detecting speech signals of the plurality of microphone channels, the speech signals are pre-processed, wherein the pre-processing includes framing, pre-emphasizing, and windowing the speech signals.

And performing FFT (fast Fourier transform) on the voice signals of the plurality of preprocessed microphone channels to obtain corresponding frequency spectrums, calculating band energy through a Mel filter, and transferring the band energy to Mel frequency.

Calculating logarithmic energy of voice signals of the plurality of microphones output by the Mel filter, obtaining an MFCC coefficient through DCT transformation, and performing differential operation based on the MFCC coefficient to further calculate Euclidean distances of the plurality of microphones.

And recognizing a scene in which multiple persons speak simultaneously in response to the Euclidean distance between the two microphone channels with the maximum logarithmic energy being greater than or equal to a first threshold value, and mixing the voice signals of the two microphone channels with the maximum logarithmic energy and outputting the mixed voice signals.

In the method, the MFCC analysis is based on the auditory mechanism of human ears, the human perception to the tone is in a linear relation, the Mel frequency expresses a common corresponding relation from the voice frequency to the perception frequency, and if the MFCC values of two microphone channels are greatly different, a plurality of people speak at the same time. The euclidean distance is specifically understood to be the actual distance between two points calculated. The Euclidean distance and the MFCC value calculation of the two microphones with the largest logarithmic energy are combined, the actual distance of the two microphones with the largest MFCC value difference is calculated, the situation that multiple persons speak simultaneously and the situation that one person speaks are effectively distinguished, and compared with the existing scene recognition method, the robustness is better.

In some embodiments, the method further comprises: identifying the situation that the Euclidean distance of the two microphone channels with the largest logarithmic energy is smaller than a first threshold value as a scene of single speaking, and executing the following steps:

in response to detecting speech signals of the plurality of microphone channels, the speech signals are pre-processed, wherein the pre-processing includes framing and windowing the speech signals.

And performing FFT (fast Fourier transform) on the preprocessed voice signals of the multiple microphone channels to acquire corresponding frequency spectrums, and calculating the high-frequency voice energy average value of the current frame of each microphone channel.

And calculating the ratio of the high-frequency voice energy average value of each microphone channel to the high-frequency voice energy average value of the currently selected microphone output channel, and selecting the microphone channel with the ratio being greater than or equal to a second threshold value as a new microphone output channel.

In this method, different degrees of sound reverberation are picked up by different microphones in the same conference room. In the single-person speaking scene, when the voice energy is not very different, the microphone with high frequency voice energy ratio is generally the opposite microphone, because the voice signal is mainly concentrated on high frequency, and the reverberation is mainly concentrated on medium and low frequency. The sound signals with different frequencies have different abilities of bypassing obstacles due to different wavelengths, the high-frequency signal has shorter wavelength, is not easy to bypass the obstacles, has fast frequency attenuation, has longer wavelength of the low-frequency signal, is easy to bypass the obstacles, and has slow frequency attenuation. In addition, when high and low frequency signals are propagated in the air, the higher the degree of absorption by obstacles such as walls, the more easily the signals are attenuated, and vice versa for low frequency signals. In view of the difference between the reverberation time and the reverberation degree of the sound signals with different frequencies, in this embodiment, whether the ratio of the high-frequency speech energy average value of each microphone channel to the high-frequency speech energy average value of the current microphone channel is greater than or equal to the second threshold is adopted, and the microphone channel outputting the higher high-frequency speech energy average value is adopted, so that the selection of a microphone output channel with large reverberation and poor sound quality due to the unicity selection according to the speech energy is avoided.

In some embodiments, the method further comprises: the speech energy in step S2 is a root mean square value of each frame of the aligned microphone speech signals, and the root mean square value is calculated according to the following formula:

wherein the voice data of the ith microphone is represented as: x is the number of_i1，x_i2，...，x_iLAnd L is the voice frame length. The voice energy calculation can more intuitively and effectively reflect the size of the voice energy by utilizing the root mean square value, and is convenient for long-time voice energy tracking and short-time voice energy tracking by utilizing the voice energy in subsequent steps.

In some embodiments, the method further comprises: the energy tracking comprises short-time tracking and long-time tracking, switching of the voice state in a mute state is realized based on the short-time tracking, and switching of the voice signal output channels of the microphones in a long-time mute state or a voice state is realized based on the long-time tracking. When the microphone channel is switched from a mute state to a voice state, short-time tracking with a short recording time interval (for example, the interval time is 200ms) is adopted, so that the situation that the switching is not timely is avoided; when the microphone is in a voice state or a mute state for a long time, long-time tracking with a long recording time interval (for example, the interval time is 2s) is adopted, so that frequent operation of the system is avoided, and the cost is saved.

In some embodiments, the method further comprises: the smoothing process adopts a sine curve, and a specific calculation formula is shown as the following formula:

smooth2(i)＝1-smooth(i),i＝0,1,...,L-1

where i represents the ith microphone channel and the current microphone speech signal is x₁Maximum current speech energy is x₂If the microphone voice signal x after the smoothing process is:

x_(i)＝smo0th1(i)*x_1(i)+smooth2(i)*x_2(i)i-0, 1., L-1. The switching of the voice signal output channels of the microphones is realized by adopting a smooth processing mode, and the noise inside the equipment caused by overlarge change of the voice signals of the microphones in the switching process is avoided.

In some embodiments, the method further comprises: the windowing processing mode adopts the multiplication of each frame by a Hamming window, and the formula of the windowing process is as follows:

S′(n)＝S(n)×W(n)

wherein, s (N) represents a plurality of microphone voice signals, W (N) represents a hamming window, W (N, a) represents the voice signals after windowing, N is the frame length, and a is a hamming window coefficient.

In some embodiments, the method further comprises: the calculation formula of FFT is as follows:

wherein X (k) represents a transformed spectrum, x: (n) is the speech time domain signal of the microphone, j represents an imaginary number.

Representing angular frequency, N1 represents the number of points of the fourier transform.

In some embodiments, the method further comprises: the frequency of the high-frequency speech energy is selected from the range of [4kHz,8kHz ], and the average value of the high-frequency speech energy is obtained by calculating the average value of the high-frequency speech energy values of the current frame and a plurality of historical frames.

In another aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and when executed by a processor, the computer program implements the method of the above embodiments.

In a third aspect, an embodiment of the present application provides a scene recognition system for a conference with multiple microphones, where the system includes:

a voice detection unit: configured for storing in frame alignment in response to detecting speech signals from a plurality of microphone channels.

An energy calculation unit: and the system is configured to calculate the voice energy of each frame of the plurality of microphones based on the aligned plurality of microphone channel voice signals.

A scene recognition unit: configured to pre-process the speech signals in response to detecting speech signals of the plurality of microphone channels, wherein the pre-processing includes framing, pre-emphasizing and windowing the speech signals; performing FFT (fast Fourier transform) on the voice signals of the plurality of preprocessed microphone channels to obtain corresponding frequency spectrums, calculating band energy through a Mel filter, and transferring the band energy to Mel frequency; calculating logarithmic energy of voice signals of the plurality of microphones output by the Mel filter, obtaining MFCC coefficients through DCT, and performing differential operation based on the MFCC coefficients to further calculate Euclidean distances of the MFCC coefficients of the plurality of microphones; and under the condition that the Euclidean distance between two microphone channels with the largest logarithmic energy is greater than or equal to a first threshold value, determining that the scene is a scene in which multiple persons speak simultaneously, otherwise, determining that the scene is a scene in which a single person speaks.

Selecting a reverberation minimum processing unit: the method includes configuring to pre-process a speech signal in response to detecting speech signals of a plurality of microphone channels in a single-person speaking scenario, wherein the pre-processing includes framing and windowing the speech signal; performing FFT (fast Fourier transform) on the voice signals of the plurality of preprocessed microphone channels to obtain corresponding frequency spectrums, and calculating the high-frequency voice energy average value of the current frame of each microphone channel; and calculating the ratio of the high-frequency voice energy average value of each microphone channel to the high-frequency voice energy average value of the currently selected microphone output channel, and selecting the microphone channel with the ratio being greater than or equal to a second threshold value as a new microphone output channel.

The embodiment of the application provides a scene identification method and a scene identification system for a conference with multiple microphones. Wherein storing in frame alignment comprises in response to detecting speech signals of a plurality of microphone channels; calculating the voice energy of each frame of the voice signals of the plurality of microphone channels based on the aligned voice signals; based on voice energy tracking and scene recognition, a scene in which a single person speaks and a scene in which multiple persons speak simultaneously are recognized, and therefore switching of microphone output channels is conducted. The scheme is beneficial to the synthesis of voice energy, reverberation degree, noise and the like in a single speaking scene or a multi-person simultaneous speaking scene to select the output channel of the microphone voice signal with the best tone quality.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is a schematic diagram of a scene recognition method for a conference with multiple microphones according to an embodiment of the present application;

fig. 2 is a flow chart of a scene recognition method for a conference with multiple microphones in accordance with an embodiment of the present application;

FIG. 3 is a schematic diagram of speech smoothing coefficients for a scene recognition method with a multi-microphone conference according to an embodiment of the present application;

fig. 4 is a schematic diagram of a scene recognition step of a scene recognition method for a conference with multiple microphones according to an embodiment of the present application;

fig. 5 is a flowchart illustrating the steps of selecting the minimum reverberation for a scene recognition method with a multi-microphone conference according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a scene recognition system for a conference with multiple microphones according to an embodiment of the present application;

FIG. 7 is a schematic distribution diagram of microphone arrays according to an embodiment of the present application;

FIG. 8 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 is a schematic diagram illustrating steps of a scene recognition method for a conference with multiple microphones in an embodiment of the present application, as shown in fig. 1:

In the step, the voice signals from the input channels of the microphones are detected and collected, the microphones can be built-in microphones or extended microphones, the positions of the microphones from the sound source are different, the voice signal intensities of the input channels of the microphones are also different when the microphones are detected and collected, in the voice detection step, the voice signal intensities of the input channels of the microphones are collected and stored in a frame alignment mode, and the voice energy calculation of the step S2 is carried out.

In this step, the collected input speech signals from the microphones are stored in frames and then aligned, and the speech energy of each microphone channel is calculated. The voice energy can be reflected more intuitively by utilizing the root mean square value, and the subsequent intermittent tracking by utilizing the voice energy is facilitated.

In the step, Mel-scale Frequency Cepstral coeffients (MFCC) are adopted to calculate the Euclidean distance of two microphone channels with the largest logarithmic energy so as to identify a scene in which a single person speaks or a scene in which multiple persons speak simultaneously, when the scene in which the single person speaks is identified, a high-Frequency energy comparison processing mode is executed to output the microphone channel with smaller reverberation, and when the scene in which the multiple persons speak simultaneously is identified, a mixed sound processing mode is executed to output the mixed microphone channel.

In some preferred embodiments, an Acoustic Echo Cancellation (AEC) and Voice Activity Detection (VAD) algorithm are used to cancel the far-end Voice signal, and each microphone input Voice signal having a strength greater than or equal to the first threshold value is output as the Voice signal for performing step S1. In a real conference, a plurality of built-in microphones may be cascaded, and each built-in microphone is cascaded with a plurality of extension microphones, wherein the built-in microphone is a directional microphone in general, and the extension microphone may be one of a directional microphone and an omnidirectional microphone. By adopting the AEC algorithm and the VAD algorithm, the noise interference of the far-end echo and the VAD algorithm, the logarithm value of which is lower than the first threshold, can be effectively eliminated, and the probability of selecting a microphone voice output channel with high quality tone quality is improved.

In some embodiments, the microphone speech signal is truncated and stored in frames, each frame being 8ms or 10ms long to facilitate delay alignment and energy calculation in subsequent steps.

In some embodiments, the first threshold value may be set to [ -70dB, -50dB ].

In some preferred embodiments, the speech energy in step S2 is a root mean square value of each frame of the aligned microphone speech signals, and the root mean square value is calculated as follows:

wherein the voice data of the aligned ith microphone is represented as: x is the number of_i1，x_i2，...，x_iLWherein, L is the length of the voice frame.

In some preferred embodiments, the speech frame length comprises at least 10 frames of the microphone speech signal, the speech frame length of each frame being set to 10 ms. The conference system is a real-time system, the length of a microphone voice frame transmitted each time is 8ms or 10ms, the voice energy of the microphone is represented by calculating the root mean square value, the length of the voice frame can not be too small, the frame length of each frame in the preferred scheme is set to 10ms, and voice signals of continuous 10 frames are taken for average value taking, so that the calculated root mean square value is moderate in size and convenient to track.

With continuing reference to fig. 2, a flow diagram of a method for scene recognition for a conference with multiple microphones in accordance with an embodiment of the present application is shown. The method comprises the following steps:

step 201: and (6) pre-gain processing. The sound pickup microphone of a general conference system can be divided into an omnidirectional microphone and a directional microphone (or a directional microphone), wherein the omnidirectional microphone has a large sound pickup range, but has serious reverberation and poor sound quality, and the directional microphone has good sound quality but a narrow sound pickup range. When the omnidirectional microphone and the directional microphone exist in the conference at the same time, and the sensitivities of the omnidirectional microphone and the directional microphone are not greatly different, the sound quality of the directional microphone is better under the same distance, but the voice energy picked up by the omnidirectional microphone is possibly larger at the moment, especially in a conference room with large reverberation, if the output channel of the microphone is directly selected according to the voice energy, the switching to the omnidirectional microphone is likely to be carried out. Therefore, pre-gain processing is added to the Echo processing module of the omnidirectional microphone, and under the condition that the omnidirectional microphone and the directional microphone exist at the same distance, the voice energy of the omnidirectional microphone is reduced, namely, negative gain is applied, the negative gain value can be obtained by calculating the difference between the current Echo Return Loss (ERL) and a target ERL, wherein the target ERL is the time (such as 1m) when the microphone is at a certain value away from the loudspeaker, the microphone and the loudspeaker are taken as a system, and the ratio of the sound energy played by the loudspeaker to the signal energy picked up by the microphone, namely, the Echo signal attenuation is realized. As the omni-directional microphone is closer to the sound source, the more the voice energy drops, thereby reducing the probability of selecting the omni-directional microphone.

Step 202: and voice detection. In response to the detection of the signal strength of each microphone input channel, an AEC algorithm is adopted to eliminate the far-end voice signal, a VAD algorithm is adopted to judge whether each microphone input channel is larger than or equal to a set first threshold, and whether the microphone input channel is in a voice state is judged according to the set first threshold. The voice detection is favorable for eliminating noise and noise signals, and the probability of selecting a microphone voice output channel with high quality tone quality is improved.

Step 203: and (4) energy calculation. And calculating the voice energy of the aligned voice signals of the microphones, mainly performing frame alignment processing on the voice data input by the microphones and calculating the root mean square value of each frame of signal of each microphone. The voice energy can be intuitively reflected by utilizing the root mean square value, and the tracking by utilizing the voice energy is convenient.

Step 204: and (4) energy tracking. And adopting a self-adaptive processing and analyzing process to track the voice input energy of each microphone discontinuously. And selecting a long-time tracking mode or a short-time tracking mode according to the working state of the microphone, and executing the scene recognition in the step 205.

Step 205: and (5) scene recognition. The Euclidean distance of two microphone channels with the largest logarithmic energy is calculated by adopting Mel-scale Frequency Cepstral coeffients (MFCC) so as to identify a scene in which a single person speaks and a scene in which multiple persons speak simultaneously, when the scene in which the single person speaks is identified, a high-Frequency energy comparison processing mode is executed to output the microphone channel with smaller reverberation, and when the scene in which the multiple persons speak simultaneously is identified, a sound mixing processing mode is executed to output the mixed microphone channel.

Step 206: and selecting a microphone. The output step 205 identifies a microphone voice channel corresponding to the scene.

Step 207: and (5) mixing sound processing. When a scene with multiple simultaneous talking is judged through the MFCC, voice signals of two microphone channels with the largest logarithmic energy need to be mixed. There are many methods of mixing sound, including direct addition, weighted averaging, attenuated addition, non-uniform Wave-damping (AWS), and the like.

Step 208: and (6) smoothing. Under a scene mode that a single person walks from one microphone to the other microphone or two persons respectively speak to the two microphones in a crossed mode, smoothing processing is adopted to realize switching of channels of the microphones. The conditions of noise and the like in the switching process of the microphone output channels are avoided.

In some preferred embodiments, the method further comprises: the smoothing process adopts a sine curve, and a specific calculation formula is shown as the following formula:

smooth2(i)＝1-smooth(i)，i＝0，1，...，L-1

x_(i)＝smooth1(i)*x_1(i)+smooth2(i)*x_2(i)，i＝0，1，...，L-1。

the voice energy collected by the two microphone output voice channels is gradually reduced all the way, is gradually increased all the way, and is subjected to smoothing processing when the voice energy exceeds a second threshold, the switching of the voice signal output channels of the microphones is realized based on the smoothing processing, and the noise inside the equipment, which is caused by the fact that the change of the voice signals of the microphones is too large in the switching process, is avoided.

In some preferred embodiments, the second threshold is set to [3dB, 6dB ], and a different second threshold may be set according to the specific scene identification, for example, the second threshold may be increased by 6dB when a duplex scene is identified.

Fig. 3 is a schematic diagram illustrating speech smoothing coefficients for a scene recognition method with a multi-microphone conference according to an embodiment of the present application. The abscissa is set as the smoothing length, the ordinate is set as the smoothing coefficient, smooth1 is the microphone voice output signal with smaller voice energy, smooth2 is the microphone voice output signal with larger voice energy, the value of the smoothing coefficient of smooth1 gradually decreases from 1 to 0, the value of the smoothing coefficient of smooth2 increases from 0 to 1, and smooth switching of the two microphone voice output channels is completed.

In some preferred embodiments, the energy tracking includes short-time tracking and long-time tracking, switching the voice state based on the short-time tracking implementing a mute state, and switching the microphone channel based on the long-time tracking implementing a long-time mute state or voice state. Short-term tracking tracks the speech energy of the previous T1 time, such as 200ms, with a relatively short time interval per recording, and long-term tracking tracks the speech energy of the previous T2 time, such as 2s, with a relatively long time per recording.

It should be noted that, when the output channel of the microphone is detected to be in a mute state by voice, the noise energy of each microphone is compared, the noise energy is larger than the noise source, the noise energy is closer to the noise source, the energy is smaller than the noise source, the difference between the two noise energies is larger than or equal to the second threshold, and the microphone channel with smaller noise energy is output.

In addition, when the mute state or the long-term output state of the microphone output channel is detected, the energy or noise of the voice input by the other microphone appears in a larger energy or noise in a shorter time, and the energy of the two is larger than or equal to the second threshold, and the current microphone voice output channel is kept.

With continued reference to fig. 4, a schematic diagram of scene recognition steps for a scene recognition method for a conference with multiple microphones is shown according to an embodiment of the present application. The method comprises the following steps:

step 401: continuous speech. Based on the pre-gain processing, the speech energy calculation and the speech energy tracking, when the microphone channel continuous speech signal is detected, the following 402-411 steps are performed.

Step 402: and (5) framing. The speech signals of each microphone channel are re-framed.

Step 403: pre-emphasis is performed. Pre-emphasis processing is carried out on voice signals of each microphone channel after framing, and a specific calculation formula is as follows:

H(Z)＝1-μZ^-1

where μ is the pre-coefficient. The pre-emphasis aims to improve the high-frequency part of the voice signal, enable the frequency spectrum of the voice signal of each microphone channel to be flat, eliminate adverse effects caused by the vocal cords and lips of a sound source individual in the voice generation process, highlight the high-frequency formant of the voice signal and increase the high-frequency resolution of the voice.

Step 404: and (5) windowing. Multiplying each frame of voice signal of each microphone channel by a Hamming window, wherein the specific calculation formula of windowing is as follows:

S′(n)＝S(n)×W(n)

wherein, s (N) represents a plurality of microphone voice signals, w (N) represents a hamming window, N is the frame length, and a is a hamming window coefficient. Windowing allows the speech signal of the microphone to be periodic to reduce speech energy leakage of the speech signal in the fast fourier transform.

Step 405: fast Fourier Transform (FFT). Performing FFT (fast Fourier transform) on each frame of signals subjected to framing and windowing to obtain the frequency spectrum of each frame of the voice signals of the microphones, and performing modulo square on the frequency spectrum of the voice signals to obtain the energy spectrum of the voice signals, wherein a Fourier transform calculation formula is as follows:

where X (k) represents the transformed spectrum, x (n) is the speech time domain signal of the microphone, j represents an imaginary number,

The characteristic of the voice signal is difficult to be seen through the conversion of the voice signal of each microphone on the time domain, the voice energy distribution on the frequency spectrum is obtained through FFT of each frame of voice signal after windowing, and the voice characteristics of different sound sources can be seen visually according to different voice energy distributions, so that whether the voice signal of each microphone comes from the same sound source or not is identified.

Step 406: mel triangle filters (Mel filter bank). The energy spectrum is passed through a group of Mel-scale triangular filter banks, a filter bank with M triangular filters is defined, the center frequency of each triangular filter is linearly distributed at equal intervals, and the formula of mutually frequency Mel domains is as follows:

where f denotes frequency and fmel denotes Mel frequency.

The frequency response of the mel-triangle filter is defined as follows:

wherein the content of the first and second substances,

denotes the Mel frequency in a defined range, the frequency is a linear distribution of equal intervals, f (m) is the center frequency, H_m(k) The frequency response of the triangular filter is shown, and k represents the number of points of the fourier transform.

Step 407: and (4) carrying out logarithmic operation. Calculating the logarithmic energy output by each filter bank according to the following formula:

wherein En (m) represents logarithmic energy, H_m(k) Representing the frequency response of the triangular filter, x (k) represents the transformed spectrum because the human ear's perception of sound is non-linear in a logarithmic relationship, which is also human-like hearing and cannot hear loudness in the linear range.

Step 408: discrete Cosine Transform (DCT). DCT transformation is carried out on the voice signal to obtain an MFCC coefficient calculation formula as follows:

wherein, L represents the order of the MFCC coefficient and is also the dimension of the MFCC, M represents the number of the triangular filters, and En (M) represents the logarithmic energy.

Step 409: MFCC characteristics. The difference operation is performed on the MFCC coefficients to obtain a first-order difference and a second-order difference, the operation includes information of a previous frame and a next frame of the current speech signal, and the calculation formula is as follows:

wherein dt represents the t-th first order difference parameter, Q represents the dimensionality of the cepstral coefficient, K represents the time difference of the first derivative, C_tRepresenting the t-th cepstral coefficient. Because of the difference of speaking among different people, the reliable performance requirement is difficult to be achieved by using a single parameter, the performance of an actual system is provided by collecting the combination of characteristic parameters, and when the correlation among all groups of parameters is not large, the effect of reflecting different characteristics of a voice signal is better.

Step 410: the euclidean distance. And (3) calculating Euclidean distances of MFCC coefficients of different microphones according to the following calculation formula:

wherein, x and y are MFCC coefficients of different microphones, and N is the number of the MFCC coefficients.

Step 411: and selecting a microphone output channel. And when the Euclidean distance between the two microphone channels with the maximum logarithmic energy is larger than or equal to a first threshold value, recognizing the situation as a scene in which multiple persons speak simultaneously, and outputting the voice signals of the two microphone channels with the maximum logarithmic energy after mixing the voice signals. And when the Euclidean distance between the two microphone channels with the maximum logarithmic energy is smaller than a first threshold value, identifying the scene of single speaking, and executing a high-frequency energy comparison processing mode to output the microphone channel with smaller reverberation.

In a particularly preferred embodiment, the speech signal is reframed in step 402, which may be set to have a length of 8ms or 10ms per frame.

In a specific preferred embodiment, the value range of μ in step 403 is [0.9,1.0], and in a specific embodiment, 0.97 may be preferred.

In a specific preferred embodiment, the hamming window a coefficient in step 404 has a value range of [0.1,1.0], and in a specific embodiment, 0.46 is preferred.

In a specific preferred embodiment, in step 406, the number M of mel triangular filter banks is 24, and the value range of f is [0,4000 ].

In a particularly preferred embodiment, the MFCC coefficients in step 408 are typically 12 parameters with only real part and no imaginary part for the calculation.

In a particularly preferred embodiment, the order of the MFCC coefficient in step 409 may also be 13, which is the dimension of the MFCC.

In a particularly preferred embodiment, the first threshold in step 411 is set to [20-40 ].

When a scene of a single person speaking is identified, a microphone farther away from a sound source may appear due to the influence of the sensitivity of the microphone and the reverberation of the conference room in the conference room with large reverberation, the picked-up microphone voice energy is larger, and if a microphone channel with the largest voice energy is selected by directly adopting voice energy tracking, a microphone channel farther away from the sound source but with a larger reverberation ratio may be selected. Therefore, the distance between the two microphones and the speaker is judged by calculating the absolute value of the Energy difference between the two microphones with the largest voice Energy, the Reverberation Ratio of each microphone can be calculated by two methods of voice Energy-to-Reverberation Modulation Energy Ratio (SRMR) and High Frequency Energy Contrast (HFEC), and the microphone output channel with the smallest Reverberation is selected.

The Speech Energy-to-Reverberation Modulation Energy Ratio (SRMR) is a mainstream method at present, and is a method for non-intrusive evaluation of Speech intelligibility, which comprises the following specific steps: passing the speech signal through a 23-channel gamma filter bank and Equivalent Rectangular Bandwidth (ERB); calculating an instantaneous envelope of each band output signal by a hilbert transform; windowing (hamming window) and Discrete Fourier Transform (DFT) the instantaneous envelope signal; processing the DFT signal by an 8-channel auditory modulation filter to obtain signals of 8 frequency bands; calculating the SRMR, namely the ratio of the energy of the first 4 frequency bands to the energy of the last 4 frequency bands; comparing the SRMR of each microphone with the SRMR of the currently selected microphone, and if the SRMR of each microphone is greater than or equal to a threshold, selecting the output channel of the microphone, wherein the SRMR calculation formula is as follows:

wherein, b_kRepresenting the energy of the kth self-band.

In this embodiment, a High Frequency Energy Complex (HFEC) is used to select the microphone output channel with the least reverberation. The calculation method comprises the steps of summing the FFT signal changes of the high-frequency voice signals, comparing the calculated average value with the current microphone output channel, and outputting the microphone voice signal with the minimum reverberation. With continuing reference to fig. 5, a flowchart illustrating steps for selecting a minimum reverberation for a method of scene recognition with a multi-microphone conference according to an embodiment of the application is shown, the method comprising the steps of:

step 501: and (5) windowing. Multiplying each frame of voice signal of each microphone channel by a Hamming window to reduce voice energy leakage in voice signal FFT, wherein the specific calculation formula of windowing is as follows:

S′(n)＝S(n)×W(n)

wherein, s (N) represents a plurality of microphone voice signals, W (N) represents a hamming window, N is the frame length, a is the hamming window coefficient, and W (N, a) is the windowed voice signal.

Step 502: fast Fourier Transform (FFT). Performing FFT (fast Fourier transform) on each frame of signals subjected to framing and windowing to obtain the frequency spectrum of each frame of the voice signals of the microphones, and performing modulo square on the frequency spectrum of the voice signals to obtain the energy spectrum of the voice signals, wherein a Fourier transform calculation formula is as follows:

representing angular frequency, N1 fourierThe number of points transformed.

The characteristic of the voice signal is difficult to see through the conversion of the voice signal of each microphone on the time domain, each frame of voice signal after windowing must be processed by FFT to obtain the voice energy distribution on the frequency spectrum, and the voice characteristics of different sound sources can be visually seen according to different energy distributions, so that whether the voice signal of each microphone comes from the same sound source or not can be identified.

Step 503: and calculating high-frequency energy. The preprocessed voice signals of the multiple microphone channels are subjected to FFT in step 502 to obtain corresponding frequency spectrums, and the high-frequency voice energy average value of the current frame of each microphone channel is calculated. According to the average value of the high-frequency voice energy, the situation that the voice channel of the microphone is selected by mistake due to voice noise or the fact that the voice signal is enlarged instantly can be effectively avoided.

Step 504: and (5) energy comparison. And calculating the ratio of the high-frequency voice energy average value of each microphone channel to the high-frequency voice energy average value of the currently selected microphone output channel, and selecting the microphone channel with the ratio being greater than or equal to a second threshold value as a new microphone output channel. And determining an output channel of the new microphone according to the comparison between the calculated corresponding ratio and the second threshold value, and setting parameters by adjusting the second threshold value according to the specific conditions of the conference, thereby controlling the output channel of the microphone.

Step 505: and selecting a microphone output channel. Outputting the microphone channels for which the ratio calculated in step 504 is greater than or equal to the second threshold.

In this method, the applicant concluded from years of practical experience that the degree of acoustic reverberation picked up by different microphones in the same conference room is different. In the single-person speaking scene, when the voice energy is not very different, the microphone with high frequency voice energy ratio is generally the opposite microphone, because the voice signal is mainly concentrated on high frequency, and the reverberation is mainly concentrated on medium and low frequency. The sound signals with different frequencies have different abilities of bypassing obstacles due to different wavelengths, the high-frequency signal has shorter wavelength, is not easy to bypass the obstacles, has fast frequency attenuation, has longer wavelength of the low-frequency signal, is easy to bypass the obstacles, and has slow frequency attenuation. In addition, when high and low frequency signals are propagated in the air, the higher the degree of absorption by obstacles such as walls, the more easily the signals are attenuated, and vice versa for low frequency signals. In view of the difference between the reverberation time and the reverberation degree of the sound signals with different frequencies, in this embodiment, whether the ratio of the high-frequency speech energy average value of each microphone channel to the high-frequency speech energy average value of the current microphone channel is greater than or equal to the second threshold is adopted, and the microphone channel outputting the higher high-frequency speech energy average value is adopted, so that the selection of a microphone output channel with large reverberation and poor sound quality due to the unicity selection according to the speech energy is avoided.

In a specific preferred embodiment, the hamming window coefficient a in step 501 has a value range of [0.1,1.0], and in a specific embodiment, 0.46 may be selected.

In a particularly preferred embodiment, the frequency of the high-frequency speech energy in step 503 is selected from the range of [4Hz-8kHz ], and the high-frequency speech energy average value is obtained by calculating the average value of the high-frequency speech energy values of the current frame and several historical frames, and in a particularly preferred embodiment, the average value is calculated by summing the high-frequency speech energy values of the previous 5 frames to the previous 20 frames.

In a particularly preferred embodiment, the second threshold in step 504 may be set to [1,6]

In addition, the application also provides a scene recognition system for the conference with the multiple microphones. As shown in fig. 6, the method includes: a speech detection unit 601, an energy calculation unit 602, a scene recognition unit 603 and a selective reverberation minimization processing unit 604. When the voice detection unit 601 detects the voice signals of the input channels of the microphones, the energy calculation unit 602 is stored according to frames and combined with the scene recognition unit 603 to recognize a scene in which multiple persons speak simultaneously or a scene in which a single person speaks simultaneously, and the scene in which multiple persons speak simultaneously is recognized, and the voice signals of the two microphone channels with the largest logarithmic energy are mixed and output. When the euclidean distance between the two microphone channels with the largest logarithmic energy is smaller than the first threshold, a scene of one-man speaking is identified, and the microphone channel with smaller reverberation is output by the selected reverberation minimum processing unit 604.

In a specific embodiment, the voice detection unit 601: configured for storing in frame alignment in response to detecting speech signals from a plurality of microphone channels.

The energy calculation unit 602: and the system is configured to calculate the voice energy of each frame of the plurality of microphones based on the aligned plurality of microphone channel voice signals.

Scene recognition unit 603: configured to pre-process the speech signals in response to detecting speech signals of the plurality of microphone channels, wherein the pre-processing includes framing, pre-emphasizing and windowing the speech signals; performing FFT (fast Fourier transform) on the voice signals of the plurality of preprocessed microphone channels to obtain corresponding frequency spectrums, calculating band energy through a Mel filter, and transferring the band energy to Mel frequency; calculating logarithmic energy of voice signals of the plurality of microphones output by the Mel filter, obtaining MFCC coefficients through DCT, and performing differential operation based on the MFCC coefficients to further calculate Euclidean distances of the MFCC coefficients of the plurality of microphones; and under the condition that the Euclidean distance between two microphone channels with the largest logarithmic energy is greater than or equal to a first threshold value, determining that the scene is a scene in which multiple persons speak simultaneously, otherwise, determining that the scene is a scene in which a single person speaks.

Select reverberation minimum processing unit 604: the method includes configuring to pre-process a speech signal in response to detecting speech signals of a plurality of microphone channels in a single-person speaking scenario, wherein the pre-processing includes framing and windowing the speech signal; performing FFT (fast Fourier transform) on the voice signals of the plurality of preprocessed microphone channels to obtain corresponding frequency spectrums, and calculating the high-frequency voice energy average value of the current frame of each microphone channel; and calculating the ratio of the high-frequency voice energy average value of each microphone channel to the high-frequency voice energy average value of the currently selected microphone output channel, and selecting the microphone channel with the ratio being greater than or equal to a second threshold value as a new microphone output channel.

The method of the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

As shown in fig. 7, which is a schematic diagram of a sound pickup portion of a multi-microphone conference system according to an embodiment of the present invention, INT denotes an internal microphone, EXT denotes an extension microphone, in a real conference, there may be a plurality of internal microphones such as INT1, INT2, INT3 cascaded, a plurality of extension microphones such as EXT1, EXT2 cascaded, and generally a directional microphone is built in, and an extension microphone may be a directional microphone or an omnidirectional microphone, in this embodiment, 3 internal microphones are all set as the directional microphone, and 2 extension microphones are all set as the omnidirectional microphone.

In some particularly preferred embodiments, the single-person speaking scene mode may include the following:

when a single person starts speaking to one microphone, the intensity of the voice signal of the current microphone is from nothing to be present and is larger than or equal to a first threshold, and a microphone channel with the maximum voice energy is rapidly output by combining voice detection and short-time energy tracking.

When a single person starts speaking to a plurality of microphones, the voice signal intensity of each microphone is from zero to zero and is greater than or equal to a first threshold, and the Euclidean distance between two microphone channels with the largest logarithmic energy is smaller than a first threshold, the microphone channels with the ratio of the high-frequency voice energy average value to the high-frequency voice energy average value of the currently selected microphone output channel which is greater than or equal to a second threshold are quickly switched to, namely the microphone output channels with large voice energy and small reverberation energy.

When a single person is between two microphones, the intensity of two microphone voice signals is greater than or equal to a first threshold, the difference of logarithmic energies of two microphone channels is not greater than a second threshold, the Euclidean distance of the two microphone channels is less than the first threshold, and the microphone channel with the ratio of the high-frequency voice energy average value to the high-frequency voice energy average value of the currently selected microphone output channel greater than or equal to the second threshold is quickly switched to, namely the microphone output channel with the voice energy close to and the reverberation energy smaller.

When a single person walks from one microphone to another microphone, the voice energy input by one microphone close to the sound source is gradually reduced, the voice energy input by the other microphone far away from the sound source is gradually increased, the Euclidean distance between two microphone channels is smaller than a first threshold value, and when the voice energy difference of the two microphones is larger than or equal to a second threshold value, the microphone channels with the ratio of the high-frequency voice energy average value to the high-frequency voice energy average value of the currently selected microphone output channel larger than or equal to the second threshold value are switched by adopting a voice smoothing processing technology, namely the microphone output channels with large voice energy and small reverberation energy.

When two persons respectively speak to the two microphones in a cross way, the voice energy collected by one microphone in a period of time is larger, the energy collected by the other microphone in the next period of time is larger, the Euclidean distance between the two microphone channels is smaller than a first threshold value, and when the voice energy difference of the two microphones is larger than or equal to a second threshold value, the microphone channels with the ratio of the high-frequency voice energy average value to the high-frequency voice energy average value of the currently selected microphone output channel larger than or equal to the second threshold value are switched by adopting a voice smoothing processing technology, namely the microphone output channels with large voice energy and smaller reverberation energy.

In some particularly preferred embodiments, the scene mode of multiple simultaneous speaking may include the following:

when two or more persons speak to two or more microphones simultaneously, the two or more microphone voice input channels acquire voice signals larger than or equal to a first threshold value, the difference value of voice energy of each microphone channel is smaller than a second threshold value, simultaneously, the Euclidean distance of the two microphone channels with the largest logarithmic energy is larger than or equal to the first threshold value, and the voice signals of the two microphone channels with the largest logarithmic energy are output after being mixed.

In some particularly preferred embodiments, the scene modes other than the single-person speaking and multi-person speaking scenes may include the following:

when the voice detection unit detects that the voice signals of each microphone channel are in a mute state (the far end may have a person speaking, but the local end does not have a person speaking, and the VAD algorithm is smaller than the first threshold), the current microphone voice output channel is maintained.

When one of the microphones is closer to the noise source, the noise energy of the voice channels of the microphones is compared during the mute period, the microphone with larger energy is closer to the noise source, the microphone with smaller energy is farther from the noise source, and the noise energy of the two microphones is larger than or equal to the first threshold. And selecting a microphone voice output channel with low noise energy by adopting voice detection and short-time energy tracking.

When instantaneous noise or short-time speech breaks occur, very short-time noise (higher energy) occurs during silence or short-time higher energy speech or noise (higher energy) occurs at another microphone during speech. And keeping the current microphone voice output channel by adopting voice detection and short-time tracking.

And when the duplex scene is detected by voice, the second threshold value is increased, and the current microphone voice output channel is kept or the microphone channel with the best duplex effect is switched.

Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output portion 607 including a signal such as a Liquid Crystal Display (LCD) and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 may also be connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 801.

It should be noted that the computer readable medium of the present application can be a computer readable signal medium or a computer readable medium or any combination of the two. A computer readable medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present application may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a speech detection unit, an energy calculation unit, a scene recognition unit, and a selective reverberation minimization processing unit. Where the names of these modules do not in some cases constitute a limitation on the module itself, for example, a speech detection unit may also be described as a "unit that is responsive to detecting speech signals from multiple microphone channels and storing the microphone speech signals frame-aligned".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: the system comprises a voice detection unit, an energy calculation unit, a scene recognition unit and a selected reverberation minimum processing unit. Wherein, the voice detection unit: in response to detecting speech signals from multiple microphone channels, storing in frame alignment; an energy calculation unit: calculating the voice energy of each frame of the plurality of microphones based on the aligned voice signals of the plurality of microphone channels; a scene recognition unit: in response to detecting speech signals of a plurality of microphone channels, pre-processing the speech signals, wherein the pre-processing includes framing, pre-emphasizing, and windowing the speech signals; performing FFT (fast Fourier transform) on the voice signals of the plurality of preprocessed microphone channels to obtain corresponding frequency spectrums, calculating band energy through a Mel filter, and transferring the band energy to Mel frequency; calculating logarithmic energy of voice signals of the plurality of microphones output by the Mel filter, obtaining MFCC coefficients through DCT, and performing differential operation based on the MFCC coefficients to further calculate Euclidean distances of the MFCC coefficients of the plurality of microphones; determining that the scene is a scene in which multiple persons speak simultaneously under the condition that the Euclidean distance between two microphone channels with the largest logarithmic energy is larger than or equal to a first threshold value, otherwise determining that the scene is a scene in which a single person speaks; selecting a reverberation minimum processing unit: in response to detecting speech signals of a plurality of microphone channels in a single-person speaking scenario, preprocessing the speech signals, wherein the preprocessing includes framing and windowing the speech signals; performing FFT (fast Fourier transform) on the voice signals of the plurality of preprocessed microphone channels to obtain corresponding frequency spectrums, and calculating the high-frequency voice energy average value of the current frame of each microphone channel; and calculating the ratio of the high-frequency voice energy average value of each microphone channel to the high-frequency voice energy average value of the currently selected microphone output channel, and selecting the microphone channel with the ratio being greater than or equal to a second threshold value as a new microphone output channel.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for scene recognition for a conference with multiple microphones, the method comprising:

s1: in response to detecting speech signals of multiple microphone channels, storing in frame alignment;

s2: calculating the voice energy of each frame of the voice signals of the plurality of microphone channels based on the aligned voice signals;

s3: based on the voice energy tracking, the Euclidean distance of two microphone channels with the largest logarithmic energy is calculated by adopting a Mel frequency cepstrum coefficient so as to identify a scene in which a single person speaks and a scene in which multiple persons speak simultaneously, and therefore switching of microphone output channels is carried out.

2. The method as claimed in claim 1, wherein the step S3 further comprises the steps of:

in response to detecting speech signals of multiple microphone channels, pre-processing the speech signals, wherein pre-processing includes framing, pre-emphasizing, and windowing the speech signals;

performing FFT (fast Fourier transform) on the preprocessed voice signals of the multiple microphone channels to obtain corresponding frequency spectrums, calculating band energy through a Mel filter, and transferring the band energy to Mel frequency;

calculating logarithmic energy of the voice signals of the plurality of microphones output by the Mel filter, obtaining MFCC coefficients through DCT transformation, and performing differential operation based on the MFCC coefficients to further calculate Euclidean distances of the plurality of microphones;

and recognizing a scene in which multiple persons speak simultaneously in response to the fact that the Euclidean distance between the two microphone channels with the maximum logarithmic energy is larger than or equal to a first threshold value, and outputting the voice signals of the two microphone channels with the maximum logarithmic energy after sound mixing.

3. The method as claimed in claim 2, wherein the case that the euclidean distance is less than the first threshold is identified as a single speaking scene, and the following steps are performed:

calculating the high-frequency voice energy average value of the current frame of each microphone channel;

4. The method as claimed in claim 1, wherein the speech energy in step S2 is a root mean square value of each frame of the aligned microphone speech signals, and the root mean square value is calculated as follows:

wherein the voice data of the ith microphone is represented as: x is the number of_i1，x_i2，…，x_iLWherein, L is the length of the voice frame.

5. The method as claimed in claim 3, wherein the windowing process is performed by multiplying each frame by a hamming window, and the windowing process is formulated as follows:

S′(n)＝S(n)×W(n)

w (N, a) represents a windowed microphone speech signal, s (N) represents a plurality of microphone speech signals, W (N) represents a hamming window, N is the frame length, and a is a hamming window coefficient.

6. A method as claimed in claim 3, wherein the FFT is calculated as follows:

7. The method as claimed in claim 3, wherein the frequency of the high-frequency speech energy is selected from the range of [4kHz,8kHz ], and the average value of the high-frequency speech energy is obtained by calculating the average value of the high-frequency speech energy values of the current frame and the historical frames.

8. The method as claimed in claim 1, wherein based on the scene recognition and the discontinuous tracking of the speech energy, the switching of the speech signal output channels of the microphones is implemented by using a smoothing process, the smoothing process uses a sinusoidal curve, and the following formula is specifically calculated:

smooth2(i)＝1-smooth(i),i＝0,1,…,L-1

x_(i)＝smooth1(i)*x_1(i)+smooth2(i)*x_2(i),i＝0,1,…,L-1。

9. a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.

10. A scene recognition system for conferencing with multiple microphones, the system comprising:

a voice detection unit: configured for storing in frame alignment in response to detecting speech signals from a plurality of microphone channels;

an energy calculation unit: the method comprises the steps of calculating voice energy of each frame of a plurality of microphones based on a plurality of aligned microphone channel voice signals;

a scene recognition unit: configure to pre-process a speech signal of a plurality of microphone channels in response to detecting the speech signal, wherein pre-processing includes framing, pre-emphasizing, and windowing the speech signal; performing FFT (fast Fourier transform) on the preprocessed voice signals of the multiple microphone channels to obtain corresponding frequency spectrums, calculating band energy through a Mel filter, and transferring the band energy to Mel frequency; calculating logarithmic energy of the voice signals of the plurality of microphones output by the Mel filter, obtaining MFCC coefficients through DCT transformation, and performing differential operation based on the MFCC coefficients to further calculate Euclidean distances of the plurality of microphones; determining that the scene is a scene in which multiple persons speak simultaneously under the condition that the Euclidean distance between the two microphone channels with the largest logarithmic energy is larger than or equal to a first threshold value, otherwise determining that the scene is a scene in which a single person speaks;

selecting a reverberation minimum processing unit: the method includes configuring to pre-process a speech signal of a plurality of microphone channels in response to detecting the speech signal in a single-person speaking scenario, wherein pre-processing includes framing and windowing the speech signal; performing FFT (fast Fourier transform) on the preprocessed voice signals of the plurality of microphone channels to acquire corresponding frequency spectrums, and calculating high-frequency voice energy average values of the current frames of the microphone channels; and calculating the ratio of the high-frequency voice energy average value of each microphone channel to the high-frequency voice energy average value of the currently selected microphone output channel, and selecting the microphone channel with the ratio being greater than or equal to a second threshold value as a new microphone output channel.