CN117579770A - Method, device, electronic equipment and medium for determining main speaker in conference - Google Patents

Method, device, electronic equipment and medium for determining main speaker in conference Download PDF

Info

Publication number
CN117579770A
CN117579770A CN202311361650.4A CN202311361650A CN117579770A CN 117579770 A CN117579770 A CN 117579770A CN 202311361650 A CN202311361650 A CN 202311361650A CN 117579770 A CN117579770 A CN 117579770A
Authority
CN
China
Prior art keywords
voice signal
signal
paragraphs
voice
sound pressure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311361650.4A
Other languages
Chinese (zh)
Inventor
程建喜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haoxin Cloud Beijing Network Communication Co ltd
Original Assignee
Haoxin Cloud Beijing Network Communication Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haoxin Cloud Beijing Network Communication Co ltd filed Critical Haoxin Cloud Beijing Network Communication Co ltd
Priority to CN202311361650.4A priority Critical patent/CN117579770A/en
Publication of CN117579770A publication Critical patent/CN117579770A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • H04L65/403Arrangements for multi-party communication, e.g. for conferences

Abstract

The embodiment of the application provides a method, a device, electronic equipment and a medium for determining a main speaker in a conference, wherein the method comprises the steps of detecting voice activity of multiple voice signal paragraphs and determining at least two voice signal paragraphs with voice activity, wherein the multiple voice signal paragraphs correspond to a plurality of users participating in the conference, and each user corresponds to one voice signal paragraph; performing sound pressure level calculation on the at least two voice signal paragraphs to obtain sound pressure level calculation results of the at least two voice signal paragraphs; and determining a user corresponding to one signal paragraph in the at least two voice signal paragraphs as a main speaker according to the sound pressure level calculation result of the at least two voice signal paragraphs. By the technical scheme of the embodiment of the application, the accuracy of determining the main speaker can be improved.

Description

Method, device, electronic equipment and medium for determining main speaker in conference
Technical Field
The present application relates to the field of communications, and in particular, to a method, apparatus, electronic device, and medium for determining a primary speaker in a conference.
Background
With the development of communication technology, video conferencing plays an increasingly important role in our daily work and life.
In the conference, the main speaker is accurately determined, and the method plays a very important role in the participation experience of the participators. However, there may be multiple speakers in the conference or some noise unrelated to the speech, such as keyboard knocks, coughing, whispering, road noise, etc., which can be a significant challenge to the inventors for a positive determination.
Therefore, how to improve the accuracy of determining the main speaker is a problem to be solved.
Disclosure of Invention
An object of an embodiment of the present application is to provide a method, an apparatus, an electronic device, and a medium for determining a main speaker in a conference, where accuracy of determining the main speaker may be improved through a technical solution of the embodiment of the present application.
In a first aspect, an embodiment of the present application provides a method for determining a main speaker in a conference, including: detecting voice activity of multiple voice signal paragraphs, and determining at least two voice signal paragraphs with voice activity, wherein the multiple voice signal paragraphs correspond to multiple users participating in a conference, and each user corresponds to one voice signal paragraph; performing sound pressure level calculation on the at least two voice signal paragraphs to obtain sound pressure level calculation results of the at least two voice signal paragraphs; and determining a user corresponding to one signal paragraph in the at least two voice signal paragraphs as a main speaker according to the sound pressure level calculation result of the at least two voice signal paragraphs.
According to the method and the device, voice activity detection is firstly carried out, then sound pressure level detection is carried out only on signals with voice activity, finally a main speaker is determined according to the sound pressure level detection result, and accuracy of determination of the main speaker can be improved through detection of the voice activity and the sound level.
In one embodiment, the plurality of users includes a plurality of focal video members; alternatively, the plurality of users includes all members participating in the conference.
In one embodiment, the detecting voice activity on the multiple voice signal segments, determining that at least two voice signal segments with voice activity exist, includes: detecting the parameters of the multiple voice signal segments according to at least one of the following parameters, and determining at least two voice signal segments with voice activity: the energy of the sound signal, the zero crossing rate of the sound signal, the amplitude difference of the sound signal, the zero crossing rate of the sound signal, the signal to noise ratio of the sound signal, the frequency of the sound signal and the time domain of the sound signal.
In one embodiment, the calculating the sound pressure level of the at least two voice signal segments to obtain the sound pressure level calculation result of the at least two voice signal segments includes: and calculating the sound pressure level calculation result of the at least two voice signal paragraphs according to the following formula:
Wherein SPL represents the sound pressure level, P 0 Represents the sound pressure of the hearing threshold, P rms The acoustic root mean square RMS value is the acoustic sample point amplitude.
In one embodiment, the determining, according to the sound pressure level calculation result of at least two voice signal paragraphs, the user corresponding to one signal paragraph in the at least two voice signal paragraphs as the main speaker includes: carrying out speech rate detection on at least two paths of speech signal paragraphs with the front sound pressure level to obtain a speech rate detection result; and according to the speech speed detection result, determining the user corresponding to one signal paragraph in at least two voice signal paragraphs with the front sound pressure level as a main speaker.
In one embodiment, the method is performed by a conference terminal, the method further comprising: highlighting the primary speaker's picture.
In one embodiment, the method is performed by a media server, the method further comprising: and sending an indication message to the conference terminal, wherein the indication message is used for indicating the main speaker so that the conference terminal can highlight the picture of the main speaker.
In a second aspect, one embodiment of the present application provides an apparatus for determining a primary speaker in a conference, comprising: the system comprises a detection unit, a voice activity detection unit and a voice activity detection unit, wherein the detection unit is used for detecting voice activities of multiple paths of voice signal paragraphs and determining at least two paths of voice signal paragraphs with voice activities, the multiple paths of voice signal paragraphs correspond to multiple users participating in a conference, and each user corresponds to one path of voice signal paragraph; the computing unit is used for carrying out sound pressure level computation on the at least two voice signal paragraphs to obtain sound pressure level computation results of the at least two voice signal paragraphs; and the determining unit is used for determining a user corresponding to one signal paragraph in the at least two voice signal paragraphs as a main speaker according to the sound pressure level calculation result of the at least two voice signal paragraphs.
In one embodiment, the plurality of users includes a plurality of focal video members; alternatively, the plurality of users includes all members participating in the conference.
In one embodiment, the detection unit is specifically configured to: detecting the parameters of the multiple voice signal segments according to at least one of the following parameters, and determining at least two voice signal segments with voice activity: the energy of the sound signal, the zero crossing rate of the sound signal, the amplitude difference of the sound signal, the zero crossing rate of the sound signal, the signal to noise ratio of the sound signal, the frequency of the sound signal and the time domain of the sound signal.
In one embodiment, the computing unit is specifically configured to: and calculating the sound pressure level calculation result of the at least two voice signal paragraphs according to the following formula:
wherein SPL represents the sound pressure level, P 0 Represents the sound pressure of the hearing threshold, P rms The acoustic root mean square RMS value is the acoustic sample point amplitude.
In one embodiment, the determining unit is specifically configured to: carrying out speech rate detection on at least two paths of speech signal paragraphs with the front sound pressure level to obtain a speech rate detection result; and according to the speech speed detection result, determining the user corresponding to one signal paragraph in at least two voice signal paragraphs with the front sound pressure level as a main speaker.
In one embodiment, the apparatus is a conference terminal, the apparatus further comprising: and the display unit is used for highlighting the picture of the main speaker.
In one embodiment, the apparatus is a media server, the apparatus further comprising: and the sending unit is used for sending an indication message to the conference terminal, wherein the indication message is used for indicating the main speaker so that the conference terminal can highlight the picture of the main speaker.
In a third aspect, an embodiment of the present application provides a computer-readable storage medium, on which is stored a computer program, which when executed by a processor, implements a method according to any one of the first aspect and the implementation manner of the first aspect.
In a fourth aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor can implement the method according to any implementation of the first aspect and the first aspect when executing the program.
In a fifth aspect, an embodiment of the present application provides a computer program product, where the computer program product includes a computer program, where the computer program when executed by a processor may implement a method according to any implementation manner of the first aspect and the first aspect.
Drawings
In order to more clearly illustrate the technical solution of one embodiment of the present application, the following description will briefly describe the drawings required to be used in one embodiment of the present application, it being understood that the following drawings illustrate only some embodiments of the present application and therefore should not be considered limiting of the scope, and that other related drawings may be obtained according to these drawings without the need for inventive effort to a person of ordinary skill in the art.
Fig. 1 is a schematic view of a conference system scenario provided in an embodiment of the present application;
FIG. 2 is a flow chart of a method for determining a primary speaker in a conference provided in one embodiment of the present application;
fig. 3 is a flowchart of a method for determining a primary speaker in a conference according to another embodiment of the present application;
fig. 4 is a schematic diagram of a method and apparatus for determining a main speaker in a conference according to an embodiment of the present application;
fig. 5 is a schematic diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
In the video conference, the main speaker is accurately determined, and the method has very important effect on improving the participation experience of the participators. In the prior art, the calculation of the main speaker often has inaccurate problems, so that the following problems may exist: the first problem is: the efficiency of the conference is low, if the calculation of the current main speaker is inaccurate, the setting error of the focused video in the conference can be caused, so that the conference efficiency and the participant experience are affected; a second problem: misunderstanding and misleading: if the calculation of the main speaker is inaccurate, misunderstanding and misleading can be caused, so that the decision and result of the conference are affected; third problem: uneven and unreasonable, if the current primary speaker calculation is inaccurate, it may cause some participants to be ignored or treated unfairly, thereby affecting the fairness and rationality of the conference.
In an actual video conference, the following difficulties exist in the current calculation of a main speaker in the video conference, which results in low accuracy of the determination of the main speaker: the influence of environmental noise, the background noise and the difference of microphone sound reception quality can interfere voice activity detection, so that judgment errors are caused; and (3) processing simultaneous speaking, wherein when a plurality of people speak simultaneously, the conference client side or the media server side cannot accurately judge the current main speaker.
Therefore, how to improve the accuracy of determining the main speaker is a problem to be solved.
In view of the above problems, the embodiment of the application provides a method for determining a main speaker in a conference, which can improve the accuracy of determining the main speaker and enhance the participation experience of participants.
The method for determining a main speaker in a conference according to the embodiment of the present application is described below with reference to specific embodiments. Before describing a specific method, a system scenario in which the embodiments of the present application may be applied is described below with reference to fig. 1, and then a specific method for determining a main speaker in a conference is described with reference to fig. 2.
A schematic diagram of a conference system provided in an embodiment of the present application is exemplarily described below with reference to fig. 1.
The conference system as shown in fig. 1 includes: a plurality of conference terminals and a media server.
In the embodiment of the application, the media server can be used for forwarding the data stream of the conference terminal, so that the real-time online video conference of a plurality of conference terminals is realized.
It should be understood that, in the embodiment of the present application, the conference terminal may also be referred to as a client, a user side, a user device, or a terminal device, and in the real-time example of the present application, the conference terminal may be installed with a browser, and may perform real-time conference communication through the browser, or be installed with an APP or an applet, and perform real-time conference communication through the APP or the applet. The conference terminal in the present application may include a smart phone, a tablet computer, (personal digital assistant, PDA personal digital assistant), a computer, a game console, a wearable device, a tablet computer (portable android device, PAD), and the like, and the embodiments of the present application are not limited thereto.
It should be understood that the operating system running on the conference terminal may be a Linux kernel-based operating system such as a mobile Android (Android), a mobile Ubuntu (Ubuntu), a taziram (Tizen), or a desktop operating system such as Windows, mac OS, linux, but the present invention is not limited thereto.
It should be understood that in the conference system scenario shown in fig. 1, the number of conference terminals may include any number, according to the actual situation, for example, 10, 50, 500, 5000, etc., and specifically, according to the actual number of participants.
It should also be understood that the conference system in the embodiment of the present application only illustrates the case of including the conference terminal and the media server, but the embodiment of the present application is not limited thereto, and the conference system may also include a signaling server and other devices. In the embodiment of the present application, the signaling server may also be referred to as a control end or a management end, and may be used to perform signaling interaction with the conference terminals to transfer necessary information required for establishing a connection between the conference terminals.
The system shown in fig. 1 may be a Web Real-time communication (Web Real-Time Communications, webRTC) based system, and the WebRTC system may include the following three architectures: a Mesh (Mesh) architecture, namely, a plurality of conference terminals are connected in pairs to form a Mesh structure, and all other conference terminals can be interconnected and communicated; the multipoint conference unit (Multipoint Conferencing Unit, MCU) architecture consists of a media server and a plurality of conference clients, each conference terminal sends own audio and video to the media server, the media server carries out mixed coding on all conference terminals in the same room and sends the mixed audio and video to each conference terminal; the selective forwarding unit (Selective Forwarding Unit, SFU) architecture also comprises a media server and a plurality of terminals, but unlike MCU, SFU does not mix audio and video, and after receiving the audio and video stream shared by a certain conference terminal, it directly forwards the audio and video stream to other conference terminals in the room. The system shown in fig. 1 includes a media server, and the system can be an MCU architecture or an SFU architecture system. It should be understood that fig. 1 is merely exemplary, and embodiments of the present application are not limited thereto, and fig. 1 merely illustrates an architecture with a media server. The methods of the embodiments of the present application described below may also be applicable to cases where there is no media server, such as a grid architecture.
A method for determining a primary speaker in a conference provided in one embodiment of the present application is described below by way of example with reference to fig. 2.
It should be understood that in the embodiment of the present application, the main speaker may also be referred to as a main speaker, a main focus member, or the like, and the main speaker may be understood as a person currently speaking mainly or a person currently speaking most importantly.
The method shown in fig. 2 may be applied to the real-time communication system shown in fig. 1, and it should be understood that the implementation subject of the method shown in fig. 2 may be a conference terminal (such as the conference terminal shown in fig. 1 or a conference terminal under a grid architecture) or a media server (such as the media server shown in fig. 1, which may be a MCU or a media server under an SFU architecture), or other devices not shown in fig. 1, for example, may be a conference recording device or an intermediate service device forwarding conference audio and video to other devices, and the embodiment of the present application is not limited thereto. Hereinafter, unless otherwise specified, the above apparatuses may be the execution subjects of the method shown in fig. 2. Specifically, the method shown in fig. 2 includes:
at 210, voice activity detection is performed on multiple voice signal segments to determine at least two voice signal segments for which voice activity exists.
The multiple voice signal paragraphs correspond to multiple users participating in the conference, and each user corresponds to one voice signal paragraph.
Specifically, in the embodiment of the present application, each conference terminal participating in the conference may correspond to one path of voice signal. In this embodiment of the present application, the speech signal section may be a speech signal of a historical time before the current time, for example, 10ms or 20ms before the current time is used as a starting point, and because the speech signal section in this embodiment of the present application is relatively short in time, the embodiment of the present application may quickly determine the main speaker, and because the time is relatively short, for the user's experience, the experience determined in real time is basically achieved. Therefore, the embodiment of the application can be considered to be capable of determining the main speaker in real time during the conference.
It should be understood that each voice signal in the embodiments of the present application may be collected separately by the corresponding conference terminal.
Alternatively, the multiple voice signal segments in the embodiments of the present application may be voice signal segments of multiple focus video members.
Specifically, the method execution body in the embodiment of the present application may first determine whether the participant has a focus video member. If one focal video member is set, it is set as the current main speaker (presenter) by default; if multiple focus video members are set, only the voice signals set as the focus video members may be extracted, and then step 210 is performed, that is, in this case, the multiple voice signals in 210 are the voice signals of the multiple focus video members; if the focus video member is not set, then the multiple voice signals in 210 are the voice signals of all the reference members. Correspondingly, as another embodiment, the plurality of users includes a plurality of focal video members; alternatively, the plurality of users includes all members participating in the conference.
It should be appreciated that in a video conference, a conference host may set a participant member as a focus, all members in the conference will focus on the member's picture, the key members will not be out of focus, and information transfer is more efficient.
Specifically, there are two cases in which the moderator sets the focus member. The conference interface of all members focuses on the member picture after the member becomes the current main speaker after the focus video is successfully set. Secondly, multi-focus video, i.e. during a conference, a presenter can set multiple participants to focus video at the same time. At most, the same number of focus videos as the number of the displayed page views can be set at the same time, for example, 1-9 focus videos can be set, and after a plurality of focus videos are set, the focus pictures of the conference interfaces of all members in the conference can be switched among the members. That is, in the second case, embodiments of the present application dynamically select one member from the plurality of focal members as the primary speaker in real time.
Optionally, as another embodiment, the detecting voice activity on the multiple voice signal segments, determining at least two voice signal segments with voice activity, includes:
Detecting the parameters of the multiple voice signal segments according to at least one of the following parameters, and determining at least two voice signal segments with voice activity: the energy of the sound signal, the zero crossing rate of the sound signal, the amplitude difference of the sound signal, the zero crossing rate of the sound signal, the signal to noise ratio of the sound signal, the frequency of the sound signal and the time domain of the sound signal.
Specifically, in the embodiment of the present application, voice Activity Detection (VAD) is performed on multiple voice signal segments first to distinguish between voice activity segments and non-voice activity segments in each voice signal segment. A speech signal may be identified as muted if there is a speech activity segment in the speech signal segment that indicates that a person is currently speaking, and if there is no speech activity segment, such as a noisy or non-voiced condition.
According to the method and the device for detecting the voice activity, through voice activity detection, certain robustness can be guaranteed even in a non-ideal environment with background noise, and accuracy of determination of a main speaker is improved.
The energy of the sound signal may include, among other things, instantaneous energy and short-time energy. For instantaneous energy, an energy threshold may be set in the embodiments of the present application to determine whether a sound signal is present, and when the energy of the sound signal exceeds the threshold, sound may be considered to be present. For short-time energy, the embodiment of the application can set a short-time energy threshold, judge whether the sound signal exists by calculating the energy of the sound signal in a period of time, and consider that the sound exists when the energy of the sound signal in the period of time exceeds a certain threshold, namely the short-time energy threshold. For example, the duration of a speech signal segment may be 10ms or 20ms and the short-term energy threshold in speech activity detection may be set to-50 dB to-60 dB.
The difference in amplitude of the sound signal may be calculated to determine whether the sound signal exists by calculating the difference in amplitude of the sound signal within a period of time (the period of time may be equal to or less than the period of time of the voice signal segment). When the amplitude difference during this period exceeds a certain threshold, such as a short-time average amplitude difference (SAC), sound is considered to be present.
The zero crossing rate of an acoustic signal refers to the number of times the signal passes through the zero point of the time axis. In voice activity detection, the zero-crossing rate may be used to detect silence segments and non-silence segments in the voice signal. The zero crossing rate of the speech signal is typically high and a zero crossing rate threshold may be set to detect speech activity frames. The implementation principle is that the zero-crossing rate is obtained by dividing the voice signal into short time periods (the time periods can be equal to or less than the time length of the voice signal section) and calculating the number of times of the signal crossing the zero point in each time period. If the zero-crossing rate is above a predetermined threshold, then the speech signal is considered to be present for the period of time. For example, the duration of a segment of speech signal may be 10ms or 20ms, and the threshold in zero crossing rate detection may be set to be greater than 15 times every 10-20 ms.
For signal-to-noise ratio detection of the sound signal, a method of calculating short-time energy of the sound signal and comparing with background noise may be employed. The speech signal is divided into short frames, typically 10-40ms long. For each frame signal s (n), its short-term energy is calculated:
E=∑|s(n)|^2
here the summation is a summation over all sampling points of the frame. The background noise segment is extracted before speech starts and the average energy E noise of the noise is calculated. A threshold of the energy ratio is set, for example 1.5 times. For each frame of the speech signal, short-time energy E is calculated, and whether the following conditions are satisfied is judged:
E>1.5*E_noise
if the above condition is met (the short time energy is significantly higher than the background noise energy), then the frame is determined to be a speech active frame. And finally, smoothing processing can be performed to eliminate the situation that the non-speech frame is misjudged.
To detect speech activity on a speech segment based on the frequency of the sound signal, spectral analysis techniques such as Short-time fourier transform (Short-Time Fourier Transform, STFT) or power spectral density analysis may be used. First, the voice signal is subjected to necessary preprocessing including denoising, downsampling, smoothing or equalization, etc. This will help reduce the effects of background noise and improve the effectiveness of spectral analysis. The speech signal is divided into small time segments, typically each segment lasting 10ms-30ms. The windows may overlap to ensure that the instants of voice activity are not missed. For each time window, the spectrum is calculated using techniques such as STFT or power spectral density analysis. This will produce a two-dimensional image of frequency versus time showing the energy distribution of the signal at different frequencies. Features related to voice activity are extracted from the spectrum. In general, human speech has significant features in the frequency spectrum, such as harmonic components and morphological features. Various feature extraction methods may be used, such as mel-frequency cepstral coefficients (MFCCs) or fundamental frequency estimation. Based on the energy or other statistical information of the features, an appropriate threshold is set to determine when voice activity is present. The selection of the threshold may require some experimentation and adjustment to accommodate different environments and speech signal characteristics. A threshold is applied to each time segment to determine if it contains voice activity. If the window is determined to be voice active, the corresponding time period is marked as a voice active region.
It should be understood that in the embodiment of the present application, whether a voice signal segment has voice activity may be determined by one of the parameters, or whether voice activity has voice activity may be determined by a weighted sum of multiple parameters, where when determining whether voice activity has voice activity by using multiple parameters, the weight of each parameter may be flexibly adjusted according to actual needs, and the embodiment of the present application is not limited to this. The priority of the parameters may also be set, and whether the voice activity exists may be determined from high to low according to the priority level, for example, after the voice activity is determined by using the high-level parameters, the low-level determination is not used any more, and when the voice activity does not exist can not be determined by using the high-level parameters, the voice activity exists or not is determined by using the low-level parameters, and the specific parameter level may be adjusted according to the actual situation, which is not limited in the embodiment of the present application.
220, performing sound pressure level calculation on the at least two voice signal paragraphs to obtain sound pressure level calculation results of the at least two voice signal paragraphs;
in particular, for speech signal paragraphs with speech activity, embodiments of the present application may further determine the dominant speaker by sound pressure level.
It should be appreciated that sound pressure level (sound pressure level, SPL), which may also be referred to as sound level, is in decibels (dB). When sound propagates, it causes a disturbance of the particles, causing a change in pressure, which is the sound pressure. The greater the sound pressure, the greater the level for that.
Optionally, as another embodiment, the calculating the sound pressure level of the at least two voice signal segments to obtain a sound pressure level calculation result of the at least two voice signal segments includes:
and calculating the sound pressure level calculation result of the at least two voice signal paragraphs according to the following formula:
wherein SPL represents the sound pressure level, P 0 Represents the sound pressure of the hearing threshold, P rms Is a Root Mean Square (RMS) value of the amplitude of the sound sample point. Wherein P is 0 Taking P normally in air 0 =2=10 -5 Pa。
230, determining the user corresponding to one signal paragraph in the at least two voice signal paragraphs as the main speaker according to the sound pressure level calculation result of the at least two voice signal paragraphs.
Specifically, after calculating the sound pressure level, embodiments of the present application may employ several scenarios to determine the primary speaker.
In the first case, as an embodiment, the main speaker is determined based on only the sound level, for example, a user corresponding to a voice signal having the highest sound pressure level may be determined as the main speaker, or one of several users having the highest sound pressure level may be determined as the main speaker, for example, one bit may be randomly selected and determined as the main speaker.
In a second case, as an embodiment, the determining, according to the sound pressure level and the speech speed, the main speaker together, for example, the determining, according to the sound pressure level calculation result of at least two voice signal paragraphs, a user corresponding to one signal paragraph in the at least two voice signal paragraphs as the main speaker includes:
carrying out speech rate detection on at least two paths of speech signal paragraphs with the front sound pressure level to obtain a speech rate detection result;
and according to the speech speed detection result, determining the user corresponding to one signal paragraph in at least two voice signal paragraphs with the front sound pressure level as a main speaker.
For example, several speakers with a front sound pressure level are screened according to the sound pressure level, and then a main speaker is screened from the several speakers with the front sound pressure level according to the speech speed. In this case, the person with the highest speech rate can be determined as the main speaker from the persons before the sound pressure level is examined by the speech rate screening, or the person with the highest score can be determined as the main speaker by the weighted sum selection based on the speech rate and the sound pressure level.
That is, in the embodiment of the present application, besides the sound size may be used to determine the main speaker, the speech speed is also an important factor for determining the current main speaker, and the participant with a faster speech speed will generally become the current main speaker more easily. For sounds with similar sound sizes (for example, the difference is less than 5 db), the speaking speed of the speaker is calculated by calculating the rhythm, the speed, the intonation, the syllable length, the pause time and the like of the voice, the speakers can be ranked again, and then the main speaker can be further screened out.
The indicators of speech rate detection in the embodiments of the present application may include the following aspects:
speech rate: the speed of speech is expressed and is typically measured in words per minute.
Intonation: represents the pitch of a utterance, typically measured in terms of fundamental frequency or pitch.
Syllable length: the duration of each syllable is represented, typically measured in milliseconds.
Duration of pause: representing the duration of a pause in speech, typically measured in milliseconds.
Pronunciation accuracy: the accuracy of the pronunciation in speaking is indicated and is typically measured by the number of mispronunciations.
The above indicators may be used to determine the speed alone, or may be used to determine the speed together by a plurality of indicators, which is not limited in the embodiment of the present application. It should be appreciated that the threshold for speech rate detection is case-specific and different thresholds may be required for different application scenarios and tasks. In general, the appropriate threshold may be determined by experimentation and adjustment.
In the embodiment of the application, the speech speed of the voice signal can be determined by using a Hidden Markov Model (HMM) or a Gaussian Mixture Model (GMM) and the like. The HMM model is a common speech signal processing method and can be used for tasks such as speech speed detection and speech recognition. In speech rate detection, an input speech signal is typically divided into frames, each frame having a length of about 10 ms. And then obtaining corresponding feature vectors by calculating the frequency and time domain features of each frame. Then, these feature vectors are modeled using an HMM model, resulting in a speech rate model. Finally, the speech rate of the speaker is determined by decoding the input signal, i.e., calculating its probability in the speech rate model. The GMM model is also a common speech signal processing method, and can be used for tasks such as speech recognition and speaker recognition. In speech rate detection, an input speech signal is typically divided into frames, each frame having a length of about 10 ms. And then obtaining corresponding feature vectors by calculating the frequency and time domain features of each frame. These feature vectors are then modeled using a GMM model to obtain a speech rate model. Finally, the speech rate of the speaker is determined by decoding the input signal, i.e., calculating its probability in the speech rate model.
Optionally, as another embodiment, the method shown in fig. 2 is performed by a conference terminal, and the method may further include: highlighting the primary speaker's picture.
Specifically, when the method shown in fig. 2 is executed by the conference terminal, the conference terminal may be a terminal under any one of the three architectures of WebRTC. After the terminal device determines the main speaker, the main speaker can be highlighted by covering a green frame and the like on the outer layer of the video picture of the speaker, or the picture of the current main speaker is enlarged, other participants are small video pictures, or only the picture of the main speaker is displayed. It will be appreciated that when the method shown in fig. 2 is performed by a terminal device, each terminal participating in a conference needs to perform the method in order for the user to ascertain who is the primary speaker. Or one or several conference terminals execute the method, and then synchronize the determined result to other conference terminals.
Alternatively, as another embodiment, the method shown in fig. 2 is performed by a media server, and the method may further include: and sending an indication message to the conference terminal, wherein the indication message is used for indicating the main speaker so that the conference terminal can highlight the picture of the main speaker.
In this case, the media server may be a media server under the MCU architecture. In this case, the terminal device does not need to perform the method, and the terminal device can be directly synchronized with the terminal device after the determination by the media server. After determining the main speaker, the media server tells the conference terminal, which may be directly telling or telling the conference terminal through the signaling server.
According to the voice activity detection method and device, voice activity detection is firstly carried out, then only the voice activity signal is subjected to sound pressure level detection or sound pressure level and speech speed detection, and through two-stage or three-stage detection, the accuracy of determination of a main speaker can be improved, and confusion caused by simultaneous speaking of multiple people is avoided.
Meanwhile, the time of the voice signal paragraph is relatively short in the embodiment of the application, so that the embodiment of the application determines the main speaker by processing the relatively short-time signal, and can quickly respond and identify the change of the speaker, thereby realizing real-time interaction of the conference. In the embodiment of the application, the processing amount of data can be reduced through multi-layer screening, and the hardware requirement on equipment can be reduced.
The method for determining a main speaker in a conference according to the embodiment of the present application is described in detail below with reference to the detailed example of fig. 3. Fig. 3 is a flowchart of a method for determining a main speaker in a conference according to an embodiment of the present application. The method of fig. 3 may be performed by the execution body described above, and as shown in fig. 3, includes:
301, determining whether there is a focal video member.
Specifically, it is determined whether there is a focus video member among the members of the participant, and in the case where it is determined that there is a focus video member, step 302 is performed; in the event that a determination is made that there is no focal video member, step 305 is performed.
302, a determination is made as to whether there are multiple focal video members.
In other words, when it is determined that there is a focus video member, the number of focus video members is determined.
Specifically, in the case where it is determined that there is only one focus video member, step 303 is performed; in the event that a plurality of focal video members are determined, step 304 is performed.
303, determining that the focal video member is the dominant speaker.
I.e. when it is determined that there is only one in-focus video member, that member is determined directly as the primary speaker.
304, the sound of all the focal video members is fetched.
After performing step 304, step 306 is performed. Specifically, when there is a focus video member, only the signal stream of the focus video member is extracted to determine the main speaker, and the calculation amount can be reduced.
305, the sound of all members is fetched.
In particular, the signal streams of all members are analyzed when there are no focal video members.
306, voice activity detection.
Specifically, voice activity detection is performed on the extracted signal stream, and a data stream with voice activity is screened out.
In particular, the specific action of voice activity detection may be referred to the description in step 210 above, and will not be repeated here.
307, sound level detection.
Sound level (also referred to as pressure level) detection is performed on the data stream with voice activity, and a multi-path data stream with higher sound pressure level is obtained.
Specific actions for specific sound level detection may be referred to above in step 220, and will not be described here.
308, speech rate detection.
And detecting the speech speed of the multipath data stream with higher sound pressure level. After obtaining the speech rate detection result, step 309 is performed.
Specific actions for specific speech rate detection may be referred to above in step 230, and will not be described here.
309, the main speaker is determined.
In other words, the current presenter is determined.
Specifically, the description of determining the main speaker according to the speech rate detection result may refer to the description of step 230 above, which is not repeated herein.
It should be understood that the method shown in fig. 3 corresponds to the method shown in fig. 2, and the same actions as in the method shown in fig. 2 as referred to in fig. 3 may be referred to the description in fig. 2 above, and the corresponding description is omitted from fig. 3 as appropriate for avoiding repetition. It should also be appreciated that the method shown in fig. 3 is merely exemplary, and in actual situations, step 308 may be omitted, e.g., the user with the highest sound level is determined to be the primary speaker directly after 307. Alternatively step 307 may be omitted and step 308 performed directly after 306, i.e. the primary speaker is determined from the speech rate only for the data stream for which voice activity is present. Alternatively, the order between steps 307 and 308 may be reversed, i.e., screening by speech rate and then screening by sound pressure level to determine the primary speaker.
Referring to fig. 4, fig. 4 shows a block diagram of an apparatus for determining a main speaker in a conference according to an embodiment of the present application. The apparatus 400 shown in fig. 4 may be the execution body of the method of fig. 2 or fig. 3, and it should be understood that the apparatus 400 corresponds to the execution body in the method embodiment and is capable of executing the steps in the method embodiment, and specific functions of the apparatus 400 may be referred to the description above, and detailed descriptions are omitted herein as appropriate to avoid redundancy.
The apparatus 400 shown in fig. 4 includes at least one software functional module that can be stored in a memory in the form of software or firmware or cured in the apparatus, and the apparatus 400 shown in fig. 4 includes: a detecting unit 410, configured to detect voice activity of multiple voice signal paragraphs, and determine at least two voice signal paragraphs with voice activity, where the multiple voice signal paragraphs correspond to multiple users participating in the conference, and each user corresponds to one voice signal paragraph; a calculating unit 420, configured to perform sound pressure level calculation on the at least two voice signal paragraphs, to obtain a sound pressure level calculation result of the at least two voice signal paragraphs; the determining unit 430 is configured to determine, as a main speaker, a user corresponding to one signal segment of the at least two voice signal segments according to a sound pressure level calculation result of the at least two voice signal segments.
In one embodiment, the plurality of users includes a plurality of focal video members; alternatively, the plurality of users includes all members participating in the conference.
In one embodiment, the detection unit is specifically configured to: detecting the parameters of the multiple voice signal segments according to at least one of the following parameters, and determining at least two voice signal segments with voice activity: the energy of the sound signal, the zero crossing rate of the sound signal, the amplitude difference of the sound signal, the zero crossing rate of the sound signal, the signal to noise ratio of the sound signal, the frequency of the sound signal and the time domain of the sound signal.
In one embodiment, the computing unit is specifically configured to: and calculating the sound pressure level calculation result of the at least two voice signal paragraphs according to the following formula:
wherein SPL represents the sound pressure levelP is different from 0 Represents the sound pressure of the hearing threshold, P rms The acoustic root mean square RMS value is the acoustic sample point amplitude.
In one embodiment, the determining unit is specifically configured to: carrying out speech rate detection on at least two paths of speech signal paragraphs with the front sound pressure level to obtain a speech rate detection result; and according to the speech speed detection result, determining the user corresponding to one signal paragraph in at least two voice signal paragraphs with the front sound pressure level as a main speaker.
In one embodiment, the apparatus is a conference terminal, the apparatus further comprising: and the display unit is used for highlighting the picture of the main speaker.
In one embodiment, the apparatus is a media server, the apparatus further comprising: and the sending unit is used for sending an indication message to the conference terminal, wherein the indication message is used for indicating the main speaker so that the conference terminal can highlight the picture of the main speaker.
It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding procedure in the foregoing method for the specific working procedure of the apparatus described above, and this will not be repeated here.
As shown in fig. 5, one embodiment of the present application provides an electronic device 500, the electronic device 500 comprising: memory 510, processor 520, and a computer program stored on memory 510 and executable on processor 520, wherein processor 520, when reading the program from memory 510 and executing the program via bus 530, may implement a method of executing a subject execution as in any of the embodiments described above. Optionally, the device shown in fig. 5 may also include a transceiver, which may be used for transmission and/or reception of data and/or signals. Optionally, when the device shown in fig. 5 is a conference terminal, the device may further include a collector, where the collector may be used to collect audio/video data. The optional electronic device 500 may also include other input devices or other hardware devices, etc., and embodiments of the present application are not limited in this regard.
Processor 520 may process the digital signals and may include various computing structures. Such as a complex instruction set computer architecture, a reduced instruction set computer architecture, or an architecture that implements a combination of instruction sets. In some examples, processor 520 may be a microprocessor.
Memory 510 may be used for storing instructions to be executed by processor 520 or data related to execution of the instructions. Such instructions and/or data may include code to implement some or all of the functions of one or more modules described in embodiments of the present application. The processor 520 of the disclosed embodiments may be configured to execute instructions in the memory 510 to implement the methods performed by the execution bodies described above. Memory 510 includes dynamic random access memory, static random access memory, flash memory, optical memory, or other memory known to those skilled in the art.
An embodiment of the present application further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, can implement a method performed by an executing body in any of the above methods as provided in the above embodiments.
An embodiment of the present application further provides a computer program product, where the computer program product includes a computer program, where the computer program when executed by a processor may implement a method performed by an executing body in any of the above methods provided in the above embodiments.
It should be noted that the processor (e.g., the processor of fig. 5) in embodiments of the present invention may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The processor may be a general purpose processor, a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated crcuit, ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.
It will be appreciated that the memory in embodiments of the invention (e.g., the memory of FIG. 5) may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and direct memory bus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
Applications to which embodiments of the present application relate include any application installed on a requesting end, including but not limited to, browser, email, instant messaging service, word processing, keyboard virtualization, widgets (widgets), encryption, digital rights management, voice recognition, voice replication, positioning (e.g., functions provided by the global positioning system), music playing, and so forth.
It should be understood that the transceiver unit or transceiver in the embodiments of the present invention may also be referred to as a communication unit.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a high-density digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
As used in this specification, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between 2 or more computers. Furthermore, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from two components interacting with one another in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present invention, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
In addition, the terms "system" and "network" are often used interchangeably herein. The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
It should be understood that in embodiments of the present invention, "B corresponding to a" means that B is associated with a, from which B may be determined. It should also be understood that determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
In summary, the foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for determining a primary speaker in a conference, comprising:
detecting voice activity of multiple voice signal paragraphs, and determining at least two voice signal paragraphs with voice activity, wherein the multiple voice signal paragraphs correspond to multiple users participating in a conference, and each user corresponds to one voice signal paragraph;
performing sound pressure level calculation on the at least two voice signal paragraphs to obtain sound pressure level calculation results of the at least two voice signal paragraphs;
And determining a user corresponding to one signal paragraph in the at least two voice signal paragraphs as a main speaker according to the sound pressure level calculation result of the at least two voice signal paragraphs.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the plurality of users includes a plurality of focus video members;
or,
the plurality of users includes all members participating in the conference.
3. The method according to claim 1 or 2, wherein the detecting voice activity on multiple voice signal segments, determining at least two voice signal segments for which voice activity exists, comprises:
detecting the parameters of the multiple voice signal segments according to at least one of the following parameters, and determining at least two voice signal segments with voice activity: the energy of the sound signal, the zero crossing rate of the sound signal, the amplitude difference of the sound signal, the zero crossing rate of the sound signal, the signal to noise ratio of the sound signal, the frequency of the sound signal and the time domain of the sound signal.
4. The method according to claim 1 or 2, wherein the performing sound pressure level calculation on the at least two voice signal segments to obtain sound pressure level calculation results of the at least two voice signal segments includes:
And calculating the sound pressure level calculation result of the at least two voice signal paragraphs according to the following formula:
wherein SPL represents the sound pressure level, P 0 Represents the sound pressure of the hearing threshold, P rms The acoustic root mean square RMS value is the acoustic sample point amplitude.
5. The method according to claim 1 or 2, wherein the determining, according to the sound pressure level calculation result of at least two voice signal paragraphs, a user corresponding to one signal paragraph of the at least two voice signal paragraphs as a main speaker includes:
carrying out speech rate detection on at least two paths of speech signal paragraphs with the front sound pressure level to obtain a speech rate detection result;
and according to the speech speed detection result, determining the user corresponding to one signal paragraph in at least two voice signal paragraphs with the front sound pressure level as a main speaker.
6. The method according to claim 1 or 2, characterized in that the method is performed by a conference terminal, the method further comprising:
highlighting the primary speaker's picture.
7. The method according to claim 1 or 2, wherein the method is performed by a media server, the method further comprising:
and sending an indication message to the conference terminal, wherein the indication message is used for indicating the main speaker so that the conference terminal can highlight the picture of the main speaker.
8. An apparatus for determining a primary speaker in a conference, comprising:
the system comprises a detection unit, a voice activity detection unit and a voice activity detection unit, wherein the detection unit is used for detecting voice activities of multiple paths of voice signal paragraphs and determining at least two paths of voice signal paragraphs with voice activities, the multiple paths of voice signal paragraphs correspond to multiple users participating in a conference, and each user corresponds to one path of voice signal paragraph;
the computing unit is used for carrying out sound pressure level computation on the at least two voice signal paragraphs to obtain sound pressure level computation results of the at least two voice signal paragraphs;
and the determining unit is used for determining a user corresponding to one signal paragraph in the at least two voice signal paragraphs as a main speaker according to the sound pressure level calculation result of the at least two voice signal paragraphs.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the computer program when run by the processor performs the method of any one of claims 1-7.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program, wherein the computer program when run by a processor performs the method according to any of claims 1-7.
CN202311361650.4A 2023-10-19 2023-10-19 Method, device, electronic equipment and medium for determining main speaker in conference Pending CN117579770A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311361650.4A CN117579770A (en) 2023-10-19 2023-10-19 Method, device, electronic equipment and medium for determining main speaker in conference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311361650.4A CN117579770A (en) 2023-10-19 2023-10-19 Method, device, electronic equipment and medium for determining main speaker in conference

Publications (1)

Publication Number Publication Date
CN117579770A true CN117579770A (en) 2024-02-20

Family

ID=89887040

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311361650.4A Pending CN117579770A (en) 2023-10-19 2023-10-19 Method, device, electronic equipment and medium for determining main speaker in conference

Country Status (1)

Country Link
CN (1) CN117579770A (en)

Similar Documents

Publication Publication Date Title
US11823679B2 (en) Method and system of audio false keyphrase rejection using speaker recognition
US9293133B2 (en) Improving voice communication over a network
CN107799126A (en) Sound end detecting method and device based on Supervised machine learning
US9293140B2 (en) Speaker-identification-assisted speech processing systems and methods
US20180301157A1 (en) Impulsive Noise Suppression
US9378755B2 (en) Detecting a user's voice activity using dynamic probabilistic models of speech features
KR20140026229A (en) Voice activity detection
CN110648678A (en) Scene identification method and system for conference with multiple microphones
KR20160102300A (en) Situation dependent transient suppression
CN109361995B (en) Volume adjusting method and device for electrical equipment, electrical equipment and medium
Sadjadi et al. Blind spectral weighting for robust speaker identification under reverberation mismatch
WO2023040523A1 (en) Audio signal processing method and apparatus, electronic device, and storage medium
Bramsløw et al. Improving competing voices segregation for hearing impaired listeners using a low-latency deep neural network algorithm
CN112053702B (en) Voice processing method and device and electronic equipment
JP6524674B2 (en) Voice processing apparatus, voice processing method and voice processing program
JP6268916B2 (en) Abnormal conversation detection apparatus, abnormal conversation detection method, and abnormal conversation detection computer program
JP2024500746A (en) Speaker-specific speech amplification
Jokinen et al. Signal-to-noise ratio adaptive post-filtering method for intelligibility enhancement of telephone speech
Tian et al. Spoofing detection under noisy conditions: a preliminary investigation and an initial database
Williams et al. Privacy-Preserving Occupancy Estimation
Varela et al. Combining pulse-based features for rejecting far-field speech in a HMM-based voice activity detector
CN117579770A (en) Method, device, electronic equipment and medium for determining main speaker in conference
WO2021258958A1 (en) Speech encoding method and apparatus, computer device, and storage medium
Oviatt Multimodal signal processing in naturalistic noisy environments
CN115516553A (en) System and method for multi-microphone automated clinical documentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication