CN114727194A

CN114727194A - Microphone volume control method, device, equipment and storage medium

Info

Publication number: CN114727194A
Application number: CN202110002583.1A
Authority: CN
Inventors: 高毅; 罗程; 李斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-04
Filing date: 2021-01-04
Publication date: 2022-07-08

Abstract

The embodiment of the application provides a microphone volume control method, a microphone volume control device, microphone volume control equipment and a storage medium. The method comprises the following steps: carrying out voice detection on voice signals collected by a microphone to obtain at least two environment indexes corresponding to the voice signals; respectively carrying out voice signal preprocessing on voice signals by adopting a voice signal processing mode corresponding to each environment index to correspondingly obtain at least two voice signal streams; performing parameter feature extraction on each voice signal stream to obtain at least two parameter state streams of the voice signals; respectively determining a digital gain adjustment amount and an analog gain adjustment amount of the microphone according to the at least two parameter state streams; and correspondingly adjusting the digital gain and the analog gain of the microphone. According to the method and the device, the flexible digital gain and model gain adjustment can be flexibly carried out on the microphone adaptively according to the environmental index of the current environment of the microphone, so that the volume of the microphone is smoother, and the user experience is improved.

Description

Microphone volume control method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of terminals, in particular to a method, a device, equipment and a storage medium for controlling the volume of a microphone.

Background

In the current automatic control method for microphone gain, digital gain and analog gain of the microphone are usually adjusted based on a preset fixed automatic gain control algorithm to realize volume control of the microphone. The method in the related art does not consider multiple indexes which may affect the microphone volume, and cannot flexibly and adaptively adjust the digital gain and the model gain of the microphone according to the current environment of the microphone, so that the method in the related art presents the problem of large volume and small volume when adjusting the microphone volume, thereby greatly reducing the user experience.

Disclosure of Invention

The embodiment of the application provides a microphone volume control method, a microphone volume control device, microphone volume control equipment and a microphone volume control storage medium.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a microphone volume control method, which comprises the following steps:

carrying out voice detection on voice signals collected by the microphone to obtain at least two environment indexes corresponding to the voice signals;

respectively carrying out voice signal preprocessing on the voice signals by adopting a voice signal processing mode corresponding to each environment index to correspondingly obtain at least two voice signal streams;

extracting parameter characteristics of each voice signal stream to obtain at least two parameter state streams of the voice signals;

respectively determining a digital gain adjustment amount and an analog gain adjustment amount of the microphone according to the at least two parameter state streams;

and correspondingly adjusting the digital gain and the analog gain of the microphone according to the digital gain adjustment amount and the analog gain adjustment amount so as to realize volume control of the microphone.

The embodiment of the application provides a microphone volume control device, the device includes:

the voice detection module is used for carrying out voice detection on the voice signals collected by the microphone to obtain at least two environment indexes corresponding to the voice signals;

the preprocessing module is used for respectively preprocessing the voice signals in a voice signal processing mode corresponding to each environment index to correspondingly obtain at least two voice signal streams;

the parameter feature extraction module is used for extracting the parameter feature of each voice signal stream to obtain at least two parameter state streams of the voice signals;

a determining module, configured to determine a digital gain adjustment amount and an analog gain adjustment amount of the microphone according to the at least two parameter state streams;

and the adjusting module is used for correspondingly adjusting the digital gain and the analog gain of the microphone according to the digital gain adjustment quantity and the analog gain adjustment quantity so as to realize volume control of the microphone.

Embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium; the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor is configured to execute the computer instructions to implement the above-mentioned microphone volume control method.

The embodiment of the application provides a microphone volume control equipment, includes: a memory for storing executable instructions; and the processor is used for realizing the microphone volume control method when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the executable instructions to realize the microphone volume control method.

The embodiment of the application has the following beneficial effects: carrying out voice detection on voice signals collected by a microphone to obtain at least two environment indexes corresponding to the voice signals; the method comprises the steps of adopting a voice signal processing mode corresponding to each environment index to carry out different types of voice signal preprocessing on voice signals respectively to obtain at least two voice signal streams, wherein each type of voice signal processing mode corresponds to one environment index, then determining a digital gain adjustment amount and an analog gain adjustment amount of a microphone based on the obtained at least two voice signal streams, and realizing volume control on the microphone according to the digital gain adjustment amount and the analog gain adjustment amount.

Drawings

Fig. 1 is a schematic diagram of an alternative architecture of a microphone volume control system according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a server provided in an embodiment of the present application;

fig. 3 is a schematic flow chart of an alternative method for controlling the volume of a microphone according to an embodiment of the present application;

fig. 4 is a schematic flow chart of an alternative method for controlling the volume of a microphone according to an embodiment of the present application;

fig. 5 is a schematic flow chart of an alternative method for controlling the volume of a microphone according to an embodiment of the present application;

fig. 6 is a schematic flow chart of an alternative method for controlling the volume of a microphone according to an embodiment of the present application;

FIG. 7 is a diagram of an application scenario of a method according to an embodiment of the present application;

fig. 8 is an architecture diagram of an automatic gain control method for a microphone according to an embodiment of the present application;

FIG. 9 is a diagram of a speech feature pool architecture provided by an embodiment of the present application;

FIG. 10 is a block diagram of a gain control module according to an embodiment of the present disclosure;

FIG. 11 is a schematic diagram of a two-stage volume control provided by an embodiment of the present application;

fig. 12 is a two-level volume control abstract transform diagram of fig. 11.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiments of the present application belong. The terminology used in the embodiments of the present application is for the purpose of describing the embodiments of the present application only and is not intended to be limiting of the present application.

In the existing automatic control method for microphone gain, one implementation manner is to use Voice Activity Detection (VAD) technology and use a preset automatic gain control algorithm to adjust the digital gain and analog gain of the microphone, but the traditional VAD and pitch indexes are not used comprehensively to control the gain calculation. Another implementation manner is that in the conversation process, the microphone volume and the microphone reinforcement are adjusted according to the Gain value of any adjusted analog Automatic Gain Control (AGC) or the Gain value of the digital AGC, if the Gain value of the analog AGC or the Gain value of the digital AGC is positive, it is firstly judged whether only the microphone volume is increased and the positive Gain requirement is met, if yes, the microphone volume is adjusted, if not, it is judged whether the microphone reinforcement can be adjusted, if yes, the microphone volume and the microphone reinforcement are adjusted at the same time, and if the microphone volume and the microphone reinforcement are adjusted to the maximum, the adjustment is not performed; if the gain value of the analog AGC or the gain value of the digital AGC is negative, whether the microphone volume is only reduced to meet the requirement of negative gain is judged, if yes, the microphone volume is adjusted, if not, whether the microphone enhancement can be adjusted is judged, if yes, the microphone volume and the microphone enhancement are adjusted at the same time, and if both the microphone volume and the microphone enhancement are adjusted to the minimum value, the adjustment is not carried out, so that the technology is an automatic control process for the microphone volume and the microphone enhancement.

However, in the methods in the related art, multiple environmental indexes that may affect the volume of the microphone are not considered, and flexible digital gain and model gain adjustment cannot be flexibly performed on the microphone according to the current environment of the microphone, so that the method in the related art presents a problem of large volume and small volume when adjusting the volume of the microphone, thereby greatly reducing the user experience.

Based on the above problems in the related art, in the microphone volume control method provided in the embodiment of the present application, first, a voice signal acquired by a microphone is subjected to voice detection, and at least two environmental indexes corresponding to the voice signal are obtained; respectively carrying out voice signal preprocessing on voice signals collected by a microphone by adopting a voice signal processing mode corresponding to each environment index to correspondingly obtain at least two voice signal streams; then, parameter feature extraction is carried out on each voice signal flow to obtain at least two parameter state flows of the voice signals; respectively determining the digital gain adjustment quantity and the analog gain adjustment quantity of the microphone according to at least two parameter state streams; and finally, correspondingly adjusting the digital gain and the analog gain of the microphone according to the digital gain adjustment quantity and the analog gain adjustment quantity so as to realize volume control of the microphone. Therefore, the influence of multiple environment indexes on the microphone volume is considered, so that the microphone can be flexibly adjusted in a digital gain and a model gain flexibly according to the environment index of the current environment of the microphone, the microphone volume is smoother, and the user experience is improved.

An exemplary application of the microphone volume control device according to the embodiment of the present application is described below, in one implementation, the microphone volume control device according to the embodiment of the present application may be implemented as any terminal having a voice capture and entry function, such as a notebook computer, a tablet computer, a desktop computer, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), an intelligent robot, and in another implementation, the microphone volume control device according to the embodiment of the present application may also be implemented as a server. Next, an exemplary application when the microphone volume control device is implemented as a server will be explained.

Referring to fig. 1, fig. 1 is a schematic diagram of an alternative architecture of a microphone volume control system 10 according to an embodiment of the present application. In order to implement volume control on a microphone, a microphone volume control system 10 provided in the embodiment of the present application includes a terminal 100, a network 200, and a server 300, where the terminal 100 has a microphone 100-1, the microphone collects a voice to obtain a voice signal, the terminal 100 sends the collected voice signal to the server 300 through the network 200, the server 300 performs voice detection on the voice signal collected by the microphone to obtain at least two environment indexes corresponding to the voice signal, and performs different types of voice signal preprocessing on the voice signal collected by the microphone respectively by using a voice signal processing method corresponding to each of the environment indexes to obtain at least two voice signal streams correspondingly; wherein, each type of voice signal processing mode corresponds to an environment index; extracting parameter characteristics of each voice signal stream to obtain at least two parameter state streams of the voice signals; and respectively determining the digital gain adjustment amount and the analog gain adjustment amount of the microphone according to the at least two parameter state streams. After obtaining the digital gain adjustment amount and the analog gain adjustment amount, the server 300 may correspondingly adjust the digital gain and the analog gain of the microphone according to the digital gain adjustment amount and the analog gain adjustment amount to implement the volume control on the microphone, and may also send the digital gain adjustment amount and the analog gain adjustment amount to the terminal 100, so that the terminal 100 correspondingly adjusts the digital gain and the analog gain of the microphone according to the digital gain adjustment amount and the analog gain adjustment amount to implement the volume control on the microphone.

Fig. 2 is a schematic structural diagram of a server 300 according to an embodiment of the present application, where the server 300 shown in fig. 2 includes: at least one processor 310, memory 350, at least one network interface 320, and a user interface 330. The various components in server 300 are coupled together by a bus system 340. It will be appreciated that the bus system 340 is used to enable communications among the components connected. The bus system 340 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 340 in fig. 2.

The Processor 310 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 330 includes one or more output devices 331, including one or more speakers and/or one or more visual display screens, that enable presentation of media content. The user interface 330 also includes one or more input devices 332, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 350 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 350 optionally includes one or more storage devices physically located remote from processor 310. The memory 350 may include either volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 350 described in embodiments herein is intended to comprise any suitable type of memory. In some embodiments, memory 350 is capable of storing data, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below, to support various operations.

An operating system 351 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 352 for communicating to other computing devices via one or more (wired or wireless) network interfaces 320, exemplary network interfaces 320 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

an input processing module 353 for detecting one or more user inputs or interactions from one of the one or more input devices 332 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided by the embodiments of the present application can be implemented in software, and fig. 2 shows a microphone volume control device 354 stored in the memory 350, where the microphone volume control device 354 can be a microphone volume control device in the server 300, and can be software in the form of programs and plug-ins, and the like, and includes the following software modules: the speech detection module 3540, the pre-processing module 3541, the parameter feature extraction module 3542, the determination module 3543, and the adjustment module 3544 are logical and thus may be arbitrarily combined or further separated depending on the functionality implemented. The functions of the respective modules will be explained below.

In other embodiments, the apparatus provided in the embodiments of the present Application may be implemented in hardware, and for example, the apparatus provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the microphone volume control method provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The microphone volume control method provided by the embodiment of the present application will be described below in conjunction with an exemplary application and implementation of the server 300 provided by the embodiment of the present application. Referring to fig. 3, fig. 3 is an alternative flow chart of a microphone volume control method provided in an embodiment of the present application, and the following description will be made with reference to the steps shown in fig. 3.

Step S301, voice detection is carried out on the voice signals collected by the microphone, and at least two environment indexes corresponding to the voice signals are obtained.

Here, the voice detection is used to detect an environmental sound in the voice signal, for example, the environmental index includes, but is not limited to, an environmental sound such as background noise, echo, and howling, and thus the voice detection includes noise detection, echo detection, and howling detection. By carrying out voice detection on the voice signals, the noise, echo, howling and other environmental sounds in the voice signals are obtained.

In some embodiments, when detecting noise in a speech signal, any noise detection method may be used, for example, active voice detection may be performed on the speech signal first, so that according to fluctuation of energy in the speech signal, speech is considered to exist when the energy is large and the fluctuation is severe, and the speech includes both normal speech (i.e. basic speech) of a user and noise, so that further extraction and processing of the noise are required. Then, according to the variation rule of the energy fluctuation in the speech signal, the noise segment in the speech signal can be determined, and the noise energy spectrum is estimated, so as to determine the noise signal in the speech signal, i.e. to implement the noise detection in the speech signal.

In some embodiments, when detecting the echo in the speech signal, any echo detection method may be adopted, for example, when the terminal is detected to be in a hands-free mode during a call, and if the speech is detected through an active voice detection method, that is, according to fluctuation of energy in the speech signal, the speech is considered to exist when the energy is large and the fluctuation is severe, then the speech signal at this time may be considered to include the echo.

In some embodiments, when detecting howling in a voice signal, any echo detection manner may be adopted, for example, a howling detector based on a neural network may be adopted to perform howling detection, or a distance between two terminals in a call is detected, when the distance between the two terminals is smaller than a distance threshold and the two terminals are both in a hands-free state, an acoustic loop may occur between the two terminals to form howling, and then it may be detected that the voice signal includes howling.

Step S302, adopting a voice signal processing mode corresponding to each environment index to respectively carry out voice signal preprocessing on voice signals, and correspondingly obtaining at least two voice signal streams.

Here, each type of speech signal processing mode corresponds to an environmental indicator, and the environmental indicator includes, but is not limited to, environmental sounds such as background noise, echo, and howling. In the embodiment of the present application, each voice signal stream corresponds to a voice signal processing method, and after a voice signal is processed by a voice signal processing method, a corresponding voice signal stream is obtained. It should be noted that, multiple types of voice signal processing manners may also be adopted to process the voice signals at the same time, so as to obtain corresponding voice signal streams.

In the embodiment of the present application, the voice signal processing manner may include, but is not limited to, at least one of the following: the method comprises the steps of direct current filtering removing processing, echo eliminating processing, noise suppression processing, howling suppression processing and the like. When the voice signal processing mode is the DC-removing filtering processing mode, the DC-removing filtering processing is carried out on the voice signal to obtain a voice signal stream after the DC-removing filtering processing; when the voice signal processing mode is echo cancellation processing, performing echo cancellation processing on the voice signal to obtain a voice signal stream with echo cancelled; and when the voice signal processing mode is noise suppression processing, performing noise suppression processing on the voice signal to obtain a voice signal stream after noise suppression.

The speech signal processing method in the embodiment of the present application is to pre-process a speech signal, that is, the speech signal can be pre-processed after the speech signal is collected, and then the microphone gain control is performed based on the pre-processed speech signal.

In the embodiment of the present application, the voice signal may be a voice signal with a certain duration. The embodiment of the application can process one frame of voice every time by taking the voice frame as a unit, thereby realizing the volume control of the microphone. In the speech frame, a plurality of speech samples may exist in a frame of speech, each speech sample is a sample parameter of an audio waveform in a time domain, and a frame of speech corresponds to a certain duration, and may be a speech sample within a period of time, for example, a speech frame may be 10 milliseconds, and a speech sample of 10 milliseconds is obtained to obtain a speech signal, and then the speech signal is processed and calculated.

Step S303, performing parameter feature extraction on each voice signal stream to obtain at least two parameter state streams of the voice signal.

Here, the parameter feature extraction refers to performing feature extraction on a voice signal stream processed by each voice signal processing method to obtain current state parameters corresponding to each voice signal stream in different states, where the current state parameters are parameters for describing different information and different states of the voice signal.

In the embodiment of the present application, the voice signal processing manner may include, but is not limited to, at least one of the following: the de-dc filtering process, the echo cancellation process, and the noise suppression process, and correspondingly, the voice signal stream may include, but is not limited to, at least one of: a voice signal stream after direct current filtering is removed, a voice signal stream after echo cancellation and a voice signal stream after noise suppression.

When the voice signal stream is a dc-filtered voice signal stream, and when the parameter feature of the voice signal stream is extracted, the dc-filtered voice signal stream may be truncated and detected to obtain a truncated flag of the dc-filtered voice signal stream, for example, whether the amplitude of the dc-filtered voice signal stream exceeds the representation range of a 16-bit integer may be detected to cause truncation distortion, if so, the truncated flag may be set as a first truncated flag, and if not, the truncated flag may be set as a second truncated flag. In some embodiments, when the parameter feature extraction is performed on the voice signal stream, time-domain energy envelope calculation may also be performed on the voice signal stream after the dc filtering is performed, so as to obtain the energy of the voice signal stream; or performing VAD detection of microphone recording on the voice signal stream after the direct current filtering is removed to obtain a VAD result. After the truncation flag, the energy, and the VAD result are obtained, at least one of the truncation flag, the energy, and the VAD result is determined as a parametric state stream of the speech signal.

When the voice signal stream is the voice signal stream after echo cancellation, and parameter feature extraction is performed on the voice signal stream, the voice signal stream after echo cancellation may be subjected to fundamental tone extraction to obtain a fundamental tone frequency of the voice signal stream after echo cancellation; and determining the obtained fundamental tone frequency as the parameter state flow of the voice signal flow after echo cancellation.

When the voice signal stream is a voice signal stream after noise suppression, and parameter feature extraction is performed on the voice signal stream, howling detection may be performed on the voice signal stream after noise suppression to obtain a howling flag of the voice signal stream after noise suppression; performing VAD detection of a second type of signal on the voice signal flow after the noise suppression to obtain a VAD result of the voice signal flow after the noise suppression; and determining at least one of the howling flag and the VAD result as a parameter state stream of the noise suppressed speech signal stream.

Step S304, according to at least two parameter state streams, respectively determining the digital gain adjustment amount and the analog gain adjustment amount of the microphone.

After the parameter state flow is determined, an adjustment mode and an adjustment amount are determined according to current state parameters corresponding to the parameter state flow, wherein each current state parameter corresponds to a digital gain adjustment amount and/or an analog gain adjustment amount. The digital gain adjustment amount and the analog gain adjustment amount are target amounts to be adjusted for the digital gain and the analog gain of the microphone, and the adjustment amount for the gain control of the microphone can be determined based on the digital gain adjustment amount and the analog gain adjustment amount.

Step S305, correspondingly adjusting the digital gain and the analog gain of the microphone according to the digital gain adjustment amount and the analog gain adjustment amount, so as to control the volume of the microphone.

Here, after the digital gain adjustment amount and the analog gain adjustment amount are determined, the digital gain and the analog gain are adjusted in accordance with target amounts corresponding to the digital gain adjustment amount and the analog gain adjustment amount. It should be noted that the adjustment processes of the digital gain and the analog gain may be performed in parallel, that is, after the digital gain adjustment amount and the analog gain adjustment amount are determined, the digital gain and the analog gain may be adjusted at the same time, or after one of the digital gain and the analog gain is adjusted, the other one is adjusted. In the adjusting process, if the digital gain and the analog gain are adjusted in sequence, if the volume of the microphone after adjustment of any one party meets the requirement, namely the volume of the microphone is smooth, the gain of the other party can not be adjusted.

In some embodiments, the microphone gain adjusted value may be a preset value, i.e., a target volume of the microphone is preset. Thus, when determining the digital gain adjustment amount and the analog gain adjustment amount of the microphone, the digital gain adjustment amount and the analog gain adjustment amount can be respectively determined according to at least two parameter state streams and the preset value.

In some embodiments, when the digital gain and the analog gain of the microphone are correspondingly adjusted according to the digital gain adjustment amount and the analog gain adjustment amount, the digital gain and the analog gain may be adjusted in a parallel adjustment manner, may also be adjusted in sequence, or may also be adjusted only by any one of the digital gain and the analog gain. Here, the parallel adjustment means that the digital gain and the analog gain are adjusted simultaneously according to the determined digital gain adjustment amount and analog gain adjustment amount; in adjusting the digital gain and the analog gain in sequence, either the digital gain or the analog gain, for example, the digital gain, may be adjusted first, and after the digital gain is adjusted by the digital gain adjustment amount, the analog gain may be continuously adjusted by the analog gain adjustment amount. In some embodiments, the microphone volume detection may be performed after the digital gain (or analog gain) is adjusted, and when the detection result indicates that the adjusted microphone volume meets a preset condition or reaches a preset target volume, the adjustment of the analog gain (or digital gain) is stopped, that is, the volume control of the microphone is implemented only by adjusting any one of the digital gain and the analog gain.

In some embodiments, the adjustment of the digital gain and the analog gain may also be performed according to a certain adjustment ratio. For example, after the digital gain adjustment amount and the analog gain adjustment amount are determined, the digital gain adjustment amount and the analog gain adjustment amount are weighted according to a preset adjustment ratio to determine a final adjustment amount of the digital gain and the analog gain, and then the digital gain and the analog gain are adjusted according to the calculated final adjustment amount. In other embodiments, when determining the digital gain adjustment amount and the analog gain adjustment amount, a ratio between the digital gain adjustment amount and the analog gain adjustment amount may be considered, that is, the determined digital gain adjustment amount and the determined analog gain adjustment amount have the preset adjustment ratio therebetween, so that when adjusting the digital gain and the analog gain, the adjustment may be performed according to the preset adjustment ratio.

In some embodiments, the digital gain adjustment amount and the analog gain adjustment amount may have a direct proportional relationship or an inverse proportional relationship therebetween. When the digital gain adjustment quantity and the analog gain adjustment quantity have a direct proportional relation, the digital gain adjustment quantity is used for adjusting and increasing the digital gain of the microphone, and meanwhile, the analog gain adjustment quantity is used for adjusting and increasing the analog gain of the microphone; alternatively, the digital gain adjustment amount is used to adjust the digital gain of the microphone, while the analog gain adjustment amount is used to adjust the analog gain of the microphone. When the digital gain adjustment quantity and the analog gain adjustment quantity have an inverse proportional relation, the digital gain adjustment quantity is used for adjusting and increasing the digital gain of the microphone, and meanwhile, the analog gain adjustment quantity is used for adjusting and reducing the analog gain of the microphone; alternatively, the digital gain adjustment amount is used to adjust the digital gain of the microphone to be decreased, while the analog gain adjustment amount is used to adjust the analog gain of the microphone to be increased.

In some embodiments, after the digital gain adjustment amount and the analog gain adjustment amount are determined, the digital gain adjustment amount and the analog gain adjustment amount may be divided into a plurality of sub-adjustment amounts according to a certain dividing manner, and then the digital gain and the analog gain may be adjusted according to the sub-adjustment amounts obtained by the division, for example, the digital gain adjustment amount and the analog gain adjustment amount may be divided according to equal division intervals or unequal division intervals.

For example, if the currently determined digital gain adjustment amount is 32db and the analog gain adjustment amount is 40db, the digital gain adjustment amount may be divided into 4 sub-adjustment amounts { +5 db; +7 db; +9 db; +11db, dividing the analog gain adjustment quantity into 4 sub adjustment quantities { +10 db; +10 db; +10 db; +10db, then, at the time of adjustment, the digital gain may be increased by 5db, then by 7db, then by 9db, then by 11db, then the analog gain may be increased by 10db, then by 10 db. Or, in other embodiments, during the adjustment, the digital gain may be increased by 5db, the analog gain may be increased by 10db, the digital gain may be increased by 7db, the analog gain may be increased by 10db, the digital gain may be increased by 9db, the analog gain may be increased by 10db, the digital gain may be increased by 11db, and the analog gain may be increased by 10 db. In addition, in the adjusting process, the microphone volume can be detected once after each adjustment of the sub-adjustment amount, and when the detection result shows that the current adjusted microphone volume meets the preset condition or reaches the preset target volume, the adjustment of the analog gain and the digital gain is stopped.

The embodiment of the application can be applied to the following scenes: when the terminal collects the voice signal under the current environment through the microphone, the terminal sends the collected voice signal to the server, the server adopts the method provided by the embodiment of the application, adopts at least two different types of voice signal processing modes to respectively carry out different types of voice signal preprocessing on the voice signal collected by the microphone, correspondingly obtains at least two voice signal streams, then carries out parameter characteristic extraction on each voice signal stream to obtain at least two parameter state streams of the voice signal, respectively determines the digital gain adjustment quantity and the analog gain adjustment quantity of the microphone according to the at least two parameter state streams, and after the digital gain adjustment quantity and the analog gain adjustment quantity are obtained, the server correspondingly adjusts the digital gain and the analog gain of the microphone according to the digital gain adjustment quantity and the analog gain adjustment quantity so as to realize the volume control of the microphone, or the server sends the digital gain adjustment amount and the analog gain adjustment amount to the terminal, so that the terminal correspondingly adjusts the digital gain and the analog gain of the microphone according to the digital gain adjustment amount and the analog gain adjustment amount, and volume control of the microphone is achieved.

According to the microphone volume control method provided by the embodiment of the application, different types of voice signal preprocessing are respectively carried out on voice signals collected by a microphone, at least two voice signal streams are obtained, each type of voice signal processing mode corresponds to one environment index, then the digital gain adjustment amount and the analog gain adjustment amount of the microphone are determined based on the obtained at least two voice signal streams, and the volume control of the microphone is realized according to the digital gain adjustment amount and the analog gain adjustment amount.

It should be noted that, in other embodiments, the microphone volume control method may also be implemented by the terminal, that is, the terminal acquires the voice signal acquired by the microphone carried by the terminal, and performs different types of voice signal preprocessing on the voice signal acquired by the microphone respectively by using at least two different types of voice signal processing manners, so as to obtain at least two voice signal streams correspondingly; wherein, each type of voice signal processing mode corresponds to an environment index; then, parameter feature extraction is carried out on each voice signal flow to obtain at least two parameter state flows of the voice signals; respectively determining the digital gain adjustment quantity and the analog gain adjustment quantity of the microphone according to the at least two parameter state flows; and finally, correspondingly adjusting the digital gain and the analog gain of the microphone according to the digital gain adjustment quantity and the analog gain adjustment quantity so as to realize volume control of the microphone. Under the scene, the terminal autonomously realizes the volume control process of the microphone, data transmission with the server is not needed, bandwidth consumption can be greatly saved, and the calculation amount of the server is reduced. In addition, the volume control of the microphone is carried out by the terminal, and the voice signals collected by the microphone can be analyzed in real time, so that the problem of inaccurate volume control caused by data transmission failure or invalid data transmission can be solved.

In some embodiments, the speech signal processing means includes, but is not limited to, at least one of: the method comprises the steps of de-DC filtering processing, echo cancellation processing and noise suppression processing.

Fig. 4 is an alternative flow chart of a microphone volume control method according to an embodiment of the present disclosure, and as shown in fig. 4, in some embodiments, when the voice signal processing mode includes a dc filtering process, the voice signal stream includes a dc filtered voice signal stream; step S303 may be implemented by:

step S401, performing truncation detection on the voice signal stream after the direct current filtering is removed to obtain a truncation mark of the voice signal stream.

Here, the truncation detection is used to detect whether the amplitude of the speech signal picked up by the microphone has exceeded the representation range of a 16-bit integer, which in turn causes truncation distortion, which generally means that the microphone hardware gain is too large and the hardware volume setting needs to be reduced as quickly as possible. The truncation detection may be achieved by: since the representation range of the 16-bit integer is-32768 to 32767, if the absolute value of the signal amplitude of N speech samples in a frame of the input speech signal is greater than a threshold, for example, the threshold may be 32760, the speech signal may be marked as the truncation, so that the truncation flag Satuate is 1, otherwise, the truncation flag Satuate is 0.

Step S402, time domain energy envelope calculation is carried out on the voice signal flow after the direct current filtering is removed, and the energy of the voice signal flow is obtained.

Here, the time domain energy envelope calculation is to divide the speech signal into smaller speech segments, for example, the original 10 ms speech signal (i.e., the speech frame) into 5 ms segments, and calculate the sum of squares (i.e., energy) of each speech sample in the speech segments, so as to obtain the energy of the speech signal stream.

Step S403, performing VAD detection of microphone recording on the voice signal stream after the dc filtering is removed, to obtain a first VAD value of the voice signal stream.

VAD detection here refers to active voice detection, also called speech endpoint detection and speech boundary detection. VAD detection aims to identify and eliminate long periods of silence from the speech signal stream to save speech channel resources without degrading the quality of service. In the embodiment of the application, the VAD detection is realized according to the energy fluctuation of the voice signal flow, if the energy is large and the fluctuation is severe, the voice is considered to exist, and therefore the first VAD value of the voice signal flow is the VAD value of the existing voice; if the energy is small and the fluctuation is gradual, no speech is considered to be present, i.e. the silence period is considered to be present, and thus the first VAD value of the speech signal stream is the VAD value where no speech is present.

At step S404, at least one of the truncated flag, the energy and the first VAD value is determined as a parameter state flow.

Referring to fig. 4, in some embodiments, when the voice signal processing mode includes echo cancellation processing, the voice signal stream includes a voice signal stream after echo cancellation; step S303 may be implemented by:

step S405, performs pitch extraction on the voice signal stream after the echo cancellation, to obtain a pitch frequency of the voice signal stream.

Since the vocal cords can vibrate and generate harmonic waves when a person speaks, the frequency of the vocal cords vibration is the fundamental tone frequency, and the sound signals such as noise do not have the fundamental tone frequency, whether the fundamental tone frequency exists in the voice signal stream can be detected through fundamental tone detection.

In step S406, the pitch frequency is determined as a parameter state stream.

In some embodiments, when performing echo cancellation processing on a speech signal, the method may further comprise the steps of: in step S11, when the voice signal is detected to have the first type signal, the energy fluctuation of the first type signal is determined.

Here, the first type of signal may be a far-end signal, where the far-end signal refers to a voice signal of an opposite end of two parties in a call. According to the embodiment of the application, when a far-end person speaks, a far-end signal is also collected and stored and is used as a reference signal when echo cancellation is carried out.

In step S12, a second VAD value of the voice signal stream is determined based on the energy fluctuation.

Here, it is determined whether there is a voice signal in the far-end signal according to the energy fluctuation of the far-end signal, and if there is a voice signal in the far-end signal, the second VAD value may be set to have the VAD value of the voice signal of the far-end; if there is no speech signal in the far-end signal, the second VAD value may be set to a VAD value that does not have a speech signal of the far-end.

In step S13, when the first type signal forms an echo signal in the speech signal, an echo state flag of the speech signal is determined according to the echo signal.

Here, if the voice signal of the far-end is played through the loudspeaker of the terminal and is collected by the microphone of the terminal, an echo signal is formed.

At least one of the second VAD value and the echo state flag is determined as a parameter state flow, step S14.

Referring to fig. 4, in some embodiments, when the voice signal processing mode includes a noise suppression process, the voice signal stream includes a noise-suppressed voice signal stream, and step S303 may be implemented by:

step S407, perform howling detection on the voice signal stream after noise suppression to obtain a howling flag of the voice signal stream.

Here, howling detection is used to detect whether or not an obvious howling signal is contained in a voice signal stream. For example, when two call terminals are in the same room and at least one of the terminals is in the handsfree mode, an acoustic loop is easily formed, thereby forming howling. When a howling signal is detected, the howling mark is a first howling mark, and when the howling signal is not detected, the howling mark is a second howling mark.

Step S408, performing VAD detection on the voice signal stream after noise suppression to obtain a third VAD value of the voice signal stream.

Here, the second type of signal may be a near-end signal, wherein the VAD detection module of the near-end signal detects whether there is a speech signal in the audio signal based on the signal energy fluctuation, as in the VAD detection of the microphone recording. In contrast, the voice signal stream after noise suppression is subjected to near-end signal VAD detection, so that most of the noise in the voice signal stream is eliminated, so that when the third VAD value is the VAD value with voice signal, only voice input is usually present in the near end, and when the near end has no voice input, the third VAD value is the VAD value without voice signal.

Step S409, determining at least one of the howling flag and the third VAD value as a parameter status stream.

In some embodiments, when the howling flag of the voice signal stream is determined, howling suppression may be further performed according to the howling flag of the voice signal stream, where the howling suppression includes the following steps:

and step S21, when the howling mark is the first howling mark, determining the distribution rule of the howling energy of the voice signal on the frequency spectrum.

And step S22, according to the distribution rule, reducing the gain of the frequency band with the howling energy to restrain the howling energy.

In step S23, when the howling flag is the second howling flag, the suppression of the gain of the frequency band is cancelled.

Based on fig. 4 and fig. 5 are an alternative flow chart of the microphone volume control method provided in the embodiment of the present application, as shown in fig. 5, in some embodiments, after determining the first VAD value, the method may further include the following steps:

step S501, determining at least one noise segment in the voice signal according to the first VAD value.

Step S502, noise energy estimation is carried out on each noise section to obtain a noise energy spectrum of the corresponding noise section.

Step S503, subtracting the noise energy spectrum of each noise segment from the voice signal spectrum corresponding to the voice signal to obtain the denoised voice energy spectrum.

Step S504, the voice energy spectrum after denoising is subjected to time-frequency transformation processing, and a voice signal flow after noise suppression is obtained.

In the embodiment of the present application, when the first VAD value is determined, noise suppression processing may be further performed according to the first VAD value. For example, the noise energy spectrum may be estimated and then subtracted from the microphone signal spectrum, and the remaining speech energy spectrum may be used to reconstruct the denoised speech waveform by time-frequency transformation.

Correspondingly, after the noise-suppressed speech signal stream is obtained, the parameter feature extraction is performed on the noise-suppressed speech signal stream to obtain at least two parameter state streams of the speech signal, that is, step S407 is performed.

Based on fig. 4 and fig. 6 are an optional flowchart of a microphone volume control method according to an embodiment of the present application, and as shown in fig. 6, the determining the digital gain adjustment amount of the microphone according to at least two parameter state streams in step S304 may be implemented by the following steps:

in step S601, the current digital gain of the microphone is obtained.

Step S602, when the third VAD value is the first preset value, counting the root mean square energy of the voice signal in a preset time period.

In step S603, if the root mean square energy is smaller than the first energy threshold, the current digital gain of the microphone is increased by a first preset gain, so as to obtain a digital gain adjustment amount.

In the embodiment of the present application, the first preset gain may be gradually increased multiple times, and the sum of the increased first preset gains is determined as the digital gain adjustment amount until the upper limit of the digital gain or the target value of the digital gain is reached.

Further, when determining the digital gain adjustment amount of the microphone, the method may further include the following steps:

in step S604, when the howling flag is the first howling flag, the root mean square energy is stopped from being calculated until the howling flag is the second howling flag.

In step S605, when the howling flag is the second howling flag, the root mean square energy is continuously updated.

With continued reference to fig. 6, in some embodiments, the determining the analog gain adjustment amount of the microphone according to the at least two parameter state streams in step S304 can be implemented by:

step S606, the current analog gain of the microphone is acquired.

Step S607, when the first VAD value is the second preset value and the pitch frequency is greater than zero, calculating the energy of the smoothed voice segment according to the energy of the voice signal stream.

In step S608, when the energy of the smoothed voice segment is greater than the second energy threshold, the current analog gain of the microphone is decreased by a second preset gain, so as to obtain an analog gain adjustment amount.

In the embodiment of the present application, the second preset gain may be gradually decreased for a plurality of times, and the sum of the plurality of second preset gains that are decreased until the lower limit of the analog gain or the target value of the analog gain is reached is determined as the analog gain adjustment amount.

Further, when determining the analog gain adjustment amount of the microphone, the method may further include the following steps:

and step S609, when the echo exists in the voice signal is determined according to the parameter state flow, the current analog gain of the microphone is prohibited from being increased.

In this embodiment of the present application, after determining the digital gain adjustment amount and the analog gain adjustment amount of the microphone, the step S305 correspondingly adjusts the analog gain of the microphone according to the analog gain adjustment amount, which may be implemented by any one of the following manners:

the method comprises the following steps:

step S31, mapping the analog gain adjustment amount to a preset gain interval to obtain a mapping value of the analog gain adjustment amount. And step S32, adjusting the analog gain of the microphone by adopting the mapping value.

The method II comprises the following steps:

step S33, performing two-stage gain mapping on the analog gain adjustment amount to obtain a first-stage gain adjustment amount corresponding to the microphone enhancement and a second-stage gain adjustment amount corresponding to the microphone array. In step S34, the microphone emphasis of the microphone is adjusted according to the first-stage gain adjustment amount. And step S35, adjusting the microphone array of the microphone according to the second-stage gain adjustment amount.

In the embodiment of the application, when the volume of the microphone is adjusted, the microphone reinforcing and microphone array can be adjusted simultaneously, so that the volume of the microphone can be stably and orderly adjusted by combining the first-level gain adjustment amount and the second-level gain adjustment amount, and the adjusted volume of the microphone is smoother and meets the use requirements of users.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

In a voice communication system, voice signals are collected through a microphone hardware, the collected voice volumes are different, if the voice volume is too large, voice distortion is caused, or the effect of other modules is influenced (for example, if the echo recorded by a microphone is too large, the echo cancellation is not clean due to large nonlinear distortion); too little voice volume can cause the recipient to be inaudible. Therefore, how to control the recording volume within a reasonable range and smoothly adjust the recording volume needs to comprehensively consider the comprehensive influence of each voice processing module, accurately distinguish various signals such as human voice, background noise, echo, howling and the like, and fully utilize the capability provided by equipment hardware or a bottom layer audio interface to achieve a better effect.

In order to solve the above problem, an embodiment of the present application provides a method for controlling a microphone volume, where the method is an automatic gain control method for a microphone, and the method includes systematically considering influences of various signals such as VAD, echo, howling, and the like, and synthesizing features such as a pitch of a human voice so as to better distinguish between noise and the human voice. For the adjustment of the analog gain (hardware volume) and the digital gain (software volume), in order to make the volume adjustment smoother on the whole, new logic is designed to more accurately coordinate the change of the analog gain and the digital gain. Finally, a complete gain control system is formed, and the purpose of optimally controlling the volume of the microphone is achieved.

It should be noted that, in an actual application scenario of microphone volume adjustment, the digital gain mainly adjusts the pulse amplitude of the digital-to-analog conversion input, if the amplitude of the digital gain is too small, the increase of the loss bit error may be caused, and if the amplitude of the digital gain is too large, the signal-to-noise ratio of the digital signal may also be degraded due to the severe top-clipping of the data pulse and the increase of the noise pulse, and the increase of the interference bit error may also be caused. The optimum value of the digital gain is to adjust the upper limit within the range of the domain value of the digital circuit input. The magnitude of the digital gain does not increase or decrease the value of the output audio power, only affecting the operating state of the decoder. The analog gain mainly adjusts the signal intensity of the linear amplification input, the magnitude of the analog gain directly influences the value of the output audio power within a certain range, and a larger analog gain input value is beneficial to improving the output signal-to-noise ratio and can increase the output power in a same ratio. However, when the input is too large, the output power increases gradually, and the distortion increases rapidly. The optimum adjustment value of the analog gain is such that the peak value of the output voltage is within the linear range of the amplifier.

Currently, Voice over Internet Protocol (VoIP) call applications based on IP are increasingly commonly used, and a call application program runs on a terminal adopting different operating systems, such as iOS/Mac/Android/Windows, as shown in fig. 7, which is an application scenario diagram of the method in the embodiment of the present application, and the method in the embodiment of the present application may be applied to any one of the call devices 701 to 704 in fig. 7.

Common applications supporting VoIP include WeChat and QQ voice calls, and when different call terminals are in a call, initial volume settings of microphones may be different, which results in that some terminal microphones have loud sounds and some terminal microphones have weak sounds. If the hardware volume of the microphone (also called the analog gain of the microphone) is too small, the sound recorded by the microphone is too small, the opposite party cannot hear the sound, the required volume cannot be reached even if the digital gain (also called the digital volume or the software volume) is increased, and meanwhile, the noise (a small amount of noise possibly remains even if the noise is passed through the noise suppression module) is amplified when the digital gain is increased, so that the hearing comfort is reduced; if the microphone volume is too large, the maximum value of the audio sample is beyond the numerical representation range (for example, the numerical value range of a 16-bit integer is-32768-32767), and thus truncated distortion (also called saturation distortion) is caused; in addition, if the microphone volume is too large, nonlinear distortion (such as truncation distortion and hardware distortion) is severe, which may cause the performance of echo cancellation to be degraded.

The automatic gain control provided by the embodiment of the application needs to comprehensively consider various acoustic scenes such as voice, noise, echo, howling and the like so as to avoid mistakenly adjusting the volume.

The embodiment of the application provides a complete automatic gain control system of a microphone, so that the aim of automatically controlling the digital gain and the analog gain of the microphone at a sending end is fulfilled, and the volume of the microphone is in a reasonable position. Fig. 8 is an architecture diagram of a microphone automatic gain control method according to an embodiment of the present application, as shown in fig. 8, an audio acquisition hardware interface 801 acquires a voice signal of a speaker, and at the same time, there may be background noise of a surrounding environment (for example, an office air conditioner, or a car motor sound at a roadside, etc.) and, if a call is made with another person, there may also be an echo signal (for example, a sound played by a speaker of the speaker in a hands-free mode may reenter a microphone) and a howling signal (for example, when a multi-terminal call is made, two terminals a and B are in the same conference room, one person speaks into the microphone of terminal a, and a sound is played from the speaker of terminal B and is recycled into the microphone of terminal a). The speech is fed into the speech processing system in frames, for example, one frame every 20 milliseconds, and the speech processing system processes each frame of speech signal in turn. It is desirable to have the gain control module 802 to accurately detect the volume of the voice and automatically adjust the analog gain (i.e., the hardware volume) and the digital gain (i.e., the software volume) according to the volume of the voice.

With continued reference to fig. 8, audio acquisition hardware interface 801 provides a program interface for software to acquire and set microphone related hardware parameters. The dc-removing filtering module 803 removes the dc component in the audio signal collected by the microphone through a preset filter, so as to avoid interfering with the detection and subsequent processing of the voice. The echo cancellation module 804 cancels echo signals that may exist in the microphone signal to prevent the far-end speaker from hearing his own voice. The noise suppression module 805 suppresses the background noise in the microphone signal, and improves the hearing comfort of the speaker. The howling suppression module 806 suppresses the howling that may exist in the microphone signal (for example, when two call terminals are close to each other and hands-free, an acoustic loop may occur between the two call terminals, thereby forming an acoustic feedback or howling). The gain control module 802 adjusts the amplitude of the audio signal, i.e. digital volume or digital gain, according to the acoustic parameters, calculates the gain of the microphone hardware, and obtains or sets the hardware volume (i.e. analog gain) through the audio acquisition hardware interface 801. Finally, the encoding sending module 807 compresses and encodes the audio signal and sends the compressed and encoded audio signal to the receiving party through the network for decoding and playing.

The solid line in fig. 8 represents the speech signal flow and the dashed line the parameter state flow. The microphone hardware status pool 808 in fig. 8 is used to store status information from the hardware, including the current gain value of the hardware microphone, whether two-stage gain control is supported, how many stages there are in each stage of gain, how many dB of gain each stage represents, and so on. The speech feature pool 809 in FIG. 8 is used to compute and store speech features.

Fig. 9 is a diagram of a voice feature pool architecture provided in an embodiment of the present application, which is shown in fig. 8 and 9:

the speech signal stream (1) in the speech feature pool is the speech signal after dc-filtering and is used for truncation detection, calculation of the time-domain energy envelope and VAD detection of the microphone recording signal.

The truncation detection module 901 is configured to detect whether the amplitude of the microphone recording signal exceeds the representation range of a 16-bit integer, so as to cause truncation distortion, where such a situation generally indicates that the gain of the microphone hardware is too large, and it is necessary to reduce the hardware volume setting as soon as possible. A simple truncation detection method comprises the following steps: since the representation range of the 16-bit integer is-32768 to 32767, if the absolute value of the signal amplitude of N speech samples in a certain frame of input speech is greater than a certain threshold, for example, N-10, which may be 32760, the speech signal segment may be marked as truncated, the truncation flag Satuate is set to 1, otherwise the truncation flag Satuate is set to 0.

The calculate time domain energy envelope module 902 calculates the sum of squares (i.e., energy) e (t) of each speech sample in a speech segment by dividing the speech frame into smaller speech segments, e.g., 5 millisecond segments.

The VAD Detection module 903 of the microphone recording signal is used to perform Voice Activity Detection (VAD) on the signal collected by the microphone, generally according to the fluctuation of energy in the audio signal. When the energy is large and the fluctuation is severe, voice is considered to exist, at this time, VAD1 is made to be 1 (where VAD1 corresponds to the above first VAD value), otherwise, voice is considered to be not exist, and VAD1 is made to be 0. Note that this time, in addition to detecting speech segments, if the background noise is not stationary, such as a nearby keyboard sound or a door opening/closing sound, it is easy to detect VAD1 as 1. In addition, if there is a strong echo signal, such as the voice of the far-end speaker playing in the speaker in the hands-free mode, entering the microphone again, the VAD1 is also detected as 1. Taking the energy envelopes of the echo signals into account together means that the volume needs to be reduced also when the echo is too large, which is a desired behavior.

The stream of speech signals (2) in the pool of speech features is the microphone signal after echo cancellation. The pitch extraction module 904 uses the signal for pitch detection. Because the vocal cords can vibrate and generate harmonic waves when a person speaks, the frequency of the vocal cords vibration is the fundamental tone frequency. Many noises do not have a fundamental tone frequency or the fundamental tone is not obvious, such as instantaneous noises of opening and closing a door or knocking a desk and a keyboard, and the like, so that the interference of most noises on VAD detection can be eliminated by fully utilizing the fundamental tone detection, and a voice section can be found more accurately. Generally, pitch frequency pitch of a human is between 50 and 500Hz, most of people are around one or two hundred Hz, and if no pitch is detected, the pitch frequency pitch is set to 0.

The stream of speech signals (3) in the pool of speech features is the microphone signal after noise suppression and is used for howling detection and VAD detection of the near-end signal.

The howling detection module 905 detects whether the audio signal contains an obvious howling signal. For example, when two call terminals are in the same room and at least one of the terminals is in the handsfree mode, an acoustic loop is easily formed, thereby forming howling. The howling signal may cause the speech energy to be estimated incorrectly, thereby adversely affecting the automatic gain control. The howling detection module may be a Neural Network-based howling detector, such as a circulating Neural Network (RNN) howling detector trained in advance, where the howling flag is 1 when howling is detected, and 0 if howling is not detected.

The VAD detection module 906 of the near-end signal, like the VAD detection module 903 of the microphone recording signal, detects whether there is a voice signal in the audio signal based on the signal energy fluctuation. The difference is that since echo and most of the noise are already removed in the speech signal stream (3), when the VAD flag VAD2 is 1 (where VAD2 corresponds to the third VAD value mentioned above), there is generally only an indication that there is speech at the near end, and when there is no speech at the near end and there is only echo, VAD2 is 0. Since noise cancellation is not easy to eliminate cleanly, the VAD2 herein is also prone to a condition equal to 1 for non-stationary noise, such as keyboard sounds, etc.

The stream of parametric states (4) in the speech feature pool comes from the echo cancellation module, which calculates the echo state 907 and the VAD908 of the far-end signal in the normal echo cancellation algorithm. When the far-end person speaks, the far-end voice signal is stored in the reference signal used by the echo cancellation module for the echo cancellation algorithm, and by detecting the energy fluctuation of the far-end signal, it can be detected whether there is a voice signal in the far-end signal, if there is a voice signal in the far-end signal, the VAD3 is made to be 1 (where VAD3 corresponds to the second VAD value mentioned above), otherwise VAD3 is made to be 0. If the voice signal of the far-end is played through the loudspeaker and collected by the microphone of the far-end, an echo signal is formed. The echo cancellation module cancels the echo and detects whether there is the echo currently, if yes, the echo state echo is set to 1, otherwise, the echo state echo is set to 0.

On the right side of fig. 9, the parameter state flow (5) is VAD1 of the microphone recording, which can be used in a noise suppression algorithm, for example, in a noise segment, a noise energy spectrum can be estimated, and then the noise energy spectrum is subtracted from the microphone signal spectrum, and the remaining speech energy spectrum can reconstruct a denoised speech waveform through time-frequency transformation.

On the right side of fig. 9, the parameter state stream (6) is a howlFlag for howling detection, and when howlFlag is 1, it indicates that howling is present at this time, the howling suppression module may perform howling suppression using this information, for example, to detect the distribution of howling energy on the spectrum and reduce howling by reducing the gain of the frequency band having howling energy, that is, by suppressing howling energy. When howlFlag is equal to 0, the suppression of the band is gradually canceled to return the band gain to normal.

On the right side of fig. 9, the parameter state stream (7) includes all the calculated features in the pool of speech features mentioned above, which can be used for automatic gain control. The automatic gain control module is described in detail below.

Fig. 10 is an architecture diagram of a gain control module according to an embodiment of the present application, and as shown in fig. 10, the digital gain control module 1001 may use a conventional digital gain control algorithm, when the VAD flag of the near-end signal in the speech feature pool indicates speech, that is, VAD2 is equal to 1, count the root mean square energy (RMS Level) of the speech signal in a recent period of time, for example, 0.5 seconds, and if the root mean square energy is smaller than the target volume, gradually increase the digital gain until a preset upper limit value of the digital gain is reached, for example, the amplification does not exceed 30 dB. If the digital gain is greater than the target volume, the digital gain is gradually decreased until a preset lower digital gain limit value, for example, -10dB, is reached.

When the digital gain control module 1001 increases or decreases the digital gain, if there is a howling signal, that is, when howlFlag in the speech feature pool is equal to 1, the root mean square energy value of the speech signal is stopped being calculated until the root mean square energy of the speech signal is continuously updated after the howling signal disappears.

When the digital gain control module 1001 increases or decreases the digital gain, if there is a far-end speech signal, i.e., VAD3 in the speech feature pool is 1, the digital gain is increased or decreased at a slower speed so as not to negatively affect the erroneous speech rms energy estimation caused by not removing the clean echo.

The digital gain control module 1001 outputs the volume-adjusted digital audio stream.

Referring to fig. 10, the analog gain calculating module 1002 in fig. 10 calculates the target analog gain a according to the information provided by the parameter state stream (7), and the specific implementation may include:

when the microphone recording VAD in the speech feature pool detects speech and the pitch extraction module detects pitch, i.e. VAD1 is 1 and pitch >0, a smoothed speech segment energy is calculated by using the energy e (t) of the 5 ms speech segment calculated in the previous speech feature pool according to the following formula (1-1):

E_m(t)＝p*E_m(t-1)+(1-p)*E(t) (1-1)；

wherein E is_m(t) represents the smoothed speech segment energy; e_m(t-1) representing the smoothed speech segment energy obtained from the previous speech segment; p is a decimal fraction less than 1, e.g.Such as 0.95. When E is_m(t) gradually decreasing the target analog gain a when the target analog gain is greater than a preset energy upper limit; when E is_m(t) gradually increasing the target analog gain a when less than a lower energy limit.

When the truncation detection module in the speech feature pool detects that there is a truncation, that is, Satuate is 1, the target analog gain a is reduced frame by frame at a faster speed, for example, the target analog gain a at time t under the normal condition is calculated by the following formula (1-2):

A(t)＝A(t-1)-1 (1-2)；

where A (t-1) represents the analog gain of the previous frame.

When there is a truncation, the following formula (1-3) can be used for calculation:

A(t)＝A(t-1)-2 (1-3)。

when the echo state in the speech feature pool is 1, that is, echo state is 1, in order to avoid that the estimation of speech energy is too small due to residual echo and the analog gain is misadjusted and increased, the analog gain is only allowed to be decreased, and is not allowed to be increased continuously.

When the howling detection state in the voice feature pool is 1, that is, when howlFlag is 1, to avoid that voice energy is estimated incorrectly due to howling, the updating of E is stopped at this time_m(t)。

The analog gain conversion module 1003 in fig. 10 is used to convert the calculated target analog gain a into a hardware analog volume value range V. For example, if the range of the target analog gain a calculated by the analog gain calculation module 1002 is an integer range between 0 and 300, and the range of values V accepted by the actual hardware gain control interface is an integer range between 0 and 100, then one way is to map the values 0 to 256 of the values of a uniformly to 0 to 100 (where 256 may be used for the preset value, but may take other values), i.e. 256 is set as the upper limit of a, and the part exceeding 256 is ignored because the digital gain can compensate the part of the volume. And finally, making V100A/256.

Another implementation of the analog gain conversion module 1003 is: in some communication terminals, there are sometimes two-stage gain control mechanisms, such as the two-stage volume control diagram of fig. 11, where the microphone enhancement 1101 (i.e., the first stage volume control) and the microphone array 1102 (i.e., the second stage volume control) are present in the windows system. The two-stage volume control of fig. 11 can be abstracted into the two-stage gain control of fig. 12, where the first stage gain 1201 is divided into three stages (in practice there may be multiple stages), each stage representing a 12dB gain (also 10dB or other values possible), for a total of 36 dB. The second stage gain 1202 is divided into 100 steps, each representing 0.6dB if a total of 60dB of gain is provided.

To increase the volume control range, the two gains in fig. 12 may be controlled. For example, if the first gain 1201 is still in the lower gear, the first gain 1201 may be increased by one gear if the volume is still not large enough after the second gain 1202 has been increased to the position of 100. In order to make the volume smoothly transited, the second-stage gain 1202 needs to be decreased to a position of one switching point while the first-stage gain 1201 is increased by one stage. For example, for the above example, the first stage gain 1201 represents 12dB per stage and the second stage represents 0.6dB per stage, then the switching point S of the second stage gain 1202 is at (100-12dB/0.6) ═ 80. A smooth gain adjustment can thus be achieved by the following logic:

after the above logic processing, the final second-stage analog gain range is calculated from 0 to 256, and the final second-stage analog gain range also needs to be converted into a number which is acceptable for a hardware interface and is between 0 and 100, that is, V is 100 a/256.

When the first-stage gain needs to be adjusted or the amplitude of the second-stage gain adjustment exceeds a certain amount, for example, 2dB, the gain control module 802 in fig. 8 sends a new analog gain value, i.e., a parameter state stream (8), to the hardware through the audio acquisition hardware interface, thereby adjusting the analog gain. The new hardware configuration, i.e. the parameter state stream (9), is then sent to the microphone hardware state pool for storage, for reference by the gain control algorithm.

The method for controlling the volume of the microphone comprises the steps of systematically considering the influences of various signals such as VAD, echo and howling and integrating the characteristics of pitch of human voice and the like so as to better distinguish noise from the human voice. For the adjustment of the analog gain (hardware volume) and the digital gain (software volume), in order to make the volume adjustment smoother on the whole, new logic is designed to more accurately coordinate the change of the analog gain and the digital gain. Finally, a complete gain control system is formed, and the purpose of optimally controlling the volume of the microphone is achieved.

Continuing with the exemplary structure of the microphone volume control device 354 implemented as a software module provided in the embodiments of the present application, in some embodiments, as shown in fig. 2, the software module stored in the microphone volume control device 354 of the memory 350 may be a microphone volume control device in the server 300, and the device includes:

a voice detection module 3540, configured to perform voice detection on a voice signal acquired by the microphone, so as to obtain at least two environment indexes corresponding to the voice signal; a preprocessing module 3541, configured to perform different types of voice signal preprocessing on the voice signals respectively by using a voice signal processing manner corresponding to each environment index, so as to obtain at least two voice signal streams correspondingly; a parameter feature extraction module 3542, configured to perform parameter feature extraction on each voice signal stream to obtain at least two parameter state streams of the voice signal; a determining module 3543, configured to determine a digital gain adjustment amount and an analog gain adjustment amount of the microphone according to the at least two parameter state streams, respectively; an adjusting module 3544, configured to correspondingly adjust the digital gain and the analog gain of the microphone according to the digital gain adjustment amount and the analog gain adjustment amount, so as to implement volume control on the microphone.

In some embodiments, the speech signal processing means comprises a dc-filtering process; the voice signal stream comprises a voice signal stream subjected to direct current filtering; the parameter feature extraction module is further configured to: performing truncation detection on the voice signal stream subjected to direct current filtering to obtain a truncation mark of the voice signal stream; performing time domain energy envelope calculation on the voice signal stream subjected to direct current filtering to obtain the energy of the voice signal stream; performing VAD detection of microphone recording on the voice signal flow subjected to direct current filtering to obtain a first VAD value of the voice signal flow; determining at least one of the truncated flag, the energy, and the first VA D value as the parameter state stream.

In some embodiments, the apparatus further comprises: determining at least one noise segment in the speech signal according to the first VAD value; the noise section determining module is used for estimating noise energy of each noise section to obtain a noise energy spectrum of the corresponding noise section; a deleting module, configured to subtract the noise energy spectrum of each noise segment from a speech signal spectrum corresponding to the speech signal, to obtain a denoised speech energy spectrum; the time-frequency transformation module is used for performing time-frequency transformation processing on the denoised voice energy spectrum to obtain a voice signal flow after noise suppression; correspondingly, the parameter feature extraction module is further configured to: and performing parameter feature extraction on the voice signal flow after the noise suppression to obtain at least two parameter state flows of the voice signal.

In some embodiments, the speech signal processing means further comprises echo cancellation processing; the voice signal stream comprises a voice signal stream after echo cancellation; the parameter feature extraction module is further configured to: extracting fundamental tone of the voice signal flow after the echo cancellation to obtain fundamental tone frequency of the voice signal flow; and determining the pitch frequency as the parameter state stream.

In some embodiments, in performing the echo cancellation processing on the speech signal, the apparatus comprises: the energy fluctuation determining module is used for determining the energy fluctuation of the first type of signal when the voice signal is detected to have the first type of signal; a second VAD value determination module for determining a second VAD value of the voice signal stream according to the energy fluctuation; an echo state flag determining module, configured to determine an echo state flag of the voice signal according to the echo signal when the first type of signal forms an echo signal in the voice signal; a parameter status flow determination module to determine at least one of the second VAD value and the echo status flag as the parameter status flow.

In some embodiments, the speech signal processing means further comprises noise suppression processing; the voice signal stream comprises a noise-suppressed voice signal stream; the parameter feature extraction module is further configured to: performing howling detection on the voice signal stream after the noise suppression to obtain a howling mark of the voice signal stream; performing VAD detection of a second type of signal on the voice signal flow after the noise suppression to obtain a third VAD value of the voice signal flow; determining at least one of the howling flag and the third VAD value as the parameter status stream.

In some embodiments, the apparatus further comprises: a distribution rule determining module, configured to determine a distribution rule of howling energy of the voice signal on a frequency spectrum when the howling flag is a first howling flag; a howling energy suppression module, configured to reduce, according to the distribution rule, a gain of a frequency band having the howling energy, so as to suppress the howling energy; and the cancellation module is used for canceling the suppression of the gain of the frequency band when the howling mark is the second howling mark.

In some embodiments, the determining module is further to: obtaining a current digital gain of the microphone; when the third VAD value is a first preset value, counting the root mean square energy of the voice signal in a preset time period; and if the root-mean-square energy is smaller than a first energy threshold, increasing the current digital gain of the microphone by a first preset gain to obtain the digital gain adjustment quantity.

In some embodiments, the determining module is further to: when the howling mark is the first howling mark, stopping calculating the root-mean-square energy until the howling mark is the second howling mark; and when the howling mark is the second howling mark, continuously updating the root-mean-square energy.

In some embodiments, the determining module is further to: acquiring the current analog gain of the microphone; when the first VAD value is a second preset value and the pitch frequency is greater than zero, calculating the energy of the smoothed voice section according to the energy of the voice signal flow; and when the energy of the smoothed voice segment is greater than a second energy threshold value, reducing the current analog gain of the microphone by a second preset gain to obtain the analog gain adjustment quantity.

In some embodiments, the adjustment module is further to: mapping the analog gain adjustment quantity to a preset gain interval to obtain a mapping value of the analog gain adjustment quantity; and adjusting the analog gain of the microphone by adopting the mapping value.

In some embodiments, the adjustment module is further to: carrying out two-stage gain mapping on the analog gain adjustment quantity to obtain a first-stage gain adjustment quantity corresponding to microphone enhancement and a second-stage gain adjustment quantity corresponding to a microphone array; adjusting the microphone emphasis of the microphone according to the first-stage gain adjustment amount; and adjusting a microphone array of the microphone according to the second-stage gain adjustment amount.

It should be noted that the description of the apparatus in the embodiment of the present application is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description is not repeated. For technical details not disclosed in the embodiments of the apparatus, reference is made to the description of the embodiments of the method of the present application for understanding.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method of the embodiment of the present application.

Embodiments of the present application provide a storage medium having stored therein executable instructions, which when executed by a processor, will cause the processor to perform a method provided by embodiments of the present application, for example, the method as illustrated in fig. 3.

In some embodiments, the storage medium may be a computer-readable storage medium, such as a Ferroelectric Random Access Memory (FRAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), a charged Erasable Programmable Read Only Memory (EEPROM), a flash Memory, a magnetic surface Memory, an optical disc, or a Compact disc Read Only Memory (CD-ROM), and the like; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A microphone volume control method, the method comprising:

performing parameter feature extraction on each voice signal stream to obtain at least two parameter state streams of the voice signals;

2. The method of claim 1, wherein the speech signal processing means comprises a de-dc filtering process; the voice signal stream comprises a voice signal stream subjected to direct current filtering;

the extracting parameter characteristics of each voice signal flow to obtain at least two parameter state flows of the voice signal comprises:

performing truncation detection on the voice signal stream subjected to direct current filtering to obtain a truncation mark of the voice signal stream;

performing time domain energy envelope calculation on the voice signal stream subjected to direct current filtering to obtain the energy of the voice signal stream;

performing VAD detection of microphone recording on the voice signal flow subjected to direct current filtering to obtain a first VAD value of the voice signal flow;

determining at least one of the truncated flag, the energy, and the first VAD value as the parametric state flow.

3. The method of claim 2, further comprising:

determining at least one noise segment in the speech signal according to the first VAD value;

estimating noise energy of each noise section to obtain a noise energy spectrum of the corresponding noise section;

subtracting the noise energy spectrum of each noise section from the voice signal spectrum corresponding to the voice signal to obtain a denoised voice energy spectrum;

performing time-frequency transformation processing on the denoised voice energy spectrum to obtain a voice signal flow with noise suppressed;

correspondingly, parameter feature extraction is carried out on the voice signal flow after the noise suppression, and at least two parameter state flows of the voice signal are obtained.

4. The method of claim 2, wherein the speech signal processing means further comprises echo cancellation processing; the voice signal stream comprises a voice signal stream after echo cancellation;

the extracting of the parameter characteristics of each voice signal flow to obtain at least two parameter state flows of the voice signals further comprises:

extracting fundamental tone of the voice signal flow after the echo cancellation to obtain fundamental tone frequency of the voice signal flow;

and determining the pitch frequency as the parameter state stream.

5. The method of claim 4, wherein in performing the echo cancellation processing on the speech signal, the method further comprises:

when a first type of signal is detected in the voice signals, determining energy fluctuation of the first type of signal;

determining a second VAD value of the voice signal stream according to the energy fluctuation;

when the first type of signal forms an echo signal in the voice signal, determining an echo state mark of the voice signal according to the echo signal;

determining at least one of the second VAD value and the echo status flag as the parameter status stream.

6. The method of claim 4, wherein the speech signal processing means further comprises a noise suppression process; the voice signal stream comprises a noise-suppressed voice signal stream;

performing howling detection on the voice signal stream after the noise suppression to obtain a howling mark of the voice signal stream;

performing VAD detection of a second type of signal on the voice signal flow after the noise suppression to obtain a third VAD value of the voice signal flow;

determining at least one of the howling flag and the third VAD value as the parameter status stream.

7. The method of claim 6, further comprising:

when the howling mark is a first howling mark, determining the distribution rule of the howling energy of the voice signal on a frequency spectrum;

according to the distribution rule, reducing the gain of the frequency band with the howling energy so as to restrain the howling energy;

and when the howling mark is a second howling mark, canceling the suppression of the gain of the frequency band.

8. The method of claim 6, wherein determining the amount of digital gain adjustment for the microphone based on the at least two streams of parameter states comprises:

obtaining a current digital gain of the microphone;

when the third VAD value is a first preset value, counting the root mean square energy of the voice signal in a preset time period;

and if the root-mean-square energy is smaller than a first energy threshold, increasing the current digital gain of the microphone by a first preset gain to obtain the digital gain adjustment quantity.

9. The method of claim 8, wherein determining the amount of digital gain adjustment for the microphone based on the at least two streams of parameter states, further comprises:

when the howling mark is the first howling mark, stopping calculating the root-mean-square energy until the howling mark is the second howling mark;

and when the howling mark is the second howling mark, continuously updating the root-mean-square energy.

10. The method of claim 6, wherein determining an analog gain adjustment for the microphone based on the at least two streams of parameter states comprises:

acquiring the current analog gain of the microphone;

when the first VAD value is a second preset value and the pitch frequency is greater than zero, calculating the energy of the smoothed voice section according to the energy of the voice signal flow;

and when the energy of the smoothed voice segment is greater than a second energy threshold value, reducing the current analog gain of the microphone by a second preset gain to obtain the analog gain adjustment quantity.

11. The method according to any one of claims 1 to 10, wherein correspondingly adjusting the analog gain of the microphone according to the analog gain adjustment amount comprises:

mapping the analog gain adjustment quantity to a preset gain interval to obtain a mapping value of the analog gain adjustment quantity;

and adjusting the analog gain of the microphone by adopting the mapping value.

12. The method according to any one of claims 1 to 10, wherein correspondingly adjusting the analog gain of the microphone according to the analog gain adjustment amount comprises:

carrying out two-stage gain mapping on the analog gain adjustment quantity to obtain a first-stage gain adjustment quantity corresponding to microphone enhancement and a second-stage gain adjustment quantity corresponding to a microphone array;

adjusting the microphone enhancement of the microphone according to the first-stage gain adjustment amount;

and adjusting a microphone array of the microphone according to the second-stage gain adjustment amount.

13. A microphone volume control device, the device comprising:

the preprocessing module is used for respectively preprocessing the voice signals by adopting a voice signal processing mode corresponding to each environment index to correspondingly obtain at least two voice signal streams;

14. A microphone volume control device, comprising:

a memory for storing executable instructions; a processor for implementing the microphone volume control method of any one of claims 1 to 12 when executing executable instructions stored in the memory.

15. A computer-readable storage medium having stored thereon executable instructions for causing a processor to implement the microphone volume control method of any one of claims 1 to 12 when the executable instructions are executed.