CN114974279A

CN114974279A - Sound quality control method, device, equipment and storage medium

Info

Publication number: CN114974279A
Application number: CN202210511637.1A
Authority: CN
Inventors: 盛剑锋; 周骏华; 程宝平
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2022-08-30

Abstract

The application discloses a tone quality control method, a tone quality control device, tone quality control equipment and a storage medium, and belongs to the technical field of voice processing. The method comprises the steps of carrying out noise reduction processing on audio data based on a real-time voice noise reduction model, wherein the real-time voice noise reduction model is used for carrying out noise reduction processing on the audio data according to noise reduction parameters; and performing double-layer automatic gain control on the audio data subjected to the noise reduction processing. That is to say, in the present application, the real-time speech noise reduction model performs noise reduction processing on the audio data according to the noise reduction parameters, so as to improve the noise reduction effect on the audio data, perform double-layer automatic gain control on the noise-reduced audio data, expand the automatic gain range of the volume, and improve the quality of the sound quality of the audio data.

Description

Sound quality control method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for controlling sound quality.

Background

Under the condition that the distance between the sound source and the acquisition equipment is far, the energy of the noise is close to or even exceeds the energy of the target audio, and the target audio and the noise cannot be judged based on the traditional voice noise reduction method, so that the noise reduction effect is poor; based on the traditional automatic gain control algorithm, the target audio and the noise cannot be distinguished, the target audio cannot be amplified, and the subjective auditory sensation quality of the target audio is influenced. That is, the quality of far-field audio quality cannot be improved based on the conventional speech noise reduction method and the conventional automatic gain control algorithm, resulting in poor quality of far-field audio quality.

Disclosure of Invention

The present application mainly aims to provide a tone quality control method, device, apparatus, and storage medium, which aim to solve the technical problem that the tone quality of far-field audio is poor due to the fact that the tone quality of far-field audio cannot be improved based on the conventional speech noise reduction method and the conventional automatic gain control algorithm.

In order to achieve the above object, the present application provides a sound quality control method, including the following steps:

based on a real-time voice noise reduction model, carrying out noise reduction processing on the audio data, wherein the real-time voice noise reduction model is used for carrying out noise reduction processing on the audio data according to noise reduction parameters;

and performing double-layer automatic gain control on the audio data subjected to the noise reduction processing.

Optionally, the step of performing noise reduction processing on the audio data based on the real-time speech noise reduction model includes:

carrying out noise scene judgment on audio data, and determining a noise scene corresponding to the audio data;

acquiring a noise reduction parameter matched with the noise scene according to the noise scene;

and carrying out noise reduction processing on the audio data based on the real-time voice noise reduction model of the noise reduction parameters.

Optionally, before the step of performing noise reduction processing on the audio data based on the real-time speech noise reduction model, the method further includes:

acquiring audio training data corresponding to a preset noise scene;

and extracting time domain characteristic values and target values of the audio training data.

Training the time domain characteristic value and the target value based on a deep learning noise reduction model constructed by voice activity detection, noise spectrum estimation and spectrum subtraction to obtain noise reduction parameters;

and updating the parameters of the real-time voice noise reduction model by using the noise reduction parameters.

Optionally, the step of performing noise scene determination on the audio data and determining a noise scene corresponding to the audio data includes:

calculating noise data corresponding to the audio data, and estimating the spectral characteristics of the noise according to the calculated result to obtain a noise spectral estimation value;

and comparing the noise spectrum estimation value with a standard noise spectrum estimation value corresponding to each noise scene, and determining the noise scene corresponding to the minimum value of the comparison result as the noise scene corresponding to the audio data.

Optionally, the step of performing double-layer automatic gain control on the noise-reduced audio data includes:

performing framing processing on the audio data subjected to the noise reduction processing to obtain an audio frame;

if the audio frame is a voice frame, performing digital automatic gain control to obtain a digital automatic gain value;

if the digital automatic gain value is larger than or equal to the gain threshold value, performing analog automatic gain control to obtain an analog automatic gain step length, and feeding back the analog automatic gain step length to audio data acquisition equipment;

and if the digital automatic gain value is smaller than a gain threshold value, outputting the voice frame.

Optionally, the step of performing double-layer automatic gain control on the noise-reduced audio data further includes:

if the audio frame is a noise frame, judging whether a noise scene in the noise frame is a preset noise scene;

if the noise scene in the noise frame is a preset noise scene, generating and outputting comfortable noise;

and if the noise scene in the noise frame is not the preset noise scene, performing noise reduction processing on the audio frame to generate and output comfortable noise.

Optionally, after the step of performing framing processing on the noise-reduced audio data to obtain an audio frame, the method further includes:

determining a voice activity detection value for the audio frame based on a voice activity detection algorithm;

based on the voice activity detection value, determining a type of the audio frame, the type including a voice frame and a noise frame.

In addition, in order to achieve the above object, the present application also provides a sound quality control apparatus, including:

the real-time voice noise reduction module is used for carrying out noise reduction processing on the audio data based on a real-time voice noise reduction model, and the real-time voice noise reduction model is used for carrying out noise reduction processing on the audio data according to noise reduction parameters;

and the volume automatic gain module is used for carrying out double-layer automatic gain control on the audio data subjected to the noise reduction processing.

In addition, to achieve the above object, the present application also provides a sound quality control apparatus, including: a memory, a processor and a sound quality control program stored on the memory and executable on the processor, the sound quality control program being configured to implement the steps of the sound quality control method as described above.

In addition, to achieve the above object, the present application further provides a storage medium having stored thereon a sound quality control program, which when executed by a processor, implements the steps of the sound quality control method as described above.

Compared with the prior art that the tone quality of far-field audio is poor due to the fact that the tone quality of the far-field audio cannot be improved based on a traditional voice noise reduction method and a traditional automatic gain control algorithm, the method and the device for controlling the tone quality perform noise reduction on audio data based on a real-time voice noise reduction model, wherein the real-time voice noise reduction model is used for performing noise reduction on the audio data according to noise reduction parameters; and performing double-layer automatic gain control on the audio data subjected to the noise reduction processing. That is to say, in the present application, the real-time speech noise reduction model performs noise reduction processing on the audio data according to the noise reduction parameters, so as to improve the noise reduction effect on the audio data, perform double-layer automatic gain control on the noise-reduced audio data, expand the automatic gain range of the volume, and improve the quality of the sound quality of the audio data.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive labor.

Fig. 1 is a schematic structural diagram of a timbre control device in a hardware operating environment according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a sound quality control method according to a first embodiment of the present application;

FIG. 3 is a detailed flowchart of step S20 of the present application;

fig. 4 is a functional block diagram of a timbre control apparatus according to a first embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a sound quality control device in a hardware operating environment according to an embodiment of the present application.

As shown in fig. 1, the tone quality control apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the timbre control device and may include more or less components than those shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, the memory 1005, which is a storage medium, may include therein an operating system, a data storage module, a network communication module, a user interface module, and a sound quality control program.

In the sound quality control apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with other apparatuses; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the sound quality control apparatus may be provided in the sound quality control apparatus, and the sound quality control apparatus calls the sound quality control program stored in the memory 1005 through the processor 1001 and executes the sound quality control method provided by the embodiment of the present application.

The embodiment of the present application provides a sound quality control method, and referring to fig. 2, fig. 2 is a schematic flowchart of a first embodiment of the sound quality control method of the present application.

In this embodiment, the sound quality control method includes:

step S10, noise reduction processing is carried out on the audio data based on a real-time voice noise reduction model, and the real-time voice noise reduction model is used for carrying out noise reduction processing on the audio data according to noise reduction parameters;

and step S20, performing double-layer automatic gain control on the audio data after the noise reduction processing.

The method comprises the following specific steps:

and step S10, performing noise reduction processing on the audio data based on a real-time voice noise reduction model, wherein the real-time voice noise reduction model is used for performing noise reduction processing on the audio data according to the noise reduction parameters.

Further, the step of performing noise reduction processing on the audio data based on the real-time speech noise reduction model includes:

and step S11, carrying out noise scene judgment on the audio data, and determining a noise scene corresponding to the audio data.

It should be noted that the noise scene may be a conference noise scene, a traffic noise scene, a crowd noise scene, and the like, where the traffic noise scene may be further divided into a vehicle noise scene and a train noise scene.

Specifically, the determining a noise scene of audio data and determining a noise scene corresponding to the audio data includes:

step S111, calculating noise data corresponding to the audio data, and estimating the spectral characteristics of the noise according to the calculated result to obtain a noise spectral estimation value;

step S112, comparing the noise spectrum estimation value with a standard noise spectrum estimation value corresponding to each noise scene, and determining the noise scene corresponding to the minimum value of the comparison result as the noise scene corresponding to the audio data.

And step S12, acquiring a noise reduction parameter matched with the noise scene according to the noise scene.

In the real-time speech noise reduction model, a plurality of groups of noise reduction parameters exist, and each group of noise reduction parameters respectively corresponds to different noise scenes.

And step S13, performing noise reduction processing on the audio data based on the real-time voice noise reduction model of the noise reduction parameters.

In this embodiment, different noise scenes correspond to different noise reduction parameters, and therefore, in this embodiment, noise reduction processing needs to be performed on audio data by using real-time speech noise reduction models with different noise reduction parameters according to the noise scene corresponding to the audio data, so that a noise reduction effect on the audio data is improved.

Further, before the step of performing noise reduction processing on the audio data based on the real-time speech noise reduction model, the method further includes:

and A1, acquiring audio training data corresponding to a preset noise scene.

The audio training data includes clean audio data and noise data of a preset noise scene.

In this embodiment, the audio training data may be acquired in a variety of ways, and in one possible implementation, the tester carries an audio acquisition setting to acquire clean audio data in an environment without noise, and acquires noise data in an environment with a preset noise scene. For example, the tester carries the audio collection setting to collect the "noise collection" that user a said in the recording studio, then "noise collection" is clean audio data, and the tester carries the audio collection setting to collect the sound in the environment as the noise data of presetting the noise scene on the road side of morning and evening peak period.

And A2, extracting time domain characteristic values and target values of the audio training data.

In this embodiment, the time-domain feature value of the audio training data is used to indicate the feature of the audio training data in the time domain, and in a possible implementation, the time-domain feature value of the audio training data may include one or more of a noise threshold, a long-term energy value, a short-term energy value, and a noise envelope tracking value. It is understood that the time domain feature value of the audio training data may also include other information, and the embodiment is not limited in this respect.

Specifically, the noise threshold is used to indicate an amplitude range of the noise, the long-time energy value and the short-time energy value are used to indicate energy information of the audio data in a preset time period, and the noise envelope tracking value is used to estimate an amplitude of the noise.

In this embodiment, the target value of the audio training data may include one or more of a voice activity detection value of clean audio data and a full-band signal-to-noise value of noise data of a preset noise scene.

Wherein the voice activity detection value of the clean audio data may be used to indicate whether voice or noise is detected. For example, it may be indicated by "1" that speech is currently detected and "0" that noise is currently detected; alternatively, "0" may be used to indicate that the currently detected speech is voice, and "1" may be used to indicate that the currently detected noise is noise, and the embodiment is not particularly limited. The full-band signal-to-noise ratio value of the noise data of the preset noise scene may be used to indicate the correspondence of speech and noise.

Step A3, training the time domain characteristic value and the target value to obtain a noise reduction parameter based on a deep learning noise reduction model constructed by voice activity detection, noise spectrum estimation and spectrum subtraction.

In this embodiment, there may be a plurality of methods for constructing the deep learning noise reduction model, and in one possible implementation, the deep learning noise reduction model may be constructed based on keras. Specifically, Keras is a highly modular neural network library based on theono, for example, Keras may be based on Torch and written in Python language, and Keras may support a Graphics Processing Unit (GPU) and a Central Processing Unit (CPU).

In this embodiment, the deep-learning noise reduction model may include a voice activity detection module, a noise spectrum estimation module, and a spectral subtraction module. The voice activity detection module may detect clean audio data and noise data of a preset noise scene, and distinguish between voice and amplitude according to activity flags (for example, amplitude range, etc.) of the detected clean audio data and the noise data of the preset noise scene. The noise spectrum estimation module may be configured to calculate noise data of a preset noise scene, and may estimate a spectral characteristic of noise according to a result obtained by the calculation. The spectrum subtraction module may be configured to determine a gain value according to the calculation results obtained by the voice activity detection module and the noise spectrum estimation module, where the gain value may be used to suppress noise in the voice information.

In specific implementation, the time domain characteristic value of the audio training data is used as the input information of the deep learning noise reduction model, the target value of the audio training data is used as the output information of the deep learning noise reduction model, and then the deep learning noise reduction model is controlled to carry out model training according to the input information and the output information to obtain noise reduction parameters. For example, inputting the time domain feature value into the voice activity detection module may obtain a first model parameter; inputting the first model parameter into a noise spectrum estimation module to obtain a second model parameter, and inputting the first model parameter into a spectrum subtraction module to obtain a third model parameter; and finally, inputting the first model parameter, the second model parameter and the third model parameter into a spectrum subtraction module together to obtain the model parameters of the training data after deep learning noise reduction model training, wherein the model parameters are noise reduction parameters. It should be noted that, in this embodiment, each module in the deep learning noise reduction model may be a functional module constructed by Keras, that is, the voice activity detection module, the noise spectrum estimation module, and the spectrum subtraction module are introduced only for describing a process of determining the model parameters, and in a specific implementation, other modules may be further included, which is not limited in particular.

And A4, updating the parameters of the real-time voice noise reduction model by using the noise reduction parameters.

In this embodiment, the time domain characteristic value and the target value of the audio training data are obtained, and the deep learning training model constructed based on the voice activity detection, the noise spectrum estimation and the spectral subtraction is adopted to train the time domain characteristic value and the target value of the training data, so that the time domain characteristic and the frequency domain characteristic of the audio data can be combined in the training model process, the training performance of the deep learning noise reduction model is further improved, and the training speed of the deep learning noise reduction model is accelerated. Meanwhile, the audio training data comprise audio data under different noise scenes, different audio training data are adopted to train the deep learning noise reduction model to obtain noise reduction parameters corresponding to different noise scenes, the real-time voice noise reduction model is updated according to the noise reduction parameters corresponding to different noise scenes, the real-time voice noise reduction model with different noise reduction parameters is used for carrying out noise reduction processing on the audio data according to the noise scenes corresponding to the audio data, and then the noise reduction effect on the audio data is improved.

Referring to fig. 3, fig. 3 is a detailed flowchart of step S20 in the present application.

Further, the step of performing double-layer automatic gain control on the noise-reduced audio data includes:

and step S21, performing framing processing on the audio data subjected to the noise reduction processing to obtain an audio frame.

In this embodiment, the audio data after the noise reduction processing may be subjected to framing processing according to a preset length, and the preset length is not specifically limited herein.

And step S22, if the audio frame is a voice frame, performing digital automatic gain control to obtain a digital automatic gain value.

In this embodiment, there are various manners for obtaining the digital automatic gain value, and in a possible implementation manner, the digital automatic gain value may be obtained by performing digital automatic gain control through an automatic gain control algorithm AGC.

And step S23, if the digital automatic gain value is larger than or equal to the gain threshold value, performing analog automatic gain control to obtain an analog automatic gain step length, and feeding the analog automatic gain step length back to the audio data acquisition equipment.

In this embodiment, there may be multiple ways to obtain the analog automatic gain step, and in one possible implementation, the difference between the digital automatic gain value and the gain threshold is calculated, and the analog automatic gain step is determined based on the mapping relationship between the difference between the digital automatic gain value and the gain threshold and the analog automatic gain step. The mapping relation between the difference value between the digital automatic gain value and the gain threshold value and the analog automatic gain step length can be obtained through multiple experiments.

And step S24, if the digital automatic gain value is smaller than the gain threshold value, outputting the voice frame.

Further, the step of performing double-layer automatic gain control on the noise-reduced audio data further includes:

step S25, if the audio frame is a noise frame, judging whether a noise scene in the noise frame is a preset noise scene;

step S26, if the noise scene in the noise frame is a preset noise scene, generating and outputting comfortable noise;

and step S27, if the noise scene in the noise frame is not a preset noise scene, performing noise reduction processing on the audio frame, and generating and outputting comfortable noise.

In this embodiment, a specific process of determining whether the noise scene in the noise frame is a preset noise scene is as follows:

calculating noise data corresponding to the noise frame, and estimating the spectral characteristics of the noise according to the calculated result to obtain a noise spectral estimation value;

comparing the noise spectrum estimation value with the standard noise spectrum estimation value corresponding to each noise scene;

if the comparison results of the noise spectrum estimation values and the standard noise spectrum estimation values corresponding to all the noise scenes are larger than a preset comparison threshold value, judging that the noise scene in the noise frame is not a preset noise scene;

on the contrary, if the comparison result between the noise spectrum estimation value and the standard noise spectrum estimation value corresponding to one of the noise scenes is smaller than the preset comparison threshold, the noise scene in the noise frame is determined to be the preset noise scene.

In this embodiment, the comfort noise is generated by a CNG (comfort noise generator) program, which is a program for generating background noise for telephone communication when a brief silence occurs during a call.

Further, after the step of performing framing processing on the noise-reduced audio data to obtain an audio frame, the method further includes:

In a possible implementation manner, the voice activity detection algorithm may determine whether voice data exists in the audio data through a feature extraction module, a threshold calculation module, a threshold decision module, and the like, that is, determine an input signal, and distinguish the voice data from noise data.

In another possible implementation manner, a pre-trained acoustic model may perform frame-by-frame recognition on a sound sample (where each audio frame may be a preset length), and determine a recognition result output by the acoustic model. Wherein the recognition result can be represented by 0 or 1 (for example, 0 can be used to characterize that the corresponding audio frame does not include the non-noise signal, and 1 can be used to characterize that the corresponding audio frame includes the non-noise signal).

Then, the voice activity detection model may be trained using the voice samples and the recognition result output by the acoustic model as a training set, so that the trained voice activity detection may distinguish voice data from noise data.

In this embodiment, the analog automatic gain of the audio data acquisition device during the next audio data acquisition is fed back by using the digital automatic gain, so as to expand the range of the volume automatic gain and improve the quality of the sound quality of the audio data.

The embodiment of the present application further provides a sound quality control device, referring to fig. 4, and fig. 4 is a schematic diagram of functional modules of the first embodiment of the sound quality control device of the present application.

In this embodiment, the sound quality control apparatus includes:

the real-time voice noise reduction module 10 is configured to perform noise reduction processing on the audio data based on a real-time voice noise reduction model, where the real-time voice noise reduction model is configured to perform noise reduction processing on the audio data according to noise reduction parameters;

and a volume automatic gain module 20, configured to perform double-layer automatic gain control on the noise-reduced audio data.

Optionally, the real-time speech noise reduction module includes:

the noise scene judging unit is used for judging the noise scene of the audio data and determining the noise scene corresponding to the audio data;

the noise reduction parameter matching unit is used for acquiring noise reduction parameters matched with the noise scene according to the noise scene;

and the noise reduction processing unit is used for carrying out noise reduction processing on the audio data based on the real-time voice noise reduction model of the noise reduction parameters.

Optionally, the sound quality control apparatus further includes:

the acquisition module is used for acquiring audio training data corresponding to a preset noise scene;

and the training data extraction module is used for extracting the time domain characteristic value and the target value of the audio training data.

The model training module is used for training the time domain characteristic value and the target value to obtain a noise reduction parameter based on a deep learning noise reduction model constructed by voice activity detection, noise spectrum estimation and spectrum subtraction;

and the updating module is used for updating the parameters of the real-time voice noise reduction model by using the noise reduction parameters.

Optionally, the noise scene determination unit is configured to implement:

Optionally, the volume automatic gain module includes:

the framing unit is used for framing the audio data subjected to the noise reduction processing to obtain an audio frame;

the digital automatic gain unit is used for carrying out digital automatic gain control to obtain a digital automatic gain value if the audio frame is a voice frame;

the analog automatic gain unit is used for carrying out analog automatic gain control to obtain an analog automatic gain step length and feeding back the analog automatic gain step length to the audio data acquisition equipment if the digital automatic gain value is greater than or equal to a gain threshold value;

and the output unit is used for outputting the voice frame if the digital automatic gain value is smaller than a gain threshold value.

Optionally, the volume automatic gain module further comprises:

a comfort noise generation unit for implementing:

Optionally, the volume automatic gain module further comprises:

a type judgment unit for realizing: determining a voice activity detection value for the audio frame based on a voice activity detection algorithm;

The specific implementation of the tone quality control apparatus of the present application is substantially the same as that of the above-mentioned embodiments of the tone quality control method, and is not described herein again.

An embodiment of the present application further provides a storage medium, where the storage medium stores a sound quality control program, and the sound quality control program, when executed by a processor, implements the steps of the sound quality control method described above.

The specific implementation of the storage medium of the present application is substantially the same as that of each embodiment of the sound quality control method, and is not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present application may be substantially or partially embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims

1. A sound quality control method is characterized by comprising the following steps:

2. The method for controlling sound quality according to claim 1, wherein said step of performing noise reduction processing on the audio data based on a real-time speech noise reduction model comprises:

3. The method for controlling sound quality according to claim 1 or 2, wherein said step of performing noise reduction processing on the audio data based on the real-time speech noise reduction model further comprises:

acquiring audio training data corresponding to a preset noise scene;

extracting time domain characteristic values and target values of the audio training data;

4. The method for controlling sound quality according to claim 2, wherein said step of determining a noise scene of the audio data and determining a noise scene corresponding to the audio data includes:

and comparing the noise spectrum estimation value with the standard noise spectrum estimation value corresponding to each noise scene, and determining the noise scene corresponding to the minimum value of the comparison result as the noise scene corresponding to the audio data.

5. The method for controlling sound quality according to claim 1, wherein said step of performing double-layer automatic gain control on the noise-reduced audio data includes:

6. The method for controlling sound quality according to claim 5, wherein said step of performing double-layer automatic gain control on the noise-reduced audio data further comprises:

7. The method for controlling sound quality according to claim 5, wherein the step of performing framing processing on the noise-reduced audio data to obtain an audio frame further comprises:

8. A sound quality control apparatus, characterized by comprising:

9. A sound quality control apparatus, characterized in that the apparatus comprises: a memory, a processor and a timbre control program stored on said memory and executable on said processor, said timbre control program being configured to implement the steps of the timbre control method as claimed in any one of claims 1 to 7.

10. A storage medium, characterized in that a sound quality control program is stored thereon, which when executed by a processor implements the steps of the sound quality control method according to any one of claims 1 to 7.