CN113327626A

CN113327626A - Voice noise reduction method, device, equipment and storage medium

Info

Publication number: CN113327626A
Application number: CN202110699792.6A
Authority: CN
Inventors: 汪雪; 黄石磊; 程刚; 何竹
Original assignee: Shenzhen Raisound Technology Co ltd
Current assignee: Shenzhen Raisound Technology Co ltd
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2021-08-31
Anticipated expiration: 2041-06-23
Also published as: CN113327626B

Abstract

The application relates to the technical field of audio processing, and discloses a voice noise reduction method, which comprises the following steps: the method comprises the steps of obtaining voice data, inputting the voice data into a preset standard scene recognition model, determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene, selecting a preset noise reduction model corresponding to the voice scene, and reducing noise of the voice data. In addition, the application also discloses a voice noise reduction method, a device, equipment and a storage medium. The accuracy of making an uproar can be reduced to this application.

Description

Voice noise reduction method, device, equipment and storage medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a method and an apparatus for speech noise reduction, and a storage medium.

Background

In the current life, various noise audio data are filled, for example, noise audio data at the roadside, noise audio data in a park, noise audio data at an office and the like, and audio characteristics of the noise audio data are different in different voice scenes, and meanwhile, voice noise reduction means correspondingly adopted are different.

The existing voice noise reduction method is based on noise reduction of a user voiceprint feature, for example, the user voice in the voice data is enhanced according to the voiceprint feature of the user, so as to weaken background noise and complete voice noise reduction. However, in an actual application scenario, when the volume of the background noise is too large, the method cannot weaken the background noise according to the enhancement of the user voice, and thus the noise reduction accuracy is not high.

Disclosure of Invention

In order to solve the above technical problem or at least partially solve the above technical problem, the present application provides a voice noise reduction method, apparatus and storage medium.

In a first aspect, the present application provides a method for speech noise reduction, the method comprising:

acquiring voice data;

inputting the voice data into a preset standard scene recognition model, and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene;

and selecting a preset noise reduction model corresponding to the voice scene, and reducing noise of the voice data.

In one embodiment of the first aspect, the step of obtaining the voice data is preceded by:

collecting a noise sample set under each scene, and extracting audio features from each noise sample;

performing cluster analysis on the noise sample set based on the audio features to obtain a classified voice set;

and segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model.

In one embodiment of the first aspect, after the steps of segmenting the classified speech set into a training speech set and a testing speech set, constructing the scene recognition model by using the training speech set, and performing test adjustment on the scene recognition model by using the testing speech set to obtain a standard scene recognition model, the method further includes:

and establishing a noise reduction model corresponding to each scene according to the collected noise sample set under each scene for calling.

In one embodiment of the first aspect, the constructing a scene recognition model by using the training speech set includes:

calculating a kini index between each feature label and the corresponding training voice set to obtain a kini index set corresponding to the feature label, wherein the feature label is a category label extracted from a noise sample set under each scene to obtain corresponding audio features;

sorting the set of the kiney indexes from large to small, and selecting the label corresponding to the smallest kiney index in the set of the kiney indexes as a dividing point;

taking the segmentation point as a root node of an initial decision tree, starting from the segmentation point to generate child nodes, distributing the training speech set to the child nodes, and generating the initial decision tree until all labels in the feature labels are traversed;

and pruning the initial decision tree to obtain a scene recognition model.

In one embodiment of the first aspect, the pruning the initial decision tree to obtain a scene recognition model includes:

calculating surface error gain values of all non-leaf nodes on the initial decision tree;

and pruning the non-leaf nodes of which the surface error gain values are smaller than a preset gain threshold value to obtain a scene recognition model.

In one embodiment of the first aspect, the performing test adjustment on the scene recognition model by using the test speech set to obtain a standard scene recognition model includes:

carrying out scene recognition processing on the test voice set by using the scene recognition model to obtain a recognition result corresponding to the test voice set;

and when the recognition result corresponding to the test voice set is inconsistent with the feature label corresponding to the test voice set, the training voice set is reused to train the scene recognition model until the recognition result corresponding to the test voice set is consistent with the feature label corresponding to the test voice set, and a standard scene recognition model is obtained.

In one embodiment of the first aspect, the performing cluster analysis on the noise sample set based on the audio features to obtain a classified speech set includes:

acquiring a preset standard feature, and calculating a conditional probability value between the audio feature and the standard feature;

and sequencing each noise sample in the noise sample set according to the size of the conditional probability value, and dividing the sequenced noise sample set by using a preset audio interval as a dividing point to obtain a classified voice set.

In one embodiment of the first aspect, acquiring a noise sample set in each scene, and extracting audio features from each noise sample includes:

pre-emphasis processing, framing processing, windowing processing and fast Fourier transform are carried out on the noise sample set to obtain a short-time frequency spectrum of the noise sample set;

performing modular squaring on the short-time frequency spectrum to obtain a power spectrum of the noise sample set;

and calculating the power spectrum by utilizing a preset Mel-scale triangular filter group to obtain logarithmic energy, and performing discrete cosine transform on the logarithmic energy to obtain the audio characteristics corresponding to each noise sample.

In a second aspect, the present application provides a speech scene recognition apparatus, the apparatus comprising:

the voice data acquisition module is used for acquiring voice data;

the voice scene recognition module is used for inputting the voice data into a preset standard scene recognition model and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene;

and the noise reduction module is used for selecting a preset noise reduction model corresponding to the voice scene and reducing noise of the voice data.

In a third aspect, a voice recognition device is provided, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;

a memory for storing a computer program;

a processor, configured to implement the steps of the speech noise reduction method according to any embodiment of the first aspect when executing the program stored in the memory.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the speech noise reduction method according to any of the embodiments of the first aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:

according to the embodiment of the application, the obtained voice data are input into the preset standard scene recognition model, the standard scene recognition model is used for recognizing the voice scene corresponding to the voice data, the voice scene corresponding to the voice data can be determined, the voice environment where the voice data are located is selected, the preset noise reduction model corresponding to the voice scene is selected, the voice data are subjected to noise reduction, the noise is reduced through the noise reduction model matched with the scene, the noise reduction operation is executed more accurately, and therefore the purpose of improving the accuracy of voice noise reduction is achieved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a speech noise reduction method according to an embodiment of the present application;

fig. 2 is a schematic flowchart illustrating a process of testing and adjusting a scene recognition model in a speech noise reduction method according to an embodiment of the present application;

fig. 3 is a schematic block diagram of an apparatus for speech noise reduction according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device for reducing noise in voice according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic flowchart of a speech noise reduction method according to an embodiment of the present application. In this embodiment, the voice noise reduction method includes:

and S1, acquiring voice data.

In the embodiment of the present invention, the voice data is audio data containing noise to be subjected to noise reduction processing, so as to perform audio processing such as voice recognition in the following. Specifically, the voice data may be audio data collected in any voice scene.

Further, the step of obtaining voice data comprises:

In detail, in the embodiment of the present application, the noise sample set includes noise audio data in each voice scene, for example, noise audio data in a park, noise audio data at a roadside, or noise audio data in an office. In the embodiment of the present invention, the noise sample set may further include a feature label corresponding to each noise sample, where the feature label is used to label each noise sample to extract a corresponding audio feature. The audio features may include a zero-crossing rate, mel-frequency cepstrum coefficients, a spectral centroid, a spectral diffusion, a spectral entropy, a spectral flux, and the like, wherein the audio features in the embodiment of the present application are preferably mel-frequency cepstrum coefficients.

Specifically, the acquiring a noise sample set under each scene, and extracting audio features from each noise sample includes:

In an alternative embodiment of the present application, the noise sample set is pre-emphasized by a predetermined high-pass filter to obtain a high-frequency noise sample set, and the pre-emphasis process can enhance the high-frequency portion of the speech signal in the noise sample set.

The embodiment of the application performs pre-emphasis processing on the noise sample set, so that formants of high-frequency parts in the noise sample can be highlighted.

In an optional embodiment of the present application, a preset sampling point is used to segment the high-frequency noise sample set into data of multiple frames, so as to obtain a segmented data set;

preferably, in the embodiment of the present application, the sampling point is 512 or 256.

In an optional embodiment of the present application, the windowing process is to perform windowing on each frame in the frame data set according to a preset window function, so as to obtain a windowed signal.

In detail, the preset window function is:

S′(n)＝S(n)×W(n)

wherein, S' (N) is a windowing signal, S (N) is a framing data set, w (N) is a window function, N is the size of the frame, and N is the number of frames.

Preferably, in this embodiment of the present application, the preset window function may select a hamming window, and w (n) is a function expression of the hamming window.

The embodiment of the application performs windowing on the frame data set, so that the continuity of the left end and the right end of the frame can be improved, and the frequency spectrum leakage is reduced.

Further, embodiments of the present invention perform a fast fourier transform using the following formula, including:

and

and (3) performing modulus squaring on the short-time spectrum by adopting the following formula:

wherein S (k) is a short-time frequency spectrum, p (k) is a power spectrum, S' (N) is a windowing signal, N is the size of a frame, N is the number of frames, and k is a preset parameter on the short-time frequency spectrum.

Since the characteristics of the signal are usually difficult to see by the transformation of the signal in the time domain, the embodiment of the present invention converts the noise sample set into the energy distribution in the frequency domain, and different energy distributions may represent the characteristics of different voices.

Further, in the embodiment of the present invention, the triangular filter bank with Mel (Mel) scale is:

wherein, t (m) is logarithmic energy, p (k) is power spectrum, h (k) is frequency response of the triangular filter, N is frame size, and k is a preset parameter on the short-time spectrum.

The embodiment of the invention can make the short-time frequency spectrum smooth by utilizing the triangular filter to calculate the logarithmic energy of the power spectrum, eliminate harmonic waves and highlight formants in voice information.

Specifically, the performing cluster analysis on the noise sample set based on the audio features to obtain a classified speech set includes:

acquiring a preset standard feature, and calculating a correlation coefficient between the audio feature and the standard feature;

and sequencing each noise sample in the noise sample set according to the magnitude of the correlation coefficient, and dividing the sequenced noise sample set by using a preset audio interval as a dividing point to obtain a classified voice set.

Wherein the classified voice set comprises voices in different scenes, such as voice in a road scene, voice in a park scene, and the like.

In detail, the following formula is used to calculate a correlation coefficient between the audio feature corresponding to each noise sample in the noise sample set and the standard feature, including:

wherein q is_ijIs the correlation coefficient, y_iFor the audio features corresponding to the noise samples, y_jFor the standard features, exp is an exponential function, y_kAnd y_lAre fixed parameters.

Specifically, the clustering analysis is performed on the original noise sample set by embedding the noise samples distributed in the high-dimensional space into a certain low-dimensional subspace, so as to keep the data in the low-dimensional space consistent with the characteristics in the high-dimensional space as much as possible. The clustering analysis can keep the advantage of global clustering characteristics of high-dimensional data in a low-dimensional space, and the clustering relation of various noise samples is visually analyzed, so that the noise samples with similar time-frequency domain characteristics are classified into one class for classification and identification, and the identification accuracy is improved.

And further, segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model. And segmenting the classified voice set according to a preset segmentation proportion to obtain a training voice set and a test voice set.

Preferably, the division ratio is a training speech set: test speech set 7: 3.

the training speech set can be used for subsequent model training and is a sample for model fitting, and the testing speech set can be used for adjusting hyper-parameters of the model and primarily evaluating the capability of the model, and is particularly used for evaluating the generalization capability of the model.

Specifically, the constructing and obtaining a scene recognition model by using the training speech set includes:

and pruning the initial decision tree to obtain a scene recognition model.

Specifically, the calculating the kini index between each feature label and the corresponding training data set includes:

calculating a kini index between each feature label of the function and the training speech set corresponding to the feature label by using the following kini indexes:

wherein Gini (p) is a Gini index, p_kRepresents the kth frame data in the training speech set, where K is the number of frames in the training speech set.

In detail, the kini index represents the impure degree of the model, and the smaller the kini index, the lower the impure degree, indicating the better the characteristic.

Further, the pruning the initial decision tree to obtain a scene recognition model includes:

In the embodiment of the present application, the preset gain threshold is 0.5.

Further, the calculating the surface error gain value of all non-leaf nodes on the initial decision tree includes:

calculating surface error gain values for all non-leaf nodes on the initial decision tree using the following gain formula:

R(t)＝r(t)*p(t)

wherein α represents a surface error gain value, r (t) represents an error cost of a leaf node, r (t) represents an error cost of a non-leaf node, n (t) represents a node number of the initial decision tree, r (t) is an error rate of a leaf node, and p (t) is a ratio of the number of the leaf nodes to the number of all nodes.

Specifically, referring to fig. 2, the performing test adjustment on the scene recognition model by using the test speech set to obtain a standard scene recognition model includes:

s101, performing scene recognition processing on the test voice set by using the scene recognition model to obtain a recognition result corresponding to the test voice set;

s102, when the recognition result corresponding to the test voice set is inconsistent with the feature label corresponding to the test voice set, the training voice set is reused to train the scene recognition model, and until the recognition result corresponding to the test voice set is consistent with the feature label corresponding to the test voice set, a standard scene recognition model is obtained.

Further, after the steps of segmenting the classified speech set into a training speech set and a testing speech set, constructing the scene recognition model by using the training speech set, and performing test adjustment on the scene recognition model by using the testing speech set to obtain a standard scene recognition model, the method further comprises:

S2, inputting the voice data into a preset standard scene recognition model, and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene.

In the embodiment of the application, the acquired voice data is input into the preset standard scene recognition model, the preset standard scene recognition model performs scene recognition processing on the voice data, and the voice scene corresponding to the voice data is output.

And S3, selecting a preset noise reduction model corresponding to the voice scene, and reducing the noise of the voice data.

In the embodiment of the application, the noise reduction model includes a dynamic time warping model, a vector quantization model, a hidden markov model, and the like, and according to a speech scene corresponding to the speech data and characteristics of the noise reduction model, the corresponding noise reduction model is selected to perform noise reduction operation on the speech data to obtain a noise reduction result.

According to the embodiment of the application, the acquired voice data are input into the preset standard scene recognition model, the standard scene recognition model is used for recognizing the voice scene corresponding to the voice data, the voice scene corresponding to the voice data can be determined, the voice environment where the voice data are located is selected, the preset noise reduction model corresponding to the voice scene is selected, the voice data are subjected to noise reduction, and the accuracy of voice noise reduction is improved.

As shown in fig. 3, an embodiment of the present application provides a schematic block diagram of a speech noise reduction apparatus 10, where the speech noise reduction apparatus 10 includes: a voice data acquisition module 11, a voice scene recognition module 12 and a noise reduction module 13.

The voice data acquisition module 11 is configured to acquire voice data;

the speech scene recognition module 12 is configured to input the speech data into a preset standard scene recognition model, and determine a speech scene corresponding to the speech data, where the standard scene recognition model is obtained by training according to a noise sample set in each scene;

and the noise reduction module 13 is configured to select a preset noise reduction model corresponding to the voice scene to reduce noise of the voice data.

In detail, in the embodiment of the present application, when being used, each module in the speech noise reduction apparatus 10 adopts the same technical means as the speech noise reduction method described in fig. 1, and can produce the same technical effect, which is not described herein again.

As shown in fig. 4, an embodiment of the present application provides a voice noise reduction device, which includes a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 complete mutual communication through the communication bus 114,

a memory 113 for storing a computer program;

in an embodiment of the present application, the processor 111, configured to execute the program stored in the memory 113, to implement the voice noise reduction method provided in any of the foregoing method embodiments, includes:

acquiring voice data;

The communication bus 114 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 114 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface 112 is used for communication between the above-described electronic apparatus and other apparatuses.

The memory 113 may include a Random Access Memory (RAM), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 113 may also be at least one storage device located remotely from the processor 111.

The processor 111 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.

The present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the voice noise reduction method provided in any one of the foregoing method embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (ssd)), among others. It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for speech noise reduction, the method comprising:

acquiring voice data;

2. The method of claim 1, wherein the step of obtaining speech data is preceded by the steps of:

3. The method of claim 2, wherein after the steps of segmenting the classified speech set into a training speech set and a testing speech set, constructing the scene recognition model by using the training speech set, and performing test adjustment on the scene recognition model by using the testing speech set to obtain a standard scene recognition model, the method further comprises:

4. The method of claim 2, wherein the constructing the scene recognition model using the training speech set comprises:

and pruning the initial decision tree to obtain a scene recognition model.

5. The method of claim 4, wherein the pruning the initial decision tree to obtain a scene recognition model comprises:

6. The method of claim 4, wherein the performing test adjustment on the scene recognition model by using the test speech set to obtain a standard scene recognition model comprises:

7. The method of claim 2, wherein the performing cluster analysis on the noise sample set based on the audio features to obtain a classified speech set comprises:

8. The method according to any one of claims 1 to 5, wherein the collecting a noise sample set in each scene and extracting audio features from each noise sample comprises:

9. An apparatus for speech noise reduction, the apparatus comprising:

the voice data acquisition module is used for acquiring voice data;

10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the steps of the speech noise reduction method according to any one of claims 1 to 8 when executing a program stored in the memory.

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for speech noise reduction according to any of claims 1 to 8.