CN113327626A - Voice noise reduction method, device, equipment and storage medium - Google Patents
Voice noise reduction method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN113327626A CN113327626A CN202110699792.6A CN202110699792A CN113327626A CN 113327626 A CN113327626 A CN 113327626A CN 202110699792 A CN202110699792 A CN 202110699792A CN 113327626 A CN113327626 A CN 113327626A
- Authority
- CN
- China
- Prior art keywords
- voice
- scene
- recognition model
- scene recognition
- noise
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000009467 reduction Effects 0.000 title claims abstract description 59
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000012549 training Methods 0.000 claims abstract description 42
- 238000012545 processing Methods 0.000 claims abstract description 19
- 238000012360 testing method Methods 0.000 claims description 49
- 238000001228 spectrum Methods 0.000 claims description 21
- 238000004891 communication Methods 0.000 claims description 19
- 238000003066 decision tree Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 11
- 238000013138 pruning Methods 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 claims description 7
- 238000007621 cluster analysis Methods 0.000 claims description 6
- 238000009432 framing Methods 0.000 claims description 4
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 description 8
- 230000008569 process Effects 0.000 description 6
- 230000003595 spectral effect Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000004907 flux Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T90/00—Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The application relates to the technical field of audio processing, and discloses a voice noise reduction method, which comprises the following steps: the method comprises the steps of obtaining voice data, inputting the voice data into a preset standard scene recognition model, determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene, selecting a preset noise reduction model corresponding to the voice scene, and reducing noise of the voice data. In addition, the application also discloses a voice noise reduction method, a device, equipment and a storage medium. The accuracy of making an uproar can be reduced to this application.
Description
Technical Field
The present application relates to the field of audio processing technologies, and in particular, to a method and an apparatus for speech noise reduction, and a storage medium.
Background
In the current life, various noise audio data are filled, for example, noise audio data at the roadside, noise audio data in a park, noise audio data at an office and the like, and audio characteristics of the noise audio data are different in different voice scenes, and meanwhile, voice noise reduction means correspondingly adopted are different.
The existing voice noise reduction method is based on noise reduction of a user voiceprint feature, for example, the user voice in the voice data is enhanced according to the voiceprint feature of the user, so as to weaken background noise and complete voice noise reduction. However, in an actual application scenario, when the volume of the background noise is too large, the method cannot weaken the background noise according to the enhancement of the user voice, and thus the noise reduction accuracy is not high.
Disclosure of Invention
In order to solve the above technical problem or at least partially solve the above technical problem, the present application provides a voice noise reduction method, apparatus and storage medium.
In a first aspect, the present application provides a method for speech noise reduction, the method comprising:
acquiring voice data;
inputting the voice data into a preset standard scene recognition model, and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene;
and selecting a preset noise reduction model corresponding to the voice scene, and reducing noise of the voice data.
In one embodiment of the first aspect, the step of obtaining the voice data is preceded by:
collecting a noise sample set under each scene, and extracting audio features from each noise sample;
performing cluster analysis on the noise sample set based on the audio features to obtain a classified voice set;
and segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model.
In one embodiment of the first aspect, after the steps of segmenting the classified speech set into a training speech set and a testing speech set, constructing the scene recognition model by using the training speech set, and performing test adjustment on the scene recognition model by using the testing speech set to obtain a standard scene recognition model, the method further includes:
and establishing a noise reduction model corresponding to each scene according to the collected noise sample set under each scene for calling.
In one embodiment of the first aspect, the constructing a scene recognition model by using the training speech set includes:
calculating a kini index between each feature label and the corresponding training voice set to obtain a kini index set corresponding to the feature label, wherein the feature label is a category label extracted from a noise sample set under each scene to obtain corresponding audio features;
sorting the set of the kiney indexes from large to small, and selecting the label corresponding to the smallest kiney index in the set of the kiney indexes as a dividing point;
taking the segmentation point as a root node of an initial decision tree, starting from the segmentation point to generate child nodes, distributing the training speech set to the child nodes, and generating the initial decision tree until all labels in the feature labels are traversed;
and pruning the initial decision tree to obtain a scene recognition model.
In one embodiment of the first aspect, the pruning the initial decision tree to obtain a scene recognition model includes:
calculating surface error gain values of all non-leaf nodes on the initial decision tree;
and pruning the non-leaf nodes of which the surface error gain values are smaller than a preset gain threshold value to obtain a scene recognition model.
In one embodiment of the first aspect, the performing test adjustment on the scene recognition model by using the test speech set to obtain a standard scene recognition model includes:
carrying out scene recognition processing on the test voice set by using the scene recognition model to obtain a recognition result corresponding to the test voice set;
and when the recognition result corresponding to the test voice set is inconsistent with the feature label corresponding to the test voice set, the training voice set is reused to train the scene recognition model until the recognition result corresponding to the test voice set is consistent with the feature label corresponding to the test voice set, and a standard scene recognition model is obtained.
In one embodiment of the first aspect, the performing cluster analysis on the noise sample set based on the audio features to obtain a classified speech set includes:
acquiring a preset standard feature, and calculating a conditional probability value between the audio feature and the standard feature;
and sequencing each noise sample in the noise sample set according to the size of the conditional probability value, and dividing the sequenced noise sample set by using a preset audio interval as a dividing point to obtain a classified voice set.
In one embodiment of the first aspect, acquiring a noise sample set in each scene, and extracting audio features from each noise sample includes:
pre-emphasis processing, framing processing, windowing processing and fast Fourier transform are carried out on the noise sample set to obtain a short-time frequency spectrum of the noise sample set;
performing modular squaring on the short-time frequency spectrum to obtain a power spectrum of the noise sample set;
and calculating the power spectrum by utilizing a preset Mel-scale triangular filter group to obtain logarithmic energy, and performing discrete cosine transform on the logarithmic energy to obtain the audio characteristics corresponding to each noise sample.
In a second aspect, the present application provides a speech scene recognition apparatus, the apparatus comprising:
the voice data acquisition module is used for acquiring voice data;
the voice scene recognition module is used for inputting the voice data into a preset standard scene recognition model and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene;
and the noise reduction module is used for selecting a preset noise reduction model corresponding to the voice scene and reducing noise of the voice data.
In a third aspect, a voice recognition device is provided, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;
a memory for storing a computer program;
a processor, configured to implement the steps of the speech noise reduction method according to any embodiment of the first aspect when executing the program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the speech noise reduction method according to any of the embodiments of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
according to the embodiment of the application, the obtained voice data are input into the preset standard scene recognition model, the standard scene recognition model is used for recognizing the voice scene corresponding to the voice data, the voice scene corresponding to the voice data can be determined, the voice environment where the voice data are located is selected, the preset noise reduction model corresponding to the voice scene is selected, the voice data are subjected to noise reduction, the noise is reduced through the noise reduction model matched with the scene, the noise reduction operation is executed more accurately, and therefore the purpose of improving the accuracy of voice noise reduction is achieved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flowchart of a speech noise reduction method according to an embodiment of the present application;
fig. 2 is a schematic flowchart illustrating a process of testing and adjusting a scene recognition model in a speech noise reduction method according to an embodiment of the present application;
fig. 3 is a schematic block diagram of an apparatus for speech noise reduction according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device for reducing noise in voice according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic flowchart of a speech noise reduction method according to an embodiment of the present application. In this embodiment, the voice noise reduction method includes:
and S1, acquiring voice data.
In the embodiment of the present invention, the voice data is audio data containing noise to be subjected to noise reduction processing, so as to perform audio processing such as voice recognition in the following. Specifically, the voice data may be audio data collected in any voice scene.
Further, the step of obtaining voice data comprises:
collecting a noise sample set under each scene, and extracting audio features from each noise sample;
performing cluster analysis on the noise sample set based on the audio features to obtain a classified voice set;
and segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model.
In detail, in the embodiment of the present application, the noise sample set includes noise audio data in each voice scene, for example, noise audio data in a park, noise audio data at a roadside, or noise audio data in an office. In the embodiment of the present invention, the noise sample set may further include a feature label corresponding to each noise sample, where the feature label is used to label each noise sample to extract a corresponding audio feature. The audio features may include a zero-crossing rate, mel-frequency cepstrum coefficients, a spectral centroid, a spectral diffusion, a spectral entropy, a spectral flux, and the like, wherein the audio features in the embodiment of the present application are preferably mel-frequency cepstrum coefficients.
Specifically, the acquiring a noise sample set under each scene, and extracting audio features from each noise sample includes:
pre-emphasis processing, framing processing, windowing processing and fast Fourier transform are carried out on the noise sample set to obtain a short-time frequency spectrum of the noise sample set;
performing modular squaring on the short-time frequency spectrum to obtain a power spectrum of the noise sample set;
and calculating the power spectrum by utilizing a preset Mel-scale triangular filter group to obtain logarithmic energy, and performing discrete cosine transform on the logarithmic energy to obtain the audio characteristics corresponding to each noise sample.
In an alternative embodiment of the present application, the noise sample set is pre-emphasized by a predetermined high-pass filter to obtain a high-frequency noise sample set, and the pre-emphasis process can enhance the high-frequency portion of the speech signal in the noise sample set.
The embodiment of the application performs pre-emphasis processing on the noise sample set, so that formants of high-frequency parts in the noise sample can be highlighted.
In an optional embodiment of the present application, a preset sampling point is used to segment the high-frequency noise sample set into data of multiple frames, so as to obtain a segmented data set;
preferably, in the embodiment of the present application, the sampling point is 512 or 256.
In an optional embodiment of the present application, the windowing process is to perform windowing on each frame in the frame data set according to a preset window function, so as to obtain a windowed signal.
In detail, the preset window function is:
S′(n)=S(n)×W(n)
wherein, S' (N) is a windowing signal, S (N) is a framing data set, w (N) is a window function, N is the size of the frame, and N is the number of frames.
Preferably, in this embodiment of the present application, the preset window function may select a hamming window, and w (n) is a function expression of the hamming window.
The embodiment of the application performs windowing on the frame data set, so that the continuity of the left end and the right end of the frame can be improved, and the frequency spectrum leakage is reduced.
Further, embodiments of the present invention perform a fast fourier transform using the following formula, including:
and (3) performing modulus squaring on the short-time spectrum by adopting the following formula:
wherein S (k) is a short-time frequency spectrum, p (k) is a power spectrum, S' (N) is a windowing signal, N is the size of a frame, N is the number of frames, and k is a preset parameter on the short-time frequency spectrum.
Since the characteristics of the signal are usually difficult to see by the transformation of the signal in the time domain, the embodiment of the present invention converts the noise sample set into the energy distribution in the frequency domain, and different energy distributions may represent the characteristics of different voices.
Further, in the embodiment of the present invention, the triangular filter bank with Mel (Mel) scale is:
wherein, t (m) is logarithmic energy, p (k) is power spectrum, h (k) is frequency response of the triangular filter, N is frame size, and k is a preset parameter on the short-time spectrum.
The embodiment of the invention can make the short-time frequency spectrum smooth by utilizing the triangular filter to calculate the logarithmic energy of the power spectrum, eliminate harmonic waves and highlight formants in voice information.
Specifically, the performing cluster analysis on the noise sample set based on the audio features to obtain a classified speech set includes:
acquiring a preset standard feature, and calculating a correlation coefficient between the audio feature and the standard feature;
and sequencing each noise sample in the noise sample set according to the magnitude of the correlation coefficient, and dividing the sequenced noise sample set by using a preset audio interval as a dividing point to obtain a classified voice set.
Wherein the classified voice set comprises voices in different scenes, such as voice in a road scene, voice in a park scene, and the like.
In detail, the following formula is used to calculate a correlation coefficient between the audio feature corresponding to each noise sample in the noise sample set and the standard feature, including:
wherein q isijIs the correlation coefficient, yiFor the audio features corresponding to the noise samples, yjFor the standard features, exp is an exponential function, ykAnd ylAre fixed parameters.
Specifically, the clustering analysis is performed on the original noise sample set by embedding the noise samples distributed in the high-dimensional space into a certain low-dimensional subspace, so as to keep the data in the low-dimensional space consistent with the characteristics in the high-dimensional space as much as possible. The clustering analysis can keep the advantage of global clustering characteristics of high-dimensional data in a low-dimensional space, and the clustering relation of various noise samples is visually analyzed, so that the noise samples with similar time-frequency domain characteristics are classified into one class for classification and identification, and the identification accuracy is improved.
And further, segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model. And segmenting the classified voice set according to a preset segmentation proportion to obtain a training voice set and a test voice set.
Preferably, the division ratio is a training speech set: test speech set 7: 3.
the training speech set can be used for subsequent model training and is a sample for model fitting, and the testing speech set can be used for adjusting hyper-parameters of the model and primarily evaluating the capability of the model, and is particularly used for evaluating the generalization capability of the model.
Specifically, the constructing and obtaining a scene recognition model by using the training speech set includes:
calculating a kini index between each feature label and the corresponding training voice set to obtain a kini index set corresponding to the feature label, wherein the feature label is a category label extracted from a noise sample set under each scene to obtain corresponding audio features;
sorting the set of the kiney indexes from large to small, and selecting the label corresponding to the smallest kiney index in the set of the kiney indexes as a dividing point;
taking the segmentation point as a root node of an initial decision tree, starting from the segmentation point to generate child nodes, distributing the training speech set to the child nodes, and generating the initial decision tree until all labels in the feature labels are traversed;
and pruning the initial decision tree to obtain a scene recognition model.
Specifically, the calculating the kini index between each feature label and the corresponding training data set includes:
calculating a kini index between each feature label of the function and the training speech set corresponding to the feature label by using the following kini indexes:
wherein Gini (p) is a Gini index, pkRepresents the kth frame data in the training speech set, where K is the number of frames in the training speech set.
In detail, the kini index represents the impure degree of the model, and the smaller the kini index, the lower the impure degree, indicating the better the characteristic.
Further, the pruning the initial decision tree to obtain a scene recognition model includes:
calculating surface error gain values of all non-leaf nodes on the initial decision tree;
and pruning the non-leaf nodes of which the surface error gain values are smaller than a preset gain threshold value to obtain a scene recognition model.
In the embodiment of the present application, the preset gain threshold is 0.5.
Further, the calculating the surface error gain value of all non-leaf nodes on the initial decision tree includes:
calculating surface error gain values for all non-leaf nodes on the initial decision tree using the following gain formula:
R(t)=r(t)*p(t)
wherein α represents a surface error gain value, r (t) represents an error cost of a leaf node, r (t) represents an error cost of a non-leaf node, n (t) represents a node number of the initial decision tree, r (t) is an error rate of a leaf node, and p (t) is a ratio of the number of the leaf nodes to the number of all nodes.
Specifically, referring to fig. 2, the performing test adjustment on the scene recognition model by using the test speech set to obtain a standard scene recognition model includes:
s101, performing scene recognition processing on the test voice set by using the scene recognition model to obtain a recognition result corresponding to the test voice set;
s102, when the recognition result corresponding to the test voice set is inconsistent with the feature label corresponding to the test voice set, the training voice set is reused to train the scene recognition model, and until the recognition result corresponding to the test voice set is consistent with the feature label corresponding to the test voice set, a standard scene recognition model is obtained.
Further, after the steps of segmenting the classified speech set into a training speech set and a testing speech set, constructing the scene recognition model by using the training speech set, and performing test adjustment on the scene recognition model by using the testing speech set to obtain a standard scene recognition model, the method further comprises:
and establishing a noise reduction model corresponding to each scene according to the collected noise sample set under each scene for calling.
S2, inputting the voice data into a preset standard scene recognition model, and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene.
In the embodiment of the application, the acquired voice data is input into the preset standard scene recognition model, the preset standard scene recognition model performs scene recognition processing on the voice data, and the voice scene corresponding to the voice data is output.
And S3, selecting a preset noise reduction model corresponding to the voice scene, and reducing the noise of the voice data.
In the embodiment of the application, the noise reduction model includes a dynamic time warping model, a vector quantization model, a hidden markov model, and the like, and according to a speech scene corresponding to the speech data and characteristics of the noise reduction model, the corresponding noise reduction model is selected to perform noise reduction operation on the speech data to obtain a noise reduction result.
According to the embodiment of the application, the acquired voice data are input into the preset standard scene recognition model, the standard scene recognition model is used for recognizing the voice scene corresponding to the voice data, the voice scene corresponding to the voice data can be determined, the voice environment where the voice data are located is selected, the preset noise reduction model corresponding to the voice scene is selected, the voice data are subjected to noise reduction, and the accuracy of voice noise reduction is improved.
As shown in fig. 3, an embodiment of the present application provides a schematic block diagram of a speech noise reduction apparatus 10, where the speech noise reduction apparatus 10 includes: a voice data acquisition module 11, a voice scene recognition module 12 and a noise reduction module 13.
The voice data acquisition module 11 is configured to acquire voice data;
the speech scene recognition module 12 is configured to input the speech data into a preset standard scene recognition model, and determine a speech scene corresponding to the speech data, where the standard scene recognition model is obtained by training according to a noise sample set in each scene;
and the noise reduction module 13 is configured to select a preset noise reduction model corresponding to the voice scene to reduce noise of the voice data.
In detail, in the embodiment of the present application, when being used, each module in the speech noise reduction apparatus 10 adopts the same technical means as the speech noise reduction method described in fig. 1, and can produce the same technical effect, which is not described herein again.
As shown in fig. 4, an embodiment of the present application provides a voice noise reduction device, which includes a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 complete mutual communication through the communication bus 114,
a memory 113 for storing a computer program;
in an embodiment of the present application, the processor 111, configured to execute the program stored in the memory 113, to implement the voice noise reduction method provided in any of the foregoing method embodiments, includes:
acquiring voice data;
inputting the voice data into a preset standard scene recognition model, and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene;
and selecting a preset noise reduction model corresponding to the voice scene, and reducing noise of the voice data.
The communication bus 114 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 114 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface 112 is used for communication between the above-described electronic apparatus and other apparatuses.
The memory 113 may include a Random Access Memory (RAM), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 113 may also be at least one storage device located remotely from the processor 111.
The processor 111 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
The present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the voice noise reduction method provided in any one of the foregoing method embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (ssd)), among others. It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (11)
1. A method for speech noise reduction, the method comprising:
acquiring voice data;
inputting the voice data into a preset standard scene recognition model, and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene;
and selecting a preset noise reduction model corresponding to the voice scene, and reducing noise of the voice data.
2. The method of claim 1, wherein the step of obtaining speech data is preceded by the steps of:
collecting a noise sample set under each scene, and extracting audio features from each noise sample;
performing cluster analysis on the noise sample set based on the audio features to obtain a classified voice set;
and segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model.
3. The method of claim 2, wherein after the steps of segmenting the classified speech set into a training speech set and a testing speech set, constructing the scene recognition model by using the training speech set, and performing test adjustment on the scene recognition model by using the testing speech set to obtain a standard scene recognition model, the method further comprises:
and establishing a noise reduction model corresponding to each scene according to the collected noise sample set under each scene for calling.
4. The method of claim 2, wherein the constructing the scene recognition model using the training speech set comprises:
calculating a kini index between each feature label and the corresponding training voice set to obtain a kini index set corresponding to the feature label, wherein the feature label is a category label extracted from a noise sample set under each scene to obtain corresponding audio features;
sorting the set of the kiney indexes from large to small, and selecting the label corresponding to the smallest kiney index in the set of the kiney indexes as a dividing point;
taking the segmentation point as a root node of an initial decision tree, starting from the segmentation point to generate child nodes, distributing the training speech set to the child nodes, and generating the initial decision tree until all labels in the feature labels are traversed;
and pruning the initial decision tree to obtain a scene recognition model.
5. The method of claim 4, wherein the pruning the initial decision tree to obtain a scene recognition model comprises:
calculating surface error gain values of all non-leaf nodes on the initial decision tree;
and pruning the non-leaf nodes of which the surface error gain values are smaller than a preset gain threshold value to obtain a scene recognition model.
6. The method of claim 4, wherein the performing test adjustment on the scene recognition model by using the test speech set to obtain a standard scene recognition model comprises:
carrying out scene recognition processing on the test voice set by using the scene recognition model to obtain a recognition result corresponding to the test voice set;
and when the recognition result corresponding to the test voice set is inconsistent with the feature label corresponding to the test voice set, the training voice set is reused to train the scene recognition model until the recognition result corresponding to the test voice set is consistent with the feature label corresponding to the test voice set, and a standard scene recognition model is obtained.
7. The method of claim 2, wherein the performing cluster analysis on the noise sample set based on the audio features to obtain a classified speech set comprises:
acquiring a preset standard feature, and calculating a conditional probability value between the audio feature and the standard feature;
and sequencing each noise sample in the noise sample set according to the size of the conditional probability value, and dividing the sequenced noise sample set by using a preset audio interval as a dividing point to obtain a classified voice set.
8. The method according to any one of claims 1 to 5, wherein the collecting a noise sample set in each scene and extracting audio features from each noise sample comprises:
pre-emphasis processing, framing processing, windowing processing and fast Fourier transform are carried out on the noise sample set to obtain a short-time frequency spectrum of the noise sample set;
performing modular squaring on the short-time frequency spectrum to obtain a power spectrum of the noise sample set;
and calculating the power spectrum by utilizing a preset Mel-scale triangular filter group to obtain logarithmic energy, and performing discrete cosine transform on the logarithmic energy to obtain the audio characteristics corresponding to each noise sample.
9. An apparatus for speech noise reduction, the apparatus comprising:
the voice data acquisition module is used for acquiring voice data;
the voice scene recognition module is used for inputting the voice data into a preset standard scene recognition model and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene;
and the noise reduction module is used for selecting a preset noise reduction model corresponding to the voice scene and reducing noise of the voice data.
10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the speech noise reduction method according to any one of claims 1 to 8 when executing a program stored in the memory.
11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for speech noise reduction according to any of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110699792.6A CN113327626B (en) | 2021-06-23 | 2021-06-23 | Voice noise reduction method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110699792.6A CN113327626B (en) | 2021-06-23 | 2021-06-23 | Voice noise reduction method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113327626A true CN113327626A (en) | 2021-08-31 |
CN113327626B CN113327626B (en) | 2023-09-08 |
Family
ID=77424416
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110699792.6A Active CN113327626B (en) | 2021-06-23 | 2021-06-23 | Voice noise reduction method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113327626B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113516985A (en) * | 2021-09-13 | 2021-10-19 | 北京易真学思教育科技有限公司 | Speech recognition method, apparatus and non-volatile computer-readable storage medium |
CN113793620A (en) * | 2021-11-17 | 2021-12-14 | 深圳市北科瑞声科技股份有限公司 | Voice noise reduction method, device and equipment based on scene classification and storage medium |
CN114333881A (en) * | 2022-03-09 | 2022-04-12 | 深圳市迪斯声学有限公司 | Audio transmission noise reduction method, device, equipment and medium based on environment self-adaptation |
CN114566160A (en) * | 2022-03-01 | 2022-05-31 | 游密科技(深圳)有限公司 | Voice processing method and device, computer equipment and storage medium |
CN114666092A (en) * | 2022-02-16 | 2022-06-24 | 奇安信科技集团股份有限公司 | Real-time behavior safety baseline data noise reduction method and device for safety analysis |
CN114974279A (en) * | 2022-05-10 | 2022-08-30 | 中移(杭州)信息技术有限公司 | Sound quality control method, device, equipment and storage medium |
CN115331689A (en) * | 2022-08-11 | 2022-11-11 | 北京声智科技有限公司 | Training method, device, equipment, storage medium and product of voice noise reduction model |
CN116758934A (en) * | 2023-08-18 | 2023-09-15 | 深圳市微克科技有限公司 | Method, system and medium for realizing intercom function of intelligent wearable device |
CN116994599A (en) * | 2023-09-13 | 2023-11-03 | 湖北星纪魅族科技有限公司 | Audio noise reduction method for electronic equipment, electronic equipment and storage medium |
CN117202071A (en) * | 2023-09-21 | 2023-12-08 | 广东金海纳实业有限公司 | Test method and system of noise reduction earphone |
Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101710490A (en) * | 2009-11-20 | 2010-05-19 | 安徽科大讯飞信息科技股份有限公司 | Method and device for compensating noise for voice assessment |
US20120225719A1 (en) * | 2011-03-04 | 2012-09-06 | Mirosoft Corporation | Gesture Detection and Recognition |
KR20120129421A (en) * | 2011-05-20 | 2012-11-28 | 고려대학교 산학협력단 | Apparatus and method of speech recognition for number |
US8438029B1 (en) * | 2012-08-22 | 2013-05-07 | Google Inc. | Confidence tying for unsupervised synthetic speech adaptation |
CN106611183A (en) * | 2016-05-30 | 2017-05-03 | 四川用联信息技术有限公司 | Method for constructing Gini coefficient and misclassification cost-sensitive decision tree |
KR20180046062A (en) * | 2016-10-27 | 2018-05-08 | 에스케이텔레콤 주식회사 | Method for speech endpoint detection using normalizaion and apparatus thereof |
CN108181107A (en) * | 2018-01-12 | 2018-06-19 | 东北电力大学 | The Wind turbines bearing mechanical method for diagnosing faults of meter and more class objects |
CN108198547A (en) * | 2018-01-18 | 2018-06-22 | 深圳市北科瑞声科技股份有限公司 | Sound end detecting method, device, computer equipment and storage medium |
US20180254041A1 (en) * | 2016-04-11 | 2018-09-06 | Sonde Health, Inc. | System and method for activation of voice interactive services based on user state |
CN109285538A (en) * | 2018-09-19 | 2019-01-29 | 宁波大学 | A kind of mobile phone source title method under the additive noise environment based on normal Q transform domain |
WO2019237519A1 (en) * | 2018-06-11 | 2019-12-19 | 平安科技(深圳)有限公司 | General vector training method, voice clustering method, apparatus, device and medium |
CN110769111A (en) * | 2019-10-28 | 2020-02-07 | 珠海格力电器股份有限公司 | Noise reduction method, system, storage medium and terminal |
CN111754988A (en) * | 2020-06-23 | 2020-10-09 | 南京工程学院 | Sound scene classification method based on attention mechanism and double-path depth residual error network |
CN111916066A (en) * | 2020-08-13 | 2020-11-10 | 山东大学 | Random forest based voice tone recognition method and system |
CN111933175A (en) * | 2020-08-06 | 2020-11-13 | 北京中电慧声科技有限公司 | Active voice detection method and system based on noise scene recognition |
CN112614504A (en) * | 2020-12-22 | 2021-04-06 | 平安科技(深圳)有限公司 | Single sound channel voice noise reduction method, system, equipment and readable storage medium |
CN112863667A (en) * | 2021-01-22 | 2021-05-28 | 杭州电子科技大学 | Lung sound diagnosis device based on deep learning |
-
2021
- 2021-06-23 CN CN202110699792.6A patent/CN113327626B/en active Active
Patent Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101710490A (en) * | 2009-11-20 | 2010-05-19 | 安徽科大讯飞信息科技股份有限公司 | Method and device for compensating noise for voice assessment |
US20120225719A1 (en) * | 2011-03-04 | 2012-09-06 | Mirosoft Corporation | Gesture Detection and Recognition |
KR20120129421A (en) * | 2011-05-20 | 2012-11-28 | 고려대학교 산학협력단 | Apparatus and method of speech recognition for number |
US8438029B1 (en) * | 2012-08-22 | 2013-05-07 | Google Inc. | Confidence tying for unsupervised synthetic speech adaptation |
US20180254041A1 (en) * | 2016-04-11 | 2018-09-06 | Sonde Health, Inc. | System and method for activation of voice interactive services based on user state |
CN106611183A (en) * | 2016-05-30 | 2017-05-03 | 四川用联信息技术有限公司 | Method for constructing Gini coefficient and misclassification cost-sensitive decision tree |
KR20180046062A (en) * | 2016-10-27 | 2018-05-08 | 에스케이텔레콤 주식회사 | Method for speech endpoint detection using normalizaion and apparatus thereof |
CN108181107A (en) * | 2018-01-12 | 2018-06-19 | 东北电力大学 | The Wind turbines bearing mechanical method for diagnosing faults of meter and more class objects |
CN108198547A (en) * | 2018-01-18 | 2018-06-22 | 深圳市北科瑞声科技股份有限公司 | Sound end detecting method, device, computer equipment and storage medium |
WO2019237519A1 (en) * | 2018-06-11 | 2019-12-19 | 平安科技(深圳)有限公司 | General vector training method, voice clustering method, apparatus, device and medium |
CN109285538A (en) * | 2018-09-19 | 2019-01-29 | 宁波大学 | A kind of mobile phone source title method under the additive noise environment based on normal Q transform domain |
CN110769111A (en) * | 2019-10-28 | 2020-02-07 | 珠海格力电器股份有限公司 | Noise reduction method, system, storage medium and terminal |
CN111754988A (en) * | 2020-06-23 | 2020-10-09 | 南京工程学院 | Sound scene classification method based on attention mechanism and double-path depth residual error network |
CN111933175A (en) * | 2020-08-06 | 2020-11-13 | 北京中电慧声科技有限公司 | Active voice detection method and system based on noise scene recognition |
CN111916066A (en) * | 2020-08-13 | 2020-11-10 | 山东大学 | Random forest based voice tone recognition method and system |
CN112614504A (en) * | 2020-12-22 | 2021-04-06 | 平安科技(深圳)有限公司 | Single sound channel voice noise reduction method, system, equipment and readable storage medium |
CN112863667A (en) * | 2021-01-22 | 2021-05-28 | 杭州电子科技大学 | Lung sound diagnosis device based on deep learning |
Non-Patent Citations (1)
Title |
---|
涂晴宇: "面向人机交互的语音情感识别与文本敏感词检测", 《 中国优秀硕士学位论文全文数据库 (信息科技辑)》, pages 136 - 201 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113516985A (en) * | 2021-09-13 | 2021-10-19 | 北京易真学思教育科技有限公司 | Speech recognition method, apparatus and non-volatile computer-readable storage medium |
CN113793620A (en) * | 2021-11-17 | 2021-12-14 | 深圳市北科瑞声科技股份有限公司 | Voice noise reduction method, device and equipment based on scene classification and storage medium |
CN113793620B (en) * | 2021-11-17 | 2022-03-08 | 深圳市北科瑞声科技股份有限公司 | Voice noise reduction method, device and equipment based on scene classification and storage medium |
CN114666092A (en) * | 2022-02-16 | 2022-06-24 | 奇安信科技集团股份有限公司 | Real-time behavior safety baseline data noise reduction method and device for safety analysis |
CN114566160A (en) * | 2022-03-01 | 2022-05-31 | 游密科技(深圳)有限公司 | Voice processing method and device, computer equipment and storage medium |
CN114333881A (en) * | 2022-03-09 | 2022-04-12 | 深圳市迪斯声学有限公司 | Audio transmission noise reduction method, device, equipment and medium based on environment self-adaptation |
CN114974279A (en) * | 2022-05-10 | 2022-08-30 | 中移(杭州)信息技术有限公司 | Sound quality control method, device, equipment and storage medium |
CN115331689A (en) * | 2022-08-11 | 2022-11-11 | 北京声智科技有限公司 | Training method, device, equipment, storage medium and product of voice noise reduction model |
CN116758934A (en) * | 2023-08-18 | 2023-09-15 | 深圳市微克科技有限公司 | Method, system and medium for realizing intercom function of intelligent wearable device |
CN116758934B (en) * | 2023-08-18 | 2023-11-07 | 深圳市微克科技有限公司 | Method, system and medium for realizing intercom function of intelligent wearable device |
CN116994599A (en) * | 2023-09-13 | 2023-11-03 | 湖北星纪魅族科技有限公司 | Audio noise reduction method for electronic equipment, electronic equipment and storage medium |
CN117202071A (en) * | 2023-09-21 | 2023-12-08 | 广东金海纳实业有限公司 | Test method and system of noise reduction earphone |
CN117202071B (en) * | 2023-09-21 | 2024-03-29 | 广东金海纳实业有限公司 | Test method and system of noise reduction earphone |
Also Published As
Publication number | Publication date |
---|---|
CN113327626B (en) | 2023-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113327626B (en) | Voice noise reduction method, device, equipment and storage medium | |
WO2021208287A1 (en) | Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium | |
US8160877B1 (en) | Hierarchical real-time speaker recognition for biometric VoIP verification and targeting | |
CN109493881B (en) | Method and device for labeling audio and computing equipment | |
CN103943104B (en) | A kind of voice messaging knows method for distinguishing and terminal unit | |
CN109256138B (en) | Identity verification method, terminal device and computer readable storage medium | |
CN109065071B (en) | Song clustering method based on iterative k-means algorithm | |
CN113223536B (en) | Voiceprint recognition method and device and terminal equipment | |
CN111724770B (en) | Audio keyword identification method for generating confrontation network based on deep convolution | |
CN108899033B (en) | Method and device for determining speaker characteristics | |
CN111540342B (en) | Energy threshold adjusting method, device, equipment and medium | |
CN109378014A (en) | A kind of mobile device source discrimination and system based on convolutional neural networks | |
CN111402922B (en) | Audio signal classification method, device, equipment and storage medium based on small samples | |
CN117935789A (en) | Speech recognition method, system, equipment and storage medium | |
Kaminski et al. | Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models | |
Leow et al. | Language-resource independent speech segmentation using cues from a spectrogram image | |
CN107993666B (en) | Speech recognition method, speech recognition device, computer equipment and readable storage medium | |
CN115938346A (en) | Intonation evaluation method, system, equipment and storage medium | |
Marković et al. | Reverberation-based feature extraction for acoustic scene classification | |
CN112309404B (en) | Machine voice authentication method, device, equipment and storage medium | |
CN111933153B (en) | Voice segmentation point determining method and device | |
Patil et al. | Content-based audio classification and retrieval: A novel approach | |
CN114446284A (en) | Speaker log generation method and device, computer equipment and readable storage medium | |
CN114302301A (en) | Frequency response correction method and related product | |
Therese et al. | A linear visual assessment tendency based clustering with power normalized cepstral coefficients for audio signal recognition system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |