CN113327626A - Voice noise reduction method, device, equipment and storage medium - Google Patents

Voice noise reduction method, device, equipment and storage medium Download PDF

Info

Publication number
CN113327626A
CN113327626A CN202110699792.6A CN202110699792A CN113327626A CN 113327626 A CN113327626 A CN 113327626A CN 202110699792 A CN202110699792 A CN 202110699792A CN 113327626 A CN113327626 A CN 113327626A
Authority
CN
China
Prior art keywords
voice
scene
recognition model
scene recognition
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110699792.6A
Other languages
Chinese (zh)
Other versions
CN113327626B (en
Inventor
汪雪
黄石磊
程刚
何竹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Raisound Technology Co ltd
Original Assignee
Shenzhen Raisound Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Raisound Technology Co ltd filed Critical Shenzhen Raisound Technology Co ltd
Priority to CN202110699792.6A priority Critical patent/CN113327626B/en
Publication of CN113327626A publication Critical patent/CN113327626A/en
Application granted granted Critical
Publication of CN113327626B publication Critical patent/CN113327626B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T90/00Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application relates to the technical field of audio processing, and discloses a voice noise reduction method, which comprises the following steps: the method comprises the steps of obtaining voice data, inputting the voice data into a preset standard scene recognition model, determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene, selecting a preset noise reduction model corresponding to the voice scene, and reducing noise of the voice data. In addition, the application also discloses a voice noise reduction method, a device, equipment and a storage medium. The accuracy of making an uproar can be reduced to this application.

Description

Voice noise reduction method, device, equipment and storage medium
Technical Field
The present application relates to the field of audio processing technologies, and in particular, to a method and an apparatus for speech noise reduction, and a storage medium.
Background
In the current life, various noise audio data are filled, for example, noise audio data at the roadside, noise audio data in a park, noise audio data at an office and the like, and audio characteristics of the noise audio data are different in different voice scenes, and meanwhile, voice noise reduction means correspondingly adopted are different.
The existing voice noise reduction method is based on noise reduction of a user voiceprint feature, for example, the user voice in the voice data is enhanced according to the voiceprint feature of the user, so as to weaken background noise and complete voice noise reduction. However, in an actual application scenario, when the volume of the background noise is too large, the method cannot weaken the background noise according to the enhancement of the user voice, and thus the noise reduction accuracy is not high.
Disclosure of Invention
In order to solve the above technical problem or at least partially solve the above technical problem, the present application provides a voice noise reduction method, apparatus and storage medium.
In a first aspect, the present application provides a method for speech noise reduction, the method comprising:
acquiring voice data;
inputting the voice data into a preset standard scene recognition model, and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene;
and selecting a preset noise reduction model corresponding to the voice scene, and reducing noise of the voice data.
In one embodiment of the first aspect, the step of obtaining the voice data is preceded by:
collecting a noise sample set under each scene, and extracting audio features from each noise sample;
performing cluster analysis on the noise sample set based on the audio features to obtain a classified voice set;
and segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model.
In one embodiment of the first aspect, after the steps of segmenting the classified speech set into a training speech set and a testing speech set, constructing the scene recognition model by using the training speech set, and performing test adjustment on the scene recognition model by using the testing speech set to obtain a standard scene recognition model, the method further includes:
and establishing a noise reduction model corresponding to each scene according to the collected noise sample set under each scene for calling.
In one embodiment of the first aspect, the constructing a scene recognition model by using the training speech set includes:
calculating a kini index between each feature label and the corresponding training voice set to obtain a kini index set corresponding to the feature label, wherein the feature label is a category label extracted from a noise sample set under each scene to obtain corresponding audio features;
sorting the set of the kiney indexes from large to small, and selecting the label corresponding to the smallest kiney index in the set of the kiney indexes as a dividing point;
taking the segmentation point as a root node of an initial decision tree, starting from the segmentation point to generate child nodes, distributing the training speech set to the child nodes, and generating the initial decision tree until all labels in the feature labels are traversed;
and pruning the initial decision tree to obtain a scene recognition model.
In one embodiment of the first aspect, the pruning the initial decision tree to obtain a scene recognition model includes:
calculating surface error gain values of all non-leaf nodes on the initial decision tree;
and pruning the non-leaf nodes of which the surface error gain values are smaller than a preset gain threshold value to obtain a scene recognition model.
In one embodiment of the first aspect, the performing test adjustment on the scene recognition model by using the test speech set to obtain a standard scene recognition model includes:
carrying out scene recognition processing on the test voice set by using the scene recognition model to obtain a recognition result corresponding to the test voice set;
and when the recognition result corresponding to the test voice set is inconsistent with the feature label corresponding to the test voice set, the training voice set is reused to train the scene recognition model until the recognition result corresponding to the test voice set is consistent with the feature label corresponding to the test voice set, and a standard scene recognition model is obtained.
In one embodiment of the first aspect, the performing cluster analysis on the noise sample set based on the audio features to obtain a classified speech set includes:
acquiring a preset standard feature, and calculating a conditional probability value between the audio feature and the standard feature;
and sequencing each noise sample in the noise sample set according to the size of the conditional probability value, and dividing the sequenced noise sample set by using a preset audio interval as a dividing point to obtain a classified voice set.
In one embodiment of the first aspect, acquiring a noise sample set in each scene, and extracting audio features from each noise sample includes:
pre-emphasis processing, framing processing, windowing processing and fast Fourier transform are carried out on the noise sample set to obtain a short-time frequency spectrum of the noise sample set;
performing modular squaring on the short-time frequency spectrum to obtain a power spectrum of the noise sample set;
and calculating the power spectrum by utilizing a preset Mel-scale triangular filter group to obtain logarithmic energy, and performing discrete cosine transform on the logarithmic energy to obtain the audio characteristics corresponding to each noise sample.
In a second aspect, the present application provides a speech scene recognition apparatus, the apparatus comprising:
the voice data acquisition module is used for acquiring voice data;
the voice scene recognition module is used for inputting the voice data into a preset standard scene recognition model and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene;
and the noise reduction module is used for selecting a preset noise reduction model corresponding to the voice scene and reducing noise of the voice data.
In a third aspect, a voice recognition device is provided, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;
a memory for storing a computer program;
a processor, configured to implement the steps of the speech noise reduction method according to any embodiment of the first aspect when executing the program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the speech noise reduction method according to any of the embodiments of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
according to the embodiment of the application, the obtained voice data are input into the preset standard scene recognition model, the standard scene recognition model is used for recognizing the voice scene corresponding to the voice data, the voice scene corresponding to the voice data can be determined, the voice environment where the voice data are located is selected, the preset noise reduction model corresponding to the voice scene is selected, the voice data are subjected to noise reduction, the noise is reduced through the noise reduction model matched with the scene, the noise reduction operation is executed more accurately, and therefore the purpose of improving the accuracy of voice noise reduction is achieved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flowchart of a speech noise reduction method according to an embodiment of the present application;
fig. 2 is a schematic flowchart illustrating a process of testing and adjusting a scene recognition model in a speech noise reduction method according to an embodiment of the present application;
fig. 3 is a schematic block diagram of an apparatus for speech noise reduction according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device for reducing noise in voice according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic flowchart of a speech noise reduction method according to an embodiment of the present application. In this embodiment, the voice noise reduction method includes:
and S1, acquiring voice data.
In the embodiment of the present invention, the voice data is audio data containing noise to be subjected to noise reduction processing, so as to perform audio processing such as voice recognition in the following. Specifically, the voice data may be audio data collected in any voice scene.
Further, the step of obtaining voice data comprises:
collecting a noise sample set under each scene, and extracting audio features from each noise sample;
performing cluster analysis on the noise sample set based on the audio features to obtain a classified voice set;
and segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model.
In detail, in the embodiment of the present application, the noise sample set includes noise audio data in each voice scene, for example, noise audio data in a park, noise audio data at a roadside, or noise audio data in an office. In the embodiment of the present invention, the noise sample set may further include a feature label corresponding to each noise sample, where the feature label is used to label each noise sample to extract a corresponding audio feature. The audio features may include a zero-crossing rate, mel-frequency cepstrum coefficients, a spectral centroid, a spectral diffusion, a spectral entropy, a spectral flux, and the like, wherein the audio features in the embodiment of the present application are preferably mel-frequency cepstrum coefficients.
Specifically, the acquiring a noise sample set under each scene, and extracting audio features from each noise sample includes:
pre-emphasis processing, framing processing, windowing processing and fast Fourier transform are carried out on the noise sample set to obtain a short-time frequency spectrum of the noise sample set;
performing modular squaring on the short-time frequency spectrum to obtain a power spectrum of the noise sample set;
and calculating the power spectrum by utilizing a preset Mel-scale triangular filter group to obtain logarithmic energy, and performing discrete cosine transform on the logarithmic energy to obtain the audio characteristics corresponding to each noise sample.
In an alternative embodiment of the present application, the noise sample set is pre-emphasized by a predetermined high-pass filter to obtain a high-frequency noise sample set, and the pre-emphasis process can enhance the high-frequency portion of the speech signal in the noise sample set.
The embodiment of the application performs pre-emphasis processing on the noise sample set, so that formants of high-frequency parts in the noise sample can be highlighted.
In an optional embodiment of the present application, a preset sampling point is used to segment the high-frequency noise sample set into data of multiple frames, so as to obtain a segmented data set;
preferably, in the embodiment of the present application, the sampling point is 512 or 256.
In an optional embodiment of the present application, the windowing process is to perform windowing on each frame in the frame data set according to a preset window function, so as to obtain a windowed signal.
In detail, the preset window function is:
S′(n)=S(n)×W(n)
Figure BDA0003129791290000041
wherein, S' (N) is a windowing signal, S (N) is a framing data set, w (N) is a window function, N is the size of the frame, and N is the number of frames.
Preferably, in this embodiment of the present application, the preset window function may select a hamming window, and w (n) is a function expression of the hamming window.
The embodiment of the application performs windowing on the frame data set, so that the continuity of the left end and the right end of the frame can be improved, and the frequency spectrum leakage is reduced.
Further, embodiments of the present invention perform a fast fourier transform using the following formula, including:
Figure BDA0003129791290000042
and
and (3) performing modulus squaring on the short-time spectrum by adopting the following formula:
Figure BDA0003129791290000043
wherein S (k) is a short-time frequency spectrum, p (k) is a power spectrum, S' (N) is a windowing signal, N is the size of a frame, N is the number of frames, and k is a preset parameter on the short-time frequency spectrum.
Since the characteristics of the signal are usually difficult to see by the transformation of the signal in the time domain, the embodiment of the present invention converts the noise sample set into the energy distribution in the frequency domain, and different energy distributions may represent the characteristics of different voices.
Further, in the embodiment of the present invention, the triangular filter bank with Mel (Mel) scale is:
Figure BDA0003129791290000051
wherein, t (m) is logarithmic energy, p (k) is power spectrum, h (k) is frequency response of the triangular filter, N is frame size, and k is a preset parameter on the short-time spectrum.
The embodiment of the invention can make the short-time frequency spectrum smooth by utilizing the triangular filter to calculate the logarithmic energy of the power spectrum, eliminate harmonic waves and highlight formants in voice information.
Specifically, the performing cluster analysis on the noise sample set based on the audio features to obtain a classified speech set includes:
acquiring a preset standard feature, and calculating a correlation coefficient between the audio feature and the standard feature;
and sequencing each noise sample in the noise sample set according to the magnitude of the correlation coefficient, and dividing the sequenced noise sample set by using a preset audio interval as a dividing point to obtain a classified voice set.
Wherein the classified voice set comprises voices in different scenes, such as voice in a road scene, voice in a park scene, and the like.
In detail, the following formula is used to calculate a correlation coefficient between the audio feature corresponding to each noise sample in the noise sample set and the standard feature, including:
Figure BDA0003129791290000052
wherein q isijIs the correlation coefficient, yiFor the audio features corresponding to the noise samples, yjFor the standard features, exp is an exponential function, ykAnd ylAre fixed parameters.
Specifically, the clustering analysis is performed on the original noise sample set by embedding the noise samples distributed in the high-dimensional space into a certain low-dimensional subspace, so as to keep the data in the low-dimensional space consistent with the characteristics in the high-dimensional space as much as possible. The clustering analysis can keep the advantage of global clustering characteristics of high-dimensional data in a low-dimensional space, and the clustering relation of various noise samples is visually analyzed, so that the noise samples with similar time-frequency domain characteristics are classified into one class for classification and identification, and the identification accuracy is improved.
And further, segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model. And segmenting the classified voice set according to a preset segmentation proportion to obtain a training voice set and a test voice set.
Preferably, the division ratio is a training speech set: test speech set 7: 3.
the training speech set can be used for subsequent model training and is a sample for model fitting, and the testing speech set can be used for adjusting hyper-parameters of the model and primarily evaluating the capability of the model, and is particularly used for evaluating the generalization capability of the model.
Specifically, the constructing and obtaining a scene recognition model by using the training speech set includes:
calculating a kini index between each feature label and the corresponding training voice set to obtain a kini index set corresponding to the feature label, wherein the feature label is a category label extracted from a noise sample set under each scene to obtain corresponding audio features;
sorting the set of the kiney indexes from large to small, and selecting the label corresponding to the smallest kiney index in the set of the kiney indexes as a dividing point;
taking the segmentation point as a root node of an initial decision tree, starting from the segmentation point to generate child nodes, distributing the training speech set to the child nodes, and generating the initial decision tree until all labels in the feature labels are traversed;
and pruning the initial decision tree to obtain a scene recognition model.
Specifically, the calculating the kini index between each feature label and the corresponding training data set includes:
calculating a kini index between each feature label of the function and the training speech set corresponding to the feature label by using the following kini indexes:
Figure BDA0003129791290000061
wherein Gini (p) is a Gini index, pkRepresents the kth frame data in the training speech set, where K is the number of frames in the training speech set.
In detail, the kini index represents the impure degree of the model, and the smaller the kini index, the lower the impure degree, indicating the better the characteristic.
Further, the pruning the initial decision tree to obtain a scene recognition model includes:
calculating surface error gain values of all non-leaf nodes on the initial decision tree;
and pruning the non-leaf nodes of which the surface error gain values are smaller than a preset gain threshold value to obtain a scene recognition model.
In the embodiment of the present application, the preset gain threshold is 0.5.
Further, the calculating the surface error gain value of all non-leaf nodes on the initial decision tree includes:
calculating surface error gain values for all non-leaf nodes on the initial decision tree using the following gain formula:
Figure BDA0003129791290000062
R(t)=r(t)*p(t)
wherein α represents a surface error gain value, r (t) represents an error cost of a leaf node, r (t) represents an error cost of a non-leaf node, n (t) represents a node number of the initial decision tree, r (t) is an error rate of a leaf node, and p (t) is a ratio of the number of the leaf nodes to the number of all nodes.
Specifically, referring to fig. 2, the performing test adjustment on the scene recognition model by using the test speech set to obtain a standard scene recognition model includes:
s101, performing scene recognition processing on the test voice set by using the scene recognition model to obtain a recognition result corresponding to the test voice set;
s102, when the recognition result corresponding to the test voice set is inconsistent with the feature label corresponding to the test voice set, the training voice set is reused to train the scene recognition model, and until the recognition result corresponding to the test voice set is consistent with the feature label corresponding to the test voice set, a standard scene recognition model is obtained.
Further, after the steps of segmenting the classified speech set into a training speech set and a testing speech set, constructing the scene recognition model by using the training speech set, and performing test adjustment on the scene recognition model by using the testing speech set to obtain a standard scene recognition model, the method further comprises:
and establishing a noise reduction model corresponding to each scene according to the collected noise sample set under each scene for calling.
S2, inputting the voice data into a preset standard scene recognition model, and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene.
In the embodiment of the application, the acquired voice data is input into the preset standard scene recognition model, the preset standard scene recognition model performs scene recognition processing on the voice data, and the voice scene corresponding to the voice data is output.
And S3, selecting a preset noise reduction model corresponding to the voice scene, and reducing the noise of the voice data.
In the embodiment of the application, the noise reduction model includes a dynamic time warping model, a vector quantization model, a hidden markov model, and the like, and according to a speech scene corresponding to the speech data and characteristics of the noise reduction model, the corresponding noise reduction model is selected to perform noise reduction operation on the speech data to obtain a noise reduction result.
According to the embodiment of the application, the acquired voice data are input into the preset standard scene recognition model, the standard scene recognition model is used for recognizing the voice scene corresponding to the voice data, the voice scene corresponding to the voice data can be determined, the voice environment where the voice data are located is selected, the preset noise reduction model corresponding to the voice scene is selected, the voice data are subjected to noise reduction, and the accuracy of voice noise reduction is improved.
As shown in fig. 3, an embodiment of the present application provides a schematic block diagram of a speech noise reduction apparatus 10, where the speech noise reduction apparatus 10 includes: a voice data acquisition module 11, a voice scene recognition module 12 and a noise reduction module 13.
The voice data acquisition module 11 is configured to acquire voice data;
the speech scene recognition module 12 is configured to input the speech data into a preset standard scene recognition model, and determine a speech scene corresponding to the speech data, where the standard scene recognition model is obtained by training according to a noise sample set in each scene;
and the noise reduction module 13 is configured to select a preset noise reduction model corresponding to the voice scene to reduce noise of the voice data.
In detail, in the embodiment of the present application, when being used, each module in the speech noise reduction apparatus 10 adopts the same technical means as the speech noise reduction method described in fig. 1, and can produce the same technical effect, which is not described herein again.
As shown in fig. 4, an embodiment of the present application provides a voice noise reduction device, which includes a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 complete mutual communication through the communication bus 114,
a memory 113 for storing a computer program;
in an embodiment of the present application, the processor 111, configured to execute the program stored in the memory 113, to implement the voice noise reduction method provided in any of the foregoing method embodiments, includes:
acquiring voice data;
inputting the voice data into a preset standard scene recognition model, and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene;
and selecting a preset noise reduction model corresponding to the voice scene, and reducing noise of the voice data.
The communication bus 114 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 114 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface 112 is used for communication between the above-described electronic apparatus and other apparatuses.
The memory 113 may include a Random Access Memory (RAM), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 113 may also be at least one storage device located remotely from the processor 111.
The processor 111 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
The present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the voice noise reduction method provided in any one of the foregoing method embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (ssd)), among others. It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (11)

1. A method for speech noise reduction, the method comprising:
acquiring voice data;
inputting the voice data into a preset standard scene recognition model, and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene;
and selecting a preset noise reduction model corresponding to the voice scene, and reducing noise of the voice data.
2. The method of claim 1, wherein the step of obtaining speech data is preceded by the steps of:
collecting a noise sample set under each scene, and extracting audio features from each noise sample;
performing cluster analysis on the noise sample set based on the audio features to obtain a classified voice set;
and segmenting the classified voice set into a training voice set and a testing voice set, constructing the scene recognition model by using the training voice set, and testing and adjusting the scene recognition model by using the testing voice set to obtain a standard scene recognition model.
3. The method of claim 2, wherein after the steps of segmenting the classified speech set into a training speech set and a testing speech set, constructing the scene recognition model by using the training speech set, and performing test adjustment on the scene recognition model by using the testing speech set to obtain a standard scene recognition model, the method further comprises:
and establishing a noise reduction model corresponding to each scene according to the collected noise sample set under each scene for calling.
4. The method of claim 2, wherein the constructing the scene recognition model using the training speech set comprises:
calculating a kini index between each feature label and the corresponding training voice set to obtain a kini index set corresponding to the feature label, wherein the feature label is a category label extracted from a noise sample set under each scene to obtain corresponding audio features;
sorting the set of the kiney indexes from large to small, and selecting the label corresponding to the smallest kiney index in the set of the kiney indexes as a dividing point;
taking the segmentation point as a root node of an initial decision tree, starting from the segmentation point to generate child nodes, distributing the training speech set to the child nodes, and generating the initial decision tree until all labels in the feature labels are traversed;
and pruning the initial decision tree to obtain a scene recognition model.
5. The method of claim 4, wherein the pruning the initial decision tree to obtain a scene recognition model comprises:
calculating surface error gain values of all non-leaf nodes on the initial decision tree;
and pruning the non-leaf nodes of which the surface error gain values are smaller than a preset gain threshold value to obtain a scene recognition model.
6. The method of claim 4, wherein the performing test adjustment on the scene recognition model by using the test speech set to obtain a standard scene recognition model comprises:
carrying out scene recognition processing on the test voice set by using the scene recognition model to obtain a recognition result corresponding to the test voice set;
and when the recognition result corresponding to the test voice set is inconsistent with the feature label corresponding to the test voice set, the training voice set is reused to train the scene recognition model until the recognition result corresponding to the test voice set is consistent with the feature label corresponding to the test voice set, and a standard scene recognition model is obtained.
7. The method of claim 2, wherein the performing cluster analysis on the noise sample set based on the audio features to obtain a classified speech set comprises:
acquiring a preset standard feature, and calculating a conditional probability value between the audio feature and the standard feature;
and sequencing each noise sample in the noise sample set according to the size of the conditional probability value, and dividing the sequenced noise sample set by using a preset audio interval as a dividing point to obtain a classified voice set.
8. The method according to any one of claims 1 to 5, wherein the collecting a noise sample set in each scene and extracting audio features from each noise sample comprises:
pre-emphasis processing, framing processing, windowing processing and fast Fourier transform are carried out on the noise sample set to obtain a short-time frequency spectrum of the noise sample set;
performing modular squaring on the short-time frequency spectrum to obtain a power spectrum of the noise sample set;
and calculating the power spectrum by utilizing a preset Mel-scale triangular filter group to obtain logarithmic energy, and performing discrete cosine transform on the logarithmic energy to obtain the audio characteristics corresponding to each noise sample.
9. An apparatus for speech noise reduction, the apparatus comprising:
the voice data acquisition module is used for acquiring voice data;
the voice scene recognition module is used for inputting the voice data into a preset standard scene recognition model and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to a noise sample set under each scene;
and the noise reduction module is used for selecting a preset noise reduction model corresponding to the voice scene and reducing noise of the voice data.
10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the speech noise reduction method according to any one of claims 1 to 8 when executing a program stored in the memory.
11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for speech noise reduction according to any of claims 1 to 8.
CN202110699792.6A 2021-06-23 2021-06-23 Voice noise reduction method, device, equipment and storage medium Active CN113327626B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110699792.6A CN113327626B (en) 2021-06-23 2021-06-23 Voice noise reduction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110699792.6A CN113327626B (en) 2021-06-23 2021-06-23 Voice noise reduction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113327626A true CN113327626A (en) 2021-08-31
CN113327626B CN113327626B (en) 2023-09-08

Family

ID=77424416

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110699792.6A Active CN113327626B (en) 2021-06-23 2021-06-23 Voice noise reduction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113327626B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516985A (en) * 2021-09-13 2021-10-19 北京易真学思教育科技有限公司 Speech recognition method, apparatus and non-volatile computer-readable storage medium
CN113793620A (en) * 2021-11-17 2021-12-14 深圳市北科瑞声科技股份有限公司 Voice noise reduction method, device and equipment based on scene classification and storage medium
CN114333881A (en) * 2022-03-09 2022-04-12 深圳市迪斯声学有限公司 Audio transmission noise reduction method, device, equipment and medium based on environment self-adaptation
CN116758934A (en) * 2023-08-18 2023-09-15 深圳市微克科技有限公司 Method, system and medium for realizing intercom function of intelligent wearable device
CN116994599A (en) * 2023-09-13 2023-11-03 湖北星纪魅族科技有限公司 Audio noise reduction method for electronic equipment, electronic equipment and storage medium
CN117202071A (en) * 2023-09-21 2023-12-08 广东金海纳实业有限公司 Test method and system of noise reduction earphone

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101710490A (en) * 2009-11-20 2010-05-19 安徽科大讯飞信息科技股份有限公司 Method and device for compensating noise for voice assessment
US20120225719A1 (en) * 2011-03-04 2012-09-06 Mirosoft Corporation Gesture Detection and Recognition
KR20120129421A (en) * 2011-05-20 2012-11-28 고려대학교 산학협력단 Apparatus and method of speech recognition for number
US8438029B1 (en) * 2012-08-22 2013-05-07 Google Inc. Confidence tying for unsupervised synthetic speech adaptation
CN106611183A (en) * 2016-05-30 2017-05-03 四川用联信息技术有限公司 Method for constructing Gini coefficient and misclassification cost-sensitive decision tree
KR20180046062A (en) * 2016-10-27 2018-05-08 에스케이텔레콤 주식회사 Method for speech endpoint detection using normalizaion and apparatus thereof
CN108181107A (en) * 2018-01-12 2018-06-19 东北电力大学 The Wind turbines bearing mechanical method for diagnosing faults of meter and more class objects
CN108198547A (en) * 2018-01-18 2018-06-22 深圳市北科瑞声科技股份有限公司 Sound end detecting method, device, computer equipment and storage medium
US20180254041A1 (en) * 2016-04-11 2018-09-06 Sonde Health, Inc. System and method for activation of voice interactive services based on user state
CN109285538A (en) * 2018-09-19 2019-01-29 宁波大学 A kind of mobile phone source title method under the additive noise environment based on normal Q transform domain
WO2019237519A1 (en) * 2018-06-11 2019-12-19 平安科技(深圳)有限公司 General vector training method, voice clustering method, apparatus, device and medium
CN110769111A (en) * 2019-10-28 2020-02-07 珠海格力电器股份有限公司 Noise reduction method, system, storage medium and terminal
CN111754988A (en) * 2020-06-23 2020-10-09 南京工程学院 Sound scene classification method based on attention mechanism and double-path depth residual error network
CN111916066A (en) * 2020-08-13 2020-11-10 山东大学 Random forest based voice tone recognition method and system
CN111933175A (en) * 2020-08-06 2020-11-13 北京中电慧声科技有限公司 Active voice detection method and system based on noise scene recognition
CN112614504A (en) * 2020-12-22 2021-04-06 平安科技(深圳)有限公司 Single sound channel voice noise reduction method, system, equipment and readable storage medium
CN112863667A (en) * 2021-01-22 2021-05-28 杭州电子科技大学 Lung sound diagnosis device based on deep learning

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101710490A (en) * 2009-11-20 2010-05-19 安徽科大讯飞信息科技股份有限公司 Method and device for compensating noise for voice assessment
US20120225719A1 (en) * 2011-03-04 2012-09-06 Mirosoft Corporation Gesture Detection and Recognition
KR20120129421A (en) * 2011-05-20 2012-11-28 고려대학교 산학협력단 Apparatus and method of speech recognition for number
US8438029B1 (en) * 2012-08-22 2013-05-07 Google Inc. Confidence tying for unsupervised synthetic speech adaptation
US20180254041A1 (en) * 2016-04-11 2018-09-06 Sonde Health, Inc. System and method for activation of voice interactive services based on user state
CN106611183A (en) * 2016-05-30 2017-05-03 四川用联信息技术有限公司 Method for constructing Gini coefficient and misclassification cost-sensitive decision tree
KR20180046062A (en) * 2016-10-27 2018-05-08 에스케이텔레콤 주식회사 Method for speech endpoint detection using normalizaion and apparatus thereof
CN108181107A (en) * 2018-01-12 2018-06-19 东北电力大学 The Wind turbines bearing mechanical method for diagnosing faults of meter and more class objects
CN108198547A (en) * 2018-01-18 2018-06-22 深圳市北科瑞声科技股份有限公司 Sound end detecting method, device, computer equipment and storage medium
WO2019237519A1 (en) * 2018-06-11 2019-12-19 平安科技(深圳)有限公司 General vector training method, voice clustering method, apparatus, device and medium
CN109285538A (en) * 2018-09-19 2019-01-29 宁波大学 A kind of mobile phone source title method under the additive noise environment based on normal Q transform domain
CN110769111A (en) * 2019-10-28 2020-02-07 珠海格力电器股份有限公司 Noise reduction method, system, storage medium and terminal
CN111754988A (en) * 2020-06-23 2020-10-09 南京工程学院 Sound scene classification method based on attention mechanism and double-path depth residual error network
CN111933175A (en) * 2020-08-06 2020-11-13 北京中电慧声科技有限公司 Active voice detection method and system based on noise scene recognition
CN111916066A (en) * 2020-08-13 2020-11-10 山东大学 Random forest based voice tone recognition method and system
CN112614504A (en) * 2020-12-22 2021-04-06 平安科技(深圳)有限公司 Single sound channel voice noise reduction method, system, equipment and readable storage medium
CN112863667A (en) * 2021-01-22 2021-05-28 杭州电子科技大学 Lung sound diagnosis device based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
涂晴宇: "面向人机交互的语音情感识别与文本敏感词检测", 《 中国优秀硕士学位论文全文数据库 (信息科技辑)》, pages 136 - 201 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516985A (en) * 2021-09-13 2021-10-19 北京易真学思教育科技有限公司 Speech recognition method, apparatus and non-volatile computer-readable storage medium
CN113793620A (en) * 2021-11-17 2021-12-14 深圳市北科瑞声科技股份有限公司 Voice noise reduction method, device and equipment based on scene classification and storage medium
CN113793620B (en) * 2021-11-17 2022-03-08 深圳市北科瑞声科技股份有限公司 Voice noise reduction method, device and equipment based on scene classification and storage medium
CN114333881A (en) * 2022-03-09 2022-04-12 深圳市迪斯声学有限公司 Audio transmission noise reduction method, device, equipment and medium based on environment self-adaptation
CN116758934A (en) * 2023-08-18 2023-09-15 深圳市微克科技有限公司 Method, system and medium for realizing intercom function of intelligent wearable device
CN116758934B (en) * 2023-08-18 2023-11-07 深圳市微克科技有限公司 Method, system and medium for realizing intercom function of intelligent wearable device
CN116994599A (en) * 2023-09-13 2023-11-03 湖北星纪魅族科技有限公司 Audio noise reduction method for electronic equipment, electronic equipment and storage medium
CN117202071A (en) * 2023-09-21 2023-12-08 广东金海纳实业有限公司 Test method and system of noise reduction earphone
CN117202071B (en) * 2023-09-21 2024-03-29 广东金海纳实业有限公司 Test method and system of noise reduction earphone

Also Published As

Publication number Publication date
CN113327626B (en) 2023-09-08

Similar Documents

Publication Publication Date Title
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
CN113327626B (en) Voice noise reduction method, device, equipment and storage medium
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
CN105206270B (en) A kind of isolated digit speech recognition categorizing system and method combining PCA and RBM
CN109493881B (en) Method and device for labeling audio and computing equipment
CN109065071B (en) Song clustering method based on iterative k-means algorithm
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
CN111724770B (en) Audio keyword identification method for generating confrontation network based on deep convolution
CN108899033B (en) Method and device for determining speaker characteristics
CN109256138A (en) Auth method, terminal device and computer readable storage medium
CN109378014A (en) A kind of mobile device source discrimination and system based on convolutional neural networks
CN111540342B (en) Energy threshold adjusting method, device, equipment and medium
CN110428853A (en) Voice activity detection method, Voice activity detection device and electronic equipment
Hsieh et al. Improving perceptual quality by phone-fortified perceptual loss for speech enhancement
CN111402922B (en) Audio signal classification method, device, equipment and storage medium based on small samples
CN115394318A (en) Audio detection method and device
Kaminski et al. Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models
CN107993666B (en) Speech recognition method, speech recognition device, computer equipment and readable storage medium
CN115938346A (en) Intonation evaluation method, system, equipment and storage medium
Marković et al. Reverberation-based feature extraction for acoustic scene classification
CN112309404B (en) Machine voice authentication method, device, equipment and storage medium
Patil et al. Content-based audio classification and retrieval: A novel approach
CN111326161B (en) Voiceprint determining method and device
CN114302301A (en) Frequency response correction method and related product
Therese et al. A linear visual assessment tendency based clustering with power normalized cepstral coefficients for audio signal recognition system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant