CN114913876B - Noisy speech end point detection method based on multi-resolution KL divergence and voting mechanism - Google Patents

Noisy speech end point detection method based on multi-resolution KL divergence and voting mechanism Download PDF

Info

Publication number
CN114913876B
CN114913876B CN202210508346.7A CN202210508346A CN114913876B CN 114913876 B CN114913876 B CN 114913876B CN 202210508346 A CN202210508346 A CN 202210508346A CN 114913876 B CN114913876 B CN 114913876B
Authority
CN
China
Prior art keywords
granularity
divergence
value
sub
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210508346.7A
Other languages
Chinese (zh)
Other versions
CN114913876A (en
Inventor
李晔
沈自强
白全民
张存阳
王金颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Institute Of Science And Technology Development Strategy
Original Assignee
Shandong Institute Of Science And Technology Development Strategy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Institute Of Science And Technology Development Strategy filed Critical Shandong Institute Of Science And Technology Development Strategy
Priority to CN202210508346.7A priority Critical patent/CN114913876B/en
Publication of CN114913876A publication Critical patent/CN114913876A/en
Application granted granted Critical
Publication of CN114913876B publication Critical patent/CN114913876B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a noisy speech endpoint detection method based on multi-resolution KL divergence and voting mechanism, comprising the following steps: framing sampled voice signal samples according to time sequence; dividing the frequency band of the current frame according to different granularities to obtain a plurality of sub-bands under each granularity; respectively calculating KL divergence of sub-band energy distribution under different granularities, and comparing the KL divergence of each granularity with a first threshold value to obtain a discrimination value of each granularity; and obtaining a comprehensive value according to the judgment values of different granularities, and obtaining a judgment result of the voice and the noise according to the comparison of the comprehensive value and the second threshold value. The problem that noise caused by fluctuation of a noise frequency spectrum in a nearby subband is misjudged as voice is effectively solved, and therefore accuracy of voice endpoint detection is effectively improved.

Description

Noisy speech end point detection method based on multi-resolution KL divergence and voting mechanism
Technical Field
The invention relates to the technical field of voice signal processing, in particular to a noisy voice endpoint detection method based on multi-resolution KL divergence and a voting mechanism.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The voice endpoint detection technique has wide application in the field of voice signal processing. As in variable rate speech coding systems, speech end point detection techniques are used to make decisions on speech segments and non-speech segments, thereby reducing the number of coded bits in the non-speech segments and reducing the average coding rate. In a voice noise reduction system, a voice endpoint detection technology is used for judging a voice segment and a non-voice segment, so that noise characteristics are estimated and updated in the non-voice segment, and a better noise reduction effect is achieved. In addition, the method also relates to the technical fields of voice recognition, echo cancellation, voice wakeup and the like.
However, the presence of background noise affects the accuracy of speech endpoint detection, and endpoint detection of noisy speech remains a research difficulty and hotspot. Techniques such as methods based on frequency domain features have been proposed for better endpoint detection. The endpoint detection based on the frequency domain features can better utilize the difference between noise and the voice signal frequency spectrum statistical features, and has a good detection effect; spectral entropy-based endpoint detection methods are typical representatives of this class of algorithms.
However, the robustness of the method for performing endpoint detection by using spectral entropy still needs to be further improved, in the noisy speech such as white noise and powder noise, although the noise spectrum features are relatively stable, there is a difference in degree, such as fluctuation in the distribution of energy in adjacent subbands, and the noise misjudgment as speech can be generated by adopting a single fixed subband division method; for another example, under a single fixed subband division frame, there is still a problem that the spectral entropy of an individual speech frame is determined to be noise due to the close proximity to the noise frame, so that erroneous determination occurs. The above problems all directly affect the effect of voice endpoint detection.
Disclosure of Invention
In order to solve the problems, the invention provides a noisy speech endpoint detection method based on multi-resolution KL divergence and voting mechanism, which effectively solves the problem that noise caused by fluctuation of a noise spectrum in a nearby subband is misjudged as speech, thereby effectively improving the accuracy of speech endpoint detection.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
in a first aspect, the present invention provides a method for detecting a noisy speech endpoint based on a multi-resolution KL divergence and a voting mechanism, including:
framing sampled voice signal samples according to time sequence;
dividing the frequency band of the current frame according to different granularities to obtain a plurality of sub-bands under each granularity;
Respectively calculating KL divergence of sub-band energy distribution under different granularities, and comparing the KL divergence of each granularity with a first threshold value to obtain a discrimination value of each granularity;
And obtaining a comprehensive value by adopting a voting mechanism according to the judgment values of different granularities, and obtaining a judgment result of the voice and the noise according to the comparison of the comprehensive value and the second threshold value.
As an alternative implementation manner, the first N frames of speech signal samples are set as noise segments, and if the current frame number is smaller than N, the spectrum of the noise segments is updated according to the spectrum of the current frame, specifically:
Wherein, The energy of the jth sub-band under the kth granularity of the noise to be estimated; is the energy of the jth sub-band at the kth granularity of the current frame i.
As an alternative embodiment, the KL divergence is:
wherein KL k is KL divergence of subband energy distribution under the kth granularity, and L is the number of subbands under the kth granularity; the energy of an mth subband under the kth granularity of noise to be estimated; The energy of the jth sub-band under the kth granularity of the noise to be estimated; the energy of the m-th sub-band under the kth granularity of the current frame number i; the energy of the jth subband at the kth granularity of the current frame number i.
As an alternative implementation manner, comparing the KL divergence of each granularity with a first threshold value, and if the KL divergence is greater than the first threshold value, determining that the current frame is a voice frame and the discrimination value is 1; otherwise, the current frame is noise, and the discrimination value is zero.
As an alternative implementation manner, a voting mechanism is adopted to obtain a comprehensive value according to the decision values of different granularities, specifically:
Where SN total is a composite value, w k is the number of subbands in k granularity, SN k (i) is a discrimination value, and KN is the total number of granularity categories.
Alternatively, the current frame is a speech frame if the integrated value is greater than the second threshold, or is a noise frame if the integrated value is greater than the second threshold.
As an alternative embodiment, after performing the fast fourier transform on the current frame, the frequency band is divided according to different granularity.
In a second aspect, the present invention provides a noisy speech endpoint detection system based on a multi-resolution KL divergence and voting mechanism, comprising:
the framing module is configured to frame the sampled voice signal samples according to the time sequence;
the sub-band dividing module is configured to divide the frequency band of the current frame according to different granularities, and a plurality of sub-bands are obtained under each granularity;
the first judging module is configured to calculate the KL divergence of the sub-band energy distribution under different granularities respectively, and compare the KL divergence of each granularity with a first threshold value to obtain a judging value of each granularity;
And the second judging module is configured to obtain a comprehensive value by adopting a voting mechanism according to the judging values of different granularities, and obtain a judging result of the voice and the noise according to the comparison of the comprehensive value and the second threshold value.
In a third aspect, the invention provides an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of the first aspect.
In a fourth aspect, the present invention provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of the first aspect.
Compared with the prior art, the invention has the beneficial effects that:
The invention provides a noisy speech endpoint detection based on multi-resolution KL divergence and voting mechanism, which is characterized in that an input speech frame signal is subjected to short-time-frequency analysis, multi-resolution sub-band division is carried out according to different granularities, the KL divergence of each sub-band division is calculated respectively, and the KL divergence is utilized to carry out speech endpoint detection, so that a noise/speech judgment result under the granularities is obtained; and combining a voting mechanism, counting weights according to the resolution, wherein the higher the resolution is, the larger the weights are, and comprehensively utilizing noise/voice judgment results under a plurality of granularities to judge the final noise and voice. The problem that noise caused by fluctuation of a noise frequency spectrum in a nearby subband is misjudged as voice is effectively solved, and therefore accuracy of voice endpoint detection is effectively improved.
Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a flow chart of a conventional speech end point detection method based on spectral entropy;
Fig. 2 is a flowchart of a method for detecting a noisy speech endpoint according to embodiment 1 of the present invention.
Detailed Description
The invention is further described below with reference to the drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, unless the context clearly indicates otherwise, the singular forms also are intended to include the plural forms, and furthermore, it is to be understood that the terms "comprises" and "comprising" and any variations thereof are intended to cover non-exclusive inclusions, such as, for example, processes, methods, systems, products or devices that comprise a series of steps or units, are not necessarily limited to those steps or units that are expressly listed, but may include other steps or units that are not expressly listed or inherent to such processes, methods, products or devices.
Embodiments of the invention and features of the embodiments may be combined with each other without conflict.
Example 1
Fig. 1 is a flowchart of a conventional speech endpoint detection method based on spectral entropy, specifically:
(1) Framing input speech signal samples according to a time sequence;
(2) Performing fast Fourier transform on each frame of signal, and dividing the signal into N sub-bands;
(3) Calculating energy E i of each sub-band, and calculating spectral entropy H according to the proportion of the energy E i of each sub-band to total energy E total of the voice frame signal, wherein the spectral entropy H is specifically as follows:
(4) Calculating an absolute value H err=|H-HN I of a difference value between the current frame spectrum entropy H and the noise frame spectrum entropy H N, wherein H N is obtained by using a plurality of input frame statistics;
(5) And comparing the absolute value H err of the spectral entropy difference with an empirical threshold T to obtain a judging result of the voice frame and the non-voice frame, wherein if H err is greater than T, the current frame is the voice frame, otherwise, the current frame is noise.
The robustness of the method for endpoint detection by using the spectral entropy still needs to be further improved, and the problem of misjudgment exists.
Therefore, this embodiment proposes a noisy speech endpoint detection method based on a multi-resolution KL divergence (Kullback-Leibler divergence) and a weighted voting mechanism, as shown in fig. 2, specifically including:
framing sampled voice signal samples according to time sequence;
dividing the frequency band of the current frame according to different granularities to obtain a plurality of sub-bands under each granularity;
Respectively calculating KL divergence of sub-band energy distribution under different granularities, and comparing the KL divergence of each granularity with a first threshold value to obtain a discrimination value of each granularity;
And obtaining a comprehensive value by adopting a voting mechanism according to the judgment values of different granularities, and obtaining a judgment result of the voice and the noise according to the comparison of the comprehensive value and the second threshold value.
In this embodiment, the voice signal is sampled at 8KHz frequency, and the power frequency interference is removed by high-pass filtering, so as to obtain a voice signal sample, and the voice signal sample is framed in time sequence.
Alternatively, but not limited to, every 32ms, i.e., 256 speech signal samples, form a frame.
In this embodiment, after performing fast fourier transform on the current frame, frequency bands are divided according to different granularities, and a plurality of subbands are obtained under each granularity;
In this embodiment, a granularity set k= {2,4,8, 16, 32}, where 5 seed bands of 2,4,8, 16, 32, etc. are set to divide granularity, K represents the number of fourier transform frequencies contained in each sub band, and the total number of granularity categories is denoted as KN.
As an alternative implementation manner, 256-point fast fourier transform is performed on the current frame according to discrete fourier transform, the first 128 values are selected to perform band division, and the current frame is divided into 64, 32, 16, 8, 4 sub-bands at 5 granularity levels of 2, 4, 8, 16, 32, etc.
The present embodiment sets the first N frames of voice signals as noise segments, for example, n=20, but is not limited thereto;
initializing a frame number i, enabling the frame number i to be 0, if the current frame number i is less than N, updating the frequency spectrum of a noise section according to the frequency spectrum of the current frame, enabling the i to be i+1, re-framing and re-dividing the sub-band until the current frame number i is greater than or equal to N;
The spectrum update of the noise section is specifically:
Wherein, The initial value of the energy of the jth sub-band under the kth granularity of the background noise to be estimated is 0; is the energy of the jth sub-band at the kth granularity of the current frame i.
Otherwise, respectively calculating KL divergence for the sub-band energy distribution under different granularity; the method comprises the following steps:
wherein KL k is KL divergence of subband energy distribution under the kth granularity, and L is the number of subbands under the kth granularity; The energy of the m-th subband under the kth granularity of the background noise to be estimated; Is the energy of the m-th subband at the kth granularity of the current frame.
In this embodiment, comparing the KL divergence of each granularity with a first threshold T 1, if the KL divergence is greater than T 1, the current frame is a speech frame, and recording a discrimination value SN k (i) =1; otherwise, the current frame is environmental noise, and a discrimination value SN k (i) =0 is recorded;
Specifically: if KL k>T1, under the kth granularity, the current frame is a voice frame, and SN k (i) =1; otherwise, the current frame is environmental noise, the SN k(i)=0;T1 is recorded as an empirical statistical value, and the current frame is obtained offline.
In this embodiment, according to the decision values of different granularities, a voting mechanism is combined, and weights are counted, wherein the finer the granularity is, the more subbands are, the larger the weights are, so as to obtain a comprehensive value SN total:
where w k is the number of subbands at k granularity.
In this embodiment, according to the comparison between the integrated value and the second threshold T 2, if SN total>T2, the current frame is a speech frame, otherwise it is a noise frame; t 2 is an empirical statistic, obtained off-line. And carrying out the processing on each frame of voice signal in sequence until the voice signal processing is completed.
The voice endpoint detection method with noise of the embodiment improves the accuracy of voice endpoint detection in a noise environment. Compared with spectrum entropy, the KL divergence can more effectively describe the difference between different spectrum energy distributions, and has stronger robustness under different signal-to-noise ratios; because the noise spectrum has poor robustness on the problem of energy fluctuation among sub-bands in the conventional KL divergence, the embodiment combines a voting mechanism at the same time, and obtains a comprehensive discrimination value according to the KL divergence under different granularities. The method for detecting the endpoint of the voice with noise in the embodiment can be used for the technical requirements of voice endpoint detection in various voice signal processing fields such as variable rate voice compression coding, voice enhancement, voice noise reduction, echo cancellation, voice recognition, voice wakeup and the like in a noise environment.
Example 2
The embodiment provides a noisy speech endpoint detection system based on multi-resolution KL divergence and voting mechanism, comprising:
the framing module is configured to frame the sampled voice signal samples according to the time sequence;
the sub-band dividing module is configured to divide the frequency band of the current frame according to different granularities, and a plurality of sub-bands are obtained under each granularity;
the first judging module is configured to calculate the KL divergence of the sub-band energy distribution under different granularities respectively, and compare the KL divergence of each granularity with a first threshold value to obtain a judging value of each granularity;
And the second judging module is configured to obtain a comprehensive value by adopting a voting mechanism according to the judging values of different granularities, and obtain a judging result of the voice and the noise according to the comparison of the comprehensive value and the second threshold value.
It should be noted that the above modules correspond to the steps described in embodiment 1, and the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.
In further embodiments, there is also provided:
An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method described in embodiment 1. For brevity, the description is omitted here.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.
A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method described in embodiment 1.
The method in embodiment 1 may be directly embodied as a hardware processor executing or executed with a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.
Those of ordinary skill in the art will appreciate that the elements of the various examples described in connection with the present embodiments, i.e., the algorithm steps, can be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims (6)

1. The noisy speech end point detection method based on the multi-resolution KL divergence and voting mechanism is characterized by comprising the following steps:
framing sampled voice signal samples according to time sequence;
dividing the frequency band of the current frame according to different granularities to obtain a plurality of sub-bands under each granularity;
Respectively calculating KL divergence of sub-band energy distribution under different granularities, and comparing the KL divergence of each granularity with a first threshold value to obtain a discrimination value of each granularity;
Obtaining a comprehensive value by adopting a voting mechanism according to the judgment values of different granularities, and obtaining a judgment result of voice and noise according to the comparison of the comprehensive value and a second threshold value;
Setting the first N frames of voice signal sample points as noise segments, and if the current frame number is smaller than N, updating the frequency spectrum of the noise segments according to the frequency spectrum of the current frame, wherein the method specifically comprises the following steps:
Wherein, The energy of the jth sub-band under the kth granularity of the noise to be estimated; The energy of the jth sub-band under the kth granularity of the current frame i;
The KL divergence is:
Wherein, KL divergence, which is the energy distribution of the sub-bands under the kth granularity, and L is the number of sub-bands under the kth granularity; the energy of an mth subband under the kth granularity of noise to be estimated; the energy of the m-th sub-band under the kth granularity of the current frame number i;
Comparing the KL divergence of each granularity with a first threshold value, and if the KL divergence is larger than the first threshold value, judging that the current frame is a voice frame and the judgment value is 1; otherwise, the current frame is noise, and the discrimination value is zero;
the comprehensive value is obtained by adopting a voting mechanism according to the judgment values of different granularities, and is specifically as follows:
Wherein, As a result of the combination of the values,For the number of subbands at k granularity,As the discrimination value, KN is the total number of particle size categories.
2. The method for detecting a noisy speech endpoint based on the multi-resolution KL divergence and voting scheme according to claim 1, wherein the current frame is a speech frame if the integrated value is greater than the second threshold value, and is a noisy frame otherwise, based on a comparison of the integrated value with the second threshold value.
3. The method for detecting a noisy speech endpoint based on the multi-resolution KL divergence and voting scheme according to claim 1, wherein the current frame is subjected to the fast fourier transform and then subjected to the frequency band division according to different granularities.
4. A noisy speech endpoint detection system based on a multi-resolution KL-divergence and voting mechanism, characterized in that it is configured to perform the noisy speech endpoint detection method based on a multi-resolution KL-divergence and voting mechanism as defined in any one of claims 1 to 3, and that it comprises:
the framing module is configured to frame the sampled voice signal samples according to the time sequence;
the sub-band dividing module is configured to divide the frequency band of the current frame according to different granularities, and a plurality of sub-bands are obtained under each granularity;
the first judging module is configured to calculate the KL divergence of the sub-band energy distribution under different granularities respectively, and compare the KL divergence of each granularity with a first threshold value to obtain a judging value of each granularity;
And the second judging module is configured to obtain a comprehensive value by adopting a voting mechanism according to the judging values of different granularities, and obtain a judging result of the voice and the noise according to the comparison of the comprehensive value and the second threshold value.
5. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of any one of claims 1-3.
6. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of any of claims 1-3.
CN202210508346.7A 2022-05-11 2022-05-11 Noisy speech end point detection method based on multi-resolution KL divergence and voting mechanism Active CN114913876B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210508346.7A CN114913876B (en) 2022-05-11 2022-05-11 Noisy speech end point detection method based on multi-resolution KL divergence and voting mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210508346.7A CN114913876B (en) 2022-05-11 2022-05-11 Noisy speech end point detection method based on multi-resolution KL divergence and voting mechanism

Publications (2)

Publication Number Publication Date
CN114913876A CN114913876A (en) 2022-08-16
CN114913876B true CN114913876B (en) 2024-08-20

Family

ID=82767149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210508346.7A Active CN114913876B (en) 2022-05-11 2022-05-11 Noisy speech end point detection method based on multi-resolution KL divergence and voting mechanism

Country Status (1)

Country Link
CN (1) CN114913876B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN113628612A (en) * 2020-05-07 2021-11-09 北京三星通信技术研究有限公司 Voice recognition method and device, electronic equipment and computer readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN113628612A (en) * 2020-05-07 2021-11-09 北京三星通信技术研究有限公司 Voice recognition method and device, electronic equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN114913876A (en) 2022-08-16

Similar Documents

Publication Publication Date Title
CN108831500B (en) Speech enhancement method, device, computer equipment and storage medium
KR102002681B1 (en) Bandwidth extension based on generative adversarial networks
CN107393526B (en) Voice silence detection method, device, computer equipment and storage medium
Gerkmann et al. Improved a posteriori speech presence probability estimation based on a likelihood ratio with fixed priors
Chen et al. Improved voice activity detection algorithm using wavelet and support vector machine
US20140067388A1 (en) Robust voice activity detection in adverse environments
CN110706693B (en) Method and device for determining voice endpoint, storage medium and electronic device
US20220059114A1 (en) Method and apparatus for determining a deep filter
US10984812B2 (en) Audio signal discriminator and coder
Abdulatif et al. Aegan: Time-frequency speech denoising via generative adversarial networks
Yao et al. Coarse-to-fine optimization for speech enhancement
CN107331386A (en) End-point detecting method, device, processing system and the computer equipment of audio signal
Chai et al. Gaussian density guided deep neural network for single-channel speech enhancement
CN114913876B (en) Noisy speech end point detection method based on multi-resolution KL divergence and voting mechanism
CN114512140A (en) Voice enhancement method, device and equipment
TWI749547B (en) Speech enhancement system based on deep learning
Rabaoui et al. Using robust features with multi-class SVMs to classify noisy sounds
Sanam et al. Teager energy operation on wavelet packet coefficients for enhancing noisy speech using a hard thresholding function
Graf et al. Improved performance measures for voice activity detection
Prasad et al. Noise estimation using negentropy based voice-activity detector
Wang et al. Speech enhancement based on perceptually motivated guided spectrogram filtering
Syed et al. Speech waveform compression using robust adaptive voice activity detection for nonstationary noise in multimedia communications
Hung et al. Exploiting the non-uniform frequency-resolution spectrograms to improve the deep denoising auto-encoder for speech enhancement
Maithani et al. Noise characterization and classification for background estimation
Alex et al. Robust optimal sub-band wavelet cepstral coefficient method for speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant