CN116665717B - Cross-subband spectral entropy weighted likelihood ratio voice detection method and system - Google Patents
Cross-subband spectral entropy weighted likelihood ratio voice detection method and system Download PDFInfo
- Publication number
- CN116665717B CN116665717B CN202310963463.7A CN202310963463A CN116665717B CN 116665717 B CN116665717 B CN 116665717B CN 202310963463 A CN202310963463 A CN 202310963463A CN 116665717 B CN116665717 B CN 116665717B
- Authority
- CN
- China
- Prior art keywords
- sub
- frequency
- band
- subband
- likelihood ratio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000003595 spectral effect Effects 0.000 title claims abstract description 79
- 238000001514 detection method Methods 0.000 title claims abstract description 49
- 238000001228 spectrum Methods 0.000 claims abstract description 63
- 238000003657 Likelihood-ratio test Methods 0.000 claims abstract description 19
- 238000000034 method Methods 0.000 claims description 38
- 238000005070 sampling Methods 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 9
- 238000009432 framing Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 abstract description 5
- 230000004913 activation Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000007796 conventional method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 2
- 238000013179 statistical model Methods 0.000 description 2
- 241001672694 Citrus reticulata Species 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The application discloses a cross-subband spectral entropy weighted likelihood ratio voice detection method and a system, wherein non-uniform partial overlapping subband division is firstly carried out in a frequency domain, and spectral entropy characteristics of each subband are extracted; setting likelihood ratio weight values of corresponding sub-bands according to the entropy of the sub-bands and the ratio of the energy spectrum of the sub-bands to the average energy spectrum of the non-speech frame sub-bands; and finally, judging whether the signal of a certain frame is a voice frame or not by combining the weighted likelihood ratio with a preset threshold detection. The application has robustness under noise background according to the spectral entropy characteristics of the voice signal, uses the subband spectral entropy information to set the likelihood ratio weight in the likelihood ratio test detection method, uses the weighted likelihood ratio as one of the voice detection judgment bases, improves the detection accuracy of the likelihood ratio test voice detection method under the environment with low signal-to-noise ratio, and is suitable for the voice signal processing fields such as voice recognition, speaker recognition and the like.
Description
Technical Field
The application relates to the technical field of voice detection, in particular to a cross-subband spectral entropy weighted likelihood ratio voice detection method and system.
Background
The voice activation detection (Voice Activity Detection, VAD) aims to distinguish between voice signals and non-voice signals from the signals. Voice signal processing systems often involve VAD detection problems. In a voice coding system, judging whether the current signal has voice or not by VAD, and adopting different bit allocation modes or different coding and decoding methods, thereby reducing the coding rate on the premise of not affecting the quality of synthesized voice; in a speech recognition system or speaker recognition system, accurate VAD decisions can improve recognition rate and save processing time. The traditional voice activation detection is mainly based on the method of voice characteristic parameters such as short-time energy, zero-crossing rate, spectral entropy, LPC parameters, cepstrum features, high-order statistics and the like, and has satisfactory effects under the condition of high signal-to-noise ratio, but the detection performance is drastically reduced along with the reduction of the signal-to-noise ratio.
In order to solve the VAD problem under low signal-to-noise ratio, a VAD algorithm based on likelihood ratio test is currently proposed, the method utilizes a Gaussian statistical model to model Fourier transform coefficients of signals according to two hypotheses of voice and non-voice, and the adaptation degree of the two statistical models and current observation data is evaluated through a likelihood ratio test method, so that VAD judgment is made. On the one hand, the spectral entropy characteristics of the voice signal have certain robustness, and when the signal-to-noise ratio is reduced, the spectral entropy shape of the voice signal is kept unchanged substantially; on the other hand, the spectral entropy of a speech signal is independent of amplitude, but only of the randomness (i.e. distribution) of the signal, and the greater the spectral flatness, the greater the spectral entropy value, and the spectral entropy of speech is generally smaller than that of noise. The spectral entropy of different frequency bands presents different judging capability to existing voice in the same time period, and the spectral entropy values of different frequency sub-bands can be used as auxiliary features of likelihood ratio judgment in a likelihood ratio test method. The application provides a cross-subband spectral entropy weighted likelihood ratio voice activation detection method, which comprises the steps of discontinuously dividing subbands, calculating the spectral entropy of the subbands, setting the weight of the subband frequency component likelihood ratio according to the subband spectral entropy, and using the weighted likelihood ratio as a voice detection judgment basis.
Disclosure of Invention
In order to solve the technical problems, the application provides a cross-subband spectral entropy weighted likelihood ratio voice detection method and a system.
The first aspect of the application provides a cross-subband spectral entropy weighted likelihood ratio voice detection method, which comprises the following steps:
step S01: given a sample signal to be detectedGiven likelihood ratio decision threshold +.>A given time zone effectiveness decision threshold +.>Given Fourier transform Length->For signals->Carrying out windowing and framing pretreatment, and calculating the +.>Likelihood ratio test value of spectral line of each frequency point of frame signal>;
Step S02: dividing sub-bands in a frequency domain range, wherein the sub-bands are non-uniform and partially overlapped;
step S03: according to the upper and lower frequency limits of the dividing sub-band in step S02, calculate the firstFrame signal->Energy spectrum of subband->And calculate->Frame->Spectral entropy of subband->;
Step S04: calculate all non-speech signal framesAverage energy spectrum of subband +.>;
Step S05: according to the firstSubband spectral entropy size, th +.>Energy spectrum of subband->And->Subband mean energy spectrum +.>Setting the likelihood ratio weight of subband +.>;
Step S06: weighting and summing the likelihood ratio test values according to weights, calculating an average value, and judging the first according to a likelihood ratio threshold valueWhether the frame signal is speech.
In the scheme, the method comprises the following steps: the step S02 specifically includes:
the frequency range of the sub-band division isWherein->The sampling rate of the signal is 8000Hz or 16000Hz; dividing the whole frequency range into a low-frequency class frequency range and a high-frequency class frequency range, and setting the low-frequency class frequency range to be +.>Hz, high frequency class frequency range is +.>Hz;
According toDetermining the number of sub-band divisions of the low-frequency class frequency band and the high-frequency class frequency band, dividing the sub-band according to the number of sub-bands in the low-frequency class frequency band and the high-frequency class frequency band, and setting adjacent sub-bands to be partially overlapped.
In the scheme, according toDetermining the number of sub-band divisions of a low-frequency class frequency band and a high-frequency class frequency band, and carrying out sub-band division according to the number of sub-bands in the low-frequency class frequency band and the high-frequency class frequency band, wherein the method specifically comprises the following steps:
when sampling rateAt 8000Hz, the frequency band is divided into 5 sub-bands, the low frequency band is evenly divided into 2 sub-bands, and the width of each sub-band is +.>The high frequency class frequency band is evenly divided into 3 sub-bands, each sub-band width is +.>;
When sampling rateAt 16000Hz, the frequency band is divided into 10 sub-bands, the low frequency band is uniformly divided into 4 sub-bands, and the width of each sub-band is +.>The high frequency class frequency band is evenly divided into 6 sub-bands, each sub-band width is +.>;
The sub-bands obtained after division are non-overlapping, and the boundary frequencies of the sub-bands obtained by division are regarded as the upper frequency limit and the lower frequency limit of the sub-bands. Set the firstThe upper frequency limit of the sub-band is +.>First->The lower frequency limit of the sub-band is +.>. According to Fourier transform length->And +.>And->Calculating frequency points corresponding to the upper frequency limit and the lower frequency limit of each sub-band>And->。
In this scheme, set up adjacent subband as partial overlap, specifically:
is provided withFor the total number of subbands>Is the amount of frequency shift. For front->A plurality of sub-bands, each sub-band being partially overlapped with the next sub-band by means of a backward frequency shift, i.e. when +.>When (I)>,/>,/>Is the +.>An upper frequency limit of the sub-band;
first, theThe sub-band is partially overlapped with the previous sub-band by means of forward frequency shift, i.e. when +.>In the time-course of which the first and second contact surfaces,,/>,/>is the +.>The lower frequency limit of the sub-band.
According toAnd->Calculating frequency points corresponding to the upper limit and the lower limit of each sub-band frequency +.>And->。
In this solution, the step S03 specifically includes:
according to the upper frequency limit of each sub-bandAnd frequency lower limit->Calculate +.>Frame->Energy spectrum of subband->,/>Representing from->To->Sum of energy spectra of all frequency spectrum lines, < ->Wherein->Indicate->Frame->Energy spectrum of each frequency point;
calculate the firstFrame->No. of sub-band>Normalized probability Density function of individual frequency Point spectral lines +.>,;
Calculate the firstFrame->Spectral entropy of subband->,/>,/>Indicate->Frame->In subband->Normalized probability density function for individual frequency point spectral lines.
In this solution, the step S04 specifically includes:
calculate the firstA non-speech signal frame->Energy spectrum of subband->,/>;
Calculate all non-speech signal framesAverage energy spectrum of subband +.>,/>,/>Is the total number of frames of the non-speech signal.
In this scheme, the step S05 specifically includes:
given the firstSpectral entropy preset threshold of subband +.>. If%>Frame->Energy spectrum of subband->Frame of non-speech signal->Average energy spectrum of subband +.>Is greater than a preset threshold->I.e. +.>And->Spectral entropy of sub-bandsLess than a preset threshold->I.e. +.>Then->Likelihood ratio weight of all frequency points in each sub-band +.>Set to 1;
otherwise, then get the firstLikelihood ratio weight of all frequency points in each sub-band +.>Set to 0;
wherein the likelihood ratio weight isThe method comprises the following steps:
;
in the middle ofIndicate->Likelihood ratio weight of individual frequency points belonging to the +.>Sub-bands, it is noted here +.>Sub-bands->Refers to the lower limit +.>And upper limit->The divided non-overlapping sub-bands.
In this embodiment, the non-speech signal frame is specifically within a time range detected in the past, and is separated from the firstAll signal frames detected as "non-speech" in one active time region where the frame signal time is nearest;
before the first valid time zone occurs, the signal to be detectedAll signal frames of the first 0.5s of (a) are regarded as non-speech signal frames.
In this scheme, the effective time area specifically is:
the time zone is a zone formed by a certain number of signal framesFor the total number of frames of the non-speech signal in the time zone, < >>A time zone effectiveness judgment threshold; when->When it is, then the time zone is considered valid; otherwise, the time zone is considered invalid.
The second aspect of the present application also provides a cross-subband spectral entropy weighted likelihood ratio voice detection system, the system comprising: the system comprises a memory and a processor, wherein the memory comprises a cross-subband spectrum entropy weighted likelihood ratio voice detection method program, and the processor executes the steps of the cross-subband spectrum entropy weighted likelihood ratio voice detection method.
The application discloses a cross-subband spectral entropy weighted likelihood ratio voice detection method and a system, wherein non-uniform partial overlapping subband division is firstly carried out in a frequency domain, and spectral entropy characteristics of each subband are extracted; setting likelihood ratio weight values of corresponding sub-bands according to the entropy of the sub-bands and the ratio of the energy spectrum of the sub-bands to the average energy spectrum of the non-speech frame sub-bands; and finally, judging whether the signal of a certain frame is a voice frame or not by combining the weighted likelihood ratio with a preset threshold detection. The application has robustness under noise background according to the spectral entropy characteristics of the voice signal, uses the subband spectral entropy information to set the likelihood ratio weight in the likelihood ratio test detection method, uses the weighted likelihood ratio as one of the voice detection judgment bases, improves the detection accuracy of the likelihood ratio test voice detection method under the environment with low signal-to-noise ratio, and is suitable for the voice signal processing fields such as voice recognition, speaker recognition and the like.
Drawings
In order to more clearly illustrate the technical solutions of embodiments or examples of the present application, the drawings that are required to be used in the embodiments or examples of the present application will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive efforts for those skilled in the art.
FIG. 1 shows a flow chart of a cross-subband spectral entropy weighted likelihood ratio speech detection method of the present application;
FIG. 2 shows a flow chart of the present application for calculating spectral entropy of each subband;
FIG. 3 is a flow chart of a method of setting likelihood ratio weights for subbands in accordance with the present application;
FIG. 4 is a schematic diagram showing the comparison of the detection results of the method provided by the application and the conventional method;
FIG. 5 shows a block diagram of a cross-subband spectral entropy weighted likelihood ratio speech detection system of the present application.
Detailed Description
In order that the above-recited objects, features and advantages of the present application will be more clearly understood, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, without conflict, the embodiments of the present application and features in the embodiments may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, however, the present application may be practiced in other ways than those described herein, and therefore the scope of the present application is not limited to the specific embodiments disclosed below.
Example 1
FIG. 1 shows a flow chart of a cross-subband spectral entropy weighted likelihood ratio speech detection method of the present application.
As shown in fig. 1, a first aspect of the present application provides a cross-subband spectral entropy weighted likelihood ratio voice detection method, including:
step S01: given a sample signal to be detectedGiven likelihood ratio decision threshold +.>A given time zone effectiveness decision threshold +.>Given Fourier transform Length->For signals->Carrying out windowing and framing pretreatment, and calculating the +.>Likelihood ratio test value of spectral line of each frequency point of frame signal>;
Step S02: dividing sub-bands in a frequency domain range, wherein the sub-bands are non-uniform and partially overlapped;
step S03: according to the upper and lower frequency limits of the dividing sub-band in step S02, calculate the firstFrame signal->Energy spectrum of subband->And calculate->Frame->Spectral entropy of subband->;
Step S04: calculate all non-speech signal framesAverage energy spectrum of subband +.>;
Step S05: according to the firstSubband spectral entropy size, th +.>Energy spectrum of subband->And->Subband mean energy spectrum +.>Setting the likelihood ratio weight +.>;
Step S06: weighting and summing the likelihood ratio test values according to weights, calculating an average value, and judging the first according to a likelihood ratio threshold valueWhether the frame signal is speech.
It should be noted that: the step S02 specifically includes: the frequency range of the sub-band division isWherein->The sampling rate of the signal is 8000Hz or 16000Hz; dividing the whole frequency range into a low-frequency class frequency range and a high-frequency class frequency range, and setting the low-frequency class frequency range to be +.>Hz, high frequency class frequency range is +.>Hz;
According toDetermining the number of sub-band divisions of the low-frequency class frequency band and the high-frequency class frequency band, dividing the sub-band according to the number of sub-bands in the low-frequency class frequency band and the high-frequency class frequency band, and setting adjacent sub-bands to be partially overlapped.
It should be noted that, when the sampling rate isAt 8000Hz, the frequency band is divided into 5 sub-bands, the low frequency band is evenly divided into 2 sub-bands, and the width of each sub-band is +.>The high frequency class frequency band is evenly divided into 3 sub-bands, each sub-band width is +.>The method comprises the steps of carrying out a first treatment on the surface of the When the sampling rate +.>At 16000Hz, the frequency band is divided into 10 sub-bands, the low frequency class frequency band is uniformly divided into 4 sub-bands, and each sub-band has the width ofThe high frequency class frequency band is evenly divided into 6 sub-bands, each sub-band width is +.>. The sub-bands obtained after division are non-overlapped, and the boundary frequency of each sub-band obtained by division is regarded as the upper frequency limit and the lower frequency limit of each sub-band. Let go of>The upper frequency limit of the sub-band is +.>First->The lower frequency limit of the sub-band is +.>. According to Fourier transform length->Andand->Calculating frequency points corresponding to the upper frequency limit and the lower frequency limit of each sub-band>And->。
It should be noted that adjacent sub-bands are further arranged to overlap.
Is provided withFor the total number of subbands>Is the amount of frequency shift. For front->A plurality of sub-bands, each sub-band being partially overlapped with the next sub-band by means of a backward frequency shift, i.e. when +.>When (I)>,/>,/>Is the +.>Upper frequency limit of sub-band, < >>Is the frequency shift amount;
first, theThe sub-band is partially overlapped with the previous sub-band by means of forward frequency shift, i.e. when +.>In the time-course of which the first and second contact surfaces,,/>,/>is the +.>Lower frequency limit of sub-band, < >>Is the amount of frequency shift. According toAnd->Calculating frequency points corresponding to the upper limit and the lower limit of each sub-band frequency +.>And->。
As shown in fig. 2, the step S03 specifically includes:
s302, according to the upper frequency limit of each sub-bandAnd frequency lower limit->Calculate +.>Frame->Energy spectrum of subband->,Representing from->To->Energy spectrum sum of all frequency spectrum lines +.>Wherein->Indicate->Frame->Energy spectrum of each frequency point;
s304, calculate the firstFrame->No. of sub-band> Normalized probability Density function of spectral lines->,;
S306, calculate the firstFrame->Spectral entropy of subband->,/>,/>Indicate->Frame->In subband->Normalized probability density function for individual frequency point spectral lines.
It should be noted that, the step S04 specifically includes:
calculate the firstA non-speech signal frame->Energy spectrum of subband->,/>;
Calculate all non-speech signal framesAverage energy spectrum of subband +.>,/>,/>Is the total number of frames of the non-speech signal.
As shown in fig. 3, the step S05 specifically includes:
s502, give the firstSpectral entropy preset threshold of subband +.>. If%>Frame->Energy spectrum of subband->Frame of non-speech signal->Average energy spectrum of subband +.>Is greater than a preset threshold->I.e. +.>And->Spectral entropy of subband->Less than a preset threshold->I.e. +.>Then->Likelihood ratio weight of all frequency points in each sub-band +.>Set to 1;
s504, otherwise, the first isLikelihood ratio weight of all frequency points in each sub-band +.>Set to 0;
wherein the likelihood ratio weight isThe method comprises the following steps:
;
in the middle ofIndicate->Likelihood ratio weight of individual frequency points belonging to the +.>Sub-bands, it is noted here +.>Sub-bands->Refers to the lower limit +.>And upper limit->The divided non-overlapping sub-bands.
It should be noted that, the non-speech signal frame specifically refers to the non-speech signal frame within the time range detected in the pastAll signal frames detected as "non-speech" in one active time region where the frame signal time is nearest; in the first effective time zoneBefore the domain appears, the signal to be detected +.>All signal frames of the first 0.5s of (a) are regarded as non-speech signal frames.
The time zone refers to a zone formed by a certain number of signal frames, and is setFor the number of signal frames in the region +.>The start frame of the time zone is +.>Frame, end frame is->The frame time region refers to a region formed by a certain number of signal frames, and is provided with +.>For the total number of frames of the non-speech signal in the time zone, < >>A time zone effectiveness judgment threshold; when->When it is, then the time zone is considered valid; otherwise, the time zone is considered invalid.
Example 2
In this embodiment, a low signal-to-noise signal sample is produced and voice activation detection is performed using the provided method, and the method is compared in effect with a conventional likelihood ratio voice activation detection method.
The application provides a cross-subband spectrum entropy weighted likelihood ratio voice detection method, which comprises the following steps:
step S01: given a sample signal to be detectedGiven likelihood ratio decision threshold +.>A given time zone effectiveness decision threshold +.>Given Fourier transform Length->For signals->Carrying out windowing and framing pretreatment, and calculating the +.>Likelihood ratio test value of spectral line of each frequency point of frame signal>;
And selecting a multi-person dialogue sample from a Chinese Mandarin natural spoken dialogue corpus (CADCC), wherein the total duration is about 20 minutes 37.526 seconds, 528 voice segments are contained, and the sampling rate is 8000Hz. Samples were manually labeled for detection accuracy statistics, with speech frames (containing vowels and consonants) accounting for about 75.03% and non-speech frames accounting for about 24.97%. The NOISEX-92 noise database is adopted as a superposition noise source, and noise samples including Gaussian white noise (stationary noise) and noisy noise (non-stationary noise) are selected; synthesizing speech and noise into a low signal-to-noise ratio speech signal sample with a signal-to-noise ratio of 0dB, and taking the low signal-to-noise ratio speech signal sample as a sampling signal to be detected。
Setting likelihood ratio decision threshold when detecting Gaussian white noise speech samples0.6; setting a likelihood ratio decision threshold when detecting noisy speech samples>20. Setting a time zone effectiveness decision threshold +.>For 30 frames, for signal->And carrying out frame division pretreatment by adding a hamming window, setting the frame length to be 0.45ms and shifting the frame to be 22.5ms. Calculate->Frame signal->Likelihood ratio test value of each frequency point +.>Wherein->。/>The length of the fourier transform is set to 360.
Step S02: dividing sub-bands in a frequency domain range, wherein the sub-bands are non-uniform and partially overlapped;
(1) Sub-band division in the frequency domain
Due to the signal sampling rate8000Hz, so that the sub-band division is 0 +.>4000Hz; dividing the whole frequency range into a low-frequency class frequency band and a high-frequency class frequency band, and setting the low-frequency class frequency band range to be 0 +.>1000Hz, the range of the high-frequency class frequency band is 1000 +.>4000Hz;
When the signal sampling rateWhen 8000Hz, the frequency band is divided into 5 sub-bands, the low-frequency class frequency band is uniformly divided into 2 sub-bands, the width of each sub-band is 500Hz, the high-frequency class frequency band is uniformly divided into 3 sub-bands, and the width of each sub-band is 1000Hz; the sub-bands are non-overlapping after the division, and the frequency boundary of each sub-band obtained by the division is regarded as the upper frequency limit and the lower frequency limit of each sub-band. According to the Fourier transform length (+)>=360) and +.>And->And calculating frequency points corresponding to the upper frequency limit and the lower frequency limit. The upper and lower frequency limits of the non-overlapping sub-bands and spectral lines corresponding to the frequency points are shown in Table 1.
TABLE 1 non-overlapping subband frequency upper and lower limits and corresponding bins
;
(2) Disposing adjacent sub-bands to partially overlap
Further setting adjacent sub-bands to be partially overlapped, and the total number of the current sub-bandsFor 5, set the frequency shift amount +.>500Hz. For the first 4 subbands, each subband is partly overlapping with its next subband by means of a backward frequency shift, i.e. when +.>In the time-course of which the first and second contact surfaces,,/>,/>is the +.>Upper frequency limit of sub-band, < >>Is the frequency shift amount; first->The sub-band is partially overlapped with the previous sub-band by means of forward frequency shift, i.e. when +.>When (I)>,,/>Is the +.>The lower frequency limit of the sub-band. According to->And. The upper limit and the lower limit of the frequency of the partially overlapped sub-band and the spectral line of the corresponding frequency point are specifically shown in table 2.
TABLE 2 partially overlapping subband frequency upper and lower limits and corresponding bins
;
Step S03: according to the upper and lower limits of the sub-band frequency divided in step S02, calculate the firstFrame signal->Energy spectrum of subband->And calculate->Frame->Spectral entropy of subband->;
(1) According to the length of the set Fourier transformFor->Frame signal->Performing a fast fourier transform, denoted +.>。/>According to the upper frequency limit of each subband +.>And frequency lower limit->Calculate +.>Frame NoEnergy spectrum of subband->,/>Representing from->To->Sum of energy spectra of all frequency spectrum lines, < ->WhereinIndicate->Frame->Energy spectrum of each frequency point;
(2) Calculate the firstFrame->Spectral entropy of sub-bands
Calculate the firstFrame->No. of sub-band>Normalized probability Density function of individual frequency Point spectral lines +.>,The method comprises the steps of carrying out a first treatment on the surface of the Calculate->Frame->Spectral entropy of subband->,/>,/>Indicate->Frame->In subband->Normalized probability density function for individual frequency point spectral lines.
Step S04: calculate all non-speech signal framesAverage energy spectrum of subband +.>;
Calculate the firstA non-speech signal frame->Energy spectrum of sub-band/>,/>The method comprises the steps of carrying out a first treatment on the surface of the Calculate all non-speech signal frames +.>Average energy spectrum of subband +.>,/>,/>Is the total number of frames of the non-speech signal.
The non-speech signal frame is specifically the frame of the non-speech signal within the time range detected in the pastAll signal frames detected as "non-speech" in one active time region where the frame signal time is nearest; before the first active time zone occurs, the signal to be detected is +.>All signal frames of the first 0.5s of (a) are considered non-speech signal frames.
The time zone refers to a zone formed by a certain number of signal frames, and is provided withSetting +.>First->The start frame of the time zone is +.>Frame, ending frame is the firstThe frame time region refers to a region formed by a certain number of signal frames, and is provided with +.>For the total number of frames of the non-speech signal in the time zone, < >>Setting +.>The method comprises the steps of carrying out a first treatment on the surface of the When->When it is, then the time zone is considered valid; otherwise, the time zone is considered invalid.
Step S05: according to the firstSubband spectral entropy size, th +.>Energy spectrum of subband->And->Subband mean energy spectrum +.>Setting the likelihood ratio weight +.>;
Setting a threshold valueCalculating the average value of the spectrum entropy of each sub-band of the previous 0.5s signal (about 22 frames) as the preset threshold value of the spectrum entropy of each sub-band/>As shown in table 3. If%>Frame->Energy spectrum of subband->Frame of non-speech signal->Average energy spectrum of subband +.>Is greater than a preset threshold->I.e. +.>And->Spectral entropy of subband->Less than a preset threshold->(Table 3), i.e.)>Then->Likelihood ratio weight of all frequency points in each sub-band +.>Set to 1; otherwise, let->The likelihood ratio weights for all the bins in the subbands are set to 0. The method comprises the following steps:
;
in the middle ofIndicate->Likelihood ratio weight of individual frequency points belonging to the +.>The subbands, in particular, are described hereSub-bands->Refers to non-overlapping subbands divided by the upper and lower frequency limits of table 1.
Table 3 preset threshold mapping table for spectral entropy of each subband
;
Step S06: weighting and summing the likelihood ratio test values according to weights, calculating an average value, and finally, according to a likelihood ratio threshold valueDecision->Whether the frame signal is speech.
When (when)When deciding->Frame signalIs speech; when->When deciding->The frame signal is non-speech. Wherein (1)>。
The provided method is compared with the traditional likelihood ratio test voice detection method in effect, and the effectiveness of the provided method is further illustrated through detection result examples and detection accuracy statistics. Fig. 4 shows a comparison example (22 nd frame to 294 nd frame) of the method provided by the present embodiment and the conventional method.
Compared with the detection accuracy, as shown in table 4, the provided method has obviously improved detection accuracy in the environment of 0dB signal-to-noise ratio (white noise and noisy noise) compared with the traditional method.
Table 4 provides a comparison of the detection accuracy of the method and the conventional method
;
Example 3
FIG. 5 shows a block diagram of a cross-subband spectral entropy weighted likelihood ratio speech detection system of the present application.
The second aspect of the present application also provides a cross-subband spectral entropy weighted likelihood ratio speech detection system 5, comprising: a memory 51, a processor 52, said memory comprising a cross-subband spectral entropy weighted likelihood ratio speech detection method program which when executed by said processor performs the steps of:
given a sample signal to be detectedGiven likelihood ratio decision threshold +.>Given time zone validity decision thresholdGiven Fourier transform Length->For signals->Carrying out windowing and framing pretreatment, and calculating the +.>Likelihood ratio test value of spectral line of each frequency point of frame signal>;
Dividing sub-bands in a frequency domain range, wherein the sub-bands are non-uniform and partially overlapped;
calculating the first according to the upper and lower frequency limits of the dividing sub-bandFrame signal->Energy spectrum of subband->And calculate->Frame->Spectral entropy of subband->;
Calculate all non-speech signal framesAverage energy spectrum of subband +.>;
According to the firstSubband spectral entropy size, th +.>Energy spectrum of subband->And->Subband mean energy spectrum +.>Setting the likelihood ratio weight +.>;
Weighting and summing the likelihood ratio test values according to weights, calculating an average value, and judging the first according to a likelihood ratio threshold valueWhether the frame signal is speech.
The third aspect of the present application also provides a computer readable storage medium, comprising a cross-subband spectral entropy weighted likelihood ratio speech detection method program, which when executed by a processor, implements the steps of a cross-subband spectral entropy weighted likelihood ratio speech detection method as described in any of the preceding claims.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.
Alternatively, the above-described integrated units of the present application may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (9)
1. A cross-subband spectrum entropy weighted likelihood ratio voice detection method is characterized by comprising the following steps:
step S01: given a sample signal to be detectedGiven likelihood ratio decision threshold +.>A given time zone effectiveness decision threshold +.>Given Fourier transform Length->For signals->Carrying out windowing and framing pretreatment, and calculating the +.>Likelihood ratio test value of each frequency point of frame signal>;
Step S02: dividing sub-bands in a frequency domain range, wherein the sub-bands are non-uniform and partially overlapped;
step S03: according to the upper and lower frequency limits of the dividing sub-band in step S02, calculate the firstFrame signal->Energy spectrum of sub-bandAnd calculate->Frame->Spectral entropy of subband->;
Step S04: calculate all non-speech signal framesAverage energy spectrum of subband +.>;
Step S05: according to the firstSubband spectral entropy size, th +.>Energy spectrum of subband->And->Sub-band average energy spectrumSetting the likelihood ratio weight +.>;
Step S06: weighting and summing the likelihood ratio test values according to weights, calculating an average value, and judging the first according to a likelihood ratio threshold valueWhether the frame signal is speech;
the step S05 specifically includes:
given the firstSpectral entropy preset threshold of subband +.>If->Frame->Energy spectrum of subband->Frame of non-speech signal->Average energy spectrum of subband +.>Is greater than a preset threshold->I.e. +.>And->Spectral entropy of subband->Less than a preset threshold->I.e. +.>Then->Likelihood ratio weight of all frequency points in each sub-band +.>Set to 1;
otherwise, then get the firstLikelihood ratio weight of all frequency points in each sub-band +.>Set to 0;
wherein the likelihood ratio weight isThe method comprises the following steps:
;
in the middle ofIndicate->Likelihood ratio weight of individual frequency points belonging to +.>Sub-bands, here->Sub-bands->Refers to pressing the lower limitAnd upper limit->The divided non-overlapping sub-bands.
2. The method for detecting the cross-subband spectral entropy weighted likelihood ratio voice according to claim 1, wherein the step S02 is specifically:
the frequency range of the sub-band division isWherein->The sampling rate of the signal is 8000Hz or 16000Hz; dividing the whole frequency range into a low-frequency class frequency range and a high-frequency class frequency range, and setting the low-frequency class frequency range to be +.>Hz, high frequency class frequency range is +.>Hz;
According toDetermination ofThe number of sub-band divisions of the low frequency class band and the high frequency class band is based on the number of sub-bands in the low frequency class band and the high frequency class band, and adjacent sub-bands are set to be partially overlapped.
3. The method for cross-subband spectral entropy weighted likelihood ratio speech detection of claim 2, wherein the method is based onDetermining the number of sub-band divisions of a low-frequency class frequency band and a high-frequency class frequency band, and carrying out sub-band division according to the number of sub-bands in the low-frequency class frequency band and the high-frequency class frequency band, wherein the method specifically comprises the following steps:
when sampling rateAt 8000Hz, the frequency band is divided into 5 sub-bands, the low frequency band is evenly divided into 2 sub-bands, and the width of each sub-band is +.>The high frequency class frequency band is evenly divided into 3 sub-bands, each sub-band width is +.>;
When sampling rateAt 16000Hz, the frequency band is divided into 10 sub-bands, the low frequency band is uniformly divided into 4 sub-bands, and the width of each sub-band is +.>The high frequency class frequency band is evenly divided into 6 sub-bands, each sub-band width is +.>;
The sub-bands obtained after division are non-overlapped, and the boundary frequency of each sub-band obtained by division is regarded as the upper frequency limit sum of each sub-bandA lower frequency limit; set the firstThe upper frequency limit of the sub-band is +.>First->The lower frequency limit of the sub-band is +.>According to Fourier transform length->And->And->Calculating frequency points corresponding to the upper frequency limit and the lower frequency limit of each sub-band>And->。
4. The method for detecting the cross-subband spectral entropy weighted likelihood ratio voice according to claim 2, wherein adjacent subbands are set to be partially overlapped, specifically:
is provided withFor the total number of subbands>For the frequency shift amount, for the former->A plurality of sub-bands, each sub-band being partially overlapped with the next sub-band by means of a backward frequency shift, i.e. when +.>When (I)>,/>,/>Is the +.>Upper frequency limit of sub-band, < >>Is the +.>Lower frequency limit of sub-band, < >>Is->The upper frequency limit of the sub-band,is->A lower frequency limit of the sub-band;
first, theThe sub-band is formed with the last sub-band by adopting a forward frequency shift modePartially overlapping, i.e. when->In the time-course of which the first and second contact surfaces,,/>;
according toAnd->Calculating frequency points corresponding to the upper limit and the lower limit of each sub-band frequency +.>And->。
5. The method for detecting voice by cross-subband spectral entropy weighted likelihood ratio according to claim 4, wherein step S03 is specifically:
according to the upper frequency limit of each sub-bandAnd frequency lower limit->Calculate +.>Frame->Energy spectrum of subband->,/>Representing from->To->Sum of energy spectra of all frequency spectrum lines, < ->Wherein->Indicate->Frame->Energy spectrum of each frequency point;
calculate the firstFrame->No. of sub-band>Normalized probability Density function of individual frequency Point spectral lines +.>,/>;
Calculate the firstFrame->Spectral entropy of subband->,/>,/>Indicate->Frame->In subband->Normalized probability density function for individual frequency point spectral lines.
6. The method for detecting speech by cross-subband spectral entropy weighted likelihood ratio according to claim 4, wherein step S04 is specifically:
calculate the firstA non-speech signal frame->Energy spectrum of subband->,/>,/>Is->The non-speech signal is +>Energy spectrum of each frequency point;
calculate all non-speech signal framesAverage energy spectrum of subband +.>,/>,/>Is the total number of frames of the non-speech signal.
7. The method of claim 1, wherein the non-speech signal frames are in a range of time detected in the past, and are separated from each other by a third time periodAll signal frames detected as "non-speech" in one active time region where the frame signal time is nearest; before the first active time zone occurs, the signal to be detected is +.>All signal frames of the first 0.5s of (a) are regarded as non-speech signal frames.
8. The method for detecting the voice of the cross-subband spectral entropy weighted likelihood ratio according to claim 7, wherein the effective time area is specifically:
the time zone is a zone formed by a certain number of signal framesFor the total number of frames of the non-speech signal in the time zone, < >>A time zone effectiveness judgment threshold; when->When it is, then the time zone is considered valid; otherwise, the time zone is considered invalid.
9. A cross-subband spectral entropy weighted likelihood ratio speech detection system, the system comprising: memory, a processor comprising a cross-subband spectral entropy weighted likelihood ratio speech detection method program therein, the processor performing the steps of the cross-subband spectral entropy weighted likelihood ratio speech detection method according to any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310963463.7A CN116665717B (en) | 2023-08-02 | 2023-08-02 | Cross-subband spectral entropy weighted likelihood ratio voice detection method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310963463.7A CN116665717B (en) | 2023-08-02 | 2023-08-02 | Cross-subband spectral entropy weighted likelihood ratio voice detection method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116665717A CN116665717A (en) | 2023-08-29 |
CN116665717B true CN116665717B (en) | 2023-09-29 |
Family
ID=87715797
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310963463.7A Active CN116665717B (en) | 2023-08-02 | 2023-08-02 | Cross-subband spectral entropy weighted likelihood ratio voice detection method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116665717B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108962285A (en) * | 2018-07-20 | 2018-12-07 | 浙江万里学院 | A kind of sound end detecting method dividing subband based on human ear masking effect |
CN110047519A (en) * | 2019-04-16 | 2019-07-23 | 广州大学 | A kind of sound end detecting method, device and equipment |
CN113838476A (en) * | 2021-09-24 | 2021-12-24 | 世邦通信股份有限公司 | Noise estimation method and device for noisy speech |
WO2022105570A1 (en) * | 2020-11-17 | 2022-05-27 | 深圳壹账通智能科技有限公司 | Speech endpoint detection method, apparatus and device, and computer readable storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
ATE477572T1 (en) * | 2007-10-01 | 2010-08-15 | Harman Becker Automotive Sys | EFFICIENT SUB-BAND AUDIO SIGNAL PROCESSING, METHOD, APPARATUS AND ASSOCIATED COMPUTER PROGRAM |
-
2023
- 2023-08-02 CN CN202310963463.7A patent/CN116665717B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108962285A (en) * | 2018-07-20 | 2018-12-07 | 浙江万里学院 | A kind of sound end detecting method dividing subband based on human ear masking effect |
CN110047519A (en) * | 2019-04-16 | 2019-07-23 | 广州大学 | A kind of sound end detecting method, device and equipment |
WO2022105570A1 (en) * | 2020-11-17 | 2022-05-27 | 深圳壹账通智能科技有限公司 | Speech endpoint detection method, apparatus and device, and computer readable storage medium |
CN113838476A (en) * | 2021-09-24 | 2021-12-24 | 世邦通信股份有限公司 | Noise estimation method and device for noisy speech |
Non-Patent Citations (2)
Title |
---|
An improved robust statistical voice activity detection based on sub-band periodic intensity;Weijun He et al.;2015 IEEE International Conference on Information and Automation;全文 * |
基于方差和谱熵结合的语音端点检测方法;毛强等;常州工学院学报;第34卷(第2期);36-40+52 * |
Also Published As
Publication number | Publication date |
---|---|
CN116665717A (en) | 2023-08-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8160877B1 (en) | Hierarchical real-time speaker recognition for biometric VoIP verification and targeting | |
EP2151822B1 (en) | Apparatus and method for processing and audio signal for speech enhancement using a feature extraction | |
Tan et al. | Multi-band summary correlogram-based pitch detection for noisy speech | |
Yamashita et al. | Nonstationary noise estimation using low-frequency regions for spectral subtraction | |
US10783899B2 (en) | Babble noise suppression | |
US20150081287A1 (en) | Adaptive noise reduction for high noise environments | |
Chen et al. | Improved voice activity detection algorithm using wavelet and support vector machine | |
CN108305639B (en) | Speech emotion recognition method, computer-readable storage medium and terminal | |
CN108682432B (en) | Speech emotion recognition device | |
Upadhyay et al. | An improved multi-band spectral subtraction algorithm for enhancing speech in various noise environments | |
Gonzalez et al. | Mask-based enhancement for very low quality speech | |
Morales-Cordovilla et al. | Feature extraction based on pitch-synchronous averaging for robust speech recognition | |
KR102136700B1 (en) | VAD apparatus and method based on tone counting | |
Milner et al. | Clean speech reconstruction from MFCC vectors and fundamental frequency using an integrated front-end | |
CN108172214A (en) | A kind of small echo speech recognition features parameter extracting method based on Mel domains | |
Zhang et al. | Fast nonstationary noise tracking based on log-spectral power mmse estimator and temporal recursive averaging | |
CN113593604A (en) | Method, device and storage medium for detecting audio quality | |
CN112233657B (en) | Speech enhancement method based on low-frequency syllable recognition | |
Gupta et al. | Speech enhancement using MMSE estimation and spectral subtraction methods | |
CN116665717B (en) | Cross-subband spectral entropy weighted likelihood ratio voice detection method and system | |
Bhukya et al. | Robust methods for text-dependent speaker verification | |
Bai et al. | Two-pass quantile based noise spectrum estimation | |
Zhang et al. | Spectral subtraction on real and imaginary modulation spectra | |
Tomchuk | Spectral masking in MFCC calculation for noisy speech | |
Goodarzi et al. | Speech enhancement using spectral subtraction based on a modified noise minimum statistics estimation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |