CN108053842B - Short wave voice endpoint detection method based on image recognition - Google Patents
Short wave voice endpoint detection method based on image recognition Download PDFInfo
- Publication number
- CN108053842B CN108053842B CN201711330638.1A CN201711330638A CN108053842B CN 108053842 B CN108053842 B CN 108053842B CN 201711330638 A CN201711330638 A CN 201711330638A CN 108053842 B CN108053842 B CN 108053842B
- Authority
- CN
- China
- Prior art keywords
- voice
- signal
- spectrum
- spectrogram
- window
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 26
- 238000000034 method Methods 0.000 claims abstract description 23
- 238000007781 pre-processing Methods 0.000 claims abstract description 12
- 238000009432 framing Methods 0.000 claims abstract description 5
- 238000001228 spectrum Methods 0.000 claims description 50
- 238000012545 processing Methods 0.000 claims description 23
- 230000003595 spectral effect Effects 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 230000001174 ascending effect Effects 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000003014 reinforcing effect Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 2
- 238000009499 grossing Methods 0.000 claims 1
- 238000010187 selection method Methods 0.000 claims 1
- 238000005728 strengthening Methods 0.000 claims 1
- 238000009826 distribution Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 108010076504 Protein Sorting Signals Proteins 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 206010011878 Deafness Diseases 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000005314 correlation function Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000010370 hearing loss Effects 0.000 description 1
- 231100000888 hearing loss Toxicity 0.000 description 1
- 208000016354 hearing loss disease Diseases 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000011410 subtraction method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/13—Edge detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Telephonic Communication Services (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
The invention belongs to the field of voice detection, and particularly relates to a short-wave voice endpoint detection method based on image recognition. The technical scheme of the invention is as follows: firstly, preprocessing data to improve the signal-to-noise ratio; then framing according to a specific length, and simultaneously performing short-time Fourier transform to obtain a spectrogram; and finally, searching the voiceprint in the spectrogram by using an image recognition method, and determining the speech segment in the data according to the voiceprint distribution. The voice after the preprocessing has similar signal-to-noise ratio by adopting the method of the invention, and the parameters do not need to be adjusted in the subsequent steps, therefore, the method of the invention can self-adaptively select the talking section from different background noises.
Description
Technical Field
The invention belongs to the field of voice detection, and particularly relates to a short-wave voice endpoint detection method based on image recognition.
Background
Despite the continuous emergence of new radio communication systems, short-wave radio stations are still receiving general attention due to their autonomous communication capabilities and wide coverage. However, the short-wave communication transmission electric wave needs to be reflected by an ionized layer, so that the noise is large. The existence of strong background noise makes monitoring personnel unable to work for a long time, and noise reduction processing is needed, and meanwhile, squelch processing is carried out on a non-voice section. In order to prevent the hearing loss, the performance of the voice endpoint detection method is important.
In conventional speech processing, there are many methods for detecting end points, such as end point detection based on correlation function, end point detection based on cepstrum distance, end point detection based on energy-to-zero ratio, and end point detection based on wavelet decomposition, etc., according to different characteristics. Aiming at different voices, parameters are adjusted, and voice talking sections can be accurately selected. However, in a variable environment, when real-time communication is required, it is impractical to adjust the endpoint detection parameters, and the conventional speech processing method is no longer applicable.
The voice frequency spectrogram is called a spectrogram for short, and the change relation of the short-time frequency spectrum of the voice along with time is researched through short-time Fourier transform analysis of the voice. The horizontal direction of the speech spectrogram is a time axis, the vertical direction of the speech spectrogram is a frequency axis, and the gray stripes on the speech spectrogram represent the speech short-time spectrums at various moments. The spectrogram reflects the dynamic spectrum characteristics of a voice signal, has important practical value in voice analysis, and is called as visual voice.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an adaptive processing method according to the characteristic mechanism of human voice production and the characteristic that voiceprints do not exist in a noise spectrum.
The technical scheme of the invention is as follows: firstly, preprocessing data to improve the signal-to-noise ratio; then framing according to a specific length, and simultaneously performing short-time Fourier transform to obtain a spectrogram; and finally, searching the voiceprint in the spectrogram by using an image recognition method, and determining the speech segment in the data according to the voiceprint distribution.
A short wave voice endpoint detection method based on image recognition specifically comprises the following steps:
s1, performing voice preprocessing aiming at ensuring that the definitions of the voice prints of the formed spectrogram are approximately the same, which is the premise of effective image recognition, and specifically comprising the following steps:
s11, in the process of collecting voice signal data, due to some reasons of a test system, a linear or slowly-changed trend error is generated in a time sequence, so that a zero line of a voice signal deviates from a base line, even the deviation size changes along with time, a voice related function is caused, a power spectrum function deforms in processing calculation, and the trend error is removed by fitting a trend term by a least square method;
s12, amplitude normalization is carried out;
s13, low-pass filtering is carried out, and noise higher than 3500Hz is removed;
s14, reinforcing the voice by using the spectral subtraction of the multi-window spectrum;
s2, carrying out image recognition on the acquired spectrogram to acquire a structural body, wherein the structural body comprises a starting point and an end point of the spectrogram voiceprint position, and specifically comprises the following steps:
s21, framing the voice signals, and performing short-time Fourier transform by taking frames as units to obtain short-time spectrums;
s22, arranging the short-time frequency spectrums obtained in the S21 according to the time sequence of the frames to obtain a spectrogram;
s23, identifying the voiceprint in the spectrogram of S22, namely: changing the color spectrogram into a gray image; extracting the image edge of the gray-scale image, and identifying the position of a line segment in the gray-scale image; forming a structural body by the obtained starting point and the end point containing the voice print position of the spectrogram;
s3, carrying out endpoint detection, specifically:
s31, extracting start point position vector ST [ < ST > from the structure of S21,st2,...,sti,...,stn]And end point position vector EN ═ EN1,en2,...,eni,...,enn]Wherein, stiRefers to the ith start point position, eniRefers to the ith end point position. Sorting the starting point position vector ST and the end point position vector EN according to ascending order;
s32, judgmentIf there is a speech segment, the speech segment can be regarded as a voiceprint if there are three horizontal line segments, and the rest are noise. Is numerically expressed as when eni>sti+2That is, the line segment with the ith point as the starting point is considered to be in the conversation segment;
s33, selecting all line segments certainly in the presence segment, and searching whether the element ST 'of ST exists in the range of 100 frames in the left and right directions'iIf any, is also included in the session segment and replaces the original stiThe search to the left and right 100 frames is repeated until there are no elements of ST in the left and right 100 frames.
Further, the step of S14 for enhancing speech by using spectral subtraction of multi-window spectrum includes the following steps:
step A, setting the time sequence of the voice signal as x (n), and carrying out windowing and frame division processing on the x (n) by using a Hamming window with the length of wlen to obtain the ith frame of voice signal as xi(m) said xi(m) has a frame length of wlen, xi(m) discrete Fourier transform
B, taking M frames before and after the frame i as the center, and calculating the X in the step A by 2M +1 framesi(k) Average amplitude spectrum of each component in the spectrumAngle of sumWherein j refers to the rear j frame with the i frame as the center, Im refers to the imaginary part, and Re refers to the real part;
step C, averaging a plurality of orthogonal data windows to the same data sequence to obtain spectrum estimation, wherein a multi-window spectrum is defined asWherein L is the number of data windows, SmtAs a spectrum of a window of data w, i.e.Tx (N) is the data sequence, N is the sequence length, aw(n) is the w-th data window, aw(n) is a set of mutually orthogonal discrete ellipsoid sequences for direct spectrum determination with the same signal sequence, aw(n) satisfy mutual orthogonality between multiple data windows, i.e.Using the multi-window spectrum definition method to divide the frame signal xi(m) performing multi-window spectral estimation, i.e.
Step D, smoothing the multi-window spectrum power density estimated value, and calculating the smooth power spectral densityCalculating noise mean power spectral densityCalculating a gain factorWherein NIS represents the number of frames occupied by the preamble null segment;
step E, according to the obtained amplitude spectrum after the reduction of the multi-window spectrumSynthesizing an enhanced speech signalThe multi-window spectrum subtraction uses the leading non-speech section to calculate the power of the noise, after the power of the whole voice subtracts the noise component, the voice signal is restored by the phase angle relation, the over-subtraction factor determines the strengthening degree of the signal, and the gain compensation factor determines the calculation time length.
Further, the selection method of the over-subtraction factor is as follows:
i, setting an initial value of an over-subtraction factor to be 1, and taking an initial signal-to-noise ratio snr' to be 0;
II, performing enhancement processing on the voice by using multi-window spectral subtraction, and calculating the signal-to-noise ratio snr of the processed signal;
III, if the signal-to-noise ratio snr of the processed signal is greater than the initial signal-to-noise ratio snr ', performing the next step, if the signal-to-noise ratio snr of the processed signal is less than or equal to the initial signal-to-noise ratio snr', indicating that the voice in the signal is not significant, not processing, keeping all voice signals, and directly outputting;
and IV, if the signal-to-noise ratio snr of the processed signal is less than 8dB, increasing the over-subtraction factor by 0.5, making snr' equal to snr, and repeating the steps II to IV until the signal-to-noise ratio is greater than 8 dB.
The invention has the beneficial effects that:
the voice after the preprocessing has similar signal-to-noise ratio by adopting the method of the invention, and the parameters do not need to be adjusted in the subsequent steps, therefore, the method of the invention can self-adaptively select the talking section from different background noises.
Drawings
FIG. 1 is a schematic diagram of a multi-window spectral improvement spectral subtraction method.
FIG. 2 is a flow chart of a speech enhancement process.
FIG. 3 is a flow chart of the method of the present invention.
FIG. 4 is a time domain diagram of speech before speech preprocessing in embodiment 1.
FIG. 5 is a time domain diagram of speech after speech preprocessing in embodiment 1.
FIG. 6 is a spectrogram of each frame of speech in embodiment 1.
FIG. 7 is a spectrogram after the gray scale processing in example 1.
Fig. 8 is a horizontal line segment portion in the spectrogram after the gray level processing in embodiment 1.
Fig. 9 shows the end point detection result of the spectrogram after the gray level processing in embodiment 1.
Fig. 10 is a time domain diagram of the endpoint detection result in embodiment 1, where the left is the original speech and the right is the preprocessed speech.
Fig. 11 is a time domain diagram of speech before speech preprocessing in embodiment 2.
Fig. 12 is a time domain diagram of speech after speech preprocessing in embodiment 2.
Fig. 13 is a spectrogram of each frame of speech in embodiment 2.
FIG. 14 is a spectrogram after the gray scale processing in example 2.
Fig. 15 is a horizontal line segment portion in the spectrogram after the gray level processing in embodiment 2.
Fig. 16 shows the end point detection result of the spectrogram after the gray level processing in embodiment 2.
Fig. 17 is a time domain diagram of the endpoint detection result in embodiment 2, where the left is the original speech and the right is the preprocessed speech.
Detailed Description
The present invention will be described with reference to the accompanying drawings.
The method selects the voiceprint characteristics as the characteristics of the sound. Due to the unique physiological structure of human vocalization, voiceprints can be seen from speech spectrograms (spectrogram). The voiceprint of human voice has obvious characteristics, and in a speech section, specific rules of energy distribution on different frequencies can be seen; in the spectrogram of voice, a plurality of transversely parallel lines are presented, and the lines are the voiceprints. The voiceprint can embody individual pronunciation characteristics and phoneme characteristics, and is widely applied to the aspect of speech recognition.
As shown in fig. 3, the method of the present invention comprises the following steps:
s1, performing voice preprocessing aiming at ensuring that the definitions of the voice prints of the formed spectrogram are approximately the same, which is the premise of effective image recognition, and specifically comprising the following steps:
s11, in the process of collecting voice signal data, due to some reasons of a test system, a linear or slowly-changed trend error is generated in a time sequence, so that a zero line of a voice signal deviates from a base line, even the deviation size changes along with time, a voice related function is caused, a power spectrum function deforms in processing calculation, and the trend error is removed by fitting a trend term by a least square method;
s12, amplitude normalization is carried out;
s13, low-pass filtering is carried out, and noise higher than 3500Hz is removed;
s14, reinforcing the voice by using the spectral subtraction of the multi-window spectrum, which specifically comprises the following steps:
step A, setting the time sequence of the voice signal as x (n), and carrying out windowing and frame division processing on the x (n) by using a Hamming window with the length of wlen to obtain the ith frame of voice signal as xi(m) said xi(m) has a frame length of wlen, xi(m) discrete Fourier transform
B, taking M frames before and after the frame i as the center, and calculating the X in the step A by 2M +1 framesi(k) Average amplitude spectrum of each component in the spectrumAngle of sumWherein j refers to the next j frame centered on the i frame, Im refers to the imaginary part, and Re refers to the real part.
Step C, averaging a plurality of orthogonal data windows to the same data sequence to obtain spectrum estimation, wherein a multi-window spectrum is defined asWherein L is the number of data windows, SmtAs a spectrum of a window of data w, i.e.Tx (N) is the data sequence, N is the sequence length, aw(n) is the w-th data window, aw(n) is a set of mutually orthogonal discrete ellipsoid sequences for direct spectrum determination with the same signal sequence, aw(n) satisfy mutual orthogonality between multiple data windows, i.e.Using the multi-window spectrum definition method to divide the frame signal xi(m) performing multi-window spectral estimation, i.e.
Step D, smoothing the multi-window spectrum power density estimated value, and calculating the smooth power spectral densityCalculating noise mean power spectral densityCalculating a gain factorWherein NIS represents the number of frames occupied by the preamble null segment;
step E, according to the obtained amplitude spectrum after the reduction of the multi-window spectrumSynthesizing an enhanced speech signalThe multi-window spectrum subtraction is to use the leading non-speech section to calculate the power of the noise, and after the power of the whole sound subtracts the components of the noise, the speech signal is restored by using the phase angle relation, the over-subtraction factor determines the strengthening degree of the signal, and the gain compensation factor determines the calculation time length;
the selection method of the over-subtraction factor comprises the following steps:
i, setting an initial value of an over-subtraction factor to be 1, and taking an initial signal-to-noise ratio snr' to be 0;
II, performing enhancement processing on the voice by using multi-window spectral subtraction, and calculating the signal-to-noise ratio snr of the processed signal;
III, if the signal-to-noise ratio snr of the processed signal is greater than the initial signal-to-noise ratio snr ', performing the next step, if the signal-to-noise ratio snr of the processed signal is less than or equal to the initial signal-to-noise ratio snr', indicating that the voice in the signal is not significant, not processing, keeping all voice signals, and directly outputting;
IV, if the signal-to-noise ratio snr of the processed signal is less than 8dB, increasing the over-subtraction factor by 0.5, making snr' equal to snr, and repeating the steps II-IV until the signal-to-noise ratio is greater than 8 dB;
s2, carrying out image recognition on the acquired spectrogram to acquire a structural body, wherein the structural body comprises a starting point and an end point of the spectrogram voiceprint position, and specifically comprises the following steps:
s21, framing the voice signals, and performing short-time Fourier transform by taking frames as units to obtain short-time spectrums;
s22, arranging the short-time frequency spectrums obtained in the S21 according to the time sequence of the frames to obtain a spectrogram;
s23, identifying the voiceprint in the spectrogram of S22, namely: changing the color spectrogram into a gray image; extracting the image edge of the gray-scale image, and identifying the position of a line segment in the gray-scale image; forming a structural body by the obtained starting point and the end point containing the voice print position of the spectrogram;
s3, carrying out endpoint detection, specifically:
s31, extracting start point position vector ST [ < ST > from the structure of S21,st2,...,sti,...,stn]And end point position vector EN ═ EN1,en2,...,eni,...,enn]Wherein, stiRefers to the ith start point position, eniRefers to the ith end point position. Sorting the starting point position vector ST and the end point position vector EN according to ascending order;
and S32, judging whether a speech segment exists, and if three horizontal line segments exist, the speech segment can be considered as a voiceprint, and the rest are noise. Is numerically expressed as when eni>sti+2That is, the line segment with the ith point as the starting point is considered to be in the conversation segment;
s33, selecting all line segments certainly in the presence segment, and searching whether the element ST 'of ST exists in the range of 100 frames in the left and right directions'iIf any, is also included in the session segment and replaces the original stiThe search to the left and right 100 frames is repeated until there are no elements of ST in the left and right 100 frames. The purpose of this is to prevent effects due to taking a straight-line functionPoor end point detection performance is affected.
Detailed description of the preferred embodimentsa typical noise background
Reading in a file, drawing a time domain graph as shown in figure 4, and drawing a time domain graph after voice preprocessing as shown in figure 5.
The voice is framed, the frame length is 200, the frame is shifted by 80, the framed data is a two-dimensional matrix of 200 x 2964, Fourier transform is carried out on 200 numbers (each frame) in each row as a unit to obtain the frequency spectrum of each frame, 2964 frequency spectrums are provided, a spectrogram is drawn by taking the horizontal axis as time and the vertical axis as frequency, the spectrogram is shown in figure 6, and a low-frequency part (0 Hz-3500 Hz) is taken and subjected to gray level processing to obtain a spectrogram, which is shown in figure 7. Wherein fig. 7, 8, 9 have been rotated 90 degrees clockwise for clarity of illustration).
In fig. 7 white parts are visible, with parallel ripples, i.e. voiceprints, which are speech parts, and white uncorrugated parts, which are caused by strong noise. The horizontal line segment in the figure is selected, see fig. 8.
And storing the starting point and the end point, and reordering according to the size of the horizontal axis direction to obtain a starting point vector and an end point vector. When there are three horizontal line segments, we can consider it as voiceprint, and the rest is noise. Is numerically expressed as eni>sti+2That is, the end position of the ith segment is greater than the start position of the (i + 2) th segment, so as to judge whether the voice has a conversation according to the end position of the ith segment. To ensure that no missing information is detected, the possible speech segments are searched to the left and right. The results are shown in FIG. 9. Conversion to time domain diagram see fig. 10. With the method of the present invention, speech segments are detected in a typical noise background.
Detailed description of the preferred embodiment 2, Strong noise background
The procedure was the same as in example one, and the experimental results were as follows:
it should be noted that, in a strong noise background, a strong noise spectrum still remains after the speech enhancement processing, as shown in fig. 14, a speech segment in the graph is a region where energy is high and parallel lines exist, and after the speech segment, due to the existence of strong noise, a noise spectrum with low energy and existing in a dotted manner remains in the speech spectrum. As shown in fig. 15, when a line segment is identified, a part of the noise spectrum is identified as a line segment, and therefore, erroneous judgment is caused at the time of end point detection. The final detection results are shown in fig. 16 to 17, and it can be seen that all the speech segments in the speech are recognized, but a part of the speech segments containing only strong noise is misjudged as speech.
Claims (3)
1. A short wave voice endpoint detection method based on image recognition is characterized by comprising the following steps:
s1, performing voice preprocessing aiming at ensuring that the definitions of the voice prints of the formed spectrogram are approximately the same, which is the premise of effective image recognition, and specifically comprising the following steps:
s11, in the process of collecting voice signal data, due to some reasons of a test system, a linear or slowly-changed trend error is generated in a time sequence, so that a zero line of a voice signal deviates from a base line, even the deviation size changes along with time, a voice related function is caused, a power spectrum function deforms in processing calculation, and the trend error is removed by fitting a trend term by a least square method;
s12, amplitude normalization is carried out;
s13, low-pass filtering is carried out, and noise higher than 3500Hz is removed;
s14, reinforcing the voice by using the spectral subtraction of the multi-window spectrum;
s2, carrying out image recognition on the acquired spectrogram to acquire a structural body, wherein the structural body comprises a starting point and an end point of the spectrogram voiceprint position, and specifically comprises the following steps:
s21, framing the voice signals, and performing short-time Fourier transform by taking frames as units to obtain short-time spectrums;
s22, arranging the short-time frequency spectrums obtained in the S21 according to the time sequence of the frames to obtain a spectrogram;
s23, identifying the voiceprint in the spectrogram of S22, namely: changing the color spectrogram into a gray image; extracting the image edge of the gray-scale image, and identifying the position of a line segment in the gray-scale image; forming a structural body by the obtained starting point and the end point containing the voice print position of the spectrogram;
s3, carrying out endpoint detection, specifically:
s31, extracting start point position vector ST [ < ST > from the structure of S21,st2,...,sti,...,stn]And end point position vector EN ═ EN1,en2,...,eni,...,enn]Wherein, stiRefers to the ith start point position, eniThe position of the ith end point is pointed; sorting the starting point position vector ST and the end point position vector EN according to ascending order;
s32, judging whether a speech segment exists, wherein the speech segment can be regarded as a voiceprint when three horizontal line segments exist, and the rest are noises; is numerically expressed as when eni>sti+2That is, the line segment with the ith point as the starting point is considered to be in the conversation segment;
s33, selecting all line segments certainly in the presence segment, and searching whether the element ST 'of ST exists in the range of 100 frames in the left and right directions'iIf any, is also included in the session segment and replaces the original stiThe search to the left and right 100 frames is repeated until there are no elements of ST in the left and right 100 frames.
2. The short-wave voice endpoint detection method based on image recognition according to claim 1, characterized in that: s14, the specific steps of using spectral subtraction of multi-window spectrum to strengthen the voice are as follows:
step A, setting the time sequence of the voice signal as x (n), and carrying out windowing and frame division processing on the x (n) by using a Hamming window with the length of wlen to obtain the ith frame voice signal as xi (m), wherein the frame length of xi (m) is wlen, and the xi (m) discrete Fourier transform is
Step B, taking M frames before and after the frame i is taken as the center, and calculating the average amplitude spectrum of each component in the xi (k) in the step A by 2M +1 framesAngle of sumWherein j refers to the rear j frame with the i frame as the center, Im refers to the imaginary part, and Re refers to the real part;
step C, averaging a plurality of orthogonal data windows to the same data sequence to obtain spectrum estimation, wherein a multi-window spectrum is defined asWherein L is the number of data windows, SmtAs a spectrum of a window of data w, i.e.Tx (N) is a data sequence, N is a sequence length, aw (N) is a w-th data window, aw (N) is a group of discrete ellipsoid sequences which are orthogonal with each other and are used for respectively solving direct spectra with the same column signal, aw (N) satisfies the mutual orthogonality among a plurality of data windows, namelyThe multi-window spectrum definition method performs multi-window spectrum estimation on the framed signal xi (m), that is, the multi-window spectrum definition method
Step D, smoothing the multi-window spectrum power density estimated value, and calculating the smooth power spectral densityCalculating noise mean power spectral densityCalculating a gain factorWherein NIS represents the number of frames occupied by the preamble null segment;
step E, according to the obtained amplitude spectrum after the reduction of the multi-window spectrumSynthesizing an enhanced speech signalThe multi-window spectrum subtraction uses the leading non-speech section to calculate the power of the noise, after the power of the whole voice subtracts the noise component, the voice signal is restored by the phase angle relation, the over-subtraction factor determines the strengthening degree of the signal, and the gain compensation factor determines the calculation time length.
3. The short-wave voice endpoint detection method based on image recognition according to claim 2, characterized in that:
the selection method of the over-subtraction factor comprises the following steps:
i, setting an initial value of an over-subtraction factor to be 1, and taking an initial signal-to-noise ratio snr' to be 0;
II, performing enhancement processing on the voice by using multi-window spectral subtraction, and calculating the signal-to-noise ratio snr of the processed signal;
III, if the signal-to-noise ratio snr of the processed signal is greater than the initial signal-to-noise ratio snr ', performing the next step, if the signal-to-noise ratio snr of the processed signal is less than or equal to the initial signal-to-noise ratio snr', indicating that the voice in the signal is not significant, not processing, keeping all voice signals, and directly outputting;
and IV, if the signal-to-noise ratio snr of the processed signal is less than 8dB, increasing the over-subtraction factor by 0.5, making snr' equal to snr, and repeating the steps II to IV until the signal-to-noise ratio is greater than 8 dB.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711330638.1A CN108053842B (en) | 2017-12-13 | 2017-12-13 | Short wave voice endpoint detection method based on image recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711330638.1A CN108053842B (en) | 2017-12-13 | 2017-12-13 | Short wave voice endpoint detection method based on image recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108053842A CN108053842A (en) | 2018-05-18 |
CN108053842B true CN108053842B (en) | 2021-09-14 |
Family
ID=62132480
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711330638.1A Active CN108053842B (en) | 2017-12-13 | 2017-12-13 | Short wave voice endpoint detection method based on image recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108053842B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109346105B (en) * | 2018-07-27 | 2022-04-15 | 南京理工大学 | Pitch period spectrogram method for directly displaying pitch period track |
CN110047470A (en) * | 2019-04-11 | 2019-07-23 | 深圳市壹鸽科技有限公司 | A kind of sound end detecting method |
CN111354378B (en) * | 2020-02-12 | 2020-11-24 | 北京声智科技有限公司 | Voice endpoint detection method, device, equipment and computer storage medium |
CN111429905B (en) * | 2020-03-23 | 2024-06-07 | 北京声智科技有限公司 | Voice signal processing method and device, voice intelligent elevator, medium and equipment |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1299126A (en) * | 2001-01-16 | 2001-06-13 | 北京大学 | Method for discriminating acoustic figure with base band components and sounding parameters |
CN101727905A (en) * | 2009-11-27 | 2010-06-09 | 江南大学 | Method for acquiring vocal print picture with refined time-frequency structure |
CN102884575A (en) * | 2010-04-22 | 2013-01-16 | 高通股份有限公司 | Voice activity detection |
CN104637497A (en) * | 2015-01-16 | 2015-05-20 | 南京工程学院 | Speech spectrum characteristic extracting method facing speech emotion identification |
CN105489226A (en) * | 2015-11-23 | 2016-04-13 | 湖北工业大学 | Wiener filtering speech enhancement method for multi-taper spectrum estimation of pickup |
CN105810213A (en) * | 2014-12-30 | 2016-07-27 | 浙江大华技术股份有限公司 | Typical abnormal sound detection method and device |
CN106024010A (en) * | 2016-05-19 | 2016-10-12 | 渤海大学 | Speech signal dynamic characteristic extraction method based on formant curves |
CN106531174A (en) * | 2016-11-27 | 2017-03-22 | 福州大学 | Animal sound recognition method based on wavelet packet decomposition and spectrogram features |
CN106953887A (en) * | 2017-01-05 | 2017-07-14 | 北京中瑞鸿程科技开发有限公司 | A kind of personalized Organisation recommendations method of fine granularity radio station audio content |
CN106971740A (en) * | 2017-03-28 | 2017-07-21 | 吉林大学 | Probability and the sound enhancement method of phase estimation are had based on voice |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040260540A1 (en) * | 2003-06-20 | 2004-12-23 | Tong Zhang | System and method for spectrogram analysis of an audio signal |
US20050288923A1 (en) * | 2004-06-25 | 2005-12-29 | The Hong Kong University Of Science And Technology | Speech enhancement by noise masking |
KR100789084B1 (en) * | 2006-11-21 | 2007-12-26 | 한양대학교 산학협력단 | Speech enhancement method by overweighting gain with nonlinear structure in wavelet packet transform |
CN103117066B (en) * | 2013-01-17 | 2015-04-15 | 杭州电子科技大学 | Low signal to noise ratio voice endpoint detection method based on time-frequency instaneous energy spectrum |
-
2017
- 2017-12-13 CN CN201711330638.1A patent/CN108053842B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1299126A (en) * | 2001-01-16 | 2001-06-13 | 北京大学 | Method for discriminating acoustic figure with base band components and sounding parameters |
CN101727905A (en) * | 2009-11-27 | 2010-06-09 | 江南大学 | Method for acquiring vocal print picture with refined time-frequency structure |
CN102884575A (en) * | 2010-04-22 | 2013-01-16 | 高通股份有限公司 | Voice activity detection |
CN105810213A (en) * | 2014-12-30 | 2016-07-27 | 浙江大华技术股份有限公司 | Typical abnormal sound detection method and device |
CN104637497A (en) * | 2015-01-16 | 2015-05-20 | 南京工程学院 | Speech spectrum characteristic extracting method facing speech emotion identification |
CN105489226A (en) * | 2015-11-23 | 2016-04-13 | 湖北工业大学 | Wiener filtering speech enhancement method for multi-taper spectrum estimation of pickup |
CN106024010A (en) * | 2016-05-19 | 2016-10-12 | 渤海大学 | Speech signal dynamic characteristic extraction method based on formant curves |
CN106531174A (en) * | 2016-11-27 | 2017-03-22 | 福州大学 | Animal sound recognition method based on wavelet packet decomposition and spectrogram features |
CN106953887A (en) * | 2017-01-05 | 2017-07-14 | 北京中瑞鸿程科技开发有限公司 | A kind of personalized Organisation recommendations method of fine granularity radio station audio content |
CN106971740A (en) * | 2017-03-28 | 2017-07-21 | 吉林大学 | Probability and the sound enhancement method of phase estimation are had based on voice |
Non-Patent Citations (4)
Title |
---|
Voice Activity Detection Algorithm with Low Signal-to-Noise Ratios Based on Spectrum Entropy;Kun-Ching WANG et al.;《2008 Second International Symposium on Universal Communication》;20081216;423-428页 * |
一种基于语谱图分析的语音增强算法;肖纯智;《语音技术》;20120930;第36卷(第9期);44-48页 * |
基于倒谱特征和浊音特性的语音端点检测方法的研究;孙海英;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20090515;全文 * |
基于语谱图的语音端点检测算法;陈向民等;《语音技术》;20060430;47-49页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108053842A (en) | 2018-05-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108053842B (en) | Short wave voice endpoint detection method based on image recognition | |
CN103854662B (en) | Adaptive voice detection method based on multiple domain Combined estimator | |
EP1547061B1 (en) | Multichannel voice detection in adverse environments | |
CN109545188A (en) | A kind of real-time voice end-point detecting method and device | |
US20020010581A1 (en) | Voice recognition device | |
EP3411876B1 (en) | Babble noise suppression | |
JPH0916194A (en) | Noise reduction for voice signal | |
CN112735456A (en) | Speech enhancement method based on DNN-CLSTM network | |
JP6272433B2 (en) | Method and apparatus for detecting pitch cycle accuracy | |
CN108597505A (en) | Audio recognition method, device and terminal device | |
Yu et al. | Effect of multi-condition training and speech enhancement methods on spoofing detection | |
CN111091833A (en) | Endpoint detection method for reducing noise influence | |
CN111312275A (en) | Online sound source separation enhancement system based on sub-band decomposition | |
EP0780828A2 (en) | Method and system for performing speech recognition | |
Morales-Cordovilla et al. | Feature extraction based on pitch-synchronous averaging for robust speech recognition | |
CN111599372B (en) | Stable on-line multi-channel voice dereverberation method and system | |
Martín-Doñas et al. | Dual-channel DNN-based speech enhancement for smartphones | |
CN111899750A (en) | Speech enhancement algorithm combining cochlear speech features and hopping deep neural network | |
CN109102823B (en) | Speech enhancement method based on subband spectral entropy | |
Hsu et al. | Voice activity detection based on frequency modulation of harmonics | |
WO2007041789A1 (en) | Front-end processing of speech signals | |
CN112233657A (en) | Speech enhancement method based on low-frequency syllable recognition | |
CN113903344B (en) | Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction | |
CN116665681A (en) | Thunder identification method based on combined filtering | |
TWI749547B (en) | Speech enhancement system based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |