CN108053842B - Short wave voice endpoint detection method based on image recognition - Google Patents

Short wave voice endpoint detection method based on image recognition Download PDF

Info

Publication number
CN108053842B
CN108053842B CN201711330638.1A CN201711330638A CN108053842B CN 108053842 B CN108053842 B CN 108053842B CN 201711330638 A CN201711330638 A CN 201711330638A CN 108053842 B CN108053842 B CN 108053842B
Authority
CN
China
Prior art keywords
voice
signal
spectrum
spectrogram
window
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711330638.1A
Other languages
Chinese (zh)
Other versions
CN108053842A (en
Inventor
陈章鑫
杨孟文
司进修
黄际彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201711330638.1A priority Critical patent/CN108053842B/en
Publication of CN108053842A publication Critical patent/CN108053842A/en
Application granted granted Critical
Publication of CN108053842B publication Critical patent/CN108053842B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention belongs to the field of voice detection, and particularly relates to a short-wave voice endpoint detection method based on image recognition. The technical scheme of the invention is as follows: firstly, preprocessing data to improve the signal-to-noise ratio; then framing according to a specific length, and simultaneously performing short-time Fourier transform to obtain a spectrogram; and finally, searching the voiceprint in the spectrogram by using an image recognition method, and determining the speech segment in the data according to the voiceprint distribution. The voice after the preprocessing has similar signal-to-noise ratio by adopting the method of the invention, and the parameters do not need to be adjusted in the subsequent steps, therefore, the method of the invention can self-adaptively select the talking section from different background noises.

Description

Short wave voice endpoint detection method based on image recognition
Technical Field
The invention belongs to the field of voice detection, and particularly relates to a short-wave voice endpoint detection method based on image recognition.
Background
Despite the continuous emergence of new radio communication systems, short-wave radio stations are still receiving general attention due to their autonomous communication capabilities and wide coverage. However, the short-wave communication transmission electric wave needs to be reflected by an ionized layer, so that the noise is large. The existence of strong background noise makes monitoring personnel unable to work for a long time, and noise reduction processing is needed, and meanwhile, squelch processing is carried out on a non-voice section. In order to prevent the hearing loss, the performance of the voice endpoint detection method is important.
In conventional speech processing, there are many methods for detecting end points, such as end point detection based on correlation function, end point detection based on cepstrum distance, end point detection based on energy-to-zero ratio, and end point detection based on wavelet decomposition, etc., according to different characteristics. Aiming at different voices, parameters are adjusted, and voice talking sections can be accurately selected. However, in a variable environment, when real-time communication is required, it is impractical to adjust the endpoint detection parameters, and the conventional speech processing method is no longer applicable.
The voice frequency spectrogram is called a spectrogram for short, and the change relation of the short-time frequency spectrum of the voice along with time is researched through short-time Fourier transform analysis of the voice. The horizontal direction of the speech spectrogram is a time axis, the vertical direction of the speech spectrogram is a frequency axis, and the gray stripes on the speech spectrogram represent the speech short-time spectrums at various moments. The spectrogram reflects the dynamic spectrum characteristics of a voice signal, has important practical value in voice analysis, and is called as visual voice.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an adaptive processing method according to the characteristic mechanism of human voice production and the characteristic that voiceprints do not exist in a noise spectrum.
The technical scheme of the invention is as follows: firstly, preprocessing data to improve the signal-to-noise ratio; then framing according to a specific length, and simultaneously performing short-time Fourier transform to obtain a spectrogram; and finally, searching the voiceprint in the spectrogram by using an image recognition method, and determining the speech segment in the data according to the voiceprint distribution.
A short wave voice endpoint detection method based on image recognition specifically comprises the following steps:
s1, performing voice preprocessing aiming at ensuring that the definitions of the voice prints of the formed spectrogram are approximately the same, which is the premise of effective image recognition, and specifically comprising the following steps:
s11, in the process of collecting voice signal data, due to some reasons of a test system, a linear or slowly-changed trend error is generated in a time sequence, so that a zero line of a voice signal deviates from a base line, even the deviation size changes along with time, a voice related function is caused, a power spectrum function deforms in processing calculation, and the trend error is removed by fitting a trend term by a least square method;
s12, amplitude normalization is carried out;
s13, low-pass filtering is carried out, and noise higher than 3500Hz is removed;
s14, reinforcing the voice by using the spectral subtraction of the multi-window spectrum;
s2, carrying out image recognition on the acquired spectrogram to acquire a structural body, wherein the structural body comprises a starting point and an end point of the spectrogram voiceprint position, and specifically comprises the following steps:
s21, framing the voice signals, and performing short-time Fourier transform by taking frames as units to obtain short-time spectrums;
s22, arranging the short-time frequency spectrums obtained in the S21 according to the time sequence of the frames to obtain a spectrogram;
s23, identifying the voiceprint in the spectrogram of S22, namely: changing the color spectrogram into a gray image; extracting the image edge of the gray-scale image, and identifying the position of a line segment in the gray-scale image; forming a structural body by the obtained starting point and the end point containing the voice print position of the spectrogram;
s3, carrying out endpoint detection, specifically:
s31, extracting start point position vector ST [ < ST > from the structure of S21,st2,...,sti,...,stn]And end point position vector EN ═ EN1,en2,...,eni,...,enn]Wherein, stiRefers to the ith start point position, eniRefers to the ith end point position. Sorting the starting point position vector ST and the end point position vector EN according to ascending order;
s32, judgmentIf there is a speech segment, the speech segment can be regarded as a voiceprint if there are three horizontal line segments, and the rest are noise. Is numerically expressed as when eni>sti+2That is, the line segment with the ith point as the starting point is considered to be in the conversation segment;
s33, selecting all line segments certainly in the presence segment, and searching whether the element ST 'of ST exists in the range of 100 frames in the left and right directions'iIf any, is also included in the session segment and replaces the original stiThe search to the left and right 100 frames is repeated until there are no elements of ST in the left and right 100 frames.
Further, the step of S14 for enhancing speech by using spectral subtraction of multi-window spectrum includes the following steps:
step A, setting the time sequence of the voice signal as x (n), and carrying out windowing and frame division processing on the x (n) by using a Hamming window with the length of wlen to obtain the ith frame of voice signal as xi(m) said xi(m) has a frame length of wlen, xi(m) discrete Fourier transform
Figure BDA0001506562280000031
B, taking M frames before and after the frame i as the center, and calculating the X in the step A by 2M +1 framesi(k) Average amplitude spectrum of each component in the spectrum
Figure BDA0001506562280000032
Angle of sum
Figure BDA0001506562280000033
Wherein j refers to the rear j frame with the i frame as the center, Im refers to the imaginary part, and Re refers to the real part;
step C, averaging a plurality of orthogonal data windows to the same data sequence to obtain spectrum estimation, wherein a multi-window spectrum is defined as
Figure BDA0001506562280000034
Wherein L is the number of data windows, SmtAs a spectrum of a window of data w, i.e.
Figure BDA0001506562280000035
Tx (N) is the data sequence, N is the sequence length, aw(n) is the w-th data window, aw(n) is a set of mutually orthogonal discrete ellipsoid sequences for direct spectrum determination with the same signal sequence, aw(n) satisfy mutual orthogonality between multiple data windows, i.e.
Figure BDA0001506562280000036
Using the multi-window spectrum definition method to divide the frame signal xi(m) performing multi-window spectral estimation, i.e.
Figure BDA0001506562280000037
Step D, smoothing the multi-window spectrum power density estimated value, and calculating the smooth power spectral density
Figure BDA0001506562280000038
Calculating noise mean power spectral density
Figure BDA0001506562280000039
Calculating a gain factor
Figure BDA00015065622800000310
Wherein NIS represents the number of frames occupied by the preamble null segment;
step E, according to the obtained amplitude spectrum after the reduction of the multi-window spectrum
Figure BDA00015065622800000311
Synthesizing an enhanced speech signal
Figure BDA00015065622800000312
The multi-window spectrum subtraction uses the leading non-speech section to calculate the power of the noise, after the power of the whole voice subtracts the noise component, the voice signal is restored by the phase angle relation, the over-subtraction factor determines the strengthening degree of the signal, and the gain compensation factor determines the calculation time length.
Further, the selection method of the over-subtraction factor is as follows:
i, setting an initial value of an over-subtraction factor to be 1, and taking an initial signal-to-noise ratio snr' to be 0;
II, performing enhancement processing on the voice by using multi-window spectral subtraction, and calculating the signal-to-noise ratio snr of the processed signal;
III, if the signal-to-noise ratio snr of the processed signal is greater than the initial signal-to-noise ratio snr ', performing the next step, if the signal-to-noise ratio snr of the processed signal is less than or equal to the initial signal-to-noise ratio snr', indicating that the voice in the signal is not significant, not processing, keeping all voice signals, and directly outputting;
and IV, if the signal-to-noise ratio snr of the processed signal is less than 8dB, increasing the over-subtraction factor by 0.5, making snr' equal to snr, and repeating the steps II to IV until the signal-to-noise ratio is greater than 8 dB.
The invention has the beneficial effects that:
the voice after the preprocessing has similar signal-to-noise ratio by adopting the method of the invention, and the parameters do not need to be adjusted in the subsequent steps, therefore, the method of the invention can self-adaptively select the talking section from different background noises.
Drawings
FIG. 1 is a schematic diagram of a multi-window spectral improvement spectral subtraction method.
FIG. 2 is a flow chart of a speech enhancement process.
FIG. 3 is a flow chart of the method of the present invention.
FIG. 4 is a time domain diagram of speech before speech preprocessing in embodiment 1.
FIG. 5 is a time domain diagram of speech after speech preprocessing in embodiment 1.
FIG. 6 is a spectrogram of each frame of speech in embodiment 1.
FIG. 7 is a spectrogram after the gray scale processing in example 1.
Fig. 8 is a horizontal line segment portion in the spectrogram after the gray level processing in embodiment 1.
Fig. 9 shows the end point detection result of the spectrogram after the gray level processing in embodiment 1.
Fig. 10 is a time domain diagram of the endpoint detection result in embodiment 1, where the left is the original speech and the right is the preprocessed speech.
Fig. 11 is a time domain diagram of speech before speech preprocessing in embodiment 2.
Fig. 12 is a time domain diagram of speech after speech preprocessing in embodiment 2.
Fig. 13 is a spectrogram of each frame of speech in embodiment 2.
FIG. 14 is a spectrogram after the gray scale processing in example 2.
Fig. 15 is a horizontal line segment portion in the spectrogram after the gray level processing in embodiment 2.
Fig. 16 shows the end point detection result of the spectrogram after the gray level processing in embodiment 2.
Fig. 17 is a time domain diagram of the endpoint detection result in embodiment 2, where the left is the original speech and the right is the preprocessed speech.
Detailed Description
The present invention will be described with reference to the accompanying drawings.
The method selects the voiceprint characteristics as the characteristics of the sound. Due to the unique physiological structure of human vocalization, voiceprints can be seen from speech spectrograms (spectrogram). The voiceprint of human voice has obvious characteristics, and in a speech section, specific rules of energy distribution on different frequencies can be seen; in the spectrogram of voice, a plurality of transversely parallel lines are presented, and the lines are the voiceprints. The voiceprint can embody individual pronunciation characteristics and phoneme characteristics, and is widely applied to the aspect of speech recognition.
As shown in fig. 3, the method of the present invention comprises the following steps:
s1, performing voice preprocessing aiming at ensuring that the definitions of the voice prints of the formed spectrogram are approximately the same, which is the premise of effective image recognition, and specifically comprising the following steps:
s11, in the process of collecting voice signal data, due to some reasons of a test system, a linear or slowly-changed trend error is generated in a time sequence, so that a zero line of a voice signal deviates from a base line, even the deviation size changes along with time, a voice related function is caused, a power spectrum function deforms in processing calculation, and the trend error is removed by fitting a trend term by a least square method;
s12, amplitude normalization is carried out;
s13, low-pass filtering is carried out, and noise higher than 3500Hz is removed;
s14, reinforcing the voice by using the spectral subtraction of the multi-window spectrum, which specifically comprises the following steps:
step A, setting the time sequence of the voice signal as x (n), and carrying out windowing and frame division processing on the x (n) by using a Hamming window with the length of wlen to obtain the ith frame of voice signal as xi(m) said xi(m) has a frame length of wlen, xi(m) discrete Fourier transform
Figure BDA0001506562280000061
B, taking M frames before and after the frame i as the center, and calculating the X in the step A by 2M +1 framesi(k) Average amplitude spectrum of each component in the spectrum
Figure BDA0001506562280000062
Angle of sum
Figure BDA0001506562280000063
Wherein j refers to the next j frame centered on the i frame, Im refers to the imaginary part, and Re refers to the real part.
Step C, averaging a plurality of orthogonal data windows to the same data sequence to obtain spectrum estimation, wherein a multi-window spectrum is defined as
Figure BDA0001506562280000064
Wherein L is the number of data windows, SmtAs a spectrum of a window of data w, i.e.
Figure BDA0001506562280000065
Tx (N) is the data sequence, N is the sequence length, aw(n) is the w-th data window, aw(n) is a set of mutually orthogonal discrete ellipsoid sequences for direct spectrum determination with the same signal sequence, aw(n) satisfy mutual orthogonality between multiple data windows, i.e.
Figure BDA0001506562280000066
Using the multi-window spectrum definition method to divide the frame signal xi(m) performing multi-window spectral estimation, i.e.
Figure BDA0001506562280000067
Step D, smoothing the multi-window spectrum power density estimated value, and calculating the smooth power spectral density
Figure BDA0001506562280000068
Calculating noise mean power spectral density
Figure BDA0001506562280000069
Calculating a gain factor
Figure BDA00015065622800000610
Wherein NIS represents the number of frames occupied by the preamble null segment;
step E, according to the obtained amplitude spectrum after the reduction of the multi-window spectrum
Figure BDA00015065622800000611
Synthesizing an enhanced speech signal
Figure BDA00015065622800000612
The multi-window spectrum subtraction is to use the leading non-speech section to calculate the power of the noise, and after the power of the whole sound subtracts the components of the noise, the speech signal is restored by using the phase angle relation, the over-subtraction factor determines the strengthening degree of the signal, and the gain compensation factor determines the calculation time length;
the selection method of the over-subtraction factor comprises the following steps:
i, setting an initial value of an over-subtraction factor to be 1, and taking an initial signal-to-noise ratio snr' to be 0;
II, performing enhancement processing on the voice by using multi-window spectral subtraction, and calculating the signal-to-noise ratio snr of the processed signal;
III, if the signal-to-noise ratio snr of the processed signal is greater than the initial signal-to-noise ratio snr ', performing the next step, if the signal-to-noise ratio snr of the processed signal is less than or equal to the initial signal-to-noise ratio snr', indicating that the voice in the signal is not significant, not processing, keeping all voice signals, and directly outputting;
IV, if the signal-to-noise ratio snr of the processed signal is less than 8dB, increasing the over-subtraction factor by 0.5, making snr' equal to snr, and repeating the steps II-IV until the signal-to-noise ratio is greater than 8 dB;
s2, carrying out image recognition on the acquired spectrogram to acquire a structural body, wherein the structural body comprises a starting point and an end point of the spectrogram voiceprint position, and specifically comprises the following steps:
s21, framing the voice signals, and performing short-time Fourier transform by taking frames as units to obtain short-time spectrums;
s22, arranging the short-time frequency spectrums obtained in the S21 according to the time sequence of the frames to obtain a spectrogram;
s23, identifying the voiceprint in the spectrogram of S22, namely: changing the color spectrogram into a gray image; extracting the image edge of the gray-scale image, and identifying the position of a line segment in the gray-scale image; forming a structural body by the obtained starting point and the end point containing the voice print position of the spectrogram;
s3, carrying out endpoint detection, specifically:
s31, extracting start point position vector ST [ < ST > from the structure of S21,st2,...,sti,...,stn]And end point position vector EN ═ EN1,en2,...,eni,...,enn]Wherein, stiRefers to the ith start point position, eniRefers to the ith end point position. Sorting the starting point position vector ST and the end point position vector EN according to ascending order;
and S32, judging whether a speech segment exists, and if three horizontal line segments exist, the speech segment can be considered as a voiceprint, and the rest are noise. Is numerically expressed as when eni>sti+2That is, the line segment with the ith point as the starting point is considered to be in the conversation segment;
s33, selecting all line segments certainly in the presence segment, and searching whether the element ST 'of ST exists in the range of 100 frames in the left and right directions'iIf any, is also included in the session segment and replaces the original stiThe search to the left and right 100 frames is repeated until there are no elements of ST in the left and right 100 frames. The purpose of this is to prevent effects due to taking a straight-line functionPoor end point detection performance is affected.
Detailed description of the preferred embodimentsa typical noise background
Reading in a file, drawing a time domain graph as shown in figure 4, and drawing a time domain graph after voice preprocessing as shown in figure 5.
The voice is framed, the frame length is 200, the frame is shifted by 80, the framed data is a two-dimensional matrix of 200 x 2964, Fourier transform is carried out on 200 numbers (each frame) in each row as a unit to obtain the frequency spectrum of each frame, 2964 frequency spectrums are provided, a spectrogram is drawn by taking the horizontal axis as time and the vertical axis as frequency, the spectrogram is shown in figure 6, and a low-frequency part (0 Hz-3500 Hz) is taken and subjected to gray level processing to obtain a spectrogram, which is shown in figure 7. Wherein fig. 7, 8, 9 have been rotated 90 degrees clockwise for clarity of illustration).
In fig. 7 white parts are visible, with parallel ripples, i.e. voiceprints, which are speech parts, and white uncorrugated parts, which are caused by strong noise. The horizontal line segment in the figure is selected, see fig. 8.
And storing the starting point and the end point, and reordering according to the size of the horizontal axis direction to obtain a starting point vector and an end point vector. When there are three horizontal line segments, we can consider it as voiceprint, and the rest is noise. Is numerically expressed as eni>sti+2That is, the end position of the ith segment is greater than the start position of the (i + 2) th segment, so as to judge whether the voice has a conversation according to the end position of the ith segment. To ensure that no missing information is detected, the possible speech segments are searched to the left and right. The results are shown in FIG. 9. Conversion to time domain diagram see fig. 10. With the method of the present invention, speech segments are detected in a typical noise background.
Detailed description of the preferred embodiment 2, Strong noise background
The procedure was the same as in example one, and the experimental results were as follows:
it should be noted that, in a strong noise background, a strong noise spectrum still remains after the speech enhancement processing, as shown in fig. 14, a speech segment in the graph is a region where energy is high and parallel lines exist, and after the speech segment, due to the existence of strong noise, a noise spectrum with low energy and existing in a dotted manner remains in the speech spectrum. As shown in fig. 15, when a line segment is identified, a part of the noise spectrum is identified as a line segment, and therefore, erroneous judgment is caused at the time of end point detection. The final detection results are shown in fig. 16 to 17, and it can be seen that all the speech segments in the speech are recognized, but a part of the speech segments containing only strong noise is misjudged as speech.

Claims (3)

1. A short wave voice endpoint detection method based on image recognition is characterized by comprising the following steps:
s1, performing voice preprocessing aiming at ensuring that the definitions of the voice prints of the formed spectrogram are approximately the same, which is the premise of effective image recognition, and specifically comprising the following steps:
s11, in the process of collecting voice signal data, due to some reasons of a test system, a linear or slowly-changed trend error is generated in a time sequence, so that a zero line of a voice signal deviates from a base line, even the deviation size changes along with time, a voice related function is caused, a power spectrum function deforms in processing calculation, and the trend error is removed by fitting a trend term by a least square method;
s12, amplitude normalization is carried out;
s13, low-pass filtering is carried out, and noise higher than 3500Hz is removed;
s14, reinforcing the voice by using the spectral subtraction of the multi-window spectrum;
s2, carrying out image recognition on the acquired spectrogram to acquire a structural body, wherein the structural body comprises a starting point and an end point of the spectrogram voiceprint position, and specifically comprises the following steps:
s21, framing the voice signals, and performing short-time Fourier transform by taking frames as units to obtain short-time spectrums;
s22, arranging the short-time frequency spectrums obtained in the S21 according to the time sequence of the frames to obtain a spectrogram;
s23, identifying the voiceprint in the spectrogram of S22, namely: changing the color spectrogram into a gray image; extracting the image edge of the gray-scale image, and identifying the position of a line segment in the gray-scale image; forming a structural body by the obtained starting point and the end point containing the voice print position of the spectrogram;
s3, carrying out endpoint detection, specifically:
s31, extracting start point position vector ST [ < ST > from the structure of S21,st2,...,sti,...,stn]And end point position vector EN ═ EN1,en2,...,eni,...,enn]Wherein, stiRefers to the ith start point position, eniThe position of the ith end point is pointed; sorting the starting point position vector ST and the end point position vector EN according to ascending order;
s32, judging whether a speech segment exists, wherein the speech segment can be regarded as a voiceprint when three horizontal line segments exist, and the rest are noises; is numerically expressed as when eni>sti+2That is, the line segment with the ith point as the starting point is considered to be in the conversation segment;
s33, selecting all line segments certainly in the presence segment, and searching whether the element ST 'of ST exists in the range of 100 frames in the left and right directions'iIf any, is also included in the session segment and replaces the original stiThe search to the left and right 100 frames is repeated until there are no elements of ST in the left and right 100 frames.
2. The short-wave voice endpoint detection method based on image recognition according to claim 1, characterized in that: s14, the specific steps of using spectral subtraction of multi-window spectrum to strengthen the voice are as follows:
step A, setting the time sequence of the voice signal as x (n), and carrying out windowing and frame division processing on the x (n) by using a Hamming window with the length of wlen to obtain the ith frame voice signal as xi (m), wherein the frame length of xi (m) is wlen, and the xi (m) discrete Fourier transform is
Figure FDA0003213262830000011
Step B, taking M frames before and after the frame i is taken as the center, and calculating the average amplitude spectrum of each component in the xi (k) in the step A by 2M +1 frames
Figure FDA0003213262830000021
Angle of sum
Figure FDA0003213262830000022
Wherein j refers to the rear j frame with the i frame as the center, Im refers to the imaginary part, and Re refers to the real part;
step C, averaging a plurality of orthogonal data windows to the same data sequence to obtain spectrum estimation, wherein a multi-window spectrum is defined as
Figure FDA0003213262830000023
Wherein L is the number of data windows, SmtAs a spectrum of a window of data w, i.e.
Figure FDA0003213262830000024
Tx (N) is a data sequence, N is a sequence length, aw (N) is a w-th data window, aw (N) is a group of discrete ellipsoid sequences which are orthogonal with each other and are used for respectively solving direct spectra with the same column signal, aw (N) satisfies the mutual orthogonality among a plurality of data windows, namely
Figure FDA0003213262830000025
The multi-window spectrum definition method performs multi-window spectrum estimation on the framed signal xi (m), that is, the multi-window spectrum definition method
Figure FDA0003213262830000026
Step D, smoothing the multi-window spectrum power density estimated value, and calculating the smooth power spectral density
Figure FDA0003213262830000027
Calculating noise mean power spectral density
Figure FDA0003213262830000028
Calculating a gain factor
Figure FDA0003213262830000029
Wherein NIS represents the number of frames occupied by the preamble null segment;
step E, according to the obtained amplitude spectrum after the reduction of the multi-window spectrum
Figure FDA00032132628300000210
Synthesizing an enhanced speech signal
Figure FDA00032132628300000211
The multi-window spectrum subtraction uses the leading non-speech section to calculate the power of the noise, after the power of the whole voice subtracts the noise component, the voice signal is restored by the phase angle relation, the over-subtraction factor determines the strengthening degree of the signal, and the gain compensation factor determines the calculation time length.
3. The short-wave voice endpoint detection method based on image recognition according to claim 2, characterized in that:
the selection method of the over-subtraction factor comprises the following steps:
i, setting an initial value of an over-subtraction factor to be 1, and taking an initial signal-to-noise ratio snr' to be 0;
II, performing enhancement processing on the voice by using multi-window spectral subtraction, and calculating the signal-to-noise ratio snr of the processed signal;
III, if the signal-to-noise ratio snr of the processed signal is greater than the initial signal-to-noise ratio snr ', performing the next step, if the signal-to-noise ratio snr of the processed signal is less than or equal to the initial signal-to-noise ratio snr', indicating that the voice in the signal is not significant, not processing, keeping all voice signals, and directly outputting;
and IV, if the signal-to-noise ratio snr of the processed signal is less than 8dB, increasing the over-subtraction factor by 0.5, making snr' equal to snr, and repeating the steps II to IV until the signal-to-noise ratio is greater than 8 dB.
CN201711330638.1A 2017-12-13 2017-12-13 Short wave voice endpoint detection method based on image recognition Active CN108053842B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711330638.1A CN108053842B (en) 2017-12-13 2017-12-13 Short wave voice endpoint detection method based on image recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711330638.1A CN108053842B (en) 2017-12-13 2017-12-13 Short wave voice endpoint detection method based on image recognition

Publications (2)

Publication Number Publication Date
CN108053842A CN108053842A (en) 2018-05-18
CN108053842B true CN108053842B (en) 2021-09-14

Family

ID=62132480

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711330638.1A Active CN108053842B (en) 2017-12-13 2017-12-13 Short wave voice endpoint detection method based on image recognition

Country Status (1)

Country Link
CN (1) CN108053842B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109346105B (en) * 2018-07-27 2022-04-15 南京理工大学 Pitch period spectrogram method for directly displaying pitch period track
CN110047470A (en) * 2019-04-11 2019-07-23 深圳市壹鸽科技有限公司 A kind of sound end detecting method
CN111354378B (en) * 2020-02-12 2020-11-24 北京声智科技有限公司 Voice endpoint detection method, device, equipment and computer storage medium
CN111429905B (en) * 2020-03-23 2024-06-07 北京声智科技有限公司 Voice signal processing method and device, voice intelligent elevator, medium and equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1299126A (en) * 2001-01-16 2001-06-13 北京大学 Method for discriminating acoustic figure with base band components and sounding parameters
CN101727905A (en) * 2009-11-27 2010-06-09 江南大学 Method for acquiring vocal print picture with refined time-frequency structure
CN102884575A (en) * 2010-04-22 2013-01-16 高通股份有限公司 Voice activity detection
CN104637497A (en) * 2015-01-16 2015-05-20 南京工程学院 Speech spectrum characteristic extracting method facing speech emotion identification
CN105489226A (en) * 2015-11-23 2016-04-13 湖北工业大学 Wiener filtering speech enhancement method for multi-taper spectrum estimation of pickup
CN105810213A (en) * 2014-12-30 2016-07-27 浙江大华技术股份有限公司 Typical abnormal sound detection method and device
CN106024010A (en) * 2016-05-19 2016-10-12 渤海大学 Speech signal dynamic characteristic extraction method based on formant curves
CN106531174A (en) * 2016-11-27 2017-03-22 福州大学 Animal sound recognition method based on wavelet packet decomposition and spectrogram features
CN106953887A (en) * 2017-01-05 2017-07-14 北京中瑞鸿程科技开发有限公司 A kind of personalized Organisation recommendations method of fine granularity radio station audio content
CN106971740A (en) * 2017-03-28 2017-07-21 吉林大学 Probability and the sound enhancement method of phase estimation are had based on voice

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040260540A1 (en) * 2003-06-20 2004-12-23 Tong Zhang System and method for spectrogram analysis of an audio signal
US20050288923A1 (en) * 2004-06-25 2005-12-29 The Hong Kong University Of Science And Technology Speech enhancement by noise masking
KR100789084B1 (en) * 2006-11-21 2007-12-26 한양대학교 산학협력단 Speech enhancement method by overweighting gain with nonlinear structure in wavelet packet transform
CN103117066B (en) * 2013-01-17 2015-04-15 杭州电子科技大学 Low signal to noise ratio voice endpoint detection method based on time-frequency instaneous energy spectrum

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1299126A (en) * 2001-01-16 2001-06-13 北京大学 Method for discriminating acoustic figure with base band components and sounding parameters
CN101727905A (en) * 2009-11-27 2010-06-09 江南大学 Method for acquiring vocal print picture with refined time-frequency structure
CN102884575A (en) * 2010-04-22 2013-01-16 高通股份有限公司 Voice activity detection
CN105810213A (en) * 2014-12-30 2016-07-27 浙江大华技术股份有限公司 Typical abnormal sound detection method and device
CN104637497A (en) * 2015-01-16 2015-05-20 南京工程学院 Speech spectrum characteristic extracting method facing speech emotion identification
CN105489226A (en) * 2015-11-23 2016-04-13 湖北工业大学 Wiener filtering speech enhancement method for multi-taper spectrum estimation of pickup
CN106024010A (en) * 2016-05-19 2016-10-12 渤海大学 Speech signal dynamic characteristic extraction method based on formant curves
CN106531174A (en) * 2016-11-27 2017-03-22 福州大学 Animal sound recognition method based on wavelet packet decomposition and spectrogram features
CN106953887A (en) * 2017-01-05 2017-07-14 北京中瑞鸿程科技开发有限公司 A kind of personalized Organisation recommendations method of fine granularity radio station audio content
CN106971740A (en) * 2017-03-28 2017-07-21 吉林大学 Probability and the sound enhancement method of phase estimation are had based on voice

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Voice Activity Detection Algorithm with Low Signal-to-Noise Ratios Based on Spectrum Entropy;Kun-Ching WANG et al.;《2008 Second International Symposium on Universal Communication》;20081216;423-428页 *
一种基于语谱图分析的语音增强算法;肖纯智;《语音技术》;20120930;第36卷(第9期);44-48页 *
基于倒谱特征和浊音特性的语音端点检测方法的研究;孙海英;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20090515;全文 *
基于语谱图的语音端点检测算法;陈向民等;《语音技术》;20060430;47-49页 *

Also Published As

Publication number Publication date
CN108053842A (en) 2018-05-18

Similar Documents

Publication Publication Date Title
CN108053842B (en) Short wave voice endpoint detection method based on image recognition
CN103854662B (en) Adaptive voice detection method based on multiple domain Combined estimator
EP1547061B1 (en) Multichannel voice detection in adverse environments
CN109545188A (en) A kind of real-time voice end-point detecting method and device
US20020010581A1 (en) Voice recognition device
EP3411876B1 (en) Babble noise suppression
JPH0916194A (en) Noise reduction for voice signal
CN112735456A (en) Speech enhancement method based on DNN-CLSTM network
JP6272433B2 (en) Method and apparatus for detecting pitch cycle accuracy
CN108597505A (en) Audio recognition method, device and terminal device
Yu et al. Effect of multi-condition training and speech enhancement methods on spoofing detection
CN111091833A (en) Endpoint detection method for reducing noise influence
CN111312275A (en) Online sound source separation enhancement system based on sub-band decomposition
EP0780828A2 (en) Method and system for performing speech recognition
Morales-Cordovilla et al. Feature extraction based on pitch-synchronous averaging for robust speech recognition
CN111599372B (en) Stable on-line multi-channel voice dereverberation method and system
Martín-Doñas et al. Dual-channel DNN-based speech enhancement for smartphones
CN111899750A (en) Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
CN109102823B (en) Speech enhancement method based on subband spectral entropy
Hsu et al. Voice activity detection based on frequency modulation of harmonics
WO2007041789A1 (en) Front-end processing of speech signals
CN112233657A (en) Speech enhancement method based on low-frequency syllable recognition
CN113903344B (en) Deep learning voiceprint recognition method based on multi-channel wavelet decomposition common noise reduction
CN116665681A (en) Thunder identification method based on combined filtering
TWI749547B (en) Speech enhancement system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant