CN108053842B

CN108053842B - Short wave voice endpoint detection method based on image recognition

Info

Publication number: CN108053842B
Application number: CN201711330638.1A
Authority: CN
Inventors: 陈章鑫; 杨孟文; 司进修; 黄际彦
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-12-13
Filing date: 2017-12-13
Publication date: 2021-09-14
Anticipated expiration: 2037-12-13
Also published as: CN108053842A

Abstract

The invention belongs to the field of voice detection, and particularly relates to a short-wave voice endpoint detection method based on image recognition. The technical scheme of the invention is as follows: firstly, preprocessing data to improve the signal-to-noise ratio; then framing according to a specific length, and simultaneously performing short-time Fourier transform to obtain a spectrogram; and finally, searching the voiceprint in the spectrogram by using an image recognition method, and determining the speech segment in the data according to the voiceprint distribution. The voice after the preprocessing has similar signal-to-noise ratio by adopting the method of the invention, and the parameters do not need to be adjusted in the subsequent steps, therefore, the method of the invention can self-adaptively select the talking section from different background noises.

Description

Short wave voice endpoint detection method based on image recognition

Technical Field

The invention belongs to the field of voice detection, and particularly relates to a short-wave voice endpoint detection method based on image recognition.

Background

Despite the continuous emergence of new radio communication systems, short-wave radio stations are still receiving general attention due to their autonomous communication capabilities and wide coverage. However, the short-wave communication transmission electric wave needs to be reflected by an ionized layer, so that the noise is large. The existence of strong background noise makes monitoring personnel unable to work for a long time, and noise reduction processing is needed, and meanwhile, squelch processing is carried out on a non-voice section. In order to prevent the hearing loss, the performance of the voice endpoint detection method is important.

In conventional speech processing, there are many methods for detecting end points, such as end point detection based on correlation function, end point detection based on cepstrum distance, end point detection based on energy-to-zero ratio, and end point detection based on wavelet decomposition, etc., according to different characteristics. Aiming at different voices, parameters are adjusted, and voice talking sections can be accurately selected. However, in a variable environment, when real-time communication is required, it is impractical to adjust the endpoint detection parameters, and the conventional speech processing method is no longer applicable.

The voice frequency spectrogram is called a spectrogram for short, and the change relation of the short-time frequency spectrum of the voice along with time is researched through short-time Fourier transform analysis of the voice. The horizontal direction of the speech spectrogram is a time axis, the vertical direction of the speech spectrogram is a frequency axis, and the gray stripes on the speech spectrogram represent the speech short-time spectrums at various moments. The spectrogram reflects the dynamic spectrum characteristics of a voice signal, has important practical value in voice analysis, and is called as visual voice.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an adaptive processing method according to the characteristic mechanism of human voice production and the characteristic that voiceprints do not exist in a noise spectrum.

The technical scheme of the invention is as follows: firstly, preprocessing data to improve the signal-to-noise ratio; then framing according to a specific length, and simultaneously performing short-time Fourier transform to obtain a spectrogram; and finally, searching the voiceprint in the spectrogram by using an image recognition method, and determining the speech segment in the data according to the voiceprint distribution.

A short wave voice endpoint detection method based on image recognition specifically comprises the following steps:

s1, performing voice preprocessing aiming at ensuring that the definitions of the voice prints of the formed spectrogram are approximately the same, which is the premise of effective image recognition, and specifically comprising the following steps:

s11, in the process of collecting voice signal data, due to some reasons of a test system, a linear or slowly-changed trend error is generated in a time sequence, so that a zero line of a voice signal deviates from a base line, even the deviation size changes along with time, a voice related function is caused, a power spectrum function deforms in processing calculation, and the trend error is removed by fitting a trend term by a least square method;

s12, amplitude normalization is carried out;

s13, low-pass filtering is carried out, and noise higher than 3500Hz is removed;

s14, reinforcing the voice by using the spectral subtraction of the multi-window spectrum;

s2, carrying out image recognition on the acquired spectrogram to acquire a structural body, wherein the structural body comprises a starting point and an end point of the spectrogram voiceprint position, and specifically comprises the following steps:

s21, framing the voice signals, and performing short-time Fourier transform by taking frames as units to obtain short-time spectrums;

s22, arranging the short-time frequency spectrums obtained in the S21 according to the time sequence of the frames to obtain a spectrogram;

s23, identifying the voiceprint in the spectrogram of S22, namely: changing the color spectrogram into a gray image; extracting the image edge of the gray-scale image, and identifying the position of a line segment in the gray-scale image; forming a structural body by the obtained starting point and the end point containing the voice print position of the spectrogram;

s3, carrying out endpoint detection, specifically:

s31, extracting start point position vector ST [ < ST > from the structure of S2₁,st₂,...,st_i,...,st_n]And end point position vector EN ═ EN₁,en₂,...,en_i,...,en_n]Wherein, st_iRefers to the ith start point position, en_iRefers to the ith end point position. Sorting the starting point position vector ST and the end point position vector EN according to ascending order;

s32, judgmentIf there is a speech segment, the speech segment can be regarded as a voiceprint if there are three horizontal line segments, and the rest are noise. Is numerically expressed as when en_i＞st_i+2That is, the line segment with the ith point as the starting point is considered to be in the conversation segment;

s33, selecting all line segments certainly in the presence segment, and searching whether the element ST 'of ST exists in the range of 100 frames in the left and right directions'_iIf any, is also included in the session segment and replaces the original st_iThe search to the left and right 100 frames is repeated until there are no elements of ST in the left and right 100 frames.

Further, the step of S14 for enhancing speech by using spectral subtraction of multi-window spectrum includes the following steps:

step A, setting the time sequence of the voice signal as x (n), and carrying out windowing and frame division processing on the x (n) by using a Hamming window with the length of wlen to obtain the ith frame of voice signal as x_i(m) said x_i(m) has a frame length of wlen, x_i(m) discrete Fourier transform

B, taking M frames before and after the frame i as the center, and calculating the X in the step A by 2M +1 frames_i(k) Average amplitude spectrum of each component in the spectrum

Angle of sum

Wherein j refers to the rear j frame with the i frame as the center, Im refers to the imaginary part, and Re refers to the real part;

step C, averaging a plurality of orthogonal data windows to the same data sequence to obtain spectrum estimation, wherein a multi-window spectrum is defined as

Wherein L is the number of data windows, S^mtAs a spectrum of a window of data w, i.e.

Tx (N) is the data sequence, N is the sequence length, a_w(n) is the w-th data window, a_w(n) is a set of mutually orthogonal discrete ellipsoid sequences for direct spectrum determination with the same signal sequence, a_w(n) satisfy mutual orthogonality between multiple data windows, i.e.

Using the multi-window spectrum definition method to divide the frame signal x_i(m) performing multi-window spectral estimation, i.e.

Step D, smoothing the multi-window spectrum power density estimated value, and calculating the smooth power spectral density

Calculating noise mean power spectral density

Calculating a gain factor

Wherein NIS represents the number of frames occupied by the preamble null segment;

step E, according to the obtained amplitude spectrum after the reduction of the multi-window spectrum

Synthesizing an enhanced speech signal

The multi-window spectrum subtraction uses the leading non-speech section to calculate the power of the noise, after the power of the whole voice subtracts the noise component, the voice signal is restored by the phase angle relation, the over-subtraction factor determines the strengthening degree of the signal, and the gain compensation factor determines the calculation time length.

Further, the selection method of the over-subtraction factor is as follows:

i, setting an initial value of an over-subtraction factor to be 1, and taking an initial signal-to-noise ratio snr' to be 0;

II, performing enhancement processing on the voice by using multi-window spectral subtraction, and calculating the signal-to-noise ratio snr of the processed signal;

III, if the signal-to-noise ratio snr of the processed signal is greater than the initial signal-to-noise ratio snr ', performing the next step, if the signal-to-noise ratio snr of the processed signal is less than or equal to the initial signal-to-noise ratio snr', indicating that the voice in the signal is not significant, not processing, keeping all voice signals, and directly outputting;

and IV, if the signal-to-noise ratio snr of the processed signal is less than 8dB, increasing the over-subtraction factor by 0.5, making snr' equal to snr, and repeating the steps II to IV until the signal-to-noise ratio is greater than 8 dB.

The invention has the beneficial effects that:

the voice after the preprocessing has similar signal-to-noise ratio by adopting the method of the invention, and the parameters do not need to be adjusted in the subsequent steps, therefore, the method of the invention can self-adaptively select the talking section from different background noises.

Drawings

FIG. 1 is a schematic diagram of a multi-window spectral improvement spectral subtraction method.

FIG. 2 is a flow chart of a speech enhancement process.

FIG. 3 is a flow chart of the method of the present invention.

FIG. 4 is a time domain diagram of speech before speech preprocessing in embodiment 1.

FIG. 5 is a time domain diagram of speech after speech preprocessing in embodiment 1.

FIG. 6 is a spectrogram of each frame of speech in embodiment 1.

FIG. 7 is a spectrogram after the gray scale processing in example 1.

Fig. 8 is a horizontal line segment portion in the spectrogram after the gray level processing in embodiment 1.

Fig. 9 shows the end point detection result of the spectrogram after the gray level processing in embodiment 1.

Fig. 10 is a time domain diagram of the endpoint detection result in embodiment 1, where the left is the original speech and the right is the preprocessed speech.

Fig. 11 is a time domain diagram of speech before speech preprocessing in embodiment 2.

Fig. 12 is a time domain diagram of speech after speech preprocessing in embodiment 2.

Fig. 13 is a spectrogram of each frame of speech in embodiment 2.

FIG. 14 is a spectrogram after the gray scale processing in example 2.

Fig. 15 is a horizontal line segment portion in the spectrogram after the gray level processing in embodiment 2.

Fig. 16 shows the end point detection result of the spectrogram after the gray level processing in embodiment 2.

Fig. 17 is a time domain diagram of the endpoint detection result in embodiment 2, where the left is the original speech and the right is the preprocessed speech.

Detailed Description

The present invention will be described with reference to the accompanying drawings.

The method selects the voiceprint characteristics as the characteristics of the sound. Due to the unique physiological structure of human vocalization, voiceprints can be seen from speech spectrograms (spectrogram). The voiceprint of human voice has obvious characteristics, and in a speech section, specific rules of energy distribution on different frequencies can be seen; in the spectrogram of voice, a plurality of transversely parallel lines are presented, and the lines are the voiceprints. The voiceprint can embody individual pronunciation characteristics and phoneme characteristics, and is widely applied to the aspect of speech recognition.

As shown in fig. 3, the method of the present invention comprises the following steps:

s12, amplitude normalization is carried out;

s14, reinforcing the voice by using the spectral subtraction of the multi-window spectrum, which specifically comprises the following steps:

Angle of sum

Wherein j refers to the next j frame centered on the i frame, Im refers to the imaginary part, and Re refers to the real part.

Calculating noise mean power spectral density

Calculating a gain factor

Synthesizing an enhanced speech signal

The multi-window spectrum subtraction is to use the leading non-speech section to calculate the power of the noise, and after the power of the whole sound subtracts the components of the noise, the speech signal is restored by using the phase angle relation, the over-subtraction factor determines the strengthening degree of the signal, and the gain compensation factor determines the calculation time length;

the selection method of the over-subtraction factor comprises the following steps:

IV, if the signal-to-noise ratio snr of the processed signal is less than 8dB, increasing the over-subtraction factor by 0.5, making snr' equal to snr, and repeating the steps II-IV until the signal-to-noise ratio is greater than 8 dB;

s3, carrying out endpoint detection, specifically:

and S32, judging whether a speech segment exists, and if three horizontal line segments exist, the speech segment can be considered as a voiceprint, and the rest are noise. Is numerically expressed as when en_i＞st_i+2That is, the line segment with the ith point as the starting point is considered to be in the conversation segment;

s33, selecting all line segments certainly in the presence segment, and searching whether the element ST 'of ST exists in the range of 100 frames in the left and right directions'_iIf any, is also included in the session segment and replaces the original st_iThe search to the left and right 100 frames is repeated until there are no elements of ST in the left and right 100 frames. The purpose of this is to prevent effects due to taking a straight-line functionPoor end point detection performance is affected.

Detailed description of the preferred embodimentsa typical noise background

Reading in a file, drawing a time domain graph as shown in figure 4, and drawing a time domain graph after voice preprocessing as shown in figure 5.

The voice is framed, the frame length is 200, the frame is shifted by 80, the framed data is a two-dimensional matrix of 200 x 2964, Fourier transform is carried out on 200 numbers (each frame) in each row as a unit to obtain the frequency spectrum of each frame, 2964 frequency spectrums are provided, a spectrogram is drawn by taking the horizontal axis as time and the vertical axis as frequency, the spectrogram is shown in figure 6, and a low-frequency part (0 Hz-3500 Hz) is taken and subjected to gray level processing to obtain a spectrogram, which is shown in figure 7. Wherein fig. 7, 8, 9 have been rotated 90 degrees clockwise for clarity of illustration).

In fig. 7 white parts are visible, with parallel ripples, i.e. voiceprints, which are speech parts, and white uncorrugated parts, which are caused by strong noise. The horizontal line segment in the figure is selected, see fig. 8.

And storing the starting point and the end point, and reordering according to the size of the horizontal axis direction to obtain a starting point vector and an end point vector. When there are three horizontal line segments, we can consider it as voiceprint, and the rest is noise. Is numerically expressed as en_i＞st_i+2That is, the end position of the ith segment is greater than the start position of the (i + 2) th segment, so as to judge whether the voice has a conversation according to the end position of the ith segment. To ensure that no missing information is detected, the possible speech segments are searched to the left and right. The results are shown in FIG. 9. Conversion to time domain diagram see fig. 10. With the method of the present invention, speech segments are detected in a typical noise background.

Detailed description of the preferred embodiment 2, Strong noise background

The procedure was the same as in example one, and the experimental results were as follows:

it should be noted that, in a strong noise background, a strong noise spectrum still remains after the speech enhancement processing, as shown in fig. 14, a speech segment in the graph is a region where energy is high and parallel lines exist, and after the speech segment, due to the existence of strong noise, a noise spectrum with low energy and existing in a dotted manner remains in the speech spectrum. As shown in fig. 15, when a line segment is identified, a part of the noise spectrum is identified as a line segment, and therefore, erroneous judgment is caused at the time of end point detection. The final detection results are shown in fig. 16 to 17, and it can be seen that all the speech segments in the speech are recognized, but a part of the speech segments containing only strong noise is misjudged as speech.

Claims

1. A short wave voice endpoint detection method based on image recognition is characterized by comprising the following steps:

s12, amplitude normalization is carried out;

s3, carrying out endpoint detection, specifically:

s31, extracting start point position vector ST [ < ST > from the structure of S2₁,st₂,...,st_i,...,st_n]And end point position vector EN ═ EN₁,en₂,...,en_i,...,en_n]Wherein, st_iRefers to the ith start point position, en_iThe position of the ith end point is pointed; sorting the starting point position vector ST and the end point position vector EN according to ascending order;

s32, judging whether a speech segment exists, wherein the speech segment can be regarded as a voiceprint when three horizontal line segments exist, and the rest are noises; is numerically expressed as when en_i＞st_i+2That is, the line segment with the ith point as the starting point is considered to be in the conversation segment;

2. The short-wave voice endpoint detection method based on image recognition according to claim 1, characterized in that: s14, the specific steps of using spectral subtraction of multi-window spectrum to strengthen the voice are as follows:

step A, setting the time sequence of the voice signal as x (n), and carrying out windowing and frame division processing on the x (n) by using a Hamming window with the length of wlen to obtain the ith frame voice signal as xi (m), wherein the frame length of xi (m) is wlen, and the xi (m) discrete Fourier transform is

Step B, taking M frames before and after the frame i is taken as the center, and calculating the average amplitude spectrum of each component in the xi (k) in the step A by 2M +1 frames

Angle of sum

Tx (N) is a data sequence, N is a sequence length, aw (N) is a w-th data window, aw (N) is a group of discrete ellipsoid sequences which are orthogonal with each other and are used for respectively solving direct spectra with the same column signal, aw (N) satisfies the mutual orthogonality among a plurality of data windows, namely

The multi-window spectrum definition method performs multi-window spectrum estimation on the framed signal xi (m), that is, the multi-window spectrum definition method

Calculating noise mean power spectral density

Calculating a gain factor

Synthesizing an enhanced speech signal

3. The short-wave voice endpoint detection method based on image recognition according to claim 2, characterized in that: