CN104392728A

CN104392728A - Colored repeated sentence spectrum construction method for speech reconstruction

Info

Publication number: CN104392728A
Application number: CN201410688088.0A
Authority: CN
Inventors: 王双维; 李广岩; 梁士利; 王春蕾; 曹晓林; 郑彩侠
Original assignee: Northeast Normal University
Current assignee: Northeast Normal University
Priority date: 2014-11-26
Filing date: 2014-11-26
Publication date: 2015-03-04
Anticipated expiration: 2034-11-26
Also published as: CN104392728B

Abstract

The invention provides a colored repeated sentence spectrum construction method for speech reconstruction, and belongs to the technical field of speech signal processing. The method is characterized in that two color channels respectively serve as a real part and an imaginary part in fourier transform; the position coordinates of R-B synthetic color in an R-G-B color space are corresponding to the real part and the imaginary part in the fourier transform, wherein G is the symbolic combination of the real part and the imaginary part. According to the method, the real part, the imaginary and the symbols of the real part and the imaginary, corresponding to the repeating number, can be analyzed according to the R-G-B color ratio; the speech spectrum is subjected to image processing, then the speech is reconstructed, thus the fourier transform can be performed by enhancing the speech by the image processing technology and the like, and as a result, the speech reconstruction is realized.

Description

Color compound spectrogram construction method capable of realizing voice reconstruction

Technical Field

The invention belongs to the field of voice signal processing, and relates to a visual color spectrogram construction method capable of realizing voice reconstruction.

Background

Spectrogram is an advantageous tool for speech analysis and phonetics, and is a readable symbol system for studying speech information. It shows the closely related time domain and frequency domain characteristics and their interrelation at the same time, which is not done by the simple time domain signal or frequency domain signal and the simple parallel of the two signals. Therefore, the amount of information carried by the spectrogram is far greater than the sum of the amount of information carried by the pure time domain signal and the pure frequency domain signal. Recently, it is known that the research includes extracting textural features by using an image processing technology, and realizing voice identity authentication and confirmation of specific words of a specific person by combining a subsequent classifier; carrying out singing voice recognition under background music by using spectrogram textures; and performing voice recognition and the like based on spectrogram local gradient calculation. Zhao Shenghui, et al, Beijing university of rational engineers, proposed "a speech spectrogram color enhancement method for voice visualization" and patented (200910235643.3).

However, in the past research, most spectrograms exist as visual display spectrographic features, and the data source of actual analysis is still original voice signal data rather than the spectrogram itself. In particular, since the spectrogram is a visual representation of the amplitude-frequency characteristics of speech and lacks phase information, speech reconstruction based on the spectrogram is not possible. Although the color spectrogram is based on three color channels, the color spectrogram is a pseudo-color image of the gray spectrogram and does not increase information dimension due to color.

Disclosure of Invention

Technical problem to be solved

The invention aims to provide a visual color spectrogram construction method capable of realizing voice reconstruction, which can respectively represent a real part and an imaginary part of voice time-frequency analysis by utilizing an R channel and a B channel in an RGB color model, and a G channel in the RGB color model marks a symbol combination of the real part and the imaginary part of the voice time-frequency analysis to form a complex spectrogram with a three-dimensional information structure. The spectrogram can obtain the real part size and the imaginary part size of voice time-frequency analysis by extracting R channel data and B channel data, obtain symbols of the real part and the imaginary part through G channel decoding, generate a voice time-frequency analysis complex matrix, and further realize voice reconstruction through inverse Fourier transform.

The invention is not limited to the decomposition and reconstruction of human speech, nor to sound signals in the audio range (20 Hz-20 kHz).

(II) technical scheme

In order to achieve the purpose, the invention adopts the following scheme:

1. windowing and framing the original speech signal to form a speech signal framed NxM matrixThe number N of the matrix lines is the number of signal points of each frame, and the number N of the matrix columns is the number of framing of original voice signals;

2. performing N-point DFT on each column in the signal framing matrix, wherein the result of the ith column is as follows:

（1）

and is

（2）

Wherein,is a matrixThe nth row and the ith column,is composed ofThe real part of (a) is,is composed ofAn imaginary part of (d);is an N × M complex matrix, the matrix elements

（3）

And is

，（4）

Is provided with

（5）

3. Will be provided withDecomposition of complex matrices into real partsAnd imaginary partTwo submatrices, taking their absolute values and normalizing the data to make the data dynamicThe range is 0-1;

4. when the sign code matrix is constructed, the real part and the imaginary part of the complex matrix are plus, -and 0 respectively, 9 combinations are in total. The invention marks the 9 combinations by using 9 numerical values so as to reserve symbol information of a real part and an imaginary part of an original complex matrix;

5. constructing a 3-dimensional matrixReal part submatrixNormalized as layer 1, imaginary part submatrix of layer number dimensionAfter normalization, the code matrix is used as the 3 rd layer of the layer number dimension, and the symbol code matrix is used as the 2 nd layer of the layer number dimension;

6. will 3-dimensional matrixAs a driving matrix of the RGB color model, a complex spectrogram consisting of red and blue primary colors is formed. Wherein the real part sub-matrixCorresponding to the red channel R, imaginary submatrixCorresponding to the blue channel B, the symbol coding matrix is used as a corresponding green channel G;

7. and (3) voice reconstruction process: and respectively extracting R channel data, B channel data and G channel data, decoding the G channel to obtain symbols of a real part and an imaginary part, assigning the symbols to the extracted R channel and B channel, and constructing a complex matrix by the two matrixes to obtain normalized voice time-frequency analysis data. And performing inverse Fourier transform to obtain a voice signal framing matrix, and removing framing to form a voice sequence to realize voice reconstruction.

The invention has the advantages of use and superiority (beneficial effect)

1. The invention utilizes two color channels to respectively express the real part and the imaginary part of Fourier transform, in an R-G-B color space, the position coordinates of R-B composite color correspond to the real part and the imaginary part of the Fourier transform, and a G value represents the symbol combination of the real part and the imaginary part. For example, the sizes of the position marked real part and imaginary part of A in the R-B color space in FIG. 1 are 0.8 and 0.2 respectively, and the real part and imaginary part of the corresponding complex value and the sign thereof can be analyzed by the R-G-B color matching;

2. the meaning of the spectrogram is to process the spectrogram, then reconstruct the voice, and realize the purpose of enhancing the voice by using an image processing technology. Although the power spectrum and the amplitude spectrum can also be used in image processing techniques, they lack phase information or symbol information and cannot be subjected to inverse fourier transform, and therefore cannot be reconstructed into speech.

Drawings

1. FIG. 1 shows that in R-B color space, two color channels respectively express the real part and the imaginary part of Fourier transform, the position coordinates of the combined color correspond to the magnitude of the real part and the imaginary part of the Fourier transform, the abscissa (red channel) represents the real part, and the ordinate (blue channel) represents the imaginary part. For example, point A is located in R-B color space (0.8, 0.2), where R-B color matching represents a real component size of 0.8 and an imaginary component size of 0.2. Through G channel symbol decoding, the real part and the imaginary part of the corresponding complex value can be analyzed through color matching;

2. fig. 2 is a flow chart for constructing and using a color compound spectrogram capable of realizing voice reconstruction.

Detailed Description

The examples in the schemes are used to illustrate the invention, but not to limit the scope of the invention.

The specific implementation mode of the invention is divided into two major parts of 9 modules, and the flow is shown in figure 2. The following description takes the example of a speech signal with a sampling rate of 16 kHz:

1. a voice framing module:firstly, a speech signal is windowed and framed, for example, a frame signal divided into 1024 points is divided into M frames to form a 1024 × M framing signal matrix. The frequency domain resolution is 15.6 Hz;

2. a Fourier analysis module:according to the formula (1), FFT is applied to each column of the 1024 xM framing signal matrix for DFT calculation, so that 1024-point DFT of the corresponding column is obtained, and the 1024 xM time-frequency analysis matrix as the formulas (2), (3), (4) and (5) is formed. The matrix is a complex matrix, each element corresponds to a real part and an imaginary part of frequency characteristics of a certain frequency band at a certain moment;

3. a sub-matrix forming module:is provided withThe maximum absolute value of the real or imaginary part of all elements of the matrix is d. Constructing 2 matrices

（6）

（7）

Andare respectively corresponding toReal part of matrixAnd imaginary partNormalizes the subarrays by the absolute value of (c). D is used as a normalization constant in order to makeAndthe dynamic ranges are consistent;

4. a symbol encoding matrix forming module:respectively extracting in formula (5) by using sign functionReal part of matrixAnd imaginary partSymbol of

（8）

（9）

Function(s)The function of (1) is to output-1 when x is less than 0, and +1 when x is greater than 0, and 0 when x is equal to 0. The weighted sum of the two formulas (8) and (9) is obtained to obtain the real partAnd imaginary partSymbol combination coding of

（10）

(10) The symbol combination coding results of (a) are shown in table 1. The 9 calculations in table 1 mark 9 states of the real and imaginary sign combinations. In order to visualize the symbol combination code by using the G channel, the zero point of the result in Table 1 needs to be translated and normalized, and the normalization is expressed by the following formula

（11）

As can be seen from the above formula (11),the value of (A) is between 0 and 0.01, and the results are shown in Table 2. The normalization constant of 800 is used to make the maximum value of the G channel much smaller than the values of the R channel and the B channel, so that the green color of the G channel does not visually interfere with the R-B secondary color chart when the spectrogram is visualizedThe effect of the image;

TABLE 1 real partAnd imaginary partSymbol combination coding

Table 2 real partAnd imaginary partNormalized coding of symbol combinations

5. The RGB color model driving matrix forming and visualizing module comprises:constructing a 3-dimensional matrixNormalization submatrix of absolute value of real partLayer 1, the imaginary absolute value normalization submatrix as the layer number dimensionLayer 3, symbol-combining coding matrix as layer dimensionAs layer 2 in the layer number dimension. Will 3-dimensional matrixAnd forming a color compound spectrogram as a driving matrix of the RGB color model. In which the real part normalizes the absolute value sub-matrixCorresponding to the red channel R, the imaginary part normalized absolute value sub-matrixCorresponding to blue channel B, symbol combination coding matrixCorresponding to the green channel G. Because the value of the G channel is far smaller than that of the R channel and the B channel, the color compound spectrogram is visually embodied as an R-B two-primary-color image.

And a frequency domain subarray extraction module:respectively extracting the 1 st layer and the 3 rd layer in a 3-dimensional matrix corresponding to the image-processed two-primary-color compound language spectrogram intoAndtwo matrixes are used for standby;

7. a symbol decoding module:

taking out the G channel symbol combined code to form a normalized symbol combined code matrix

(1) Real part symbol decoding, firstly, the symbol combination coding matrix recovery is realized by the following formula

（11）

Then the real part symbol matrix

（12）

In the formula (12)Is a step function whenWhen the temperature of the water is higher than the set temperature,when is coming into contact withWhen the temperature of the water is higher than the set temperature,when is coming into contact withWhen the temperature of the water is higher than the set temperature,. (12) The result of formula (la) is: when in useThe time corresponding to the real part sign is positive,the result of (a) is + 1; when in useThe sign of the real part of the time-domain correspondence is negative,the result of (a) is-1; when in useThe time corresponding to the real part sign is zero,the result of (2) is 0.

(2) Imaginary symbol decoding using real symbol decoding results

（13）

The result of analysis of formula (13) is asThe time corresponding imaginary sign is positive whenThe calculation results of the equations (13) are、Andand are both + 1. And so on.

And the time-frequency characteristic matrix forming module:the real part sub-matrix and the imaginary part sub-matrix are respectively composed ofAndthe method comprises the steps of generating the data,the frequency domain characteristic matrix

（14）

9. A voice signal reconstruction module:applying FFT pairsPerforming inverse column-column Fourier transform to form a processed speech signal framing matrixWill beAll the columns are connected end to form a one-dimensional voice sequence, so that voice reconstruction is realized.

Claims

1. Color compound language spectrogram construction method capable of realizing voice reconstruction by using voice framing technology，Firstly, performing windowing and framing on a voice signal, dividing the voice signal into frame signals of N points, setting the frame signals to be divided into M frames in total to form an NxM framing signal matrix, applying FFT (fast Fourier transform) to perform DFT (discrete Fourier transform) calculation on each column of the NxM framing signal matrix to obtain an N-point DFT (discrete Fourier transform) of a corresponding column, and forming an NxM time-frequency analysis matrixEach element corresponds to the real part and the imaginary part of the frequency characteristic of a certain frequency band at a certain time, and is characterized in that:

1) and a sub-matrix forming module:is provided withThe maximum absolute value of the real part or the imaginary part of all elements of the matrix is d, and 2 matrixes are constructed

Andare respectively corresponding toReal part of matrixAnd imaginary partD as a normalization constant, in order to make the absolute value ofAndthe dynamic ranges are consistent;

2) the symbol coding matrix forming module:by separate extraction of symbolic functionsIn the formulaReal part of matrixAnd imaginary partSymbol of

Function(s)Has the functions of outputting-1 when x is less than 0, outputting +1 when x is greater than 0, outputting 0 when x is equal to 0, and、the weighted sum is obtained by two formulas to obtain the real partAnd imaginary partSymbol combination coding of

The symbol combination coding results of the above formula are shown in table 1, 9 calculation results in table 1 mark 9 states of the real part and imaginary part symbol combination, and in order to visualize the symbol combination coding by using the G channel, the zero point of the results in table 1 is translated and normalized, and the normalization is represented by the following formula

According to the formula, the compound has the advantages of,the value of (a) is between 0 and 0.01, the result is shown in table 2, and the maximum value of the G channel is far smaller than the values of the R channel and the B channel by taking 800 as a normalization constant, so that the green color of the G channel does not visually interfere with the R-B secondary color image when the spectrogram is visualized;

TABLE 1 real partAnd imaginary partSymbol combination coding

Table 2 real partAnd imaginary partNormalized coding of symbol combinations

；

3) The RGB color model driving matrix forming and visualizing module comprises:constructing a 3-dimensional matrixNormalization submatrix of absolute value of real partLayer 1, the imaginary absolute value normalization submatrix as the layer number dimensionLayer 3, symbol-combining coding matrix as layer dimensionAs the 2 nd layer in the layer number dimension, a 3-dimensional matrix is formedForming a color complex spectrogram as a driving matrix of the RGB color model, wherein a real part normalizes an absolute value sub-matrixCorresponding to the red channel R, the imaginary part normalized absolute value sub-matrixCorresponding to blue channel B, symbol combination coding matrixCorresponding to the green channel G, the value of the G channel is far smaller than that of the R channel and the B channel, so that the color compound spectrogram is visually represented as an R-B two-primary-color image;

4) and a frequency domain subarray extraction module:respectively extracting the 1 st layer and the 3 rd layer in a 3-dimensional matrix corresponding to the image-processed two-primary-color compound language spectrogram intoAndtwo matrixes are used for standby;

5) and a symbol decoding module:

Then the real part symbol matrix

In the above formulaIs a step function whenWhen the temperature of the water is higher than the set temperature,when is coming into contact withWhen the temperature of the water is higher than the set temperature,when is coming into contact withWhen the temperature of the water is higher than the set temperature,，the result of formula (la) is: when in useThe time corresponding to the real part sign is positive,the result of (a) is + 1; when in useThe sign of the real part of the time-domain correspondence is negative,the result of (a) is-1; when in useThe time corresponding to the real part sign is zero,the result of (1) is 0;

(2) imaginary symbol decoding using real symbol decoding results

Analyzing the results of the above formula whenThe time corresponding imaginary sign is positive whenThus, it isThe calculation results of the formulae are respectively、Andall are +1, and the rest are analogized;

6) and the time-frequency characteristic matrix forming module:the real part sub-matrix and the imaginary part sub-matrix are respectively composed ofAndgenerating, then, a frequency domain characteristic matrix

；

Applying FFT pairsPerforming inverse column-column Fourier transform to form a processed speech signal framing matrixWill beAll the columns are connected end to form a one-dimensional voice sequence, and voice reconstruction can be realized.