CN110070894B

CN110070894B - Improved method for identifying multiple pathological unit tones

Info

Publication number: CN110070894B
Application number: CN201910233952.0A
Authority: CN
Inventors: 张涛; 武雅琴
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-03-26
Filing date: 2019-03-26
Publication date: 2021-08-03
Anticipated expiration: 2039-03-26
Also published as: CN110070894A

Abstract

An improved multiple pathology unit tone identification method, comprising: calculating line spectrum pair parameters of an input voice signal; calculating adjacent differential line spectrum pair parameters of the input voice signal; performing frequency warping on the line spectrum pair parameters of the input voice signal to obtain the bark line spectrum pair parameters of the input voice signal; performing characteristic enhancement on the Barker line spectrum pair parameters of the input voice signal to obtain enhanced Barker line spectrum pair parameters; and inputting the enhanced bucking line spectrum pair parameters of the input voice signal into a deep neural network classifier to identify a plurality of pathological unit tones. The method has better recognition rate and provides a research foundation for the subsequent voice restoration of unit voices and more complex words and sentences.

Description

Improved method for identifying multiple pathological unit tones

Technical Field

The invention relates to a pathological unit tone identification method. And more particularly to an improved method of multiple pathological unit tone identification.

Background

Voice is the most direct way of language transmission, so the sound quality of voice directly affects the daily communication efficiency of people. About 750 million people in the united states are statistically afflicted with voice disease, with a prevalence of 57.7% for educational professionals and 28.8% for non-educational professionals. Furthermore, in the uk, approximately 2200 people are diagnosed with laryngeal cancer each year. The unclear voice can greatly reduce the life quality of people, so that the pathological voice is recognized and repaired, which is particularly important.

The voice disease can be treated by medicines and physical modes, but the expression of a patient is influenced by the treatment imperfectness, so that the identification and repair of pathological voice by adopting a non-invasive repair mode become the key of the research of researchers. The recognition and restoration of the unit voice is the basis of complex words and sentences. For the multiple unit voice recognition research, the current research objects are based on normal voice, and commonly used characteristic parameters are Linear Prediction Cepstrum Coeffient (LPCC), Mel-Frequency Cepstrum parameters (MFCC), formants and the like. However, the identification work aiming at pathological voices mostly focuses on two categories of pathological voices and normal voices, and because the identification rate of most acoustic characteristic parameter pairs/a/voice is almost higher than that of other vowels, pathological unit voice/a/is generally selected at home and abroad as an experimental sample, and the characteristic parameters of the voice sample are extracted and input into different classification networks to identify the pathological voices. The commonly used identification features include time-length features such as fundamental frequency disturbance and amplitude disturbance, and regression features such as MPEG-7 and Multidirectional regression (MDR). But the features (LPCC, MFCC) applied to the recognition of multiple normal single tones are less effective in recognizing multiple pathological single tones.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an improved method for identifying a plurality of pathological unit voices, which can further improve the identification rate of pathological voices.

The technical scheme adopted by the invention is as follows: an improved method for identifying multiple pathological unit tones, comprising the steps of:

1) calculating line spectrum pair parameters of an input voice signal;

2) calculating adjacent differential line spectrum pair parameters of the input voice signal;

3) performing frequency warping on the line spectrum pair parameters of the input voice signal to obtain the bark line spectrum pair parameters of the input voice signal;

the frequency warping adopts the following formula:

Bark＝26.81/(1+(1960/f))-0.53 (6)

wherein Bark represents Bark frequency; f represents a linear frequency;

4) according to the adjacent differential line spectrum pair parameters, carrying out feature enhancement on the Barker line spectrum pair parameters of the input voice signal to obtain enhanced Barker line spectrum pair parameters;

5) and inputting the enhanced bucking line spectrum pair parameters of the input voice signal into a deep neural network classifier to identify a plurality of pathological unit tones.

The step 1) comprises the following steps:

(1.1) performing signal preprocessing, including direct current removing processing and framing processing;

(1.2) for each frame of voice signal, calculating a 12-order linear prediction coefficient a by adopting a Levenson-Dubin autocorrelation algorithm according to the set model order p-12_i；

(1.3) Linear prediction coefficient a calculated from (1.2)_iThe linear prediction inverse filter system function is calculated as follows:

wherein A (z) represents a linear prediction inverse filter system function; p represents the model order; a is_iRepresenting linear prediction coefficients;

(1.4) calculating p +1 order symmetric and antisymmetric polynomials for the linear prediction inverse filter system function a (z):

P(z)＝A(z)+z^-(p+1)A(z^-1) (2)

wherein P (z) represents a symmetric polynomial of order p +1 of A (z), and A (z) represents a linear prediction inverse filter system function; p represents the model order;

Q(z)＝A(z)-z^-(p+1)A(z^-1) (3)

wherein Q (z) represents an antisymmetric polynomial of order p +1 of A (z), and A (z) represents a linear prediction inverse filter system function; p represents the model order;

(1.5) calculating line spectrum pair parameters of the 12 th order input speech signal from P (z) and Q (z):

in the formula, H (e)^jω) Is a linear prediction of the spectral amplitude, e^jωIs a frequency representation of z, P (e)^jω) Is A (e)^jω) P +1 order symmetric polynomial of (a), (b), (c), (d), (e)^jω) Is A (e)^jω) P +1 order antisymmetric polynomial of (cos θ)_iAnd cos omega_iIs that the LSP coefficients are in the cosine domainIs represented by theta_iAnd ω_iIs the line spectrum frequency corresponding to the line spectrum pair coefficient of the input voice signal, pi is the accumulative multiplication sign.

Step 2) is calculated according to the following formula:

DAL_i＝l_i+1-l_i，i＝1,2,...M(M＜N) (5)

in the formula, DAL_iIs the ith order adjacent differential line spectral pair parameter, l_i+1Line spectrum pair parameter of i +1 st order, l_iAnd the ith order line spectrum pair parameter, M is the maximum order of the adjacent differential line spectrum pair parameter, and N is the maximum order of the line spectrum pair parameter.

Step 4) adjusting the j-th-order bark line spectrum pair parameter in a bidirectional iteration mode according to the adjacent difference line spectrum, wherein j is 2, the original bark line spectrum pair parameter is directly updated after adjustment, the adjusted j-th-order bark line spectrum pair parameter is used for adjusting the next-order bark line spectrum pair parameter, and the bark line spectrum pair parameter of the current frame is set to be { b } b₁，b₂,...b_N}^NN is the maximum order of the line spectrum pair parameter, and the coefficient of the adjacent differential line spectrum pair parameter of the current frame is b_i+1-

b

_i1,2, 1, N-1; the specific iterative formula is as follows:

c_i＝η(b_i+1-b_i),η＜1,i＝2,3,...,N-1 (8)

(1) forward iteration: adjusting j-th-order buckline spectrum pair parameters in a forward direction from j-2 to j-N-1;

(2) backward iteration: adjusting j-th-order buckline spectrum pair parameters from j-N-1 to j-2;

(3) averaging: averaging the Barker line spectrum pair parameters obtained by the forward iteration and the backward iteration to obtain enhanced Barker line spectrum pair parameters;

in the formula, eta controls the enhancement degree of the formant, and the smaller eta is, the more obvious the enhancement effect is.

Step 5) firstly, randomly selecting 75% of each unit voice data set in the SVD pathological voice database as a training set and 25% of each unit voice data set as a testing set to ensure that each type of voice data meets the average distribution in the stage of training and testing the classification network, and then inputting the pathological voice/a/,/i/,/u/of the peristaltic polyp and the 12-order enhanced Barker line spectrum of the 6 unit voice into the deep neural network for identification, wherein the network parameters are set as follows: and 2, hidden layers are formed, each layer is provided with 100 neurons, a ReLU function is selected as an activation function, a Softmax function is selected in the last layer of the recognition model to change the output of the neural network into probability distribution, and then the classification result is optimized.

The improved method for identifying the multiple pathological unit tones has the following beneficial effects:

1) the invention ensures that the improved method for identifying the multiple pathological unit tones has better identification rate compared with the traditional MFCC and LPCC characteristics, and provides the broad characteristic E-BLSP suitable for identifying the multiple pathological unit tones. The newly proposed E-BLSP feature achieves high recognition rate for normal/a/,/i/,/u/and pathology/a/,/i/,/u/6 unit tones;

2) the E-BLSP provided by the invention has a higher recognition rate of pathology/i/voice than pathology/a/voice, and the traditional pathological voice recognition is mostly based on unit voice/a/, so that a new thought is improved for the recognition and diagnosis of pathological voice, and a research basis is provided for the voice repair of unit voice and more complex words and sentences.

Drawings

FIG. 1 is a schematic structural diagram of an improved multiple pathology unit tone identification method of the present invention;

FIG. 2a is a diagram of the 11-step DAL parameter box for normal single tones/a/;

FIG. 2b is a 11-order DAL parameter box diagram of pathological unit tones/a/;

FIG. 2c is a diagram of the 11-step DAL parameter box for normal unit tones/i/;

FIG. 2d is a 11-order DAL parameter box diagram of pathological unit tones/i/;

FIG. 2e is a diagram of the 11-order DAL parameter box for normal single tones/u/;

FIG. 2f is a 11-order DAL parameter box plot of pathological unit tones/u/;

fig. 3a is a schematic diagram of 12-order LSP parameters according to an embodiment of the present invention;

FIG. 3b is a diagram illustrating the BLSP parameter of 12 th order according to an embodiment of the present invention;

FIG. 4a is a schematic three-dimensional spectrum of a BLSP parameter of 12 th order according to an embodiment of the present invention;

FIG. 4b is a schematic three-dimensional spectrum of the 12 th order E-BLSP parameter of the present invention.

Detailed Description

The following describes an improved multiple pathological unit tone recognition method according to the present invention in detail with reference to the accompanying drawings and examples.

As shown in fig. 1, the improved method for recognizing multiple pathological unit tones of the present invention comprises the following steps:

1) calculating Line Spectrum Pair (LSP) parameters of an input voice signal; the method comprises the following steps:

P(z)＝A(z)+z^-(p+1)A(z^-1) (2)

Q(z)＝A(z)-z^-(p+1)A(z^-1) (3)

in the formula, H (e)^jω) Is a linear prediction of the spectral amplitude, e^jωIs a frequency representation of z, P (e)^jω) Is A (e)^jω) P +1 order symmetric polynomial of (a), (b), (c), (d), (e)^jω) Is A (e)^jω) P +1 order antisymmetric polynomial of (cos θ)_iAnd cos omega_iIs a representation of the LSP coefficients in the cosine domain, θ_iAnd ω_iIs the Line Spectrum Frequency (LSF) corresponding to the line Spectrum pair coefficient of the input speech signal, Π is the multiplicative sign.

2) Calculating parameters of Adjacent Difference line spectrum pairs (DAL) of the input voice signals;

is calculated according to the following formula:

DAL_i＝l_i+1-l_i，i＝1,2,...M(M＜N) (5)

3) Performing frequency warping on Line Spectrum Pair parameters of an input voice signal to obtain Bark Line Spectrum Pair (BLSP) parameters of the input voice signal;

the frequency warping adopts the following formula:

Bark＝26.81/(1+(1960/f))-0.53 (6)

wherein Bark represents Bark frequency; f represents a linear frequency.

4) Inputting speech signals according to adjacent differential line spectrum pairsThe parameters of the Barker Line Spectrum Pair are subjected to characteristic enhancement to obtain Enhanced Barker Line Spectrum Pair (E-BLSP) parameters; adjusting the parameters of a j-th-order bark line spectrum pair by adopting a bidirectional iteration mode according to adjacent differential line spectrums, wherein j is 2, 1, and N-1, directly updating the original bark line spectrum pair parameters after adjustment, using the adjusted j-th-order bark line spectrum pair parameters into the bark line spectrum pair parameters for adjusting the next-order, and setting the bark line spectrum pair parameters of the current frame as { b } b₁，b₂,...b_N}^NN is the maximum order of the line spectrum pair parameter, and the coefficient of the adjacent differential line spectrum pair parameter of the current frame is b_i+1-

b

_i1,2, 1, N-1; the specific iterative formula is as follows:

c_i＝η(b_i+1-b_i),η＜1,i＝2,3,...,N-1 (8)

5) And inputting the enhanced bucking line spectrum pair parameters of the input voice signal into a deep neural network classifier to identify a plurality of pathological unit tones. Firstly, randomly selecting 75% of each unit voice data set in an SVD pathological voice database as a training set and 25% of each unit voice data set as a testing set to ensure that each type of voice data meets the average distribution in the stage of classification network training and testing, then inputting the parameters of the pathological voice/a/,/i/,/u/of the peristaltic polyp and the 12-order enhanced Barker line spectrum of the 6 unit voice into a deep neural network for identification, and setting the network parameters as follows: and 2, hidden layers are formed, each layer is provided with 100 neurons, a ReLU function is selected as an activation function, a Softmax function is selected in the last layer of the recognition model to change the output of the neural network into probability distribution, and then the classification result is optimized.

Specific examples are given below:

1. pretreatment: the time length of each frame signal in the framing processing is 30ms, the sampling frequency is 8KHz, the corresponding frame length is 240, and the frame shifting is 80

2. When calculating the linear prediction coefficient, p is 12

3. The linear prediction inverse filter system function A (z) can be obtained by calculating the linear prediction coefficient

4. Calculating p +1 order symmetric and antisymmetric polynomials P (z) and Q (z) for A (z)

5. Calculating 12 th order LSP parameters from P (z) and Q (z)

6. Computing 11-order DAL (Difference of Adjacent LSP, DAL) parameters of an input speech signal from 12-order LSP parameters

7. Performing frequency warping on LSP parameters of an input voice signal to obtain BLSP (Bar Link Spectrum Pair, BLSP) parameters of the input voice signal

Fig. 2a to 2f are box diagrams of DAL parameters of unit tone signals DAL according to embodiment 6 of the present invention. Wherein FIG. 2a is a diagram showing the 11-step DAL parameter box of normal single tone/a/; FIG. 2b is a diagram showing the 11-step DAL parameter box for pathological unit tone/a/; FIG. 2c is a diagram showing the 11-step DAL parameter box for normal single tones/i/; FIG. 2d is a diagram showing the 11-step DAL parameter box for pathological unit tones/i/; FIG. 2e is a diagram showing the 11-order DAL parameter box for normal single tones/u/; FIG. 2f shows the 11-order DAL parameter box for pathological unit tones/u/.

As can be seen from fig. 2a to 2f, for the normal/a/,/i/,/u/three unit tones, the rectangular frames of the first 7 th-order DAL data distribution have large differences, and have better discrimination for the three unit tones; for the pathology/a/,/i/,/u/three unit sound signals, the DAL data of the first 7 th order are distributed more uniformly than the normal voice. For pathology/a/voice, the later 4-order DAL parameters are completely different from the normal/a/voice distribution, and the later 4-order DAL data distribution of pathology/i/voice and/u/voice has more overlapped parts and poor distinguishing effect. Because the DAL low-order parameters correspond to the low-frequency part of the signal, the embodiment of the invention considers that the low-frequency section division of the DAL parameters is higher than the high-frequency section and the Bark domain can more truly reflect the feeling of the human ear to the signal, the BLSP parameters are obtained by carrying out nonlinear frequency bending on the extracted LSP by adopting the Bark domain transformation scale, and the bending function is as follows:

Bark＝26.81/(1+(1960/f))-0.53 (6)

wherein Bark represents Bark frequency; f represents a linear frequency.

Fig. 3a to fig. 3b are schematic diagrams illustrating a 12-order LSP parameter and a 12-order BLSP parameter according to an embodiment of the present invention. Compared to fig. 3a, fig. 3b amplifies the low frequency part of the signal, compresses the high frequency part, and improves the discrimination between normal and pathological vowels.

8. Performing feature enhancement on BLSP parameters of an input voice signal to obtain E-BLSP (Enhanced-Bar Line Spectrum Pair, E-BLSP) parameters: eta controls the enhancement degree of the resonance peak, and the smaller eta is, the more obvious the enhancement effect is. In the example η of the present invention, 0.4 was used.

Fig. 4 a-4 b are schematic three-dimensional frequency spectrum diagrams of the BLSP parameter of 12 th order and the E-BLSP parameter of 12 th order according to the embodiment of the present invention. Compared with fig. 4a, the amplitude of fig. 4b is greatly improved at the resonance peak frequency, the broadening effect is suppressed, and the discrimination between normal and pathological vowels is greatly enhanced.

9. Inputting E-BLSP parameters of input speech signal into DNN classifier for recognizing multiple pathological unit tones

The embodiment of the invention firstly randomly selects 75% of each unit voice data set as a training set and 25% of each unit voice data set as a testing set to ensure that each type of voice data meets the average distribution in the training and testing stages of a classification Network, and then inputs 12-order E-BLSP parameters of 6 types of unit voice into a DNN (Deep Neural Network, DNN) Network for identification. The network parameters are set as follows: 2 layers of hidden layers, 100 neurons per layer.

In the embodiment of the invention, in the aspect of selecting the unit Voice source signals, an SVD (Saarbrucken Voice Database) pathological Voice Database which is recorded by the university of Saelan and is responsible for the Voice research institute is adopted, the Database comprises normal and various pathological Voice signals of continuous vowels/a/,/i/and/u/, the sampling rate is unified to be 50KHz, and the resolution is 16 bits. Three continuous vowels/a/,/i/,/u/, of pathological voice and normal voice of the peristaltic polyp are selected from the three continuous vowels, and the sampling rate is uniformly reduced to 8 KHz. The total number of speech samples per class is 180, containing 4 different tones (normal, low, high, low-high-low).

The evaluation of the embodiment of the invention mainly comprises two indexes of accuracy and AUC. The accuracy is defined as the percentage of cases which are correctly classified, an ROC (Receiver Operating Characteristic) Curve is a comprehensive index reflecting continuous variables of sensitivity and specificity, the correlation between the sensitivity and the specificity can be revealed by a composition method, AUC (Area Under Curve, AUC) is defined as the Area enclosed by coordinate axes Under the ROC Curve, the value range is between 0.5 and 1, and the larger the value of AUC is, the better the classification effect is. In order to ensure the accuracy and the universality of the experiment, each characteristic combination experiment is carried out for 10 times, and the average is taken as the final classification result.

As can be seen from table 1: the recognition rate of the characteristics of the invention on a plurality of pathological unit tones is higher than that of the conventional MFCC and LPCC. The highest accuracy can reach 97.3600%, and the AUC can reach 0.9894.

TABLE 1

Claims

1. An improved method for recognizing multiple pathological unit tones, comprising the steps of:

1) calculating line spectrum pair parameters of an input voice signal;

3) performing frequency warping on the line spectrum pair parameters of the input voice signal to obtain the bark line spectrum pair parameters of the input voice signal; the frequency warping adopts the following formula:

Bark＝26.81/(1+(1960/f))-0.53 (6)

wherein Bark represents Bark frequency; f represents a linear frequency;

2. An improved multiple pathology unit tone identification method according to claim 1, wherein step 1) comprises:

P(z)＝A(z)+z^-(p+1)A(z^-1) (2)

Q(z)＝A(z)-z^-(p+1)A(z^-1) (3)

in the formula, H (e)^jω) Is a linear prediction of the spectral amplitude, e^jωIs a frequency representation of z, P (e)^jω) Is A (e)^jω) P +1 order symmetric polynomial of (a), (b), (c), (d), (e)^jω) Is A (e)^jω) P +1 order antisymmetric polynomial of (cos θ)_iAnd cos omega_iIs a representation of the LSP coefficients in the cosine domain, θ_iAnd ω_iIs the line spectrum frequency corresponding to the line spectrum pair coefficient of the input voice signal, pi is the accumulative multiplication sign.

3. An improved method for multiple pathological unit tone recognition according to claim 1, wherein step 2) is calculated according to the following formula:

DAL_i＝l_i+1-l_i，i＝1,2,...M(M＜N) (5)

4. The improved multiple pathological unit tone identification method according to claim 1, wherein in step 4), the j-th-order bark line spectrum pair parameter is adjusted by bidirectional iteration according to the adjacent difference line spectrum pair parameters, wherein j is 2₁，b₂,...b_N}^NN is a line spectrum referenceThe maximum order of the number, the coefficient of the adjacent differential line spectrum pair parameter of the current frame is b_i+1-b_i1,2, 1, N-1; the specific iterative formula is as follows:

c_i＝η(b_i+1-b_i),η＜1,i＝2,3,...,N-1 (8)

5. The improved multiple pathological unit voice recognition method of claim 1, wherein step 5) is to randomly select 75% of each unit voice data set in the SVD pathological voice database as a training set and 25% of each unit voice data set as a testing set to ensure that each class voice data satisfies an average distribution during the training and testing phases of the classification network, and then input the parameters of the pathological voice/a/,/i/,/u/and the 12 th order enhanced barker line spectrum of the 6 unit voice voices/a/,/i/,/u/to the deep neural network for recognition, wherein the network parameters are set as: and 2, hidden layers are formed, each layer is provided with 100 neurons, a ReLU function is selected as an activation function, a Softmax function is selected in the last layer of the recognition model to change the output of the neural network into probability distribution, and then the classification result is optimized.