CN112349297B

CN112349297B - Depression detection method based on microphone array

Info

Publication number: CN112349297B
Application number: CN202011248610.5A
Authority: CN
Inventors: 焦亚萌; 周成智
Original assignee: Xian Polytechnic University
Current assignee: Xian Polytechnic University
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2023-07-04
Anticipated expiration: 2040-11-10
Also published as: CN112349297A

Abstract

The invention discloses a depression detection method based on a microphone array, which comprises the steps of collecting and preprocessing voice signals of a target patient by using the microphone array; extracting the preprocessed audio signal of the target patient and the MFCC (frequency spectrum coefficient) characteristics of the voice data of the existing depression patient to generate an audio spectrogram; the MFCC characteristics are sent into a 1D convolutional neural network to obtain P-dimensional characteristics of the MFCC; sending the audio spectrogram into a 2D convolutional neural network to obtain the O-dimensional characteristic of the spectrogram; inputting the O-dimensional characteristics into a countermeasure generation network to generate a new spectrum image, and transmitting the generated new spectrum image into a 2D convolutional neural network for training; fusing the P-dimensional characteristics of the MFCC and the characteristics obtained by training and reducing the dimension through the full connection layer; training a classifier by using the dimension reduction characteristics; training the classifier to identify the test voice and obtaining the identification result. The invention improves the accuracy of identifying the depression in a non-experimental environment.

Description

Depression detection method based on microphone array

Technical Field

The invention belongs to the technical field of voice recognition methods, and particularly relates to a depression detection method based on a microphone array.

Background

At present, the voice signal has made some progress in the field of depression detection, but the diagnosis of the illness state of a patient mainly ensures that the patient carries out voice signal acquisition in front of a fixed voice acquisition device and mainly depends on a clinician to carry out diagnosis, and the common diagnosis methods include a Beck depression scale (BDI), a Hamiltonian depression scale (HAMD) and the like, so that the diagnosis result of the patient depends on the experience and the capability of doctors, and more importantly, the patient is matched. Therefore, most of the voices collected during the current examination of the patient are characterized by programming and mechanization, and the problem of inaccuracy of the collected voices of the patient is possibly caused. Therefore, the detection device must be able to collect the patient's voice under the condition of removing the background noise in the natural state of the patient's daily life.

A microphone array, which is composed of a number of acoustic sensors, is a system for sampling and processing the spatial characteristics of a sound field. In complex acoustic environments, noise always comes from all directions and often overlaps with speech signals in time and frequency spectrum, plus the effects of echo and reverberation, it is very difficult to capture relatively pure speech with a single microphone. The microphone array fuses the space-time information of the voice signals, and can simultaneously extract sound sources and inhibit noise.

Convolutional neural networks (CNN, convolutional Neural Network) are one of the deep learning algorithms established in recent years, which have good classification performance for large image processing. The biggest advantage of generating the countermeasure network (GAN, generative Adversarial Networks) is that the experimental problem of insufficient sample data is solved, a proper network model is constructed to generate a false and spurious sample, diagnosis and prediction of medical diseases can be effectively facilitated, and more important diagnosis basis is provided for medical research.

The advantage that the microphone array can clearly adopt sound signals is combined with the advantages of two deep learning methods of GAN and CNN, so that the accuracy of identifying depression is improved.

Disclosure of Invention

The invention aims to provide a depression detection method based on a microphone array, which improves the accuracy of depression identification.

The technical scheme adopted by the invention is as follows: a depression detection method based on a microphone array, comprising the steps of:

step 1, a microphone array is used for collecting voice signals of a target patient and preprocessing the voice signals;

step 2, extracting the audio signal preprocessed by the target patient in the step 1 and the MFCC characteristics of the voice data of the existing depression patient to generate an audio spectrogram;

step 3, the MFCC features extracted in the step 2 are sent into a 1D convolutional neural network to obtain P-dimensional features of the MFCC;

step 4, sending the audio spectrogram generated in the step 2 into a 2D convolutional neural network to obtain the O-dimensional characteristic of the spectrogram;

step 5, inputting the O-dimensional characteristics obtained in the step 4 into an countermeasure generation network to generate a new spectrum image, and transmitting the generated new spectrum image into the 2D convolutional neural network in the step 4 for training;

step 6, fusing the P-dimensional characteristics of the MFCC extracted in the step 3 with the characteristics obtained by training in the step 5, and reducing the dimension through the full connection layer;

step 7, training a classifier through the feature after dimension reduction obtained in the step 6;

and 8, recognizing the test voice through the trained classifier in the step 7 to obtain a recognition result.

The present invention is also characterized in that,

the step 1 specifically comprises the following steps:

step 1.1, collecting a voice signal of a target patient through a quaternary cross microphone array;

step 1.2, carrying out frame windowing on the collected voice signal of the target patient, transforming the signal from a time domain ratio to a frequency domain by utilizing fast Fourier transform, completing estimation of a spectrum factor by calculating a smooth power spectrum and a noise power spectrum, outputting a signal after spectrum subtraction, and finally calculating and detecting the voice signal of the target patient by combining the energy entropy ratio to obtain an endpoint value of voice;

step 1.3, combining the end point detection result, and judging the position of the sound source signal by using a DOA positioning method;

and 1.4, synthesizing four paths of signals into one path of signal through a superdirective beam forming algorithm by using the voice signals subjected to end point detection and sound source positioning processing, so as to realize synthesis, noise reduction and enhancement of microphone array signals.

The step 2 specifically comprises the following steps:

step 2.1, firstly dividing a voice signal into frames through a Hamming window function; then a cepstral feature vector is generated and calculated for each frameDiscrete Fourier transform, which only keeps the logarithm of the amplitude spectrum, collects 24 frequency spectrum components of 44100 frequency bands in the Mel frequency range after the frequency spectrum is smoothed, and approximates the frequency spectrum to discrete cosine transform after Karhunen-Loeve transform is applied; finally obtain f per frame ₁ ,f ₂ ,...,f _N ]A cepstrum feature;

step 2.2, according to the set frame number, framing and windowing the voice signal of the target patient, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal x (m) in the m-th frame to obtain a spectrogram; l filters are selected, and L frames having the same size as the filters are selected in the time direction, a spectrogram of L×L×3 is generated, and the size of the generated color image is adjusted to M×M×3.

The 1D convolutional neural network of step 3 is: using a Keras framework based on Tensorflow with an open source, only building two 1D convolution layers, wherein each layer adopts a correction linear unit as an activation function; the input dimension is Mx1, through w ₁ A convolution layer filter with m multiplied by 1, dropout of 0.1 and maximum pool step of q ₁ Outputting a feature vector of S; in the stage of training the 1D convolutional neural network, the MFCC features containing time-frequency information of each frame of voice signal are sequentially read into a memory by using a traversing method, a training set and a testing set are divided, labels are respectively added to the training set and the testing set, and processed data are transmitted into the convolutional neural network according to the set labels to carry out iterative training, and the total iteration is carried out for B times.

The 2D convolutional neural network of step 4 is: using a open source Tensorflow-based Keras framework to build a containing w ₂ Two-dimensional convolution layers of size n x n, w ₁ A convolutional neural network of a maximum pooling layer and 1 fully connected layer with an output dimension L, wherein correction linear units are adopted in the convolutional layer and the fully connected layer as an activation function; in the stage of training convolutional neural network, using traversing method to orderly read spectrogram characteristics of each frame of speech signal containing texture-like information into internal memory, dividing training set and test set, respectively adding labels to training set and test set, then transferring the processed data into volume according to the set labelIn the neural network, iterative training is carried out, and total iterations are carried out for B times; training a convolutional neural network, setting a learning rate as epsilon by using a random gradient descent method as an optimizer, wherein the learning rate after each update is mu in attenuation value, and the power is beta.

The countermeasure generation network of step 5 is: the network model comprises a generator and a discriminator, wherein the generator network model consists of 1 full-connection layer, 3 transposed convolution layers and 2 batch standardization layers, and outputs a color picture with the size of MxMx3, and the discriminator part comprises 3 convolution layers and a full-connection layer with a softmax function; the discriminator network model is composed of 3 convolution layers, 2 batch standardization layers and 2 full connection layers by using a 7-layer convolution neural network model, and finally outputs a probability value; setting a probability threshold lambda, and when the probability value generated by the discriminator after multiple training is larger than lambda, storing the spectrogram generated by the generator.

The step 6 is specifically as follows: the P-dimensional characteristics of the MFCC extracted through the 1D convolutional neural network are fused with the O-dimensional characteristics of the spectrogram to obtain P+O-dimensional characteristics, and the dimension of the P+O-dimensional characteristics is changed into 256 dimensions through a full connecting layer.

The step 7 is specifically as follows:

step 7.1, taking the voice of the target patient as test voice and taking the voice data of the existing depression patient as training data; the training data comprises voice information of X persons, labels of whether the X persons suffer from depression are used as a label dictionary, each label is provided with a corresponding index number, and index numbers of which the label index numbers are classified are set; after one test, adding a spectrogram generated by a target patient into a training data set;

and 7.2, for each label, the voice with depression is taken as a positive sample set, the voice without depression is taken as a negative sample set, and the positive sample set and the negative sample set are used for training the two-class SVM, so that the trained two-class SVM is obtained.

The beneficial effects of the invention are as follows: the microphone for voice acquisition is convenient to carry, and can acquire voice signals of a patient in a natural state; based on the depression identification research results combining the characteristics of the CNN, the MFCC and the GAN enhancement data set, the advantages of the MFCC and the CNN are combined, and the accuracy of depression identification in a non-experimental environment is improved.

Drawings

FIG. 1 is a schematic diagram of a method for detecting depression based on a microphone array of the present invention;

FIG. 2 is a schematic diagram of a microphone array used in a method for detecting depression based on a microphone array according to the present invention;

FIG. 3 is a schematic diagram of a CNN model in a microphone array-based depression detection method of the present invention;

fig. 4 is a schematic diagram of GAN model in a method for detecting depression based on a microphone array according to the present invention.

Detailed Description

The invention will be described in detail with reference to the accompanying drawings and detailed description.

The invention provides a depression detection method based on a microphone array, which is shown in fig. 1 to 4 and comprises the following steps:

step 1, can carry out accurate sound source localization and form pickup wave beam in the direction of target speaker through using annular microphone array, restrain noise and reflected sound, strengthen sound signal, can accurately discern 3-5 m's long-distance pronunciation under noisy environment, satisfy the demand to gathering at any time of patient's daily life speech signal, specifically:

step 1.1, collecting a patient voice signal through a quaternary cross microphone array;

step 1.2, framing and windowing the collected voice signal of the target patient, transforming the signal from the time domain ratio to the frequency domain by utilizing the fast Fourier transform, and completing the estimation of the spectrum factor by calculating the smooth power spectrum and the noise power spectrum. Outputting the spectrum subtracted signal. Finally, the combined energy entropy ratio is calculated and detected whether the patient voice signal is contained. Obtaining an endpoint value of the voice; the calculation process of the energy entropy ratio is as follows:

the energy of each frame is calculated as:

x _i (m) is a signal of the i-th frame, and the frame length is N. The energy relation expression is:

E _i ＝log ₁₀ (1+e _i /a)

a is a constant, and proper adjustment can distinguish unvoiced sound from noisy sound. The i-th frame voice signal is subjected to fast Fourier transform and then is:

obtaining the energy spectrum of the frequency component corresponding to the kth spectral line:

the normalized spectral probability density is:

short-term spectral entropy definition of speech frames:

energy-to-entropy ratio EH _i The ratio of energy and entropy spectrum:

step 1.3, combining the end point detection result, using DOA positioning method to make position judgment for sound source signal, using the processing procedure of a frame of signal data to make description: by reading in voice data, taking the m-th frame as a processing object, and taking 4 paths of microphone signalsThe number corresponds to the data of the m-th frame, 4 paths of signals are combined into 1 path of signals, and W is carried out on the signals _c (k) Weighting; then find the corresponding energy sum E of a certain angle on different frequency bands _s Calculating to obtain energy value E corresponding to 360 angles of current frame signal _s (i) The value of i is 0-360 degrees. Take the maximum value E in the 360 energies _smax (i) And an angle i corresponding to the maximum energy value, the sound source angle determined by the current frame can be output. The band energy of each frame signal corresponding to a certain angle is:

wherein f ₁ 、f ₂ Indicating the setting range of the frequency band 1-N/2+1, X _sw (k) In order to perform band weighting processing on the 1-path signals after combination, the formula is as follows:

in which W is _e (k) As a band weighting factor, the formula is:

wherein the index 0 is less than lambda and less than 1, W (k) is a masking weight factor, and represents that the frequency band with the maximum signal-to-noise ratio SNR in each frequency band is taken for the current data.

X _s (k) To combine 4 signals into 1 signal, the formula is:

wherein X is _i (k) Is 1 out of 4 signals.

Step 1.4, synthesizing 4 paths of signals into 1 path of signals through a superdirective beam forming algorithm by using the voice signals subjected to end point detection and sound source positioning processing, thereby realizing synthesis, noise reduction and enhancement of microphone array signals. The super-directivity beamforming algorithm is detailed as follows:

the microphone array of the invention selects a quaternary cross array, which can be regarded as one of uniform circular arrays, and the geometric relation of the array can know that the direction vector of arrival of a receiving signal with an angle theta is as follows:

wherein,,

the voice environment used by the method is mainly indoor and daily life, so that the noise matrix calculated based on the scattered noise field has certain applicability to the current microphone voice environment; the scattered noise field only describes the three-dimensional spherical surface homodromous noise field, and the related function expression is as follows:

where sinc (x) yields the sampling function sinpi x/pi x. The microphone array is composed of M array elements, and the signals received by the ith microphone are as follows:

wherein f represents frequency, A _i The amplitude is represented by a value representing the amplitude,

the phase is represented, and according to the mathematical model theory of the optimal solution of 'superdirectivity', the correlation coefficient of noise signals between any two points in space is as follows:

the noise covariance matrix is normalized as:

R _nn ＝[ρ _ij ](i,j＝1,2,...,N-1)

d _ij representing the distance between any two array elements in the microphone array.

The invention adopts the minimum variance distortion-free response (MVDR) beam forming principle, which is under the constraint condition w of LCMV method ^H a (θ) =1, this approach keeps the signal strength while the variance of the noise is minimized, so to speak MVDR maximizes the signal-to-noise ratio (SNR) of the array output signal. The target is to select a filter coefficient w to minimize the total output power under the constraint condition that the voice signal is not distorted; therefore, the key objective is to solve the optimal solution of the weight coefficient w, and the constraint expression is as follows:

wherein a (θ) _s )＝[a ₁ (θ),...,a _M (θ)] ^T The target signal guiding vector represents a transfer function between the sound source direction and the microphone and can be calculated by a plurality of delay time tau; r is R _x For a spatial signal correlation covariance matrix, when k noise signals that are not temporally correlated with each other arrive at a microphone array element from different directions, the spatial correlation covariance matrix is defined as:

the method is calculated by lagrange Multiplier:

we normalize the resulting R using the resulting noise covariance matrix _nn Instead of the aboveNoise covariance matrix R in MVDR _x The super directivity weighting factor can be obtained as:

and (5) completing the weighted beam forming of the multi-channel microphone by using the optimized super-directional weighting coefficient.

Step 2, extracting MFCC features and generating an audio spectrogram, specifically extracting a time-frequency representation and a texture-like representation of an audio signal simultaneously:

step 2.1, firstly dividing the voice signal into frames through a hamming window function. A cepstral feature vector is then generated and a discrete fourier transform is computed for each frame. Only the logarithm of the amplitude spectrum is kept, and after the spectrum is smoothed, 24 spectral components of 44100 frequency bands are collected in the Mel frequency range. The components of the mel-spectrum vector calculated for each frame are highly correlated. Therefore, KL (Karhunen-Loeve) transform is applied, and then approximated as Discrete Cosine Transform (DCT). Finally, get/f per frame ₁ ,f ₂ ,...,f _N ]A cepstrum feature;

and 2.2, according to the set frame number, framing and windowing the patient voice signal, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal x (m) in the m-th frame to obtain a spectrogram. In order to accommodate the input of the convolutional neural network, L filters are selected, and L frames having the same size as the filters are selected in the time direction, a spectrogram of l×l×3 is generated, and the size of the generated color image is adjusted to mxm×3.

Step 3, the MFCC features in the step 2 are sent into a 1D convolutional neural network to obtain P-dimensional features of the MFCC, and the 1D convolutional neural network is as follows: using a Keras framework based on Tensorflow with an open source, in order to prevent the over-fitting problem, only two one-dimensional (1D) convolution layers are built, wherein each layer adopts a correction linear unit (ReLU) as an activation function; the input dimension is Mx1, through w ₁ A convolution layer filter with m multiplied by 1, dropout of 0.1 and maximum pool step of q ₁ And outputting a feature vector of S. Training 1D convolutional nervesAnd in the stage of the network, the MFCC characteristics containing time-frequency information of each frame of voice signal are sequentially read into the memory by using a traversing method, a training set and a testing set are divided, labels are respectively added to the training set and the testing set, and processed data are transmitted into the convolutional neural network according to the set labels for iterative training, and total iterations are carried out for B times.

Step 4, sending the spectrogram in the step 2 into a 2D convolutional neural network to obtain the O-dimensional characteristic of the spectrogram, wherein the 2D convolutional neural network is as follows: by using an open source Tensorflow-based Keras framework, referring to the network structure of AlexNet, a network structure containing w is simplified and built ₂ Two-dimensional convolution layers of size n x n, w ₁ A convolutional neural network of a maximum pooling layer and 1 fully connected layer with an output dimension L, wherein correction linear units (ReLUs) are adopted as an activation function in the convolutional layer and the fully connected layer; in the stage of training the convolutional neural network, sequentially reading spectrogram characteristics containing texture-like information of each frame of voice signal into a memory by using a traversing method, dividing a training set and a testing set, respectively adding labels to the training set and the testing set, and then transmitting processed data into the convolutional neural network according to the set labels for iterative training, wherein the total iteration is carried out for B times. Training a convolutional neural network, setting a learning rate as epsilon by using a random gradient descent method as an optimizer, wherein the learning rate after each update is mu in attenuation value, and the power is beta.

And 5, inputting the characteristics obtained in the step 4 into a countermeasure generation network to generate a new frequency spectrum image, putting the generated new frequency spectrum into the original frequency spectrum data, and then executing the training in the step 4. The antagonism generation network is: the network structure based on DCGAN is simplified and parameter adjustment is carried out. The network model comprises a generator (generator) and a discriminator (discriminator), wherein the generator network model consists of 1 fully connected layer, 3 transposed convolution layers and 2 batch normalization layers, and outputs a color picture with the size of M multiplied by 3, and the discriminator part comprises 3 convolution layers and a fully connected layer with a softmax function; the discriminator network model consists of 3 convolutional layers, 2 batch normalization layers and 2 full connection layers using a 7-layer convolutional neural network model, and is finally output as a probability value. Setting a probability threshold lambda, and when the probability value generated by the discriminator after multiple training is larger than lambda, storing the spectrogram generated by the generator. And (4) transmitting the generated spectrogram meeting the standard into the convolutional network in the step (4) for retraining.

And step 6, fusing the MFCC features extracted in the step 3 and the features obtained by the expanded spectrogram data in the step 4, and performing dimension reduction through a full connection layer, wherein the method specifically comprises the following steps: the P-dimensional features of the MFCCs extracted through the CNNs are fused with the O-dimensional features of the spectrograms to obtain P+O-dimensional features, and the dimensions of the P+O-dimensional features are changed into 256 dimensions through a full connection layer.

And 7, training a classifier through the feature after dimension reduction processed in the step 6, wherein the classifier specifically comprises the following steps:

and 7.1, taking the voice of the target patient as test voice, and taking the voice data of the existing depression patient as training data. The training data comprises voice information of X persons, labels of whether the X persons suffer from depression are used as a label dictionary, each label is provided with a corresponding index number, and the index numbers of the label index numbers are set as index numbers of the class. After one test, adding the spectrogram generated by the target patient into the training data set.

Step 7.2, for each label, the voice with depression is taken as a positive sample set, and the voice without depression is taken as a negative sample set. Training the two-classification SVM by using a positive example sample set and a negative example sample set to obtain a trained two-classification SVM; the classifier training process is specifically as follows:

and determining two parameters, namely a kernel function and a penalty factor of the SVM by circularly checking the accuracy of the SVM training set, and performing model training by using the parameters after selecting the optimal parameters. Let training sample speech data be: { x _i ,y _i },x _i ∈R ⁿ ,i＝1,2,..,n，x _i Is the O+P dimension feature vector, y _i To determine whether a depression label is present, the SVM maps the training set to a high-dimensional space using a nonlinear mapping Φ (x), the most classified face that makes the nonlinear problem linear is described as: y=ω ^T Phi (x) +b, omega and b represent the weight and bias vector of the SVM.

To find the optimal ω and b, a relaxation factor ζ is then introduced _i Transforming the classification plane to obtain the secondary optimization problem of the classification plane, namely:

s.t.y _i (ω·Φ(x _i )+b)≥1-ξ _i

ξ _i ≥0i＝1,2,...,n

wherein: c represents a penalty parameter. The secondary optimization problem is transformed by introducing Lagrangian multipliers to obtain the method:

the weight vector ω is calculated as: omega = Σα _i y _i Φ(x _i ) Φ (x), the decision function of the support vector machine can be described as: f (x) =sgn (α) _i y _i Φ(x _i )·Φ(x _j ) +b), simplifying calculation, and introducing a Gaussian direct basis (RBF) kernel function to make a decision function as follows:

where σ represents the width parameter of the RBF.

And 8, recognizing the test voice through the trained classifier in the step 7. The generated identification result can be sent to a guardian of the patient through WIFI so as to observe the illness state of the patient at any time.

Through the mode, the microphone for voice acquisition is convenient to carry, and the voice signal of the patient in the natural state can be acquired; based on the depression identification research results combining the characteristics of the CNN, the MFCC and the GAN enhancement data set, the advantages of the MFCC and the CNN are combined, and the accuracy of depression identification in a non-experimental environment is improved.

The depression identification challenge competition database of AVEC2013 audiovisual depression identification is used for carrying out depression identification test by using the depression detection method based on the microphone array, and the data set comprises 340 individuals of voice information. The specific operation is as follows:

step 1, preprocessing the voice signals under each sub-directory sequentially by using a traversing method, and dividing the voice signals into frames by using a Hamming window function. A cepstral feature vector is then generated and a discrete fourier transform is computed for each frame. Only the logarithm of the amplitude is retained. After the spectrum is smoothed, 24 spectrum components of 44100 frequency bands are collected in the Mel frequency range. The components of the mel-spectrum vector calculated for each frame are highly correlated. Therefore, KL (Karhunen-Loeve) transform is applied, and then approximated as Discrete Cosine Transform (DCT).

Step 2, extracting MFCC characteristics after preprocessing signals, normalizing the MFCC characteristics, limiting the length of each section of voice to 10 seconds by dividing voice fragments, obtaining a 177-dimensional characteristic vector of each frame by 50 frames per second, and enabling the number of channels of each second of voice to be 50; converting the voice signal into a spectrogram, wherein the spectrogram limits the sampling frame number to 64 frames per second; a color picture having a spectrum of 64×64×3 pixels is obtained, and the picture size is adjusted to 200×200×3 pixel size.

Step 3, constructing a convolution pooling layer, wherein a model of a 5-layer convolution neural network is composed of 2 convolution layers, 2 maximum pooling layers and 1 full connection layer, input data of a first layer is 177 multiplied by 1 multiplied by 50 MFCC characteristics, convolution operation is carried out on the MFCC characteristics and the MFCC characteristics by adopting a convolution kernel of 5 multiplied by 1, the convolution kernel moves along the X axis and the y axis of the MFCC characteristics, the step length is 1 pixel, 100 convolution kernels are used in total to generate 173 multiplied by 1 multiplied by 100 pixel layers, a ReLU function is used as an activation function, the pixel layers are subjected to treatment of a ReLU unit to generate activated pixel layers, the activated pixel layers are subjected to treatment of maximum pooling operation, the scale of pooling operation is 4 multiplied by 1, the step length is default to be 1, and the pixel size after pooling is 43 multiplied by 1 multiplied by 100; the second layer uses a convolution kernel of 5×1×200, and a convolution operation is performed to generate 39×1×200 pixel layers. The pixel layers are processed by a ReLU unit to generate activated pixel layers, the activated pixel layers are processed by the maximum pool operation, the pool operation scale is 4 multiplied by 1, the image size after the pool is 9 multiplied by 1 multiplied by 200, and then input neurons are disconnected immediately with 10% probability to update parameters when the parameters are updated by the Dropout layer; the multi-dimensional input is unidimensionally input by using a flattening layer, after flattening treatment, a group of unidimensionally pixel arrays are output, the sum contains 1800 data, and then the pixels are used as input to be transmitted into a full-connection layer for further operation.

And 4, building convolution pooling layers, wherein the method comprises 3 convolution layers, 3 maximum pooling layers and 1 full connection layer by using a 7-layer convolution neural network model. The input data of the first layer is a spectrogram of 200 multiplied by 3, convolution operation is carried out on the input data and the spectrogram by adopting convolution kernels of 3 multiplied by 3, the convolution kernels move along the X axis and the Y axis of the image, the step length is 1 pixel, 64 convolution kernels are used for generating 198 multiplied by 64 pixel layer data, a ReLU function is used as an activation function, the pixel layers are processed by the ReLU unit to generate activated pixel layers, the activated pixel layers are processed by the maximum pooling operation, the scale of the pooling operation is 2 multiplied by 2, the step length is 2 by default, and the pixel size after pooling is 99 multiplied by 64; during back propagation, each convolution kernel corresponds to one deviation value, namely 64 convolution kernels of the first layer correspond to 64 deviation values input by the upper layer; the second layer uses 32 3×3×64 convolution kernels, and 97×97×32 pixel layers are generated after convolution operation. The pixel layers are processed by a ReLU unit to generate activated pixel layers, the activated pixel layers are processed by the maximum pool operation, the pooled image size is 48 multiplied by 32 by using a pooled operation scale of 2, and then input neurons are immediately disconnected with 10% probability to update parameters when the parameters are updated by a Dropout layer, so that the parameters are prevented from being overfitted; in the back propagation in this layer, each convolution kernel corresponds to one offset value, i.e. 64 convolution kernels of the first layer correspond to 32 offset values of the upper layer input; similarly, the third layer uses 32 3×3×32 convolution kernels, and a convolution operation is performed to generate 46×46×32 pixel layers. The pixel layers are processed by a ReLU unit to generate activated pixel layers, the activated pixel layers are processed by the maximum pool operation, the pool operation scale is 2 multiplied by 2, the image size after the pool is 23 multiplied by 32, and then input neurons are disconnected immediately with 10% probability to update parameters when the parameters are updated by the Dropout layer; the multi-dimensional input is unidimensionally processed by using a flattening layer, a group of unidimensionally pixel arrays are output after flattening treatment, the sum of the unidimensionally pixel arrays contains 16928 data, and then the pixels are taken as input to be transmitted into a full-connection layer for further operation.

In order to extract the characteristics of the spectrogram itself to send into a GAN network to generate a new spectrogram, the obtained multidimensional characteristics are required to be subjected to dimension reduction, a full-connection layer is built, the full-connection (Dense) is used for fully connecting the input 16928 data to 128 nerve units, 128 data are generated after the ReLU activation function processing, and 128 data are output after the Dropout processing to serve as voice emotion characteristics.

Step 5, the GAN generator network model of the invention consists of 1 full connection layer, 3 transposed convolution layers and 2 batch normalization layers. The first layer input data is 128 data extracted in the step 4, is connected with 4608 neurons through a full connection layer, and is converted into a shape of 3 multiplied by 512; the second layer uses transpose convolution to reduce 512 channels to 256 channels, kernel_size 3, step size 3, and pass through the batch normalization layer; the third layer uses transpose convolution to reduce 256 channels to 128 channels, kernel_size 5, step size 2, and pass through the batch normalization layer; the fourth layer uses transpose convolution to reduce 128 channels to 3 channels, kernel_size 4, step size 3;

the GAN discriminator network model of the invention consists of 3 convolutional layers, 2 batch normalization layers, and 1 fully connected layer using a 7-layer convolutional neural network model. The input data of the first layer is a spectrogram of 64 multiplied by 3, convolution operation is carried out on the spectrogram and a convolution kernel of 5 multiplied by 3, the convolution kernel moves along the X axis and the Y axis of the image, the step length is 1 pixel, 64 convolution kernels are used for generating 60 multiplied by 24 pixel layer data, a leakage-ReLU function is used as an activation function, and the pixel layers are subjected to the processing of the leakage-ReLU unit to generate an activated pixel layer; the second layer uses 128 5×5×128 convolution kernels, and 57×57×128 pixel layers are generated after convolution operation. The pixel layers are processed by a Leakly-ReLU unit to generate activated pixels, and the activated pixel layers are processed by a batch normalization layer for preventing overfitting; the third layer uses 256 5×5×256 convolution kernels, and generates 53×53×256 pixel layers after convolution operation. The pixel layers are processed by a Leakly-ReLU unit to generate activated pixels, and the activated pixel layers are processed by a batch normalization layer for preventing overfitting; using a flattening layer to unify multi-dimensional input, after flattening treatment, using the pixels as input to enter a full-connection layer, wherein the final output layer is 1 node, and outputting a probability value; the size of the generated spectrogram which meets the standard and is 64 multiplied by 3 is modified to be 200 multiplied by 3 pixel size, and then the spectrogram is transmitted into the convolution network in the step 4 for retraining.

And 6, constructing a full-connection layer, combining 1800-dimensional data extracted in the step 3 and 16928-dimensional data extracted in the step 4 into 18728-dimensional data, performing full connection on the 18728-dimensional data and 256 nerve units, generating 256 data after being processed by a ReLU activation function, and outputting 256 data after being processed by a Dropout as voice emotion characteristics.

Step 7, as the data set contains 292 persons, 43,800 voice information is used in total through clipping and screening, labels of 292 persons, which are whether to suffer from depression, are used as a label dictionary, each label is provided with a corresponding index number, index numbers with the label index number being similar are set, 90% of voice signals of the labels are used as a training set, and the rest 10% of voice signals are used as a test set;

for each label, the speech that always had depression was taken as a positive sample set and the speech that did not have depression was taken as a negative sample set. Training the two-classification SVM by using a positive example sample set and a negative example sample set to obtain a trained two-classification SVM;

and 8, recognizing test voice through the trained two-class SVM in the step 7.

Claims

1. A method for detecting depression based on a microphone array, comprising the steps of:

2. The method for detecting depression based on microphone array as claimed in claim 1, wherein the step 1 specifically comprises the steps of:

3. The method for detecting depression based on microphone array as claimed in claim 2, wherein the step 2 specifically comprises the steps of:

step 2.1, firstly dividing a voice signal into frames through a Hamming window function; then generating a cepstrum feature vector, calculating discrete Fourier transform for each frame, only reserving logarithm of an amplitude spectrum, collecting 24 frequency spectrum components of 44100 frequency bands in a Mel frequency range after the frequency spectrum is smoothed, and approximating the frequency spectrum components to discrete cosine transform after Karhunen-Loeve transform is applied; finally obtain f per frame ₁ ,f ₂ ,...,f _N ]A cepstrum feature;

4. The method for detecting depression based on microphone array as claimed in claim 3, wherein the 1D convolutional neural network of step 3 is: using a Keras framework based on Tensorflow with an open source, only building two 1D convolution layers, wherein each layer adopts a correction linear unit as an activation function; the input dimension is Mx1, through w ₁ A convolution layer filter with m multiplied by 1, dropout of 0.1 and maximum pool step of q ₁ Outputting a feature vector of S; in the stage of training the 1D convolutional neural network, the MFCC features containing time-frequency information of each frame of voice signal are sequentially read into a memory by using a traversing method, a training set and a testing set are divided, labels are respectively added to the training set and the testing set, and processed data are transmitted into the convolutional neural network according to the set labels to carry out iterative training, and the total iteration is carried out for B times.

5. A microphone-based array as defined in claim 4The depression detection method is characterized in that the 2D convolutional neural network in the step 4 is as follows: using a open source Tensorflow-based Keras framework to build a containing w ₂ Two-dimensional convolution layers of size n x n, w ₁ A convolutional neural network of a maximum pooling layer and 1 fully connected layer with an output dimension L, wherein correction linear units are adopted in the convolutional layer and the fully connected layer as an activation function; in the stage of training the convolutional neural network, sequentially reading spectrogram characteristics containing texture-like information of each frame of voice signal into a memory by using a traversing method, dividing a training set and a testing set, respectively adding labels to the training set and the testing set, and then transmitting processed data into the convolutional neural network according to the set labels for iterative training, wherein the total iteration is carried out for B times; training a convolutional neural network, setting a learning rate as epsilon by using a random gradient descent method as an optimizer, wherein the learning rate after each update is mu in attenuation value, and the power is beta.

6. The method for detecting depression based on microphone array as claimed in claim 5, wherein the countermeasure generation network of step 5 is: the network model comprises a generator and a discriminator, wherein the generator network model consists of 1 full-connection layer, 3 transposed convolution layers and 2 batch standardization layers, and outputs a color picture with the size of MxMx3, and the discriminator part comprises 3 convolution layers and a full-connection layer with a softmax function; the discriminator network model is composed of 3 convolution layers, 2 batch standardization layers and 2 full connection layers by using a 7-layer convolution neural network model, and finally outputs a probability value; setting a probability threshold lambda, and when the probability value generated by the discriminator after multiple training is larger than lambda, storing the spectrogram generated by the generator.

7. The method for detecting depression based on microphone array as claimed in claim 6, wherein said step 6 is specifically: the P-dimensional characteristics of the MFCC extracted through the 1D convolutional neural network are fused with the O-dimensional characteristics of the spectrogram to obtain P+O-dimensional characteristics, and the dimension of the P+O-dimensional characteristics is changed into 256 dimensions through a full connecting layer.

8. The method for detecting depression based on microphone array as claimed in claim 7, wherein said step 7 is specifically: