CN112349297A

CN112349297A - Depression detection method based on microphone array

Info

Publication number: CN112349297A
Application number: CN202011248610.5A
Authority: CN
Inventors: 焦亚萌; 周成智
Original assignee: Xian Polytechnic University
Current assignee: Xian Polytechnic University
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2021-02-09
Anticipated expiration: 2040-11-10
Also published as: CN112349297B

Abstract

The invention discloses a depression detection method based on a microphone array, which comprises the steps of collecting a voice signal of a target patient by using the microphone array and preprocessing the voice signal; extracting MFCC characteristics of the audio signal preprocessed by the target patient and the voice data of the existing depression patient to generate an audio frequency spectrogram; sending the MFCC features into a 1D convolutional neural network to obtain P-dimensional features of the MFCC; sending the audio frequency spectrogram into a 2D convolutional neural network to obtain O-dimensional characteristics of the spectrogram; inputting the O-dimensional features into a countermeasure generation network to generate a new spectral image, and transmitting the generated new spectral image into a 2D convolutional neural network for training; fusing the P-dimensional characteristics of the MFCC and the characteristics obtained by training and reducing the dimensions through a full-connection layer; training a classifier by using the dimensionality reduction features; and training the classifier to identify the test voice to obtain an identification result. The method improves the accuracy of depression identification in non-experimental environments.

Description

Depression detection method based on microphone array

Technical Field

The invention belongs to the technical field of voice recognition methods, and particularly relates to a depression detection method based on a microphone array.

Background

Currently, some progress has been made in the field of depression detection, but diagnosis of patient's condition mainly requires that the patient performs voice signal acquisition before a fixed voice acquisition device and mainly relies on a clinician for diagnosis, and common diagnosis schemes include a becker depression scale (BDI), a hamilton depression scale (HAMD), and the like, so that diagnosis results of the patient are very dependent on the experience and ability of a physician, and more importantly, cooperation of the patient is required. Therefore, most of the collected voices during the examination of the patient present the characteristics of programming and mechanization, which may cause the problem of inaccurate collected voices of the patient. Therefore, the detection device must be able to collect the voice of the patient under the condition of removing background noise in the natural state of daily life of the patient.

A microphone array is composed of a number of acoustic sensors and is a system for sampling and processing the spatial characteristics of a sound field. In a complex acoustic environment, noise always comes from all directions, and often overlaps with a speech signal in time and frequency spectrum, and in addition to the effects of echo and reverberation, it is very difficult to capture relatively pure speech with a single microphone. And the microphone array fuses the space-time information of the voice signal, so that the sound source can be simultaneously extracted and the noise can be suppressed.

Convolutional Neural Networks (CNN) are one of deep learning algorithms established in recent years, and have a good classification performance for processing large-scale images. The generation of the antagonistic network (GAN) has the greatest advantage that the experimental problem of insufficient sample data is solved, and a sample which is false and genuine is generated by constructing a proper network model, so that the diagnosis and the prediction of medical diseases can be effectively facilitated, and more important diagnosis basis is provided for medical research.

The advantage that the microphone array can clearly adopt sound signals is combined with the advantages of two deep learning methods of GAN and CNN, and therefore the accuracy of depression identification is improved.

Disclosure of Invention

The invention aims to provide a depression detection method based on a microphone array, which improves the accuracy of depression identification.

The technical scheme adopted by the invention is as follows: a microphone array based depression detection method comprising the steps of:

step 1, collecting a voice signal of a target patient by using a microphone array and preprocessing the voice signal;

step 2, extracting MFCC characteristics of the audio signal preprocessed by the target patient and the voice data of the existing depression patient in the step 1 to generate an audio frequency spectrogram;

step 3, sending the MFCC features extracted in the step 2 into a 1D convolutional neural network to obtain P-dimensional features of the MFCC;

step 4, sending the audio frequency spectrogram generated in the step 2 into a 2D convolutional neural network to obtain O-dimensional characteristics of the spectrogram;

step 5, inputting the O-dimensional features obtained in the step 4 into a countermeasure generation network to generate a new spectral image, and transmitting the generated new spectral image into the 2D convolution neural network in the step 4 for training;

step 6, fusing the P-dimensional characteristics of the MFCC extracted in the step 3 and the characteristics obtained by training in the step 5, and reducing the dimensions through a full connection layer;

7, training a classifier by the feature obtained in the step 6 after dimensionality reduction;

and 8, identifying the test voice through the classifier trained in the step 7 to obtain an identification result.

The present invention is also characterized in that,

the step 1 specifically comprises the following steps:

step 1.1, acquiring a target patient voice signal through a quaternary cross microphone array;

step 1.2, performing frame windowing on the collected voice signal of the target patient, converting the signal from a time domain ratio to a frequency domain by utilizing fast Fourier transform, finishing estimation of a spectrum factor by calculating a smooth power spectrum and a noise power spectrum, outputting the signal after spectrum subtraction, and finally calculating and detecting the voice signal of the target patient by combining an energy-entropy ratio to obtain a voice endpoint value;

step 1.3, combining the end point detection result, and judging the position of the sound source signal by using a DOA (direction of arrival) positioning method;

and step 1.4, synthesizing four paths of signals into one path of signal through a super-directivity beam forming algorithm according to the voice signals subjected to endpoint detection and sound source positioning processing, and realizing synthesis, noise reduction and enhancement of the microphone array signals.

The step 2 specifically comprises the following steps:

step 2.1, firstly dividing the voice signal into frames through a Hamming window function; generating cepstrum characteristic vectors, calculating discrete Fourier transform for each frame, only reserving the logarithm of an amplitude spectrum, collecting 24 spectrum components of 44100 frequency bands in a Mel frequency range after the frequency spectrum is smoothed, and approximating the spectrum components to discrete cosine transform after the Karhunen-Loeve transform is applied; finally, each frame obtains [ f₁,f₂,...,f_N]A cepstral feature;

step 2.2, according to the set frame number, performing framing and windowing on the voice signal of the target patient, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal in the mth frame to obtain a spectrogram; when L filters are selected and L frames having the same size as the filters are selected in the time direction, an L × 3 spectrogram is generated, and the size of the generated color image is adjusted to M × 3.

The 1D convolutional neural network of the step 3 is as follows: only two 1D convolutional layers are built by using an open-source Keras framework based on Tensorflow, wherein each layer adopts a correction linear unit as an activation function; input dimension of Mx 1, through w₁Convolution layer filter with size of mx 1, dropout of 0.1, maximum pool step of q₁Outputting a feature vector of S; at the stage of training the 1D convolutional neural network, the MFCC characteristics of each frame of voice signal, including time-frequency information, are read into a memory in sequence by using a traversal method, a training set and a testing set are divided, labels are added to the training set and the testing set respectively, processed data are transmitted into the convolutional neural network according to the set labels, iterative training is carried out, and iteration is carried out for B times in total.

The 2D convolutional neural network of the step 4 is as follows: construction of a Keras frame containing w using an open source Tensorflow-based₂N x n sized two-dimensional convolution layers, w₁The convolutional neural network comprises maximum pooling layers and 1 full-connection layer with the output dimension of L, wherein correction linear units are adopted in the convolutional layers and the full-connection layers as activation functions; at the stage of training the convolutional neural network, sequentially reading spectrogram features containing texture-like information of each frame of voice signal into a memory by using a traversal method, dividing a training set and a testing set, adding labels to the training set and the testing set respectively, transmitting the processed data into the convolutional neural network according to the set labels, performing iterative training for B times in total; and training a convolutional neural network, using a random gradient descent method as an optimizer, setting the learning rate to be epsilon, the learning rate attenuation value after each update to be mu, and the power to be beta.

The countermeasure generation network of step 5 is: based on the network structure of DCGAN, simplify it and carry on the adjustment on the parameter, the network model includes generator and discriminator, the generator network model is made up of 1 full connection layer, 3 transpose convolution layers and 2 pieces of batch standardized layers, output as a color picture of size M x 3, the discriminator part includes 3 convolution layers and a full connection layer with softmax function; the discriminator network model is composed of 3 convolutional layers, 2 batch normalization layers and 2 full-connection layers by using a 7-layer convolutional neural network model, and finally output is a probability value; and setting a probability threshold lambda, and storing the spectrogram generated by the generator when the probability value generated by the discriminator after multiple training is larger than lambda.

The step 6 specifically comprises the following steps: and fusing the P dimension characteristic of the MFCC extracted by the 1D convolutional neural network with the O dimension characteristic of the spectrogram to obtain a P + O dimension characteristic, and changing the dimension of the P + O dimension characteristic into 256 dimensions through a full connection layer.

The step 7 specifically comprises the following steps:

step 7.1, taking the voice of the target patient as a test voice, and taking the voice data of the existing depression patient as training data; the training data comprises voice information of X individuals, tags of whether the X individuals suffer from depression are used as a tag dictionary, each tag has a corresponding index number, and the tag index numbers are set as index numbers of the class; after one test, adding a spectrogram generated by a target patient into a training data set;

and 7.2, for each label, using the voices which always suffer from depression as a positive example sample set and the voices which do not suffer from depression as a negative example sample set, and training the two-classification SVM by using the positive example sample set and the negative example sample set to obtain the trained two-classification SVM.

The invention has the beneficial effects that: according to the depression detection method based on the microphone array, the microphone for voice acquisition is convenient to carry, and voice signals of a patient in a natural state can be acquired; based on the depression recognition research result combined with the CNN, the MFCC characteristics and the GAN enhanced data set characteristics, the accuracy of depression recognition in a non-experimental environment is improved by combining the advantages of MFCC and CNN.

Drawings

FIG. 1 is a schematic diagram of a microphone array based depression detection method of the present invention;

FIG. 2 is a schematic diagram of a microphone array used in a method of the present invention for depression detection based on a microphone array;

FIG. 3 is a schematic diagram of a CNN model in a depression detection method based on a microphone array according to the present invention;

fig. 4 is a schematic diagram of a GAN model in a depression detection method based on a microphone array according to the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention provides a depression detection method based on a microphone array, which comprises the following steps as shown in figures 1 to 4:

step 1, can carry out accurate sound localization at the target speaker direction formation pickup beam through using annular microphone array, restrain noise and reflected sound, strengthen sound signal, can accurately discern 3-5 m's remote pronunciation under noisy environment, satisfy the demand of gathering at any time to speech signal in the patient daily life, specifically do:

step 1.1, acquiring a patient voice signal through a quaternary cross microphone array;

step 1.2, performing frame windowing on the collected voice signals of the target patient, converting the signals from a time domain ratio to a frequency domain by utilizing fast Fourier transform, and finishing the estimation of a spectrum factor by calculating a smooth power spectrum and a noise power spectrum. And outputting the spectrum-reduced signal. Finally, the energy-to-entropy ratio is combined and detected to determine whether the patient speech signal is contained. Obtaining an endpoint value of the voice; the calculation process of the energy-entropy ratio is as follows:

the energy per frame is calculated as:

x_iand (m) is a signal of an ith frame, and the frame length is N. The energy relation expression is as follows:

E_i＝log₁₀(1+e_i/a)

a is constant and proper adjustment can distinguish between unvoiced sounds and noise. The ith frame of voice signal is subjected to fast Fourier transform to obtain:

obtaining a frequency component energy spectrum corresponding to the kth spectral line:

the normalized spectral probability density is then:

short-time spectral entropy definition of a speech frame:

energy to entropy ratio EH_iIs the ratio of energy and entropy spectrum:

step 1.3, combining the end point detection result, using a DOA positioning method to judge the position of the sound source signal, and explaining the processing process of a frame of signal data: reading voice data, taking the mth frame as a processing object, taking 4 paths of microphone signals corresponding to the mth frame data, combining the 4 paths of signals into a 1 path of signal, and performing W on the signal_c(k) Weighting; then, the corresponding energy sum E of a certain angle on different frequency bands is obtained_sCalculating to obtain the energy value E corresponding to the current frame signal at 360 degrees_s(i) And the value of i is 0-360 degrees. Take the maximum E of these 360 energies_smax(i) And the angle i corresponding to the maximum energy value, the sound source angle determined by the current frame can be output. The band energy of each frame signal corresponding to a certain angle is:

in the formula (f)₁、f₂Indicates the setting range of the frequency band from 1 to N/2+1, X_sw(k) In order to perform frequency band weighting processing on the combined 1-path signal, the formula is as follows:

in the formula, W_e(k) As a band weighting factor, the formula is:

in the formula, the index 0 < lambda < 1, and W (k) is a masking weight factor, which indicates that the frequency band with the maximum signal-to-noise ratio SNR in each frequency band is selected for the current data.

X_s(k) To combine 4 signals into 1 signal, the formula is:

in the formula, X_i(k) Is 1 of the 4 signals.

And step 1.4, synthesizing the 4 paths of signals into 1 path of signal through a super-directional beam forming algorithm on the voice signals after the endpoint detection and the sound source positioning processing, thereby realizing the synthesis, noise reduction and enhancement of the microphone array signals. The super-directional beamforming algorithm is detailed as follows:

the microphone array quaternary cross array can be regarded as one of uniform circular arrays, and the arrival direction vector of a received signal at an angle theta is as follows according to the geometrical relationship of the array:

wherein the content of the first and second substances,

the voice environment used by the method is mainly indoor and daily life, so that the noise matrix calculated based on the scattered noise field has certain applicability to the current microphone voice environment; the scattered noise field only describes the equidirectional noise field of the three-dimensional sphere, and the expression of the correlation function of the scattered noise field is as follows:

where sinc (x) yields the sampling function sin π x/π x. The microphone array is composed of M array elements, and the signal received by the ith microphone is as follows:

wherein f represents frequency, A_iThe amplitude is represented by a value representing the amplitude,

expressing the phase, according to the mathematical model theory of the optimal solution of the super directivity, the noise signal correlation coefficient between any two points in the space is as follows:

the noise covariance matrix is normalized to:

R_nn＝[ρ_ij](i,j＝1,2,...,N-1)

d_ijrepresenting the distance between any two array elements in the microphone array.

The invention adopts the principle of minimum variance distortion free response (MVDR) beam forming, which is the constraint condition w of the LCMV method^HThis method is true when a (θ) ═ 1, and the signal strength is maintained while the variance of the noise is minimized, so to speak, MVDR maximizes the signal-to-noise ratio (SNR) of the array output signal. The aim is to select a filter coefficient w to minimize the total output power under the constraint condition of no distortion of a voice signal; therefore, the key objective is to solve the optimal solution of the weight coefficient w, and the constraint expression is as follows:

wherein, a (theta)_s)＝[a₁(θ),...,a_M(θ)]^TA vector is guided for a target signal, represents a transfer function between a sound source direction and a microphone, and can be obtained by calculating multiple delay time tau; r_xFor a spatial signal dependent covariance matrix, when k noise signals, which are not temporally correlated with each other, arrive at the microphone element from different directions, the spatial dependent covariance matrix is defined as:

calculated by the lagrange Multiplier method:

we normalize the resulting R using the obtained noise covariance matrix_nnInstead of the noise covariance matrix R in the above-mentioned MVDR_xThe superdirectivity weighting coefficient can be obtained as follows:

and completing weighted beam forming of the multi-channel microphone by using the optimized super-directional weighting coefficient.

Step 2, extracting MFCC characteristics and generating an audio frequency spectrogram, specifically extracting time-frequency representation and similar texture representation of an audio signal at the same time:

step 2.1, firstly, dividing the voice signal into frames by a Hamming window function. Cepstral feature vectors are then generated and a discrete fourier transform is computed for each frame. Only the logarithm of the amplitude spectrum is retained and 24 spectral components of 44100 bands are collected over the Mel frequency range after the spectrum is smoothed. The components of the mel-frequency spectral vector calculated for each frame are highly correlated. Therefore, after applying KL (Karhunen-Loeve) transform, it is approximated as Discrete Cosine Transform (DCT). Finally, [ f ] is obtained per frame₁,f₂,...,f_N]A cepstral feature;

and 2.2, performing framing and windowing on the voice signal of the patient according to the set frame number, performing short-time Fourier transform on the discrete voice signal x (m), and calculating the power spectrum of the discrete voice signal in the mth frame to obtain a spectrogram. To accommodate the input of the convolutional neural network, L filters are selected, and L frames that are as large as the filters are selected in the time direction, so that an L × 3 spectrogram is generated, and the size of the generated color image is adjusted to M × 3.

And 3, sending the MFCC characteristics obtained in the step 2 into a 1D convolutional neural network to obtain P-dimensional characteristics of the MFCC, wherein the 1D convolutional neural network is as follows: using an open source tensirflow-based Keras framework, to prevent the over-fitting problem, only two one-dimensional (1D) convolutional layers are built, each layer using a corrective linear unit (ReLU) as an activation function; input dimension of Mx 1, through w₁Convolution layer filter with size of mx 1, dropout of 0.1, maximum pool step of q₁And outputting the feature vector of S. At the stage of training the 1D convolutional neural network, the MFCC characteristics of each frame of voice signal, including time-frequency information, are read into a memory in sequence by using a traversal method, a training set and a testing set are divided, labels are added to the training set and the testing set respectively, processed data are transmitted into the convolutional neural network according to the set labels, iterative training is carried out, and iteration is carried out for B times in total.

And 4, sending the spectrogram obtained in the step 2 into a 2D convolutional neural network to obtain O-dimensional characteristics of the spectrogram, wherein the 2D convolutional neural network is as follows: an open-source Keras framework based on Tensorflow is used, and a network structure containing w is constructed by referring to AlexNet₂N x n sized two-dimensional convolution layers, w₁A convolutional neural network of maximum pooling layers and 1 fully-connected layer with output dimension L, wherein a correcting linear unit (ReLU) is adopted in both the convolutional layer and the fully-connected layer as an activation function; at the stage of training the convolutional neural network, sequentially reading the spectrogram characteristics of each frame of voice signal, which contain texture-like information, into a memory by using a traversal method, dividing a training set and a test set, adding labels to the training set and the test set respectively, transmitting the processed data into the convolutional neural network according to the set labels, performing iterative training, and iterating for B times in total. And training a convolutional neural network, using a random gradient descent method as an optimizer, setting the learning rate to be epsilon, the learning rate attenuation value after each update to be mu, and the power to be beta.

And 5, inputting the characteristics obtained in the step 4 into a countermeasure generation network to generate a new frequency spectrum image, putting the generated new frequency spectrum image into original frequency spectrum image data, and then executing the training in the step 4. The countermeasure generation network is: based on the network structure of DCGAN, the network structure is simplified and the parameters are adjusted. The network model comprises a generator (generator) and a discriminator (discriminator), wherein the generator network model consists of 1 full-connection layer, 3 transposition convolution layers and 2 batch standardization layers, the output is a color picture with the size of M multiplied by 3, and the discriminator part comprises 3 convolution layers and a full-connection layer with a softmax function; the discriminator network model is composed of 3 convolutional layers, 2 batch normalization layers and 2 full-connection layers by using a 7-layer convolutional neural network model, and finally output is a probability value. And setting a probability threshold lambda, and storing the spectrogram generated by the generator when the probability value generated by the discriminator after multiple training is larger than lambda. And (4) transmitting the generated spectrogram meeting the standard into the convolution network in the step 4 for retraining.

And 6, fusing the MFCC features extracted in the step 3 and the features obtained by the expanded spectrogram data through the step 4, and reducing dimensions through a full connection layer, wherein the method specifically comprises the following steps of: and (3) fusing the P-dimensional feature of the MFCC extracted by the CNN with the O-dimensional feature of the spectrogram to obtain a P + O-dimensional feature, and enabling the dimension of the P + O-dimensional feature to be 256-dimensional through a full connection layer.

And 7, training a classifier through the obtained features subjected to dimensionality reduction processed in the step 6, wherein the training classifier specifically comprises the following steps:

and 7.1, taking the voice of the target patient as the test voice, and taking the voice data of the existing depression patient as training data. The training data comprises voice information of X individuals, tags of whether the X individuals suffer from depression are used as a tag dictionary, each tag has a corresponding index number, and the tag index numbers are set as index numbers of the class. After one test, the spectrogram generated by the target patient is added to the training data set.

Step 7.2, for each tag, the speech that always suffered from depression was taken as the positive sample set, and the speech that did not suffer from depression was taken as the negative sample set. Training a two-classification SVM by using the positive sample set and the negative sample set to obtain a trained two-classification SVM; the classifier training process is specifically as follows:

determining kernel function and punishment of SVM by circularly checking accuracy rate of SVM training setAnd selecting the optimal parameters of the two parameters of the penalty factor, and then utilizing the parameters to train the model. Let the training sample speech data be: { x_i,y_i},x_i∈Rⁿ,i＝1,2,..,n，x_iIs a feature vector of O + P dimension, y_iTo determine whether a depression label is present, the SVM maps the training set to a high dimensional space using a non-linear mapping Φ (x), the most classified surface that makes the non-linear problem linear is described as: y ═ ω^TΦ (x) + b, ω and b represent the weight and bias of the SVM.

To find the optimal ω and b, then the relaxation factor ξ is introduced_iAnd transforming the classification plane to obtain a secondary optimization problem, namely:

s.t.y_i(ω·Φ(x_i)+b)≥1-ξ_i

ξ_i≥0i＝1,2,...,n

in the formula: c denotes a penalty parameter. Transforming the quadratic optimization problem by introducing a Lagrange multiplier to obtain:

the formula for the weight vector ω is: ω ═ Σ α_iy_iΦ(x_i) Φ (x), the decision function of the support vector machine can be described as: f (x) sgn (α)_iy_iΦ(x_i)·Φ(x_j) + b), simplified computation, introducing a gaussian Radial Basis (RBF) kernel function and then a decision function:

where σ represents the width parameter of the RBF.

And 8, identifying the test voice through the classifier trained in the step 7. The generated identification result can be sent to the guardian of the patient through WIFI, so that the illness state of the patient can be observed at any time.

Through the mode, the microphone for voice acquisition based on the depression detection method of the microphone array is convenient to carry, and can acquire voice signals of a patient in a natural state; based on the depression recognition research result combined with the CNN, the MFCC characteristics and the GAN enhanced data set characteristics, the accuracy of depression recognition in a non-experimental environment is improved by combining the advantages of MFCC and CNN.

The depression recognition test was performed with the AVEC2013 audio visual depression recognition challenge race database using the microphone array based depression detection method of the present invention, the data set containing 340 speech information of the individual. The specific operation is as follows:

step 1, preprocessing the voice signals under each subdirectory in sequence by using a traversal method, and dividing the voice signals into frames by using a Hamming window function. Cepstral feature vectors are then generated and a discrete fourier transform is computed for each frame. Only the logarithm of the amplitude is retained. After the spectrum is smoothed, 24 spectral components of 44100 bands are collected over the Mel frequency range. The components of the mel-frequency spectral vector calculated for each frame are highly correlated. Therefore, after applying KL (Karhunen-Loeve) transform, it is approximated as Discrete Cosine Transform (DCT).

Step 2, extracting MFCC features after preprocessing signals, normalizing the MFCC features, limiting the length of each section of voice to be 10 seconds by dividing voice fragments, obtaining 177-dimensional feature vectors of each frame by 50 frames per second, and setting the number of channels of each voice to be 50; then converting the voice signal into a spectrogram, wherein the spectrogram limits the number of sampling frames to 64 frames per second; obtaining a color picture with a spectrogram of 64 × 64 × 3 pixels, and adjusting the picture size to 200 × 200 × 3 pixels.

Step 3, building a convolution pooling layer, wherein a model of a 5-layer convolution neural network is composed of 2 convolution layers, 2 maximum pooling layers and 1 full-connection layer, the input data of the first layer is MFCC characteristics of 177 × 1 × 50, convolution operation is carried out by adopting a convolution kernel of 5 × 1 and the MFCC characteristics, the convolution kernel moves along two directions of an x axis and a y axis of the MFCC characteristics, the step length is 1 pixel, 100 convolution kernels are used in total to generate 173 × 1 × 100 pixel layers, a ReLU function is used as an activation function, the pixel layers are processed by a ReLU unit to generate activation pixel layers, the activation pixel layers are processed by maximum pooling operation, the scale of the pooling operation is 4 × 1, the step length is defaulted to be 1, and the size of the pooled pixels is 43 × 1 × 100; the second layer uses 5 × 1 × 200 convolution kernels, and generates 39 × 1 × 200 pixel layers after convolution operation. The pixel layers are processed by a ReLU unit to generate active pixel layers, the active pixel layers are processed by maximum pool operation, the size of the image after the pool operation is 4 multiplied by 1, the size of the image after the pool operation is 9 multiplied by 1 multiplied by 200, and then the input neurons are immediately disconnected with a probability of 10% to update the parameters when the parameters are updated by a Dropout layer; the multi-dimensional input is subjected to one-dimensional input by using a flattening layer, a group of one-dimensional pixel arrays are output after flattening processing, 1800 data are contained in the array, and then the pixels are used as input and transmitted to a full connection layer for the next operation.

And 4, building a convolution pooling layer, wherein a 7-layer convolution neural network model is composed of 3 convolution layers, 3 maximum pooling layers and 1 full-connection layer. The input data of the first layer is a spectrogram of 200 × 200 × 3, convolution operation is performed on the spectrogram by adopting a convolution kernel of 3 × 3 × 3, the convolution kernel moves along the x axis and the y axis of an image, the step size is 1 pixel, 64 convolution kernels are used in total to generate data of 198 × 198 × 64 pixels, a ReLU function is used as an activation function, the pixel layers are processed by a ReLU unit to generate activated pixel layers, the activated pixel layers are processed by maximum pooling operation, the scale of the pooling operation is 2 × 2, the step size is default to 2, and the size of the pooled pixels is 99 × 99 × 64; when reversely propagating, each convolution kernel should have an offset value, namely 64 convolution kernels of the first layer correspond to 64 offset values input by the upper layer; the second layer uses 32 convolution kernels, 3 × 3 × 64, and generates 97 × 97 × 32 pixel layers after convolution operation. The pixel layers are processed by a ReLU unit to generate active pixel layers, the active pixel layers are processed by maximum pool operation, the size of the image after the pool operation is 2 x 2, the size of the image after the pool operation is 48 x 32, and then when the parameters are updated by a Dropout layer, input neurons are immediately disconnected with a probability of 10% to update the parameters so as to prevent overfitting; in the backward propagation in this layer, each convolution kernel should have an offset value, i.e. 64 convolution kernels in the first layer correspond to 32 offset values input by the upper layer; similarly, the third layer uses 32 3 × 3 × 32 convolution kernels, and generates 46 × 46 × 32 pixel layers after convolution operation. The pixel layers are processed by a ReLU unit to generate active pixel layers, the active pixel layers are processed by maximum pool operation, the size of the image after the pool operation is 2 multiplied by 2, the size of the image after the pool operation is 23 multiplied by 32, and then the input neurons are immediately disconnected with a probability of 10% to update the parameters when the parameters are updated by a Dropout layer; the multi-dimensional input is subjected to one-dimensional input by using a flattening layer, a group of one-dimensional pixel arrays are output after flattening treatment, the total number of the one-dimensional pixel arrays comprises 16928 data, and then the pixels are used as input and transmitted into a full-connection layer to carry out the next operation.

In order to extract the features of the spectrogram to be sent to a GAN network to generate a new spectrogram, the multidimensional features obtained by the spectrogram need to be subjected to dimensionality reduction, a full connection layer is built, the full connection (Dense) is used for fully connecting 16928 input data to 128 neural units, then 128 data are generated after the processing of a ReLU activation function, and 128 data are output after the processing of Dropout and serve as speech emotion features.

And step 5, the GAN generator network model consists of 1 full-connection layer, 3 transposition convolution layers and 2 batch standardization layers. The first layer input data is 128 data extracted in step 4, is connected with 4608 neurons through a full connection layer, and is converted into a shape of 3 × 3 × 512; the second layer reduces 512 channels to 256 channels using transposed convolution, kernel _ size is 3, step size is 3, and passes through the batch normalization layer; the third layer reduces 256 channels to 128 channels using transposed convolution, kernel _ size is 5, step size is 2, and passes through the batch normalization layer; the fourth layer reduces 128 channels to 3 channels using transposed convolution, with a kernel _ size of 4 and a step size of 3;

the GAN discriminator network model of the invention uses 7 layers of convolutional neural network model and consists of 3 convolutional layers, 2 batch normalization layers and 1 full connection layer. The input data of the first layer is a spectrogram of 64 multiplied by 3, a convolution operation is carried out on the spectrogram by adopting a convolution kernel of 5 multiplied by 3, the convolution kernel moves along the x axis and the y axis of the image, the step length is 1 pixel, 64 convolution kernels are used together to generate data of 60 multiplied by 24 pixel layers, a Leakly-ReLU function is used as an activation function, and the pixel layers are processed by a Leakly-ReLU unit to generate an activation pixel layer; the second layer uses 128 5 × 5 × 128 convolution kernels, and generates 57 × 57 × 128 pixel layers after convolution operation. The pixel layers are processed by a Leakly-ReLU unit to generate activated pixels, and the activated pixel layers are processed by a batch normalization layer to prevent overfitting; the third layer generates 53 × 53 × 256 pixel layers by convolution operation using 256 5 × 5 × 256 convolution kernels. The pixel layers are processed by a Leakly-ReLU unit to generate activated pixels, and the activated pixel layers are processed by a batch normalization layer to prevent overfitting; using a flattening layer to carry out one-dimensional input, carrying out flattening treatment, then using the pixels as input to transmit the input into a full-connection layer, wherein the last layer of output layer is provided with 1 node, and outputting a probability value; the standard-compliant 64 × 64 × 3 generated spectrogram size is modified to 200 × 200 × 3 pixel size and introduced into the convolutional network of step 4 for retraining.

And 6, building a full connection layer, combining the 1800 dimensional data extracted in the step 3 and the 16928 dimensional data extracted in the step 4 into 18728 dimensional data, fully connecting the 18728 dimensional data with 256 neural units, processing by a ReLU activation function to generate 256 data, and outputting the 256 data after Dropout processing to serve as the speech emotion characteristics.

Step 7, because the data set contains 292 people, 43,800 sections of voice information are used together through clipping and screening, the tags of 292 people whether suffering from depression are used as a tag dictionary, each tag has a corresponding index number, the index numbers of the tags are set as the class index numbers, 90% of voice signals of the tags are used as a training set, and the rest 10% of voice signals are used as a testing set;

for each tag, the voices that always suffered from depression were taken as the positive sample set, and the voices that did not suffer from depression were taken as the negative sample set. Training a two-classification SVM by using the positive sample set and the negative sample set to obtain a trained two-classification SVM;

and 8, recognizing the test voice through the two-classification SVM trained in the step 7.

Claims

1. A microphone array based depression detection method, comprising the steps of:

2. The microphone array based depression detection method as claimed in claim 1, wherein the step 1 comprises the following steps:

3. A microphone array based depression detection method according to claim 2, wherein the step 2 comprises the following steps:

4. The microphone array-based depression detection method as claimed in claim 3, wherein the 1D convolutional neural network of the step 3 is: only two 1D convolutional layers are built by using an open-source Keras framework based on Tensorflow, wherein each layer adopts a correction linear unit as an activation function; input dimension of Mx 1, through w₁Convolution layer filter with size of mx 1, dropout of 0.1, maximum pool step of q₁Output isA feature vector of S; at the stage of training the 1D convolutional neural network, the MFCC characteristics of each frame of voice signal, including time-frequency information, are read into a memory in sequence by using a traversal method, a training set and a testing set are divided, labels are added to the training set and the testing set respectively, processed data are transmitted into the convolutional neural network according to the set labels, iterative training is carried out, and iteration is carried out for B times in total.

5. The microphone array-based depression detection method of claim 4, wherein the 2D convolutional neural network of the step 4 is: construction of a Keras frame containing w using an open source Tensorflow-based₂N x n sized two-dimensional convolution layers, w₁The convolutional neural network comprises maximum pooling layers and 1 full-connection layer with the output dimension of L, wherein correction linear units are adopted in the convolutional layers and the full-connection layers as activation functions; at the stage of training the convolutional neural network, sequentially reading spectrogram features containing texture-like information of each frame of voice signal into a memory by using a traversal method, dividing a training set and a testing set, adding labels to the training set and the testing set respectively, transmitting the processed data into the convolutional neural network according to the set labels, performing iterative training for B times in total; and training a convolutional neural network, using a random gradient descent method as an optimizer, setting the learning rate to be epsilon, the learning rate attenuation value after each update to be mu, and the power to be beta.

6. The microphone array-based depression detection method of claim 5, wherein the confrontation generating network of step 5 is: based on the network structure of DCGAN, simplify it and carry on the adjustment on the parameter, the network model includes generator and discriminator, the generator network model is made up of 1 full connection layer, 3 transpose convolution layers and 2 pieces of batch standardized layers, output as a color picture of size M x 3, the discriminator part includes 3 convolution layers and a full connection layer with softmax function; the discriminator network model is composed of 3 convolutional layers, 2 batch normalization layers and 2 full-connection layers by using a 7-layer convolutional neural network model, and finally output is a probability value; and setting a probability threshold lambda, and storing the spectrogram generated by the generator when the probability value generated by the discriminator after multiple training is larger than lambda.

7. The microphone array-based depression detection method according to claim 6, wherein the step 6 is specifically: and fusing the P dimension characteristic of the MFCC extracted by the 1D convolutional neural network with the O dimension characteristic of the spectrogram to obtain a P + O dimension characteristic, and changing the dimension of the P + O dimension characteristic into 256 dimensions through a full connection layer.

8. The microphone array-based depression detection method according to claim 7, wherein the step 7 is specifically: