CN110428364B

CN110428364B - Method and device for expanding Parkinson voiceprint spectrogram sample and computer storage medium

Info

Publication number: CN110428364B
Application number: CN201910720986.2A
Authority: CN
Inventors: 王娟; 徐志京
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2019-08-06
Filing date: 2019-08-06
Publication date: 2022-09-30
Anticipated expiration: 2039-08-06
Also published as: CN110428364A

Abstract

The invention provides a method for expanding a sample of a Parkinson voiceprint spectrogram, which comprises the following steps: acquiring a plurality of audios containing vowel pronunciation, and segmenting to obtain corresponding spectrogram; converting the obtained spectrogram into a gray spectrogram according to Fourier transform; converting the gray scale spectrogram into a pseudo color spectrogram; converting the pseudo-color spectrogram into a plurality of pictures according to a preset resolution, and allocating a first label and a second label to each picture; training the multiple pictures through an HR-DCGAN model to generate trained pictures corresponding to the multiple figure judgments; obtaining the similarity between the trained picture and a plurality of pictures; and judging whether each picture in the trained pictures is used as an extended sample of the speech of the Parkinson patient one by one according to the similarity value. In addition, the invention also discloses a Parkinson voiceprint spectrogram sample expansion device and a computer storage medium.

Description

Method and device for expanding samples of Parkinson voiceprint speech spectrogram and computer storage medium

Technical Field

The invention relates to the technical field of voice processing, in particular to a method and a device for expanding a Parkinson voiceprint spectrogram sample and a computer storage medium.

Background

Voiceprints are important biological characteristics of human beings, Parkinson's Disease (PD) belongs to common nervous system degenerative diseases, and 90% of PD patients have vocal cord injury in early symptoms, so the voiceprints can be applied to discrimination of diseases such as Parkinson and the like. However, the existing patient voiceprint data sets and samples are few, the samples are difficult to obtain, and overfitting is easy to happen when a deep learning algorithm is adopted for processing, so that a good effect cannot be achieved. Therefore, when the deep learning algorithm is adopted to diagnose the Parkinson's disease, sample expansion is an urgent problem to be solved.

The audio signal is converted into a spectrogram, and important voiceprint features related to a research target can be identified and extracted by utilizing a neural network so as to automatically classify the image. With the deepening of the number of layers of the neural network, strong performance is shown in the classification and identification fields, and the deep convolution neural network is used as a data driving model and depends on a large number of samples to exert the maximum efficiency. Because of few samples, many scholars do not use the deep convolutional neural network method for PD identification.

In order to expand the samples, the traditional expansion method of the image samples comprises the steps of clipping, turning, translating, scaling, contrast transformation and the like of the images, which can change or destroy the voiceprint characteristic information in the spectrogram, influence the accuracy rate of classification and identification, and is not suitable for sample expansion of the category.

Disclosure of Invention

In view of the above disadvantages of the prior art, an object of the present invention is to provide a method and an apparatus for expanding a parkinson's voiceprint spectrogram sample, which aim to generate a trained sample through a DCGAN model, and perform similarity calculation on the trained sample and an original sample, thereby selecting a sample from the trained sample for expansion, solving the problem in the prior art that classification accuracy is affected by destroying vocal print feature information of a spectrogram, and then placing the expanded sample into a voiceprint spectrogram sample library to be applied to identification work of a parkinson's patient, thereby improving identification accuracy of a PD patient under a small sample.

In order to achieve the above objects and other related objects, the present invention provides a method for expanding a sample of a parkinsonism voiceprint spectrogram, the method comprising:

acquiring and dividing a plurality of audios containing vowel pronunciation to obtain corresponding spectrogram;

converting the obtained spectrogram into a gray spectrogram according to Fourier transform;

converting the gray scale spectrogram into a pseudo color spectrogram;

converting the pseudo-color spectrogram into a plurality of pictures according to a preset resolution, and allocating a first label and a second label to each picture, wherein the first label is a spectrogram label corresponding to a Parkinson's disease person, and the second label is a spectrogram label corresponding to a non-Parkinson's disease person;

training the pictures through an HR-DCGAN model to generate trained pictures corresponding to the multiple figure judgments;

obtaining the similarity between the trained picture and the plurality of pictures;

and judging whether each picture in the trained pictures is used as an extended sample of the speech of the Parkinson patient one by one according to the similarity value.

In one implementation manner of the present invention, the step of obtaining and segmenting a plurality of audios including vowel pronunciation to obtain corresponding spectrogram includes:

acquiring a plurality of audios with continuous three-time vowel pronunciation and pronunciation duration of 6 s;

cutting the audio into 3 audio segments of 2 s;

and preprocessing each audio judgment to obtain a spectrogram, wherein the preprocessing comprises pre-emphasis processing, framing, windowing and end point detection.

In an implementation manner of the present invention, the step of converting the obtained spectrogram into a grayscale spectrogram according to fourier transform includes:

performing Fourier transform on the spectrogram;

performing fast Fourier transform on each frame of signal to obtain short-time Fourier transform and obtain a corresponding short-time power spectrum;

and connecting to form a gray-scale speech spectrogram according to the short-time power spectrum.

In an implementation manner of the present invention, the step of converting the pseudo color spectrogram into a plurality of pictures according to a preset resolution, and allocating a first tag and a second tag to each picture includes:

performing redundancy removing operation on the pseudo color spectrogram, wherein the redundancy removing operation comprises coordinate axis removing and white edge removing operation;

converting the pseudo-color spectrogram after redundancy removal operation into a spectrogram in JPEG and PNG formats according to a preset resolution, wherein the preset resolution is as follows: 128 × 128,256 × 256,512 × 512,1024 × 1024;

and adding a label to each spectrogram.

In an implementation manner of the present invention, the step of obtaining the similarity between the trained picture and the plurality of pictures includes:

performing image blocking processing on each image in the plurality of pictures and the corresponding trained image to obtain a first number of image blocks;

for each image block, the steps are performed: obtaining the mean, variance and covariance of the corresponding image blocks; calculating the similarity of the two image blocks according to the mean, the variance and the covariance of each image block; obtaining a first number of similarity values;

calculating an average of the first number of similarity values;

and taking the average value as the similarity of the trained picture and the plurality of pictures.

In an implementation manner of the present invention, the formula adopted for calculating the similarity between the two image blocks is specifically expressed as:

wherein, mu _x Is the mean value, sigma, corresponding to the image blocks in the plurality of pictures _x Is the variance, mu, corresponding to the image blocks in the plurality of pictures _g The variance, sigma, corresponding to the image blocks of the plurality of trained images _g Is the mean value, sigma, corresponding to the image blocks of the plurality of trained images _xg Is the covariance of two image blocks, c ₁ ，c ₂ Is a constant.

In an implementation manner of the present invention, the step of determining one by one whether each of the trained pictures is used as an extended sample of the parkinson's patient's voice according to the similarity value includes:

acquiring a preset comparison value;

judging whether a target picture exists in the pictures after training or not, wherein the similarity value between the target picture and the corresponding image before training is not less than the preset comparison value;

determining the target picture to be an augmented sample of the parkinson's patient's voice.

In addition, the invention also discloses a Parkinson voiceprint spectrogram sample expansion device, which comprises a processor and a memory connected with the processor through a communication bus; wherein the content of the first and second substances,

the memorizer is used for storing a Parkinson voiceprint spectrogram sample expansion program;

the processor is used for executing the Parkinson voiceprint spectrogram sample expansion program so as to realize any one of the Parkinson voiceprint spectrogram sample expansion steps.

Also, a computer storage medium is disclosed, the computer storage medium storing one or more programs, the one or more programs being executable by one or more processors to cause the one or more processors to perform any of the parkinsonian voiceprint spectrogram sample augmentation steps.

As described above, according to the parkinson spectrogram sample expansion method, device and computer-readable storage medium provided by the embodiments of the present invention, the post-training sample is generated through the DCGAN model, and the similarity calculation is performed between the post-training sample and the original sample, so that the sample is selected from the post-training sample for expansion, thereby solving the problem in the prior art that the classification accuracy is affected by destroying the voice print characteristic information of the spectrogram, and then the expanded sample is placed in the voiceprint spectrogram sample library to be applied to the identification work of the parkinson patient, so as to improve the identification accuracy of the PD patient under the small sample.

Drawings

Fig. 1 is a schematic flow chart of a method for expanding a sample of a parkinsonian voiceprint spectrogram according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a first specific application of the method for expanding a parkinsonian voiceprint spectrogram sample according to the embodiment of the present invention.

Fig. 3 is a schematic diagram of a second specific application of the method for expanding a sample of a parkinsonian voiceprint spectrogram according to the embodiment of the present invention.

Fig. 4 is a schematic diagram of a third specific application of the method for expanding a sample of a parkinsonian voiceprint spectrogram according to the embodiment of the present invention.

Fig. 5 is a PD patient speech spectrum diagram of a parkinson voiceprint spectrogram sample expansion method according to an embodiment of the present invention.

Fig. 6 is a non-PD patient speech spectrum diagram of a parkinson voiceprint spectrogram sample expansion method according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.

Please refer to fig. 1-6. It should be noted that the drawings provided in this embodiment are only for schematically illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings and not drawn according to the number, shape and size of the components in actual implementation, and the form, quantity and proportion of each component in actual implementation may be arbitrarily changed, and the component layout may be more complicated.

As shown in fig. 1, an embodiment of the present invention provides a method for expanding a sample of a parkinsonism voiceprint spectrogram, where the method includes the following steps:

s101, acquiring and dividing a plurality of audios containing vowel pronunciations to obtain corresponding spectrogram.

In a specific implementation, the invention adopts a Sakar data set, each audio sample of a participant comprises 3 continuous vowel pronunciations, each pronunciation process lasts for 6s of pronunciation time and 3s-6s of rest time, and all voice records are in wav format. Each pronunciation is divided into 2s voice segments, so that one wav format voice file can be divided into 3 2s voice segments.

It should be noted that participants are parkinson patients and non-parkinson patients (healthy persons) who respectively collect their corresponding voices, and therefore, the collected sample includes two types of voices.

And S102, converting the acquired spectrogram into a gray spectrogram according to Fourier transform.

It can be understood that due to the existence of noise in the audio acquisition process, such as environmental background noise, aliasing interference generated by the pronunciation organs and the audio acquisition device, harmonic distortion, etc., the quality of the acquired audio signal is uneven, so that the preprocessing of the original audio signal is necessary and is an important process affecting the recognition accuracy. The preprocessing comprises four processes of pre-emphasis, framing, windowing and end point detection.

After pretreatment, the audio signal is converted into a spectrogram, a Parkinson voiceprint spectrogram sample library is established, a flow chart is shown as the attached figure 2, and the steps are as follows:

(1) performing Fourier transform on the preprocessed audio signal:

wherein w (n) is a window function type, which can specifically adopt Hamming window, X _n (e ^jw ) Is a function of w and n.

(2) Let w be 2 pi k/N, (k is not less than 0 and not more than N-1), N represents the number of points of Fast Fourier Transform (FFT), FFT is performed on each frame signal to obtain the short-time Fourier Transform as shown in the formula:

(3) calculating a short-time power spectrum S from the short-time Fourier transform _n (e ^jw )：

S _n (e ^jw )＝X _n (e ^jw )·X _n (e ^jw )＝|X _n (e ^jw )| ²

Wherein the content of the first and second substances,

R _n (k) a short-time autocorrelation function of x (n), S _n (e ^jw ) Is R _n (k) N, w are respectively the horizontal and vertical coordinates, S _n (e ^jw ) Is the pixel gray level representation of point (n, w).

(4) Gray scale map mapping: and sequentially connecting the gray level representations of each frame to generate a gray level spectrogram.

S103, converting the gray-scale spectrogram into a pseudo-color spectrogram.

In order to improve the recognizability of the image content, a pseudo color mapping function colormap (map) in Matlab2016a (where map is an adopted pseudo color mapping matrix and is default to jet) may be used to perform power spectrum pseudo color display to obtain a pseudo color spectrogram.

Specifically, when the sampling frequency of the known audio signal is 44.1KHz, the parameters are set as follows: the number of NFFT points is 2048, the frame length is 46.44ms, the frame length is 1/4, the frame overlapping part is 3/4 of the frame length, and the generated spectrogram has obvious sound-print characteristics and the clearest texture. The audio signal was converted to a spectrogram using Matlab2016 a.

And S104, converting the pseudo-color spectrogram into a plurality of pictures according to a preset resolution ratio, and allocating a first label and a second label to each picture, wherein the first label is a speech spectrum label corresponding to a Parkinson disease person, and the second label is a speech spectrum label corresponding to a non-Parkinson disease person.

And (3) processing and storing operation during spectrogram visualization: in order to reduce redundant information in the spectrogram image, the spectrogram is subjected to coordinate axis removal and white edge removal operation. Because the texture information such as pitch frequency, harmonic wave and the like of the spectrogram is concentrated in the middle-lower part and the bottom part of the spectrogram, the high-frequency region contains less useful information, and the region above 10000Hz is represented as noise and blank on the spectrogram. Therefore, the default frequency display range [0,20000Hz ] is modified into the fixed frequency range [0,10000Hz ], and texture information such as each harmonic wave, formants and the like in the spectrogram can be more clearly visualized.

The spectrogram in JPEG format and PNG format with different resolutions (the resolution is 128 multiplied by 128,256 multiplied by 256,512 multiplied by 512,1024 multiplied by 1024) is respectively saved, and category labels are added, for example, labels corresponding to the spectrogram of a healthy person and a PD patient are respectively '01' and '10', and the spectrogram is used as a voiceprint spectrogram sample library for standby, so that the processing of a neural network is convenient.

And S105, training the pictures through an HR-DCGAN model to generate the trained pictures corresponding to the pictures.

As shown in fig. 3, in order to ensure the quality of the GAN-generated spectrogram and the stability of training, and to enable the spectrogram to be suitable for the current mature CNN architecture, the present research uses a spectrogram in JPEG format with a resolution of 256 × 256 × 3 in a voiceprint spectrogram sample library for sample expansion.

And the HR-DCGAN model generates a new spectrogram image in an unsupervised manner by adopting a minimum-maximum strategy according to the countermeasure idea. The purpose of the generator G is to input a obedient probability distribution P _z The random noise vector z sampled in (even distribution or Gaussian distribution) continuously learns the distribution of the real training sample x, and outputs a false sample G (z) which is approximate to the potential distribution of the real sample. P of discriminator D _data Essentially a classifier, input G (z) or x, the calculation input belonging to P _data Is determined to be input from the real sample P _data Or a dummy sample g (z). The two are used for resisting training and alternately updating the parameters of D and G, and the discrimination of D is maximized while the parameters of G (z) and P are minimized _data The performance of D and G is continuously improved due to the data distribution error. Finally reach Nash equilibrium, when D can not correctly estimate whether the input comes from G (z) or P _data When G learns the distribution space of the real sample x, a 'pseudo sample' approximating to the real spectrogram image is generated.

The architecture design of G and D in the HR-DCGAN model is based on the network structure of the DCGAN model, and the network depth is increased to adapt to the spectrogram with the resolution of 256 multiplied by 3. Increasing the network layer number of G, and increasing the size of the generated image layer by layer, wherein the change process is as follows:

4×4→8×8→16×16→32×32→64×64→128×128→256×256

and finally generating a spectrogram with high resolution. The D network increases the number of network layers according to the size of the input image to adapt to layer-by-layer down-sampling of the high-resolution image in the deconvolution process, and the change process of the characteristic diagram is as follows:

256×256→128×128→64×64→32×32→16×16→8×8→4×4

the loss function is:

then adding a feature matching method into the model, setting f (x) as a feature graph output by a middle layer of the discriminator network, minimizing the error between the G feature graph and the D feature graph, and taking an objective function as:

d, the loss function is unchanged, and the network output is judged to be maximized according to a preset mode:

the loss function of G becomes the error of the "pseudo" sample generated during training and the error of the feature matching process, and the formula is as follows:

a network model diagram of G and D in HR-DCGAN is shown in FIG. 3. The study uses a pseudo-color spectrogram, so the output of G and the input of D are three channels. In addition, G and D are set to 64 in the first convolutional layer and 2048 in the first fully-connected layer.

As shown in fig. 4, G comprises a 7-dimensional network, represented by convolution up-sampled to a 4 × 4 spatial range, with a 100-dimensional vector obeying a gaussian distribution as the noise z input, with 2048 signatures, resulting in a 4 × 4 × 2048 tensor. The h 0-h 5 layers are micro-step convolution layers, the micro-step convolution layers comprise convolution kernels of 5 multiplied by 5, the step is 2, and spatial up-sampling is carried out in the learning process of G. After Batch Normalization (BN), the unit of each hidden layer is normalized to zero mean and unit variance, so that the learning process is stabilized, the problem of collapse of a generation model caused by poor initialization is solved, and the gradient can be propagated more deeply. Activation is then performed using the Relu activation function. The size of the generated characteristic diagram is doubled and the number is halved after each micro-step convolution layer. The h6 layer is activated by the tanh function, and finally outputs a spectrogram image of 256 × 256 × 3 as an input of "pseudo" data of D.

D comprises 7 layers of networks, h 0-h 5 are convolutional layers, 5 multiplied by 5 convolutional kernels are adopted, and the step is 2. All layers have a non-linear mapping of the BN layer and the leakey Relu activation function, except for the input layer of D. The convolution layer carries out feature extraction on an input spectrogram, and the size of the feature map is reduced by half and the number is doubled after each convolution layer carries out down-sampling. The h6 layer discriminates between true samples and generated "false" samples using the Sigmoid activation function, the output of which represents the probability that the input image is from a true sample.

Raw spectrogram samples of PD patients and healthy persons are respectively input into an HR-DCGAN model so as to adapt to an unsupervised training process of the model.

And (3) oscillating the g _ loss and d _ loss in the early stage of spectrogram training of PD patients and healthy people, and finally stably converging the g _ loss and the d _ loss respectively to generate high-resolution samples with similar texture characteristics and generate a visual result. Randomly taking generated images under different epochs, wherein a spectrogram of a PD patient is shown in figure 5, a spectrogram of a healthy human is shown in figure 6, and the generated results are all partial results, wherein (a) to (h) represent the numerical sequence of the epochs during the training of the HR-DCGAN model. When the 0Epoch is carried out, all the generated color contours are noise points and the color contours of the spectrogram; at the 50Epoch time, an image with obvious fundamental tone frequency and a resonance peak position but fuzzy texture is generated; at the 100Epoch, the fundamental tone frequency and the formant are clearer; at the 200Epoch, the harmonic texture of the spectrogram is clear, and the fundamental frequency and the middle and high frequency noise are smooth; at 300 th and 400Epoch, the integrity of the formants and the distribution of each harmonic can be seen, and the texture is clearer; at 500 th and 599 th Epoch, the resonance peak is prominent, and the harmonic texture is clearer. The model convergence speed is high as can be seen from the spectrogram under different epochs, the quality of the spectrogram generated by naked eyes is stably improved, the spectrogram is similar to a real sample in the aspects of texture, color contrast and the like, and the spectrogram can be used for sample expansion.

And S106, obtaining the similarity between the trained picture and the plurality of pictures.

It should be noted that evaluating the quality of the GAN-generated images is a complex task and that methods for evaluating and selecting samples via subjective vision are difficult and persuasive to practice. And strong correlations exist among pixels of the spectrogram image, and the correlations carry important information related to the energy of the audio signal, the position of a formant, continuity, harmonic texture and other voiceprint characteristics.

It can be understood that each of the plurality of pictures corresponds to one or more trained pictures, and in order to determine whether the trained pictures can be used as samples in the gallery, the similarity between the pictures before and after training needs to be obtained.

Assuming that an image before training is A and an image after training is B, the images A and B are partitioned into blocks by using a sliding window, the total block number is N (wherein each block is a part corresponding to two images), and the mean, the variance and the covariance of each window are calculated by adopting a Gaussian function.

Specifically, for any one block, the formula for calculating the similarity between two image blocks is specifically expressed as:

Then, the N similarity values of the N pieces are added and averaged to obtain the similarity value of the two images.

The SSIM index can more directly represent the similarity between the structures and pixels of a generated spectrogram and a true spectrogram through structural information (brightness and contrast), and most quality evaluation methods based on error sensitivity, such as MSE and PSNR, use linear transformation to decompose signals without involving correlation.

The generation effect of the network in the early stage of training is poor, so that the generation image of the first 100 epochs is not considered. After 100 epochs, the SSIM value range obtained by calculation of the generated spectrogram and the original spectrogram of PD patients and healthy people is 0.7835-0.9374, and the SSIM value is unstable due to the fact that training in the early stage of training is not converged, the position, range and structure of voiceprint features such as formants, harmonic waves and the like in the spectrogram are variously changed, and the pixel value is changed due to the smoothing of noise. With the gradual stabilization of the network, most SSIM values are between 0.85 and 0.90, which shows that the spectrogram generated by the HR-DCGAN and the original spectrogram are similar to real samples in the aspects of texture, color contrast and the like.

And S107, judging whether each picture in the trained pictures is used as an extended sample of the speech of the Parkinson patient one by one according to the similarity value.

In order to select a spectrogram sample with high similarity, a threshold value is established by comparing SSIM values obtained by calculation in an analysis experiment and is used as a standard for sample selection, and when the value of SSIM indexes is greater than or equal to the threshold value, the SSIM indexes can be used for sample expansion, otherwise, the SSIM indexes are not used for sample expansion.

In the research, the threshold value of SSIM is set to be 0.85, namely, a spectrogram generated under an Epoch with the SSIM value being more than or equal to 0.85 is selected and added with a tag for sample expansion. In addition, the screened high-similarity sample can expand the original sample according to different expansion coefficients (the value is between 1 and 30 times), so that the original data set is expanded by different times.

In an implementation manner of the present invention, the step of determining one by one whether each of the trained pictures is used as an extended sample of the parkinson's patient's voice according to the similarity value includes: acquiring a preset comparison value; judging whether a target picture exists in the pictures after training or not, wherein the similarity value between the target picture and the corresponding image before training is not less than the preset comparison value; the target picture is determined to be an augmented sample of parkinson's patient speech.

According to the embodiment of the invention, the audio signal in the PD data set is subjected to segmentation and preprocessing operations, then the audio signal is converted into the spectrogram in the JPEG format and the PNG format with multiple resolutions, and the class label in the coding format is added. The method has the advantages that a multi-resolution Parkinson voiceprint spectrogram sample library is established, and the advantages of a spectrogram combined time-frequency analysis method are combined, so that the neural network can conveniently perform processing operations such as feature extraction, recognition classification and image synthesis.

By adopting the HR-DCGAN model and combining the antagonistic learning strategy and the feature matching method of DCGAN, the voiceprint features in the spectrogram can be better extracted and retained by increasing the network layer number of DCGAN and adding the feature matching item constraint, and the 256 × 256 × 3 spectrogram sample with high resolution is generated, and the training process is stable. The method can make up the defects of the audio sample in the identification and diagnosis of the PD patient, meet the requirement of deep learning, and effectively improve the identification accuracy of the PD patient under a small sample.

By comparing and analyzing the SSIM value between the generated spectrogram and the original spectrogram calculated under different epochs, the SSIM threshold value is deduced, and the selection standard of sample expansion is established. The SSIM standard can effectively screen out high-similarity speech spectrogram samples, so that the effectiveness of sample expansion is guaranteed. And the screened expansion sample can expand the original sample according to different expansion coefficients (the value is between 1 and 30 times).

the memory is used for storing a Parkinson voiceprint speech spectrogram sample expansion program;

the processor is configured to execute the parkinsonian voiceprint spectrogram sample expansion program to implement any one of the parkinsonian voiceprint spectrogram sample expansion steps.

And a computer storage medium storing one or more programs, the one or more programs being executable by one or more processors to cause the one or more processors to perform any of the parkinsonian voiceprint spectrogram sample augmenting steps.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for expanding a sample of a spectrogram of a Parkinson voiceprint language is characterized by comprising the following steps:

converting the gray scale spectrogram into a pseudo color spectrogram;

converting the pseudo-color spectrogram into a plurality of pictures according to a preset resolution ratio, and allocating a first label and a second label to each picture, wherein the first label is a speech spectrum label corresponding to a Parkinson disease person, and the second label is a speech spectrum label corresponding to a non-Parkinson disease person;

2. The method for augmenting samples of a parkon voiceprint spectrogram according to claim 1, wherein said step of obtaining and segmenting a plurality of audios containing vowel sounds to obtain corresponding spectrogram comprises:

acquiring a plurality of audios which are continuously generated by three times of vowels and have the pronunciation duration of 6 s;

cutting the audio into 3 audio segments of 2 s;

3. The method for augmenting samples of a parkinsonian voiceprint spectrogram according to claim 1 or 2, wherein the step of transforming the obtained spectrogram into a grayscale spectrogram according to a fourier transform comprises:

performing Fourier transform on the spectrogram;

and connecting to form a gray scale spectrogram according to the short-time power spectrum.

4. The method for augmenting samples of a parkinsonian voiceprint spectrogram according to claim 3, wherein the step of converting said pseudocolor spectrogram into a plurality of pictures according to a predetermined resolution and assigning a first tag and a second tag to each picture comprises:

and adding a label to each spectrogram.

5. The method for augmenting the sample of the parkinsonism voiceprint spectrogram according to claim 1, wherein the step of obtaining the similarity between the trained picture and the plurality of pictures comprises:

calculating an average of the first number of similarity values;

6. The method for augmenting the sample of the parkinsonian voiceprint spectrogram according to claim 5, wherein the formula for calculating the similarity between the two image blocks is specifically expressed as:

wherein, mu _x Is the mean value, sigma, corresponding to the image blocks in the plurality of pictures _x A variance, mu, corresponding to image blocks in the plurality of pictures _g The variance, sigma, corresponding to the image blocks of the plurality of trained images _g Is the mean value, sigma, corresponding to the image blocks of the plurality of trained images _xg Is the covariance of two image blocks, c ₁ ，c ₂ Is a constant.

7. The method for expanding the samples of the parkinsonism voiceprint spectrogram according to claim 5 or 6, wherein the step of judging whether each of the trained pictures is used as the expanded sample of the voices of the parkinsonism patient one by one according to the similarity value comprises the following steps:

acquiring a preset comparison value;

judging whether a target picture exists in the pictures after training or not, wherein the similarity value between the target picture and the corresponding image before training is not smaller than the preset comparison value;

8. The device for expanding the samples of the spectrogram of the Parkinson voiceprint language is characterized by comprising a processor and a memory which is connected with the processor through a communication bus; wherein, the first and the second end of the pipe are connected with each other,

the processor, configured to execute the parkinsonism voiceprint spectrogram sample expansion program to implement the parkinsonism voiceprint spectrogram sample expansion step of any one of claims 1 to 7.

9. A computer storage medium storing one or more programs, the one or more programs being executable by one or more processors to cause the one or more processors to perform the parkinsonian voiceprint spectrogram sample expansion step of any one of claims 1 to 7.