CN110428364B - Method and device for expanding Parkinson voiceprint spectrogram sample and computer storage medium - Google Patents

Method and device for expanding Parkinson voiceprint spectrogram sample and computer storage medium Download PDF

Info

Publication number
CN110428364B
CN110428364B CN201910720986.2A CN201910720986A CN110428364B CN 110428364 B CN110428364 B CN 110428364B CN 201910720986 A CN201910720986 A CN 201910720986A CN 110428364 B CN110428364 B CN 110428364B
Authority
CN
China
Prior art keywords
spectrogram
pictures
voiceprint
sample
parkinson
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910720986.2A
Other languages
Chinese (zh)
Other versions
CN110428364A (en
Inventor
王娟
徐志京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Maritime University
Original Assignee
Shanghai Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Maritime University filed Critical Shanghai Maritime University
Priority to CN201910720986.2A priority Critical patent/CN110428364B/en
Publication of CN110428364A publication Critical patent/CN110428364A/en
Application granted granted Critical
Publication of CN110428364B publication Critical patent/CN110428364B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/04Context-preserving transformations, e.g. by using an importance map
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20048Transform domain processing
    • G06T2207/20056Discrete and fast Fourier transform, [DFT, FFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method for expanding a sample of a Parkinson voiceprint spectrogram, which comprises the following steps: acquiring a plurality of audios containing vowel pronunciation, and segmenting to obtain corresponding spectrogram; converting the obtained spectrogram into a gray spectrogram according to Fourier transform; converting the gray scale spectrogram into a pseudo color spectrogram; converting the pseudo-color spectrogram into a plurality of pictures according to a preset resolution, and allocating a first label and a second label to each picture; training the multiple pictures through an HR-DCGAN model to generate trained pictures corresponding to the multiple figure judgments; obtaining the similarity between the trained picture and a plurality of pictures; and judging whether each picture in the trained pictures is used as an extended sample of the speech of the Parkinson patient one by one according to the similarity value. In addition, the invention also discloses a Parkinson voiceprint spectrogram sample expansion device and a computer storage medium.

Description

Method and device for expanding samples of Parkinson voiceprint speech spectrogram and computer storage medium
Technical Field
The invention relates to the technical field of voice processing, in particular to a method and a device for expanding a Parkinson voiceprint spectrogram sample and a computer storage medium.
Background
Voiceprints are important biological characteristics of human beings, Parkinson's Disease (PD) belongs to common nervous system degenerative diseases, and 90% of PD patients have vocal cord injury in early symptoms, so the voiceprints can be applied to discrimination of diseases such as Parkinson and the like. However, the existing patient voiceprint data sets and samples are few, the samples are difficult to obtain, and overfitting is easy to happen when a deep learning algorithm is adopted for processing, so that a good effect cannot be achieved. Therefore, when the deep learning algorithm is adopted to diagnose the Parkinson's disease, sample expansion is an urgent problem to be solved.
The audio signal is converted into a spectrogram, and important voiceprint features related to a research target can be identified and extracted by utilizing a neural network so as to automatically classify the image. With the deepening of the number of layers of the neural network, strong performance is shown in the classification and identification fields, and the deep convolution neural network is used as a data driving model and depends on a large number of samples to exert the maximum efficiency. Because of few samples, many scholars do not use the deep convolutional neural network method for PD identification.
In order to expand the samples, the traditional expansion method of the image samples comprises the steps of clipping, turning, translating, scaling, contrast transformation and the like of the images, which can change or destroy the voiceprint characteristic information in the spectrogram, influence the accuracy rate of classification and identification, and is not suitable for sample expansion of the category.
Disclosure of Invention
In view of the above disadvantages of the prior art, an object of the present invention is to provide a method and an apparatus for expanding a parkinson's voiceprint spectrogram sample, which aim to generate a trained sample through a DCGAN model, and perform similarity calculation on the trained sample and an original sample, thereby selecting a sample from the trained sample for expansion, solving the problem in the prior art that classification accuracy is affected by destroying vocal print feature information of a spectrogram, and then placing the expanded sample into a voiceprint spectrogram sample library to be applied to identification work of a parkinson's patient, thereby improving identification accuracy of a PD patient under a small sample.
In order to achieve the above objects and other related objects, the present invention provides a method for expanding a sample of a parkinsonism voiceprint spectrogram, the method comprising:
acquiring and dividing a plurality of audios containing vowel pronunciation to obtain corresponding spectrogram;
converting the obtained spectrogram into a gray spectrogram according to Fourier transform;
converting the gray scale spectrogram into a pseudo color spectrogram;
converting the pseudo-color spectrogram into a plurality of pictures according to a preset resolution, and allocating a first label and a second label to each picture, wherein the first label is a spectrogram label corresponding to a Parkinson's disease person, and the second label is a spectrogram label corresponding to a non-Parkinson's disease person;
training the pictures through an HR-DCGAN model to generate trained pictures corresponding to the multiple figure judgments;
obtaining the similarity between the trained picture and the plurality of pictures;
and judging whether each picture in the trained pictures is used as an extended sample of the speech of the Parkinson patient one by one according to the similarity value.
In one implementation manner of the present invention, the step of obtaining and segmenting a plurality of audios including vowel pronunciation to obtain corresponding spectrogram includes:
acquiring a plurality of audios with continuous three-time vowel pronunciation and pronunciation duration of 6 s;
cutting the audio into 3 audio segments of 2 s;
and preprocessing each audio judgment to obtain a spectrogram, wherein the preprocessing comprises pre-emphasis processing, framing, windowing and end point detection.
In an implementation manner of the present invention, the step of converting the obtained spectrogram into a grayscale spectrogram according to fourier transform includes:
performing Fourier transform on the spectrogram;
performing fast Fourier transform on each frame of signal to obtain short-time Fourier transform and obtain a corresponding short-time power spectrum;
and connecting to form a gray-scale speech spectrogram according to the short-time power spectrum.
In an implementation manner of the present invention, the step of converting the pseudo color spectrogram into a plurality of pictures according to a preset resolution, and allocating a first tag and a second tag to each picture includes:
performing redundancy removing operation on the pseudo color spectrogram, wherein the redundancy removing operation comprises coordinate axis removing and white edge removing operation;
converting the pseudo-color spectrogram after redundancy removal operation into a spectrogram in JPEG and PNG formats according to a preset resolution, wherein the preset resolution is as follows: 128 × 128,256 × 256,512 × 512,1024 × 1024;
and adding a label to each spectrogram.
In an implementation manner of the present invention, the step of obtaining the similarity between the trained picture and the plurality of pictures includes:
performing image blocking processing on each image in the plurality of pictures and the corresponding trained image to obtain a first number of image blocks;
for each image block, the steps are performed: obtaining the mean, variance and covariance of the corresponding image blocks; calculating the similarity of the two image blocks according to the mean, the variance and the covariance of each image block; obtaining a first number of similarity values;
calculating an average of the first number of similarity values;
and taking the average value as the similarity of the trained picture and the plurality of pictures.
In an implementation manner of the present invention, the formula adopted for calculating the similarity between the two image blocks is specifically expressed as:
Figure BDA0002157182040000031
wherein, mu x Is the mean value, sigma, corresponding to the image blocks in the plurality of pictures x Is the variance, mu, corresponding to the image blocks in the plurality of pictures g The variance, sigma, corresponding to the image blocks of the plurality of trained images g Is the mean value, sigma, corresponding to the image blocks of the plurality of trained images xg Is the covariance of two image blocks, c 1 ,c 2 Is a constant.
In an implementation manner of the present invention, the step of determining one by one whether each of the trained pictures is used as an extended sample of the parkinson's patient's voice according to the similarity value includes:
acquiring a preset comparison value;
judging whether a target picture exists in the pictures after training or not, wherein the similarity value between the target picture and the corresponding image before training is not less than the preset comparison value;
determining the target picture to be an augmented sample of the parkinson's patient's voice.
In addition, the invention also discloses a Parkinson voiceprint spectrogram sample expansion device, which comprises a processor and a memory connected with the processor through a communication bus; wherein the content of the first and second substances,
the memorizer is used for storing a Parkinson voiceprint spectrogram sample expansion program;
the processor is used for executing the Parkinson voiceprint spectrogram sample expansion program so as to realize any one of the Parkinson voiceprint spectrogram sample expansion steps.
Also, a computer storage medium is disclosed, the computer storage medium storing one or more programs, the one or more programs being executable by one or more processors to cause the one or more processors to perform any of the parkinsonian voiceprint spectrogram sample augmentation steps.
As described above, according to the parkinson spectrogram sample expansion method, device and computer-readable storage medium provided by the embodiments of the present invention, the post-training sample is generated through the DCGAN model, and the similarity calculation is performed between the post-training sample and the original sample, so that the sample is selected from the post-training sample for expansion, thereby solving the problem in the prior art that the classification accuracy is affected by destroying the voice print characteristic information of the spectrogram, and then the expanded sample is placed in the voiceprint spectrogram sample library to be applied to the identification work of the parkinson patient, so as to improve the identification accuracy of the PD patient under the small sample.
Drawings
Fig. 1 is a schematic flow chart of a method for expanding a sample of a parkinsonian voiceprint spectrogram according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a first specific application of the method for expanding a parkinsonian voiceprint spectrogram sample according to the embodiment of the present invention.
Fig. 3 is a schematic diagram of a second specific application of the method for expanding a sample of a parkinsonian voiceprint spectrogram according to the embodiment of the present invention.
Fig. 4 is a schematic diagram of a third specific application of the method for expanding a sample of a parkinsonian voiceprint spectrogram according to the embodiment of the present invention.
Fig. 5 is a PD patient speech spectrum diagram of a parkinson voiceprint spectrogram sample expansion method according to an embodiment of the present invention.
Fig. 6 is a non-PD patient speech spectrum diagram of a parkinson voiceprint spectrogram sample expansion method according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.
Please refer to fig. 1-6. It should be noted that the drawings provided in this embodiment are only for schematically illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings and not drawn according to the number, shape and size of the components in actual implementation, and the form, quantity and proportion of each component in actual implementation may be arbitrarily changed, and the component layout may be more complicated.
As shown in fig. 1, an embodiment of the present invention provides a method for expanding a sample of a parkinsonism voiceprint spectrogram, where the method includes the following steps:
s101, acquiring and dividing a plurality of audios containing vowel pronunciations to obtain corresponding spectrogram.
In a specific implementation, the invention adopts a Sakar data set, each audio sample of a participant comprises 3 continuous vowel pronunciations, each pronunciation process lasts for 6s of pronunciation time and 3s-6s of rest time, and all voice records are in wav format. Each pronunciation is divided into 2s voice segments, so that one wav format voice file can be divided into 3 2s voice segments.
It should be noted that participants are parkinson patients and non-parkinson patients (healthy persons) who respectively collect their corresponding voices, and therefore, the collected sample includes two types of voices.
And S102, converting the acquired spectrogram into a gray spectrogram according to Fourier transform.
It can be understood that due to the existence of noise in the audio acquisition process, such as environmental background noise, aliasing interference generated by the pronunciation organs and the audio acquisition device, harmonic distortion, etc., the quality of the acquired audio signal is uneven, so that the preprocessing of the original audio signal is necessary and is an important process affecting the recognition accuracy. The preprocessing comprises four processes of pre-emphasis, framing, windowing and end point detection.
After pretreatment, the audio signal is converted into a spectrogram, a Parkinson voiceprint spectrogram sample library is established, a flow chart is shown as the attached figure 2, and the steps are as follows:
(1) performing Fourier transform on the preprocessed audio signal:
Figure BDA0002157182040000061
wherein w (n) is a window function type, which can specifically adopt Hamming window, X n (e jw ) Is a function of w and n.
(2) Let w be 2 pi k/N, (k is not less than 0 and not more than N-1), N represents the number of points of Fast Fourier Transform (FFT), FFT is performed on each frame signal to obtain the short-time Fourier Transform as shown in the formula:
Figure BDA0002157182040000062
(3) calculating a short-time power spectrum S from the short-time Fourier transform n (e jw ):
S n (e jw )=X n (e jw )·X n (e jw )=|X n (e jw )| 2
Wherein the content of the first and second substances,
Figure BDA0002157182040000063
Figure BDA0002157182040000064
R n (k) a short-time autocorrelation function of x (n), S n (e jw ) Is R n (k) N, w are respectively the horizontal and vertical coordinates, S n (e jw ) Is the pixel gray level representation of point (n, w).
(4) Gray scale map mapping: and sequentially connecting the gray level representations of each frame to generate a gray level spectrogram.
S103, converting the gray-scale spectrogram into a pseudo-color spectrogram.
In order to improve the recognizability of the image content, a pseudo color mapping function colormap (map) in Matlab2016a (where map is an adopted pseudo color mapping matrix and is default to jet) may be used to perform power spectrum pseudo color display to obtain a pseudo color spectrogram.
Specifically, when the sampling frequency of the known audio signal is 44.1KHz, the parameters are set as follows: the number of NFFT points is 2048, the frame length is 46.44ms, the frame length is 1/4, the frame overlapping part is 3/4 of the frame length, and the generated spectrogram has obvious sound-print characteristics and the clearest texture. The audio signal was converted to a spectrogram using Matlab2016 a.
And S104, converting the pseudo-color spectrogram into a plurality of pictures according to a preset resolution ratio, and allocating a first label and a second label to each picture, wherein the first label is a speech spectrum label corresponding to a Parkinson disease person, and the second label is a speech spectrum label corresponding to a non-Parkinson disease person.
And (3) processing and storing operation during spectrogram visualization: in order to reduce redundant information in the spectrogram image, the spectrogram is subjected to coordinate axis removal and white edge removal operation. Because the texture information such as pitch frequency, harmonic wave and the like of the spectrogram is concentrated in the middle-lower part and the bottom part of the spectrogram, the high-frequency region contains less useful information, and the region above 10000Hz is represented as noise and blank on the spectrogram. Therefore, the default frequency display range [0,20000Hz ] is modified into the fixed frequency range [0,10000Hz ], and texture information such as each harmonic wave, formants and the like in the spectrogram can be more clearly visualized.
The spectrogram in JPEG format and PNG format with different resolutions (the resolution is 128 multiplied by 128,256 multiplied by 256,512 multiplied by 512,1024 multiplied by 1024) is respectively saved, and category labels are added, for example, labels corresponding to the spectrogram of a healthy person and a PD patient are respectively '01' and '10', and the spectrogram is used as a voiceprint spectrogram sample library for standby, so that the processing of a neural network is convenient.
And S105, training the pictures through an HR-DCGAN model to generate the trained pictures corresponding to the pictures.
As shown in fig. 3, in order to ensure the quality of the GAN-generated spectrogram and the stability of training, and to enable the spectrogram to be suitable for the current mature CNN architecture, the present research uses a spectrogram in JPEG format with a resolution of 256 × 256 × 3 in a voiceprint spectrogram sample library for sample expansion.
And the HR-DCGAN model generates a new spectrogram image in an unsupervised manner by adopting a minimum-maximum strategy according to the countermeasure idea. The purpose of the generator G is to input a obedient probability distribution P z The random noise vector z sampled in (even distribution or Gaussian distribution) continuously learns the distribution of the real training sample x, and outputs a false sample G (z) which is approximate to the potential distribution of the real sample. P of discriminator D data Essentially a classifier, input G (z) or x, the calculation input belonging to P data Is determined to be input from the real sample P data Or a dummy sample g (z). The two are used for resisting training and alternately updating the parameters of D and G, and the discrimination of D is maximized while the parameters of G (z) and P are minimized data The performance of D and G is continuously improved due to the data distribution error. Finally reach Nash equilibrium, when D can not correctly estimate whether the input comes from G (z) or P data When G learns the distribution space of the real sample x, a 'pseudo sample' approximating to the real spectrogram image is generated.
The architecture design of G and D in the HR-DCGAN model is based on the network structure of the DCGAN model, and the network depth is increased to adapt to the spectrogram with the resolution of 256 multiplied by 3. Increasing the network layer number of G, and increasing the size of the generated image layer by layer, wherein the change process is as follows:
4×4→8×8→16×16→32×32→64×64→128×128→256×256
and finally generating a spectrogram with high resolution. The D network increases the number of network layers according to the size of the input image to adapt to layer-by-layer down-sampling of the high-resolution image in the deconvolution process, and the change process of the characteristic diagram is as follows:
256×256→128×128→64×64→32×32→16×16→8×8→4×4
the loss function is:
Figure BDA0002157182040000081
then adding a feature matching method into the model, setting f (x) as a feature graph output by a middle layer of the discriminator network, minimizing the error between the G feature graph and the D feature graph, and taking an objective function as:
Figure BDA0002157182040000082
d, the loss function is unchanged, and the network output is judged to be maximized according to a preset mode:
Figure BDA0002157182040000083
the loss function of G becomes the error of the "pseudo" sample generated during training and the error of the feature matching process, and the formula is as follows:
Figure BDA0002157182040000084
a network model diagram of G and D in HR-DCGAN is shown in FIG. 3. The study uses a pseudo-color spectrogram, so the output of G and the input of D are three channels. In addition, G and D are set to 64 in the first convolutional layer and 2048 in the first fully-connected layer.
As shown in fig. 4, G comprises a 7-dimensional network, represented by convolution up-sampled to a 4 × 4 spatial range, with a 100-dimensional vector obeying a gaussian distribution as the noise z input, with 2048 signatures, resulting in a 4 × 4 × 2048 tensor. The h 0-h 5 layers are micro-step convolution layers, the micro-step convolution layers comprise convolution kernels of 5 multiplied by 5, the step is 2, and spatial up-sampling is carried out in the learning process of G. After Batch Normalization (BN), the unit of each hidden layer is normalized to zero mean and unit variance, so that the learning process is stabilized, the problem of collapse of a generation model caused by poor initialization is solved, and the gradient can be propagated more deeply. Activation is then performed using the Relu activation function. The size of the generated characteristic diagram is doubled and the number is halved after each micro-step convolution layer. The h6 layer is activated by the tanh function, and finally outputs a spectrogram image of 256 × 256 × 3 as an input of "pseudo" data of D.
D comprises 7 layers of networks, h 0-h 5 are convolutional layers, 5 multiplied by 5 convolutional kernels are adopted, and the step is 2. All layers have a non-linear mapping of the BN layer and the leakey Relu activation function, except for the input layer of D. The convolution layer carries out feature extraction on an input spectrogram, and the size of the feature map is reduced by half and the number is doubled after each convolution layer carries out down-sampling. The h6 layer discriminates between true samples and generated "false" samples using the Sigmoid activation function, the output of which represents the probability that the input image is from a true sample.
Raw spectrogram samples of PD patients and healthy persons are respectively input into an HR-DCGAN model so as to adapt to an unsupervised training process of the model.
And (3) oscillating the g _ loss and d _ loss in the early stage of spectrogram training of PD patients and healthy people, and finally stably converging the g _ loss and the d _ loss respectively to generate high-resolution samples with similar texture characteristics and generate a visual result. Randomly taking generated images under different epochs, wherein a spectrogram of a PD patient is shown in figure 5, a spectrogram of a healthy human is shown in figure 6, and the generated results are all partial results, wherein (a) to (h) represent the numerical sequence of the epochs during the training of the HR-DCGAN model. When the 0Epoch is carried out, all the generated color contours are noise points and the color contours of the spectrogram; at the 50Epoch time, an image with obvious fundamental tone frequency and a resonance peak position but fuzzy texture is generated; at the 100Epoch, the fundamental tone frequency and the formant are clearer; at the 200Epoch, the harmonic texture of the spectrogram is clear, and the fundamental frequency and the middle and high frequency noise are smooth; at 300 th and 400Epoch, the integrity of the formants and the distribution of each harmonic can be seen, and the texture is clearer; at 500 th and 599 th Epoch, the resonance peak is prominent, and the harmonic texture is clearer. The model convergence speed is high as can be seen from the spectrogram under different epochs, the quality of the spectrogram generated by naked eyes is stably improved, the spectrogram is similar to a real sample in the aspects of texture, color contrast and the like, and the spectrogram can be used for sample expansion.
And S106, obtaining the similarity between the trained picture and the plurality of pictures.
It should be noted that evaluating the quality of the GAN-generated images is a complex task and that methods for evaluating and selecting samples via subjective vision are difficult and persuasive to practice. And strong correlations exist among pixels of the spectrogram image, and the correlations carry important information related to the energy of the audio signal, the position of a formant, continuity, harmonic texture and other voiceprint characteristics.
It can be understood that each of the plurality of pictures corresponds to one or more trained pictures, and in order to determine whether the trained pictures can be used as samples in the gallery, the similarity between the pictures before and after training needs to be obtained.
Assuming that an image before training is A and an image after training is B, the images A and B are partitioned into blocks by using a sliding window, the total block number is N (wherein each block is a part corresponding to two images), and the mean, the variance and the covariance of each window are calculated by adopting a Gaussian function.
Specifically, for any one block, the formula for calculating the similarity between two image blocks is specifically expressed as:
Figure BDA0002157182040000101
wherein, mu x Is the mean value, sigma, corresponding to the image blocks in the plurality of pictures x Is the variance, mu, corresponding to the image blocks in the plurality of pictures g The variance, sigma, corresponding to the image blocks of the plurality of trained images g Is the mean value, sigma, corresponding to the image blocks of the plurality of trained images xg Is the covariance of two image blocks, c 1 ,c 2 Is a constant.
Then, the N similarity values of the N pieces are added and averaged to obtain the similarity value of the two images.
The SSIM index can more directly represent the similarity between the structures and pixels of a generated spectrogram and a true spectrogram through structural information (brightness and contrast), and most quality evaluation methods based on error sensitivity, such as MSE and PSNR, use linear transformation to decompose signals without involving correlation.
The generation effect of the network in the early stage of training is poor, so that the generation image of the first 100 epochs is not considered. After 100 epochs, the SSIM value range obtained by calculation of the generated spectrogram and the original spectrogram of PD patients and healthy people is 0.7835-0.9374, and the SSIM value is unstable due to the fact that training in the early stage of training is not converged, the position, range and structure of voiceprint features such as formants, harmonic waves and the like in the spectrogram are variously changed, and the pixel value is changed due to the smoothing of noise. With the gradual stabilization of the network, most SSIM values are between 0.85 and 0.90, which shows that the spectrogram generated by the HR-DCGAN and the original spectrogram are similar to real samples in the aspects of texture, color contrast and the like.
And S107, judging whether each picture in the trained pictures is used as an extended sample of the speech of the Parkinson patient one by one according to the similarity value.
In order to select a spectrogram sample with high similarity, a threshold value is established by comparing SSIM values obtained by calculation in an analysis experiment and is used as a standard for sample selection, and when the value of SSIM indexes is greater than or equal to the threshold value, the SSIM indexes can be used for sample expansion, otherwise, the SSIM indexes are not used for sample expansion.
In the research, the threshold value of SSIM is set to be 0.85, namely, a spectrogram generated under an Epoch with the SSIM value being more than or equal to 0.85 is selected and added with a tag for sample expansion. In addition, the screened high-similarity sample can expand the original sample according to different expansion coefficients (the value is between 1 and 30 times), so that the original data set is expanded by different times.
In an implementation manner of the present invention, the step of determining one by one whether each of the trained pictures is used as an extended sample of the parkinson's patient's voice according to the similarity value includes: acquiring a preset comparison value; judging whether a target picture exists in the pictures after training or not, wherein the similarity value between the target picture and the corresponding image before training is not less than the preset comparison value; the target picture is determined to be an augmented sample of parkinson's patient speech.
According to the embodiment of the invention, the audio signal in the PD data set is subjected to segmentation and preprocessing operations, then the audio signal is converted into the spectrogram in the JPEG format and the PNG format with multiple resolutions, and the class label in the coding format is added. The method has the advantages that a multi-resolution Parkinson voiceprint spectrogram sample library is established, and the advantages of a spectrogram combined time-frequency analysis method are combined, so that the neural network can conveniently perform processing operations such as feature extraction, recognition classification and image synthesis.
By adopting the HR-DCGAN model and combining the antagonistic learning strategy and the feature matching method of DCGAN, the voiceprint features in the spectrogram can be better extracted and retained by increasing the network layer number of DCGAN and adding the feature matching item constraint, and the 256 × 256 × 3 spectrogram sample with high resolution is generated, and the training process is stable. The method can make up the defects of the audio sample in the identification and diagnosis of the PD patient, meet the requirement of deep learning, and effectively improve the identification accuracy of the PD patient under a small sample.
By comparing and analyzing the SSIM value between the generated spectrogram and the original spectrogram calculated under different epochs, the SSIM threshold value is deduced, and the selection standard of sample expansion is established. The SSIM standard can effectively screen out high-similarity speech spectrogram samples, so that the effectiveness of sample expansion is guaranteed. And the screened expansion sample can expand the original sample according to different expansion coefficients (the value is between 1 and 30 times).
In addition, the invention also discloses a Parkinson voiceprint spectrogram sample expansion device, which comprises a processor and a memory connected with the processor through a communication bus; wherein the content of the first and second substances,
the memory is used for storing a Parkinson voiceprint speech spectrogram sample expansion program;
the processor is configured to execute the parkinsonian voiceprint spectrogram sample expansion program to implement any one of the parkinsonian voiceprint spectrogram sample expansion steps.
And a computer storage medium storing one or more programs, the one or more programs being executable by one or more processors to cause the one or more processors to perform any of the parkinsonian voiceprint spectrogram sample augmenting steps.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (9)

1. A method for expanding a sample of a spectrogram of a Parkinson voiceprint language is characterized by comprising the following steps:
acquiring and dividing a plurality of audios containing vowel pronunciation to obtain corresponding spectrogram;
converting the obtained spectrogram into a gray spectrogram according to Fourier transform;
converting the gray scale spectrogram into a pseudo color spectrogram;
converting the pseudo-color spectrogram into a plurality of pictures according to a preset resolution ratio, and allocating a first label and a second label to each picture, wherein the first label is a speech spectrum label corresponding to a Parkinson disease person, and the second label is a speech spectrum label corresponding to a non-Parkinson disease person;
training the pictures through an HR-DCGAN model to generate trained pictures corresponding to the multiple figure judgments;
obtaining the similarity between the trained picture and the plurality of pictures;
and judging whether each picture in the trained pictures is used as an extended sample of the speech of the Parkinson patient one by one according to the similarity value.
2. The method for augmenting samples of a parkon voiceprint spectrogram according to claim 1, wherein said step of obtaining and segmenting a plurality of audios containing vowel sounds to obtain corresponding spectrogram comprises:
acquiring a plurality of audios which are continuously generated by three times of vowels and have the pronunciation duration of 6 s;
cutting the audio into 3 audio segments of 2 s;
and preprocessing each audio judgment to obtain a spectrogram, wherein the preprocessing comprises pre-emphasis processing, framing, windowing and end point detection.
3. The method for augmenting samples of a parkinsonian voiceprint spectrogram according to claim 1 or 2, wherein the step of transforming the obtained spectrogram into a grayscale spectrogram according to a fourier transform comprises:
performing Fourier transform on the spectrogram;
performing fast Fourier transform on each frame of signal to obtain short-time Fourier transform and obtain a corresponding short-time power spectrum;
and connecting to form a gray scale spectrogram according to the short-time power spectrum.
4. The method for augmenting samples of a parkinsonian voiceprint spectrogram according to claim 3, wherein the step of converting said pseudocolor spectrogram into a plurality of pictures according to a predetermined resolution and assigning a first tag and a second tag to each picture comprises:
performing redundancy removing operation on the pseudo color spectrogram, wherein the redundancy removing operation comprises coordinate axis removing and white edge removing operation;
converting the pseudo-color spectrogram after redundancy removal operation into a spectrogram in JPEG and PNG formats according to a preset resolution, wherein the preset resolution is as follows: 128 × 128,256 × 256,512 × 512,1024 × 1024;
and adding a label to each spectrogram.
5. The method for augmenting the sample of the parkinsonism voiceprint spectrogram according to claim 1, wherein the step of obtaining the similarity between the trained picture and the plurality of pictures comprises:
performing image blocking processing on each image in the plurality of pictures and the corresponding trained image to obtain a first number of image blocks;
for each image block, the steps are performed: obtaining the mean, variance and covariance of the corresponding image blocks; calculating the similarity of the two image blocks according to the mean, the variance and the covariance of each image block; obtaining a first number of similarity values;
calculating an average of the first number of similarity values;
and taking the average value as the similarity of the trained picture and the plurality of pictures.
6. The method for augmenting the sample of the parkinsonian voiceprint spectrogram according to claim 5, wherein the formula for calculating the similarity between the two image blocks is specifically expressed as:
Figure FDA0002157182030000021
wherein, mu x Is the mean value, sigma, corresponding to the image blocks in the plurality of pictures x A variance, mu, corresponding to image blocks in the plurality of pictures g The variance, sigma, corresponding to the image blocks of the plurality of trained images g Is the mean value, sigma, corresponding to the image blocks of the plurality of trained images xg Is the covariance of two image blocks, c 1 ,c 2 Is a constant.
7. The method for expanding the samples of the parkinsonism voiceprint spectrogram according to claim 5 or 6, wherein the step of judging whether each of the trained pictures is used as the expanded sample of the voices of the parkinsonism patient one by one according to the similarity value comprises the following steps:
acquiring a preset comparison value;
judging whether a target picture exists in the pictures after training or not, wherein the similarity value between the target picture and the corresponding image before training is not smaller than the preset comparison value;
determining the target picture to be an augmented sample of the parkinson's patient's voice.
8. The device for expanding the samples of the spectrogram of the Parkinson voiceprint language is characterized by comprising a processor and a memory which is connected with the processor through a communication bus; wherein, the first and the second end of the pipe are connected with each other,
the memory is used for storing a Parkinson voiceprint speech spectrogram sample expansion program;
the processor, configured to execute the parkinsonism voiceprint spectrogram sample expansion program to implement the parkinsonism voiceprint spectrogram sample expansion step of any one of claims 1 to 7.
9. A computer storage medium storing one or more programs, the one or more programs being executable by one or more processors to cause the one or more processors to perform the parkinsonian voiceprint spectrogram sample expansion step of any one of claims 1 to 7.
CN201910720986.2A 2019-08-06 2019-08-06 Method and device for expanding Parkinson voiceprint spectrogram sample and computer storage medium Active CN110428364B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910720986.2A CN110428364B (en) 2019-08-06 2019-08-06 Method and device for expanding Parkinson voiceprint spectrogram sample and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910720986.2A CN110428364B (en) 2019-08-06 2019-08-06 Method and device for expanding Parkinson voiceprint spectrogram sample and computer storage medium

Publications (2)

Publication Number Publication Date
CN110428364A CN110428364A (en) 2019-11-08
CN110428364B true CN110428364B (en) 2022-09-30

Family

ID=68414378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910720986.2A Active CN110428364B (en) 2019-08-06 2019-08-06 Method and device for expanding Parkinson voiceprint spectrogram sample and computer storage medium

Country Status (1)

Country Link
CN (1) CN110428364B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111292766B (en) * 2020-02-07 2023-08-08 抖音视界有限公司 Method, apparatus, electronic device and medium for generating voice samples
CN111612799A (en) * 2020-05-15 2020-09-01 中南大学 Face data pair-oriented incomplete reticulate pattern face repairing method and system and storage medium
CN113255433A (en) * 2021-04-06 2021-08-13 北京迈格威科技有限公司 Model training method, device and computer storage medium
CN113642714B (en) * 2021-08-27 2024-02-09 国网湖南省电力有限公司 Insulator pollution discharge state identification method and system based on small sample learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080300867A1 (en) * 2007-06-03 2008-12-04 Yan Yuling System and method of analyzing voice via visual and acoustic data
CN110033018B (en) * 2019-03-06 2023-10-31 平安科技(深圳)有限公司 Graph similarity judging method and device and computer readable storage medium

Also Published As

Publication number Publication date
CN110428364A (en) 2019-11-08

Similar Documents

Publication Publication Date Title
CN110428364B (en) Method and device for expanding Parkinson voiceprint spectrogram sample and computer storage medium
CN108831485B (en) Speaker identification method based on spectrogram statistical characteristics
CN111461176B (en) Multi-mode fusion method, device, medium and equipment based on normalized mutual information
Matthews et al. Extraction of visual features for lipreading
US20180061439A1 (en) Automatic audio captioning
EP2695160B1 (en) Speech syllable/vowel/phone boundary detection using auditory attention cues
CN105813548B (en) Method for evaluating at least one facial clinical sign
CN107657964A (en) Depression aided detection method and grader based on acoustic feature and sparse mathematics
Gurbuz et al. Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition
CN113947127A (en) Multi-mode emotion recognition method and system for accompanying robot
Xu et al. Parkinson’s disease detection based on spectrogram-deep convolutional generative adversarial network sample augmentation
CN112750442B (en) Crested mill population ecological system monitoring system with wavelet transformation and method thereof
CN115359576A (en) Multi-modal emotion recognition method and device, electronic equipment and storage medium
CN112418166A (en) Emotion distribution learning method based on multi-mode information
Vaz et al. Convex Hull Convolutive Non-Negative Matrix Factorization for Uncovering Temporal Patterns in Multivariate Time-Series Data.
Yang et al. GAN-based sample expansion for underwater acoustic signal
CN115331289A (en) Micro-expression recognition method based on video motion amplification and optical flow characteristics
CN112687280B (en) Biodiversity monitoring system with frequency spectrum-time space interface
CN112735444B (en) Chinese phoenix head and gull recognition system with model matching and model matching method thereof
CN112735442B (en) Wetland ecology monitoring system with audio separation voiceprint recognition function and audio separation method thereof
CN112259086A (en) Speech conversion method based on spectrogram synthesis
Srinivasan et al. Multi-view representation based speech assisted system for people with neurological disorders
Almajai et al. Analysis of correlation between audio and visual speech features for clean audio feature prediction in noise.
CN111084711B (en) Terrain detection method of blind guiding stick based on active visual guidance
CN111401548B (en) Lofar line spectrum detection method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant