CN111599347A - Standardized sampling method for extracting pathological voice MFCC (Mel frequency cepstrum coefficient) features for artificial intelligence analysis - Google Patents

Standardized sampling method for extracting pathological voice MFCC (Mel frequency cepstrum coefficient) features for artificial intelligence analysis Download PDF

Info

Publication number
CN111599347A
CN111599347A CN202010462384.4A CN202010462384A CN111599347A CN 111599347 A CN111599347 A CN 111599347A CN 202010462384 A CN202010462384 A CN 202010462384A CN 111599347 A CN111599347 A CN 111599347A
Authority
CN
China
Prior art keywords
voice
mfcc
artificial intelligence
framing
standardized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010462384.4A
Other languages
Chinese (zh)
Other versions
CN111599347B (en
Inventor
牟志伟
江晨银
柯慧明
潘正祥
温晓宇
陈亮
朱凌燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Kehui Jianyuan Medical Technology Co ltd
Original Assignee
Guangzhou Kehui Jianyuan Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Kehui Jianyuan Medical Technology Co ltd filed Critical Guangzhou Kehui Jianyuan Medical Technology Co ltd
Priority to CN202010462384.4A priority Critical patent/CN111599347B/en
Publication of CN111599347A publication Critical patent/CN111599347A/en
Application granted granted Critical
Publication of CN111599347B publication Critical patent/CN111599347B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4803Speech analysis specially adapted for diagnostic purposes
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/725Details of waveform analysis using specific filters therefor, e.g. Kalman or adaptive filters
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7253Details of waveform analysis characterised by using transforms
    • A61B5/7257Details of waveform analysis characterised by using transforms using Fourier transforms
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Pathology (AREA)
  • Surgery (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Veterinary Medicine (AREA)
  • Animal Behavior & Ethology (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Physiology (AREA)
  • Psychiatry (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a standardized sampling method for extracting pathological voice MFCC characteristics for artificial intelligence analysis, which comprises the following steps: collecting voice data, and collecting the voice data of 82 Chinese Putonghua syllables according to the sequence of the Chinese Putonghua voice evaluation system vocabulary; editing the collected voice data to finish the editing work of 82 syllables, and then classifying and filing; extracting signals of the 82 clipped syllables, and extracting MFCC characteristics of each syllable by appointing pre-emphasis, framing, windowing, fast Fourier transform, a triangular band-pass filter and expanding framing processing; and constructing the processed data into an MFCC voice library. The invention extracts the specific MFCC characteristics of each syllable through a standardized flow method, constructs a digitalized, standardized and structured voice characteristic database, can serve various applications of pathological voice characteristic big data and artificial intelligence analysis, and improves the objectivity and efficiency of pathological voice research and application.

Description

Standardized sampling method for extracting pathological voice MFCC (Mel frequency cepstrum coefficient) features for artificial intelligence analysis
Technical Field
The invention belongs to the technical field of intelligent recognition, and particularly relates to a standardized sampling method for extracting MFCC (Mel frequency cepstrum coefficient) features of pathological voices for artificial intelligent analysis.
Background
At present, the number of people with language disorder in China increases year by year, and communication disorder caused by dysarthria seriously affects the return of patients to the society. Although the number of patients with dysarthria in China is large, 2016 research on forest intensity and Lujian proves that the current evaluation method cannot meet the requirement of therapists on accurate speech rehabilitation. The traditional Chinese medicine is still mainly evaluated by subjective auditory evaluation and/or scales needing subjective judgment, and lacks objectivity and efficiency. In addition, the number of speech therapists in China is seriously insufficient, most of the speech therapists are not professional graduations, and the diagnosis and evaluation capability is weak.
In recent years, based on the rapid development of Artificial intelligence technology, for example, application research of Artificial Neural Network (ANN) and Deep Learning (DL) in normal speech analysis and recognition, language education, intelligent voice guidance, and the like has achieved some achievements. The State Council 'New Generation Artificial Intelligence development planning' proposes to accelerate the innovation of artificial intelligence, studies the characteristics and regularity of acoustic parameters in dysarthria, and diagnoses and classifies various dysarthria based on artificial neural network, thereby improving the objectivity and efficiency of pathological speech assessment and liberating manpower. For big data and artificial intelligence analysis of pathological speech, a digitized, standardized and structured data set is necessary. At present, no unified method and standard for big data analysis and artificial intelligence research of pathological voice at home and abroad exist, and a unified and efficient pathological voice feature acquisition method is urgently needed.
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art, provides a standardized sampling method for extracting the MFCC characteristics of pathological voices for artificial intelligence analysis, serves various applications of large data of the characteristics of the pathological voices and the artificial intelligence analysis, and improves the objectivity and the efficiency of research and application of the pathological voices.
In order to achieve the purpose, the invention adopts the following technical scheme:
a standardized sampling method for extracting MFCC characteristics of pathological speech for artificial intelligence analysis comprises the following steps:
collecting voice data, and collecting the voice data of 82 Chinese Putonghua syllables according to the sequence of the Chinese Putonghua voice evaluation system vocabulary;
the collected voice data is clipped, the clipping work of 82 syllables is completed, and then classification filing is carried out, wherein 28 unit tones, 23 compound vowels, 21 consonants and 10 sequence voices are provided;
extracting signals of the 82 clipped syllables, and extracting MFCC characteristics of each syllable by appointing pre-emphasis, framing, windowing, fast Fourier transform, a triangular band-pass filter and expanding framing processing;
and forming a structured voice library by using the processed data, wherein the standardized data of the MFCC voice library specifically comprises the following steps:
the voice sample after each syllable data of 82 syllable samples is preprocessed has 4 MFCC characteristics which are respectively present in A, B, C, D four groups, and the MFCC characteristics are respectively standardized MFCC data of 13 frames, 19 frames, 13+3 frames and 19+3 frames;
structuring the database: a, B, C, D four groups of data are put into the vowel and tone sub-library and marked as vowel and tone sub-library A group, vowel and tone sub-library B group, vowel and tone sub-library C group and vowel and tone sub-library D group.
As a preferred technical solution, the "mandarin chinese speech evaluation system vocabulary" includes 3 parts of table 4 main parts, namely, a unit tone part, a sequential language part, a compound vowel part, and a consonant part;
the unit tone and tone part comprises 24 single syllables of Chinese mandarin, which are composed of 1-4 tones of the same or equivalent tone position initial consonant and single final, and comprises: eighthly, drawing, holding, father, approach, nose, pen, must, all, reading, gambling, Du, Ge, Gege, Ou, wave, neck, lame, jow, silt, fish, rain and jade;
the sequential language part, which is a Chinese mandarin word with the number of 1-10 composed of initial consonant and final consonant, comprises: 1,2, 3,4,5,6, 7,8, 9 and 10;
the compound vowel part, 23 single syllables Chinese mandarin words composed of the same or equal phoneme initial and compound final 1 tone, includes: snapping, shrimp, bag, melon, loss, turtle, cup, hold, label, edge, class, guest, rush, upper, ice, cave, pot, light, guan, ditch, lambkin, boot and brother;
the consonant part, 21 single syllables Chinese mandarin words composed of 21 initials and 1 tone of single vowel a or i, includes: eighthly, lying prone, lapping, He, Ga, Jia, machine, Qin, Zhi, eat, Zi, Defect, Fa, Ha, West, Shi, Si, Ri, Ma, na and La.
As a preferred technical scheme, when voice data acquisition is carried out, the distance between the mouth and the lip of a subject is 9cm-11cm, the voice speed is natural and stable, the volume is moderate, and a word list is repeatedly recorded for 2 times.
As a preferred technical solution, the pre-emphasis specifically comprises:
processing the processed speech signal through a high pass filter of the following formula:
H(Z)=1-μz-1
the value of μ in the above formula is 0.9 to 1.0.
As a preferred technical solution, the framing specifically includes:
the framing time is 20-30ms, namely a framing, and the overlapping area between two adjacent frames is set to be 10-15ms, namely frame shift; the sampling rate of the voice samples is 8KHz or 16KHz, and the sampling point N per minute frame is 256-512.
As a preferred technical scheme, the windowing specifically is as follows:
multiplying each frame by a Hamming window after framing, increasing the continuity of the left end and the right end of the frame, assuming that the signal after framing is S (n), wherein n represents the length of the framing minus 1, the signal value after multiplying S (n) by the Hamming window is x (n),
x(n)=S(n)×W(n)
where W (n) is the Hamming window, the formula is as follows:
Figure BDA0002511454680000041
in the above formula, the value of a is 0.46, and the value of N is 0, 1.
As a preferred technical solution, the fast fourier transform specifically includes:
performing fast Fourier transform on each frame signal subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo square on the frequency spectrum of a voice signal to obtain a power spectrum of the voice signal, and setting discrete Fourier transform of the voice signal, wherein the formula is as follows:
Figure BDA0002511454680000042
in the above formula, x (N) is an input speech signal, and N represents the number of points of fourier transform.
As a preferred technical solution, the processing of the triangular band-pass filter specifically includes:
let xa (k) pass through a set of 24 triangular filters with center frequencies designated as f (m), m ═ 1, 2.. said, 24, and the spacing between each f (m) decreases with decreasing value of m and increases with increasing value of m, the formula for the triangular bandpass filter is as follows:
Figure BDA0002511454680000043
in the above formula, the first and second carbon atoms are,
Figure BDA0002511454680000044
as a preferred technical solution, the logarithmic frequency value, hm (k), output by each filter bank is substituted into the following formula:
Figure BDA0002511454680000045
as a preferred technical solution, the discrete cosine transform specifically includes:
substituting s (m) into the following equation:
Figure BDA0002511454680000051
the above formula L refers to the MFCC coefficient order, and is 12-16; m is the number of the triangular filters; c (n) MFCC values for each frame; the 13 th and 19 th stages are respectively connected by frames to obtain 2 groups of MFCC values to be stored in a bin, namely a group A and a group B.
As a preferred technical solution, the extended framing specifically includes:
will resonate peak F1、F2And F0And adding the midpoint values into the group A and the group B as a subframe to obtain 2 groups of MFCCs to be put in storage, namely the group C and the group D.
Compared with the prior art, the invention has the following advantages and beneficial effects:
based on the research of pathological speech artificial intelligent recognition by the inventor, the invention designs ' evaluation word list of Mandarin Chinese dysarthria ' (hereinafter, word list '). The word list has 82 syllables of Chinese vocabulary, and specific MFCC characteristics of each syllable are extracted through a standardized flow method to construct a digitalized, standardized and structured voice database. The invention can serve for multiple applications of pathological voice characteristic big data and artificial intelligence analysis, and improves the objectivity and efficiency of pathological voice research and application.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
figure 2 is a diagram of the Mel frequency filterbank pattern of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Examples
In terms of Speech Recognition (Speech Recognition) and Voice Print Recognition (Voice Print Recognition), the most commonly used Speech feature is the Mel-scale Frequency cepstral coefficients (MFCC). The human ear has different hearing sensitivities to sound waves of different frequencies. Speech signals from 200Hz to 5000Hz have the greatest effect on the intelligibility of speech. The critical bandwidth of sound masking is smaller for the low frequency domain than for the high frequency. Therefore, 28 band-pass filters from dense to sparse are arranged according to the size of critical bandwidth from low frequency to high frequency, and the input signal is subjected to filtering processing. The signal energy output by each band-pass filter is taken as the basic characteristic of the signal, and the acoustic characteristic based on the characteristics of the human ear is MFCC. The shape of the human body vocal tract can be presented in the form of a short-time power spectrum envelope, and the MFCC can accurately represent the envelope, namely acoustic features are used for reflecting the structural change of the vocal tract cavity and indirectly reflecting pathophysiological changes. In addition, because the characteristics do not depend on the properties of the signals, no assumption and limitation are made on the input signals, and the research results of the auditory model are considered, so that the characteristics are widely used for digital speech recognition. Therefore, the parameter has better robustness than the LPCC based on the vocal tract model, better conforms to the auditory characteristics of human ears, and still has better recognition performance when the signal-to-noise ratio is reduced. In summary, MFCC is suitable as a digitized speech input feature for big data research and artificial intelligence analysis.
The main technical scheme of the invention is that 82 Mandarin Chinese voice data are collected according to the sequence of the Mandarin Chinese voice evaluation system vocabulary (shown in tables 1-3), preprocessing is carried out according to the specified method of the invention, the editing work of 82 syllables is completed, MFCC characteristics are extracted from each syllable, and the syllable enters a structured voice library of unit tones, compound vowels, consonants, sequential languages and tones after preprocessing according to the method.
Furthermore, the "mandarin chinese speech evaluation system vocabulary" includes 3 parts, namely, a single tone part, a sequential language part, a compound vowel part, and a consonant part:
a unit tone and tone part, 24 single syllable mandarin chinese words consisting of consonants of the same or equal phoneme and 1-4 tones of a single vowel, eight (ba1), plucking (ba2), pinching (ba3), fath (ba4), detritus (bi1), nose (bi2), pen (bi3), bib (bi4), du (du1), reading (du2), gambling (du3), du (du4), brother (ge1), diaphragm (ge2), ge (ge3), individual (ge4), wave (bo1), neck (bo2), lameness (bo3), jolt (bo4), silt (yu1), fish (yu2), rain (yu3), and jade (yu 4);
a sequential language part, namely Chinese mandarin words with numbers of 1-10 consisting of initial consonants and final consonants, 1(yi1), 2(er4), 3(san1),4(si4),5(wu3),6(liu4), 7(qi1),8(ba1), 9(jiu3) and 10(shi 2);
a compound vowel part, 23 single syllable chinese mandarin words consisting of consonants and compound vowel 1 tones of the same or equal phoneme, an arm-break (bai1), shrimp (xia1), bag (bai1), melon (gua1), loss (diou1), turtle (guii 1), cup (bei1), hold (bie1), logo (biao1), edge (bian1), class (ban1), guest (bin1), runner (ben1), upper (bang1), ice (bin1), collapse (beng1), pan (guo1), light (guang1), gate (guan1), ditch (gou1), lambkin (guai1), baue (xiue 1), xiong 1);
the consonant part comprises 21 monosyllabic Chinese mandarins consisting of 21 initials and 1 tone of a simple vowel a or i, eight (ba1), groveling (pa1), overlapping (da1), he (ta1), Ga (ga1), coffee (ka1), machine (ji1), seven (qi1), cicada (zhi1), eating (chi1), funding (zi1), defect (ci1), hair (fa1), haha (ha1), west (xi1), teacher (shi1), thinking (si1), day (ri4), mother (ma1), that (na1) and La (la 1).
Mandarin Chinese pronunciation assessment system vocabulary 1 (vowel, tone, sequence language)
Figure BDA0002511454680000071
Figure BDA0002511454680000081
Mandarin Chinese pronunciation assessment system vocabulary 2 (vowel change, partial vowel)
Figure BDA0002511454680000082
Figure BDA0002511454680000091
Mandarin Chinese pronunciation assessment system vocabulary 3 (consonant)
Serial number Consonant type Character (Chinese character) Initial consonant Vowels Tone of sound
1 Non-air-supply stop sound Eight-part b a 1
2 Air supply plug sound Groveling p a 1
3 Non-air-supply stop sound Building block d a 1
4 Air supply plug sound He has a main body t a 1
5 Non-air-supply stop sound Ga-a g a 1
6 Air supply plug sound Coffee (Perch) k a 1
7 Erasing sound of non-air-supply plug Machine for working j i 1
8 Air delivery plug wipe Seven-piece q i 1
9 Erasing sound of non-air-supply plug To know zh i 1
10 Air delivery plug wipe Eating ch i 1
11 Erasing sound of non-air-supply plug Resource management system z i 1
12 Air delivery plug wipe Defect of the heart c i 1
13 Qing and Ca sound Hair-like device f a 1
14 Qing and Ca sound Ha h a 1
15 Qing and Ca sound Western medicine x i 1
16 Qing and Ca sound Teacher sh i 1
17 Qing and Ca sound Thought of s i 1
18 Turbid wipe Day(s) r i 4
19 Nasal sound Mother m a 1
20 Nasal sound That n a 1
21 Edge tone Pulling device l a 1
To further illustrate the technical solution of the present invention, the monosyllabic word "bar" is taken as an example, and the following description is made:
before the standardized sampling of the present embodiment is performed, the recording environment needs to be selected.
Optionally, the recording environment of this embodiment is selected: the method is carried out in a voice laboratory provided with a sound insulation door and sound-absorbing rock wool, and the sound insulation degree is 45 dB.
Optionally, the recording instrument and the parameter selection of this example: a Sony Zoom H4N recording pen is selected, the sampling rate of 44.1kHz and the tone quality of 16 bits are stored, and the recording is copied to a computer hard disk.
As shown in fig. 1, the standardized sampling method for extracting the MFCC features of the pathological speech for artificial intelligence analysis of the present embodiment includes the following steps:
s1, collecting voice data, and collecting the voice data of 82 Mandarin Chinese syllables according to the sequence of the Mandarin Chinese voice evaluation system vocabulary; the method specifically comprises the following steps:
the subject was recorded by collecting speech data for 82 mandarin chinese syllables with reference to 82 chinese words of the mandarin chinese speech assessment system vocabularies (table 1). When recording, the subject takes the sitting position, the pen holds the recorder with hands, the lip of the subject is about 10cm away from the recorder, and when seeing that the character 'ba' appears on the screen, the subject reads 'ba (/ b ā /)' at natural smooth speed and moderate volume, and records for 2 times. The amplitude of the wave of the waveform recorded by the recording pen is required to be in the range of 1/3-2/3.
S2, editing the collected voice data, specifically:
the target note/b ā/first recording was cut out separately from the note of each subject using CoolEdit Pro2.1. If the first recording has noise, interference, waveform fluctuation amplitude exceeding the range of the window value 1/3-2/3 and waveform prompting energy is insufficient, the second recording data is selected for processing. The valid preprocessed sample classes are then archived to unit tone groups.
S3, extracting the characteristics of the clipped signal, namely, based on the clipped sample, completing the MFCC characteristic extraction of the digital voice signal of syllable/b ā/by processing such as pre-emphasis, framing, windowing, fast Fourier transform, triangular band-pass filter, extended framing and the like, wherein the specific pre-processing steps are as follows:
s31, pre-emphasis is designated:
processing the processed speech signal through a high pass filter of the following formula:
H(Z)=1-μz-1
in the above formula, μ has a value of 0.97.
S32, frame division is designated:
with time 25ms as one frame, the overlap area between two adjacent frames is set to 10ms, i.e. the frame shift. The sampling rate of the voice sample/b ā/is 16KHz, and the length N of each minute frame is 400. In this embodiment, 13 and 19 frames are taken, and zero padding is performed if the frame is insufficient.
S32, designating windowing:
after the frames are divided in step S32, each frame is multiplied by a Hamming Window (Hamming Window) to increase the continuity of the left and right ends of the frame. Assuming that the signal after framing is S (n), wherein n represents the length of the framing minus 1, the value of the signal after multiplying S (n) by the Hamming window is x (n),
x(n)=S(n)×W(n)
w (n) is the Hamming window, which is given by the formula:
Figure BDA0002511454680000111
n-1, wherein a is 0.46 and N is 0, 1.
S33, fast Fourier transform:
performing fast Fourier transform on each frame signal subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo square on the frequency spectrum of a speech signal to obtain a power spectrum of the speech signal, and setting a Discrete Fourier Transform (DFT) formula of the speech signal as follows:
Figure BDA0002511454680000112
in the above formula, x (N) is an input speech signal, and N represents the number of points of fourier transform.
S33, triangular band-pass filter:
xa (k) is passed through a set of 24 triangular filters with the filter center frequency designated as f (m), and m 1, 2. The interval between each f (m) decreases with decreasing value of m and increases with increasing value of m, as shown in fig. 2.
The triangular band-pass filter is used for smoothing the frequency spectrum and has the function of eliminating harmonic waves, so that the triangular band-pass filter is not influenced by the tone or pitch of a piece of voice. Secondly, the subsequent operation amount is reduced. The formula is as follows:
Figure BDA0002511454680000121
in the above formula, the first and second carbon atoms are,
Figure BDA0002511454680000122
s34, logarithmic operation, substituting the logarithmic frequency value output by each filter bank, hm (k) into the following formula:
Figure BDA0002511454680000123
s35, Discrete Cosine Transform (DCT), S (m) into the following equation:
Figure BDA0002511454680000124
the above formula L refers to the MFCC coefficient order, the method specifies 13 and 19; m is the number of triangular filters, the method is designated 24; c (n) is the MFCC value for each frame. The 13 th and 19 th sub-frames are connected to obtain 2 MFCC values into A group and B group.
S36, expanding and framing:
will be/b ā/formant F1、F2And F0The midpoint values are respectively added into the group A and the group B as a subframe to obtain the other 2 MFCCs which are put into a group C and a group D.
S4, forming the processed data into an MFCC voice library:
normalized data for monosyllabic "bar" in MFCC speech library: speech samples/b ā/4 MFCC features of total 4 types after [0033] and [0034] preprocessing of each syllable data of 82 syllable samples exist in A, B, C, D four groups of normalized MFCC data of 13 frames, 19 frames, 13+3 frames, and 19+3 frames, respectively
Structuring data: the 4 groups of data are put into a vowel and tone sub-library and marked as a vowel and tone sub-library A group, a vowel and tone sub-library B group, a vowel and tone sub-library C group and a vowel and tone sub-library D group in a row.
The other 81 syllables are processed in the same way as the monosyllabic "bar", and will not be described in detail herein
The invention researches a novel pathological voice standardized sampling method based on MFCC characteristics, is different from the traditional voice recording sampling method, formulates a word list comprising 82 Chinese syllables based on the earlier research result of the inventor, and processes each syllable into 4 different data based on the acoustic index MFCC by adopting a standardized and structured data sampling method. The method can be conveniently used for the construction of pathological voice database, big data voice analysis and artificial intelligence operation.
The research on the structured sampling standard of pathological speech in China is less, the method provided by the invention is practiced on the application of pathological speech libraries, artificial neural networks, deep learning and the like, and has the advantages of reliability and simple and convenient operation, so that the method finally becomes the standard in the field.
The evaluation method based on the artificial intelligence big data liberates manpower, depends on the development of intelligence, is the crystallization of the intelligent era, is combined with the intelligent development, and is the result of the era progress and the scientific development. The invention provides a method selection for artificial intelligence research and scientific diagnosis of pathological voice.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (10)

1. A standardized sampling method for extracting MFCC characteristics of pathological speech for artificial intelligence analysis is characterized by comprising the following steps:
collecting voice data, and collecting the voice data of 82 Chinese Putonghua syllables according to the sequence of the Chinese Putonghua voice evaluation system vocabulary;
the collected voice data is clipped, the clipping work of 82 syllables is completed, and then classification filing is carried out, wherein 28 unit tones, 23 compound vowels, 21 consonants and 10 sequence voices are provided;
extracting signals of the 82 clipped syllables, and extracting MFCC characteristics of each syllable by appointing pre-emphasis, framing, windowing, fast Fourier transform, a triangular band-pass filter and expanding framing processing;
and forming a structured voice library by using the processed data, wherein the standardized data of the MFCC voice library specifically comprises the following steps:
the voice sample after each syllable data of 82 syllable samples is preprocessed has 4 MFCC characteristics which are respectively present in A, B, C, D four groups, and the MFCC characteristics are respectively standardized MFCC data of 13 frames, 19 frames, 13+3 frames and 19+3 frames;
structuring the database: a, B, C, D four groups of data are put into the vowel and tone sub-library and marked as vowel and tone sub-library A group, vowel and tone sub-library B group, vowel and tone sub-library C group and vowel and tone sub-library D group.
2. The standardized sampling method for extracting MFCC features of pathological voices for artificial intelligence analysis as claimed in claim 1, wherein said "Mandarin Chinese Speech evaluation System vocabulary" includes 3 parts of 4 main parts, namely, a single-tone part, a sequential language part, a compound vowel part and a consonant part;
the unit tone and tone part comprises 24 single syllables of Chinese mandarin, which are composed of 1-4 tones of the same or equivalent tone position initial consonant and single final, and comprises: eighthly, drawing, holding, father, approach, nose, pen, must, all, reading, gambling, Du, Ge, Gege, Ou, wave, neck, lame, jow, silt, fish, rain and jade;
the sequential language part, which is a Chinese mandarin word with the number of 1-10 composed of initial consonant and final consonant, comprises: 1,2, 3,4,5,6, 7,8, 9 and 10;
the compound vowel part, 23 single syllables Chinese mandarin words composed of the same or equal phoneme initial and compound final 1 tone, includes: snapping, shrimp, bag, melon, loss, turtle, cup, hold, label, edge, class, guest, rush, upper, ice, cave, pot, light, guan, ditch, lambkin, boot and brother;
the consonant part, 21 single syllables Chinese mandarin words composed of 21 initials and 1 tone of single vowel a or i, includes: eighthly, lying prone, lapping, He, Ga, Jia, machine, Qin, Zhi, eat, Zi, Defect, Fa, Ha, West, Shi, Si, Ri, Ma, na and La.
3. The standardized sampling method for extracting MFCC features of pathological voices for artificial intelligence analysis as claimed in claim 1, wherein the lip distance of the subject from the recorder is 9-11 cm when voice data is collected, the voice speed is natural and stable, the volume is moderate, and the vocabulary is repeatedly recorded for 2 times;
the pre-emphasis is specifically as follows:
processing the processed speech signal through a high pass filter of the following formula:
H(Z)=1-μz-1
the value of μ in the above formula is 0.9 to 1.0.
4. The standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 1, wherein said framing is specifically:
the framing time is 20-30ms, namely a framing, and the overlapping area between two adjacent frames is set to be 10-15ms, namely frame shift; the sampling rate of the voice samples is 8KHz or 16KHz, and the sampling point N per minute frame is 256-512.
5. The standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 1, wherein said windowing is specifically:
multiplying each frame by a Hamming window after framing, increasing the continuity of the left end and the right end of the frame, assuming that the signal after framing is S (n), wherein n represents the length of the framing minus 1, the signal value after multiplying S (n) by the Hamming window is x (n),
x(n)=S(n)×W(n)
where W (n) is the Hamming window, the formula is as follows:
Figure FDA0002511454670000021
in the above formula, the value of a is 0.46, and the value of N is 0, 1.
6. The standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 1, wherein said fast Fourier transform is specifically:
performing fast Fourier transform on each frame signal subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo square on the frequency spectrum of a voice signal to obtain a power spectrum of the voice signal, and setting discrete Fourier transform of the voice signal, wherein the formula is as follows:
Figure FDA0002511454670000031
in the above formula, x (N) is an input speech signal, and N represents the number of points of fourier transform.
7. The standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 6, wherein the processing of said triangular band-pass filter is specifically:
let xa (k) pass through a set of 24 triangular filters with center frequencies designated as f (m), m ═ 1, 2.. said, 24, and the spacing between each f (m) decreases with decreasing value of m and increases with increasing value of m, the formula for the triangular bandpass filter is as follows:
Figure FDA0002511454670000032
in the above formula, the first and second carbon atoms are,
Figure FDA0002511454670000033
8. the method of claim 7, wherein the log-frequency value of each filter bank output, hm (k), is substituted into the following formula:
Figure FDA0002511454670000034
9. the standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 8, wherein said discrete cosine transform is embodied as:
substituting s (m) into the following equation:
Figure FDA0002511454670000035
the above formula L refers to the MFCC coefficient order, and is 12-16; m is the number of the triangular filters; c (n) MFCC values for each frame; the 13 th and 19 th stages are respectively connected by frames to obtain 2 groups of MFCC values to be stored in a bin, namely a group A and a group B.
10. The standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 9, wherein said extended framing is specifically:
will resonate peak F1、F2And F0And adding the midpoint values into the group A and the group B as a subframe to obtain 2 groups of MFCCs to be put in storage, namely the group C and the group D.
CN202010462384.4A 2020-05-27 2020-05-27 Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis Active CN111599347B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010462384.4A CN111599347B (en) 2020-05-27 2020-05-27 Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010462384.4A CN111599347B (en) 2020-05-27 2020-05-27 Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis

Publications (2)

Publication Number Publication Date
CN111599347A true CN111599347A (en) 2020-08-28
CN111599347B CN111599347B (en) 2024-04-16

Family

ID=72192364

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010462384.4A Active CN111599347B (en) 2020-05-27 2020-05-27 Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis

Country Status (1)

Country Link
CN (1) CN111599347B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382293A (en) * 2020-11-11 2021-02-19 广东电网有限责任公司 Intelligent voice interaction method and system for power Internet of things

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211026A (en) * 1997-09-05 1999-03-17 中国科学院声学研究所 Continuous voice identification technology for Chinese putonghua large vocabulary
WO2001039179A1 (en) * 1999-11-23 2001-05-31 Infotalk Corporation Limited System and method for speech recognition using tonal modeling
CN1412741A (en) * 2002-12-13 2003-04-23 郑方 Chinese speech identification method with dialect background
CN101436403A (en) * 2007-11-16 2009-05-20 创新未来科技有限公司 Method and system for recognizing tone
CN103310273A (en) * 2013-06-26 2013-09-18 南京邮电大学 Method for articulating Chinese vowels with tones and based on DIVA model
CN103366735A (en) * 2012-03-29 2013-10-23 北京中传天籁数字技术有限公司 A voice data mapping method and apparatus
CN104123934A (en) * 2014-07-23 2014-10-29 泰亿格电子(上海)有限公司 Speech composition recognition method and system
CN105788608A (en) * 2016-03-03 2016-07-20 渤海大学 Chinese initial consonant and compound vowel visualization method based on neural network
CN110570842A (en) * 2019-10-25 2019-12-13 南京云白信息科技有限公司 Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree
CN110600055A (en) * 2019-08-15 2019-12-20 杭州电子科技大学 Singing voice separation method using melody extraction and voice synthesis technology
CN110808072A (en) * 2019-11-08 2020-02-18 广州科慧健远医疗科技有限公司 Method for evaluating dysarthria of children based on optimized acoustic parameters of data mining technology
CN110827980A (en) * 2019-11-08 2020-02-21 广州科慧健远医疗科技有限公司 Dysarthria grading evaluation method based on acoustic indexes
CN111028863A (en) * 2019-12-20 2020-04-17 广州科慧健远医疗科技有限公司 Method for diagnosing dysarthria tone error after stroke based on neural network and diagnosis device thereof

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211026A (en) * 1997-09-05 1999-03-17 中国科学院声学研究所 Continuous voice identification technology for Chinese putonghua large vocabulary
WO2001039179A1 (en) * 1999-11-23 2001-05-31 Infotalk Corporation Limited System and method for speech recognition using tonal modeling
CN1412741A (en) * 2002-12-13 2003-04-23 郑方 Chinese speech identification method with dialect background
CN101436403A (en) * 2007-11-16 2009-05-20 创新未来科技有限公司 Method and system for recognizing tone
CN103366735A (en) * 2012-03-29 2013-10-23 北京中传天籁数字技术有限公司 A voice data mapping method and apparatus
CN103310273A (en) * 2013-06-26 2013-09-18 南京邮电大学 Method for articulating Chinese vowels with tones and based on DIVA model
CN104123934A (en) * 2014-07-23 2014-10-29 泰亿格电子(上海)有限公司 Speech composition recognition method and system
CN105788608A (en) * 2016-03-03 2016-07-20 渤海大学 Chinese initial consonant and compound vowel visualization method based on neural network
CN110600055A (en) * 2019-08-15 2019-12-20 杭州电子科技大学 Singing voice separation method using melody extraction and voice synthesis technology
CN110570842A (en) * 2019-10-25 2019-12-13 南京云白信息科技有限公司 Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree
CN110808072A (en) * 2019-11-08 2020-02-18 广州科慧健远医疗科技有限公司 Method for evaluating dysarthria of children based on optimized acoustic parameters of data mining technology
CN110827980A (en) * 2019-11-08 2020-02-21 广州科慧健远医疗科技有限公司 Dysarthria grading evaluation method based on acoustic indexes
CN111028863A (en) * 2019-12-20 2020-04-17 广州科慧健远医疗科技有限公司 Method for diagnosing dysarthria tone error after stroke based on neural network and diagnosis device thereof

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382293A (en) * 2020-11-11 2021-02-19 广东电网有限责任公司 Intelligent voice interaction method and system for power Internet of things

Also Published As

Publication number Publication date
CN111599347B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
CN109741732B (en) Named entity recognition method, named entity recognition device, equipment and medium
Gobl et al. 11 voice source variation and its communicative functions
Reddy Speech recognition by machine: A review
Chi et al. Subglottal coupling and its influence on vowel formants
CN109785857A (en) An abnormal sound event recognition method based on MFCC+MP fusion features
CN110265063B (en) A polygraph detection method based on sequence analysis of fixed-duration speech emotion recognition
CN109243497A (en) The control method and device that voice wakes up
CN103366735B (en) The mapping method of speech data and device
CN109065073A (en) Speech-emotion recognition method based on depth S VM network model
CN109979428A (en) Audio generation method and device, storage medium, electronic equipment
CN111179914B (en) Voice sample screening method based on improved dynamic time warping algorithm
Cai et al. The DKU-JNU-EMA electromagnetic articulography database on Mandarin and Chinese dialects with tandem feature based acoustic-to-articulatory inversion
McKinney Laryngeal frequency analysis for linguistic research
Crystal et al. Characterization and modeling of speech-segment durations
CN102880906B (en) Chinese vowel pronunciation method based on DIVA nerve network model
CN111599347A (en) Standardized sampling method for extracting pathological voice MFCC (Mel frequency cepstrum coefficient) features for artificial intelligence analysis
CN114842878A (en) Speech emotion recognition method based on neural network
Kulshreshtha et al. Dialect accent features for establishing speaker identity: a case study
CN114283822A (en) Many-to-one voice conversion method based on gamma pass frequency cepstrum coefficient
CN118762687A (en) A method for automatic identification of Tibetan dialects
CN113539239A (en) Voice conversion method, device, storage medium and electronic equipment
Broad Formants in automatic speech recognition
Dharmale et al. Evaluation of phonetic system for speech recognition on smartphone
Peterson et al. Objectives and techniques of speech synthesis
Chen et al. Teager Mel and PLP fusion feature based speech emotion recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant