CN111599347A - Standardized sampling method for extracting pathological voice MFCC (Mel frequency cepstrum coefficient) features for artificial intelligence analysis - Google Patents
Standardized sampling method for extracting pathological voice MFCC (Mel frequency cepstrum coefficient) features for artificial intelligence analysis Download PDFInfo
- Publication number
- CN111599347A CN111599347A CN202010462384.4A CN202010462384A CN111599347A CN 111599347 A CN111599347 A CN 111599347A CN 202010462384 A CN202010462384 A CN 202010462384A CN 111599347 A CN111599347 A CN 111599347A
- Authority
- CN
- China
- Prior art keywords
- voice
- mfcc
- artificial intelligence
- framing
- standardized
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000001575 pathological effect Effects 0.000 title claims abstract description 32
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 27
- 238000005070 sampling Methods 0.000 title claims abstract description 27
- 238000004458 analytical method Methods 0.000 title claims abstract description 23
- 238000009432 framing Methods 0.000 claims abstract description 28
- 238000011156 evaluation Methods 0.000 claims abstract description 13
- 238000012545 processing Methods 0.000 claims abstract description 11
- 241001672694 Citrus reticulata Species 0.000 claims description 25
- 150000001875 compounds Chemical class 0.000 claims description 12
- 238000001228 spectrum Methods 0.000 claims description 11
- 230000007547 defect Effects 0.000 claims description 5
- 241000251468 Actinopterygii Species 0.000 claims description 3
- 241000219112 Cucumis Species 0.000 claims description 3
- 235000015510 Cucumis melo subsp melo Nutrition 0.000 claims description 3
- 241000238557 Decapoda Species 0.000 claims description 3
- 208000001613 Gambling Diseases 0.000 claims description 3
- 241000270666 Testudines Species 0.000 claims description 3
- FJJCIZWZNKZHII-UHFFFAOYSA-N [4,6-bis(cyanoamino)-1,3,5-triazin-2-yl]cyanamide Chemical compound N#CNC1=NC(NC#N)=NC(NC#N)=N1 FJJCIZWZNKZHII-UHFFFAOYSA-N 0.000 claims description 3
- 125000004432 carbon atom Chemical group C* 0.000 claims description 3
- 230000007423 decrease Effects 0.000 claims description 3
- 230000003247 decreasing effect Effects 0.000 claims description 3
- 230000037433 frameshift Effects 0.000 claims description 3
- 239000010977 jade Substances 0.000 claims description 3
- 238000013459 approach Methods 0.000 claims description 2
- 238000011160 research Methods 0.000 abstract description 13
- 239000000284 extract Substances 0.000 abstract 1
- 206010013887 Dysarthria Diseases 0.000 description 5
- 238000011161 development Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 238000012384 transportation and delivery Methods 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 238000009413 insulation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 101150015099 CHIA1 gene Proteins 0.000 description 1
- 241000931705 Cicada Species 0.000 description 1
- 101150109842 GUA1 gene Proteins 0.000 description 1
- 241000989913 Gunnera petaloidea Species 0.000 description 1
- 241000269799 Perca fluviatilis Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 101150014959 chi1 gene Proteins 0.000 description 1
- 208000030251 communication disease Diseases 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000002425 crystallisation Methods 0.000 description 1
- 230000008025 crystallization Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 210000004209 hair Anatomy 0.000 description 1
- 208000030175 lameness Diseases 0.000 description 1
- 208000011977 language disease Diseases 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 239000011490 mineral wool Substances 0.000 description 1
- 230000004796 pathophysiological change Effects 0.000 description 1
- 238000013439 planning Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/48—Other medical applications
- A61B5/4803—Speech analysis specially adapted for diagnostic purposes
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/72—Signal processing specially adapted for physiological signals or for diagnostic purposes
- A61B5/7235—Details of waveform analysis
- A61B5/725—Details of waveform analysis using specific filters therefor, e.g. Kalman or adaptive filters
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/72—Signal processing specially adapted for physiological signals or for diagnostic purposes
- A61B5/7235—Details of waveform analysis
- A61B5/7253—Details of waveform analysis characterised by using transforms
- A61B5/7257—Details of waveform analysis characterised by using transforms using Fourier transforms
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Pathology (AREA)
- Surgery (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Heart & Thoracic Surgery (AREA)
- Medical Informatics (AREA)
- Molecular Biology (AREA)
- Veterinary Medicine (AREA)
- Animal Behavior & Ethology (AREA)
- General Health & Medical Sciences (AREA)
- Public Health (AREA)
- Physiology (AREA)
- Psychiatry (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a standardized sampling method for extracting pathological voice MFCC characteristics for artificial intelligence analysis, which comprises the following steps: collecting voice data, and collecting the voice data of 82 Chinese Putonghua syllables according to the sequence of the Chinese Putonghua voice evaluation system vocabulary; editing the collected voice data to finish the editing work of 82 syllables, and then classifying and filing; extracting signals of the 82 clipped syllables, and extracting MFCC characteristics of each syllable by appointing pre-emphasis, framing, windowing, fast Fourier transform, a triangular band-pass filter and expanding framing processing; and constructing the processed data into an MFCC voice library. The invention extracts the specific MFCC characteristics of each syllable through a standardized flow method, constructs a digitalized, standardized and structured voice characteristic database, can serve various applications of pathological voice characteristic big data and artificial intelligence analysis, and improves the objectivity and efficiency of pathological voice research and application.
Description
Technical Field
The invention belongs to the technical field of intelligent recognition, and particularly relates to a standardized sampling method for extracting MFCC (Mel frequency cepstrum coefficient) features of pathological voices for artificial intelligent analysis.
Background
At present, the number of people with language disorder in China increases year by year, and communication disorder caused by dysarthria seriously affects the return of patients to the society. Although the number of patients with dysarthria in China is large, 2016 research on forest intensity and Lujian proves that the current evaluation method cannot meet the requirement of therapists on accurate speech rehabilitation. The traditional Chinese medicine is still mainly evaluated by subjective auditory evaluation and/or scales needing subjective judgment, and lacks objectivity and efficiency. In addition, the number of speech therapists in China is seriously insufficient, most of the speech therapists are not professional graduations, and the diagnosis and evaluation capability is weak.
In recent years, based on the rapid development of Artificial intelligence technology, for example, application research of Artificial Neural Network (ANN) and Deep Learning (DL) in normal speech analysis and recognition, language education, intelligent voice guidance, and the like has achieved some achievements. The State Council 'New Generation Artificial Intelligence development planning' proposes to accelerate the innovation of artificial intelligence, studies the characteristics and regularity of acoustic parameters in dysarthria, and diagnoses and classifies various dysarthria based on artificial neural network, thereby improving the objectivity and efficiency of pathological speech assessment and liberating manpower. For big data and artificial intelligence analysis of pathological speech, a digitized, standardized and structured data set is necessary. At present, no unified method and standard for big data analysis and artificial intelligence research of pathological voice at home and abroad exist, and a unified and efficient pathological voice feature acquisition method is urgently needed.
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art, provides a standardized sampling method for extracting the MFCC characteristics of pathological voices for artificial intelligence analysis, serves various applications of large data of the characteristics of the pathological voices and the artificial intelligence analysis, and improves the objectivity and the efficiency of research and application of the pathological voices.
In order to achieve the purpose, the invention adopts the following technical scheme:
a standardized sampling method for extracting MFCC characteristics of pathological speech for artificial intelligence analysis comprises the following steps:
collecting voice data, and collecting the voice data of 82 Chinese Putonghua syllables according to the sequence of the Chinese Putonghua voice evaluation system vocabulary;
the collected voice data is clipped, the clipping work of 82 syllables is completed, and then classification filing is carried out, wherein 28 unit tones, 23 compound vowels, 21 consonants and 10 sequence voices are provided;
extracting signals of the 82 clipped syllables, and extracting MFCC characteristics of each syllable by appointing pre-emphasis, framing, windowing, fast Fourier transform, a triangular band-pass filter and expanding framing processing;
and forming a structured voice library by using the processed data, wherein the standardized data of the MFCC voice library specifically comprises the following steps:
the voice sample after each syllable data of 82 syllable samples is preprocessed has 4 MFCC characteristics which are respectively present in A, B, C, D four groups, and the MFCC characteristics are respectively standardized MFCC data of 13 frames, 19 frames, 13+3 frames and 19+3 frames;
structuring the database: a, B, C, D four groups of data are put into the vowel and tone sub-library and marked as vowel and tone sub-library A group, vowel and tone sub-library B group, vowel and tone sub-library C group and vowel and tone sub-library D group.
As a preferred technical solution, the "mandarin chinese speech evaluation system vocabulary" includes 3 parts of table 4 main parts, namely, a unit tone part, a sequential language part, a compound vowel part, and a consonant part;
the unit tone and tone part comprises 24 single syllables of Chinese mandarin, which are composed of 1-4 tones of the same or equivalent tone position initial consonant and single final, and comprises: eighthly, drawing, holding, father, approach, nose, pen, must, all, reading, gambling, Du, Ge, Gege, Ou, wave, neck, lame, jow, silt, fish, rain and jade;
the sequential language part, which is a Chinese mandarin word with the number of 1-10 composed of initial consonant and final consonant, comprises: 1,2, 3,4,5,6, 7,8, 9 and 10;
the compound vowel part, 23 single syllables Chinese mandarin words composed of the same or equal phoneme initial and compound final 1 tone, includes: snapping, shrimp, bag, melon, loss, turtle, cup, hold, label, edge, class, guest, rush, upper, ice, cave, pot, light, guan, ditch, lambkin, boot and brother;
the consonant part, 21 single syllables Chinese mandarin words composed of 21 initials and 1 tone of single vowel a or i, includes: eighthly, lying prone, lapping, He, Ga, Jia, machine, Qin, Zhi, eat, Zi, Defect, Fa, Ha, West, Shi, Si, Ri, Ma, na and La.
As a preferred technical scheme, when voice data acquisition is carried out, the distance between the mouth and the lip of a subject is 9cm-11cm, the voice speed is natural and stable, the volume is moderate, and a word list is repeatedly recorded for 2 times.
As a preferred technical solution, the pre-emphasis specifically comprises:
processing the processed speech signal through a high pass filter of the following formula:
H(Z)=1-μz-1
the value of μ in the above formula is 0.9 to 1.0.
As a preferred technical solution, the framing specifically includes:
the framing time is 20-30ms, namely a framing, and the overlapping area between two adjacent frames is set to be 10-15ms, namely frame shift; the sampling rate of the voice samples is 8KHz or 16KHz, and the sampling point N per minute frame is 256-512.
As a preferred technical scheme, the windowing specifically is as follows:
multiplying each frame by a Hamming window after framing, increasing the continuity of the left end and the right end of the frame, assuming that the signal after framing is S (n), wherein n represents the length of the framing minus 1, the signal value after multiplying S (n) by the Hamming window is x (n),
x(n)=S(n)×W(n)
where W (n) is the Hamming window, the formula is as follows:
in the above formula, the value of a is 0.46, and the value of N is 0, 1.
As a preferred technical solution, the fast fourier transform specifically includes:
performing fast Fourier transform on each frame signal subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo square on the frequency spectrum of a voice signal to obtain a power spectrum of the voice signal, and setting discrete Fourier transform of the voice signal, wherein the formula is as follows:
in the above formula, x (N) is an input speech signal, and N represents the number of points of fourier transform.
As a preferred technical solution, the processing of the triangular band-pass filter specifically includes:
let xa (k) pass through a set of 24 triangular filters with center frequencies designated as f (m), m ═ 1, 2.. said, 24, and the spacing between each f (m) decreases with decreasing value of m and increases with increasing value of m, the formula for the triangular bandpass filter is as follows:
as a preferred technical solution, the logarithmic frequency value, hm (k), output by each filter bank is substituted into the following formula:
as a preferred technical solution, the discrete cosine transform specifically includes:
substituting s (m) into the following equation:
the above formula L refers to the MFCC coefficient order, and is 12-16; m is the number of the triangular filters; c (n) MFCC values for each frame; the 13 th and 19 th stages are respectively connected by frames to obtain 2 groups of MFCC values to be stored in a bin, namely a group A and a group B.
As a preferred technical solution, the extended framing specifically includes:
will resonate peak F1、F2And F0And adding the midpoint values into the group A and the group B as a subframe to obtain 2 groups of MFCCs to be put in storage, namely the group C and the group D.
Compared with the prior art, the invention has the following advantages and beneficial effects:
based on the research of pathological speech artificial intelligent recognition by the inventor, the invention designs ' evaluation word list of Mandarin Chinese dysarthria ' (hereinafter, word list '). The word list has 82 syllables of Chinese vocabulary, and specific MFCC characteristics of each syllable are extracted through a standardized flow method to construct a digitalized, standardized and structured voice database. The invention can serve for multiple applications of pathological voice characteristic big data and artificial intelligence analysis, and improves the objectivity and efficiency of pathological voice research and application.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
figure 2 is a diagram of the Mel frequency filterbank pattern of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Examples
In terms of Speech Recognition (Speech Recognition) and Voice Print Recognition (Voice Print Recognition), the most commonly used Speech feature is the Mel-scale Frequency cepstral coefficients (MFCC). The human ear has different hearing sensitivities to sound waves of different frequencies. Speech signals from 200Hz to 5000Hz have the greatest effect on the intelligibility of speech. The critical bandwidth of sound masking is smaller for the low frequency domain than for the high frequency. Therefore, 28 band-pass filters from dense to sparse are arranged according to the size of critical bandwidth from low frequency to high frequency, and the input signal is subjected to filtering processing. The signal energy output by each band-pass filter is taken as the basic characteristic of the signal, and the acoustic characteristic based on the characteristics of the human ear is MFCC. The shape of the human body vocal tract can be presented in the form of a short-time power spectrum envelope, and the MFCC can accurately represent the envelope, namely acoustic features are used for reflecting the structural change of the vocal tract cavity and indirectly reflecting pathophysiological changes. In addition, because the characteristics do not depend on the properties of the signals, no assumption and limitation are made on the input signals, and the research results of the auditory model are considered, so that the characteristics are widely used for digital speech recognition. Therefore, the parameter has better robustness than the LPCC based on the vocal tract model, better conforms to the auditory characteristics of human ears, and still has better recognition performance when the signal-to-noise ratio is reduced. In summary, MFCC is suitable as a digitized speech input feature for big data research and artificial intelligence analysis.
The main technical scheme of the invention is that 82 Mandarin Chinese voice data are collected according to the sequence of the Mandarin Chinese voice evaluation system vocabulary (shown in tables 1-3), preprocessing is carried out according to the specified method of the invention, the editing work of 82 syllables is completed, MFCC characteristics are extracted from each syllable, and the syllable enters a structured voice library of unit tones, compound vowels, consonants, sequential languages and tones after preprocessing according to the method.
Furthermore, the "mandarin chinese speech evaluation system vocabulary" includes 3 parts, namely, a single tone part, a sequential language part, a compound vowel part, and a consonant part:
a unit tone and tone part, 24 single syllable mandarin chinese words consisting of consonants of the same or equal phoneme and 1-4 tones of a single vowel, eight (ba1), plucking (ba2), pinching (ba3), fath (ba4), detritus (bi1), nose (bi2), pen (bi3), bib (bi4), du (du1), reading (du2), gambling (du3), du (du4), brother (ge1), diaphragm (ge2), ge (ge3), individual (ge4), wave (bo1), neck (bo2), lameness (bo3), jolt (bo4), silt (yu1), fish (yu2), rain (yu3), and jade (yu 4);
a sequential language part, namely Chinese mandarin words with numbers of 1-10 consisting of initial consonants and final consonants, 1(yi1), 2(er4), 3(san1),4(si4),5(wu3),6(liu4), 7(qi1),8(ba1), 9(jiu3) and 10(shi 2);
a compound vowel part, 23 single syllable chinese mandarin words consisting of consonants and compound vowel 1 tones of the same or equal phoneme, an arm-break (bai1), shrimp (xia1), bag (bai1), melon (gua1), loss (diou1), turtle (guii 1), cup (bei1), hold (bie1), logo (biao1), edge (bian1), class (ban1), guest (bin1), runner (ben1), upper (bang1), ice (bin1), collapse (beng1), pan (guo1), light (guang1), gate (guan1), ditch (gou1), lambkin (guai1), baue (xiue 1), xiong 1);
the consonant part comprises 21 monosyllabic Chinese mandarins consisting of 21 initials and 1 tone of a simple vowel a or i, eight (ba1), groveling (pa1), overlapping (da1), he (ta1), Ga (ga1), coffee (ka1), machine (ji1), seven (qi1), cicada (zhi1), eating (chi1), funding (zi1), defect (ci1), hair (fa1), haha (ha1), west (xi1), teacher (shi1), thinking (si1), day (ri4), mother (ma1), that (na1) and La (la 1).
Mandarin Chinese pronunciation assessment system vocabulary 1 (vowel, tone, sequence language)
Mandarin Chinese pronunciation assessment system vocabulary 2 (vowel change, partial vowel)
Mandarin Chinese pronunciation assessment system vocabulary 3 (consonant)
Serial number | Consonant type | Character (Chinese character) | Initial consonant | Vowels | Tone of |
|
1 | Non-air-supply stop sound | Eight-part | b | a | 1 | |
2 | Air supply plug sound | Groveling | p | a | 1 | |
3 | Non-air-supply stop sound | Building block | d | a | 1 | |
4 | Air supply plug sound | He has a main body | t | a | 1 | |
5 | Non-air-supply stop sound | Ga-a | g | a | 1 | |
6 | Air supply plug sound | Coffee (Perch) | k | a | 1 | |
7 | Erasing sound of non-air-supply plug | Machine for working | | i | 1 | |
8 | Air delivery plug wipe | Seven- | q | i | 1 | |
9 | Erasing sound of non-air-supply plug | To know | zh | i | 1 | |
10 | Air delivery plug wipe | | ch | i | 1 | |
11 | Erasing sound of non-air-supply plug | Resource management | z | i | 1 | |
12 | Air delivery plug wipe | Defect of the | c | i | 1 | |
13 | Qing and Ca sound | Hair-like device | f | a | 1 | |
14 | Qing and Ca sound | Ha | h | a | 1 | |
15 | Qing and Ca sound | Western | x | i | 1 | |
16 | Qing and Ca sound | Teacher | sh | i | 1 | |
17 | Qing and Ca sound | Thought of | s | i | 1 | |
18 | Turbid wipe | Day(s) | r | i | 4 | |
19 | Nasal sound | Mother | m | a | 1 | |
20 | Nasal sound | That | n | a | 1 | |
21 | Edge tone | Pulling device | l | a | 1 |
To further illustrate the technical solution of the present invention, the monosyllabic word "bar" is taken as an example, and the following description is made:
before the standardized sampling of the present embodiment is performed, the recording environment needs to be selected.
Optionally, the recording environment of this embodiment is selected: the method is carried out in a voice laboratory provided with a sound insulation door and sound-absorbing rock wool, and the sound insulation degree is 45 dB.
Optionally, the recording instrument and the parameter selection of this example: a Sony Zoom H4N recording pen is selected, the sampling rate of 44.1kHz and the tone quality of 16 bits are stored, and the recording is copied to a computer hard disk.
As shown in fig. 1, the standardized sampling method for extracting the MFCC features of the pathological speech for artificial intelligence analysis of the present embodiment includes the following steps:
s1, collecting voice data, and collecting the voice data of 82 Mandarin Chinese syllables according to the sequence of the Mandarin Chinese voice evaluation system vocabulary; the method specifically comprises the following steps:
the subject was recorded by collecting speech data for 82 mandarin chinese syllables with reference to 82 chinese words of the mandarin chinese speech assessment system vocabularies (table 1). When recording, the subject takes the sitting position, the pen holds the recorder with hands, the lip of the subject is about 10cm away from the recorder, and when seeing that the character 'ba' appears on the screen, the subject reads 'ba (/ b ā /)' at natural smooth speed and moderate volume, and records for 2 times. The amplitude of the wave of the waveform recorded by the recording pen is required to be in the range of 1/3-2/3.
S2, editing the collected voice data, specifically:
the target note/b ā/first recording was cut out separately from the note of each subject using CoolEdit Pro2.1. If the first recording has noise, interference, waveform fluctuation amplitude exceeding the range of the window value 1/3-2/3 and waveform prompting energy is insufficient, the second recording data is selected for processing. The valid preprocessed sample classes are then archived to unit tone groups.
S3, extracting the characteristics of the clipped signal, namely, based on the clipped sample, completing the MFCC characteristic extraction of the digital voice signal of syllable/b ā/by processing such as pre-emphasis, framing, windowing, fast Fourier transform, triangular band-pass filter, extended framing and the like, wherein the specific pre-processing steps are as follows:
s31, pre-emphasis is designated:
processing the processed speech signal through a high pass filter of the following formula:
H(Z)=1-μz-1
in the above formula, μ has a value of 0.97.
S32, frame division is designated:
with time 25ms as one frame, the overlap area between two adjacent frames is set to 10ms, i.e. the frame shift. The sampling rate of the voice sample/b ā/is 16KHz, and the length N of each minute frame is 400. In this embodiment, 13 and 19 frames are taken, and zero padding is performed if the frame is insufficient.
S32, designating windowing:
after the frames are divided in step S32, each frame is multiplied by a Hamming Window (Hamming Window) to increase the continuity of the left and right ends of the frame. Assuming that the signal after framing is S (n), wherein n represents the length of the framing minus 1, the value of the signal after multiplying S (n) by the Hamming window is x (n),
x(n)=S(n)×W(n)
w (n) is the Hamming window, which is given by the formula:
n-1, wherein a is 0.46 and N is 0, 1.
S33, fast Fourier transform:
performing fast Fourier transform on each frame signal subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo square on the frequency spectrum of a speech signal to obtain a power spectrum of the speech signal, and setting a Discrete Fourier Transform (DFT) formula of the speech signal as follows:
in the above formula, x (N) is an input speech signal, and N represents the number of points of fourier transform.
S33, triangular band-pass filter:
xa (k) is passed through a set of 24 triangular filters with the filter center frequency designated as f (m), and m 1, 2. The interval between each f (m) decreases with decreasing value of m and increases with increasing value of m, as shown in fig. 2.
The triangular band-pass filter is used for smoothing the frequency spectrum and has the function of eliminating harmonic waves, so that the triangular band-pass filter is not influenced by the tone or pitch of a piece of voice. Secondly, the subsequent operation amount is reduced. The formula is as follows:
s34, logarithmic operation, substituting the logarithmic frequency value output by each filter bank, hm (k) into the following formula:
s35, Discrete Cosine Transform (DCT), S (m) into the following equation:
the above formula L refers to the MFCC coefficient order, the method specifies 13 and 19; m is the number of triangular filters, the method is designated 24; c (n) is the MFCC value for each frame. The 13 th and 19 th sub-frames are connected to obtain 2 MFCC values into A group and B group.
S36, expanding and framing:
will be/b ā/formant F1、F2And F0The midpoint values are respectively added into the group A and the group B as a subframe to obtain the other 2 MFCCs which are put into a group C and a group D.
S4, forming the processed data into an MFCC voice library:
normalized data for monosyllabic "bar" in MFCC speech library: speech samples/b ā/4 MFCC features of total 4 types after [0033] and [0034] preprocessing of each syllable data of 82 syllable samples exist in A, B, C, D four groups of normalized MFCC data of 13 frames, 19 frames, 13+3 frames, and 19+3 frames, respectively
Structuring data: the 4 groups of data are put into a vowel and tone sub-library and marked as a vowel and tone sub-library A group, a vowel and tone sub-library B group, a vowel and tone sub-library C group and a vowel and tone sub-library D group in a row.
The other 81 syllables are processed in the same way as the monosyllabic "bar", and will not be described in detail herein
The invention researches a novel pathological voice standardized sampling method based on MFCC characteristics, is different from the traditional voice recording sampling method, formulates a word list comprising 82 Chinese syllables based on the earlier research result of the inventor, and processes each syllable into 4 different data based on the acoustic index MFCC by adopting a standardized and structured data sampling method. The method can be conveniently used for the construction of pathological voice database, big data voice analysis and artificial intelligence operation.
The research on the structured sampling standard of pathological speech in China is less, the method provided by the invention is practiced on the application of pathological speech libraries, artificial neural networks, deep learning and the like, and has the advantages of reliability and simple and convenient operation, so that the method finally becomes the standard in the field.
The evaluation method based on the artificial intelligence big data liberates manpower, depends on the development of intelligence, is the crystallization of the intelligent era, is combined with the intelligent development, and is the result of the era progress and the scientific development. The invention provides a method selection for artificial intelligence research and scientific diagnosis of pathological voice.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (10)
1. A standardized sampling method for extracting MFCC characteristics of pathological speech for artificial intelligence analysis is characterized by comprising the following steps:
collecting voice data, and collecting the voice data of 82 Chinese Putonghua syllables according to the sequence of the Chinese Putonghua voice evaluation system vocabulary;
the collected voice data is clipped, the clipping work of 82 syllables is completed, and then classification filing is carried out, wherein 28 unit tones, 23 compound vowels, 21 consonants and 10 sequence voices are provided;
extracting signals of the 82 clipped syllables, and extracting MFCC characteristics of each syllable by appointing pre-emphasis, framing, windowing, fast Fourier transform, a triangular band-pass filter and expanding framing processing;
and forming a structured voice library by using the processed data, wherein the standardized data of the MFCC voice library specifically comprises the following steps:
the voice sample after each syllable data of 82 syllable samples is preprocessed has 4 MFCC characteristics which are respectively present in A, B, C, D four groups, and the MFCC characteristics are respectively standardized MFCC data of 13 frames, 19 frames, 13+3 frames and 19+3 frames;
structuring the database: a, B, C, D four groups of data are put into the vowel and tone sub-library and marked as vowel and tone sub-library A group, vowel and tone sub-library B group, vowel and tone sub-library C group and vowel and tone sub-library D group.
2. The standardized sampling method for extracting MFCC features of pathological voices for artificial intelligence analysis as claimed in claim 1, wherein said "Mandarin Chinese Speech evaluation System vocabulary" includes 3 parts of 4 main parts, namely, a single-tone part, a sequential language part, a compound vowel part and a consonant part;
the unit tone and tone part comprises 24 single syllables of Chinese mandarin, which are composed of 1-4 tones of the same or equivalent tone position initial consonant and single final, and comprises: eighthly, drawing, holding, father, approach, nose, pen, must, all, reading, gambling, Du, Ge, Gege, Ou, wave, neck, lame, jow, silt, fish, rain and jade;
the sequential language part, which is a Chinese mandarin word with the number of 1-10 composed of initial consonant and final consonant, comprises: 1,2, 3,4,5,6, 7,8, 9 and 10;
the compound vowel part, 23 single syllables Chinese mandarin words composed of the same or equal phoneme initial and compound final 1 tone, includes: snapping, shrimp, bag, melon, loss, turtle, cup, hold, label, edge, class, guest, rush, upper, ice, cave, pot, light, guan, ditch, lambkin, boot and brother;
the consonant part, 21 single syllables Chinese mandarin words composed of 21 initials and 1 tone of single vowel a or i, includes: eighthly, lying prone, lapping, He, Ga, Jia, machine, Qin, Zhi, eat, Zi, Defect, Fa, Ha, West, Shi, Si, Ri, Ma, na and La.
3. The standardized sampling method for extracting MFCC features of pathological voices for artificial intelligence analysis as claimed in claim 1, wherein the lip distance of the subject from the recorder is 9-11 cm when voice data is collected, the voice speed is natural and stable, the volume is moderate, and the vocabulary is repeatedly recorded for 2 times;
the pre-emphasis is specifically as follows:
processing the processed speech signal through a high pass filter of the following formula:
H(Z)=1-μz-1
the value of μ in the above formula is 0.9 to 1.0.
4. The standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 1, wherein said framing is specifically:
the framing time is 20-30ms, namely a framing, and the overlapping area between two adjacent frames is set to be 10-15ms, namely frame shift; the sampling rate of the voice samples is 8KHz or 16KHz, and the sampling point N per minute frame is 256-512.
5. The standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 1, wherein said windowing is specifically:
multiplying each frame by a Hamming window after framing, increasing the continuity of the left end and the right end of the frame, assuming that the signal after framing is S (n), wherein n represents the length of the framing minus 1, the signal value after multiplying S (n) by the Hamming window is x (n),
x(n)=S(n)×W(n)
where W (n) is the Hamming window, the formula is as follows:
in the above formula, the value of a is 0.46, and the value of N is 0, 1.
6. The standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 1, wherein said fast Fourier transform is specifically:
performing fast Fourier transform on each frame signal subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo square on the frequency spectrum of a voice signal to obtain a power spectrum of the voice signal, and setting discrete Fourier transform of the voice signal, wherein the formula is as follows:
in the above formula, x (N) is an input speech signal, and N represents the number of points of fourier transform.
7. The standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 6, wherein the processing of said triangular band-pass filter is specifically:
let xa (k) pass through a set of 24 triangular filters with center frequencies designated as f (m), m ═ 1, 2.. said, 24, and the spacing between each f (m) decreases with decreasing value of m and increases with increasing value of m, the formula for the triangular bandpass filter is as follows:
9. the standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 8, wherein said discrete cosine transform is embodied as:
substituting s (m) into the following equation:
the above formula L refers to the MFCC coefficient order, and is 12-16; m is the number of the triangular filters; c (n) MFCC values for each frame; the 13 th and 19 th stages are respectively connected by frames to obtain 2 groups of MFCC values to be stored in a bin, namely a group A and a group B.
10. The standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 9, wherein said extended framing is specifically:
will resonate peak F1、F2And F0And adding the midpoint values into the group A and the group B as a subframe to obtain 2 groups of MFCCs to be put in storage, namely the group C and the group D.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010462384.4A CN111599347B (en) | 2020-05-27 | 2020-05-27 | Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010462384.4A CN111599347B (en) | 2020-05-27 | 2020-05-27 | Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111599347A true CN111599347A (en) | 2020-08-28 |
CN111599347B CN111599347B (en) | 2024-04-16 |
Family
ID=72192364
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010462384.4A Active CN111599347B (en) | 2020-05-27 | 2020-05-27 | Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111599347B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112382293A (en) * | 2020-11-11 | 2021-02-19 | 广东电网有限责任公司 | Intelligent voice interaction method and system for power Internet of things |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1211026A (en) * | 1997-09-05 | 1999-03-17 | 中国科学院声学研究所 | Continuous voice identification technology for Chinese putonghua large vocabulary |
WO2001039179A1 (en) * | 1999-11-23 | 2001-05-31 | Infotalk Corporation Limited | System and method for speech recognition using tonal modeling |
CN1412741A (en) * | 2002-12-13 | 2003-04-23 | 郑方 | Chinese speech identification method with dialect background |
CN101436403A (en) * | 2007-11-16 | 2009-05-20 | 创新未来科技有限公司 | Method and system for recognizing tone |
CN103310273A (en) * | 2013-06-26 | 2013-09-18 | 南京邮电大学 | Method for articulating Chinese vowels with tones and based on DIVA model |
CN103366735A (en) * | 2012-03-29 | 2013-10-23 | 北京中传天籁数字技术有限公司 | A voice data mapping method and apparatus |
CN104123934A (en) * | 2014-07-23 | 2014-10-29 | 泰亿格电子(上海)有限公司 | Speech composition recognition method and system |
CN105788608A (en) * | 2016-03-03 | 2016-07-20 | 渤海大学 | Chinese initial consonant and compound vowel visualization method based on neural network |
CN110570842A (en) * | 2019-10-25 | 2019-12-13 | 南京云白信息科技有限公司 | Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree |
CN110600055A (en) * | 2019-08-15 | 2019-12-20 | 杭州电子科技大学 | Singing voice separation method using melody extraction and voice synthesis technology |
CN110808072A (en) * | 2019-11-08 | 2020-02-18 | 广州科慧健远医疗科技有限公司 | Method for evaluating dysarthria of children based on optimized acoustic parameters of data mining technology |
CN110827980A (en) * | 2019-11-08 | 2020-02-21 | 广州科慧健远医疗科技有限公司 | Dysarthria grading evaluation method based on acoustic indexes |
CN111028863A (en) * | 2019-12-20 | 2020-04-17 | 广州科慧健远医疗科技有限公司 | Method for diagnosing dysarthria tone error after stroke based on neural network and diagnosis device thereof |
-
2020
- 2020-05-27 CN CN202010462384.4A patent/CN111599347B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1211026A (en) * | 1997-09-05 | 1999-03-17 | 中国科学院声学研究所 | Continuous voice identification technology for Chinese putonghua large vocabulary |
WO2001039179A1 (en) * | 1999-11-23 | 2001-05-31 | Infotalk Corporation Limited | System and method for speech recognition using tonal modeling |
CN1412741A (en) * | 2002-12-13 | 2003-04-23 | 郑方 | Chinese speech identification method with dialect background |
CN101436403A (en) * | 2007-11-16 | 2009-05-20 | 创新未来科技有限公司 | Method and system for recognizing tone |
CN103366735A (en) * | 2012-03-29 | 2013-10-23 | 北京中传天籁数字技术有限公司 | A voice data mapping method and apparatus |
CN103310273A (en) * | 2013-06-26 | 2013-09-18 | 南京邮电大学 | Method for articulating Chinese vowels with tones and based on DIVA model |
CN104123934A (en) * | 2014-07-23 | 2014-10-29 | 泰亿格电子(上海)有限公司 | Speech composition recognition method and system |
CN105788608A (en) * | 2016-03-03 | 2016-07-20 | 渤海大学 | Chinese initial consonant and compound vowel visualization method based on neural network |
CN110600055A (en) * | 2019-08-15 | 2019-12-20 | 杭州电子科技大学 | Singing voice separation method using melody extraction and voice synthesis technology |
CN110570842A (en) * | 2019-10-25 | 2019-12-13 | 南京云白信息科技有限公司 | Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree |
CN110808072A (en) * | 2019-11-08 | 2020-02-18 | 广州科慧健远医疗科技有限公司 | Method for evaluating dysarthria of children based on optimized acoustic parameters of data mining technology |
CN110827980A (en) * | 2019-11-08 | 2020-02-21 | 广州科慧健远医疗科技有限公司 | Dysarthria grading evaluation method based on acoustic indexes |
CN111028863A (en) * | 2019-12-20 | 2020-04-17 | 广州科慧健远医疗科技有限公司 | Method for diagnosing dysarthria tone error after stroke based on neural network and diagnosis device thereof |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112382293A (en) * | 2020-11-11 | 2021-02-19 | 广东电网有限责任公司 | Intelligent voice interaction method and system for power Internet of things |
Also Published As
Publication number | Publication date |
---|---|
CN111599347B (en) | 2024-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109741732B (en) | Named entity recognition method, named entity recognition device, equipment and medium | |
Gobl et al. | 11 voice source variation and its communicative functions | |
Reddy | Speech recognition by machine: A review | |
Chi et al. | Subglottal coupling and its influence on vowel formants | |
CN109785857A (en) | An abnormal sound event recognition method based on MFCC+MP fusion features | |
CN110265063B (en) | A polygraph detection method based on sequence analysis of fixed-duration speech emotion recognition | |
CN109243497A (en) | The control method and device that voice wakes up | |
CN103366735B (en) | The mapping method of speech data and device | |
CN109065073A (en) | Speech-emotion recognition method based on depth S VM network model | |
CN109979428A (en) | Audio generation method and device, storage medium, electronic equipment | |
CN111179914B (en) | Voice sample screening method based on improved dynamic time warping algorithm | |
Cai et al. | The DKU-JNU-EMA electromagnetic articulography database on Mandarin and Chinese dialects with tandem feature based acoustic-to-articulatory inversion | |
McKinney | Laryngeal frequency analysis for linguistic research | |
Crystal et al. | Characterization and modeling of speech-segment durations | |
CN102880906B (en) | Chinese vowel pronunciation method based on DIVA nerve network model | |
CN111599347A (en) | Standardized sampling method for extracting pathological voice MFCC (Mel frequency cepstrum coefficient) features for artificial intelligence analysis | |
CN114842878A (en) | Speech emotion recognition method based on neural network | |
Kulshreshtha et al. | Dialect accent features for establishing speaker identity: a case study | |
CN114283822A (en) | Many-to-one voice conversion method based on gamma pass frequency cepstrum coefficient | |
CN118762687A (en) | A method for automatic identification of Tibetan dialects | |
CN113539239A (en) | Voice conversion method, device, storage medium and electronic equipment | |
Broad | Formants in automatic speech recognition | |
Dharmale et al. | Evaluation of phonetic system for speech recognition on smartphone | |
Peterson et al. | Objectives and techniques of speech synthesis | |
Chen et al. | Teager Mel and PLP fusion feature based speech emotion recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |