CN111599347B - Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis - Google Patents
Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis Download PDFInfo
- Publication number
- CN111599347B CN111599347B CN202010462384.4A CN202010462384A CN111599347B CN 111599347 B CN111599347 B CN 111599347B CN 202010462384 A CN202010462384 A CN 202010462384A CN 111599347 B CN111599347 B CN 111599347B
- Authority
- CN
- China
- Prior art keywords
- voice
- mfcc
- vowel
- framing
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 230000001575 pathological effect Effects 0.000 title claims abstract description 23
- 238000005070 sampling Methods 0.000 title claims abstract description 17
- 238000004458 analytical method Methods 0.000 title claims abstract description 14
- 230000002093 peripheral effect Effects 0.000 title description 2
- 241001672694 Citrus reticulata Species 0.000 claims abstract description 31
- 238000009432 framing Methods 0.000 claims abstract description 26
- 238000011156 evaluation Methods 0.000 claims abstract description 16
- 238000013473 artificial intelligence Methods 0.000 claims abstract description 13
- 230000009466 transformation Effects 0.000 claims abstract description 4
- 238000001228 spectrum Methods 0.000 claims description 11
- 150000001875 compounds Chemical class 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 6
- 230000007547 defect Effects 0.000 claims description 5
- 241000251468 Actinopterygii Species 0.000 claims description 3
- 235000015510 Cucumis melo subsp melo Nutrition 0.000 claims description 3
- 241000238557 Decapoda Species 0.000 claims description 3
- 244000046146 Pueraria lobata Species 0.000 claims description 3
- 235000010575 Pueraria lobata Nutrition 0.000 claims description 3
- 241000270708 Testudinidae Species 0.000 claims description 3
- FJJCIZWZNKZHII-UHFFFAOYSA-N [4,6-bis(cyanoamino)-1,3,5-triazin-2-yl]cyanamide Chemical compound N#CNC1=NC(NC#N)=NC(NC#N)=N1 FJJCIZWZNKZHII-UHFFFAOYSA-N 0.000 claims description 3
- 239000002131 composite material Substances 0.000 claims description 3
- 230000007423 decrease Effects 0.000 claims description 3
- 230000037433 frameshift Effects 0.000 claims description 3
- 210000004209 hair Anatomy 0.000 claims description 3
- 239000010977 jade Substances 0.000 claims description 3
- 206010003497 Asphyxia Diseases 0.000 claims description 2
- 206010022562 Intermittent claudication Diseases 0.000 claims description 2
- 208000024980 claudication Diseases 0.000 claims description 2
- 244000241257 Cucumis melo Species 0.000 claims 1
- 238000011160 research Methods 0.000 abstract description 11
- 206010013887 Dysarthria Diseases 0.000 description 5
- 238000004140 cleaning Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 238000007790 scraping Methods 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 241000219112 Cucumis Species 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 235000013305 food Nutrition 0.000 description 2
- 238000009413 insulation Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 235000019687 Lamb Nutrition 0.000 description 1
- 238000010521 absorption reaction Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000002425 crystallisation Methods 0.000 description 1
- 230000008025 crystallization Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 208000030175 lameness Diseases 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 239000011490 mineral wool Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000004796 pathophysiological change Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/48—Other medical applications
- A61B5/4803—Speech analysis specially adapted for diagnostic purposes
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/72—Signal processing specially adapted for physiological signals or for diagnostic purposes
- A61B5/7235—Details of waveform analysis
- A61B5/725—Details of waveform analysis using specific filters therefor, e.g. Kalman or adaptive filters
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/72—Signal processing specially adapted for physiological signals or for diagnostic purposes
- A61B5/7235—Details of waveform analysis
- A61B5/7253—Details of waveform analysis characterised by using transforms
- A61B5/7257—Details of waveform analysis characterised by using transforms using Fourier transforms
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Pathology (AREA)
- Surgery (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Heart & Thoracic Surgery (AREA)
- Medical Informatics (AREA)
- Molecular Biology (AREA)
- Veterinary Medicine (AREA)
- Animal Behavior & Ethology (AREA)
- General Health & Medical Sciences (AREA)
- Public Health (AREA)
- Physiology (AREA)
- Psychiatry (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a standardized sampling method for extracting pathological voice MFCC characteristics for artificial intelligence analysis, which comprises the following steps: collecting voice data, and collecting the voice data of syllables of the Mandarin Chinese 82 according to the sequence of the word list of the voice evaluation system of the Mandarin Chinese; editing the collected voice data to finish editing work of 82 syllables, and then classifying and archiving; extracting signals of 82 syllables after clipping, and extracting the MFCC characteristics of each syllable by designating pre-emphasis, framing, windowing, fast Fourier transformation, triangular band-pass filter and expanding framing; and constructing the processed data into an MFCC voice library. According to the invention, through a standardized flow method, specific MFCC characteristics of each syllable are extracted, a digitalized, standardized and structured voice characteristic database is constructed, and various applications of pathological voice characteristic big data and artificial intelligent analysis can be served, so that objectivity and efficiency of pathological voice research and application are improved.
Description
Technical Field
The invention belongs to the technical field of intelligent recognition, and particularly relates to a standardized sampling method for extracting pathological voice MFCC characteristics for artificial intelligent analysis.
Background
The number of people with language handicaps in China is increased year by year, wherein communication handicaps caused by dysarthria seriously affect patients to return to society. Although the number of patients with dysarthria in China is heavy, the research and study in 2016 years Lin Jiang and Lu Jianliang show that the current evaluation method can not meet the requirements of therapists on accurate speech rehabilitation. The domestic rehabilitation departments and speech rehabilitation institutions are mainly applied to subjective hearing evaluation and/or scales requiring subjective judgment, and lack objectivity and efficiency. In addition, the number of language therapists in China is seriously insufficient, most of the language therapists are not in professional graduation, and the diagnosis and evaluation abilities are weak.
In recent years, application research based on rapid development of artificial intelligence technology, such as artificial neural network (Artificial Neural Network, ANN) and Deep Learning (DL), has achieved some results in normal voice analysis and recognition, language education, intelligent voice guidance, and the like. The medical aspect of the new generation artificial intelligence development planning of the national institutes provides that the artificial intelligence innovation application should be quickened, the characteristics and regularity of acoustic parameters in dysarthria are researched, and various dysarthria are diagnosed and classified based on an artificial neural network, so that the objectivity and efficiency of pathological voice evaluation are improved, and the manpower is liberated. To conduct big data and artificial intelligence analysis on pathological speech, there must be digitized, standardized and structured data sets. At present, no unified method and standard exist for analyzing big data of pathological voice at home and abroad, and a unified and efficient pathological voice characteristic acquisition method is urgently needed.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art, and provides a standardized sampling method for extracting pathological voice MFCC characteristics for artificial intelligent analysis, which is used for serving various applications of pathological voice characteristic big data and artificial intelligent analysis and improving objectivity and efficiency of pathological voice research and application.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a standardized sampling method for extracting pathological voice MFCC features for artificial intelligence analysis, comprising the steps of:
collecting voice data, and collecting the voice data of syllables of the Mandarin Chinese 82 according to the sequence of the word list of the voice evaluation system of the Mandarin Chinese;
editing the collected voice data to finish the editing work of 82 syllables, and then classifying and archiving, wherein 28 single vowels, 23 compound vowels, 21 consonants and 10 sequential voices are used;
extracting signals of 82 syllables after clipping, and extracting the MFCC characteristics of each syllable by designating pre-emphasis, framing, windowing, fast Fourier transformation, triangular band-pass filter and expanding framing;
the processed data form a structured voice library, and the standardized data of the MFCC voice library are specifically as follows:
the voice sample of each syllable data of 82 syllable samples after pretreatment has 4 MFCC characteristics which are respectively in A, B, C, D groups and are respectively standardized MFCC data of 13 frames, 19 frames, 13+3 frames and 19+3 frames;
structured database: the A, B, C, D four groups of data are input into a vowel and tone sub-library, and are marked as a vowel and tone sub-library A group, a vowel and tone sub-library B group, a vowel and tone sub-library C group and a vowel and tone sub-library D group in a column.
As a preferable technical scheme, the "Mandarin Chinese speech evaluation System vocabulary" includes 3 sub-tables of 4 main parts, namely a single vowel tone part, a sequence language part, a composite vowel part and a consonant part;
the single vowel and tone part, 24 single syllable Mandarin Chinese words composed of 1-4 tones of the same or equal phonemes of the initial consonant and single vowel, comprises: eighth, pull, handle, father, force, nose, pen, must, all, read, bet, duu, go, bay, kudzuvine, personal, wave, neck, claudication, dustpan, silted, fish, rain and jade;
the sequence language part, which is composed of initial consonants and final sounds and forms Chinese Mandarin words with the numbers of 1-10, comprises: 1,2,3,4,5,6,7,8,9 and 10;
a composite vowel portion comprising 23 monosyllabic mandarin chinese words of the same or equal-vowel 1 tone, comprising: breaking, shrimp, bag, melon, loss, tortoise, cup, suffocation, mark, edge, ban, guest, running, upper, ice, collapse, pot, light, closing, ditch, lambkin, boot and brothers;
a consonant part, 21 monosyllabic Mandarin words consisting of 1 tone of 21 initials and single finals a or i, comprising: eight, groveling, lapping, he, ga, ca, machine, seven, know, eat, fund, defect, hair, ha, xi, chef, si, ri, ma, na and la.
As an optimal technical scheme, when voice data acquisition is carried out, the distance between the lips of a subject and a recorder is 9cm-11cm, the speech speed is natural and stable, the volume is moderate, and the vocabulary is repeatedly recorded for 2 times.
As a preferable technical solution, the pre-emphasis specifically includes:
processing the processed voice signal through a high-pass filter as follows:
H(Z)=1-μz -1
the value of μ in the above formula is 0.9 to 1.0.
As a preferable technical scheme, the framing specifically includes:
the framing time is 20-30ms, which is a frame, and the overlapping area between two adjacent frames is set to be 10-15ms, namely frame shift; the sampling rate of the voice sample is 8KHz or 16KHz, and the sampling point N of each frame is 256-512.
As a preferable technical scheme, the windowing specifically comprises:
multiplying each frame by a Hamming window after framing, increasing continuity of left and right ends of the frame, assuming that a signal after framing is S (n), where n represents a frame length minus 1, a signal value obtained by multiplying S (n) by the Hamming window is x (n),
x(n)=S(n)×W(n)
wherein W (n) is a Hamming window, and the formula is as follows:
in the above formula, the value of a is 0.46, and the value of N is 0, 1.
As a preferable technical solution, the fast fourier transform is specifically:
performing fast Fourier transform on each frame of signals subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo squaring on the frequency spectrum of the voice signal to obtain a power spectrum of the voice signal, and setting discrete Fourier transform of the voice signal, wherein the formula is as follows:
in the above expression, x (N) is an input speech signal, and N represents the number of points of fourier transform.
As a preferable technical solution, the processing of the triangular band-pass filter specifically includes:
xa (k) was passed through a set of 24 triangular filters with center frequencies designated f (m), m=1, 2..24, the spacing between f (m) decreasing with decreasing m value and widening with increasing m value, the triangular bandpass filter was formulated as follows:
in the above-mentioned method, the step of,
as a preferred technical solution, the logarithmic frequency value of each filter bank output, hm (k), is substituted into the following formula:
as a preferable technical solution, the discrete cosine transform specifically includes:
substituting S (m) into the following formula:
the L refers to the coefficient order of the MFCC, and 12-16 is taken; m is the number of triangular filters; c (n) is the MFCC value for each frame; and (3) connecting the 13 th order and the 19 th order frames to obtain 2 groups of MFCC values, namely, group A and group B.
As a preferable technical solution, the extended frame is specifically:
peak F of resonance 1 、F 2 And F 0 And each midpoint value is used as a frame to be added into the A group and the B group, so as to obtain 2 groups of MFCCs for warehousing, namely the C group and the D group.
Compared with the prior art, the invention has the following advantages and beneficial effects:
according to the research of the inventor on the artificial intelligent recognition of pathological voice, the Chinese Mandarin dysarthria evaluation vocabulary (hereinafter referred to as vocabulary) is designed. The Chinese vocabulary with 82 syllables in the vocabulary is extracted by a standardized flow method to extract the specific MFCC characteristics of each syllable, and a digitalized, standardized and structured voice database is constructed. The invention can be used for various applications of pathological voice characteristic big data and artificial intelligence analysis, and improves objectivity and efficiency of pathological voice research and application.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
fig. 2 is a schematic diagram of the Mel-frequency filter bank of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
Examples
In terms of speech recognition (Speech Recognition) and voiceprint recognition (Voice Print Recognition), the most commonly used speech feature is Mel-cepstral coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC for short). The human ear has different auditory sensitivities to sound waves of different frequencies. The speech signal from 200Hz to 5000Hz has the greatest effect on the audibility of speech. The critical bandwidth for sound masking due to the low frequency domain is small compared to the high frequency. Thus 28 bandpass filters are arranged from dense to sparse according to the critical bandwidth size from low frequency to high frequency to filter the input signal. The energy of the signal output by each band-pass filter is taken as the basic characteristic of the signal, and the acoustic characteristic based on the human ear characteristic is the MFCC. The shape of the human body vocal tract can be presented in the form of a short-time power spectrum envelope, and the MFCC can accurately represent the envelope, namely acoustic characteristics are used for reflecting the structural change of the vocal tract official cavity and indirectly reflecting the pathophysiological change. In addition, the characteristics do not depend on the nature of signals, do not make any assumption and limitation on input signals, and consider the research result of an auditory model, so that the method is widely used for digital voice recognition. Therefore, such parameters are more robust than LPCC based on vocal tract model, more fitting the auditory properties of the human ear, and still have better recognition performance when the signal-to-noise ratio is reduced. In summary, the MFCC is suitable as a digitized speech input feature for big data research and artificial intelligence analysis.
The main technical scheme of the invention is that 82 Mandarin Chinese voice data acquisition is carried out according to the sequence of word list of Chinese Mandarin voice evaluation system (shown in tables 1-3), preprocessing is carried out according to the stipulated method of the patent, the editing work of 82 syllables is completed, each syllable extracts MFCC characteristics, and the syllables enter a structured voice library of single vowels, compound vowels, consonants, sequence languages and tones after preprocessing according to the method.
Further, the "Mandarin Chinese speech evaluation System vocabulary" includes 3 sub-tables of 4 main parts, namely, a single vowel tone part, a sequence language part, a compound vowel part, and a consonant part:
the single vowel and tone part consists of 24 single syllable Mandarin words consisting of 1-4 tones of the initial consonant and single vowel of the same or equal phonemes, eight (ba 1), pull (ba 2), handle (ba 3), father (ba 4), detrusor (bi 1), nose (bi 2), pen (bi 3), must (bi 4), all (du 1), read (du 2), bet (du 3), du (du 4), go (ge 1), bay (ge 2), kudzuvine (ge 3), individual (ge 4), wave (bo 1), neck (bo 2), lameness (bo 3), dustpan (bo 4), siltation (yu 1), fish (yu 2), rain (yu 3), jade (yu 4);
a sequential language part, which consists of initials and finals and has the Chinese Mandarin words with the numbers of 1-10, 1 (yi 1), 2 (er 4), 3 (san 1), 4 (si 4), 5 (wu 3), 6 (liu 4), 7 (qi 1), 8 (ba 1), 9 (jiu 3), and 10 (shi 2);
a compound vowel portion, 23 monosyllabic Mandarin words consisting of the initial consonants and compound vowels 1 intonation of the same or equivalent phonemes, break off (bai 1), shrimp (xia 1), bag (bao 1), melon (gua 1), diou1, tortoise (guei 1), cup (bei 1), suffocating (bie 1), logo (bio 1), edge (bian 1), ban (ban 1), guest (bin 1), running (ben 1), bang (bang 1), ice (bang 1), collapse (beng 1), pot (gue 1), light (guang 1), guan (guan 1), groove (gou 1), lamb (guai 1), boot (xue 1), brother (xiong 1);
the consonant part, 21 monosyllabic Mandarin words composed of 1 tone of 21 initials and single finals a or i, eight (ba 1), groveling (pa 1), lapping (da 1), he (ta 1), ga (ga 1), ka (ka 1), machine (ji 1), seven (qi 1), know (zhi 1), eat (chi 1), asset (zi 1), defect (ci 1), send (fa 1), ha (ha 1), xi (xi 1), engineer (shi 1), si (si 1), day (ri 4), ma (ma 1), na (na 1), and pull (la 1).
Chinese Mandarin pronunciation assessment system vocabulary 1 (vowels, tones, sequence language)
Vocabulary 2 of Mandarin Chinese phonetic evaluation system (vowel change, partial vowels)
Vocabulary 3 of Mandarin Chinese phonetic evaluation system (consonant)
Sequence number | Consonant type | Word(s) | Initial consonant | Vowels of vowels | Tone of sound |
1 | Non-air-supply stop cock | Eight (eight) | b | a | 1 |
2 | Air supply stop cock | Groveling body | p | a | 1 |
3 | Non-air-supply stop cock | Lapping device | d | a | 1 |
4 | Air supply stop cock | He is provided with | t | a | 1 |
5 | Non-air-supply stop cock | Gaa (Chinese character of 'Gaa') | g | a | 1 |
6 | Air supply stop cock | Coffee and coffee making machine | k | a | 1 |
7 | Non-air-supply sound-scraping plug | Machine for making food | j | i | 1 |
8 | Air-supply plug and sound-erasing device | Seven pieces of | q | i | 1 |
9 | Non-air-supply sound-scraping plug | It is known that | zh | i | 1 |
10 | Air-supply plug and sound-erasing device | Eating food | ch | i | 1 |
11 | Non-air-supply sound-scraping plug | Resource(s) | z | i | 1 |
12 | Air-supply plug and sound-erasing device | Defects and flaws | c | i | 1 |
13 | Cleaning sound | Hair brush | f | a | 1 |
14 | Cleaning sound | Ha | h | a | 1 |
15 | Cleaning sound | Western medicine | x | i | 1 |
16 | Cleaning sound | The teacher | sh | i | 1 |
17 | Cleaning sound | Thinking of | s | i | 1 |
18 | Turbid sound | Day of the day | r | i | 4 |
19 | Nasal sound | Mother's body | m | a | 1 |
20 | Nasal sound | That is | n | a | 1 |
21 | Side tone | Pulling device | l | a | 1 |
For further explanation of the technical solution of the present invention, the monosyllabic word "ba" is taken as an example, and the following description is made:
before the standardized sampling of this embodiment is performed, a selection of recording environment is required.
Optionally, the recording environment of this embodiment selects: the method is most carried out in a voice laboratory equipped with a sound insulation door and sound absorption rock wool, and the sound insulation degree is 45dB.
Optionally, the recording apparatus and parameters of this example are selected by: a Sony Zoom H4N recording pen is selected, the sampling rate of 44.1kHz and the tone quality of 16 bits are used for storage, and the recorded sound is copied to a computer hard disk.
As shown in fig. 1, a standardized sampling method for extracting pathological voice MFCC features for artificial intelligence analysis according to this embodiment includes the following steps:
s1, collecting voice data, and collecting voice data of 82 Mandarin syllables according to the sequence of a word list of a Mandarin voice evaluation system; the method comprises the following steps:
referring to 82 Chinese vocabularies of the vocabulary of the Chinese Mandarin language evaluation system (table 1), 82 Chinese Mandarin syllables of the voice data are collected, and the test subject is recorded. When recording, the user takes the sitting position, the pen holds the recorder, the lip of the user is about 10cm away from the recorder, and when the user sees the ' bar ' character on the screen, the user reads the ' bar (/ b ā /) at natural and steady speech speed and moderate volume, and records the sound repeatedly for 2 times. The waveform fluctuation range recorded by the recording pen is required to be in the range of 1/3-2/3 of the screen.
S2, clipping the collected voice data, wherein the method specifically comprises the following steps:
the target sound/b ā/first recording was cut out separately for each subject's voice file using CoolEdit pro 2.1. If the first recording has noise, interference and waveform fluctuation amplitude exceeding the range of 1/3-2/3 of the window value and waveform prompt energy is insufficient, the second recording data is selected for processing. The valid preprocessed samples are then classified and archived to single vowel groups.
S3, extracting the characteristics of the clipped signals, namely extracting the MFCC characteristics of syllables/b ā/digital voice signals based on the clipped samples through pre-emphasis, framing, windowing, fast Fourier transformation, triangular band-pass filter, extended framing and other processes, wherein the specific preprocessing steps are as follows:
s31, designating pre-emphasis:
processing the processed voice signal through a high-pass filter as follows:
H(Z)=1-μz -1
the value of μ in the above formula is 0.97.
S32, designating framing:
taking time 25ms as a frame, the overlapping area between two adjacent frames is set to 10ms, namely frame shift. The sample rate of the speech samples/b ā/is 16KHz and the value of N per frame length is 400. In this embodiment, frames 13 and 19 are taken, and if the zero padding is insufficient, the zero padding is performed.
S32, appointed windowing:
each frame is multiplied by a Hamming Window (Hamming Window) after framing in step S32 to increase the continuity of the left and right ends of the frame. Assuming that the signal after framing is S (n), where n represents the framing length minus 1, the signal value after multiplying S (n) by the hamming window is x (n),
x(n)=S(n)×W(n)
w (n) is a Hamming window, and the formula is as follows:
where a=0.46, n=0, 1.
S33, fast Fourier transform:
performing fast Fourier transform on each frame of signals subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo square on the frequency spectrum of the voice signal to obtain a power spectrum of the voice signal, and setting a Discrete Fourier Transform (DFT) formula of the voice signal as follows:
in the above expression, x (N) is an input speech signal, and N represents the number of points of fourier transform.
S33, a triangular band-pass filter:
xa (k) was passed through a set of 24 triangular filters with a center frequency designated f (m), m=1, 2. The interval between f (m) decreases as the value of m decreases and increases as the value of m increases, as shown in fig. 2.
The first triangular band-pass filter smoothes the frequency spectrum and has the function of eliminating harmonic waves, so that the first triangular band-pass filter is not influenced by the tone or the pitch of a section of voice. And secondly, the subsequent operation amount is reduced. The formula is as follows:
in the above-mentioned method, the step of,
s34, carrying out logarithmic operation, wherein the logarithmic frequency value output by each filter bank is substituted into the following formula:
s35, discrete Cosine Transform (DCT), and S (m) is substituted into the following formula:
the above formula L refers to the MFCC coefficient scale, the method specifies 13 and 19; m is the number of triangular filters, the method is designated as 24; c (n) is the MFCC value for each frame. The 13 th and 19 th order framing connections obtain 2 MFCC values into group a, group B.
S36, expanding framing:
will/b ā/formant F 1 、F 2 And F 0 The midpoint value of (2) is used as a frame to be added into the A group and the B group respectively, and another 2 MFCCs are obtained to be put into the C group and the D group.
S4, constructing the processed data into an MFCC voice library:
normalized data for monosyllabic "bar" in MFCC speech library: the voice sample/b ā/total 4 MFCC characteristics of each syllable data of 82 syllable samples after being preprocessed by [0033] and [0034] are respectively existed in A, B, C, D four groups, which are respectively normalized MFCC data of 13 frames, 19 frames, 13+3 frames and 19+3 frames
Structured data: the 4 groups of data are put into vowels and phonons sub-libraries, and are marked as a vowels and phonons sub-library A group, a vowels and phonons sub-library B group, a vowels and phonons sub-library C group and a vowels and phonons sub-library D group in a split manner.
The processing method of other 81 syllables is the same as that of the monosyllabic "ba", and will not be described here again
The invention researches a novel pathological voice standardized sampling method based on the MFCC characteristics, and is different from the traditional voice recording sampling method, the invention formulates a vocabulary comprising 82 Chinese syllables based on the prior research result of the inventor, adopts a standardized and structured data sampling method, and processes each syllable into 4 different data based on an acoustic index MFCC. Can be conveniently used for constructing a pathological voice library, analyzing big data voice and requiring artificial intelligence operation.
The method provided by the invention has the advantages of practice on pathological voice library, artificial neural network, deep learning and other applications, reliability and simplicity and convenience in operation, and finally becomes a standard in the field possible.
The evaluation method based on the artificial intelligence big data releases manpower, relies on intelligent development, is crystallization in an intelligent age, combines with intelligent development, and is the result of age progress and scientific development. The invention provides a method selection for artificial intelligence research and scientific diagnosis of pathological voice.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.
Claims (1)
1. A standardized sampling method for extracting pathological voice MFCC features for artificial intelligence analysis, comprising the steps of:
collecting voice data, and collecting the voice data of syllables of the Mandarin Chinese 82 according to the sequence of the word list of the voice evaluation system of the Mandarin Chinese; the Chinese Mandarin pronunciation assessment system vocabulary comprises 3 sub-tables and 4 main parts, namely a single vowel tone part, a sequence language part, a compound vowel part and a consonant part;
the single vowel tone portion, 24 single syllable Mandarin Chinese words composed of 1-4 tones of the initial consonant and single vowel of the same or equal phonemes, comprises: eighth, pull, handle, father, force, nose, pen, must, all, read, bet, duu, go, bay, kudzuvine, personal, wave, neck, claudication, dustpan, silted, fish, rain and jade;
the sequence language part, which is composed of initial consonants and final sounds and forms Chinese Mandarin words with the numbers of 1-10, comprises: 1,2,3,4,5,6,7,8,9 and 10;
a composite vowel portion comprising 23 monosyllabic mandarin chinese words of the same or equal-vowel 1 tone, comprising: breaking, shrimp, bag, melon, loss, tortoise, cup, suffocation, mark, edge, ban, guest, running, upper, ice, collapse, pot, light, closing, ditch, lambkin, boot and brothers;
a consonant part, 21 monosyllabic Mandarin words consisting of 1 tone of 21 initials and single finals a or i, comprising: eight, groveling, lapping, he, ga, ca, machine, seven, know, eat, fund, defect, hair, ha, si, ri, ma, na and la;
when collecting voice data, the distance between the mouth and the lip of the subject and the recorder is 9cm-11cm, the speech speed is natural and stable, the volume is moderate, and the vocabulary is repeatedly recorded for 2 times;
extracting signals of 82 syllables after clipping, and extracting the MFCC characteristics of each syllable by designating pre-emphasis, framing, windowing, fast Fourier transformation, triangular band-pass filter and expanding framing;
the pre-emphasis is specifically as follows:
processing the processed voice signal through a high-pass filter as follows:
H(Z)=1-μz -1 (1)
wherein μ has a value of 0.9 to 1.0;
editing the collected voice data to finish the editing work of 82 syllables, and then classifying and archiving, wherein 28 single vowels, 23 compound vowels, 21 consonants and 10 sequential voices are used;
the framing is specifically as follows:
the framing time is 20-30ms, which is a frame, and the overlapping area between two adjacent frames is set to be 10-15ms, namely frame shift; the sampling rate of the voice sample is 8KHz or 16KHz, and the sampling point N of each frame is 256-512;
the windowing specifically comprises the following steps:
after framing, multiplying each frame by a hamming window, increasing the continuity of the left end and the right end of the frame, assuming that the signal after framing is S (N), the value of N is 0,1, the signal value after multiplying N-1 and S (N) by the hamming window is x (N),
x(n)=S(n) ×W(n) (2)
in formula (2), W (n) is a hamming window, and the formula is as follows:
the value of a in the formula (3) is 0.46, and the value of N is 0, 1.
The fast fourier transform is specifically:
performing fast Fourier transform on each frame of signals subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo squaring on the frequency spectrum of the voice signal to obtain a power spectrum of the voice signal, and setting discrete Fourier transform of the voice signal, wherein the formula is as follows:
in the formula (4), x (N) is an input voice signal, and N represents the number of points of fourier transform;
the processing of the triangular band-pass filter specifically comprises the following steps:
xa (k) is input to 24 triangular filters, respectively, the center frequency of the triangular filter is designated as f (m), m=1, 2,..24, f (m) is the center frequency of the mth triangular filter, the interval between f (m) is reduced with the decrease of the value of m, and the triangular band-pass filter H is widened with the increase of the value of m m (k) The formula of (2) is as follows:
in the formula (5) of the present invention,
the processed data form a structured voice library, and the standardized data of the MFCC voice library are specifically as follows:
the voice sample of each syllable data of 82 syllable samples after pretreatment has 4 MFCC characteristics which are respectively in A, B, C, D groups and are respectively standardized MFCC data of 13 frames, 19 frames, 13+3 frames and 19+3 frames;
structured database: the A, B, C, D four groups of data are put into a vowel and tone sub-library, and are marked as a vowel and tone sub-library A group, a vowel and tone sub-library B group, a vowel and tone sub-library C group and a vowel and tone sub-library D group in a column manner;
the logarithmic frequency values of each filter bank output, hm (k), are substituted into the following formula:
the discrete cosine transform is specifically:
substituting S (m) into the following formula:
l in the formula (7) refers to the MFCC coefficient order, and 12-16 is taken; m is the number of triangular filters; c (n) is the MFCC value for each frame; each frame of 13 th and 19 th orders is connected to obtain 2 groups of MFCC values for storage, namely a group A and a group B;
the extended framing is specifically:
peak F of resonance 1 、F 2 And F 0 And each midpoint value is used as a frame to be added into the A group and the B group respectively, so as to obtain 2 groups of MFCCs for warehousing, namely the C group and the D group.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010462384.4A CN111599347B (en) | 2020-05-27 | 2020-05-27 | Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010462384.4A CN111599347B (en) | 2020-05-27 | 2020-05-27 | Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111599347A CN111599347A (en) | 2020-08-28 |
CN111599347B true CN111599347B (en) | 2024-04-16 |
Family
ID=72192364
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010462384.4A Active CN111599347B (en) | 2020-05-27 | 2020-05-27 | Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111599347B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112382293A (en) * | 2020-11-11 | 2021-02-19 | 广东电网有限责任公司 | Intelligent voice interaction method and system for power Internet of things |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1211026A (en) * | 1997-09-05 | 1999-03-17 | 中国科学院声学研究所 | Continuous voice identification technology for Chinese putonghua large vocabulary |
WO2001039179A1 (en) * | 1999-11-23 | 2001-05-31 | Infotalk Corporation Limited | System and method for speech recognition using tonal modeling |
CN1412741A (en) * | 2002-12-13 | 2003-04-23 | 郑方 | Chinese speech identification method with dialect background |
CN101436403A (en) * | 2007-11-16 | 2009-05-20 | 创新未来科技有限公司 | Method and system for recognizing tone |
CN103310273A (en) * | 2013-06-26 | 2013-09-18 | 南京邮电大学 | Method for articulating Chinese vowels with tones and based on DIVA model |
CN103366735A (en) * | 2012-03-29 | 2013-10-23 | 北京中传天籁数字技术有限公司 | A voice data mapping method and apparatus |
CN104123934A (en) * | 2014-07-23 | 2014-10-29 | 泰亿格电子(上海)有限公司 | Speech composition recognition method and system |
CN105788608A (en) * | 2016-03-03 | 2016-07-20 | 渤海大学 | Chinese initial consonant and compound vowel visualization method based on neural network |
CN110570842A (en) * | 2019-10-25 | 2019-12-13 | 南京云白信息科技有限公司 | Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree |
CN110600055A (en) * | 2019-08-15 | 2019-12-20 | 杭州电子科技大学 | Singing voice separation method using melody extraction and voice synthesis technology |
CN110808072A (en) * | 2019-11-08 | 2020-02-18 | 广州科慧健远医疗科技有限公司 | Method for evaluating dysarthria of children based on optimized acoustic parameters of data mining technology |
CN110827980A (en) * | 2019-11-08 | 2020-02-21 | 广州科慧健远医疗科技有限公司 | Dysarthria grading evaluation method based on acoustic indexes |
CN111028863A (en) * | 2019-12-20 | 2020-04-17 | 广州科慧健远医疗科技有限公司 | Method for diagnosing dysarthria tone error after stroke based on neural network and diagnosis device thereof |
-
2020
- 2020-05-27 CN CN202010462384.4A patent/CN111599347B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1211026A (en) * | 1997-09-05 | 1999-03-17 | 中国科学院声学研究所 | Continuous voice identification technology for Chinese putonghua large vocabulary |
WO2001039179A1 (en) * | 1999-11-23 | 2001-05-31 | Infotalk Corporation Limited | System and method for speech recognition using tonal modeling |
CN1412741A (en) * | 2002-12-13 | 2003-04-23 | 郑方 | Chinese speech identification method with dialect background |
CN101436403A (en) * | 2007-11-16 | 2009-05-20 | 创新未来科技有限公司 | Method and system for recognizing tone |
CN103366735A (en) * | 2012-03-29 | 2013-10-23 | 北京中传天籁数字技术有限公司 | A voice data mapping method and apparatus |
CN103310273A (en) * | 2013-06-26 | 2013-09-18 | 南京邮电大学 | Method for articulating Chinese vowels with tones and based on DIVA model |
CN104123934A (en) * | 2014-07-23 | 2014-10-29 | 泰亿格电子(上海)有限公司 | Speech composition recognition method and system |
CN105788608A (en) * | 2016-03-03 | 2016-07-20 | 渤海大学 | Chinese initial consonant and compound vowel visualization method based on neural network |
CN110600055A (en) * | 2019-08-15 | 2019-12-20 | 杭州电子科技大学 | Singing voice separation method using melody extraction and voice synthesis technology |
CN110570842A (en) * | 2019-10-25 | 2019-12-13 | 南京云白信息科技有限公司 | Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree |
CN110808072A (en) * | 2019-11-08 | 2020-02-18 | 广州科慧健远医疗科技有限公司 | Method for evaluating dysarthria of children based on optimized acoustic parameters of data mining technology |
CN110827980A (en) * | 2019-11-08 | 2020-02-21 | 广州科慧健远医疗科技有限公司 | Dysarthria grading evaluation method based on acoustic indexes |
CN111028863A (en) * | 2019-12-20 | 2020-04-17 | 广州科慧健远医疗科技有限公司 | Method for diagnosing dysarthria tone error after stroke based on neural network and diagnosis device thereof |
Also Published As
Publication number | Publication date |
---|---|
CN111599347A (en) | 2020-08-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108922541B (en) | Multi-dimensional characteristic parameter voiceprint recognition method based on DTW and GMM models | |
Chi et al. | Subglottal coupling and its influence on vowel formants | |
CN103280220A (en) | Real-time recognition method for baby cry | |
CN105825852A (en) | Oral English reading test scoring method | |
Pao et al. | Mandarin emotional speech recognition based on SVM and NN | |
CN104050965A (en) | English phonetic pronunciation quality evaluation system with emotion recognition function and method thereof | |
CN101976564A (en) | Method for identifying insect voice | |
CN102655003B (en) | Method for recognizing emotion points of Chinese pronunciation based on sound-track modulating signals MFCC (Mel Frequency Cepstrum Coefficient) | |
CN103366735B (en) | The mapping method of speech data and device | |
CN108564956B (en) | Voiceprint recognition method and device, server and storage medium | |
CN112397074A (en) | Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning | |
CN102237083A (en) | Portable interpretation system based on WinCE platform and language recognition method thereof | |
CN111599347B (en) | Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis | |
Kharamat et al. | Durian ripeness classification from the knocking sounds using convolutional neural network | |
CN114842878A (en) | Speech emotion recognition method based on neural network | |
Cai et al. | The DKU-JNU-EMA electromagnetic articulography database on Mandarin and Chinese dialects with tandem feature based acoustic-to-articulatory inversion | |
Crichton et al. | Linear prediction model of speech production with applications to deaf speech training | |
Chamoli et al. | Detection of emotion in analysis of speech using linear predictive coding techniques (LPC) | |
Watt | Research methods in speech acoustics | |
Malécot | New procedures for descriptive phonetics | |
CN112599119B (en) | Method for establishing and analyzing mobility dysarthria voice library in big data background | |
Kumar et al. | Text dependent speaker identification in noisy environment | |
Khulage et al. | Analysis of speech under stress using linear techniques and non-linear techniques for emotion recognition system | |
Regel | A module for acoustic-phonetic transcription of fluently spoken German speech | |
Prasangini et al. | Sinhala speech to sinhala unicode text conversion for disaster relief facilitation in sri lanka |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |