CN111599347B - Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis - Google Patents

Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis Download PDF

Info

Publication number
CN111599347B
CN111599347B CN202010462384.4A CN202010462384A CN111599347B CN 111599347 B CN111599347 B CN 111599347B CN 202010462384 A CN202010462384 A CN 202010462384A CN 111599347 B CN111599347 B CN 111599347B
Authority
CN
China
Prior art keywords
voice
mfcc
vowel
framing
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010462384.4A
Other languages
Chinese (zh)
Other versions
CN111599347A (en
Inventor
牟志伟
江晨银
柯慧明
潘正祥
温晓宇
陈亮
朱凌燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Kehui Jianyuan Medical Technology Co ltd
Original Assignee
Guangzhou Kehui Jianyuan Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Kehui Jianyuan Medical Technology Co ltd filed Critical Guangzhou Kehui Jianyuan Medical Technology Co ltd
Priority to CN202010462384.4A priority Critical patent/CN111599347B/en
Publication of CN111599347A publication Critical patent/CN111599347A/en
Application granted granted Critical
Publication of CN111599347B publication Critical patent/CN111599347B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4803Speech analysis specially adapted for diagnostic purposes
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/725Details of waveform analysis using specific filters therefor, e.g. Kalman or adaptive filters
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7253Details of waveform analysis characterised by using transforms
    • A61B5/7257Details of waveform analysis characterised by using transforms using Fourier transforms
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Pathology (AREA)
  • Surgery (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Veterinary Medicine (AREA)
  • Animal Behavior & Ethology (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Physiology (AREA)
  • Psychiatry (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a standardized sampling method for extracting pathological voice MFCC characteristics for artificial intelligence analysis, which comprises the following steps: collecting voice data, and collecting the voice data of syllables of the Mandarin Chinese 82 according to the sequence of the word list of the voice evaluation system of the Mandarin Chinese; editing the collected voice data to finish editing work of 82 syllables, and then classifying and archiving; extracting signals of 82 syllables after clipping, and extracting the MFCC characteristics of each syllable by designating pre-emphasis, framing, windowing, fast Fourier transformation, triangular band-pass filter and expanding framing; and constructing the processed data into an MFCC voice library. According to the invention, through a standardized flow method, specific MFCC characteristics of each syllable are extracted, a digitalized, standardized and structured voice characteristic database is constructed, and various applications of pathological voice characteristic big data and artificial intelligent analysis can be served, so that objectivity and efficiency of pathological voice research and application are improved.

Description

Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis
Technical Field
The invention belongs to the technical field of intelligent recognition, and particularly relates to a standardized sampling method for extracting pathological voice MFCC characteristics for artificial intelligent analysis.
Background
The number of people with language handicaps in China is increased year by year, wherein communication handicaps caused by dysarthria seriously affect patients to return to society. Although the number of patients with dysarthria in China is heavy, the research and study in 2016 years Lin Jiang and Lu Jianliang show that the current evaluation method can not meet the requirements of therapists on accurate speech rehabilitation. The domestic rehabilitation departments and speech rehabilitation institutions are mainly applied to subjective hearing evaluation and/or scales requiring subjective judgment, and lack objectivity and efficiency. In addition, the number of language therapists in China is seriously insufficient, most of the language therapists are not in professional graduation, and the diagnosis and evaluation abilities are weak.
In recent years, application research based on rapid development of artificial intelligence technology, such as artificial neural network (Artificial Neural Network, ANN) and Deep Learning (DL), has achieved some results in normal voice analysis and recognition, language education, intelligent voice guidance, and the like. The medical aspect of the new generation artificial intelligence development planning of the national institutes provides that the artificial intelligence innovation application should be quickened, the characteristics and regularity of acoustic parameters in dysarthria are researched, and various dysarthria are diagnosed and classified based on an artificial neural network, so that the objectivity and efficiency of pathological voice evaluation are improved, and the manpower is liberated. To conduct big data and artificial intelligence analysis on pathological speech, there must be digitized, standardized and structured data sets. At present, no unified method and standard exist for analyzing big data of pathological voice at home and abroad, and a unified and efficient pathological voice characteristic acquisition method is urgently needed.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art, and provides a standardized sampling method for extracting pathological voice MFCC characteristics for artificial intelligent analysis, which is used for serving various applications of pathological voice characteristic big data and artificial intelligent analysis and improving objectivity and efficiency of pathological voice research and application.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a standardized sampling method for extracting pathological voice MFCC features for artificial intelligence analysis, comprising the steps of:
collecting voice data, and collecting the voice data of syllables of the Mandarin Chinese 82 according to the sequence of the word list of the voice evaluation system of the Mandarin Chinese;
editing the collected voice data to finish the editing work of 82 syllables, and then classifying and archiving, wherein 28 single vowels, 23 compound vowels, 21 consonants and 10 sequential voices are used;
extracting signals of 82 syllables after clipping, and extracting the MFCC characteristics of each syllable by designating pre-emphasis, framing, windowing, fast Fourier transformation, triangular band-pass filter and expanding framing;
the processed data form a structured voice library, and the standardized data of the MFCC voice library are specifically as follows:
the voice sample of each syllable data of 82 syllable samples after pretreatment has 4 MFCC characteristics which are respectively in A, B, C, D groups and are respectively standardized MFCC data of 13 frames, 19 frames, 13+3 frames and 19+3 frames;
structured database: the A, B, C, D four groups of data are input into a vowel and tone sub-library, and are marked as a vowel and tone sub-library A group, a vowel and tone sub-library B group, a vowel and tone sub-library C group and a vowel and tone sub-library D group in a column.
As a preferable technical scheme, the "Mandarin Chinese speech evaluation System vocabulary" includes 3 sub-tables of 4 main parts, namely a single vowel tone part, a sequence language part, a composite vowel part and a consonant part;
the single vowel and tone part, 24 single syllable Mandarin Chinese words composed of 1-4 tones of the same or equal phonemes of the initial consonant and single vowel, comprises: eighth, pull, handle, father, force, nose, pen, must, all, read, bet, duu, go, bay, kudzuvine, personal, wave, neck, claudication, dustpan, silted, fish, rain and jade;
the sequence language part, which is composed of initial consonants and final sounds and forms Chinese Mandarin words with the numbers of 1-10, comprises: 1,2,3,4,5,6,7,8,9 and 10;
a composite vowel portion comprising 23 monosyllabic mandarin chinese words of the same or equal-vowel 1 tone, comprising: breaking, shrimp, bag, melon, loss, tortoise, cup, suffocation, mark, edge, ban, guest, running, upper, ice, collapse, pot, light, closing, ditch, lambkin, boot and brothers;
a consonant part, 21 monosyllabic Mandarin words consisting of 1 tone of 21 initials and single finals a or i, comprising: eight, groveling, lapping, he, ga, ca, machine, seven, know, eat, fund, defect, hair, ha, xi, chef, si, ri, ma, na and la.
As an optimal technical scheme, when voice data acquisition is carried out, the distance between the lips of a subject and a recorder is 9cm-11cm, the speech speed is natural and stable, the volume is moderate, and the vocabulary is repeatedly recorded for 2 times.
As a preferable technical solution, the pre-emphasis specifically includes:
processing the processed voice signal through a high-pass filter as follows:
H(Z)=1-μz -1
the value of μ in the above formula is 0.9 to 1.0.
As a preferable technical scheme, the framing specifically includes:
the framing time is 20-30ms, which is a frame, and the overlapping area between two adjacent frames is set to be 10-15ms, namely frame shift; the sampling rate of the voice sample is 8KHz or 16KHz, and the sampling point N of each frame is 256-512.
As a preferable technical scheme, the windowing specifically comprises:
multiplying each frame by a Hamming window after framing, increasing continuity of left and right ends of the frame, assuming that a signal after framing is S (n), where n represents a frame length minus 1, a signal value obtained by multiplying S (n) by the Hamming window is x (n),
x(n)=S(n)×W(n)
wherein W (n) is a Hamming window, and the formula is as follows:
in the above formula, the value of a is 0.46, and the value of N is 0, 1.
As a preferable technical solution, the fast fourier transform is specifically:
performing fast Fourier transform on each frame of signals subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo squaring on the frequency spectrum of the voice signal to obtain a power spectrum of the voice signal, and setting discrete Fourier transform of the voice signal, wherein the formula is as follows:
in the above expression, x (N) is an input speech signal, and N represents the number of points of fourier transform.
As a preferable technical solution, the processing of the triangular band-pass filter specifically includes:
xa (k) was passed through a set of 24 triangular filters with center frequencies designated f (m), m=1, 2..24, the spacing between f (m) decreasing with decreasing m value and widening with increasing m value, the triangular bandpass filter was formulated as follows:
in the above-mentioned method, the step of,
as a preferred technical solution, the logarithmic frequency value of each filter bank output, hm (k), is substituted into the following formula:
as a preferable technical solution, the discrete cosine transform specifically includes:
substituting S (m) into the following formula:
the L refers to the coefficient order of the MFCC, and 12-16 is taken; m is the number of triangular filters; c (n) is the MFCC value for each frame; and (3) connecting the 13 th order and the 19 th order frames to obtain 2 groups of MFCC values, namely, group A and group B.
As a preferable technical solution, the extended frame is specifically:
peak F of resonance 1 、F 2 And F 0 And each midpoint value is used as a frame to be added into the A group and the B group, so as to obtain 2 groups of MFCCs for warehousing, namely the C group and the D group.
Compared with the prior art, the invention has the following advantages and beneficial effects:
according to the research of the inventor on the artificial intelligent recognition of pathological voice, the Chinese Mandarin dysarthria evaluation vocabulary (hereinafter referred to as vocabulary) is designed. The Chinese vocabulary with 82 syllables in the vocabulary is extracted by a standardized flow method to extract the specific MFCC characteristics of each syllable, and a digitalized, standardized and structured voice database is constructed. The invention can be used for various applications of pathological voice characteristic big data and artificial intelligence analysis, and improves objectivity and efficiency of pathological voice research and application.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
fig. 2 is a schematic diagram of the Mel-frequency filter bank of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
Examples
In terms of speech recognition (Speech Recognition) and voiceprint recognition (Voice Print Recognition), the most commonly used speech feature is Mel-cepstral coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC for short). The human ear has different auditory sensitivities to sound waves of different frequencies. The speech signal from 200Hz to 5000Hz has the greatest effect on the audibility of speech. The critical bandwidth for sound masking due to the low frequency domain is small compared to the high frequency. Thus 28 bandpass filters are arranged from dense to sparse according to the critical bandwidth size from low frequency to high frequency to filter the input signal. The energy of the signal output by each band-pass filter is taken as the basic characteristic of the signal, and the acoustic characteristic based on the human ear characteristic is the MFCC. The shape of the human body vocal tract can be presented in the form of a short-time power spectrum envelope, and the MFCC can accurately represent the envelope, namely acoustic characteristics are used for reflecting the structural change of the vocal tract official cavity and indirectly reflecting the pathophysiological change. In addition, the characteristics do not depend on the nature of signals, do not make any assumption and limitation on input signals, and consider the research result of an auditory model, so that the method is widely used for digital voice recognition. Therefore, such parameters are more robust than LPCC based on vocal tract model, more fitting the auditory properties of the human ear, and still have better recognition performance when the signal-to-noise ratio is reduced. In summary, the MFCC is suitable as a digitized speech input feature for big data research and artificial intelligence analysis.
The main technical scheme of the invention is that 82 Mandarin Chinese voice data acquisition is carried out according to the sequence of word list of Chinese Mandarin voice evaluation system (shown in tables 1-3), preprocessing is carried out according to the stipulated method of the patent, the editing work of 82 syllables is completed, each syllable extracts MFCC characteristics, and the syllables enter a structured voice library of single vowels, compound vowels, consonants, sequence languages and tones after preprocessing according to the method.
Further, the "Mandarin Chinese speech evaluation System vocabulary" includes 3 sub-tables of 4 main parts, namely, a single vowel tone part, a sequence language part, a compound vowel part, and a consonant part:
the single vowel and tone part consists of 24 single syllable Mandarin words consisting of 1-4 tones of the initial consonant and single vowel of the same or equal phonemes, eight (ba 1), pull (ba 2), handle (ba 3), father (ba 4), detrusor (bi 1), nose (bi 2), pen (bi 3), must (bi 4), all (du 1), read (du 2), bet (du 3), du (du 4), go (ge 1), bay (ge 2), kudzuvine (ge 3), individual (ge 4), wave (bo 1), neck (bo 2), lameness (bo 3), dustpan (bo 4), siltation (yu 1), fish (yu 2), rain (yu 3), jade (yu 4);
a sequential language part, which consists of initials and finals and has the Chinese Mandarin words with the numbers of 1-10, 1 (yi 1), 2 (er 4), 3 (san 1), 4 (si 4), 5 (wu 3), 6 (liu 4), 7 (qi 1), 8 (ba 1), 9 (jiu 3), and 10 (shi 2);
a compound vowel portion, 23 monosyllabic Mandarin words consisting of the initial consonants and compound vowels 1 intonation of the same or equivalent phonemes, break off (bai 1), shrimp (xia 1), bag (bao 1), melon (gua 1), diou1, tortoise (guei 1), cup (bei 1), suffocating (bie 1), logo (bio 1), edge (bian 1), ban (ban 1), guest (bin 1), running (ben 1), bang (bang 1), ice (bang 1), collapse (beng 1), pot (gue 1), light (guang 1), guan (guan 1), groove (gou 1), lamb (guai 1), boot (xue 1), brother (xiong 1);
the consonant part, 21 monosyllabic Mandarin words composed of 1 tone of 21 initials and single finals a or i, eight (ba 1), groveling (pa 1), lapping (da 1), he (ta 1), ga (ga 1), ka (ka 1), machine (ji 1), seven (qi 1), know (zhi 1), eat (chi 1), asset (zi 1), defect (ci 1), send (fa 1), ha (ha 1), xi (xi 1), engineer (shi 1), si (si 1), day (ri 4), ma (ma 1), na (na 1), and pull (la 1).
Chinese Mandarin pronunciation assessment system vocabulary 1 (vowels, tones, sequence language)
Vocabulary 2 of Mandarin Chinese phonetic evaluation system (vowel change, partial vowels)
Vocabulary 3 of Mandarin Chinese phonetic evaluation system (consonant)
Sequence number Consonant type Word(s) Initial consonant Vowels of vowels Tone of sound
1 Non-air-supply stop cock Eight (eight) b a 1
2 Air supply stop cock Groveling body p a 1
3 Non-air-supply stop cock Lapping device d a 1
4 Air supply stop cock He is provided with t a 1
5 Non-air-supply stop cock Gaa (Chinese character of 'Gaa') g a 1
6 Air supply stop cock Coffee and coffee making machine k a 1
7 Non-air-supply sound-scraping plug Machine for making food j i 1
8 Air-supply plug and sound-erasing device Seven pieces of q i 1
9 Non-air-supply sound-scraping plug It is known that zh i 1
10 Air-supply plug and sound-erasing device Eating food ch i 1
11 Non-air-supply sound-scraping plug Resource(s) z i 1
12 Air-supply plug and sound-erasing device Defects and flaws c i 1
13 Cleaning sound Hair brush f a 1
14 Cleaning sound Ha h a 1
15 Cleaning sound Western medicine x i 1
16 Cleaning sound The teacher sh i 1
17 Cleaning sound Thinking of s i 1
18 Turbid sound Day of the day r i 4
19 Nasal sound Mother's body m a 1
20 Nasal sound That is n a 1
21 Side tone Pulling device l a 1
For further explanation of the technical solution of the present invention, the monosyllabic word "ba" is taken as an example, and the following description is made:
before the standardized sampling of this embodiment is performed, a selection of recording environment is required.
Optionally, the recording environment of this embodiment selects: the method is most carried out in a voice laboratory equipped with a sound insulation door and sound absorption rock wool, and the sound insulation degree is 45dB.
Optionally, the recording apparatus and parameters of this example are selected by: a Sony Zoom H4N recording pen is selected, the sampling rate of 44.1kHz and the tone quality of 16 bits are used for storage, and the recorded sound is copied to a computer hard disk.
As shown in fig. 1, a standardized sampling method for extracting pathological voice MFCC features for artificial intelligence analysis according to this embodiment includes the following steps:
s1, collecting voice data, and collecting voice data of 82 Mandarin syllables according to the sequence of a word list of a Mandarin voice evaluation system; the method comprises the following steps:
referring to 82 Chinese vocabularies of the vocabulary of the Chinese Mandarin language evaluation system (table 1), 82 Chinese Mandarin syllables of the voice data are collected, and the test subject is recorded. When recording, the user takes the sitting position, the pen holds the recorder, the lip of the user is about 10cm away from the recorder, and when the user sees the ' bar ' character on the screen, the user reads the ' bar (/ b ā /) at natural and steady speech speed and moderate volume, and records the sound repeatedly for 2 times. The waveform fluctuation range recorded by the recording pen is required to be in the range of 1/3-2/3 of the screen.
S2, clipping the collected voice data, wherein the method specifically comprises the following steps:
the target sound/b ā/first recording was cut out separately for each subject's voice file using CoolEdit pro 2.1. If the first recording has noise, interference and waveform fluctuation amplitude exceeding the range of 1/3-2/3 of the window value and waveform prompt energy is insufficient, the second recording data is selected for processing. The valid preprocessed samples are then classified and archived to single vowel groups.
S3, extracting the characteristics of the clipped signals, namely extracting the MFCC characteristics of syllables/b ā/digital voice signals based on the clipped samples through pre-emphasis, framing, windowing, fast Fourier transformation, triangular band-pass filter, extended framing and other processes, wherein the specific preprocessing steps are as follows:
s31, designating pre-emphasis:
processing the processed voice signal through a high-pass filter as follows:
H(Z)=1-μz -1
the value of μ in the above formula is 0.97.
S32, designating framing:
taking time 25ms as a frame, the overlapping area between two adjacent frames is set to 10ms, namely frame shift. The sample rate of the speech samples/b ā/is 16KHz and the value of N per frame length is 400. In this embodiment, frames 13 and 19 are taken, and if the zero padding is insufficient, the zero padding is performed.
S32, appointed windowing:
each frame is multiplied by a Hamming Window (Hamming Window) after framing in step S32 to increase the continuity of the left and right ends of the frame. Assuming that the signal after framing is S (n), where n represents the framing length minus 1, the signal value after multiplying S (n) by the hamming window is x (n),
x(n)=S(n)×W(n)
w (n) is a Hamming window, and the formula is as follows:
where a=0.46, n=0, 1.
S33, fast Fourier transform:
performing fast Fourier transform on each frame of signals subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo square on the frequency spectrum of the voice signal to obtain a power spectrum of the voice signal, and setting a Discrete Fourier Transform (DFT) formula of the voice signal as follows:
in the above expression, x (N) is an input speech signal, and N represents the number of points of fourier transform.
S33, a triangular band-pass filter:
xa (k) was passed through a set of 24 triangular filters with a center frequency designated f (m), m=1, 2. The interval between f (m) decreases as the value of m decreases and increases as the value of m increases, as shown in fig. 2.
The first triangular band-pass filter smoothes the frequency spectrum and has the function of eliminating harmonic waves, so that the first triangular band-pass filter is not influenced by the tone or the pitch of a section of voice. And secondly, the subsequent operation amount is reduced. The formula is as follows:
in the above-mentioned method, the step of,
s34, carrying out logarithmic operation, wherein the logarithmic frequency value output by each filter bank is substituted into the following formula:
s35, discrete Cosine Transform (DCT), and S (m) is substituted into the following formula:
the above formula L refers to the MFCC coefficient scale, the method specifies 13 and 19; m is the number of triangular filters, the method is designated as 24; c (n) is the MFCC value for each frame. The 13 th and 19 th order framing connections obtain 2 MFCC values into group a, group B.
S36, expanding framing:
will/b ā/formant F 1 、F 2 And F 0 The midpoint value of (2) is used as a frame to be added into the A group and the B group respectively, and another 2 MFCCs are obtained to be put into the C group and the D group.
S4, constructing the processed data into an MFCC voice library:
normalized data for monosyllabic "bar" in MFCC speech library: the voice sample/b ā/total 4 MFCC characteristics of each syllable data of 82 syllable samples after being preprocessed by [0033] and [0034] are respectively existed in A, B, C, D four groups, which are respectively normalized MFCC data of 13 frames, 19 frames, 13+3 frames and 19+3 frames
Structured data: the 4 groups of data are put into vowels and phonons sub-libraries, and are marked as a vowels and phonons sub-library A group, a vowels and phonons sub-library B group, a vowels and phonons sub-library C group and a vowels and phonons sub-library D group in a split manner.
The processing method of other 81 syllables is the same as that of the monosyllabic "ba", and will not be described here again
The invention researches a novel pathological voice standardized sampling method based on the MFCC characteristics, and is different from the traditional voice recording sampling method, the invention formulates a vocabulary comprising 82 Chinese syllables based on the prior research result of the inventor, adopts a standardized and structured data sampling method, and processes each syllable into 4 different data based on an acoustic index MFCC. Can be conveniently used for constructing a pathological voice library, analyzing big data voice and requiring artificial intelligence operation.
The method provided by the invention has the advantages of practice on pathological voice library, artificial neural network, deep learning and other applications, reliability and simplicity and convenience in operation, and finally becomes a standard in the field possible.
The evaluation method based on the artificial intelligence big data releases manpower, relies on intelligent development, is crystallization in an intelligent age, combines with intelligent development, and is the result of age progress and scientific development. The invention provides a method selection for artificial intelligence research and scientific diagnosis of pathological voice.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (1)

1. A standardized sampling method for extracting pathological voice MFCC features for artificial intelligence analysis, comprising the steps of:
collecting voice data, and collecting the voice data of syllables of the Mandarin Chinese 82 according to the sequence of the word list of the voice evaluation system of the Mandarin Chinese; the Chinese Mandarin pronunciation assessment system vocabulary comprises 3 sub-tables and 4 main parts, namely a single vowel tone part, a sequence language part, a compound vowel part and a consonant part;
the single vowel tone portion, 24 single syllable Mandarin Chinese words composed of 1-4 tones of the initial consonant and single vowel of the same or equal phonemes, comprises: eighth, pull, handle, father, force, nose, pen, must, all, read, bet, duu, go, bay, kudzuvine, personal, wave, neck, claudication, dustpan, silted, fish, rain and jade;
the sequence language part, which is composed of initial consonants and final sounds and forms Chinese Mandarin words with the numbers of 1-10, comprises: 1,2,3,4,5,6,7,8,9 and 10;
a composite vowel portion comprising 23 monosyllabic mandarin chinese words of the same or equal-vowel 1 tone, comprising: breaking, shrimp, bag, melon, loss, tortoise, cup, suffocation, mark, edge, ban, guest, running, upper, ice, collapse, pot, light, closing, ditch, lambkin, boot and brothers;
a consonant part, 21 monosyllabic Mandarin words consisting of 1 tone of 21 initials and single finals a or i, comprising: eight, groveling, lapping, he, ga, ca, machine, seven, know, eat, fund, defect, hair, ha, si, ri, ma, na and la;
when collecting voice data, the distance between the mouth and the lip of the subject and the recorder is 9cm-11cm, the speech speed is natural and stable, the volume is moderate, and the vocabulary is repeatedly recorded for 2 times;
extracting signals of 82 syllables after clipping, and extracting the MFCC characteristics of each syllable by designating pre-emphasis, framing, windowing, fast Fourier transformation, triangular band-pass filter and expanding framing;
the pre-emphasis is specifically as follows:
processing the processed voice signal through a high-pass filter as follows:
H(Z)=1-μz -1 (1)
wherein μ has a value of 0.9 to 1.0;
editing the collected voice data to finish the editing work of 82 syllables, and then classifying and archiving, wherein 28 single vowels, 23 compound vowels, 21 consonants and 10 sequential voices are used;
the framing is specifically as follows:
the framing time is 20-30ms, which is a frame, and the overlapping area between two adjacent frames is set to be 10-15ms, namely frame shift; the sampling rate of the voice sample is 8KHz or 16KHz, and the sampling point N of each frame is 256-512;
the windowing specifically comprises the following steps:
after framing, multiplying each frame by a hamming window, increasing the continuity of the left end and the right end of the frame, assuming that the signal after framing is S (N), the value of N is 0,1, the signal value after multiplying N-1 and S (N) by the hamming window is x (N),
x(n)=S(n) ×W(n) (2)
in formula (2), W (n) is a hamming window, and the formula is as follows:
the value of a in the formula (3) is 0.46, and the value of N is 0, 1.
The fast fourier transform is specifically:
performing fast Fourier transform on each frame of signals subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo squaring on the frequency spectrum of the voice signal to obtain a power spectrum of the voice signal, and setting discrete Fourier transform of the voice signal, wherein the formula is as follows:
in the formula (4), x (N) is an input voice signal, and N represents the number of points of fourier transform;
the processing of the triangular band-pass filter specifically comprises the following steps:
xa (k) is input to 24 triangular filters, respectively, the center frequency of the triangular filter is designated as f (m), m=1, 2,..24, f (m) is the center frequency of the mth triangular filter, the interval between f (m) is reduced with the decrease of the value of m, and the triangular band-pass filter H is widened with the increase of the value of m m (k) The formula of (2) is as follows:
in the formula (5) of the present invention,
the processed data form a structured voice library, and the standardized data of the MFCC voice library are specifically as follows:
the voice sample of each syllable data of 82 syllable samples after pretreatment has 4 MFCC characteristics which are respectively in A, B, C, D groups and are respectively standardized MFCC data of 13 frames, 19 frames, 13+3 frames and 19+3 frames;
structured database: the A, B, C, D four groups of data are put into a vowel and tone sub-library, and are marked as a vowel and tone sub-library A group, a vowel and tone sub-library B group, a vowel and tone sub-library C group and a vowel and tone sub-library D group in a column manner;
the logarithmic frequency values of each filter bank output, hm (k), are substituted into the following formula:
the discrete cosine transform is specifically:
substituting S (m) into the following formula:
l in the formula (7) refers to the MFCC coefficient order, and 12-16 is taken; m is the number of triangular filters; c (n) is the MFCC value for each frame; each frame of 13 th and 19 th orders is connected to obtain 2 groups of MFCC values for storage, namely a group A and a group B;
the extended framing is specifically:
peak F of resonance 1 、F 2 And F 0 And each midpoint value is used as a frame to be added into the A group and the B group respectively, so as to obtain 2 groups of MFCCs for warehousing, namely the C group and the D group.
CN202010462384.4A 2020-05-27 2020-05-27 Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis Active CN111599347B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010462384.4A CN111599347B (en) 2020-05-27 2020-05-27 Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010462384.4A CN111599347B (en) 2020-05-27 2020-05-27 Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis

Publications (2)

Publication Number Publication Date
CN111599347A CN111599347A (en) 2020-08-28
CN111599347B true CN111599347B (en) 2024-04-16

Family

ID=72192364

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010462384.4A Active CN111599347B (en) 2020-05-27 2020-05-27 Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis

Country Status (1)

Country Link
CN (1) CN111599347B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382293A (en) * 2020-11-11 2021-02-19 广东电网有限责任公司 Intelligent voice interaction method and system for power Internet of things

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211026A (en) * 1997-09-05 1999-03-17 中国科学院声学研究所 Continuous voice identification technology for Chinese putonghua large vocabulary
WO2001039179A1 (en) * 1999-11-23 2001-05-31 Infotalk Corporation Limited System and method for speech recognition using tonal modeling
CN1412741A (en) * 2002-12-13 2003-04-23 郑方 Chinese speech identification method with dialect background
CN101436403A (en) * 2007-11-16 2009-05-20 创新未来科技有限公司 Method and system for recognizing tone
CN103310273A (en) * 2013-06-26 2013-09-18 南京邮电大学 Method for articulating Chinese vowels with tones and based on DIVA model
CN103366735A (en) * 2012-03-29 2013-10-23 北京中传天籁数字技术有限公司 A voice data mapping method and apparatus
CN104123934A (en) * 2014-07-23 2014-10-29 泰亿格电子(上海)有限公司 Speech composition recognition method and system
CN105788608A (en) * 2016-03-03 2016-07-20 渤海大学 Chinese initial consonant and compound vowel visualization method based on neural network
CN110570842A (en) * 2019-10-25 2019-12-13 南京云白信息科技有限公司 Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree
CN110600055A (en) * 2019-08-15 2019-12-20 杭州电子科技大学 Singing voice separation method using melody extraction and voice synthesis technology
CN110808072A (en) * 2019-11-08 2020-02-18 广州科慧健远医疗科技有限公司 Method for evaluating dysarthria of children based on optimized acoustic parameters of data mining technology
CN110827980A (en) * 2019-11-08 2020-02-21 广州科慧健远医疗科技有限公司 Dysarthria grading evaluation method based on acoustic indexes
CN111028863A (en) * 2019-12-20 2020-04-17 广州科慧健远医疗科技有限公司 Method for diagnosing dysarthria tone error after stroke based on neural network and diagnosis device thereof

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211026A (en) * 1997-09-05 1999-03-17 中国科学院声学研究所 Continuous voice identification technology for Chinese putonghua large vocabulary
WO2001039179A1 (en) * 1999-11-23 2001-05-31 Infotalk Corporation Limited System and method for speech recognition using tonal modeling
CN1412741A (en) * 2002-12-13 2003-04-23 郑方 Chinese speech identification method with dialect background
CN101436403A (en) * 2007-11-16 2009-05-20 创新未来科技有限公司 Method and system for recognizing tone
CN103366735A (en) * 2012-03-29 2013-10-23 北京中传天籁数字技术有限公司 A voice data mapping method and apparatus
CN103310273A (en) * 2013-06-26 2013-09-18 南京邮电大学 Method for articulating Chinese vowels with tones and based on DIVA model
CN104123934A (en) * 2014-07-23 2014-10-29 泰亿格电子(上海)有限公司 Speech composition recognition method and system
CN105788608A (en) * 2016-03-03 2016-07-20 渤海大学 Chinese initial consonant and compound vowel visualization method based on neural network
CN110600055A (en) * 2019-08-15 2019-12-20 杭州电子科技大学 Singing voice separation method using melody extraction and voice synthesis technology
CN110570842A (en) * 2019-10-25 2019-12-13 南京云白信息科技有限公司 Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree
CN110808072A (en) * 2019-11-08 2020-02-18 广州科慧健远医疗科技有限公司 Method for evaluating dysarthria of children based on optimized acoustic parameters of data mining technology
CN110827980A (en) * 2019-11-08 2020-02-21 广州科慧健远医疗科技有限公司 Dysarthria grading evaluation method based on acoustic indexes
CN111028863A (en) * 2019-12-20 2020-04-17 广州科慧健远医疗科技有限公司 Method for diagnosing dysarthria tone error after stroke based on neural network and diagnosis device thereof

Also Published As

Publication number Publication date
CN111599347A (en) 2020-08-28

Similar Documents

Publication Publication Date Title
CN108922541B (en) Multi-dimensional characteristic parameter voiceprint recognition method based on DTW and GMM models
Chi et al. Subglottal coupling and its influence on vowel formants
CN103280220A (en) Real-time recognition method for baby cry
CN105825852A (en) Oral English reading test scoring method
Pao et al. Mandarin emotional speech recognition based on SVM and NN
CN104050965A (en) English phonetic pronunciation quality evaluation system with emotion recognition function and method thereof
CN101976564A (en) Method for identifying insect voice
CN102655003B (en) Method for recognizing emotion points of Chinese pronunciation based on sound-track modulating signals MFCC (Mel Frequency Cepstrum Coefficient)
CN103366735B (en) The mapping method of speech data and device
CN108564956B (en) Voiceprint recognition method and device, server and storage medium
CN112397074A (en) Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning
CN102237083A (en) Portable interpretation system based on WinCE platform and language recognition method thereof
CN111599347B (en) Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis
Kharamat et al. Durian ripeness classification from the knocking sounds using convolutional neural network
CN114842878A (en) Speech emotion recognition method based on neural network
Cai et al. The DKU-JNU-EMA electromagnetic articulography database on Mandarin and Chinese dialects with tandem feature based acoustic-to-articulatory inversion
Crichton et al. Linear prediction model of speech production with applications to deaf speech training
Chamoli et al. Detection of emotion in analysis of speech using linear predictive coding techniques (LPC)
Watt Research methods in speech acoustics
Malécot New procedures for descriptive phonetics
CN112599119B (en) Method for establishing and analyzing mobility dysarthria voice library in big data background
Kumar et al. Text dependent speaker identification in noisy environment
Khulage et al. Analysis of speech under stress using linear techniques and non-linear techniques for emotion recognition system
Regel A module for acoustic-phonetic transcription of fluently spoken German speech
Prasangini et al. Sinhala speech to sinhala unicode text conversion for disaster relief facilitation in sri lanka

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant