CN111599347B

CN111599347B - Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis

Info

Publication number: CN111599347B
Application number: CN202010462384.4A
Authority: CN
Inventors: 牟志伟; 江晨银; 柯慧明; 潘正祥; 温晓宇; 陈亮; 朱凌燕
Original assignee: Guangzhou Kehui Jianyuan Medical Technology Co ltd
Current assignee: Guangzhou Kehui Jianyuan Medical Technology Co ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2024-04-16
Anticipated expiration: 2040-05-27
Also published as: CN111599347A

Abstract

The invention discloses a standardized sampling method for extracting pathological voice MFCC characteristics for artificial intelligence analysis, which comprises the following steps: collecting voice data, and collecting the voice data of syllables of the Mandarin Chinese 82 according to the sequence of the word list of the voice evaluation system of the Mandarin Chinese; editing the collected voice data to finish editing work of 82 syllables, and then classifying and archiving; extracting signals of 82 syllables after clipping, and extracting the MFCC characteristics of each syllable by designating pre-emphasis, framing, windowing, fast Fourier transformation, triangular band-pass filter and expanding framing; and constructing the processed data into an MFCC voice library. According to the invention, through a standardized flow method, specific MFCC characteristics of each syllable are extracted, a digitalized, standardized and structured voice characteristic database is constructed, and various applications of pathological voice characteristic big data and artificial intelligent analysis can be served, so that objectivity and efficiency of pathological voice research and application are improved.

Description

Standardized sampling method for extracting pathological voice MFCC (functional peripheral component interconnect) characteristics for artificial intelligent analysis

Technical Field

The invention belongs to the technical field of intelligent recognition, and particularly relates to a standardized sampling method for extracting pathological voice MFCC characteristics for artificial intelligent analysis.

Background

The number of people with language handicaps in China is increased year by year, wherein communication handicaps caused by dysarthria seriously affect patients to return to society. Although the number of patients with dysarthria in China is heavy, the research and study in 2016 years Lin Jiang and Lu Jianliang show that the current evaluation method can not meet the requirements of therapists on accurate speech rehabilitation. The domestic rehabilitation departments and speech rehabilitation institutions are mainly applied to subjective hearing evaluation and/or scales requiring subjective judgment, and lack objectivity and efficiency. In addition, the number of language therapists in China is seriously insufficient, most of the language therapists are not in professional graduation, and the diagnosis and evaluation abilities are weak.

In recent years, application research based on rapid development of artificial intelligence technology, such as artificial neural network (Artificial Neural Network, ANN) and Deep Learning (DL), has achieved some results in normal voice analysis and recognition, language education, intelligent voice guidance, and the like. The medical aspect of the new generation artificial intelligence development planning of the national institutes provides that the artificial intelligence innovation application should be quickened, the characteristics and regularity of acoustic parameters in dysarthria are researched, and various dysarthria are diagnosed and classified based on an artificial neural network, so that the objectivity and efficiency of pathological voice evaluation are improved, and the manpower is liberated. To conduct big data and artificial intelligence analysis on pathological speech, there must be digitized, standardized and structured data sets. At present, no unified method and standard exist for analyzing big data of pathological voice at home and abroad, and a unified and efficient pathological voice characteristic acquisition method is urgently needed.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art, and provides a standardized sampling method for extracting pathological voice MFCC characteristics for artificial intelligent analysis, which is used for serving various applications of pathological voice characteristic big data and artificial intelligent analysis and improving objectivity and efficiency of pathological voice research and application.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a standardized sampling method for extracting pathological voice MFCC features for artificial intelligence analysis, comprising the steps of:

collecting voice data, and collecting the voice data of syllables of the Mandarin Chinese 82 according to the sequence of the word list of the voice evaluation system of the Mandarin Chinese;

editing the collected voice data to finish the editing work of 82 syllables, and then classifying and archiving, wherein 28 single vowels, 23 compound vowels, 21 consonants and 10 sequential voices are used;

extracting signals of 82 syllables after clipping, and extracting the MFCC characteristics of each syllable by designating pre-emphasis, framing, windowing, fast Fourier transformation, triangular band-pass filter and expanding framing;

the processed data form a structured voice library, and the standardized data of the MFCC voice library are specifically as follows:

the voice sample of each syllable data of 82 syllable samples after pretreatment has 4 MFCC characteristics which are respectively in A, B, C, D groups and are respectively standardized MFCC data of 13 frames, 19 frames, 13+3 frames and 19+3 frames;

structured database: the A, B, C, D four groups of data are input into a vowel and tone sub-library, and are marked as a vowel and tone sub-library A group, a vowel and tone sub-library B group, a vowel and tone sub-library C group and a vowel and tone sub-library D group in a column.

As a preferable technical scheme, the "Mandarin Chinese speech evaluation System vocabulary" includes 3 sub-tables of 4 main parts, namely a single vowel tone part, a sequence language part, a composite vowel part and a consonant part;

the single vowel and tone part, 24 single syllable Mandarin Chinese words composed of 1-4 tones of the same or equal phonemes of the initial consonant and single vowel, comprises: eighth, pull, handle, father, force, nose, pen, must, all, read, bet, duu, go, bay, kudzuvine, personal, wave, neck, claudication, dustpan, silted, fish, rain and jade;

the sequence language part, which is composed of initial consonants and final sounds and forms Chinese Mandarin words with the numbers of 1-10, comprises: 1,2,3,4,5,6,7,8,9 and 10;

a composite vowel portion comprising 23 monosyllabic mandarin chinese words of the same or equal-vowel 1 tone, comprising: breaking, shrimp, bag, melon, loss, tortoise, cup, suffocation, mark, edge, ban, guest, running, upper, ice, collapse, pot, light, closing, ditch, lambkin, boot and brothers;

a consonant part, 21 monosyllabic Mandarin words consisting of 1 tone of 21 initials and single finals a or i, comprising: eight, groveling, lapping, he, ga, ca, machine, seven, know, eat, fund, defect, hair, ha, xi, chef, si, ri, ma, na and la.

As an optimal technical scheme, when voice data acquisition is carried out, the distance between the lips of a subject and a recorder is 9cm-11cm, the speech speed is natural and stable, the volume is moderate, and the vocabulary is repeatedly recorded for 2 times.

As a preferable technical solution, the pre-emphasis specifically includes:

processing the processed voice signal through a high-pass filter as follows:

H(Z)＝1-μz ^-1

the value of μ in the above formula is 0.9 to 1.0.

As a preferable technical scheme, the framing specifically includes:

the framing time is 20-30ms, which is a frame, and the overlapping area between two adjacent frames is set to be 10-15ms, namely frame shift; the sampling rate of the voice sample is 8KHz or 16KHz, and the sampling point N of each frame is 256-512.

As a preferable technical scheme, the windowing specifically comprises:

multiplying each frame by a Hamming window after framing, increasing continuity of left and right ends of the frame, assuming that a signal after framing is S (n), where n represents a frame length minus 1, a signal value obtained by multiplying S (n) by the Hamming window is x (n),

x(n)＝S(n)×W(n)

wherein W (n) is a Hamming window, and the formula is as follows:

in the above formula, the value of a is 0.46, and the value of N is 0, 1.

As a preferable technical solution, the fast fourier transform is specifically:

performing fast Fourier transform on each frame of signals subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo squaring on the frequency spectrum of the voice signal to obtain a power spectrum of the voice signal, and setting discrete Fourier transform of the voice signal, wherein the formula is as follows:

in the above expression, x (N) is an input speech signal, and N represents the number of points of fourier transform.

As a preferable technical solution, the processing of the triangular band-pass filter specifically includes:

xa (k) was passed through a set of 24 triangular filters with center frequencies designated f (m), m=1, 2..24, the spacing between f (m) decreasing with decreasing m value and widening with increasing m value, the triangular bandpass filter was formulated as follows:

in the above-mentioned method, the step of,

as a preferred technical solution, the logarithmic frequency value of each filter bank output, hm (k), is substituted into the following formula:

as a preferable technical solution, the discrete cosine transform specifically includes:

substituting S (m) into the following formula:

the L refers to the coefficient order of the MFCC, and 12-16 is taken; m is the number of triangular filters; c (n) is the MFCC value for each frame; and (3) connecting the 13 th order and the 19 th order frames to obtain 2 groups of MFCC values, namely, group A and group B.

As a preferable technical solution, the extended frame is specifically:

peak F of resonance ₁ 、F ₂ And F ₀ And each midpoint value is used as a frame to be added into the A group and the B group, so as to obtain 2 groups of MFCCs for warehousing, namely the C group and the D group.

Compared with the prior art, the invention has the following advantages and beneficial effects:

according to the research of the inventor on the artificial intelligent recognition of pathological voice, the Chinese Mandarin dysarthria evaluation vocabulary (hereinafter referred to as vocabulary) is designed. The Chinese vocabulary with 82 syllables in the vocabulary is extracted by a standardized flow method to extract the specific MFCC characteristics of each syllable, and a digitalized, standardized and structured voice database is constructed. The invention can be used for various applications of pathological voice characteristic big data and artificial intelligence analysis, and improves objectivity and efficiency of pathological voice research and application.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

fig. 2 is a schematic diagram of the Mel-frequency filter bank of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Examples

In terms of speech recognition (Speech Recognition) and voiceprint recognition (Voice Print Recognition), the most commonly used speech feature is Mel-cepstral coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC for short). The human ear has different auditory sensitivities to sound waves of different frequencies. The speech signal from 200Hz to 5000Hz has the greatest effect on the audibility of speech. The critical bandwidth for sound masking due to the low frequency domain is small compared to the high frequency. Thus 28 bandpass filters are arranged from dense to sparse according to the critical bandwidth size from low frequency to high frequency to filter the input signal. The energy of the signal output by each band-pass filter is taken as the basic characteristic of the signal, and the acoustic characteristic based on the human ear characteristic is the MFCC. The shape of the human body vocal tract can be presented in the form of a short-time power spectrum envelope, and the MFCC can accurately represent the envelope, namely acoustic characteristics are used for reflecting the structural change of the vocal tract official cavity and indirectly reflecting the pathophysiological change. In addition, the characteristics do not depend on the nature of signals, do not make any assumption and limitation on input signals, and consider the research result of an auditory model, so that the method is widely used for digital voice recognition. Therefore, such parameters are more robust than LPCC based on vocal tract model, more fitting the auditory properties of the human ear, and still have better recognition performance when the signal-to-noise ratio is reduced. In summary, the MFCC is suitable as a digitized speech input feature for big data research and artificial intelligence analysis.

The main technical scheme of the invention is that 82 Mandarin Chinese voice data acquisition is carried out according to the sequence of word list of Chinese Mandarin voice evaluation system (shown in tables 1-3), preprocessing is carried out according to the stipulated method of the patent, the editing work of 82 syllables is completed, each syllable extracts MFCC characteristics, and the syllables enter a structured voice library of single vowels, compound vowels, consonants, sequence languages and tones after preprocessing according to the method.

Further, the "Mandarin Chinese speech evaluation System vocabulary" includes 3 sub-tables of 4 main parts, namely, a single vowel tone part, a sequence language part, a compound vowel part, and a consonant part:

the single vowel and tone part consists of 24 single syllable Mandarin words consisting of 1-4 tones of the initial consonant and single vowel of the same or equal phonemes, eight (ba 1), pull (ba 2), handle (ba 3), father (ba 4), detrusor (bi 1), nose (bi 2), pen (bi 3), must (bi 4), all (du 1), read (du 2), bet (du 3), du (du 4), go (ge 1), bay (ge 2), kudzuvine (ge 3), individual (ge 4), wave (bo 1), neck (bo 2), lameness (bo 3), dustpan (bo 4), siltation (yu 1), fish (yu 2), rain (yu 3), jade (yu 4);

a sequential language part, which consists of initials and finals and has the Chinese Mandarin words with the numbers of 1-10, 1 (yi 1), 2 (er 4), 3 (san 1), 4 (si 4), 5 (wu 3), 6 (liu 4), 7 (qi 1), 8 (ba 1), 9 (jiu 3), and 10 (shi 2);

a compound vowel portion, 23 monosyllabic Mandarin words consisting of the initial consonants and compound vowels 1 intonation of the same or equivalent phonemes, break off (bai 1), shrimp (xia 1), bag (bao 1), melon (gua 1), diou1, tortoise (guei 1), cup (bei 1), suffocating (bie 1), logo (bio 1), edge (bian 1), ban (ban 1), guest (bin 1), running (ben 1), bang (bang 1), ice (bang 1), collapse (beng 1), pot (gue 1), light (guang 1), guan (guan 1), groove (gou 1), lamb (guai 1), boot (xue 1), brother (xiong 1);

the consonant part, 21 monosyllabic Mandarin words composed of 1 tone of 21 initials and single finals a or i, eight (ba 1), groveling (pa 1), lapping (da 1), he (ta 1), ga (ga 1), ka (ka 1), machine (ji 1), seven (qi 1), know (zhi 1), eat (chi 1), asset (zi 1), defect (ci 1), send (fa 1), ha (ha 1), xi (xi 1), engineer (shi 1), si (si 1), day (ri 4), ma (ma 1), na (na 1), and pull (la 1).

Chinese Mandarin pronunciation assessment system vocabulary 1 (vowels, tones, sequence language)

Vocabulary 2 of Mandarin Chinese phonetic evaluation system (vowel change, partial vowels)

Vocabulary 3 of Mandarin Chinese phonetic evaluation system (consonant)

Sequence number	Consonant type	Word(s)	Initial consonant	Vowels of vowels	Tone of sound
						1	Non-air-supply stop cock	Eight (eight)	b	a	1
2	Air supply stop cock	Groveling body	p	a	1
						3	Non-air-supply stop cock	Lapping device	d	a	1
4	Air supply stop cock	He is provided with	t	a	1
						5	Non-air-supply stop cock	Gaa (Chinese character of 'Gaa')	g	a	1
6	Air supply stop cock	Coffee and coffee making machine	k	a	1
						7	Non-air-supply sound-scraping plug	Machine for making food	j	i	1
8	Air-supply plug and sound-erasing device	Seven pieces of	q	i	1
						9	Non-air-supply sound-scraping plug	It is known that	zh	i	1
10	Air-supply plug and sound-erasing device	Eating food	ch	i	1
						11	Non-air-supply sound-scraping plug	Resource(s)	z	i	1
12	Air-supply plug and sound-erasing device	Defects and flaws	c	i	1
						13	Cleaning sound	Hair brush	f	a	1
14	Cleaning sound	Ha	h	a	1
						15	Cleaning sound	Western medicine	x	i	1
16	Cleaning sound	The teacher	sh	i	1
						17	Cleaning sound	Thinking of	s	i	1
18	Turbid sound	Day of the day	r	i	4
						19	Nasal sound	Mother's body	m	a	1
20	Nasal sound	That is	n	a	1
						21	Side tone	Pulling device	l	a	1

For further explanation of the technical solution of the present invention, the monosyllabic word "ba" is taken as an example, and the following description is made:

before the standardized sampling of this embodiment is performed, a selection of recording environment is required.

Optionally, the recording environment of this embodiment selects: the method is most carried out in a voice laboratory equipped with a sound insulation door and sound absorption rock wool, and the sound insulation degree is 45dB.

Optionally, the recording apparatus and parameters of this example are selected by: a Sony Zoom H4N recording pen is selected, the sampling rate of 44.1kHz and the tone quality of 16 bits are used for storage, and the recorded sound is copied to a computer hard disk.

As shown in fig. 1, a standardized sampling method for extracting pathological voice MFCC features for artificial intelligence analysis according to this embodiment includes the following steps:

s1, collecting voice data, and collecting voice data of 82 Mandarin syllables according to the sequence of a word list of a Mandarin voice evaluation system; the method comprises the following steps:

referring to 82 Chinese vocabularies of the vocabulary of the Chinese Mandarin language evaluation system (table 1), 82 Chinese Mandarin syllables of the voice data are collected, and the test subject is recorded. When recording, the user takes the sitting position, the pen holds the recorder, the lip of the user is about 10cm away from the recorder, and when the user sees the ' bar ' character on the screen, the user reads the ' bar (/ b ā /) at natural and steady speech speed and moderate volume, and records the sound repeatedly for 2 times. The waveform fluctuation range recorded by the recording pen is required to be in the range of 1/3-2/3 of the screen.

S2, clipping the collected voice data, wherein the method specifically comprises the following steps:

the target sound/b ā/first recording was cut out separately for each subject's voice file using CoolEdit pro 2.1. If the first recording has noise, interference and waveform fluctuation amplitude exceeding the range of 1/3-2/3 of the window value and waveform prompt energy is insufficient, the second recording data is selected for processing. The valid preprocessed samples are then classified and archived to single vowel groups.

S3, extracting the characteristics of the clipped signals, namely extracting the MFCC characteristics of syllables/b ā/digital voice signals based on the clipped samples through pre-emphasis, framing, windowing, fast Fourier transformation, triangular band-pass filter, extended framing and other processes, wherein the specific preprocessing steps are as follows:

s31, designating pre-emphasis:

processing the processed voice signal through a high-pass filter as follows:

H(Z)＝1-μz ^-1

the value of μ in the above formula is 0.97.

S32, designating framing:

taking time 25ms as a frame, the overlapping area between two adjacent frames is set to 10ms, namely frame shift. The sample rate of the speech samples/b ā/is 16KHz and the value of N per frame length is 400. In this embodiment, frames 13 and 19 are taken, and if the zero padding is insufficient, the zero padding is performed.

S32, appointed windowing:

each frame is multiplied by a Hamming Window (Hamming Window) after framing in step S32 to increase the continuity of the left and right ends of the frame. Assuming that the signal after framing is S (n), where n represents the framing length minus 1, the signal value after multiplying S (n) by the hamming window is x (n),

x(n)＝S(n)×W(n)

w (n) is a Hamming window, and the formula is as follows:

where a=0.46, n=0, 1.

S33, fast Fourier transform:

performing fast Fourier transform on each frame of signals subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo square on the frequency spectrum of the voice signal to obtain a power spectrum of the voice signal, and setting a Discrete Fourier Transform (DFT) formula of the voice signal as follows:

S33, a triangular band-pass filter:

xa (k) was passed through a set of 24 triangular filters with a center frequency designated f (m), m=1, 2. The interval between f (m) decreases as the value of m decreases and increases as the value of m increases, as shown in fig. 2.

The first triangular band-pass filter smoothes the frequency spectrum and has the function of eliminating harmonic waves, so that the first triangular band-pass filter is not influenced by the tone or the pitch of a section of voice. And secondly, the subsequent operation amount is reduced. The formula is as follows:

in the above-mentioned method, the step of,

s34, carrying out logarithmic operation, wherein the logarithmic frequency value output by each filter bank is substituted into the following formula:

s35, discrete Cosine Transform (DCT), and S (m) is substituted into the following formula:

the above formula L refers to the MFCC coefficient scale, the method specifies 13 and 19; m is the number of triangular filters, the method is designated as 24; c (n) is the MFCC value for each frame. The 13 th and 19 th order framing connections obtain 2 MFCC values into group a, group B.

S36, expanding framing:

will/b ā/formant F ₁ 、F ₂ And F ₀ The midpoint value of (2) is used as a frame to be added into the A group and the B group respectively, and another 2 MFCCs are obtained to be put into the C group and the D group.

S4, constructing the processed data into an MFCC voice library:

normalized data for monosyllabic "bar" in MFCC speech library: the voice sample/b ā/total 4 MFCC characteristics of each syllable data of 82 syllable samples after being preprocessed by [0033] and [0034] are respectively existed in A, B, C, D four groups, which are respectively normalized MFCC data of 13 frames, 19 frames, 13+3 frames and 19+3 frames

Structured data: the 4 groups of data are put into vowels and phonons sub-libraries, and are marked as a vowels and phonons sub-library A group, a vowels and phonons sub-library B group, a vowels and phonons sub-library C group and a vowels and phonons sub-library D group in a split manner.

The processing method of other 81 syllables is the same as that of the monosyllabic "ba", and will not be described here again

The invention researches a novel pathological voice standardized sampling method based on the MFCC characteristics, and is different from the traditional voice recording sampling method, the invention formulates a vocabulary comprising 82 Chinese syllables based on the prior research result of the inventor, adopts a standardized and structured data sampling method, and processes each syllable into 4 different data based on an acoustic index MFCC. Can be conveniently used for constructing a pathological voice library, analyzing big data voice and requiring artificial intelligence operation.

The method provided by the invention has the advantages of practice on pathological voice library, artificial neural network, deep learning and other applications, reliability and simplicity and convenience in operation, and finally becomes a standard in the field possible.

The evaluation method based on the artificial intelligence big data releases manpower, relies on intelligent development, is crystallization in an intelligent age, combines with intelligent development, and is the result of age progress and scientific development. The invention provides a method selection for artificial intelligence research and scientific diagnosis of pathological voice.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. A standardized sampling method for extracting pathological voice MFCC features for artificial intelligence analysis, comprising the steps of:

collecting voice data, and collecting the voice data of syllables of the Mandarin Chinese 82 according to the sequence of the word list of the voice evaluation system of the Mandarin Chinese; the Chinese Mandarin pronunciation assessment system vocabulary comprises 3 sub-tables and 4 main parts, namely a single vowel tone part, a sequence language part, a compound vowel part and a consonant part;

the single vowel tone portion, 24 single syllable Mandarin Chinese words composed of 1-4 tones of the initial consonant and single vowel of the same or equal phonemes, comprises: eighth, pull, handle, father, force, nose, pen, must, all, read, bet, duu, go, bay, kudzuvine, personal, wave, neck, claudication, dustpan, silted, fish, rain and jade;

a consonant part, 21 monosyllabic Mandarin words consisting of 1 tone of 21 initials and single finals a or i, comprising: eight, groveling, lapping, he, ga, ca, machine, seven, know, eat, fund, defect, hair, ha, si, ri, ma, na and la;

when collecting voice data, the distance between the mouth and the lip of the subject and the recorder is 9cm-11cm, the speech speed is natural and stable, the volume is moderate, and the vocabulary is repeatedly recorded for 2 times;

the pre-emphasis is specifically as follows:

processing the processed voice signal through a high-pass filter as follows:

H(Z)＝1-μz ^-1 (1)

wherein μ has a value of 0.9 to 1.0;

the framing is specifically as follows:

the framing time is 20-30ms, which is a frame, and the overlapping area between two adjacent frames is set to be 10-15ms, namely frame shift; the sampling rate of the voice sample is 8KHz or 16KHz, and the sampling point N of each frame is 256-512;

the windowing specifically comprises the following steps:

after framing, multiplying each frame by a hamming window, increasing the continuity of the left end and the right end of the frame, assuming that the signal after framing is S (N), the value of N is 0,1, the signal value after multiplying N-1 and S (N) by the hamming window is x (N),

x(n)＝S(n) ×W(n) (2)

in formula (2), W (n) is a hamming window, and the formula is as follows:

the value of a in the formula (3) is 0.46, and the value of N is 0, 1.

The fast fourier transform is specifically:

in the formula (4), x (N) is an input voice signal, and N represents the number of points of fourier transform;

the processing of the triangular band-pass filter specifically comprises the following steps:

xa (k) is input to 24 triangular filters, respectively, the center frequency of the triangular filter is designated as f (m), m=1, 2,..24, f (m) is the center frequency of the mth triangular filter, the interval between f (m) is reduced with the decrease of the value of m, and the triangular band-pass filter H is widened with the increase of the value of m _m (k) The formula of (2) is as follows:

in the formula (5) of the present invention,

structured database: the A, B, C, D four groups of data are put into a vowel and tone sub-library, and are marked as a vowel and tone sub-library A group, a vowel and tone sub-library B group, a vowel and tone sub-library C group and a vowel and tone sub-library D group in a column manner;

the logarithmic frequency values of each filter bank output, hm (k), are substituted into the following formula:

the discrete cosine transform is specifically:

substituting S (m) into the following formula:

l in the formula (7) refers to the MFCC coefficient order, and 12-16 is taken; m is the number of triangular filters; c (n) is the MFCC value for each frame; each frame of 13 th and 19 th orders is connected to obtain 2 groups of MFCC values for storage, namely a group A and a group B;

the extended framing is specifically:

peak F of resonance ₁ 、F ₂ And F ₀ And each midpoint value is used as a frame to be added into the A group and the B group respectively, so as to obtain 2 groups of MFCCs for warehousing, namely the C group and the D group.