CN111599347A

CN111599347A - Standardized sampling method for extracting pathological voice MFCC (Mel frequency cepstrum coefficient) features for artificial intelligence analysis

Info

Publication number: CN111599347A
Application number: CN202010462384.4A
Authority: CN
Inventors: 牟志伟; 江晨银; 柯慧明; 潘正祥; 温晓宇; 陈亮; 朱凌燕
Original assignee: Guangzhou Kehui Jianyuan Medical Technology Co ltd
Current assignee: Guangzhou Kehui Jianyuan Medical Technology Co ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2020-08-28
Anticipated expiration: 2040-05-27
Also published as: CN111599347B

Abstract

The invention discloses a standardized sampling method for extracting pathological voice MFCC characteristics for artificial intelligence analysis, which comprises the following steps: collecting voice data, and collecting the voice data of 82 Chinese Putonghua syllables according to the sequence of the Chinese Putonghua voice evaluation system vocabulary; editing the collected voice data to finish the editing work of 82 syllables, and then classifying and filing; extracting signals of the 82 clipped syllables, and extracting MFCC characteristics of each syllable by appointing pre-emphasis, framing, windowing, fast Fourier transform, a triangular band-pass filter and expanding framing processing; and constructing the processed data into an MFCC voice library. The invention extracts the specific MFCC characteristics of each syllable through a standardized flow method, constructs a digitalized, standardized and structured voice characteristic database, can serve various applications of pathological voice characteristic big data and artificial intelligence analysis, and improves the objectivity and efficiency of pathological voice research and application.

Description

Standardized sampling method for extracting pathological voice MFCC (Mel frequency cepstrum coefficient) features for artificial intelligence analysis

Technical Field

The invention belongs to the technical field of intelligent recognition, and particularly relates to a standardized sampling method for extracting MFCC (Mel frequency cepstrum coefficient) features of pathological voices for artificial intelligent analysis.

Background

At present, the number of people with language disorder in China increases year by year, and communication disorder caused by dysarthria seriously affects the return of patients to the society. Although the number of patients with dysarthria in China is large, 2016 research on forest intensity and Lujian proves that the current evaluation method cannot meet the requirement of therapists on accurate speech rehabilitation. The traditional Chinese medicine is still mainly evaluated by subjective auditory evaluation and/or scales needing subjective judgment, and lacks objectivity and efficiency. In addition, the number of speech therapists in China is seriously insufficient, most of the speech therapists are not professional graduations, and the diagnosis and evaluation capability is weak.

In recent years, based on the rapid development of Artificial intelligence technology, for example, application research of Artificial Neural Network (ANN) and Deep Learning (DL) in normal speech analysis and recognition, language education, intelligent voice guidance, and the like has achieved some achievements. The State Council 'New Generation Artificial Intelligence development planning' proposes to accelerate the innovation of artificial intelligence, studies the characteristics and regularity of acoustic parameters in dysarthria, and diagnoses and classifies various dysarthria based on artificial neural network, thereby improving the objectivity and efficiency of pathological speech assessment and liberating manpower. For big data and artificial intelligence analysis of pathological speech, a digitized, standardized and structured data set is necessary. At present, no unified method and standard for big data analysis and artificial intelligence research of pathological voice at home and abroad exist, and a unified and efficient pathological voice feature acquisition method is urgently needed.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art, provides a standardized sampling method for extracting the MFCC characteristics of pathological voices for artificial intelligence analysis, serves various applications of large data of the characteristics of the pathological voices and the artificial intelligence analysis, and improves the objectivity and the efficiency of research and application of the pathological voices.

In order to achieve the purpose, the invention adopts the following technical scheme:

a standardized sampling method for extracting MFCC characteristics of pathological speech for artificial intelligence analysis comprises the following steps:

collecting voice data, and collecting the voice data of 82 Chinese Putonghua syllables according to the sequence of the Chinese Putonghua voice evaluation system vocabulary;

the collected voice data is clipped, the clipping work of 82 syllables is completed, and then classification filing is carried out, wherein 28 unit tones, 23 compound vowels, 21 consonants and 10 sequence voices are provided;

extracting signals of the 82 clipped syllables, and extracting MFCC characteristics of each syllable by appointing pre-emphasis, framing, windowing, fast Fourier transform, a triangular band-pass filter and expanding framing processing;

and forming a structured voice library by using the processed data, wherein the standardized data of the MFCC voice library specifically comprises the following steps:

the voice sample after each syllable data of 82 syllable samples is preprocessed has 4 MFCC characteristics which are respectively present in A, B, C, D four groups, and the MFCC characteristics are respectively standardized MFCC data of 13 frames, 19 frames, 13+3 frames and 19+3 frames;

structuring the database: a, B, C, D four groups of data are put into the vowel and tone sub-library and marked as vowel and tone sub-library A group, vowel and tone sub-library B group, vowel and tone sub-library C group and vowel and tone sub-library D group.

As a preferred technical solution, the "mandarin chinese speech evaluation system vocabulary" includes 3 parts of table 4 main parts, namely, a unit tone part, a sequential language part, a compound vowel part, and a consonant part;

the unit tone and tone part comprises 24 single syllables of Chinese mandarin, which are composed of 1-4 tones of the same or equivalent tone position initial consonant and single final, and comprises: eighthly, drawing, holding, father, approach, nose, pen, must, all, reading, gambling, Du, Ge, Gege, Ou, wave, neck, lame, jow, silt, fish, rain and jade;

the sequential language part, which is a Chinese mandarin word with the number of 1-10 composed of initial consonant and final consonant, comprises: 1,2, 3,4,5,6, 7,8, 9 and 10;

the compound vowel part, 23 single syllables Chinese mandarin words composed of the same or equal phoneme initial and compound final 1 tone, includes: snapping, shrimp, bag, melon, loss, turtle, cup, hold, label, edge, class, guest, rush, upper, ice, cave, pot, light, guan, ditch, lambkin, boot and brother;

the consonant part, 21 single syllables Chinese mandarin words composed of 21 initials and 1 tone of single vowel a or i, includes: eighthly, lying prone, lapping, He, Ga, Jia, machine, Qin, Zhi, eat, Zi, Defect, Fa, Ha, West, Shi, Si, Ri, Ma, na and La.

As a preferred technical scheme, when voice data acquisition is carried out, the distance between the mouth and the lip of a subject is 9cm-11cm, the voice speed is natural and stable, the volume is moderate, and a word list is repeatedly recorded for 2 times.

As a preferred technical solution, the pre-emphasis specifically comprises:

processing the processed speech signal through a high pass filter of the following formula:

H(Z)＝1-μz^-1

the value of μ in the above formula is 0.9 to 1.0.

As a preferred technical solution, the framing specifically includes:

the framing time is 20-30ms, namely a framing, and the overlapping area between two adjacent frames is set to be 10-15ms, namely frame shift; the sampling rate of the voice samples is 8KHz or 16KHz, and the sampling point N per minute frame is 256-512.

As a preferred technical scheme, the windowing specifically is as follows:

multiplying each frame by a Hamming window after framing, increasing the continuity of the left end and the right end of the frame, assuming that the signal after framing is S (n), wherein n represents the length of the framing minus 1, the signal value after multiplying S (n) by the Hamming window is x (n),

x(n)＝S(n)×W(n)

where W (n) is the Hamming window, the formula is as follows:

in the above formula, the value of a is 0.46, and the value of N is 0, 1.

As a preferred technical solution, the fast fourier transform specifically includes:

performing fast Fourier transform on each frame signal subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo square on the frequency spectrum of a voice signal to obtain a power spectrum of the voice signal, and setting discrete Fourier transform of the voice signal, wherein the formula is as follows:

in the above formula, x (N) is an input speech signal, and N represents the number of points of fourier transform.

As a preferred technical solution, the processing of the triangular band-pass filter specifically includes:

let xa (k) pass through a set of 24 triangular filters with center frequencies designated as f (m), m ═ 1, 2.. said, 24, and the spacing between each f (m) decreases with decreasing value of m and increases with increasing value of m, the formula for the triangular bandpass filter is as follows:

in the above formula, the first and second carbon atoms are,

as a preferred technical solution, the logarithmic frequency value, hm (k), output by each filter bank is substituted into the following formula:

as a preferred technical solution, the discrete cosine transform specifically includes:

substituting s (m) into the following equation:

the above formula L refers to the MFCC coefficient order, and is 12-16; m is the number of the triangular filters; c (n) MFCC values for each frame; the 13 th and 19 th stages are respectively connected by frames to obtain 2 groups of MFCC values to be stored in a bin, namely a group A and a group B.

As a preferred technical solution, the extended framing specifically includes:

will resonate peak F₁、F₂And F₀And adding the midpoint values into the group A and the group B as a subframe to obtain 2 groups of MFCCs to be put in storage, namely the group C and the group D.

Compared with the prior art, the invention has the following advantages and beneficial effects:

based on the research of pathological speech artificial intelligent recognition by the inventor, the invention designs ' evaluation word list of Mandarin Chinese dysarthria ' (hereinafter, word list '). The word list has 82 syllables of Chinese vocabulary, and specific MFCC characteristics of each syllable are extracted through a standardized flow method to construct a digitalized, standardized and structured voice database. The invention can serve for multiple applications of pathological voice characteristic big data and artificial intelligence analysis, and improves the objectivity and efficiency of pathological voice research and application.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

figure 2 is a diagram of the Mel frequency filterbank pattern of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

In terms of Speech Recognition (Speech Recognition) and Voice Print Recognition (Voice Print Recognition), the most commonly used Speech feature is the Mel-scale Frequency cepstral coefficients (MFCC). The human ear has different hearing sensitivities to sound waves of different frequencies. Speech signals from 200Hz to 5000Hz have the greatest effect on the intelligibility of speech. The critical bandwidth of sound masking is smaller for the low frequency domain than for the high frequency. Therefore, 28 band-pass filters from dense to sparse are arranged according to the size of critical bandwidth from low frequency to high frequency, and the input signal is subjected to filtering processing. The signal energy output by each band-pass filter is taken as the basic characteristic of the signal, and the acoustic characteristic based on the characteristics of the human ear is MFCC. The shape of the human body vocal tract can be presented in the form of a short-time power spectrum envelope, and the MFCC can accurately represent the envelope, namely acoustic features are used for reflecting the structural change of the vocal tract cavity and indirectly reflecting pathophysiological changes. In addition, because the characteristics do not depend on the properties of the signals, no assumption and limitation are made on the input signals, and the research results of the auditory model are considered, so that the characteristics are widely used for digital speech recognition. Therefore, the parameter has better robustness than the LPCC based on the vocal tract model, better conforms to the auditory characteristics of human ears, and still has better recognition performance when the signal-to-noise ratio is reduced. In summary, MFCC is suitable as a digitized speech input feature for big data research and artificial intelligence analysis.

The main technical scheme of the invention is that 82 Mandarin Chinese voice data are collected according to the sequence of the Mandarin Chinese voice evaluation system vocabulary (shown in tables 1-3), preprocessing is carried out according to the specified method of the invention, the editing work of 82 syllables is completed, MFCC characteristics are extracted from each syllable, and the syllable enters a structured voice library of unit tones, compound vowels, consonants, sequential languages and tones after preprocessing according to the method.

Furthermore, the "mandarin chinese speech evaluation system vocabulary" includes 3 parts, namely, a single tone part, a sequential language part, a compound vowel part, and a consonant part:

a unit tone and tone part, 24 single syllable mandarin chinese words consisting of consonants of the same or equal phoneme and 1-4 tones of a single vowel, eight (ba1), plucking (ba2), pinching (ba3), fath (ba4), detritus (bi1), nose (bi2), pen (bi3), bib (bi4), du (du1), reading (du2), gambling (du3), du (du4), brother (ge1), diaphragm (ge2), ge (ge3), individual (ge4), wave (bo1), neck (bo2), lameness (bo3), jolt (bo4), silt (yu1), fish (yu2), rain (yu3), and jade (yu 4);

a sequential language part, namely Chinese mandarin words with numbers of 1-10 consisting of initial consonants and final consonants, 1(yi1), 2(er4), 3(san1),4(si4),5(wu3),6(liu4), 7(qi1),8(ba1), 9(jiu3) and 10(shi 2);

a compound vowel part, 23 single syllable chinese mandarin words consisting of consonants and compound vowel 1 tones of the same or equal phoneme, an arm-break (bai1), shrimp (xia1), bag (bai1), melon (gua1), loss (diou1), turtle (guii 1), cup (bei1), hold (bie1), logo (biao1), edge (bian1), class (ban1), guest (bin1), runner (ben1), upper (bang1), ice (bin1), collapse (beng1), pan (guo1), light (guang1), gate (guan1), ditch (gou1), lambkin (guai1), baue (xiue 1), xiong 1);

the consonant part comprises 21 monosyllabic Chinese mandarins consisting of 21 initials and 1 tone of a simple vowel a or i, eight (ba1), groveling (pa1), overlapping (da1), he (ta1), Ga (ga1), coffee (ka1), machine (ji1), seven (qi1), cicada (zhi1), eating (chi1), funding (zi1), defect (ci1), hair (fa1), haha (ha1), west (xi1), teacher (shi1), thinking (si1), day (ri4), mother (ma1), that (na1) and La (la 1).

Mandarin Chinese pronunciation assessment system vocabulary 1 (vowel, tone, sequence language)

Mandarin Chinese pronunciation assessment system vocabulary 2 (vowel change, partial vowel)

Mandarin Chinese pronunciation assessment system vocabulary 3 (consonant)

Serial number

Consonant type

Character (Chinese character)

Initial consonant

Vowels

Tone of sound

1

Non-air-supply stop sound

Eight-part

b

a

1

2

Air supply plug sound

Groveling

p

a

1

3

Non-air-supply stop sound

Building block

d

a

1

4

Air supply plug sound

He has a main body

t

a

1

5

Non-air-supply stop sound

Ga-a

g

a

1

6

Air supply plug sound

Coffee (Perch)

k

a

1

7

Erasing sound of non-air-supply plug

Machine for working

j

i

1

8

Air delivery plug wipe

Seven-piece

q

i

1

9

Erasing sound of non-air-supply plug

To know

zh

i

1

10

Air delivery plug wipe

Eating

ch

i

1

11

Erasing sound of non-air-supply plug

Resource management system

z

i

1

12

Air delivery plug wipe

Defect of the heart

c

i

1

13

Qing and Ca sound

Hair-like device

f

a

1

14

Qing and Ca sound

Ha

h

a

1

15

Qing and Ca sound

Western medicine

x

i

1

16

Qing and Ca sound

Teacher

sh

i

1

17

Qing and Ca sound

Thought of

s

i

1

18

Turbid wipe

Day(s)

r

i

4

19

Nasal sound

Mother

m

a

1

20

Nasal sound

That

n

a

1

21

Edge tone

Pulling device

l

a

1

To further illustrate the technical solution of the present invention, the monosyllabic word "bar" is taken as an example, and the following description is made:

before the standardized sampling of the present embodiment is performed, the recording environment needs to be selected.

Optionally, the recording environment of this embodiment is selected: the method is carried out in a voice laboratory provided with a sound insulation door and sound-absorbing rock wool, and the sound insulation degree is 45 dB.

Optionally, the recording instrument and the parameter selection of this example: a Sony Zoom H4N recording pen is selected, the sampling rate of 44.1kHz and the tone quality of 16 bits are stored, and the recording is copied to a computer hard disk.

As shown in fig. 1, the standardized sampling method for extracting the MFCC features of the pathological speech for artificial intelligence analysis of the present embodiment includes the following steps:

s1, collecting voice data, and collecting the voice data of 82 Mandarin Chinese syllables according to the sequence of the Mandarin Chinese voice evaluation system vocabulary; the method specifically comprises the following steps:

the subject was recorded by collecting speech data for 82 mandarin chinese syllables with reference to 82 chinese words of the mandarin chinese speech assessment system vocabularies (table 1). When recording, the subject takes the sitting position, the pen holds the recorder with hands, the lip of the subject is about 10cm away from the recorder, and when seeing that the character 'ba' appears on the screen, the subject reads 'ba (/ b ā /)' at natural smooth speed and moderate volume, and records for 2 times. The amplitude of the wave of the waveform recorded by the recording pen is required to be in the range of 1/3-2/3.

S2, editing the collected voice data, specifically:

the target note/b ā/first recording was cut out separately from the note of each subject using CoolEdit Pro2.1. If the first recording has noise, interference, waveform fluctuation amplitude exceeding the range of the window value 1/3-2/3 and waveform prompting energy is insufficient, the second recording data is selected for processing. The valid preprocessed sample classes are then archived to unit tone groups.

S3, extracting the characteristics of the clipped signal, namely, based on the clipped sample, completing the MFCC characteristic extraction of the digital voice signal of syllable/b ā/by processing such as pre-emphasis, framing, windowing, fast Fourier transform, triangular band-pass filter, extended framing and the like, wherein the specific pre-processing steps are as follows:

s31, pre-emphasis is designated:

H(Z)＝1-μz^-1

in the above formula, μ has a value of 0.97.

S32, frame division is designated:

with time 25ms as one frame, the overlap area between two adjacent frames is set to 10ms, i.e. the frame shift. The sampling rate of the voice sample/b ā/is 16KHz, and the length N of each minute frame is 400. In this embodiment, 13 and 19 frames are taken, and zero padding is performed if the frame is insufficient.

S32, designating windowing:

after the frames are divided in step S32, each frame is multiplied by a Hamming Window (Hamming Window) to increase the continuity of the left and right ends of the frame. Assuming that the signal after framing is S (n), wherein n represents the length of the framing minus 1, the value of the signal after multiplying S (n) by the Hamming window is x (n),

x(n)＝S(n)×W(n)

w (n) is the Hamming window, which is given by the formula:

n-1, wherein a is 0.46 and N is 0, 1.

S33, fast Fourier transform:

performing fast Fourier transform on each frame signal subjected to framing and windowing to obtain a frequency spectrum value of each frame, performing modulo square on the frequency spectrum of a speech signal to obtain a power spectrum of the speech signal, and setting a Discrete Fourier Transform (DFT) formula of the speech signal as follows:

S33, triangular band-pass filter:

xa (k) is passed through a set of 24 triangular filters with the filter center frequency designated as f (m), and

m

1, 2. The interval between each f (m) decreases with decreasing value of m and increases with increasing value of m, as shown in fig. 2.

The triangular band-pass filter is used for smoothing the frequency spectrum and has the function of eliminating harmonic waves, so that the triangular band-pass filter is not influenced by the tone or pitch of a piece of voice. Secondly, the subsequent operation amount is reduced. The formula is as follows:

in the above formula, the first and second carbon atoms are,

s34, logarithmic operation, substituting the logarithmic frequency value output by each filter bank, hm (k) into the following formula:

s35, Discrete Cosine Transform (DCT), S (m) into the following equation:

the above formula L refers to the MFCC coefficient order, the method specifies 13 and 19; m is the number of triangular filters, the method is designated 24; c (n) is the MFCC value for each frame. The 13 th and 19 th sub-frames are connected to obtain 2 MFCC values into A group and B group.

S36, expanding and framing:

will be/b ā/formant F₁、F₂And F₀The midpoint values are respectively added into the group A and the group B as a subframe to obtain the other 2 MFCCs which are put into a group C and a group D.

S4, forming the processed data into an MFCC voice library:

normalized data for monosyllabic "bar" in MFCC speech library: speech samples/b ā/4 MFCC features of total 4 types after [0033] and [0034] preprocessing of each syllable data of 82 syllable samples exist in A, B, C, D four groups of normalized MFCC data of 13 frames, 19 frames, 13+3 frames, and 19+3 frames, respectively

Structuring data: the 4 groups of data are put into a vowel and tone sub-library and marked as a vowel and tone sub-library A group, a vowel and tone sub-library B group, a vowel and tone sub-library C group and a vowel and tone sub-library D group in a row.

The other 81 syllables are processed in the same way as the monosyllabic "bar", and will not be described in detail herein

The invention researches a novel pathological voice standardized sampling method based on MFCC characteristics, is different from the traditional voice recording sampling method, formulates a word list comprising 82 Chinese syllables based on the earlier research result of the inventor, and processes each syllable into 4 different data based on the acoustic index MFCC by adopting a standardized and structured data sampling method. The method can be conveniently used for the construction of pathological voice database, big data voice analysis and artificial intelligence operation.

The research on the structured sampling standard of pathological speech in China is less, the method provided by the invention is practiced on the application of pathological speech libraries, artificial neural networks, deep learning and the like, and has the advantages of reliability and simple and convenient operation, so that the method finally becomes the standard in the field.

The evaluation method based on the artificial intelligence big data liberates manpower, depends on the development of intelligence, is the crystallization of the intelligent era, is combined with the intelligent development, and is the result of the era progress and the scientific development. The invention provides a method selection for artificial intelligence research and scientific diagnosis of pathological voice.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A standardized sampling method for extracting MFCC characteristics of pathological speech for artificial intelligence analysis is characterized by comprising the following steps:

2. The standardized sampling method for extracting MFCC features of pathological voices for artificial intelligence analysis as claimed in claim 1, wherein said "Mandarin Chinese Speech evaluation System vocabulary" includes 3 parts of 4 main parts, namely, a single-tone part, a sequential language part, a compound vowel part and a consonant part;

3. The standardized sampling method for extracting MFCC features of pathological voices for artificial intelligence analysis as claimed in claim 1, wherein the lip distance of the subject from the recorder is 9-11 cm when voice data is collected, the voice speed is natural and stable, the volume is moderate, and the vocabulary is repeatedly recorded for 2 times;

the pre-emphasis is specifically as follows:

H(Z)＝1-μz^-1

the value of μ in the above formula is 0.9 to 1.0.

4. The standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 1, wherein said framing is specifically:

5. The standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 1, wherein said windowing is specifically:

x(n)＝S(n)×W(n)

where W (n) is the Hamming window, the formula is as follows:

in the above formula, the value of a is 0.46, and the value of N is 0, 1.

6. The standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 1, wherein said fast Fourier transform is specifically:

7. The standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 6, wherein the processing of said triangular band-pass filter is specifically:

in the above formula, the first and second carbon atoms are,

8. the method of claim 7, wherein the log-frequency value of each filter bank output, hm (k), is substituted into the following formula:

9. the standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 8, wherein said discrete cosine transform is embodied as:

substituting s (m) into the following equation:

10. The standardized sampling method for extracting MFCC features of pathological speech for artificial intelligence analysis as claimed in claim 9, wherein said extended framing is specifically: