CN112927696A

CN112927696A - System and method for automatically evaluating dysarthria based on voice recognition

Info

Publication number: CN112927696A
Application number: CN201911234291.XA
Authority: CN
Inventors: 茹克艳木·肉孜; 苏荣锋; 王岚
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2021-06-08

Abstract

The invention provides a system and a method for automatically evaluating dysarthria based on voice recognition. The system comprises a first feature extraction unit, a second feature extraction unit, a feature splicing unit, a multilayer perceptron and an evaluation unit, wherein the feature splicing unit is in communication connection with the first feature extraction unit, the second feature extraction unit and the multilayer perceptron; the second feature extraction unit extracts the relation between the audio label at the frame level and the frame phoneme-probability; the feature splicing unit splices the features extracted by the first feature extraction unit and the features extracted by the second feature extraction unit; the multilayer perceptron outputs individual sentence obstacle degrees and corresponding prediction probabilities based on the splicing characteristics; the evaluation unit obtains an overall evaluation result using the prediction probability information of the individual sentences. The invention can improve the accuracy and stability of dysarthria estimation.

Description

System and method for automatically evaluating dysarthria based on voice recognition

Technical Field

The invention relates to the technical field of dysarthria assessment, in particular to a dysarthria automatic assessment system and method based on voice recognition.

Background

Dysarthria is manifested by vague speech, poor fluency, inaccurate pronunciation, abnormal volume and rhythm. Physicians usually go through vocal organ examination and speech assessment to confirm whether they suffer from dysarthria and the degree of pathology. For preschool children, the above performance can be improved and cured by language training. Due to limited physician resources, time, and widespread use of the internet and mobile devices, dysarthric language training is being driven on mobile device applications (apps). The evaluation result of the mobile terminal language training effect can provide timely feedback for the user and provide important information beneficial to the personalized design of the training course for the training program designer.

The existing effective evaluation method mainly takes a subjective method of auditory perception as a main part, an objective analysis method lacks attention, and a complete dysarthria automatic evaluation scheme is absent. The existing dysarthria recognition scheme is to extract the formants of dysarthric voices to calculate acoustic parameters, calculate tongue-lip shift displacement of organ movement data, and perform correlation calculation on the acoustic parameters and the organ movement data to recognize dysarthria. The eGeMAPS acoustic parameter set is also analyzed by speech analysis integrated in the OpenSMILE tool for other speech related diseases, such as aphasia speech assessment, but there is no case for dysarthria speech analysis assessment.

In academic research, assessment of dysarthric speech has focused primarily on vowels and parts of the acoustic features. For example, the correlation between formant concentration ratio (FCR3), triangular vowel region (TVSA), Vocal Onset Time (VOT) extracted from a phrase containing a target consonant and dysarthria, and the dysarthria has been discussed. Since continuous speech in fixed pronunciations and daily conversations differ in pronunciation quality and duration, the vowel features described in the prior art are not suitable for the portion of continuous speech in a language training session. For consonants, the method only focuses on six consonants of b, p, d, t, g and k, and meanwhile, the automatic extraction of features such as voice starting time is difficult to achieve accurately. In addition, these features are not sufficient to sufficiently reflect the problems of dysarthric speech, and in particular, the problem of misrepresentation due to the presence of substitution on consonants is not considered.

In conclusion, the prior art still lacks an effective dysarthric automatic assessment means, and has the main problems that: subjective assessment methods of auditory perception lack objectivity, accuracy and stability; automatic assessment of the obstructive speech is not achieved; the inputs used by existing assessment methods are limited to limited, isolated letter sounds, and do not use continuous speech information.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a system and a method for automatically evaluating dysarthria based on voice recognition, which aim to use a speech feature extraction mode based on voice recognition and combine a deep learning classifier to automatically evaluate dysarthria.

According to a first aspect of the present invention, a system for automatic assessment of dysarthria based on speech recognition is provided. The system comprises a first feature extraction unit, a second feature extraction unit, a feature splicing unit, a multilayer perceptron and an evaluation unit, wherein the feature splicing unit is in communication connection with the first feature extraction unit, the second feature extraction unit and the multilayer perceptron, and the evaluation unit is in communication connection with the multilayer perceptron, wherein: the first feature extraction unit is used for extracting traditional sentence-level acoustic features; the second feature extraction unit is used for extracting the audio label at the frame level and the frame phoneme-probability corresponding relation, wherein the frame phoneme-probability corresponding relation is a set of two tuples consisting of phonemes contained in a frame and posterior probabilities thereof; the feature splicing unit splices the features extracted by the first feature extraction unit and the features extracted by the second feature extraction unit to obtain spliced features; the multilayer perceptron is used for outputting individual sentence obstacle degree categories and corresponding prediction probabilities based on the splicing characteristics; the evaluation unit obtains an overall evaluation result using the prediction probability information of the individual sentences.

In one embodiment, the second feature extraction unit is configured to extract one or more of a phoneme duration, a phoneme replacement rate, an approximate pronunciation quality, a frame blur rate, or a frame phoneme number for each sentence audio.

In one embodiment, the multi-layered perceptron is configured to include an input layer, a hidden layer, and an output layer, wherein the output layer is set to 4 nodes corresponding to four classes of dysarthria, normal, mild, moderate, and severe, respectively.

In one embodiment, the second feature extraction unit is configured to:

inputting standard text labels and actual pronunciation audio into a deep neural network acoustic model, and obtaining 118 pronunciation audio labels at a frame level through forced alignment;

inputting the actual pronunciation audio into the deep neural network acoustic model to obtain phonemes and corresponding Gaussian probability density functions corresponding to each node of an output layer of the deep neural network acoustic model;

and calculating the phonemes and the posterior probabilities thereof contained in each frame, wherein the output of Gaussian probability density functions of the same phonemes is added to obtain the posterior probabilities of the phonemes, and further, the frame phoneme-probability corresponding relation is obtained.

In one embodiment, the second feature extraction unit is configured to extract one or more of a vowel phoneme duration, a consonant phoneme duration, an overall phoneme duration, a consonant substitution rate, a vowel substitution rate, an overall substitution rate, a mean value of consonant approximate pronunciation quality, a mean value of vowel approximate pronunciation quality, a mean value of overall approximate pronunciation quality, a sentence frame ambiguity rate, a number of consonants, a number of vowels, a number of frame phonemes, for each sentence audio.

In one embodiment, the feature stitching unit is configured to perform max-min normalization on the features extracted by the first feature extraction unit and the features extracted by the second feature extraction unit as the input of the multi-layer perceptron.

According to a second aspect of the present invention, a method for automatic dysarthria assessment based on speech recognition is provided. The method comprises the following steps:

extracting traditional sentence-level acoustic features;

extracting a frame-level audio label and a frame phoneme-probability corresponding relation, wherein the frame phoneme-probability corresponding relation is a set of two tuples consisting of phonemes contained in a frame and posterior probabilities thereof;

splicing the traditional sentence-level acoustic features and the features extracted based on the frame phoneme-probability corresponding relation to obtain spliced features;

outputting individual sentence obstacle degree categories and corresponding prediction probabilities based on the splicing characteristics by utilizing a multilayer perceptron;

and obtaining an overall evaluation result by utilizing the prediction probability information of the individual sentences.

In one embodiment, the overall evaluation result is expressed as:

where N denotes the number of speech sentences evaluated, P_Average，p_PredictionIs a multi-dimensional vector, each dimension representing a class of dysarthric degrees, p_PredictionA corresponding probability representing the degree of dysarthria.

Compared with the prior art, the invention has the advantages that: performing objective analysis based on a voice recognition technology to evaluate dysarthria, wherein the evaluation result has accuracy and stability; extracting features based on pronunciation phonemes in the continuous speech so that the features contain as much information related to dysarthric speech as possible; the extracted feature set comprises traditional acoustic features and features based on an automatic voice recognition technology, so that problems of dysarthric voice are more completely represented, and the accuracy of evaluation is improved; the automatic evaluation process feeds back the language training effect information of the evaluated person in time, so that the manpower and time resources are saved.

Drawings

The invention is illustrated and described only by way of example and not by way of limitation in the scope of the invention as set forth in the following drawings, in which:

FIG. 1 is a schematic diagram of a system for automated assessment of dysarthria based on speech recognition according to one embodiment of the present invention;

FIG. 2 is a schematic diagram of a system for automated assessment of dysarthria based on speech recognition according to another embodiment of the present invention;

FIG. 3 is a schematic diagram of a DNN acoustic model framework according to one embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions, design methods, and advantages of the present invention more apparent, the present invention will be further described in detail by specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not as a limitation. Thus, other examples of the exemplary embodiments may have different values.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

Referring to fig. 1, the system for automatically evaluating dysarthria based on speech recognition according to the embodiment of the present invention includes a feature extraction unit 110, a feature extraction unit 120, a feature concatenation unit 130, a multi-layer perceptron 140, and an evaluation unit 150, where the feature extraction unit 110 is configured to extract traditional acoustic features, the feature extraction unit 120 is configured to extract acoustic features customized herein, the feature concatenation unit 130 is configured to concatenate the traditional acoustic features and the customized features and then input the concatenated features into the multi-layer perceptron 140, the multi-layer perceptron 140 is configured to output corresponding obstacle degree categories and corresponding prediction probabilities, and the evaluation unit 150 is configured to obtain an overall evaluation result by using prediction probability information of individual sentences. The units shown in fig. 1 may be implemented using software modules, processors or hardware logic circuits.

Referring to fig. 1, an automatic dysarthria assessment system based on speech recognition according to an embodiment of the present invention generally includes the following aspects: a feature extraction part, which extracts sentence-level feature description for dysarthric Speech, for example, firstly extracting a self-defined 13-dimensional feature by using an Automatic Speech Recognition technology (ASR), then extracting an 88-dimensional eGeMAPS parameter by using an OpenSMILE tool to serve as a traditional acoustic feature, and then splicing the self-defined 13-dimensional feature and the 88-dimensional traditional acoustic feature together to form a 101-dimensional new feature, wherein the feature is a sentence-level feature; and the classification and evaluation part is used for classifying the characteristics at the sentence level by using the multilayer perceptron, the multilayer perceptron outputs the obstacle degree category and the prediction probability of each sentence, and further, the prediction probability information of all sentences of an individual is used for automatically evaluating the dysarthria degree of the individual.

More specifically, see the automated evaluation framework shown in FIG. 2, wherein the training process of the ASR system is as follows: for normal child voice data, extracting acoustic features of each training sample, wherein the acoustic features comprise 12-dimensional static PLP (perceptual linear prediction) features and 1-dimensional static Pitch features, and corresponding first-order second-order differences thereof, and the total dimensions are 39; then training by using a maximum likelihood probability method to obtain a G MM-HMM baseline model which contains 3652 triphone binding states; then, using a GMM-HM baseline model to carry out forced alignment to obtain a triphone binding state of each frame as a supervision label of subsequent network training; and finally, training by using the supervision label and combining a BP (Back propagation) algorithm to obtain a Deep Neural Network (DNN) model. In the assessment phase, a dysarthric child first needs to listen carefully to a given sentence and then repeat reading the sentence 3 times. The sentence length is 4-7 words and covers all 118 phonemes, see table 1. Each sentence contains its standard pronunciation and standard text labels, as well as the actual pronunciation audio of a dysarthric child.

TABLE 1 118 phonemes for speech recognition in Chinese

a1

a2

a3

a4

a5

ai1

ai2

ai3

ai4

ai5

ao1

ao2

ao3

ao4

ao5

b

c

Ch

d

e1

e2

e3

e4

e5

ei1

ei2

ei3

ei4

er2

er3

er4

f

g

h

i1

i2

i3

i4

i5

ia1

ia2

ia3

ia4

ia5

iao1

iao2

iao3

iao4

ie1

ie2

ie3

ie4

ie5

iu1

iu2

iu3

iu4

j

k

l

m

n

ng

o1

o2

o3

o4

o5

ou1

ou2

ou3

ou4

ou5

p

q

r

s

sh

sil

t

u1

u2

u3

u4

u5

ua1

ua2

ua3

ua4

ua5

uai1

uai2

uai3

uai4

ue1

ue2

ue3

ue4

ui1

ui2

ui3

ui4

uo1

uo2

uo3

uo4

uo5

v1

v2

v3

v4

v5

w

X

y

z

zh

sp

Finally, the following steps are performed with the ASR system to obtain information for acoustic feature extraction:

step (1), inputting standard text labels and actual pronunciation audio into a DNN acoustic model, and obtaining 118 pronunciation audio labels at a frame level through forced alignment (force alignment);

step (2), inputting the actual pronunciation audio into the DNN acoustic model, and obtaining the preliminary identification information of each frame from the output layer of the DNN acoustic model, where the information includes the phonemes corresponding to each node in the last layer and the output of the corresponding gaussian probability density function, see fig. 3(Ot represents a feature vector);

and (3) calculating the phonemes and the posterior probabilities thereof contained in each frame by using the information in the step (2), as shown in the step (3) of fig. 3, namely adding the outputs of the gaussian probability density functions of the same phonemes to obtain the posterior probabilities of the phonemes. The frame phoneme-probability table is defined as a set of two tuples consisting of phonemes contained in one frame and a posterior probability thereof.

It should be noted that the basic unit of the input in fig. 3 is one frame, and the sum of the outputs of the output layer gaussian probability density functions is 1. Accordingly, the multiple phonemes obtained in step (3) are contained in one frame, i.e. the posterior probability values of all phonemes are added to be 1.

In the embodiment of the invention, the following 5 types of features are extracted based on an ASR system:

1) duration of phoneme

And (3) obtaining the audio label of each sentence at the frame level by using the step (1), and counting the frame number of each phoneme appearing in the sentence by using the label. The statistical method comprises the following steps:

the number of frames of a phoneme is equal to the number of frames in which the phoneme continuously appears, and if/a/this phoneme continuously appears for 7 frames, the number of frames is 7. If a sentence has two (or more) places where the same phoneme appears, the average value of the frame number of the two places is taken as the frame number of the phoneme.

The phoneme duration is equal to the frame duration (25 milliseconds) multiplied by the number of frames of the phoneme.

And calculating the average value of the duration of each phoneme in the sentence as a characteristic value according to the difference of the phoneme types (vowels and consonants). The class features are three-dimensional, i.e., vowel phoneme duration, consonant phoneme duration, and overall phoneme duration.

2) Phoneme replacement rate

And (4) defining the main phoneme as the phoneme with the maximum probability value in the corresponding frame, and calculating to obtain a frame-level main phoneme sequence of each sentence according to the frame phoneme-probability table obtained in the step (3).

And (3) comparing the audio label obtained in the step (1) with the main phoneme sequence (except for the mute phoneme), if the two phonemes are consistent, determining that the two phonemes are matched, and if the two phonemes are not consistent and the phoneme in the main phoneme sequence is not the mute phoneme, determining that the two phonemes are replaced. The process is in units of frames.

The phoneme replacement rate is equal to the ratio of the number of frames of the "replacement" to the sum of the number of frames of the "replacement" and the "matching", i.e.:

wherein R represents a substitution rate; n is a radical of_{Replacement of}，N_MatchingRespectively, the number of frames for "replacement" and "matching".

Calculating the replacement rate of consonants, vowels and all phonemes according to the formula to form a three-dimensional phoneme replacement rate characteristic consisting of consonant replacement rate, vowel replacement rate and total replacement rate

3) The approximate Pronunciation quality (approximate Goodness of probability, aop) is a quantity representing the quality of Pronunciation, the calculation needs to know the labeling information of the real Pronunciation (actually uttered Pronunciation), but in practice, due to the characteristics of dysarthria of the voice, no matter manual labeling or automatic recognition can not provide accurate real Pronunciation information. Therefore, an approximate pronunciation quality aGOP is defined herein to represent pronunciation quality. The calculation method is represented by the probability value of the main phoneme (the phoneme with the maximum probability value) in the frame phoneme-probability table obtained in the step (3):

aGOP＝max_{q∈Q}p(q|o_t) (2)

wherein, O_tRepresenting the input feature vector, Q representing the set of all phonemes in chinese, Q representing the phonemes in the phoneme-probability table, and P representing the probability of the Q phonemes.

And according to aGOPs of different phonemes in the sentence, adding and averaging according to the types (vowels and consonants) of the phonemes to obtain an aGOP average value vector of the type.

The feature has three dimensions, i.e., the mean of the consonant aGOP, the mean of the vowel aGOP, and the mean of the overall aGOP.

4) Frame blur Ratio (BFR)

Three phonemes with the highest probability are taken for each frame based on the frame phoneme-probability table. It is called a "blurred frame" if it satisfies the following conditions: the second approximate value is equal to or greater than 0.2 or the third approximate value is equal to or greater than 0.1. Otherwise called "non-blurred frames", the silent phoneme frames are not considered. The frame blur rate is calculated as follows:

wherein N is_{Blurred frames}Number of "blurred frames", N_{Non-blurred frames}The number of "non-blurred frames" is indicated.

The feature has one dimension, i.e., sentence frame blur rate.

5) Number of frame phonemes

The number of frame phonemes is defined as: and (4) the frame phoneme-probability table obtained in the step (3) contains the number of phonemes.

The number of consonantal elements is the number of consonantal elements in the frame phoneme-probability table, and the number of vowel elements is the number of vowel elements in the frame phoneme-probability table.

The features are three-dimensional, consonant element number, vowel element number, and frame element number.

In summary, in the embodiment of the present invention, the extracted custom features are 5 major (13-dimensional) features, namely, the duration of a prime, the phoneme replacement rate, the quality of an approximate pronunciation, the frame blur rate and the number of frame phonemes, extracted from each sentence audio.

For traditional acoustic feature extraction, OpenSMILE tool is utilized. The eGeMAPS parameter set integrated in the OpenSMILE tool is widely used for acoustic analysis of audio, and contains 88-dimensional statistical feature parameters. In the embodiment of the invention, the OpenSMILE tool is used for extracting sentence-level characteristics of the dysarthric voice according to the parameter set.

For the classification and evaluation section, the degree of speech sound impairment of the children's speech in the training data set is first quantified. For example, the existing children's voices are evaluated according to medical evaluation to judge the degree of disorder, and are classified into four categories of ' normal ', ' mild ', ' moderate ', ' severe '.

Then, feature concatenation is performed on each sentence, namely the 13-dimensional feature based on ASR and the 88-dimensional acoustic feature concatenation dimension is 101 dimensions. And performing maximum-minimum normalization on the 101-dimensional sentence characteristic numerical values as the input of the multilayer perceptron. The multi-layer perceptron comprises three layers, an input layer (such as 101 nodes), a hidden layer (such as 52 nodes) and an output layer (such as 4 nodes). The output of the multi-layered perceptron includes a current sentence impairment level category and a corresponding prediction probability. The prediction probability is a 4-dimensional vector, each dimension corresponds to 4 dysarthria degree categories, the value of which is between 0 and 1, and the sum of the 4 probability values is 1.

Next, an overall evaluation result is obtained using the prediction probability information of the individual sentences. For example, the prediction probabilities (4-dimensional vectors) of all sentences of the evaluated individual are added and averaged by dimension, respectively.

Where N represents the number of speech sentences of the evaluated child, P_Average，p_PredictionEach representing one of the above quantified obstacle levels, representing a probability of belonging to that obstacle level. The final evaluated dysarthric degree category is P_AverageFor example, the degree of obstruction represented by the dimension in which the probability maximum is located.

The automatic assessment of the embodiment of the present invention gives one of the obstacle degree levels defined in advance in a quantized manner to the dysarthric voice. Such a classification evaluation provides a more objective and stable result than a subjective evaluation. In addition, the feature extraction of the ASR is carried out according to the mismatching degree of the acoustic model of normal child voice training and dysarthria pathological voice, the whole feature extraction and classification evaluation process only needs model texts used for language training and voice audio of an evaluated person, and manual information marking is not needed, so that the technical scheme can realize automatic evaluation.

In summary, the present invention is based on the voice recognition technology, and extracts the speech features of dysarthric voice from the continuous voice of dysarthric patients, wherein the features include the acoustics and linguistic information of dysarthric voice, and the feature extraction process utilizes all vowels and consonants contained in the voice of dysarthric patients to fully reflect the pathological features of dysarthric patients. And according to the extracted features, giving an automatic evaluation result by using a classifier classification and comprehensive judgment method. When the dysarthric speech feature is extracted, all phonemes contained in the dysarthric speech are covered, and the feature is extracted from a short sentence, namely continuous speech. And through verification, the method can provide objective, reliable and stable evaluation results, can realize automatic evaluation of dysarthria voice, and saves manpower and time resources.

It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A dysarthria automatic assessment system based on speech recognition is characterized by comprising a first feature extraction unit, a second feature extraction unit, a feature splicing unit, a multilayer perceptron and an assessment unit, wherein the feature splicing unit is in communication connection with the first feature extraction unit, the second feature extraction unit and the multilayer perceptron, and the assessment unit is in communication connection with the multilayer perceptron, wherein: the first feature extraction unit is used for extracting traditional sentence-level acoustic features; the second feature extraction unit is used for extracting the audio label at the frame level and the frame phoneme-probability corresponding relation, wherein the frame phoneme-probability corresponding relation is a set of two tuples consisting of phonemes contained in a frame and posterior probabilities thereof; the feature splicing unit splices the features extracted by the first feature extraction unit and the features extracted by the second feature extraction unit to obtain spliced features; the multilayer perceptron is used for outputting individual sentence obstacle degree categories and corresponding prediction probabilities based on the splicing characteristics; the evaluation unit obtains an overall evaluation result using the prediction probability information of the individual sentences.

2. The system according to claim 1, wherein the second feature extraction unit is configured to extract one or more of a phoneme duration, a phoneme substitution rate, an approximate pronunciation quality, a frame blur rate, or a frame phoneme number for each sentence audio.

3. The system according to claim 1, wherein the multi-layered perceptron is configured to include an input layer, a hidden layer and an output layer, wherein the output layer is configured to 4 nodes corresponding to four categories of dysarthria, normal, mild, moderate and severe.

4. The system according to claim 1, wherein the second feature extraction unit is configured to:

5. The system according to claim 1, wherein the second feature extraction unit is configured to extract one or more of a vowel phoneme duration, a consonant phoneme duration, an overall phoneme duration, a consonant replacement rate, a vowel replacement rate, an overall replacement rate, a mean value of consonant approximate pronunciation quality, a mean value of vowel approximate pronunciation quality, a mean value of overall approximate pronunciation quality, a sentence frame blur rate, a number of consonants, a number of vowels, a number of frame phonemes, for each sentence audio.

6. The system according to claim 1, wherein the feature concatenation unit is configured to perform max-min normalization on the features extracted by the first feature extraction unit and the features extracted by the second feature extraction unit as the input of the multi-layered perceptron.

7. A dysarthria automatic evaluation method based on voice recognition comprises the following steps:

extracting traditional sentence-level acoustic features;

8. The automatic dysarthria assessment method based on speech recognition according to claim 7, wherein the overall assessment result is expressed as: