CN112927696A - System and method for automatically evaluating dysarthria based on voice recognition - Google Patents

System and method for automatically evaluating dysarthria based on voice recognition Download PDF

Info

Publication number
CN112927696A
CN112927696A CN201911234291.XA CN201911234291A CN112927696A CN 112927696 A CN112927696 A CN 112927696A CN 201911234291 A CN201911234291 A CN 201911234291A CN 112927696 A CN112927696 A CN 112927696A
Authority
CN
China
Prior art keywords
frame
feature extraction
extraction unit
phoneme
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911234291.XA
Other languages
Chinese (zh)
Inventor
茹克艳木·肉孜
苏荣锋
王岚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201911234291.XA priority Critical patent/CN112927696A/en
Publication of CN112927696A publication Critical patent/CN112927696A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4803Speech analysis specially adapted for diagnostic purposes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B2503/00Evaluating a particular growth phase or type of persons or animals
    • A61B2503/06Children, e.g. for attention deficit diagnosis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Abstract

The invention provides a system and a method for automatically evaluating dysarthria based on voice recognition. The system comprises a first feature extraction unit, a second feature extraction unit, a feature splicing unit, a multilayer perceptron and an evaluation unit, wherein the feature splicing unit is in communication connection with the first feature extraction unit, the second feature extraction unit and the multilayer perceptron; the second feature extraction unit extracts the relation between the audio label at the frame level and the frame phoneme-probability; the feature splicing unit splices the features extracted by the first feature extraction unit and the features extracted by the second feature extraction unit; the multilayer perceptron outputs individual sentence obstacle degrees and corresponding prediction probabilities based on the splicing characteristics; the evaluation unit obtains an overall evaluation result using the prediction probability information of the individual sentences. The invention can improve the accuracy and stability of dysarthria estimation.

Description

System and method for automatically evaluating dysarthria based on voice recognition
Technical Field
The invention relates to the technical field of dysarthria assessment, in particular to a dysarthria automatic assessment system and method based on voice recognition.
Background
Dysarthria is manifested by vague speech, poor fluency, inaccurate pronunciation, abnormal volume and rhythm. Physicians usually go through vocal organ examination and speech assessment to confirm whether they suffer from dysarthria and the degree of pathology. For preschool children, the above performance can be improved and cured by language training. Due to limited physician resources, time, and widespread use of the internet and mobile devices, dysarthric language training is being driven on mobile device applications (apps). The evaluation result of the mobile terminal language training effect can provide timely feedback for the user and provide important information beneficial to the personalized design of the training course for the training program designer.
The existing effective evaluation method mainly takes a subjective method of auditory perception as a main part, an objective analysis method lacks attention, and a complete dysarthria automatic evaluation scheme is absent. The existing dysarthria recognition scheme is to extract the formants of dysarthric voices to calculate acoustic parameters, calculate tongue-lip shift displacement of organ movement data, and perform correlation calculation on the acoustic parameters and the organ movement data to recognize dysarthria. The eGeMAPS acoustic parameter set is also analyzed by speech analysis integrated in the OpenSMILE tool for other speech related diseases, such as aphasia speech assessment, but there is no case for dysarthria speech analysis assessment.
In academic research, assessment of dysarthric speech has focused primarily on vowels and parts of the acoustic features. For example, the correlation between formant concentration ratio (FCR3), triangular vowel region (TVSA), Vocal Onset Time (VOT) extracted from a phrase containing a target consonant and dysarthria, and the dysarthria has been discussed. Since continuous speech in fixed pronunciations and daily conversations differ in pronunciation quality and duration, the vowel features described in the prior art are not suitable for the portion of continuous speech in a language training session. For consonants, the method only focuses on six consonants of b, p, d, t, g and k, and meanwhile, the automatic extraction of features such as voice starting time is difficult to achieve accurately. In addition, these features are not sufficient to sufficiently reflect the problems of dysarthric speech, and in particular, the problem of misrepresentation due to the presence of substitution on consonants is not considered.
In conclusion, the prior art still lacks an effective dysarthric automatic assessment means, and has the main problems that: subjective assessment methods of auditory perception lack objectivity, accuracy and stability; automatic assessment of the obstructive speech is not achieved; the inputs used by existing assessment methods are limited to limited, isolated letter sounds, and do not use continuous speech information.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a system and a method for automatically evaluating dysarthria based on voice recognition, which aim to use a speech feature extraction mode based on voice recognition and combine a deep learning classifier to automatically evaluate dysarthria.
According to a first aspect of the present invention, a system for automatic assessment of dysarthria based on speech recognition is provided. The system comprises a first feature extraction unit, a second feature extraction unit, a feature splicing unit, a multilayer perceptron and an evaluation unit, wherein the feature splicing unit is in communication connection with the first feature extraction unit, the second feature extraction unit and the multilayer perceptron, and the evaluation unit is in communication connection with the multilayer perceptron, wherein: the first feature extraction unit is used for extracting traditional sentence-level acoustic features; the second feature extraction unit is used for extracting the audio label at the frame level and the frame phoneme-probability corresponding relation, wherein the frame phoneme-probability corresponding relation is a set of two tuples consisting of phonemes contained in a frame and posterior probabilities thereof; the feature splicing unit splices the features extracted by the first feature extraction unit and the features extracted by the second feature extraction unit to obtain spliced features; the multilayer perceptron is used for outputting individual sentence obstacle degree categories and corresponding prediction probabilities based on the splicing characteristics; the evaluation unit obtains an overall evaluation result using the prediction probability information of the individual sentences.
In one embodiment, the second feature extraction unit is configured to extract one or more of a phoneme duration, a phoneme replacement rate, an approximate pronunciation quality, a frame blur rate, or a frame phoneme number for each sentence audio.
In one embodiment, the multi-layered perceptron is configured to include an input layer, a hidden layer, and an output layer, wherein the output layer is set to 4 nodes corresponding to four classes of dysarthria, normal, mild, moderate, and severe, respectively.
In one embodiment, the second feature extraction unit is configured to:
inputting standard text labels and actual pronunciation audio into a deep neural network acoustic model, and obtaining 118 pronunciation audio labels at a frame level through forced alignment;
inputting the actual pronunciation audio into the deep neural network acoustic model to obtain phonemes and corresponding Gaussian probability density functions corresponding to each node of an output layer of the deep neural network acoustic model;
and calculating the phonemes and the posterior probabilities thereof contained in each frame, wherein the output of Gaussian probability density functions of the same phonemes is added to obtain the posterior probabilities of the phonemes, and further, the frame phoneme-probability corresponding relation is obtained.
In one embodiment, the second feature extraction unit is configured to extract one or more of a vowel phoneme duration, a consonant phoneme duration, an overall phoneme duration, a consonant substitution rate, a vowel substitution rate, an overall substitution rate, a mean value of consonant approximate pronunciation quality, a mean value of vowel approximate pronunciation quality, a mean value of overall approximate pronunciation quality, a sentence frame ambiguity rate, a number of consonants, a number of vowels, a number of frame phonemes, for each sentence audio.
In one embodiment, the feature stitching unit is configured to perform max-min normalization on the features extracted by the first feature extraction unit and the features extracted by the second feature extraction unit as the input of the multi-layer perceptron.
According to a second aspect of the present invention, a method for automatic dysarthria assessment based on speech recognition is provided. The method comprises the following steps:
extracting traditional sentence-level acoustic features;
extracting a frame-level audio label and a frame phoneme-probability corresponding relation, wherein the frame phoneme-probability corresponding relation is a set of two tuples consisting of phonemes contained in a frame and posterior probabilities thereof;
splicing the traditional sentence-level acoustic features and the features extracted based on the frame phoneme-probability corresponding relation to obtain spliced features;
outputting individual sentence obstacle degree categories and corresponding prediction probabilities based on the splicing characteristics by utilizing a multilayer perceptron;
and obtaining an overall evaluation result by utilizing the prediction probability information of the individual sentences.
In one embodiment, the overall evaluation result is expressed as:
Figure BDA0002304459840000031
where N denotes the number of speech sentences evaluated, PAverage,pPredictionIs a multi-dimensional vector, each dimension representing a class of dysarthric degrees, pPredictionA corresponding probability representing the degree of dysarthria.
Compared with the prior art, the invention has the advantages that: performing objective analysis based on a voice recognition technology to evaluate dysarthria, wherein the evaluation result has accuracy and stability; extracting features based on pronunciation phonemes in the continuous speech so that the features contain as much information related to dysarthric speech as possible; the extracted feature set comprises traditional acoustic features and features based on an automatic voice recognition technology, so that problems of dysarthric voice are more completely represented, and the accuracy of evaluation is improved; the automatic evaluation process feeds back the language training effect information of the evaluated person in time, so that the manpower and time resources are saved.
Drawings
The invention is illustrated and described only by way of example and not by way of limitation in the scope of the invention as set forth in the following drawings, in which:
FIG. 1 is a schematic diagram of a system for automated assessment of dysarthria based on speech recognition according to one embodiment of the present invention;
FIG. 2 is a schematic diagram of a system for automated assessment of dysarthria based on speech recognition according to another embodiment of the present invention;
FIG. 3 is a schematic diagram of a DNN acoustic model framework according to one embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions, design methods, and advantages of the present invention more apparent, the present invention will be further described in detail by specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not as a limitation. Thus, other examples of the exemplary embodiments may have different values.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
Referring to fig. 1, the system for automatically evaluating dysarthria based on speech recognition according to the embodiment of the present invention includes a feature extraction unit 110, a feature extraction unit 120, a feature concatenation unit 130, a multi-layer perceptron 140, and an evaluation unit 150, where the feature extraction unit 110 is configured to extract traditional acoustic features, the feature extraction unit 120 is configured to extract acoustic features customized herein, the feature concatenation unit 130 is configured to concatenate the traditional acoustic features and the customized features and then input the concatenated features into the multi-layer perceptron 140, the multi-layer perceptron 140 is configured to output corresponding obstacle degree categories and corresponding prediction probabilities, and the evaluation unit 150 is configured to obtain an overall evaluation result by using prediction probability information of individual sentences. The units shown in fig. 1 may be implemented using software modules, processors or hardware logic circuits.
Referring to fig. 1, an automatic dysarthria assessment system based on speech recognition according to an embodiment of the present invention generally includes the following aspects: a feature extraction part, which extracts sentence-level feature description for dysarthric Speech, for example, firstly extracting a self-defined 13-dimensional feature by using an Automatic Speech Recognition technology (ASR), then extracting an 88-dimensional eGeMAPS parameter by using an OpenSMILE tool to serve as a traditional acoustic feature, and then splicing the self-defined 13-dimensional feature and the 88-dimensional traditional acoustic feature together to form a 101-dimensional new feature, wherein the feature is a sentence-level feature; and the classification and evaluation part is used for classifying the characteristics at the sentence level by using the multilayer perceptron, the multilayer perceptron outputs the obstacle degree category and the prediction probability of each sentence, and further, the prediction probability information of all sentences of an individual is used for automatically evaluating the dysarthria degree of the individual.
More specifically, see the automated evaluation framework shown in FIG. 2, wherein the training process of the ASR system is as follows: for normal child voice data, extracting acoustic features of each training sample, wherein the acoustic features comprise 12-dimensional static PLP (perceptual linear prediction) features and 1-dimensional static Pitch features, and corresponding first-order second-order differences thereof, and the total dimensions are 39; then training by using a maximum likelihood probability method to obtain a G MM-HMM baseline model which contains 3652 triphone binding states; then, using a GMM-HM baseline model to carry out forced alignment to obtain a triphone binding state of each frame as a supervision label of subsequent network training; and finally, training by using the supervision label and combining a BP (Back propagation) algorithm to obtain a Deep Neural Network (DNN) model. In the assessment phase, a dysarthric child first needs to listen carefully to a given sentence and then repeat reading the sentence 3 times. The sentence length is 4-7 words and covers all 118 phonemes, see table 1. Each sentence contains its standard pronunciation and standard text labels, as well as the actual pronunciation audio of a dysarthric child.
TABLE 1 118 phonemes for speech recognition in Chinese
a1 a2 a3 a4 a5 ai1 ai2 ai3 ai4 ai5 ao1 ao2
ao3 ao4 ao5 b c Ch d e1 e2 e3 e4 e5
ei1 ei2 ei3 ei4 er2 er3 er4 f g h i1 i2
i3 i4 i5 ia1 ia2 ia3 ia4 ia5 iao1 iao2 iao3 iao4
ie1 ie2 ie3 ie4 ie5 iu1 iu2 iu3 iu4 j k l
m n ng o1 o2 o3 o4 o5 ou1 ou2 ou3 ou4
ou5 p q r s sh sil t u1 u2 u3 u4
u5 ua1 ua2 ua3 ua4 ua5 uai1 uai2 uai3 uai4 ue1 ue2
ue3 ue4 ui1 ui2 ui3 ui4 uo1 uo2 uo3 uo4 uo5 v1
v2 v3 v4 v5 w X y z zh sp
Finally, the following steps are performed with the ASR system to obtain information for acoustic feature extraction:
step (1), inputting standard text labels and actual pronunciation audio into a DNN acoustic model, and obtaining 118 pronunciation audio labels at a frame level through forced alignment (force alignment);
step (2), inputting the actual pronunciation audio into the DNN acoustic model, and obtaining the preliminary identification information of each frame from the output layer of the DNN acoustic model, where the information includes the phonemes corresponding to each node in the last layer and the output of the corresponding gaussian probability density function, see fig. 3(Ot represents a feature vector);
and (3) calculating the phonemes and the posterior probabilities thereof contained in each frame by using the information in the step (2), as shown in the step (3) of fig. 3, namely adding the outputs of the gaussian probability density functions of the same phonemes to obtain the posterior probabilities of the phonemes. The frame phoneme-probability table is defined as a set of two tuples consisting of phonemes contained in one frame and a posterior probability thereof.
It should be noted that the basic unit of the input in fig. 3 is one frame, and the sum of the outputs of the output layer gaussian probability density functions is 1. Accordingly, the multiple phonemes obtained in step (3) are contained in one frame, i.e. the posterior probability values of all phonemes are added to be 1.
In the embodiment of the invention, the following 5 types of features are extracted based on an ASR system:
1) duration of phoneme
And (3) obtaining the audio label of each sentence at the frame level by using the step (1), and counting the frame number of each phoneme appearing in the sentence by using the label. The statistical method comprises the following steps:
the number of frames of a phoneme is equal to the number of frames in which the phoneme continuously appears, and if/a/this phoneme continuously appears for 7 frames, the number of frames is 7. If a sentence has two (or more) places where the same phoneme appears, the average value of the frame number of the two places is taken as the frame number of the phoneme.
The phoneme duration is equal to the frame duration (25 milliseconds) multiplied by the number of frames of the phoneme.
And calculating the average value of the duration of each phoneme in the sentence as a characteristic value according to the difference of the phoneme types (vowels and consonants). The class features are three-dimensional, i.e., vowel phoneme duration, consonant phoneme duration, and overall phoneme duration.
2) Phoneme replacement rate
And (4) defining the main phoneme as the phoneme with the maximum probability value in the corresponding frame, and calculating to obtain a frame-level main phoneme sequence of each sentence according to the frame phoneme-probability table obtained in the step (3).
And (3) comparing the audio label obtained in the step (1) with the main phoneme sequence (except for the mute phoneme), if the two phonemes are consistent, determining that the two phonemes are matched, and if the two phonemes are not consistent and the phoneme in the main phoneme sequence is not the mute phoneme, determining that the two phonemes are replaced. The process is in units of frames.
The phoneme replacement rate is equal to the ratio of the number of frames of the "replacement" to the sum of the number of frames of the "replacement" and the "matching", i.e.:
Figure BDA0002304459840000071
wherein R represents a substitution rate; n is a radical ofReplacement of,NMatchingRespectively, the number of frames for "replacement" and "matching".
Calculating the replacement rate of consonants, vowels and all phonemes according to the formula to form a three-dimensional phoneme replacement rate characteristic consisting of consonant replacement rate, vowel replacement rate and total replacement rate
3) The approximate Pronunciation quality (approximate Goodness of probability, aop) is a quantity representing the quality of Pronunciation, the calculation needs to know the labeling information of the real Pronunciation (actually uttered Pronunciation), but in practice, due to the characteristics of dysarthria of the voice, no matter manual labeling or automatic recognition can not provide accurate real Pronunciation information. Therefore, an approximate pronunciation quality aGOP is defined herein to represent pronunciation quality. The calculation method is represented by the probability value of the main phoneme (the phoneme with the maximum probability value) in the frame phoneme-probability table obtained in the step (3):
aGOP=max{q∈Q}p(q|ot) (2)
wherein, OtRepresenting the input feature vector, Q representing the set of all phonemes in chinese, Q representing the phonemes in the phoneme-probability table, and P representing the probability of the Q phonemes.
And according to aGOPs of different phonemes in the sentence, adding and averaging according to the types (vowels and consonants) of the phonemes to obtain an aGOP average value vector of the type.
The feature has three dimensions, i.e., the mean of the consonant aGOP, the mean of the vowel aGOP, and the mean of the overall aGOP.
4) Frame blur Ratio (BFR)
Three phonemes with the highest probability are taken for each frame based on the frame phoneme-probability table. It is called a "blurred frame" if it satisfies the following conditions: the second approximate value is equal to or greater than 0.2 or the third approximate value is equal to or greater than 0.1. Otherwise called "non-blurred frames", the silent phoneme frames are not considered. The frame blur rate is calculated as follows:
Figure BDA0002304459840000072
wherein N isBlurred framesNumber of "blurred frames", NNon-blurred framesThe number of "non-blurred frames" is indicated.
The feature has one dimension, i.e., sentence frame blur rate.
5) Number of frame phonemes
The number of frame phonemes is defined as: and (4) the frame phoneme-probability table obtained in the step (3) contains the number of phonemes.
The number of consonantal elements is the number of consonantal elements in the frame phoneme-probability table, and the number of vowel elements is the number of vowel elements in the frame phoneme-probability table.
The features are three-dimensional, consonant element number, vowel element number, and frame element number.
In summary, in the embodiment of the present invention, the extracted custom features are 5 major (13-dimensional) features, namely, the duration of a prime, the phoneme replacement rate, the quality of an approximate pronunciation, the frame blur rate and the number of frame phonemes, extracted from each sentence audio.
For traditional acoustic feature extraction, OpenSMILE tool is utilized. The eGeMAPS parameter set integrated in the OpenSMILE tool is widely used for acoustic analysis of audio, and contains 88-dimensional statistical feature parameters. In the embodiment of the invention, the OpenSMILE tool is used for extracting sentence-level characteristics of the dysarthric voice according to the parameter set.
For the classification and evaluation section, the degree of speech sound impairment of the children's speech in the training data set is first quantified. For example, the existing children's voices are evaluated according to medical evaluation to judge the degree of disorder, and are classified into four categories of ' normal ', ' mild ', ' moderate ', ' severe '.
Then, feature concatenation is performed on each sentence, namely the 13-dimensional feature based on ASR and the 88-dimensional acoustic feature concatenation dimension is 101 dimensions. And performing maximum-minimum normalization on the 101-dimensional sentence characteristic numerical values as the input of the multilayer perceptron. The multi-layer perceptron comprises three layers, an input layer (such as 101 nodes), a hidden layer (such as 52 nodes) and an output layer (such as 4 nodes). The output of the multi-layered perceptron includes a current sentence impairment level category and a corresponding prediction probability. The prediction probability is a 4-dimensional vector, each dimension corresponds to 4 dysarthria degree categories, the value of which is between 0 and 1, and the sum of the 4 probability values is 1.
Next, an overall evaluation result is obtained using the prediction probability information of the individual sentences. For example, the prediction probabilities (4-dimensional vectors) of all sentences of the evaluated individual are added and averaged by dimension, respectively.
Figure BDA0002304459840000081
Where N represents the number of speech sentences of the evaluated child, PAverage,pPredictionEach representing one of the above quantified obstacle levels, representing a probability of belonging to that obstacle level. The final evaluated dysarthric degree category is PAverageFor example, the degree of obstruction represented by the dimension in which the probability maximum is located.
The automatic assessment of the embodiment of the present invention gives one of the obstacle degree levels defined in advance in a quantized manner to the dysarthric voice. Such a classification evaluation provides a more objective and stable result than a subjective evaluation. In addition, the feature extraction of the ASR is carried out according to the mismatching degree of the acoustic model of normal child voice training and dysarthria pathological voice, the whole feature extraction and classification evaluation process only needs model texts used for language training and voice audio of an evaluated person, and manual information marking is not needed, so that the technical scheme can realize automatic evaluation.
In summary, the present invention is based on the voice recognition technology, and extracts the speech features of dysarthric voice from the continuous voice of dysarthric patients, wherein the features include the acoustics and linguistic information of dysarthric voice, and the feature extraction process utilizes all vowels and consonants contained in the voice of dysarthric patients to fully reflect the pathological features of dysarthric patients. And according to the extracted features, giving an automatic evaluation result by using a classifier classification and comprehensive judgment method. When the dysarthric speech feature is extracted, all phonemes contained in the dysarthric speech are covered, and the feature is extracted from a short sentence, namely continuous speech. And through verification, the method can provide objective, reliable and stable evaluation results, can realize automatic evaluation of dysarthria voice, and saves manpower and time resources.
It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (8)

1. A dysarthria automatic assessment system based on speech recognition is characterized by comprising a first feature extraction unit, a second feature extraction unit, a feature splicing unit, a multilayer perceptron and an assessment unit, wherein the feature splicing unit is in communication connection with the first feature extraction unit, the second feature extraction unit and the multilayer perceptron, and the assessment unit is in communication connection with the multilayer perceptron, wherein: the first feature extraction unit is used for extracting traditional sentence-level acoustic features; the second feature extraction unit is used for extracting the audio label at the frame level and the frame phoneme-probability corresponding relation, wherein the frame phoneme-probability corresponding relation is a set of two tuples consisting of phonemes contained in a frame and posterior probabilities thereof; the feature splicing unit splices the features extracted by the first feature extraction unit and the features extracted by the second feature extraction unit to obtain spliced features; the multilayer perceptron is used for outputting individual sentence obstacle degree categories and corresponding prediction probabilities based on the splicing characteristics; the evaluation unit obtains an overall evaluation result using the prediction probability information of the individual sentences.
2. The system according to claim 1, wherein the second feature extraction unit is configured to extract one or more of a phoneme duration, a phoneme substitution rate, an approximate pronunciation quality, a frame blur rate, or a frame phoneme number for each sentence audio.
3. The system according to claim 1, wherein the multi-layered perceptron is configured to include an input layer, a hidden layer and an output layer, wherein the output layer is configured to 4 nodes corresponding to four categories of dysarthria, normal, mild, moderate and severe.
4. The system according to claim 1, wherein the second feature extraction unit is configured to:
inputting standard text labels and actual pronunciation audio into a deep neural network acoustic model, and obtaining 118 pronunciation audio labels at a frame level through forced alignment;
inputting the actual pronunciation audio into the deep neural network acoustic model to obtain phonemes and corresponding Gaussian probability density functions corresponding to each node of an output layer of the deep neural network acoustic model;
and calculating the phonemes and the posterior probabilities thereof contained in each frame, wherein the output of Gaussian probability density functions of the same phonemes is added to obtain the posterior probabilities of the phonemes, and further, the frame phoneme-probability corresponding relation is obtained.
5. The system according to claim 1, wherein the second feature extraction unit is configured to extract one or more of a vowel phoneme duration, a consonant phoneme duration, an overall phoneme duration, a consonant replacement rate, a vowel replacement rate, an overall replacement rate, a mean value of consonant approximate pronunciation quality, a mean value of vowel approximate pronunciation quality, a mean value of overall approximate pronunciation quality, a sentence frame blur rate, a number of consonants, a number of vowels, a number of frame phonemes, for each sentence audio.
6. The system according to claim 1, wherein the feature concatenation unit is configured to perform max-min normalization on the features extracted by the first feature extraction unit and the features extracted by the second feature extraction unit as the input of the multi-layered perceptron.
7. A dysarthria automatic evaluation method based on voice recognition comprises the following steps:
extracting traditional sentence-level acoustic features;
extracting a frame-level audio label and a frame phoneme-probability corresponding relation, wherein the frame phoneme-probability corresponding relation is a set of two tuples consisting of phonemes contained in a frame and posterior probabilities thereof;
splicing the traditional sentence-level acoustic features and the features extracted based on the frame phoneme-probability corresponding relation to obtain spliced features;
outputting individual sentence obstacle degree categories and corresponding prediction probabilities based on the splicing characteristics by utilizing a multilayer perceptron;
and obtaining an overall evaluation result by utilizing the prediction probability information of the individual sentences.
8. The automatic dysarthria assessment method based on speech recognition according to claim 7, wherein the overall assessment result is expressed as:
Figure FDA0002304459830000021
where N denotes the number of speech sentences evaluated, PAverage,pPredictionIs a multi-dimensional vector, each dimension representing a class of dysarthric degrees, pPredictionA corresponding probability representing the degree of dysarthria.
CN201911234291.XA 2019-12-05 2019-12-05 System and method for automatically evaluating dysarthria based on voice recognition Pending CN112927696A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911234291.XA CN112927696A (en) 2019-12-05 2019-12-05 System and method for automatically evaluating dysarthria based on voice recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911234291.XA CN112927696A (en) 2019-12-05 2019-12-05 System and method for automatically evaluating dysarthria based on voice recognition

Publications (1)

Publication Number Publication Date
CN112927696A true CN112927696A (en) 2021-06-08

Family

ID=76161355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911234291.XA Pending CN112927696A (en) 2019-12-05 2019-12-05 System and method for automatically evaluating dysarthria based on voice recognition

Country Status (1)

Country Link
CN (1) CN112927696A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113671031A (en) * 2021-08-20 2021-11-19 北京房江湖科技有限公司 Wall hollowing detection method and device
WO2023032553A1 (en) * 2021-09-02 2023-03-09 パナソニックホールディングス株式会社 Articulation abnormality detection method, articulation abnormality detection device, and program

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727903A (en) * 2008-10-29 2010-06-09 中国科学院自动化研究所 Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems
CN103065626A (en) * 2012-12-20 2013-04-24 中国科学院声学研究所 Automatic grading method and automatic grading equipment for read questions in test of spoken English
CN103705218A (en) * 2013-12-20 2014-04-09 中国科学院深圳先进技术研究院 Dysarthria identifying method, system and device
CN106205603A (en) * 2016-08-29 2016-12-07 北京语言大学 A kind of tone appraisal procedure
CN107578772A (en) * 2017-08-17 2018-01-12 天津快商通信息技术有限责任公司 Merge acoustic feature and the pronunciation evaluating method and system of pronunciation movement feature
CN108597542A (en) * 2018-03-19 2018-09-28 华南理工大学 A kind of dysarthrosis severity method of estimation based on depth audio frequency characteristics
CN109545243A (en) * 2019-01-23 2019-03-29 北京猎户星空科技有限公司 Pronunciation quality evaluating method, device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727903A (en) * 2008-10-29 2010-06-09 中国科学院自动化研究所 Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems
CN103065626A (en) * 2012-12-20 2013-04-24 中国科学院声学研究所 Automatic grading method and automatic grading equipment for read questions in test of spoken English
CN103705218A (en) * 2013-12-20 2014-04-09 中国科学院深圳先进技术研究院 Dysarthria identifying method, system and device
CN106205603A (en) * 2016-08-29 2016-12-07 北京语言大学 A kind of tone appraisal procedure
CN107578772A (en) * 2017-08-17 2018-01-12 天津快商通信息技术有限责任公司 Merge acoustic feature and the pronunciation evaluating method and system of pronunciation movement feature
CN108597542A (en) * 2018-03-19 2018-09-28 华南理工大学 A kind of dysarthrosis severity method of estimation based on depth audio frequency characteristics
CN109545243A (en) * 2019-01-23 2019-03-29 北京猎户星空科技有限公司 Pronunciation quality evaluating method, device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱洪涛等: "基于声学模型自适应与支持向量回归的英语朗读发音质量评测模型", 《桂林电子科技大学学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113671031A (en) * 2021-08-20 2021-11-19 北京房江湖科技有限公司 Wall hollowing detection method and device
WO2023032553A1 (en) * 2021-09-02 2023-03-09 パナソニックホールディングス株式会社 Articulation abnormality detection method, articulation abnormality detection device, and program

Similar Documents

Publication Publication Date Title
Desai et al. Feature extraction and classification techniques for speech recognition: A review
Li et al. Spoken language recognition: from fundamentals to practice
CN110473566A (en) Audio separation method, device, electronic equipment and computer readable storage medium
Jemine Real-time voice cloning
Jacob Modelling speech emotion recognition using logistic regression and decision trees
Das et al. Effect of aging on speech features and phoneme recognition: a study on Bengali voicing vowels
JP6674706B2 (en) Program, apparatus and method for automatically scoring from dictation speech of learner
Yücesoy et al. A new approach with score-level fusion for the classification of a speaker age and gender
Bartelds et al. A new acoustic-based pronunciation distance measure
Devi et al. Speaker emotion recognition based on speech features and classification techniques
Keshet Automatic speech recognition: A primer for speech-language pathology researchers
KR101068122B1 (en) Apparatus and method for rejection based garbage and anti-word model in a speech recognition
CN114783464A (en) Cognitive detection method and related device, electronic equipment and storage medium
Přibil et al. GMM-based speaker gender and age classification after voice conversion
CN112927696A (en) System and method for automatically evaluating dysarthria based on voice recognition
Coro et al. Psycho-acoustics inspired automatic speech recognition
Zourmand et al. Gender classification in children based on speech characteristics: using fundamental and formant frequencies of Malay vowels
Sefara et al. Web-based automatic pronunciation assistant
Wester Pronunciation variation modeling for Dutch automatic speech recognition
KR101925248B1 (en) Method and apparatus utilizing voice feature vector for optimization of voice authentication
Singh et al. Automatic articulation error detection tool for Punjabi language with aid for hearing impaired people
Fennir et al. Acoustic scene classification for speaker diarization
Bhardwaj et al. A Study of Methods Involved In Voice Emotion Recognition
Jeyalakshmi et al. Integrated models and features-based speaker independent emotion recognition
Prabhakera et al. Glottal source estimation from coded telephone speech using a deep neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination