WO2019034184A1 - 融合声学特征和发音运动特征的发音评估方法和系统 - Google Patents
融合声学特征和发音运动特征的发音评估方法和系统 Download PDFInfo
- Publication number
- WO2019034184A1 WO2019034184A1 PCT/CN2018/105942 CN2018105942W WO2019034184A1 WO 2019034184 A1 WO2019034184 A1 WO 2019034184A1 CN 2018105942 W CN2018105942 W CN 2018105942W WO 2019034184 A1 WO2019034184 A1 WO 2019034184A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature
- pronunciation
- motion
- fusion
- intelligibility
- Prior art date
Links
- 230000033001 locomotion Effects 0.000 title claims abstract description 208
- 238000011156 evaluation Methods 0.000 title claims abstract description 79
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000004927 fusion Effects 0.000 claims abstract description 105
- 230000001755 vocal effect Effects 0.000 claims description 29
- 238000012549 training Methods 0.000 claims description 25
- 210000000056 organ Anatomy 0.000 claims description 19
- 210000003128 head Anatomy 0.000 claims description 16
- 239000000284 extract Substances 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 8
- 238000005070 sampling Methods 0.000 claims description 8
- 210000004283 incisor Anatomy 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000009471 action Effects 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 4
- 239000000203 mixture Substances 0.000 claims description 4
- 210000001061 forehead Anatomy 0.000 claims description 3
- 238000007499 fusion processing Methods 0.000 claims description 3
- 210000001595 mastoid Anatomy 0.000 claims description 3
- 210000005182 tip of the tongue Anatomy 0.000 claims description 3
- 206010013887 Dysarthria Diseases 0.000 description 6
- 230000001771 impaired effect Effects 0.000 description 6
- 241001466559 Torgos Species 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 208000012661 Dyskinesia Diseases 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 206010010356 Congenital anomaly Diseases 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 208000028389 Nerve injury Diseases 0.000 description 1
- 210000003423 ankle Anatomy 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000006931 brain damage Effects 0.000 description 1
- 208000029028 brain injury Diseases 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004886 head movement Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 210000003205 muscle Anatomy 0.000 description 1
- 230000008764 nerve damage Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000000472 traumatic effect Effects 0.000 description 1
Images
Classifications
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/48—Other medical applications
- A61B5/4803—Speech analysis specially adapted for diagnostic purposes
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/103—Detecting, measuring or recording devices for testing the shape, pattern, colour, size or movement of the body or parts thereof, for diagnostic purposes
- A61B5/11—Measuring movement of the entire body or parts thereof, e.g. head or hand tremor, mobility of a limb
- A61B5/1113—Local tracking of patients, e.g. in a hospital or private home
- A61B5/1114—Tracking parts of the body
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/103—Detecting, measuring or recording devices for testing the shape, pattern, colour, size or movement of the body or parts thereof, for diagnostic purposes
- A61B5/11—Measuring movement of the entire body or parts thereof, e.g. head or hand tremor, mobility of a limb
- A61B5/1121—Determining geometric values, e.g. centre of rotation or angular range of movement
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/68—Arrangements of detecting, measuring or recording means, e.g. sensors, in relation to patient
- A61B5/6801—Arrangements of detecting, measuring or recording means, e.g. sensors, in relation to patient specially adapted to be attached to or worn on the body surface
- A61B5/6813—Specially adapted to be attached to a specific body part
- A61B5/6814—Head
- A61B5/682—Mouth, e.g., oral cavity; tongue; Lips; Teeth
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/72—Signal processing specially adapted for physiological signals or for diagnostic purposes
- A61B5/7235—Details of waveform analysis
- A61B5/7264—Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems
- A61B5/7267—Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems involving training the classification device
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B2562/00—Details of sensors; Constructional details of sensor housings or probes; Accessories for sensors
- A61B2562/02—Details of sensors specially adapted for in-vivo measurements
- A61B2562/0204—Acoustic sensors
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B2562/00—Details of sensors; Constructional details of sensor housings or probes; Accessories for sensors
- A61B2562/02—Details of sensors specially adapted for in-vivo measurements
- A61B2562/0219—Inertial sensors, e.g. accelerometers, gyroscopes, tilt switches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Definitions
- the invention relates to the field of pronunciation evaluation technology, in particular to a pronunciation evaluation method combining acoustic features and pronunciation motion characteristics and a system using the same.
- the perception and generation of speech is the result of multiple auditory organs and vocal organs working together in a short period of time.
- Some people suffer from brain or nerve damage due to congenital and traumatic nature, which makes it impossible to control the specific muscles to emit correct speech, which is characterized by pronunciation, vocalization, resonance, and rhythm. This is the dysarthria.
- the intelligibility of speech is the degree to which the listener can accurately obtain the speech information of the speaker.
- the severity of the dysarthria is often assessed by the intelligibility of the speech. The more serious the disease, the more intelligible the speech. low.
- the research on dysarthria has gradually increased, but most of them use the acoustic parameters to analyze the intelligibility. Ignoring the abnormal vocal organ movement is the source of abnormal sound, which makes the evaluation method not comprehensive enough, and the evaluation result is not reliable. Therefore, it is especially important to develop a set of reliable and objective and accurate evaluation criteria that do not depend on subjective evaluation.
- the present invention provides a pronunciation evaluation method and system for combining acoustic features and pronunciation motion features, which acquires acoustic data and corresponding pronunciation motion characteristics by extracting audio data and corresponding pronunciation motion data, respectively, and The acoustic feature and the pronunciation motion feature are combined to obtain a more accurate and reliable fusion evaluation result, which makes the pronunciation evaluation more objective and accurate.
- a pronunciation evaluation method combining acoustic features and pronunciation motion features comprising the following steps:
- the acoustic feature intelligibility discrimination model and the pronunciation motion feature intelligibility discriminant model are separately trained according to the acoustic feature and the pronunciation motion feature, and the acoustic feature intelligibility discriminant model is evaluated.
- the result is processed by the strategy fusion with the evaluation result of the pronunciation motion feature intelligibility discriminant model, and the strategy fusion evaluation result is obtained.
- the audio data and the pronunciation motion data are collected, and the audio data and the pronunciation motion data are collected by using an electromagnetic pronunciation motion tracing system, and a space sensor is placed in the sounding organ, and the calculation is performed.
- the three-dimensional spatial coordinates and angles of the spatial sensor in the magnetic field obtain the sounding motion data, and the audio data corresponding to the acquisition time is acquired while collecting the sounding motion data; wherein the sounding organ includes The lips, the pronunciation motion data include lip motion data.
- the spatial sensor is further placed on the bridge of the nose, and the sounding motion feature is extracted from the sounding motion data in the step (10), and the space sensor on the lip is calculated by using the space sensor of the nose bridge as a coordinate origin.
- the relative distance of the coordinate origin; the three-dimensional coordinate distance x, y, z of the four spatial sensors on the lips is taken as a motion feature, and each sampling point is used as a frame, and the pronunciation motion characteristics are extracted for each frame of data according to the following formula:
- the subscripts of x, y, and z represent upper lip motion data, lower lip motion data, left-mouth angular motion data, and right-mouth angular motion data, respectively.
- the process of performing feature fusion in the step (20) is to set a window length of the acoustic feature and the pronunciation motion feature according to the sampling rate of the audio data and the pronunciation motion data, according to The window length is set to move the window, and the acoustic feature and the sounding motion feature are feature-fused with the window shift.
- the process of the policy fusion is to set different weight ratios according to the evaluation result of the acoustic feature intelligibility discrimination model and the evaluation result of the pronunciation motion feature intelligibility discrimination model, according to the weight ratio
- the calculation strategy fusion assessment results are performed; the calculation method is as follows:
- LL represents the result of the policy fusion evaluation
- k represents the classification of the evaluation result
- w represents the weight
- the argmax function represents the parameter having the largest score.
- the present invention also provides a pronunciation evaluation system that combines acoustic features and pronunciation motion features, including:
- a feature extraction module configured to collect audio data and pronunciation motion data, and extract an acoustic feature for the audio data, and extract a pronunciation motion feature for the pronunciation motion data, wherein the audio data and the pronunciation motion data are in time Corresponding;
- a feature fusion module which performs feature fusion on the acoustic feature and the sounding motion feature according to a time correspondence relationship to obtain a fusion feature
- a model training module which is trained according to the fusion feature to obtain a fusion feature intelligibility discrimination model
- the pronunciation evaluation module obtains the feature fusion evaluation result by using the fusion feature intelligibility discrimination model.
- the method further includes a policy fusion module
- the model training module further performs training according to the acoustic feature and the sounding motion feature to obtain an acoustic feature intelligibility discrimination model and a pronunciation motion feature intelligibility discrimination model;
- the policy fusion module performs a policy fusion process on the evaluation result of the acoustic feature intelligibility discriminant model and the evaluation result of the pronunciation motion intelligibility discriminant model to obtain a strategy fusion evaluation result.
- the method further comprises a data acquisition module, which uses the electromagnetic pronunciation action tracing system to collect the audio data and the pronunciation motion data, and places a space sensor in the pronunciation organ, and calculates a three-dimensional space coordinate of the space sensor in the magnetic field. And the angle, the pronunciation motion data is obtained, and the audio data corresponding to the acquisition time is acquired while the pronunciation motion data is collected.
- a data acquisition module which uses the electromagnetic pronunciation action tracing system to collect the audio data and the pronunciation motion data, and places a space sensor in the pronunciation organ, and calculates a three-dimensional space coordinate of the space sensor in the magnetic field. And the angle, the pronunciation motion data is obtained, and the audio data corresponding to the acquisition time is acquired while the pronunciation motion data is collected.
- the vocal organ comprises one or more of the following: a tongue, a lip, a corner of the mouth, an incisor; wherein the spatial sensor of the tongue is disposed at the tip of the tongue, in the tongue, behind the tongue; the spatial sensor of the lip is disposed in the middle of the upper lip The middle part of the lower lip; the space sensor of the corner of the mouth is disposed at the left corner of the mouth and the right corner of the mouth; the spatial sensor of the incisor is disposed at the lower incisor and is used for tracking the movement of the lower jaw.
- the method further includes: setting a spatial sensor at the head position to detect the head motion data, and correcting the pronunciation motion data according to the head motion data; the head position includes one or more of the following: the forehead and the nose bridge , behind the ear; wherein the space sensor behind the ear is placed on the mastoid bone behind the ear.
- the model training module performs training by inputting the acoustic feature or the pronunciation motion feature or the fusion feature into a Gaussian mixture model-Hidden Markov Model to obtain a corresponding acoustic feature intelligibility discrimination model. , pronunciation motion feature intelligibility discriminant model, fusion feature intelligibility discriminant model.
- the present invention extracts acoustic data and corresponding vocal motion features by acquiring audio data and corresponding vocal motion data, and performs feature fusion on the acoustic features and vocal motion features, and performs model training through fusion features. More accurate and reliable feature fusion evaluation results make the pronunciation evaluation more objective and accurate;
- the present invention further performs training on the acoustic feature intelligibility discrimination model and the pronunciation motion feature intelligibility discrimination model according to the acoustic feature and the pronunciation motion feature, and performs strategy fusion on the evaluation results of each model. Processing, obtaining a strategy fusion evaluation result, verifying and cross-referencing the strategy fusion evaluation result and the feature fusion evaluation result, so that the pronunciation evaluation result is more objective and accurate;
- the present invention not only detects the pronunciation motion data of the vocal organ, but also includes setting a spatial sensor at the head position to detect the head motion data, and correcting the vocal motion data according to the head motion data, so that the data is more Accurate and reliable.
- FIG. 1 is a schematic flow chart of a method for sounding evaluation of a fusion acoustic feature and a pronunciation motion feature according to the present invention
- FIG. 2 is a schematic structural view of a pronunciation evaluation system combining acoustic characteristics and pronunciation motion characteristics according to the present invention
- Figure 3 is one of the schematic diagrams of the spatial sensor distribution
- Figure 4 is a second schematic diagram of the spatial sensor distribution.
- a sounding evaluation method for a fusion acoustic feature and a pronunciation motion feature of the present invention is characterized in that it comprises the following steps:
- the audio data and the pronunciation motion data are collected, and the audio data and the pronunciation motion data are collected by using an electromagnetic pronunciation motion tracing system.
- the 3DAG500 electromagnetic pronunciation motion description is used.
- the audio data corresponding to the acquisition time is simultaneously performed while the pronunciation data is being pronounced; wherein the sounding organ includes a lip, and the sounding motion data includes lip motion data. Due to the abnormal movement of the tongue in patients with dysarthria, the sensor will fall off during the exercise, which makes it difficult to obtain valid data for the tongue movement data. Therefore, in the present embodiment, the lip motion data is selected as the main vocal motion data.
- the EMA system uses the phenomenon that the spatial sensor generates alternating current in the alternating magnetic field, calculates the three-dimensional space coordinates and angle of the space sensor in the magnetic field, and collects the motion data. And collecting the audio signal at the same time while collecting the position information of the space sensor.
- the space sensor is attached to the recording device by a thin, lightweight cable that does not interfere with the free movement of the head within the EMA cube.
- Window processing is performed on each frame of the weighted data to obtain windowed data.
- 20 ms is taken as one frame, and the Hanning window is selected due to possible leakage of spectrum energy at the frame boundary. Windowing is performed for each frame.
- the acoustic characteristics are obtained by using the Mel frequency cepstral coefficient as a characteristic parameter.
- the Mel Frequency Cepstral Coefficient is based on the frequency domain characteristics of the human ear. It maps the linear amplitude spectrum to the Mel nonlinear amplitude spectrum based on auditory perception and then converts it to the cepstrum.
- the change information between the preceding and following frames also helps to identify different speech characteristics, so the MFCC generally adds the first-order difference and the second-order difference for each dimension of the cepstral coefficients.
- a 13-dimensional MFCC is employed, and its first-order difference and second-order difference are acoustic characteristics.
- the process of performing feature fusion in the step (20) is to set a window length of the acoustic feature and the sounding motion feature according to the sampling rate of the audio data and the sounding motion data, according to the window length A setting window shift is performed, and the acoustic feature and the sounding motion feature are feature-fused with the window shift, so that the two types of feature point complementarity advantages can be effectively utilized for modeling.
- the sampling rate of the audio data is 16000 Hz
- the sampling rate of the pronunciation motion data is 200 Hz.
- the window length of the acoustic feature is set to 20 ms
- the motion feature window length is 5 ms, and the feature is extracted.
- the window is shifted to 5ms.
- the characteristic dimension of the obtained fusion feature (Acoustic-Articulatory) is 51.
- the GMM-HMM model with four levels (normal, slight, medium, severe) intelligibility discrimination is trained using the fusion features.
- the hidden Markov model has three state numbers and a mixed Gauss number of 24.
- the model training is performed by inputting the acoustic feature or the pronunciation motion feature or the fusion feature into a Gaussian mixture model-hidden Markov model (GMM-HMM model) respectively.
- GMM-HMM model Gaussian mixture model-hidden Markov model
- the intelligibility assessment is performed by training the intelligibility discriminant model for discriminating different levels of intelligibility by means of the GMM-HMM model and using the acoustic features and the articulation motion features, respectively. Considering the timing characteristics of the speech signal, it is modeled by HMM, and the state emission probability of each HMM is calculated using the GMM model.
- the degree of intelligibility is directly proportional to its severity. According to the diagnosis of the speech pathologist, it is divided into mild, moderate, and severe, plus a normal human control, a total of four groups. The GMM-HMM model was trained for each group separately. In order to verify that the influence of different features on the intelligibility discrimination is different, the GMM-HMM model is trained for the acoustic features and the pronunciation motion features respectively.
- the hidden Markov model is a left-to-right model without spanning, and its state number is three.
- the mixed Gauss number is 8, and the acoustic feature intelligibility discriminant model (denoted as Acoustic-GMM-HMM) and the pronunciation motion feature intelligibility discriminant model (denoted as Articulatory-GMM-HMM) are obtained.
- the fusion feature intelligibility discriminant model is used to obtain the feature fusion evaluation result, and the fusion feature intelligibility discriminant model is used to judge different levels of intelligibility.
- the process of the policy fusion is to set different weights by respectively evaluating the evaluation result of the acoustic feature intelligibility discrimination model and the evaluation result of the pronunciation motion intelligibility discrimination model.
- a ratio the calculation strategy fusion evaluation result is performed according to the weight ratio; that is, the acoustic feature intelligibility discrimination model (Acoustic-GMM-HMM) and the pronunciation motion feature intelligibility discrimination model (Articulatory-GMM-HMM) make the decision fusion according to the following formula:
- LL represents the result of the policy fusion evaluation (ie, the maximum likelihood score after the decision fusion)
- Representing the evaluation result of the acoustic feature intelligibility discrimination model Representing the evaluation result of the pronunciation motion feature intelligibility discriminant model
- k represents the grade classification of the evaluation result
- w represents the weight
- the argmax function represents finding the parameter having the largest score; in this embodiment, k is 1, 2, 3, 4, representing four levels of normal, slight, medium, and severe respectively
- w represents the weight of the acoustic feature intelligibility discriminant model (Acoustic-GMM-HMM), and the value is 0.5
- 1-w indicates the pronunciation motion feature
- the weight of the intelligibility discriminant model (Articulatory-GMM-HMM).
- the present invention also provides a pronunciation evaluation system that combines acoustic features and pronunciation motion features, including:
- a data acquisition module which uses the electromagnetic pronunciation action tracing system to collect the audio data and the pronunciation motion data, and places a space sensor in the pronunciation organ, and calculates a three-dimensional space coordinate and an angle of the space sensor in the magnetic field, thereby obtaining a Transmitting the motion data, and collecting the audio data corresponding to the time while collecting the pronunciation motion data;
- a feature extraction module configured to collect audio data and pronunciation motion data, and extract an acoustic feature for the audio data, and extract a pronunciation motion feature for the pronunciation motion data, wherein the audio data and the pronunciation motion data are in time Corresponding;
- a feature fusion module which performs feature fusion on the acoustic feature and the sounding motion feature according to a time correspondence relationship to obtain a fusion feature
- a model training module which is trained according to the fusion feature to obtain a fusion feature intelligibility discrimination model
- a pronunciation evaluation module which uses the fusion feature intelligibility discrimination model to obtain a feature fusion evaluation result
- a strategy fusion module wherein the model training module further performs training according to the acoustic feature and the pronunciation motion feature to obtain an acoustic feature intelligibility discrimination model and a pronunciation motion feature intelligibility discrimination model;
- the evaluation result of the acoustic feature intelligibility discriminant model and the evaluation result of the pronunciation motion intelligibility discriminant model are processed by strategy fusion, and the strategy fusion evaluation result is obtained.
- the vocal organ includes one or more of the following: a tongue, a lip, a corner of the mouth, and an incisor; wherein the spatial sensor of the tongue is disposed at the tip of the tongue (the TT-tongue anatomical plane) 1 cm), in the tongue (TM-3 cm behind the tip sensor), behind the tongue (2 cm behind the sensor in the TB-tongue); the spatial sensor of the lip is placed in the middle of the upper lip (UL), in the middle of the lower lip (LL
- the space sensor of the corner of the mouth is disposed at the left corner of the mouth (LM) and the right corner of the mouth (RM); the spatial sensor of the incisor is disposed at the lower incisor (JA) and is used to track the movement of the lower jaw.
- the articulating organs are mainly composed of lips, teeth, tongue, ankles and the like.
- the tongue and lips closely cooperate with other parts to block the airflow and change the shape of the oral resonator, which plays an important role in the pronunciation. Therefore, we first analyze the data of the tongue vocal organs.
- the sensor will fall off during the movement, which makes it difficult to obtain valid data for the tongue movement data. Therefore, in the present embodiment, the motion data using the lip vocal organ is selected as the main vocal motion data.
- the method further includes: setting a spatial sensor at the head position to detect the head motion data, and correcting the pronunciation motion data according to the head motion data;
- the head position includes one or more of the following: the forehead and the nose bridge , behind the ear; wherein the space sensor behind the ear is placed on the mastoid bone behind the ear to play the role of reference and record head movement.
- the pronunciation motion feature is extracted from the pronunciation motion data, which is adopted by The spatial sensor of the bridge of the nose is used as the coordinate origin to calculate the relative distance of the spatial sensor on the lip from the origin of the coordinate; the three-dimensional coordinate distance x, y, z of the four spatial sensors on the lip is used as the motion feature, and each sampling point is taken as one Frame, extract the pronunciation motion characteristics for each frame of data according to the following formula:
- the subscripts of x, y, and z represent upper lip motion data, lower lip motion data, left-mouth angular motion data, and right-mouth angular motion data, respectively.
- the pronunciation movement features a total of 12 dimensions.
- the model training module performs training by inputting the acoustic feature or the pronunciation motion feature or the fusion feature into a Gaussian mixture model-Hidden Markov Model, and obtains a corresponding acoustic feature intelligibility discrimination model and pronunciation motion.
- Feature intelligibility discriminant model and fusion feature intelligibility discriminant model are used as training parameters.
- the Torgo data set based on the audio data and the pronunciation motion data is taken as an example to describe the algorithm flow of the entire system.
- the specific steps are as follows:
- the input of the system includes four levels of intelligibility: severe, medium, slight, and normal.
- the judgment of the level of intelligibility is obtained according to the diagnosis of the speech pathologist.
- the number of participants in the data set was 3, 2, 2, and 7, respectively, and the number of pronunciation samples was 567, 876, 671, and 4289, respectively.
- the EMA device simultaneously acquires audio data and pronunciation motion data, and here extracts acoustic features, motion features, and fusion A-A features for the two types of features, respectively, in the settings of Table 2.
- the training of the intelligibility discriminant model is performed by the GMM-HMM method.
- the GMM-HMM discriminant model using motion characteristics has a significant improvement in the accuracy of speech impaired persons, but for normal people, the acoustic characteristics using MFCC are more accurate.
- the GMM-HMM using motion characteristics increased by an average of 0.56 percentage points over the GMM-HMM using acoustic features.
- the use of motion characteristics is very effective in discriminating the intelligibility of speech impaired persons.
- the motion characteristics are good for the discrimination of the obstacle person.
- the feature fusion A-A feature to train the GMM-HMM model, and to use the acoustic feature GMM-HMM and the motion feature GMM-HMM for decision fusion.
- feature fusion and decision fusion can combine the complementary advantages of the two types of features, further improving the discriminating effect.
- the present invention not only utilizes audio data, but also uses the pronunciation motion data of a speech impaired person to judge the intelligibility level of the dysarthria from the aspect of the vocal movement.
- the focus of the pronunciation data is to extract the feature data of the speech impaired.
- the tongue movement data is unstable and difficult to obtain. Therefore, in this embodiment, the pronunciation data of the lips is mainly used as the main basis. Effectively distinguish the degree of intelligibility of speech impaired persons.
- the invention combines the traditional speech acoustic features and the pronunciation motion features through feature fusion and decision fusion, effectively utilizes the complementarity of the two types of features, and ensures the objectivity and comprehensiveness of the evaluation.
- the results are more than the acoustic features alone.
- the use of pronunciation motion features alone has a clear advantage in classifying the degree of intelligibility.
Landscapes
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Public Health (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Veterinary Medicine (AREA)
- Animal Behavior & Ethology (AREA)
- Surgery (AREA)
- Molecular Biology (AREA)
- Medical Informatics (AREA)
- Heart & Thoracic Surgery (AREA)
- Biomedical Technology (AREA)
- Pathology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Dentistry (AREA)
- Artificial Intelligence (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Physiology (AREA)
- Quality & Reliability (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Psychiatry (AREA)
- Geometry (AREA)
- Evolutionary Computation (AREA)
- Fuzzy Systems (AREA)
- Electrically Operated Instructional Devices (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
- Epidemiology (AREA)
- Probability & Statistics with Applications (AREA)
Abstract
Description
Claims (12)
- 一种融合声学特征和发音运动特征的发音评估方法,其特征在于,包括以下步骤:(10)采集音频数据和发音运动数据,并对所述音频数据提取声学特征,对所述发音运动数据提取发音运动特征,其中,所述音频数据和所述发音运动数据在时间上相对应;(20)根据时间对应关系将所述声学特征和所述发音运动特征进行特征融合的处理,得到融合特征;(30)根据所述融合特征进行训练得到融合特征可懂度判别模型;(40)利用所述融合特征可懂度判别模型得到特征融合评估结果。
- 根据权利要求1所述的一种融合声学特征和发音运动特征的发音评估方法,其特征在于:还进一步根据所述声学特征和所述发音运动特征进行分别训练得到声学特征可懂度判别模型和发音运动特征可懂度判别模型,并将所述声学特征可懂度判别模型的评估结果和所述发音运动特征可懂度判别模型的评估结果进行策略融合的处理,得到策略融合评估结果。
- 根据权利要求1或2所述的一种融合声学特征和发音运动特征的发音评估方法,其特征在于:所述的步骤(10)中进行采集音频数据和发音运动数据,是利用电磁式发音动作描迹系统进行采集所述音频数据和发音运动数据,通过在发音器官放置空间传感器,并计算所述空间传感器在磁场中的三维空间坐标和角度,得到所述发音运动数据,并在采集所述发音运动数据的同时进行采集时间上相对应的所述音频数据;其中,所述发音器官包括嘴唇,所述发音运动数据包括嘴唇运动数据。
- 根据权利要求3所述的一种融合声学特征和发音运动特征的发音评估方法,其特征在于:还进一步在鼻梁放置空间传感器,所述的步骤(10)中 对所述发音运动数据提取发音运动特征,是采用以所述鼻梁的空间传感器作为坐标原点,计算嘴唇上的空间传感器距离所述坐标原点的相对距离;以嘴唇上四个空间传感器的三维坐标距离x,y,z作为运动特征,每一个采样点作为一帧,对每帧数据按如下公式提取发音运动特征:lip=[x 1...x 4,y 1...y 4,z 1...z 4] T;其中x,y,z的下标分别代表上嘴唇运动数据、下嘴唇运动数据、左嘴角运动数据、右嘴角运动数据。
- 根据权利要求1或2所述的一种融合声学特征和发音运动特征的发音评估方法,其特征在于:所述的步骤(20)中进行特征融合的处理,是根据所述音频数据和所述发音运动数据的采样率进行设置所述声学特征和所述发音运动特征的窗长,根据所述窗长进行设置窗移,并以所述窗移对所述声学特征和所述发音运动特征进行特征融合。
- 一种融合声学特征和发音运动特征的发音评估系统,其特征在于,包括:特征提取模块,用于采集音频数据和发音运动数据,并对所述音频数据提取声学特征,对所述发音运动数据提取发音运动特征,其中,所述音频数据和所述发音运动数据在时间上相对应;特征融合模块,其根据时间对应关系将所述声学特征和所述发音运动特征进行特征融合的处理,得到融合特征;模型训练模块,根据所述融合特征进行训练得到融合特征可懂度判别模型;发音评估模块,利用所述融合特征可懂度判别模型得到特征融合评估结果。
- 根据权利要求7所述的一种融合声学特征和发音运动特征的发音评估系统,其特征在于:还包括策略融合模块;所述模型训练模块还进一步根据所述声学特征和所述发音运动特征进行分别训练得到声学特征可懂度判别模型和发音运动特征可懂度判别模型;所述策略融合模块将所述声学特征可懂度判别模型的评估结果和所述发音运动特征可懂度判别模型的评估结果进行策略融合的处理,得到策略融合评估结果。
- 根据权利要求7或8所述的一种融合声学特征和发音运动特征的发音评估系统,其特征在于:还包括数据采集模块,其利用电磁式发音动作描迹系统进行采集所述音频数据和发音运动数据,通过在发音器官放置空间传感器,并计算所述空间传感器在磁场中的三维空间坐标和角度,得到所述发音运动数据,并在采集所述发音运动数据的同时进行采集时间上相对应的所述 音频数据。
- 根据权利要求9所述的一种融合声学特征和发音运动特征的发音评估方法,其特征在于:所述发音器官包括以下一种以上:舌头、嘴唇、嘴角、门牙;其中,所述舌头的空间传感器设置在舌尖、舌中、舌后;所述嘴唇的空间传感器设置在上嘴唇中部、下嘴唇中部;所述嘴角的空间传感器设置在左嘴角、右嘴角;所述门牙的空间传感器设置在下门牙并用于跟踪下颌的运动。
- 根据权利要求9所述的一种融合声学特征和发音运动特征的发音评估方法,其特征在于:还包括在头部位置设置空间传感器进行检测头部运动数据,并根据所述头部运动数据对所述发音运动数据进行校正;所述头部位置包括以下一种以上:额头、鼻梁、耳后;其中,所述耳后的空间传感器设置在耳朵后面的乳突骨上。
- 根据权利要求7或8所述的一种融合声学特征和发音运动特征的发音评估系统,其特征在于:所述模型训练模块是通过将所述声学特征或所述发音运动特征或所述融合特征分别输入高斯混合模型-隐马尔可夫模型进行训练,得到对应的声学特征可懂度判别模型、发音运动特征可懂度判别模型、融合特征可懂度判别模型。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/616,459 US11786171B2 (en) | 2017-08-17 | 2018-09-17 | Method and system for articulation evaluation by fusing acoustic features and articulatory movement features |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710708049.6A CN107578772A (zh) | 2017-08-17 | 2017-08-17 | 融合声学特征和发音运动特征的发音评估方法和系统 |
CN201710708049.6 | 2017-08-17 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019034184A1 true WO2019034184A1 (zh) | 2019-02-21 |
Family
ID=61034267
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/105942 WO2019034184A1 (zh) | 2017-08-17 | 2018-09-17 | 融合声学特征和发音运动特征的发音评估方法和系统 |
Country Status (3)
Country | Link |
---|---|
US (1) | US11786171B2 (zh) |
CN (1) | CN107578772A (zh) |
WO (1) | WO2019034184A1 (zh) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107578772A (zh) | 2017-08-17 | 2018-01-12 | 天津快商通信息技术有限责任公司 | 融合声学特征和发音运动特征的发音评估方法和系统 |
CN108922563B (zh) * | 2018-06-17 | 2019-09-24 | 海南大学 | 基于偏差器官形态行为可视化的口语学习矫正方法 |
CN109360645B (zh) * | 2018-08-01 | 2021-06-11 | 太原理工大学 | 一种构音障碍发音运动异常分布的统计分类方法 |
EP3618061B1 (en) * | 2018-08-30 | 2022-04-27 | Tata Consultancy Services Limited | Method and system for improving recognition of disordered speech |
CN109697976B (zh) * | 2018-12-14 | 2021-05-25 | 北京葡萄智学科技有限公司 | 一种发音识别方法及装置 |
CN111951828A (zh) * | 2019-05-16 | 2020-11-17 | 上海流利说信息技术有限公司 | 发音测评方法、装置、系统、介质和计算设备 |
CN110223671B (zh) * | 2019-06-06 | 2021-08-10 | 标贝(深圳)科技有限公司 | 语言韵律边界预测方法、装置、系统和存储介质 |
CN112927696A (zh) * | 2019-12-05 | 2021-06-08 | 中国科学院深圳先进技术研究院 | 一种基于语音识别的构音障碍自动评估系统和方法 |
CN111210838B (zh) * | 2019-12-05 | 2023-09-15 | 中国船舶工业综合技术经济研究院 | 一种言语认知能力的评价方法 |
CN113496696A (zh) * | 2020-04-03 | 2021-10-12 | 中国科学院深圳先进技术研究院 | 一种基于语音识别的言语功能自动评估系统和方法 |
CN113314100B (zh) * | 2021-07-29 | 2021-10-08 | 腾讯科技(深圳)有限公司 | 口语测试的评估、结果显示方法、装置、设备及存储介质 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101292281A (zh) * | 2005-09-29 | 2008-10-22 | 独立行政法人产业技术综合研究所 | 发音诊断装置、发音诊断方法、存储媒介、以及发音诊断程序 |
CN102063903A (zh) * | 2010-09-25 | 2011-05-18 | 中国科学院深圳先进技术研究院 | 言语交互训练系统及方法 |
JP2012088675A (ja) * | 2010-10-19 | 2012-05-10 | Inokuma Kazuhito | 音声分析機能を持つ言語発音学習装置及びそのシステム |
CN102663928A (zh) * | 2012-03-07 | 2012-09-12 | 天津大学 | 一种聋人学习说话的电子教学方法 |
CN103218924A (zh) * | 2013-03-29 | 2013-07-24 | 上海众实科技发展有限公司 | 一种基于音视频双模态的口语学习监测方法 |
WO2015030471A1 (en) * | 2013-08-26 | 2015-03-05 | Seli Innovations Inc. | Pronunciation correction apparatus and method thereof |
CN106409030A (zh) * | 2016-12-08 | 2017-02-15 | 河南牧业经济学院 | 一种个性化外语口语学习系统 |
CN107578772A (zh) * | 2017-08-17 | 2018-01-12 | 天津快商通信息技术有限责任公司 | 融合声学特征和发音运动特征的发音评估方法和系统 |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4836218A (en) * | 1984-04-09 | 1989-06-06 | Arthrotek, Inc. | Method and apparatus for the acoustic detection and analysis of joint disorders |
CN1924994B (zh) * | 2005-08-31 | 2010-11-03 | 中国科学院自动化研究所 | 一种嵌入式语音合成方法及系统 |
KR101035768B1 (ko) * | 2009-01-02 | 2011-05-20 | 전남대학교산학협력단 | 립 리딩을 위한 입술 영역 설정 방법 및 장치 |
US8913103B1 (en) * | 2012-02-01 | 2014-12-16 | Google Inc. | Method and apparatus for focus-of-attention control |
US9159321B2 (en) * | 2012-02-27 | 2015-10-13 | Hong Kong Baptist University | Lip-password based speaker verification system |
US20140365221A1 (en) * | 2012-07-31 | 2014-12-11 | Novospeech Ltd. | Method and apparatus for speech recognition |
US9911358B2 (en) * | 2013-05-20 | 2018-03-06 | Georgia Tech Research Corporation | Wireless real-time tongue tracking for speech impairment diagnosis, speech therapy with audiovisual biofeedback, and silent speech interfaces |
WO2014194439A1 (en) * | 2013-06-04 | 2014-12-11 | Intel Corporation | Avatar-based video encoding |
JP2016129661A (ja) * | 2015-01-09 | 2016-07-21 | パナソニックIpマネジメント株式会社 | 判定システム、制御信号出力システム、リハビリシステム、判定方法、制御信号出力方法、コンピュータプログラム、脳波信号取得システム |
US10888265B2 (en) * | 2015-10-07 | 2021-01-12 | Donna Edwards | Jaw function measurement apparatus |
EP3226570A1 (en) * | 2016-03-31 | 2017-10-04 | Thomson Licensing | Synchronizing audio and video signals rendered on different devices |
-
2017
- 2017-08-17 CN CN201710708049.6A patent/CN107578772A/zh active Pending
-
2018
- 2018-09-17 US US16/616,459 patent/US11786171B2/en active Active
- 2018-09-17 WO PCT/CN2018/105942 patent/WO2019034184A1/zh active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101292281A (zh) * | 2005-09-29 | 2008-10-22 | 独立行政法人产业技术综合研究所 | 发音诊断装置、发音诊断方法、存储媒介、以及发音诊断程序 |
CN102063903A (zh) * | 2010-09-25 | 2011-05-18 | 中国科学院深圳先进技术研究院 | 言语交互训练系统及方法 |
JP2012088675A (ja) * | 2010-10-19 | 2012-05-10 | Inokuma Kazuhito | 音声分析機能を持つ言語発音学習装置及びそのシステム |
CN102663928A (zh) * | 2012-03-07 | 2012-09-12 | 天津大学 | 一种聋人学习说话的电子教学方法 |
CN103218924A (zh) * | 2013-03-29 | 2013-07-24 | 上海众实科技发展有限公司 | 一种基于音视频双模态的口语学习监测方法 |
WO2015030471A1 (en) * | 2013-08-26 | 2015-03-05 | Seli Innovations Inc. | Pronunciation correction apparatus and method thereof |
CN106409030A (zh) * | 2016-12-08 | 2017-02-15 | 河南牧业经济学院 | 一种个性化外语口语学习系统 |
CN107578772A (zh) * | 2017-08-17 | 2018-01-12 | 天津快商通信息技术有限责任公司 | 融合声学特征和发音运动特征的发音评估方法和系统 |
Also Published As
Publication number | Publication date |
---|---|
US11786171B2 (en) | 2023-10-17 |
US20200178883A1 (en) | 2020-06-11 |
CN107578772A (zh) | 2018-01-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019034184A1 (zh) | 融合声学特征和发音运动特征的发音评估方法和系统 | |
Rudzicz et al. | The TORGO database of acoustic and articulatory speech from speakers with dysarthria | |
Wang et al. | Articulatory distinctiveness of vowels and consonants: A data-driven approach | |
Jin et al. | Adventitious sounds identification and extraction using temporal–spectral dominance-based features | |
Golabbakhsh et al. | Automatic identification of hypernasality in normal and cleft lip and palate patients with acoustic analysis of speech | |
Zañartu et al. | Subglottal impedance-based inverse filtering of voiced sounds using neck surface acceleration | |
Wang et al. | Phoneme-level articulatory animation in pronunciation training | |
JP2003255993A (ja) | 音声認識システム、音声認識方法、音声認識プログラム、音声合成システム、音声合成方法、音声合成プログラム | |
US10959661B2 (en) | Quantification of bulbar function | |
US20150351663A1 (en) | Determining apnea-hypopnia index ahi from speech | |
CN105023573A (zh) | 使用听觉注意力线索的语音音节/元音/音素边界检测 | |
Whitfield et al. | Examining acoustic and kinematic measures of articulatory working space: Effects of speech intensity | |
CN105448291A (zh) | 基于语音的帕金森症检测方法及检测系统 | |
El Emary et al. | Towards developing a voice pathologies detection system | |
Vojtech et al. | Refining algorithmic estimation of relative fundamental frequency: Accounting for sample characteristics and fundamental frequency estimation method | |
Diaz-Cadiz et al. | Adductory vocal fold kinematic trajectories during conventional versus high-speed videoendoscopy | |
CN115346561B (zh) | 基于语音特征的抑郁情绪评估预测方法及系统 | |
TWI749663B (zh) | 發聲監控之方法及系統 | |
Ribeiro et al. | Speaker-independent classification of phonetic segments from raw ultrasound in child speech | |
JP4381404B2 (ja) | 音声合成システム、音声合成方法、音声合成プログラム | |
Sofwan et al. | Normal and Murmur Heart Sound Classification Using Linear Predictive Coding and k-Nearest Neighbor Methods | |
Lee et al. | An exploratory study of emotional speech production using functional data analysis techniques | |
Talkar et al. | Acoustic Indicators of Speech Motor Coordination in Adults With and Without Traumatic Brain Injury. | |
Sebkhi et al. | Evaluation of a wireless tongue tracking system on the identification of phoneme landmarks | |
Berger | Measurement of vowel nasalization by multi-dimensional acoustic analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18846909 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18846909 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07.10.2020) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18846909 Country of ref document: EP Kind code of ref document: A1 |