WO2019034184A1 - 融合声学特征和发音运动特征的发音评估方法和系统 - Google Patents

融合声学特征和发音运动特征的发音评估方法和系统 Download PDF

Info

Publication number
WO2019034184A1
WO2019034184A1 PCT/CN2018/105942 CN2018105942W WO2019034184A1 WO 2019034184 A1 WO2019034184 A1 WO 2019034184A1 CN 2018105942 W CN2018105942 W CN 2018105942W WO 2019034184 A1 WO2019034184 A1 WO 2019034184A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
pronunciation
motion
fusion
intelligibility
Prior art date
Application number
PCT/CN2018/105942
Other languages
English (en)
French (fr)
Inventor
党建武
原梦
王龙标
Original Assignee
厦门快商通科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 厦门快商通科技股份有限公司 filed Critical 厦门快商通科技股份有限公司
Priority to US16/616,459 priority Critical patent/US11786171B2/en
Publication of WO2019034184A1 publication Critical patent/WO2019034184A1/zh

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4803Speech analysis specially adapted for diagnostic purposes
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/103Detecting, measuring or recording devices for testing the shape, pattern, colour, size or movement of the body or parts thereof, for diagnostic purposes
    • A61B5/11Measuring movement of the entire body or parts thereof, e.g. head or hand tremor, mobility of a limb
    • A61B5/1113Local tracking of patients, e.g. in a hospital or private home
    • A61B5/1114Tracking parts of the body
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/103Detecting, measuring or recording devices for testing the shape, pattern, colour, size or movement of the body or parts thereof, for diagnostic purposes
    • A61B5/11Measuring movement of the entire body or parts thereof, e.g. head or hand tremor, mobility of a limb
    • A61B5/1121Determining geometric values, e.g. centre of rotation or angular range of movement
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/68Arrangements of detecting, measuring or recording means, e.g. sensors, in relation to patient
    • A61B5/6801Arrangements of detecting, measuring or recording means, e.g. sensors, in relation to patient specially adapted to be attached to or worn on the body surface
    • A61B5/6813Specially adapted to be attached to a specific body part
    • A61B5/6814Head
    • A61B5/682Mouth, e.g., oral cavity; tongue; Lips; Teeth
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7264Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems
    • A61B5/7267Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems involving training the classification device
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B2562/00Details of sensors; Constructional details of sensor housings or probes; Accessories for sensors
    • A61B2562/02Details of sensors specially adapted for in-vivo measurements
    • A61B2562/0204Acoustic sensors
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B2562/00Details of sensors; Constructional details of sensor housings or probes; Accessories for sensors
    • A61B2562/02Details of sensors specially adapted for in-vivo measurements
    • A61B2562/0219Inertial sensors, e.g. accelerometers, gyroscopes, tilt switches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the invention relates to the field of pronunciation evaluation technology, in particular to a pronunciation evaluation method combining acoustic features and pronunciation motion characteristics and a system using the same.
  • the perception and generation of speech is the result of multiple auditory organs and vocal organs working together in a short period of time.
  • Some people suffer from brain or nerve damage due to congenital and traumatic nature, which makes it impossible to control the specific muscles to emit correct speech, which is characterized by pronunciation, vocalization, resonance, and rhythm. This is the dysarthria.
  • the intelligibility of speech is the degree to which the listener can accurately obtain the speech information of the speaker.
  • the severity of the dysarthria is often assessed by the intelligibility of the speech. The more serious the disease, the more intelligible the speech. low.
  • the research on dysarthria has gradually increased, but most of them use the acoustic parameters to analyze the intelligibility. Ignoring the abnormal vocal organ movement is the source of abnormal sound, which makes the evaluation method not comprehensive enough, and the evaluation result is not reliable. Therefore, it is especially important to develop a set of reliable and objective and accurate evaluation criteria that do not depend on subjective evaluation.
  • the present invention provides a pronunciation evaluation method and system for combining acoustic features and pronunciation motion features, which acquires acoustic data and corresponding pronunciation motion characteristics by extracting audio data and corresponding pronunciation motion data, respectively, and The acoustic feature and the pronunciation motion feature are combined to obtain a more accurate and reliable fusion evaluation result, which makes the pronunciation evaluation more objective and accurate.
  • a pronunciation evaluation method combining acoustic features and pronunciation motion features comprising the following steps:
  • the acoustic feature intelligibility discrimination model and the pronunciation motion feature intelligibility discriminant model are separately trained according to the acoustic feature and the pronunciation motion feature, and the acoustic feature intelligibility discriminant model is evaluated.
  • the result is processed by the strategy fusion with the evaluation result of the pronunciation motion feature intelligibility discriminant model, and the strategy fusion evaluation result is obtained.
  • the audio data and the pronunciation motion data are collected, and the audio data and the pronunciation motion data are collected by using an electromagnetic pronunciation motion tracing system, and a space sensor is placed in the sounding organ, and the calculation is performed.
  • the three-dimensional spatial coordinates and angles of the spatial sensor in the magnetic field obtain the sounding motion data, and the audio data corresponding to the acquisition time is acquired while collecting the sounding motion data; wherein the sounding organ includes The lips, the pronunciation motion data include lip motion data.
  • the spatial sensor is further placed on the bridge of the nose, and the sounding motion feature is extracted from the sounding motion data in the step (10), and the space sensor on the lip is calculated by using the space sensor of the nose bridge as a coordinate origin.
  • the relative distance of the coordinate origin; the three-dimensional coordinate distance x, y, z of the four spatial sensors on the lips is taken as a motion feature, and each sampling point is used as a frame, and the pronunciation motion characteristics are extracted for each frame of data according to the following formula:
  • the subscripts of x, y, and z represent upper lip motion data, lower lip motion data, left-mouth angular motion data, and right-mouth angular motion data, respectively.
  • the process of performing feature fusion in the step (20) is to set a window length of the acoustic feature and the pronunciation motion feature according to the sampling rate of the audio data and the pronunciation motion data, according to The window length is set to move the window, and the acoustic feature and the sounding motion feature are feature-fused with the window shift.
  • the process of the policy fusion is to set different weight ratios according to the evaluation result of the acoustic feature intelligibility discrimination model and the evaluation result of the pronunciation motion feature intelligibility discrimination model, according to the weight ratio
  • the calculation strategy fusion assessment results are performed; the calculation method is as follows:
  • LL represents the result of the policy fusion evaluation
  • k represents the classification of the evaluation result
  • w represents the weight
  • the argmax function represents the parameter having the largest score.
  • the present invention also provides a pronunciation evaluation system that combines acoustic features and pronunciation motion features, including:
  • a feature extraction module configured to collect audio data and pronunciation motion data, and extract an acoustic feature for the audio data, and extract a pronunciation motion feature for the pronunciation motion data, wherein the audio data and the pronunciation motion data are in time Corresponding;
  • a feature fusion module which performs feature fusion on the acoustic feature and the sounding motion feature according to a time correspondence relationship to obtain a fusion feature
  • a model training module which is trained according to the fusion feature to obtain a fusion feature intelligibility discrimination model
  • the pronunciation evaluation module obtains the feature fusion evaluation result by using the fusion feature intelligibility discrimination model.
  • the method further includes a policy fusion module
  • the model training module further performs training according to the acoustic feature and the sounding motion feature to obtain an acoustic feature intelligibility discrimination model and a pronunciation motion feature intelligibility discrimination model;
  • the policy fusion module performs a policy fusion process on the evaluation result of the acoustic feature intelligibility discriminant model and the evaluation result of the pronunciation motion intelligibility discriminant model to obtain a strategy fusion evaluation result.
  • the method further comprises a data acquisition module, which uses the electromagnetic pronunciation action tracing system to collect the audio data and the pronunciation motion data, and places a space sensor in the pronunciation organ, and calculates a three-dimensional space coordinate of the space sensor in the magnetic field. And the angle, the pronunciation motion data is obtained, and the audio data corresponding to the acquisition time is acquired while the pronunciation motion data is collected.
  • a data acquisition module which uses the electromagnetic pronunciation action tracing system to collect the audio data and the pronunciation motion data, and places a space sensor in the pronunciation organ, and calculates a three-dimensional space coordinate of the space sensor in the magnetic field. And the angle, the pronunciation motion data is obtained, and the audio data corresponding to the acquisition time is acquired while the pronunciation motion data is collected.
  • the vocal organ comprises one or more of the following: a tongue, a lip, a corner of the mouth, an incisor; wherein the spatial sensor of the tongue is disposed at the tip of the tongue, in the tongue, behind the tongue; the spatial sensor of the lip is disposed in the middle of the upper lip The middle part of the lower lip; the space sensor of the corner of the mouth is disposed at the left corner of the mouth and the right corner of the mouth; the spatial sensor of the incisor is disposed at the lower incisor and is used for tracking the movement of the lower jaw.
  • the method further includes: setting a spatial sensor at the head position to detect the head motion data, and correcting the pronunciation motion data according to the head motion data; the head position includes one or more of the following: the forehead and the nose bridge , behind the ear; wherein the space sensor behind the ear is placed on the mastoid bone behind the ear.
  • the model training module performs training by inputting the acoustic feature or the pronunciation motion feature or the fusion feature into a Gaussian mixture model-Hidden Markov Model to obtain a corresponding acoustic feature intelligibility discrimination model. , pronunciation motion feature intelligibility discriminant model, fusion feature intelligibility discriminant model.
  • the present invention extracts acoustic data and corresponding vocal motion features by acquiring audio data and corresponding vocal motion data, and performs feature fusion on the acoustic features and vocal motion features, and performs model training through fusion features. More accurate and reliable feature fusion evaluation results make the pronunciation evaluation more objective and accurate;
  • the present invention further performs training on the acoustic feature intelligibility discrimination model and the pronunciation motion feature intelligibility discrimination model according to the acoustic feature and the pronunciation motion feature, and performs strategy fusion on the evaluation results of each model. Processing, obtaining a strategy fusion evaluation result, verifying and cross-referencing the strategy fusion evaluation result and the feature fusion evaluation result, so that the pronunciation evaluation result is more objective and accurate;
  • the present invention not only detects the pronunciation motion data of the vocal organ, but also includes setting a spatial sensor at the head position to detect the head motion data, and correcting the vocal motion data according to the head motion data, so that the data is more Accurate and reliable.
  • FIG. 1 is a schematic flow chart of a method for sounding evaluation of a fusion acoustic feature and a pronunciation motion feature according to the present invention
  • FIG. 2 is a schematic structural view of a pronunciation evaluation system combining acoustic characteristics and pronunciation motion characteristics according to the present invention
  • Figure 3 is one of the schematic diagrams of the spatial sensor distribution
  • Figure 4 is a second schematic diagram of the spatial sensor distribution.
  • a sounding evaluation method for a fusion acoustic feature and a pronunciation motion feature of the present invention is characterized in that it comprises the following steps:
  • the audio data and the pronunciation motion data are collected, and the audio data and the pronunciation motion data are collected by using an electromagnetic pronunciation motion tracing system.
  • the 3DAG500 electromagnetic pronunciation motion description is used.
  • the audio data corresponding to the acquisition time is simultaneously performed while the pronunciation data is being pronounced; wherein the sounding organ includes a lip, and the sounding motion data includes lip motion data. Due to the abnormal movement of the tongue in patients with dysarthria, the sensor will fall off during the exercise, which makes it difficult to obtain valid data for the tongue movement data. Therefore, in the present embodiment, the lip motion data is selected as the main vocal motion data.
  • the EMA system uses the phenomenon that the spatial sensor generates alternating current in the alternating magnetic field, calculates the three-dimensional space coordinates and angle of the space sensor in the magnetic field, and collects the motion data. And collecting the audio signal at the same time while collecting the position information of the space sensor.
  • the space sensor is attached to the recording device by a thin, lightweight cable that does not interfere with the free movement of the head within the EMA cube.
  • Window processing is performed on each frame of the weighted data to obtain windowed data.
  • 20 ms is taken as one frame, and the Hanning window is selected due to possible leakage of spectrum energy at the frame boundary. Windowing is performed for each frame.
  • the acoustic characteristics are obtained by using the Mel frequency cepstral coefficient as a characteristic parameter.
  • the Mel Frequency Cepstral Coefficient is based on the frequency domain characteristics of the human ear. It maps the linear amplitude spectrum to the Mel nonlinear amplitude spectrum based on auditory perception and then converts it to the cepstrum.
  • the change information between the preceding and following frames also helps to identify different speech characteristics, so the MFCC generally adds the first-order difference and the second-order difference for each dimension of the cepstral coefficients.
  • a 13-dimensional MFCC is employed, and its first-order difference and second-order difference are acoustic characteristics.
  • the process of performing feature fusion in the step (20) is to set a window length of the acoustic feature and the sounding motion feature according to the sampling rate of the audio data and the sounding motion data, according to the window length A setting window shift is performed, and the acoustic feature and the sounding motion feature are feature-fused with the window shift, so that the two types of feature point complementarity advantages can be effectively utilized for modeling.
  • the sampling rate of the audio data is 16000 Hz
  • the sampling rate of the pronunciation motion data is 200 Hz.
  • the window length of the acoustic feature is set to 20 ms
  • the motion feature window length is 5 ms, and the feature is extracted.
  • the window is shifted to 5ms.
  • the characteristic dimension of the obtained fusion feature (Acoustic-Articulatory) is 51.
  • the GMM-HMM model with four levels (normal, slight, medium, severe) intelligibility discrimination is trained using the fusion features.
  • the hidden Markov model has three state numbers and a mixed Gauss number of 24.
  • the model training is performed by inputting the acoustic feature or the pronunciation motion feature or the fusion feature into a Gaussian mixture model-hidden Markov model (GMM-HMM model) respectively.
  • GMM-HMM model Gaussian mixture model-hidden Markov model
  • the intelligibility assessment is performed by training the intelligibility discriminant model for discriminating different levels of intelligibility by means of the GMM-HMM model and using the acoustic features and the articulation motion features, respectively. Considering the timing characteristics of the speech signal, it is modeled by HMM, and the state emission probability of each HMM is calculated using the GMM model.
  • the degree of intelligibility is directly proportional to its severity. According to the diagnosis of the speech pathologist, it is divided into mild, moderate, and severe, plus a normal human control, a total of four groups. The GMM-HMM model was trained for each group separately. In order to verify that the influence of different features on the intelligibility discrimination is different, the GMM-HMM model is trained for the acoustic features and the pronunciation motion features respectively.
  • the hidden Markov model is a left-to-right model without spanning, and its state number is three.
  • the mixed Gauss number is 8, and the acoustic feature intelligibility discriminant model (denoted as Acoustic-GMM-HMM) and the pronunciation motion feature intelligibility discriminant model (denoted as Articulatory-GMM-HMM) are obtained.
  • the fusion feature intelligibility discriminant model is used to obtain the feature fusion evaluation result, and the fusion feature intelligibility discriminant model is used to judge different levels of intelligibility.
  • the process of the policy fusion is to set different weights by respectively evaluating the evaluation result of the acoustic feature intelligibility discrimination model and the evaluation result of the pronunciation motion intelligibility discrimination model.
  • a ratio the calculation strategy fusion evaluation result is performed according to the weight ratio; that is, the acoustic feature intelligibility discrimination model (Acoustic-GMM-HMM) and the pronunciation motion feature intelligibility discrimination model (Articulatory-GMM-HMM) make the decision fusion according to the following formula:
  • LL represents the result of the policy fusion evaluation (ie, the maximum likelihood score after the decision fusion)
  • Representing the evaluation result of the acoustic feature intelligibility discrimination model Representing the evaluation result of the pronunciation motion feature intelligibility discriminant model
  • k represents the grade classification of the evaluation result
  • w represents the weight
  • the argmax function represents finding the parameter having the largest score; in this embodiment, k is 1, 2, 3, 4, representing four levels of normal, slight, medium, and severe respectively
  • w represents the weight of the acoustic feature intelligibility discriminant model (Acoustic-GMM-HMM), and the value is 0.5
  • 1-w indicates the pronunciation motion feature
  • the weight of the intelligibility discriminant model (Articulatory-GMM-HMM).
  • the present invention also provides a pronunciation evaluation system that combines acoustic features and pronunciation motion features, including:
  • a data acquisition module which uses the electromagnetic pronunciation action tracing system to collect the audio data and the pronunciation motion data, and places a space sensor in the pronunciation organ, and calculates a three-dimensional space coordinate and an angle of the space sensor in the magnetic field, thereby obtaining a Transmitting the motion data, and collecting the audio data corresponding to the time while collecting the pronunciation motion data;
  • a feature extraction module configured to collect audio data and pronunciation motion data, and extract an acoustic feature for the audio data, and extract a pronunciation motion feature for the pronunciation motion data, wherein the audio data and the pronunciation motion data are in time Corresponding;
  • a feature fusion module which performs feature fusion on the acoustic feature and the sounding motion feature according to a time correspondence relationship to obtain a fusion feature
  • a model training module which is trained according to the fusion feature to obtain a fusion feature intelligibility discrimination model
  • a pronunciation evaluation module which uses the fusion feature intelligibility discrimination model to obtain a feature fusion evaluation result
  • a strategy fusion module wherein the model training module further performs training according to the acoustic feature and the pronunciation motion feature to obtain an acoustic feature intelligibility discrimination model and a pronunciation motion feature intelligibility discrimination model;
  • the evaluation result of the acoustic feature intelligibility discriminant model and the evaluation result of the pronunciation motion intelligibility discriminant model are processed by strategy fusion, and the strategy fusion evaluation result is obtained.
  • the vocal organ includes one or more of the following: a tongue, a lip, a corner of the mouth, and an incisor; wherein the spatial sensor of the tongue is disposed at the tip of the tongue (the TT-tongue anatomical plane) 1 cm), in the tongue (TM-3 cm behind the tip sensor), behind the tongue (2 cm behind the sensor in the TB-tongue); the spatial sensor of the lip is placed in the middle of the upper lip (UL), in the middle of the lower lip (LL
  • the space sensor of the corner of the mouth is disposed at the left corner of the mouth (LM) and the right corner of the mouth (RM); the spatial sensor of the incisor is disposed at the lower incisor (JA) and is used to track the movement of the lower jaw.
  • the articulating organs are mainly composed of lips, teeth, tongue, ankles and the like.
  • the tongue and lips closely cooperate with other parts to block the airflow and change the shape of the oral resonator, which plays an important role in the pronunciation. Therefore, we first analyze the data of the tongue vocal organs.
  • the sensor will fall off during the movement, which makes it difficult to obtain valid data for the tongue movement data. Therefore, in the present embodiment, the motion data using the lip vocal organ is selected as the main vocal motion data.
  • the method further includes: setting a spatial sensor at the head position to detect the head motion data, and correcting the pronunciation motion data according to the head motion data;
  • the head position includes one or more of the following: the forehead and the nose bridge , behind the ear; wherein the space sensor behind the ear is placed on the mastoid bone behind the ear to play the role of reference and record head movement.
  • the pronunciation motion feature is extracted from the pronunciation motion data, which is adopted by The spatial sensor of the bridge of the nose is used as the coordinate origin to calculate the relative distance of the spatial sensor on the lip from the origin of the coordinate; the three-dimensional coordinate distance x, y, z of the four spatial sensors on the lip is used as the motion feature, and each sampling point is taken as one Frame, extract the pronunciation motion characteristics for each frame of data according to the following formula:
  • the subscripts of x, y, and z represent upper lip motion data, lower lip motion data, left-mouth angular motion data, and right-mouth angular motion data, respectively.
  • the pronunciation movement features a total of 12 dimensions.
  • the model training module performs training by inputting the acoustic feature or the pronunciation motion feature or the fusion feature into a Gaussian mixture model-Hidden Markov Model, and obtains a corresponding acoustic feature intelligibility discrimination model and pronunciation motion.
  • Feature intelligibility discriminant model and fusion feature intelligibility discriminant model are used as training parameters.
  • the Torgo data set based on the audio data and the pronunciation motion data is taken as an example to describe the algorithm flow of the entire system.
  • the specific steps are as follows:
  • the input of the system includes four levels of intelligibility: severe, medium, slight, and normal.
  • the judgment of the level of intelligibility is obtained according to the diagnosis of the speech pathologist.
  • the number of participants in the data set was 3, 2, 2, and 7, respectively, and the number of pronunciation samples was 567, 876, 671, and 4289, respectively.
  • the EMA device simultaneously acquires audio data and pronunciation motion data, and here extracts acoustic features, motion features, and fusion A-A features for the two types of features, respectively, in the settings of Table 2.
  • the training of the intelligibility discriminant model is performed by the GMM-HMM method.
  • the GMM-HMM discriminant model using motion characteristics has a significant improvement in the accuracy of speech impaired persons, but for normal people, the acoustic characteristics using MFCC are more accurate.
  • the GMM-HMM using motion characteristics increased by an average of 0.56 percentage points over the GMM-HMM using acoustic features.
  • the use of motion characteristics is very effective in discriminating the intelligibility of speech impaired persons.
  • the motion characteristics are good for the discrimination of the obstacle person.
  • the feature fusion A-A feature to train the GMM-HMM model, and to use the acoustic feature GMM-HMM and the motion feature GMM-HMM for decision fusion.
  • feature fusion and decision fusion can combine the complementary advantages of the two types of features, further improving the discriminating effect.
  • the present invention not only utilizes audio data, but also uses the pronunciation motion data of a speech impaired person to judge the intelligibility level of the dysarthria from the aspect of the vocal movement.
  • the focus of the pronunciation data is to extract the feature data of the speech impaired.
  • the tongue movement data is unstable and difficult to obtain. Therefore, in this embodiment, the pronunciation data of the lips is mainly used as the main basis. Effectively distinguish the degree of intelligibility of speech impaired persons.
  • the invention combines the traditional speech acoustic features and the pronunciation motion features through feature fusion and decision fusion, effectively utilizes the complementarity of the two types of features, and ensures the objectivity and comprehensiveness of the evaluation.
  • the results are more than the acoustic features alone.
  • the use of pronunciation motion features alone has a clear advantage in classifying the degree of intelligibility.

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Veterinary Medicine (AREA)
  • Animal Behavior & Ethology (AREA)
  • Surgery (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Dentistry (AREA)
  • Artificial Intelligence (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Physiology (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Psychiatry (AREA)
  • Geometry (AREA)
  • Evolutionary Computation (AREA)
  • Fuzzy Systems (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Epidemiology (AREA)
  • Probability & Statistics with Applications (AREA)

Abstract

一种融合声学特征和发音运动特征的发音评估方法和系统,通过采集音频数据和发音运动数据,并对音频数据提取声学特征,对发音运动数据提取发音运动特征,根据时间对应关系将声学特征和发音运动特征进行特征融合和策略融合,有效利用两类特征的互补性,保证评价的客观性和全面性,从而得到更准确、更可靠的特征融合评估结果和决策融合评估结果,使得发音评估更加客观准确。

Description

融合声学特征和发音运动特征的发音评估方法和系统 技术领域
本发明涉及发音评估技术领域,特别是一种融合声学特征和发音运动特征的发音评估方法及其应用该方法的系统。
背景技术
言语的感知和生成是多重听觉器官、发音器官在短时间内协同工作的结果。部分人由于先天性和外伤性造成脑部或神经受损,从而无法控制特定肌肉发出正确语音,表现为发音、发声、共鸣、韵律异常,这就是构音障碍。
语音可懂度就是听众可以准确获得说话者语音信号表达信息的程度,对于构音障碍的严重程度往往是通过语音的可懂度进行发音评估的,疾病越严重,语音的可懂度也就越低。近年来,对于构音障碍的研究逐渐增多,但大多是利用声学参数进行可懂度的分析,忽略了异常的发音器官运动才是造成异常声音的源头,从而使得评估方法不够全面,评估结果不可靠。因此,制定出一套可靠的、不依赖于主观评价的、客观准确的评价标准尤为重要。
发明内容
本发明为解决上述问题,提供了一种融合声学特征和发音运动特征的发音评估方法和系统,其通过采集音频数据和对应的发音运动数据和分别提取声学特征和对应的发音运动特征,并对该声学特征和发音运动特征进行融合处理,从而得到更准确、更可靠的融合评估结果,使得发音评估更加客观准确。
为实现上述目的,本发明采用的技术方案为:
一种融合声学特征和发音运动特征的发音评估方法,其包括以下步骤:
(10)采集音频数据和发音运动数据,并对所述音频数据提取声学特征, 对所述发音运动数据提取发音运动特征,其中,所述音频数据和所述发音运动数据在时间上相对应;
(20)根据时间对应关系将所述声学特征和所述发音运动特征进行特征融合的处理,得到融合特征;
(30)根据所述融合特征进行训练得到融合特征可懂度判别模型;
(40)利用所述融合特征可懂度判别模型得到特征融合评估结果。
优选的,还进一步根据所述声学特征和所述发音运动特征进行分别训练得到声学特征可懂度判别模型和发音运动特征可懂度判别模型,并将所述声学特征可懂度判别模型的评估结果和所述发音运动特征可懂度判别模型的评估结果进行策略融合的处理,得到策略融合评估结果。
优选的,所述的步骤(10)中进行采集音频数据和发音运动数据,是利用电磁式发音动作描迹系统进行采集所述音频数据和发音运动数据,通过在发音器官放置空间传感器,并计算所述空间传感器在磁场中的三维空间坐标和角度,得到所述发音运动数据,并在采集所述发音运动数据的同时进行采集时间上相对应的所述音频数据;其中,所述发音器官包括嘴唇,所述发音运动数据包括嘴唇运动数据。
优选的,还进一步在鼻梁放置空间传感器,所述的步骤(10)中对所述发音运动数据提取发音运动特征,是采用以所述鼻梁的空间传感器作为坐标原点,计算嘴唇上的空间传感器距离所述坐标原点的相对距离;以嘴唇上四个空间传感器的三维坐标距离x,y,z作为运动特征,每一个采样点作为一帧,对每帧数据按如下公式提取发音运动特征:
lip=[x 1...x 4,y 1...y 4,z 1...z 4] T
其中x,y,z的下标分别代表上嘴唇运动数据、下嘴唇运动数据、左嘴 角运动数据、右嘴角运动数据。
优选的,所述的步骤(20)中进行特征融合的处理,是根据所述音频数据和所述发音运动数据的采样率进行设置所述声学特征和所述发音运动特征的窗长,根据所述窗长进行设置窗移,并以所述窗移对所述声学特征和所述发音运动特征进行特征融合。
优选的,所述策略融合的处理,是通过对所述声学特征可懂度判别模型的评估结果和所述发音运动特征可懂度判别模型的评估结果分别设置不同的权重比例,根据该权重比例进行计算策略融合评估结果;其计算方法如下:
Figure PCTCN2018105942-appb-000001
其中,LL表示所述策略融合评估结果,
Figure PCTCN2018105942-appb-000002
表示所述声学特征可懂度判别模型的评估结果,
Figure PCTCN2018105942-appb-000003
表示所述发音运动特征可懂度判别模型的评估结果,k表示评估结果的等级分类,w表示权重,argmax函数表示寻找具有最大评分的参量。
对应的,本发明还提供一种融合声学特征和发音运动特征的发音评估系统,其包括:
特征提取模块,用于采集音频数据和发音运动数据,并对所述音频数据提取声学特征,对所述发音运动数据提取发音运动特征,其中,所述音频数据和所述发音运动数据在时间上相对应;
特征融合模块,其根据时间对应关系将所述声学特征和所述发音运动特征进行特征融合的处理,得到融合特征;
模型训练模块,根据所述融合特征进行训练得到融合特征可懂度判别模型;
发音评估模块,利用所述融合特征可懂度判别模型得到特征融合评估结果。
优选的,还包括策略融合模块;
所述模型训练模块还进一步根据所述声学特征和所述发音运动特征进行分别训练得到声学特征可懂度判别模型和发音运动特征可懂度判别模型;
所述策略融合模块将所述声学特征可懂度判别模型的评估结果和所述发音运动特征可懂度判别模型的评估结果进行策略融合的处理,得到策略融合评估结果。
优选的,还包括数据采集模块,其利用电磁式发音动作描迹系统进行采集所述音频数据和发音运动数据,通过在发音器官放置空间传感器,并计算所述空间传感器在磁场中的三维空间坐标和角度,得到所述发音运动数据,并在采集所述发音运动数据的同时进行采集时间上相对应的所述音频数据。
优选的,所述发音器官包括以下一种以上:舌头、嘴唇、嘴角、门牙;其中,所述舌头的空间传感器设置在舌尖、舌中、舌后;所述嘴唇的空间传感器设置在上嘴唇中部、下嘴唇中部;所述嘴角的空间传感器设置在左嘴角、右嘴角;所述门牙的空间传感器设置在下门牙并用于跟踪下颌的运动。
进一步的,还包括在头部位置设置空间传感器进行检测头部运动数据,并根据所述头部运动数据对所述发音运动数据进行校正;所述头部位置包括以下一种以上:额头、鼻梁、耳后;其中,所述耳后的空间传感器设置在耳朵后面的乳突骨上。
优选的,所述模型训练模块是通过将所述声学特征或所述发音运动特征或所述融合特征分别输入高斯混合模型-隐马尔可夫模型进行训练,得到对应的声学特征可懂度判别模型、发音运动特征可懂度判别模型、融合特征可懂 度判别模型。
本发明的有益效果是:
(1)本发明通过采集音频数据和对应的发音运动数据和分别提取声学特征和对应的发音运动特征,并对该声学特征和发音运动特征进行特征融合,通过融合特征进行模型的训练,从而得到更准确、更可靠的特征融合评估结果,使得发音评估更加客观准确;
(2)本发明还进一步根据所述声学特征和所述发音运动特征进行分别训练得到声学特征可懂度判别模型和发音运动特征可懂度判别模型,并将各个模型的评估结果进行策略融合的处理,得到策略融合评估结果,将该策略融合评估结果与所述特征融合评估结果相互验证和相互参考,使得发音评估结果更加客观准确;
(3)本发明不仅检测发音器官的发音运动数据,还包括在头部位置设置空间传感器进行检测头部运动数据,并根据所述头部运动数据对所述发音运动数据进行校正,使得数据更准确、可靠。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本发明的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1为本发明一种融合声学特征和发音运动特征的发音评估方法的流程简图;
图2为本发明一种融合声学特征和发音运动特征的发音评估系统的结构示意图;
图3为空间传感器分布示意图之一;
图4为空间传感器分布示意图之二。
具体实施方式
为了使本发明所要解决的技术问题、技术方案及有益效果更加清楚、明白,以下结合附图及实施例对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。
如图1所示,本发明的一种融合声学特征和发音运动特征的发音评估方法,其特征在于,包括以下步骤:
(10)采集音频数据和发音运动数据,并对所述音频数据提取声学特征,对所述发音运动数据提取发音运动特征,其中,所述音频数据和所述发音运动数据在时间上相对应;
(20)根据时间对应关系将所述声学特征和所述发音运动特征进行特征融合的处理,得到融合特征;
(30)根据所述融合特征进行训练得到融合特征可懂度判别模型;
(40)利用所述融合特征可懂度判别模型得到特征融合评估结果。
(50)根据所述声学特征和所述发音运动特征进行分别训练得到声学特征可懂度判别模型和发音运动特征可懂度判别模型,并将所述声学特征可懂度判别模型的评估结果和所述发音运动特征可懂度判别模型的评估结果进行策略融合的处理,得到策略融合评估结果。
所述的步骤(10)中进行采集音频数据和发音运动数据,是利用电磁式发音动作描迹系统进行采集所述音频数据和发音运动数据,本实施例中,是采用3DAG500电磁式发音动作描迹系统(EMA系统)采集的发音运动数据和音频数据;通过在发音器官放置空间传感器,并计算所述空间传感器在磁场中的三维空间坐标和角度,得到所述发音运动数据,并在采集所述发音运动 数据的同时进行采集时间上相对应的所述音频数据;其中,所述发音器官包括嘴唇,所述发音运动数据包括嘴唇运动数据。由于构音障碍患者舌头异常运动,会使得传感器在运动过程中脱落,造成得到舌头运动数据难以采集到有效数据。因此,本实施例中,选择利用嘴唇运动数据作为主要的发音运动数据。
EMA系统是利用空间传感器在交变磁场中产生的交变电流这一现象,计算得出空间传感器在磁场中的三维空间坐标和角度,进行运动数据的采集。并在采集空间传感器位置信息的同时,同步采集音频信号。所述空间传感器由薄且重量轻的电缆连接到记录设备上,使其不妨碍EMA立方体内头部的自由运动。
所述的步骤(10)中对所述音频数据提取声学特征,进一步包括:
(11).将所述音频数据s(n)通过高通滤波器进行预加重的处理,得到加重数据;高通滤波器关系可以表示为:H(z)=1-az -1([a∈[0.9,1]);经过预加重后的信号表示为:s'(n)=s(n)-as(n-1),本实施例中a值取0.95。
(12).对所述加重数据的每一帧进行加窗处理,得到加窗数据;本实施例中,取20ms为一帧,由于帧边界处频谱能量的可能存在泄露情况,选用汉宁窗对每一帧都进行加窗处理。
(13).对每一帧进行快速傅里叶变换(FFT变换),从时域数据转变为频域数据,并计算其谱线能量;
(14).将所述加窗数据的每一帧的谱线能量通过Mel滤波器,并计算在Mel滤波器中的能量;
(15).对Mel滤波器的能量取对数后计算DCT(离散余弦变换)倒谱,得到Mel频率倒谱系数(MFCC);
(16).以所述Mel频率倒谱系数作为特征参数,得到所述声学特征。
Mel频率倒谱系数(MFCC)是基于人耳听觉频域特性,将线性幅度谱映射到基于听觉感知的Mel非线性幅度谱中,再转换到倒谱上。前后帧之间的变化信息也有助于识别不同的语音特性,所以MFCC一般还会加入倒谱系数每一维的一阶差分和二阶差分。本实施例中采用13维MFCC,以及其一阶差分和二阶差分为声学特征。
所述的步骤(20)中进行特征融合的处理,是根据所述音频数据和所述发音运动数据的采样率进行设置所述声学特征和所述发音运动特征的窗长,根据所述窗长进行设置窗移,并以所述窗移对所述声学特征和所述发音运动特征进行特征融合,从而能够有效利用两类特征点互补性优势进行建模。本实施例中,所述音频数据的采样率为16000Hz,所述发音运动数据的采样率200Hz,为了同步两类特征,对声学特征的窗长设置为20ms,运动特征窗长为5ms,提取特征时的窗移为5ms。本实施例中,得到的融合特征(Acoustic-Articulatory)的特征维度为51。利用融合特征训练出四等级(正常、轻微、中等、严重)的可懂度判别的GMM-HMM模型。隐马尔科夫模型状态数为3个,混合高斯数为24。
所述的步骤(30)中,模型训练是通过将所述声学特征或所述发音运动特征或所述融合特征分别输入高斯混合模型-隐马尔可夫模型(GMM-HMM模型)进行训练,得到对应的声学特征可懂度判别模型、发音运动特征可懂度判别模型、融合特征可懂度判别模型。通过借助GMM-HMM模型,并分别利用所述声学特征和所述发音运动特征进行训练出判别不同级别可懂度的可懂度判别模型,从而进行可懂度的评估。考虑语音信号的时序特性,利用HMM对其进行建模,同时使用GMM模型计算每个HMM的状态发射概率。这就是 GMM-HMM模型。可懂度的程度与其严重程度成正比,根据语音病理学家的诊断,划分为轻微、中等,严重,再加上正常人的对照,共四组群体。分别对每一个群体训练GMM-HMM模型。为验证不同特征对可懂度判别的影响不同,对于声学特征和发音运动特征分别训练GMM-HMM模型,隐马尔科夫模型是无跨越的从左向右模型,它的状态数为3个,混合高斯数为8,得到声学特征可懂度判别模型(记为Acoustic-GMM-HMM)、发音运动特征可懂度判别模型(记为Articulatory-GMM-HMM)。
所述的步骤(40)中,利用所述融合特征可懂度判别模型得到特征融合评估结果,是运用所述融合特征可懂度判别模型,进行不同级别可懂度的判断。
所述的步骤(50)中,所述策略融合的处理,是通过对所述声学特征可懂度判别模型的评估结果和所述发音运动特征可懂度判别模型的评估结果分别设置不同的权重比例,根据该权重比例进行计算策略融合评估结果;即,将所述声学特征可懂度判别模型(Acoustic-GMM-HMM)和所述发音运动特征可懂度判别模型(Articulatory-GMM-HMM)按如下公式进行决策融合:
Figure PCTCN2018105942-appb-000004
其中,LL表示所述策略融合评估结果(即,决策融合后的最大似然值得分),
Figure PCTCN2018105942-appb-000005
表示所述声学特征可懂度判别模型的评估结果,
Figure PCTCN2018105942-appb-000006
表示所述发音运动特征可懂度判别模型的评估结果,k表示评估结果的等级分类,w表示权重,argmax函数表示寻找具有最大评分的参量;本实施例中,k为1、2、3、4,分别代表正常、轻微、中等、严重四个等级;w表示所述声学特征可懂度判别模型(Acoustic-GMM-HMM)的权重,取值为0.5; 1-w表示所述发音运动特征可懂度判别模型(Articulatory-GMM-HMM)的权重。
如图2所示,本发明还提供一种融合声学特征和发音运动特征的发音评估系统,其包括:
数据采集模块,其利用电磁式发音动作描迹系统进行采集所述音频数据和发音运动数据,通过在发音器官放置空间传感器,并计算所述空间传感器在磁场中的三维空间坐标和角度,得到所述发音运动数据,并在采集所述发音运动数据的同时进行采集时间上相对应的所述音频数据;
特征提取模块,用于采集音频数据和发音运动数据,并对所述音频数据提取声学特征,对所述发音运动数据提取发音运动特征,其中,所述音频数据和所述发音运动数据在时间上相对应;
特征融合模块,其根据时间对应关系将所述声学特征和所述发音运动特征进行特征融合的处理,得到融合特征;
模型训练模块,根据所述融合特征进行训练得到融合特征可懂度判别模型;
发音评估模块,利用所述融合特征可懂度判别模型得到特征融合评估结果;
策略融合模块,所述模型训练模块还进一步根据所述声学特征和所述发音运动特征进行分别训练得到声学特征可懂度判别模型和发音运动特征可懂度判别模型;所述策略融合模块将所述声学特征可懂度判别模型的评估结果和所述发音运动特征可懂度判别模型的评估结果进行策略融合的处理,得到策略融合评估结果。
如图3和图4所示,本实施例中,所述发音器官包括以下一种以上:舌 头、嘴唇、嘴角、门牙;其中,所述舌头的空间传感器设置在舌尖(TT-舌尖解剖面后的1厘米)、舌中(TM-舌尖传感器后3厘米)、舌后(TB-舌中传感器后2厘米);所述嘴唇的空间传感器设置在上嘴唇中部(UL)、下嘴唇中部(LL);所述嘴角的空间传感器设置在左嘴角(LM)、右嘴角(RM);所述门牙的空间传感器设置在下门牙(JA)并用于跟踪下颌的运动。发音器官主要是由嘴唇、牙齿、舌头、腭部等构成。其中,舌头和嘴唇与其他部位密切配合,阻挡气流、改变口腔共鸣器的形状,在发音中起着重要作用。因此,我们首先对舌头发音器官数据分析,然而由于构音障碍患者舌头异常运动,会使得传感器在运动过程中脱落,造成得到舌头运动数据难以采集到有效数据。因此,本实施例中,选择利用嘴唇发音器官的运动数据作为主要的发音运动数据。
进一步的,还包括在头部位置设置空间传感器进行检测头部运动数据,并根据所述头部运动数据对所述发音运动数据进行校正;所述头部位置包括以下一种以上:额头、鼻梁、耳后;其中,所述耳后的空间传感器设置在耳朵后面的乳突骨上,起到参考和记录头部运动的作用。
本实施例中,我们利用空间传感器采集到的三维空间坐标来进行分析,还进一步在鼻梁放置空间传感器,所述的步骤(10)中对所述发音运动数据提取发音运动特征,是采用以所述鼻梁的空间传感器作为坐标原点,计算嘴唇上的空间传感器距离所述坐标原点的相对距离;以嘴唇上四个空间传感器的三维坐标距离x,y,z作为运动特征,每一个采样点作为一帧,对每帧数据按如下公式提取发音运动特征:
lip=[x 1...x 4,y 1...y 4,z 1...z 4] T
其中x,y,z的下标分别代表上嘴唇运动数据、下嘴唇运动数据、左嘴 角运动数据、右嘴角运动数据。发音运动特征共12维。
所述模型训练模块是通过将所述声学特征或所述发音运动特征或所述融合特征分别输入高斯混合模型-隐马尔可夫模型进行训练,得到对应的声学特征可懂度判别模型、发音运动特征可懂度判别模型、融合特征可懂度判别模型。
本实施例中,基于音频数据和发音运动数据的Torgo数据集为例对整个系统算法流程进行简述,具体步骤如下:
1)Torgo数据集输入
Figure PCTCN2018105942-appb-000007
表1实验数据集信息
如表1所示,系统的输入分别为包含严重、中等、轻微、正常四个可懂度级别,可懂度的级别的判断根据语音病理专家的诊断得到。数据集被试数量分别为3、2、2、7,发音样本数分别为567条、876条、671条、4289条。 2)数据特征的提取
Figure PCTCN2018105942-appb-000008
表2提取特征条件
EMA设备同步采集音频数据和发音运动数据,在这里以表2的设置分别提取声学特征、运动特征,以及对两类特征的融合A-A特征。
3)训练可懂度判别模型
Figure PCTCN2018105942-appb-000009
Figure PCTCN2018105942-appb-000010
表3可懂度判别评估结果
在获取数据的音频特征和运动特征后,通过GMM-HMM方法,进行可懂度判别模型的训练。如表3前两列所示,使用运动特征的GMM-HMM判别模型对言语障碍者准确率有明显的提高,但是对于正常人来说,使用MFCC的声学特征准确率更高。总的来说,使用运动特征的GMM-HMM比使用声学特征的GMM-HMM平均提升了0.56个百分点。说明使用运动特征对言语障碍者的可懂度判别是非常有效的。
4)特征融合和决策融合的模型训练
Figure PCTCN2018105942-appb-000011
表4可懂度判别kappa系数指标
考虑到声学特征对于正常人的判别效果好,运动特征对于障碍人的判别效果好。为将两类特征的互补作用更好的应用,提出了使用特征融合A-A特征训练GMM-HMM模型,以及使用声学特征GMM-HMM和运动特征GMM-HMM做决策融合。如表3后两列所示,特征融合和决策融合能够结合两类特征的互补优势,进一步提高了判别效果。
本发明不仅利用音频数据,还利用言语障碍者的发音运动数据,从发音运动的方面对构音障碍进行可懂度级别的判断。发音运动数据的重点在于对言语障碍者的运动数据进行特征提取,通过对数据的分析,舌头运动数据不 稳定、不易获取,因此,本实施例中主要以嘴唇的发音运动数据为主要依据,可以有效区别言语障碍者可懂度程度。
同时,在对言语障碍者可懂度评估中,通过提取发音运动特征改善了传统基于音频数据的声学特征的方法,并通过Torgo数据集和准确率和kappa系数说明了其可行性。
本发明通过特征融合和决策融合,将传统的语音声学特征和发音运动特征结合起来,有效利用两类特征的互补性,保证评价的客观性和全面性,通过融合方法,结果比单独使用声学特征或单独使用发音运动特征在对可懂度程度分类上有明显优势。
需要说明的是,本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。对于系统实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。并且,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。另外,本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
上述说明示出并描述了本发明的优选实施例,应当理解本发明并非局限 于本文所披露的形式,不应看作是对其他实施例的排除,而可用于各种其他组合、修改和环境,并能够在本文发明构想范围内,通过上述教导或相关领域的技术或知识进行改动。而本领域人员所进行的改动和变化不脱离本发明的精神和范围,则都应在本发明所附权利要求的保护范围内。

Claims (12)

  1. 一种融合声学特征和发音运动特征的发音评估方法,其特征在于,包括以下步骤:
    (10)采集音频数据和发音运动数据,并对所述音频数据提取声学特征,对所述发音运动数据提取发音运动特征,其中,所述音频数据和所述发音运动数据在时间上相对应;
    (20)根据时间对应关系将所述声学特征和所述发音运动特征进行特征融合的处理,得到融合特征;
    (30)根据所述融合特征进行训练得到融合特征可懂度判别模型;
    (40)利用所述融合特征可懂度判别模型得到特征融合评估结果。
  2. 根据权利要求1所述的一种融合声学特征和发音运动特征的发音评估方法,其特征在于:还进一步根据所述声学特征和所述发音运动特征进行分别训练得到声学特征可懂度判别模型和发音运动特征可懂度判别模型,并将所述声学特征可懂度判别模型的评估结果和所述发音运动特征可懂度判别模型的评估结果进行策略融合的处理,得到策略融合评估结果。
  3. 根据权利要求1或2所述的一种融合声学特征和发音运动特征的发音评估方法,其特征在于:所述的步骤(10)中进行采集音频数据和发音运动数据,是利用电磁式发音动作描迹系统进行采集所述音频数据和发音运动数据,通过在发音器官放置空间传感器,并计算所述空间传感器在磁场中的三维空间坐标和角度,得到所述发音运动数据,并在采集所述发音运动数据的同时进行采集时间上相对应的所述音频数据;其中,所述发音器官包括嘴唇,所述发音运动数据包括嘴唇运动数据。
  4. 根据权利要求3所述的一种融合声学特征和发音运动特征的发音评估方法,其特征在于:还进一步在鼻梁放置空间传感器,所述的步骤(10)中 对所述发音运动数据提取发音运动特征,是采用以所述鼻梁的空间传感器作为坐标原点,计算嘴唇上的空间传感器距离所述坐标原点的相对距离;以嘴唇上四个空间传感器的三维坐标距离x,y,z作为运动特征,每一个采样点作为一帧,对每帧数据按如下公式提取发音运动特征:
    lip=[x 1...x 4,y 1...y 4,z 1...z 4] T
    其中x,y,z的下标分别代表上嘴唇运动数据、下嘴唇运动数据、左嘴角运动数据、右嘴角运动数据。
  5. 根据权利要求1或2所述的一种融合声学特征和发音运动特征的发音评估方法,其特征在于:所述的步骤(20)中进行特征融合的处理,是根据所述音频数据和所述发音运动数据的采样率进行设置所述声学特征和所述发音运动特征的窗长,根据所述窗长进行设置窗移,并以所述窗移对所述声学特征和所述发音运动特征进行特征融合。
  6. 根据权利要求2所述的一种融合声学特征和发音运动特征的发音评估方法,其特征在于:所述策略融合的处理,是通过对所述声学特征可懂度判别模型的评估结果和所述发音运动特征可懂度判别模型的评估结果分别设置不同的权重比例,根据该权重比例进行计算策略融合评估结果;其计算方法如下:
    Figure PCTCN2018105942-appb-100001
    其中,LL表示所述策略融合评估结果,
    Figure PCTCN2018105942-appb-100002
    表示所述声学特征可懂度判别模型的评估结果,
    Figure PCTCN2018105942-appb-100003
    表示所述发音运动特征可懂度判别模型的评估结果,k表示评估结果的等级分类,w表示权重,argmax函数表示寻找具有最大评分的参量。
  7. 一种融合声学特征和发音运动特征的发音评估系统,其特征在于,包括:
    特征提取模块,用于采集音频数据和发音运动数据,并对所述音频数据提取声学特征,对所述发音运动数据提取发音运动特征,其中,所述音频数据和所述发音运动数据在时间上相对应;
    特征融合模块,其根据时间对应关系将所述声学特征和所述发音运动特征进行特征融合的处理,得到融合特征;
    模型训练模块,根据所述融合特征进行训练得到融合特征可懂度判别模型;
    发音评估模块,利用所述融合特征可懂度判别模型得到特征融合评估结果。
  8. 根据权利要求7所述的一种融合声学特征和发音运动特征的发音评估系统,其特征在于:还包括策略融合模块;
    所述模型训练模块还进一步根据所述声学特征和所述发音运动特征进行分别训练得到声学特征可懂度判别模型和发音运动特征可懂度判别模型;
    所述策略融合模块将所述声学特征可懂度判别模型的评估结果和所述发音运动特征可懂度判别模型的评估结果进行策略融合的处理,得到策略融合评估结果。
  9. 根据权利要求7或8所述的一种融合声学特征和发音运动特征的发音评估系统,其特征在于:还包括数据采集模块,其利用电磁式发音动作描迹系统进行采集所述音频数据和发音运动数据,通过在发音器官放置空间传感器,并计算所述空间传感器在磁场中的三维空间坐标和角度,得到所述发音运动数据,并在采集所述发音运动数据的同时进行采集时间上相对应的所述 音频数据。
  10. 根据权利要求9所述的一种融合声学特征和发音运动特征的发音评估方法,其特征在于:所述发音器官包括以下一种以上:舌头、嘴唇、嘴角、门牙;其中,所述舌头的空间传感器设置在舌尖、舌中、舌后;所述嘴唇的空间传感器设置在上嘴唇中部、下嘴唇中部;所述嘴角的空间传感器设置在左嘴角、右嘴角;所述门牙的空间传感器设置在下门牙并用于跟踪下颌的运动。
  11. 根据权利要求9所述的一种融合声学特征和发音运动特征的发音评估方法,其特征在于:还包括在头部位置设置空间传感器进行检测头部运动数据,并根据所述头部运动数据对所述发音运动数据进行校正;所述头部位置包括以下一种以上:额头、鼻梁、耳后;其中,所述耳后的空间传感器设置在耳朵后面的乳突骨上。
  12. 根据权利要求7或8所述的一种融合声学特征和发音运动特征的发音评估系统,其特征在于:所述模型训练模块是通过将所述声学特征或所述发音运动特征或所述融合特征分别输入高斯混合模型-隐马尔可夫模型进行训练,得到对应的声学特征可懂度判别模型、发音运动特征可懂度判别模型、融合特征可懂度判别模型。
PCT/CN2018/105942 2017-08-17 2018-09-17 融合声学特征和发音运动特征的发音评估方法和系统 WO2019034184A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/616,459 US11786171B2 (en) 2017-08-17 2018-09-17 Method and system for articulation evaluation by fusing acoustic features and articulatory movement features

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710708049.6A CN107578772A (zh) 2017-08-17 2017-08-17 融合声学特征和发音运动特征的发音评估方法和系统
CN201710708049.6 2017-08-17

Publications (1)

Publication Number Publication Date
WO2019034184A1 true WO2019034184A1 (zh) 2019-02-21

Family

ID=61034267

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/105942 WO2019034184A1 (zh) 2017-08-17 2018-09-17 融合声学特征和发音运动特征的发音评估方法和系统

Country Status (3)

Country Link
US (1) US11786171B2 (zh)
CN (1) CN107578772A (zh)
WO (1) WO2019034184A1 (zh)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107578772A (zh) 2017-08-17 2018-01-12 天津快商通信息技术有限责任公司 融合声学特征和发音运动特征的发音评估方法和系统
CN108922563B (zh) * 2018-06-17 2019-09-24 海南大学 基于偏差器官形态行为可视化的口语学习矫正方法
CN109360645B (zh) * 2018-08-01 2021-06-11 太原理工大学 一种构音障碍发音运动异常分布的统计分类方法
EP3618061B1 (en) * 2018-08-30 2022-04-27 Tata Consultancy Services Limited Method and system for improving recognition of disordered speech
CN109697976B (zh) * 2018-12-14 2021-05-25 北京葡萄智学科技有限公司 一种发音识别方法及装置
CN111951828A (zh) * 2019-05-16 2020-11-17 上海流利说信息技术有限公司 发音测评方法、装置、系统、介质和计算设备
CN110223671B (zh) * 2019-06-06 2021-08-10 标贝(深圳)科技有限公司 语言韵律边界预测方法、装置、系统和存储介质
CN112927696A (zh) * 2019-12-05 2021-06-08 中国科学院深圳先进技术研究院 一种基于语音识别的构音障碍自动评估系统和方法
CN111210838B (zh) * 2019-12-05 2023-09-15 中国船舶工业综合技术经济研究院 一种言语认知能力的评价方法
CN113496696A (zh) * 2020-04-03 2021-10-12 中国科学院深圳先进技术研究院 一种基于语音识别的言语功能自动评估系统和方法
CN113314100B (zh) * 2021-07-29 2021-10-08 腾讯科技(深圳)有限公司 口语测试的评估、结果显示方法、装置、设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101292281A (zh) * 2005-09-29 2008-10-22 独立行政法人产业技术综合研究所 发音诊断装置、发音诊断方法、存储媒介、以及发音诊断程序
CN102063903A (zh) * 2010-09-25 2011-05-18 中国科学院深圳先进技术研究院 言语交互训练系统及方法
JP2012088675A (ja) * 2010-10-19 2012-05-10 Inokuma Kazuhito 音声分析機能を持つ言語発音学習装置及びそのシステム
CN102663928A (zh) * 2012-03-07 2012-09-12 天津大学 一种聋人学习说话的电子教学方法
CN103218924A (zh) * 2013-03-29 2013-07-24 上海众实科技发展有限公司 一种基于音视频双模态的口语学习监测方法
WO2015030471A1 (en) * 2013-08-26 2015-03-05 Seli Innovations Inc. Pronunciation correction apparatus and method thereof
CN106409030A (zh) * 2016-12-08 2017-02-15 河南牧业经济学院 一种个性化外语口语学习系统
CN107578772A (zh) * 2017-08-17 2018-01-12 天津快商通信息技术有限责任公司 融合声学特征和发音运动特征的发音评估方法和系统

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4836218A (en) * 1984-04-09 1989-06-06 Arthrotek, Inc. Method and apparatus for the acoustic detection and analysis of joint disorders
CN1924994B (zh) * 2005-08-31 2010-11-03 中国科学院自动化研究所 一种嵌入式语音合成方法及系统
KR101035768B1 (ko) * 2009-01-02 2011-05-20 전남대학교산학협력단 립 리딩을 위한 입술 영역 설정 방법 및 장치
US8913103B1 (en) * 2012-02-01 2014-12-16 Google Inc. Method and apparatus for focus-of-attention control
US9159321B2 (en) * 2012-02-27 2015-10-13 Hong Kong Baptist University Lip-password based speaker verification system
US20140365221A1 (en) * 2012-07-31 2014-12-11 Novospeech Ltd. Method and apparatus for speech recognition
US9911358B2 (en) * 2013-05-20 2018-03-06 Georgia Tech Research Corporation Wireless real-time tongue tracking for speech impairment diagnosis, speech therapy with audiovisual biofeedback, and silent speech interfaces
WO2014194439A1 (en) * 2013-06-04 2014-12-11 Intel Corporation Avatar-based video encoding
JP2016129661A (ja) * 2015-01-09 2016-07-21 パナソニックIpマネジメント株式会社 判定システム、制御信号出力システム、リハビリシステム、判定方法、制御信号出力方法、コンピュータプログラム、脳波信号取得システム
US10888265B2 (en) * 2015-10-07 2021-01-12 Donna Edwards Jaw function measurement apparatus
EP3226570A1 (en) * 2016-03-31 2017-10-04 Thomson Licensing Synchronizing audio and video signals rendered on different devices

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101292281A (zh) * 2005-09-29 2008-10-22 独立行政法人产业技术综合研究所 发音诊断装置、发音诊断方法、存储媒介、以及发音诊断程序
CN102063903A (zh) * 2010-09-25 2011-05-18 中国科学院深圳先进技术研究院 言语交互训练系统及方法
JP2012088675A (ja) * 2010-10-19 2012-05-10 Inokuma Kazuhito 音声分析機能を持つ言語発音学習装置及びそのシステム
CN102663928A (zh) * 2012-03-07 2012-09-12 天津大学 一种聋人学习说话的电子教学方法
CN103218924A (zh) * 2013-03-29 2013-07-24 上海众实科技发展有限公司 一种基于音视频双模态的口语学习监测方法
WO2015030471A1 (en) * 2013-08-26 2015-03-05 Seli Innovations Inc. Pronunciation correction apparatus and method thereof
CN106409030A (zh) * 2016-12-08 2017-02-15 河南牧业经济学院 一种个性化外语口语学习系统
CN107578772A (zh) * 2017-08-17 2018-01-12 天津快商通信息技术有限责任公司 融合声学特征和发音运动特征的发音评估方法和系统

Also Published As

Publication number Publication date
US11786171B2 (en) 2023-10-17
US20200178883A1 (en) 2020-06-11
CN107578772A (zh) 2018-01-12

Similar Documents

Publication Publication Date Title
WO2019034184A1 (zh) 融合声学特征和发音运动特征的发音评估方法和系统
Rudzicz et al. The TORGO database of acoustic and articulatory speech from speakers with dysarthria
Wang et al. Articulatory distinctiveness of vowels and consonants: A data-driven approach
Jin et al. Adventitious sounds identification and extraction using temporal–spectral dominance-based features
Golabbakhsh et al. Automatic identification of hypernasality in normal and cleft lip and palate patients with acoustic analysis of speech
Zañartu et al. Subglottal impedance-based inverse filtering of voiced sounds using neck surface acceleration
Wang et al. Phoneme-level articulatory animation in pronunciation training
JP2003255993A (ja) 音声認識システム、音声認識方法、音声認識プログラム、音声合成システム、音声合成方法、音声合成プログラム
US10959661B2 (en) Quantification of bulbar function
US20150351663A1 (en) Determining apnea-hypopnia index ahi from speech
CN105023573A (zh) 使用听觉注意力线索的语音音节/元音/音素边界检测
Whitfield et al. Examining acoustic and kinematic measures of articulatory working space: Effects of speech intensity
CN105448291A (zh) 基于语音的帕金森症检测方法及检测系统
El Emary et al. Towards developing a voice pathologies detection system
Vojtech et al. Refining algorithmic estimation of relative fundamental frequency: Accounting for sample characteristics and fundamental frequency estimation method
Diaz-Cadiz et al. Adductory vocal fold kinematic trajectories during conventional versus high-speed videoendoscopy
CN115346561B (zh) 基于语音特征的抑郁情绪评估预测方法及系统
TWI749663B (zh) 發聲監控之方法及系統
Ribeiro et al. Speaker-independent classification of phonetic segments from raw ultrasound in child speech
JP4381404B2 (ja) 音声合成システム、音声合成方法、音声合成プログラム
Sofwan et al. Normal and Murmur Heart Sound Classification Using Linear Predictive Coding and k-Nearest Neighbor Methods
Lee et al. An exploratory study of emotional speech production using functional data analysis techniques
Talkar et al. Acoustic Indicators of Speech Motor Coordination in Adults With and Without Traumatic Brain Injury.
Sebkhi et al. Evaluation of a wireless tongue tracking system on the identification of phoneme landmarks
Berger Measurement of vowel nasalization by multi-dimensional acoustic analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18846909

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18846909

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07.10.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18846909

Country of ref document: EP

Kind code of ref document: A1