CN112863263B - Korean pronunciation correction system based on big data mining technology - Google Patents
Korean pronunciation correction system based on big data mining technology Download PDFInfo
- Publication number
- CN112863263B CN112863263B CN202110060609.8A CN202110060609A CN112863263B CN 112863263 B CN112863263 B CN 112863263B CN 202110060609 A CN202110060609 A CN 202110060609A CN 112863263 B CN112863263 B CN 112863263B
- Authority
- CN
- China
- Prior art keywords
- pronunciation
- tongue
- korean
- signal
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012937 correction Methods 0.000 title claims abstract description 21
- 238000005516 engineering process Methods 0.000 title claims abstract description 10
- 238000007418 data mining Methods 0.000 title claims abstract description 9
- 238000000034 method Methods 0.000 claims abstract description 31
- 230000008569 process Effects 0.000 claims abstract description 19
- 230000033001 locomotion Effects 0.000 claims abstract description 15
- 238000002595 magnetic resonance imaging Methods 0.000 claims abstract description 6
- 230000006870 function Effects 0.000 claims description 25
- 230000001755 vocal effect Effects 0.000 claims description 21
- 238000004364 calculation method Methods 0.000 claims description 13
- 230000005236 sound signal Effects 0.000 claims description 11
- 230000002040 relaxant effect Effects 0.000 claims description 10
- 238000001514 detection method Methods 0.000 claims description 9
- 210000000214 mouth Anatomy 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 8
- 230000004044 response Effects 0.000 claims description 7
- 210000001260 vocal cord Anatomy 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 6
- 238000004519 manufacturing process Methods 0.000 claims description 6
- 238000012546 transfer Methods 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 5
- 238000007405 data analysis Methods 0.000 claims description 5
- 238000005259 measurement Methods 0.000 claims description 5
- 239000000700 radioactive tracer Substances 0.000 claims description 5
- 230000008054 signal transmission Effects 0.000 claims description 5
- 238000003672 processing method Methods 0.000 claims description 4
- 238000000926 separation method Methods 0.000 claims description 4
- 230000021615 conjugation Effects 0.000 claims description 3
- 238000003384 imaging method Methods 0.000 claims description 3
- 230000001915 proofreading effect Effects 0.000 claims description 3
- 239000003795 chemical substances by application Substances 0.000 claims description 2
- 230000001052 transient effect Effects 0.000 claims description 2
- 238000012800 visualization Methods 0.000 claims description 2
- 210000003254 palate Anatomy 0.000 abstract 1
- 210000002105 tongue Anatomy 0.000 description 63
- 210000000867 larynx Anatomy 0.000 description 6
- 241001672694 Citrus reticulata Species 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 230000001419 dependent effect Effects 0.000 description 3
- 210000004072 lung Anatomy 0.000 description 3
- 238000004080 punching Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 210000004704 glottis Anatomy 0.000 description 2
- 210000001847 jaw Anatomy 0.000 description 2
- 210000004373 mandible Anatomy 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 235000008694 Humulus lupulus Nutrition 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000009172 bursting Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 210000001097 facial muscle Anatomy 0.000 description 1
- 210000001061 forehead Anatomy 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000009545 invasion Effects 0.000 description 1
- 210000000088 lip Anatomy 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 210000003205 muscle Anatomy 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 238000007493 shaping process Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 210000005176 supraglottis Anatomy 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000001256 tonic effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09B—EDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
- G09B5/00—Electrically-operated educational appliances
- G09B5/04—Electrically-operated educational appliances with audible presentation of the material to be studied
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09B—EDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
- G09B19/00—Teaching not covered by other main groups of this subclass
- G09B19/06—Foreign languages
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
- G10L2025/906—Pitch tracking
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Educational Administration (AREA)
- Educational Technology (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention relates to a Korean pronunciation correction system based on big data mining technology, which utilizes a sensor to detect the formant frequency and the position changes of the tongue and the chin in the pronunciation process to determine the chin pronunciation parameters related to the pitch, and also utilizes magnetic resonance imaging and palate electrogram data to capture the three-dimensional sound channel geometrical characteristics of near consonants in the pronunciation process, and guides the dynamic adjustment of the movements of the lower jaw, the tongue and the throat of a learner according to the actual phoneme string and standard pronunciation.
Description
Technical Field
The invention relates to the field of language learning, in particular to a Korean pronunciation correction system based on a big data mining technology.
Technical Field
Due to historical reasons, Korean is greatly influenced by Chinese, so that the Korean has many similarities with Chinese, and the similarity brings great convenience for Korean to learn Chinese and brings great negative migration. Although many korean pronunciations and chinese pronunciations are similar in pronunciation and are particularly apparent in korean kanji words, in fact, there are great differences in pronunciation methods and pronunciation positions. This difference makes korean students have many difficult difficulties to overcome when learning chinese, and brings much trouble to korean chinese speech teaching. It is necessary to study the consonant difference problem of Han-Han pronunciation, to discuss the difference of Han-Han consonants, and to discuss the corresponding teaching strategy.
Consonants are the sounds formed by significant obstruction of airflow at the site of pronunciation, called consonants. Consonants in chinese and korean are different in pronunciation method, pronunciation position, and pronunciation strength. The consonant system of Mandarin Chinese and the consonant system of Korean have no correspondence, and some sounds exist in Mandarin Chinese and do not exist in Korean, such as f [ ] f](ii) a There are also some sounds that appear to be pronounced in the same location and manner, but in fact are not as pronounced, such asAnd g, k; still other sounds exist in korean, but not in chinese, such as the tonic of korean, and the chinese consonant system does not. There are also tight tones in korean, which are distinguished from loose tones by a stronger airflow. Meanwhile, there is a throat sound in the korean consonant systemNasal soundFlashing soundThese three sounds do not exist in Chinese, but are also more specific in Korean, the nasal soundsAt the beginning of the syllable, the syllable is not pronounced, the larynx is voicedLike the h-tone, the flashing toneThe pronunciation method is similar to r sound when receiving sound.
In the learning process, learners often have strong dependence on the native language. Generally speaking, it is common that the learner prefers to learn the second language from the original language, and replaces the target language with the sound similar to that of the original language, or learning the target language by thinking of the original language may cause errors. (1) Similar speech causes bias errors, both Mandarin and Korean are inherently similar, and substitution is more common, as some of the above-described approximations, e.g. by using similar soundsG, k are replaced, thereby causing bias; (2) substitution of pronunciation not present in the native language by native speech, e.g. by laryngeal toneInstead of h, orPronunciations are substituted for either l or r. (3) The pronunciation change of korean language causes bias. Therefore, learning the mandarin chinese through the ambisonics thinking of the native language also causes bias.
In summary, understanding the relationship between voicing characteristics and acoustic signals is crucial to solving the voicing inversion problem.
Disclosure of Invention
The invention provides a Korean pronunciation correction system based on a big data mining technology, which realizes the detection and automatic correction of Korean spoken pronunciation errors and provides technical support for students to learn Korean.
A Korean pronunciation correction system based on big data mining technology comprises an audio signal acquisition module, a data analysis module, a correction module, a control module, a terminal module and a cloud module, wherein the signal transmission device comprises a vocal cord vibration sensor and an electromagnetic sensor, the electromagnetic sensor is used for capturing the movement of the tongue and the chin in voice recognition, the electromagnetic sensor is a wearable permanent magnetic tracer agent, the movement of the tongue is tracked wirelessly by utilizing a magnetic sensor array, the ultrasonic imaging measurement of the coordinates and the curvature position of the tongue is carried out to represent the tongue in the speaking process, meanwhile, the formant frequency of vowels in a pronunciation model is estimated based on the combination of the lower jaw, the tongue and the throat, the data analysis module optimizes two formants before Korean vowels and consonants, and the specific steps comprise:
s1. for vowels, the first formant is expressed asIts value is inversely proportional to the tongue height h:
the second resonance peak, is shown asFor vowel production, its value is inversely proportional to the horizontal axis advance of the tongue,/:
the mouth is considered as a tubular model and as a resonator, and the model is modified to obtain:
β1and beta2Is the closest constant value, beta, of the formant response of the provided tongue vowel pronunciation system1、β2E is R, c is the speed of sound, c is 340 m/s;
s2, determining beta1And beta2Value of (a), beta1And beta2The value of (a) is calculated based on the acquired value of the formants of the existing oral system of the experimental value of the permanent magnetic tracer, in order to improve the accuracy, a loss function between the formants of the estimation system and the tongue pronunciation system is calculated, and the loss is calculated by using a mean square error function:
calculating partial derivatives of the loss function and updating beta by1And beta2Current value of (a):
s3, the first formants of the relaxing tone, the tightening tone and the air supply tone are respectively expressed as follows:
the second formants of the relaxing tone, the relaxing tone and the air supply tone are respectively expressed as follows:
in the formula, gamma1、γ2Is the closest constant value of the provided tongue consonant pronunciation system formant response, c is the speed of sound, B is the burst release time, Duration is the Duration of pronunciation;
s4, cascading the simplified oral cavity system based on the tongue with the throat system to provide a calculation formula of the vocal tract system, wherein a transfer function of a resonant peak frequency of the vocal tract system is expressed as V (z)kTransfer function of formant frequencies of the laryngeal system and tongue is expressed as L (z)kAnd
A1,A2representing the formant frequencies of the laryngeal and lingual articulatory systems, respectively, T representing the duration of each formant, z representing the bandwidth of the formant, FikThe expression represents that the values of i and k are different
S5, the correction module acquires the formant frequency and the position changes of the tongue and the chin through a sensor so as to determine a chin pronunciation parameter related to the pitch; and in the process of pronunciation, performing acoustic and electromyogram analysis, capturing the three-dimensional vocal tract geometric characteristics of the near consonant by using magnetic resonance imaging and palatal electrogram data, and guiding the dynamic adjustment of the movements of the lower jaw, tongue and throat of a learner according to the actual phoneme string and standard pronunciation.
Furthermore, the introduction of error-eliminating calculation can effectively perform high-precision spoken language pronunciation correction calculation, firstly, data processing and error calculation are performed, and the process is as follows:
in the formula, the error E is an error threshold, H is an extreme value of a vibration trough, C is an effective period law of audio, D is a constant frequency parameter, and PAH is a standard amplitude of Korean voice;
the collected Korean spoken utterances are normalized:
in the formula etaEIs a function discrete value in a Korean pronunciation process, n is a weight of the function discrete value, T represents a hop count between two audio nodes, dijRepresenting the shortest distance between audio node i and node jA path;
the pronunciation is corrected as follows:
Vi=RUi(ATS-1)-1
in the formula, ATFor the natural skewness of the audio, it is a parameter for measuring the note, S-1Is the combination of audio attributes, is a function parameter of audio proofreading, R is the lifting weight of the high-grade audio, Ui is the measurement of the audio, ViIs the audio error protection limit.
Further, the vocal cord vibration sensor comprises a voice signal acquisition sensor array, and the frequency domain of the korean voice signal feature detection is v (t, θ), that is:
in the formula, ωi(theta) represents an instantaneous time-domain signal weighting vector of the ith pronunciation output of Korean,representing the transient time domain signal component of the Korean pronunciation output, theta is a speech signal parameter, theta represents a conjugation operator, M represents a sensor, and the maximum value of the quantity is M;
and performing time domain matching and filtering on the voice signals by adopting an adaptive beam forming method. The frequency domain characteristics of the output signal are as follows:
V(t,θ)=xH(t)ω(θ)
in the formula, H represents complex conjugate transpose;
the weight vector and components of the instantaneous time-domain signal of the korean speech output can be expressed as:
x(t)=[x1(t),x2(t),…,xM(t)]T
ω(θ)=[ω1(θ),ω2(θ),…,ωM(θ)]T;
combining the self-adaptive filtering and blind source separation, decomposing the voice signal to obtain an FM component of Korean voice detection, and outputting the FM component as follows:
Tm(θ)=(m-1)T0(θ);
in the formula, T0(θ) represents the initial FM component. Combining with the signal processing method of the sensor array, a signal model for detecting the pronunciation error of the Korean language is obtained as follows:
in the formula, gmTo calculate the coefficients, nm(t) is an auxiliary parameter.
Furthermore, the audio signal acquisition module comprises a signal transmission device, an audio signal modulator, a demodulator and a voice acquisition device.
Furthermore, the audio signal modulator modulates the low-frequency digital signal into a high-frequency digital signal through a digital signal processing technology and transmits the high-frequency digital signal, the audio signal modulator and the demodulator are used in pair and used for adjusting the digital signal into the high-frequency signal to transmit, and the demodulator restores the digital signal into an original signal.
Further, the demodulator restores the low frequency digital signal modulated in the high frequency digital signal.
Furthermore, the control module consists of a program counter, an instruction register, an instruction decoder, a time sequence generator and an operation controller and is used for issuing commands and coordinating and commanding the operation of the whole system.
Further, the terminal module comprises a client UI module and a visualization module, and the client UI module is suitable for collecting terminal user information.
Further, the cloud module comprises a signal receiving module, and the cloud module comprises a korean standard pronunciation and a database of an oral system and a throat system.
In the pronunciation process, the sensor is used for detecting the formant frequency and the position change of the tongue and the chin so as to determine the chin pronunciation parameters related to the pitch. In the process of pronunciation, acoustic and electromyogram analysis is carried out, magnetic resonance imaging and palatal electrogram data are used for capturing the three-dimensional vocal tract geometric characteristics of the near consonants, and dynamic adjustment guidance is carried out on the movements of the lower jaw, the tongue and the throat of a learner according to the actual phoneme string and standard pronunciation.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The Korean pronunciation error correction system is mainly used for recognizing Korean spoken pronunciation, detecting and automatically correcting the Korean spoken pronunciation error. Spoken pronunciation is the first step in learning korean, and is the basis for the entire learning of korean. The primary problem in learning korean is to remember words. The primary task of remembering words is to remember the pronunciation of a word. The hearing can be greatly improved by the correct pronunciation habit of the spoken language. Even if some familiar words are in a sentence, they cannot understand the correct spoken pronunciation of others because of their own unique spoken pronunciation, thus causing difficulties in korean spoken interaction. Accurate korean pronunciation is very important to the hearing of students.
The system hardware architecture is constructed according to the requirements of the Korean spoken language pronunciation error automatic error correction system, and comprises an audio signal acquisition module, a data analysis module, a correction module, a control module, a terminal module and a cloud module.
The audio signal modulator is a device that modulates a low-frequency digital signal into a high-frequency digital signal by a digital signal processing technique and transmits the signal. An audio signal modulator is usually used in combination with a demodulator to convert a digital signal into a high frequency signal for transmission, and a demodulator to restore the digital signal to an original signal. A demodulator is a device that restores a low frequency digital signal modulated in a high frequency digital signal using a digital signal processing technique. The main function of the voice collector is to collect the pronunciation of Korean spoken language. The controller is a main circuit for changing a preset sequence, explaining the wiring and the circuit of the control circuit, controlling the resistance of the die punching, controlling the rotating speed of a motor of the die punching in the die punching and a main device for braking and reversing, and mainly comprises a program counter, an instruction register, an instruction decoder, a time sequence generator and an operation controller; issuing commands, i.e., coordinating and directing the operation of the entire system, is the "decision-making principal".
The traditional spoken language voice correction system adopts a signal processing method to extract the characteristics of a spoken language voice signal and recognize information, compares an extracted voiceprint image with a standard voiceprint, but does not correct the voiceprint image on the basis of a pronunciation mechanism. The invention researches a voice system, and enables a user to sense and detect the muscle movement mode of own vocal organs (including lips, chin, tongue and teeth) in the vocal process through a signal transmission device arranged on a napestrap so as to correct and adjust the vocal. The speech system is used to record the activity of the pronunciation system (including facial muscles), detect the synthesis of speech signals using electromagnetic signals, and determine the acoustic properties of the pronunciation map by describing the pronunciation trajectory of the mandible, lips, tongue body and tongue tip.
The vocal cord vibrating device is located in the larynx and captures sensor signals, which are sent to a control system to detect periodic vibrations associated with the utterance. Meanwhile, the electromagnetic sensor is connected to the face and records the pulse, while the tongue and ear interface is a wearable system that can capture the movements of the tongue and chin for speech recognition.
The tongue's characteristics in terms of vowel production are considered in the present invention to be the primary role in the production of speech through the mouth. Wearable permanent magnetic tracers are fixed on the tongue, the magnetic sensor array is used for tracking the movement of the tongue in a wireless mode, and the wearable system is free of physical invasion. Ultrasonic imaging measurements of the coordinates of the tongue and its curvature location to represent the tongue during speech while the formant frequencies of the vowels in the articulatory model are estimated based on the combination of the mandible, tongue and larynx. The vowel formant frequency values were experimentally counted using recorded voices of ten thousand koreans, which were associated with their tongue curvatures obtained by ultrasonically analyzing the resonance mechanism of the oral vocal tract system. From the relationship between the coordinates of the tongue and the formant frequencies, it is concluded that: the first formant frequency is dependent on the height of the tongue and the second formant is dependent on the length of advance of the horizontal axis of the tongue.
During the pronunciation process, the sensor is used for detecting the formant frequency and the position change of the tongue and the chin so as to determine the chin pronunciation parameters related to the pitch. In the process of pronunciation, acoustic and electromyogram analysis is carried out, and the three-dimensional vocal tract geometric characteristics of the near consonants are captured by utilizing magnetic resonance imaging and palatal electrogram data.
The first formant is inversely proportional to tongue height and the second formant frequency is related to the size of the forehead cavity or the degree of tongue advancement based on the displayed tongue and lip position. And formant frequencies are speaker dependent and vary with gender and age. In the invention, an optimized statistical formula of the vowel formant frequency is provided from the accumulation result of the vowel and is expanded to the consonant, and all researches are based on tongue motion mapping in the pronunciation process of the vowel and the consonant. The tongue basal oral cavity statistical model provided by the invention is associated with the throat model and is compared with the voice generated by the vocal tract model in detail. The algorithm is based on a formant expression and is suitable for vowel and consonant generation of different age groups and sexes.
The invention provides an optimized statistical relationship of two formants in front of a Korean vowel and a consonant, defines an age-and gender-independent speech generation system by using human tongue motion, and associates a tongue pronunciation system with a known throat model.
When the vocal cords suddenly close, a pulse-like excitation in the vibration source causes the glottis to close, and it is at this stage that the subglottal region and the supraglottic region are separated, and therefore the effective length of the vocal tract is reduced, thus producing resonance only for the supraglottic portion. This variation in vocal tract length causes a variation in the dominant resonances of the spectrum, and it is difficult to extract the resonance frequencies and their associated bandwidths accurately, since these frequencies and their associated bandwidths vary continuously due to the variation in vocal tract shape, not only within the pitch period, but also within the pitch period (i.e. from the closed phase to the open phase of the glottis), and therefore the estimation of the resonance bandwidth must be carefully done for short speech segments. When the speech spectrum is decomposed into amplitude and phase components, the prominent resonance locations and the associated bandwidths are called formants. During the vowel sounds, the first two formants of the oral system formants are inversely proportional to tongue height and tongue propulsion, respectively. And performing statistical estimation by mapping tongue direction characteristics by adopting a sound channel synthesizer and a vowel space theory. The vocal tract shape and the quadrangle are displayed in pairs representing each vowel. In vowel space theory, the same pattern is quadrilateral, where the horizontal axis l represents tongue advancement, e.g., anterior, medial, posterior, which describes the tongue being elevated during vowel articulation, and the slope line h represents tongue height, e.g., closed, medial, and open.
A first resonance peak, denoted asFor vowel production, its value is inversely proportional to tongue height h:
the second resonance peak, is shown asFor vowel production, its value is inversely proportional to the horizontal axis advance of the tongue,/:
the mouth is considered to be a tubular model and assumed to be a resonator. And correcting the model to obtain:
β1and beta2Is the closest constant value, beta, of the formant response of the provided tongue vowel pronunciation system1、β2e.R, c is the speed of sound, c 340 m/s.
The next step is to determine beta1And beta2Value of (a), beta1And beta2The value of (a) is calculated based on the acquired value of the formants of the existing oral system of the experimental value of the permanent magnetic tracer, in order to improve the accuracy, a loss function between the formants of the estimation system and the tongue pronunciation system is calculated, and the loss is calculated by using a mean square error function:
calculating partial derivatives of the loss function and updating beta by1And beta2The current value of (a).
The pronunciation produced for a consonant represents the position and movement of the tongue by the relationship between the tongue height h and the horizontal axis advance l of the consonant. In a manner similar to a vowel, a relationship between the tongue height h of the consonant quadrilateral and the horizontal axis advance l of the tongue is established. A statistical formula of the consonant oral cavity formants is obtained by a gradient descent method and optimized. Consonants are described and distinguished by a phonological and modal system, on the basis of which consonants are divided into three distinct groups: loose sound, tight sound, air-feeding sound. From the acoustic properties of consonants, the first and second formants are affected by the size of the constriction, the manner of articulation (tongue height) and bursting (sudden release of air), the position of the tongue, and the voiced or unvoiced sounds and the articulation (tongue forward).
The first formants of the relaxing tone, the relaxing tone and the air supply tone are respectively expressed as follows:
the second formants of the relaxing tone, the relaxing tone and the air supply tone are respectively expressed as follows:
in the formula, gamma1、γ2Is the closest constant value of the provided tongue consonant pronunciation system formant response, c is the speed of sound, B is the burst release time, and Duration is the Duration of pronunciation.
After the formants of the complete set of vowels and consonants are established, the present invention proposes a new method of quantifying speech intelligibility using the above results and indicates that the formants of the first two formants of the tongue pronunciation system are different.
The vocal tract model includes the lung (glottic source) and the larynx, andoral cavity of single conduit. The lungs act as a motive force to provide airflow to the larynx. The larynx regulates the airflow from the lungs and provides a periodic or noisy source of airflow. Thus, the output provides a modulated airflow by spectrally shaping the light source, a calculation formula for the vocal tract system is developed by cascading a simplified tongue-based oral system (tongue articulatory system) with the laryngeal system, the transfer function of the vocal tract system formant frequencies being represented by V (z)kTransfer function of formant frequencies of the laryngeal system and tongue is expressed as L (z)kAnd
A1,A2representing the formant frequencies of the laryngeal and lingual articulatory systems, respectively, T representing the duration of each formant, z representing the bandwidth of the formant, FikThe expression represents that the values of i and k are different
In addition, the bandwidth of the formants obtained by short-time processing can be approximate to the instantaneous bandwidth of each formant, and the formants can be extracted by the instantaneous bandwidth in addition to the amplitude component. The formant bandwidth is determined by decomposing the speech signal through a bank of bandpass filters and then demodulating each band to obtain an amplitude envelope and an instantaneous frequency signal. The bandwidth of formants is then extracted from these instantaneous frequency signals using an energy separation algorithm, the bandwidth values are normalized with respect to the maximum and plotted as histogram curves, and the bandwidth at the dominant resonance frequency of the spectral response is extracted from short segments of speech to highlight the variation in bandwidth in vowel and consonant segments.
The vocal cord vibration sensor comprises a voice signal acquisition sensor array, and the frequency domain of Korean voice signal feature detection is v (t, theta), namely:
in the formula, ωi(theta) represents an instantaneous time-domain signal weighting vector of the ith pronunciation output of Korean,represents the instantaneous time domain signal component of the Korean pronunciation output, theta is the speech signal parameter, phi represents the conjugation operator, M represents the sensor, and the maximum value of the number is M.
And performing time domain matching and filtering on the voice signals by adopting an adaptive beam forming method. The frequency domain characteristics of the output signal are as follows:
V(t,θ)=xH(t)ω(θ)
in the formula, H represents a complex conjugate transpose.
The weight vector and components of the instantaneous time-domain signal of the korean speech output can be expressed as:
x(t)=[x1(t),x2(t),…,xM(t)]T
ω(θ)=[ω1(θ),ω2(θ),…,ωM(θ)]T
combining the self-adaptive filtering and blind source separation, decomposing the voice signal to obtain an FM component of Korean voice detection, and outputting the FM component as follows:
Tm(θ)=(m-1)T0(θ)
in the formula, T0(θ) represents the initial FM component. Combining with the signal processing method of the sensor array, a signal model for detecting the pronunciation error of the Korean language is obtained as follows:
in the formula, gmTo calculate the coefficients, nm(t) is an auxiliary parameter.
Speech error detection
After the learner pronounces according to the prompt of the system, the system combines the standard pronunciation dictionary and the pronunciation rule to form a phoneme detection network. Meanwhile, the formant frequency and the position change of the tongue and the jaw are obtained through a sensor so as to determine a jaw pronunciation parameter related to the pitch; and in the process of pronunciation, performing acoustic and electromyogram analysis, capturing the three-dimensional vocal tract geometric characteristics of the near consonant by using magnetic resonance imaging and palatal electrogram data, and guiding the dynamic adjustment of the movements of the lower jaw, tongue and throat of a learner according to the actual phoneme string and standard pronunciation.
The introduction of error-eliminating calculation can effectively carry out high-precision spoken language pronunciation correction calculation, firstly, data processing and error calculation are carried out, and the process is as follows:
in the formula, the error E, the error threshold H, the extreme value of the vibration trough B, the effective period law of the audio frequency C, the constant frequency parameter D and the PAH are the standard amplitude of the Korean speech.
By the above method, the collected korean spoken utterances are "normalized":
in the formula etaEIs a function discrete value in the Korean pronunciation process, n is the weight of the function discrete value, and T represents twoNumber of hops between individual audio nodes, dijRepresenting the shortest path between audio node i and node j.
The pronunciation is corrected as follows:
Vi=RUi(ATS-1)-1
in the formula, ATFor the natural skewness of the audio, it is a parameter for measuring the note, S-1Is the combination of audio attributes, is a function parameter of audio proofreading, R is the lifting weight of the high-grade audio, Ui is the measurement of the audio, ViIs the audio error protection limit.
Through the research on the sound channel and the oral cavity model, the pronunciation errors of the Korean spoken language are automatically corrected based on pronunciation phonemes, and technical support is provided for students to learn the Korean language.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent changes and modifications made to the above embodiment according to the technical spirit of the present invention are within the scope of the technical solution of the present invention.
Claims (8)
1. The Korean pronunciation correction system based on the big data mining technology is characterized by comprising an audio signal acquisition module, a data analysis module, a correction module, a control module, a terminal module and a cloud module, wherein a signal transmission device comprises a vocal cord vibration sensor and an electromagnetic sensor, the electromagnetic sensor is used for capturing the movement of the tongue and the chin in voice recognition, the electromagnetic sensor is a wearable permanent magnetic tracer agent, the movement of the tongue is wirelessly tracked by utilizing a magnetic sensor array, the ultrasonic imaging measurement of the coordinates and the curvature position of the tongue is carried out to represent the tongue in the speaking process, meanwhile, the formant frequency of a vowel in a pronunciation model is estimated based on the combination of the lower jaw, the tongue and the throat, and the data analysis module optimizes two formants before a Korean vowel and a consonant, and the method comprises the following specific steps:
s1. for vowels, the first formant is expressed asIts value is inversely proportional to the tongue height h:
the second resonance peak, is shown asFor vowel production, its value is inversely proportional to the horizontal axis advance of the tongue,/:
the mouth is considered as a tubular model and as a resonator, and the model is modified to obtain:
β1and beta2Is the closest constant value, beta, of the formant response of the provided tongue vowel pronunciation system1、β2E is R, c is the speed of sound, c is 340 m/s;
s2, determining beta1And beta2Value of (a), beta1And beta2The value of (a) is calculated based on the acquired value of the formants of the existing oral system of the experimental value of the permanent magnetic tracer, in order to improve the accuracy, a loss function between the formants of the estimation system and the tongue pronunciation system is calculated, and the loss is calculated by using a mean square error function:
calculating partial derivatives of the loss function and updating beta by1And beta2Current value of (a):
s3, the first formants of the relaxing tone, the tightening tone and the air supply tone are respectively expressed as follows:
the second formants of the relaxing tone, the relaxing tone and the air supply tone are respectively expressed as follows:
in the formula, gamma1、γ2Is the closest constant value of the provided tongue consonant pronunciation system formant response, c is the speed of sound, B is the burst release time, Duration is the Duration of pronunciation;
s4, cascading the simplified oral cavity system based on the tongue with the throat system to provide a calculation formula of the vocal tract system, wherein a transfer function of a resonant peak frequency of the vocal tract system is expressed as V (z)kTransfer function of formant frequencies of the laryngeal system and tongue is expressed as L (z)kAnd O (z)k:
A1,A2Representing the formant frequencies of the laryngeal and lingual articulatory systems, respectively, T representing the duration of each formant, z representing the bandwidth of the formant, FikThe expression represents that the values of i and k are different
S5, the correction module acquires the formant frequency and the position changes of the tongue and the chin through a sensor so as to determine a chin pronunciation parameter related to the pitch; in the process of pronunciation, performing acoustic and electromyogram analysis, capturing the three-dimensional vocal tract geometric characteristics of the near consonant by using magnetic resonance imaging and palatal electrogram data, and guiding the dynamic adjustment of the movements of the lower jaw, tongue and throat of a learner according to the actual phoneme string and standard pronunciation;
the audio signal acquisition module comprises a signal transmission device, an audio signal modulator, a demodulator and a voice acquisition device.
2. The korean pronunciation correction system based on big data mining technology as claimed in claim 1, wherein the introduction of error-eliminating calculation can effectively perform the high-precision spoken pronunciation correction calculation, and the data processing and error calculation are performed first, as follows:
in the formula, the error E is an error threshold, H is an extreme value of a vibration trough, C is an effective period law of audio, D is a constant frequency parameter, and PAH is a standard amplitude of Korean voice;
combining the self-adaptive filtering and blind source separation, decomposing the voice signal to obtain an FM component of Korean voice detection, and outputting the FM component as follows:
Tm(θ)=(m-1)T0(θ);
in the formula, T0(θ) represents the initial FM component, Tm(θ) is the FM component;
and (3) carrying out normalized calculation on the collected Korean spoken pronunciation:
in the formula etaEIs a function discrete value in a Korean pronunciation process, n is a weight of the function discrete value, T represents a hop count between two audio nodes, dijRepresenting the shortest path between audio node i and node j;
the pronunciation is corrected as follows:
Vi=RUi(ATS-1)-1
in the formula, ATFor the natural skewness of the audio, it is a parameter for measuring the note, S-1Is a combination of audio attributes, is a function parameter of audio proofreading, and R is highLifting weight, U, of the level audioiIs a measure of audio, ViIs the audio error protection limit.
3. The system of claim 1, wherein the vocal cord vibration sensor comprises a voice signal acquisition sensor array, and the frequency domain of the feature detection of the Korean voice signal is v (t, θ), that is:
in the formula, ωi(theta) represents an instantaneous time-domain signal weighting vector of the ith pronunciation output of Korean,representing the transient time domain signal component of the Korean pronunciation output, theta is a speech signal parameter, theta represents a conjugation operator, M represents a sensor, and the maximum value of the quantity is M;
the time domain matching and filtering are carried out on the voice signals by adopting a self-adaptive beam forming method, and the frequency domain characteristics of the output signals are as follows:
V(t,θ)=xH(t)ω(θ)
in the formula, H represents complex conjugate transpose;
the weight vector and components of the instantaneous time-domain signal of the korean speech output can be expressed as:
x(t)=[x1(t),x2(t),…,xM(t)]T
ω(θ)=[ω1(θ),ω2(θ),…,ωM(θ)]T;
combining with the signal processing method of the sensor array, a signal model for detecting the pronunciation error of the Korean language is obtained as follows:
in the formula, gmTo calculate the coefficients, nm(t) is an auxiliary parameter.
4. The system of claim 3, wherein the audio modulator modulates a low frequency digital signal into a high frequency digital signal by a digital signal processing technique and transmits the high frequency digital signal, and the audio modulator is used in pair with the demodulator to modulate the digital signal into a high frequency signal and transmit the high frequency signal, and the demodulator restores the digital signal to an original signal.
5. The system of claim 4, wherein the demodulator recovers a low frequency digital signal modulated in a high frequency digital signal.
6. The korean pronunciation correction system based on big data mining technology as claimed in claim 1, wherein the control module is composed of a program counter, an instruction register, an instruction decoder, a timing generator and an operation controller for issuing commands, coordinating and directing the operation of the whole system.
7. The system of claim 1, wherein the terminal module comprises a client UI module, a visualization module, and the client UI module is adapted to collect information of a terminal user.
8. The system of claim 1, wherein the cloud module comprises a signal receiving module, and the cloud module comprises a standard pronunciation for korean and a database of an oral system and a throat system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110060609.8A CN112863263B (en) | 2021-01-18 | 2021-01-18 | Korean pronunciation correction system based on big data mining technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110060609.8A CN112863263B (en) | 2021-01-18 | 2021-01-18 | Korean pronunciation correction system based on big data mining technology |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112863263A CN112863263A (en) | 2021-05-28 |
CN112863263B true CN112863263B (en) | 2021-12-07 |
Family
ID=76005979
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110060609.8A Active CN112863263B (en) | 2021-01-18 | 2021-01-18 | Korean pronunciation correction system based on big data mining technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112863263B (en) |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20150024180A (en) * | 2013-08-26 | 2015-03-06 | 주식회사 셀리이노베이션스 | Pronunciation correction apparatus and method |
CN104732977B (en) * | 2015-03-09 | 2018-05-11 | 广东外语外贸大学 | A kind of online spoken language pronunciation quality evaluating method and system |
CN105261246B (en) * | 2015-12-02 | 2018-06-05 | 武汉慧人信息科技有限公司 | A kind of Oral English Practice error correction system based on big data digging technology |
KR20180115602A (en) * | 2017-04-13 | 2018-10-23 | 인하대학교 산학협력단 | Imaging Element and Apparatus for Recognition Speech Production and Intention Using Derencephalus Action |
KR20190066314A (en) * | 2017-12-05 | 2019-06-13 | 순천향대학교 산학협력단 | Pronunciation and vocal practice device and method for deaf and dumb person |
CN108922563B (en) * | 2018-06-17 | 2019-09-24 | 海南大学 | Based on the visual verbal learning antidote of deviation organ morphology behavior |
CN112185186B (en) * | 2020-09-30 | 2022-07-01 | 北京有竹居网络技术有限公司 | Pronunciation correction method and device, electronic equipment and storage medium |
-
2021
- 2021-01-18 CN CN202110060609.8A patent/CN112863263B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN112863263A (en) | 2021-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lieberman et al. | Speech physiology, speech perception, and acoustic phonetics | |
US6275795B1 (en) | Apparatus and method for normalizing an input speech signal | |
JP2000504849A (en) | Speech coding, reconstruction and recognition using acoustics and electromagnetic waves | |
KR20190037183A (en) | The Articulatory Physical Features and Sound-Text Synchronization for the Speech Production and its Expression Based on Speech Intention and its Recognition Using Derencephalus Action | |
CN113496696A (en) | Speech function automatic evaluation system and method based on voice recognition | |
Ladefoged | Speculations on the control of speech | |
CN112863263B (en) | Korean pronunciation correction system based on big data mining technology | |
CN113241065B (en) | Dysarthria voice recognition method and system based on visual facial contour motion | |
Koreman | Decoding linguistic information in the glottal airflow | |
Garnier et al. | Efforts and coordination in the production of bilabial consonants | |
US10388184B2 (en) | Computer implemented method and system for training a subject's articulation | |
Deng et al. | Speech analysis: the production-perception perspective | |
CN116701709B (en) | Method, system and device for establishing single consonant physiological voice database | |
Peterson | Some observations on speech | |
Huang et al. | Model-based articulatory phonetic features for improved speech recognition | |
Nataraj | Estimation of place of articulation of fricatives from spectral parameters using artificial neural network | |
Bush | Modeling coarticulation in continuous speech | |
Liu et al. | A study on the pronunciation of nasal initial syllables in Shigatse dialect based on Glottal MS-110 | |
CN112967538A (en) | English pronunciation information acquisition system | |
Munoz-Luna et al. | Spectral study with automatic formant extraction to improve non-native pronunciation of English vowels | |
Albalkhi | Articulation modelling of vowels in dysarthric and non-dysarthric speech | |
CN112967714A (en) | Information acquisition method for English voice | |
Wang et al. | Speech recognition system based on visual feature for the hearing impaired | |
Mikuöová | Estimating Vocal Tract Resonances of Synthesized High-Pitched Vowels Using CNN | |
De Oliveira et al. | Speech aid for the deaf based on a representation of the vocal tract: the vowel module |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |