WO2020027394A1 - Apparatus and method for evaluating accuracy of phoneme unit pronunciation - Google Patents

Apparatus and method for evaluating accuracy of phoneme unit pronunciation Download PDF

Info

Publication number
WO2020027394A1
WO2020027394A1 PCT/KR2019/000147 KR2019000147W WO2020027394A1 WO 2020027394 A1 WO2020027394 A1 WO 2020027394A1 KR 2019000147 W KR2019000147 W KR 2019000147W WO 2020027394 A1 WO2020027394 A1 WO 2020027394A1
Authority
WO
WIPO (PCT)
Prior art keywords
score
time interval
unit
information
phoneme
Prior art date
Application number
PCT/KR2019/000147
Other languages
French (fr)
Korean (ko)
Inventor
윤종성
권용대
홍연정
김서현
조영선
양형원
Original Assignee
미디어젠 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 미디어젠 주식회사 filed Critical 미디어젠 주식회사
Publication of WO2020027394A1 publication Critical patent/WO2020027394A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]

Definitions

  • the present invention relates to a phoneme pronunciation accuracy evaluation device and an evaluation method, and more particularly to improve the problem of the conventional automatic pronunciation evaluation device that provides only the overall pronunciation evaluation score for the spoken speech signal corresponding to a given word or sentence, By providing a score for each phoneme (pronounced), which is a detailed unit of the voice signal, it is possible to feed back not only the overall pronunciation score but also the score for each phoneme (pronounced).
  • the present invention relates to a phoneme unit pronunciation accuracy evaluation device and an evaluation method.
  • Pronunciation is a voice of a language, and there are differences in the characteristics of the pronunciation according to the type and individual of the language.
  • pronunciation characteristics for the same language should be expressed to enable accurate communication with each other, even considering individual differences.
  • This pronunciation correction method has a problem that not only the listening ability of the individual must be preceded, but also difficult to apply to various pronunciations in common.
  • foreign language speaking and conversational learning methods are to go to a language school and learn directly from a foreign lecturer.
  • HMM hidden Markov model
  • the speech recognition system extracts a feature vector in frame units defined by the system for a speech signal that has undergone preprocessing such as frequency subtraction, sound source separation, noise filtering, and the like, and then processes the signal using the extracted feature vector.
  • preprocessing such as frequency subtraction, sound source separation, noise filtering, and the like
  • a first object of the present invention is to provide a conventional automatic pronunciation evaluation apparatus that provides only an overall pronunciation evaluation score for a spoken speech signal corresponding to a given word or sentence.
  • the pronunciation evaluation score and the overall pronunciation evaluation score for each phoneme are provided.
  • the present invention aims to provide a phoneme pronunciation accuracy evaluation device and an evaluation method capable of feeding back not only the overall pronunciation score but also the score of each phoneme (pronounced voice).
  • a second object of the present invention is to provide a score for each phoneme as a value between 0 and 100 points through a web page or a mobile app that is easily accessible to a user.
  • the phonetic unit pronunciation accuracy evaluation device In order to achieve the problem to be solved by the present invention, the phonetic unit pronunciation accuracy evaluation device,
  • the voice information extracting unit 100 obtains voice information pronounced by the learner about the spoken text information and the spoken text information from the learner, divides the obtained voice information into a set time interval unit, and extracts a speech feature vector for each time interval. ;
  • a forced sorting unit 400 forcibly sorting the spoken text information obtained by the voice information extracting unit for each time interval to generate forced sorting result information
  • An adjustment score providing unit 700 providing an adjustment score for each time interval to the score output unit according to whether the speech recognition result information of the speech recognition unit and the forced alignment result information of the forced alignment unit correspond to each time section;
  • a score output unit 800 that calculates an average score for each phoneme of the input voice information based on the adjustment score for each time interval provided from the adjustment score provider or calculates and outputs an overall average score for the input voice information; Characterized in that.
  • Log likelihood calculation step (S400) for calculating log likelihood for each time interval for the forced alignment result information using the voice feature vector for each time interval extracted through the voice information extraction step and the forced alignment result information generated through the forced alignment step (S400) ;
  • the problem of the conventional automatic pronunciation evaluation device that provides only the overall pronunciation evaluation score for the spoken speech signal corresponding to a given word or sentence
  • scores for each phoneme pronoun
  • FIG. 1 is an overall configuration diagram schematically showing an apparatus for evaluating phoneme pronunciation accuracy according to a first embodiment of the present invention.
  • FIG. 2 is an exemplary waveform graph of a speech signal obtained by the phoneme pronunciation pronunciation evaluation apparatus according to the first embodiment of the present invention.
  • FIG. 3 is an exemplary diagram of phoneme average scores calculated by a phoneme pronunciation accuracy evaluation apparatus according to a first exemplary embodiment of the present invention.
  • Figure 4 is an exemplary view showing the average score for each phoneme and the total average score calculated by the phoneme unit pronunciation accuracy evaluation apparatus according to the first embodiment of the present invention.
  • FIG. 5 is an overall flowchart of a phoneme unit pronunciation accuracy evaluation method according to a first embodiment of the present invention
  • first and second may be used to describe various components, but the components may not be limited by the terms.
  • the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component.
  • a component When a component is referred to as being connected or connected to another component, it may be understood that it may be directly connected or connected to the other component, but there may be other components in between. .
  • a voice information extracting unit 100 for acquiring utterance text information of the utterance text information and utterance text information from the learner, dividing the acquired speech information by a predetermined time interval unit, and extracting a speech feature vector for each time interval;
  • a forced sorting unit 400 forcibly sorting the spoken text information obtained by the voice information extracting unit for each time interval to generate forced sorting result information
  • a log likelihood score conversion unit 600 for generating a log likelihood conversion score obtained by converting the log likelihood for each time interval of the calculated forced result information into a score between 0 and 100 points;
  • An adjustment score providing unit 700 providing an adjustment score for each time interval to the score output unit according to whether the speech recognition result information of the speech recognition unit and the forced alignment result information of the forced alignment unit correspond to each time section;
  • a score output unit 800 that calculates an average score for each phoneme based on the adjustment score for each time interval provided from the adjustment score provider or calculates and outputs an overall average score for the input voice information. do.
  • the native speaker acoustic model information stored in the native speaker acoustic model storage unit 200 includes information on the native speaker's pronunciation characteristics of each phoneme by analyzing the uttering speed of the native speaker, the length of the silent section between each pronunciation, and the like using a deep learning model. Characterized in that.
  • the score output unit 800 is characterized in that for treating the average score for each phoneme as a score value between 0 and 100 points.
  • the score output unit 800 is characterized in that for outputting at least one or more of the average score for each phoneme, the overall average score on the screen.
  • the interval unit is characterized in that the time interval in the range of 1msec ⁇ 20msec. Preferably it is characterized in that 10msec.
  • the log likelihood calculator 500 may extract the voice feature vector extracted by the voice information extractor 100 and the forced alignment result information generated by the forced sorter 400 using the following log likelihood equation. It is characterized by calculating the log likelihood for each time interval for the forced alignment result information.
  • log likelihood score conversion formula By using the log likelihood score conversion formula below it is characterized in that the log likelihood for each time interval for the forced alignment result information is converted into a score between 0 and 100 points.
  • the adjustment score of the corresponding section is set to 100 and provided to the score output unit, and the voice recognition result of the voice recognition unit
  • the time likelihood between the information and the forced sorting result information of the forced sorting unit may be provided to the score output unit using the log likelihood transformation score of the corresponding section converted by the log likelihood score converter as an adjustment score.
  • Log likelihood calculation step (S400) for calculating log likelihood for each time interval for the forced alignment result information using the voice feature vector for each time interval extracted through the voice information extraction step and the forced alignment result information generated through the forced alignment step (S400) ;
  • the force likelihood result by the log likelihood calculator 500 using the force feature result generated through the forced feature and the speech feature vector extracted for each time interval extracted through the voice information extraction step using the following log likelihood equation. Computing the log likelihood for each time interval for the information.
  • the log likelihood score conversion step (S500) is a log likelihood score conversion unit 600, the log likelihood score for each time interval for the forced sorting result information by using the following log likelihood score conversion equation between 0 to 100 points And converting the score.
  • the adjustment score providing step (S600) is performed by the adjustment score providing unit 700 in a time interval in which the voice recognition result information of the voice recognition unit and the forced alignment result information coincide with the log likelihood adjustment equation below.
  • the log likelihood score converter converts the corresponding section of the corresponding section. It is characterized in that the log likelihood conversion score is provided as an adjustment score to the score output unit.
  • FIG. 1 is an overall configuration diagram schematically showing an apparatus for evaluating phoneme pronunciation accuracy according to a first embodiment of the present invention.
  • the phoneme pronunciation pronunciation evaluation apparatus 1000 of the present invention improves the problem of the conventional automatic pronunciation evaluation apparatus that provides only an overall pronunciation evaluation score for a spoken speech signal corresponding to a given word or sentence.
  • the phoneme pronunciation pronunciation evaluation apparatus 1000 of the present invention improves the problem of the conventional automatic pronunciation evaluation apparatus that provides only an overall pronunciation evaluation score for a spoken speech signal corresponding to a given word or sentence.
  • a score for each phoneme (pronounced) which is a detailed unit of the voice signal, it is possible to feed back not only the overall pronunciation score but also the score for each phoneme (pronounced). It will enhance the effect.
  • the problem of the related art which provides only the overall pronunciation evaluation score for the input voice signal, is improved to provide a score for each phoneme which is a detailed unit of the voice signal.
  • the minimum sound unit that brings a difference in meaning is called a phoneme (pronounced), and when learning a foreign language, it is important to learn pronunciation of the phoneme unit of the corresponding language.
  • the common points of the existing pronunciation evaluation score calculation methods provide a score that evaluates the input foreign language voice signal as a whole, so that the user is provided with limited feedback because the score is not provided for each phoneme.
  • the learning effect through the pronunciation evaluation feedback is enhanced.
  • the conventional method for calculating a pronunciation score provides a total pronunciation score of 'the pronunciation score for cat is 80.
  • the pronunciation score for c is 80 for cat.
  • the pronunciation score for a is 90 points
  • the pronunciation score for t is 90 points
  • the overall pronunciation score is 86.6 points.
  • the phoneme unit pronunciation accuracy evaluation apparatus 1000 includes a voice information extracting unit 100, a native speaker model storage unit 200, a voice recognition unit 300, and a forced alignment unit 400. ), A log likelihood calculator 500, a log likelihood score converter 600, an adjustment score provider 700, and a score output unit 800.
  • the voice information extracting unit 100 obtains the spoken text information and the voice information pronounced by the learner for the spoken text information from the learner, divides the acquired voice information into a set time interval unit, and divides the speech feature vector for each time interval. Will be extracted.
  • voice text information corresponding to a text of 'cat' and voice information of a learner who pronounces 'cat' are obtained.
  • the present invention is provided with an input means for inputting text and an input means for inputting voice information.
  • the learner provides the voice information extracting unit 100 to the voice information extracting unit 100 through the input means for inputting text.
  • Voice information pronounced 'cat' which is spoken text, is input through an input means (eg, a microphone means) for inputting voice information.
  • the voice information extracting unit 100 receiving the voiced text information and the voice information acquires a voice signal of 'cat', and divides the acquired voice information into units of a set time period. A voice feature vector for each time interval is extracted as shown in FIG.
  • time intervals are divided by 10 ms units for the speech signal illustrated in FIG. 2, and a feature vector (MFCC) for the speech signal is extracted for each time interval.
  • MFCC feature vector
  • MFCC Mel Frequency Cepstrum Coefficient
  • the time interval unit for extracting the speech feature vector is characterized in that the time unit in the range of 1msec ⁇ 20msec.
  • the unit is preferably set in 10 msec units.
  • the native speaker model storage unit 200 stores native speaker model information.
  • the native speaker model information stored in the native speaker model storage unit 200 is characterized in that the native speaker pronunciation characteristic information for each phoneme using a deep learning model.
  • native speaker pronunciation characteristic information for each phoneme which is an analysis result of analyzing a native speaker's uttering speed and the length of a silent section between each pronunciation, is stored in the native speaker's acoustic model storage unit 200 and used therein.
  • the voice recognition unit 300 performs voice recognition on the voice pronounced by the learner.
  • the speech recognition unit 300 performs speech recognition on speech feature vectors for each time interval extracted by the speech information extracting unit 100 using native speaker sound model information stored in the native speaker sound model storage unit 200. To generate voice recognition result information.
  • the voice recognition result information is b phoneme pronunciation in 0-10ms (1 section), b phoneme pronunciation in 10-20ms (2 sections), b phoneme in 20-30ms (3 sections), as shown in FIG. Pronunciation, b phonetic pronunciation for 30-40ms (4 sections), ⁇ phonetic pronunciation for 40-50ms (5 sections), ⁇ phonetic pronunciation for 50-60ms (6 sections), t phonetic pronunciation for 60-70ms (7 sections), T phoneme pronunciation in 70 ⁇ 80ms (8 sections) and s phoneme pronunciation in 80 ⁇ 90ms (9 sections).
  • the forced sorting unit 400 generates the forced sorting result information by forcibly sorting the spoken text information obtained by the voice information extracting unit 100 for each time interval.
  • the forced alignment unit 400 adjusts the phoneme unit pronunciation corresponding to the text 'cat' to the voice strip in FIG. 3. Forced alignment is shown as shown.
  • pronunciation of phoneme units corresponding to the spoken text is forcedly sorted for each 10 ms time interval.
  • the forced sorting result information is k phoneme, 0 to 10 ms (one section), as shown in FIG. 3, and 10 to 20 ms.
  • K phoneme in 2 sections k phoneme in 20-30ms (3 sections), phoneme in 30-40ms (4 sections), phoneme in 40-50ms (5 sections), phoneme in 50 to 60ms (6 sections)
  • the log likelihood calculator 500 uses the voice feature vector extracted by the voice information extractor 100 for each time interval and the forced alignment result information generated by the forced sorter 400 to generate a time interval for the forced alignment result information. Calculate the star log likelihood.
  • the log likelihood calculator 500 calculates a log likelihood for each time interval for the forced sorting result information by using the following log likelihood formula.
  • oi denotes a voice feature vector of the i-th time interval
  • qi denotes a phoneme of the i-th time interval based on the forced alignment result information
  • p (oi ⁇ qi) denotes a probability value of oi coming out of qi in the i-th time interval.
  • the log likelihood for each time interval for the forced sorting result information may have a negative value. This is because p (oi ⁇ qi) is a value between 0 and 1 (the probability of oi coming out of qi in the i-th time interval), and is a logarithm of these values.
  • the log likelihood score converter 600 converts the log likelihood calculated for each time interval for the forced sorting result information into a score between 0 and 100 points.
  • the reason for converting the log likelihood calculated for each time interval into scores between 0 and 100 for the forced sorting result information is that the calculated log likelihood value for each time interval has a negative value, which is the value of the positive region.
  • log likelihood score conversion formula By using the log likelihood score conversion formula below it is characterized in that the log likelihood for each time interval for the forced alignment result information is converted into a score between 0 and 100 points.
  • oi denotes a voice feature vector of the i-th time interval
  • qi denotes a phoneme of the i-th time interval based on the forced alignment result information
  • p (oi ⁇ qi) denotes a probability value of oi coming out of qi in the i-th time interval.
  • the voice feature vector of one section (0 to 10 ms) shown in FIG. 3 is 'k', which is a phoneme of the first section based on the forced alignment result information.
  • 'k' which is a phoneme of the first section based on the forced alignment result information.
  • the ' ⁇ ' which is a phoneme of 5 sections based on the sorting result information, is a log value of the probability of coming out of the phoneme, plus 100
  • the converted log likelihood 90 of 7 sections (60 to 70 ms) shown in FIG. 'T' which is a phoneme of the 7-segment phoneme based on the coercion result information, is a value obtained by adding 100 to the log value of the probability of coming out of the phoneme.
  • p (o1 ⁇ q1) is the phoneme of 'k' whose first feature is the first feature of the first time interval. It is a probability to come out, and the log value is taken as the log likelihood value 90 after taking the logarithm to the probability value.
  • the speech feature vector is obtained, the transformed log likelihood value of 90 points of FIG. 3 is calculated due to the probability value of which the probability value is much smaller than one.
  • the log likelihood score conversion unit 600 converts the log likelihood of the 'k' phone, which is a phoneme (phoneme by the forced alignment result information) of one section (0 to 10 ms), as shown in FIG.
  • the converted log likelihood of the 'k' phone which is the phoneme of the 2 sections (10 ⁇ 20ms) (the phoneme by the forced alignment result information), is 80 points, and the phoneme of the 3 sections (20 ⁇ 30ms) (the phoneme by the forced alignment result information)
  • the converted log likelihood of the 'k' phoneme is 100 points, and the converted log likelihood of the ' ⁇ phoneme is 4 points (30 ⁇ 40ms), which is 40 points and 5 sections (40
  • the converted log likelihood of the ' ⁇ phoneme which is a phoneme of ⁇ 50ms) (the phoneme based on the forced alignment result information), is 80 points, and the converted log likelihood of the' ⁇ phoneme, which is 6 sections (50 ⁇ 60ms), is 80 points, 7 sections ( The converted log likelihood of the 't
  • the adjustment score providing unit 700 provides the adjustment score for each time interval to the score output unit according to whether the voice recognition result information of the voice recognition unit and the forced alignment result information of the forced alignment unit match each time period. Done.
  • the adjustment score of the corresponding section is provided as 100 and provided to the score output unit.
  • the log likelihood score conversion unit converts the log likelihood conversion score of the corresponding section converted by the log likelihood score conversion unit to the score output unit for a time interval in which the voice recognition result information of the speech recognition unit and the forced alignment result information of the forced alignment unit do not coincide. do.
  • oi denotes a voice feature vector of the i-th time interval
  • qi denotes a phoneme of the i-th time interval based on the forced alignment result information
  • p (oi ⁇ qi) denotes a probability value of oi coming out of qi in the i-th time interval.
  • the speech recognition result and the forced sorting result in the case of 't and t' in ' ⁇ and ⁇ 7' sections in five sections and 't and t' in eight sections, and 't and t' in eight sections.
  • the log likelihood conversion scores for the speech recognition results for each time interval are '80 points, 80 points, 90 points, 90 points', respectively, but the adjusted score is' 100 points', which is provided to the score output unit. will be.
  • the reason for setting the adjustment score as 100 is that although the speech recognition result information and the forced sorting result information match, if the adjustment is not made, the evaluation score is considerably lowered, so an adjustment score of 100 points is introduced to eliminate the score error. It will be done.
  • the log likelihood conversion score for the speech recognition result of the corresponding time interval is provided as the adjustment score, which is provided to the score output unit.
  • 80 points for 2 points, 100 points for 3 sections, 40 points for 4 sections, and 70 points for 9 sections are provided to the score output unit.
  • the score output unit 800 calculates an average score for each phoneme of the input voice information based on the adjustment score provided from the adjustment score provider, or calculates and outputs an overall average score for the input voice information.
  • 'cat' is composed of 'k' and ' ⁇ 't' phonemes, as shown in FIG.
  • the time interval adjustment score for phonemes is 90,80,100 points, respectively, followed by ((90 + 80 + 100) / 3), so the average score for 'k' phonemes is 90 points, 40,100,100 points followed by '(40 + 100 + 100) / 3', so the average score for the ' ⁇ phoneme is 80 points, and the time interval adjustment for the' t 'phoneme is 100,100,70 points followed by' (100 + 100 + 70 ' ) / 3 ', so the average score for the' t 'phoneme is calculated as 90 points and printed out.
  • At least one of the average score for each phoneme and the total average score of the input voice information characterized in that for outputting on the screen.
  • the phoneme average score and the overall average score may be simultaneously provided, only the phoneme average score may be provided, or only the overall average score may be provided.
  • FIG. 5 is a flowchart illustrating a method for evaluating phoneme unit pronunciation accuracy according to a first embodiment of the present invention.
  • the phoneme unit pronunciation accuracy evaluation method includes a voice information extraction step S100, a voice recognition step S200, a forced sorting step S300, a log likelihood calculation step S400, and a log likelihood score conversion.
  • a step S500, an adjustment score providing step S600, and a score output step S700 are included.
  • the speech information extracting unit 100 obtains the spoken text information and the spoken speech information of the learner from the learner from the learner, divides the obtained speech information into a set time interval unit, and extracts a speech feature vector for each time section. Extracting voice information (S100);
  • the speech recognition unit 300 performs speech recognition on speech feature vectors for each time interval extracted through the speech information extraction step S100 using the native speaker sound model information stored in the native speaker sound model storage unit 200.
  • Speech recognition step (S200) for generating a speech recognition result information by;
  • a forced sorting step (S300) forcing the sorting unit 400 to perform forced sorting of the spoken text information obtained through the voice information extracting step (S100) for each time interval to generate a forced sorting result information (S300);
  • the log likelihood measuring unit 500 extracts the voice feature vector for each time interval extracted through the voice information extraction step S100 and the forced sorting step S300.
  • the log likelihood calculation step (S400) is a forced feature result generated by the voice feature vector and the forced alignment step (S300) for each time interval extracted through the voice information extraction step (S100) using the following log likelihood equation; It is characterized in that to calculate the log likelihood for each time interval for the forced alignment result information.
  • oi denotes a voice feature vector of the i-th time interval
  • qi denotes a phoneme of the i-th time interval based on the forced alignment result information
  • p (oi ⁇ qi) denotes a probability value of oi coming out of qi in the i-th time interval.
  • the log likelihood score conversion step (S500) is characterized by converting the log likelihood for each time interval for the forced sorting result information into a score between 0 and 100 points using the following log likelihood score conversion equation.
  • oi denotes a voice feature vector of the i-th time interval
  • qi denotes a phoneme of the i-th time interval based on the forced alignment result information
  • p (oi ⁇ qi) denotes a probability value of oi coming out of qi in the i-th time interval.
  • the adjustment score of the corresponding section is set to 100 for a time section in which the voice recognition result information of the voice recognition unit and the forced alignment result information of the forced alignment unit correspond to each other according to the following log likelihood adjustment equation.
  • the score is output to the score output section, and the score is output by adjusting the log likelihood conversion score of the corresponding section, which is converted by the log likelihood score converter, for a time interval where the voice recognition result information of the voice recognition unit and the forced sorting result information of the forced alignment unit do not match. It is characterized by the provision of wealth.
  • oi denotes a voice feature vector of the i-th time interval
  • qi denotes a phoneme of the i-th time interval based on the forced alignment result information
  • p (oi ⁇ qi) denotes a probability value of oi coming out of qi in the i-th time interval.
  • the present invention by improving the problem of the conventional automatic pronunciation evaluation device that provides only the overall pronunciation evaluation score for the spoken speech signal corresponding to a given word or sentence, by providing a score for each phoneme (phoneme) which is a detailed unit of the speech signal.
  • the overall pronunciation score can be fed back to the score for each phoneme (pronounced) can concentrate on what the phoneme is insufficient to enhance the learning effect accordingly.
  • the problem of the conventional automatic pronunciation evaluation device that provides only the overall pronunciation evaluation score for the spoken speech signal corresponding to a given word or sentence
  • scores for each phoneme pronoun
  • the industrial applicability is also increased.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The present invention relates to an apparatus and a method for evaluating the accuracy of phoneme unit pronunciation and, more specifically, to an apparatus and a method for evaluating the accuracy of phoneme unit pronunciation, the apparatus and the method alleviating a problem of a conventional automatic pronunciation evaluation device that provides only a total pronunciation evaluation score for a voice signal of a voice, pronounced by a learner in correspondence to a given word or sentence, so as to provide a score for each phoneme (pronunciation), which is a detailed unit of a voice signal, thereby enabling feedback of total pronunciation score and the score for each phoneme (pronunciation) to be provided, such that an unsatisfactory phoneme can be studied intensely, and thus learning effects are enhanced.

Description

음소 단위 발음 정확성 평가 장치 및 평가 방법Phonetic unit pronunciation accuracy evaluation device and evaluation method
본 발명은 음소 단위 발음 정확성 평가 장치 및 평가 방법에 관한 것으로서, 더욱 상세하게는 주어진 단어 혹은 문장에 대응되는 발화된 음성 신호에 대한 전체 발음 평가 점수만을 제공하는 종래 자동 발음 평가장치의 문제점을 개선하여 음성 신호의 세부 단위인 음소(발음)마다 점수를 제공함으로써, 전체 발음 점수뿐만 아니라, 각 음소(발음)별 점수까지 피드백할 수 있으므로 미흡한 음소가 무엇인지를 집중적으로 학습할 수 있어 이에 따른 학습 효과를 증진시키는 음소 단위 발음 정확성 평가 장치 및 평가 방법에 관한 것이다.The present invention relates to a phoneme pronunciation accuracy evaluation device and an evaluation method, and more particularly to improve the problem of the conventional automatic pronunciation evaluation device that provides only the overall pronunciation evaluation score for the spoken speech signal corresponding to a given word or sentence, By providing a score for each phoneme (pronounced), which is a detailed unit of the voice signal, it is possible to feed back not only the overall pronunciation score but also the score for each phoneme (pronounced). The present invention relates to a phoneme unit pronunciation accuracy evaluation device and an evaluation method.
정보의 교류가 많아짐에 따라 현대 사회는 사람과 사람 사이의 커뮤니케이션이 이전보다 더욱 중요해졌다.As information exchanges increase, the communication between people becomes more important than ever before in modern society.
정보 통신 기술의 발전으로 인해 커뮤니케이션의 수단이 다양화되었으나, 사람의 음성을 전달하는 대화는 여전히 가장 중요한 커뮤니케이션 방법이다. The development of information and communication technology has diversified the means of communication, but the dialogue that conveys the human voice is still the most important communication method.
그리고, 음성을 이용하여 커뮤니케이션을 하는 경우에도 고려되어야 할 여러 가지 항목이 있으며, 이러한 고려 대상 항목 중 중요한 하나는 발음이다.In addition, there are various items to be considered even when communicating using voice, and one of the items to be considered is pronunciation.
발음은 언어를 음성으로 표현한 것으로서, 언어의 종류 및 개인에 따라 발음의 특성에도 차이가 있다. Pronunciation is a voice of a language, and there are differences in the characteristics of the pronunciation according to the type and individual of the language.
기본적으로 동일한 언어에 대한 발음 특성은 개인차를 고려하더라도 서로가 정확한 의사 전달이 가능하도록 표현되어야 한다. Basically, pronunciation characteristics for the same language should be expressed to enable accurate communication with each other, even considering individual differences.
그러나, 모든 사람이 언어 특성에 따른 정확한 발음을 구사하지는 못하며, 이러한 문제로 인하여 동일한 말을 여러번 반복해야 하거나, 잘못된 의사 전달이 되는 경우가 종종 발생한다. However, not everyone speaks the correct pronunciation according to the language characteristics, and this problem often causes the same word to be repeated many times or incorrect communication occurs.
이에 정확한 발음을 구사할 수 있도록 발음을 교정하는 다양한 방법이 제시되었으나, 대부분의 발음 교정 방법은 다수의 사람들로부터 발음이 정확하다고 평가받은 다른 사람의 발음을 따라하거나, 발음이 어려운 특정 단어나 문장을 반복하여 말하는 것과 같이 정량적으로 분석되지 않는 감각적인 방법이 대부분이었다. Various methods for correcting pronunciation have been suggested to use the correct pronunciation. However, most of the pronunciation correction methods repeat a specific word or sentence that is difficult to pronounce or repeats the pronunciation of another person who is judged to be correct by many people. Most of the sensory methods were not quantitatively analyzed.
즉, 발음이 정확하다고 평가받는 사람의 발음 특성을 모른 채 단순히 반복적인 따라하는 방법이 주로 이용되었다. That is, the method of simply repeating repeatedly without knowing the pronunciation characteristics of a person who is evaluated as accurate in pronunciation is mainly used.
이러한 발음 교정 방법은 개인의 청취 능력이 선행되어야 할 뿐만 아니라, 다양한 발음에 대해 공통적으로 적용하기 어렵다는 문제가 있다.This pronunciation correction method has a problem that not only the listening ability of the individual must be preceded, but also difficult to apply to various pronunciations in common.
한편, 최근 들어 인터넷의 발달과 교역량의 확대로 세계의 여러 나라 사람들을 만날 기회가 확대되었고, 특히 기업에서 외국인 바이어 등을 업무상으로 만날 일이 많아지면서 외국어에 대한 수요가 끊임없이 늘고 있다. On the other hand, in recent years, with the development of the Internet and the expansion of trade volume, the opportunity to meet people from various countries of the world has expanded. In particular, the demand for foreign languages is constantly increasing as more companies meet with foreign buyers on business.
이와 같이, 외국인과 만날 일이 늘어나면서 종래 독해 위주의 외국어 교육과 달리 회화 중심의 외국어 교육이 각광받고 있다. As such, as the number of meetings with foreigners increases, conversation-oriented foreign language education is in the spotlight, unlike conventional reading-oriented foreign language education.
일반적으로 외국어 말하기 및 회화 학습 방법은 주로 어학 학원에 가서 외국인 강사에게 직접 배우는 것이다. Generally speaking, foreign language speaking and conversational learning methods are to go to a language school and learn directly from a foreign lecturer.
그러나, 학원에 가는 방법은 시간 제약과 비용에 관한 문제가 있고, 외국인 강사에게 직접 배우는 경우에도 피드백을 구하기가 쉽지 않다. However, the way to go to academy is a matter of time constraints and costs, and it is not easy to get feedback even if you are learning directly from a foreign instructor.
따라서, 시간과 비용 문제를 해결하고 적절한 피드백을 얻을 수 있는 외국어학습 방법이 있다면, 시간과 비용적인 측면에서 효율적일 것이다.Therefore, if there is a foreign language learning method that can solve the time and cost problem and get appropriate feedback, it will be efficient in terms of time and cost.
최근 들어 음성인식 기술의 발달과 더불어 이를 외국어 교육에 적용하려는 시도가 많이 이루어지고 있다. Recently, with the development of speech recognition technology, many attempts have been made to apply it to foreign language education.
이 중에서도 근래에 많이 시도되고 있는 방법은, 은닉 마르코프 모델(Hidden Markov Model, 이하 'HMM'이라 함)을 이용하는 방법이다. Among these methods, many attempts have been made in recent years using a hidden Markov model (hereinafter referred to as HMM).
이때, 음성인식 시스템에서는 주파수 차감법, 음원 분리 기술, 잡음 필터링 기술 등의 전처리 과정을 거친 음성 신호에 대하여 시스템에서 정의한 프레임 단위로 특징 벡터를 추출하고, 추출된 특징벡터를 이용하여 이후의 신호 처리를 하게 된다. In this case, the speech recognition system extracts a feature vector in frame units defined by the system for a speech signal that has undergone preprocessing such as frequency subtraction, sound source separation, noise filtering, and the like, and then processes the signal using the extracted feature vector. Will be
기존의 외국어 말하기 평가 방법 및 시스템은, HMM 인식기를 이용하여 평가하고자 하는 단위에 대한 정확도를 측정하는 것이 전부였다. Existing methods and systems for evaluating foreign language speaking are all measured using the HMM recognizer to measure the accuracy of units to be evaluated.
왜냐하면, 화자의 발음의 다른 요소(길이, 에너지, 억양, 강세 등)를 특징벡터에 반영하지 못하였기 때문이다. This is because other elements of the speaker's pronunciation (length, energy, intonation, stress, etc.) could not be reflected in the feature vector.
즉, 단지 단순하게 문장을 따라 읽고 이에 대한 평가를 HMM 인식기를 통하여 얻은 결과를 토대로 평가하는 수준이었다. In other words, it simply reads the sentences and evaluates them based on the results obtained through the HMM recognizer.
그러나, 실질적으로 한국어와 달리 외국어에서 의미 전달의 중요한 축을 담당하는 것이 길이, 에너지, 억양, 강세 등의 요소이다. However, unlike Korean, it is the factors such as length, energy, intonation, and stress that play an important axis of meaning transfer in foreign languages.
예를 들어, 중국어에서는 억양과 관계있는 성조에 의해 그 의미가 완전히 바뀌기도 하고, 영어권 언어에서는 강세가 의미 전달에 있어서 중요한 부분을 담당한다. For example, in Chinese, the meaning is completely changed by tonality related to intonation, and in English-speaking languages, stress is an important part of meaning transfer.
현재 보편적으로 보급되고 있는 외국어 자동 발음 평가장치들의 경우에는 입력된 음성 신호에 대하여 전체 발음 평가점수만 제공하고 있었으며, 의미 차이를 가져다주는 최소 소리 단위인 음소 단위의 발음 학습 방식은 아닌 것이다.In the case of foreign language automatic pronunciation evaluation devices, which are currently widely used, only the entire pronunciation evaluation score is provided for the input voice signal, and it is not a phonetic learning method that is the minimum sound unit that brings meaning differences.
따라서, 사용자에게는 제한적인 피드백 정보를 제공함으로써, 학습 효과를 증진시키는 데에는 한계가 있었다.Therefore, by providing limited feedback information to the user, there is a limit to enhancing the learning effect.
<선행기술문헌><Preceding technical literature>
대한민국등록특허번호 제10-0733469호Korea Patent Registration No. 10-0733469
따라서 본 발명은 상기와 같은 종래 기술의 문제점을 감안하여 제안된 것으로서, 본 발명의 제1 목적은 주어진 단어 혹은 문장에 대응되는 발화된 음성 신호에 대한 전체 발음 평가 점수만을 제공하는 종래 자동 발음 평가장치의 문제점을 개선하여 발화된 외국어 음성 신호에 대하여 원어민 음향 모델을 사용하여 획득한 음성 인식 결과와 발성 텍스트를 강제 정렬한 결과를 구간별로 비교하여 음소별 발음 평가 점수 및 전체 발음 평가 점수를 제공함으로써, 전체 발음 점수뿐만 아니라, 각 음소(발음)별 점수까지 피드백할 수 있는 음소 단위 발음 정확성 평가 장치 및 평가 방법을 제공하고자 한다.Accordingly, the present invention has been proposed in view of the above-described problems of the prior art, and a first object of the present invention is to provide a conventional automatic pronunciation evaluation apparatus that provides only an overall pronunciation evaluation score for a spoken speech signal corresponding to a given word or sentence. By comparing the speech recognition result obtained by using the native speaker's acoustic model with the forced alignment of the spoken text for the spoken foreign language speech signal, the pronunciation evaluation score and the overall pronunciation evaluation score for each phoneme are provided. The present invention aims to provide a phoneme pronunciation accuracy evaluation device and an evaluation method capable of feeding back not only the overall pronunciation score but also the score of each phoneme (pronounced voice).
본 발명의 제2 목적은 사용자 접근이 용이한 웹 페이지 혹은 모바일 앱을 통해 음소별 점수를 0점 내지 100점 사이의 값으로 제공하고자 한다.A second object of the present invention is to provide a score for each phoneme as a value between 0 and 100 points through a web page or a mobile app that is easily accessible to a user.
본 발명이 해결하고자 하는 과제를 달성하기 위하여, 음소 단위 발음 정확성 평가 장치는,In order to achieve the problem to be solved by the present invention, the phonetic unit pronunciation accuracy evaluation device,
학습자로부터 발성 텍스트 정보와 발성 텍스트 정보에 대한 학습자가 발음한음성 정보를 획득하며, 획득된 음성 정보를 설정된 시간 구간 단위로 나누고, 각 시간 구간별 음성 특징 벡터를 추출하는 음성정보추출부(100);The voice information extracting unit 100 obtains voice information pronounced by the learner about the spoken text information and the spoken text information from the learner, divides the obtained voice information into a set time interval unit, and extracts a speech feature vector for each time interval. ;
원어민 음향 모델 정보가 저장되는 원어민음향모델저장부(200);A native speaker model storing unit 200 in which native speaker model information is stored;
상기 원어민음향모델저장부에 저장된 원어민 음향 모델 정보를 이용하여 상기 음성정보추출부가 추출한 각 시간 구간별 음성 특징 벡터들에 대한 음성 인식을 수행하여 음성 인식 결과정보를 생성하는 음성인식부(300);A speech recognition unit 300 for generating speech recognition result information by performing speech recognition on speech feature vectors extracted by the speech information extraction unit by using native speaker sound model information stored in the native speaker model storage unit;
상기 음성정보추출부가 획득한 발성 텍스트 정보를 시간 구간별로 강제 정렬하여 강제 정렬 결과정보를 생성하는 강제정렬부(400);A forced sorting unit 400 forcibly sorting the spoken text information obtained by the voice information extracting unit for each time interval to generate forced sorting result information;
상기 음성정보추출부(100)가 추출한 각 시간 구간별 음성 특징 벡터와 강제정렬부(400)가 생성한 강제 정렬 결과정보를 이용하여 강제 정렬 결과정보에 대한 시간 구간별 로그 우도를 계산하는 로그우도계산부(500);Log likelihood for calculating the log likelihood for each time interval for the forced alignment result information using the speech feature vector extracted by the voice information extractor 100 and the forced alignment result information generated by the forced alignment unit 400. A calculator 500;
강제정렬 결과정보에 대한 시간 구간별로 계산된 로그 우도를 0 내지 100점 사이의 점수로 변환시킨 로그 우도 변환 점수를 생성하는 로그우도점수변환부(600);A log likelihood score conversion unit 600 for generating a log likelihood transformation score obtained by converting the log likelihood calculated for each time interval for the forced alignment result information into a score between 0 and 100 points;
음성인식부의 음성 인식 결과 정보와 강제정렬부의 강제 정렬 결과 정보의 시간 구간별 일치 여부에 따라 시간 구간별 조정 점수를 점수출력부로 제공하는 조정점수제공부(700);An adjustment score providing unit 700 providing an adjustment score for each time interval to the score output unit according to whether the speech recognition result information of the speech recognition unit and the forced alignment result information of the forced alignment unit correspond to each time section;
상기 조정점수제공부로부터 제공된 시간 구간별 조정 점수를 토대로 입력된 음성 정보에 대한 음소별 평균 점수를 계산하거나, 입력된 음성 정보에 대한 전체 평균 점수를 계산하여 출력시키는 점수출력부(800);를 포함하는 것을 특징으로 한다.A score output unit 800 that calculates an average score for each phoneme of the input voice information based on the adjustment score for each time interval provided from the adjustment score provider or calculates and outputs an overall average score for the input voice information; Characterized in that.
한편, 본 발명인 음소 단위 발음 정확성 평가 방법은,On the other hand, the present inventor phoneme pronunciation pronunciation evaluation method,
학습자로부터 발성 텍스트 정보와 발성 텍스트 정보 에 대한 학습자의 발화 음성 정보를 획득하고, 획득된 음성 정보를 설정된 시간 구간 단위로 나누며, 각 시간 구간별 음성 특징 벡터를 추출하는 음성정보추출 단계(S100);A voice information extracting step of obtaining speech information of the learner and speech information of the learner from the learner, dividing the obtained speech information into predetermined time interval units, and extracting a speech feature vector for each time interval (S100);
원어민 음향 모델 정보를 이용하여 상기 음성정보추출단계(S100)를 통해 추출된 각 시간 구간별 음성 특징 벡터들에 대한 음성 인식을 수행하여 음성 인식 결과정보를 생성하는 음성인식단계(S200);A speech recognition step (S200) of generating speech recognition result information by performing speech recognition on speech feature vectors of each time section extracted through the speech information extraction step (S100) using native speaker sound model information;
음성정보추출단계를 통해 획득한 발성 텍스트 정보를 시간 구간별로 강제 정렬하여 강제 정렬 결과정보를 생성하는 강제정렬단계(S300);A forced sorting step (S300) of generating forced sorting result information by forcibly sorting the spoken text information obtained through the voice information extracting step by time intervals;
음성정보추출 단계를 통해 추출한 각 시간 구간별 음성 특징 벡터와 강제정렬단계를 통해 생성한 강제 정렬 결과정보를 이용하여 강제 정렬 결과정보에 대한 시간 구간별 로그 우도를 계산하는 로그우도계산단계(S400);Log likelihood calculation step (S400) for calculating log likelihood for each time interval for the forced alignment result information using the voice feature vector for each time interval extracted through the voice information extraction step and the forced alignment result information generated through the forced alignment step (S400) ;
강제정렬 결과정보에 대해 시간 구간별로 계산된 로그 우도를 0 내지 100점 사이의 점수로 변환시키는 로그우도 점수변환단계(S500); A log likelihood score conversion step (S500) of converting the log likelihood calculated for each time interval for the forced alignment result information into a score between 0 and 100 points;
음성 인식 결과 정보와 강제 정렬 결과 정보의 시간 구간별 일치 여부에 따라 시간 구간별 조정 점수를 제공하는 조정점수제공단계(S600);An adjustment score providing step of providing an adjustment score for each time interval according to whether the speech recognition result information and the forced alignment result information correspond to each time interval (S600);
시간 구간별 조정 점수를 토대로 입력된 음성 정보에 대한 음소별 평균 점 수를 계산하거나, 입력된 음성 정보에 대한 전체 평균 점수를 계산하여 출력시키기는 점수출력단계(S700);를 포함하는 것을 특징으로 한다.And a score output step (S700) for calculating an average score for each phoneme of the input voice information based on the adjustment score for each time interval, or for calculating and outputting the total average score for the input voice information. do.
이상의 구성 및 작용을 지니는 본 발명에 따른 음소 단위 발음 정확성 평가 장치 및 평가 방법을 통해, 주어진 단어 혹은 문장에 대응되는 발화된 음성 신호에 대한 전체 발음 평가 점수만을 제공하는 종래 자동 발음 평가장치의 문제점을 개선하여 음성 신호의 세부 단위인 음소(발음)마다 점수를 제공함으로써, 전체 발음 점수뿐만 아니라, 각 음소(발음)별 점수까지 피드백할 수 있게 되어 미흡한 음소가 무엇인지를 집중적으로 학습할 수 있어 이에 따른 학습 효과를 증진시키게 된다.Through the phoneme pronunciation accuracy evaluation device and the evaluation method according to the present invention having the above configuration and action, the problem of the conventional automatic pronunciation evaluation device that provides only the overall pronunciation evaluation score for the spoken speech signal corresponding to a given word or sentence By improving and providing scores for each phoneme (pronoun), which is a detailed unit of the voice signal, it is possible to feed back not only the overall pronunciation score, but also the score for each phoneme (pronunciation). To improve the learning effect.
또한, 사용자 접근이 용이한 웹 페이지 혹은 모바일 앱을 통해 음소별 점수를 0점 내지 100점 사이의 값으로 제공함으로써, 기술 보급이 용이한 효과를 제공하게 된다.In addition, by providing a score for each phoneme as a value between 0 and 100 points through a web page or a mobile app that is easily accessible to the user, technology diffusion can be easily provided.
도 1은 본 발명의 제1 실시예에 따른 음소 단위 발음 정확성 평가 장치를 개략적으로 나타낸 전체 구성도.1 is an overall configuration diagram schematically showing an apparatus for evaluating phoneme pronunciation accuracy according to a first embodiment of the present invention.
도 2는 본 발명의 제1 실시예에 따른 음소 단위 발음 정확성 평가 장치에 의해 획득된 음성 신호에 대한 파형 그래프 예시도.2 is an exemplary waveform graph of a speech signal obtained by the phoneme pronunciation pronunciation evaluation apparatus according to the first embodiment of the present invention.
도 3은 본 발명의 제1 실시예에 따른 음소 단위 발음 정확성 평가 장치에 의해 계산되는 음소별 평균 점수 예시도.3 is an exemplary diagram of phoneme average scores calculated by a phoneme pronunciation accuracy evaluation apparatus according to a first exemplary embodiment of the present invention.
도 4는 본 발명의 제1 실시예에 따른 음소 단위 발음 정확성 평가 장치에 의해 계산된 음소별 평균점수 및 전체 평균 점수를 화면에 나타낸 예시도.Figure 4 is an exemplary view showing the average score for each phoneme and the total average score calculated by the phoneme unit pronunciation accuracy evaluation apparatus according to the first embodiment of the present invention.
도 5는 본 발명의 제1 실시예에 따른 음소 단위 발음 정확성 평가 방법의 전체 흐름도.5 is an overall flowchart of a phoneme unit pronunciation accuracy evaluation method according to a first embodiment of the present invention;
<도면의 부호><Sign of Drawing>
100 : 음성정보추출부100: voice information extraction unit
200 : 원어민음향모델저장부200: native speaker model storage unit
300 : 음성인식부300: voice recognition unit
400 : 강제정렬부400: forced alignment
500 : 로그우도계산부500: log likelihood calculator
600 : 로그우도점수변환부600: log likelihood score conversion unit
700 : 조정점수제공부700: adjustment score provider
800 : 점수출력부800: score output unit
1000 : 음소 단위 발음 정확성 평가 장치1000: phoneme unit pronunciation accuracy evaluation device
이하의 내용은 단지 본 발명의 원리를 예시한다. 그러므로 당업자는 비록 본 명세서에 명확히 설명되거나 도시되지 않았지만, 본 발명의 원리를 구현하고 본 발명의 개념과 범위에 포함된 다양한 장치를 발명할 수 있는 것이다. The following merely illustrates the principles of the invention. Therefore, those skilled in the art, although not explicitly described or illustrated herein, can embody the principles of the present invention and invent various devices that fall within the spirit and scope of the present invention.
또한, 본 명세서에 열거된 모든 조건부 용어 및 실시 예들은 원칙적으로, 본 발명의 개념이 이해되도록 하기 위한 목적으로만 명백히 의도되고, 이와 같이 특별히 열거된 실시 예들 및 상태들에 제한적이지 않는 것으로 이해되어야 한다.In addition, all conditional terms and embodiments listed herein are in principle clearly intended to be understood only for the purpose of understanding the concept of the invention and are not to be limited to the specifically listed embodiments and states. do.
본 발명을 설명함에 있어서 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되지 않을 수 있다.In describing the present invention, terms such as first and second may be used to describe various components, but the components may not be limited by the terms.
예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component.
어떤 구성요소가 다른 구성요소에 연결되어 있다거나 접속되어 있다고 언급되는 경우는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해될 수 있다.When a component is referred to as being connected or connected to another component, it may be understood that it may be directly connected or connected to the other component, but there may be other components in between. .
본 명세서에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니며, 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the invention, and the singular forms “a”, “an” and “the” may include the plural forms as well, unless the context clearly indicates otherwise.
본 명세서에서, 포함하다 또는 구비하다 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것으로서, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해될 수 있다.In this specification, the terms including or including are intended to designate that there exists a feature, number, step, operation, component, part, or a combination thereof described in the specification, and one or more other features or numbers, It can be understood that it does not exclude in advance the possibility of the presence or addition of steps, actions, components, parts or combinations thereof.
본 발명의 제1 실시예에 따른 음소 단위 발음 정확성 평가 장치는,The phoneme unit pronunciation accuracy evaluation apparatus according to the first embodiment of the present invention,
학습자로부터 발성 텍스트 정보와 발성 텍스트 정보에 대한 학습자의 발화 음성 정보를 획득하며, 획득된 음성 정보를 설정된 시간 구간 단위로 나누고, 각 시간 구간별 음성 특징 벡터를 추출하는 음성정보추출부(100);A voice information extracting unit 100 for acquiring utterance text information of the utterance text information and utterance text information from the learner, dividing the acquired speech information by a predetermined time interval unit, and extracting a speech feature vector for each time interval;
원어민 음향 모델 정보가 저장되는 원어민음향모델저장부(200);A native speaker model storing unit 200 in which native speaker model information is stored;
상기 원어민음향모델저장부에 저장된 원어민 음향 모델 정보를 이용하여 상기 음성정보추출부가 추출한 각 시간 구간별 음성 특징 벡터들에 대한 음성 인식을 수행하여 음성 인식 결과정보를 생성하는 음성인식부(300);A speech recognition unit 300 for generating speech recognition result information by performing speech recognition on speech feature vectors extracted by the speech information extraction unit by using native speaker sound model information stored in the native speaker model storage unit;
상기 음성정보추출부가 획득한 발성 텍스트 정보를 시간 구간별로 강제 정렬하여 강제 정렬 결과정보를 생성하는 강제정렬부(400);A forced sorting unit 400 forcibly sorting the spoken text information obtained by the voice information extracting unit for each time interval to generate forced sorting result information;
상기 음성정보추출부(100)가 추출한 각 시간 구간별 음성 특징 벡터와 강제정렬부(400)가 생성한 강제 정렬 결과정보를 이용하여 강제 정렬 결과정보에 대한 시간 구간별 로그 우도를 계산하는 로그우도계산부(500);Log likelihood for calculating the log likelihood for each time interval for the forced alignment result information using the speech feature vector extracted by the voice information extractor 100 and the forced alignment result information generated by the forced alignment unit 400. A calculator 500;
상기 계산된 강제 결과정보에 대한 시간 구간별 로그 우도를 0 내지 100점 사이의 점수로 변환시킨 로그 우도 변환 점수를 생성하는 로그우도점수변환부(600);A log likelihood score conversion unit 600 for generating a log likelihood conversion score obtained by converting the log likelihood for each time interval of the calculated forced result information into a score between 0 and 100 points;
음성인식부의 음성 인식 결과 정보와 강제정렬부의 강제 정렬 결과 정보의 시간 구간별 일치 여부에 따라 시간 구간별 조정 점수를 점수출력부로 제공하는 조정점수제공부(700);An adjustment score providing unit 700 providing an adjustment score for each time interval to the score output unit according to whether the speech recognition result information of the speech recognition unit and the forced alignment result information of the forced alignment unit correspond to each time section;
상기 조정점수제공부로부터 제공된 시간 구간별 조정 점수를 토대로 음소별 평균 점수를 계산하거나, 입력된 음성 정보에 대한 전체 평균 점수를 계산하여 출력시키는 점수출력부(800);를 포함하여 구성되는 것을 특징으로 한다.And a score output unit 800 that calculates an average score for each phoneme based on the adjustment score for each time interval provided from the adjustment score provider or calculates and outputs an overall average score for the input voice information. do.
이때, 상기 원어민음향모델저장부(200)에 저장되는 원어민 음향 모델 정보는 딥러닝 모델을 이용하여 원어민의 발성속도, 각 발음 사이의 묵음구간의 길이 등을 분석한 음소별 원어민 발음 특성 정보를 포함하는 것을 특징으로 한다.At this time, the native speaker acoustic model information stored in the native speaker acoustic model storage unit 200 includes information on the native speaker's pronunciation characteristics of each phoneme by analyzing the uttering speed of the native speaker, the length of the silent section between each pronunciation, and the like using a deep learning model. Characterized in that.
이때, 상기 점수출력부(800)는 음소별 평균 점수를 0점 내지 100점 사이의 점수값으로 처리하는 것을 특징으로 한다.At this time, the score output unit 800 is characterized in that for treating the average score for each phoneme as a score value between 0 and 100 points.
상기 점수출력부(800)는 음소별 평균 점수, 전체 평균 점수 중 적어도 어느 하나 이상을 화면에 출력시키는 것을 특징으로 한다.The score output unit 800 is characterized in that for outputting at least one or more of the average score for each phoneme, the overall average score on the screen.
이때, 상기 구간 단위는 1msec ~ 20msec 범위의 시간 구간인 것을 특징으로 한다. 바람직하게는 10msec인 것을 특징으로 한다.At this time, the interval unit is characterized in that the time interval in the range of 1msec ~ 20msec. Preferably it is characterized in that 10msec.
또한, 상기 로그우도계산부(500)는 하기의 로그 우도식을 이용하여 음성정보추출부(100)가 추출한 각 시간 구간별 음성 특징 벡터와 강제정렬부(400)가 생성한 강제 정렬 결과정보를 이용하여 강제 정렬 결과정보에 대한 시간 구간별 로그 우도를 계산하는 것을 특징으로 한다.In addition, the log likelihood calculator 500 may extract the voice feature vector extracted by the voice information extractor 100 and the forced alignment result information generated by the forced sorter 400 using the following log likelihood equation. It is characterized by calculating the log likelihood for each time interval for the forced alignment result information.
log(p(oi│qi)) (로그 우도식)log (p (oi│qi)) (log likelihood)
(oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률)(oi is the voice feature vector of the i-th time interval, qi is the phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) is the probability of oi coming out of qi in the i-th time interval)
그리고, 상기 로그우도점수변환부(600)는,And, the log likelihood score conversion unit 600,
하기의 로그 우도 점수 변환식을 이용하여 강제정렬 결과정보에 대한 시간 구간별 로그 우도를 0 내지 100점 사이의 점수로 변환시키는 것을 특징으로 한다.By using the log likelihood score conversion formula below it is characterized in that the log likelihood for each time interval for the forced alignment result information is converted into a score between 0 and 100 points.
Figure PCTKR2019000147-appb-I000001
Figure PCTKR2019000147-appb-I000001
(oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률)(oi is the voice feature vector of the i-th time interval, qi is the phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) is the probability of oi coming out of qi in the i-th time interval)
그리고, 상기 조정점수제공부(700)는,And, the adjustment score providing unit 700,
하기 로그 우도 조정식에 따라, 음성인식부의 음성 인식 결과 정보와 강제정렬부의 강제 정렬 결과 정보가 일치하는 시간 구간에 대해서는 해당 구간의 조정 점수를 100으로 하여 점수출력부로 제공하고, 음성인식부의 음성 인식 결과 정보와 강제정렬부의 강제 정렬 결과 정보가 일치하지 않는 시간 구간에 대해서는 로그우도점수변환부가 변환한 해당 구간의 로그 우도 변환 점수를 조정 점수로 하여 점수출력부로 제공하는 것을 특징으로 한다.According to the following log likelihood adjustment formula, for a time interval where the voice recognition result information of the voice recognition unit and the forced alignment result information of the forced alignment unit match, the adjustment score of the corresponding section is set to 100 and provided to the score output unit, and the voice recognition result of the voice recognition unit The time likelihood between the information and the forced sorting result information of the forced sorting unit may be provided to the score output unit using the log likelihood transformation score of the corresponding section converted by the log likelihood score converter as an adjustment score.
Figure PCTKR2019000147-appb-I000002
Figure PCTKR2019000147-appb-I000002
(oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률)(oi is the voice feature vector of the i-th time interval, qi is the phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) is the probability of oi coming out of qi in the i-th time interval)
한편, 본 발명의 제1 실시예에 따른 음소 단위 발음 정확성 평가 방법은,On the other hand, the phoneme unit pronunciation accuracy evaluation method according to the first embodiment of the present invention,
학습자로부터 발성 텍스트 정보와 발성 텍스트 정보 에 대한 학습자의 발화 음성 정보를 획득하고, 획득된 음성 정보를 설정된 시간 구간 단위로 나누며, 각 시간 구간별 음성 특징 벡터를 추출하는 음성정보추출 단계(S100);A voice information extracting step of obtaining speech information of the learner and speech information of the learner from the learner, dividing the obtained speech information into predetermined time interval units, and extracting a speech feature vector for each time interval (S100);
원어민 음향 모델 정보를 이용하여 상기 음성정보추출단계(S100)를 통해 추출된 각 시간 구간별 음성 특징 벡터들에 대한 음성 인식을 수행하여 음성 인식 결과정보를 생성하는 음성인식단계(S200);A speech recognition step (S200) of generating speech recognition result information by performing speech recognition on speech feature vectors of each time section extracted through the speech information extraction step (S100) using native speaker sound model information;
음성정보추출단계를 통해 획득한 발성 텍스트 정보를 시간 구간별로 강제 정렬하여 강제 정렬 결과정보를 생성하는 강제정렬단계(S300);A forced sorting step (S300) of generating forced sorting result information by forcibly sorting the spoken text information obtained through the voice information extracting step by time intervals;
음성정보추출 단계를 통해 추출한 각 시간 구간별 음성 특징 벡터와 강제정렬단계를 통해 생성한 강제 정렬 결과정보를 이용하여 강제 정렬 결과정보에 대한 시간 구간별 로그 우도를 계산하는 로그우도계산단계(S400);Log likelihood calculation step (S400) for calculating log likelihood for each time interval for the forced alignment result information using the voice feature vector for each time interval extracted through the voice information extraction step and the forced alignment result information generated through the forced alignment step (S400) ;
강제정렬 결과정보에 대해 시간 구간별로 계산된 로그 우도를 0 내지 100점 사이의 점수로 변환시키는 로그우도 점수변환단계(S500); A log likelihood score conversion step (S500) of converting the log likelihood calculated for each time interval for the forced alignment result information into a score between 0 and 100 points;
음성 인식 결과 정보와 강제 정렬 결과 정보의 시간 구간별 일치 여부에 따라 시간 구간별 조정 점수를 제공하는 조정점수제공단계(S600);An adjustment score providing step of providing an adjustment score for each time interval according to whether the speech recognition result information and the forced alignment result information correspond to each time interval (S600);
시간 구간별 조정 점수를 토대로 입력된 음성 정보에 대한 음소별 평균 점수를 계산하거나, 입력된 음성 정보에 대한 전체 평균 점수를 계산하여 출력시키기는 점수출력단계(S700);를 포함하여 이루어지는 것을 특징으로 한다.Comprising a score output step (S700) to calculate the average score for each phoneme for the input voice information on the basis of the adjustment score for each time interval, or to calculate and output the total average score for the input voice information (S700); do.
이때, 상기 로그우도계산단계(S400)는,At this time, the log likelihood calculation step (S400),
로그우도계산부(500)에 의해, 하기의 로그 우도식을 이용하여 음성정보추출 단계를 통해 추출한 각 시간 구간별 음성 특징 벡터와 강제정렬단계를 통해 생성한 강제 정렬 결과정보를 이용하여 강제 정렬 결과정보에 대한 시간 구간별 로그 우도를 계산하는 것을 특징으로 한다.The force likelihood result by the log likelihood calculator 500 using the force feature result generated through the forced feature and the speech feature vector extracted for each time interval extracted through the voice information extraction step using the following log likelihood equation. Computing the log likelihood for each time interval for the information.
log(p(oi│qi)) (로그 우도식)log (p (oi│qi)) (log likelihood)
(oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률)(oi is the voice feature vector of the i-th time interval, qi is the phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) is the probability of oi coming out of qi in the i-th time interval)
이때, 상기 로그우도점수변환단계(S500)는 로그우도점수변환부(600)에 의해, 하기의 로그 우도 점수 변환식을 이용하여 강제정렬 결과정보에 대한 시간 구간별 로그 우도를 0 내지 100점 사이의 점수로 변환시키는 것을 특징으로 한다.At this time, the log likelihood score conversion step (S500) is a log likelihood score conversion unit 600, the log likelihood score for each time interval for the forced sorting result information by using the following log likelihood score conversion equation between 0 to 100 points And converting the score.
Figure PCTKR2019000147-appb-I000003
Figure PCTKR2019000147-appb-I000003
(oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률)(oi is the voice feature vector of the i-th time interval, qi is the phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) is the probability of oi coming out of qi in the i-th time interval)
이때, 상기 조정점수제공단계(S600)는 조정점수제공부(700)에 의해, 하기의 로그 우도 조정식에 따라, 음성인식부의 음성 인식 결 과 정보와 강제정렬부의 강제 정렬 결과 정보가 일치하는 시간 구간에 대해서는 해당 구간의 조정 점수를 100으로 하여 점수출력부로 제공하고, 음성인식부의 음성 인식 결과 정보와 강제정렬부의 강제 정 렬 결과 정보가 일치하지 않는 시간 구간에 대해서는 로그우도점수변환부가 변환한 해당 구간의 로그 우도 변환 점수를 조정 점수로 하여 점수 출 력부로 제공하는 것을 특징으로 한다.In this case, the adjustment score providing step (S600) is performed by the adjustment score providing unit 700 in a time interval in which the voice recognition result information of the voice recognition unit and the forced alignment result information coincide with the log likelihood adjustment equation below. For the time interval where the adjusted score of the corresponding section is set to 100 and the voice recognition result information of the voice recognition unit and the forced alignment result information of the forced alignment unit do not coincide, the log likelihood score converter converts the corresponding section of the corresponding section. It is characterized in that the log likelihood conversion score is provided as an adjustment score to the score output unit.
Figure PCTKR2019000147-appb-I000004
Figure PCTKR2019000147-appb-I000004
(oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률)(oi is the voice feature vector of the i-th time interval, qi is the phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) is the probability of oi coming out of qi in the i-th time interval)
이하에서는, 본 발명에 의한 음소 단위 발음 정확성 평가 장치 및 평가 방법의 실시예를 통해 상세히 설명하도록 한다.Hereinafter, the phoneme unit pronunciation accuracy evaluation apparatus and the evaluation method according to the present invention will be described in detail.
도 1은 본 발명의 제1 실시예에 따른 음소 단위 발음 정확성 평가 장치를 개략적으로 나타낸 전체 구성도이다.1 is an overall configuration diagram schematically showing an apparatus for evaluating phoneme pronunciation accuracy according to a first embodiment of the present invention.
도 1에 도시한 바와 같이, 본 발명인 음소 단위 발음 정확성 평가 장치(1000)는 주어진 단어 혹은 문장에 대응되는 발화된 음성 신호에 대한 전체 발음 평가 점수만을 제공하는 종래 자동 발음 평가장치의 문제점을 개선하여 음성 신호의 세부 단위인 음소(발음)마다 점수를 제공함으로써, 전체 발음 점수뿐만 아니라, 각 음소(발음)별 점수까지 피드백할 수 있게 되어 미흡한 음소가 무엇인지를 집중적으로 학습할 수 있어 이에 따른 학습 효과를 증진시키게 된다.As shown in FIG. 1, the phoneme pronunciation pronunciation evaluation apparatus 1000 of the present invention improves the problem of the conventional automatic pronunciation evaluation apparatus that provides only an overall pronunciation evaluation score for a spoken speech signal corresponding to a given word or sentence. By providing a score for each phoneme (pronounced), which is a detailed unit of the voice signal, it is possible to feed back not only the overall pronunciation score but also the score for each phoneme (pronounced). It will enhance the effect.
또한, 사용자 접근이 용이한 웹 페이지 혹은 모바일 앱을 통해 음소별 점수를 0점 내지 100점 사이의 값으로 제공함으로써, 기술 보급이 용이한 효과를 제공하게 된다.In addition, by providing a score for each phoneme as a value between 0 and 100 points through a web page or a mobile app that is easily accessible to the user, technology diffusion can be easily provided.
즉, 입력된 음성 신호에 대한 전체 발음 평가 점수만 제공하는 종래 기술의 문제점을 개선하여 음성 신호의 세부 단위인 음소마다 점수를 제공하게 된다.That is, the problem of the related art, which provides only the overall pronunciation evaluation score for the input voice signal, is improved to provide a score for each phoneme which is a detailed unit of the voice signal.
이때, 의미 차이를 가져다 주는 최소 소리 단위를 음소(발음)라 하며, 외국어 학습시, 해당 언어의 음소 단위의 발음 학습이 중요하다.At this time, the minimum sound unit that brings a difference in meaning is called a phoneme (pronounced), and when learning a foreign language, it is important to learn pronunciation of the phoneme unit of the corresponding language.
기존 발음 평가 점수 산출 방법들의 공통점은 입력된 외국어 음성 신호를 전체적으로 평가한 점수를 제공하므로 음소별 점수를 제공하지 않아 사용자가 제한적인 피드백을 제공받게 된다.The common points of the existing pronunciation evaluation score calculation methods provide a score that evaluates the input foreign language voice signal as a whole, so that the user is provided with limited feedback because the score is not provided for each phoneme.
그러나 본 발명의 경우에는 전체 점수 뿐만 아니라, 각 발음(음소)별 점수까지 제공함으로써, 발음 평가 피드백을 통한 학습 효과를 증진시키게 된다.However, in the present invention, by providing not only the overall score but also the score for each pronunciation (phoneme), the learning effect through the pronunciation evaluation feedback is enhanced.
예를 들어, 종래 발음 평가 점수 산출 방법은 'cat에 대한 발음 점수는 80점입니다.'라는 전체 발음 평가 점수를 제공하지만, 본 발명의 경우에는 'cat에 대하여 c에 대한 발음 점수는 80점, a에 대한 발음 점수는 90점, t에 대한 발음 점수는 90점이며, 전체 발음 점수는 86.6점입니다.'라는 음소별 평가 점수, 전체 발음 평가 점수를 제공하는 것이다.For example, the conventional method for calculating a pronunciation score provides a total pronunciation score of 'the pronunciation score for cat is 80.' However, in the present invention, the pronunciation score for c is 80 for cat. The pronunciation score for a is 90 points, the pronunciation score for t is 90 points, and the overall pronunciation score is 86.6 points.
상기와 같은 음소 단위 발음 정확성 평가 장치(1000)는 도 1에 도시한 바와 같이, 음성정보추출부(100), 원어민음향모델저장부(200), 음성인식부(300), 강제정렬부(400), 로그우도계산부(500), 로그우도점수변환부(600), 조정점수제공부(700), 점수출력부(800)를 포함하여 구성된다.As shown in FIG. 1, the phoneme unit pronunciation accuracy evaluation apparatus 1000 includes a voice information extracting unit 100, a native speaker model storage unit 200, a voice recognition unit 300, and a forced alignment unit 400. ), A log likelihood calculator 500, a log likelihood score converter 600, an adjustment score provider 700, and a score output unit 800.
구체적으로 설명하면, Specifically,
상기 음성정보추출부(100)는 학습자로부터 발성 텍스트 정보와 발성 텍스트 정보에 대한 학습자가 발음한 음성 정보를 획득하고,획득된 음성 정보를 설정된 시간 구간 단위로 나누고, 각 시간 구간별 음성 특징 벡터를 추출하게 된다.The voice information extracting unit 100 obtains the spoken text information and the voice information pronounced by the learner for the spoken text information from the learner, divides the acquired voice information into a set time interval unit, and divides the speech feature vector for each time interval. Will be extracted.
예를 들어, 'cat'이라는 텍스트에 해당하는 발성 텍스트 정보와 'cat'을 발음한 학습자의 음성 정보를 획득하게 된다.For example, voice text information corresponding to a text of 'cat' and voice information of a learner who pronounces 'cat' are obtained.
본 발명은 텍스트를 입력하는 입력수단과 음성 정보를 입력하는 입력수단을 갖추게 되는데, 학습자는 텍스트를 입력하는 입력수단을 통해 발성 텍스트인 'cat'를 음성정보추출부(100)에 제공하고, 이어서 음성 정보를 입력하는 입력수단(예: 마이크 수단)을 통해 발성 텍스트인 'cat'를 발음한 음성 정보를 입력하게 된다. The present invention is provided with an input means for inputting text and an input means for inputting voice information. The learner provides the voice information extracting unit 100 to the voice information extracting unit 100 through the input means for inputting text. Voice information pronounced 'cat', which is spoken text, is input through an input means (eg, a microphone means) for inputting voice information.
발성 텍스트 정보와 음성 정보를 입력받은 음성정보추출부(100)는 도 2에 도시한 바와 같이, 'cat'이라는 음성 신호를 획득하게 되며, 획득된 음성 정보를 설정된 시간 구간 단위로 나누며, 도 3에 도시된 바와 같은 각 시간 구간별 음성 특징 벡터를 추출하게 된다.As shown in FIG. 2, the voice information extracting unit 100 receiving the voiced text information and the voice information acquires a voice signal of 'cat', and divides the acquired voice information into units of a set time period. A voice feature vector for each time interval is extracted as shown in FIG.
예를 들어, 발화된 도 2와 같은 음성 신호에 대하여 10ms 단위별로 시간 구간을 나누며, 각 시간 구간마다 음성 신호에 대한 특징 벡터(MFCC)를 추출하는 것이다.For example, time intervals are divided by 10 ms units for the speech signal illustrated in FIG. 2, and a feature vector (MFCC) for the speech signal is extracted for each time interval.
음성 인식에 있어서, 발성속도, 각 발음 사이의 묵음구간의 길이는 매우 중요한 요소이다. In speech recognition, the speed of speech and the length of the silence interval between each pronunciation are very important factors.
음성 특징 벡터를 추출하는 기법으로는 MFCC(Mel Frequency Cepstrum Coefficient) 파라미터가 많이 사용되고 있으며, 음성 인식 기술에서 널리 이용되고 있는 알고리즘이므로 구체적인 설명은 생략하도록 한다.As a technique for extracting a speech feature vector, MFCC (Mel Frequency Cepstrum Coefficient) parameters are widely used and detailed descriptions are omitted since they are widely used in speech recognition technology.
이때, 음성 특징 벡터를 추출하기 위한 시간 구간 단위는 1msec ~ 20msec 범위의 시간 단위인 것을 특징으로 하는데, 동질의 발음신호가 존재하는 시간 구간이 대략 10msec임을 고려할 때 음성 특징 벡터를 추출하기 위한 시간 구간 단위는 10msec 단위로 설정되는 것이 바람직하다.At this time, the time interval unit for extracting the speech feature vector is characterized in that the time unit in the range of 1msec ~ 20msec. The unit is preferably set in 10 msec units.
상기 원어민음향모델저장부(200)에는 원어민 음향 모델 정보가 저장된다.The native speaker model storage unit 200 stores native speaker model information.
상기 원어민음향모델저장부(200)에 저장되는 원어민 음향 모델 정보는 딥러닝 모델을 이용하여 음소별 원어민 발음 특성 정보인 것을 특징으로 한다.The native speaker model information stored in the native speaker model storage unit 200 is characterized in that the native speaker pronunciation characteristic information for each phoneme using a deep learning model.
즉, 딥러닝 모델을 이용하여 원어민의 발성속도, 각 발음 사이의 묵음구간의 길이 등을 분석한 분석 결과 정보인 음소별 원어민 발음 특성 정보가 원어민음향모델저장부(200)에 저장되고 이를 이용하여 음성인식부(300)가 학습자가 발음한 음성에 대한 음성 인식을 수행하게 되는 것이다.That is, by using the deep learning model, native speaker pronunciation characteristic information for each phoneme, which is an analysis result of analyzing a native speaker's uttering speed and the length of a silent section between each pronunciation, is stored in the native speaker's acoustic model storage unit 200 and used therein. The voice recognition unit 300 performs voice recognition on the voice pronounced by the learner.
상기 음성인식부(300)는 상기 원어민음향모델저장부(200)에 저장된 원어민 음향 모델 정보를 이용하여 상기 음성정보추출부(100)가 추출한 각 시간 구간별 음성 특징 벡터들에 대한 음성 인식을 수행하여 음성 인식 결과정보를 생성한다.The speech recognition unit 300 performs speech recognition on speech feature vectors for each time interval extracted by the speech information extracting unit 100 using native speaker sound model information stored in the native speaker sound model storage unit 200. To generate voice recognition result information.
예를 들어, 음성 인식 결과 정보는 도 3에 도시된 바와 같이 0~10ms(1구간)에 b 음소 발음, 10~20ms(2구간)에 b 음소 발음, 20~30ms(3구간)에 b 음소 발음, 30~40ms(4구간)에 b 음소 발음, 40~50ms(5구간)에 æ음소 발음, 50~60ms(6구간)에 æ음소 발음, 60~70ms(7구간)에 t 음소 발음, 70~80ms(8구간)에 t 음소 발음, 80~90ms(9구간)에 s 음소 발음이 배열된 결과정보인 것이다.For example, the voice recognition result information is b phoneme pronunciation in 0-10ms (1 section), b phoneme pronunciation in 10-20ms (2 sections), b phoneme in 20-30ms (3 sections), as shown in FIG. Pronunciation, b phonetic pronunciation for 30-40ms (4 sections), æ phonetic pronunciation for 40-50ms (5 sections), æ phonetic pronunciation for 50-60ms (6 sections), t phonetic pronunciation for 60-70ms (7 sections), T phoneme pronunciation in 70 ~ 80ms (8 sections) and s phoneme pronunciation in 80 ~ 90ms (9 sections).
상기 강제정렬부(400)는 상기 음성정보추출부(100)가 획득한 발성 텍스트 정보를 시간 구간별로 강제 정렬하여 강제 정렬 결과정보를 생성하는 것이다.The forced sorting unit 400 generates the forced sorting result information by forcibly sorting the spoken text information obtained by the voice information extracting unit 100 for each time interval.
예를 들어 'cat'이라는 발성 텍스트 정보를 음성정보추출부(100)가 획득하는 경우, 강제정렬부(400)는 'cat'이라는 텍스트에 해당하는 음소 단위의 발음을 발성 스트립에 맞추어 도 3에 도시한 바와 같이 강제 정렬하게 된다.For example, when the voice information extracting unit 100 acquires the voice text information 'cat', the forced alignment unit 400 adjusts the phoneme unit pronunciation corresponding to the text 'cat' to the voice strip in FIG. 3. Forced alignment is shown as shown.
예를 들어, 10ms 시간 구간별로 발성 텍스트에에 해당하는 음소단위의 발음을 강제 정렬시키게 되는데, 강제 정렬 결과정보는 도 3에 도시된 바와 같이 0~10ms(1구간)에 k 음소, 10~20ms(2구간)에 k 음소, 20~30ms(3구간)에 k 음소, 30~40ms(4구간)에 æ음소, 40~50ms(5구간)에 æ음소, 50~60ms(6구간)에 æ음소, 60~70ms(7구간)에 t 음소, 70~80ms(8구간)에 t 음소, 80~90ms(9구간)에 t 음소가 배열된 결과정보인 것이다.For example, pronunciation of phoneme units corresponding to the spoken text is forcedly sorted for each 10 ms time interval. The forced sorting result information is k phoneme, 0 to 10 ms (one section), as shown in FIG. 3, and 10 to 20 ms. K phoneme in 2 sections, k phoneme in 20-30ms (3 sections), phoneme in 30-40ms (4 sections), phoneme in 40-50ms (5 sections), phoneme in 50 to 60ms (6 sections) Phoneme, t phoneme in 60 ~ 70ms (7 sections), t phoneme in 70 ~ 80ms (8 sections), t phoneme in 80 ~ 90ms (9 sections).
상기 로그우도계산부(500)는 음성정보추출부(100)가 추출한 각 시간 구간별 음성 특징 벡터와 강제정렬부(400)가 생성한 강제 정렬 결과정보를 이용하여 강제 정렬 결과정보에 대한 시간 구간별 로그 우도를 계산한다.The log likelihood calculator 500 uses the voice feature vector extracted by the voice information extractor 100 for each time interval and the forced alignment result information generated by the forced sorter 400 to generate a time interval for the forced alignment result information. Calculate the star log likelihood.
구체적으로, 상기 로그우도계산부(500)는 하기의 로그 우도식을 이용하여 강제 정렬 결과정보에 대한 시간 구간별 로그 우도를 계산하는 것을 특징으로 한다.Specifically, the log likelihood calculator 500 calculates a log likelihood for each time interval for the forced sorting result information by using the following log likelihood formula.
log(p(oi│qi)) (로그 우도식)log (p (oi│qi)) (log likelihood)
이때, oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률값을 의미한다.In this case, oi denotes a voice feature vector of the i-th time interval, qi denotes a phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) denotes a probability value of oi coming out of qi in the i-th time interval.
상기 강제 정렬 결과정보에 대한 시간 구간별 로그 우도는 (-)값을 갖는 것을 특징으로 한다. 왜냐하면 p(oi│qi)는( i번째 시간구간에서 oi 가 qi 에서 나올 확률값) 0~1 사이의 값이고, 이러한 값에 로그를 취한 값이기 때문이다.The log likelihood for each time interval for the forced sorting result information may have a negative value. This is because p (oi│qi) is a value between 0 and 1 (the probability of oi coming out of qi in the i-th time interval), and is a logarithm of these values.
상기 로그우도점수변환부(600)는 강제 정렬 결과정보에 대한 시간 구간별로 계산된 로그 우도를 0 내지 100점 사이의 점수로 변환시키게 된다.The log likelihood score converter 600 converts the log likelihood calculated for each time interval for the forced sorting result information into a score between 0 and 100 points.
강제정렬 결과정보에 대하여 시간 구간별로 계산된 로그 우도를 0 내지 100점 사이의 점수로 변환시키는 이유는 계산된 시간 구간별 로그 우도값은 (-)값을 갖는 특성 때문에 이를 (+) 영역의 값으로 변환시키기 위함이다.The reason for converting the log likelihood calculated for each time interval into scores between 0 and 100 for the forced sorting result information is that the calculated log likelihood value for each time interval has a negative value, which is the value of the positive region. To convert to
구체적으로, 상기 로그우도점수변환부(600)는,Specifically, the log likelihood score conversion unit 600,
하기의 로그 우도 점수 변환식을 이용하여 강제정렬 결과정보에 대한 시간 구간별 로그 우도를 0 내지 100점 사이의 점수로 변환시키는 것을 특징으로 한다.By using the log likelihood score conversion formula below it is characterized in that the log likelihood for each time interval for the forced alignment result information is converted into a score between 0 and 100 points.
Figure PCTKR2019000147-appb-I000005
Figure PCTKR2019000147-appb-I000005
이때, oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률값을 의미한다.In this case, oi denotes a voice feature vector of the i-th time interval, qi denotes a phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) denotes a probability value of oi coming out of qi in the i-th time interval.
예를 들어, 도 3에 도시된 1구간(0~10ms구간)의 변환된 로그 우도 90은 1구간(0~10ms구간)의 음성 특징 백터가 강제 정렬 결과 정보에 의한 1구간 음소인 'k'란 음소로부터 나올 확률에 대한 로그값에 100을 더한 값이고, 도 3에 도시된 5구간(40~50ms구간)의 변환된 로그 우도 80은 5구간(40~50ms구간)의 음성 특징 백터가 강제 정렬 결과 정보에 의한 5구간 음소인 'æ란 음소로부터 나올 확률에 대한 로그값에 100을 더한 값이고, 도 3에 도시된 7구간(60~70ms구간)의 변환된 로그 우도90은 7구간(60~70ms구간)의 음성 특징 백터가 강제 정렬 결과 정보에 의한 7구간 음소인 't'란 음소로부터 나올 확률에 대한 로그값에 100을 더한 값인 것이다.For example, in the converted log likelihood 90 of one section (0 to 10 ms) shown in FIG. 3, the voice feature vector of one section (0 to 10 ms) is 'k', which is a phoneme of the first section based on the forced alignment result information. Is a log value of probability of phoneme coming out from the phoneme, plus 100, and the transformed log likelihood 80 of 5 sections (40-50 ms section) shown in FIG. 3 is forced by a speech feature vector of 5 sections (40-50 ms section). The 'æ', which is a phoneme of 5 sections based on the sorting result information, is a log value of the probability of coming out of the phoneme, plus 100, and the converted log likelihood 90 of 7 sections (60 to 70 ms) shown in FIG. 'T', which is a phoneme of the 7-segment phoneme based on the coercion result information, is a value obtained by adding 100 to the log value of the probability of coming out of the phoneme.
좀더 구체적인 예를 들어 설명하면, p(o1│q1)는 1번째 시간구간의 음성 특징 벡터(음소 b의 음성 특징 벡터)가 강제 정렬 결과 정보에 의한 1번째 시간구간 음소인 'k'란 음소로부터 나올 확률이며, 상기 확률값에 로그를 취한후 100을 더한 값이 변환된 로그 우도값 90인 것이다.In a more specific example, p (o1│q1) is the phoneme of 'k' whose first feature is the first feature of the first time interval. It is a probability to come out, and the log value is taken as the log likelihood value 90 after taking the logarithm to the probability value.
즉, 1번째 시간구간의 음성 특징 벡터는 'k'란 음소에 해당하는 음성 특징 벡터가 나와야 정확한 발음이 입력된 것으로 판단되어 확륙이 1에 가까운 값이 되지만, 실제로는 'b'란 음소에 해당하는 음성 특징 벡터가 나왔기 때문에 확률값이 1보다 훨씬 작은 확률값으로 인해 도 3의 90점이라는 변환된 로그 우도값이 계산된 것이다.That is, it is determined that the correct pronunciation is input when the voice feature vector corresponding to the phoneme of 'k' is determined to have the correct pronunciation input, so that the landing is close to 1, but in reality, the phoneme corresponds to the phoneme of 'b'. Since the speech feature vector is obtained, the transformed log likelihood value of 90 points of FIG. 3 is calculated due to the probability value of which the probability value is much smaller than one.
정리하면, 로그우도점수변환부(600)는 도 3에 도시된 바와 같이 1구간(0~10ms)의 음소(강제정렬 결과정보에 의한 음소)인 'k' 음소의 변환된 로그 우도는 90점, 2구간(10~20ms)의 음소(강제정렬 결과정보에 의한 음소)인 'k' 음소의 변환된 로그 우도는 80점, 3구간(20~30ms)의 음소(강제정렬 결과정보에 의한 음소)인 'k' 음소의 변환된 로그 우도는 100점, 4구간(30~40ms)의 음소(강제정렬 결과정보에 의한 음소)인 'æ음소의 변환된 로그 우도는 40점, 5구간(40~50ms)의 음소(강제정렬 결과정보에 의한 음소)인 'æ음소의 변환된 로그 우도는 80점, 6구간(50~60ms)인 'æ음소의 변환된 로그 우도는 80점, 7구간(60~70ms)의 음소(강제정렬 결과정보에 의한 음소)인 't' 음소의 변환된 로그 우도는 90점, 8구간(70~80ms)인 't' 음소의 변환된 로그 우도는 90점, 9구간(80~90ms)인 't' 음소의 변환된 로그 우도는 70점 등과 같은 변환 점수로 변환시키게 된다.In summary, the log likelihood score conversion unit 600 converts the log likelihood of the 'k' phone, which is a phoneme (phoneme by the forced alignment result information) of one section (0 to 10 ms), as shown in FIG. The converted log likelihood of the 'k' phone, which is the phoneme of the 2 sections (10 ~ 20ms) (the phoneme by the forced alignment result information), is 80 points, and the phoneme of the 3 sections (20 ~ 30ms) (the phoneme by the forced alignment result information) The converted log likelihood of the 'k' phoneme is 100 points, and the converted log likelihood of the 'æ phoneme is 4 points (30 ~ 40ms), which is 40 points and 5 sections (40 The converted log likelihood of the 'æ phoneme, which is a phoneme of ~ 50ms) (the phoneme based on the forced alignment result information), is 80 points, and the converted log likelihood of the' æ phoneme, which is 6 sections (50 ~ 60ms), is 80 points, 7 sections ( The converted log likelihood of the 't' phone, which is the phoneme of 60 ~ 70ms) (the phoneme by the forced alignment result information), is 90 points, and the converted log likelihood of the 't' phone, which is 8 sections (70 ~ 80ms), is 90 points, Converted log of 't' phoneme with 9 segments (80 ~ 90ms) The likelihood is converted into a conversion score such as 70 points.
상기와 같이, 로그 우도 점수 변환 후, 조정점수제공부(700)는 음성인식부의 음성 인식 결과 정보와 강제정렬부의 강제 정렬 결과 정보의 시간 구간별 일치 여부에 따라 시간 구간별 조정 점수를 점수출력부로 제공하게 된다.As described above, after the log likelihood score conversion, the adjustment score providing unit 700 provides the adjustment score for each time interval to the score output unit according to whether the voice recognition result information of the voice recognition unit and the forced alignment result information of the forced alignment unit match each time period. Done.
좀 더 구체적으로는 하기의 로그 우도 조정식에 따라, 음성인식부의 음성 인식 결과 정보와 강제정렬부의 강제 정렬 결과 정보가 일치하는 시간 구간에 대해서는 해당 구간의 조정 점수를 100으로 하여 점수출력부로 제공하고, 음성인식부의 음성 인식 결과 정보와 강제정렬부의 강제 정렬 결과 정보가 일치하지 않는 시간 구간에 대해서는 로그우도점수변환부가 변환한 해당 구간의 로그 우도 변환 점수를 조정 점수로 하여 점수출력부로 제공하는 것을 특징으로 한다.More specifically, according to the following log likelihood adjustment equation, for the time interval where the voice recognition result information of the voice recognition unit and the forced alignment result information of the forced alignment unit match, the adjustment score of the corresponding section is provided as 100 and provided to the score output unit. The log likelihood score conversion unit converts the log likelihood conversion score of the corresponding section converted by the log likelihood score conversion unit to the score output unit for a time interval in which the voice recognition result information of the speech recognition unit and the forced alignment result information of the forced alignment unit do not coincide. do.
Figure PCTKR2019000147-appb-I000006
Figure PCTKR2019000147-appb-I000006
이때, oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률값을 의미한다.In this case, oi denotes a voice feature vector of the i-th time interval, qi denotes a phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) denotes a probability value of oi coming out of qi in the i-th time interval.
예를 들어, 도 3에 도시된 바와 같이 5구간의 'æ와 æ6구간의 'æ와 æ7구간의 't와 t', 8구간의 't와 t'의 경우에는 음성 인식 결과와 강제 정렬 결과가 일치하는 구간이므로, 해당 시간 구간별 음성 인식 결과에 대한 로그 우도 변환 점수가 각각 '80점, 80점, 90점, 90점'이지만 조정 점수는 '100점'으로 하여 점수출력부로 제공하게 되는 것이다.For example, as shown in FIG. 3, the speech recognition result and the forced sorting result in the case of 't and t' in 'æ and æ7' sections in five sections and 't and t' in eight sections, and 't and t' in eight sections. Is a matching interval, the log likelihood conversion scores for the speech recognition results for each time interval are '80 points, 80 points, 90 points, 90 points', respectively, but the adjusted score is' 100 points', which is provided to the score output unit. will be.
조정 점수를 100점으로 하는 이유는 음성 인식 결과 정보와 강제 정렬 결과 정보가 일치함에도 불구하고, 이를 조정하지 않게 된다면 평가 점수가 상당히 낮아지게 되므로 이에 대한 점수 오류를 제거하고자 100점이라는 조정 점수를 도입하게 되는 것이다.The reason for setting the adjustment score as 100 is that although the speech recognition result information and the forced sorting result information match, if the adjustment is not made, the evaluation score is considerably lowered, so an adjustment score of 100 points is introduced to eliminate the score error. It will be done.
또한, 도 3에 도시된 바와 같이 1구간의 'k와 b', 2구간의 'k와 b', 3구간의 'k와 b', 4구간의 'æ와 b', 9구간의 't와 s'의 경우에 음성 인식 결과와 강제 정렬 결과가 서로 일치하지 않기 때문에 해당 시간 구간별 음성 인식 결과에 대한 로그 우도 변환 점수를 조정 점수로 하여 점수출력부로 제공 하게 되는데, 1구간의 경우에 90점, 2구간의 경우에 80점, 3구간의 경우에 100점, 4구간의 경우에 40점, 9구간의 경우에 70점을 조정 점수로 하여 점수출력부로 제공하는 것이다.3, 'k and b' in one section, 'k and b' in two sections, 'k and b' in three sections, 'æ and b' in four sections, and 't' in nine sections. In the case of and s', since the speech recognition result and the forced sorting result do not coincide with each other, the log likelihood conversion score for the speech recognition result of the corresponding time interval is provided as the adjustment score, which is provided to the score output unit. 80 points for 2 points, 100 points for 3 sections, 40 points for 4 sections, and 70 points for 9 sections are provided to the score output unit.
이때, 점수출력부(800)는 상기 조정점수제공부로부터 제공된 조정 점수를 토대로 입력된 음성 정보에 대한 음소별 평균 점수를 계산하거나, 입력된 음성 정보에 대한 전체 평균 점수를 계산하여 출력시키게 되는 것이다.In this case, the score output unit 800 calculates an average score for each phoneme of the input voice information based on the adjustment score provided from the adjustment score provider, or calculates and outputs an overall average score for the input voice information.
예를 들어, 학습자가 'cat'란 음성정보와 발성 텍스트 정보를 제공하는 경우, 'cat'는 'k', 'æ't'란 음소로 구성되는데 도 3에 도시한 바와 같이, 'k' 음소에 대한 시간 구간별 조정 점수는 각각 90,80,100점 이어서 '(90+80+100)/3'이므로 'k' 음소에 대한 평균점수는 90점, 'æ음소에 대한 시간 구간별 조정점수는 40,100,100점 이어서 '(40+100+100)/3'이므로 'æ음소에 대한 평균점수는 80점, 't' 음소에 대한 시간 구간별 조정점수는 100,100,70점 이어서 '(100+100+70)/3'이므로 't' 음소에 대한 평균점수는 90점으로 계산하여 출력시키게 된다.For example, when the learner provides voice information and speech text information 'cat', 'cat' is composed of 'k' and 'æ't' phonemes, as shown in FIG. The time interval adjustment score for phonemes is 90,80,100 points, respectively, followed by ((90 + 80 + 100) / 3), so the average score for 'k' phonemes is 90 points, 40,100,100 points followed by '(40 + 100 + 100) / 3', so the average score for the 'æ phoneme is 80 points, and the time interval adjustment for the' t 'phoneme is 100,100,70 points followed by' (100 + 100 + 70 ' ) / 3 ', so the average score for the' t 'phoneme is calculated as 90 points and printed out.
입력된 음성 정보인 'cat'에 대한 전체 평균 점수는 '(90+80+90)/3'이므로 86.7점으로 계산하여 출력시키게 된다.Since the total average score of the input voice information 'cat' is '(90 + 80 + 90) / 3', it is calculated as 86.7 points and outputted.
또한 입력된 음성 정보에 대한 음소별 평균 점수, 전체 평균 점수 중 적어도 어느 하나 이상을 화면에 출력시키는 것을 특징으로 한다.In addition, at least one of the average score for each phoneme and the total average score of the input voice information, characterized in that for outputting on the screen.
예를 들어, 도 4에 도시한 바와 같이, 음소별 평균 점수와 전체 평균 점수를 동시에 제공하거나, 음소별 평균 점수만 제공하거나, 전체 평균 점수만 제공할 수도 있는 것이다.For example, as shown in FIG. 4, the phoneme average score and the overall average score may be simultaneously provided, only the phoneme average score may be provided, or only the overall average score may be provided.
이하에서는 본 발명의 제1 실시예에 따른 음소 단위 발음 정확성 평가 방법에 대하여 도 5를 참조하여 구체적으로 설명하도록 하겠다.Hereinafter, a method for evaluating phoneme unit pronunciation accuracy according to a first embodiment of the present invention will be described in detail with reference to FIG. 5.
도 5는 본 발명의 제1 실시예에 따른 음소 단위 발음 정확성 평가 방법의 전체 흐름도이다.5 is a flowchart illustrating a method for evaluating phoneme unit pronunciation accuracy according to a first embodiment of the present invention.
도 5에 도시한 바와 같이, 음소 단위 발음 정확성 평가 방법은, 음성정보추출단계(S100), 음성인식단계(S200), 강제정렬단계(S300), 로그우도계산단계(S400), 로그우도점수변환단계(S500), 조정점수제공단계(S600), 점수출력단계(S700)를 포함하게 된다.As shown in FIG. 5, the phoneme unit pronunciation accuracy evaluation method includes a voice information extraction step S100, a voice recognition step S200, a forced sorting step S300, a log likelihood calculation step S400, and a log likelihood score conversion. A step S500, an adjustment score providing step S600, and a score output step S700 are included.
구체적으로는 본 발명의 음소 단위 발음 정확성 평가 방법은,Specifically, the phoneme unit pronunciation accuracy evaluation method of the present invention,
음성정보추출부(100)가 학습자로부터 발성 텍스트 정보와 발성 텍스트 정보 에 대한 학습자의 발화 음성 정보를 획득하고, 획득된 음성 정보를 설정된 시간 구 간 단위로 나누며, 각 시간 구간별 음성 특징 벡터를 추출하는 음성정보추출 단계(S100); The speech information extracting unit 100 obtains the spoken text information and the spoken speech information of the learner from the learner from the learner, divides the obtained speech information into a set time interval unit, and extracts a speech feature vector for each time section. Extracting voice information (S100);
음성인식부(300)가 원어민음향모델저장부(200)에 저장된 원어민 음향 모델 정보를 이용하여 상기 음성정보추출단계(S100)를 통해 추출된 각 시간 구간별 음성 특징 벡터들에 대한 음성 인식을 수행하여 음성 인식 결과정보를 생성하는 음성인식단계(S200);The speech recognition unit 300 performs speech recognition on speech feature vectors for each time interval extracted through the speech information extraction step S100 using the native speaker sound model information stored in the native speaker sound model storage unit 200. Speech recognition step (S200) for generating a speech recognition result information by;
강제정렬부(400)가 음성정보추출단계(S100)를 통해 획득한 발성 텍스트 정보를 시간 구간별로 강제 정렬하여 강제 정렬 결과정보를 생성하는 강제정렬단계(S300);A forced sorting step (S300) forcing the sorting unit 400 to perform forced sorting of the spoken text information obtained through the voice information extracting step (S100) for each time interval to generate a forced sorting result information (S300);
로그우도측정부(500)가 음성정보추출 단계(S100)를 통해 추출한 각 시간 구간별 음성 특징 벡터와 강제정렬단계(S300);를 통해 생성한 강제 정렬 결과정보를 이용하여 강제 정렬 결과정보에 대한 시간 구간별 로그 우도를 계산하는 로그우도계산단계(S400);The log likelihood measuring unit 500 extracts the voice feature vector for each time interval extracted through the voice information extraction step S100 and the forced sorting step S300. A log likelihood calculation step of calculating a log likelihood for each time interval (S400);
로그우도점수변환부(600)가 강제정렬 결과정보에 대해 시간 구간별로 계산된 로그 우도를 0 내지 100점 사이의 점수로 변환시키는 로그우도 점수변환단계(S500); A log likelihood score conversion step (S500) for converting the log likelihood score calculated by the log likelihood score conversion unit 600 into a score between 0 and 100 for the time-aligned result of the forced alignment;
조정점수제공부(700)가 음성 인식 결과 정보와 강제 정렬 결과 정보의 시간 구간별 일치 여부에 따라 시간 구간별 조정 점수를 점수출력부로 제공하는 조정점수제공단계(S600);An adjustment score providing step of providing the adjustment score for each time interval to the score output unit according to whether the adjustment score provider 700 matches the speech recognition result information with the forced alignment result information for each time interval (S600);
점수출력부(800)가 제공된 시간 구간별 조정 점수를 토대로 음소별 평균 점 수를 계산하거나, 입력된 음성 정보에 대한 전체 평균 점수를 계산하여 출력시키기는 점수출력단계(S700);를 포함하는 것을 특징으로 한다.It includes a score output step (S700) is calculated by the score output unit 800 to calculate the average score for each phoneme on the basis of the adjusted score for each time interval, or to calculate and output the total average score for the input voice information; It features.
상기 로그우도계산단계(S400)는 하기의 로그 우도식을 이용하여 음성정보추출 단계(S100)를 통해 추출한 각 시간 구간별 음성 특징 벡터와 강제정렬단계(S300);를 통해 생성한 강제 정렬 결과정보를 이용하여 강제 정렬 결과정보에 대한 시간 구간별 로그 우도를 계산하는 것을 특징으로 한다.The log likelihood calculation step (S400) is a forced feature result generated by the voice feature vector and the forced alignment step (S300) for each time interval extracted through the voice information extraction step (S100) using the following log likelihood equation; It is characterized in that to calculate the log likelihood for each time interval for the forced alignment result information.
log(p(oi│q i)) (로그 우도식)log (p (oi│q i)) (log likelihood)
이때, oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률값을 의미한다.In this case, oi denotes a voice feature vector of the i-th time interval, qi denotes a phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) denotes a probability value of oi coming out of qi in the i-th time interval.
상기 로그우도점수변환단계 (S500)는 하기의 로그 우도 점수 변환식을 이용하여 강제정렬 결과정보에 대한 시간 구간별 로그 우도를 0 내지 100점 사이의 점수로 변환시키는 것을 특징으로 한다.The log likelihood score conversion step (S500) is characterized by converting the log likelihood for each time interval for the forced sorting result information into a score between 0 and 100 points using the following log likelihood score conversion equation.
Figure PCTKR2019000147-appb-I000007
Figure PCTKR2019000147-appb-I000007
이때, oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률값을 의미한다.In this case, oi denotes a voice feature vector of the i-th time interval, qi denotes a phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) denotes a probability value of oi coming out of qi in the i-th time interval.
상기 조정점수제공단계(S600)는 하기의 로그 우도 조정식에 따라, 음성인식부의 음성 인식 결과 정보와 강 제정렬부의 강제 정렬 결과 정보가 일치하는 시간 구간에 대해서는 해당 구간의 조 정 점수를 100으로 하여 점수출력부로 제공하고, 음성인식부의 음성 인식 결과 정보와 강제정렬부의 강제 정렬 결과 정보가 일치하지 않는 시간 구간에 대해서는 로그우도점수변환부가 변환한 해당 구간의 로그 우도 변환 점수를 조정점수로 하여 점수출력부로 제공하는 것을 특징으로 한다. In the adjusting score providing step (S600), the adjustment score of the corresponding section is set to 100 for a time section in which the voice recognition result information of the voice recognition unit and the forced alignment result information of the forced alignment unit correspond to each other according to the following log likelihood adjustment equation. The score is output to the score output section, and the score is output by adjusting the log likelihood conversion score of the corresponding section, which is converted by the log likelihood score converter, for a time interval where the voice recognition result information of the voice recognition unit and the forced sorting result information of the forced alignment unit do not match. It is characterized by the provision of wealth.
Figure PCTKR2019000147-appb-I000008
Figure PCTKR2019000147-appb-I000008
이때, oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률값을 의미한다.In this case, oi denotes a voice feature vector of the i-th time interval, qi denotes a phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) denotes a probability value of oi coming out of qi in the i-th time interval.
본 발명의 제1 실시예에 따른 음소 단위 발음 정확성 평가 방법의 구체적 특징들은 본 발명의 제1 실시예에 따른 음소 단위 발음 정확성 평가 장치 설명 부분에서 설명한바와 동일하기에 구체적 설명은 생략하기로 한다.Specific features of the phoneme pronunciation accuracy evaluation method according to the first embodiment of the present invention are the same as described in the description of the phoneme pronunciation accuracy evaluation device according to the first embodiment of the present invention will be omitted.
본 발명에 의하면, 주어진 단어 혹은 문장에 대응되는 발화된 음성 신호에 대한 전체 발음 평가 점수만을 제공하는 종래 자동 발음 평가장치의 문제점을 개선하여 음성 신호의 세부 단위인 음소(발음)마다 점수를 제공함으로써, 전체 발음 점수뿐만 아니라, 각 음소(발음)별 점수까지 피드백할 수 있게 되어 미흡한 음소가 무엇인지를 집중적으로 학습할 수 있어 이에 따른 학습 효과를 증진시키게 된다.According to the present invention, by improving the problem of the conventional automatic pronunciation evaluation device that provides only the overall pronunciation evaluation score for the spoken speech signal corresponding to a given word or sentence, by providing a score for each phoneme (phoneme) which is a detailed unit of the speech signal In addition, as well as the overall pronunciation score can be fed back to the score for each phoneme (pronounced) can concentrate on what the phoneme is insufficient to enhance the learning effect accordingly.
또한, 사용자 접근이 용이한 웹 페이지 혹은 모바일 앱을 통해 음소별 점수를 0점 내지 100점 사이의 값으로 제공함으로써, 기술 보급이 용이한 효과를 제공하게 된다.In addition, by providing a score for each phoneme as a value between 0 and 100 points through a web page or a mobile app that is easily accessible to the user, technology diffusion can be easily provided.
또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형 실시가 가능한 것은 물론이고, 이러한 변형 실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안될 것이다.In addition, although the preferred embodiment of the present invention has been shown and described above, the present invention is not limited to the above-described specific embodiment, the technical field to which the invention belongs without departing from the spirit of the invention claimed in the claims. Of course, various modifications can be made by those skilled in the art, and these modifications should not be individually understood from the technical spirit or prospect of the present invention.
이상의 구성 및 작용을 지니는 본 발명에 따른 음소 단위 발음 정확성 평가 장치 및 평가 방법을 통해, 주어진 단어 혹은 문장에 대응되는 발화된 음성 신호에 대한 전체 발음 평가 점수만을 제공하는 종래 자동 발음 평가장치의 문제점을 개선하여 음성 신호의 세부 단위인 음소(발음)마다 점수를 제공함으로써, 전체 발음 점수뿐만 아니라, 각 음소(발음)별 점수까지 피드백할 수 있게 되어 미흡한 음소가 무엇인지를 집중적으로 학습할 수 있어 이에 따른 학습 효과를 증진시키게 되는 효과가 있으므로, 산업상 이용가능성도 높아진다.Through the phoneme pronunciation accuracy evaluation device and the evaluation method according to the present invention having the above configuration and action, the problem of the conventional automatic pronunciation evaluation device that provides only the overall pronunciation evaluation score for the spoken speech signal corresponding to a given word or sentence By improving and providing scores for each phoneme (pronoun), which is a detailed unit of the voice signal, it is possible to feed back not only the overall pronunciation score, but also the score for each phoneme (pronunciation). As it has the effect of enhancing the learning effect, the industrial applicability is also increased.

Claims (10)

  1. 음소 단위 발음 정확성 평가 장치에 있어서,In the phoneme pronunciation accuracy evaluation device,
    학습자로부터 발성 텍스트 정보와 발성 텍스트 정보에 대한 학습자가 발음한음성 정보를 획득하며, 획득된 음성 정보를 설정된 시간 구간 단위로 나누고, 각 시간 구간별 음성 특징 벡터를 추출하는 음성정보추출부(100);The voice information extracting unit 100 obtains voice information pronounced by the learner about the spoken text information and the spoken text information from the learner, divides the obtained voice information into a set time interval unit, and extracts a speech feature vector for each time interval. ;
    원어민 음향 모델 정보가 저장되는 원어민음향모델저장부(200);A native speaker model storing unit 200 in which native speaker model information is stored;
    상기 원어민음향모델저장부에 저장된 원어민 음향 모델 정보를 이용하여 상기 음성정보추출부가 추출한 각 시간 구간별 음성 특징 벡터들에 대한 음성 인식을 수행하여 음성 인식 결과정보를 생성하는 음성인식부(300);A speech recognition unit 300 for generating speech recognition result information by performing speech recognition on speech feature vectors extracted by the speech information extraction unit by using native speaker sound model information stored in the native speaker model storage unit;
    상기 음성정보추출부가 획득한 발성 텍스트 정보를 시간 구간별로 강제 정렬하여 강제 정렬 결과정보를 생성하는 강제정렬부(400);A forced sorting unit 400 forcibly sorting the spoken text information obtained by the voice information extracting unit for each time interval to generate forced sorting result information;
    상기 음성정보추출부(100)가 추출한 각 시간 구간별 음성 특징 벡터와 강제정렬부(400)가 생성한 강제 정렬 결과정보를 이용하여 강제 정렬 결과정보에 대한 시간 구간별 로그 우도를 계산하는 로그우도계산부(500);Log likelihood for calculating the log likelihood for each time interval for the forced alignment result information using the speech feature vector extracted by the voice information extractor 100 and the forced alignment result information generated by the forced alignment unit 400. A calculator 500;
    강제정렬 결과정보에 대한 시간 구간별로 계산된 로그 우도를 0 내지 100점 사이의 점수로 변환시킨 로그 우도 변환 점수를 생성하는 로그우도점수변환부(600);A log likelihood score conversion unit 600 for generating a log likelihood transformation score obtained by converting the log likelihood calculated for each time interval for the forced alignment result information into a score between 0 and 100 points;
    음성인식부의 음성 인식 결과 정보와 강제정렬부의 강제 정렬 결과 정보의 시간 구간별 일치 여부에 따라 시간 구간별 조정 점수를 점수출력부로 제공하는 조정점수제공부(700);An adjustment score providing unit 700 providing an adjustment score for each time interval to the score output unit according to whether the speech recognition result information of the speech recognition unit and the forced alignment result information of the forced alignment unit correspond to each time section;
    상기 조정점수제공부로부터 제공된 시간 구간별 조정 점수를 토대로 입력된 음성 정보에 대한 음소별 평균 점수를 계산하거나, 입력된 음성 정보에 대한 전체 평균 점수를 계산하여 출력시키는 점수출력부(800);를 포함하여 구성되는 음소 단위 발음 정확성 평가 장치.A score output unit 800 that calculates an average score for each phoneme of the input voice information based on the adjustment score for each time interval provided from the adjustment score provider or calculates and outputs an overall average score for the input voice information; Phonetic unit pronunciation accuracy evaluation device configured by.
  2. 제 1항에 있어서,The method of claim 1,
    상기 원어민음향모델저장부(200)에 저장되는 원어민 음향 모델 정보는 딥러닝 모델을 이용하여 원어민의 발성속도, 각 발음 사이의 묵음구간의 길이 등을 분석한 음소별 원어민 발음 특성 정보를 포함하는 것을 특징으로 하는 음소 단위 발음 정확성 평가 장치.The native speaker model information stored in the native speaker model storage unit 200 may include native speaker pronunciation information for each phoneme, which is analyzed by using a deep learning model, such as a native speaker's uttering speed and a length of a silent section between each pronunciation. Phonetic unit pronunciation accuracy evaluation device characterized in that.
  3. 제 1항에 있어서,The method of claim 1,
    상기 점수출력부(800)는,The score output unit 800,
    음소별 평균 점수를 0점 내지 100점 사이의 점수값으로 처리하는 것을 특징으로 하는 음소 단위 발음 정확성 평가 장치.Phonetic unit pronunciation accuracy evaluation device, characterized in that for processing the average score for each phoneme as a score value between 0 and 100 points.
  4. 제 1항에 있어서,The method of claim 1,
    상기 점수출력부(800)는,The score output unit 800,
    입력된 음성 정보에 대한 음소별 평균 점수, 전체 평균 점수 중 적어도 어느 하나 이상을 화면에 출력시키는 것을 특징으로 하는 음소 단위 발음 정확성 평가 장치.Phonetic unit pronunciation accuracy evaluation device, characterized in that for outputting at least one or more of the average score for each phoneme, the total average score for the input voice information on the screen.
  5. 제 1항에 있어서,The method of claim 1,
    상기 구간 단위는,The interval unit is,
    1msec ~ 20msec 범위의 시간 구간인 것을 특징으로 하는 음소 단위 발음 정확성 평가 장치.Phonetic unit pronunciation accuracy evaluation device, characterized in that the time interval in the range of 1msec ~ 20msec.
  6. 제 1항에 있어서,The method of claim 1,
    상기 로그우도계산부(500)는,The log likelihood calculator 500
    하기의 로그 우도식을 이용하여 음성정보추출부(100)가 추출한 각 시간 구간별 음성 특징 벡터와 강제정렬부(400)가 생성한 강제 정렬 결과정보를 이용하여 강제 정렬 결과정보에 대한 시간 구간별 로그 우도를 계산하는 것을 특징으로 하는 음소 단위 발음 정확성 평가 장치.By using the following log likelihood equation, the voice feature vector extracted by the voice information extraction unit 100 and the forced alignment result information generated by the forced alignment unit 400 for each time interval are used for each time interval. Phonetic unit pronunciation accuracy evaluation device, characterized in that for calculating the log likelihood.
    log(p(oi│qi)) (로그 우도식)log (p (oi│qi)) (log likelihood)
    (oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률)(oi is the voice feature vector of the i-th time interval, qi is the phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) is the probability of oi coming out of qi in the i-th time interval)
  7. 제 1항에 있어서,The method of claim 1,
    상기 로그우도점수변환부(600)는,The log likelihood score conversion unit 600,
    하기의 로그 우도 점수 변환식을 이용하여 강제정렬 결과정보에 대한 시간 구간별 로그 우도를 0 내지 100점 사이의 점수로 변환시키는 것을 특징으로 하는 음소 단위 발음 정확성 평가 장치.Phonetic unit pronunciation accuracy evaluation device, characterized in that for converting the log likelihood for each time interval for the forced alignment result information to a score between 0 to 100 points using the log likelihood score conversion equation.
    Figure PCTKR2019000147-appb-I000009
    Figure PCTKR2019000147-appb-I000009
    (oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률)(oi is the voice feature vector of the i-th time interval, qi is the phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) is the probability of oi coming out of qi in the i-th time interval)
  8. 음소 단위 발음 정확성 평가 방법에 있어서,In the phoneme pronunciation pronunciation evaluation method,
    학습자로부터 발성 텍스트 정보와 발성 텍스트 정보 에 대한 학습자의 발화 음성 정보를 획득하고, 획득된 음성 정보를 설정된 시간 구 간 단위로 나누며, 각 시간 구간별 음성 특징 벡터를 추출하는 음성정보추출 단계(S100);Speech information extraction step (S100) of acquiring the spoken text information of the learner from the learner and the spoken speech information of the spoken text information, dividing the acquired speech information into units of a predetermined time interval, and extracting a speech feature vector for each time interval (S100). ;
    원어민 음향 모델 정보를 이용하여 상기 음성정보추출단계(S100)를 통해 추출된 각 시간 구간별 음성 특징 벡터들에 대한 음성 인식을 수행하여 음성 인식 결과정보를 생성하는 음성인식단계(S200);A speech recognition step (S200) of generating speech recognition result information by performing speech recognition on speech feature vectors of each time section extracted through the speech information extraction step (S100) using native speaker sound model information;
    음성정보추출단계를 통해 획득한 발성 텍스트 정보를 시간 구간별로 강제 정렬하여 강제 정렬 결과정보를 생성하는 강제정렬단계(S300);A forced sorting step (S300) of generating forced sorting result information by forcibly sorting the spoken text information obtained through the voice information extracting step by time intervals;
    음성정보추출 단계를 통해 추출한 각 시간 구간별 음성 특징 벡터와 강제정렬단계를 통해 생성한 강제 정렬 결과정보를 이용하여 강제 정렬 결과정보에 대한 시간 구간별 로그 우도를 계산하는 로그우도계산단계(S400);Log likelihood calculation step (S400) for calculating log likelihood for each time interval for the forced alignment result information using the voice feature vector for each time interval extracted through the voice information extraction step and the forced alignment result information generated through the forced alignment step (S400) ;
    강제정렬 결과정보에 대해 시간 구간별로 계산된 로그 우도를 0 내지 100점 사이의 점수로 변환시키는 로그우도 점수변환단계(S500); A log likelihood score conversion step (S500) of converting the log likelihood calculated for each time interval for the forced alignment result information into a score between 0 and 100 points;
    음성 인식 결과 정보와 강제 정렬 결과 정보의 시간 구간별 일치 여부에 따라 시간 구간별 조정 점수를 제공하는 조정점수제공단계(S600);An adjustment score providing step of providing an adjustment score for each time interval according to whether the speech recognition result information and the forced alignment result information correspond to each time interval (S600);
    시간 구간별 조정 점수를 토대로 입력된 음성 정보에 대한 음소별 평균 점 수를 계산하거나, 입력된 음성 정보에 대한 전체 평균 점수를 계산하여 출력시키기는 점수출력단계(S700);를 포함하여 이루어지는 것을 특징으로 하는 음소 단위 발음 정확성 평가 방법.A score output step (S700); calculating or outputting an average score for each phoneme of the input voice information on the basis of the adjustment score for each time interval, or calculating and outputting an overall average score for the input voice information; Phoneme unit pronunciation accuracy evaluation method.
  9. 제 8항에 있어서,The method of claim 8,
    상기 로그우도계산단계(S400)는,The log likelihood calculation step (S400),
    하기의 로그 우도식을 이용하여 음성정보추출 단계를 통해 추출한 각 시간 구간별 음성 특징 벡터와 강제정렬단계를 통해 생성한 강제 정렬 결과정보를 이용하여 강제 정렬 결과정보에 대한 시간 구간별 로그 우도를 계산하는 것을 특징으로 하는 음소 단위 발음 정확성 평가 방법.Using the log likelihood equation below, the log likelihood for each time interval is calculated for the forced alignment result information using the voice feature vector extracted through the voice information extraction step and the forced alignment result information generated through the forced alignment step. Phoneme unit pronunciation accuracy evaluation method characterized in that.
    log(p(oi│qi)) (로그 우도식)log (p (o i │ q i )) (log likelihood)
    (oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률)(oi is the voice feature vector of the i-th time interval, qi is the phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) is the probability of oi coming out of qi in the i-th time interval)
  10. 제 8항에 있어서,The method of claim 8,
    상기 로그우도점수변환단계(S500)는,The log likelihood score conversion step (S500),
    하기의 로그 우도 점수 변환식을 이용하여 강제정렬 결과정보에 대한 시간 구간별 로그 우도를 0 내지 100점 사이의 점수로 변환시키는 것을 특징으로 하는 음소 단위 발음 정확성 평가 방법.A phoneme pronunciation accuracy evaluation method for converting the log likelihood for each time interval for the forced sorting result information into a score between 0 and 100 using the log likelihood score conversion equation below.
    Figure PCTKR2019000147-appb-I000010
    Figure PCTKR2019000147-appb-I000010
    (oi 는 i번째 시간구간의 음성 특징 벡터, qi는 강제 정렬 결과 정보에 의한 i번째 시간구간의 음소, p(oi│qi)는 i번째 시간구간에서 oi 가 qi 에서 나올 확률)(oi is the voice feature vector of the i-th time interval, qi is the phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) is the probability of oi coming out of qi in the i-th time interval)
PCT/KR2019/000147 2018-08-02 2019-01-04 Apparatus and method for evaluating accuracy of phoneme unit pronunciation WO2020027394A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2018-0090123 2018-08-02
KR20180090123 2018-08-02

Publications (1)

Publication Number Publication Date
WO2020027394A1 true WO2020027394A1 (en) 2020-02-06

Family

ID=69232268

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2019/000147 WO2020027394A1 (en) 2018-08-02 2019-01-04 Apparatus and method for evaluating accuracy of phoneme unit pronunciation

Country Status (1)

Country Link
WO (1) WO2020027394A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986650A (en) * 2020-08-07 2020-11-24 云知声智能科技股份有限公司 Method and system for assisting speech evaluation by means of language identification
CN112331180A (en) * 2020-11-03 2021-02-05 北京猿力未来科技有限公司 Spoken language evaluation method and device
CN112466288A (en) * 2020-12-18 2021-03-09 北京百度网讯科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN112767919A (en) * 2021-01-22 2021-05-07 北京读我科技有限公司 Voice evaluation method and device
CN112908360A (en) * 2021-02-02 2021-06-04 早道(大连)教育科技有限公司 Online spoken language pronunciation evaluation method and device and storage medium
CN113823329A (en) * 2021-07-30 2021-12-21 腾讯科技(深圳)有限公司 Data processing method and computer device
WO2022048354A1 (en) * 2020-09-07 2022-03-10 北京世纪好未来教育科技有限公司 Speech forced alignment model evaluation method and apparatus, electronic device, and storage medium
CN115376547A (en) * 2022-08-12 2022-11-22 腾讯科技(深圳)有限公司 Pronunciation evaluation method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050074298A (en) * 2004-01-08 2005-07-18 정보통신연구진흥원 Pronunciation test system and method of foreign language
KR20100049201A (en) * 2008-11-03 2010-05-12 윤병원 Electronic dictionary service method having drill on pronunciation and electronic dictionary using the same
KR20150001189A (en) * 2013-06-26 2015-01-06 한국전자통신연구원 System and method for evaluating and training capability of speaking in foreign language using voice recognition
KR101609473B1 (en) * 2014-10-14 2016-04-05 충북대학교 산학협력단 System and method for automatic fluency evaluation of english speaking tests
KR20160122542A (en) * 2015-04-14 2016-10-24 주식회사 셀바스에이아이 Method and apparatus for measuring pronounciation similarity

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050074298A (en) * 2004-01-08 2005-07-18 정보통신연구진흥원 Pronunciation test system and method of foreign language
KR20100049201A (en) * 2008-11-03 2010-05-12 윤병원 Electronic dictionary service method having drill on pronunciation and electronic dictionary using the same
KR20150001189A (en) * 2013-06-26 2015-01-06 한국전자통신연구원 System and method for evaluating and training capability of speaking in foreign language using voice recognition
KR101609473B1 (en) * 2014-10-14 2016-04-05 충북대학교 산학협력단 System and method for automatic fluency evaluation of english speaking tests
KR20160122542A (en) * 2015-04-14 2016-10-24 주식회사 셀바스에이아이 Method and apparatus for measuring pronounciation similarity

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986650A (en) * 2020-08-07 2020-11-24 云知声智能科技股份有限公司 Method and system for assisting speech evaluation by means of language identification
CN111986650B (en) * 2020-08-07 2024-02-27 云知声智能科技股份有限公司 Method and system for assisting voice evaluation by means of language identification
WO2022048354A1 (en) * 2020-09-07 2022-03-10 北京世纪好未来教育科技有限公司 Speech forced alignment model evaluation method and apparatus, electronic device, and storage medium
US11749257B2 (en) 2020-09-07 2023-09-05 Beijing Century Tal Education Technology Co., Ltd. Method for evaluating a speech forced alignment model, electronic device, and storage medium
CN112331180A (en) * 2020-11-03 2021-02-05 北京猿力未来科技有限公司 Spoken language evaluation method and device
CN112466288A (en) * 2020-12-18 2021-03-09 北京百度网讯科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN112767919A (en) * 2021-01-22 2021-05-07 北京读我科技有限公司 Voice evaluation method and device
CN112908360A (en) * 2021-02-02 2021-06-04 早道(大连)教育科技有限公司 Online spoken language pronunciation evaluation method and device and storage medium
CN113823329A (en) * 2021-07-30 2021-12-21 腾讯科技(深圳)有限公司 Data processing method and computer device
CN115376547A (en) * 2022-08-12 2022-11-22 腾讯科技(深圳)有限公司 Pronunciation evaluation method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
WO2020027394A1 (en) Apparatus and method for evaluating accuracy of phoneme unit pronunciation
WO2020213996A1 (en) Method and apparatus for interrupt detection
WO2020231181A1 (en) Method and device for providing voice recognition service
WO2020145439A1 (en) Emotion information-based voice synthesis method and device
WO2020190050A1 (en) Speech synthesis apparatus and method therefor
WO2020189850A1 (en) Electronic device and method of controlling speech recognition by electronic device
WO2017217661A1 (en) Word sense embedding apparatus and method using lexical semantic network, and homograph discrimination apparatus and method using lexical semantic network and word embedding
WO2021112642A1 (en) Voice user interface
WO2019139431A1 (en) Speech translation method and system using multilingual text-to-speech synthesis model
WO2017082447A1 (en) Foreign language reading aloud and displaying device and method therefor, motor learning device and motor learning method based on foreign language rhythmic action detection sensor, using same, and electronic medium and studying material in which same is recorded
WO2019078615A1 (en) Method and electronic device for translating speech signal
WO2020230926A1 (en) Voice synthesis apparatus for evaluating quality of synthesized voice by using artificial intelligence, and operating method therefor
WO2020085794A1 (en) Electronic device and method for controlling the same
WO2020050509A1 (en) Voice synthesis device
WO2015099464A1 (en) Pronunciation learning support system utilizing three-dimensional multimedia and pronunciation learning support method thereof
WO2020145472A1 (en) Neural vocoder for implementing speaker adaptive model and generating synthesized speech signal, and method for training neural vocoder
WO2022260432A1 (en) Method and system for generating composite speech by using style tag expressed in natural language
WO2020153717A1 (en) Electronic device and controlling method of electronic device
WO2022080774A1 (en) Speech disorder assessment device, method, and program
WO2021040490A1 (en) Speech synthesis method and apparatus
EP3841460A1 (en) Electronic device and method for controlling the same
WO2023085584A1 (en) Speech synthesis device and speech synthesis method
WO2022035183A1 (en) Device for recognizing user&#39;s voice input and method for operating same
WO2021085661A1 (en) Intelligent voice recognition method and apparatus
WO2023177095A1 (en) Patched multi-condition training for robust speech recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19843830

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 06.07.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19843830

Country of ref document: EP

Kind code of ref document: A1