WO2020027394A1

WO2020027394A1 - Apparatus and method for evaluating accuracy of phoneme unit pronunciation

Info

Publication number: WO2020027394A1
Application number: PCT/KR2019/000147
Authority: WO
Inventors: 윤종성; 권용대; 홍연정; 김서현; 조영선; 양형원
Original assignee: 미디어젠 주식회사
Priority date: 2018-08-02
Filing date: 2019-01-04
Publication date: 2020-02-06

Abstract

The present invention relates to an apparatus and a method for evaluating the accuracy of phoneme unit pronunciation and, more specifically, to an apparatus and a method for evaluating the accuracy of phoneme unit pronunciation, the apparatus and the method alleviating a problem of a conventional automatic pronunciation evaluation device that provides only a total pronunciation evaluation score for a voice signal of a voice, pronounced by a learner in correspondence to a given word or sentence, so as to provide a score for each phoneme (pronunciation), which is a detailed unit of a voice signal, thereby enabling feedback of total pronunciation score and the score for each phoneme (pronunciation) to be provided, such that an unsatisfactory phoneme can be studied intensely, and thus learning effects are enhanced.

Description

Phonetic unit pronunciation accuracy evaluation device and evaluation method

The present invention relates to a phoneme pronunciation accuracy evaluation device and an evaluation method, and more particularly to improve the problem of the conventional automatic pronunciation evaluation device that provides only the overall pronunciation evaluation score for the spoken speech signal corresponding to a given word or sentence, By providing a score for each phoneme (pronounced), which is a detailed unit of the voice signal, it is possible to feed back not only the overall pronunciation score but also the score for each phoneme (pronounced). The present invention relates to a phoneme unit pronunciation accuracy evaluation device and an evaluation method.

As information exchanges increase, the communication between people becomes more important than ever before in modern society.

The development of information and communication technology has diversified the means of communication, but the dialogue that conveys the human voice is still the most important communication method.

In addition, there are various items to be considered even when communicating using voice, and one of the items to be considered is pronunciation.

Pronunciation is a voice of a language, and there are differences in the characteristics of the pronunciation according to the type and individual of the language.

Basically, pronunciation characteristics for the same language should be expressed to enable accurate communication with each other, even considering individual differences.

However, not everyone speaks the correct pronunciation according to the language characteristics, and this problem often causes the same word to be repeated many times or incorrect communication occurs.

Various methods for correcting pronunciation have been suggested to use the correct pronunciation. However, most of the pronunciation correction methods repeat a specific word or sentence that is difficult to pronounce or repeats the pronunciation of another person who is judged to be correct by many people. Most of the sensory methods were not quantitatively analyzed.

That is, the method of simply repeating repeatedly without knowing the pronunciation characteristics of a person who is evaluated as accurate in pronunciation is mainly used.

This pronunciation correction method has a problem that not only the listening ability of the individual must be preceded, but also difficult to apply to various pronunciations in common.

On the other hand, in recent years, with the development of the Internet and the expansion of trade volume, the opportunity to meet people from various countries of the world has expanded. In particular, the demand for foreign languages is constantly increasing as more companies meet with foreign buyers on business.

As such, as the number of meetings with foreigners increases, conversation-oriented foreign language education is in the spotlight, unlike conventional reading-oriented foreign language education.

Generally speaking, foreign language speaking and conversational learning methods are to go to a language school and learn directly from a foreign lecturer.

However, the way to go to academy is a matter of time constraints and costs, and it is not easy to get feedback even if you are learning directly from a foreign instructor.

Therefore, if there is a foreign language learning method that can solve the time and cost problem and get appropriate feedback, it will be efficient in terms of time and cost.

Recently, with the development of speech recognition technology, many attempts have been made to apply it to foreign language education.

Among these methods, many attempts have been made in recent years using a hidden Markov model (hereinafter referred to as HMM).

In this case, the speech recognition system extracts a feature vector in frame units defined by the system for a speech signal that has undergone preprocessing such as frequency subtraction, sound source separation, noise filtering, and the like, and then processes the signal using the extracted feature vector. Will be

Existing methods and systems for evaluating foreign language speaking are all measured using the HMM recognizer to measure the accuracy of units to be evaluated.

This is because other elements of the speaker's pronunciation (length, energy, intonation, stress, etc.) could not be reflected in the feature vector.

In other words, it simply reads the sentences and evaluates them based on the results obtained through the HMM recognizer.

However, unlike Korean, it is the factors such as length, energy, intonation, and stress that play an important axis of meaning transfer in foreign languages.

For example, in Chinese, the meaning is completely changed by tonality related to intonation, and in English-speaking languages, stress is an important part of meaning transfer.

In the case of foreign language automatic pronunciation evaluation devices, which are currently widely used, only the entire pronunciation evaluation score is provided for the input voice signal, and it is not a phonetic learning method that is the minimum sound unit that brings meaning differences.

Therefore, by providing limited feedback information to the user, there is a limit to enhancing the learning effect.

Korea Patent Registration No. 10-0733469

Accordingly, the present invention has been proposed in view of the above-described problems of the prior art, and a first object of the present invention is to provide a conventional automatic pronunciation evaluation apparatus that provides only an overall pronunciation evaluation score for a spoken speech signal corresponding to a given word or sentence. By comparing the speech recognition result obtained by using the native speaker's acoustic model with the forced alignment of the spoken text for the spoken foreign language speech signal, the pronunciation evaluation score and the overall pronunciation evaluation score for each phoneme are provided. The present invention aims to provide a phoneme pronunciation accuracy evaluation device and an evaluation method capable of feeding back not only the overall pronunciation score but also the score of each phoneme (pronounced voice).

A second object of the present invention is to provide a score for each phoneme as a value between 0 and 100 points through a web page or a mobile app that is easily accessible to a user.

In order to achieve the problem to be solved by the present invention, the phonetic unit pronunciation accuracy evaluation device,

The voice information extracting unit 100 obtains voice information pronounced by the learner about the spoken text information and the spoken text information from the learner, divides the obtained voice information into a set time interval unit, and extracts a speech feature vector for each time interval. ;

A native speaker model storing unit 200 in which native speaker model information is stored;

A speech recognition unit 300 for generating speech recognition result information by performing speech recognition on speech feature vectors extracted by the speech information extraction unit by using native speaker sound model information stored in the native speaker model storage unit;

A forced sorting unit 400 forcibly sorting the spoken text information obtained by the voice information extracting unit for each time interval to generate forced sorting result information;

Log likelihood for calculating the log likelihood for each time interval for the forced alignment result information using the speech feature vector extracted by the voice information extractor 100 and the forced alignment result information generated by the forced alignment unit 400. A calculator 500;

A log likelihood score conversion unit 600 for generating a log likelihood transformation score obtained by converting the log likelihood calculated for each time interval for the forced alignment result information into a score between 0 and 100 points;

An adjustment score providing unit 700 providing an adjustment score for each time interval to the score output unit according to whether the speech recognition result information of the speech recognition unit and the forced alignment result information of the forced alignment unit correspond to each time section;

A score output unit 800 that calculates an average score for each phoneme of the input voice information based on the adjustment score for each time interval provided from the adjustment score provider or calculates and outputs an overall average score for the input voice information; Characterized in that.

On the other hand, the present inventor phoneme pronunciation pronunciation evaluation method,

A voice information extracting step of obtaining speech information of the learner and speech information of the learner from the learner, dividing the obtained speech information into predetermined time interval units, and extracting a speech feature vector for each time interval (S100);

A speech recognition step (S200) of generating speech recognition result information by performing speech recognition on speech feature vectors of each time section extracted through the speech information extraction step (S100) using native speaker sound model information;

A forced sorting step (S300) of generating forced sorting result information by forcibly sorting the spoken text information obtained through the voice information extracting step by time intervals;

Log likelihood calculation step (S400) for calculating log likelihood for each time interval for the forced alignment result information using the voice feature vector for each time interval extracted through the voice information extraction step and the forced alignment result information generated through the forced alignment step (S400) ;

A log likelihood score conversion step (S500) of converting the log likelihood calculated for each time interval for the forced alignment result information into a score between 0 and 100 points;

An adjustment score providing step of providing an adjustment score for each time interval according to whether the speech recognition result information and the forced alignment result information correspond to each time interval (S600);

And a score output step (S700) for calculating an average score for each phoneme of the input voice information based on the adjustment score for each time interval, or for calculating and outputting the total average score for the input voice information. do.

Through the phoneme pronunciation accuracy evaluation device and the evaluation method according to the present invention having the above configuration and action, the problem of the conventional automatic pronunciation evaluation device that provides only the overall pronunciation evaluation score for the spoken speech signal corresponding to a given word or sentence By improving and providing scores for each phoneme (pronoun), which is a detailed unit of the voice signal, it is possible to feed back not only the overall pronunciation score, but also the score for each phoneme (pronunciation). To improve the learning effect.

In addition, by providing a score for each phoneme as a value between 0 and 100 points through a web page or a mobile app that is easily accessible to the user, technology diffusion can be easily provided.

1 is an overall configuration diagram schematically showing an apparatus for evaluating phoneme pronunciation accuracy according to a first embodiment of the present invention.

2 is an exemplary waveform graph of a speech signal obtained by the phoneme pronunciation pronunciation evaluation apparatus according to the first embodiment of the present invention.

3 is an exemplary diagram of phoneme average scores calculated by a phoneme pronunciation accuracy evaluation apparatus according to a first exemplary embodiment of the present invention.

Figure 4 is an exemplary view showing the average score for each phoneme and the total average score calculated by the phoneme unit pronunciation accuracy evaluation apparatus according to the first embodiment of the present invention.

5 is an overall flowchart of a phoneme unit pronunciation accuracy evaluation method according to a first embodiment of the present invention;

100: voice information extraction unit

200: native speaker model storage unit

300: voice recognition unit

400: forced alignment

500: log likelihood calculator

600: log likelihood score conversion unit

700: adjustment score provider

800: score output unit

1000: phoneme unit pronunciation accuracy evaluation device

The following merely illustrates the principles of the invention. Therefore, those skilled in the art, although not explicitly described or illustrated herein, can embody the principles of the present invention and invent various devices that fall within the spirit and scope of the present invention.

In addition, all conditional terms and embodiments listed herein are in principle clearly intended to be understood only for the purpose of understanding the concept of the invention and are not to be limited to the specifically listed embodiments and states. do.

In describing the present invention, terms such as first and second may be used to describe various components, but the components may not be limited by the terms.

For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component.

When a component is referred to as being connected or connected to another component, it may be understood that it may be directly connected or connected to the other component, but there may be other components in between. .

The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the invention, and the singular forms “a”, “an” and “the” may include the plural forms as well, unless the context clearly indicates otherwise.

In this specification, the terms including or including are intended to designate that there exists a feature, number, step, operation, component, part, or a combination thereof described in the specification, and one or more other features or numbers, It can be understood that it does not exclude in advance the possibility of the presence or addition of steps, actions, components, parts or combinations thereof.

The phoneme unit pronunciation accuracy evaluation apparatus according to the first embodiment of the present invention,

A voice information extracting unit 100 for acquiring utterance text information of the utterance text information and utterance text information from the learner, dividing the acquired speech information by a predetermined time interval unit, and extracting a speech feature vector for each time interval;

A log likelihood score conversion unit 600 for generating a log likelihood conversion score obtained by converting the log likelihood for each time interval of the calculated forced result information into a score between 0 and 100 points;

And a score output unit 800 that calculates an average score for each phoneme based on the adjustment score for each time interval provided from the adjustment score provider or calculates and outputs an overall average score for the input voice information. do.

At this time, the native speaker acoustic model information stored in the native speaker acoustic model storage unit 200 includes information on the native speaker's pronunciation characteristics of each phoneme by analyzing the uttering speed of the native speaker, the length of the silent section between each pronunciation, and the like using a deep learning model. Characterized in that.

At this time, the score output unit 800 is characterized in that for treating the average score for each phoneme as a score value between 0 and 100 points.

The score output unit 800 is characterized in that for outputting at least one or more of the average score for each phoneme, the overall average score on the screen.

At this time, the interval unit is characterized in that the time interval in the range of 1msec ~ 20msec. Preferably it is characterized in that 10msec.

In addition, the log likelihood calculator 500 may extract the voice feature vector extracted by the voice information extractor 100 and the forced alignment result information generated by the forced sorter 400 using the following log likelihood equation. It is characterized by calculating the log likelihood for each time interval for the forced alignment result information.

log (p (oi│qi)) (log likelihood)

(oi is the voice feature vector of the i-th time interval, qi is the phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) is the probability of oi coming out of qi in the i-th time interval)

And, the log likelihood score conversion unit 600,

By using the log likelihood score conversion formula below it is characterized in that the log likelihood for each time interval for the forced alignment result information is converted into a score between 0 and 100 points.

And, the adjustment score providing unit 700,

According to the following log likelihood adjustment formula, for a time interval where the voice recognition result information of the voice recognition unit and the forced alignment result information of the forced alignment unit match, the adjustment score of the corresponding section is set to 100 and provided to the score output unit, and the voice recognition result of the voice recognition unit The time likelihood between the information and the forced sorting result information of the forced sorting unit may be provided to the score output unit using the log likelihood transformation score of the corresponding section converted by the log likelihood score converter as an adjustment score.

On the other hand, the phoneme unit pronunciation accuracy evaluation method according to the first embodiment of the present invention,

Comprising a score output step (S700) to calculate the average score for each phoneme for the input voice information on the basis of the adjustment score for each time interval, or to calculate and output the total average score for the input voice information (S700); do.

At this time, the log likelihood calculation step (S400),

The force likelihood result by the log likelihood calculator 500 using the force feature result generated through the forced feature and the speech feature vector extracted for each time interval extracted through the voice information extraction step using the following log likelihood equation. Computing the log likelihood for each time interval for the information.

log (p (oi│qi)) (log likelihood)

At this time, the log likelihood score conversion step (S500) is a log likelihood score conversion unit 600, the log likelihood score for each time interval for the forced sorting result information by using the following log likelihood score conversion equation between 0 to 100 points And converting the score.

In this case, the adjustment score providing step (S600) is performed by the adjustment score providing unit 700 in a time interval in which the voice recognition result information of the voice recognition unit and the forced alignment result information coincide with the log likelihood adjustment equation below. For the time interval where the adjusted score of the corresponding section is set to 100 and the voice recognition result information of the voice recognition unit and the forced alignment result information of the forced alignment unit do not coincide, the log likelihood score converter converts the corresponding section of the corresponding section. It is characterized in that the log likelihood conversion score is provided as an adjustment score to the score output unit.

Hereinafter, the phoneme unit pronunciation accuracy evaluation apparatus and the evaluation method according to the present invention will be described in detail.

As shown in FIG. 1, the phoneme pronunciation pronunciation evaluation apparatus 1000 of the present invention improves the problem of the conventional automatic pronunciation evaluation apparatus that provides only an overall pronunciation evaluation score for a spoken speech signal corresponding to a given word or sentence. By providing a score for each phoneme (pronounced), which is a detailed unit of the voice signal, it is possible to feed back not only the overall pronunciation score but also the score for each phoneme (pronounced). It will enhance the effect.

That is, the problem of the related art, which provides only the overall pronunciation evaluation score for the input voice signal, is improved to provide a score for each phoneme which is a detailed unit of the voice signal.

At this time, the minimum sound unit that brings a difference in meaning is called a phoneme (pronounced), and when learning a foreign language, it is important to learn pronunciation of the phoneme unit of the corresponding language.

The common points of the existing pronunciation evaluation score calculation methods provide a score that evaluates the input foreign language voice signal as a whole, so that the user is provided with limited feedback because the score is not provided for each phoneme.

However, in the present invention, by providing not only the overall score but also the score for each pronunciation (phoneme), the learning effect through the pronunciation evaluation feedback is enhanced.

For example, the conventional method for calculating a pronunciation score provides a total pronunciation score of 'the pronunciation score for cat is 80.' However, in the present invention, the pronunciation score for c is 80 for cat. The pronunciation score for a is 90 points, the pronunciation score for t is 90 points, and the overall pronunciation score is 86.6 points.

As shown in FIG. 1, the phoneme unit pronunciation accuracy evaluation apparatus 1000 includes a voice information extracting unit 100, a native speaker model storage unit 200, a voice recognition unit 300, and a forced alignment unit 400. ), A log likelihood calculator 500, a log likelihood score converter 600, an adjustment score provider 700, and a score output unit 800.

Specifically,

The voice information extracting unit 100 obtains the spoken text information and the voice information pronounced by the learner for the spoken text information from the learner, divides the acquired voice information into a set time interval unit, and divides the speech feature vector for each time interval. Will be extracted.

For example, voice text information corresponding to a text of 'cat' and voice information of a learner who pronounces 'cat' are obtained.

The present invention is provided with an input means for inputting text and an input means for inputting voice information. The learner provides the voice information extracting unit 100 to the voice information extracting unit 100 through the input means for inputting text. Voice information pronounced 'cat', which is spoken text, is input through an input means (eg, a microphone means) for inputting voice information.

As shown in FIG. 2, the voice information extracting unit 100 receiving the voiced text information and the voice information acquires a voice signal of 'cat', and divides the acquired voice information into units of a set time period. A voice feature vector for each time interval is extracted as shown in FIG.

For example, time intervals are divided by 10 ms units for the speech signal illustrated in FIG. 2, and a feature vector (MFCC) for the speech signal is extracted for each time interval.

In speech recognition, the speed of speech and the length of the silence interval between each pronunciation are very important factors.

As a technique for extracting a speech feature vector, MFCC (Mel Frequency Cepstrum Coefficient) parameters are widely used and detailed descriptions are omitted since they are widely used in speech recognition technology.

At this time, the time interval unit for extracting the speech feature vector is characterized in that the time unit in the range of 1msec ~ 20msec. The unit is preferably set in 10 msec units.

The native speaker model storage unit 200 stores native speaker model information.

The native speaker model information stored in the native speaker model storage unit 200 is characterized in that the native speaker pronunciation characteristic information for each phoneme using a deep learning model.

That is, by using the deep learning model, native speaker pronunciation characteristic information for each phoneme, which is an analysis result of analyzing a native speaker's uttering speed and the length of a silent section between each pronunciation, is stored in the native speaker's acoustic model storage unit 200 and used therein. The voice recognition unit 300 performs voice recognition on the voice pronounced by the learner.

The speech recognition unit 300 performs speech recognition on speech feature vectors for each time interval extracted by the speech information extracting unit 100 using native speaker sound model information stored in the native speaker sound model storage unit 200. To generate voice recognition result information.

For example, the voice recognition result information is b phoneme pronunciation in 0-10ms (1 section), b phoneme pronunciation in 10-20ms (2 sections), b phoneme in 20-30ms (3 sections), as shown in FIG. Pronunciation, b phonetic pronunciation for 30-40ms (4 sections), æ phonetic pronunciation for 40-50ms (5 sections), æ phonetic pronunciation for 50-60ms (6 sections), t phonetic pronunciation for 60-70ms (7 sections), T phoneme pronunciation in 70 ~ 80ms (8 sections) and s phoneme pronunciation in 80 ~ 90ms (9 sections).

The forced sorting unit 400 generates the forced sorting result information by forcibly sorting the spoken text information obtained by the voice information extracting unit 100 for each time interval.

For example, when the voice information extracting unit 100 acquires the voice text information 'cat', the forced alignment unit 400 adjusts the phoneme unit pronunciation corresponding to the text 'cat' to the voice strip in FIG. 3. Forced alignment is shown as shown.

For example, pronunciation of phoneme units corresponding to the spoken text is forcedly sorted for each 10 ms time interval. The forced sorting result information is k phoneme, 0 to 10 ms (one section), as shown in FIG. 3, and 10 to 20 ms. K phoneme in 2 sections, k phoneme in 20-30ms (3 sections), phoneme in 30-40ms (4 sections), phoneme in 40-50ms (5 sections), phoneme in 50 to 60ms (6 sections) Phoneme, t phoneme in 60 ~ 70ms (7 sections), t phoneme in 70 ~ 80ms (8 sections), t phoneme in 80 ~ 90ms (9 sections).

The log likelihood calculator 500 uses the voice feature vector extracted by the voice information extractor 100 for each time interval and the forced alignment result information generated by the forced sorter 400 to generate a time interval for the forced alignment result information. Calculate the star log likelihood.

Specifically, the log likelihood calculator 500 calculates a log likelihood for each time interval for the forced sorting result information by using the following log likelihood formula.

log (p (oi│qi)) (log likelihood)

In this case, oi denotes a voice feature vector of the i-th time interval, qi denotes a phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) denotes a probability value of oi coming out of qi in the i-th time interval.

The log likelihood for each time interval for the forced sorting result information may have a negative value. This is because p (oi│qi) is a value between 0 and 1 (the probability of oi coming out of qi in the i-th time interval), and is a logarithm of these values.

The log likelihood score converter 600 converts the log likelihood calculated for each time interval for the forced sorting result information into a score between 0 and 100 points.

The reason for converting the log likelihood calculated for each time interval into scores between 0 and 100 for the forced sorting result information is that the calculated log likelihood value for each time interval has a negative value, which is the value of the positive region. To convert to

Specifically, the log likelihood score conversion unit 600,

For example, in the converted log likelihood 90 of one section (0 to 10 ms) shown in FIG. 3, the voice feature vector of one section (0 to 10 ms) is 'k', which is a phoneme of the first section based on the forced alignment result information. Is a log value of probability of phoneme coming out from the phoneme, plus 100, and the transformed log likelihood 80 of 5 sections (40-50 ms section) shown in FIG. 3 is forced by a speech feature vector of 5 sections (40-50 ms section). The 'æ', which is a phoneme of 5 sections based on the sorting result information, is a log value of the probability of coming out of the phoneme, plus 100, and the converted log likelihood 90 of 7 sections (60 to 70 ms) shown in FIG. 'T', which is a phoneme of the 7-segment phoneme based on the coercion result information, is a value obtained by adding 100 to the log value of the probability of coming out of the phoneme.

In a more specific example, p (o1│q1) is the phoneme of 'k' whose first feature is the first feature of the first time interval. It is a probability to come out, and the log value is taken as the log likelihood value 90 after taking the logarithm to the probability value.

That is, it is determined that the correct pronunciation is input when the voice feature vector corresponding to the phoneme of 'k' is determined to have the correct pronunciation input, so that the landing is close to 1, but in reality, the phoneme corresponds to the phoneme of 'b'. Since the speech feature vector is obtained, the transformed log likelihood value of 90 points of FIG. 3 is calculated due to the probability value of which the probability value is much smaller than one.

In summary, the log likelihood score conversion unit 600 converts the log likelihood of the 'k' phone, which is a phoneme (phoneme by the forced alignment result information) of one section (0 to 10 ms), as shown in FIG. The converted log likelihood of the 'k' phone, which is the phoneme of the 2 sections (10 ~ 20ms) (the phoneme by the forced alignment result information), is 80 points, and the phoneme of the 3 sections (20 ~ 30ms) (the phoneme by the forced alignment result information) The converted log likelihood of the 'k' phoneme is 100 points, and the converted log likelihood of the 'æ phoneme is 4 points (30 ~ 40ms), which is 40 points and 5 sections (40 The converted log likelihood of the 'æ phoneme, which is a phoneme of ~ 50ms) (the phoneme based on the forced alignment result information), is 80 points, and the converted log likelihood of the' æ phoneme, which is 6 sections (50 ~ 60ms), is 80 points, 7 sections ( The converted log likelihood of the 't' phone, which is the phoneme of 60 ~ 70ms) (the phoneme by the forced alignment result information), is 90 points, and the converted log likelihood of the 't' phone, which is 8 sections (70 ~ 80ms), is 90 points, Converted log of 't' phoneme with 9 segments (80 ~ 90ms) The likelihood is converted into a conversion score such as 70 points.

As described above, after the log likelihood score conversion, the adjustment score providing unit 700 provides the adjustment score for each time interval to the score output unit according to whether the voice recognition result information of the voice recognition unit and the forced alignment result information of the forced alignment unit match each time period. Done.

More specifically, according to the following log likelihood adjustment equation, for the time interval where the voice recognition result information of the voice recognition unit and the forced alignment result information of the forced alignment unit match, the adjustment score of the corresponding section is provided as 100 and provided to the score output unit. The log likelihood score conversion unit converts the log likelihood conversion score of the corresponding section converted by the log likelihood score conversion unit to the score output unit for a time interval in which the voice recognition result information of the speech recognition unit and the forced alignment result information of the forced alignment unit do not coincide. do.

For example, as shown in FIG. 3, the speech recognition result and the forced sorting result in the case of 't and t' in 'æ and æ7' sections in five sections and 't and t' in eight sections, and 't and t' in eight sections. Is a matching interval, the log likelihood conversion scores for the speech recognition results for each time interval are '80 points, 80 points, 90 points, 90 points', respectively, but the adjusted score is' 100 points', which is provided to the score output unit. will be.

The reason for setting the adjustment score as 100 is that although the speech recognition result information and the forced sorting result information match, if the adjustment is not made, the evaluation score is considerably lowered, so an adjustment score of 100 points is introduced to eliminate the score error. It will be done.

3, 'k and b' in one section, 'k and b' in two sections, 'k and b' in three sections, 'æ and b' in four sections, and 't' in nine sections. In the case of and s', since the speech recognition result and the forced sorting result do not coincide with each other, the log likelihood conversion score for the speech recognition result of the corresponding time interval is provided as the adjustment score, which is provided to the score output unit. 80 points for 2 points, 100 points for 3 sections, 40 points for 4 sections, and 70 points for 9 sections are provided to the score output unit.

In this case, the score output unit 800 calculates an average score for each phoneme of the input voice information based on the adjustment score provided from the adjustment score provider, or calculates and outputs an overall average score for the input voice information.

For example, when the learner provides voice information and speech text information 'cat', 'cat' is composed of 'k' and 'æ't' phonemes, as shown in FIG. The time interval adjustment score for phonemes is 90,80,100 points, respectively, followed by ((90 + 80 + 100) / 3), so the average score for 'k' phonemes is 90 points, 40,100,100 points followed by '(40 + 100 + 100) / 3', so the average score for the 'æ phoneme is 80 points, and the time interval adjustment for the' t 'phoneme is 100,100,70 points followed by' (100 + 100 + 70 ' ) / 3 ', so the average score for the' t 'phoneme is calculated as 90 points and printed out.

Since the total average score of the input voice information 'cat' is '(90 + 80 + 90) / 3', it is calculated as 86.7 points and outputted.

In addition, at least one of the average score for each phoneme and the total average score of the input voice information, characterized in that for outputting on the screen.

For example, as shown in FIG. 4, the phoneme average score and the overall average score may be simultaneously provided, only the phoneme average score may be provided, or only the overall average score may be provided.

Hereinafter, a method for evaluating phoneme unit pronunciation accuracy according to a first embodiment of the present invention will be described in detail with reference to FIG. 5.

5 is a flowchart illustrating a method for evaluating phoneme unit pronunciation accuracy according to a first embodiment of the present invention.

As shown in FIG. 5, the phoneme unit pronunciation accuracy evaluation method includes a voice information extraction step S100, a voice recognition step S200, a forced sorting step S300, a log likelihood calculation step S400, and a log likelihood score conversion. A step S500, an adjustment score providing step S600, and a score output step S700 are included.

Specifically, the phoneme unit pronunciation accuracy evaluation method of the present invention,

The speech information extracting unit 100 obtains the spoken text information and the spoken speech information of the learner from the learner from the learner, divides the obtained speech information into a set time interval unit, and extracts a speech feature vector for each time section. Extracting voice information (S100);

The speech recognition unit 300 performs speech recognition on speech feature vectors for each time interval extracted through the speech information extraction step S100 using the native speaker sound model information stored in the native speaker sound model storage unit 200. Speech recognition step (S200) for generating a speech recognition result information by;

A forced sorting step (S300) forcing the sorting unit 400 to perform forced sorting of the spoken text information obtained through the voice information extracting step (S100) for each time interval to generate a forced sorting result information (S300);

The log likelihood measuring unit 500 extracts the voice feature vector for each time interval extracted through the voice information extraction step S100 and the forced sorting step S300. A log likelihood calculation step of calculating a log likelihood for each time interval (S400);

A log likelihood score conversion step (S500) for converting the log likelihood score calculated by the log likelihood score conversion unit 600 into a score between 0 and 100 for the time-aligned result of the forced alignment;

An adjustment score providing step of providing the adjustment score for each time interval to the score output unit according to whether the adjustment score provider 700 matches the speech recognition result information with the forced alignment result information for each time interval (S600);

It includes a score output step (S700) is calculated by the score output unit 800 to calculate the average score for each phoneme on the basis of the adjusted score for each time interval, or to calculate and output the total average score for the input voice information; It features.

The log likelihood calculation step (S400) is a forced feature result generated by the voice feature vector and the forced alignment step (S300) for each time interval extracted through the voice information extraction step (S100) using the following log likelihood equation; It is characterized in that to calculate the log likelihood for each time interval for the forced alignment result information.

log (p (oi│q i)) (log likelihood)

The log likelihood score conversion step (S500) is characterized by converting the log likelihood for each time interval for the forced sorting result information into a score between 0 and 100 points using the following log likelihood score conversion equation.

In the adjusting score providing step (S600), the adjustment score of the corresponding section is set to 100 for a time section in which the voice recognition result information of the voice recognition unit and the forced alignment result information of the forced alignment unit correspond to each other according to the following log likelihood adjustment equation. The score is output to the score output section, and the score is output by adjusting the log likelihood conversion score of the corresponding section, which is converted by the log likelihood score converter, for a time interval where the voice recognition result information of the voice recognition unit and the forced sorting result information of the forced alignment unit do not match. It is characterized by the provision of wealth.

Specific features of the phoneme pronunciation accuracy evaluation method according to the first embodiment of the present invention are the same as described in the description of the phoneme pronunciation accuracy evaluation device according to the first embodiment of the present invention will be omitted.

According to the present invention, by improving the problem of the conventional automatic pronunciation evaluation device that provides only the overall pronunciation evaluation score for the spoken speech signal corresponding to a given word or sentence, by providing a score for each phoneme (phoneme) which is a detailed unit of the speech signal In addition, as well as the overall pronunciation score can be fed back to the score for each phoneme (pronounced) can concentrate on what the phoneme is insufficient to enhance the learning effect accordingly.

In addition, although the preferred embodiment of the present invention has been shown and described above, the present invention is not limited to the above-described specific embodiment, the technical field to which the invention belongs without departing from the spirit of the invention claimed in the claims. Of course, various modifications can be made by those skilled in the art, and these modifications should not be individually understood from the technical spirit or prospect of the present invention.

Through the phoneme pronunciation accuracy evaluation device and the evaluation method according to the present invention having the above configuration and action, the problem of the conventional automatic pronunciation evaluation device that provides only the overall pronunciation evaluation score for the spoken speech signal corresponding to a given word or sentence By improving and providing scores for each phoneme (pronoun), which is a detailed unit of the voice signal, it is possible to feed back not only the overall pronunciation score, but also the score for each phoneme (pronunciation). As it has the effect of enhancing the learning effect, the industrial applicability is also increased.

Claims

In the phoneme pronunciation accuracy evaluation device,

The voice information extracting unit 100 obtains voice information pronounced by the learner about the spoken text information and the spoken text information from the learner, divides the obtained voice information into a set time interval unit, and extracts a speech feature vector for each time interval. ;

A native speaker model storing unit 200 in which native speaker model information is stored;

A speech recognition unit 300 for generating speech recognition result information by performing speech recognition on speech feature vectors extracted by the speech information extraction unit by using native speaker sound model information stored in the native speaker model storage unit;

A forced sorting unit 400 forcibly sorting the spoken text information obtained by the voice information extracting unit for each time interval to generate forced sorting result information;

Log likelihood for calculating the log likelihood for each time interval for the forced alignment result information using the speech feature vector extracted by the voice information extractor 100 and the forced alignment result information generated by the forced alignment unit 400. A calculator 500;

A log likelihood score conversion unit 600 for generating a log likelihood transformation score obtained by converting the log likelihood calculated for each time interval for the forced alignment result information into a score between 0 and 100 points;

An adjustment score providing unit 700 providing an adjustment score for each time interval to the score output unit according to whether the speech recognition result information of the speech recognition unit and the forced alignment result information of the forced alignment unit correspond to each time section;

A score output unit 800 that calculates an average score for each phoneme of the input voice information based on the adjustment score for each time interval provided from the adjustment score provider or calculates and outputs an overall average score for the input voice information; Phonetic unit pronunciation accuracy evaluation device configured by.
The method of claim 1,

The native speaker model information stored in the native speaker model storage unit 200 may include native speaker pronunciation information for each phoneme, which is analyzed by using a deep learning model, such as a native speaker's uttering speed and a length of a silent section between each pronunciation. Phonetic unit pronunciation accuracy evaluation device characterized in that.
The method of claim 1,

The score output unit 800,

Phonetic unit pronunciation accuracy evaluation device, characterized in that for processing the average score for each phoneme as a score value between 0 and 100 points.
The method of claim 1,

The score output unit 800,

Phonetic unit pronunciation accuracy evaluation device, characterized in that for outputting at least one or more of the average score for each phoneme, the total average score for the input voice information on the screen.
The method of claim 1,

The interval unit is,

Phonetic unit pronunciation accuracy evaluation device, characterized in that the time interval in the range of 1msec ~ 20msec.
The method of claim 1,

The log likelihood calculator 500

By using the following log likelihood equation, the voice feature vector extracted by the voice information extraction unit 100 and the forced alignment result information generated by the forced alignment unit 400 for each time interval are used for each time interval. Phonetic unit pronunciation accuracy evaluation device, characterized in that for calculating the log likelihood.

log (p (oi│qi)) (log likelihood)

(oi is the voice feature vector of the i-th time interval, qi is the phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) is the probability of oi coming out of qi in the i-th time interval)
The method of claim 1,

The log likelihood score conversion unit 600,

Phonetic unit pronunciation accuracy evaluation device, characterized in that for converting the log likelihood for each time interval for the forced alignment result information to a score between 0 to 100 points using the log likelihood score conversion equation.

(oi is the voice feature vector of the i-th time interval, qi is the phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) is the probability of oi coming out of qi in the i-th time interval)
In the phoneme pronunciation pronunciation evaluation method,

Speech information extraction step (S100) of acquiring the spoken text information of the learner from the learner and the spoken speech information of the spoken text information, dividing the acquired speech information into units of a predetermined time interval, and extracting a speech feature vector for each time interval (S100). ;

A speech recognition step (S200) of generating speech recognition result information by performing speech recognition on speech feature vectors of each time section extracted through the speech information extraction step (S100) using native speaker sound model information;

A forced sorting step (S300) of generating forced sorting result information by forcibly sorting the spoken text information obtained through the voice information extracting step by time intervals;

Log likelihood calculation step (S400) for calculating log likelihood for each time interval for the forced alignment result information using the voice feature vector for each time interval extracted through the voice information extraction step and the forced alignment result information generated through the forced alignment step (S400) ;

A log likelihood score conversion step (S500) of converting the log likelihood calculated for each time interval for the forced alignment result information into a score between 0 and 100 points;

An adjustment score providing step of providing an adjustment score for each time interval according to whether the speech recognition result information and the forced alignment result information correspond to each time interval (S600);

A score output step (S700); calculating or outputting an average score for each phoneme of the input voice information on the basis of the adjustment score for each time interval, or calculating and outputting an overall average score for the input voice information; Phoneme unit pronunciation accuracy evaluation method.
The method of claim 8,

The log likelihood calculation step (S400),

Using the log likelihood equation below, the log likelihood for each time interval is calculated for the forced alignment result information using the voice feature vector extracted through the voice information extraction step and the forced alignment result information generated through the forced alignment step. Phoneme unit pronunciation accuracy evaluation method characterized in that.

log (p (o i │ q i )) (log likelihood)

(oi is the voice feature vector of the i-th time interval, qi is the phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) is the probability of oi coming out of qi in the i-th time interval)
The method of claim 8,

The log likelihood score conversion step (S500),

A phoneme pronunciation accuracy evaluation method for converting the log likelihood for each time interval for the forced sorting result information into a score between 0 and 100 using the log likelihood score conversion equation below.

(oi is the voice feature vector of the i-th time interval, qi is the phoneme of the i-th time interval based on the forced alignment result information, and p (oi│qi) is the probability of oi coming out of qi in the i-th time interval)